Web Crawling for AI Training Data at Scale

Web Crawling for AI Training Data at Scale

Meta’s Llama 3 was pre-trained on over 15 trillion tokens of web-crawled data. Llama 4, released in April 2025, more than doubled that to over 30 trillion tokens of multimodal content (with individual models ranging from 22 to 40 trillion tokens depending on the variant). Common Crawl’s March 2026 archive alone, one month of one nonprofit’s crawling, contained 1.97 billion pages and 344.64 TiB of uncompressed content. The actual volumes that OpenAI, Anthropic, and Google collect internally are almost certainly larger.

None of that data was handed over voluntarily. It was crawled, and the web has gotten significantly worse at tolerating crawlers since the AI training gold rush began. The sites you need to collect from are actively blocking AI crawlers, rate-limiting unfamiliar traffic, and deploying commercial anti-bot systems that treat any bulk request pattern as hostile.

If you are building training datasets for fine-tuning, domain-specific corpora for RAG, or a production pipeline that needs fresh content on a recurring basis, the crawling itself is the easy part. Keeping it running reliably across thousands of diverse targets without getting blocked into uselessness is the actual engineering problem. Most of that problem comes down to proxy infrastructure — and most teams underinvest in it until they have already burned weeks on crawl failures that were never a code issue.

How AI Data Collection Differs from Regular Web Scraping

A typical scraping job has a defined target. You know the site, you know the selectors, you build the spider, you run it, you are done. AI data collection breaks that model in a few ways that matter.

The volumes are different. Training a domain-specific model or building a comprehensive RAG knowledge base means crawling thousands — sometimes tens of thousands — of distinct websites. You are not going deep on one site; you are going broad across the web. Your crawler will encounter wildly different HTML structures, JavaScript rendering requirements, authentication walls, and anti-bot defenses across every target, and it needs to handle all of them without manual intervention.

The diversity requirements are different. A dataset skewed toward a handful of well-structured sites produces a model that performs well on content resembling those sites and poorly on everything else. Geographic diversity matters too. Training data collected exclusively through US-based IP addresses reflects the content, pricing, and language those IPs are served — a problem if your model needs to handle localized content across dozens of markets.

And unlike a one-off scraping project, the crawling never stops. RAG systems depend on fresh data. A knowledge base crawled once and left alone starts returning stale answers almost immediately. Pricing changes, product listings rotate, documentation gets revised. The crawl has to keep running.

All of this compounds the same underlying problem: at this scale, the bottleneck is not compute or storage. It is access. Your IP gets blocked, your requests get throttled, your crawl dies halfway through, and your dataset has gaps you might not notice until the model starts hallucinating about things it never saw.

The robots.txt Problem: Websites Are Actively Blocking AI Crawlers

The web’s relationship with crawlers changed once website owners realized AI companies were scraping their content to train commercial models. The response has been swift and measurable.

An analysis of robots.txt files across Cloudflare’s network on March 30, 2026, found that GPTBot (OpenAI’s training crawler) appeared in 5.52% of blocking rules, making it the most-blocked AI crawler. ClaudeBot (Anthropic) came in at 4.88%, CCBot (Common Crawl) at 5.08%, and Google-Extended at 4.44%. Those percentages sound modest until you consider the denominator. Across the broader web, Ahrefs reported GPTBot is blocked by approximately 5.89% of all sites they analyzed, with the figure climbing to 7.3% when measured across their full 461-million-domain robots.txt dataset.

The blocking rates among publishers are far more aggressive. Press Gazette found that 79% of the world’s top news websites block at least one AI training crawler. Nearly half of news sites block GPTBot outright.

There is an important nuance here. Many site owners now distinguish between AI training crawlers and AI retrieval crawlers. A site might block GPTBot (which feeds training data to OpenAI) while allowing OAI-SearchBot (which powers ChatGPT’s real-time browsing). The same pattern shows up with Anthropic: ClaudeBot is blocked, but Claude-Web (used for live answer retrieval) is sometimes permitted. Publishers do not want their content absorbed into a training dataset without compensation, but they do want to appear in AI-powered search answers.

Cloudflare has made this distinction easier for site owners to enforce. Their AI Crawl Control feature, introduced and expanded through 2024 and 2025, provides managed robots.txt templates that block training-specific user agents while allowing search-adjacent ones. They have also built enforcement mechanisms that go beyond robots.txt, detecting crawlers that misidentify themselves and returning 402 responses.

What does this mean for your data pipeline? If you are crawling the web at AI scale using a well-known crawler user agent, you are going to be blocked by a significant portion of your targets. If you are crawling with a generic or custom user agent, you are going to hit IP-based detection instead — rate limiting, CAPTCHAs, and outright blocks. Either way, you need infrastructure that can handle rejection and adapt.

Why Proxies Are Non-Negotiable at This Scale

Websites do not rely solely on robots.txt. Most modern sites run anti-bot systems — Cloudflare, Akamai, DataDome, PerimeterX — that evaluate every incoming request against a set of signals. The IP address is the first and most decisive one.

When your crawl runs from a single server, every request comes from the same IP. It takes remarkably few requests before that IP gets flagged. The threshold varies by site, but dozens of requests per minute to a single domain from one address will trigger rate limiting on most sites protected by any commercial anti-bot solution. On aggressive targets, even a handful of rapid requests can get you blocked outright.

Running on cloud infrastructure makes it worse. Anti-bot systems maintain databases of IP ranges belonging to AWS, Google Cloud, DigitalOcean, Hetzner, and every other major hosting provider. A request from one of these ranges is automatically treated with higher suspicion. A request from a home broadband connection looks like a person browsing the web. A request from an EC2 instance looks like a bot — because it usually is.

A proxy server sits between your crawler and the target site, forwarding requests through a different IP address. When you route traffic through a pool of thousands of proxy IPs, each individual address makes only a few requests to any given site, keeping you below rate-limiting thresholds and avoiding IP-level bans.

But the type of proxy matters enormously. Not all IPs carry the same weight.

Choosing the Right Proxy Type for AI Crawling

There are three categories, each with a distinct profile for large-scale data collection. Matching the right type to the right workload is where most of the cost and success-rate difference actually lives.

Mobile Proxies: The Highest-Trust Option

Mobile proxies route traffic through 4G and 5G connections assigned by real mobile carriers. To a target website, requests appear to originate from a mobile device on a cellular network rather than from a server or home broadband connection.

The defining technical characteristic that makes mobile proxies hard to detect and block is CGNAT (Carrier-Grade NAT). Mobile carriers do not have enough public IPv4 addresses to assign one to every subscriber, so they pool hundreds or thousands of devices behind each public-facing IP. From a website’s perspective, a single mobile IP could represent a single automated client or a large group of genuine customers, and there is no reliable way to tell the difference without side information. Blocking the IP risks cutting off real paying users in bulk, which raises the cost of a false positive and makes anti-bot systems more conservative about acting on mobile traffic.

The practical consequence for crawling is that mobile IPs tend to hold up on sites where residential IPs have already been flagged. Social platforms, large ecommerce retailers, ticketing sites, and sources sitting behind aggressive commercial anti-bot tooling are the clearest examples — these are the targets where detection is priced in heavily, and where mobile’s structural advantage translates directly into higher success rates.

IP rotation behaves differently on mobile than on other proxy types. Mobile networks rotate addresses as part of normal operation, as devices move between cell towers, as leases expire, and as carriers rebalance traffic. A proxy that cycles through mobile IPs is imitating behaviour the carrier itself already exhibits, which makes rotation less conspicuous than it can be on residential, where a home broadband IP suddenly changing is itself a signal worth flagging.

The tradeoff is cost. The infrastructure required to run proxies on carrier-assigned SIMs is more expensive per gigabyte than sourcing residential or datacenter IPs, and mobile proxies are generally priced accordingly.

Geographic targeting on mobile used to be carrier-only. That is no longer the case. Proxidize’s mobile pool supports city-level targeting across dozens of US locations with carrier selection as well, so you can match regional content variation without dropping down to residential.

Residential Proxies: Broad Geographic Coverage

Residential proxies source their IPs from real home internet connections assigned by ISPs. Traffic routed through a residential proxy appears to originate from an ordinary household broadband connection, which is the traffic profile most websites treat as default-trustworthy.

Residential proxy networks are typically the largest available by raw IP count, with the major providers operating pools in the millions across most countries. Country- and city-level targeting is standard among serious providers, which matters for collecting content that varies by geography. Retail pricing, product availability, language, and regional inventory all fall into this category. A product listing crawled from a single US IP reflects one slice of what the site actually serves; crawling it from IPs across 30 countries reveals the variation.

IP rotation on residential works differently from the mobile pattern. Most pools support per-request or timed rotation by assigning a fresh IP for each new connection, but the underlying addresses are static home broadband connections that do not organically rotate the way mobile networks do. For the target site, an abrupt IP change mid-session can itself read as anomalous — a real person does not typically switch from a Comcast cable connection in Ohio to a Verizon FiOS connection in Texas between pageviews. That is not a reason to avoid residential rotation, only a reason to match rotation cadence and session length to the target’s tolerance rather than set rotation aggressively by default.

Two structural properties limit residential proxies in ways worth understanding before committing to a pool. The first is IP reputation drift. Residential IPs are shared across many customers of the same provider over time, so if a previous user scraped a specific target aggressively from an IP that later rotates into your session, the address may already be flagged before you send your first request. This is why reputation-tracking anti-bot systems are more effective against lower-quality residential pools than they are against mobile — the shared-over-time nature of residential pools is a weakness that mobile’s CGNAT structure does not carry.

The second is sourcing. The provenance of residential IPs varies significantly across providers. Some are acquired through transparent opt-in SDKs and explicit consent, others through less transparent channels, and pool quality tends to correlate with how the provider obtains its IPs. This is a diligence question more than a technical one, but it matters for any crawl that needs to be defensible under legal or ethical scrutiny.

Datacenter Proxies: Lenient Targets Only

Datacenter proxies are the cheapest option by a wide margin, but they carry the highest detection risk. Their IPs come from commercial hosting ranges, and anti-bot systems know it.

Not every site on the internet runs Cloudflare, though. Government databases, academic repositories, open data portals, and many smaller sites have minimal bot detection. For these targets, datacenter proxies are fine — fast, cheap, and available in bulk. If you are crawling Common Crawl’s seed list and most of your targets are mid-tier sites without commercial anti-bot protection, a datacenter pool handles a significant portion of the work without eating into your residential or mobile budget.

Quick Comparison

ResidentialMobileDatacenter
IP sourceHome ISP connections4G/5G mobile carriersCommercial hosting ranges
Detection riskLowLowest (CGNAT shields IPs)Highest
Cost~$1–12/GB (Proxidize)From $2/GB, or unlimited data on per-proxy plans (Proxidize)Cheapest per GB
Best forBroad crawls across many domains where geographic breadth mattersHigh-value targets, social platforms, and any site where IP trust is the bottleneck; sustained crawls where per-proxy unlimited data beats per-GB billingLenient sites, bulk crawling where IP reputation is not the limiting factor
Geographic granularityCity-level, 190+ countriesCity-level across US, carrier-selectableCountry-level
RotationPer-request or timedPer-request or timed, 1–3s swap on ProxidizeStatic or rotating

The practical approach is to match proxy type to the distribution of difficulty in your target list, not to treat any single type as the default. For a crawl heavy on government, academic, and open-data sources, datacenter handles most of the volume and residential or mobile pick up the rest. For a crawl that leans on ecommerce, social, or publisher content, mobile earns its place as the primary choice because a large share of those targets sit behind anti-bot systems that residential alone struggles with. A mixed portfolio will use all three, often routed per target rather than per request. For a deeper look at how proxy types map to specific scraping scenarios, our guide to the best proxies for web scraping covers the decision framework in detail.

Building the Crawl Pipeline

The tools you use depend on whether you need browser rendering (for JavaScript-heavy sites) or can get away with raw HTTP requests (for static content). Most AI data collection pipelines need both.

Firecrawl: Built for LLM Pipelines

Firecrawl has become the default tool for teams that need web content converted directly into LLM-ready formats. It scrapes a URL and returns clean markdown or structured JSON, skipping the manual parsing step entirely. Its /map endpoint generates sitemaps of target domains, /crawl traverses sites to a specified depth, and /scrape extracts content from individual pages.

Version 2.8.0, released in February 2026, added parallel agent execution for running thousands of queries simultaneously with automatic failure handling. The tool is open-source under AGPL-3.0, which means you can self-host Firecrawl and retain full control of your infrastructure and data.

Self-hosting introduces the IP problem discussed earlier. When running Firecrawl on your own servers, every request originates from your machine’s static IP or your cloud provider’s datacenter range. Proxy configuration is handled at the Docker level. The simplest approach is setting environment variables in your .env file:

# .env — proxy credentials from your provider

HTTP_PROXY=http://username:[email protected]:20000

HTTPS_PROXY=http://username:[email protected]:20000

If you are using Firecrawl’s Playwright service for JavaScript rendering, you need to inject the proxy into the browser service specifically within your docker-compose.yaml:

services:

  playwright-service:

    image: mendableai/firecrawl-playwright-service:latest

    environment:

      - PROXY_SERVER=http://pg.proxi.es:20000

      - PROXY_USERNAME=username

      - PROXY_PASSWORD=password

      - HTTPS_PROXY=http://username:[email protected]:20000

Crawlee: Production-Grade Crawling with Built-In Proxy Rotation

Crawlee, maintained by Apify, is a crawling framework available in both Node.js and Python. It supports multiple crawler types: CheerioCrawler for fast static HTML parsing, and PlaywrightCrawler for sites requiring full browser rendering. Proxy rotation, session management, and auto-scaling are built into the framework rather than bolted on.

In Python, configuring proxy rotation with Crawlee’s PlaywrightCrawler looks like this:

from crawlee.playwright_crawler import PlaywrightCrawler

from crawlee.proxy_configuration import ProxyConfiguration

proxy_config = ProxyConfiguration(

    proxy_urls=[

        'http://user:[email protected]:20000',

        'http://user:[email protected]:20001',

    ]

)

crawler = PlaywrightCrawler(

    proxy_configuration=proxy_config,

    max_requests_per_crawl=10000,

)

Crawlee handles the rotation automatically, cycling through the provided proxy URLs across requests. You can feed it a list of thousands of target URLs and let the framework manage concurrency, retries, and proxy assignment without writing that orchestration yourself.

Scrapy: The Workhorse for Raw HTTP Crawling

For targets that do not require JavaScript rendering, Scrapy remains the fastest option. It is asynchronous, handles thousands of concurrent requests efficiently, and has a mature middleware ecosystem. Proxy integration happens through the request meta parameter:

yield scrapy.Request(

    url,

    callback=self.parse,

    meta={'proxy': 'http://user:[email protected]:20000'}

)

For production crawls, a custom downloader middleware that rotates proxies across requests is cleaner than embedding proxy URLs in every spider. If some of your targets require browser rendering while the rest do not, the Scrapy Playwright integration lets you handle both within the same framework.

Playwright Directly: When You Need Full Browser Control

Some tasks — particularly those involving single-page applications, infinite scroll, or content gated behind interaction — need direct headless browser automation. Playwright with its stealth plugins is the current standard for this in 2026. The Python playwright-stealth package patches common fingerprint detection vectors, though it is not a silver bullet against sophisticated anti-bot systems. For teams already invested in Selenium, proxy configuration follows a similar pattern through browser capabilities, but Playwright’s async API and tighter fingerprint control have made it the more common choice for new projects.

from playwright.async_api import async_playwright

async with async_playwright() as p:

    browser = await p.chromium.launch(

        proxy={

            'server': 'http://pg.proxi.es:20000',

            'username': 'user',

            'password': 'pass',

        }

    )

    context = await browser.new_context()

    page = await context.new_page()

    await page.goto('https://target-site.com')

    content = await page.content()

Stealth patches address fingerprint-level detection: navigator properties, WebGL rendering, plugin enumeration. They do not solve IP reputation, TLS fingerprint analysis, or behavioral anomaly detection. Those are solved by the proxy layer and by request pacing. Combining a mobile or residential proxy with stealth patches and randomized timing between requests is what gets you through Cloudflare on the sites that matter most.

For a broader look at tools in this space, our best AI web scrapers roundup compares the major options.

From Raw HTML to LLM-Ready Data

Crawling gets you raw content. Turning it into something a model can actually use is a separate pipeline.

Markdown Conversion

LLMs and embedding models work with text, not HTML. The first processing step is converting crawled pages into clean, structured text. Firecrawl does this automatically, returning markdown with preserved heading hierarchy, lists, and tables. If you are using Scrapy, Beautiful Soup, or a custom crawler, you need to strip HTML tags, remove navigation chrome, eliminate boilerplate (headers, footers, sidebars, cookie banners), and retain the semantic structure of the content.

The heading hierarchy matters more than it might seem. When a document is later split into chunks for embedding, headers provide the context that keeps each chunk meaningful in isolation. A chunk that says “The price is $49/month” is useless without knowing which product and plan it refers to. If the heading structure is preserved, the chunking algorithm can include that context.

Deduplication

At AI crawling volumes, duplicate content is inevitable. The same article gets syndicated across dozens of sites. Boilerplate navigation and footer text repeats on every page of a domain. Near-identical product descriptions appear across thousands of retailer listings. Leaving duplicates in place skews model weights toward over-represented content and wastes storage and embedding compute.

Deduplication happens at two levels. URL-level dedup is straightforward: normalize URLs (strip tracking parameters, trailing slashes, protocol variations) and drop exact duplicates. Content-level dedup is harder. Exact-match hashing (MD5 or SHA-256 of the page body) catches verbatim copies but misses pages that differ by a single sentence or ad block. Fuzzy techniques like MinHash or SimHash generate compact signatures that detect near-duplicates efficiently at scale. Common Crawl applies its own deduplication pipeline to its monthly archives, and Meta describes using “semantic deduplication approaches” in the data curation process for Llama 3. If the largest labs consider it essential, smaller teams should too.

Quality Filtering

Not everything the crawler brings back is worth keeping. A broad web crawl will inevitably collect pages that are mostly boilerplate with a few sentences of real content, auto-generated doorway pages, scraped-and-respun spam, cookie consent dialogs rendered as full pages, and content in languages your model does not need.

The simplest filters are heuristic: remove pages under a word-count threshold (say, 200 words of body text after boilerplate stripping), remove pages where the ratio of navigation to content exceeds a threshold, and remove pages that fail a language detection check. More sophisticated pipelines use classifier-based filtering — a small model trained to predict whether a page is “high quality” for the target domain. Meta used exactly this approach for Llama 3, training text-quality classifiers on Llama 2’s output to score and filter the pretraining dataset. You do not need a foundation model to build your own quality classifier; a fine-tuned distilled model scoring pages on a 1–5 scale works for most use cases.

Chunking Strategy

Chunking is the process of splitting documents into segments sized for embedding and retrieval. It is also the step where most RAG pipelines quietly go wrong. Chunks that are too large dilute the relevance signal during vector search. Chunks that are too small lose context and produce incoherent retrievals. The right strategy depends on your content and query patterns, but a few principles hold broadly.

Respect document structure. Split on headings and section boundaries rather than at arbitrary character counts. A 512-token chunk that contains a complete section is far more useful than one that cuts a paragraph in half.

Overlap adjacent chunks by 10–15%. This prevents information that spans a section boundary from being lost entirely. If a key fact sits at the edge of a chunk, the overlap ensures it appears in at least one retrievable segment.

Store metadata with every chunk. At minimum: source URL, crawl date, document title, section heading, and chunk position. This metadata enables filtering during retrieval (only return chunks from the latest crawl, only return chunks from a specific domain) and is essential for citation in generated answers.

Embedding and Vector Storage

Once chunked, each segment is passed through an embedding model that converts it into a high-dimensional vector representing its semantic meaning. These vectors go into a vector database for retrieval.

The vector database market in 2026 is mature. Qdrant delivers the lowest query latencies among purpose-built vector databases (in the 4–6ms range at the 50th percentile for 1 million vectors, depending on hardware and index configuration). Pinecone offers fully managed infrastructure with zero operational overhead. Weaviate combines vector search with knowledge graph capabilities. And for most pipelines under 5 million vectors, pgvector running inside your existing PostgreSQL instance handles the job without introducing another service.

The practical takeaway from production benchmarks is that the choice of vector database accounts for maybe 5–10% of your RAG system’s output quality. Chunking strategy, embedding model selection, and retrieval pipeline design matter far more.

Keeping Data Fresh: Continuous Crawling for RAG

RAG — retrieval-augmented generation — lets LLMs answer questions about data they were not trained on. Instead of baking knowledge into model weights, RAG retrieves relevant documents at query time and appends them to the prompt. It has become the dominant architecture for enterprise AI, with adoption accelerating sharply through 2025 and into 2026 as organizations move from proof-of-concept deployments to production systems.

Its value depends entirely on the freshness and accuracy of the documents it retrieves. A RAG pipeline built on a knowledge base that was crawled once and never updated gets worse over time without anyone noticing — until a user gets a confidently wrong answer.

Continuous crawling — re-crawling your source URLs on a schedule and updating your vector store with new and changed content — is what keeps a RAG system honest. The cadence depends on your domain. News and pricing data might need daily crawls. Product documentation and reference content might be fine weekly or biweekly. Regulatory and legal content changes less frequently, but missing an update can have serious consequences.

Each re-crawl consumes proxy bandwidth. Cutting your crawl frequency to save on proxy spend means your knowledge base drifts further from reality between updates. The right framing is not “how do I minimize proxy costs” but “what is the cost of stale data” — and then budgeting bandwidth accordingly.

Session rotation is particularly important for continuous crawling. If you are hitting the same sites on a recurring schedule, using the same IP or a small set of IPs will eventually get those addresses flagged. Rotating your proxy IP per request, or at short intervals, prevents pattern accumulation over time. Both mobile and rotating residential proxies handle this natively: each new request gets routed through a different address from the pool. Mobile has the added advantage that its rotation carries the CGNAT trust signal with every new IP, which matters when you are hitting the same domain week after week.

Bandwidth Planning and Cost Reality

At AI crawling scale, bandwidth is your dominant variable cost. Estimating it accurately before you start prevents both budget surprises and under-provisioned crawls that produce incomplete datasets.

A useful baseline: one million page requests at an average response size of 100KB consumes roughly 100GB of bandwidth. That figure covers only successful responses. Failed requests — 403s, CAPTCHAs, timeouts, and soft-block pages that return a 200 status with no usable data — still consume bandwidth on the proxy side without contributing to your dataset. If your success rate is 70%, you are effectively paying for 143GB to get 100GB of usable content. At 95%, you pay for 105GB. The proxy type you choose directly affects this ratio, which means the cheapest per-gigabyte option is not automatically the cheapest per-successful-page option.

A worked example frames the scale. A crawl of 5 million pages at 100KB each is roughly 500GB of raw bandwidth. At the low end of residential pricing — around $1/GB on high-volume plans — that is $500 before accounting for failures. At the upper end of the pay-as-you-go residential market, where pricing can reach $8–12/GB, the same crawl runs $4,000–$6,000. The per-successful-page gap between those tiers is wider than the per-GB gap alone suggests, because a larger bandwidth budget absorbs retries that would otherwise tip a crawl into incomplete territory.

Mobile proxies change the calculation in two ways. On per-GB billing, the raw cost per gigabyte is typically higher than residential, but a higher success rate on hardened targets narrows the gap in effective cost per successful page: a mobile pool achieving 95% against a site where residential achieves 70% pays for considerably less wasted bandwidth per record collected. On per-proxy billing, which is a common alternative model for mobile, you pay for an endpoint with a data allowance rather than drawing from a metered pool. For crawls that return to the same targets repeatedly, this removes bandwidth from the equation entirely — you are paying for the endpoint’s existence, not for its consumption.

Three things reduce wasted bandwidth in practice regardless of proxy type. First, filter aggressively before you crawl. Use sitemap analysis and URL pattern matching to avoid pages that will not contribute useful content, such as login pages, shopping carts, and duplicate parameter URLs. Second, cache robots.txt and DNS lookups. Redundant lookups on every request add latency and wasted bytes. Third, compress responses. Most modern web servers support gzip or brotli encoding, and requesting compressed content reduces transfer sizes by 60–80%.

Legal Considerations You Should Actually Understand

The legal landscape around web scraping for AI training is unsettled, and pretending otherwise would be irresponsible.

On one side, there is established precedent supporting the legality of scraping publicly available data. The Ninth Circuit’s 2022 ruling in hiQ Labs v. LinkedIn affirmed that scraping publicly accessible data is unlikely to violate the Computer Fraud and Abuse Act. That case was about traditional data scraping, not AI training specifically, but the principle that public data is public has held up in subsequent rulings.

On the other side, a wave of copyright lawsuits specifically targeting AI training has produced conflicting outcomes. In June 2025, two federal judges in California found that using copyrighted books to train AI models can qualify as fair use, partially ruling in favor of Anthropic (Bartz v. Anthropic) and Meta (Kadrey v. Meta). But in an earlier case, Thomson Reuters v. ROSS Intelligence, a Delaware federal court held in February 2025 that the fair use defense failed as a matter of law. Over 70 infringement lawsuits against AI companies were filed through 2025. The largest settlement announced to date is the $1.5 billion Bartz v. Anthropic agreement, which received preliminary court approval in September 2025 and is awaiting a final fairness hearing.

The practical guidance for data collection teams is straightforward, even if the law is not. Avoid scraping content that is paywalled, behind authentication, or clearly marked as not for reuse. Respect robots.txt directives, especially for sites that have explicitly opted out of AI training crawlers. Segregate any content of questionable provenance from your training data. Do not scrape personal data without a lawful basis, particularly if GDPR applies to any of your target jurisdictions. And document your data sourcing practices thoroughly — discovery in these lawsuits is increasingly focused on exactly how companies acquired their training data.

None of this is legal advice. If your AI training pipeline touches copyrighted content at scale, get a lawyer who specializes in this area. The cost of legal counsel is trivial compared to the cost of a statutory damages claim.

Bringing It Together

The crawling itself is the part most teams already know how to do. What separates a functioning pipeline from a side project that worked on five sites is keeping it running across thousands of diverse targets, at volumes that produce useful datasets, without getting blocked.

The stack has four layers. The crawling layer does the fetching: Firecrawl for LLM-ready output, Crawlee for managed browser crawling at scale, Scrapy for high-throughput HTTP requests — mixed and matched based on what each target demands. The processing layer does the cleaning: markdown conversion, deduplication, quality filtering, structured chunking, and embedding into a vector store with full metadata. The orchestration layer keeps it alive over time: scheduled re-crawls, success-rate monitoring, and automated retry logic.

Then there is the proxy layer. Mobile proxies carry the highest trust signal and the lowest block rate against commercial anti-bot systems, which makes them the natural fit for targets where IP reputation is the limiting factor. Residential proxies bring the geographic breadth that broad, many-domain crawls need. Datacenter proxies handle the long tail of lenient targets cheaply. The right mix depends on the distribution of targets in your crawl.

Most teams underinvest here and then spend weeks debugging crawl failures that were never a code problem — they were an access problem. Getting the proxy infrastructure right does not just save money on retries. It decides whether the dataset you build is good enough to trust a model or a RAG system with.


What type of proxy is best for AI training data collection?

It depends on the mix of targets in your crawl. For broad, geographically diverse crawls across many lightly protected domains, residential is a strong general-purpose option. For target lists weighted toward social platforms, major ecommerce sites, publishers, or anything behind aggressive anti-bot systems, mobile proxies deliver the highest success rates because CGNAT IP sharing makes their addresses the hardest for sites to block without cutting off real customers. Most production pipelines use a mix across all three proxy types rather than treating any one as the default.

How much bandwidth do I need to crawl one million pages?

At an average page size of 100KB, one million pages consumes approximately 100GB of bandwidth. Actual consumption depends on page complexity, whether you are rendering JavaScript (which loads additional resources), and your success rate. Failed requests consume bandwidth without yielding usable data, so a higher success rate directly reduces your effective cost per page. For sustained crawls against the same domains, per-proxy mobile plans with unlimited data can remove the per-GB calculation from the equation entirely.

Is it legal to scrape the web for AI training data?

The legality depends on what you scrape, how you use it, and which jurisdiction applies. Scraping publicly available, non-copyrighted data is generally supported by US precedent (hiQ v. LinkedIn). Scraping copyrighted content for model training is the subject of active litigation with conflicting court rulings, including the Thomson Reuters v. ROSS decision against fair use and the Bartz v. Anthropic and Kadrey v. Meta rulings finding fair use in specific circumstances. In the EU, GDPR applies to any personal data regardless of whether it is publicly visible. Respect robots.txt, avoid paywalled and authenticated content, and consult legal counsel for your specific use case.

How do I keep my RAG knowledge base from going stale?

Schedule recurring crawls of your source URLs and re-index the content in your vector database. Include a crawl_date field in your chunk metadata so you can filter queries to only return results from the latest crawl. The appropriate crawl frequency depends on how quickly your source content changes: daily for news and pricing, weekly for product documentation, monthly for reference material that rarely updates.

Can I use Firecrawl with proxies for self-hosted deployments?

Yes. Proxy configuration for self-hosted Firecrawl happens at the Docker level. Set HTTP_PROXY and HTTPS_PROXY environment variables in your .env file, or inject proxy credentials directly into the Playwright service configuration in docker-compose.yaml. Without proxies, self-hosted Firecrawl will use your server’s static IP, which gets flagged quickly on any site with anti-bot protection.

Data without roadblocks

Run automation with fewer bans, faster results, and real efficiency.

Related articles

Proxidize Next-Gen — The All New Proxidize Hardware

At Proxidize, we strive to be your ultimate mobile proxy solution. We’re firm believers in

Abed Elez

Image showing a computer against a large screen that says AI. Text to the left reads
10 Best Vibe Coding Tools in 2026

Vibe coding tools have become the latest innovation in the AI space, allowing anyone from

Zeid Abughazaleh

a drawing of a laptop with a gear next to the title
5 Best AI Web Scrapers in 2026

Web scraping used to be a pretty binary situation, either you knew how to write

Eyad Elkhatib

Data without roadblocks.

Run automation with fewer bans, faster results, and real efficiency.

Talk to Our Sales Team​

Looking to get started with Proxidize? Our team is here to help.

“Proxidize has been instrumental in helping our business grow faster than ever over the last 12 months. In short, Proxidize has empowered us to have control over every part of our business, which should be the goal of any successful company.”

mobile-1.jpg
Makai Macdonald
Social Media Lead Specialist | Product London Design UK

What to Expect:

By submitting this form, you consent to receive marketing communications from Proxidize regarding our products, services, and events. Your information will be processed in accordance with our Privacy Policy. You may unsubscribe at any time.

Contact us
Contact Sales