Crawl4AI is an open-source Python framework built for one job: turning websites into clean, structured data that AI models can actually use. It takes raw HTML, strips the noise, and outputs Markdown or JSON that feeds directly into LLM pipelines, RAG systems, and downstream automation without the usual cleanup overhead.
With over 66,000 GitHub stars and an Apache 2.0 license, it’s one of the most widely adopted tools in the AI data collection space. Unlike managed scraping APIs that charge per request, Crawl4AI runs locally on your infrastructure. You control the extraction logic, the browser behavior, and the proxy configuration.
That last point matters more than most guides acknowledge. Crawl4AI works fine against a handful of pages without any proxy. But production-volume crawling hits the same wall every scraper hits: rate limiting, IP bans, and anti-bot systems that flag automated traffic before your first batch finishes. Crawl4AI has built-in proxy support and rotation strategies specifically because the tool doesn’t function at scale without them.
This guide covers installation, extraction strategies, and proxy configuration with working code throughout.
Installation and Setup
Crawl4AI requires Python 3.9 or higher. Install it and run the setup command to pull browser dependencies:
pip install -U crawl4ai
crawl4ai-setupThe setup command installs Playwright’s Chromium browser, which Crawl4AI uses under the hood for rendering JavaScript-heavy pages. If you run into browser issues:
python -m playwright install --with-deps chromiumRun crawl4ai-doctor to verify your installation. It checks Python version, browser binaries, and core dependencies. All green means you’re ready.
Your First Crawl
The simplest Crawl4AI script:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business"
)
print(result.markdown)
asyncio.run(main())AsyncWebCrawler launches a headless Chromium instance, loads the page, waits for JavaScript to render, and returns a result object. result.markdown contains the page content converted to clean Markdown with BM25 filtering applied to strip navigation, footers, ads, and other boilerplate that would waste tokens in an LLM context window.
The result object also exposes result.html for raw HTML, result.links for extracted links, result.media for images and video, and result.metadata for page-level information like title and description.
For quick one-off jobs, the CLI works:
crwl https://example.com -o markdown
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10For production pipelines, stick with the Python API. That’s where you get extraction strategies, proxy rotation, and session management.
Extraction Strategies
Crawl4AI supports three extraction approaches, each suited to different situations.
CSS and XPath Schema Extraction
When you know the page structure, CSS and XPath selectors are the fastest and most reliable method. Define a schema mapping selectors to field names, and Crawl4AI returns structured JSON:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
schema = {
"name": "Products",
"baseSelector": "div.product-card",
"fields": [
{"name": "title", "selector": "h2.product-title", "type": "text"},
{"name": "price", "selector": "span.price", "type": "text"},
{"name": "url", "selector": "a.product-link", "type": "attribute", "attribute": "href"}
]
}
run_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema)
)
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com/products", config=run_config)
products = json.loads(result.extracted_content)
print(products)This works well for e-commerce sites, directories, and anything with a repeating layout. It’s also the cheapest extraction method since it doesn’t require an LLM API call.
LLM-Based Extraction
For pages with inconsistent or complex layouts, Crawl4AI can hand the content to an LLM and extract structured data based on a natural-language prompt. It supports OpenAI, Anthropic, DeepSeek, Groq, and any provider compatible with the LiteLLM interface.
from crawl4ai import LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
llm_config = LLMConfig(
provider="openai/gpt-4o",
api_token="env:OPENAI_API_KEY"
)
extraction = LLMExtractionStrategy(
llm_config=llm_config,
instruction="Extract all company names, their funding amounts, and founding years from this page."
)
run_config = CrawlerRunConfig(extraction_strategy=extraction)The LLM receives cleaned Markdown rather than raw HTML, which cuts token usage and improves accuracy. More expensive per page than CSS selectors, but it handles messy layouts that would require constant selector patching with the structured approach. If a site rearranges its product cards every other week, LLM extraction keeps working while your CSS schema breaks.
Cosine Similarity and BM25 Filtering
Crawl4AI’s default content cleaning uses BM25 scoring to identify the most relevant blocks on a page. You can also apply cosine similarity filtering to extract only content semantically related to a specific query, which is useful when collecting AI training data where relevance matters more than completeness. Both run locally with no API calls, keeping them viable for high-volume crawls where per-page LLM costs would add up fast.
Configuring Proxies
This is where most Crawl4AI deployments either work or fall apart. A single-page test against a cooperative site needs no proxy. A production crawl hitting hundreds of pages across multiple domains will get blocked without one.
The reasons are the same ones covered in any web scraping proxy guide: sites enforce rate limits per IP, anti-bot systems flag non-human traffic patterns, and your crawler’s single IP becomes a liability at any meaningful volume.
Crawl4AI handles proxy configuration through ProxyConfig in CrawlerRunConfig, giving you per-crawl control over routing.
Basic Setup
The simplest configuration:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, ProxyConfig
run_config = CrawlerRunConfig(
proxy_config=ProxyConfig(
server="http://proxy.example.com:8080"
)
)
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://target-site.com", config=run_config)
print(result.markdown)You can also pass the proxy as a string or dictionary:
# String format
run_config = CrawlerRunConfig(proxy_config="http://proxy.example.com:8080")
# Dictionary format
run_config = CrawlerRunConfig(proxy_config={"server": "http://proxy.example.com:8080"})Authenticated Proxies
Most commercial proxy services require credentials:
run_config = CrawlerRunConfig(
proxy_config=ProxyConfig(
server="http://proxy.example.com:8080",
username="your_username",
password="your_password",
)
)ProxyConfig.from_string() handles inline credential formats too:
proxy = ProxyConfig.from_string("http://user:[email protected]:8080")
run_config = CrawlerRunConfig(proxy_config=proxy)Supported Formats
ProxyConfig.from_string() accepts:
| Format | Example |
| HTTP | http://user:[email protected]:8080 |
| HTTPS | https://proxy.example.com:8080 |
| SOCKS5 | socks5://proxy.example.com:1080 |
| IP:port | 192.168.1.1:8080 |
| IP:port:user:pass | 192.168.1.1:8080:user:pass |
One caveat: Playwright does not support SOCKS5 proxies with authentication. If your provider requires credentials, use HTTP. This is a Playwright limitation, not a Crawl4AI one.
Rotation
A single proxy IP gets flagged eventually, regardless of quality. Rotation distributes requests across multiple addresses so no individual IP draws enough attention to trigger blocks.
Crawl4AI ships with RoundRobinProxyStrategy:
from crawl4ai import CrawlerRunConfig, ProxyConfig
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
proxies = [
ProxyConfig(server="http://proxy.example.com:8001", username="user", password="pass"),
ProxyConfig(server="http://proxy.example.com:8002", username="user", password="pass"),
ProxyConfig(server="http://proxy.example.com:8003", username="user", password="pass"),
]
proxy_strategy = RoundRobinProxyStrategy(proxies)
run_config = CrawlerRunConfig(proxy_rotation_strategy=proxy_strategy)For larger pools, load from environment variables:
import os
os.environ["PROXIES"] = "ip1:port1:user1:pass1,ip2:port2:user2:pass2,ip3:port3:user3:pass3"
proxies = ProxyConfig.from_env()
proxy_strategy = RoundRobinProxyStrategy(proxies)
run_config = CrawlerRunConfig(proxy_rotation_strategy=proxy_strategy)Choosing the Right Proxy Type
The type of proxy matters as much as whether you use one.
| Proxy Type | IP Source | Trust Level | Best For |
|---|---|---|---|
| Datacenter | Hosting providers | Low | Low-protection targets, speed-priority jobs |
| Residential | Real ISP connections | Medium-High | General-purpose scraping, most sites |
| Mobile (4G/5G) | Carrier networks via CGNAT | Highest | Sites with aggressive bot protection |
Datacenter proxies are cheap and fast, but their IPs are registered to hosting companies. Any competent anti-bot system (Cloudflare, DataDome, Akamai) flags them on sight.
Residential proxies use IPs assigned by real ISPs to home internet connections, giving them higher trust scores with target sites. They handle most scraping targets without issues and are the solid middle-ground choice.
Mobile proxies are the hardest for anti-bot systems to act against. They use real 4G and 5G carrier IPs shared among thousands of legitimate users through CGNAT (Carrier-Grade NAT). When a site sees traffic from a mobile IP, blocking it risks cutting off every real user sharing that address. That trade-off is what makes them effective against even the most aggressive protection layers, and the best option for Crawl4AI deployments hitting well-defended sites.
Anti-Bot Detection and Proxy Escalation
Crawl4AI v0.8.5 introduced a multi-tier fallback system that automatically escalates when requests get blocked.
After each crawl attempt, Crawl4AI inspects the response for known anti-bot signals: Cloudflare challenge pages, 403/429 status codes, firewall blocks from Imperva, Sucuri, and similar services. If blocking is detected, escalation begins.
The first tier retries through your proxy list in order. The recommendation is to sort proxies cheapest-first: datacenter before residential, residential before mobile. Your most expensive IPs only fire when cheaper ones have already failed.
The second tier repeats the full proxy rotation for additional rounds, controlled by max_retries in CrawlerRunConfig. For a list of three proxies with max_retries=2, that’s nine total attempts before the system gives up or moves on.
The third tier is a fallback function: a custom async function you provide that receives the URL and returns raw HTML as a last resort. You might use it to hit an external scraping API, pull from cache, or try an alternative source entirely.
Combined with Crawl4AI’s other anti-detection features (user-agent randomization, viewport variation, Shadow DOM flattening), the escalation system means fewer dead requests and more complete datasets. If the sites you’re targeting also deploy CAPTCHAs, you’ll need a CAPTCHA solving service on top of this. Proxies handle IP-based blocks. CAPTCHA solvers handle the challenge layer. Different problems, different tools.
Production Configuration
A few settings make a noticeable difference at scale.
Speed and Resource Usage
If you only need text content, skip everything else:
from crawl4ai import BrowserConfig
browser_config = BrowserConfig(
text_mode=True,
avoid_ads=True,
avoid_css=True
)text_mode disables image loading. Stripping ads and CSS cuts additional bandwidth. The per-page savings are small, but they compound fast across thousands of pages.
Handling Dynamic Content
Sites that load content after the initial page render need special handling:
run_config = CrawlerRunConfig(
wait_for="css:div.content-loaded",
scan_full_page=True,
page_timeout=30000
)scan_full_page scrolls the entire page to trigger lazy-loaded content. Without it, you’ll miss anything below the fold. If you’ve ever looked at scraped output and wondered why half the data was missing, this was probably the reason.
Session and Identity Management
For sites that require login or maintain state, persistent context keeps cookies and session data between crawl runs:
browser_config = BrowserConfig(
use_persistent_context=True,
user_data_dir="./browser-data"
)Pair this with identity settings that match your proxy’s geography:
browser_config = BrowserConfig(
user_agent_mode="random",
)
run_config = CrawlerRunConfig(
locale="en-US",
timezone_id="America/New_York",
)A US-based mobile proxy reporting ja-JP as the browser locale is the kind of mismatch that anti-bot systems pick up on immediately. Small detail, easy to overlook, but it’s the difference between a crawler that works and one that gets flagged on the second request.
Crawl4AI vs Firecrawl
Both convert web content into LLM-ready formats, but the architecture is fundamentally different.
Crawl4AI is local-first. You run it on your own machine, manage browser instances and proxy configuration yourself, and pay nothing per request. The tradeoff is infrastructure overhead. Firecrawl is API-first, designed around a hosted service where you send URLs and get structured data back. You can self-host it, but the design centers on a centralized API with Docker.If you need granular control over browser behavior, proxy rotation, and extraction logic, Crawl4AI is the better fit. If you want a managed pipeline without the infrastructure burden, Firecrawl is worth evaluating. For a broader look at the space, see the best AI web scrapers roundup.
Frequently Asked Questions
Is Crawl4AI free?
Yes. Open-source under Apache 2.0, usable commercially without licensing fees. You’ll pay for LLM API calls if you use LLM-based extraction, and you’ll need a proxy service for production-scale crawling. The team also offers a cloud service for users who’d rather not self-host.
What proxies work best with Crawl4AI?
Mobile proxies deliver the highest success rates. Their IPs are shared among thousands of real users via CGNAT, making them nearly impossible to block without collateral damage to legitimate traffic. Residential proxies are a strong second choice. Datacenter proxies work on low-protection targets but fail against modern anti-bot systems like Cloudflare and DataDome.
Can Crawl4AI handle JavaScript-rendered pages?
Yes. Playwright’s Chromium browser renders JavaScript fully before extraction. Use wait_for to hold for dynamic content and scan_full_page to trigger lazy-loaded elements.
Does Crawl4AI support proxy rotation?
RoundRobinProxyStrategy cycles through a proxy list, and ProxyConfig.from_env() loads lists from environment variables. Custom rotation strategies are also supported if the built-in options don’t fit your setup.
How does Crawl4AI compare to Scrapy?
Scrapy is optimized for speed with plain HTTP requests. Crawl4AI uses a headless browser, which is slower per page but handles JavaScript rendering and anti-bot challenges that Scrapy can’t without additional middleware. Static HTML pages favor Scrapy. JavaScript-heavy targets or LLM-ready output requirements favor Crawl4AI.
Is Crawl4AI good for large-scale scraping?
It handles large workloads, but browser resources are the bottleneck. Each concurrent crawl runs a browser context that uses more CPU and memory than a plain HTTP request. For scale, use Docker with multiple worker instances, enable text_mode to reduce resource consumption, and pair it with reliable proxy rotation so you’re not burning compute on requests that were going to get blocked anyway.



