Should I go with a cloud or self-hosted solution? The question of convenience vs control is a topic of discussion in every Slack channel out there. As developers, we love to have control over stuff we build, but sometimes the associated cost and overhead you add to your team is not worth it. One side argues that you should pay the cloud to save time; the other side claims that open-source self hosting is the only way to scale without going bankrupt.
Let’s take Firecrawl as an example. A market intelligence agent needs to scrape thousands of websites to provide high quality data. The discussion naturally turns to whether a Firecrawl self host version would be the most reasonable solution rather than paying for the Firecrawl API.
I would argue that, other than budget, there are always “hidden costs” that we as developers don’t take into consideration — we just love building, what I can say? Let’s put each option side by side. If you’re impatient, you can go straight to the proxy integration.
The Firecrawl API is a great choice. It handles everything for you. You provide what you want and the data will come to you clean and ready; no need to worry about anything else. You might start scraping 100 pages today, but say the business thrives and you suddenly need 100,000 pages. The bill grows larger every month and every month you postpone the dream of buying your own Porsche.
By contrast, Firecrawl self host is a good choice too. You have the code hosted on your servers, which is more secure, saves you money, and — as a developer — you get to modify it. Happy days! You deploy it and soon enough you hit a wall. Your logs turn red with 403 Forbidden errors. The site works locally on your machines, but your DigitalOcean or AWS server is getting instantly blocked by Cloudflare.
The consequences of either choice become more obvious as you scale and scrape more data. Sticking with the cloud is just too expensive and self hosting reveals a brutal truth to you: the system works great, but your IP address is getting flagged as a bot or a spam. It’s like having a Porsche engine you’re forced to drive 20 with.
Every project more complicated than a “Hello World” scrape, explodes into conflicting opinions about rotating IPs, residential proxies, and avoiding captchas. Another day in the life of a developer trying to scrape. It’s so easy to switch back to the cloud and pay up.
That’s why we are going to break down the solution practically. We will show you how to keep the cost savings of your Firecrawl self host and prevent you from getting blocked by websites by using proxies. Cheers to build the ultimate AI data pipelines without getting banned.

What is Firecrawl?
To understand why Firecrawl is different, we first have to understand how the traditional data extraction works. When you usually start web scraping, you will get a very messy HTML code with a lot of <div> tags, navigation bars, and other nonsense. You have to clean it using Python or JavaScript and put out a clean JSON to the LLMs to prevent wasting valuable tokens and extra cost.
Built to solve this kind of problem, Firecrawl scrapes websites and gives you clean JSON or markdown ready to be used in any LLM. This saves time and effort, as you don’t have to see extra data that you won’t use.
The “Web-to-LLM” Converter
Firecrawl acts like the middle man between the data and your AI model. You see, back in the old days you had to write custom code for the selectors for every website you visit (crazy, I know). Nowadays, AI web scrapers just do that automatically for you. Firecrawl scrapes any website you want and produces a standardized format (like llms.txt) a file that AI agents can read and understand.
Beyond Basic Scraping: Maps & RAG
As we said, most scrapers are simple: grab the HTML of a website of your choice and give it to you. But we live in a fast paced world now, so every second counts. Firecrawl AI agents come onto the scene and piece together the website targeted by your scrape — this is where it shines as a deep research tool.
The /map Endpoint (AI Sitemap Generator)
Before your AI can read any website, it needs to know how the website is structured and what pages exist. Normally every website has a file called sitemap.xml, but these files are often outdated or incomplete. The Firecrawl endpoint solves this problem by acting as a detective that discovers the pages of your target website.
- Why AI Needs It: In a nutshell it gives a good overview of the website before you start. Let’s say you’re building a customer support bot. You don’t want to scrape just the homepage; you need every piece of information you can find, from FAQs, the help center, and API documentation. The /map endpoints results in a clean list of every URL that you can filter before wasting any scraping credits.
- The Hidden Challenge: Mapping websites requires sending multiple “head” requests in seconds. This behaviour is not normal so the chance of you being flagged is high, that’s why you should have the option to rotate proxies or IPs to prevent any blocking.
Powering RAG Pipelines
RAG, or Retrieval-Augmented Generation, is the architecture that allows LLMs to “know” about private data. Firecrawl is purpose-built for this.
- Clean Data In, Answers out: The best thing about Firecrawl is that it doesn’t just clean text, it returns clean markdown that can be used for any LLM since it has a hierarchy (headers, lists, tables). This is very important for vector databases (like Pinecone, Weaviate and MongoDB) because it keeps related information together.
- LLM.txt Support: If you really care about your SEO and want to increase the likelihood of your website being recommended by AI chatbots, having an llm.txt is important. It makes it easier for AI chatbots to understand your website in the most efficient way possible.
Deep Research and Batch Scraping
For agents that need to “browse” the web to answer a simple (or not very simple) question such as “Find all pricing plans for these 5 competitors”, Firecrawl offers deep research capabilities.
- The Crawl Endpoint: Unlike the scrape endpoint which hits one page, Firecrawl provides another endpoint called crawl. It traverses a site (BFS/DFS) to a specified depth.
- Batch Operations: You can submit a Firecrawl batch scrape job to process hundreds of URLs at the same time.
- The Trap: This is where the Firecrawl self host crowd gets banned the fastest. Deep crawling will spike unusual traffic patterns that will trigger rate limits right away. Without having high-quality IPs masking this activity, your “deep research” agent will likely get banned after a couple of pages.

The Hidden Problem with Firecrawl Self Hosting
Self-hosting is great. You deploy your own version of the code and you scrape whatever you want with it. You do whatever you want with the data. In other words, you are in control, with no middle man required. However, there is a catch. When you switch from cloud to self-host you gain a lot, but you lose out as well. You lose the invisible infrastructure that makes scraping possible in the first place, i.e. IP rotation.
The “Localhost” IP Trap
When you self-host Firecrawl locally or deploy it via Docker, every single request comes from your machine’s static IP address. It’s like calling someone a million times from one number (you will get blocked eventually).
For its cloud version, Firecrawl has agreements with proxy providers (like Proxidize), who give Firecrawl access to a large pool of IPs Firecrawl uses those proxies while scraping or crawling websites, which makes them unlikely to get caught because they aren’t using the same IP for each request. Switching between IP addresses is the only way to do large-scale web scraping.
Why Cloudflare Hates Your VPS
Most developers deploy their self-hosted instances on cloud providers such as AWS, DigitalOcean, or Hetzner because it’s cheap and scalable. But something they sometimes forget to take into account is IP reputation. Modern anti-bot systems can see the reputation of the incoming IP addresses. If you have a bad one, you are in trouble:
- The Datacenter Flag: Let’s be clear here. Websites know that real people don’t doomscroll or tweet from AWS servers, so that’s already a red flag.
- The ASN Block: Security systems look at the ASN (Autonomous System Number) of your IP. If it belongs to a hosting provider like AWS or Google Cloud, for example, it is automatically flagged as a “non-human” traffic.
Even if your scraper is the best, most efficient code in the world, you will probably hit 403 Forbidden errors or 502 error screens before you even load the HTML. That’s why you need a high-quality third-party proxy provider to help you overcome the anti-bot systems and identify as a human.

Why You Need a Firecrawl Proxy
Since you are self hosting Firecrawl, you are responsible for the “networking” layer. Most developers try to save money by buying cheap proxies, but this is a mistake that will cost you money and time. To scrape more high value data without having any problems, you need to fundamentally change how your crawler looks to the outside world. That starts with proxy servers.
The Problem with Cheap Proxy Providers
Buying cheap proxies is not the solution for trying to get better results while scraping because they have IPs that have been abused by thousands of other users before you and thus they’re already on numerous blacklists.
- Instant Flagging: Providers like Cloudflare have vast databases of these “dirty IPs”. Trying to use these IPs will get you flagged right away and get black listed before you even hit your first request.
- The Captcha Loop: Maybe your first few requests went through and life is good. Your IP is not clean so you will get captchas from time to time and it will be annoying. More than just annoying, it’s expensive, so it’s something you should consider as well.
A completely anonymous profile starts
with the highest quality mobile proxies
Scraping Sophisticated Platforms
Platforms like LinkedIn and Instagram have really strict anti-scraping measures and lots of mechanisms to verify that people are human. Trying to scrape them with a cheap proxy or your own local IP won’t work at scale. You will need high quality IP addresses, which you can only get via trusted proxy providers.
Another thing to consider is that once you scroll these website or any website in general, session continuity and sticky session are super important, you don’t want your crawl to fail mid session after hours of waiting it and for the so sophisticated platforms you need sticky sessions to keep you logged in and keep scraping them non-stop.

Step-by-Step Guide: Configuring Proxies for Firecrawl Self Host
Since Firecrawl python sdk or Node client connects to your self host, your proxy confirmation should happen on the Docker side, i.e. at the infrastructure level. If you don’t do that your crawl will be exposed to be blocked by Cloudflare and we don’t want that.
Here are two ways you can inject your proxies and mask your request with them to prevent your Firecrawl self host from getting blocked while crawling websites.
Method 1: The Simple .env
This is one of the simplest ways out there. You just need to create an .env file in your project (preferably in the root). Here are the steps to make it easier for you:
- Navigate to the root of the project you are working on.
- Open (or create) your .env file.
- After creating the .env file you should add the proxy variables to it, normally this kind of information you get from the proxy provider you subscribed with.
# .env file configuration
# Format: http://username:[email protected]:port
# Standard Proxy Variables
HTTP_PROXY=http://user123:[email protected]:8080This is a very simple, straightforward way to do it. When you run the Docker file, it will read the variables inside the .env file and use them. Make sure you specify the location of the .env file for the Docker file to know the path to get the data from.
Note: Proxy formats can be different between proxy providers. You can change the format, but each provider will have a different way to do that.
Method 2: Docker Compose for Playwright
For scraping at scale, Playwright — a Python library for web scraping — is one of the best options a developer can make. It’s open-source, easy to use, and lets you integrate proxies.
Sometimes the .env file isn’t the ideal solution since the Playwright container might be isolated. In that case, you need to inject proxies specifically into the browser service configuration:
- Open your Docker-compose.yaml file
- Locate the playwright-service section.
- Add the proxy variables under the environment key.
services:
playwright-service:
image: mendableai/firecrawl-playwright-service:latest
environment:
# Inject Proxy for Browser Traffic
- PROXY_SERVER=http://proxy-gateway.com:8080
- PROXY_USERNAME=user123
- PROXY_PASSWORD=pass123
# Fallback to standard conventions
- HTTPS_PROXY=http://user123:[email protected]:8080
depends_on:
- redisPro Tip: If you do a large amount of web scraping, make sure that you have a provider that offers session rotation, i.e. the ability to rotate after every request or at will. It’s super important to have it since every request or every Playwright browser opened will have a new clean IP to use, which prevents you from getting blocked.
Conclusion
Firecrawl is one of the best scraping platforms out there. Yes it’s still a startup, but it has a large audience of developers and, let’s not forget, it’s also backed by one of the most famous combinators in the world.
Key takeaways:
- If you have the technical experience and you don’t mind putting in a little bit of effort, going with a Firecrawl self host option might be a good option for you.
- The more you scale your project, the more you are going to pay to operate it. That’s life.
- If you prefer to have everything in one place and spare yourself a headache while scraping — and you don’t mind the cost — going with Firecrawl API is the best option for you.
- If you decide to self host Firecrawl, having a great proxy provider is essential to prevent any problems and get great results.
- Don’t ever use your local IP for scraping projects, large or small, since it might get blacklisted.
Firecrawl have some decent features to offer. That’s why people are using it. The ability to take messy HTML with unused divs and CSS selectors and turn it into a clean JSON or markdown that can be used directly into LLMs is a great feature to have. It really saves time and effort. No need to create additional scripts to clean the data after collecting it, unlike traditional scrapers.
The ability to choose between scraping, crawling, and mapping is great as well. Many people might want to know what a website has to offer, so they’ll decide to generate a content ma. By contrast, if you want to explore the website’s URLs you go with the crawl option. If you want to get all the information from one website you normally go with scraping.
To have all of Firecrawl’s amazing features and remain in complete control, you should choose the Firecrawl self host option. A good proxy provider is involved one way or another to prevent any problems or cutoffs while scraping. Choosing to go with the cloud version means you don’t have to worry about it, though.
Frequently Asked Questions
Is Firecrawl free to use?
Yes, the self-hosted version of Firecrawl is open-source and can be used by anyone under the AGPL-3.0 license, though you will be responsible for the costs of the server (VPS) and proxy infrastructure needed to run Firecrawl’s self-hosted version.
What is the main difference between Firecrawl Cloud and self hosting Firecrawl?
In the cloud version the service is fully managed by the Firecrawl team. Both technical and non-technical people can use it. With Firecrawl’s self-hosted version you will have to manage all the infrastructure and the servers related to hosting the code, which requires some technical expertise.
Why am I getting 403 Forbidden errors on my self-hosted Firecrawl instance?
This happens because you are sending requests from your local IP or using servers like AWS or Google Cloud to do the scraping for you. Cloudflare will block these IPs. To prevent such errors you need to use mobile proxies or residential proxies.
Does Firecrawl respect robots.txt?
Yes. By default, it will search for a website’s robots.txt first and see what it is allowed and disallowed from and adjust the scraper accordingly.
Can Firecrawl scrape websites behind a login?
Yes, Firecrawl does support that, but you will need to provide it with the credentials in the header of the request. Using sticky sessions here is important to prevent any information from being deleted mid-session.
What is the difference between /map, /scrape, /crawl?
/scrape is used to extract data from a single URL into a markdown or JSON; /map is used to draw a sitemap of the website you are trying to scrape, without scraping it; and /crawl is used to follow links from start to finish.
What are the hardware requirements to self-host Firecrawl?
To run the server comfortably you will need to run it with Docker Compose with 2GB of RAM, along with PostgreSQL databases and redis instance.
Does Firecrawl self-host have rate limits?
No it’s unlimited. There is no limit rate and you can scrape as much as your hardware allows you to.



