I want to scrape Reddit data. Simple enough, right? It turns out that most Reddit scraping tutorials fall into two categories: the “here’s how to grab 10 posts” quick and messy scripts, or the over-engineered enterprise solutions that feel like using a spaceship.
The problem is that real-world scraping is messy: Some builders only need to scrape a couple of Reddit posts for analysis purposes, others want to scrape thousands of records for research purposes, and others still are only interested in the comments, and so on. The truth is that you might start with a small job today and find yourself needing something more scaled up next month.
I needed something that could handle both kinds of requests without breaking, so I built a Reddit scraper that gets smarter as your needs grow. Small jobs stay simple and fast; large jobs automatically get proxy protection and async processing — same interface but different engines under the hood.
Here’s how I did it, the problems I tackled along the way, and why Reddit turned out to be surprisingly friendly to scrape (spoiler: it’s not like the other platforms). If you’re not interested in the journey and just want the code, here you go.

Why Python? (And Not the Reasons You Think)
I picked Python for this project, but not for the usual “Python is great for scraping” reasons that every builder agrees on.
Here’s the thing: Reddit scraping is not that complex, but it heavily depends on your use case. Scraping a couple of posts is not the same as scraping thousands of posts: the more you try to scale, the more problems you will face, such as async requests, blocked IPs, and error handling that doesn’t fall over after the first timeout.
Most programming languages force you to pick a lane and you stick to it: Go? Super fast but verbose for simple tasks. JavaScript? Although it’s my personal favourite and it’s great for async, it’s painful for data processing. PHP? I’ve been working with it recently for a custom plugin, and let’s not go there.
Python offered me something different: a way to take two completely different approaches in the same ecosystem.
For example if I want to do a quick job, I could use requests — dead simple, reliable, and perfect for synchronous pagination:
def _make_request(self, url: str, params: Optional[Dict] = None) -> Optional[Dict[str, Any]]:
try:
response = self.session.get(url, params=params)
response.raise_for_status()
self._sleep_with_delay()
return response.json()
except requests.exceptions.RequestException as e:
logger.error(f"Request failed for {url}: {e}")
return None
For the heavy lifting I used async with proxy rotation:
proxy_url = None
if self.proxy_manager:
proxy_dict = self.proxy_manager.get_next_http_proxy()
if proxy_dict:
proxy_url = proxy_dict.get('http', proxy_dict.get('https'))
async with session.get(url, params=params, proxy=proxy_url) as response:
if response.status == 200:
data = await response.json()
return data
I built both Reddit scrapers in the same codebase and let the system choose which one to use based on the job size. Python’s ecosystem just works perfectly for this kind of stuff:
- Click for nice CLI interfaces
- Rich for progress bars that actually look good
- Pandas is used when the CEO says “can you export this to CSV?”
- Aiohttp and requests because they play nice together
Python is the best for this kind of use case, and those who say otherwise… That’s their opinion and they can keep it to themselves.

The Pagination Problem (And Why It Broke My Brain)
I thought pagination would be the easy part. “Grab page 1, then page 2, then page,” right? Wrong.
Reddit doesn’t use page numbers, it uses something called cursor-based pagination with an after token. Each response gives you a token pointing to the next batch to scrape. If you miss that token or handle it wrong, it’s game over. You’ll be stuck in an infinite loop of the same 20–25 posts. To be fair, though, that wasn’t the real problem.
The real problem was that differently sized jobs needed completely different approaches. Here’s how I split them up:
- Small jobs: I just want to scrape a couple of posts. I want something fast, simple, and reliable; no fancy stuff.
- Large jobs: I want to scrape a few hundred —maybe thousands— of posts, which requires speed, error recovery, and proxy rotation to avoid getting blocked.
I was faced with an annoying decision: Do I keep it simple and hit a wall later or over-engineer a complex solution that will make me hate myself when I apply it to small tasks?
To which I thought: why not build both? I built two completely different pagination engines and let the system choose which one to use based on the job size.
The Direct Request
For small jobs, I keep it straightforward with synchronous pagination:
def scrape_subreddit_paginated(self, subreddit: str, sort_by: str = "hot",
max_posts: int = 1000, batch_size: int = 100):
url = f"https://www.reddit.com/r/{subreddit}/{sort_by}.json"
after = None
posts_fetched = 0
while posts_fetched < max_posts:
params = {
'limit': min(batch_size, 100),
'raw_json': 1
}
if after:
params['after'] = after
data = self._make_request(url, params)
The Async Beast
For large jobs, I went fully with async with proxy rotation:
async def scrape_subreddit(self, subreddit: str, sort_by: str = "hot",
limit: int = 25) -> List[Dict[str, Any]]:
while posts_fetched < limit:
data = await self._make_request(url, params)
await asyncio.sleep(self.delay)
With these two approaches, we’ve got the same interface but with a different engine for each use case!

Taming Reddit’s JSON Chaos
Reddit’s API gives you data of course, but calling it “clean” would be generous and you’ll probably need to see an ophthalmologist after sifting through it.
Here’s what a raw Reddit post looks like in JSON:
{
"kind": "t3",
"data": {
"subreddit": "programming",
"selftext": "",
"author_fullname": "t2_abc123",
"title": "Some programming post",
"subreddit_name_prefixed": "r/programming",
"ups": 42,
"downs": 0,
"score": 42,
"created_utc": 1703875200.0,
"num_comments": 15,
"permalink": "/r/programming/comments/abc123/some_post/",
"url": "https://example.com",
"author": "username",
"is_self": false,
"stickied": false,
"over_18": false,
// ... and about 50 more fields you don't need
}
}
This is honestly a nightmare for analysis:
- Timestamps are Unix epochs (good luck reading those)
- Inconsistent field names
- Tons of fields you will never use
- Some fields might be missing entirely
I needed consistent, clean data that would make my life easier and wouldn’t break my analysis code.
The Cleaning Process
Here’s how I transformed Reddit’s data into something nice and useable:
def _clean_post_data(self, raw_post: Dict) -> Dict[str, Any]:
return {
'id': raw_post.get('id'),
'title': raw_post.get('title'),
'author': raw_post.get('author'),
'score': raw_post.get('score', 0),
'upvotes': raw_post.get('ups', 0),
'downvotes': raw_post.get('downs', 0),
'upvote_ratio': raw_post.get('upvote_ratio', 0),
'url': raw_post.get('url'),
'permalink': f"https://reddit.com{raw_post.get('permalink', '')}",
'created_utc': raw_post.get('created_utc'),
'created_date': self._format_date(raw_post.get('created_utc')),
'num_comments': raw_post.get('num_comments', 0),
'subreddit': raw_post.get('subreddit'),
'is_self_post': raw_post.get('is_self', False),
'is_nsfw': raw_post.get('over_18', False),
'is_stickied': raw_post.get('stickied', False),
'flair': raw_post.get('link_flair_text'),
'post_text': raw_post.get('selftext', ''),
'domain': raw_post.get('domain')
}
What This Gets You
This is the same post, but after cleaning it up it now looks like this:
{
"id": "abc123",
"title": "Some programming post",
"author": "username",
"score": 42,
"created_date": "2025-08-19 12:00:00",
"num_comments": 15,
"permalink": "https://reddit.com/r/programming/comments/abc123/some_post/",
"is_nsfw": false,
"subreddit": "programming"
}
Now this is the kind of data we can work with that won’t break the code:
- Consistent field names across all posts
- Human-readable data
- No missing fields
- Only the data you actually need
Reddit usually changes their APIs (and they do that a lot). When they do, I only need to update the cleaning function in one place and we’re back in business.

Data Storage: Pick Your Favorite
Clean data is useless if you can’t get it where you need it — facts. While building the code, I fielded a lot of requests from the team who said things like “can we get this in CVS format?” or “can we use this to train models?”.
Trying to guess what someone needs is a losing game (I just hate talking to people). The data scientist wants JSON for flexibility, the analyst wants CVS for his beloved Excel sheet, so I built multiple output formats to let people decide for themselves.
JSON: The Default Choice
Most of the time, JSON just works (why would you choose anything else, really?)
def save_data(data, output_file: str, format: str = "json"):
if format == "json":
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
It’s clean, readable, and preserves data types — perfect for feeding into other tools or doing analysis in general.
What you get as we mentioned earlier:
[
{
"id": "abc123",
"title": "Some programming post",
"author": "username",
"score": 42,
"created_date": "2025-08-19 12:00:00",
"comments": [
{
"author": "commenter1",
"body": "Great post!",
"score": 5
}
]
}
]
CSV: For the Excel(lent) People
Sometimes you just need a spreadsheet (don’t ask me why):
elif format == "csv":
df = pd.DataFrame(data)
df.to_csv(output_file, index=False)
What you get:
id,title,author,score,created_date,num_comments
abc123,Some programming post,username,42,2025-08-19 12:00:00,15
def456,Another post,user2,128,2025-08-19 13:00:00,42
The perfect format for quick analysis, sharing with non-technical people (I love you guys), or just importing it into a database.
As we said earlier, we added a CLI, with which we can make our lives easier via command line:
python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output posts.json
python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output posts.csv --format csv
python3 -m reddit_scraper.cli interactive
Handling Edge Cases
Real data can sometimes have problems:
- Unicode characters in titles and comments (hence ensure_ascii=False)
- Large datasets that might not fit in memory
- Nested data like comments (JSON handles this gracefully)
json.dump(data, f, indent=2, ensure_ascii=False)
At the end of the day, whether you’re a builder, data analyst, or just someone who loves spreadsheets, the code will handle all these cases for you.

The Brain: When to Go Simple vs When to Go Heavy
Here’s the part that took me way too long to figure out. After a couple of hours of testing, I found out when we actually need the complex stuff.
I started by always using the async scraper with proxy rotation. It seemed logical at the time, but I was wrong as usual. For small jobs, all of that complexity was just… slow.
Spinning up async sessions, initializing proxy managers, health-checking our Reddit proxies — all of that just to scrape a few posts that would have taken a couple of seconds to scrape using simple GET requests.
The problem is that we can’t take the simple route all the time. If we want to do larger jobs then we can’t do the normal GET request. We might hit Reddit’s API limit or get your IP addresses flagged, so we needed to create a decision engine that could look at the job and apply the right approach automatically.
The Magic Number Is 100
The determining factor is, unsurprisingly, Reddit’s API limit. I figured that out after a couple of tests (I should have read the documentation).
use_proxies = inputs['post_count'] > 100 and config_manager.has_proxies()
if inputs['post_count'] > 100:
return await handle_large_scraping_job(
inputs['subject'], inputs['post_count'], inputs['sort_method'],
scraper_config, proxy_manager, captcha_solver,
use_proxies, use_captcha
)
else:
return handle_regular_scraping_job(
inputs['subject'], inputs['post_count'], inputs['sort_method'],
scraper_config, proxy_manager, captcha_solver,
use_proxies, use_captcha
)
This will translate to:
- <= 100 posts: Direct requests work fine, Reddit doesn’t care, job is done in a matter of seconds.
- > 100 posts: You start hitting rate limits (now we are talking), and here we will use proxy rotation.
What This Looks Like in Practice
Do you want 50 posts from r/programming?
Method: RequestsScraper direct (small job)
Proxies: No
- Direct HTTP requests
- No proxy overhead
- Done in 5–10 seconds
Do you want 500 posts from r/programming?
Method: JSONScraper with proxies (large job)
Proxies: Yes
- HTTP proxy rotation on every request
- Async pagination
- Protected from rate limit
The system will take the number of posts as input and then act based on that number.

It’s All About the Customization
Most web scraping tutorials serve only one use case most of the time, so while building this I was thinking of dynamic ways of adding your subreddit or sorting.
Different Subreddits: Just Change the Parameter
python3 -m reddit_scraper.cli interactive
python3 -m reddit_scraper.cli json subreddit python --limit 50
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50
Different Sorting: Hot, New, Top, Rising
Reddit has different ways to sort posts, and each gives you different data:
python3 -m reddit_scraper.cli json subreddit programming --sort hot --limit 50
python3 -m reddit_scraper.cli json subreddit programming --sort new --limit 50
python3 -m reddit_scraper.cli json subreddit programming --sort top --limit 50
python3 -m reddit_scraper.cli json subreddit programming --sort rising --limit 50
Comments: When You Need the Full Picture
In some cases you need the full context of the post you are looking for:
python3 -m reddit_scraper.cli json subreddit programming --limit 100
python3 -m reddit_scraper.cli json subreddit-with-comments programming --limit 100 --include-comments --comment-limit 50
python3 -m reddit_scraper.cli json comments programming POST_ID --sort best --output single_post_comments.json
When you include comments, the scraper makes a separate API call for each post’s comments. That’s where the proxy rotation really comes into play, because for each 100 posts there will be 101 total requests (1 for posts + 100 for comments).
Output Formats
python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output analysis.json
python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output analysis.csv --format csv
python3 -m reddit_scraper.cli interactive
Configuration File
In the config.json file we have all the environment variables we need:
{
"proxies": [
{
"host": "your-proxy.com",
"port": 8080,
"username": "your_username",
"password": "your_password",
"proxy_type": "http"
}
],
"scraping": {
"default_delay": 1.0,
"max_retries": 3,
"requests_per_minute": 60,
"user_agent": "RedditScraper/1.0.0"
}
}
It’s flexible, we don’t hardcode anything into the code, obviously… We’re professional here.
The Interactive Mode
We wanted you to love the CLI more, so we created an interactive one for you!
Enter subreddit name: programming
How many posts? [50]: 200
Sort method (hot, new, top, rising) [hot]: new
Use captcha solving? [Y/n]: n
Proxy usage: Yes (automatic for >100 posts)
Output filename: programming_new_posts.json
Starting scrape:
Subreddit: r/programming
Posts: 200
Sort: new
Method: JSONScraper with proxies (large job)
Proxies: Yes
Captcha: No
Conclusion
Here’s the thing about Reddit, it’s one of the easiest platforms to scrape compared to the most known platforms. Clean JSON endpoints, reasonable rate limits; Reddit actually wants you to scrape their data on their own terms.
However, the difference between building a small project and a production scale project comes down to one key point: for different job sizes we have different approaches. The goal is not to build the most complex Reddit scraper of all time, it’s about building systems that adapt to the job at hand, regardless of size.
Real Lessons:
- Use JSON endpoints (/r/subreddit/hot.json) not HTML parsing
- Handle pagination properly with Reddit’s after token
- Clean the data early, Reddit’s JSON is messy
- Automate the complexity: users shouldn’t have to think about whether or not they should be using proxies
- Start simple, scale smart, add features when you actually need them
In the end, we built this project to scale and be applicable to different use cases. Feel free to take at the code: it’s an open-source Reddit scraper created for the builders who want to access data in the easiest way possible.