Reddit Scraper: How to Scrape Reddit for Free

A drawing of a person in a hoodie working at a laptop next to the Reddit logo next to the title

Share

IN THIS ARTICLE:

I want to scrape Reddit data. Simple enough, right? It turns out that most Reddit scraping tutorials fall into two categories: the “here’s how to grab 10 posts” quick and messy scripts, or the over-engineered enterprise solutions that feel like using a spaceship.

The problem is that real-world scraping is messy: Some builders only need to scrape a couple of Reddit posts for analysis purposes, others want to scrape thousands of records for research purposes, and others still are only interested in the comments, and so on. The truth is that you might start with a small job today and find yourself needing something more scaled up next month.

I needed something that could handle both kinds of requests without breaking, so I built a Reddit scraper that gets smarter as your needs grow. Small jobs stay simple and fast; large jobs automatically get proxy protection and async processing — same interface but different engines under the hood.

Here’s how I did it, the problems I tackled along the way, and why Reddit turned out to be surprisingly friendly to scrape (spoiler: it’s not like the other platforms). If you’re not interested in the journey and just want the code, here you go.

Why Python? (And Not the Reasons You Think)

I picked Python for this project, but not for the usual “Python is great for scraping” reasons that every builder agrees on.

Here’s the thing: Reddit scraping is not that complex, but it heavily depends on your use case. Scraping a couple of posts is not the same as scraping thousands of posts: the more you try to scale, the more problems you will face, such as async requests, blocked IPs, and error handling that doesn’t fall over after the first timeout.

Most programming languages force you to pick a lane and you stick to it: Go? Super fast but verbose for simple tasks. JavaScript? Although it’s my personal favourite and it’s great for async, it’s painful for data processing. PHP? I’ve been working with it recently for a custom plugin, and let’s not go there.

Python offered me something different: a way to take two completely different approaches in the same ecosystem.

For example if I want to do a quick job, I could use requests — dead simple, reliable, and perfect for synchronous pagination:

def _make_request(self, url: str, params: Optional[Dict] = None) -> Optional[Dict[str, Any]]:
    try:
        response = self.session.get(url, params=params)
        response.raise_for_status()
        self._sleep_with_delay()
        return response.json()
    except requests.exceptions.RequestException as e:
        logger.error(f"Request failed for {url}: {e}")
        return None

For the heavy lifting I used async with proxy rotation:

proxy_url = None
if self.proxy_manager:
    proxy_dict = self.proxy_manager.get_next_http_proxy()
    if proxy_dict:
        proxy_url = proxy_dict.get('http', proxy_dict.get('https'))

async with session.get(url, params=params, proxy=proxy_url) as response:
    if response.status == 200:
        data = await response.json()
        return data

I built both Reddit scrapers in the same codebase and let the system choose which one to use based on the job size. Python’s ecosystem just works perfectly for this kind of stuff:

  • Click for nice CLI interfaces
  • Rich for progress bars that actually look good
  • Pandas is used when the CEO says “can you export this to CSV?”
  • Aiohttp and requests because they play nice together

Python is the best for this kind of use case, and those who say otherwise… That’s their opinion and they can keep it to themselves.

The Pagination Problem (And Why It Broke My Brain)

I thought pagination would be the easy part. “Grab page 1, then page 2, then page,” right? Wrong.

Reddit doesn’t use page numbers, it uses something called cursor-based pagination with an after token. Each response gives you a token pointing to the next batch to scrape. If you miss that token or handle it wrong, it’s game over. You’ll be stuck in an infinite loop of the same 20–25 posts. To be fair, though, that wasn’t the real problem.

The real problem was that differently sized jobs needed completely different approaches. Here’s how I split them up:

  • Small jobs: I just want to scrape a couple of posts. I want something fast, simple, and reliable; no fancy stuff.
  • Large jobs: I want to scrape a few hundred —maybe thousands— of posts, which requires speed, error recovery, and proxy rotation to avoid getting blocked.

I was faced with an annoying decision: Do I keep it simple and hit a wall later or over-engineer a complex solution that will make me hate myself when I apply it to small tasks?

To which I thought: why not build both? I built two completely different pagination engines and let the system choose which one to use based on the job size.

The Direct Request

For small jobs, I keep it straightforward with synchronous pagination:

def scrape_subreddit_paginated(self, subreddit: str, sort_by: str = "hot",
                              max_posts: int = 1000, batch_size: int = 100):
    url = f"https://www.reddit.com/r/{subreddit}/{sort_by}.json"
    after = None
    posts_fetched = 0
    
    while posts_fetched < max_posts:
        params = {
            'limit': min(batch_size, 100),  
            'raw_json': 1
        }
        
        if after:
            params['after'] = after
            
        data = self._make_request(url, params)

The Async Beast

For large jobs, I went fully with async with proxy rotation:

async def scrape_subreddit(self, subreddit: str, sort_by: str = "hot", 
                          limit: int = 25) -> List[Dict[str, Any]]:
    while posts_fetched < limit:
        data = await self._make_request(url, params) 
        await asyncio.sleep(self.delay)

With these two approaches, we’ve got the same interface but with a different engine for each use case!

Taming Reddit’s JSON Chaos

Reddit’s API gives you data of course, but calling it “clean” would be generous and you’ll probably need to see an ophthalmologist after sifting through it.

Here’s what a raw Reddit post looks like in JSON:

{
  "kind": "t3",
  "data": {
    "subreddit": "programming",
    "selftext": "",
    "author_fullname": "t2_abc123",
    "title": "Some programming post",
    "subreddit_name_prefixed": "r/programming",
    "ups": 42,
    "downs": 0,
    "score": 42,
    "created_utc": 1703875200.0,
    "num_comments": 15,
    "permalink": "/r/programming/comments/abc123/some_post/",
    "url": "https://example.com",
    "author": "username",
    "is_self": false,
    "stickied": false,
    "over_18": false,
    // ... and about 50 more fields you don't need
  }
}

This is honestly a nightmare for analysis:

  • Timestamps are Unix epochs (good luck reading those)
  • Inconsistent field names
  • Tons of fields you will never use
  • Some fields might be missing entirely

I needed consistent, clean data that would make my life easier and wouldn’t break my analysis code.

The Cleaning Process

Here’s how I transformed Reddit’s data into something nice and useable:

def _clean_post_data(self, raw_post: Dict) -> Dict[str, Any]:
    return {
        'id': raw_post.get('id'),
        'title': raw_post.get('title'),
        'author': raw_post.get('author'),
        'score': raw_post.get('score', 0),
        'upvotes': raw_post.get('ups', 0),
        'downvotes': raw_post.get('downs', 0),
        'upvote_ratio': raw_post.get('upvote_ratio', 0),
        'url': raw_post.get('url'),
        'permalink': f"https://reddit.com{raw_post.get('permalink', '')}",
        'created_utc': raw_post.get('created_utc'),
        'created_date': self._format_date(raw_post.get('created_utc')),
        'num_comments': raw_post.get('num_comments', 0),
        'subreddit': raw_post.get('subreddit'),
        'is_self_post': raw_post.get('is_self', False),
        'is_nsfw': raw_post.get('over_18', False),
        'is_stickied': raw_post.get('stickied', False),
        'flair': raw_post.get('link_flair_text'),
        'post_text': raw_post.get('selftext', ''),
        'domain': raw_post.get('domain')
    } 

What This Gets You

This is the same post, but after cleaning it up it now looks like this:

{
  "id": "abc123",
  "title": "Some programming post", 
  "author": "username",
  "score": 42,
  "created_date": "2025-08-19 12:00:00",
  "num_comments": 15,
  "permalink": "https://reddit.com/r/programming/comments/abc123/some_post/",
  "is_nsfw": false,
  "subreddit": "programming"
}

Now this is the kind of data we can work with that won’t break the code:

  • Consistent field names across all posts
  • Human-readable data
  • No missing fields
  • Only the data you actually need

Reddit usually changes their APIs (and they do that a lot). When they do, I only need to update the cleaning function in one place and we’re back in business.

Data Storage: Pick Your Favorite

Clean data is useless if you can’t get it where you need it — facts. While building the code, I fielded a lot of requests from the team who said things like “can we get this in CVS format?” or “can we use this to train models?”.

Trying to guess what someone needs is a losing game (I just hate talking to people). The data scientist wants JSON for flexibility, the analyst wants CVS for his beloved Excel sheet, so I built multiple output formats to let people decide for themselves.

JSON: The Default Choice

Most of the time, JSON just works (why would you choose anything else, really?)

def save_data(data, output_file: str, format: str = "json"):
    if format == "json":
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)

It’s clean, readable, and preserves data types — perfect for feeding into other tools or doing analysis in general.

What you get as we mentioned earlier:

[
  {
    "id": "abc123",
    "title": "Some programming post",
    "author": "username", 
    "score": 42,
    "created_date": "2025-08-19 12:00:00",
    "comments": [
      {
        "author": "commenter1",
        "body": "Great post!",
        "score": 5
      }
    ]
  }
]

CSV: For the Excel(lent) People

Sometimes you just need a spreadsheet (don’t ask me why):

elif format == "csv":
    df = pd.DataFrame(data)
    df.to_csv(output_file, index=False)

What you get:

id,title,author,score,created_date,num_comments
abc123,Some programming post,username,42,2025-08-19 12:00:00,15
def456,Another post,user2,128,2025-08-19 13:00:00,42

The perfect format for quick analysis, sharing with non-technical people (I love you guys), or just importing it into a database.

As we said earlier, we added a CLI, with which we can make our lives easier via command line:

python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output posts.json

python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output posts.csv --format csv

python3 -m reddit_scraper.cli interactive

Handling Edge Cases

Real data can sometimes have problems:

  • Unicode characters in titles and comments (hence ensure_ascii=False)
  • Large datasets that might not fit in memory
  • Nested data like comments (JSON handles this gracefully)
json.dump(data, f, indent=2, ensure_ascii=False)

At the end of the day, whether you’re a builder, data analyst, or just someone who loves spreadsheets, the code will handle all these cases for you.

The Brain: When to Go Simple vs When to Go Heavy

Here’s the part that took me way too long to figure out. After a couple of hours of testing, I found out when we actually need the complex stuff.

I started by always using the async scraper with proxy rotation. It seemed logical at the time, but I was wrong as usual. For small jobs, all of that complexity was just… slow.

Spinning up async sessions, initializing proxy managers, health-checking our Reddit proxies — all of that just to scrape a few posts that would have taken a couple of seconds to scrape using simple GET requests.

The problem is that we can’t take the simple route all the time. If we want to do larger jobs then we can’t do the normal GET request. We might hit Reddit’s API limit or get your IP addresses flagged, so we needed to create a decision engine that could look at the job and apply the right approach automatically.

The Magic Number Is 100

The determining factor is, unsurprisingly, Reddit’s API limit. I figured that out after a couple of tests (I should have read the documentation).

use_proxies = inputs['post_count'] > 100 and config_manager.has_proxies()

if inputs['post_count'] > 100:
    return await handle_large_scraping_job(
        inputs['subject'], inputs['post_count'], inputs['sort_method'],
        scraper_config, proxy_manager, captcha_solver,
        use_proxies, use_captcha
    )
else:
    return handle_regular_scraping_job(
        inputs['subject'], inputs['post_count'], inputs['sort_method'],
        scraper_config, proxy_manager, captcha_solver,
        use_proxies, use_captcha
    )

This will translate to:

  • <= 100 posts: Direct requests work fine, Reddit doesn’t care, job is done in a matter of seconds. 
  • > 100 posts: You start hitting rate limits (now we are talking), and here we will use proxy rotation.

What This Looks Like in Practice

Do you want 50 posts from r/programming?

Method: RequestsScraper direct (small job)
Proxies: No
  • Direct HTTP requests 
  • No proxy overhead
  • Done in 5–10 seconds

Do you want 500 posts from r/programming?

Method: JSONScraper with proxies (large job)
Proxies: Yes
  • HTTP proxy rotation on every request
  • Async pagination
  • Protected from rate limit

The system will take the number of posts as input and then act based on that number.

It’s All About the Customization

Most web scraping tutorials serve only one use case most of the time, so while building this I was thinking of dynamic ways of adding your subreddit or sorting.

Different Subreddits: Just Change the Parameter

python3 -m reddit_scraper.cli interactive

python3 -m reddit_scraper.cli json subreddit python --limit 50

python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50

Different Sorting: Hot, New, Top, Rising

Reddit has different ways to sort posts, and each gives you different data:

python3 -m reddit_scraper.cli json subreddit programming --sort hot --limit 50

python3 -m reddit_scraper.cli json subreddit programming --sort new --limit 50

python3 -m reddit_scraper.cli json subreddit programming --sort top --limit 50

python3 -m reddit_scraper.cli json subreddit programming --sort rising --limit 50

Comments: When You Need the Full Picture

In some cases you need the full context of the post you are looking for:

python3 -m reddit_scraper.cli json subreddit programming --limit 100

python3 -m reddit_scraper.cli json subreddit-with-comments programming --limit 100 --include-comments --comment-limit 50

python3 -m reddit_scraper.cli json comments programming POST_ID --sort best --output single_post_comments.json

When you include comments, the scraper makes a separate API call for each post’s comments. That’s where the proxy rotation really comes into play, because for each 100 posts there will be 101 total requests (1 for posts + 100 for comments).

Output Formats

python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output analysis.json
 
python3 -m reddit_scraper.cli json subreddit programming --limit 50 --output analysis.csv --format csv

python3 -m reddit_scraper.cli interactive

Configuration File

In the config.json file we have all the environment variables we need:

{
  "proxies": [
    {
      "host": "your-proxy.com",
      "port": 8080,
      "username": "your_username", 
      "password": "your_password",
      "proxy_type": "http"
    }
  ],
  "scraping": {
    "default_delay": 1.0,
    "max_retries": 3,
    "requests_per_minute": 60,
    "user_agent": "RedditScraper/1.0.0"
  }
}

It’s flexible, we don’t hardcode anything into the code, obviously… We’re professional here.

The Interactive Mode

We wanted you to love the CLI more, so we created an interactive one for you!

Enter subreddit name: programming
How many posts? [50]: 200  
Sort method (hot, new, top, rising) [hot]: new
Use captcha solving? [Y/n]: n
Proxy usage: Yes (automatic for >100 posts)
Output filename: programming_new_posts.json

Starting scrape:
  Subreddit: r/programming
  Posts: 200
  Sort: new  
  Method: JSONScraper with proxies (large job)
  Proxies: Yes
  Captcha: No

Conclusion

Here’s the thing about Reddit, it’s one of the easiest platforms to scrape compared to the most known platforms. Clean JSON endpoints, reasonable rate limits; Reddit actually wants you to scrape their data on their own terms.

However, the difference between building a small project and a production scale project comes down to one key point: for different job sizes we have different approaches. The goal is not to build the most complex Reddit scraper of all time, it’s about building systems that adapt to the job at hand, regardless of size.

Real Lessons:

  • Use JSON endpoints (/r/subreddit/hot.json) not HTML parsing
  • Handle pagination properly with Reddit’s after token
  • Clean the data early, Reddit’s JSON is messy
  • Automate the complexity: users shouldn’t have to think about whether or not they should be using proxies
  • Start simple, scale smart, add features when you actually need them

In the end, we built this project to scale and be applicable to different use cases. Feel free to take at the code: it’s an open-source Reddit scraper created for the builders who want to access data in the easiest way possible.

About the author

Yazan is a Software Engineer at Proxidize with a passion for technology and a love for building things with code. He has worked in several industries, including consulting and healthcare, and is currently focused on proxy technologies.
IN THIS ARTICLE:

Save Up To 90% on Your Proxies

Discover the world’s first distributed proxy network, which guarantees the best IP quality, reliability and price.

Related articles

What Is Data Mining?

Businesses often use data mining as a way to predict customer behavior, detect fraud, optimize marketing campaigns, and identify any

Zeid Abughazaleh

What is Scrapoxy? All Your Proxies on One Interface

Scrapoxy is an open source proxy orchestration tool that unifies multiple proxies behind one user-friendly endpoint. It started as a

Omar Rifai

Proxidize & Keitaro Tracker: Affiliate Tracking Made Better

Proxidize is excited to announce a new partnership with Keitaro, a reliable affiliate tracker that can help monitor your campaigns.

Abed Elezz

10 Ways to Fix Error Code 407

Encountering error code 407 is a frustrating experience, leaving you unable to complete your project and unsure of how to

Zeid Abughazaleh

Start for Free! Start for Free! Start for Free! Start for Free! Start for Free! 

Talk to Our Sales Team​

Looking to get started with Proxidize? Our team is here to help.

“Proxidize has been instrumental in helping our business grow faster than ever over the last 12 months. In short, Proxidize has empowered us to have control over every part of our business, which should be the goal of any successful company.”

mobile-1.jpg
Makai Macdonald
Social Media Lead Specialist | Product London Design UK

What to Expect:

By submitting this form, you consent to receive marketing communications from Proxidize regarding our products, services, and events. Your information will be processed in accordance with our Privacy Policy. You may unsubscribe at any time.

Contact us
Contact Sales