Twitter Scraper: How To Scrape Twitter For Free

Let’s say you want to know people’s opinions on a specific topic, maybe you want to do sentiment analysis for them for research purposes, or you’re a software engineer who has been tasked with scraping (this is me). There’s no better place to do it than Twitter/X. Millions of people use it every day to tweet and talk about every possible topic under the sun. However in order to accomplish this at any meaningful scale (or efficiency), you must scrape, i.e. use code that collects data for you while you drink coffee or browse the internet.

If you’ve got a tech background, you’ll likely be familiar with web scraping. If not, we’ll walk you through it. This article will be a step-by-step breakdown of how I built an open-source Twitter scraper that allows you to scrape all (or some) of an X account’s posts. Too many guides leave you with just that, so we went a step further and have included an AI integration for analysis of the tweets you’ve scraped — something that might appeal to everyone from researchers and academics to OSINT Twitter.

High-quality scraping and automation starts with high-quality mobile proxies

Despite having far fewer monthly active users (611 million compared to Facebook’s 3 billion), Twitter statistics show that the average Twitter/X user engages with the site for longer and more deeply than other social media sites. The rules are completely different, with many people using Twitter at work, using it to follow real-time news and events, and more.

As a platform, Twitter/X is one of the toughest ones to scrape. It has everything from anti-bot detection, captchas, and IP bans, all to prevent people from scraping, but with proxies and IP rotation, everything is possible.

Together we’ll go on the step-by-step journey of how I suffered to deliver this amazing code. We’ll include what technology stack I used, why cookies are so important and how they can be used properly. I’ll also explain how I figured out Twitter/X’s infinite scrolling, when to stop scrolling, and how to pick up where you left off across multiple sessions. If you’re not interested in the journey and just want the repo, here you go. We also have a Reddit scraper you might be interested in.

Twitter Scraping: The Technology Stack (It’s Not Just About Speed)

For this project I needed to choose an ecosystem that fit the needs of scraping a platform as strict as Twitter/X. My first choice was to go for speed over quality but that was a big mistake, it turned out. I went with Python and Selenium first, but the results were not good in terms of accuracy of the data. I mixed things up and switched to using Python with Playwright, which offered a great mix of speed and accuracy to the project.

Python Twitter Scraper: The Obvious Choice

Python dominates web scraping. If you have seen a web scraping tutorial, a data pipeline, or an automation script, the chances it’s written in Python will be high.

Python isn’t just popular because it’s “easy to learn” (though it is). It’s popular because it solves the entire workflow. Python libraries for webscraping, parsing, processing, analysing, exporting — there are mature libraries for every single step. You don’t have to switch languages halfway through your project; it’s Python from start to finish.

Python’s syntax reads like English, which means less time spent debugging errors (I hope) and more time building. What’s Python’s real power? Its ecosystem.

Selenium, Playwright, BeautifulSoup, Scrapy every major scraping tool has Python support. The community is everywhere, and when you hit a problem, someone has already solved it on Stack Overflow or you can just use AI to help you.

Here’s the part people miss: Python isn’t just good at scraping, it’s good at everything that comes after scraping. Some developers say “Python is only for scraping,” and they’re wrong. Python excels at data processing, data analysis, AI integration, and automation.

What that means for our Twitter/X scraper is that we can:

Scrape with Playwright (browser automation + network interception)
Process with JMESPath (parsing X’s nested JSON)
Analyse with OpenAI (sentiment, topics, trends)
Export with Pandas (CSV for spreadsheets users)
Store with aiofiles (async file operations)
Display with Rich (beautiful terminal output)

And we can do it all in the same language. No context switching. No rewriting. The same codebase from data collection to AI-powered insights. That’s the beauty of Python: everything you need in one ecosystem.

Selenium Twitter Scraping: A Choice Between the New and the Old

I started this project with Selenium. I’ve used Selenium for many projects and have never had major issues with it. Selenium WebDriver and proxies seemed like the perfect combo for a Twitter scraper.This project was different.

Twitter/X’s anti-bot detection is aggressive and I kept running into walls:

Proxy connection issues: Selenium’s proxy authentication required a hacky workaround with Chrome extensions.
Getting blocked constantly: Twitter/X detected my scrolling patterns as bot behaviour.
Human behaviour simulation: Mimicking natural scrolling in Selenium felt clunky and unreliable.

I tried tweaking delays, random scroll amounts, rotating user agents, but X kept catching me. The scraper would run for 200–300 tweets, then get blocked. I would restart and I even came up with the idea for doing scraping in session, but even then I kept being blocked (frustrating).

That’s when I decided to switch to Playwright mid-project.

Why Playwright to Scrape Twitter?

Native proxy support: Built-in authentication without Chrome extensions or hacks (Thank god).
Network interception: Capture X’s GraphQL API responses instead of parsing HTML.
Better anti-detection: the Playwright’s flags were a good hand to me.
Faster execution: Notice speed improvements over Selenium.

So, halfway through development, I redacted the entire codebase to use Playwright; a complete rewrite. Was it worth it?

Absolutely. The results were amazing:

Proxy connections are stable and fast
No more random blocks from Twitter/X
Scraping sessions became 2–3x faster
Cleaner code

I felt like Selenium was fighting Twitter/X’s UI. By comparison, Playwright intercepts X’s API. That’s the difference between scraping what users see versus scraping what the application actually uses. I don’t regret starting with Selenium — it helped me understand the problem. But switching to Playwright was the turning point for the project.

Discovering GraphQL: Twitter’s Hidden API Goldmine

Most web scrapers get it wrong at first: your first instinct is always to try and scrape HTML first. You load a profile page, find the divs with the CSS selectors, and you extract the text — your life is good. Then, suddenly, X changed its UI and now all of your selectors return null as data.

I was doing the same thing with Selenium once I started, dealing with CSS selectors, parsing HTML, and cleaning up text. I was talking to a friend who asked me why I wasn’t just taking advantage of the XHR requests. So I opened the Chrome DevTools and kept an eye on the network tab for a requests like these:

https://x.com/i/api/graphql/V7H0Br3k.../UserTweets
https://x.com/i/api/graphql/G3KGOASz.../UserByScreenName
https://x.com/i/api/graphql/B9Pw8l1f.../TweetDetail

https://x.com/i/api/graphql/V7H0Br3k.../UserTweets
https://x.com/i/api/graphql/G3KGOASz.../UserByScreenName
https://x.com/i/api/graphql/B9Pw8l1f.../TweetDetail

They weren’t hidden or being kept a secret, they were just endpoints Twitter/X’s front-end uses to load data. It was a real “Eureka!” moment for me. The responses were the clean, structured JSON I wanted. Twitter/X doesn’t render tweets as HTML on the server; it fetches JSON from GraphQL, then renders it in the browser with JavaScript.

What GraphQL Actually Gives Us

When I looked at the GraphQL responses, I realised X’s API returned more data than what was visible on the UI. That’s when I knew I needed to shift my focus to it. It’d be even more useful once I got AI analysis involved — the more information we have the better.

How to Scrape Twitter Profile: Grabbing Everything About the Account

To test out my code, I needed to pick someone’s profile. I chose Fabrizio Romano because the man tweets up to 30 times a day and has had strong opinions throughout his entire career.

Here’s the data our Twitter/X scraper grabbed from his profile:

{
  "data": {
    "user": {
      "result": {
        "rest_id": "330262748",
        "legacy": {
          "screen_name": "FabrizioRomano",
          "name": "Fabrizio Romano",
          "followers_count": 26479397,
          "friends_count": 2649,
          "statuses_count": 64187,
          "verified": true,
          "profile_image_url_https": "...",
          "profile_banner_url": "...",
          "description": "Here we go! ©...",
          "location": "",
          "created_at": "..."
        }
      }
    }
  }
}

{
  "data": {
    "user": {
      "result": {
        "rest_id": "330262748",
        "legacy": {
          "screen_name": "FabrizioRomano",
          "name": "Fabrizio Romano",
          "followers_count": 26479397,
          "friends_count": 2649,
          "statuses_count": 64187,
          "verified": true,
          "profile_image_url_https": "...",
          "profile_banner_url": "...",
          "description": "Here we go! ©...",
          "location": "",
          "created_at": "..."
        }
      }
    }
  }
}

This is the data you can pull from individual tweets, which includes everything from number of replies, how many views it got, and so on:

{
  "legacy": {
    "full_text": "🚨⚠️ Breaking transfer news...",
    "created_at": "Wed Oct 15 11:16:01 +0000 2025",
    "retweet_count": 152,
    "favorite_count": 2436,
    "reply_count": 250,
    "quote_count": 10,
    "entities": {
      "hashtags": [{"text": "TransferNews"}],
      "urls": [{
        "url": "https://t.co/...",
        "expanded_url": "https://..."
      }],
      "media": [{
        "type": "photo",
        "media_url_https": "https://pbs.twimg.com/..."
      }]
    }
  },
  "views": {
    "count": "118042"
  }
}

{
  "legacy": {
    "full_text": "🚨⚠️ Breaking transfer news...",
    "created_at": "Wed Oct 15 11:16:01 +0000 2025",
    "retweet_count": 152,
    "favorite_count": 2436,
    "reply_count": 250,
    "quote_count": 10,
    "entities": {
      "hashtags": [{"text": "TransferNews"}],
      "urls": [{
        "url": "https://t.co/...",
        "expanded_url": "https://..."
      }],
      "media": [{
        "type": "photo",
        "media_url_https": "https://pbs.twimg.com/..."
      }]
    }
  },
  "views": {
    "count": "118042"
  }
}

Concretely, this means that we now have access to the following information:

Full tweet text (not truncated)
Exact engagement metrics
View counts
Media URLs
Profile images and banners
Timestamps in proper ISO format

How GraphQL Shifted the Paradigm

Making this discovery changed how I approached the problem in an important way.

Before: “How do I find the right CSS selectors for this data to extract from the HTML?”
After: “How do I intercept the GraphQL responses that X is already fetching”

The GraphQL API was obviously not designed for scrapers. It was designed for X’s own engineers. As a result:

It’s stable: Since it’s provided by Twitter’s own engineers, that means it’s stable and won’t break often.
Full of data: It shows more data than we can see on the UI itself.
It’s structured: It has a consistent JSON schema not just random HTML div soup.

This made it a great foundation for good data collection to feed into the AI sentiment analysis.

Network Interception: Capturing Clean JSON Instead of Messy HTML

Finding X’s GraphQL was a big win and I think that’s where Playwright won out for me. Selenium can’t intercept network responses. It’s built for DOM interactions like clicking buttons, filling forms, and finding elements.

If you want to capture API responses in Selenium, you need browser extensions, proxy servers, or hacky workarounds that break often. Playwright has network interception built in.

The Interceptor: One Line that Changes Everything

Here’s the one line of code that made it all possible:

self.page.on("response", self._intercept_response)

self.page.on("response", self._intercept_response)

That’s it! Now every HTTP response the browser receives triggers your callback function. When X loads its own tweets, you just need to capture the JSON.

How to Intercept X’s API

The snippet below is the actual interceptor I use in the scraper:

async def _intercept_response(self, response: Response):
    try:
        if response.request.resource_type in ["xhr", "fetch"]:
            url = response.url
            
            if 'graphql' in url.lower() or 'api.twitter.com' in url or 'api.x.com' in url:
                
                if 'UserTweets' in url:
                    self.logger.info("Parsing UserTweets response")
                    data = await response.json()
                    self._parse_tweets_from_timeline(data)
                    
                elif 'UserByScreenName' in url:
                    self.logger.info("Parsing UserByScreenName response")
                    data = await response.json()
                    self._parse_user_data(data)
                    
                elif 'TweetDetail' in url or 'TweetResultByRestId' in url:
                    self.logger.info("Parsing TweetDetail response")
                    data = await response.json()
                    self._parse_single_tweet(data)
              
    except Exception as e:
        self.logger.debug(f"Error in response interceptor: {e}")

async def _intercept_response(self, response: Response):
    try:
        if response.request.resource_type in ["xhr", "fetch"]:
            url = response.url
            
            if 'graphql' in url.lower() or 'api.twitter.com' in url or 'api.x.com' in url:
                
                if 'UserTweets' in url:
                    self.logger.info("Parsing UserTweets response")
                    data = await response.json()
                    self._parse_tweets_from_timeline(data)
                    
                elif 'UserByScreenName' in url:
                    self.logger.info("Parsing UserByScreenName response")
                    data = await response.json()
                    self._parse_user_data(data)
                    
                elif 'TweetDetail' in url or 'TweetResultByRestId' in url:
                    self.logger.info("Parsing TweetDetail response")
                    data = await response.json()
                    self._parse_single_tweet(data)
              
    except Exception as e:
        self.logger.debug(f"Error in response interceptor: {e}")

This runs in the background while the browser scrolls. X loads more tweets and the interceptor catches the responses.

Scraping Twitter with Selenium vs Playwright

Let’s illustrate how different the approach to scraping Twitter/X plays out in practice between Selenium and Playwright.

Scraping X with Selenium

Before I switched away from Selenium, I had this to setup:

WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="tweet"]'))
)

tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')

for tweet in tweets:
    text = tweet.find_element(By.CSS_SELECTOR, '[data-testid="tweetText"]').text

WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, '[data-testid="tweet"]'))
)

tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')

for tweet in tweets:
    text = tweet.find_element(By.CSS_SELECTOR, '[data-testid="tweetText"]').text

With Selenium, you need to wait for elements to load, then find all tweet elements, then parse each one of them in the hope they will work. This is how most scrapers work, targeting HTML.

Intercepting X API with Playwright

By contrast, using Playwright to intercept X’s API is simple.

await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')

await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')

You’re not waiting for elements to render and you don’t have to find selectors. You’re just capturing the data X’s is already loading and fetching. The browser is essentially doing the work and we’re just intercepting the results.

Parsing X’s Timeline Data

When X loads tweets, the GraphQL response looks like this:

def _parse_tweets_from_timeline(self, data: Dict):
    try:
        instructions = jmespath.search(
            'data.user.result.timeline_v2.timeline.instructions', 
            data
        )
        
        if not instructions:
            self.logger.warning("No timeline instructions found")
            return
        
        for instruction in instructions:
            if instruction.get('type') == 'TimelineAddEntries':
                entries = instruction.get('entries', [])
                self.logger.info(f"Found {len(entries)} entries in timeline")
                
                tweet_count = 0
                for entry in entries:
                    entry_id = entry.get('entryId', '')
                    
                    if not entry_id.startswith('tweet-'):
                        continue
                    
                    tweet_result = jmespath.search(
                        'content.itemContent.tweet_results.result', 
                        entry
                    )
                    
                    if tweet_result:
                        parsed_tweet = self._extract_tweet_data(tweet_result)
                        if parsed_tweet and parsed_tweet['id'] not in self.scraped_tweet_ids:
                            self.all_tweets.append(parsed_tweet)
                            self.scraped_tweet_ids.add(parsed_tweet['id'])
                            tweet_count += 1
                
                if tweet_count > 0:
                    self.logger.info(f"Extracted {tweet_count} tweets from this batch")
                    
    except Exception as e:
        self.logger.error(f"Error parsing timeline tweets: {e}", exc_info=True)

def _parse_tweets_from_timeline(self, data: Dict):
    try:
        instructions = jmespath.search(
            'data.user.result.timeline_v2.timeline.instructions', 
            data
        )
        
        if not instructions:
            self.logger.warning("No timeline instructions found")
            return
        
        for instruction in instructions:
            if instruction.get('type') == 'TimelineAddEntries':
                entries = instruction.get('entries', [])
                self.logger.info(f"Found {len(entries)} entries in timeline")
                
                tweet_count = 0
                for entry in entries:
                    entry_id = entry.get('entryId', '')
                    
                    if not entry_id.startswith('tweet-'):
                        continue
                    
                    tweet_result = jmespath.search(
                        'content.itemContent.tweet_results.result', 
                        entry
                    )
                    
                    if tweet_result:
                        parsed_tweet = self._extract_tweet_data(tweet_result)
                        if parsed_tweet and parsed_tweet['id'] not in self.scraped_tweet_ids:
                            self.all_tweets.append(parsed_tweet)
                            self.scraped_tweet_ids.add(parsed_tweet['id'])
                            tweet_count += 1
                
                if tweet_count > 0:
                    self.logger.info(f"Extracted {tweet_count} tweets from this batch")
                    
    except Exception as e:
        self.logger.error(f"Error parsing timeline tweets: {e}", exc_info=True)

Expand

Extracting Individual Tweet Data

Once you have the tweet results object, extraction is clean and straightforward:

def _extract_tweet_data(self, tweet_result: Dict) -> Optional[Dict[str, Any]]:
    try:
        if tweet_result.get('__typename') == 'TweetWithVisibilityResults':
            tweet_result = tweet_result.get('tweet', {})
        
        legacy = tweet_result.get('legacy', {})
        tweet_id = tweet_result.get('rest_id', '')
        
        user_result = tweet_result.get('core', {}).get('user_results', {}).get('result', {})
        user_legacy = user_result.get('legacy', {})
        
        media = []
        extended_entities = legacy.get('extended_entities', {})
        for media_item in extended_entities.get('media', []):
            media_info = {
                'type': media_item.get('type', ''),
                'url': media_item.get('media_url_https', '')
            }
            if media_item.get('type') == 'video':
                variants = media_item.get('video_info', {}).get('variants', [])
                video_variants = [v for v in variants if v.get('content_type') == 'video/mp4']
                if video_variants:
                    media_info['video_url'] = max(video_variants, key=lambda x: x.get('bitrate', 0))['url']
            media.append(media_info)
        
        tweet_data = {
            'id': tweet_id,
            'text': legacy.get('full_text', ''),
            'created_at': legacy.get('created_at', ''),
            'user': {
                'username': user_legacy.get('screen_name', ''),
                'display_name': user_legacy.get('name', ''),
                'followers_count': user_legacy.get('followers_count', 0),
                'verified': user_result.get('is_blue_verified', False)
            },
            'metrics': {
                'retweet_count': legacy.get('retweet_count', 0),
                'favorite_count': legacy.get('favorite_count', 0),
                'reply_count': legacy.get('reply_count', 0),
                'quote_count': legacy.get('quote_count', 0),
                'view_count': tweet_result.get('views', {}).get('count', 0)
            },
            'hashtags': [ht.get('text', '') for ht in legacy.get('entities', {}).get('hashtags', [])],
            'media': media,
            'is_retweet': legacy.get('retweeted', False),
            'is_reply': legacy.get('in_reply_to_status_id_str') is not None,
            'scraped_at': time.time()
        }
        
        return tweet_data
        
    except Exception as e:
        self.logger.debug(f"Error extracting tweet data: {e}")
        return None

def _extract_tweet_data(self, tweet_result: Dict) -> Optional[Dict[str, Any]]:
    try:
        if tweet_result.get('__typename') == 'TweetWithVisibilityResults':
            tweet_result = tweet_result.get('tweet', {})
        
        legacy = tweet_result.get('legacy', {})
        tweet_id = tweet_result.get('rest_id', '')
        
        user_result = tweet_result.get('core', {}).get('user_results', {}).get('result', {})
        user_legacy = user_result.get('legacy', {})
        
        media = []
        extended_entities = legacy.get('extended_entities', {})
        for media_item in extended_entities.get('media', []):
            media_info = {
                'type': media_item.get('type', ''),
                'url': media_item.get('media_url_https', '')
            }
            if media_item.get('type') == 'video':
                variants = media_item.get('video_info', {}).get('variants', [])
                video_variants = [v for v in variants if v.get('content_type') == 'video/mp4']
                if video_variants:
                    media_info['video_url'] = max(video_variants, key=lambda x: x.get('bitrate', 0))['url']
            media.append(media_info)
        
        tweet_data = {
            'id': tweet_id,
            'text': legacy.get('full_text', ''),
            'created_at': legacy.get('created_at', ''),
            'user': {
                'username': user_legacy.get('screen_name', ''),
                'display_name': user_legacy.get('name', ''),
                'followers_count': user_legacy.get('followers_count', 0),
                'verified': user_result.get('is_blue_verified', False)
            },
            'metrics': {
                'retweet_count': legacy.get('retweet_count', 0),
                'favorite_count': legacy.get('favorite_count', 0),
                'reply_count': legacy.get('reply_count', 0),
                'quote_count': legacy.get('quote_count', 0),
                'view_count': tweet_result.get('views', {}).get('count', 0)
            },
            'hashtags': [ht.get('text', '') for ht in legacy.get('entities', {}).get('hashtags', [])],
            'media': media,
            'is_retweet': legacy.get('retweeted', False),
            'is_reply': legacy.get('in_reply_to_status_id_str') is not None,
            'scraped_at': time.time()
        }
        
        return tweet_data
        
    except Exception as e:
        self.logger.debug(f"Error extracting tweet data: {e}")
        return None

Expand

Built-in Duplicate Prevention

To prevent any duplicated tweet I added this line to the code that checks if the tweet is duplicated based on its ID:

if parsed_tweet['id'] not in self.scraped_tweet_ids:
    self.all_tweets.append(parsed_tweet)
    self.scraped_tweet_ids.add(parsed_tweet['id'])

if parsed_tweet['id'] not in self.scraped_tweet_ids:
    self.all_tweets.append(parsed_tweet)
    self.scraped_tweet_ids.add(parsed_tweet['id'])

Built-in Proxy Support (Without the Headache)

As we know, web scraping in general needs proxies to achieve, but it’s not something mandatory. If you were to scrape from just one IP, you’d definitely be banned though, which is why we use proxies and IP rotation to prevent that (and get as much data as we can).

Another Reason I Didn’t Use Selenium

In Selenium, proxy authentication is a disaster. Basic proxies (no auth) are simple enough:

chrome_options = Options()
chrome_options.add_argument('--proxy-server=http://proxy.com:8080')
driver = webdriver.Chrome(options=chrome_options)

chrome_options = Options()
chrome_options.add_argument('--proxy-server=http://proxy.com:8080')
driver = webdriver.Chrome(options=chrome_options)

But let’s be real here, you need a proxy with authentication. You can make it work in a situation like this with an unauthenticated proxy,but you have to jump through one of a few hoops:

Create a Chrome Extension: Building one is not that hard, but it will add complexity to the code and you will need to maintain it every now and then.
Use a Proxy Server Wrapper: Running a local proxy server that handles authentication, then point Selenium at localhost. More infrastructure. More complexity and of course more things to break.
Environment Variables: Well this works from some tools, but in my case it didn’t because it’s just not reliable enough.

Playwright Proxy Authentication: One Dictionary

Playwright’s proxy setup is one clean dictionary:

browser_args = {
    'proxy': {
        'server': 'your-server',
        'username': 'your-username',
        'password': 'your-password'
    }
}

browser = await self.playwright.chromium.launch(**browser_args)

browser_args = {
    'proxy': {
        'server': 'your-server',
        'username': 'your-username',
        'password': 'your-password'
    }
}

browser = await self.playwright.chromium.launch(**browser_args)

That’s it. Native username/password authentication — no extensions; no local proxy servers; no environment variable hack.

Here’s the actual implementation from the scraper:

async def initialize(self):
    try:
        self.playwright = await async_playwright().start()
        
        browser_args = {
            'headless': False,
            'args': [
                '--disable-blink-features=AutomationControlled',
                '--disable-dev-shm-usage',
                '--no-sandbox',
            ]
        }
        
        if self.proxy_config and self.proxy_config.get('enable_proxy_rotation'):
            proxy_list = self.proxy_config.get('proxies', [])
            if proxy_list:
                proxy_str = proxy_list[0] 
                parts = proxy_str.split(':')
                
                if len(parts) == 4:
                    host, port, username, password = parts
                    browser_args['proxy'] = {
                        'server': f'http://{host}:{port}',
                        'username': username,
                        'password': password
                    }
                    self.logger.info(f"Using proxy: {username}@{host}:{port}")
                    self.logger.info("Note: First connection through proxy may take 30-60 seconds...")
        
        self.browser = await self.playwright.chromium.launch(**browser_args)
        
        self.context = await self.browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            locale='en-US',
            timezone_id='America/New_York'
        )
        
        self.logger.info("Playwright browser initialized successfully")
        return True
        
    except Exception as e:
        self.logger.error(f"Failed to initialize Playwright: {e}")
        return False

async def initialize(self):
    try:
        self.playwright = await async_playwright().start()
        
        browser_args = {
            'headless': False,
            'args': [
                '--disable-blink-features=AutomationControlled',
                '--disable-dev-shm-usage',
                '--no-sandbox',
            ]
        }
        
        if self.proxy_config and self.proxy_config.get('enable_proxy_rotation'):
            proxy_list = self.proxy_config.get('proxies', [])
            if proxy_list:
                proxy_str = proxy_list[0] 
                parts = proxy_str.split(':')
                
                if len(parts) == 4:
                    host, port, username, password = parts
                    browser_args['proxy'] = {
                        'server': f'http://{host}:{port}',
                        'username': username,
                        'password': password
                    }
                    self.logger.info(f"Using proxy: {username}@{host}:{port}")
                    self.logger.info("Note: First connection through proxy may take 30-60 seconds...")
        
        self.browser = await self.playwright.chromium.launch(**browser_args)
        
        self.context = await self.browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            locale='en-US',
            timezone_id='America/New_York'
        )
        
        self.logger.info("Playwright browser initialized successfully")
        return True
        
    except Exception as e:
        self.logger.error(f"Failed to initialize Playwright: {e}")
        return False

Expand

The above code will take the environment variables from the configuration file which is config.ini, it should be something like this:

[PROXY]
enable_proxy_rotation = true
# Format: host:port:username:password
proxy_list = your-proxy
proxy_timeout = 15

[PROXY]
enable_proxy_rotation = true
# Format: host:port:username:password
proxy_list = your-proxy
proxy_timeout = 15

The code will parse it. Split by colons and pass it to the Playwright.

Automatic IP Rotation: The Secret Weapon

Proxidize’s mobile proxies offer the ability to rotate IP addresses automatically, which helps us make the most of a single proxy when scraping a platform like X. We’re using mobile proxies specifically because they are hard to detect. I set it to rotate every minute, but you can set the rotation interval to whatever you need.

Why does this matter to us?

Mobile proxies: They are super helpful — it’s hard to detect them, since they seem like a real IP to the servers, and the barrier to banning them is higher because of CGNAT.
Automatic IP rotation: A new IP address every 60 seconds without having to intervene manually is a big plus here.

Normally the first connection takes between 30–60 seconds, because of its connections to the proxy server, establishing tunnels and DNS resolution, but after that it becomes very fast.

HTTP vs SOCKS5 Support and the Anti-Detection Stack

Playwright supports both HTTP and SOCKS5 proxies. In our case we are using an HTTP proxy; the reason we didn’t use SOCKS5 is because it would add an unnecessary layer of complexity without any additional benefits.

Proxies alone are not enough to scrape Twitter, because X also checks:

User Agent: Does it look like a real browser?
Viewport Size: Is it a realistic screen resolution?
Locale/Timezone: Do location signals match?
Automation Flags: Does the browser show signs that it’s being automated?

Luckily, we can have Playwright handle all of this for us:

self.context = await self.browser.new_context(
    viewport={'width': 1920, 'height': 1080},  # Standard desktop resolution
    user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    locale='en-US',
    timezone_id='America/New_York'
)

self.context = await self.browser.new_context(
    viewport={'width': 1920, 'height': 1080},  # Standard desktop resolution
    user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    locale='en-US',
    timezone_id='America/New_York'
)

Here we create a new browser context, set the window resolution to something realistic, and define the User-Agent string. It also tells the browser to make ‘en-US’ the default language and set the browser’s timezone to New York.

By adding the next bit of code we can hide the automation.

'args': [
    '--disable-blink-features=AutomationControlled',
]

'args': [
    '--disable-blink-features=AutomationControlled',
]

So following these things it will show Twitter/X that a user from New York is browsing the website from a Mac, which looks like a real user.

Scraping Twitter/X over Multiple Sessions (And Not Getting Banned)

X’s login flow is very strict — it’s like a guard hovering over your shoulder, asking you for your ID every time you want to do anything. It’s infuriating and instantly prompted the question: how do we avoid the constant checks?

Most scrapers treat authentication like a chore they have to repeat. Log in, scrape, close the browser, lose the session. It’s the same story the next time: Log in again, and again, and again. For each login attempt we:

Waste 30–60 seconds
Give X another chance to flag us for suspicious activity
Risk of hitting rate limits

I learned that the best way to avoid a ban is to only login once in a while. Not only does this look more human to Twitter/X, it also decreases our chances of our session being blocked.

The Cookie Strategy

Let’s be real here, cookies are like your authentication insurance policy. When you successfully log into X, the browser stores authentication cookies. These cookies are proof that X already knows you and that you are verified. They contain session tokens, user IDs, authentication signatures, and more — in other words everything about you. So we save those cookies to a file and load them next time to skip the entire login process. After you login successfully, Playwright lets you export all cookies:

cookies = await self.context.cookies()
Path('playwright_cookies.json').write_text(json.dumps(cookies, indent=2))
self.logger.info(f"Saved {len(cookies)} cookies to playwright_cookies.json")

cookies = await self.context.cookies()
Path('playwright_cookies.json').write_text(json.dumps(cookies, indent=2))
self.logger.info(f"Saved {len(cookies)} cookies to playwright_cookies.json")

In one fell swoop, X’s entire authentication state gets exported to JSON. On the next run we can check if there’s a cookie, grab the necessary bits, and add those to the browser to log in.

if Path('playwright_cookies.json').exists():
    try:
        cookies_data = json.loads(Path('playwright_cookies.json').read_text())
        if cookies_data:
            await self.context.add_cookies(cookies_data)
            self.is_logged_in = True
            self.logger.info("Loaded saved cookies - will skip login")
    except Exception as e:
        self.logger.warning(f"Failed to load cookies: {e}")

if Path('playwright_cookies.json').exists():
    try:
        cookies_data = json.loads(Path('playwright_cookies.json').read_text())
        if cookies_data:
            await self.context.add_cookies(cookies_data)
            self.is_logged_in = True
            self.logger.info("Loaded saved cookies - will skip login")
    except Exception as e:
        self.logger.warning(f"Failed to load cookies: {e}")

This saves us a few minutes and verifies our session, which makes our odds of being banned quite low. Sometimes the cookies will expire or be invalidated by X, so we also run a quick test to see whether the cookie’s still valid. I did this by looking for the compose button (SideNav_NewTweet_Button), which only appears when you’re authenticated.

Avoiding Detection: Looking Human (Enough) While Web Scraping

It’s every scraper problem, how to look more human to any platform you scrape? Your browser fingerprint is everywhere. If you visit X’s platform or any other platform they usually look for:

Your User Agent (browser version, OS)
Your screen resolution
Your timezone and locale
JavaScript capabilities
WebGL renderer information
Canvas fingerprinting
Automation signs or signals (the most important one here)

You can fake most of these, but the one category that will kill your scraper dead is if you get flagged as automation.

As we know, Selenium and Playwright are both automation tools. They’re designed to help us scrape websites to get the data we want. This is where it becomes difficult to avoid detection. For example, when Chrome launches via Selenium, it literally advertises itself as an automation tool!

navigator.webdriver === true  // "Hi, I'm automated!"

navigator.webdriver === true  // "Hi, I'm automated!"

X checks this. If navigator.webdriver is true, you’re done. You’ll be blocked, flagged and banned. Selenium tries to hide it, but it doesn’t work everytime unfortunately.

Playwright solves this problem with a single flag that does most of the work.

browser_args = {
    'headless': False,
    'args': [
        '--disable-blink-features=AutomationControlled',
    ]
}

browser_args = {
    'headless': False,
    'args': [
        '--disable-blink-features=AutomationControlled',
    ]
}

–disable-blink-features=AutomationControlled tells Chrome to stop advertising the fact it’s being automated. It’s not perfect; advanced fingerprinting can still detect Playwright, but X’s detection is not that strong (yet).

Human-Like Web Scraping: When “Good Enough” Is Good Enough

Notice that I’m not trying to be perfect here; I’m only trying to be good enough. Perfection would require a lot of work on the code that would end up being overkill in most cases. As long as the code’s good enough in the following areas, we can actually avoid detection:

Hide any automation flags
Use realistic User Agent
Match viewport to user agent
Consistent locale/timezone
Residential/mobile proxies
Human-like scroll timing (3–6 second delays)

That’s good for now and you might say that if X updates their detection, well they will and they always do that, so for that you need to check the list again, you want to check what changed, any updates required from your sides, it’s a battle that will never end to be fair.

Error Handling: For When of Course Things Go Wrong for Some Reason

Your code will inevitably break, that’s something we all know as developers. That’s why having a good error handling system can make it a bit easier to fix errors down the line.

With scrapers and proxies, these are the most common problems:

Authentication failures: login broken, cookies expired, account locked
Network failures: Proxy timeout, connections drops, rate limits
Parsing failures: GraphQL response changed, data format different
Browser failures: Playwright crashes, page won’t load, selectors missing

Each of these categories needs its own error handling, so building a system for it isn’t optional.

Network Failures: Retry and Move On

Proxies timeout and connections drop. It happens.

async def _intercept_response(self, response: Response):
    try:
        if response.request.resource_type in ["xhr", "fetch"]:
            url = response.url
            
            if 'graphql' in url.lower():
                if 'UserTweets' in url:
                    try:
                        data = await response.json()
                        self._parse_tweets_from_timeline(data)
                    except Exception as e:
                        self.logger.warning(f"Failed to parse response from {url[:100]}: {e}")
                        
    except Exception as e:
        self.logger.debug(f"Error in response interceptor: {e}")

async def _intercept_response(self, response: Response):
    try:
        if response.request.resource_type in ["xhr", "fetch"]:
            url = response.url
            
            if 'graphql' in url.lower():
                if 'UserTweets' in url:
                    try:
                        data = await response.json()
                        self._parse_tweets_from_timeline(data)
                    except Exception as e:
                        self.logger.warning(f"Failed to parse response from {url[:100]}: {e}")
                        
    except Exception as e:
        self.logger.debug(f"Error in response interceptor: {e}")

If one of the responses failed, we just move on. The scraper doesn’t stall out because of one crash.

Parsing Failures: Defensive Extraction

X’s GraphQL responses are nested nightmares. Sometimes fields are empty or missing, or the structure changes.

def _extract_tweet_data(self, tweet_result: Dict) -> Optional[Dict[str, Any]]:
    try:
        if tweet_result.get('__typename') == 'TweetWithVisibilityResults':
            tweet_result = tweet_result.get('tweet', {})
        
        legacy = tweet_result.get('legacy', {})
        tweet_id = tweet_result.get('rest_id', '')
        
        tweet_data = {
            'id': tweet_id,
            'text': legacy.get('full_text', ''),
            'created_at': legacy.get('created_at', ''),
            'metrics': {
                'retweet_count': legacy.get('retweet_count', 0),
                'favorite_count': legacy.get('favorite_count', 0),
                'reply_count': legacy.get('reply_count', 0),
                'view_count': tweet_result.get('views', {}).get('count', 0)
            }
        }
        
        return tweet_data
        
    except Exception as e:
        self.logger.debug(f"Error extracting tweet data: {e}")
        return None

def _extract_tweet_data(self, tweet_result: Dict) -> Optional[Dict[str, Any]]:
    try:
        if tweet_result.get('__typename') == 'TweetWithVisibilityResults':
            tweet_result = tweet_result.get('tweet', {})
        
        legacy = tweet_result.get('legacy', {})
        tweet_id = tweet_result.get('rest_id', '')
        
        tweet_data = {
            'id': tweet_id,
            'text': legacy.get('full_text', ''),
            'created_at': legacy.get('created_at', ''),
            'metrics': {
                'retweet_count': legacy.get('retweet_count', 0),
                'favorite_count': legacy.get('favorite_count', 0),
                'reply_count': legacy.get('reply_count', 0),
                'view_count': tweet_result.get('views', {}).get('count', 0)
            }
        }
        
        return tweet_data
        
    except Exception as e:
        self.logger.debug(f"Error extracting tweet data: {e}")
        return None

Every .get() has a fallback. If there are missing fields just use the default, if the structure feels unfamiliar just return none, and — most importantly — don’t stop or crash. The scraper will get results even if the structure changes. The devs among you might want to change the labels to match the new structure.

The Screenshot Strategy

Screenshots are debugging gold. Whenever something breaks, you want to be able see what the page looked like before it said its last words.

try:
    await self.page.screenshot(path=f"error_{username}_{timestamp}.png")
    self.logger.error(f"Screenshot saved: error_{username}_{timestamp}.png")
except:
    pass

try:
    await self.page.screenshot(path=f"error_{username}_{timestamp}.png")
    self.logger.error(f"Screenshot saved: error_{username}_{timestamp}.png")
except:
    pass

Whenever a login fails or anything unexpected or broken happens, you will have visual evidence to help you debug the issue. The image will be saved as a .png in the root of the project.

Enough to Level a Forest: Logging and Logging and Logging

You will notice throughout the code that I have a lot of loggers. I believe they help a lot in tracking the progress of the scraping. It’s comforting to know that if something goes wrong I can go back to the logs and see exactly what happened.

Pagination Hell: When “Just Scroll Down” Becomes a Nightmare

X doesn’t have pages. It has an infinite scroll that fights back, rate limits you, randomly stops loading, and occasionally just gives up for no apparent reason. Most people think Twitter/X pagination is simple: scroll down, wait for more tweets, repeat. If only.

Here’s what actually happens:

Scroll too fast? X stops loading new content (you have been flagged)
Scroll too consistently? X will notice (you have been flagged)
Reach the “bottom”? X might still have more tweets, (or not!) but you’ll need to reach a certain number of scrolls to know for sure
Scroll for too long? X’s lazy loading will just stop responding

These are not bugs. This is X’s intentional design to prevent scraping its platform. Pagination on X isn’t a technical problem, it’s psychological warfare between your scraper and X’s anti-bot measures.

Let me spare you some of the suffering and share some tips that might help you to win.

The Infinite Scrolling Problem: No Pages, Just Chaos

Pagination on Twitter is not traditional, because you can’t predict what’s next for you. It’s utter chaos.

Here’s the traditional pagination:

Page 1: Tweets 1–20
Page 2: Tweets 21–40
Page 3: Tweets 41–60

Page 1: Tweets 1–20
Page 2: Tweets 21–40
Page 3: Tweets 41–60

Here’s X’s infinite scroll:

Scroll 1: Load 15–20 tweets (maybe)
Scroll 2: Load 8 tweets (why fewer?)
Scroll 3: Load 0 tweets (but there's more!)
Scroll 4: Load 22 tweets (now it works again?)

Scroll 1: Load 15–20 tweets (maybe)
Scroll 2: Load 8 tweets (why fewer?)
Scroll 3: Load 0 tweets (but there's more!)
Scroll 4: Load 22 tweets (now it works again?)

That being said, I still needed to find a solution. When do we stop? How do we pickup any new tweets along the way without missing any?

This is the solution that we arrived at that’s good enough: it tracks what you’ve already collected. Detect when nothing is new and know when to stop.

async def _scroll_timeline(self, resume_from_tweet_id: Optional[str] = None):
    self.logger.info("Starting timeline scroll...")
    
    scroll_attempts = 0
    self.scroll_attempts_without_new = 0
    max_scroll_attempts = 5000 
    max_attempts_without_new = 50  
    
    while scroll_attempts < max_scroll_attempts:
        scroll_attempts += 1
        tweets_before = len(self.all_tweets)
        
        await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')
        
        delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
        await asyncio.sleep(delay)
        
        tweets_after = len(self.all_tweets)
        new_tweets = tweets_after - tweets_before
        
        if new_tweets > 0:
            self.logger.info(f"Scroll {scroll_attempts}: +{new_tweets} NEW tweets (total: {tweets_after})")
            self.scroll_attempts_without_new = 0
        else:
            self.scroll_attempts_without_new += 1
            if self.scroll_attempts_without_new >= max_attempts_without_new:
                self.logger.info(f"No new tweets for {max_attempts_without_new} scrolls - stopping")
                break
        
    self.logger.info(f"Scrolling completed after {scroll_attempts} attempts")

async def _scroll_timeline(self, resume_from_tweet_id: Optional[str] = None):
    self.logger.info("Starting timeline scroll...")
    
    scroll_attempts = 0
    self.scroll_attempts_without_new = 0
    max_scroll_attempts = 5000 
    max_attempts_without_new = 50  
    
    while scroll_attempts < max_scroll_attempts:
        scroll_attempts += 1
        tweets_before = len(self.all_tweets)
        
        await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')
        
        delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
        await asyncio.sleep(delay)
        
        tweets_after = len(self.all_tweets)
        new_tweets = tweets_after - tweets_before
        
        if new_tweets > 0:
            self.logger.info(f"Scroll {scroll_attempts}: +{new_tweets} NEW tweets (total: {tweets_after})")
            self.scroll_attempts_without_new = 0
        else:
            self.scroll_attempts_without_new += 1
            if self.scroll_attempts_without_new >= max_attempts_without_new:
                self.logger.info(f"No new tweets for {max_attempts_without_new} scrolls - stopping")
                break
        
    self.logger.info(f"Scrolling completed after {scroll_attempts} attempts")

Expand

This method is good, it works. It’s fast and keeps track of what we’re scraping. After scrolling 50 times without finding a new tweet, we call it and stop the process.

Randomized Scroll Delays: Acting Human to Avoid Detection

Bots scroll at perfect intervals, but humans don’t. Let’s do a scroll comparison between a but and a human being.

Bot behavior:

await asyncio.sleep(2)
await self.page.evaluate('window.scrollBy(0, 800)')
await asyncio.sleep(2)
await self.page.evaluate('window.scrollBy(0, 800)')

await asyncio.sleep(2)
await self.page.evaluate('window.scrollBy(0, 800)')
await asyncio.sleep(2)
await self.page.evaluate('window.scrollBy(0, 800)')

The exact scroll time is instantly recognizable and X it will flag you right away.

Human behavior:

self.scroll_delay_min = 3.0
self.scroll_delay_max = 6.0

delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
await asyncio.sleep(delay)

self.scroll_delay_min = 3.0
self.scroll_delay_max = 6.0

delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
await asyncio.sleep(delay)

By introducing a bit of variance between our scrolls, X will see a scroll, then 4.7 seconds of nothing, then another scroll. Maybe it’s 3.2 seconds this time, scroll again and so on.

Why 3–6 seconds?

I tested different ranges:

1–2 seconds was too fast and X noticed; we were flagged
2–4 seconds was better, but still too inconsistent; X’s lazy loading couldn’t keep up
3–6 seconds was the sweet spot; fast enough to be efficient and slow enough to look human
5–10 seconds was too slow

It makes sense if you think about it. People rarely scroll consistently, and if you time it, the timing does shake out to be about 3–6 seconds.

The implementation:

delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
await asyncio.sleep(delay)

# Real logs from scraping sessions
# Scroll 1: +12 NEW tweets (delay: 4.7s)
# Scroll 2: +8 NEW tweets (delay: 3.2s)
# Scroll 3: +15 NEW tweets (delay: 5.9s)

delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
await asyncio.sleep(delay)

# Real logs from scraping sessions
# Scroll 1: +12 NEW tweets (delay: 4.7s)
# Scroll 2: +8 NEW tweets (delay: 3.2s)
# Scroll 3: +15 NEW tweets (delay: 5.9s)

And there we have it: human-like scrolling. You could take it a step further and add randomization, but you run the risk of making it less human. Real users have mostly fixed patterns with small variations; being so random in scrolling risks that X will notice and flag you.

The “50 Scrolls Without New Content” Rule

X’s infinite scroll has no end. It just never ends, unless you have been flagged. So how do you know when. stop?

Bad approach:

is_at_bottom = await self.page.evaluate('window.scrollY >= document.body.scrollHeight')
if is_at_bottom:
    break

is_at_bottom = await self.page.evaluate('window.scrollY >= document.body.scrollHeight')
if is_at_bottom:
    break

This works if you know you will hit a bottom, but you will find one on X.

My approach: The 50-scroll rule

self.scroll_attempts_without_new = 0
max_attempts_without_new = 50

while scroll_attempts < max_scroll_attempts:
    # ... scroll logic ...
    
    if new_tweets > 0:
        self.scroll_attempts_without_new = 0  # Reset counter
    else:
        self.scroll_attempts_without_new += 1  # Increment counter
        
        if self.scroll_attempts_without_new >= max_attempts_without_new:
            self.logger.info(f"No new tweets for {max_attempts_without_new} scrolls - stopping")
            break

self.scroll_attempts_without_new = 0
max_attempts_without_new = 50

while scroll_attempts < max_scroll_attempts:
    # ... scroll logic ...
    
    if new_tweets > 0:
        self.scroll_attempts_without_new = 0  # Reset counter
    else:
        self.scroll_attempts_without_new += 1  # Increment counter
        
        if self.scroll_attempts_without_new >= max_attempts_without_new:
            self.logger.info(f"No new tweets for {max_attempts_without_new} scrolls - stopping")
            break

The logic:

Got new tweets? Reset counter to 0
No new tweets? Increment counter
Counter hits 50? Stop scraping

I tested a bunch of different thresholds and 50 seemed to work the best. Less than 50 was too aggressive or stopped too early. More than 50 meant we were wasting time. 50 scrolls works out to 3–5 minutes of waiting before stopping.

Creating a Checkpoint System (Because Losing Your Progress Sucks)

Interruptions are by definition unforeseen. The internet dies, there’s an error during scraping, and suddenly you’ve lost all your data. By implementing a checkpoint system you can save your progress and pick up where you left off.

Alongside checkpoints, our X scraper also has sessions. You’re not necessarily going to be able to grab every single tweet from a specific account all in one go. Sessions let you resume scraping a profile, which needs its own checkpoints. For example:

Session 1: Scraped 800 tweets (Oct 15 -> Sept 1), saved checkpoint
Sessions 2: Resumed from Sept 1, scraped another 800 tweets (Sept 1 -> July 15), the checkpoint is updated

Each session starts where the previous one stopped and it’s how we can be sure that we’re tracking every tweet, never losing our progress.

What Gets Saved: The Checkpoint File Format

The checkpoint file is just a small JSON that contains the following:

{
  "total_tweets": 795,
  "oldest_tweet_id": "1962619400537653743",
  "oldest_tweet_date": "Mon Sep 01 20:51:43 +0000 2025",
  "newest_tweet_id": "1978419586904072698",
  "newest_tweet_date": "Wed Oct 15 11:16:01 +0000 2025",
  "session_count": 2,
  "last_session_tweets": 86,
  "username": "username",
  "last_updated": "2025-10-16T08:17:33.504827"
}

{
  "total_tweets": 795,
  "oldest_tweet_id": "1962619400537653743",
  "oldest_tweet_date": "Mon Sep 01 20:51:43 +0000 2025",
  "newest_tweet_id": "1978419586904072698",
  "newest_tweet_date": "Wed Oct 15 11:16:01 +0000 2025",
  "session_count": 2,
  "last_session_tweets": 86,
  "username": "username",
  "last_updated": "2025-10-16T08:17:33.504827"
}

There’s nothing complicated happening here. A few pieces of important information is saved so you can continue a scraping session. The most important one is oldest_tweet_id, because that’s where we will start our next session.

How do we create this cool JSON file?

# After scraping completes
checkpoint_data = {
    'total_tweets': len(all_tweets),
    'oldest_tweet_id': all_tweets[-1]['id'],
    'oldest_tweet_date': all_tweets[-1]['created_at'],
    'newest_tweet_id': all_tweets[0]['id'],
    'newest_tweet_date': all_tweets[0]['created_at'],
    'session_count': existing_checkpoint.get('session_count', 0) + 1,
    'last_session_tweets': len(new_tweets_this_session)
}

self.checkpoint_manager.save_checkpoint(username, checkpoint_data)

# After scraping completes
checkpoint_data = {
    'total_tweets': len(all_tweets),
    'oldest_tweet_id': all_tweets[-1]['id'],
    'oldest_tweet_date': all_tweets[-1]['created_at'],
    'newest_tweet_id': all_tweets[0]['id'],
    'newest_tweet_date': all_tweets[0]['created_at'],
    'session_count': existing_checkpoint.get('session_count', 0) + 1,
    'last_session_tweets': len(new_tweets_this_session)
}

self.checkpoint_manager.save_checkpoint(username, checkpoint_data)

X’s tweet IDs are chronological, i.e. the newer the tweet, the higher the ID number.

all_tweets[0] = Newest tweet (highest ID)
all_tweets[-1] = Oldest tweet (lowest ID)

Thus, the oldest tweet becomes your resume point.

Resume Flow: Picking Up Where You Left Off

We created a command –resume that helps you resume scraping from where you stopped. Here is an example of the command:

python main.py user -u username --resume

python main.py user -u username --resume

Step 1: Load the checkpoint

if resume:
    checkpoint = self.checkpoint_manager.load_checkpoint(username)
    if checkpoint:
        existing_tweets = self.checkpoint_manager.load_existing_tweets(username)
        resume_from_tweet_id = checkpoint.get('oldest_tweet_id')
        
        self.logger.info(f"Resuming from checkpoint with {len(existing_tweets)} existing tweets")
        self.logger.info(f"   Will continue from tweet: {resume_from_tweet_id}")
    else:
        self.logger.info(f"No checkpoint found for @{username}, starting fresh")

if resume:
    checkpoint = self.checkpoint_manager.load_checkpoint(username)
    if checkpoint:
        existing_tweets = self.checkpoint_manager.load_existing_tweets(username)
        resume_from_tweet_id = checkpoint.get('oldest_tweet_id')
        
        self.logger.info(f"Resuming from checkpoint with {len(existing_tweets)} existing tweets")
        self.logger.info(f"   Will continue from tweet: {resume_from_tweet_id}")
    else:
        self.logger.info(f"No checkpoint found for @{username}, starting fresh")

The console will show you that it will continue from the last tweet ID and provide you with any other information.

Step 2: Pass the resume point to the scraper

result = await self.playwright_scraper.scrape_user_tweets(
    username=username,
    resume_from_tweet_id=resume_from_tweet_id 
)

result = await self.playwright_scraper.scrape_user_tweets(
    username=username,
    resume_from_tweet_id=resume_from_tweet_id 
)

Here the scraper will scroll until it finds the specific ID we stopped at last session.

Step 3: Merge old and new tweets

all_tweets = self.checkpoint_manager.merge_tweets(
    existing_tweets,  
    result['tweets']
)

self.logger.info(f"Merged: {len(existing_tweets)} existing + {len(result['tweets'])} new = {len(all_tweets)} total")

all_tweets = self.checkpoint_manager.merge_tweets(
    existing_tweets,  
    result['tweets']
)

self.logger.info(f"Merged: {len(existing_tweets)} existing + {len(result['tweets'])} new = {len(all_tweets)} total")

Step 4: Update the checkpoint

new_checkpoint_data = {
    'total_tweets': len(all_tweets),
    'oldest_tweet_id': all_tweets[-1]['id'],
    'oldest_tweet_date': all_tweets[-1]['created_at'],
    'session_count': checkpoint.get('session_count', 0) + 1,
    'last_session_tweets': len(result['tweets'])
}

self.checkpoint_manager.save_checkpoint(username, new_checkpoint_data)

new_checkpoint_data = {
    'total_tweets': len(all_tweets),
    'oldest_tweet_id': all_tweets[-1]['id'],
    'oldest_tweet_date': all_tweets[-1]['created_at'],
    'session_count': checkpoint.get('session_count', 0) + 1,
    'last_session_tweets': len(result['tweets'])
}

self.checkpoint_manager.save_checkpoint(username, new_checkpoint_data)

Now the checkpoint points to the oldest tweet from this combined dataset. The next session will resume from there.

Finding the Resume Point: The Needle in the Haystack

X doesn’t let you jump to a specific tweet, so starting where your last session stopped might be the most difficult part of this process.

You can’t tell X to take you to tweet ID 321321312321, but with the checkpoint we can go back to the tweet where we stopped and continue from there.

Here’s how we did it:

async def _scroll_timeline(self, resume_from_tweet_id: Optional[str] = None):
    scroll_attempts = 0
    resume_point_found = False if resume_from_tweet_id else True
    
    while scroll_attempts < max_scroll_attempts:
        scroll_attempts += 1
        tweets_before = len(self.all_tweets)
        
        await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')
        delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
        await asyncio.sleep(delay)
        
        tweets_after = len(self.all_tweets)
        new_tweets = tweets_after - tweets_before
        
        if resume_from_tweet_id and not resume_point_found:
            for tweet in self.all_tweets:
                if tweet.get('id') == resume_from_tweet_id:
                    resume_point_found = True
                    self.logger.info(f"Found resume point at tweet {resume_from_tweet_id}!")
                    self.logger.info(f"   Clearing {len(self.all_tweets)} duplicate tweets...")
                    
                    self.all_tweets.clear()
                    self.scraped_tweet_ids.clear()
                    break
        
        if new_tweets > 0:
            if not resume_point_found:
                self.logger.info(f"Scrolling to resume point... ({tweets_after} tweets checked)")
            else:
                self.logger.info(f"Scroll {scroll_attempts}: +{new_tweets} NEW tweets (total: {tweets_after})")
            self.scroll_attempts_without_new = 0
        else:
            self.scroll_attempts_without_new += 1
            if not resume_point_found and self.scroll_attempts_without_new >= 100:
                self.logger.warning(f"Scrolled 100 times without finding resume point - might not exist")
                break
            elif resume_point_found and self.scroll_attempts_without_new >= 50:
                self.logger.info(f"No new tweets for 50 scrolls - stopping")
                break

async def _scroll_timeline(self, resume_from_tweet_id: Optional[str] = None):
    scroll_attempts = 0
    resume_point_found = False if resume_from_tweet_id else True
    
    while scroll_attempts < max_scroll_attempts:
        scroll_attempts += 1
        tweets_before = len(self.all_tweets)
        
        await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')
        delay = random.uniform(self.scroll_delay_min, self.scroll_delay_max)
        await asyncio.sleep(delay)
        
        tweets_after = len(self.all_tweets)
        new_tweets = tweets_after - tweets_before
        
        if resume_from_tweet_id and not resume_point_found:
            for tweet in self.all_tweets:
                if tweet.get('id') == resume_from_tweet_id:
                    resume_point_found = True
                    self.logger.info(f"Found resume point at tweet {resume_from_tweet_id}!")
                    self.logger.info(f"   Clearing {len(self.all_tweets)} duplicate tweets...")
                    
                    self.all_tweets.clear()
                    self.scraped_tweet_ids.clear()
                    break
        
        if new_tweets > 0:
            if not resume_point_found:
                self.logger.info(f"Scrolling to resume point... ({tweets_after} tweets checked)")
            else:
                self.logger.info(f"Scroll {scroll_attempts}: +{new_tweets} NEW tweets (total: {tweets_after})")
            self.scroll_attempts_without_new = 0
        else:
            self.scroll_attempts_without_new += 1
            if not resume_point_found and self.scroll_attempts_without_new >= 100:
                self.logger.warning(f"Scrolled 100 times without finding resume point - might not exist")
                break
            elif resume_point_found and self.scroll_attempts_without_new >= 50:
                self.logger.info(f"No new tweets for 50 scrolls - stopping")
                break

Expand

The code runs based on the command –resume. First we check if we have a checkpoint. If we do, we grab the ID and scroll until we find the tweet. If there are (new) tweets we missed last time, we save them after testing for duplicates.

If we don’t find the old tweet we’re looking for, the code fails by itself gracefully, it’s better than waiting around for nothing.

AI Analysis Integration (Making Sense of your Data)

You have collected enough tweets. For the sake of argument, let’s say you’ve scraped 3,000–4,000 tweets. Perfect. Now you have thousands of JSON lines waiting for you. Are you going to manually read them? You will go insane and working your way through even one of them will take forever. Web scraping is hard but making sense of the data is even harder.

Most scrapers just stop at data collection. They dump JSON files and call it a day. You are left with raw data and no insight. “Well that could be useful…” — no it isn’t, and as a software developer with a product/analytical mindset, I always want to know more about the data I’m collecting.

Do the tweets suggest an average sentiment? Are some people especially upset or happy about something?
What topics do people most talk about?
Which content gets the most engagement?
Are there trending patterns over time?

So it’s nice to make sense of the data you have by answering these questions. It will certainly make a lot more sense than staring at tens of thousands of lines of JSON.

Handling Data Overload: You Scraped It, Now What?

Scraping just 800 tweets will leave you with more than 200,000 words of text. That’s roughly a 400-page book, which seems like a silly amount of reading to do to get a general idea of “How do people feel about this topic on average?”

I might be the type to do that, to be honest, but normal people would consider that a waste of time. That’s where AI comes in. It reads the tweets and analyses them to give you a better sense of the data you have.

Before AI:

You just have the JSON files, you open them one by one to look at the data and make sense of it, which will take a lot of time and effort.

After AI:

You can use a single command at the start of your scraping:

python main.py user -u username --analyze

python main.py user -u username --analyze

The output will be a JSON files that has everything you want:

{
  "sentiment": {
    "overall_sentiment": {
      "positive": 61,
      "negative": 21,
      "neutral": 17
    },
    "insights": "Predominantly positive sentiment around transfer news..."
  },
  "topics": {
    "top_topics": [
      {"topic": "Transfer News", "frequency": 0.42},
      {"topic": "Contract Extensions", "frequency": 0.28}
    ]
  }
}

{
  "sentiment": {
    "overall_sentiment": {
      "positive": 61,
      "negative": 21,
      "neutral": 17
    },
    "insights": "Predominantly positive sentiment around transfer news..."
  },
  "topics": {
    "top_topics": [
      {"topic": "Transfer News", "frequency": 0.42},
      {"topic": "Contract Extensions", "frequency": 0.28}
    ]
  }
}

We’ve just saved ourselves tons of time and the output will probably be more accurate than if a human had read it, given that AI can have all the tweets as context for the prompt.

Seven Analysis Types That Actually Matter

So we didn’t just plug in ChatGPT and tell it to “analyze this”. Instead, we built seven specific analysis types, each answering different questions:

Sentiment analysis: Mainly used for brand monitoring, public opinion tracking, etc.
Topic analysis: Content strategy, trend identification
Summary generation: Quick briefings, stakeholder reports
Classification: Helps categorize the data by topic, i.e. news, opinion, personal, etc.
Entity extraction: Competitive intelligence, relationships mapping
Trend analysis: Predictive insights, content timing optimization
Engagement analysis: Content optimization, social media strategy

Each type of analysis answers a specific question. You’re not getting generic answers but structured, actionable insight.

Let’s go back to our friend Fabrizio Romano to test this on a real-world example.

{
  "tweet_count": 795,
  "analyses": {
    "sentiment": {
      "overall_sentiment": {
        "positive": 61,
        "negative": 21,
        "neutral": 17
      },
      "individual_sentiments": [
        {
          "tweet_index": 2,
          "sentiment": "positive",
          "confidence": 0.9,
          "reasoning": "Breaking news with heart emoji suggests positive sentiment."
        },
        {
          "tweet_index": 7,
          "sentiment": "negative",
          "confidence": 0.95,
          "reasoning": "Injury context and warning emoji convey negative sentiment."
        }
      ]
    },
    "topics": {
      "top_topics": [
        {
          "topic": "Transfer News",
          "frequency": 0.42,
          "keywords": ["here we go", "confirmed", "deal"],
          "category": "Sports/Football"
        },
        {
          "topic": "Contract Extensions",
          "frequency": 0.28,
          "keywords": ["renewed", "extends", "stays"]
        }
      ]
    }
  }
}

{
  "tweet_count": 795,
  "analyses": {
    "sentiment": {
      "overall_sentiment": {
        "positive": 61,
        "negative": 21,
        "neutral": 17
      },
      "individual_sentiments": [
        {
          "tweet_index": 2,
          "sentiment": "positive",
          "confidence": 0.9,
          "reasoning": "Breaking news with heart emoji suggests positive sentiment."
        },
        {
          "tweet_index": 7,
          "sentiment": "negative",
          "confidence": 0.95,
          "reasoning": "Injury context and warning emoji convey negative sentiment."
        }
      ]
    },
    "topics": {
      "top_topics": [
        {
          "topic": "Transfer News",
          "frequency": 0.42,
          "keywords": ["here we go", "confirmed", "deal"],
          "category": "Sports/Football"
        },
        {
          "topic": "Contract Extensions",
          "frequency": 0.28,
          "keywords": ["renewed", "extends", "stays"]
        }
      ]
    }
  }
}

Expand

You immediately know:

61% of the tweets are positive (transfer excitement)
21% are negative (injuries, failed deals)
Top topic is transfer news (42% of the content)
“Here we go” is a signature phrase

Token Optimization and Smart Batching

Adding AI is all well and good, but we need to know whether something’s gone wrong, like blowing through a bunch of money accidentally. I introduced a batching system to address this. This means that data would be sent in groups, not all at once, since OpenAI charges based on tokens used, and we only want to send information that’s actually important to us.

Full tweet object:

{
  "id": "1978419586904072698",
  "text": "🚨⚠️ Breaking transfer news...",
  "full_text": "🚨⚠️ Breaking transfer news...",
  "created_at": "Wed Oct 15 11:16:01 +0000 2025",
  "user": {
    "id": "330262748",
    "username": "FabrizioRomano",
    "display_name": "Fabrizio Romano",
    "followers_count": 26479397,
    "following_count": 2649,
    "verified": true,
    "profile_image_url": "https://...",
    "description": "..."
  },
  "metrics": {...},
  "media": [...],
  "urls": [...],
  "hashtags": [...],
  "scraped_at": 1729012345
}

{
  "id": "1978419586904072698",
  "text": "🚨⚠️ Breaking transfer news...",
  "full_text": "🚨⚠️ Breaking transfer news...",
  "created_at": "Wed Oct 15 11:16:01 +0000 2025",
  "user": {
    "id": "330262748",
    "username": "FabrizioRomano",
    "display_name": "Fabrizio Romano",
    "followers_count": 26479397,
    "following_count": 2649,
    "verified": true,
    "profile_image_url": "https://...",
    "description": "..."
  },
  "metrics": {...},
  "media": [...],
  "urls": [...],
  "hashtags": [...],
  "scraped_at": 1729012345
}

In the JSON above, there’s a lot of information you’re sending to the LLM that doesn’t really matter, so it’s cheaper to send only the required data.

The solution: Extract only what matters

def _extract_essential_tweet_data(self, tweets: List[Dict[str, Any]]) -> Dict[str, Any]:
    essential_data = {
        'texts': [],
        'engagement_metrics': [],
        'metadata': []
    }
    
    for tweet in tweets:
        text = tweet.get('text', '').strip()
        if text:
            essential_data['texts'].append(text)
            
            metrics = tweet.get('metrics', {})
            essential_data['engagement_metrics'].append({
                'retweet_count': metrics.get('retweet_count', 0),
                'favorite_count': metrics.get('favorite_count', 0),
                'reply_count': metrics.get('reply_count', 0),
                'view_count': metrics.get('view_count', '0')
            })
            
            essential_data['metadata'].append({
                'created_at': tweet.get('created_at', ''),
                'has_media': len(tweet.get('media', [])) > 0,
                'hashtags': tweet.get('hashtags', []),
                'is_reply': tweet.get('is_reply', False)
            })
    
    return essential_data

def _extract_essential_tweet_data(self, tweets: List[Dict[str, Any]]) -> Dict[str, Any]:
    essential_data = {
        'texts': [],
        'engagement_metrics': [],
        'metadata': []
    }
    
    for tweet in tweets:
        text = tweet.get('text', '').strip()
        if text:
            essential_data['texts'].append(text)
            
            metrics = tweet.get('metrics', {})
            essential_data['engagement_metrics'].append({
                'retweet_count': metrics.get('retweet_count', 0),
                'favorite_count': metrics.get('favorite_count', 0),
                'reply_count': metrics.get('reply_count', 0),
                'view_count': metrics.get('view_count', '0')
            })
            
            essential_data['metadata'].append({
                'created_at': tweet.get('created_at', ''),
                'has_media': len(tweet.get('media', [])) > 0,
                'hashtags': tweet.get('hashtags', []),
                'is_reply': tweet.get('is_reply', False)
            })
    
    return essential_data

Now we’re sending:

Tweet text (needed for analysis)
Engagement metrics (needed for engagement analysis)
Minimal metadata (dates, flags)

And by doing that we reduced the size of the JSON files to ~75-80%, as well as we are trying to do a batching system to not send the data once, but we send them in batches so we have control on the token size.

Structured Prompts

Having a good prompt is essential if you want to get good results. Prompts are a science in themselves, but here’s an example of a bad prompt:

prompt = f"Analyze the sentiment of these tweets: {tweets}"

prompt = f"Analyze the sentiment of these tweets: {tweets}"

A prompt like this will give you inconsistent results because you’re not being specific enough about what you want.

By contrast, this is an example of a good prompt.

prompt = f"""
Analyze the sentiment of the following {len(tweets)} tweets.

Provide:
1. Overall sentiment distribution (positive, negative, neutral percentages)
2. Individual tweet sentiments with confidence scores
3. Key emotional themes and patterns

Respond in JSON format with the following structure:
{{
    "overall_sentiment": {{
        "positive": percentage,
        "negative": percentage,
        "neutral": percentage
    }},
    "individual_sentiments": [
        {{"tweet_index": 1, "sentiment": "positive", "confidence": 0.85, "reasoning": "explanation"}}
    ],
    "emotional_themes": ["theme1", "theme2"],
    "insights": "Overall sentiment analysis insights"
}}

Tweets:
{chr(10).join([f"{i+1}. {text}" for i, text in enumerate(tweets)])}
"""

prompt = f"""
Analyze the sentiment of the following {len(tweets)} tweets.

Provide:
1. Overall sentiment distribution (positive, negative, neutral percentages)
2. Individual tweet sentiments with confidence scores
3. Key emotional themes and patterns

Respond in JSON format with the following structure:
{{
    "overall_sentiment": {{
        "positive": percentage,
        "negative": percentage,
        "neutral": percentage
    }},
    "individual_sentiments": [
        {{"tweet_index": 1, "sentiment": "positive", "confidence": 0.85, "reasoning": "explanation"}}
    ],
    "emotional_themes": ["theme1", "theme2"],
    "insights": "Overall sentiment analysis insights"
}}

Tweets:
{chr(10).join([f"{i+1}. {text}" for i, text in enumerate(tweets)])}
"""

This will give you consistent results. You’re telling the AI specifically what you want it to do and how you want the data structured. From here, you can change the prompt to suit your needs. There’s a class in the code called AnalysisPrompts that has all the prompts you want.

Conclusion

Scraping Twitter/X isn’t straightforward. The platform has strict rate limits and strong bot detection — a clear attempt to prevent web scraping. It’s easy enough to build a system that collects data, but it’s much harder to build a system that can handle a variety of different workloads.

Key takeaways:

Use Playwright, not Selenium: network interception beats HTML parsing. X’s UI changes on a weekly basis.
Intercept GraphQL responses: Stop parsing HTML. Capture the JSON X’s front-end already fetched.
Save cookies, avoid re-authentication: Login once, reuse sessions for weeks.
Randomize everything: Scroll delays (3–6 seconds), timing patterns, human-like behavior.
Implement checkpoints: To not lose progress of your sessions, always save them.
Use proxies from day one: Auto-rotating mobile proxies is super important.
Start simple, scale smart: Don’t go crazy from the first try, start step by step then scale from there.

The difference between a scraper that gets 100 tweets and 10,000+ tweets is focusing on resilience over perfection. The goal is not to build the most complex and advanced scraper, but to build a scraper that’s good enough to get the job done.

The scraper isn’t perfect, it’s just good enough for the job to bypass X’s detection and tries to collect tweets as it can. We built this project to be resilient and applicable to real-world use cases. You can find the X scraper repo here.

Frequently Asked Questions

Do I need proxies to scrape Twitter/X?

For small projects (< 500 tweets), you might not need them. For anything serious, or. scale, you absolutely need proxies. X can track your IP and block you if you scrape too many tweets at a time.

Will Twitter/X ban me for web scraping?

If you scrape like a bot, you will be banned. If you scrape like a human, probably not. X can see the patterns in your scraper’s behavior and connect the dots. If it sees something very suspicious it will block you right away.

Do I need to provide my X credentials?

Yes. The scraper logs into your account to access the timelines. You need to provide a username and password to start scraping.

What happens when my cookies expire?

The scraper detects expired cookies automatically and re-authenticates again.

Can I contribute to this project?

Yes. It’s an open-source project.

Twitter Scraper: How to Scrape Twitter for Free

Twitter Scraping: The Technology Stack (It’s Not Just About Speed)

Python Twitter Scraper: The Obvious Choice

Selenium Twitter Scraping: A Choice Between the New and the Old

Why Playwright to Scrape Twitter?

Discovering GraphQL: Twitter’s Hidden API Goldmine

What GraphQL Actually Gives Us

How to Scrape Twitter Profile: Grabbing Everything About the Account

How GraphQL Shifted the Paradigm

Network Interception: Capturing Clean JSON Instead of Messy HTML

The Interceptor: One Line that Changes Everything

How to Intercept X’s API

Scraping Twitter with Selenium vs Playwright

Scraping X with Selenium

Intercepting X API with Playwright

Parsing X’s Timeline Data

Extracting Individual Tweet Data

Built-in Duplicate Prevention

Built-in Proxy Support (Without the Headache)

Another Reason I Didn’t Use Selenium

Playwright Proxy Authentication: One Dictionary

Automatic IP Rotation: The Secret Weapon

HTTP vs SOCKS5 Support and the Anti-Detection Stack

Scraping Twitter/X over Multiple Sessions (And Not Getting Banned)

The Cookie Strategy

Avoiding Detection: Looking Human (Enough) While Web Scraping

Human-Like Web Scraping: When “Good Enough” Is Good Enough

Error Handling: For When of Course Things Go Wrong for Some Reason

Network Failures: Retry and Move On

Parsing Failures: Defensive Extraction

The Screenshot Strategy

Enough to Level a Forest: Logging and Logging and Logging

Pagination Hell: When “Just Scroll Down” Becomes a Nightmare

The Infinite Scrolling Problem: No Pages, Just Chaos

Randomized Scroll Delays: Acting Human to Avoid Detection

The “50 Scrolls Without New Content” Rule

Creating a Checkpoint System (Because Losing Your Progress Sucks)

What Gets Saved: The Checkpoint File Format

Resume Flow: Picking Up Where You Left Off

Finding the Resume Point: The Needle in the Haystack

AI Analysis Integration (Making Sense of your Data)

Handling Data Overload: You Scraped It, Now What?

Seven Analysis Types That Actually Matter

Token Optimization and Smart Batching

Structured Prompts

Conclusion

Frequently Asked Questions

Do I need proxies to scrape Twitter/X?

Will Twitter/X ban me for web scraping?

Do I need to provide my X credentials?

What happens when my cookies expire?

Can I contribute to this project?

Related articles

What is Scrapoxy? All Your Proxies on One Interface

How to Make Mobile Proxies: 4 Easy & Free Ways

How to Use Proxies in Growtopia

What to Expect: