Web Scraping With Selenium and Python

July 19, 2024

Imagine you are browsing an ecommerce website and you’re suddenly struck by a billion-dollar idea for a dropshipping company and you realize that to accomplish it, you need to scrape websites like the one you’re looking at to keep an eye on the products and provide the users with the best products.

Maybe you know a little about web scraping already and realize that this particular website (and many other ecommerce sites like it) are really hostile towards scrapers. They have a lot of complicated anti-bot features and check for things like “human-like behavior” — this is where web scraping with Selenium comes in.

Selenium is an automation framework that plays a prominent role in web scraping, which is clear when you take into account that around 1.8 million people were reportedly using it in 2024. What makes it so special — especially in today’s AI boom — is how it can mimic almost all aspects of human web browsing behavior.

Another advantage it offers developers is the fact it’s so adaptable, regardless of how complicated, modern, or dynamic a website gets. On top of that, it also works seamlessly with proxies. Its use cases can be summarized in a short list, but the depths of how useful it is in those situations is fascinating.

What is Selenium?

Selenium isn’t actually a single tool, it’s three, which together form an open-source framework for browser automation that can do (almost) everything a human can when browsing a website. Alongside the framework, a whole ecosystem of open-source projects have sprung up around Selenium.

Selenium’s sole purpose is to make it easier for people to do browser automation and browser testing. It’s now also widely used in web scraping as well, since it can be made to mimic human behaviour. Not only does it work with a variety of programming languages, it has the ability to bind several languages together.

Selenium started as a one-man project. Today, the project is overseen by a leadership committee and maintained by a small army of volunteers. As of 2018, the way Selenium WebDriver communicated with browsers was formalized as an internet standard, showing its lasting impact on how automation is done.

While all three main components of Selenium are important, for the purposes of web scraping, one stands out among the others, which is the one we’ll focus on now.

What is Selenium WebDriver?

Selenium WebDriver is the core component of the Selenium that is used for automating and creating human-like interactions with web browsers. It acts as a bridge that directly communicates with browsers, sending commands and receiving responses without the need for human interaction or an intermediary.

Selenium Language Bindings

Selenium WebDriver has language bindings for several languages, which means you can write code in Java, Python, C#, JavaScript, and more, and still interface with WebDriver. Having the choice of working in multiple languages is a huge advantage for developers, as each builder or team or company has their own preferences.

Using Selenium and Python for web scraping is particularly popular, which is unsurprising if you consider the hype around LLMs and AI — for which Python is the primary language. Taken together with how easy Python is to use and learn, and its own rich ecosystem of web scraping libraries, you can start to see why they’re a natural pairing.

Browser Drivers and Supported Browsers

WebDriver has drivers that provide browser-specific APIs to translate Selenium commands into browser actions. Each browser has its own dedicated driver:

ChromeDriver for Google Chrome and Chromium-based browsers
GeckoDriver for Mozilla Firefox
EdgeDriver for Microsoft Edge
SafariDriver for Safari on macOS
OperaChromiumDriver (an offshoot of ChromeDriver) for Opera

Each driver must be compatible with the installed version to make sure everything works smoothly and efficiently.

Selenium Headless Mode and Its Benefits

A headless browser is a browser that runs without a graphical user interface (GUI). It allows the user to control the browser session through command-line interface (CLI) and it’s very useful when running automated tasks.

Benefits of headless mode include:

Improved resource usage (CPU, memory)
Faster execution since no GUI rendering occurs
Sometimes better stealth in automated scraping scenarios

Most modern browsers supported by Selenium support headless mode via specific WebDriver options.

Python Example: Using Selenium WebDriver

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)

driver.get("https://www.python.org")

print(driver.title)

search_box = driver.find_element(By.NAME, "q")
search_box.clear()
search_box.send_keys("web scraping with selenium")
search_box.submit()
driver.quit()

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)

driver.get("https://www.python.org")

print(driver.title)

search_box = driver.find_element(By.NAME, "q")
search_box.clear()
search_box.send_keys("web scraping with selenium")
search_box.submit()
driver.quit()

The example above shows how Python works with a headless Chrome Browser, navigating to a webpage, interacting with elements, and closing the driver session with Python.

a drawing of a robot and a man sitting at a laptop under the title

Getting Started Web Scraping with Selenium and Python (Headless)

Let’s talk about how to set up your code to start working with Selenium with Python. Because we are using Python, we are going to use Poetry as the dependency manager. I recommend Poetry specifically because it’s fast, has a rich community, is really helpful with AI and Python, and makes managing libraries easy.

Download and Configure Browser Driver

First, we need to install the driver for the browser you wish to use. For the purposes of this example, we’ll be using ChromeDriver for Chrome. Make sure you have the driver version that matches the version installed on your machine.

On Google Chrome, you can check the version of your browser by typing the following into your address bar: chrome://settings/help
On Mozilla Firefox, you can check what browser version you have by going to:
☰ > Help > About Firefox
On Microsoft Edge, you can check what browser version you’re using by typing the following into your address bar: edge://settings/help
On Safari, you can check the version by going to: Menu > Safari > About Safari
On Opera, you can see what version you have installed by typing the following into your address bar: opera://about

Installing Selenium with Poetry

Create a new project with Poetry or add Selenium to an existing project:

poetry init
poetry add selenium

poetry init
poetry add selenium

In the first command we installed poetry in a project and in the second command we added Selenium to an existing one.

How to Set Up Headless Chrome in Python

Now we are going to run the code using the headless mode which is ideal for web scraping that doesn’t require a visual UI for the user. Not having to load a UI makes it much more efficient in terms of resources and it also makes the process faster.
Example Python script to initialise Selenium with Chrome Headless mode:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")

service = Service("/path/to/chromedriver")

driver = webdriver.Chrome(service=service, options=chrome_options)

driver.get("https://proxidize.com")

print("Page title is:", driver.title)

driver.quit()

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")

service = Service("/path/to/chromedriver")

driver = webdriver.Chrome(service=service, options=chrome_options)

driver.get("https://proxidize.com")

print("Page title is:", driver.title)

driver.quit()

Python and Poetry are the perfect mix for your project if you are looking for a tool that has good management for packages and Python for efficient performance and easy maintenance.

Web Scraping JavaScript-Rendered Content with Selenium

Nowadays modern websites are complex and super dynamic, which relies heavily on JavaScript to load or update content after an HTML page has been initialized. This creates challenges for web scraping communities, because the old way of web scraping relies heavily on static HTML. Now many important data might only appear after some conditions like interactions (scrolling, pressing a button, etc.) have been met or a script has been executed.

Selenium is unique and is built for these kinds of challenges: it automates a real-world browser, which means it acts like a real web browser in terms of fully rendering JavaScript. WebDriver can execute asynchronous calls and mimics user behaviors, which allows you to access dynamically loaded content right away.

Waiting for Elements to Load

A key technique in modern web scraping is to wait until a specific element appears, which has the content you need. Selenium has two built-in wait strategies: implicit and explicit waits.

Implicit Waits: The code will wait for a set amount of time for the element you want to appear.

Coding Example:

driver.implicitly_wait(2)

driver.implicitly_wait(2)

Explicit Waits: Instead of only waiting, you can add to the code the specific element you’re waiting for. WebDriver will check to see if the element is present, and if not, it will wait. This happens on a loop until the condition has been met before exiting the loop.

Coding Example:

wait = WebDriverWait(driver, timeout=2)
wait.until(lambda _ : revealed.is_displayed())

wait = WebDriverWait(driver, timeout=2)
wait.until(lambda _ : revealed.is_displayed())

Interacting with Dynamic Elements

Selenium can simulate complex user interactions to trigger JavaScript-driven content loading, including clicking buttons, filling fields, and scrolling. For instance, scrolling to the bottom of the page might trigger lazy loading or infinite scroll.

Code Example:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Say you want to log into a website. With Selenium you can do that, as filling fields is one of its key features. As a developer you’ll have to identify the HTML selectors for these inputs and define the values for each input, in this case username and password.

Code Example:

username_input = driver.find_element(By.ID, "username")
username_input.clear()
username_input.send_keys("myusername")

password_input = driver.find_element(By.ID, "password")
password_input.clear()
password_input.send_keys("mypassword")

login_button = driver.find_element(By.ID, "login-btn")
login_button.click()

username_input = driver.find_element(By.ID, "username")
username_input.clear()
username_input.send_keys("myusername")

password_input = driver.find_element(By.ID, "password")
password_input.clear()
password_input.send_keys("mypassword")

login_button = driver.find_element(By.ID, "login-btn")
login_button.click()

In the example above we identified the username and password selectors, then we added the values and logged in with a click.

A person in a hoodie typing at a laptop under the title

How to Use a Proxy With Selenium in Python

When you’re doing web scraping with Selenium and Python, especially at scale, you’ll eventually need to use proxies, which allow you to borrow another device’s IP address. This is inevitable if you’re doing any sort of automation or web scraping.

What type of proxy you choose should generally be in function of the sites you’re web scraping with Selenium. Your first choice should generally be a residential proxy, unless the site(s) you’re targeting have strong anti-bot measures or your room for error is very limited (usually social media sites, payment sites, financial sites, etc.). If you have to minimize your chances of being IP banned, choosing a mobile proxy is probably a better choice.

Running Selenium Webdriver with a Proxy in Python

A proxy acquired from a proxy provider will generally require authentication via username and password. For our example, we’ll be using a mobile proxy, for obvious reasons.

Using mobile proxies with Selenium and Python is relatively easy. The proxy credentials often follow a format combining host, port, username, and password, like this:

http://username:password@proxyhost:port

http://username:password@proxyhost:port

Now that we have our proxy, we can start writing our Python code to set it up. Some providers give you the option of rotating your IP via the dashboard but you can also do that from your code if you have multiple proxies.

Using Selenium Wire allows you to guarantee that you can use proxies directly in your Python Selenium scripts.

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

proxy_host = "host"
proxy_port = 20000
proxy_user = "usename"
proxy_pass = "password"

proxy_url = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"

seleniumwire_options = {
    'proxy': {
        'http': proxy_url,
        'https': proxy_url,
        'no_proxy': 'localhost,127.0.0.1'  # Exclude local addresses from proxying
    }
}

chrome_options = Options()
chrome_options.add_argument("--headless")  # Optional headless mode

service = Service("/path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options, seleniumwire_options=seleniumwire_options)

driver.get("https://httpbin.org/ip")  # Verify proxy IP is used

print(driver.page_source)  # Should show the IP address of the mobile proxy

driver.quit()

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

proxy_host = "host"
proxy_port = 20000
proxy_user = "usename"
proxy_pass = "password"

proxy_url = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"

seleniumwire_options = {
    'proxy': {
        'http': proxy_url,
        'https': proxy_url,
        'no_proxy': 'localhost,127.0.0.1'  # Exclude local addresses from proxying
    }
}

chrome_options = Options()
chrome_options.add_argument("--headless")  # Optional headless mode

service = Service("/path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options, seleniumwire_options=seleniumwire_options)

driver.get("https://httpbin.org/ip")  # Verify proxy IP is used

print(driver.page_source)  # Should show the IP address of the mobile proxy

driver.quit()

The code above ensures your Selenium will use authenticated mobile proxies, the IP rotation of which you can easily control (depending on your provider).

Best Practices for Using Proxies When Web Scraping with Selenium

Using and picking up the right proxies is crucial while using Selenium, because having a good provider can prevent IP bans, detection, and geo-blocking. Here are the best practices to keep in mind to have the best outcomes:

Choose High-Quality Proxies

Always go with high premium providers. Avoid any free proxies to ensure higher stability, speed, and uptime.
IP rotation is super important, so pick a provider that has this feature built in, to avoid IP bans and to better manage your proxies.
Consider having mobile proxies or residential proxies to mimic real-user IPs and reduce the detection risk.

Validate Proxies Before Use

Make sure to have a health-check before starting any action, just to make sure that your operations will work as expected.
Keep checking whether your IP has been blocked or blacklisted that fail repeatedly or show slow response time.

Monitor Latency and Errors

Keep an eye on the logs and proxies performance in general to ensure the best results.
Implement a fallback mechanism to switch proxies in case of a failure or a timeout.

A drawing of a person looking at two screens with error signs under the title

Common Issues and Troubleshooting Selenium

While Selenium is such a powerful tool, you will encounter issues that will interrupt or degrade script performance, like with anything else. Being familiar with these problems is essential to prevent them from happening and keeps your script as stable and reliable as much as possible.

Element Not Clickable/Interactable

Sometimes your script might not be interacting with an element correctly. Mistaking a slider for a button is a common example. It usually happens when Selenium tries to interact with elements that might be hidden, disabled, or overlapped by other elements.

Fixes:

You should go to inspect in the browser and highlight the element you want then take the elements and essentially guide Selenium on how to interact with that specific element.

NoSuchElementExeption

You might get these errors while trying to target elements that cannot be found on the page like a button, or a text field.

Fixes:

Make sure you’re not targeting incorrect or outdated locators for the elements.
Make sure the page is fully loaded (or has loaded the element you’re targeting).

Element References Exception

Happens if the DOM changes after an element is located, invalidating the references.

Fixes:

Try to relocate the elements before interacting with them again.
Use try-catch to refresh element references when an exception occurs.

Async Loading and Wait Problems

As we said earlier, modern websites load content asynchronously using JavaScript, so elements might not show up or be ready to be used by Selenium when it tries to access it. Selenium WebDriver does not automatically wait for these asynchronous actions, leading to field interactions or script errors.

Fixes:

Avoid using fixed delays as time.sleep() as they are unreliable and inefficient.
Use explicit waits with exceptions targeting elements visibly or clickability.

Of course there are other issues that are mentioned above, but these at the top are the most common problems you might encounter while dealing with Selenium.

Conclusion

When web scraping, Selenium is a strong choice for you as a developer. It’s a go-to choice for web scraping and browser automation, especially when handling very dynamic websites. If used together with proxies, even the most rigid anti-bot measures can be overcome.

While comprising three distinct components, we focused primarily on Selenium WebDriver, which is the interface between your code — that can be in one of 8 languages from Haskell to Go — and essentially every modern browser, allowing for automation via headless mode.

Key takeaways:

Use explicit waits to handle asynchronous loading and dynamic page elements efficiently.
Rotator proxies regularly to prevent IP bans.
Keep the driver and browser version in sync to avoid compatibility issues.
Integrate authenticated proxies using tools like Selenium Wire for secure and flexible proxy management.

By keeping these key points in mind, together with the others we’ve covered in this article, you can overcome many typical challenges they might encounter while doing browser automation or web scraping with Selenium.

Frequently Asked Questions

Can Selenium be used for web scraping?

Yes, Selenium can be used for web scraping and is in fact one of the best tools available for that purpose.

How do I handle dynamic content loading in Selenium?

Use explicit waits to wait for elements to be load/become visible before interacting with them to prevent any problems or failures.

What is the difference between implicit waits and explicit waits?

Implicit waits set a global delay for the element’s search, while explicit waits look for specific elements to appear in specific conditions.

How do I integrate proxies with Selenium?

You can do that by configuring proxies using browser options or libraries like Selenium Wire.

How do I handle iframes in Selenium?

This will require you to use Selenium WebDriver to focus on a desired iframer before you interact with a specific element, but to achieve that you need to use a method called driver.switchTo().frame().

Web Scraping With Selenium and Python

What is Selenium?

What is Selenium WebDriver?

Selenium Language Bindings

Browser Drivers and Supported Browsers

Selenium Headless Mode and Its Benefits

Getting Started Web Scraping with Selenium and Python (Headless)

Download and Configure Browser Driver

Installing Selenium with Poetry

How to Set Up Headless Chrome in Python

Web Scraping JavaScript-Rendered Content with Selenium

Waiting for Elements to Load

Interacting with Dynamic Elements

How to Use a Proxy With Selenium in Python

Running Selenium Webdriver with a Proxy in Python

Best Practices for Using Proxies When Web Scraping with Selenium

Choose High-Quality Proxies

Validate Proxies Before Use

Monitor Latency and Errors

Common Issues and Troubleshooting Selenium

Element Not Clickable/Interactable

NoSuchElementExeption

Element References Exception

Async Loading and Wait Problems

Conclusion

Frequently Asked Questions

Can Selenium be used for web scraping?

How do I handle dynamic content loading in Selenium?

What is the difference between implicit waits and explicit waits?

How do I integrate proxies with Selenium?

How do I handle iframes in Selenium?

Related articles

5 Criteria for Choosing the Best Proxy Server

Gather Insights With Mobile Proxies During Market Research

Using cURL with Python

What to Expect: