Skip to main content
Web Scraping & Automation7 min readJun 14, 2024

Basics of Web Scraping with Beautiful Soup

Zeid Abughazaleh
Zeid Abughazaleh

Jun 14, 2024

Web scraping is an important technique for extracting information from websites, allowing users to gather data efficiently and systematically. By automating the process of collecting data, web scraping can save time and provide access to large amounts of information that would be impossible to gather manually.

In this article, we will put our knowledge into practice with the basics of web scraping with Beautiful Soup, a powerful Python library designed for parsing HTML and XML documents. For a comprehensive understanding of web scraping, you can read our previous articles on web scraping and web scraping tools. They provide a solid foundation for anyone new to the field and looking to hit the ground running. If you want to understand how web crawling differs from scraping before diving in, we cover that separately.

what is beautifulsoup

What is Beautiful Soup?

Beautiful Soup is a versatile library in Python that can be used for web scraping. It pulls data out of HTML and XML files and creates a parse tree from page source codes that can be used to extract data easily. This lends itself well to applications in research, business intelligence, and competitive analysis, among others.

HTML and XML are standard languages used to create and structure data on the web. HTML  is mainly used to create web pages and web applications. It defines elements like headings, paragraphs, and links. XML, on the other hand, is designed to store and transport data in a way that is both readable by humans and machines.

A parse tree, in the context of web scraping, is a hierarchical structure that represents the syntax of a document. It breaks the document down into its constituent parts, such as tags, attributes, and text, in a tree-like format. This structure allows for easy navigation of the document, making data extraction more efficient.

When you make an HTTP request to a webpage, the server responds with the page’s HTML content — which you can see by looking at the page source of any web page. Beautiful Soup takes this raw HTML and transforms it into a parse tree, which lets you find specific elements as needed.

Compared to other web scraping libraries, Beautiful Soup has its advantages. Unlike Selenium, which is designed for interacting with dynamic and JavaScript-heavy web pages, Beautiful Soup excels at parsing and extracting data from static HTML content. It is also simpler and more lightweight than comprehensive frameworks like Scrapy.

HTML Structure

Understanding HTML Structure

HTML documents are structured with tags, which we’ve learned to identify elements like headings, paragraphs, links, and images. Each tag is enclosed in angle brackets (e.g. <tag>), and many elements have opening and closing tags (e.g. <p></p> for paragraphs). Tags can also have attributes that provide additional information about the element, such as id, class, and src for images.

Understanding HTML is important for web scraping because it lets you navigate to and extract the specific data you need from the page. This, in turn, will enable you to write more precise scripts.

Here is a simple example of HTML.

bash

And here you can see how the HTML translates to a web page.

Now that we have a basic understanding of HTML structure, let’s apply it in a hypothetical situation.

Guide to web scraping with beautifulsoup

Step-By-Step Guide to Web Scraping With Beautiful Soup

Web scraping with Beautiful Soup and mobile proxies is a powerful combination for extracting data from websites while maintaining anonymity and avoiding IP bans. Here’s a step-by-step guide on how to use Beautiful Soup with mobile proxies:

Step 1: Install Python

Make sure you have Python installed on your system. If not, download and install it from the official Python website.

Step 2: Install Required Libraries

Make sure you have Python installed on your system. You will need to install the following libraries, if you haven’t already, using pip (Python’s package manager):

python

Step 3: Import Libraries

In your Python script, import the necessary libraries:

python

Step 4: Make an HTTP Request

Use the requests library to make an HTTP request to your target URL. In this case we’ll be using the hypothetical link https://example.com/product-page.

We’ll go one step further and include a way to check whether the request was successful, too.

In the event the request is successful, we’ll create an object called soup from the HTML document. If it is unsuccessful, our code will give us an error message.

python

Step 5: Scrape Product Name and Price with Beautiful Soup

Now we can use Beautiful Soup to parse and extract the names and prices of all the products on the page.

We’ll start by creating lists to store the names and prices of each product.

For the sake of this example, we’ll assume that each product on the page is identified by the class product, its name is noted as product-name and its price as product-price.

In reality, these will be different on the website you want to scrape from. You’ll have to amend your script to reflect this.

python

For every product on the page, the script will find the name and price and add it to the lists.

The .text attribute in the code will remove the HTML tags and return only the text inside the element.

Step 6: Store Scraped Data in an Excel Sheet

Next, we’ll store the product’s name and price in a DataFrame and save it to an Excel sheet.

Then we’ll have our script tell us whether the data was successfully saved or not.

python
Web Scraping challenges

Common Web Scraping Challenges

Web scraping is rarely straightforward. There are many challenges that will hamper your activities, and you’ll have to account for each of them.

One common challenge is missing or inconsistent data. Websites may have varying structures and not all the elements you’re looking for will be present on every page. This means your script will have to account for and handle these inconsistencies.

Dynamic content is another obstacle, as some websites load data using JavaScript after the initial HTML page has been delivered. In these cases, libraries like Selenium would be more appropriate,  as they can render JavaScript and access the dynamically loaded content.

Anti-scraping mechanisms like CAPTCHAs are designed to prevent automated access to websites. While a nuisance, using captcha solvers can help bypass them. Websites may also implement rate limiting and IP blocking to combat bots. Using a pool of proxies can distribute your requests over several IPs and reduce the likelihood of being blocked. Choosing the right proxy type for your target matters as much as having enough IPs, our best proxies for web scraping guide covers the decision in detail.

Lastly, websites frequently change their structure, which can break your scripts. This means you’ll need to routinely update your scripts to handle any changes in the website’s layout or structure.

Conclusion

In this article, we covered the basics of web scraping with Beautiful Soup. We learned how to set up our environment by installing the necessary libraries and how to use Beautiful Soup to navigate and extract data from HTML content.

We also explored the structure of HTML documents, understanding how tags, attributes, and text elements are used to build web pages.

With our step-by-step guide, we learned how to make HTTP requests, parse HTML content with Beautiful Soup, and extract product names and prices from a sample e-commerce page. We then demonstrated how to store the extracted data in a DataFrame and save it to an Excel file, making it easy to manage and analyze the collected information.

With this basic guide, you should be more comfortable dipping your toes into starting own web scraping project and effectively gather data from websites.

FAQ

Got questions?
We've got answers.

Quick answers to the most common questions about this topic.

Web scraping is a technique used to extract information from websites automatically. It allows users to gather large amounts of data efficiently, which would be time-consuming to collect manually.

Beautiful Soup is a Python library designed for parsing HTML and XML documents. It simplifies the extraction of data from web pages by creating a parse tree from the page source.

Beautiful Soup works by taking raw HTML content from a webpage and transforming it into a parse tree. This structure allows users to navigate the document easily and extract specific elements as needed.

Web scraping focuses on extracting specific data from web pages, while web crawling involves systematically browsing the internet to index content. They serve different purposes but are often used together.

Beautiful Soup is simpler and more lightweight than comprehensive frameworks like Scrapy and is better suited for parsing static HTML content. In contrast, tools like Selenium are designed for interacting with dynamic web pages.

Ready to launch?

Proxies built for real operations.

For teams that depend on stability, not luck.