How to Parse XML in Python

Image of three people on a large computer with a text next to them reading

Share

IN THIS ARTICLE:

Learning to parse XML in Python is a good skill to have when working with structured data. XML is used for data storage and transfer because of its flexibility and readability. It comes in handy for extracting data from a website, processing configuration files, and analyzing large datasets. In this article, we will explain what XML is, understand the structure of it, and explore five different ways to parse XML in Python. With this guide, you will find more options to help you with your next project.

Image of a man staring at a large computer with text above stating

What is XML?

XML or Extensible Markup Language is a language used in an array of applications and systems. It is a structured and hierarchical data format that allows data storage and exchange between platforms and applications. XML was designed to be readable by both humans and machines which is why the design goals of XML emphasize simplicity, generality, and usability. It is organized in a tree which makes it simple to store and find data. XML is generally a preferred format due to its ability to be used with any operating system, its simplicity, and how readable it is, given that non-developers can read and understand it without worry. 

Understanding the Structure

Before we go into how you could parse XML in Python, it is best to understand the structure of an XML file. For the purposes of this example, we will be using an XML file from Microsoft that shows a list of books. There are two ways that you can parse XML files; through a link that contains an XML document or by using an XML document saved on your device. The principal remains the same except for the source of the file. For the purposes of this article, we will be conducting all our examples off of the link to the XML file from Microsoft. However, as the URL is an HTML page with an XML file embedded code snippet, all our scripts will need to extract the XML from the HTML using BeautifulSoup. 

XML files consist of four sections: 

Root Element: At the very top of the list is the root element. This is the descriptor of what the list will entail and will contain all the other elements within it. In the case of our example, this is the <catalog> tag.

Attribute: The attribute is the specific ID or title of the element. For our example, this would be <book id>. This is the categorical attribute assigned to the specific subject. Where catalog is the main identifier, book id is the item. 

Child Elements: These are the elements that include details about the root. For our example <book> would be the child element as each contains details about the books. 

Sub-Elements: A child element can contain more information within the structure. The <book> child element contains the sub-elements of author, title, genre, price, publication date, and description. 

An image of a large clipboard with three people surrounding it and text of XML parsing languages around them and a text above reading

Ways to Parse XML in Python

There are internal and external libraries that can help parse XML in Python. The three internal libraries are ElementTree, MiniDOM, and SAX Parser while the external libraries are BeautifulSoup, Lmxl, and Untangle. We will go through what each library is and how it can be used. 

ElementTree

ElementTree is a built-in Python XML parser that provides functions to read, alter, and modify XML files. It creates a tree-like structure that stores the data in a hierarchical format. The first step to using it is to import xml.etree.ElementTree and call the XML file’s parse() function. You can also provide the input file in a str format using the fromstring() function. After the parsed tree is analyzed, users can retrieve the root tag using the root() function.

Copy
import xml.etree.ElementTree as ET
import requests
from bs4 import BeautifulSoup

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_content = soup.find("pre").text  # Locate the <pre> tag containing the XML

# Parse the XML content
root = ET.fromstring(xml_content)

# Example: Print all child tags
for child in root:
    print(child.tag, child.attrib)

Output:

Copy
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}

Similarly, using the iter() function would help you find any element you wish to extract in the tree. As an example, let us extract the description of each book in the file. This will look like this:

Copy
import xml.etree.ElementTree as ET
import requests
from bs4 import BeautifulSoup

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_snippet = soup.find("pre")  # Locate the <pre> tag containing the XML

# Check if the XML snippet was found
if not xml_snippet:
    raise Exception("Could not find the XML snippet on the webpage.")

# Parse the XML content
xml_content = xml_snippet.text.strip()
root = ET.fromstring(xml_content)

# Iterate through all books
for book in root.findall("book"):
    book_id = book.attrib.get("id", "N/A")  # Access the 'id' attribute
    author = book.find("author").text.strip() if book.find("author") is not None else "Unknown Author"
    title = book.find("title").text.strip() if book.find("title") is not None else "Unknown Title"
    genre = book.find("genre").text.strip() if book.find("genre") is not None else "Unknown Genre"
    price = book.find("price").text.strip() if book.find("price") is not None else "Unknown Price"
    publish_date = book.find("publish_date").text.strip() if book.find("publish_date") is not None else "Unknown Publish Date"
    description = book.find("description").text.strip() if book.find("description") is not None else "Unknown Description"

    # Print the extracted data
    print(f"ID: {book_id}")
    print(f"Author: {author}")
    print(f"Title: {title}")
    print(f"Genre: {genre}")
    print(f"Price: {price}")
    print(f"Publish Date: {publish_date}")
    print(f"Description: {description}")
    print()

Output:

Copy
ID: bk101
Author: Gambardella, Matthew
Title: XML Developer's Guide
Genre: Computer
Price: 44.95
Publish Date: 2000-10-01
Description: An in-depth look at creating applications 
      with XML.

ID: bk102
Author: Ralls, Kim
Title: Midnight Rain
Genre: Fantasy
Price: 5.95
Publish Date: 2000-12-16
Description: A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.

ID: bk103
Author: Corets, Eva
Title: Maeve Ascendant
Genre: Fantasy
Price: 5.95
Publish Date: 2000-11-17
Description: After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.

MiniDOM

miniDOM or Minimal Document Object Model loads the input XML file into memory and creates a tree-like structure referred to as a “DOM Tree” to store elements, attributes, and text content. Since XML files already have a hierarchical tree structure, using miniDOM is convenient for navigating and retrieving information. To start it off, you must import xml.dom.minidom.parse() to start parsing and get the root element.

Copy
import xml.dom.minidom
import requests
from bs4 import BeautifulSoup

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_content = soup.find("pre").text

# Parse the XML content
xml_doc = xml.dom.minidom.parseString(xml_content)

# Example: Print the root element
root = xml_doc.documentElement
print("Root is", root)

# Example: Print each book's title and author
books = xml_doc.getElementsByTagName("book")
for book in books:
    title = book.getElementsByTagName("title")[0].childNodes[0].data
    author = book.getElementsByTagName("author")[0].childNodes[0].data
    print(f"Title: {title}, Author: {author}")

Output:

Copy
Root is <DOM Element: catalog at 0x1829286fbf0>
Title: XML Developer's Guide, Author: Gambardella, Matthew
Title: Midnight Rain, Author: Ralls, Kim
Title: Maeve Ascendant, Author: Corets, Eva

If you wish to print each book’s author, genre, title, and price, you would need to use the getAttribute() function. To access all the elements under a tag, use the getElementsByTagName() method and provide the tag as input. 

Copy
import xml.dom.minidom
import requests
from bs4 import BeautifulSoup

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_content = soup.find("pre").text

# Parse the XML content
xml_doc = xml.dom.minidom.parseString(xml_content)

# Get all the book elements
books = xml_doc.getElementsByTagName('book')

# Loop through the books and extract the data
for book in books:
    book_id = book.getAttribute('id')  # Access the 'id' attribute
    author = book.getElementsByTagName('author')[0].childNodes[0].data.strip()
    title = book.getElementsByTagName('title')[0].childNodes[0].data.strip()
    genre = book.getElementsByTagName('genre')[0].childNodes[0].data.strip()
    price = book.getElementsByTagName('price')[0].childNodes[0].data.strip()
    publish_date = book.getElementsByTagName('publish_date')[0].childNodes[0].data.strip()
    description = book.getElementsByTagName('description')[0].childNodes[0].data.strip()

    # Print the extracted data
    print(f"ID: {book_id}")
    print(f"Author: {author}")
    print(f"Title: {title}")
    print(f"Genre: {genre}")
    print(f"Price: {price}")
    print(f"Publish Date: {publish_date}")
    print(f"Description: {description}")
    print()

Output: 

Copy
ID: bk101
Author: Gambardella, Matthew
Title: XML Developer's Guide
Genre: Computer
Price: 44.95
Publish Date: 2000-10-01
Description: An in-depth look at creating applications 
      with XML.

ID: bk102
Author: Ralls, Kim
Title: Midnight Rain
Genre: Fantasy
Price: 5.95
Publish Date: 2000-12-16
Description: A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.

ID: bk103
Author: Corets, Eva
Title: Maeve Ascendant
Genre: Fantasy
Price: 5.95
Publish Date: 2000-11-17
Description: After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.

SAX Parser

SAX is another way you could parse XML in Python. It is a step above miniDOM as it can read the document sequentially. It does not need to load the entire tree into its memory which allows you to discard items and save memory space and resources. To set up a SAX parser, you would need to create a SAX parser object and register callback functions for the different events that you want to handle. This can be done by defining a custom BooksHandler class by sub-classing SAX’s ContentHandler.

Copy
import xml.sax
import requests
from bs4 import BeautifulSoup

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_content = soup.find("pre").text

# Define a custom SAX ContentHandler
class BookstoreHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current_element = ""
        self.current_book = {}
        self.books = []
        self.buffer = ""

    def startElement(self, name, attrs):
        self.current_element = name
        if name == "book":
            self.current_book = {"id": attrs.get("id", "N/A")}

    def characters(self, content):
        self.buffer += content.strip()

    def endElement(self, name):
        if name in ["title", "author", "price", "description"]:
            self.current_book[name] = self.buffer
        if name == "book":
            self.books.append(self.current_book)
        self.buffer = ""  # Reset buffer after each element

# Parse the XML content
parser = xml.sax.make_parser()
handler = BookstoreHandler()
xml.sax.parseString(xml_content, handler)

# Print parsed books
for book in handler.books:
    print(book)

Output:

Copy
{'id': 'bk101', 'author': 'Gambardella, Matthew', 'title': "XML Developer's Guide", 'price': '44.95', 'description': 'An in-depth look at creating applicationswith XML.'}
{'id': 'bk102', 'author': 'Ralls, Kim', 'title': 'Midnight Rain', 'price': '5.95', 'description': 'A former architect battles corporate zombies,an evil sorceress, and her own childhood to become queenof the world.'}
{'id': 'bk103', 'author': 'Corets, Eva', 'title': 'Maeve Ascendant', 'price': '5.95', 'description': 'After the collapse of a nanotechnologysociety in England, the young survivors lay thefoundation for a new society.'}

BeautifulSoup and lmxl

BeautifulSoup, one of the most popular Python libraries for web scraping, is typically used for parsing HTML. However, it can also handle XML. Using BeautifulSoup provides a more user-friendly interface to traverse, search, and modify XML documents. It is ideal for quickly extracting specific elements without requiring deep knowledge of XML’s parsing libraries. BS4 includes a built-in XML parser so it does not need any additional dependencies to parse XML in Python. If you need a faster parse, you can introduce lxml to help boost performance for larger XML files.

Copy
from bs4 import BeautifulSoup
import requests

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_content = soup.find("pre").text

# Parse the XML content using BeautifulSoup
soup = BeautifulSoup(xml_content, "lxml-xml")

# Example: Find and print all book titles and authors
for book in soup.find_all("book"):
    title = book.find("title").text if book.find("title") else "N/A"
    author = book.find("author").text if book.find("author") else "N/A"
    print(f"Title: {title}, Author: {author}")

Output:

Copy
Title: XML Developer's Guide, Author: Gambardella, Matthew
Title: Midnight Rain, Author: Ralls, Kim
Title: Maeve Ascendant, Author: Corets, Eva

Untangle

Untangle is a lightweight library used to parse XML in Python. While traditional parsers need to navigate through hierarchical structures, untangle allows you to access XML elements and attributes as Python objects. Untangle can help convert XML documents into Python dictionaries so the elements in the document can be represented as dictionary keys and their attributes and text content can be stored as corresponding values. Untangle does need to be installed into your IDE so you would need to enter “pip install untangle” into your terminal before you do anything else.

Copy
import untangle
import requests
from bs4 import BeautifulSoup

# Fetch and extract XML content from the URL
url = "https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
xml_content = soup.find("pre").text

# Parse the XML content with untangle
xml_obj = untangle.parse(xml_content)

# Example: Print all book titles and authors
for book in xml_obj.catalog.book:
    title = book.title.cdata if hasattr(book, "title") else "N/A"
    author = book.author.cdata if hasattr(book, "author") else "N/A"
    print(f"Title: {title}, Author: {author}")

Output:

Copy
Title: XML Developer's Guide, Author: Gambardella, Matthew
Title: Midnight Rain, Author: Ralls, Kim
Title: Maeve Ascendant, Author: Corets, Eva

Image of a man thinking surrounded by gears with a text above reading

Tips and Tricks to Parse XML in Python

We have explained and presented five different ways to parse XML using Python. This section will provide insight into when one parsing library will be better than the other as well as give some helpful tips and tricks to parse XML in Python.

Most of the time, choosing between the different parsers depends on the size of the XML file you are working with. If you are working with a small to medium-sized XML file, Element Tree would suffice due to its great speeds and ease of use. MiniDOM is useful for smaller files as well, especially when working with DOM-like structures. For context, the XML file we have been experimenting with throughout this article (Books.XML) is only 5KB so using Element Tree for something of the same size is a great choice.

For a larger XML file, typically something bigger than a gigabyte, BS4 and lxml would be the better choices as they are specifically efficient for parsing large documents and provide some additional functionality like XPath support. For extremely large files or streaming files, SAX is a perfect option because of how it processes data incrementally and reduces memory usage. If you are looking for the most simple and Python-like option, Untangle is the best choice. 

One of the issues that pop up when parsing or even web scraping is suffering from an IP ban due to multiple requests being made due to exceeding rate limits. A useful tool to bypass this when deciding to parse XML in Python is a proxy. While it is not entirely necessary to use when parsing a file you have in your documents, it comes in handy when parsing an XML document through a URL as we have been doing. If you are retrieving XML data from a remote server, you can route the request through a mobile proxy to hide your IP and avoid rate limits. 

Conclusion 

Choosing to parse XML in Python allows you to handle structured data for many tasks including web scraping and data analysis. With the five different approaches we presented for you in this article, you should have a comprehensive understanding of the tools available, when they could be used, and why. If you are working with small files, large datasets, or just want to make things easier for yourself, there is a parser available that will suit your needs.

About the author

Zeid is a content writer with over a decade of writing experience. He has written for publications in Canada and the United States before deciding to start writing informational articles for Proxidize. He gained an interest with technology with a specific interest in proxies.

Leave a Reply

Your email address will not be published. Required fields are marked *

IN THIS ARTICLE:

Ignite Your Business with Proxidize.

Onboard your Proxidize kits, bring your own, or convert Android phones. Proxy management has never been easier!

Start for Free! Start for Free! Start for Free! Start for Free! Start for Free!