3 Ways to Scrape PDF in Python

Image of a browser page with elements taken off of it. A man is holding up the page while a ladder lays on top of it. Text next to the image reads

Share

IN THIS ARTICLE:

There are three main ways to scrape PDF files. You could either write a script that will scrape PDF from a URL, scrape directly from a file path, or write a multifunctional scraper that can scrape whatever document you feed it through your terminal.

This article will break down the three ways to scrape PDF in Python, giving you a step-by-step guide on how to write the code from all three methods while introducing any possible challenges that might arise from attempting to scrape PDF files. 

Image of four people untangling a knot. Text above reads

Challenges of Scraping PDF

PDF files come in unstructured data which feature differences in formatting from font sizes, styles, and colors. Some other factors that contribute to the challenges when deciding to scrape PDF are a lack of standardized formatting as PDFs are designed to maintain a specific format such as varying fonts, layouts, and graphic elements. This makes it difficult to extract data accurately because the texts are not consistently formatted. Occasionally, optical character recognition (OCR) is used to convert scanned documents into PDFs but it is limited by issues such as image accuracy, language, and formatting errors. PDFs can also have different layouts with mixed content types, adding a layer of difficulty when parsing and extracting information

However, when deciding to scrape PDF, challenges arise in the form of various format maintenance, anti-scraping trap handling, and data structuring and formatting. Most PDF documents are scanned so scrapers fail to understand them without an OCR application. Some automated PDF scrapers have a combination of OCR, RPA, pattern and text recognition, and other techniques that help to scrape PDF.

Error Handling, Performance, and Security

Before you begin to scrape PDF, error handling is necessary to ensure reliable data extraction, especially when dealing with complex layouts and unstructured formats. PDFs might contain non-linear text flow, missing elements, or document images that need OCR for processing. Implementing rule-based data extraction methods such as regular expressions for structured text or fallback mechanisms for missing values can enhance accuracy. Logging errors and handling exceptions such as file corruption, encryption, or PDF parser failures ensure that the extraction process remains strong without causing script crashes. 

When scraping large volumes of PDFs, there are performance considerations to keep in mind. Using batch processing capabilities and optimizing usage can improve speed. Loading only necessary pages instead of the entire document and using advanced methods such as multi-threading can reduce execution time. For large-scale applications, using AI-powered automated extraction solutions or APIs like GPT-4 Vision API can enhance efficiency and accuracy. 

There are security implications that come with handling sensitive information such as medical records, insurance forms, or business documents. Proper document processes should include security features such as encrypted storage, controlled access, and redaction of sensitive data. You should be mindful of extracting data from email attachments as malicious PDFs can pose security risks. Implementing automated data extraction software with built-in validation checks can ensure that extracted data is accurate. 

Image of a large computer screen with a man standing in front of it holding a tablet. Text above reads

Setting Up Script to Scrape PDF

There are six libraries that can help scrape PDF with each library specializing in a specific form of PDF scraping. For normal text scraping, you would need PyMuPDF, pdfplumber, or pdfminer.six. To scrape PDF tables, Camelot or pdfplumber will be a good option. For image-based PDF scraping, pdf2image combined with pytesseract will do the trick. We will present you with a script for how to scrape PDF text, tables, and images along with scripts on how to scrape PDF through a URL, directly from a file on your device, or through a scraper that you can launch in your terminal. 

For starters, open your IDE and set it to Python. Once that is complete, install the necessary libraries. Enter the following command in your terminal. This will install all the libraries we will be using for this project.

pip install requests PyMuPDF pdfplumber pdfminer.six Camelot-py[cv] pdf2image pytesseract

It will take a few minutes to install everything as it is six different libraries so be patient while it all comes through. Requests is a necessary library to have when web scraping and can help when downloading the PDF from a URL however it is not needed if scraping PDF from a local file. 

Image of a woman starting at three exits with a large quest mark in front of her. Three boxes to the side read

How to Scrape PDF

In this section, we will present you with ways to scrape PDF from a URL, from a file, and through the terminal while presenting how to scrape text, images, and tables. For this example, we will be using this link for text and image scraping and this link for table scraping. Be sure to download both PDF documents if you wish to follow this tutorial completely. 

Scrape PDF from URL

The script below will scrape PDF content by extracting images and text from the first link. Once it is run, it will print the text onto the terminal and save the images onto your file that includes this script. If you wish to use this script, all you would need to do is change the url= into the URL of the PDF you wish to scrape text and images from.

import requests
import fitz  # PyMuPDF
import io
from PIL import Image

# PDF URL
url = "https://api.slingacademy.com/v1/sample-data/files/text-and-images.pdf"

# Download the PDF
response = requests.get(url)
pdf_path = "downloaded.pdf"

with open(pdf_path, "wb") as f:
    f.write(response.content)

# Open the PDF
doc = fitz.open(pdf_path)

# Extract text and images
for page_num, page in enumerate(doc, start=1):
    text = page.get_text()
    print(f"\n--- Page {page_num} Text ---\n")
    print(text)

    # Extract images
    image_list = page.get_images(full=True)
    for img_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]

        # Convert image bytes to PIL image
        img = Image.open(io.BytesIO(image_bytes))

        # Save the image
        img_filename = f"page_{page_num}_image_{img_index}.png"
        img.save(img_filename)
        print(f"Saved image: {img_filename}")

print("\nExtraction complete!")

Below is the script to scrape PDF tables from the second link. 

import requests
import camelot

# PDF URL
url = "https://api.slingacademy.com/v1/sample-data/files/text-and-table.pdf"

# Download the PDF
pdf_path = "downloaded_table.pdf"
response = requests.get(url)
with open(pdf_path, "wb") as f:
    f.write(response.content)

# Extract tables from the PDF
tables = camelot.read_pdf(pdf_path, pages="all")

# Print the number of tables found
print(f"Total tables extracted: {len(tables)}")

# Save each table as a CSV and print the extracted data
for i, table in enumerate(tables, start=1):
    csv_filename = f"table_{i}.csv"
    table.to_csv(csv_filename)
    print(f"\n--- Table {i} ---")
    print(table.df)  # Display the table as a Pandas DataFrame
    print(f"Saved table to {csv_filename}")

This will save the table in text form in the terminal of your IDE. If you wish to scrape PDF documents that includes text, images, and photos directly from a link, this will be the script you should use:

import requests
import fitz  # PyMuPDF
import camelot
import io
from PIL import Image
import os

# URL of the PDF
url = "Insert-Your-URL-Here.pdf"

# Download the PDF
response = requests.get(url)
pdf_path = "sample_with_table.pdf"
with open(pdf_path, "wb") as f:
    f.write(response.content)

# Open the PDF with PyMuPDF
doc = fitz.open(pdf_path)

# Directory to save extracted images
image_dir = "extracted_images"
os.makedirs(image_dir, exist_ok=True)

# Extract text and images
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    
    # Extract text
    text = page.get_text()
    print(f"\n--- Text from Page {page_num + 1} ---\n")
    print(text)
    
    # Extract images
    image_list = page.get_images(full=True)
    for img_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        
        # Save image
        image_filename = f"{image_dir}/page_{page_num + 1}_image_{img_index}.{image_ext}"
        image.save(image_filename)
        print(f"Saved image: {image_filename}")

# Extract tables using Camelot
tables = camelot.read_pdf(pdf_path, pages="all")

# Directory to save extracted tables
table_dir = "extracted_tables"
os.makedirs(table_dir, exist_ok=True)

# Save each table as a CSV file
for i, table in enumerate(tables, start=1):
    csv_filename = f"{table_dir}/table_{i}.csv"
    table.to_csv(csv_filename)
    print(f"Saved table {i} to {csv_filename}")
    print(f"\n--- Table {i} ---\n")
    print(table.df)  # Display the table as a DataFrame

This script should print out the text and the table while saving the images onto your file. Remember to replace the url= with the URL of your choice. 

Scape PDF from File

If you have your PDF file on your desktop rather than a URL, the script changes slightly to accommodate the new source. The script below will scrape PDF files directly from your device when given the path to the document.

import fitz  # PyMuPDF
import camelot
import io
from PIL import Image
import os

# Path to the local PDF file
pdf_path = r"C:\Users\Name\Documents\text-and-images.pdf"

# Open the PDF with PyMuPDF
doc = fitz.open(pdf_path)

# Directory to save extracted images
image_dir = "extracted_images"
os.makedirs(image_dir, exist_ok=True)

# Extract text and images
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    
    # Extract text
    text = page.get_text()
    print(f"\n--- Text from Page {page_num + 1} ---\n")
    print(text)
    
    # Extract images
    image_list = page.get_images(full=True)
    for img_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        
        # Save image
        image_filename = f"{image_dir}/page_{page_num + 1}_image_{img_index}.{image_ext}"
        image.save(image_filename)
        print(f"Saved image: {image_filename}")

# Extract tables using Camelot
tables = camelot.read_pdf(pdf_path, pages="all")

# Directory to save extracted tables
table_dir = "extracted_tables"
os.makedirs(table_dir, exist_ok=True)

# Save each table as a CSV file
for i, table in enumerate(tables, start=1):
    csv_filename = f"{table_dir}/table_{i}.csv"
    table.to_csv(csv_filename)
    print(f"Saved table {i} to {csv_filename}")
    print(f"\n--- Table {i} ---\n")
    print(table.df)

If you wish to scrape the tables document, the script remains the same but the pdf_path changes to the path of the tables document. It will look something like this:

import fitz  # PyMuPDF
import camelot
import io
from PIL import Image
import os

# Path to the local PDF file
pdf_path = r"C:\Users\Name\Documents\text-and-table.pdf"

# Open the PDF with PyMuPDF
doc = fitz.open(pdf_path)

# Directory to save extracted images
image_dir = "extracted_images"
os.makedirs(image_dir, exist_ok=True)

# Extract text and images
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    
    # Extract text
    text = page.get_text()
    print(f"\n--- Text from Page {page_num + 1} ---\n")
    print(text)
    
    # Extract images
    image_list = page.get_images(full=True)
    for img_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        
        # Save image
        image_filename = f"{image_dir}/page_{page_num + 1}_image_{img_index}.{image_ext}"
        image.save(image_filename)
        print(f"Saved image: {image_filename}")

# Extract tables using Camelot
tables = camelot.read_pdf(pdf_path, pages="all")

# Directory to save extracted tables
table_dir = "extracted_tables"
os.makedirs(table_dir, exist_ok=True)

# Save each table as a CSV file
for i, table in enumerate(tables, start=1):
    csv_filename = f"{table_dir}/table_{i}.csv"
    table.to_csv(csv_filename)
    print(f"Saved table {i} to {csv_filename}")
    print(f"\n--- Table {i} ---\n")
    print(table.df)

If you wish to scrape PDF files that contain text, image, and tables, this is the script you should use. Remember to change the pdf_path= to the path of your PDF document: 

import fitz  # PyMuPDF
import camelot
import io
from PIL import Image
import os

# Set the PDF path here
pdf_path = r"C:\Users\Name\Documents\sample.pdf"

# Open the PDF with PyMuPDF
doc = fitz.open(pdf_path)

# Directory to save extracted images
image_dir = "extracted_images"
os.makedirs(image_dir, exist_ok=True)

# Extract text and images
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    
    # Extract text
    text = page.get_text()
    print(f"\n--- Text from Page {page_num + 1} ---\n")
    print(text)
    
    # Extract images
    image_list = page.get_images(full=True)
    for img_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        
        # Save image
        image_filename = f"{image_dir}/page_{page_num + 1}_image_{img_index}.{image_ext}"
        image.save(image_filename)
        print(f"Saved image: {image_filename}")

# Extract tables using Camelot
tables = camelot.read_pdf(pdf_path, pages="all")

# Directory to save extracted tables
table_dir = "extracted_tables"
os.makedirs(table_dir, exist_ok=True)

# Save each table as a CSV file
for i, table in enumerate(tables, start=1):
    csv_filename = f"{table_dir}/table_{i}.csv"
    table.to_csv(csv_filename)
    print(f"Saved table {i} to {csv_filename}")
    print(f"\n--- Table {i} ---\n")
    print(table.df)

Scrape PDF through Terminal

Finally, we will explore how you can create a scraper to scrape PDF efficiently through your terminal that can scrape PDF directly by inputting the script and the path to the PDF file you wish to extract. While you can use any of the scripts provided above and alter the path or URL, you might want to save a bit of time by creating a scraper. When you write a scraper, all you would need to do is open your terminal application on your device and choose the path of the .py file and follow it with the path to the PDF document. The script to scrape PDF will look like this:

import fitz  # PyMuPDF
import io
from PIL import Image
import os
import sys

def extract_text_and_images(pdf_path):
    # Open the PDF
    doc = fitz.open(pdf_path)

    # Create a directory to save extracted images
    image_dir = "extracted_images"
    os.makedirs(image_dir, exist_ok=True)

    # Extract text and images
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)

        # Extract text
        text = page.get_text()
        print(f"\n--- Text from Page {page_num + 1} ---\n")
        print(text)

        # Extract images
        image_list = page.get_images(full=True)
        for img_index, img in enumerate(image_list, start=1):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image = Image.open(io.BytesIO(image_bytes))

            # Save image
            image_filename = f"{image_dir}/page_{page_num + 1}_image_{img_index}.{image_ext}"
            image.save(image_filename)
            print(f"Saved image: {image_filename}")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <pdf_path>")
        sys.exit(1)

    pdf_path = sys.argv[1]
    extract_text_and_images(pdf_path)

Run this in your terminal but alter the paths to fit your paths of the .py script and the PDF file:

C:\Users\User\File\FileName\.venv\Scripts\python.exe "C:\Users\User\File\FileName\.venv\FileName.py" "C:\Users\User\Documents\text-and-images.pdf"

If you wish to extract tables from a PDF document, here is the scraper script:

import camelot
import sys
import os

def extract_tables(pdf_path):
    # Extract tables using Camelot
    tables = camelot.read_pdf(pdf_path, pages="all")

    # Create directory to save extracted tables
    table_dir = "extracted_tables"
    os.makedirs(table_dir, exist_ok=True)

    # Save each table as a CSV file
    for i, table in enumerate(tables, start=1):
        csv_filename = f"{table_dir}/table_{i}.csv"
        table.to_csv(csv_filename)
        print(f"Saved table {i} to {csv_filename}")
        print(f"\n--- Table {i} ---\n")
        print(table.df)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <pdf_path>")
        sys.exit(1)

    pdf_path = sys.argv[1]
    extract_tables(pdf_path)

Here is the terminal command:

C:\Users\User\File\FileName\.venv\Scripts\python.exe "C:\Users\User\File\FileName\.venv\FileName.py" "C:\Users\User\Documents\text-and-tables.pdf"

If you wish to scrape PDF files that includes text, images, and tables, the script will look like this:

import camelot
import io
from PIL import Image
import os
import sys

def extract_text_and_images(pdf_path):
    doc = fitz.open(pdf_path)

    image_dir = "extracted_images"
    os.makedirs(image_dir, exist_ok=True)

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)

        text = page.get_text()
        print(f"\n--- Text from Page {page_num + 1} ---\n")
        print(text)

        image_list = page.get_images(full=True)
        for img_index, img in enumerate(image_list, start=1):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image = Image.open(io.BytesIO(image_bytes))

            image_filename = f"{image_dir}/page_{page_num + 1}_image_{img_index}.{image_ext}"
            image.save(image_filename)
            print(f"Saved image: {image_filename}")

def extract_tables(pdf_path):
    tables = camelot.read_pdf(pdf_path, pages="all")

    table_dir = "extracted_tables"
    os.makedirs(table_dir, exist_ok=True)

    for i, table in enumerate(tables, start=1):
        csv_filename = f"{table_dir}/table_{i}.csv"
        table.to_csv(csv_filename)
        print(f"Saved table {i} to {csv_filename}")
        print(f"\n--- Table {i} ---\n")
        print(table.df)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <pdf_path>")
        sys.exit(1)

    pdf_path = sys.argv[1]
    extract_text_and_images(pdf_path)
    extract_tables(pdf_path)

The terminal command would be this:

python "C:\Users\User\File\FileName\.venv\Example.py" "C:\Users\User\Documents\Example.pdf"

Conclusion 

Choosing to scrape PDF in Python is a useful skill that enables the extraction of unstructured data into usable formats. By using libraries like PyMuPDF, pdfplumber, and Camelot, you can handle text, images, and tables within PDFs. 

Key Takeaways:

  • There are various approaches to scrape PDF, including URLs, local file paths, and terminal-based scripts.
  • Choosing the right library is crucial; for text extraction, PyMuPDF and pdfplumber are effective, while Camelot excels in table extraction.
  • PDFs often lack standardized formatting, presenting challenges in data extraction that require specialized handling.
  • Proper environment setup, including installing necessary libraries and understanding their dependencies, is essential to successfully scrape PDF.
  • Implementing PDF scraping can automate data extraction processes, reducing manual effort and increasing efficiency in tasks like data analysis and reporting.

While challenges exist due to the diverse nature of PDF structures, understanding the appropriate tools and techniques allows for effective data extraction and integration into various workflows.


Frequently Asked Questions

What is the best method for extracting text from PDFs?

The best method depends on the document format and structure of the PDF. If the text is machine-readable, libraries like PyMuPDF, pdfplumber, and pdfminer.six can efficiently extract text in a structured format. However, for scanned documents, OCR tools like pytesseract are required to convert document images into machine-readable text.

How can I extract tables from a PDF file?

For PDFs with selectable text, Camelot and pdfplumber are effective PDF table extraction tools. If the tables are part of an image, an OCR-based approach with pytesseract is necessary. Additionally, rule-based data extraction can be used for specific table layouts.

Can I automate PDF data extraction?

Yes, automated data extraction software can streamline the process. Using batch processing capabilities, Python scripts can handle volumes of PDFs without manual intervention. Advanced solutions, such as AI-powered tools and the GPT-4 Vision API, can further enhance the extraction process.

What are the challenges of extracting data from PDFs?

The unstructured nature of many PDFs makes accurate data extraction difficult. PDFs may contain complex layouts, non-linear text flow, or multiple document formats. Additionally, embedded elements such as email attachments, column names, and party names may require special handling to ensure reliable results.

How do I extract images from a PDF?

Libraries like PyMuPDF allow extraction from images embedded within PDFs. Once extracted, these images can be processed using advanced methods such as AI-powered OCR or traditional methods like simple text recognition.

What’s the difference between structured and unstructured PDFs?

A structured format PDF maintains clear organization with defined elements, such as labeled tables and paragraphs. In contrast, an unstructured format contains arbitrary layouts, making reliable data extraction solutions more difficult to implement. Advanced data parsers can help convert unstructured data into a usable format.

About the author

Zeid is a content writer with over a decade of writing experience. He has written for publications in Canada and the United States before deciding to start writing informational articles for Proxidize. He gained an interest with technology with a specific interest in proxies.

Leave a Reply

Your email address will not be published. Required fields are marked *

IN THIS ARTICLE:

Ignite Your Business with Proxidize.

Onboard your Proxidize kits, bring your own, or convert Android phones. Proxy management has never been easier!

Start for Free! Start for Free! Start for Free! Start for Free! Start for Free!