pypdfium2 v5.0 breaking changes and migration guide

published: August 21, 2025
on this page

tl;dr

Breaking:PdfDocument.render() removed - use page-level rendering
Renamed:get_pos()get_bounds(), get_size()get_px_size()
New features:Coordinate conversion, context managers, mobile support
Performance:Parallel rendering removed due to bitmap transfer overhead
Migration:Update rendering loops, method names, TOC handling
Version:

v5.0.0b2 beta - Release notes

version timeline

  • v4.30.0 - Last stable v4 release (2024-05-09)
  • v4.30.1 - YANKED - Text extraction regression (crbug.com/387277993)
  • v5.0.0b1 - YANKED - Text extraction regression (crbug.com/399937354)
  • v5.0.0b2 - Current beta (2024-07-31)
  • PDFium version: Updated from 6996 to 7323

important: yanked versions warning

⚠️ Versions 4.30.1 and 5.0.0b1 were yanked from PyPI due to text extraction regressions in the underlying PDFium library:

If you installed these versions, downgrade to v4.30.0 or upgrade to v5.0.0b2:

# Downgrade to stable v4
pip install pypdfium2==4.30.0

# Or upgrade to latest beta
pip install pypdfium2==5.0.0b2

critical breaking changes

document rendering removal

The most significant change in v5.0 is the complete removal of parallel document rendering:

# ❌ OLD v4 code - NO LONGER WORKS
bitmap_generator = pdf.render(scale=2.0, page_indices=[0, 1, 2])
for bitmap in bitmap_generator:
    bitmap.to_pil().save("page.png")

# ✅ NEW v5 code - iterate pages directly
for page in pdf:
    bitmap = page.render(scale=2.0)
    bitmap.to_pil().save("page.png")

Why removed: The parallel rendering API had a fundamental flaw - bitmap transfer overhead exceeded parallelization benefits. Memory copying between processes was too expensive.

Migration strategies:

# Sequential processing (simplest)
def render_pdf_sequential(pdf_path, output_dir):
    pdf = pdfium.PdfDocument(pdf_path)
    for i, page in enumerate(pdf):
        bitmap = page.render(scale=2.0)
        bitmap.to_pil().save(f"{output_dir}/page_{i:03d}.png")

# Parallel with multiprocessing (if needed)
from multiprocessing import Pool

def render_single_page(args):
    pdf_path, page_index, output_path, scale = args
    pdf = pdfium.PdfDocument(pdf_path)
    page = pdf[page_index]
    bitmap = page.render(scale=scale)
    bitmap.to_pil().save(output_path)
    pdf.close()

def render_pdf_parallel(pdf_path, output_dir, scale=2.0):
    pdf = pdfium.PdfDocument(pdf_path)
    page_count = len(pdf)
    pdf.close()

    args_list = [
        (pdf_path, i, f"{output_dir}/page_{i:03d}.png", scale)
        for i in range(page_count)
    ]

    with Pool() as pool:
        pool.map(render_single_page, args_list)

bitmap api changes

color parameter position

The fill_rect() method now takes color as the first parameter:

# ❌ OLD v4 code
bitmap.fill_rect(left=10, top=20, width=100, height=50, color=(255, 0, 0, 255))

# ✅ NEW v5 code
bitmap.fill_rect(color=(255, 0, 0, 255), left=10, top=20, width=100, height=50)

removed bitmap info

# ❌ OLD v4 code
info = bitmap.get_info()
print(info.width, info.height, info.format)

# ✅ NEW v5 code
print(bitmap.width, bitmap.height, bitmap.format)

numpy array shape changes

Grayscale bitmaps now return proper 2D arrays:

# v4: grayscale returned shape (height, width, 1)
# v5: grayscale returns shape (height, width)
arr = grayscale_bitmap.to_numpy()
assert arr.ndim == 2  # Now 2D for grayscale

method renames

Old Method (v4)New Method (v5)Context
PdfObject.get_pos()PdfObject.get_bounds()All page objects
PdfImage.get_size()PdfImage.get_px_size()Image dimensions
PdfMatrix.mirror(v, h)PdfMatrix.mirror(invert_x, invert_y)Matrix operations

table of contents restructuring

The TOC API completely changed from namedtuples to method-oriented wrapper classes:

# ❌ OLD v4 code - namedtuple approach
for item in pdf.get_toc():
    print(f"Title: {item.title}")
    print(f"Page: {item.page_index}")
    print(f"View mode: {item.view_mode}")
    print(f"View pos: {item.view_pos}")

# ✅ NEW v5 code - wrapper class approach
for bookmark in pdf.get_toc():
    title = bookmark.get_title()
    dest = bookmark.get_dest()

    if dest:
        page_index = dest.get_index()
        view_mode = dest.get_view_mode()
        view_pos = dest.get_view_pos()

        print(f"Title: {title}")
        print(f"Page: {page_index}")
        print(f"View mode: {view_mode}")
        print(f"View pos: {view_pos}")

New classes provide better encapsulation:

version api cleanup

All legacy version flags removed:

# ❌ OLD v4 code
from pypdfium2 import V_PYPDFIUM2, V_LIBPDFIUM, V_BUILDNAME
print(f"pypdfium2: {V_PYPDFIUM2}")
print(f"PDFium: {V_LIBPDFIUM}")

# ✅ NEW v5 code
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO
print(f"pypdfium2: {PYPDFIUM_INFO.version}")
print(f"PDFium: {PDFIUM_INFO.version}")
print(f"Build: {PDFIUM_INFO.build}")

image extraction changes

The extract() method no longer supports fallback rendering:

# ❌ OLD v4 code
data, info = image.extract(fb_render=True)  # fb_render removed

# ✅ NEW v5 code
try:
    data, info = image.extract()
except pdfium.PdfiumError:
    # Manual fallback if extraction fails
    bitmap = image.get_bitmap(render=True)
    data = bitmap.to_pil().tobytes()

new features and additions

coordinate conversion helper

New PdfPosConv class for coordinate translation:

# Create converter for page-to-bitmap coordinate mapping
page = pdf[0]
bitmap = page.render(scale=2.0, rotation=90)
conv = bitmap.get_posconv(page)

# Convert coordinates
bitmap_x, bitmap_y = conv.to_bitmap(page_x=100, page_y=200)
page_x, page_y = conv.to_page(bitmap_x=50, bitmap_y=75)

# Direct instantiation
from pypdfium2 import PdfPosConv
conv = PdfPosConv(
    page_width=595, page_height=842,
    bitmap_width=1190, bitmap_height=1684,
    scale=2.0, rotation=0
)

quad points for transformed objects

Get corner positions of transformed objects:

# Get object corners (counter-clockwise from bottom-left)
obj = page.get_objects()[0]
if isinstance(obj, (pdfium.PdfImage, pdfium.PdfText)):
    corners = obj.get_quad_points()
    # Returns: (bottom_left, bottom_right, top_right, top_left)
    for x, y in corners:
        print(f"Corner at ({x}, {y})")

context manager support

Documents now support with-statements:

# Automatic cleanup
with pdfium.PdfDocument("document.pdf") as pdf:
    page = pdf[0]
    bitmap = page.render()
    # PDF automatically closed when exiting block

enhanced error handling

Errors now include PDFium error codes:

try:
    pdf = pdfium.PdfDocument("corrupted.pdf")
except pdfium.PdfiumError as e:
    print(f"Error code: {e.err_code}")
    # Error codes from PDFium:
    # 0: Success
    # 1: Unknown error
    # 2: File access error
    # 3: Format error
    # 4: Password error
    # 5: Security error
    # 6: Page not found

page flattening

Previously private method now public:

# Flatten annotations and form fields into page content
page.flatten()
# Useful for:
# - Preventing annotation editing
# - Ensuring consistent rendering
# - Reducing file complexity

improved image rendering

New option for native resolution rendering:

# Render image at original pixel size for best quality
image_obj = page.get_objects(filter=[pdfium.FPDF_PAGEOBJ_IMAGE])[0]
bitmap = image_obj.get_bitmap(render=True, scale_to_original=True)

# Previous approach scaled to page size
# New approach preserves original image dimensions

bitmap rendering options

Enhanced rendering control:

# Render with alpha channel for transparency
bitmap = page.render(use_alpha=True)

# Standard rendering without alpha
bitmap = page.render(use_alpha=False)

# Note: maybe_alpha is a CLI option, not an API method
# In code, explicitly check for transparency:
if pdfium_c.FPDFPage_HasTransparency(page):
    bitmap = page.render(use_alpha=True)

cli enhancements

dark theme rendering

New options for dark mode PDFs:

# Invert lightness while preserving color relationships
pypdfium2 render input.pdf output --invert-lightness

# Exclude images from inversion
pypdfium2 render input.pdf output --invert-lightness --exclude-images

# Convert fills to strokes for better visibility
pypdfium2 render input.pdf output --fill-to-stroke

Implementation uses HLS color space for intelligent inversion.

enhanced rendering controls

# Render with custom color scheme
pypdfium2 render input.pdf output \
    --color-scheme custom \
    --fill-to-stroke \
    --invert-lightness

# Process specific pages with options
pypdfium2 render input.pdf output \
    --pages 1-10 \
    --scale 2.0 \
    --rotation 90 \
    --crop 10,10,10,10

platform support expansion

mobile platforms (experimental)

New in v5.0 - experimental support for mobile:

Android architectures (partially working):

  • arm64-v8a - Modern 64-bit devices
  • armeabi-v7a - Older 32-bit devices
  • x86_64 - 64-bit emulators
  • x86 - 32-bit emulators

iOS architectures (not yet functional):

  • arm64 - Physical devices
  • arm64 - M1+ Mac simulators
  • x86_64 - Intel Mac simulators

⚠️ Important limitations:

  • Android support is experimental but somewhat functional
  • iOS support infrastructure exists but does not actually work yet
  • No wheels are distributed for mobile platforms at this time
  • This is provided on a best-effort basis and largely untested
# Installation on mobile platforms (experimental)
# Android (via Termux or similar) - may work
pip install pypdfium2 --platform android_arm64

# iOS - infrastructure only, not functional yet
# pip install pypdfium2 --platform ios_arm64  # Won't work

enhanced linux support

Added support via cibuildwheel:

  • ARM64 (aarch64)
  • RISC-V (riscv64)
  • LoongArch64 (loongarch64)
  • PowerPC64 LE (ppc64le)
  • S390x (s390x)

build system improvements

native build option

Build without Google’s toolchain:

# Use system compiler (GCC/Clang)
python setupsrc/pypdfium2_setup/build_native.py

# Automatically triggered for unsupported platforms
pip install pypdfium2 --no-binary :all:

security verification

SLSA provenance verification support:

# Download and verify authenticity
pypdfium2 update --verify

# Requires slsa-verifier tool
# https://github.com/slsa-framework/slsa-verifier

platform detection

Improved platform detection in autorelease/bindings.py:

import pypdfium2

# Get platform information
info = pypdfium2.PDFIUM_INFO
print(f"Platform: {info.platform}")
print(f"Architecture: {info.arch}")
print(f"V8 Support: {info.v8_enabled}")
print(f"XFA Support: {info.xfa_enabled}")

performance optimizations

memory leak fixes

Fixed ctypes pointer caching issues:

# ❌ OLD problematic pattern
ptr = cast(address, POINTER(c_char)).contents

# ✅ NEW safe pattern
ptr = c_char.from_address(address)

Affected APIs:

  • Buffer operations
  • Bitmap creation
  • Document loading
  • Text extraction

startup performance

Deferred imports for optional dependencies:

# PIL, NumPy, CV2 only imported when needed
# Reduces startup time by ~30% when not using image features

# These imports now happen lazily:
# - PIL: When using to_pil() or from_pil()
# - NumPy: When using to_numpy() or from_numpy()
# - CV2: When using OpenCV integration

version parsing optimization

Eliminated expensive git operations:

# Old: git ls-remote to check versions (slow)
# New: Read embedded VERSION file (instant)

# Version now cached at module level
from pypdfium2 import PDFIUM_INFO
# Subsequent access is instant

text extraction fixes and api changes

addressing the regression

The text extraction regressions that caused v4.30.1 and v5.0.0b1 to be yanked were addressed through API changes:

PDFium change: Reverted FPDFText_GetText() back to UCS-2 encoding (from UTF-16), which resolved memory allocation concerns but limited Unicode support.

pypdfium2 response: Removed implicit translation in get_text_range() and added explicit methods:

# For full Unicode support (recommended)
text = textpage.get_text_bounded()  # Handles all Unicode characters

# For basic text (faster, UCS-2 only)
text = textpage.get_text_range()  # Limited to UCS-2 character set

# Example: Extracting text with proper Unicode handling
with pdfium.PdfDocument("document.pdf") as pdf:
    page = pdf[0]
    textpage = page.get_textpage()

    # Full Unicode text extraction
    full_text = textpage.get_text_bounded()

    # Or with specific range
    partial_text = textpage.get_text_bounded(start_index=0, length=100)

best practices for text extraction

  1. Always use get_text_bounded() for documents with international characters
  2. Test text extraction after upgrading from affected versions
  3. Verify Unicode handling with documents containing emojis, CJK characters, or special symbols

migration checklist

pre-migration audit

# Script to find v4 code patterns that need updating
import ast
import sys
from pathlib import Path

def check_v4_patterns(file_path):
    with open(file_path, 'r') as f:
        tree = ast.parse(f.read())

    issues = []

    for node in ast.walk(tree):
        if isinstance(node, ast.Attribute):
            # Check for removed/renamed methods
            if node.attr in ['render', 'get_pos', 'get_size', 'get_info']:
                issues.append(f"Line {node.lineno}: {node.attr} may need updating")

    return issues

# Run on your codebase
for py_file in Path(".").rglob("*.py"):
    issues = check_v4_patterns(py_file)
    if issues:
        print(f"\n{py_file}:")
        for issue in issues:
            print(f"  {issue}")

step-by-step migration

  1. Update imports:
# Replace version imports
# from pypdfium2 import V_PYPDFIUM2, V_LIBPDFIUM
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO
  1. Fix rendering loops:
# Replace pdf.render() with page iteration
for page in pdf:
    bitmap = page.render()
  1. Update method calls:
# Rename methods
# obj.get_pos() → obj.get_bounds()
# img.get_size() → img.get_px_size()
  1. Fix bitmap operations:
# Update fill_rect parameter order
# bitmap.fill_rect(x, y, w, h, color) → bitmap.fill_rect(color, x, y, w, h)
  1. Update TOC handling:
# Switch from namedtuple to wrapper classes
for bookmark in pdf.get_toc():
    title = bookmark.get_title()
    dest = bookmark.get_dest()
  1. Add error handling:
# Use new error codes
try:
    pdf = pdfium.PdfDocument(path)
except pdfium.PdfiumError as e:
    if e.err_code == 3:  # Format error
        handle_corrupt_pdf()

testing after migration

import pypdfium2 as pdfium
import tempfile
from pathlib import Path

def test_v5_compatibility():
    # Test basic functionality
    pdf = pdfium.PdfDocument.new()
    page = pdf.new_page(595, 842)

    # Test rendering
    bitmap = page.render(scale=2.0)
    assert bitmap.width == 1190

    # Test context manager
    with tempfile.NamedTemporaryFile(suffix='.pdf') as tmp:
        pdf.save(tmp.name)

        with pdfium.PdfDocument(tmp.name) as loaded:
            assert len(loaded) == 1

    # Test coordinate conversion
    conv = bitmap.get_posconv(page)
    bx, by = conv.to_bitmap(100, 100)
    px, py = conv.to_page(bx, by)
    assert abs(px - 100) < 0.01

    print("✓ All v5 compatibility tests passed")

test_v5_compatibility()

references

on this page