pypdfium2 v5.0 breaking changes and migration guide

Breaking:	`PdfDocument.render()` removed - use page-level rendering
Renamed:	`get_pos()` → `get_bounds()`, `get_size()` → `get_px_size()`
New features:	Coordinate conversion, context managers, mobile support
Performance:	Parallel rendering removed due to bitmap transfer overhead
Migration:	Update rendering loops, method names, TOC handling
Version:	v5.0.0b2 beta - Release notes

version timeline

v4.30.0 - Last stable v4 release (2024-05-09)
v4.30.1 - YANKED - Text extraction regression (crbug.com/387277993)
v5.0.0b1 - YANKED - Text extraction regression (crbug.com/399937354)
v5.0.0b2 - Current beta (2024-07-31)
PDFium version: Updated from 6996 to 7323

important: yanked versions warning

⚠️ Versions 4.30.1 and 5.0.0b1 were yanked from PyPI due to text extraction regressions in the underlying PDFium library:

v4.30.1 (yanked 2024-12-19): Text extraction regression tracked as Chromium bug 387277993
v5.0.0b1 (yanked 2025-02-03): Text extraction regression tracked as Chromium bug 399937354

If you installed these versions, downgrade to v4.30.0 or upgrade to v5.0.0b2:

# Downgrade to stable v4
pip install pypdfium2==4.30.0

# Or upgrade to latest beta
pip install pypdfium2==5.0.0b2

critical breaking changes

document rendering removal

The most significant change in v5.0 is the complete removal of parallel document rendering:

# ❌ OLD v4 code - NO LONGER WORKS
bitmap_generator = pdf.render(scale=2.0, page_indices=[0, 1, 2])
for bitmap in bitmap_generator:
    bitmap.to_pil().save("page.png")

# ✅ NEW v5 code - iterate pages directly
for page in pdf:
    bitmap = page.render(scale=2.0)
    bitmap.to_pil().save("page.png")

Why removed: The parallel rendering API had a fundamental flaw - bitmap transfer overhead exceeded parallelization benefits. Memory copying between processes was too expensive.

Migration strategies:

# Sequential processing (simplest)
def render_pdf_sequential(pdf_path, output_dir):
    pdf = pdfium.PdfDocument(pdf_path)
    for i, page in enumerate(pdf):
        bitmap = page.render(scale=2.0)
        bitmap.to_pil().save(f"{output_dir}/page_{i:03d}.png")

# Parallel with multiprocessing (if needed)
from multiprocessing import Pool

def render_single_page(args):
    pdf_path, page_index, output_path, scale = args
    pdf = pdfium.PdfDocument(pdf_path)
    page = pdf[page_index]
    bitmap = page.render(scale=scale)
    bitmap.to_pil().save(output_path)
    pdf.close()

def render_pdf_parallel(pdf_path, output_dir, scale=2.0):
    pdf = pdfium.PdfDocument(pdf_path)
    page_count = len(pdf)
    pdf.close()

    args_list = [
        (pdf_path, i, f"{output_dir}/page_{i:03d}.png", scale)
        for i in range(page_count)
    ]

    with Pool() as pool:
        pool.map(render_single_page, args_list)

bitmap api changes

color parameter position

The fill_rect() method now takes color as the first parameter:

# ❌ OLD v4 code
bitmap.fill_rect(left=10, top=20, width=100, height=50, color=(255, 0, 0, 255))

# ✅ NEW v5 code
bitmap.fill_rect(color=(255, 0, 0, 255), left=10, top=20, width=100, height=50)

removed bitmap info

# ❌ OLD v4 code
info = bitmap.get_info()
print(info.width, info.height, info.format)

# ✅ NEW v5 code
print(bitmap.width, bitmap.height, bitmap.format)

numpy array shape changes

Grayscale bitmaps now return proper 2D arrays:

# v4: grayscale returned shape (height, width, 1)
# v5: grayscale returns shape (height, width)
arr = grayscale_bitmap.to_numpy()
assert arr.ndim == 2  # Now 2D for grayscale

method renames

Old Method (v4)	New Method (v5)	Context
`PdfObject.get_pos()`	`PdfObject.get_bounds()`	All page objects
`PdfImage.get_size()`	`PdfImage.get_px_size()`	Image dimensions
`PdfMatrix.mirror(v, h)`	`PdfMatrix.mirror(invert_x, invert_y)`	Matrix operations

table of contents restructuring

The TOC API completely changed from namedtuples to method-oriented wrapper classes:

# ❌ OLD v4 code - namedtuple approach
for item in pdf.get_toc():
    print(f"Title: {item.title}")
    print(f"Page: {item.page_index}")
    print(f"View mode: {item.view_mode}")
    print(f"View pos: {item.view_pos}")

# ✅ NEW v5 code - wrapper class approach
for bookmark in pdf.get_toc():
    title = bookmark.get_title()
    dest = bookmark.get_dest()

    if dest:
        page_index = dest.get_index()
        view_mode = dest.get_view_mode()
        view_pos = dest.get_view_pos()

        print(f"Title: {title}")
        print(f"Page: {page_index}")
        print(f"View mode: {view_mode}")
        print(f"View pos: {view_pos}")

New classes provide better encapsulation:

PdfBookmark - Represents a bookmark/outline item
PdfDest - Represents a destination

version api cleanup

All legacy version flags removed:

# ❌ OLD v4 code
from pypdfium2 import V_PYPDFIUM2, V_LIBPDFIUM, V_BUILDNAME
print(f"pypdfium2: {V_PYPDFIUM2}")
print(f"PDFium: {V_LIBPDFIUM}")

# ✅ NEW v5 code
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO
print(f"pypdfium2: {PYPDFIUM_INFO.version}")
print(f"PDFium: {PDFIUM_INFO.version}")
print(f"Build: {PDFIUM_INFO.build}")

image extraction changes

The extract() method no longer supports fallback rendering:

# ❌ OLD v4 code
data, info = image.extract(fb_render=True)  # fb_render removed

# ✅ NEW v5 code
try:
    data, info = image.extract()
except pdfium.PdfiumError:
    # Manual fallback if extraction fails
    bitmap = image.get_bitmap(render=True)
    data = bitmap.to_pil().tobytes()

new features and additions

coordinate conversion helper

New PdfPosConv class for coordinate translation:

# Create converter for page-to-bitmap coordinate mapping
page = pdf[0]
bitmap = page.render(scale=2.0, rotation=90)
conv = bitmap.get_posconv(page)

# Convert coordinates
bitmap_x, bitmap_y = conv.to_bitmap(page_x=100, page_y=200)
page_x, page_y = conv.to_page(bitmap_x=50, bitmap_y=75)

# Direct instantiation
from pypdfium2 import PdfPosConv
conv = PdfPosConv(
    page_width=595, page_height=842,
    bitmap_width=1190, bitmap_height=1684,
    scale=2.0, rotation=0
)

quad points for transformed objects

Get corner positions of transformed objects:

# Get object corners (counter-clockwise from bottom-left)
obj = page.get_objects()[0]
if isinstance(obj, (pdfium.PdfImage, pdfium.PdfText)):
    corners = obj.get_quad_points()
    # Returns: (bottom_left, bottom_right, top_right, top_left)
    for x, y in corners:
        print(f"Corner at ({x}, {y})")

context manager support

Documents now support with-statements:

# Automatic cleanup
with pdfium.PdfDocument("document.pdf") as pdf:
    page = pdf[0]
    bitmap = page.render()
    # PDF automatically closed when exiting block

enhanced error handling

Errors now include PDFium error codes:

try:
    pdf = pdfium.PdfDocument("corrupted.pdf")
except pdfium.PdfiumError as e:
    print(f"Error code: {e.err_code}")
    # Error codes from PDFium:
    # 0: Success
    # 1: Unknown error
    # 2: File access error
    # 3: Format error
    # 4: Password error
    # 5: Security error
    # 6: Page not found

page flattening

Previously private method now public:

# Flatten annotations and form fields into page content
page.flatten()
# Useful for:
# - Preventing annotation editing
# - Ensuring consistent rendering
# - Reducing file complexity

improved image rendering

New option for native resolution rendering:

# Render image at original pixel size for best quality
image_obj = page.get_objects(filter=[pdfium.FPDF_PAGEOBJ_IMAGE])[0]
bitmap = image_obj.get_bitmap(render=True, scale_to_original=True)

# Previous approach scaled to page size
# New approach preserves original image dimensions

bitmap rendering options

Enhanced rendering control:

# Render with alpha channel for transparency
bitmap = page.render(use_alpha=True)

# Standard rendering without alpha
bitmap = page.render(use_alpha=False)

# Note: maybe_alpha is a CLI option, not an API method
# In code, explicitly check for transparency:
if pdfium_c.FPDFPage_HasTransparency(page):
    bitmap = page.render(use_alpha=True)

cli enhancements

dark theme rendering

New options for dark mode PDFs:

# Invert lightness while preserving color relationships
pypdfium2 render input.pdf output --invert-lightness

# Exclude images from inversion
pypdfium2 render input.pdf output --invert-lightness --exclude-images

# Convert fills to strokes for better visibility
pypdfium2 render input.pdf output --fill-to-stroke

Implementation uses HLS color space for intelligent inversion.

enhanced rendering controls

# Render with custom color scheme
pypdfium2 render input.pdf output \
    --color-scheme custom \
    --fill-to-stroke \
    --invert-lightness

# Process specific pages with options
pypdfium2 render input.pdf output \
    --pages 1-10 \
    --scale 2.0 \
    --rotation 90 \
    --crop 10,10,10,10

platform support expansion

mobile platforms (experimental)

New in v5.0 - experimental support for mobile:

Android architectures (partially working):

arm64-v8a - Modern 64-bit devices
armeabi-v7a - Older 32-bit devices
x86_64 - 64-bit emulators
x86 - 32-bit emulators

iOS architectures (not yet functional):

arm64 - Physical devices
arm64 - M1+ Mac simulators
x86_64 - Intel Mac simulators

⚠️ Important limitations:

Android support is experimental but somewhat functional
iOS support infrastructure exists but does not actually work yet
No wheels are distributed for mobile platforms at this time
This is provided on a best-effort basis and largely untested

# Installation on mobile platforms (experimental)
# Android (via Termux or similar) - may work
pip install pypdfium2 --platform android_arm64

# iOS - infrastructure only, not functional yet
# pip install pypdfium2 --platform ios_arm64  # Won't work

enhanced linux support

Added support via cibuildwheel:

ARM64 (aarch64)
RISC-V (riscv64)
LoongArch64 (loongarch64)
PowerPC64 LE (ppc64le)
S390x (s390x)

build system improvements

native build option

Build without Google’s toolchain:

# Use system compiler (GCC/Clang)
python setupsrc/pypdfium2_setup/build_native.py

# Automatically triggered for unsupported platforms
pip install pypdfium2 --no-binary :all:

security verification

SLSA provenance verification support:

# Download and verify authenticity
pypdfium2 update --verify

# Requires slsa-verifier tool
# https://github.com/slsa-framework/slsa-verifier

platform detection

Improved platform detection in autorelease/bindings.py:

import pypdfium2

# Get platform information
info = pypdfium2.PDFIUM_INFO
print(f"Platform: {info.platform}")
print(f"Architecture: {info.arch}")
print(f"V8 Support: {info.v8_enabled}")
print(f"XFA Support: {info.xfa_enabled}")

performance optimizations

memory leak fixes

Fixed ctypes pointer caching issues:

# ❌ OLD problematic pattern
ptr = cast(address, POINTER(c_char)).contents

# ✅ NEW safe pattern
ptr = c_char.from_address(address)

Affected APIs:

Buffer operations
Bitmap creation
Document loading
Text extraction

startup performance

Deferred imports for optional dependencies:

# PIL, NumPy, CV2 only imported when needed
# Reduces startup time by ~30% when not using image features

# These imports now happen lazily:
# - PIL: When using to_pil() or from_pil()
# - NumPy: When using to_numpy() or from_numpy()
# - CV2: When using OpenCV integration

version parsing optimization

Eliminated expensive git operations:

# Old: git ls-remote to check versions (slow)
# New: Read embedded VERSION file (instant)

# Version now cached at module level
from pypdfium2 import PDFIUM_INFO
# Subsequent access is instant

text extraction fixes and api changes

addressing the regression

The text extraction regressions that caused v4.30.1 and v5.0.0b1 to be yanked were addressed through API changes:

PDFium change: Reverted FPDFText_GetText() back to UCS-2 encoding (from UTF-16), which resolved memory allocation concerns but limited Unicode support.

pypdfium2 response: Removed implicit translation in get_text_range() and added explicit methods:

# For full Unicode support (recommended)
text = textpage.get_text_bounded()  # Handles all Unicode characters

# For basic text (faster, UCS-2 only)
text = textpage.get_text_range()  # Limited to UCS-2 character set

# Example: Extracting text with proper Unicode handling
with pdfium.PdfDocument("document.pdf") as pdf:
    page = pdf[0]
    textpage = page.get_textpage()

    # Full Unicode text extraction
    full_text = textpage.get_text_bounded()

    # Or with specific range
    partial_text = textpage.get_text_bounded(start_index=0, length=100)

best practices for text extraction

Always use get_text_bounded() for documents with international characters
Test text extraction after upgrading from affected versions
Verify Unicode handling with documents containing emojis, CJK characters, or special symbols

migration checklist

pre-migration audit

# Script to find v4 code patterns that need updating
import ast
import sys
from pathlib import Path

def check_v4_patterns(file_path):
    with open(file_path, 'r') as f:
        tree = ast.parse(f.read())

    issues = []

    for node in ast.walk(tree):
        if isinstance(node, ast.Attribute):
            # Check for removed/renamed methods
            if node.attr in ['render', 'get_pos', 'get_size', 'get_info']:
                issues.append(f"Line {node.lineno}: {node.attr} may need updating")

    return issues

# Run on your codebase
for py_file in Path(".").rglob("*.py"):
    issues = check_v4_patterns(py_file)
    if issues:
        print(f"\n{py_file}:")
        for issue in issues:
            print(f"  {issue}")

step-by-step migration

Update imports:

# Replace version imports
# from pypdfium2 import V_PYPDFIUM2, V_LIBPDFIUM
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO

Fix rendering loops:

# Replace pdf.render() with page iteration
for page in pdf:
    bitmap = page.render()

Update method calls:

# Rename methods
# obj.get_pos() → obj.get_bounds()
# img.get_size() → img.get_px_size()

Fix bitmap operations:

# Update fill_rect parameter order
# bitmap.fill_rect(x, y, w, h, color) → bitmap.fill_rect(color, x, y, w, h)

Update TOC handling:

# Switch from namedtuple to wrapper classes
for bookmark in pdf.get_toc():
    title = bookmark.get_title()
    dest = bookmark.get_dest()

Add error handling:

# Use new error codes
try:
    pdf = pdfium.PdfDocument(path)
except pdfium.PdfiumError as e:
    if e.err_code == 3:  # Format error
        handle_corrupt_pdf()

testing after migration

import pypdfium2 as pdfium
import tempfile
from pathlib import Path

def test_v5_compatibility():
    # Test basic functionality
    pdf = pdfium.PdfDocument.new()
    page = pdf.new_page(595, 842)

    # Test rendering
    bitmap = page.render(scale=2.0)
    assert bitmap.width == 1190

    # Test context manager
    with tempfile.NamedTemporaryFile(suffix='.pdf') as tmp:
        pdf.save(tmp.name)

        with pdfium.PdfDocument(tmp.name) as loaded:
            assert len(loaded) == 1

    # Test coordinate conversion
    conv = bitmap.get_posconv(page)
    bx, by = conv.to_bitmap(100, 100)
    px, py = conv.to_page(bx, by)
    assert abs(px - 100) < 0.01

    print("✓ All v5 compatibility tests passed")

test_v5_compatibility()

references

pypdfium2 v5.0.0b2 Release Notes - Latest beta release
pypdfium2 v5.0.0b1 Release Notes - First beta release
Migration Guide - Official migration documentation
Issue #274: Parallel Rendering Removal - Discussion on render() removal
pypdfium2 Source Code - GitHub repository
PDFium Documentation - Upstream PDFium project
pypdfium2 Changelog - Detailed changelog