pypdfium2 v5.0 breaking changes and migration guide
on this page
tl;dr
Breaking: | PdfDocument.render() removed - use page-level rendering |
Renamed: | get_pos() → get_bounds() , get_size() → get_px_size() |
New features: | Coordinate conversion, context managers, mobile support |
Performance: | Parallel rendering removed due to bitmap transfer overhead |
Migration: | Update rendering loops, method names, TOC handling |
Version: | v5.0.0b2 beta - Release notes |
version timeline
- v4.30.0 - Last stable v4 release (2024-05-09)
- v4.30.1 - YANKED - Text extraction regression (crbug.com/387277993)
- v5.0.0b1 - YANKED - Text extraction regression (crbug.com/399937354)
- v5.0.0b2 - Current beta (2024-07-31)
- PDFium version: Updated from 6996 to 7323
important: yanked versions warning
⚠️ Versions 4.30.1 and 5.0.0b1 were yanked from PyPI due to text extraction regressions in the underlying PDFium library:
- v4.30.1 (yanked 2024-12-19): Text extraction regression tracked as Chromium bug 387277993
- v5.0.0b1 (yanked 2025-02-03): Text extraction regression tracked as Chromium bug 399937354
If you installed these versions, downgrade to v4.30.0 or upgrade to v5.0.0b2:
# Downgrade to stable v4
pip install pypdfium2==4.30.0
# Or upgrade to latest beta
pip install pypdfium2==5.0.0b2
critical breaking changes
document rendering removal
The most significant change in v5.0 is the complete removal of parallel document rendering:
# ❌ OLD v4 code - NO LONGER WORKS
bitmap_generator = pdf.render(scale=2.0, page_indices=[0, 1, 2])
for bitmap in bitmap_generator:
bitmap.to_pil().save("page.png")
# ✅ NEW v5 code - iterate pages directly
for page in pdf:
bitmap = page.render(scale=2.0)
bitmap.to_pil().save("page.png")
Why removed: The parallel rendering API had a fundamental flaw - bitmap transfer overhead exceeded parallelization benefits. Memory copying between processes was too expensive.
Migration strategies:
# Sequential processing (simplest)
def render_pdf_sequential(pdf_path, output_dir):
pdf = pdfium.PdfDocument(pdf_path)
for i, page in enumerate(pdf):
bitmap = page.render(scale=2.0)
bitmap.to_pil().save(f"{output_dir}/page_{i:03d}.png")
# Parallel with multiprocessing (if needed)
from multiprocessing import Pool
def render_single_page(args):
pdf_path, page_index, output_path, scale = args
pdf = pdfium.PdfDocument(pdf_path)
page = pdf[page_index]
bitmap = page.render(scale=scale)
bitmap.to_pil().save(output_path)
pdf.close()
def render_pdf_parallel(pdf_path, output_dir, scale=2.0):
pdf = pdfium.PdfDocument(pdf_path)
page_count = len(pdf)
pdf.close()
args_list = [
(pdf_path, i, f"{output_dir}/page_{i:03d}.png", scale)
for i in range(page_count)
]
with Pool() as pool:
pool.map(render_single_page, args_list)
bitmap api changes
color parameter position
The fill_rect()
method now takes color as the first parameter:
# ❌ OLD v4 code
bitmap.fill_rect(left=10, top=20, width=100, height=50, color=(255, 0, 0, 255))
# ✅ NEW v5 code
bitmap.fill_rect(color=(255, 0, 0, 255), left=10, top=20, width=100, height=50)
removed bitmap info
# ❌ OLD v4 code
info = bitmap.get_info()
print(info.width, info.height, info.format)
# ✅ NEW v5 code
print(bitmap.width, bitmap.height, bitmap.format)
numpy array shape changes
Grayscale bitmaps now return proper 2D arrays:
# v4: grayscale returned shape (height, width, 1)
# v5: grayscale returns shape (height, width)
arr = grayscale_bitmap.to_numpy()
assert arr.ndim == 2 # Now 2D for grayscale
method renames
Old Method (v4) | New Method (v5) | Context |
---|---|---|
PdfObject.get_pos() | PdfObject.get_bounds() | All page objects |
PdfImage.get_size() | PdfImage.get_px_size() | Image dimensions |
PdfMatrix.mirror(v, h) | PdfMatrix.mirror(invert_x, invert_y) | Matrix operations |
table of contents restructuring
The TOC API completely changed from namedtuples to method-oriented wrapper classes:
# ❌ OLD v4 code - namedtuple approach
for item in pdf.get_toc():
print(f"Title: {item.title}")
print(f"Page: {item.page_index}")
print(f"View mode: {item.view_mode}")
print(f"View pos: {item.view_pos}")
# ✅ NEW v5 code - wrapper class approach
for bookmark in pdf.get_toc():
title = bookmark.get_title()
dest = bookmark.get_dest()
if dest:
page_index = dest.get_index()
view_mode = dest.get_view_mode()
view_pos = dest.get_view_pos()
print(f"Title: {title}")
print(f"Page: {page_index}")
print(f"View mode: {view_mode}")
print(f"View pos: {view_pos}")
New classes provide better encapsulation:
PdfBookmark
- Represents a bookmark/outline itemPdfDest
- Represents a destination
version api cleanup
All legacy version flags removed:
# ❌ OLD v4 code
from pypdfium2 import V_PYPDFIUM2, V_LIBPDFIUM, V_BUILDNAME
print(f"pypdfium2: {V_PYPDFIUM2}")
print(f"PDFium: {V_LIBPDFIUM}")
# ✅ NEW v5 code
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO
print(f"pypdfium2: {PYPDFIUM_INFO.version}")
print(f"PDFium: {PDFIUM_INFO.version}")
print(f"Build: {PDFIUM_INFO.build}")
image extraction changes
The extract()
method no longer supports fallback rendering:
# ❌ OLD v4 code
data, info = image.extract(fb_render=True) # fb_render removed
# ✅ NEW v5 code
try:
data, info = image.extract()
except pdfium.PdfiumError:
# Manual fallback if extraction fails
bitmap = image.get_bitmap(render=True)
data = bitmap.to_pil().tobytes()
new features and additions
coordinate conversion helper
New PdfPosConv
class for coordinate translation:
# Create converter for page-to-bitmap coordinate mapping
page = pdf[0]
bitmap = page.render(scale=2.0, rotation=90)
conv = bitmap.get_posconv(page)
# Convert coordinates
bitmap_x, bitmap_y = conv.to_bitmap(page_x=100, page_y=200)
page_x, page_y = conv.to_page(bitmap_x=50, bitmap_y=75)
# Direct instantiation
from pypdfium2 import PdfPosConv
conv = PdfPosConv(
page_width=595, page_height=842,
bitmap_width=1190, bitmap_height=1684,
scale=2.0, rotation=0
)
quad points for transformed objects
Get corner positions of transformed objects:
# Get object corners (counter-clockwise from bottom-left)
obj = page.get_objects()[0]
if isinstance(obj, (pdfium.PdfImage, pdfium.PdfText)):
corners = obj.get_quad_points()
# Returns: (bottom_left, bottom_right, top_right, top_left)
for x, y in corners:
print(f"Corner at ({x}, {y})")
context manager support
Documents now support with-statements:
# Automatic cleanup
with pdfium.PdfDocument("document.pdf") as pdf:
page = pdf[0]
bitmap = page.render()
# PDF automatically closed when exiting block
enhanced error handling
Errors now include PDFium error codes:
try:
pdf = pdfium.PdfDocument("corrupted.pdf")
except pdfium.PdfiumError as e:
print(f"Error code: {e.err_code}")
# Error codes from PDFium:
# 0: Success
# 1: Unknown error
# 2: File access error
# 3: Format error
# 4: Password error
# 5: Security error
# 6: Page not found
page flattening
Previously private method now public:
# Flatten annotations and form fields into page content
page.flatten()
# Useful for:
# - Preventing annotation editing
# - Ensuring consistent rendering
# - Reducing file complexity
improved image rendering
New option for native resolution rendering:
# Render image at original pixel size for best quality
image_obj = page.get_objects(filter=[pdfium.FPDF_PAGEOBJ_IMAGE])[0]
bitmap = image_obj.get_bitmap(render=True, scale_to_original=True)
# Previous approach scaled to page size
# New approach preserves original image dimensions
bitmap rendering options
Enhanced rendering control:
# Render with alpha channel for transparency
bitmap = page.render(use_alpha=True)
# Standard rendering without alpha
bitmap = page.render(use_alpha=False)
# Note: maybe_alpha is a CLI option, not an API method
# In code, explicitly check for transparency:
if pdfium_c.FPDFPage_HasTransparency(page):
bitmap = page.render(use_alpha=True)
cli enhancements
dark theme rendering
New options for dark mode PDFs:
# Invert lightness while preserving color relationships
pypdfium2 render input.pdf output --invert-lightness
# Exclude images from inversion
pypdfium2 render input.pdf output --invert-lightness --exclude-images
# Convert fills to strokes for better visibility
pypdfium2 render input.pdf output --fill-to-stroke
Implementation uses HLS color space for intelligent inversion.
enhanced rendering controls
# Render with custom color scheme
pypdfium2 render input.pdf output \
--color-scheme custom \
--fill-to-stroke \
--invert-lightness
# Process specific pages with options
pypdfium2 render input.pdf output \
--pages 1-10 \
--scale 2.0 \
--rotation 90 \
--crop 10,10,10,10
platform support expansion
mobile platforms (experimental)
New in v5.0 - experimental support for mobile:
Android architectures (partially working):
arm64-v8a
- Modern 64-bit devicesarmeabi-v7a
- Older 32-bit devicesx86_64
- 64-bit emulatorsx86
- 32-bit emulators
iOS architectures (not yet functional):
arm64
- Physical devicesarm64
- M1+ Mac simulatorsx86_64
- Intel Mac simulators
⚠️ Important limitations:
- Android support is experimental but somewhat functional
- iOS support infrastructure exists but does not actually work yet
- No wheels are distributed for mobile platforms at this time
- This is provided on a best-effort basis and largely untested
# Installation on mobile platforms (experimental)
# Android (via Termux or similar) - may work
pip install pypdfium2 --platform android_arm64
# iOS - infrastructure only, not functional yet
# pip install pypdfium2 --platform ios_arm64 # Won't work
enhanced linux support
Added support via cibuildwheel:
- ARM64 (aarch64)
- RISC-V (riscv64)
- LoongArch64 (loongarch64)
- PowerPC64 LE (ppc64le)
- S390x (s390x)
build system improvements
native build option
Build without Google’s toolchain:
# Use system compiler (GCC/Clang)
python setupsrc/pypdfium2_setup/build_native.py
# Automatically triggered for unsupported platforms
pip install pypdfium2 --no-binary :all:
security verification
SLSA provenance verification support:
# Download and verify authenticity
pypdfium2 update --verify
# Requires slsa-verifier tool
# https://github.com/slsa-framework/slsa-verifier
platform detection
Improved platform detection in autorelease/bindings.py
:
import pypdfium2
# Get platform information
info = pypdfium2.PDFIUM_INFO
print(f"Platform: {info.platform}")
print(f"Architecture: {info.arch}")
print(f"V8 Support: {info.v8_enabled}")
print(f"XFA Support: {info.xfa_enabled}")
performance optimizations
memory leak fixes
Fixed ctypes pointer caching issues:
# ❌ OLD problematic pattern
ptr = cast(address, POINTER(c_char)).contents
# ✅ NEW safe pattern
ptr = c_char.from_address(address)
Affected APIs:
- Buffer operations
- Bitmap creation
- Document loading
- Text extraction
startup performance
Deferred imports for optional dependencies:
# PIL, NumPy, CV2 only imported when needed
# Reduces startup time by ~30% when not using image features
# These imports now happen lazily:
# - PIL: When using to_pil() or from_pil()
# - NumPy: When using to_numpy() or from_numpy()
# - CV2: When using OpenCV integration
version parsing optimization
Eliminated expensive git operations:
# Old: git ls-remote to check versions (slow)
# New: Read embedded VERSION file (instant)
# Version now cached at module level
from pypdfium2 import PDFIUM_INFO
# Subsequent access is instant
text extraction fixes and api changes
addressing the regression
The text extraction regressions that caused v4.30.1 and v5.0.0b1 to be yanked were addressed through API changes:
PDFium change: Reverted FPDFText_GetText()
back to UCS-2 encoding (from UTF-16), which resolved memory allocation concerns but limited Unicode support.
pypdfium2 response: Removed implicit translation in get_text_range()
and added explicit methods:
# For full Unicode support (recommended)
text = textpage.get_text_bounded() # Handles all Unicode characters
# For basic text (faster, UCS-2 only)
text = textpage.get_text_range() # Limited to UCS-2 character set
# Example: Extracting text with proper Unicode handling
with pdfium.PdfDocument("document.pdf") as pdf:
page = pdf[0]
textpage = page.get_textpage()
# Full Unicode text extraction
full_text = textpage.get_text_bounded()
# Or with specific range
partial_text = textpage.get_text_bounded(start_index=0, length=100)
best practices for text extraction
- Always use
get_text_bounded()
for documents with international characters - Test text extraction after upgrading from affected versions
- Verify Unicode handling with documents containing emojis, CJK characters, or special symbols
migration checklist
pre-migration audit
# Script to find v4 code patterns that need updating
import ast
import sys
from pathlib import Path
def check_v4_patterns(file_path):
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
issues = []
for node in ast.walk(tree):
if isinstance(node, ast.Attribute):
# Check for removed/renamed methods
if node.attr in ['render', 'get_pos', 'get_size', 'get_info']:
issues.append(f"Line {node.lineno}: {node.attr} may need updating")
return issues
# Run on your codebase
for py_file in Path(".").rglob("*.py"):
issues = check_v4_patterns(py_file)
if issues:
print(f"\n{py_file}:")
for issue in issues:
print(f" {issue}")
step-by-step migration
- Update imports:
# Replace version imports
# from pypdfium2 import V_PYPDFIUM2, V_LIBPDFIUM
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO
- Fix rendering loops:
# Replace pdf.render() with page iteration
for page in pdf:
bitmap = page.render()
- Update method calls:
# Rename methods
# obj.get_pos() → obj.get_bounds()
# img.get_size() → img.get_px_size()
- Fix bitmap operations:
# Update fill_rect parameter order
# bitmap.fill_rect(x, y, w, h, color) → bitmap.fill_rect(color, x, y, w, h)
- Update TOC handling:
# Switch from namedtuple to wrapper classes
for bookmark in pdf.get_toc():
title = bookmark.get_title()
dest = bookmark.get_dest()
- Add error handling:
# Use new error codes
try:
pdf = pdfium.PdfDocument(path)
except pdfium.PdfiumError as e:
if e.err_code == 3: # Format error
handle_corrupt_pdf()
testing after migration
import pypdfium2 as pdfium
import tempfile
from pathlib import Path
def test_v5_compatibility():
# Test basic functionality
pdf = pdfium.PdfDocument.new()
page = pdf.new_page(595, 842)
# Test rendering
bitmap = page.render(scale=2.0)
assert bitmap.width == 1190
# Test context manager
with tempfile.NamedTemporaryFile(suffix='.pdf') as tmp:
pdf.save(tmp.name)
with pdfium.PdfDocument(tmp.name) as loaded:
assert len(loaded) == 1
# Test coordinate conversion
conv = bitmap.get_posconv(page)
bx, by = conv.to_bitmap(100, 100)
px, py = conv.to_page(bx, by)
assert abs(px - 100) < 0.01
print("✓ All v5 compatibility tests passed")
test_v5_compatibility()
references
- pypdfium2 v5.0.0b2 Release Notes - Latest beta release
- pypdfium2 v5.0.0b1 Release Notes - First beta release
- Migration Guide - Official migration documentation
- Issue #274: Parallel Rendering Removal - Discussion on render() removal
- pypdfium2 Source Code - GitHub repository
- PDFium Documentation - Upstream PDFium project
- pypdfium2 Changelog - Detailed changelog