OCR Integration for Scanned Documents¶
Optical Character Recognition (OCR) allows you to extract text from scanned documents where the text isn't embedded in the PDF. This tutorial demonstrates how to work with scanned documents.
#%pip install "natural-pdf[all]"
from natural_pdf import PDF
# Load a PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
# Try extracting text without OCR
text_without_ocr = page.extract_text()
f"Without OCR: {len(text_without_ocr)} characters extracted"
'Without OCR: 0 characters extracted'
Applying OCR and Finding Elements¶
The core method is page.apply_ocr()
. This runs the OCR process and adds TextElement
objects to the page. You can specify the engine and languages.
Note: Re-applying OCR to the same page or region will automatically remove any previously generated OCR elements for that area before adding the new ones.
# Apply OCR using the default engine (EasyOCR) for English
page.apply_ocr(languages=['en'])
# Select all text pieces found by OCR
text_elements = page.find_all('text[source=ocr]')
print(f"Found {len(text_elements)} text elements using default OCR")
# Visualize the elements
text_elements.show()
Using CPU. Note: This module is much faster with a GPU.
/Users/soma/Development/natural-pdf/.nox/tutorials/lib/python3.10/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used. warnings.warn(warn_msg)
Found 45 text elements using default OCR
# Apply OCR using PaddleOCR for English
page.apply_ocr(engine='paddle', languages=['en'])
print(f"Found {len(page.find_all('text[source=ocr]'))} elements after English OCR.")
# Apply OCR using PaddleOCR for Chinese
page.apply_ocr(engine='paddle', languages=['ch'])
print(f"Found {len(page.find_all('text[source=ocr]'))} elements after Chinese OCR.")
text_with_ocr = page.extract_text()
print(f"\nExtracted text after OCR:\n{text_with_ocr[:150]}...")
Creating model: ('PP-OCRv5_server_det', None)
Using official model (PP-OCRv5_server_det), the model files will be automatically downloaded and saved in /Users/soma/.paddlex/official_models.
/Users/soma/Development/natural-pdf/.nox/tutorials/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message)
Creating model: ('PP-OCRv5_server_rec', None)
Using official model (PP-OCRv5_server_rec), the model files will be automatically downloaded and saved in /Users/soma/.paddlex/official_models.
Found 43 elements after English OCR.
Found 43 elements after Chinese OCR. Extracted text after OCR: Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham's Meatpacking Chicago, III. Date: February 3, 1905 Violation Count: 7 Summ...
You can also use .describe()
to see a summary of the OCR outcome...
page.describe()
Page 1 Summary
Elements:
- text: 43 elements
- image: 1 elements
Text Analysis:
- typography:
- fonts:
- OCR: 43
- sizes:
- 29.0pt: 9
- 34.0pt: 9
- 23.0pt: 8
- 32.0pt: 8
- 39.0pt: 3
- 22.0pt: 2
- 31.0pt: 2
- 27.0pt: 1
- 16.0pt: 1
- styles: 43 highlight
- ocr quality:
- confidence stats:
- mean: 0.98
- min: 0.87
- max: 1.00
- quality distribution:
- 99%+ (20/43) 47%:
██████████████████░░░░░░░░░░░░░░░░░░░░░░
- 95%+ (39/43) 91%:
████████████████████████████████████░░░░
- 90%+ (42/43) 98%:
███████████████████████████████████████░
- 99%+ (20/43) 47%:
- lowest scoring:
-
1: 0.87: □
-
2: 0.91: □
-
3: 0.91: Date: February 3, 1905
-
4: 0.93: □
-
5: 0.95: These people could not be shown to the visitor-for the odor ...
-
6: 0.95: Unsanitary Working Conditions.
-
7: 0.96: exhibiting - sometimes they would be overlooked for days, ti...
-
8: 0.96: into the vats; and when they were fished out, there was neve...
-
9: 0.96: □
-
10: 0.96: Summary: Worst of any, however, were the fertilizer men, and...
-
...or .inspect()
on the text elements for individual details.
page.find_all('text').inspect()
Collection Inspection (43 elements)
Word Elements
text | x0 | top | x1 | bottom | font_family | font_variant | size | bold | italic | strike | underline | highlight | source | confidence | color |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Jungle Health and Safety Inspection Service | 797 | 70 | 1131 | 92 | OCR | 23 | False | False | False | False | True | ocr | 0.98 | #000000 | |
INS-UP70N51NCL41R | 797 | 90 | 972 | 112 | OCR | 23 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Site: Durham's Meatpacking Chicago, III. | 97 | 168 | 489 | 197 | OCR | 29 | False | False | False | False | True | ocr | 0.96 | #000000 | |
Date: February 3, 1905 | 97 | 211 | 328 | 240 | OCR | 29 | False | False | False | False | True | ocr | 0.91 | #000000 | |
Violation Count: 7 | 97 | 251 | 286 | 280 | OCR | 29 | False | False | False | False | True | ocr | 0.97 | #000000 | |
Summary: Worst of any, however, were the fertilize... | 100 | 296 | 1051 | 319 | OCR | 23 | False | False | False | False | True | ocr | 0.96 | #000000 | |
These people could not be shown to the visitor-for... | 97 | 325 | 1061 | 354 | OCR | 29 | False | False | False | False | True | ocr | 0.95 | #000000 | |
visitor at a hundred yards, and as for the other m... | 100 | 363 | 1016 | 386 | OCR | 23 | False | False | False | False | True | ocr | 0.97 | #000000 | |
some of which there were open vats near the level ... | 100 | 397 | 1034 | 419 | OCR | 23 | False | False | False | False | True | ocr | 0.98 | #000000 | |
into the vats; and when they were fished out, ther... | 100 | 431 | 963 | 453 | OCR | 23 | False | False | False | False | True | ocr | 0.96 | #000000 | |
exhibiting - sometimes they would be overlooked fo... | 100 | 465 | 1027 | 487 | OCR | 22 | False | False | False | False | True | ocr | 0.96 | #000000 | |
to the world as Durham's Pure Leaf Lard! | 100 | 496 | 483 | 518 | OCR | 23 | False | False | False | False | True | ocr | 0.97 | #000000 | |
Violations | 97 | 765 | 226 | 803 | OCR | 39 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Description | 210 | 818 | 337 | 857 | OCR | 39 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Statute | 106 | 821 | 191 | 855 | OCR | 34 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Level | 939 | 821 | 1007 | 855 | OCR | 34 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Repeat? | 1045 | 821 | 1138 | 855 | OCR | 34 | False | False | False | False | True | ocr | 1.00 | #000000 | |
4.12.7 | 106 | 863 | 177 | 895 | OCR | 32 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Critical | 941 | 863 | 1016 | 895 | OCR | 32 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Unsanitary Working Conditions. | 214 | 866 | 512 | 895 | OCR | 29 | False | False | False | False | True | ocr | 0.95 | #000000 | |
Serious | 938 | 901 | 1024 | 941 | OCR | 39 | False | False | False | False | True | ocr | 1.00 | #000000 | |
5.8.3 | 106 | 904 | 166 | 938 | OCR | 34 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Inadequate Protective Equipment. | 213 | 906 | 534 | 935 | OCR | 29 | False | False | False | False | True | ocr | 0.97 | #000000 | |
□ | 1073 | 945 | 1107 | 978 | OCR | 34 | False | False | False | False | True | ocr | 0.91 | #000000 | |
6.3.9 | 106 | 946 | 166 | 980 | OCR | 34 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Serious | 941 | 946 | 1023 | 978 | OCR | 32 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Ineffective Injury Prevention. | 213 | 949 | 483 | 978 | OCR | 29 | False | False | False | False | True | ocr | 0.98 | #000000 | |
7.1.5 | 106 | 987 | 168 | 1021 | OCR | 34 | False | False | False | False | True | ocr | 1.00 | #000000 | |
□ | 1073 | 987 | 1107 | 1021 | OCR | 34 | False | False | False | False | True | ocr | 0.87 | #000000 | |
Critical | 941 | 989 | 1016 | 1021 | OCR | 32 | False | False | False | False | True | ocr | 1.00 | #000000 | |
Showing 30 of 43 elements (pass limit= to see more) |
Setting Default OCR Options¶
You can set global default OCR options using natural_pdf.options
. These defaults will be used automatically when you call apply_ocr()
without specifying parameters.
import natural_pdf as npdf
# Set global OCR defaults
npdf.options.ocr.engine = 'surya' # Default OCR engine
npdf.options.ocr.min_confidence = 0.7 # Default confidence threshold
# Now all OCR calls use these defaults
pdf = npdf.PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
pdf.pages[0].apply_ocr() # Uses: engine='surya', languages=['en', 'es'], min_confidence=0.7
# You can still override defaults for specific calls
pdf.pages[0].apply_ocr(engine='easyocr', languages=['fr']) # Override engine and languages
Loaded detection model s3://text_detection/2025_02_28 on device mps with dtype torch.float16
/Users/soma/Development/natural-pdf/.nox/tutorials/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(
Loaded recognition model s3://text_recognition/2025_02_18 on device mps with dtype torch.float16
Detecting bboxes: 0%| | 0/1 [00:00<?, ?it/s]
Detecting bboxes: 100%|█████████████████████████| 1/1 [00:00<00:00, 1.26it/s]
Detecting bboxes: 100%|█████████████████████████| 1/1 [00:00<00:00, 1.26it/s]
Recognizing Text: 0%| | 0/1 [00:00<?, ?it/s]
Recognizing Text: 100%|█████████████████████████| 1/1 [00:10<00:00, 10.16s/it]
Recognizing Text: 100%|█████████████████████████| 1/1 [00:10<00:00, 10.16s/it]
[2025-06-26 22:39:05,391] [ WARNING] easyocr.py:71 - Using CPU. Note: This module is much faster with a GPU.
/Users/soma/Development/natural-pdf/.nox/tutorials/lib/python3.10/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used. warnings.warn(warn_msg)
<Page number=1 index=0>
This is especially useful when processing many documents with the same OCR settings, as you don't need to specify the parameters repeatedly.
Advanced OCR Configuration¶
For more control, import and use the specific Options
class for your chosen engine within the apply_ocr
call.
from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
# Re-apply OCR using EasyOCR with specific options
easy_opts = EasyOCROptions(
paragraph=False,
)
page.apply_ocr(engine='easyocr', languages=['en'], min_confidence=0.1, options=easy_opts)
paddle_opts = PaddleOCROptions()
page.apply_ocr(engine='paddle', languages=['en'], options=paddle_opts)
surya_opts = SuryaOCROptions()
page.apply_ocr(engine='surya', languages=['en'], min_confidence=0.5, detect_only=True, options=surya_opts)
Loaded detection model s3://text_detection/2025_02_28 on device mps with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device mps with dtype torch.float16
Detecting bboxes: 0%| | 0/1 [00:00<?, ?it/s]
Detecting bboxes: 100%|█████████████████████████| 1/1 [00:00<00:00, 1.44it/s]
Detecting bboxes: 100%|█████████████████████████| 1/1 [00:00<00:00, 1.44it/s]
<Page number=1 index=0>
Interactive OCR Correction / Debugging¶
If OCR results aren't perfect, you can use the bundled interactive web application (SPA) to review and correct them.
Package the data: After running
apply_ocr
(orapply_layout
), usecreate_correction_task_package
to create a zip file containing the PDF images and detected elements.from natural_pdf.utils.packaging import create_correction_task_package page.apply_ocr() create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
Run the SPA: Navigate to the SPA directory within the installed
natural_pdf
library in your terminal and start a simple web server.Use the SPA: Open
http://localhost:8000
in your browser. Drag thecorrection_package.zip
file onto the page to load the document. You can then click on text elements to correct the OCR results.
Working with Multiple Pages¶
Apply OCR or layout analysis to all pages using the PDF
object.
# Process all pages in the document
# Apply OCR to all pages (example using EasyOCR)
pdf.apply_ocr(engine='easyocr', languages=['en'])
print(f"Applied OCR to {len(pdf.pages)} pages.")
# Or apply layout analysis to all pages (example using Paddle)
# pdf.apply_layout(engine='paddle')
# print(f"Applied Layout Analysis to {len(pdf.pages)} pages.")
# Extract text from all pages (uses OCR results if available)
all_text_content = pdf.extract_text(page_separator="\\n\\n---\\n\\n")
print(f"\nCombined text from all pages:\n{all_text_content[:500]}...")
[2025-06-26 22:40:16,341] [ WARNING] text_extraction.py:64 - Ignoring unsupported layout keyword argument: 'page_separator'
Applied OCR to 1 pages. Combined text from all pages: Jungle Health and Safety Inspection Service Violation Count: 7 These people could not be shown to the visitor for the odor of a fertilizer man would scare any ordinary some of which there were open vats near the level of the floor; their peculiar trouble was thattheyfell into the vats; and whentheywere fished out; there was never enough of them left to be worth exhibiting sometimestheywould be overlooked for days, till all but the bones of them had gone out Violations Statute Description Level 4...
Saving PDFs with Searchable Text¶
After applying OCR to a PDF, you can save a new version of the PDF where the recognized text is embedded as an invisible layer. This makes the text searchable and copyable in standard PDF viewers.
Use the save_searchable()
method on the PDF
TODO¶
- Add guidance on installing only the OCR engines you need (e.g.
pip install "natural-pdf[ai] easyocr"
) instead of the heavy[all]
extra. - Show how to use
detect_only=True
to combine OCR detection with external recognition for higher accuracy (ties into fine-tuning tutorial). - Include an example of saving a searchable PDF via
pdf.save_searchable("output.pdf")
after OCR. - Mention
resolution
parameter trade-offs (speed vs accuracy) when callingapply_ocr
. - Provide a quick snippet demonstrating
.viewer()
for interactive visual QC of OCR results.