Categorizing documents¶
When working with a collection of PDFs, you might need to automatically categorize pages of PDFs or entire collections of PDFs.
#%pip install "natural-pdf[ai]"
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/cia-doc.pdf")
pdf.pages.to_image(cols=6)
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
Vision classification¶
These pages are easily differentiable based on how they look, so we can most likely use a vision model to tell them apart.
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')
for page in pdf.pages:
print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use mps:0
Page 1 is text - 0.633 Page 2 is text - 0.957 Page 3 is text - 0.921 Page 4 is diagram - 0.895 Page 5 is diagram - 0.891 Page 6 is invoice - 0.919 Page 7 is text - 0.834 Page 8 is invoice - 0.594 Page 9 is invoice - 0.971 Page 10 is invoice - 0.987 Page 11 is invoice - 0.994 Page 12 is invoice - 0.992 Page 13 is text - 0.822 Page 14 is text - 0.936 Page 15 is diagram - 0.913 Page 16 is text - 0.617 Page 17 is invoice - 0.868
How did it do?
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.to_image(show_category=True)
)
Looks great! Note that I had to play around with the categories a bit before I got something that worked. Using "blank" doesn't ever show up, "invoice" did a lot better than "form," etc etc. It's pretty quick and easy to sanity check so you shouldn't have to suffer too much.
I can also save just those pages into a new PDF document.
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.save_pdf("output.pdf", original=True)
)
Text classification (default)¶
By default the search is done using text. It takes the text on the page and feeds it to the classifier along with the categories. Note that you might need to OCR your content first!
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='text')
for page in pdf.pages:
print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
Device set to use mps:0
Page 1 is text - 0.639 Page 2 is invoice - 0.498 Page 3 is invoice - 0.352 Page 4 is diagram - 0.801 Page 5 is diagram - 0.569 Page 6 is text - 0.6 Page 7 is diagram - 0.392 Page 8 is text - 0.61 Page 9 is invoice - 0.678 Page 10 is invoice - 0.472 Page 11 is text - 0.568 Page 12 is text - 0.554 Page 13 is text - 0.453 Page 14 is diagram - 0.423 Page 15 is diagram - 0.921 Page 16 is text - 0.5 Page 17 is text - 0.685
How does it compare to our vision option?
pdf.pages.filter(lambda page: page.category == 'diagram').to_image(show_category=True)
Yes, you can notice that it's wrong, but more importantly look at the confidence scores. Low scores are your best clue that something might not be perfect (beyond manually checking things, of course).
If you're processing documents that are text-heavy you'll have much better luck with a text model as compared to a vision one.
PDF classification¶
If you want to classify entire PDFs, the process is similar. The only gotcha is you can't use using="vision"
with multi-page PDFs (yet?).
import natural_pdf
pdf_paths = [
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]
# Import your PDFs
pdfs = natural_pdf.PDFCollection(pdf_paths)
# Run your classification
pdfs.classify_all(['school', 'business'], using='text')
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
<PDFCollection(count=2)>
What's the first PDF?
print(f"{pdfs[0].category} - confidence of {pdfs[0].category_confidence:0.3}")
# Look at the first page
pdfs[0].pages[0].to_image(width=500)
business - confidence of 0.837
How about the second?
print(f"{pdfs[1].category} - confidence of {pdfs[1].category_confidence:0.3}")
# Look at the first page
pdfs[1].pages[0].to_image(width=500)
school - confidence of 0.569
TODO¶
- Document advanced parameters for classification helpers (
min_confidence
,multi_label
,analysis_key
) so users can fine-tune behaviour or store multiple result sets. - Add an example that passes an explicit Hugging Face model ID (e.g.
model="openai/clip-vit-base-patch16"
) for reproducibility. - Note that vision classification only works for single-page PDFs or per-page classification, not whole multi-page PDFs.
- Remind readers to install the AI stack if they skipped the earlier magic:
pip install "natural-pdf[ai]"
. - Suggest using
pdf.pages.to_image(show_category=True)
to visually QC an entire document after classification.