Categorizing documents¶
When working with a collection of PDFs, you might need to automatically categorize pages of PDFs or entire collections of PDFs.
#%pip install "natural-pdf[ai]"
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/cia-doc.pdf")
pdf.pages.to_image(cols=6)
Could get FontBBox from font descriptor because None cannot be parsed as 4 floats
Vision classification¶
These pages are easily differentiable based on how they look, so we can most likely use a vision model to tell them apart.
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')
for page in pdf.pages:
print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
/Users/soma/Development/natural-pdf/.nox/tutorials/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Page 1 is text - 0.633 Page 2 is text - 0.957 Page 3 is text - 0.921 Page 4 is diagram - 0.895 Page 5 is diagram - 0.891 Page 6 is invoice - 0.919 Page 7 is text - 0.834 Page 8 is invoice - 0.594 Page 9 is invoice - 0.971 Page 10 is invoice - 0.987 Page 11 is invoice - 0.994 Page 12 is invoice - 0.992 Page 13 is text - 0.822 Page 14 is text - 0.936 Page 15 is diagram - 0.913 Page 16 is text - 0.617 Page 17 is invoice - 0.868
How did it do?
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.to_image(show_category=True)
)
Looks great! Note that I had to play around with the categories a bit before I got something that worked. Using "blank" doesn't ever show up, "invoice" did a lot better than "form," etc etc. It's pretty quick and easy to sanity check so you shouldn't have to suffer too much.
I can also save just those pages into a new PDF document.
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.save_pdf("output.pdf", original=True)
)
Text classification (default)¶
By default the search is done using text. It takes the text on the page and feeds it to the classifier along with the categories. Note that you might need to OCR your content first!
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='text')
for page in pdf.pages:
print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Page 1 is text - 0.629 Page 2 is invoice - 0.468 Page 3 is invoice - 0.418 Page 4 is diagram - 0.832 Page 5 is diagram - 0.669 Page 6 is text - 0.6 Page 7 is diagram - 0.463 Page 8 is text - 0.61 Page 9 is invoice - 0.647 Page 10 is invoice - 0.462 Page 11 is text - 0.462 Page 12 is text - 0.546 Page 13 is text - 0.451 Page 14 is text - 0.388 Page 15 is diagram - 0.938 Page 16 is text - 0.603 Page 17 is text - 0.712
How does it compare to our vision option?
pdf.pages.filter(lambda page: page.category == 'diagram').to_image(show_category=True)
Yes, you can notice that it's wrong, but more importantly look at the confidence scores. Low scores are your best clue that something might not be perfect (beyond manually checking things, of course).
If you're processing documents that are text-heavy you'll have much better luck with a text model as compared to a vision one.
PDF classification¶
If you want to classify entire PDFs, the process is similar. The only gotcha is you can't use using="vision"
with multi-page PDFs (yet?).
import natural_pdf
pdf_paths = [
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]
# Import your PDFs
pdfs = natural_pdf.PDFCollection(pdf_paths)
# Run your classification
pdfs.classify_all(['school', 'business'], using='text')
<PDFCollection(count=2)>