Categorizing documents¶

When working with a collection of PDFs, you might need to automatically categorize pages of PDFs or entire collections of PDFs.

In [1]:

Copied!

#%pip install "natural-pdf[ai]"
#%pip install "natural-pdf[ai]"

In [2]:

Copied!

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/cia-doc.pdf")
pdf.pages.show(columns=6)
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/cia-doc.pdf")
pdf.pages.show(columns=6)

Could get FontBBox from font descriptor because None cannot be parsed as 4 floats

Out[2]:

No description has been provided for this image

Vision classification¶

These pages are easily differentiable based on how they look, so we can most likely use a vision model to tell them apart.

In [3]:

Copied!

pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')

for page in pdf.pages:
    print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')

for page in pdf.pages:
    print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.

Device set to use mps:0

Page 1 is text - 0.633
Page 2 is text - 0.957
Page 3 is text - 0.921
Page 4 is diagram - 0.895
Page 5 is diagram - 0.891
Page 6 is invoice - 0.919
Page 7 is text - 0.834
Page 8 is invoice - 0.594
Page 9 is invoice - 0.971
Page 10 is invoice - 0.987
Page 11 is invoice - 0.994
Page 12 is invoice - 0.992
Page 13 is text - 0.822
Page 14 is text - 0.936
Page 15 is diagram - 0.913
Page 16 is text - 0.617
Page 17 is invoice - 0.868

How did it do?

In [4]:

Copied!





(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .show(show_category=True)
)
(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .show(show_category=True)
)

Out[4]:

Looks great! Note that I had to play around with the categories a bit before I got something that worked. Using "blank" doesn't ever show up, "invoice" did a lot better than "form," etc etc. It's pretty quick and easy to sanity check so you shouldn't have to suffer too much.

I can also save just those pages into a new PDF document.

In [ ]:

skip-execution

Copied!





(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .save_pdf("output.pdf", original=True)
)
(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .save_pdf("output.pdf", original=True)
)

Text classification (default)¶

By default the search is done using text. It takes the text on the page and feeds it to the classifier along with the categories. Note that you might need to OCR your content first!

In [5]:

Copied!

pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='text')

for page in pdf.pages:
    print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='text')

for page in pdf.pages:
    print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")

Device set to use mps:0

Page 1 is text - 0.629
Page 2 is invoice - 0.468
Page 3 is invoice - 0.418
Page 4 is diagram - 0.832
Page 5 is diagram - 0.669
Page 6 is text - 0.6
Page 7 is diagram - 0.463
Page 8 is text - 0.61
Page 9 is invoice - 0.647
Page 10 is invoice - 0.462
Page 11 is text - 0.462
Page 12 is text - 0.546
Page 13 is text - 0.451
Page 14 is text - 0.388
Page 15 is diagram - 0.938
Page 16 is text - 0.603
Page 17 is text - 0.712

How does it compare to our vision option?

In [6]:

Copied!

pdf.pages.filter(lambda page: page.category == 'diagram').show(show_category=True)
pdf.pages.filter(lambda page: page.category == 'diagram').show(show_category=True)

Out[6]:

Yes, you can notice that it's wrong, but more importantly look at the confidence scores. Low scores are your best clue that something might not be perfect (beyond manually checking things, of course).

If you're processing documents that are text-heavy you'll have much better luck with a text model as compared to a vision one.

PDF classification¶

If you want to classify entire PDFs, the process is similar. The only gotcha is you can't use using="vision" with multi-page PDFs (yet?).

In [7]:

Copied!





import natural_pdf

pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]

# Import your PDFs
pdfs = natural_pdf.PDFCollection(pdf_paths)

# Run your classification
pdfs.classify_all(['school', 'business'], using='text')
import natural_pdf

pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]

# Import your PDFs
pdfs = natural_pdf.PDFCollection(pdf_paths)

# Run your classification
pdfs.classify_all(['school', 'business'], using='text')

Out[7]:

<PDFCollection(count=2)>