Semantic Search Across Multiple Documents¶

When working with a collection of PDFs, you might need to find information relevant to a specific query across all documents, not just within a single one. This tutorial demonstrates how to perform semantic search over a PDFCollection.

You can do semantic search with the default install, but for increased performance with LanceDB I recommend installing the search extension.

In [1]:

Copied!

#%pip install "natural-pdf[search]"
#%pip install "natural-pdf[search]"

In [2]:

Copied!





import natural_pdf

# Define the paths to your PDF files
pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]

# Or use glob patterns
# collection = natural_pdf.PDFCollection("pdfs/*.pdf")

# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")
import natural_pdf

# Define the paths to your PDF files
pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]

# Or use glob patterns
# collection = natural_pdf.PDFCollection("pdfs/*.pdf")

# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")

CropBox missing from /Page, defaulting to MediaBox

CropBox missing from /Page, defaulting to MediaBox

CropBox missing from /Page, defaulting to MediaBox

CropBox missing from /Page, defaulting to MediaBox

CropBox missing from /Page, defaulting to MediaBox

CropBox missing from /Page, defaulting to MediaBox

Created collection with 2 PDFs.

Initializing the Search Index¶

Before performing a search, you need to initialize the search capabilities for the collection. This involves processing the documents and building an index.

In [3]:

Copied!





# Initialize search.
# index=True will build the serachable database immediately
# persist=True will save it so you don't need to do it every time
collection.init_search(index=True)
print("Search index initialized.")
# Initialize search.
# index=True will build the serachable database immediately
# persist=True will save it so you don't need to do it every time
collection.init_search(index=True)
print("Search index initialized.")

Search index initialized.

Performing a Semantic Search¶

Once the index is ready, you can use the find_relevant() method to search for content semantically related to your query.

In [4]:

Copied!

# Perform a search query
query = "american president"
results = collection.find_relevant(query)

print(f"Found {len(results)} results for '{query}':")
# Perform a search query
query = "american president"
results = collection.find_relevant(query)

print(f"Found {len(results)} results for '{query}':")

Found 6 results for 'american president':

Understanding Search Results¶

The find_relevant() method returns a list of dictionaries, each representing a relevant text chunk found in one of the PDFs. Each result includes:

pdf_path: The path to the PDF document where the result was found.
page_number: The page number within the PDF.
score: A relevance score (higher means more relevant).
content_snippet: A snippet of the text chunk that matched the query.

In the future we should be able to easily look at the PDF!

In [5]:

Copied!





# Process and display the results
if results:
    for i, result in enumerate(results):
        print(f"  {i+1}. PDF: {result['pdf_path']}")
        print(f"     Page: {result['page_number']} (Score: {result['score']:.4f})")
        # Display a snippet of the content
        snippet = result.get('content_snippet', '')
        print(f"     Snippet: {snippet}...") 
else:
    print("  No relevant results found.")

# You can access the full content if needed via the result object, 
# though 'content_snippet' is usually sufficient for display.
# Process and display the results
if results:
    for i, result in enumerate(results):
        print(f"  {i+1}. PDF: {result['pdf_path']}")
        print(f"     Page: {result['page_number']} (Score: {result['score']:.4f})")
        # Display a snippet of the content
        snippet = result.get('content_snippet', '')
        print(f"     Snippet: {snippet}...") 
else:
    print("  No relevant results found.")

# You can access the full content if needed via the result object, 
# though 'content_snippet' is usually sufficient for display.

  1. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp3zc9ulp0.pdf
     Page: 2 (Score: -0.8584)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
The Anasazi (Removed: 1)
Author: Petersen, David. ISBN: 0-516-01121-9 (trade) Published: 1991
Sit...
  2. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp3zc9ulp0.pdf
     Page: 5 (Score: -0.8661)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Centennial Place 33170000562167 $13.10 11/5/1999 33554-43170
Academy (Charter)
Was Available -- W...
  3. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpwxchhqlq.pdf
     Page: 1 (Score: -1.0080)
     Snippet: Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men...
  4. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp3zc9ulp0.pdf
     Page: 4 (Score: -1.0489)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Children of the Philippines (Removed: 1)
Author: Kinkade, Sheila, 1962- ISBN: 0-87614-993-X Publi...
  5. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp3zc9ulp0.pdf
     Page: 3 (Score: -1.0890)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Centennial Place 33170000507600 $19.45 2/21/2000 33554-43170
Academy (Charter)
Was Available -- W...
  6. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp3zc9ulp0.pdf
     Page: 1 (Score: -1.0946)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/12/2023 - Copies Removed: 2
Tristan Strong punches a hole in the sky (Removed: 1)
Author: Mbalia, Kwame. ISBN: 978-1-36803993-...

Semantic search allows you to efficiently query large sets of documents to find the most relevant information without needing exact keyword matches, leveraging the meaning and context of your query.