Semantic Search Across Multiple Documents¶
When working with a collection of PDFs, you might need to find information relevant to a specific query across all documents, not just within a single one. This tutorial demonstrates how to perform semantic search over a PDFCollection.
You can do semantic search with the default install, but for increased performance with LanceDB I recommend installing the search extension.
#%pip install "natural-pdf[search]"
import natural_pdf
# Define the paths to your PDF files
pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]
# Or use glob patterns
# collection = natural_pdf.PDFCollection("pdfs/*.pdf")
# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")
Created collection with 2 PDFs.
Initializing the Search Index¶
Before performing a search, you need to initialize the search capabilities for the collection. This involves processing the documents and building an index.
# Initialize search.
# index=True will build the serachable database immediately
# persist=True will save it so you don't need to do it every time
collection.init_search(index=True)
print("Search index initialized.")
Search index initialized.
Performing a Semantic Search¶
Once the index is ready, you can use the find_relevant() method to search for content semantically related to your query.
# Perform a search query
query = "american president"
results = collection.find_relevant(query)
print(f"Found {len(results)} results for '{query}':")
Found 6 results for 'american president':
Understanding Search Results¶
The find_relevant() method returns a list of dictionaries, each representing a relevant text chunk found in one of the PDFs. Each result includes:
- pdf_path: The path to the PDF document where the result was found.
- page_number: The page number within the PDF.
- score: A relevance score (higher means more relevant).
- content_snippet: A snippet of the text chunk that matched the query.
In the future we should be able to easily look at the PDF!
# Process and display the results
if results:
    for i, result in enumerate(results):
        print(f"  {i+1}. PDF: {result['pdf_path']}")
        print(f"     Page: {result['page_number']} (Score: {result['score']:.4f})")
        # Display a snippet of the content
        snippet = result.get('content_snippet', '')
        print(f"     Snippet: {snippet}...") 
else:
    print("  No relevant results found.")
# You can access the full content if needed via the result object, 
# though 'content_snippet' is usually sufficient for display.
  1. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf
     Page: 2 (Score: -0.8584)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
The Anasazi (Removed: 1)
Author: Petersen, David. ISBN: 0-516-01121-9 (trade) Published: 1991
Sit...
  2. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf
     Page: 5 (Score: -0.8661)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Centennial Place 33170000562167 $13.10 11/5/1999 33554-43170
Academy (Charter)
Was Available -- W...
  3. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf
     Page: 1 (Score: -1.0080)
     Snippet: Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men...
  4. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf
     Page: 4 (Score: -1.0489)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Children of the Philippines (Removed: 1)
Author: Kinkade, Sheila, 1962- ISBN: 0-87614-993-X Publi...
  5. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf
     Page: 3 (Score: -1.0890)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Centennial Place 33170000507600 $19.45 2/21/2000 33554-43170
Academy (Charter)
Was Available -- W...
  6. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf
     Page: 1 (Score: -1.0946)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/12/2023 - Copies Removed: 2
Tristan Strong punches a hole in the sky (Removed: 1)
Author: Mbalia, Kwame. ISBN: 978-1-36803993-...
Semantic search allows you to efficiently query large sets of documents to find the most relevant information without needing exact keyword matches, leveraging the meaning and context of your query.
TODO¶
- Add example for using persist=Trueandcollection_nameininit_searchto create a persistent on-disk index.
- Show how to override the embedding model (e.g. embedding_model="all-MiniLM-L12-v2").
- Mention top_kand filtering options available throughSearchOptionswhen callingfind_relevant.
- Provide a short snippet on visualising matched pages/elements once highlighting support lands (future feature).
- Clarify that installing the AI stack (natural-pdf[ai]) also pulls insentence-transformers, which is needed for in-memory NumPy fallback.