Semantic Search Across Multiple Documents¶
When working with a collection of PDFs, you might need to find information relevant to a specific query across all documents, not just within a single one. This tutorial demonstrates how to perform semantic search over a PDFCollection
.
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[search]" # Ensure search dependencies are installed
import natural_pdf
# Define the paths to your PDF files
pdf_paths = [
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]
# Or use glob patterns
# collection = natural_pdf.PDFCollection("pdfs/*.pdf")
# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")
Created collection with 2 PDFs.
Initializing the Search Index¶
Before performing a search, you need to initialize the search capabilities for the collection. This involves processing the documents and building an index.
# Initialize search.
# index=True will build the serachable database immediately
# persist=True will save it so you don't need to do it every time
collection.init_search(index=True)
print("Search index initialized.")
Search index initialized.
Performing a Semantic Search¶
Once the index is ready, you can use the find_relevant()
method to search for content semantically related to your query.
# Perform a search query
query = "american president"
results = collection.find_relevant(query)
print(f"Found {len(results)} results for '{query}':")
Found 6 results for 'american president':
Understanding Search Results¶
The find_relevant()
method returns a list of dictionaries, each representing a relevant text chunk found in one of the PDFs. Each result includes:
pdf_path
: The path to the PDF document where the result was found.page_number
: The page number within the PDF.score
: A relevance score (higher means more relevant).content_snippet
: A snippet of the text chunk that matched the query.
In the future we should be able to easily look at the PDF!
# Process and display the results
if results:
for i, result in enumerate(results):
print(f" {i+1}. PDF: {result['pdf_path']}")
print(f" Page: {result['page_number']} (Score: {result['score']:.4f})")
# Display a snippet of the content
snippet = result.get('content_snippet', '')
print(f" Snippet: {snippet}...")
else:
print(" No relevant results found.")
# You can access the full content if needed via the result object,
# though 'content_snippet' is usually sufficient for display.
1. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp9bnfjc3d.pdf Page: 2 (Score: 0.0708) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 The Anasazi (Removed: 1) Author: Petersen, David. ISBN: 0-516-01121-9 (trade) Published: 1991 Sit... 2. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp9bnfjc3d.pdf Page: 5 (Score: 0.0669) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 Centennial Place 33170000562167 $13.10 11/5/1999 33554-43170 Academy (Charter) Was Available -- W... 3. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpopf4xujx.pdf Page: 1 (Score: -0.0040) Snippet: Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men... 4. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp9bnfjc3d.pdf Page: 4 (Score: -0.0245) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 Children of the Philippines (Removed: 1) Author: Kinkade, Sheila, 1962- ISBN: 0-87614-993-X Publi... 5. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp9bnfjc3d.pdf Page: 3 (Score: -0.0445) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 Centennial Place 33170000507600 $19.45 2/21/2000 33554-43170 Academy (Charter) Was Available -- W... 6. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp9bnfjc3d.pdf Page: 1 (Score: -0.0473) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/12/2023 - Copies Removed: 2 Tristan Strong punches a hole in the sky (Removed: 1) Author: Mbalia, Kwame. ISBN: 978-1-36803993-...
Semantic search allows you to efficiently query large sets of documents to find the most relevant information without needing exact keyword matches, leveraging the meaning and context of your query.