Semantic Search Across Multiple Documents¶
When working with a collection of PDFs, you might need to find information relevant to a specific query across all documents, not just within a single one. This tutorial demonstrates how to perform semantic search over a PDFCollection
.
You can do semantic search with the default install, but for increased performance with LanceDB I recommend installing the search extension.
#%pip install "natural-pdf[search]"
import natural_pdf
# Define the paths to your PDF files
pdf_paths = [
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
]
# Or use glob patterns
# collection = natural_pdf.PDFCollection("pdfs/*.pdf")
# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")
Created collection with 2 PDFs.
Initializing the Search Index¶
Before performing a search, you need to initialize the search capabilities for the collection. This involves processing the documents and building an index.
# Initialize search.
# index=True will build the serachable database immediately
# persist=True will save it so you don't need to do it every time
collection.init_search(index=True)
print("Search index initialized.")
/Users/soma/Development/natural-pdf/.nox/tutorials/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(
Search index initialized.
Performing a Semantic Search¶
Once the index is ready, you can use the find_relevant()
method to search for content semantically related to your query.
# Perform a search query
query = "american president"
results = collection.find_relevant(query)
print(f"Found {len(results)} results for '{query}':")
Found 6 results for 'american president':
Understanding Search Results¶
The find_relevant()
method returns a list of dictionaries, each representing a relevant text chunk found in one of the PDFs. Each result includes:
pdf_path
: The path to the PDF document where the result was found.page_number
: The page number within the PDF.score
: A relevance score (higher means more relevant).content_snippet
: A snippet of the text chunk that matched the query.
In the future we should be able to easily look at the PDF!
# Process and display the results
if results:
for i, result in enumerate(results):
print(f" {i+1}. PDF: {result['pdf_path']}")
print(f" Page: {result['page_number']} (Score: {result['score']:.4f})")
# Display a snippet of the content
snippet = result.get('content_snippet', '')
print(f" Snippet: {snippet}...")
else:
print(" No relevant results found.")
# You can access the full content if needed via the result object,
# though 'content_snippet' is usually sufficient for display.
1. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf Page: 2 (Score: -0.8584) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 The Anasazi (Removed: 1) Author: Petersen, David. ISBN: 0-516-01121-9 (trade) Published: 1991 Sit... 2. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf Page: 5 (Score: -0.8661) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 Centennial Place 33170000562167 $13.10 11/5/1999 33554-43170 Academy (Charter) Was Available -- W... 3. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf Page: 1 (Score: -1.0080) Snippet: Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men... 4. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf Page: 4 (Score: -1.0489) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 Children of the Philippines (Removed: 1) Author: Kinkade, Sheila, 1962- ISBN: 0-87614-993-X Publi... 5. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf Page: 3 (Score: -1.0890) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/6/2023 - Copies Removed: 130 Centennial Place 33170000507600 $19.45 2/21/2000 33554-43170 Academy (Charter) Was Available -- W... 6. PDF: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf Page: 1 (Score: -1.0946) Snippet: Library Weeding Log Atlanta Public Schools From: 8/1/2017 To: 6/30/2023 6/12/2023 - Copies Removed: 2 Tristan Strong punches a hole in the sky (Removed: 1) Author: Mbalia, Kwame. ISBN: 978-1-36803993-...
Semantic search allows you to efficiently query large sets of documents to find the most relevant information without needing exact keyword matches, leveraging the meaning and context of your query.
TODO¶
- Add example for using
persist=True
andcollection_name
ininit_search
to create a persistent on-disk index. - Show how to override the embedding model (e.g.
embedding_model="all-MiniLM-L12-v2"
). - Mention
top_k
and filtering options available throughSearchOptions
when callingfind_relevant
. - Provide a short snippet on visualising matched pages/elements once highlighting support lands (future feature).
- Clarify that installing the AI stack (
natural-pdf[ai]
) also pulls insentence-transformers
, which is needed for in-memory NumPy fallback.