Document Question Answering (QA)¶
Sometimes, instead of searching for specific text patterns, you just want to ask the document a question directly. natural-pdf
includes an extractive Question Answering feature.
"Extractive" means it finds the literal answer text within the document, rather than generating a new answer or summarizing.
Let's ask our 01-practice.pdf
a few questions.
#%pip install "natural-pdf[all]"
from natural_pdf import PDF
# Load the PDF and get the page
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
# Ask about the date
question_1 = "What is the inspection date?"
answer_1 = page.ask(question_1)
# The result is a dictionary with the answer, confidence, etc.
answer_1
Device set to use mps:0
{'answer': 'February 3, 1905', 'confidence': 0.9979940056800842, 'start': 6, 'end': 6, 'found': True, 'page_num': 0, 'source_elements': <ElementCollection[TextElement](count=1)>}
# Ask about the company name
question_2 = "What company was inspected?"
answer_2 = page.ask(question_2)
# Display the answer dictionary
answer_2
{'answer': 'Jungle Health and Safety Inspection Service', 'confidence': 0.9988948106765747, 'start': 0, 'end': 0, 'found': True, 'page_num': 0, 'source_elements': <ElementCollection[TextElement](count=1)>}
# Ask about specific content from the table
question_3 = "What is statute 5.8.3 about?"
answer_3 = page.ask(question_3)
# Display the answer
answer_3
{'answer': 'Inadequate Protective Equipment.', 'confidence': 0.9997999668121338, 'start': 26, 'end': 26, 'found': True, 'page_num': 0, 'source_elements': <ElementCollection[TextElement](count=1)>}
The results include the extracted answer
, a confidence
score (useful for filtering uncertain answers), the page_num
, and the source_elements
.
Collecting Results into a DataFrame¶
If you're asking multiple questions, it's often useful to collect the results into a pandas DataFrame for easier analysis.
from natural_pdf import PDF
import pandas as pd
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
# List of questions to ask
questions = [
"What is the inspection date?",
"What company was inspected?",
"What is statute 5.8.3 about?",
"How many violations were there in total?" # This might be less reliable
]
# Collect answers for each question
results = []
for q in questions:
answer_dict = page.ask(q)
# Add the original question to the dictionary
answer_dict['question'] = q
results.append(answer_dict)
# Convert the list of dictionaries to a DataFrame
# We select only the most relevant columns here
df_results = pd.DataFrame(results)[['question', 'answer', 'confidence']]
# Display the DataFrame
df_results
question | answer | confidence | |
---|---|---|---|
0 | What is the inspection date? | February 3, 1905 | 0.997994 |
1 | What company was inspected? | Jungle Health and Safety Inspection Service | 0.998895 |
2 | What is statute 5.8.3 about? | Inadequate Protective Equipment. | 0.999800 |
3 | How many violations were there in total? | 4.12.7 | 0.662560 |
This shows how you can iterate through questions, collect the answer dictionaries, and then create a structured DataFrame, making it easy to review questions, answers, and their confidence levels together.
QA Model and Limitations
* The QA system relies on underlying transformer models. Performance and confidence scores vary.
* It works best for questions where the answer is explicitly stated. It cannot synthesize information or perform calculations (e.g., counting items might fail or return text containing a number rather than the count itself).
* You can potentially specify different QA models via the `model=` argument in `page.ask()` if others are configured.