Document Question Answering (QA)¶

Sometimes, instead of searching for specific text patterns, you just want to ask the document a question directly. natural-pdf includes an extractive Question Answering feature.

"Extractive" means it finds the literal answer text within the document, rather than generating a new answer or summarizing.

Let's ask our 01-practice.pdf a few questions.

In [1]:

Copied!

#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[all]"

In [2]:

Copied!





from natural_pdf import PDF

# Load the PDF and get the page
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

# Ask about the date
question_1 = "What is the inspection date?"
answer_1 = page.ask(question_1)

# The result is a dictionary with the answer, confidence, etc.
answer_1
from natural_pdf import PDF

# Load the PDF and get the page
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

# Ask about the date
question_1 = "What is the inspection date?"
answer_1 = page.ask(question_1)

# The result is a dictionary with the answer, confidence, etc.
answer_1

CropBox missing from /Page, defaulting to MediaBox

Device set to use mps:0

Out[2]:

{'answer': 'February 3, 1905',
 'confidence': 0.9979940056800842,
 'start': 6,
 'end': 6,
 'found': True,
 'page_num': 0,
 'source_elements': <ElementCollection[TextElement](count=1)>}

In [3]:

Copied!





# Ask about the company name
question_2 = "What company was inspected?"
answer_2 = page.ask(question_2)

# Display the answer dictionary
answer_2
# Ask about the company name
question_2 = "What company was inspected?"
answer_2 = page.ask(question_2)

# Display the answer dictionary
answer_2

Out[3]:

{'answer': 'Jungle Health and Safety Inspection Service',
 'confidence': 0.9988948106765747,
 'start': 0,
 'end': 0,
 'found': True,
 'page_num': 0,
 'source_elements': <ElementCollection[TextElement](count=1)>}

In [4]:

Copied!





# Ask about specific content from the table
question_3 = "What is statute 5.8.3 about?"
answer_3 = page.ask(question_3)

# Display the answer
answer_3
# Ask about specific content from the table
question_3 = "What is statute 5.8.3 about?"
answer_3 = page.ask(question_3)

# Display the answer
answer_3

Out[4]:

{'answer': 'Inadequate Protective Equipment.',
 'confidence': 0.9997999668121338,
 'start': 26,
 'end': 26,
 'found': True,
 'page_num': 0,
 'source_elements': <ElementCollection[TextElement](count=1)>}

The results include the extracted answer, a confidence score (useful for filtering uncertain answers), the page_num, and the source_elements.

Collecting Results into a DataFrame¶

If you're asking multiple questions, it's often useful to collect the results into a pandas DataFrame for easier analysis.

In [5]:

Copied!





from natural_pdf import PDF
import pandas as pd

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

# List of questions to ask
questions = [
    "What is the inspection date?",
    "What company was inspected?",
    "What is statute 5.8.3 about?",
    "How many violations were there in total?" # This might be less reliable
]

# Collect answers for each question
results = []
for q in questions:
    answer_dict = page.ask(q)
    # Add the original question to the dictionary
    answer_dict['question'] = q
    results.append(answer_dict)

# Convert the list of dictionaries to a DataFrame
# We select only the most relevant columns here
df_results = pd.DataFrame(results)[['question', 'answer', 'confidence']]

# Display the DataFrame
df_results
from natural_pdf import PDF
import pandas as pd

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

# List of questions to ask
questions = [
    "What is the inspection date?",
    "What company was inspected?",
    "What is statute 5.8.3 about?",
    "How many violations were there in total?" # This might be less reliable
]

# Collect answers for each question
results = []
for q in questions:
    answer_dict = page.ask(q)
    # Add the original question to the dictionary
    answer_dict['question'] = q
    results.append(answer_dict)

# Convert the list of dictionaries to a DataFrame
# We select only the most relevant columns here
df_results = pd.DataFrame(results)[['question', 'answer', 'confidence']]

# Display the DataFrame
df_results

CropBox missing from /Page, defaulting to MediaBox

Out[5]:

	question	answer	confidence
0	What is the inspection date?	February 3, 1905	0.997994
1	What company was inspected?	Jungle Health and Safety Inspection Service	0.998895
2	What is statute 5.8.3 about?	Inadequate Protective Equipment.	0.999800
3	How many violations were there in total?	4.12.7	0.662560

This shows how you can iterate through questions, collect the answer dictionaries, and then create a structured DataFrame, making it easy to review questions, answers, and their confidence levels together.

QA Model and Limitations

*   The QA system relies on underlying transformer models. Performance and confidence scores vary.
*   It works best for questions where the answer is explicitly stated. It cannot synthesize information or perform calculations (e.g., counting items might fail or return text containing a number rather than the count itself).
*   You can potentially specify different QA models via the `model=` argument in `page.ask()` if others are configured.