Skip to content

API Reference

This section provides detailed documentation for all the classes and methods in Natural PDF.

Core Classes

natural_pdf

Natural PDF - A more intuitive interface for working with PDFs.

Classes

natural_pdf.ConfigSection

A configuration section that holds key-value option pairs.

Source code in natural_pdf/__init__.py
41
42
43
44
45
46
47
48
49
class ConfigSection:
    """A configuration section that holds key-value option pairs."""

    def __init__(self, **defaults):
        self.__dict__.update(defaults)

    def __repr__(self):
        items = [f"{k}={v!r}" for k, v in self.__dict__.items()]
        return f"{self.__class__.__name__}({', '.join(items)})"
natural_pdf.Options

Global options for natural-pdf, similar to pandas options.

Source code in natural_pdf/__init__.py
52
53
54
55
56
57
58
59
60
61
62
63
class Options:
    """Global options for natural-pdf, similar to pandas options."""

    def __init__(self):
        # Image rendering defaults
        self.image = ConfigSection(width=None, resolution=150)

        # OCR defaults
        self.ocr = ConfigSection(engine="easyocr", languages=["en"], min_confidence=0.5)

        # Text extraction defaults (empty for now)
        self.text = ConfigSection()
natural_pdf.PDF

Bases: ExtractionMixin, ExportMixin, ClassificationMixin

Enhanced PDF wrapper built on top of pdfplumber.

This class provides a fluent interface for working with PDF documents, with improved selection, navigation, and extraction capabilities. It integrates OCR, layout analysis, and AI-powered data extraction features while maintaining compatibility with the underlying pdfplumber API.

The PDF class supports loading from files, URLs, or streams, and provides spatial navigation, element selection with CSS-like selectors, and advanced document processing workflows including multi-page content flows.

Attributes:

Name Type Description
pages PageCollection

Lazy-loaded list of Page objects for document pages.

path

Resolved path to the PDF file or source identifier.

source_path

Original path, URL, or stream identifier provided during initialization.

highlighter

Service for rendering highlighted visualizations of document content.

Example

Basic usage:

import natural_pdf as npdf

pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]
text_elements = page.find_all('text:contains("Summary")')

Advanced usage with OCR:

pdf = npdf.PDF("scanned_document.pdf")
pdf.apply_ocr(engine="easyocr", resolution=144)
tables = pdf.pages[0].find_all('table')

Source code in natural_pdf/core/pdf.py
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
class PDF(ExtractionMixin, ExportMixin, ClassificationMixin):
    """Enhanced PDF wrapper built on top of pdfplumber.

    This class provides a fluent interface for working with PDF documents,
    with improved selection, navigation, and extraction capabilities. It integrates
    OCR, layout analysis, and AI-powered data extraction features while maintaining
    compatibility with the underlying pdfplumber API.

    The PDF class supports loading from files, URLs, or streams, and provides
    spatial navigation, element selection with CSS-like selectors, and advanced
    document processing workflows including multi-page content flows.

    Attributes:
        pages: Lazy-loaded list of Page objects for document pages.
        path: Resolved path to the PDF file or source identifier.
        source_path: Original path, URL, or stream identifier provided during initialization.
        highlighter: Service for rendering highlighted visualizations of document content.

    Example:
        Basic usage:
        ```python
        import natural_pdf as npdf

        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]
        text_elements = page.find_all('text:contains("Summary")')
        ```

        Advanced usage with OCR:
        ```python
        pdf = npdf.PDF("scanned_document.pdf")
        pdf.apply_ocr(engine="easyocr", resolution=144)
        tables = pdf.pages[0].find_all('table')
        ```
    """

    def __init__(
        self,
        path_or_url_or_stream,
        reading_order: bool = True,
        font_attrs: Optional[List[str]] = None,
        keep_spaces: bool = True,
        text_tolerance: Optional[dict] = None,
        auto_text_tolerance: bool = True,
        text_layer: bool = True,
    ):
        """Initialize the enhanced PDF object.

        Args:
            path_or_url_or_stream: Path to the PDF file (str/Path), a URL (str),
                or a file-like object (stream). URLs must start with 'http://' or 'https://'.
            reading_order: If True, use natural reading order for text extraction.
                Defaults to True.
            font_attrs: List of font attributes for grouping characters into words.
                Common attributes include ['fontname', 'size']. Defaults to None.
            keep_spaces: If True, include spaces in word elements during text extraction.
                Defaults to True.
            text_tolerance: PDFplumber-style tolerance settings for text grouping.
                Dictionary with keys like 'x_tolerance', 'y_tolerance'. Defaults to None.
            auto_text_tolerance: If True, automatically scale text tolerance based on
                font size and document characteristics. Defaults to True.
            text_layer: If True, preserve existing text layer from the PDF. If False,
                removes all existing text elements during initialization, useful for
                OCR-only workflows. Defaults to True.

        Raises:
            TypeError: If path_or_url_or_stream is not a valid type.
            IOError: If the PDF file cannot be opened or read.
            ValueError: If URL download fails.

        Example:
            ```python
            # From file path
            pdf = npdf.PDF("document.pdf")

            # From URL
            pdf = npdf.PDF("https://example.com/document.pdf")

            # From stream
            with open("document.pdf", "rb") as f:
                pdf = npdf.PDF(f)

            # With custom settings
            pdf = npdf.PDF("document.pdf",
                          reading_order=False,
                          text_layer=False,  # For OCR-only processing
                          font_attrs=['fontname', 'size', 'flags'])
            ```
        """
        self._original_path_or_stream = path_or_url_or_stream
        self._temp_file = None
        self._resolved_path = None
        self._is_stream = False
        self._text_layer = text_layer
        stream_to_open = None

        if hasattr(path_or_url_or_stream, "read"):  # Check if it's file-like
            logger.info("Initializing PDF from in-memory stream.")
            self._is_stream = True
            self._resolved_path = None  # No resolved file path for streams
            self.source_path = "<stream>"  # Identifier for source
            self.path = self.source_path  # Use source identifier as path for streams
            stream_to_open = path_or_url_or_stream
            try:
                if hasattr(path_or_url_or_stream, "read"):
                    # If caller provided an in-memory binary stream, capture bytes for potential re-export
                    current_pos = path_or_url_or_stream.tell()
                    path_or_url_or_stream.seek(0)
                    self._original_bytes = path_or_url_or_stream.read()
                    path_or_url_or_stream.seek(current_pos)
            except Exception:
                pass
        elif isinstance(path_or_url_or_stream, (str, Path)):
            path_or_url = str(path_or_url_or_stream)
            self.source_path = path_or_url  # Store original path/URL as source
            is_url = path_or_url.startswith("http://") or path_or_url.startswith("https://")

            if is_url:
                logger.info(f"Downloading PDF from URL: {path_or_url}")
                try:
                    with urllib.request.urlopen(path_or_url) as response:
                        data = response.read()
                    # Load directly into an in-memory buffer — no temp file needed
                    buffer = io.BytesIO(data)
                    buffer.seek(0)
                    self._temp_file = None  # No on-disk temp file
                    self._resolved_path = path_or_url  # For repr / get_id purposes
                    stream_to_open = buffer  # pdfplumber accepts file-like objects
                except Exception as e:
                    logger.error(f"Failed to download PDF from URL: {e}")
                    raise ValueError(f"Failed to download PDF from URL: {e}")
            else:
                self._resolved_path = str(Path(path_or_url).resolve())  # Resolve local paths
                stream_to_open = self._resolved_path
            self.path = self._resolved_path  # Use resolved path for file-based PDFs
        else:
            raise TypeError(
                f"Invalid input type: {type(path_or_url_or_stream)}. "
                f"Expected path (str/Path), URL (str), or file-like object."
            )

        logger.info(f"Opening PDF source: {self.source_path}")
        logger.debug(
            f"Parameters: reading_order={reading_order}, font_attrs={font_attrs}, keep_spaces={keep_spaces}"
        )

        try:
            self._pdf = pdfplumber.open(stream_to_open)
        except Exception as e:
            logger.error(f"Failed to open PDF: {e}", exc_info=True)
            self.close()  # Attempt cleanup if opening fails
            raise IOError(f"Failed to open PDF source: {self.source_path}") from e

        # Store configuration used for initialization
        self._reading_order = reading_order
        self._config = {"keep_spaces": keep_spaces}
        self._font_attrs = font_attrs

        self._ocr_manager = OCRManager() if OCRManager else None
        self._layout_manager = LayoutManager() if LayoutManager else None
        self.highlighter = HighlightingService(self)
        # self._classification_manager_instance = ClassificationManager() # Removed this line
        self._manager_registry = {}

        # Lazily instantiate pages only when accessed
        self._pages = _LazyPageList(
            self, self._pdf, font_attrs=font_attrs, load_text=self._text_layer
        )

        self._element_cache = {}
        self._exclusions = []
        self._regions = []

        logger.info(f"PDF '{self.source_path}' initialized with {len(self._pages)} pages.")

        self._initialize_managers()
        self._initialize_highlighter()

        # Remove text layer if requested
        if not self._text_layer:
            logger.info("Removing text layer as requested (text_layer=False)")
            # Text layer is not loaded when text_layer=False, so no need to remove
            pass

        # Analysis results accessed via self.analyses property (see below)

        # --- Automatic cleanup when object is garbage-collected ---
        self._finalizer = weakref.finalize(
            self,
            PDF._finalize_cleanup,
            self._pdf,
            getattr(self, "_temp_file", None),
            getattr(self, "_is_stream", False),
        )

        # --- Text tolerance settings ------------------------------------
        # Users can pass pdfplumber-style keys (x_tolerance, x_tolerance_ratio,
        # y_tolerance, etc.) via *text_tolerance*.  We also keep a flag that
        # enables automatic tolerance scaling when explicit values are not
        # supplied.
        self._config["auto_text_tolerance"] = bool(auto_text_tolerance)
        if text_tolerance:
            # Only copy recognised primitives (numbers / None); ignore junk.
            allowed = {
                "x_tolerance",
                "x_tolerance_ratio",
                "y_tolerance",
                "keep_blank_chars",  # passthrough convenience
            }
            for k, v in text_tolerance.items():
                if k in allowed:
                    self._config[k] = v

    def _initialize_managers(self):
        """Set up manager factories for lazy instantiation."""
        # Store factories/classes for each manager key
        self._manager_factories = dict(DEFAULT_MANAGERS)
        self._managers = {}  # Will hold instantiated managers

    def get_manager(self, key: str) -> Any:
        """Retrieve a manager instance by its key, instantiating it lazily if needed.

        Managers are specialized components that handle specific functionality like
        classification, structured data extraction, or OCR processing. They are
        instantiated on-demand to minimize memory usage and startup time.

        Args:
            key: The manager key to retrieve. Common keys include 'classification'
                and 'structured_data'.

        Returns:
            The manager instance for the specified key.

        Raises:
            KeyError: If no manager is registered for the given key.
            RuntimeError: If the manager failed to initialize.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            classification_mgr = pdf.get_manager('classification')
            structured_data_mgr = pdf.get_manager('structured_data')
            ```
        """
        # Check if already instantiated
        if key in self._managers:
            manager_instance = self._managers[key]
            if manager_instance is None:
                raise RuntimeError(f"Manager '{key}' failed to initialize previously.")
            return manager_instance

        # Not instantiated yet: get factory/class
        if not hasattr(self, "_manager_factories") or key not in self._manager_factories:
            raise KeyError(
                f"No manager registered for key '{key}'. Available: {list(getattr(self, '_manager_factories', {}).keys())}"
            )
        factory_or_class = self._manager_factories[key]
        try:
            resolved = factory_or_class
            # If it's a callable that's not a class, call it to get the class/instance
            if not isinstance(resolved, type) and callable(resolved):
                resolved = resolved()
            # If it's a class, instantiate it
            if isinstance(resolved, type):
                instance = resolved()
            else:
                instance = resolved  # Already an instance
            self._managers[key] = instance
            return instance
        except Exception as e:
            logger.error(f"Failed to initialize manager for key '{key}': {e}")
            self._managers[key] = None
            raise RuntimeError(f"Manager '{key}' failed to initialize: {e}") from e

    def _initialize_highlighter(self):
        pass

    @property
    def metadata(self) -> Dict[str, Any]:
        """Access PDF metadata as a dictionary.

        Returns document metadata such as title, author, creation date, and other
        properties embedded in the PDF file. The exact keys available depend on
        what metadata was included when the PDF was created.

        Returns:
            Dictionary containing PDF metadata. Common keys include 'Title',
            'Author', 'Subject', 'Creator', 'Producer', 'CreationDate', and
            'ModDate'. May be empty if no metadata is available.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            print(pdf.metadata.get('Title', 'No title'))
            print(f"Created: {pdf.metadata.get('CreationDate')}")
            ```
        """
        return self._pdf.metadata

    @property
    def pages(self) -> "PageCollection":
        """Access pages as a PageCollection object.

        Provides access to individual pages of the PDF document through a
        collection interface that supports indexing, slicing, and iteration.
        Pages are lazy-loaded to minimize memory usage.

        Returns:
            PageCollection object that provides list-like access to PDF pages.

        Raises:
            AttributeError: If PDF pages are not yet initialized.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")

            # Access individual pages
            first_page = pdf.pages[0]
            last_page = pdf.pages[-1]

            # Slice pages
            first_three = pdf.pages[0:3]

            # Iterate over pages
            for page in pdf.pages:
                print(f"Page {page.index} has {len(page.chars)} characters")
            ```
        """
        from natural_pdf.elements.collections import PageCollection

        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")
        return PageCollection(self._pages)

    def clear_exclusions(self) -> "PDF":
        """Clear all exclusion functions from the PDF.

        Removes all previously added exclusion functions that were used to filter
        out unwanted content (like headers, footers, or administrative text) from
        text extraction and analysis operations.

        Returns:
            Self for method chaining.

        Raises:
            AttributeError: If PDF pages are not yet initialized.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            pdf.add_exclusion(lambda page: page.find('text:contains("CONFIDENTIAL")').above())

            # Later, remove all exclusions
            pdf.clear_exclusions()
            ```
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        self._exclusions = []
        for page in self._pages:
            page.clear_exclusions()
        return self

    def add_exclusion(
        self, exclusion_func: Callable[["Page"], Optional["Region"]], label: str = None
    ) -> "PDF":
        """Add an exclusion function to the PDF.

        Exclusion functions define regions of each page that should be ignored during
        text extraction and analysis operations. This is useful for filtering out headers,
        footers, watermarks, or other administrative content that shouldn't be included
        in the main document processing.

        Args:
            exclusion_func: A function that takes a Page object and returns a Region
                to exclude from processing, or None if no exclusion should be applied
                to that page. The function is called once per page.
            label: Optional descriptive label for this exclusion rule, useful for
                debugging and identification.

        Returns:
            Self for method chaining.

        Raises:
            AttributeError: If PDF pages are not yet initialized.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")

            # Exclude headers (top 50 points of each page)
            pdf.add_exclusion(
                lambda page: page.region(0, 0, page.width, 50),
                label="header_exclusion"
            )

            # Exclude any text containing "CONFIDENTIAL"
            pdf.add_exclusion(
                lambda page: page.find('text:contains("CONFIDENTIAL")').above(include_source=True)
                if page.find('text:contains("CONFIDENTIAL")') else None,
                label="confidential_exclusion"
            )

            # Chain multiple exclusions
            pdf.add_exclusion(header_func).add_exclusion(footer_func)
            ```
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        exclusion_data = (exclusion_func, label)
        self._exclusions.append(exclusion_data)

        for page in self._pages:
            page.add_exclusion(exclusion_func, label=label)

        return self

    def apply_ocr(
        self,
        engine: Optional[str] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
        apply_exclusions: bool = True,
        detect_only: bool = False,
        replace: bool = True,
        options: Optional[Any] = None,
        pages: Optional[Union[Iterable[int], range, slice]] = None,
    ) -> "PDF":
        """Apply OCR to specified pages of the PDF using batch processing.

        Performs optical character recognition on the specified pages, converting
        image-based text into searchable and extractable text elements. This method
        supports multiple OCR engines and provides batch processing for efficiency.

        Args:
            engine: Name of the OCR engine to use. Supported engines include
                'easyocr' (default), 'surya', 'paddle', and 'doctr'. If None,
                uses the global default from natural_pdf.options.ocr.engine.
            languages: List of language codes for OCR recognition (e.g., ['en', 'es']).
                If None, uses the global default from natural_pdf.options.ocr.languages.
            min_confidence: Minimum confidence threshold (0.0-1.0) for accepting
                OCR results. Text with lower confidence will be filtered out.
                If None, uses the global default.
            device: Device to run OCR on ('cpu', 'cuda', 'mps'). Engine-specific
                availability varies. If None, uses engine defaults.
            resolution: DPI resolution for rendering pages to images before OCR.
                Higher values improve accuracy but increase processing time and memory.
                Typical values: 150 (fast), 300 (balanced), 600 (high quality).
            apply_exclusions: If True, mask excluded regions before OCR to prevent
                processing of headers, footers, or other unwanted content.
            detect_only: If True, only detect text bounding boxes without performing
                character recognition. Useful for layout analysis workflows.
            replace: If True, replace any existing OCR elements on the pages.
                If False, append new OCR results to existing elements.
            options: Engine-specific options object (e.g., EasyOCROptions, SuryaOptions).
                Allows fine-tuning of engine behavior beyond common parameters.
            pages: Page indices to process. Can be:
                - None: Process all pages
                - slice: Process a range of pages (e.g., slice(0, 10))
                - Iterable[int]: Process specific page indices (e.g., [0, 2, 5])

        Returns:
            Self for method chaining.

        Raises:
            ValueError: If invalid page index is provided.
            TypeError: If pages parameter has invalid type.
            RuntimeError: If OCR engine is not available or fails.

        Example:
            ```python
            pdf = npdf.PDF("scanned_document.pdf")

            # Basic OCR on all pages
            pdf.apply_ocr()

            # High-quality OCR with specific settings
            pdf.apply_ocr(
                engine='easyocr',
                languages=['en', 'es'],
                resolution=300,
                min_confidence=0.8
            )

            # OCR specific pages only
            pdf.apply_ocr(pages=[0, 1, 2])  # First 3 pages
            pdf.apply_ocr(pages=slice(5, 10))  # Pages 5-9

            # Detection-only workflow for layout analysis
            pdf.apply_ocr(detect_only=True, resolution=150)
            ```

        Note:
            OCR processing can be time and memory intensive, especially at high
            resolutions. Consider using exclusions to mask unwanted regions and
            processing pages in batches for large documents.
        """
        if not self._ocr_manager:
            logger.error("OCRManager not available. Cannot apply OCR.")
            return self

        # Apply global options as defaults, but allow explicit parameters to override
        import natural_pdf

        # Use global OCR options if parameters are not explicitly set
        if engine is None:
            engine = natural_pdf.options.ocr.engine
        if languages is None:
            languages = natural_pdf.options.ocr.languages
        if min_confidence is None:
            min_confidence = natural_pdf.options.ocr.min_confidence
        if device is None:
            pass  # No default device in options.ocr anymore

        thread_id = threading.current_thread().name
        logger.debug(f"[{thread_id}] PDF.apply_ocr starting for {self.path}")

        target_pages = []

        target_pages = []
        if pages is None:
            target_pages = self._pages
        elif isinstance(pages, slice):
            target_pages = self._pages[pages]
        elif hasattr(pages, "__iter__"):
            try:
                target_pages = [self._pages[i] for i in pages]
            except IndexError:
                raise ValueError("Invalid page index provided in 'pages' iterable.")
            except TypeError:
                raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

        if not target_pages:
            logger.warning("No pages selected for OCR processing.")
            return self

        page_numbers = [p.number for p in target_pages]
        logger.info(f"Applying batch OCR to pages: {page_numbers}...")

        final_resolution = resolution or getattr(self, "_config", {}).get("resolution", 150)
        logger.debug(f"Using OCR image resolution: {final_resolution} DPI")

        images_pil = []
        page_image_map = []
        logger.info(f"[{thread_id}] Rendering {len(target_pages)} pages...")
        failed_page_num = "unknown"
        render_start_time = time.monotonic()

        try:
            for i, page in enumerate(tqdm(target_pages, desc="Rendering pages", leave=False)):
                failed_page_num = page.number
                logger.debug(f"  Rendering page {page.number} (index {page.index})...")
                to_image_kwargs = {
                    "resolution": final_resolution,
                    "include_highlights": False,
                    "exclusions": "mask" if apply_exclusions else None,
                }
                img = page.to_image(**to_image_kwargs)
                if img is None:
                    logger.error(f"  Failed to render page {page.number} to image.")
                    continue
                    continue
                images_pil.append(img)
                page_image_map.append((page, img))
        except Exception as e:
            logger.error(f"Failed to render pages for batch OCR: {e}")
            logger.error(f"Failed to render pages for batch OCR: {e}")
            raise RuntimeError(f"Failed to render page {failed_page_num} for OCR.") from e

        render_end_time = time.monotonic()
        logger.debug(
            f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
        )
        logger.debug(
            f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
        )

        if not images_pil or not page_image_map:
            logger.error("No images were successfully rendered for batch OCR.")
            return self

        manager_args = {
            "images": images_pil,
            "engine": engine,
            "languages": languages,
            "min_confidence": min_confidence,
            "min_confidence": min_confidence,
            "device": device,
            "options": options,
            "detect_only": detect_only,
        }
        manager_args = {k: v for k, v in manager_args.items() if v is not None}

        ocr_call_args = {k: v for k, v in manager_args.items() if k != "images"}
        logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
        logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
        ocr_start_time = time.monotonic()

        batch_results = self._ocr_manager.apply_ocr(**manager_args)

        if not isinstance(batch_results, list) or len(batch_results) != len(images_pil):
            logger.error(f"OCR Manager returned unexpected result format or length.")
            return self

        logger.info("OCR Manager batch processing complete.")

        ocr_end_time = time.monotonic()
        logger.debug(
            f"[{thread_id}] OCR processing finished (Duration: {ocr_end_time - ocr_start_time:.2f}s)"
        )

        logger.info("Adding OCR results to respective pages...")
        total_elements_added = 0

        for i, (page, img) in enumerate(page_image_map):
            results_for_page = batch_results[i]
            if not isinstance(results_for_page, list):
                logger.warning(
                    f"Skipping results for page {page.number}: Expected list, got {type(results_for_page)}"
                )
                continue

            logger.debug(f"  Processing {len(results_for_page)} results for page {page.number}...")
            try:
                if manager_args.get("replace", True) and hasattr(page, "_element_mgr"):
                    page._element_mgr.remove_ocr_elements()

                img_scale_x = page.width / img.width if img.width > 0 else 1
                img_scale_y = page.height / img.height if img.height > 0 else 1
                elements = page._element_mgr.create_text_elements_from_ocr(
                    results_for_page, img_scale_x, img_scale_y
                )

                if elements:
                    total_elements_added += len(elements)
                    logger.debug(f"  Added {len(elements)} OCR TextElements to page {page.number}.")
                else:
                    logger.debug(f"  No valid TextElements created for page {page.number}.")
            except Exception as e:
                logger.error(f"  Error adding OCR elements to page {page.number}: {e}")

        logger.info(f"Finished adding OCR results. Total elements added: {total_elements_added}")
        return self

    def add_region(
        self, region_func: Callable[["Page"], Optional["Region"]], name: str = None
    ) -> "PDF":
        """
        Add a region function to the PDF.

        Args:
            region_func: A function that takes a Page and returns a Region, or None
            region_func: A function that takes a Page and returns a Region, or None
            name: Optional name for the region

        Returns:
            Self for method chaining
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        region_data = (region_func, name)
        self._regions.append(region_data)

        for page in self._pages:
            try:
                region_instance = region_func(page)
                if region_instance and isinstance(region_instance, Region):
                    page.add_region(region_instance, name=name, source="named")
                elif region_instance is not None:
                    logger.warning(
                        f"Region function did not return a valid Region for page {page.number}"
                    )
            except Exception as e:
                logger.error(f"Error adding region for page {page.number}: {e}")

        return self

    @overload
    def find(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    def find(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]:
        """
        Find the first element matching the selector OR text content across all pages.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            Element object or None if not found.
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            raise ValueError("Internal error: No selector or text provided.")

        selector_obj = parse_selector(effective_selector)

        # Search page by page
        for page in self.pages:
            # Note: _apply_selector is on Page, so we call find directly here
            # We pass the constructed/validated effective_selector
            element = page.find(
                selector=effective_selector,  # Use the processed selector
                apply_exclusions=apply_exclusions,
                regex=regex,  # Pass down flags
                case=case,
                **kwargs,
            )
            if element:
                return element
        return None  # Not found on any page

    @overload
    def find_all(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements matching the selector OR text content across all pages.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            ElementCollection with matching elements.
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            raise ValueError("Internal error: No selector or text provided.")

        # Instead of parsing here, let each page parse and apply
        # This avoids parsing the same selector multiple times if not needed
        # selector_obj = parse_selector(effective_selector)

        # kwargs["regex"] = regex # Removed: Already passed explicitly
        # kwargs["case"] = case   # Removed: Already passed explicitly

        all_elements = []
        for page in self.pages:
            # Call page.find_all with the effective selector and flags
            page_elements = page.find_all(
                selector=effective_selector,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )
            if page_elements:
                all_elements.extend(page_elements.elements)

        from natural_pdf.elements.collections import ElementCollection

        return ElementCollection(all_elements)

    def extract_text(
        self,
        selector: Optional[str] = None,
        preserve_whitespace=True,
        use_exclusions=True,
        debug_exclusions=False,
        **kwargs,
    ) -> str:
        """
        Extract text from the entire document or matching elements.

        Args:
            selector: Optional selector to filter elements
            preserve_whitespace: Whether to keep blank characters
            use_exclusions: Whether to apply exclusion regions
            debug_exclusions: Whether to output detailed debugging for exclusions
            preserve_whitespace: Whether to keep blank characters
            use_exclusions: Whether to apply exclusion regions
            debug_exclusions: Whether to output detailed debugging for exclusions
            **kwargs: Additional extraction parameters

        Returns:
            Extracted text as string
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        if selector:
            elements = self.find_all(selector, apply_exclusions=use_exclusions, **kwargs)
            return elements.extract_text(preserve_whitespace=preserve_whitespace, **kwargs)

        if debug_exclusions:
            print(f"PDF: Extracting text with exclusions from {len(self.pages)} pages")
            print(f"PDF: Found {len(self._exclusions)} document-level exclusions")

        texts = []
        for page in self.pages:
            texts.append(
                page.extract_text(
                    preserve_whitespace=preserve_whitespace,
                    use_exclusions=use_exclusions,
                    debug_exclusions=debug_exclusions,
                    **kwargs,
                )
            )

        if debug_exclusions:
            print(f"PDF: Combined {len(texts)} pages of text")

        return "\n".join(texts)

    def extract_tables(
        self, selector: Optional[str] = None, merge_across_pages: bool = False, **kwargs
    ) -> List[Any]:
        """
        Extract tables from the document or matching elements.

        Args:
            selector: Optional selector to filter tables
            merge_across_pages: Whether to merge tables that span across pages
            **kwargs: Additional extraction parameters

        Returns:
            List of extracted tables
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        logger.warning("PDF.extract_tables is not fully implemented yet.")
        all_tables = []

        for page in self.pages:
            if hasattr(page, "extract_tables"):
                all_tables.extend(page.extract_tables(**kwargs))
            else:
                logger.debug(f"Page {page.number} does not have extract_tables method.")

        if selector:
            logger.warning("Filtering extracted tables by selector is not implemented.")

        if merge_across_pages:
            logger.warning("Merging tables across pages is not implemented.")

        return all_tables

    def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
        """
        DEPRECATED: Use save_pdf(..., ocr=True) instead.
        Saves the PDF with an OCR text layer, making content searchable.

        Requires optional dependencies. Install with: pip install \"natural-pdf[ocr-export]\"

        Args:
            output_path: Path to save the searchable PDF
            dpi: Resolution for rendering and OCR overlay
            **kwargs: Additional keyword arguments passed to the exporter
        """
        logger.warning(
            "PDF.save_searchable() is deprecated. Use PDF.save_pdf(..., ocr=True) instead."
        )
        if create_searchable_pdf is None:
            raise ImportError(
                "Saving searchable PDF requires 'pikepdf'. "
                'Install with: pip install "natural-pdf[ocr-export]"'
            )
        output_path_str = str(output_path)
        # Call the exporter directly, passing self (the PDF instance)
        create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
        # Logger info is handled within the exporter now
        # logger.info(f"Searchable PDF saved to: {output_path_str}")

    def save_pdf(
        self,
        output_path: Union[str, Path],
        ocr: bool = False,
        original: bool = False,
        dpi: int = 300,
    ):
        """
        Saves the PDF object (all its pages) to a new file.

        Choose one saving mode:
        - `ocr=True`: Creates a new, image-based PDF using OCR results from all pages.
          Text generated during the natural-pdf session becomes searchable,
          but original vector content is lost. Requires 'ocr-export' extras.
        - `original=True`: Saves a copy of the original PDF file this object represents.
          Any OCR results or analyses from the natural-pdf session are NOT included.
          If the PDF was opened from an in-memory buffer, this mode may not be suitable.
          Requires 'ocr-export' extras.

        Args:
            output_path: Path to save the new PDF file.
            ocr: If True, save as a searchable, image-based PDF using OCR data.
            original: If True, save the original source PDF content.
            dpi: Resolution (dots per inch) used only when ocr=True.

        Raises:
            ValueError: If the PDF has no pages, if neither or both 'ocr'
                        and 'original' are True.
            ImportError: If required libraries are not installed for the chosen mode.
            RuntimeError: If an unexpected error occurs during saving.
        """
        if not self.pages:
            raise ValueError("Cannot save an empty PDF object.")

        if not (ocr ^ original):  # XOR: exactly one must be true
            raise ValueError("Exactly one of 'ocr' or 'original' must be True.")

        output_path_obj = Path(output_path)
        output_path_str = str(output_path_obj)

        if ocr:
            has_vector_elements = False
            for page in self.pages:
                if (
                    hasattr(page, "rects")
                    and page.rects
                    or hasattr(page, "lines")
                    and page.lines
                    or hasattr(page, "curves")
                    and page.curves
                    or (
                        hasattr(page, "chars")
                        and any(getattr(el, "source", None) != "ocr" for el in page.chars)
                    )
                    or (
                        hasattr(page, "words")
                        and any(getattr(el, "source", None) != "ocr" for el in page.words)
                    )
                ):
                    has_vector_elements = True
                    break
            if has_vector_elements:
                logger.warning(
                    "Warning: Saving with ocr=True creates an image-based PDF. "
                    "Original vector elements (rects, lines, non-OCR text/chars) "
                    "will not be preserved in the output file."
                )

            logger.info(f"Saving searchable PDF (OCR text layer) to: {output_path_str}")
            try:
                # Delegate to the searchable PDF exporter, passing self (PDF instance)
                create_searchable_pdf(self, output_path_str, dpi=dpi)
            except Exception as e:
                raise RuntimeError(f"Failed to create searchable PDF: {e}") from e

        elif original:
            if create_original_pdf is None:
                raise ImportError(
                    "Saving with original=True requires 'pikepdf'. "
                    'Install with: pip install "natural-pdf[ocr-export]"'
                )

            # Optional: Add warning about losing OCR data similar to PageCollection
            has_ocr_elements = False
            for page in self.pages:
                if hasattr(page, "find_all"):
                    ocr_text_elements = page.find_all("text[source=ocr]")
                    if ocr_text_elements:
                        has_ocr_elements = True
                        break
                elif hasattr(page, "words"):  # Fallback
                    if any(getattr(el, "source", None) == "ocr" for el in page.words):
                        has_ocr_elements = True
                        break
            if has_ocr_elements:
                logger.warning(
                    "Warning: Saving with original=True preserves original page content. "
                    "OCR text generated in this session will not be included in the saved file."
                )

            logger.info(f"Saving original PDF content to: {output_path_str}")
            try:
                # Delegate to the original PDF exporter, passing self (PDF instance)
                create_original_pdf(self, output_path_str)
            except Exception as e:
                # Re-raise exception from exporter
                raise e

    def ask(
        self,
        question: str,
        mode: str = "extractive",
        pages: Union[int, List[int], range] = None,
        min_confidence: float = 0.1,
        model: str = None,
        **kwargs,
    ) -> Dict[str, Any]:
        """
        Ask a single question about the document content.

        Args:
            question: Question string to ask about the document
            mode: "extractive" to extract answer from document, "generative" to generate
            pages: Specific pages to query (default: all pages)
            min_confidence: Minimum confidence threshold for answers
            model: Optional model name for question answering
            **kwargs: Additional parameters passed to the QA engine

        Returns:
            Dict containing: answer, confidence, found, page_num, source_elements, etc.
        """
        # Delegate to ask_batch and return the first result
        results = self.ask_batch([question], mode=mode, pages=pages, min_confidence=min_confidence, model=model, **kwargs)
        return results[0] if results else {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": None,
            "source_elements": [],
        }

    def ask_batch(
        self,
        questions: List[str],
        mode: str = "extractive",
        pages: Union[int, List[int], range] = None,
        min_confidence: float = 0.1,
        model: str = None,
        **kwargs,
    ) -> List[Dict[str, Any]]:
        """
        Ask multiple questions about the document content using batch processing.

        This method processes multiple questions efficiently in a single batch,
        avoiding the multiprocessing resource accumulation that can occur with
        sequential individual question calls.

        Args:
            questions: List of question strings to ask about the document
            mode: "extractive" to extract answer from document, "generative" to generate
            pages: Specific pages to query (default: all pages)
            min_confidence: Minimum confidence threshold for answers
            model: Optional model name for question answering
            **kwargs: Additional parameters passed to the QA engine

        Returns:
            List of Dicts, each containing: answer, confidence, found, page_num, source_elements, etc.
        """
        from natural_pdf.qa import get_qa_engine

        if not questions:
            return []

        if not isinstance(questions, list) or not all(isinstance(q, str) for q in questions):
            raise TypeError("'questions' must be a list of strings")

        qa_engine = get_qa_engine() if model is None else get_qa_engine(model_name=model)

        # Resolve target pages
        if pages is None:
            target_pages = self.pages
        elif isinstance(pages, int):
            if 0 <= pages < len(self.pages):
                target_pages = [self.pages[pages]]
            else:
                raise IndexError(f"Page index {pages} out of range (0-{len(self.pages)-1})")
        elif isinstance(pages, (list, range)):
            target_pages = []
            for page_idx in pages:
                if 0 <= page_idx < len(self.pages):
                    target_pages.append(self.pages[page_idx])
                else:
                    logger.warning(f"Page index {page_idx} out of range, skipping")
        else:
            raise ValueError(f"Invalid pages parameter: {pages}")

        if not target_pages:
            logger.warning("No valid pages found for QA processing.")
            return [
                {
                    "answer": None,
                    "confidence": 0.0,
                    "found": False,
                    "page_num": None,
                    "source_elements": [],
                }
                for _ in questions
            ]

        logger.info(f"Processing {len(questions)} question(s) across {len(target_pages)} page(s) using batch QA...")

        # Collect all page images and metadata for batch processing
        page_images = []
        page_word_boxes = []
        page_metadata = []

        for page in target_pages:
            # Get page image
            try:
                page_image = page.to_image(resolution=150, include_highlights=False)
                if page_image is None:
                    logger.warning(f"Failed to render image for page {page.number}, skipping")
                    continue

                # Get text elements for word boxes
                elements = page.find_all("text")
                if not elements:
                    logger.warning(f"No text elements found on page {page.number}")
                    word_boxes = []
                else:
                    word_boxes = qa_engine._get_word_boxes_from_elements(elements, offset_x=0, offset_y=0)

                page_images.append(page_image)
                page_word_boxes.append(word_boxes)
                page_metadata.append({
                    "page_number": page.number,
                    "page_object": page
                })

            except Exception as e:
                logger.warning(f"Error processing page {page.number}: {e}")
                continue

        if not page_images:
            logger.warning("No page images could be processed for QA.")
            return [
                {
                    "answer": None,
                    "confidence": 0.0,
                    "found": False,
                    "page_num": None,
                    "source_elements": [],
                }
                for _ in questions
            ]

        # Process all questions against all pages in batch
        all_results = []

        for question_text in questions:
            question_results = []

            # Ask this question against each page (but in batch per page)
            for i, (page_image, word_boxes, page_meta) in enumerate(zip(page_images, page_word_boxes, page_metadata)):
                try:
                    # Use the DocumentQA batch interface 
                    page_result = qa_engine.ask(
                        image=page_image,
                        question=question_text,
                        word_boxes=word_boxes,
                        min_confidence=min_confidence,
                        **kwargs
                    )

                    if page_result and page_result.found:
                        # Add page metadata to result
                        page_result_dict = {
                            "answer": page_result.answer,
                            "confidence": page_result.confidence,
                            "found": page_result.found,
                            "page_num": page_meta["page_number"],
                            "source_elements": getattr(page_result, 'source_elements', []),
                            "start": getattr(page_result, 'start', -1),
                            "end": getattr(page_result, 'end', -1),
                        }
                        question_results.append(page_result_dict)

                except Exception as e:
                    logger.warning(f"Error processing question '{question_text}' on page {page_meta['page_number']}: {e}")
                    continue

            # Sort results by confidence and take the best one for this question
            question_results.sort(key=lambda x: x.get("confidence", 0), reverse=True)

            if question_results:
                all_results.append(question_results[0])
            else:
                # No results found for this question
                all_results.append({
                    "answer": None,
                    "confidence": 0.0,
                    "found": False,
                    "page_num": None,
                    "source_elements": [],
                })

        return all_results

    def search_within_index(
        self,
        query: Union[str, Path, Image.Image, "Region"],
        search_service: "SearchServiceProtocol",
        options: Optional["SearchOptions"] = None,
    ) -> List[Dict[str, Any]]:
        """
        Finds relevant documents from this PDF within a search index.
        Finds relevant documents from this PDF within a search index.

        Args:
            query: The search query (text, image path, PIL Image, Region)
            search_service: A pre-configured SearchService instance
            options: Optional SearchOptions to configure the query
            query: The search query (text, image path, PIL Image, Region)
            search_service: A pre-configured SearchService instance
            options: Optional SearchOptions to configure the query

        Returns:
            A list of result dictionaries, sorted by relevance
            A list of result dictionaries, sorted by relevance

        Raises:
            ImportError: If search dependencies are not installed
            ValueError: If search_service is None
            TypeError: If search_service does not conform to the protocol
            FileNotFoundError: If the collection managed by the service does not exist
            RuntimeError: For other search failures
            ImportError: If search dependencies are not installed
            ValueError: If search_service is None
            TypeError: If search_service does not conform to the protocol
            FileNotFoundError: If the collection managed by the service does not exist
            RuntimeError: For other search failures
        """
        if not search_service:
            raise ValueError("A configured SearchServiceProtocol instance must be provided.")

        collection_name = getattr(search_service, "collection_name", "<Unknown Collection>")
        logger.info(
            f"Searching within index '{collection_name}' for content from PDF '{self.path}'"
        )

        service = search_service

        query_input = query
        effective_options = copy.deepcopy(options) if options is not None else TextSearchOptions()

        if isinstance(query, Region):
            logger.debug("Query is a Region object. Extracting text.")
            if not isinstance(effective_options, TextSearchOptions):
                logger.warning(
                    "Querying with Region image requires MultiModalSearchOptions. Falling back to text extraction."
                )
            query_input = query.extract_text()
            if not query_input or query_input.isspace():
                logger.error("Region has no extractable text for query.")
                return []

        # Add filter to scope search to THIS PDF
        # Add filter to scope search to THIS PDF
        pdf_scope_filter = {
            "field": "pdf_path",
            "operator": "eq",
            "value": self.path,
        }
        logger.debug(f"Applying filter to scope search to PDF: {pdf_scope_filter}")

        # Combine with existing filters in options (if any)
        if effective_options.filters:
            logger.debug(f"Combining PDF scope filter with existing filters")
            if (
                isinstance(effective_options.filters, dict)
                and effective_options.filters.get("operator") == "AND"
            ):
                effective_options.filters["conditions"].append(pdf_scope_filter)
            elif isinstance(effective_options.filters, list):
                effective_options.filters = {
                    "operator": "AND",
                    "conditions": effective_options.filters + [pdf_scope_filter],
                }
            elif isinstance(effective_options.filters, dict):
                effective_options.filters = {
                    "operator": "AND",
                    "conditions": [effective_options.filters, pdf_scope_filter],
                }
            else:
                logger.warning(
                    f"Unsupported format for existing filters. Overwriting with PDF scope filter."
                )
                effective_options.filters = pdf_scope_filter
        else:
            effective_options.filters = pdf_scope_filter

        logger.debug(f"Final filters for service search: {effective_options.filters}")

        try:
            results = service.search(
                query=query_input,
                options=effective_options,
            )
            logger.info(f"SearchService returned {len(results)} results from PDF '{self.path}'")
            return results
        except FileNotFoundError as fnf:
            logger.error(f"Search failed: Collection not found. Error: {fnf}")
            raise
            logger.error(f"Search failed: Collection not found. Error: {fnf}")
            raise
        except Exception as e:
            logger.error(f"SearchService search failed: {e}")
            raise RuntimeError(f"Search within index failed. See logs for details.") from e
            logger.error(f"SearchService search failed: {e}")
            raise RuntimeError(f"Search within index failed. See logs for details.") from e

    def export_ocr_correction_task(self, output_zip_path: str, **kwargs):
        """
        Exports OCR results from this PDF into a correction task package.
        Exports OCR results from this PDF into a correction task package.

        Args:
            output_zip_path: The path to save the output zip file
            output_zip_path: The path to save the output zip file
            **kwargs: Additional arguments passed to create_correction_task_package
        """
        try:
            from natural_pdf.utils.packaging import create_correction_task_package

            create_correction_task_package(source=self, output_zip_path=output_zip_path, **kwargs)
        except ImportError:
            logger.error(
                "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
            )
            logger.error(
                "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
            )
        except Exception as e:
            logger.error(f"Failed to export correction task: {e}")
            raise
            logger.error(f"Failed to export correction task: {e}")
            raise

    def correct_ocr(
        self,
        correction_callback: Callable[[Any], Optional[str]],
        pages: Optional[Union[Iterable[int], range, slice]] = None,
        max_workers: Optional[int] = None,
        progress_callback: Optional[Callable[[], None]] = None,
    ) -> "PDF":
        """
        Applies corrections to OCR text elements using a callback function.
        Applies corrections to OCR text elements using a callback function.

        Args:
            correction_callback: Function that takes an element and returns corrected text or None
            correction_callback: Function that takes an element and returns corrected text or None
            pages: Optional page indices/slice to limit the scope of correction
            max_workers: Maximum number of threads to use for parallel execution
            progress_callback: Optional callback function for progress updates
            max_workers: Maximum number of threads to use for parallel execution
            progress_callback: Optional callback function for progress updates

        Returns:
            Self for method chaining
            Self for method chaining
        """
        target_page_indices = []
        target_page_indices = []
        if pages is None:
            target_page_indices = list(range(len(self._pages)))
        elif isinstance(pages, slice):
            target_page_indices = list(range(*pages.indices(len(self._pages))))
        elif hasattr(pages, "__iter__"):
            try:
                target_page_indices = [int(i) for i in pages]
                for idx in target_page_indices:
                    if not (0 <= idx < len(self._pages)):
                        raise IndexError(f"Page index {idx} out of range (0-{len(self._pages)-1}).")
            except (IndexError, TypeError, ValueError) as e:
                raise ValueError(f"Invalid page index in 'pages': {pages}. Error: {e}") from e
                raise ValueError(f"Invalid page index in 'pages': {pages}. Error: {e}") from e
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

        if not target_page_indices:
            logger.warning("No pages selected for OCR correction.")
            return self

        logger.info(f"Starting OCR correction for pages: {target_page_indices}")
        logger.info(f"Starting OCR correction for pages: {target_page_indices}")

        for page_idx in target_page_indices:
            page = self._pages[page_idx]
            try:
                page.correct_ocr(
                    correction_callback=correction_callback,
                    max_workers=max_workers,
                    progress_callback=progress_callback,
                )
            except Exception as e:
                logger.error(f"Error during correct_ocr on page {page_idx}: {e}")
                logger.error(f"Error during correct_ocr on page {page_idx}: {e}")

        logger.info("OCR correction process finished.")
        logger.info("OCR correction process finished.")
        return self

    def __len__(self) -> int:
        """Return the number of pages in the PDF."""
        if not hasattr(self, "_pages"):
            return 0
        return len(self._pages)

    def __getitem__(self, key) -> Union["Page", "PageCollection"]:
        """Access pages by index or slice."""
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not initialized yet.")

        if isinstance(key, slice):
            from natural_pdf.elements.collections import PageCollection

            return PageCollection(self._pages[key])

        if isinstance(key, int):
            if 0 <= key < len(self._pages):
                return self._pages[key]
            else:
                raise IndexError(f"Page index {key} out of range (0-{len(self._pages)-1}).")
        else:
            raise TypeError(f"Page indices must be integers or slices, not {type(key)}.")

    def close(self):
        """Close the underlying PDF file and clean up any temporary files."""
        if hasattr(self, "_pdf") and self._pdf is not None:
            try:
                self._pdf.close()
                logger.debug(f"Closed pdfplumber PDF object for {self.source_path}")
            except Exception as e:
                logger.warning(f"Error closing pdfplumber object: {e}")
            finally:
                self._pdf = None

        if hasattr(self, "_temp_file") and self._temp_file is not None:
            temp_file_path = None
            try:
                if hasattr(self._temp_file, "name") and self._temp_file.name:
                    temp_file_path = self._temp_file.name
                    # Only unlink if it exists and _is_stream is False (meaning WE created it)
                    if not self._is_stream and os.path.exists(temp_file_path):
                        os.unlink(temp_file_path)
                        logger.debug(f"Removed temporary PDF file: {temp_file_path}")
            except Exception as e:
                logger.warning(f"Failed to clean up temporary file '{temp_file_path}': {e}")

        # Cancels the weakref finalizer so we don't double-clean
        if hasattr(self, "_finalizer") and self._finalizer.alive:
            self._finalizer()

    def __enter__(self):
        """Context manager entry."""
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit."""
        self.close()

    def __repr__(self) -> str:
        """Return a string representation of the PDF object."""
        if not hasattr(self, "_pages"):
            page_count_str = "uninitialized"
        else:
            page_count_str = str(len(self._pages))

        source_info = getattr(self, "source_path", "unknown source")
        return f"<PDF source='{source_info}' pages={page_count_str}>"

    def get_id(self) -> str:
        """Get unique identifier for this PDF."""
        """Get unique identifier for this PDF."""
        return self.path

    # --- Deskew Method --- #

    def deskew(
        self,
        pages: Optional[Union[Iterable[int], range, slice]] = None,
        resolution: int = 300,
        angle: Optional[float] = None,
        detection_resolution: int = 72,
        force_overwrite: bool = False,
        **deskew_kwargs,
    ) -> "PDF":
        """
        Creates a new, in-memory PDF object containing deskewed versions of the
        specified pages from the original PDF.

        This method renders each selected page, detects and corrects skew using the 'deskew'
        library, and then combines the resulting images into a new PDF using 'img2pdf'.
        The new PDF object is returned directly.

        Important: The returned PDF is image-based. Any existing text, OCR results,
        annotations, or other elements from the original pages will *not* be carried over.

        Args:
            pages: Page indices/slice to include (0-based). If None, processes all pages.
            resolution: DPI resolution for rendering the output deskewed pages.
            angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
            detection_resolution: DPI resolution used for skew detection if angles are not
                                  already cached on the page objects.
            force_overwrite: If False (default), raises a ValueError if any target page
                             already contains processed elements (text, OCR, regions) to
                             prevent accidental data loss. Set to True to proceed anyway.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             during automatic detection (e.g., `max_angle`, `num_peaks`).

        Returns:
            A new PDF object representing the deskewed document.

        Raises:
            ImportError: If 'deskew' or 'img2pdf' libraries are not installed.
            ValueError: If `force_overwrite` is False and target pages contain elements.
            FileNotFoundError: If the source PDF cannot be read (if file-based).
            IOError: If creating the in-memory PDF fails.
            RuntimeError: If rendering or deskewing individual pages fails.
        """
        if not DESKEW_AVAILABLE:
            raise ImportError(
                "Deskew/img2pdf libraries missing. Install with: pip install natural-pdf[deskew]"
            )

        target_pages = self._get_target_pages(pages)  # Use helper to resolve pages

        # --- Safety Check --- #
        if not force_overwrite:
            for page in target_pages:
                # Check if the element manager has been initialized and contains any elements
                if (
                    hasattr(page, "_element_mgr")
                    and page._element_mgr
                    and page._element_mgr.has_elements()
                ):
                    raise ValueError(
                        f"Page {page.number} contains existing elements (text, OCR, etc.). "
                        f"Deskewing creates an image-only PDF, discarding these elements. "
                        f"Set force_overwrite=True to proceed."
                    )

        # --- Process Pages --- #
        deskewed_images_bytes = []
        logger.info(f"Deskewing {len(target_pages)} pages (output resolution={resolution} DPI)...")

        for page in tqdm(target_pages, desc="Deskewing Pages", leave=False):
            try:
                # Use page.deskew to get the corrected PIL image
                # Pass down resolutions and kwargs
                deskewed_img = page.deskew(
                    resolution=resolution,
                    angle=angle,  # Let page.deskew handle detection/caching
                    detection_resolution=detection_resolution,
                    **deskew_kwargs,
                )

                if not deskewed_img:
                    logger.warning(
                        f"Page {page.number}: Failed to generate deskewed image, skipping."
                    )
                    continue

                # Convert image to bytes for img2pdf (use PNG for lossless quality)
                with io.BytesIO() as buf:
                    deskewed_img.save(buf, format="PNG")
                    deskewed_images_bytes.append(buf.getvalue())

            except Exception as e:
                logger.error(
                    f"Page {page.number}: Failed during deskewing process: {e}", exc_info=True
                )
                # Option: Raise a runtime error, or continue and skip the page?
                # Raising makes the whole operation fail if one page fails.
                raise RuntimeError(f"Failed to process page {page.number} during deskewing.") from e

        # --- Create PDF --- #
        if not deskewed_images_bytes:
            raise RuntimeError("No pages were successfully processed to create the deskewed PDF.")

        logger.info(f"Combining {len(deskewed_images_bytes)} deskewed images into in-memory PDF...")
        try:
            # Use img2pdf to combine image bytes into PDF bytes
            pdf_bytes = img2pdf.convert(deskewed_images_bytes)

            # Wrap bytes in a stream
            pdf_stream = io.BytesIO(pdf_bytes)

            # Create a new PDF object from the stream using original config
            logger.info("Creating new PDF object from deskewed stream...")
            new_pdf = PDF(
                pdf_stream,
                reading_order=self._reading_order,
                font_attrs=self._font_attrs,
                keep_spaces=self._config.get("keep_spaces", True),
                text_layer=self._text_layer,
            )
            return new_pdf
        except Exception as e:
            logger.error(f"Failed to create in-memory PDF using img2pdf/PDF init: {e}")
            raise IOError("Failed to create deskewed PDF object from image stream.") from e

    # --- End Deskew Method --- #

    # --- Classification Methods --- #

    def classify_pages(
        self,
        labels: List[str],
        model: Optional[str] = None,
        pages: Optional[Union[Iterable[int], range, slice]] = None,
        analysis_key: str = "classification",
        using: Optional[str] = None,
        **kwargs,
    ) -> "PDF":
        """
        Classifies specified pages of the PDF.

        Args:
            labels: List of category names
            model: Model identifier ('text', 'vision', or specific HF ID)
            pages: Page indices, slice, or None for all pages
            analysis_key: Key to store results in page's analyses dict
            using: Processing mode ('text' or 'vision')
            **kwargs: Additional arguments for the ClassificationManager

        Returns:
            Self for method chaining
        """
        if not labels:
            raise ValueError("Labels list cannot be empty.")

        try:
            manager = self.get_manager("classification")
        except (ValueError, RuntimeError) as e:
            raise ClassificationError(f"Cannot get ClassificationManager: {e}") from e

        if not manager or not manager.is_available():
            from natural_pdf.classification.manager import is_classification_available

            if not is_classification_available():
                raise ImportError(
                    "Classification dependencies missing. "
                    'Install with: pip install "natural-pdf[ai]"'
                )
            raise ClassificationError("ClassificationManager not available.")

        target_pages = []
        if pages is None:
            target_pages = self._pages
        elif isinstance(pages, slice):
            target_pages = self._pages[pages]
        elif hasattr(pages, "__iter__"):
            try:
                target_pages = [self._pages[i] for i in pages]
            except IndexError:
                raise ValueError("Invalid page index provided.")
            except TypeError:
                raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

        if not target_pages:
            logger.warning("No pages selected for classification.")
            return self

        inferred_using = manager.infer_using(model if model else manager.DEFAULT_TEXT_MODEL, using)
        logger.info(
            f"Classifying {len(target_pages)} pages using model '{model or '(default)'}' (mode: {inferred_using})"
        )

        page_contents = []
        pages_to_classify = []
        logger.debug(f"Gathering content for {len(target_pages)} pages...")

        for page in target_pages:
            try:
                content = page._get_classification_content(model_type=inferred_using, **kwargs)
                page_contents.append(content)
                pages_to_classify.append(page)
            except ValueError as e:
                logger.warning(f"Skipping page {page.number}: Cannot get content - {e}")
            except Exception as e:
                logger.warning(f"Skipping page {page.number}: Error getting content - {e}")

        if not page_contents:
            logger.warning("No content could be gathered for batch classification.")
            return self

        logger.debug(f"Gathered content for {len(pages_to_classify)} pages.")

        try:
            batch_results = manager.classify_batch(
                item_contents=page_contents,
                labels=labels,
                model_id=model,
                using=inferred_using,
                **kwargs,
            )
        except Exception as e:
            logger.error(f"Batch classification failed: {e}")
            raise ClassificationError(f"Batch classification failed: {e}") from e

        if len(batch_results) != len(pages_to_classify):
            logger.error(
                f"Mismatch between number of results ({len(batch_results)}) and pages ({len(pages_to_classify)})"
            )
            return self

        logger.debug(
            f"Distributing {len(batch_results)} results to pages under key '{analysis_key}'..."
        )
        for page, result_obj in zip(pages_to_classify, batch_results):
            try:
                if not hasattr(page, "analyses") or page.analyses is None:
                    page.analyses = {}
                page.analyses[analysis_key] = result_obj
            except Exception as e:
                logger.warning(
                    f"Failed to store classification results for page {page.number}: {e}"
                )

        logger.info(f"Finished classifying PDF pages.")
        return self

    # --- End Classification Methods --- #

    # --- Extraction Support --- #
    def _get_extraction_content(self, using: str = "text", **kwargs) -> Any:
        """
        Retrieves the content for the entire PDF.

        Args:
            using: 'text' or 'vision'
            **kwargs: Additional arguments passed to extract_text or page.to_image

        Returns:
            str: Extracted text if using='text'
            List[PIL.Image.Image]: List of page images if using='vision'
            None: If content cannot be retrieved
        """
        if using == "text":
            try:
                layout = kwargs.pop("layout", True)
                return self.extract_text(layout=layout, **kwargs)
            except Exception as e:
                logger.error(f"Error extracting text from PDF: {e}")
                return None
        elif using == "vision":
            page_images = []
            logger.info(f"Rendering {len(self.pages)} pages to images...")

            resolution = kwargs.pop("resolution", 72)
            include_highlights = kwargs.pop("include_highlights", False)
            labels = kwargs.pop("labels", False)

            try:
                for page in tqdm(self.pages, desc="Rendering Pages"):
                    img = page.to_image(
                        resolution=resolution,
                        include_highlights=include_highlights,
                        labels=labels,
                        **kwargs,
                    )
                    if img:
                        page_images.append(img)
                    else:
                        logger.warning(f"Failed to render page {page.number}, skipping.")
                if not page_images:
                    logger.error("Failed to render any pages.")
                    return None
                return page_images
            except Exception as e:
                logger.error(f"Error rendering pages: {e}")
                return None
        else:
            logger.error(f"Unsupported value for 'using': {using}")
            return None

    # --- End Extraction Support --- #

    def _gather_analysis_data(
        self,
        analysis_keys: List[str],
        include_content: bool,
        include_images: bool,
        image_dir: Optional[Path],
        image_format: str,
        image_resolution: int,
    ) -> List[Dict[str, Any]]:
        """
        Gather analysis data from all pages in the PDF.

        Args:
            analysis_keys: Keys in the analyses dictionary to export
            include_content: Whether to include extracted text
            include_images: Whether to export images
            image_dir: Directory to save images
            image_format: Format to save images
            image_resolution: Resolution for exported images

        Returns:
            List of dictionaries containing analysis data
        """
        if not hasattr(self, "_pages") or not self._pages:
            logger.warning(f"No pages found in PDF {self.path}")
            return []

        all_data = []

        for page in tqdm(self._pages, desc="Gathering page data", leave=False):
            # Basic page information
            page_data = {
                "pdf_path": self.path,
                "page_number": page.number,
                "page_index": page.index,
            }

            # Include extracted text if requested
            if include_content:
                try:
                    page_data["content"] = page.extract_text(preserve_whitespace=True)
                except Exception as e:
                    logger.error(f"Error extracting text from page {page.number}: {e}")
                    page_data["content"] = ""

            # Save image if requested
            if include_images:
                try:
                    # Create image filename
                    image_filename = f"pdf_{Path(self.path).stem}_page_{page.number}.{image_format}"
                    image_path = image_dir / image_filename

                    # Save image
                    page.save_image(
                        str(image_path), resolution=image_resolution, include_highlights=True
                    )

                    # Add relative path to data
                    page_data["image_path"] = str(Path(image_path).relative_to(image_dir.parent))
                except Exception as e:
                    logger.error(f"Error saving image for page {page.number}: {e}")
                    page_data["image_path"] = None

            # Add analyses data
            for key in analysis_keys:
                if not hasattr(page, "analyses") or not page.analyses:
                    raise ValueError(f"Page {page.number} does not have analyses data")

                if key not in page.analyses:
                    raise KeyError(f"Analysis key '{key}' not found in page {page.number}")

                # Get the analysis result
                analysis_result = page.analyses[key]

                # If the result has a to_dict method, use it
                if hasattr(analysis_result, "to_dict"):
                    analysis_data = analysis_result.to_dict()
                else:
                    # Otherwise, use the result directly if it's dict-like
                    try:
                        analysis_data = dict(analysis_result)
                    except (TypeError, ValueError):
                        # Last resort: convert to string
                        analysis_data = {"raw_result": str(analysis_result)}

                # Add analysis data to page data with the key as prefix
                for k, v in analysis_data.items():
                    page_data[f"{key}.{k}"] = v

            all_data.append(page_data)

        return all_data

    def _get_target_pages(
        self, pages: Optional[Union[Iterable[int], range, slice]] = None
    ) -> List["Page"]:
        """
        Helper method to get a list of Page objects based on the input pages.

        Args:
            pages: Page indices, slice, or None for all pages

        Returns:
            List of Page objects
        """
        if pages is None:
            return self._pages
        elif isinstance(pages, slice):
            return self._pages[pages]
        elif hasattr(pages, "__iter__"):
            try:
                return [self._pages[i] for i in pages]
            except IndexError:
                raise ValueError("Invalid page index provided in 'pages' iterable.")
            except TypeError:
                raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    # --- Classification Mixin Implementation --- #

    def _get_classification_manager(self) -> "ClassificationManager":
        """Returns the ClassificationManager instance for this PDF."""
        try:
            return self.get_manager("classification")
        except (KeyError, RuntimeError) as e:
            raise AttributeError(f"Could not retrieve ClassificationManager: {e}") from e

    def _get_classification_content(self, model_type: str, **kwargs) -> Union[str, Image.Image]:
        """
        Provides the content for classifying the entire PDF.

        Args:
            model_type: 'text' or 'vision'.
            **kwargs: Additional arguments (e.g., for text extraction or image rendering).

        Returns:
            Extracted text (str) or the first page's image (PIL.Image).

        Raises:
            ValueError: If model_type is 'vision' and PDF has != 1 page,
                      or if model_type is unsupported, or if content cannot be generated.
        """
        if model_type == "text":
            try:
                # Extract text from the whole document
                text = self.extract_text(**kwargs)  # Pass relevant kwargs
                if not text or text.isspace():
                    raise ValueError("PDF contains no extractable text for classification.")
                return text
            except Exception as e:
                logger.error(f"Error extracting text for PDF classification: {e}")
                raise ValueError("Failed to extract text for classification.") from e

        elif model_type == "vision":
            if len(self.pages) == 1:
                # Use the single page's content method
                try:
                    return self.pages[0]._get_classification_content(model_type="vision", **kwargs)
                except Exception as e:
                    logger.error(f"Error getting image from single page for classification: {e}")
                    raise ValueError("Failed to get image from single page.") from e
            elif len(self.pages) == 0:
                raise ValueError("Cannot classify empty PDF using vision model.")
            else:
                raise ValueError(
                    f"Vision classification for a PDF object is only supported for single-page PDFs. "
                    f"This PDF has {len(self.pages)} pages. Use pdf.pages[0].classify() or pdf.classify_pages()."
                )
        else:
            raise ValueError(f"Unsupported model_type for PDF classification: {model_type}")

    # --- End Classification Mixin Implementation ---

    # ------------------------------------------------------------------
    # Unified analysis storage (maps to metadata["analysis"])
    # ------------------------------------------------------------------

    @property
    def analyses(self) -> Dict[str, Any]:
        if not hasattr(self, "metadata") or self.metadata is None:
            # For PDF, metadata property returns self._pdf.metadata which may be None
            self._pdf.metadata = self._pdf.metadata or {}
        if self.metadata is None:
            # Fallback safeguard
            self._pdf.metadata = {}
        return self.metadata.setdefault("analysis", {})  # type: ignore[attr-defined]

    @analyses.setter
    def analyses(self, value: Dict[str, Any]):
        if not hasattr(self, "metadata") or self.metadata is None:
            self._pdf.metadata = self._pdf.metadata or {}
        self.metadata["analysis"] = value  # type: ignore[attr-defined]

    # Static helper for weakref.finalize to avoid capturing 'self'
    @staticmethod
    def _finalize_cleanup(plumber_pdf, temp_file_obj, is_stream):
        try:
            if plumber_pdf is not None:
                plumber_pdf.close()
        except Exception:
            pass

        if temp_file_obj and not is_stream:
            try:
                path = temp_file_obj.name if hasattr(temp_file_obj, "name") else None
                if path and os.path.exists(path):
                    os.unlink(path)
            except Exception as e:
                logger.warning(f"Failed to clean up temporary file '{path}': {e}")
Attributes
natural_pdf.PDF.metadata property

Access PDF metadata as a dictionary.

Returns document metadata such as title, author, creation date, and other properties embedded in the PDF file. The exact keys available depend on what metadata was included when the PDF was created.

Returns:

Type Description
Dict[str, Any]

Dictionary containing PDF metadata. Common keys include 'Title',

Dict[str, Any]

'Author', 'Subject', 'Creator', 'Producer', 'CreationDate', and

Dict[str, Any]

'ModDate'. May be empty if no metadata is available.

Example
pdf = npdf.PDF("document.pdf")
print(pdf.metadata.get('Title', 'No title'))
print(f"Created: {pdf.metadata.get('CreationDate')}")
natural_pdf.PDF.pages property

Access pages as a PageCollection object.

Provides access to individual pages of the PDF document through a collection interface that supports indexing, slicing, and iteration. Pages are lazy-loaded to minimize memory usage.

Returns:

Type Description
PageCollection

PageCollection object that provides list-like access to PDF pages.

Raises:

Type Description
AttributeError

If PDF pages are not yet initialized.

Example
pdf = npdf.PDF("document.pdf")

# Access individual pages
first_page = pdf.pages[0]
last_page = pdf.pages[-1]

# Slice pages
first_three = pdf.pages[0:3]

# Iterate over pages
for page in pdf.pages:
    print(f"Page {page.index} has {len(page.chars)} characters")
Functions
natural_pdf.PDF.__enter__()

Context manager entry.

Source code in natural_pdf/core/pdf.py
1738
1739
1740
def __enter__(self):
    """Context manager entry."""
    return self
natural_pdf.PDF.__exit__(exc_type, exc_val, exc_tb)

Context manager exit.

Source code in natural_pdf/core/pdf.py
1742
1743
1744
def __exit__(self, exc_type, exc_val, exc_tb):
    """Context manager exit."""
    self.close()
natural_pdf.PDF.__getitem__(key)

Access pages by index or slice.

Source code in natural_pdf/core/pdf.py
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
def __getitem__(self, key) -> Union["Page", "PageCollection"]:
    """Access pages by index or slice."""
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not initialized yet.")

    if isinstance(key, slice):
        from natural_pdf.elements.collections import PageCollection

        return PageCollection(self._pages[key])

    if isinstance(key, int):
        if 0 <= key < len(self._pages):
            return self._pages[key]
        else:
            raise IndexError(f"Page index {key} out of range (0-{len(self._pages)-1}).")
    else:
        raise TypeError(f"Page indices must be integers or slices, not {type(key)}.")
natural_pdf.PDF.__init__(path_or_url_or_stream, reading_order=True, font_attrs=None, keep_spaces=True, text_tolerance=None, auto_text_tolerance=True, text_layer=True)

Initialize the enhanced PDF object.

Parameters:

Name Type Description Default
path_or_url_or_stream

Path to the PDF file (str/Path), a URL (str), or a file-like object (stream). URLs must start with 'http://' or 'https://'.

required
reading_order bool

If True, use natural reading order for text extraction. Defaults to True.

True
font_attrs Optional[List[str]]

List of font attributes for grouping characters into words. Common attributes include ['fontname', 'size']. Defaults to None.

None
keep_spaces bool

If True, include spaces in word elements during text extraction. Defaults to True.

True
text_tolerance Optional[dict]

PDFplumber-style tolerance settings for text grouping. Dictionary with keys like 'x_tolerance', 'y_tolerance'. Defaults to None.

None
auto_text_tolerance bool

If True, automatically scale text tolerance based on font size and document characteristics. Defaults to True.

True
text_layer bool

If True, preserve existing text layer from the PDF. If False, removes all existing text elements during initialization, useful for OCR-only workflows. Defaults to True.

True

Raises:

Type Description
TypeError

If path_or_url_or_stream is not a valid type.

IOError

If the PDF file cannot be opened or read.

ValueError

If URL download fails.

Example
# From file path
pdf = npdf.PDF("document.pdf")

# From URL
pdf = npdf.PDF("https://example.com/document.pdf")

# From stream
with open("document.pdf", "rb") as f:
    pdf = npdf.PDF(f)

# With custom settings
pdf = npdf.PDF("document.pdf",
              reading_order=False,
              text_layer=False,  # For OCR-only processing
              font_attrs=['fontname', 'size', 'flags'])
Source code in natural_pdf/core/pdf.py
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
def __init__(
    self,
    path_or_url_or_stream,
    reading_order: bool = True,
    font_attrs: Optional[List[str]] = None,
    keep_spaces: bool = True,
    text_tolerance: Optional[dict] = None,
    auto_text_tolerance: bool = True,
    text_layer: bool = True,
):
    """Initialize the enhanced PDF object.

    Args:
        path_or_url_or_stream: Path to the PDF file (str/Path), a URL (str),
            or a file-like object (stream). URLs must start with 'http://' or 'https://'.
        reading_order: If True, use natural reading order for text extraction.
            Defaults to True.
        font_attrs: List of font attributes for grouping characters into words.
            Common attributes include ['fontname', 'size']. Defaults to None.
        keep_spaces: If True, include spaces in word elements during text extraction.
            Defaults to True.
        text_tolerance: PDFplumber-style tolerance settings for text grouping.
            Dictionary with keys like 'x_tolerance', 'y_tolerance'. Defaults to None.
        auto_text_tolerance: If True, automatically scale text tolerance based on
            font size and document characteristics. Defaults to True.
        text_layer: If True, preserve existing text layer from the PDF. If False,
            removes all existing text elements during initialization, useful for
            OCR-only workflows. Defaults to True.

    Raises:
        TypeError: If path_or_url_or_stream is not a valid type.
        IOError: If the PDF file cannot be opened or read.
        ValueError: If URL download fails.

    Example:
        ```python
        # From file path
        pdf = npdf.PDF("document.pdf")

        # From URL
        pdf = npdf.PDF("https://example.com/document.pdf")

        # From stream
        with open("document.pdf", "rb") as f:
            pdf = npdf.PDF(f)

        # With custom settings
        pdf = npdf.PDF("document.pdf",
                      reading_order=False,
                      text_layer=False,  # For OCR-only processing
                      font_attrs=['fontname', 'size', 'flags'])
        ```
    """
    self._original_path_or_stream = path_or_url_or_stream
    self._temp_file = None
    self._resolved_path = None
    self._is_stream = False
    self._text_layer = text_layer
    stream_to_open = None

    if hasattr(path_or_url_or_stream, "read"):  # Check if it's file-like
        logger.info("Initializing PDF from in-memory stream.")
        self._is_stream = True
        self._resolved_path = None  # No resolved file path for streams
        self.source_path = "<stream>"  # Identifier for source
        self.path = self.source_path  # Use source identifier as path for streams
        stream_to_open = path_or_url_or_stream
        try:
            if hasattr(path_or_url_or_stream, "read"):
                # If caller provided an in-memory binary stream, capture bytes for potential re-export
                current_pos = path_or_url_or_stream.tell()
                path_or_url_or_stream.seek(0)
                self._original_bytes = path_or_url_or_stream.read()
                path_or_url_or_stream.seek(current_pos)
        except Exception:
            pass
    elif isinstance(path_or_url_or_stream, (str, Path)):
        path_or_url = str(path_or_url_or_stream)
        self.source_path = path_or_url  # Store original path/URL as source
        is_url = path_or_url.startswith("http://") or path_or_url.startswith("https://")

        if is_url:
            logger.info(f"Downloading PDF from URL: {path_or_url}")
            try:
                with urllib.request.urlopen(path_or_url) as response:
                    data = response.read()
                # Load directly into an in-memory buffer — no temp file needed
                buffer = io.BytesIO(data)
                buffer.seek(0)
                self._temp_file = None  # No on-disk temp file
                self._resolved_path = path_or_url  # For repr / get_id purposes
                stream_to_open = buffer  # pdfplumber accepts file-like objects
            except Exception as e:
                logger.error(f"Failed to download PDF from URL: {e}")
                raise ValueError(f"Failed to download PDF from URL: {e}")
        else:
            self._resolved_path = str(Path(path_or_url).resolve())  # Resolve local paths
            stream_to_open = self._resolved_path
        self.path = self._resolved_path  # Use resolved path for file-based PDFs
    else:
        raise TypeError(
            f"Invalid input type: {type(path_or_url_or_stream)}. "
            f"Expected path (str/Path), URL (str), or file-like object."
        )

    logger.info(f"Opening PDF source: {self.source_path}")
    logger.debug(
        f"Parameters: reading_order={reading_order}, font_attrs={font_attrs}, keep_spaces={keep_spaces}"
    )

    try:
        self._pdf = pdfplumber.open(stream_to_open)
    except Exception as e:
        logger.error(f"Failed to open PDF: {e}", exc_info=True)
        self.close()  # Attempt cleanup if opening fails
        raise IOError(f"Failed to open PDF source: {self.source_path}") from e

    # Store configuration used for initialization
    self._reading_order = reading_order
    self._config = {"keep_spaces": keep_spaces}
    self._font_attrs = font_attrs

    self._ocr_manager = OCRManager() if OCRManager else None
    self._layout_manager = LayoutManager() if LayoutManager else None
    self.highlighter = HighlightingService(self)
    # self._classification_manager_instance = ClassificationManager() # Removed this line
    self._manager_registry = {}

    # Lazily instantiate pages only when accessed
    self._pages = _LazyPageList(
        self, self._pdf, font_attrs=font_attrs, load_text=self._text_layer
    )

    self._element_cache = {}
    self._exclusions = []
    self._regions = []

    logger.info(f"PDF '{self.source_path}' initialized with {len(self._pages)} pages.")

    self._initialize_managers()
    self._initialize_highlighter()

    # Remove text layer if requested
    if not self._text_layer:
        logger.info("Removing text layer as requested (text_layer=False)")
        # Text layer is not loaded when text_layer=False, so no need to remove
        pass

    # Analysis results accessed via self.analyses property (see below)

    # --- Automatic cleanup when object is garbage-collected ---
    self._finalizer = weakref.finalize(
        self,
        PDF._finalize_cleanup,
        self._pdf,
        getattr(self, "_temp_file", None),
        getattr(self, "_is_stream", False),
    )

    # --- Text tolerance settings ------------------------------------
    # Users can pass pdfplumber-style keys (x_tolerance, x_tolerance_ratio,
    # y_tolerance, etc.) via *text_tolerance*.  We also keep a flag that
    # enables automatic tolerance scaling when explicit values are not
    # supplied.
    self._config["auto_text_tolerance"] = bool(auto_text_tolerance)
    if text_tolerance:
        # Only copy recognised primitives (numbers / None); ignore junk.
        allowed = {
            "x_tolerance",
            "x_tolerance_ratio",
            "y_tolerance",
            "keep_blank_chars",  # passthrough convenience
        }
        for k, v in text_tolerance.items():
            if k in allowed:
                self._config[k] = v
natural_pdf.PDF.__len__()

Return the number of pages in the PDF.

Source code in natural_pdf/core/pdf.py
1687
1688
1689
1690
1691
def __len__(self) -> int:
    """Return the number of pages in the PDF."""
    if not hasattr(self, "_pages"):
        return 0
    return len(self._pages)
natural_pdf.PDF.__repr__()

Return a string representation of the PDF object.

Source code in natural_pdf/core/pdf.py
1746
1747
1748
1749
1750
1751
1752
1753
1754
def __repr__(self) -> str:
    """Return a string representation of the PDF object."""
    if not hasattr(self, "_pages"):
        page_count_str = "uninitialized"
    else:
        page_count_str = str(len(self._pages))

    source_info = getattr(self, "source_path", "unknown source")
    return f"<PDF source='{source_info}' pages={page_count_str}>"
natural_pdf.PDF.add_exclusion(exclusion_func, label=None)

Add an exclusion function to the PDF.

Exclusion functions define regions of each page that should be ignored during text extraction and analysis operations. This is useful for filtering out headers, footers, watermarks, or other administrative content that shouldn't be included in the main document processing.

Parameters:

Name Type Description Default
exclusion_func Callable[[Page], Optional[Region]]

A function that takes a Page object and returns a Region to exclude from processing, or None if no exclusion should be applied to that page. The function is called once per page.

required
label str

Optional descriptive label for this exclusion rule, useful for debugging and identification.

None

Returns:

Type Description
PDF

Self for method chaining.

Raises:

Type Description
AttributeError

If PDF pages are not yet initialized.

Example
pdf = npdf.PDF("document.pdf")

# Exclude headers (top 50 points of each page)
pdf.add_exclusion(
    lambda page: page.region(0, 0, page.width, 50),
    label="header_exclusion"
)

# Exclude any text containing "CONFIDENTIAL"
pdf.add_exclusion(
    lambda page: page.find('text:contains("CONFIDENTIAL")').above(include_source=True)
    if page.find('text:contains("CONFIDENTIAL")') else None,
    label="confidential_exclusion"
)

# Chain multiple exclusions
pdf.add_exclusion(header_func).add_exclusion(footer_func)
Source code in natural_pdf/core/pdf.py
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
def add_exclusion(
    self, exclusion_func: Callable[["Page"], Optional["Region"]], label: str = None
) -> "PDF":
    """Add an exclusion function to the PDF.

    Exclusion functions define regions of each page that should be ignored during
    text extraction and analysis operations. This is useful for filtering out headers,
    footers, watermarks, or other administrative content that shouldn't be included
    in the main document processing.

    Args:
        exclusion_func: A function that takes a Page object and returns a Region
            to exclude from processing, or None if no exclusion should be applied
            to that page. The function is called once per page.
        label: Optional descriptive label for this exclusion rule, useful for
            debugging and identification.

    Returns:
        Self for method chaining.

    Raises:
        AttributeError: If PDF pages are not yet initialized.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")

        # Exclude headers (top 50 points of each page)
        pdf.add_exclusion(
            lambda page: page.region(0, 0, page.width, 50),
            label="header_exclusion"
        )

        # Exclude any text containing "CONFIDENTIAL"
        pdf.add_exclusion(
            lambda page: page.find('text:contains("CONFIDENTIAL")').above(include_source=True)
            if page.find('text:contains("CONFIDENTIAL")') else None,
            label="confidential_exclusion"
        )

        # Chain multiple exclusions
        pdf.add_exclusion(header_func).add_exclusion(footer_func)
        ```
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    exclusion_data = (exclusion_func, label)
    self._exclusions.append(exclusion_data)

    for page in self._pages:
        page.add_exclusion(exclusion_func, label=label)

    return self
natural_pdf.PDF.add_region(region_func, name=None)

Add a region function to the PDF.

Parameters:

Name Type Description Default
region_func Callable[[Page], Optional[Region]]

A function that takes a Page and returns a Region, or None

required
region_func Callable[[Page], Optional[Region]]

A function that takes a Page and returns a Region, or None

required
name str

Optional name for the region

None

Returns:

Type Description
PDF

Self for method chaining

Source code in natural_pdf/core/pdf.py
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
def add_region(
    self, region_func: Callable[["Page"], Optional["Region"]], name: str = None
) -> "PDF":
    """
    Add a region function to the PDF.

    Args:
        region_func: A function that takes a Page and returns a Region, or None
        region_func: A function that takes a Page and returns a Region, or None
        name: Optional name for the region

    Returns:
        Self for method chaining
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    region_data = (region_func, name)
    self._regions.append(region_data)

    for page in self._pages:
        try:
            region_instance = region_func(page)
            if region_instance and isinstance(region_instance, Region):
                page.add_region(region_instance, name=name, source="named")
            elif region_instance is not None:
                logger.warning(
                    f"Region function did not return a valid Region for page {page.number}"
                )
        except Exception as e:
            logger.error(f"Error adding region for page {page.number}: {e}")

    return self
natural_pdf.PDF.apply_ocr(engine=None, languages=None, min_confidence=None, device=None, resolution=None, apply_exclusions=True, detect_only=False, replace=True, options=None, pages=None)

Apply OCR to specified pages of the PDF using batch processing.

Performs optical character recognition on the specified pages, converting image-based text into searchable and extractable text elements. This method supports multiple OCR engines and provides batch processing for efficiency.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine to use. Supported engines include 'easyocr' (default), 'surya', 'paddle', and 'doctr'. If None, uses the global default from natural_pdf.options.ocr.engine.

None
languages Optional[List[str]]

List of language codes for OCR recognition (e.g., ['en', 'es']). If None, uses the global default from natural_pdf.options.ocr.languages.

None
min_confidence Optional[float]

Minimum confidence threshold (0.0-1.0) for accepting OCR results. Text with lower confidence will be filtered out. If None, uses the global default.

None
device Optional[str]

Device to run OCR on ('cpu', 'cuda', 'mps'). Engine-specific availability varies. If None, uses engine defaults.

None
resolution Optional[int]

DPI resolution for rendering pages to images before OCR. Higher values improve accuracy but increase processing time and memory. Typical values: 150 (fast), 300 (balanced), 600 (high quality).

None
apply_exclusions bool

If True, mask excluded regions before OCR to prevent processing of headers, footers, or other unwanted content.

True
detect_only bool

If True, only detect text bounding boxes without performing character recognition. Useful for layout analysis workflows.

False
replace bool

If True, replace any existing OCR elements on the pages. If False, append new OCR results to existing elements.

True
options Optional[Any]

Engine-specific options object (e.g., EasyOCROptions, SuryaOptions). Allows fine-tuning of engine behavior beyond common parameters.

None
pages Optional[Union[Iterable[int], range, slice]]

Page indices to process. Can be: - None: Process all pages - slice: Process a range of pages (e.g., slice(0, 10)) - Iterable[int]: Process specific page indices (e.g., [0, 2, 5])

None

Returns:

Type Description
PDF

Self for method chaining.

Raises:

Type Description
ValueError

If invalid page index is provided.

TypeError

If pages parameter has invalid type.

RuntimeError

If OCR engine is not available or fails.

Example
pdf = npdf.PDF("scanned_document.pdf")

# Basic OCR on all pages
pdf.apply_ocr()

# High-quality OCR with specific settings
pdf.apply_ocr(
    engine='easyocr',
    languages=['en', 'es'],
    resolution=300,
    min_confidence=0.8
)

# OCR specific pages only
pdf.apply_ocr(pages=[0, 1, 2])  # First 3 pages
pdf.apply_ocr(pages=slice(5, 10))  # Pages 5-9

# Detection-only workflow for layout analysis
pdf.apply_ocr(detect_only=True, resolution=150)
Note

OCR processing can be time and memory intensive, especially at high resolutions. Consider using exclusions to mask unwanted regions and processing pages in batches for large documents.

Source code in natural_pdf/core/pdf.py
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
def apply_ocr(
    self,
    engine: Optional[str] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
    apply_exclusions: bool = True,
    detect_only: bool = False,
    replace: bool = True,
    options: Optional[Any] = None,
    pages: Optional[Union[Iterable[int], range, slice]] = None,
) -> "PDF":
    """Apply OCR to specified pages of the PDF using batch processing.

    Performs optical character recognition on the specified pages, converting
    image-based text into searchable and extractable text elements. This method
    supports multiple OCR engines and provides batch processing for efficiency.

    Args:
        engine: Name of the OCR engine to use. Supported engines include
            'easyocr' (default), 'surya', 'paddle', and 'doctr'. If None,
            uses the global default from natural_pdf.options.ocr.engine.
        languages: List of language codes for OCR recognition (e.g., ['en', 'es']).
            If None, uses the global default from natural_pdf.options.ocr.languages.
        min_confidence: Minimum confidence threshold (0.0-1.0) for accepting
            OCR results. Text with lower confidence will be filtered out.
            If None, uses the global default.
        device: Device to run OCR on ('cpu', 'cuda', 'mps'). Engine-specific
            availability varies. If None, uses engine defaults.
        resolution: DPI resolution for rendering pages to images before OCR.
            Higher values improve accuracy but increase processing time and memory.
            Typical values: 150 (fast), 300 (balanced), 600 (high quality).
        apply_exclusions: If True, mask excluded regions before OCR to prevent
            processing of headers, footers, or other unwanted content.
        detect_only: If True, only detect text bounding boxes without performing
            character recognition. Useful for layout analysis workflows.
        replace: If True, replace any existing OCR elements on the pages.
            If False, append new OCR results to existing elements.
        options: Engine-specific options object (e.g., EasyOCROptions, SuryaOptions).
            Allows fine-tuning of engine behavior beyond common parameters.
        pages: Page indices to process. Can be:
            - None: Process all pages
            - slice: Process a range of pages (e.g., slice(0, 10))
            - Iterable[int]: Process specific page indices (e.g., [0, 2, 5])

    Returns:
        Self for method chaining.

    Raises:
        ValueError: If invalid page index is provided.
        TypeError: If pages parameter has invalid type.
        RuntimeError: If OCR engine is not available or fails.

    Example:
        ```python
        pdf = npdf.PDF("scanned_document.pdf")

        # Basic OCR on all pages
        pdf.apply_ocr()

        # High-quality OCR with specific settings
        pdf.apply_ocr(
            engine='easyocr',
            languages=['en', 'es'],
            resolution=300,
            min_confidence=0.8
        )

        # OCR specific pages only
        pdf.apply_ocr(pages=[0, 1, 2])  # First 3 pages
        pdf.apply_ocr(pages=slice(5, 10))  # Pages 5-9

        # Detection-only workflow for layout analysis
        pdf.apply_ocr(detect_only=True, resolution=150)
        ```

    Note:
        OCR processing can be time and memory intensive, especially at high
        resolutions. Consider using exclusions to mask unwanted regions and
        processing pages in batches for large documents.
    """
    if not self._ocr_manager:
        logger.error("OCRManager not available. Cannot apply OCR.")
        return self

    # Apply global options as defaults, but allow explicit parameters to override
    import natural_pdf

    # Use global OCR options if parameters are not explicitly set
    if engine is None:
        engine = natural_pdf.options.ocr.engine
    if languages is None:
        languages = natural_pdf.options.ocr.languages
    if min_confidence is None:
        min_confidence = natural_pdf.options.ocr.min_confidence
    if device is None:
        pass  # No default device in options.ocr anymore

    thread_id = threading.current_thread().name
    logger.debug(f"[{thread_id}] PDF.apply_ocr starting for {self.path}")

    target_pages = []

    target_pages = []
    if pages is None:
        target_pages = self._pages
    elif isinstance(pages, slice):
        target_pages = self._pages[pages]
    elif hasattr(pages, "__iter__"):
        try:
            target_pages = [self._pages[i] for i in pages]
        except IndexError:
            raise ValueError("Invalid page index provided in 'pages' iterable.")
        except TypeError:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
    else:
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    if not target_pages:
        logger.warning("No pages selected for OCR processing.")
        return self

    page_numbers = [p.number for p in target_pages]
    logger.info(f"Applying batch OCR to pages: {page_numbers}...")

    final_resolution = resolution or getattr(self, "_config", {}).get("resolution", 150)
    logger.debug(f"Using OCR image resolution: {final_resolution} DPI")

    images_pil = []
    page_image_map = []
    logger.info(f"[{thread_id}] Rendering {len(target_pages)} pages...")
    failed_page_num = "unknown"
    render_start_time = time.monotonic()

    try:
        for i, page in enumerate(tqdm(target_pages, desc="Rendering pages", leave=False)):
            failed_page_num = page.number
            logger.debug(f"  Rendering page {page.number} (index {page.index})...")
            to_image_kwargs = {
                "resolution": final_resolution,
                "include_highlights": False,
                "exclusions": "mask" if apply_exclusions else None,
            }
            img = page.to_image(**to_image_kwargs)
            if img is None:
                logger.error(f"  Failed to render page {page.number} to image.")
                continue
                continue
            images_pil.append(img)
            page_image_map.append((page, img))
    except Exception as e:
        logger.error(f"Failed to render pages for batch OCR: {e}")
        logger.error(f"Failed to render pages for batch OCR: {e}")
        raise RuntimeError(f"Failed to render page {failed_page_num} for OCR.") from e

    render_end_time = time.monotonic()
    logger.debug(
        f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
    )
    logger.debug(
        f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
    )

    if not images_pil or not page_image_map:
        logger.error("No images were successfully rendered for batch OCR.")
        return self

    manager_args = {
        "images": images_pil,
        "engine": engine,
        "languages": languages,
        "min_confidence": min_confidence,
        "min_confidence": min_confidence,
        "device": device,
        "options": options,
        "detect_only": detect_only,
    }
    manager_args = {k: v for k, v in manager_args.items() if v is not None}

    ocr_call_args = {k: v for k, v in manager_args.items() if k != "images"}
    logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
    logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
    ocr_start_time = time.monotonic()

    batch_results = self._ocr_manager.apply_ocr(**manager_args)

    if not isinstance(batch_results, list) or len(batch_results) != len(images_pil):
        logger.error(f"OCR Manager returned unexpected result format or length.")
        return self

    logger.info("OCR Manager batch processing complete.")

    ocr_end_time = time.monotonic()
    logger.debug(
        f"[{thread_id}] OCR processing finished (Duration: {ocr_end_time - ocr_start_time:.2f}s)"
    )

    logger.info("Adding OCR results to respective pages...")
    total_elements_added = 0

    for i, (page, img) in enumerate(page_image_map):
        results_for_page = batch_results[i]
        if not isinstance(results_for_page, list):
            logger.warning(
                f"Skipping results for page {page.number}: Expected list, got {type(results_for_page)}"
            )
            continue

        logger.debug(f"  Processing {len(results_for_page)} results for page {page.number}...")
        try:
            if manager_args.get("replace", True) and hasattr(page, "_element_mgr"):
                page._element_mgr.remove_ocr_elements()

            img_scale_x = page.width / img.width if img.width > 0 else 1
            img_scale_y = page.height / img.height if img.height > 0 else 1
            elements = page._element_mgr.create_text_elements_from_ocr(
                results_for_page, img_scale_x, img_scale_y
            )

            if elements:
                total_elements_added += len(elements)
                logger.debug(f"  Added {len(elements)} OCR TextElements to page {page.number}.")
            else:
                logger.debug(f"  No valid TextElements created for page {page.number}.")
        except Exception as e:
            logger.error(f"  Error adding OCR elements to page {page.number}: {e}")

    logger.info(f"Finished adding OCR results. Total elements added: {total_elements_added}")
    return self
natural_pdf.PDF.ask(question, mode='extractive', pages=None, min_confidence=0.1, model=None, **kwargs)

Ask a single question about the document content.

Parameters:

Name Type Description Default
question str

Question string to ask about the document

required
mode str

"extractive" to extract answer from document, "generative" to generate

'extractive'
pages Union[int, List[int], range]

Specific pages to query (default: all pages)

None
min_confidence float

Minimum confidence threshold for answers

0.1
model str

Optional model name for question answering

None
**kwargs

Additional parameters passed to the QA engine

{}

Returns:

Type Description
Dict[str, Any]

Dict containing: answer, confidence, found, page_num, source_elements, etc.

Source code in natural_pdf/core/pdf.py
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
def ask(
    self,
    question: str,
    mode: str = "extractive",
    pages: Union[int, List[int], range] = None,
    min_confidence: float = 0.1,
    model: str = None,
    **kwargs,
) -> Dict[str, Any]:
    """
    Ask a single question about the document content.

    Args:
        question: Question string to ask about the document
        mode: "extractive" to extract answer from document, "generative" to generate
        pages: Specific pages to query (default: all pages)
        min_confidence: Minimum confidence threshold for answers
        model: Optional model name for question answering
        **kwargs: Additional parameters passed to the QA engine

    Returns:
        Dict containing: answer, confidence, found, page_num, source_elements, etc.
    """
    # Delegate to ask_batch and return the first result
    results = self.ask_batch([question], mode=mode, pages=pages, min_confidence=min_confidence, model=model, **kwargs)
    return results[0] if results else {
        "answer": None,
        "confidence": 0.0,
        "found": False,
        "page_num": None,
        "source_elements": [],
    }
natural_pdf.PDF.ask_batch(questions, mode='extractive', pages=None, min_confidence=0.1, model=None, **kwargs)

Ask multiple questions about the document content using batch processing.

This method processes multiple questions efficiently in a single batch, avoiding the multiprocessing resource accumulation that can occur with sequential individual question calls.

Parameters:

Name Type Description Default
questions List[str]

List of question strings to ask about the document

required
mode str

"extractive" to extract answer from document, "generative" to generate

'extractive'
pages Union[int, List[int], range]

Specific pages to query (default: all pages)

None
min_confidence float

Minimum confidence threshold for answers

0.1
model str

Optional model name for question answering

None
**kwargs

Additional parameters passed to the QA engine

{}

Returns:

Type Description
List[Dict[str, Any]]

List of Dicts, each containing: answer, confidence, found, page_num, source_elements, etc.

Source code in natural_pdf/core/pdf.py
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
def ask_batch(
    self,
    questions: List[str],
    mode: str = "extractive",
    pages: Union[int, List[int], range] = None,
    min_confidence: float = 0.1,
    model: str = None,
    **kwargs,
) -> List[Dict[str, Any]]:
    """
    Ask multiple questions about the document content using batch processing.

    This method processes multiple questions efficiently in a single batch,
    avoiding the multiprocessing resource accumulation that can occur with
    sequential individual question calls.

    Args:
        questions: List of question strings to ask about the document
        mode: "extractive" to extract answer from document, "generative" to generate
        pages: Specific pages to query (default: all pages)
        min_confidence: Minimum confidence threshold for answers
        model: Optional model name for question answering
        **kwargs: Additional parameters passed to the QA engine

    Returns:
        List of Dicts, each containing: answer, confidence, found, page_num, source_elements, etc.
    """
    from natural_pdf.qa import get_qa_engine

    if not questions:
        return []

    if not isinstance(questions, list) or not all(isinstance(q, str) for q in questions):
        raise TypeError("'questions' must be a list of strings")

    qa_engine = get_qa_engine() if model is None else get_qa_engine(model_name=model)

    # Resolve target pages
    if pages is None:
        target_pages = self.pages
    elif isinstance(pages, int):
        if 0 <= pages < len(self.pages):
            target_pages = [self.pages[pages]]
        else:
            raise IndexError(f"Page index {pages} out of range (0-{len(self.pages)-1})")
    elif isinstance(pages, (list, range)):
        target_pages = []
        for page_idx in pages:
            if 0 <= page_idx < len(self.pages):
                target_pages.append(self.pages[page_idx])
            else:
                logger.warning(f"Page index {page_idx} out of range, skipping")
    else:
        raise ValueError(f"Invalid pages parameter: {pages}")

    if not target_pages:
        logger.warning("No valid pages found for QA processing.")
        return [
            {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": None,
                "source_elements": [],
            }
            for _ in questions
        ]

    logger.info(f"Processing {len(questions)} question(s) across {len(target_pages)} page(s) using batch QA...")

    # Collect all page images and metadata for batch processing
    page_images = []
    page_word_boxes = []
    page_metadata = []

    for page in target_pages:
        # Get page image
        try:
            page_image = page.to_image(resolution=150, include_highlights=False)
            if page_image is None:
                logger.warning(f"Failed to render image for page {page.number}, skipping")
                continue

            # Get text elements for word boxes
            elements = page.find_all("text")
            if not elements:
                logger.warning(f"No text elements found on page {page.number}")
                word_boxes = []
            else:
                word_boxes = qa_engine._get_word_boxes_from_elements(elements, offset_x=0, offset_y=0)

            page_images.append(page_image)
            page_word_boxes.append(word_boxes)
            page_metadata.append({
                "page_number": page.number,
                "page_object": page
            })

        except Exception as e:
            logger.warning(f"Error processing page {page.number}: {e}")
            continue

    if not page_images:
        logger.warning("No page images could be processed for QA.")
        return [
            {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": None,
                "source_elements": [],
            }
            for _ in questions
        ]

    # Process all questions against all pages in batch
    all_results = []

    for question_text in questions:
        question_results = []

        # Ask this question against each page (but in batch per page)
        for i, (page_image, word_boxes, page_meta) in enumerate(zip(page_images, page_word_boxes, page_metadata)):
            try:
                # Use the DocumentQA batch interface 
                page_result = qa_engine.ask(
                    image=page_image,
                    question=question_text,
                    word_boxes=word_boxes,
                    min_confidence=min_confidence,
                    **kwargs
                )

                if page_result and page_result.found:
                    # Add page metadata to result
                    page_result_dict = {
                        "answer": page_result.answer,
                        "confidence": page_result.confidence,
                        "found": page_result.found,
                        "page_num": page_meta["page_number"],
                        "source_elements": getattr(page_result, 'source_elements', []),
                        "start": getattr(page_result, 'start', -1),
                        "end": getattr(page_result, 'end', -1),
                    }
                    question_results.append(page_result_dict)

            except Exception as e:
                logger.warning(f"Error processing question '{question_text}' on page {page_meta['page_number']}: {e}")
                continue

        # Sort results by confidence and take the best one for this question
        question_results.sort(key=lambda x: x.get("confidence", 0), reverse=True)

        if question_results:
            all_results.append(question_results[0])
        else:
            # No results found for this question
            all_results.append({
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": None,
                "source_elements": [],
            })

    return all_results
natural_pdf.PDF.classify_pages(labels, model=None, pages=None, analysis_key='classification', using=None, **kwargs)

Classifies specified pages of the PDF.

Parameters:

Name Type Description Default
labels List[str]

List of category names

required
model Optional[str]

Model identifier ('text', 'vision', or specific HF ID)

None
pages Optional[Union[Iterable[int], range, slice]]

Page indices, slice, or None for all pages

None
analysis_key str

Key to store results in page's analyses dict

'classification'
using Optional[str]

Processing mode ('text' or 'vision')

None
**kwargs

Additional arguments for the ClassificationManager

{}

Returns:

Type Description
PDF

Self for method chaining

Source code in natural_pdf/core/pdf.py
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
def classify_pages(
    self,
    labels: List[str],
    model: Optional[str] = None,
    pages: Optional[Union[Iterable[int], range, slice]] = None,
    analysis_key: str = "classification",
    using: Optional[str] = None,
    **kwargs,
) -> "PDF":
    """
    Classifies specified pages of the PDF.

    Args:
        labels: List of category names
        model: Model identifier ('text', 'vision', or specific HF ID)
        pages: Page indices, slice, or None for all pages
        analysis_key: Key to store results in page's analyses dict
        using: Processing mode ('text' or 'vision')
        **kwargs: Additional arguments for the ClassificationManager

    Returns:
        Self for method chaining
    """
    if not labels:
        raise ValueError("Labels list cannot be empty.")

    try:
        manager = self.get_manager("classification")
    except (ValueError, RuntimeError) as e:
        raise ClassificationError(f"Cannot get ClassificationManager: {e}") from e

    if not manager or not manager.is_available():
        from natural_pdf.classification.manager import is_classification_available

        if not is_classification_available():
            raise ImportError(
                "Classification dependencies missing. "
                'Install with: pip install "natural-pdf[ai]"'
            )
        raise ClassificationError("ClassificationManager not available.")

    target_pages = []
    if pages is None:
        target_pages = self._pages
    elif isinstance(pages, slice):
        target_pages = self._pages[pages]
    elif hasattr(pages, "__iter__"):
        try:
            target_pages = [self._pages[i] for i in pages]
        except IndexError:
            raise ValueError("Invalid page index provided.")
        except TypeError:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
    else:
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    if not target_pages:
        logger.warning("No pages selected for classification.")
        return self

    inferred_using = manager.infer_using(model if model else manager.DEFAULT_TEXT_MODEL, using)
    logger.info(
        f"Classifying {len(target_pages)} pages using model '{model or '(default)'}' (mode: {inferred_using})"
    )

    page_contents = []
    pages_to_classify = []
    logger.debug(f"Gathering content for {len(target_pages)} pages...")

    for page in target_pages:
        try:
            content = page._get_classification_content(model_type=inferred_using, **kwargs)
            page_contents.append(content)
            pages_to_classify.append(page)
        except ValueError as e:
            logger.warning(f"Skipping page {page.number}: Cannot get content - {e}")
        except Exception as e:
            logger.warning(f"Skipping page {page.number}: Error getting content - {e}")

    if not page_contents:
        logger.warning("No content could be gathered for batch classification.")
        return self

    logger.debug(f"Gathered content for {len(pages_to_classify)} pages.")

    try:
        batch_results = manager.classify_batch(
            item_contents=page_contents,
            labels=labels,
            model_id=model,
            using=inferred_using,
            **kwargs,
        )
    except Exception as e:
        logger.error(f"Batch classification failed: {e}")
        raise ClassificationError(f"Batch classification failed: {e}") from e

    if len(batch_results) != len(pages_to_classify):
        logger.error(
            f"Mismatch between number of results ({len(batch_results)}) and pages ({len(pages_to_classify)})"
        )
        return self

    logger.debug(
        f"Distributing {len(batch_results)} results to pages under key '{analysis_key}'..."
    )
    for page, result_obj in zip(pages_to_classify, batch_results):
        try:
            if not hasattr(page, "analyses") or page.analyses is None:
                page.analyses = {}
            page.analyses[analysis_key] = result_obj
        except Exception as e:
            logger.warning(
                f"Failed to store classification results for page {page.number}: {e}"
            )

    logger.info(f"Finished classifying PDF pages.")
    return self
natural_pdf.PDF.clear_exclusions()

Clear all exclusion functions from the PDF.

Removes all previously added exclusion functions that were used to filter out unwanted content (like headers, footers, or administrative text) from text extraction and analysis operations.

Returns:

Type Description
PDF

Self for method chaining.

Raises:

Type Description
AttributeError

If PDF pages are not yet initialized.

Example
pdf = npdf.PDF("document.pdf")
pdf.add_exclusion(lambda page: page.find('text:contains("CONFIDENTIAL")').above())

# Later, remove all exclusions
pdf.clear_exclusions()
Source code in natural_pdf/core/pdf.py
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
def clear_exclusions(self) -> "PDF":
    """Clear all exclusion functions from the PDF.

    Removes all previously added exclusion functions that were used to filter
    out unwanted content (like headers, footers, or administrative text) from
    text extraction and analysis operations.

    Returns:
        Self for method chaining.

    Raises:
        AttributeError: If PDF pages are not yet initialized.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")
        pdf.add_exclusion(lambda page: page.find('text:contains("CONFIDENTIAL")').above())

        # Later, remove all exclusions
        pdf.clear_exclusions()
        ```
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    self._exclusions = []
    for page in self._pages:
        page.clear_exclusions()
    return self
natural_pdf.PDF.close()

Close the underlying PDF file and clean up any temporary files.

Source code in natural_pdf/core/pdf.py
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
def close(self):
    """Close the underlying PDF file and clean up any temporary files."""
    if hasattr(self, "_pdf") and self._pdf is not None:
        try:
            self._pdf.close()
            logger.debug(f"Closed pdfplumber PDF object for {self.source_path}")
        except Exception as e:
            logger.warning(f"Error closing pdfplumber object: {e}")
        finally:
            self._pdf = None

    if hasattr(self, "_temp_file") and self._temp_file is not None:
        temp_file_path = None
        try:
            if hasattr(self._temp_file, "name") and self._temp_file.name:
                temp_file_path = self._temp_file.name
                # Only unlink if it exists and _is_stream is False (meaning WE created it)
                if not self._is_stream and os.path.exists(temp_file_path):
                    os.unlink(temp_file_path)
                    logger.debug(f"Removed temporary PDF file: {temp_file_path}")
        except Exception as e:
            logger.warning(f"Failed to clean up temporary file '{temp_file_path}': {e}")

    # Cancels the weakref finalizer so we don't double-clean
    if hasattr(self, "_finalizer") and self._finalizer.alive:
        self._finalizer()
natural_pdf.PDF.correct_ocr(correction_callback, pages=None, max_workers=None, progress_callback=None)

Applies corrections to OCR text elements using a callback function. Applies corrections to OCR text elements using a callback function.

Parameters:

Name Type Description Default
correction_callback Callable[[Any], Optional[str]]

Function that takes an element and returns corrected text or None

required
correction_callback Callable[[Any], Optional[str]]

Function that takes an element and returns corrected text or None

required
pages Optional[Union[Iterable[int], range, slice]]

Optional page indices/slice to limit the scope of correction

None
max_workers Optional[int]

Maximum number of threads to use for parallel execution

None
progress_callback Optional[Callable[[], None]]

Optional callback function for progress updates

None
max_workers Optional[int]

Maximum number of threads to use for parallel execution

None
progress_callback Optional[Callable[[], None]]

Optional callback function for progress updates

None

Returns:

Type Description
PDF

Self for method chaining

PDF

Self for method chaining

Source code in natural_pdf/core/pdf.py
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
def correct_ocr(
    self,
    correction_callback: Callable[[Any], Optional[str]],
    pages: Optional[Union[Iterable[int], range, slice]] = None,
    max_workers: Optional[int] = None,
    progress_callback: Optional[Callable[[], None]] = None,
) -> "PDF":
    """
    Applies corrections to OCR text elements using a callback function.
    Applies corrections to OCR text elements using a callback function.

    Args:
        correction_callback: Function that takes an element and returns corrected text or None
        correction_callback: Function that takes an element and returns corrected text or None
        pages: Optional page indices/slice to limit the scope of correction
        max_workers: Maximum number of threads to use for parallel execution
        progress_callback: Optional callback function for progress updates
        max_workers: Maximum number of threads to use for parallel execution
        progress_callback: Optional callback function for progress updates

    Returns:
        Self for method chaining
        Self for method chaining
    """
    target_page_indices = []
    target_page_indices = []
    if pages is None:
        target_page_indices = list(range(len(self._pages)))
    elif isinstance(pages, slice):
        target_page_indices = list(range(*pages.indices(len(self._pages))))
    elif hasattr(pages, "__iter__"):
        try:
            target_page_indices = [int(i) for i in pages]
            for idx in target_page_indices:
                if not (0 <= idx < len(self._pages)):
                    raise IndexError(f"Page index {idx} out of range (0-{len(self._pages)-1}).")
        except (IndexError, TypeError, ValueError) as e:
            raise ValueError(f"Invalid page index in 'pages': {pages}. Error: {e}") from e
            raise ValueError(f"Invalid page index in 'pages': {pages}. Error: {e}") from e
    else:
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    if not target_page_indices:
        logger.warning("No pages selected for OCR correction.")
        return self

    logger.info(f"Starting OCR correction for pages: {target_page_indices}")
    logger.info(f"Starting OCR correction for pages: {target_page_indices}")

    for page_idx in target_page_indices:
        page = self._pages[page_idx]
        try:
            page.correct_ocr(
                correction_callback=correction_callback,
                max_workers=max_workers,
                progress_callback=progress_callback,
            )
        except Exception as e:
            logger.error(f"Error during correct_ocr on page {page_idx}: {e}")
            logger.error(f"Error during correct_ocr on page {page_idx}: {e}")

    logger.info("OCR correction process finished.")
    logger.info("OCR correction process finished.")
    return self
natural_pdf.PDF.deskew(pages=None, resolution=300, angle=None, detection_resolution=72, force_overwrite=False, **deskew_kwargs)

Creates a new, in-memory PDF object containing deskewed versions of the specified pages from the original PDF.

This method renders each selected page, detects and corrects skew using the 'deskew' library, and then combines the resulting images into a new PDF using 'img2pdf'. The new PDF object is returned directly.

Important: The returned PDF is image-based. Any existing text, OCR results, annotations, or other elements from the original pages will not be carried over.

Parameters:

Name Type Description Default
pages Optional[Union[Iterable[int], range, slice]]

Page indices/slice to include (0-based). If None, processes all pages.

None
resolution int

DPI resolution for rendering the output deskewed pages.

300
angle Optional[float]

The specific angle (in degrees) to rotate by. If None, detects automatically.

None
detection_resolution int

DPI resolution used for skew detection if angles are not already cached on the page objects.

72
force_overwrite bool

If False (default), raises a ValueError if any target page already contains processed elements (text, OCR, regions) to prevent accidental data loss. Set to True to proceed anyway.

False
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew during automatic detection (e.g., max_angle, num_peaks).

{}

Returns:

Type Description
PDF

A new PDF object representing the deskewed document.

Raises:

Type Description
ImportError

If 'deskew' or 'img2pdf' libraries are not installed.

ValueError

If force_overwrite is False and target pages contain elements.

FileNotFoundError

If the source PDF cannot be read (if file-based).

IOError

If creating the in-memory PDF fails.

RuntimeError

If rendering or deskewing individual pages fails.

Source code in natural_pdf/core/pdf.py
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
def deskew(
    self,
    pages: Optional[Union[Iterable[int], range, slice]] = None,
    resolution: int = 300,
    angle: Optional[float] = None,
    detection_resolution: int = 72,
    force_overwrite: bool = False,
    **deskew_kwargs,
) -> "PDF":
    """
    Creates a new, in-memory PDF object containing deskewed versions of the
    specified pages from the original PDF.

    This method renders each selected page, detects and corrects skew using the 'deskew'
    library, and then combines the resulting images into a new PDF using 'img2pdf'.
    The new PDF object is returned directly.

    Important: The returned PDF is image-based. Any existing text, OCR results,
    annotations, or other elements from the original pages will *not* be carried over.

    Args:
        pages: Page indices/slice to include (0-based). If None, processes all pages.
        resolution: DPI resolution for rendering the output deskewed pages.
        angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
        detection_resolution: DPI resolution used for skew detection if angles are not
                              already cached on the page objects.
        force_overwrite: If False (default), raises a ValueError if any target page
                         already contains processed elements (text, OCR, regions) to
                         prevent accidental data loss. Set to True to proceed anyway.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         during automatic detection (e.g., `max_angle`, `num_peaks`).

    Returns:
        A new PDF object representing the deskewed document.

    Raises:
        ImportError: If 'deskew' or 'img2pdf' libraries are not installed.
        ValueError: If `force_overwrite` is False and target pages contain elements.
        FileNotFoundError: If the source PDF cannot be read (if file-based).
        IOError: If creating the in-memory PDF fails.
        RuntimeError: If rendering or deskewing individual pages fails.
    """
    if not DESKEW_AVAILABLE:
        raise ImportError(
            "Deskew/img2pdf libraries missing. Install with: pip install natural-pdf[deskew]"
        )

    target_pages = self._get_target_pages(pages)  # Use helper to resolve pages

    # --- Safety Check --- #
    if not force_overwrite:
        for page in target_pages:
            # Check if the element manager has been initialized and contains any elements
            if (
                hasattr(page, "_element_mgr")
                and page._element_mgr
                and page._element_mgr.has_elements()
            ):
                raise ValueError(
                    f"Page {page.number} contains existing elements (text, OCR, etc.). "
                    f"Deskewing creates an image-only PDF, discarding these elements. "
                    f"Set force_overwrite=True to proceed."
                )

    # --- Process Pages --- #
    deskewed_images_bytes = []
    logger.info(f"Deskewing {len(target_pages)} pages (output resolution={resolution} DPI)...")

    for page in tqdm(target_pages, desc="Deskewing Pages", leave=False):
        try:
            # Use page.deskew to get the corrected PIL image
            # Pass down resolutions and kwargs
            deskewed_img = page.deskew(
                resolution=resolution,
                angle=angle,  # Let page.deskew handle detection/caching
                detection_resolution=detection_resolution,
                **deskew_kwargs,
            )

            if not deskewed_img:
                logger.warning(
                    f"Page {page.number}: Failed to generate deskewed image, skipping."
                )
                continue

            # Convert image to bytes for img2pdf (use PNG for lossless quality)
            with io.BytesIO() as buf:
                deskewed_img.save(buf, format="PNG")
                deskewed_images_bytes.append(buf.getvalue())

        except Exception as e:
            logger.error(
                f"Page {page.number}: Failed during deskewing process: {e}", exc_info=True
            )
            # Option: Raise a runtime error, or continue and skip the page?
            # Raising makes the whole operation fail if one page fails.
            raise RuntimeError(f"Failed to process page {page.number} during deskewing.") from e

    # --- Create PDF --- #
    if not deskewed_images_bytes:
        raise RuntimeError("No pages were successfully processed to create the deskewed PDF.")

    logger.info(f"Combining {len(deskewed_images_bytes)} deskewed images into in-memory PDF...")
    try:
        # Use img2pdf to combine image bytes into PDF bytes
        pdf_bytes = img2pdf.convert(deskewed_images_bytes)

        # Wrap bytes in a stream
        pdf_stream = io.BytesIO(pdf_bytes)

        # Create a new PDF object from the stream using original config
        logger.info("Creating new PDF object from deskewed stream...")
        new_pdf = PDF(
            pdf_stream,
            reading_order=self._reading_order,
            font_attrs=self._font_attrs,
            keep_spaces=self._config.get("keep_spaces", True),
            text_layer=self._text_layer,
        )
        return new_pdf
    except Exception as e:
        logger.error(f"Failed to create in-memory PDF using img2pdf/PDF init: {e}")
        raise IOError("Failed to create deskewed PDF object from image stream.") from e
natural_pdf.PDF.export_ocr_correction_task(output_zip_path, **kwargs)

Exports OCR results from this PDF into a correction task package. Exports OCR results from this PDF into a correction task package.

Parameters:

Name Type Description Default
output_zip_path str

The path to save the output zip file

required
output_zip_path str

The path to save the output zip file

required
**kwargs

Additional arguments passed to create_correction_task_package

{}
Source code in natural_pdf/core/pdf.py
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
def export_ocr_correction_task(self, output_zip_path: str, **kwargs):
    """
    Exports OCR results from this PDF into a correction task package.
    Exports OCR results from this PDF into a correction task package.

    Args:
        output_zip_path: The path to save the output zip file
        output_zip_path: The path to save the output zip file
        **kwargs: Additional arguments passed to create_correction_task_package
    """
    try:
        from natural_pdf.utils.packaging import create_correction_task_package

        create_correction_task_package(source=self, output_zip_path=output_zip_path, **kwargs)
    except ImportError:
        logger.error(
            "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
        )
        logger.error(
            "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
        )
    except Exception as e:
        logger.error(f"Failed to export correction task: {e}")
        raise
        logger.error(f"Failed to export correction task: {e}")
        raise
natural_pdf.PDF.extract_tables(selector=None, merge_across_pages=False, **kwargs)

Extract tables from the document or matching elements.

Parameters:

Name Type Description Default
selector Optional[str]

Optional selector to filter tables

None
merge_across_pages bool

Whether to merge tables that span across pages

False
**kwargs

Additional extraction parameters

{}

Returns:

Type Description
List[Any]

List of extracted tables

Source code in natural_pdf/core/pdf.py
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
def extract_tables(
    self, selector: Optional[str] = None, merge_across_pages: bool = False, **kwargs
) -> List[Any]:
    """
    Extract tables from the document or matching elements.

    Args:
        selector: Optional selector to filter tables
        merge_across_pages: Whether to merge tables that span across pages
        **kwargs: Additional extraction parameters

    Returns:
        List of extracted tables
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    logger.warning("PDF.extract_tables is not fully implemented yet.")
    all_tables = []

    for page in self.pages:
        if hasattr(page, "extract_tables"):
            all_tables.extend(page.extract_tables(**kwargs))
        else:
            logger.debug(f"Page {page.number} does not have extract_tables method.")

    if selector:
        logger.warning("Filtering extracted tables by selector is not implemented.")

    if merge_across_pages:
        logger.warning("Merging tables across pages is not implemented.")

    return all_tables
natural_pdf.PDF.extract_text(selector=None, preserve_whitespace=True, use_exclusions=True, debug_exclusions=False, **kwargs)

Extract text from the entire document or matching elements.

Parameters:

Name Type Description Default
selector Optional[str]

Optional selector to filter elements

None
preserve_whitespace

Whether to keep blank characters

True
use_exclusions

Whether to apply exclusion regions

True
debug_exclusions

Whether to output detailed debugging for exclusions

False
preserve_whitespace

Whether to keep blank characters

True
use_exclusions

Whether to apply exclusion regions

True
debug_exclusions

Whether to output detailed debugging for exclusions

False
**kwargs

Additional extraction parameters

{}

Returns:

Type Description
str

Extracted text as string

Source code in natural_pdf/core/pdf.py
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
def extract_text(
    self,
    selector: Optional[str] = None,
    preserve_whitespace=True,
    use_exclusions=True,
    debug_exclusions=False,
    **kwargs,
) -> str:
    """
    Extract text from the entire document or matching elements.

    Args:
        selector: Optional selector to filter elements
        preserve_whitespace: Whether to keep blank characters
        use_exclusions: Whether to apply exclusion regions
        debug_exclusions: Whether to output detailed debugging for exclusions
        preserve_whitespace: Whether to keep blank characters
        use_exclusions: Whether to apply exclusion regions
        debug_exclusions: Whether to output detailed debugging for exclusions
        **kwargs: Additional extraction parameters

    Returns:
        Extracted text as string
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    if selector:
        elements = self.find_all(selector, apply_exclusions=use_exclusions, **kwargs)
        return elements.extract_text(preserve_whitespace=preserve_whitespace, **kwargs)

    if debug_exclusions:
        print(f"PDF: Extracting text with exclusions from {len(self.pages)} pages")
        print(f"PDF: Found {len(self._exclusions)} document-level exclusions")

    texts = []
    for page in self.pages:
        texts.append(
            page.extract_text(
                preserve_whitespace=preserve_whitespace,
                use_exclusions=use_exclusions,
                debug_exclusions=debug_exclusions,
                **kwargs,
            )
        )

    if debug_exclusions:
        print(f"PDF: Combined {len(texts)} pages of text")

    return "\n".join(texts)
natural_pdf.PDF.find(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]
find(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]

Find the first element matching the selector OR text content across all pages.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
Optional[Any]

Element object or None if not found.

Source code in natural_pdf/core/pdf.py
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
def find(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional[Any]:
    """
    Find the first element matching the selector OR text content across all pages.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        Element object or None if not found.
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        raise ValueError("Internal error: No selector or text provided.")

    selector_obj = parse_selector(effective_selector)

    # Search page by page
    for page in self.pages:
        # Note: _apply_selector is on Page, so we call find directly here
        # We pass the constructed/validated effective_selector
        element = page.find(
            selector=effective_selector,  # Use the processed selector
            apply_exclusions=apply_exclusions,
            regex=regex,  # Pass down flags
            case=case,
            **kwargs,
        )
        if element:
            return element
    return None  # Not found on any page
natural_pdf.PDF.find_all(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements matching the selector OR text content across all pages.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements.

Source code in natural_pdf/core/pdf.py
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
def find_all(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements matching the selector OR text content across all pages.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        ElementCollection with matching elements.
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        raise ValueError("Internal error: No selector or text provided.")

    # Instead of parsing here, let each page parse and apply
    # This avoids parsing the same selector multiple times if not needed
    # selector_obj = parse_selector(effective_selector)

    # kwargs["regex"] = regex # Removed: Already passed explicitly
    # kwargs["case"] = case   # Removed: Already passed explicitly

    all_elements = []
    for page in self.pages:
        # Call page.find_all with the effective selector and flags
        page_elements = page.find_all(
            selector=effective_selector,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        if page_elements:
            all_elements.extend(page_elements.elements)

    from natural_pdf.elements.collections import ElementCollection

    return ElementCollection(all_elements)
natural_pdf.PDF.get_id()

Get unique identifier for this PDF.

Source code in natural_pdf/core/pdf.py
1756
1757
1758
1759
def get_id(self) -> str:
    """Get unique identifier for this PDF."""
    """Get unique identifier for this PDF."""
    return self.path
natural_pdf.PDF.get_manager(key)

Retrieve a manager instance by its key, instantiating it lazily if needed.

Managers are specialized components that handle specific functionality like classification, structured data extraction, or OCR processing. They are instantiated on-demand to minimize memory usage and startup time.

Parameters:

Name Type Description Default
key str

The manager key to retrieve. Common keys include 'classification' and 'structured_data'.

required

Returns:

Type Description
Any

The manager instance for the specified key.

Raises:

Type Description
KeyError

If no manager is registered for the given key.

RuntimeError

If the manager failed to initialize.

Example
pdf = npdf.PDF("document.pdf")
classification_mgr = pdf.get_manager('classification')
structured_data_mgr = pdf.get_manager('structured_data')
Source code in natural_pdf/core/pdf.py
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
def get_manager(self, key: str) -> Any:
    """Retrieve a manager instance by its key, instantiating it lazily if needed.

    Managers are specialized components that handle specific functionality like
    classification, structured data extraction, or OCR processing. They are
    instantiated on-demand to minimize memory usage and startup time.

    Args:
        key: The manager key to retrieve. Common keys include 'classification'
            and 'structured_data'.

    Returns:
        The manager instance for the specified key.

    Raises:
        KeyError: If no manager is registered for the given key.
        RuntimeError: If the manager failed to initialize.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")
        classification_mgr = pdf.get_manager('classification')
        structured_data_mgr = pdf.get_manager('structured_data')
        ```
    """
    # Check if already instantiated
    if key in self._managers:
        manager_instance = self._managers[key]
        if manager_instance is None:
            raise RuntimeError(f"Manager '{key}' failed to initialize previously.")
        return manager_instance

    # Not instantiated yet: get factory/class
    if not hasattr(self, "_manager_factories") or key not in self._manager_factories:
        raise KeyError(
            f"No manager registered for key '{key}'. Available: {list(getattr(self, '_manager_factories', {}).keys())}"
        )
    factory_or_class = self._manager_factories[key]
    try:
        resolved = factory_or_class
        # If it's a callable that's not a class, call it to get the class/instance
        if not isinstance(resolved, type) and callable(resolved):
            resolved = resolved()
        # If it's a class, instantiate it
        if isinstance(resolved, type):
            instance = resolved()
        else:
            instance = resolved  # Already an instance
        self._managers[key] = instance
        return instance
    except Exception as e:
        logger.error(f"Failed to initialize manager for key '{key}': {e}")
        self._managers[key] = None
        raise RuntimeError(f"Manager '{key}' failed to initialize: {e}") from e
natural_pdf.PDF.save_pdf(output_path, ocr=False, original=False, dpi=300)

Saves the PDF object (all its pages) to a new file.

Choose one saving mode: - ocr=True: Creates a new, image-based PDF using OCR results from all pages. Text generated during the natural-pdf session becomes searchable, but original vector content is lost. Requires 'ocr-export' extras. - original=True: Saves a copy of the original PDF file this object represents. Any OCR results or analyses from the natural-pdf session are NOT included. If the PDF was opened from an in-memory buffer, this mode may not be suitable. Requires 'ocr-export' extras.

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the new PDF file.

required
ocr bool

If True, save as a searchable, image-based PDF using OCR data.

False
original bool

If True, save the original source PDF content.

False
dpi int

Resolution (dots per inch) used only when ocr=True.

300

Raises:

Type Description
ValueError

If the PDF has no pages, if neither or both 'ocr' and 'original' are True.

ImportError

If required libraries are not installed for the chosen mode.

RuntimeError

If an unexpected error occurs during saving.

Source code in natural_pdf/core/pdf.py
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
def save_pdf(
    self,
    output_path: Union[str, Path],
    ocr: bool = False,
    original: bool = False,
    dpi: int = 300,
):
    """
    Saves the PDF object (all its pages) to a new file.

    Choose one saving mode:
    - `ocr=True`: Creates a new, image-based PDF using OCR results from all pages.
      Text generated during the natural-pdf session becomes searchable,
      but original vector content is lost. Requires 'ocr-export' extras.
    - `original=True`: Saves a copy of the original PDF file this object represents.
      Any OCR results or analyses from the natural-pdf session are NOT included.
      If the PDF was opened from an in-memory buffer, this mode may not be suitable.
      Requires 'ocr-export' extras.

    Args:
        output_path: Path to save the new PDF file.
        ocr: If True, save as a searchable, image-based PDF using OCR data.
        original: If True, save the original source PDF content.
        dpi: Resolution (dots per inch) used only when ocr=True.

    Raises:
        ValueError: If the PDF has no pages, if neither or both 'ocr'
                    and 'original' are True.
        ImportError: If required libraries are not installed for the chosen mode.
        RuntimeError: If an unexpected error occurs during saving.
    """
    if not self.pages:
        raise ValueError("Cannot save an empty PDF object.")

    if not (ocr ^ original):  # XOR: exactly one must be true
        raise ValueError("Exactly one of 'ocr' or 'original' must be True.")

    output_path_obj = Path(output_path)
    output_path_str = str(output_path_obj)

    if ocr:
        has_vector_elements = False
        for page in self.pages:
            if (
                hasattr(page, "rects")
                and page.rects
                or hasattr(page, "lines")
                and page.lines
                or hasattr(page, "curves")
                and page.curves
                or (
                    hasattr(page, "chars")
                    and any(getattr(el, "source", None) != "ocr" for el in page.chars)
                )
                or (
                    hasattr(page, "words")
                    and any(getattr(el, "source", None) != "ocr" for el in page.words)
                )
            ):
                has_vector_elements = True
                break
        if has_vector_elements:
            logger.warning(
                "Warning: Saving with ocr=True creates an image-based PDF. "
                "Original vector elements (rects, lines, non-OCR text/chars) "
                "will not be preserved in the output file."
            )

        logger.info(f"Saving searchable PDF (OCR text layer) to: {output_path_str}")
        try:
            # Delegate to the searchable PDF exporter, passing self (PDF instance)
            create_searchable_pdf(self, output_path_str, dpi=dpi)
        except Exception as e:
            raise RuntimeError(f"Failed to create searchable PDF: {e}") from e

    elif original:
        if create_original_pdf is None:
            raise ImportError(
                "Saving with original=True requires 'pikepdf'. "
                'Install with: pip install "natural-pdf[ocr-export]"'
            )

        # Optional: Add warning about losing OCR data similar to PageCollection
        has_ocr_elements = False
        for page in self.pages:
            if hasattr(page, "find_all"):
                ocr_text_elements = page.find_all("text[source=ocr]")
                if ocr_text_elements:
                    has_ocr_elements = True
                    break
            elif hasattr(page, "words"):  # Fallback
                if any(getattr(el, "source", None) == "ocr" for el in page.words):
                    has_ocr_elements = True
                    break
        if has_ocr_elements:
            logger.warning(
                "Warning: Saving with original=True preserves original page content. "
                "OCR text generated in this session will not be included in the saved file."
            )

        logger.info(f"Saving original PDF content to: {output_path_str}")
        try:
            # Delegate to the original PDF exporter, passing self (PDF instance)
            create_original_pdf(self, output_path_str)
        except Exception as e:
            # Re-raise exception from exporter
            raise e
natural_pdf.PDF.save_searchable(output_path, dpi=300, **kwargs)

DEPRECATED: Use save_pdf(..., ocr=True) instead. Saves the PDF with an OCR text layer, making content searchable.

Requires optional dependencies. Install with: pip install "natural-pdf[ocr-export]"

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the searchable PDF

required
dpi int

Resolution for rendering and OCR overlay

300
**kwargs

Additional keyword arguments passed to the exporter

{}
Source code in natural_pdf/core/pdf.py
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
    """
    DEPRECATED: Use save_pdf(..., ocr=True) instead.
    Saves the PDF with an OCR text layer, making content searchable.

    Requires optional dependencies. Install with: pip install \"natural-pdf[ocr-export]\"

    Args:
        output_path: Path to save the searchable PDF
        dpi: Resolution for rendering and OCR overlay
        **kwargs: Additional keyword arguments passed to the exporter
    """
    logger.warning(
        "PDF.save_searchable() is deprecated. Use PDF.save_pdf(..., ocr=True) instead."
    )
    if create_searchable_pdf is None:
        raise ImportError(
            "Saving searchable PDF requires 'pikepdf'. "
            'Install with: pip install "natural-pdf[ocr-export]"'
        )
    output_path_str = str(output_path)
    # Call the exporter directly, passing self (the PDF instance)
    create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
natural_pdf.PDF.search_within_index(query, search_service, options=None)

Finds relevant documents from this PDF within a search index. Finds relevant documents from this PDF within a search index.

Parameters:

Name Type Description Default
query Union[str, Path, Image, Region]

The search query (text, image path, PIL Image, Region)

required
search_service SearchServiceProtocol

A pre-configured SearchService instance

required
options Optional[SearchOptions]

Optional SearchOptions to configure the query

None
query Union[str, Path, Image, Region]

The search query (text, image path, PIL Image, Region)

required
search_service SearchServiceProtocol

A pre-configured SearchService instance

required
options Optional[SearchOptions]

Optional SearchOptions to configure the query

None

Returns:

Type Description
List[Dict[str, Any]]

A list of result dictionaries, sorted by relevance

List[Dict[str, Any]]

A list of result dictionaries, sorted by relevance

Raises:

Type Description
ImportError

If search dependencies are not installed

ValueError

If search_service is None

TypeError

If search_service does not conform to the protocol

FileNotFoundError

If the collection managed by the service does not exist

RuntimeError

For other search failures

ImportError

If search dependencies are not installed

ValueError

If search_service is None

TypeError

If search_service does not conform to the protocol

FileNotFoundError

If the collection managed by the service does not exist

RuntimeError

For other search failures

Source code in natural_pdf/core/pdf.py
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
def search_within_index(
    self,
    query: Union[str, Path, Image.Image, "Region"],
    search_service: "SearchServiceProtocol",
    options: Optional["SearchOptions"] = None,
) -> List[Dict[str, Any]]:
    """
    Finds relevant documents from this PDF within a search index.
    Finds relevant documents from this PDF within a search index.

    Args:
        query: The search query (text, image path, PIL Image, Region)
        search_service: A pre-configured SearchService instance
        options: Optional SearchOptions to configure the query
        query: The search query (text, image path, PIL Image, Region)
        search_service: A pre-configured SearchService instance
        options: Optional SearchOptions to configure the query

    Returns:
        A list of result dictionaries, sorted by relevance
        A list of result dictionaries, sorted by relevance

    Raises:
        ImportError: If search dependencies are not installed
        ValueError: If search_service is None
        TypeError: If search_service does not conform to the protocol
        FileNotFoundError: If the collection managed by the service does not exist
        RuntimeError: For other search failures
        ImportError: If search dependencies are not installed
        ValueError: If search_service is None
        TypeError: If search_service does not conform to the protocol
        FileNotFoundError: If the collection managed by the service does not exist
        RuntimeError: For other search failures
    """
    if not search_service:
        raise ValueError("A configured SearchServiceProtocol instance must be provided.")

    collection_name = getattr(search_service, "collection_name", "<Unknown Collection>")
    logger.info(
        f"Searching within index '{collection_name}' for content from PDF '{self.path}'"
    )

    service = search_service

    query_input = query
    effective_options = copy.deepcopy(options) if options is not None else TextSearchOptions()

    if isinstance(query, Region):
        logger.debug("Query is a Region object. Extracting text.")
        if not isinstance(effective_options, TextSearchOptions):
            logger.warning(
                "Querying with Region image requires MultiModalSearchOptions. Falling back to text extraction."
            )
        query_input = query.extract_text()
        if not query_input or query_input.isspace():
            logger.error("Region has no extractable text for query.")
            return []

    # Add filter to scope search to THIS PDF
    # Add filter to scope search to THIS PDF
    pdf_scope_filter = {
        "field": "pdf_path",
        "operator": "eq",
        "value": self.path,
    }
    logger.debug(f"Applying filter to scope search to PDF: {pdf_scope_filter}")

    # Combine with existing filters in options (if any)
    if effective_options.filters:
        logger.debug(f"Combining PDF scope filter with existing filters")
        if (
            isinstance(effective_options.filters, dict)
            and effective_options.filters.get("operator") == "AND"
        ):
            effective_options.filters["conditions"].append(pdf_scope_filter)
        elif isinstance(effective_options.filters, list):
            effective_options.filters = {
                "operator": "AND",
                "conditions": effective_options.filters + [pdf_scope_filter],
            }
        elif isinstance(effective_options.filters, dict):
            effective_options.filters = {
                "operator": "AND",
                "conditions": [effective_options.filters, pdf_scope_filter],
            }
        else:
            logger.warning(
                f"Unsupported format for existing filters. Overwriting with PDF scope filter."
            )
            effective_options.filters = pdf_scope_filter
    else:
        effective_options.filters = pdf_scope_filter

    logger.debug(f"Final filters for service search: {effective_options.filters}")

    try:
        results = service.search(
            query=query_input,
            options=effective_options,
        )
        logger.info(f"SearchService returned {len(results)} results from PDF '{self.path}'")
        return results
    except FileNotFoundError as fnf:
        logger.error(f"Search failed: Collection not found. Error: {fnf}")
        raise
        logger.error(f"Search failed: Collection not found. Error: {fnf}")
        raise
    except Exception as e:
        logger.error(f"SearchService search failed: {e}")
        raise RuntimeError(f"Search within index failed. See logs for details.") from e
        logger.error(f"SearchService search failed: {e}")
        raise RuntimeError(f"Search within index failed. See logs for details.") from e
natural_pdf.Page

Bases: ClassificationMixin, ExtractionMixin, ShapeDetectionMixin, DescribeMixin

Enhanced Page wrapper built on top of pdfplumber.Page.

This class provides a fluent interface for working with PDF pages, with improved selection, navigation, extraction, and question-answering capabilities. It integrates multiple analysis capabilities through mixins and provides spatial navigation with CSS-like selectors.

The Page class serves as the primary interface for document analysis, offering: - Element selection and spatial navigation - OCR and layout analysis integration - Table detection and extraction - AI-powered classification and data extraction - Visual debugging with highlighting and cropping - Text style analysis and structure detection

Attributes:

Name Type Description
index int

Zero-based index of this page in the PDF.

number int

One-based page number (index + 1).

width float

Page width in points.

height float

Page height in points.

bbox float

Bounding box tuple (x0, top, x1, bottom) of the page.

chars List[Any]

Collection of character elements on the page.

words List[Any]

Collection of word elements on the page.

lines List[Any]

Collection of line elements on the page.

rects List[Any]

Collection of rectangle elements on the page.

images List[Any]

Collection of image elements on the page.

metadata Dict[str, Any]

Dictionary for storing analysis results and custom data.

Example

Basic usage:

pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]

# Find elements with CSS-like selectors
headers = page.find_all('text[size>12]:bold')
summaries = page.find('text:contains("Summary")')

# Spatial navigation
content_below = summaries.below(until='text[size>12]:bold')

# Table extraction
tables = page.extract_table()

Advanced usage:

# Apply OCR if needed
page.apply_ocr(engine='easyocr', resolution=300)

# Layout analysis
page.analyze_layout(engine='yolo')

# AI-powered extraction
data = page.extract_structured_data(MySchema)

# Visual debugging
page.find('text:contains("Important")').show()

Source code in natural_pdf/core/page.py
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
class Page(ClassificationMixin, ExtractionMixin, ShapeDetectionMixin, DescribeMixin):
    """Enhanced Page wrapper built on top of pdfplumber.Page.

    This class provides a fluent interface for working with PDF pages,
    with improved selection, navigation, extraction, and question-answering capabilities.
    It integrates multiple analysis capabilities through mixins and provides spatial
    navigation with CSS-like selectors.

    The Page class serves as the primary interface for document analysis, offering:
    - Element selection and spatial navigation
    - OCR and layout analysis integration
    - Table detection and extraction
    - AI-powered classification and data extraction
    - Visual debugging with highlighting and cropping
    - Text style analysis and structure detection

    Attributes:
        index: Zero-based index of this page in the PDF.
        number: One-based page number (index + 1).
        width: Page width in points.
        height: Page height in points.
        bbox: Bounding box tuple (x0, top, x1, bottom) of the page.
        chars: Collection of character elements on the page.
        words: Collection of word elements on the page.
        lines: Collection of line elements on the page.
        rects: Collection of rectangle elements on the page.
        images: Collection of image elements on the page.
        metadata: Dictionary for storing analysis results and custom data.

    Example:
        Basic usage:
        ```python
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]

        # Find elements with CSS-like selectors
        headers = page.find_all('text[size>12]:bold')
        summaries = page.find('text:contains("Summary")')

        # Spatial navigation
        content_below = summaries.below(until='text[size>12]:bold')

        # Table extraction
        tables = page.extract_table()
        ```

        Advanced usage:
        ```python
        # Apply OCR if needed
        page.apply_ocr(engine='easyocr', resolution=300)

        # Layout analysis
        page.analyze_layout(engine='yolo')

        # AI-powered extraction
        data = page.extract_structured_data(MySchema)

        # Visual debugging
        page.find('text:contains("Important")').show()
        ```
    """

    def __init__(
        self,
        page: "pdfplumber.page.Page",
        parent: "PDF",
        index: int,
        font_attrs=None,
        load_text: bool = True,
    ):
        """Initialize a page wrapper.

        Creates an enhanced Page object that wraps a pdfplumber page with additional
        functionality for spatial navigation, analysis, and AI-powered extraction.

        Args:
            page: The underlying pdfplumber page object that provides raw PDF data.
            parent: Parent PDF object that contains this page and provides access
                to managers and global settings.
            index: Zero-based index of this page in the PDF document.
            font_attrs: List of font attributes to consider when grouping characters
                into words. Common attributes include ['fontname', 'size', 'flags'].
                If None, uses default character-to-word grouping rules.
            load_text: If True, load and process text elements from the PDF's text layer.
                If False, skip text layer processing (useful for OCR-only workflows).

        Note:
            This constructor is typically called automatically when accessing pages
            through the PDF.pages collection. Direct instantiation is rarely needed.

        Example:
            ```python
            # Pages are usually accessed through the PDF object
            pdf = npdf.PDF("document.pdf")
            page = pdf.pages[0]  # Page object created automatically

            # Direct construction (advanced usage)
            import pdfplumber
            with pdfplumber.open("document.pdf") as plumber_pdf:
                plumber_page = plumber_pdf.pages[0]
                page = Page(plumber_page, pdf, 0, load_text=True)
            ```
        """
        self._page = page
        self._parent = parent
        self._index = index
        self._load_text = load_text
        self._text_styles = None  # Lazy-loaded text style analyzer results
        self._exclusions = []  # List to store exclusion functions/regions
        self._skew_angle: Optional[float] = None  # Stores detected skew angle

        # --- ADDED --- Metadata store for mixins
        self.metadata: Dict[str, Any] = {}
        # --- END ADDED ---

        # Region management
        self._regions = {
            "detected": [],  # Layout detection results
            "named": {},  # Named regions (name -> region)
        }

        # -------------------------------------------------------------
        # Page-scoped configuration begins as a shallow copy of the parent
        # PDF-level configuration so that auto-computed tolerances or other
        # page-specific values do not overwrite siblings.
        # -------------------------------------------------------------
        self._config = dict(getattr(self._parent, "_config", {}))

        # Initialize ElementManager, passing font_attrs
        self._element_mgr = ElementManager(self, font_attrs=font_attrs, load_text=self._load_text)
        # self._highlighter = HighlightingService(self) # REMOVED - Use property accessor
        # --- NEW --- Central registry for analysis results
        self.analyses: Dict[str, Any] = {}

        # --- Get OCR Manager Instance ---
        if (
            OCRManager
            and hasattr(parent, "_ocr_manager")
            and isinstance(parent._ocr_manager, OCRManager)
        ):
            self._ocr_manager = parent._ocr_manager
            logger.debug(f"Page {self.number}: Using OCRManager instance from parent PDF.")
        else:
            self._ocr_manager = None
            if OCRManager:
                logger.warning(
                    f"Page {self.number}: OCRManager instance not found on parent PDF object."
                )

        # --- Get Layout Manager Instance ---
        if (
            LayoutManager
            and hasattr(parent, "_layout_manager")
            and isinstance(parent._layout_manager, LayoutManager)
        ):
            self._layout_manager = parent._layout_manager
            logger.debug(f"Page {self.number}: Using LayoutManager instance from parent PDF.")
        else:
            self._layout_manager = None
            if LayoutManager:
                logger.warning(
                    f"Page {self.number}: LayoutManager instance not found on parent PDF object. Layout analysis will fail."
                )

        # Initialize the internal variable with a single underscore
        self._layout_analyzer = None

        self._load_elements()
        self._to_image_cache: Dict[tuple, Optional["Image.Image"]] = {}

    @property
    def pdf(self) -> "PDF":
        """Provides public access to the parent PDF object."""
        return self._parent

    @property
    def number(self) -> int:
        """Get page number (1-based)."""
        return self._page.page_number

    @property
    def page_number(self) -> int:
        """Get page number (1-based)."""
        return self._page.page_number

    @property
    def index(self) -> int:
        """Get page index (0-based)."""
        return self._index

    @property
    def width(self) -> float:
        """Get page width."""
        return self._page.width

    @property
    def height(self) -> float:
        """Get page height."""
        return self._page.height

    # --- Highlighting Service Accessor ---
    @property
    def _highlighter(self) -> "HighlightingService":
        """Provides access to the parent PDF's HighlightingService."""
        if not hasattr(self._parent, "highlighter"):
            # This should ideally not happen if PDF.__init__ works correctly
            raise AttributeError("Parent PDF object does not have a 'highlighter' attribute.")
        return self._parent.highlighter

    def clear_exclusions(self) -> "Page":
        """
        Clear all exclusions from the page.
        """
        self._exclusions = []
        return self

    def add_exclusion(
        self,
        exclusion_func_or_region: Union[Callable[["Page"], "Region"], "Region", Any],
        label: Optional[str] = None,
    ) -> "Page":
        """
        Add an exclusion to the page. Text from these regions will be excluded from extraction.
        Ensures non-callable items are stored as Region objects if possible.

        Args:
            exclusion_func_or_region: Either a callable function returning a Region,
                                      a Region object, or another object with a valid .bbox attribute.
            label: Optional label for this exclusion (e.g., 'header', 'footer').

        Returns:
            Self for method chaining

        Raises:
            TypeError: If a non-callable, non-Region object without a valid bbox is provided.
        """
        exclusion_data = None  # Initialize exclusion data

        if callable(exclusion_func_or_region):
            # Store callable functions along with their label
            exclusion_data = (exclusion_func_or_region, label)
            logger.debug(
                f"Page {self.index}: Added callable exclusion '{label}': {exclusion_func_or_region}"
            )
        elif isinstance(exclusion_func_or_region, Region):
            # Store Region objects directly, assigning the label
            exclusion_func_or_region.label = label  # Assign label
            exclusion_data = (exclusion_func_or_region, label)  # Store as tuple for consistency
            logger.debug(
                f"Page {self.index}: Added Region exclusion '{label}': {exclusion_func_or_region}"
            )
        elif (
            hasattr(exclusion_func_or_region, "bbox")
            and isinstance(getattr(exclusion_func_or_region, "bbox", None), (tuple, list))
            and len(exclusion_func_or_region.bbox) == 4
        ):
            # Convert objects with a valid bbox to a Region before storing
            try:
                bbox_coords = tuple(float(v) for v in exclusion_func_or_region.bbox)
                # Pass the label to the Region constructor
                region_to_add = Region(self, bbox_coords, label=label)
                exclusion_data = (region_to_add, label)  # Store as tuple
                logger.debug(
                    f"Page {self.index}: Added exclusion '{label}' converted to Region from {type(exclusion_func_or_region)}: {region_to_add}"
                )
            except (ValueError, TypeError, Exception) as e:
                # Raise an error if conversion fails
                raise TypeError(
                    f"Failed to convert exclusion object {exclusion_func_or_region} with bbox {getattr(exclusion_func_or_region, 'bbox', 'N/A')} to Region: {e}"
                ) from e
        else:
            # Reject invalid types
            raise TypeError(
                f"Invalid exclusion type: {type(exclusion_func_or_region)}. Must be callable, Region, or have a valid .bbox attribute."
            )

        # Append the stored data (tuple of object/callable and label)
        if exclusion_data:
            self._exclusions.append(exclusion_data)

        return self

    def add_region(self, region: "Region", name: Optional[str] = None) -> "Page":
        """
        Add a region to the page.

        Args:
            region: Region object to add
            name: Optional name for the region

        Returns:
            Self for method chaining
        """
        # Check if it's actually a Region object
        if not isinstance(region, Region):
            raise TypeError("region must be a Region object")

        # Set the source and name
        region.source = "named"

        if name:
            region.name = name
            # Add to named regions dictionary (overwriting if name already exists)
            self._regions["named"][name] = region
        else:
            # Add to detected regions list (unnamed but registered)
            self._regions["detected"].append(region)

        # Add to element manager for selector queries
        self._element_mgr.add_region(region)

        return self

    def add_regions(self, regions: List["Region"], prefix: Optional[str] = None) -> "Page":
        """
        Add multiple regions to the page.

        Args:
            regions: List of Region objects to add
            prefix: Optional prefix for automatic naming (regions will be named prefix_1, prefix_2, etc.)

        Returns:
            Self for method chaining
        """
        if prefix:
            # Add with automatic sequential naming
            for i, region in enumerate(regions):
                self.add_region(region, name=f"{prefix}_{i+1}")
        else:
            # Add without names
            for region in regions:
                self.add_region(region)

        return self

    def _get_exclusion_regions(self, include_callable=True, debug=False) -> List["Region"]:
        """
        Get all exclusion regions for this page.
        Assumes self._exclusions contains tuples of (callable/Region, label).

        Args:
            include_callable: Whether to evaluate callable exclusion functions
            debug: Enable verbose debug logging for exclusion evaluation

        Returns:
            List of Region objects to exclude, with labels assigned.
        """
        regions = []

        if debug:
            print(f"\nPage {self.index}: Evaluating {len(self._exclusions)} exclusions")

        for i, exclusion_data in enumerate(self._exclusions):
            # Unpack the exclusion object/callable and its label
            exclusion_item, label = exclusion_data
            exclusion_label = label if label else f"exclusion {i}"

            # Process callable exclusion functions
            if callable(exclusion_item) and include_callable:
                try:
                    if debug:
                        print(f"  - Evaluating callable '{exclusion_label}'...")

                    # Temporarily clear exclusions (consider if really needed)
                    temp_original_exclusions = self._exclusions
                    self._exclusions = []

                    # Call the function - Expects it to return a Region or None
                    region_result = exclusion_item(self)

                    # Restore exclusions
                    self._exclusions = temp_original_exclusions

                    if isinstance(region_result, Region):
                        # Assign the label to the returned region
                        region_result.label = label
                        regions.append(region_result)
                        if debug:
                            print(f"    ✓ Added region from callable '{label}': {region_result}")
                    elif region_result:
                        logger.warning(
                            f"Callable exclusion '{exclusion_label}' returned non-Region object: {type(region_result)}. Skipping."
                        )
                        if debug:
                            print(f"    ✗ Callable returned non-Region/None: {type(region_result)}")
                    else:
                        if debug:
                            print(
                                f"    ✗ Callable '{exclusion_label}' returned None, no region added"
                            )

                except Exception as e:
                    error_msg = f"Error evaluating callable exclusion '{exclusion_label}' for page {self.index}: {e}"
                    print(error_msg)
                    import traceback

                    print(f"    Traceback: {traceback.format_exc().splitlines()[-3:]}")

            # Process direct Region objects (label was assigned in add_exclusion)
            elif isinstance(exclusion_item, Region):
                regions.append(exclusion_item)  # Label is already on the Region object
                if debug:
                    print(f"  - Added direct region '{label}': {exclusion_item}")
            # No else needed, add_exclusion should prevent invalid types

        if debug:
            print(f"Page {self.index}: Found {len(regions)} valid exclusion regions to apply")

        return regions

    def _filter_elements_by_exclusions(
        self, elements: List["Element"], debug_exclusions: bool = False
    ) -> List["Element"]:
        """
        Filters a list of elements, removing those within the page's exclusion regions.

        Args:
            elements: The list of elements to filter.
            debug_exclusions: Whether to output detailed exclusion debugging info (default: False).

        Returns:
            A new list containing only the elements not falling within any exclusion region.
        """
        if not self._exclusions:
            if debug_exclusions:
                print(
                    f"Page {self.index}: No exclusions defined, returning all {len(elements)} elements."
                )
            return elements

        # Get all exclusion regions, including evaluating callable functions
        exclusion_regions = self._get_exclusion_regions(
            include_callable=True, debug=debug_exclusions
        )

        if not exclusion_regions:
            if debug_exclusions:
                print(
                    f"Page {self.index}: No valid exclusion regions found, returning all {len(elements)} elements."
                )
            return elements

        if debug_exclusions:
            print(
                f"Page {self.index}: Applying {len(exclusion_regions)} exclusion regions to {len(elements)} elements."
            )

        filtered_elements = []
        excluded_count = 0
        for element in elements:
            exclude = False
            for region in exclusion_regions:
                # Use the region's method to check if the element is inside
                if region._is_element_in_region(element):
                    exclude = True
                    excluded_count += 1
                    break  # No need to check other regions for this element
            if not exclude:
                filtered_elements.append(element)

        if debug_exclusions:
            print(
                f"Page {self.index}: Excluded {excluded_count} elements, keeping {len(filtered_elements)}."
            )

        return filtered_elements

    @overload
    def find(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    def find(
        self,
        selector: Optional[str] = None,  # Now optional
        *,  # Force subsequent args to be keyword-only
        text: Optional[str] = None,  # New text parameter
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]:
        """
        Find first element on this page matching selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            Element object or None if not found.
        """
        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            # Escape quotes within the text for the selector string
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            # Default to 'text:contains(...)'
            effective_selector = f'text:contains("{escaped_text}")'
            # Note: regex/case handled by kwargs passed down
            logger.debug(
                f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            # Should be unreachable due to checks above
            raise ValueError("Internal error: No selector or text provided.")

        selector_obj = parse_selector(effective_selector)

        # Pass regex and case flags to selector function via kwargs
        kwargs["regex"] = regex
        kwargs["case"] = case

        # First get all matching elements without applying exclusions initially within _apply_selector
        results_collection = self._apply_selector(
            selector_obj, **kwargs
        )  # _apply_selector doesn't filter

        # Filter the results based on exclusions if requested
        if apply_exclusions and self._exclusions and results_collection:
            filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
            # Return the first element from the filtered list
            return filtered_elements[0] if filtered_elements else None
        elif results_collection:
            # Return the first element from the unfiltered results
            return results_collection.first
        else:
            return None

    @overload
    def find_all(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,  # Now optional
        *,  # Force subsequent args to be keyword-only
        text: Optional[str] = None,  # New text parameter
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements on this page matching selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            ElementCollection with matching elements.
        """
        from natural_pdf.elements.collections import ElementCollection  # Import here for type hint

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            # Escape quotes within the text for the selector string
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            # Default to 'text:contains(...)'
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            # Should be unreachable due to checks above
            raise ValueError("Internal error: No selector or text provided.")

        selector_obj = parse_selector(effective_selector)

        # Pass regex and case flags to selector function via kwargs
        kwargs["regex"] = regex
        kwargs["case"] = case

        # First get all matching elements without applying exclusions initially within _apply_selector
        results_collection = self._apply_selector(
            selector_obj, **kwargs
        )  # _apply_selector doesn't filter

        # Filter the results based on exclusions if requested
        if apply_exclusions and self._exclusions and results_collection:
            filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
            return ElementCollection(filtered_elements)
        else:
            # Return the unfiltered collection
            return results_collection

    def _apply_selector(
        self, selector_obj: Dict, **kwargs
    ) -> "ElementCollection":  # Removed apply_exclusions arg
        """
        Apply selector to page elements.
        Exclusions are now handled by the calling methods (find, find_all) if requested.

        Args:
            selector_obj: Parsed selector dictionary (single or compound OR selector)
            **kwargs: Additional filter parameters including 'regex' and 'case'

        Returns:
            ElementCollection of matching elements (unfiltered by exclusions)
        """
        from natural_pdf.selectors.parser import selector_to_filter_func

        # Handle compound OR selectors
        if selector_obj.get("type") == "or":
            # For OR selectors, search all elements and let the filter function decide
            elements_to_search = self._element_mgr.get_all_elements()

            # Create filter function from compound selector
            filter_func = selector_to_filter_func(selector_obj, **kwargs)

            # Apply the filter to all elements
            matching_elements = [element for element in elements_to_search if filter_func(element)]

            # Sort elements in reading order if requested
            if kwargs.get("reading_order", True):
                if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
                    matching_elements.sort(key=lambda el: (el.top, el.x0))
                else:
                    logger.warning(
                        "Cannot sort elements in reading order: Missing required attributes (top, x0)."
                    )

            # Return result collection
            return ElementCollection(matching_elements)

        # Handle single selectors (existing logic)
        # Get element type to filter
        element_type = selector_obj.get("type", "any").lower()

        # Determine which elements to search based on element type
        elements_to_search = []
        if element_type == "any":
            elements_to_search = self._element_mgr.get_all_elements()
        elif element_type == "text":
            elements_to_search = self._element_mgr.words
        elif element_type == "char":
            elements_to_search = self._element_mgr.chars
        elif element_type == "word":
            elements_to_search = self._element_mgr.words
        elif element_type == "rect" or element_type == "rectangle":
            elements_to_search = self._element_mgr.rects
        elif element_type == "line":
            elements_to_search = self._element_mgr.lines
        elif element_type == "region":
            elements_to_search = self._element_mgr.regions
        else:
            elements_to_search = self._element_mgr.get_all_elements()

        # Create filter function from selector, passing any additional parameters
        filter_func = selector_to_filter_func(selector_obj, **kwargs)

        # Apply the filter to matching elements
        matching_elements = [element for element in elements_to_search if filter_func(element)]

        # Handle spatial pseudo-classes that require relationship checking
        for pseudo in selector_obj.get("pseudo_classes", []):
            name = pseudo.get("name")
            args = pseudo.get("args", "")

            if name in ("above", "below", "near", "left-of", "right-of"):
                # Find the reference element first
                from natural_pdf.selectors.parser import parse_selector

                ref_selector = parse_selector(args) if isinstance(args, str) else args
                # Recursively call _apply_selector for reference element (exclusions handled later)
                ref_elements = self._apply_selector(ref_selector, **kwargs)

                if not ref_elements:
                    return ElementCollection([])

                ref_element = ref_elements.first
                if not ref_element:
                    continue

                # Filter elements based on spatial relationship
                if name == "above":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "bottom")
                        and hasattr(ref_element, "top")
                        and el.bottom <= ref_element.top
                    ]
                elif name == "below":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "top")
                        and hasattr(ref_element, "bottom")
                        and el.top >= ref_element.bottom
                    ]
                elif name == "left-of":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "x1")
                        and hasattr(ref_element, "x0")
                        and el.x1 <= ref_element.x0
                    ]
                elif name == "right-of":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "x0")
                        and hasattr(ref_element, "x1")
                        and el.x0 >= ref_element.x1
                    ]
                elif name == "near":

                    def distance(el1, el2):
                        if not (
                            hasattr(el1, "x0")
                            and hasattr(el1, "x1")
                            and hasattr(el1, "top")
                            and hasattr(el1, "bottom")
                            and hasattr(el2, "x0")
                            and hasattr(el2, "x1")
                            and hasattr(el2, "top")
                            and hasattr(el2, "bottom")
                        ):
                            return float("inf")  # Cannot calculate distance
                        el1_center_x = (el1.x0 + el1.x1) / 2
                        el1_center_y = (el1.top + el1.bottom) / 2
                        el2_center_x = (el2.x0 + el2.x1) / 2
                        el2_center_y = (el2.top + el2.bottom) / 2
                        return (
                            (el1_center_x - el2_center_x) ** 2 + (el1_center_y - el2_center_y) ** 2
                        ) ** 0.5

                    threshold = kwargs.get("near_threshold", 50)
                    matching_elements = [
                        el for el in matching_elements if distance(el, ref_element) <= threshold
                    ]

        # Sort elements in reading order if requested
        if kwargs.get("reading_order", True):
            if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
                matching_elements.sort(key=lambda el: (el.top, el.x0))
            else:
                logger.warning(
                    "Cannot sort elements in reading order: Missing required attributes (top, x0)."
                )

        # Create result collection - exclusions are handled by the calling methods (find, find_all)
        result = ElementCollection(matching_elements)

        return result

    def create_region(self, x0: float, top: float, x1: float, bottom: float) -> Any:
        """
        Create a region on this page with the specified coordinates.

        Args:
            x0: Left x-coordinate
            top: Top y-coordinate
            x1: Right x-coordinate
            bottom: Bottom y-coordinate

        Returns:
            Region object for the specified coordinates
        """
        from natural_pdf.elements.region import Region

        return Region(self, (x0, top, x1, bottom))

    def region(
        self,
        left: float = None,
        top: float = None,
        right: float = None,
        bottom: float = None,
        width: Union[str, float, None] = None,
        height: Optional[float] = None,
    ) -> Any:
        """
        Create a region on this page with more intuitive named parameters,
        allowing definition by coordinates or by coordinate + dimension.

        Args:
            left: Left x-coordinate (default: 0 if width not used).
            top: Top y-coordinate (default: 0 if height not used).
            right: Right x-coordinate (default: page width if width not used).
            bottom: Bottom y-coordinate (default: page height if height not used).
            width: Width definition. Can be:
                   - Numeric: The width of the region in points. Cannot be used with both left and right.
                   - String 'full': Sets region width to full page width (overrides left/right).
                   - String 'element' or None (default): Uses provided/calculated left/right,
                     defaulting to page width if neither are specified.
            height: Numeric height of the region. Cannot be used with both top and bottom.

        Returns:
            Region object for the specified coordinates

        Raises:
            ValueError: If conflicting arguments are provided (e.g., top, bottom, and height)
                      or if width is an invalid string.

        Examples:
            >>> page.region(top=100, height=50)  # Region from y=100 to y=150, default width
            >>> page.region(left=50, width=100)   # Region from x=50 to x=150, default height
            >>> page.region(bottom=500, height=50) # Region from y=450 to y=500
            >>> page.region(right=200, width=50)  # Region from x=150 to x=200
            >>> page.region(top=100, bottom=200, width="full") # Explicit full width
        """
        # ------------------------------------------------------------------
        # Percentage support – convert strings like "30%" to absolute values
        # based on page dimensions.  X-axis params (left, right, width) use
        # page.width; Y-axis params (top, bottom, height) use page.height.
        # ------------------------------------------------------------------

        def _pct_to_abs(val, axis: str):
            if isinstance(val, str) and val.strip().endswith("%"):
                try:
                    pct = float(val.strip()[:-1]) / 100.0
                except ValueError:
                    return val  # leave unchanged if not a number
                return pct * (self.width if axis == "x" else self.height)
            return val

        left = _pct_to_abs(left, "x")
        right = _pct_to_abs(right, "x")
        width = _pct_to_abs(width, "x")
        top = _pct_to_abs(top, "y")
        bottom = _pct_to_abs(bottom, "y")
        height = _pct_to_abs(height, "y")

        # --- Type checking and basic validation ---
        is_width_numeric = isinstance(width, (int, float))
        is_width_string = isinstance(width, str)
        width_mode = "element"  # Default mode

        if height is not None and top is not None and bottom is not None:
            raise ValueError("Cannot specify top, bottom, and height simultaneously.")
        if is_width_numeric and left is not None and right is not None:
            raise ValueError("Cannot specify left, right, and a numeric width simultaneously.")
        if is_width_string:
            width_lower = width.lower()
            if width_lower not in ["full", "element"]:
                raise ValueError("String width argument must be 'full' or 'element'.")
            width_mode = width_lower

        # --- Calculate Coordinates ---
        final_top = top
        final_bottom = bottom
        final_left = left
        final_right = right

        # Height calculations
        if height is not None:
            if top is not None:
                final_bottom = top + height
            elif bottom is not None:
                final_top = bottom - height
            else:  # Neither top nor bottom provided, default top to 0
                final_top = 0
                final_bottom = height

        # Width calculations (numeric only)
        if is_width_numeric:
            if left is not None:
                final_right = left + width
            elif right is not None:
                final_left = right - width
            else:  # Neither left nor right provided, default left to 0
                final_left = 0
                final_right = width

        # --- Apply Defaults for Unset Coordinates ---
        # Only default coordinates if they weren't set by dimension calculation
        if final_top is None:
            final_top = 0
        if final_bottom is None:
            # Check if bottom should have been set by height calc
            if height is None or top is None:
                final_bottom = self.height

        if final_left is None:
            final_left = 0
        if final_right is None:
            # Check if right should have been set by width calc
            if not is_width_numeric or left is None:
                final_right = self.width

        # --- Handle width_mode == 'full' ---
        if width_mode == "full":
            # Override left/right if mode is full
            final_left = 0
            final_right = self.width

        # --- Final Validation & Creation ---
        # Ensure coordinates are within page bounds (clamp)
        final_left = max(0, final_left)
        final_top = max(0, final_top)
        final_right = min(self.width, final_right)
        final_bottom = min(self.height, final_bottom)

        # Ensure valid box (x0<=x1, top<=bottom)
        if final_left > final_right:
            logger.warning(f"Calculated left ({final_left}) > right ({final_right}). Swapping.")
            final_left, final_right = final_right, final_left
        if final_top > final_bottom:
            logger.warning(f"Calculated top ({final_top}) > bottom ({final_bottom}). Swapping.")
            final_top, final_bottom = final_bottom, final_top

        from natural_pdf.elements.region import Region

        region = Region(self, (final_left, final_top, final_right, final_bottom))
        return region

    def get_elements(
        self, apply_exclusions=True, debug_exclusions: bool = False
    ) -> List["Element"]:
        """
        Get all elements on this page.

        Args:
            apply_exclusions: Whether to apply exclusion regions (default: True).
            debug_exclusions: Whether to output detailed exclusion debugging info (default: False).

        Returns:
            List of all elements on the page, potentially filtered by exclusions.
        """
        # Get all elements from the element manager
        all_elements = self._element_mgr.get_all_elements()

        # Apply exclusions if requested
        if apply_exclusions and self._exclusions:
            return self._filter_elements_by_exclusions(
                all_elements, debug_exclusions=debug_exclusions
            )
        else:
            if debug_exclusions:
                print(
                    f"Page {self.index}: get_elements returning all {len(all_elements)} elements (exclusions not applied)."
                )
            return all_elements

    def filter_elements(
        self, elements: List["Element"], selector: str, **kwargs
    ) -> List["Element"]:
        """
        Filter a list of elements based on a selector.

        Args:
            elements: List of elements to filter
            selector: CSS-like selector string
            **kwargs: Additional filter parameters

        Returns:
            List of elements that match the selector
        """
        from natural_pdf.selectors.parser import parse_selector, selector_to_filter_func

        # Parse the selector
        selector_obj = parse_selector(selector)

        # Create filter function from selector
        filter_func = selector_to_filter_func(selector_obj, **kwargs)

        # Apply the filter to the elements
        matching_elements = [element for element in elements if filter_func(element)]

        # Sort elements in reading order if requested
        if kwargs.get("reading_order", True):
            if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
                matching_elements.sort(key=lambda el: (el.top, el.x0))
            else:
                logger.warning(
                    "Cannot sort elements in reading order: Missing required attributes (top, x0)."
                )

        return matching_elements

    def until(self, selector: str, include_endpoint: bool = True, **kwargs) -> Any:
        """
        Select content from the top of the page until matching selector.

        Args:
            selector: CSS-like selector string
            include_endpoint: Whether to include the endpoint element in the region
            **kwargs: Additional selection parameters

        Returns:
            Region object representing the selected content

        Examples:
            >>> page.until('text:contains("Conclusion")')  # Select from top to conclusion
            >>> page.until('line[width>=2]', include_endpoint=False)  # Select up to thick line
        """
        # Find the target element
        target = self.find(selector, **kwargs)
        if not target:
            # If target not found, return a default region (full page)
            from natural_pdf.elements.region import Region

            return Region(self, (0, 0, self.width, self.height))

        # Create a region from the top of the page to the target
        from natural_pdf.elements.region import Region

        # Ensure target has positional attributes before using them
        target_top = getattr(target, "top", 0)
        target_bottom = getattr(target, "bottom", self.height)

        if include_endpoint:
            # Include the target element
            region = Region(self, (0, 0, self.width, target_bottom))
        else:
            # Up to the target element
            region = Region(self, (0, 0, self.width, target_top))

        region.end_element = target
        return region

    def crop(self, bbox=None, **kwargs) -> Any:
        """
        Crop the page to the specified bounding box.

        This is a direct wrapper around pdfplumber's crop method.

        Args:
            bbox: Bounding box (x0, top, x1, bottom) or None
            **kwargs: Additional parameters (top, bottom, left, right)

        Returns:
            Cropped page object (pdfplumber.Page)
        """
        # Returns the pdfplumber page object, not a natural-pdf Page
        return self._page.crop(bbox, **kwargs)

    def extract_text(
        self, preserve_whitespace=True, use_exclusions=True, debug_exclusions=False, **kwargs
    ) -> str:
        """
        Extract text from this page, respecting exclusions and using pdfplumber's
        layout engine (chars_to_textmap) if layout arguments are provided or default.

        Args:
            use_exclusions: Whether to apply exclusion regions (default: True).
                          Note: Filtering logic is now always applied if exclusions exist.
            debug_exclusions: Whether to output detailed exclusion debugging info (default: False).
            **kwargs: Additional layout parameters passed directly to pdfplumber's
                      `chars_to_textmap` function. Common parameters include:
                      - layout (bool): If True (default), inserts spaces/newlines.
                      - x_density (float): Pixels per character horizontally.
                      - y_density (float): Pixels per line vertically.
                      - x_tolerance (float): Tolerance for horizontal character grouping.
                      - y_tolerance (float): Tolerance for vertical character grouping.
                      - line_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                      - char_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                      See pdfplumber documentation for more.

        Returns:
            Extracted text as string, potentially with layout-based spacing.
        """
        logger.debug(f"Page {self.number}: extract_text called with kwargs: {kwargs}")
        debug = kwargs.get("debug", debug_exclusions)  # Allow 'debug' kwarg

        # 1. Get Word Elements (triggers load_elements if needed)
        word_elements = self.words
        if not word_elements:
            logger.debug(f"Page {self.number}: No word elements found.")
            return ""

        # 2. Get Exclusions
        apply_exclusions_flag = kwargs.get("use_exclusions", True)
        exclusion_regions = []
        if apply_exclusions_flag and self._exclusions:
            exclusion_regions = self._get_exclusion_regions(include_callable=True, debug=debug)
            if debug:
                logger.debug(f"Page {self.number}: Applying {len(exclusion_regions)} exclusions.")
        elif debug:
            logger.debug(f"Page {self.number}: Not applying exclusions.")

        # 3. Collect All Character Dictionaries from Word Elements
        all_char_dicts = []
        for word in word_elements:
            all_char_dicts.extend(getattr(word, "_char_dicts", []))

        # 4. Spatially Filter Characters
        filtered_chars = filter_chars_spatially(
            char_dicts=all_char_dicts,
            exclusion_regions=exclusion_regions,
            target_region=None,  # No target region for full page extraction
            debug=debug,
        )

        # 5. Generate Text Layout using Utility
        # Pass page bbox as layout context
        page_bbox = (0, 0, self.width, self.height)
        # Merge PDF-level default tolerances if caller did not override
        merged_kwargs = dict(kwargs)
        tol_keys = ["x_tolerance", "x_tolerance_ratio", "y_tolerance"]
        for k in tol_keys:
            if k not in merged_kwargs:
                if k in self._config:
                    merged_kwargs[k] = self._config[k]
                elif k in getattr(self._parent, "_config", {}):
                    merged_kwargs[k] = self._parent._config[k]

        result = generate_text_layout(
            char_dicts=filtered_chars,
            layout_context_bbox=page_bbox,
            user_kwargs=merged_kwargs,
        )

        # --- Optional: apply Unicode BiDi algorithm for mixed RTL/LTR correctness ---
        apply_bidi = kwargs.get("bidi", True)
        if apply_bidi and result:
            # Quick check for any RTL character
            import unicodedata

            def _contains_rtl(s):
                return any(unicodedata.bidirectional(ch) in ("R", "AL", "AN") for ch in s)

            if _contains_rtl(result):
                try:
                    from bidi.algorithm import get_display  # type: ignore

                    from natural_pdf.utils.bidi_mirror import mirror_brackets

                    result = "\n".join(
                        mirror_brackets(
                            get_display(
                                line,
                                base_dir=(
                                    "R"
                                    if any(
                                        unicodedata.bidirectional(ch) in ("R", "AL", "AN")
                                        for ch in line
                                    )
                                    else "L"
                                ),
                            )
                        )
                        for line in result.split("\n")
                    )
                except ModuleNotFoundError:
                    pass  # silently skip if python-bidi not available

        logger.debug(f"Page {self.number}: extract_text finished, result length: {len(result)}.")
        return result

    def extract_table(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
        show_progress: bool = False,
    ) -> List[List[Optional[str]]]:
        """
        Extract the largest table from this page using enhanced region-based extraction.

        Args:
            method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
            table_settings: Settings for pdfplumber table extraction.
            use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
            ocr_config: OCR configuration parameters.
            text_options: Dictionary of options for the 'text' method.
            cell_extraction_func: Optional callable function that takes a cell Region object
                                  and returns its string content. For 'text' method only.
            show_progress: If True, display a progress bar during cell text extraction for the 'text' method.

        Returns:
            Table data as a list of rows, where each row is a list of cell values (str or None).
        """
        # Create a full-page region and delegate to its enhanced extract_table method
        page_region = self.create_region(0, 0, self.width, self.height)
        return page_region.extract_table(
            method=method,
            table_settings=table_settings,
            use_ocr=use_ocr,
            ocr_config=ocr_config,
            text_options=text_options,
            cell_extraction_func=cell_extraction_func,
            show_progress=show_progress,
        )

    def extract_tables(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        check_tatr: bool = True,
    ) -> List[List[List[str]]]:
        """
        Extract all tables from this page with enhanced method support.

        Args:
            method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                    'stream' uses text-based strategies, 'lattice' uses line-based strategies.
                    Note: 'tatr' and 'text' methods are not supported for extract_tables.
            table_settings: Settings for pdfplumber table extraction.
            check_tatr: If True (default), first check for TATR-detected table regions
                        and extract from those before falling back to pdfplumber methods.

        Returns:
            List of tables, where each table is a list of rows, and each row is a list of cell values.
        """
        if table_settings is None:
            table_settings = {}

        # Check for TATR-detected table regions first if enabled
        if check_tatr:
            try:
                tatr_tables = self.find_all("region[type=table][model=tatr]")
                if tatr_tables:
                    logger.debug(
                        f"Page {self.number}: Found {len(tatr_tables)} TATR table regions, extracting from those..."
                    )
                    extracted_tables = []
                    for table_region in tatr_tables:
                        try:
                            table_data = table_region.extract_table(method="tatr")
                            if table_data:  # Only add non-empty tables
                                extracted_tables.append(table_data)
                        except Exception as e:
                            logger.warning(
                                f"Failed to extract table from TATR region {table_region.bbox}: {e}"
                            )

                    if extracted_tables:
                        logger.debug(
                            f"Page {self.number}: Successfully extracted {len(extracted_tables)} tables from TATR regions"
                        )
                        return extracted_tables
                    else:
                        logger.debug(
                            f"Page {self.number}: TATR regions found but no tables extracted, falling back to pdfplumber"
                        )
                else:
                    logger.debug(
                        f"Page {self.number}: No TATR table regions found, using pdfplumber methods"
                    )
            except Exception as e:
                logger.debug(
                    f"Page {self.number}: Error checking TATR regions: {e}, falling back to pdfplumber"
                )

        # Auto-detect method if not specified (try lattice first, then stream)
        if method is None:
            logger.debug(f"Page {self.number}: Auto-detecting tables extraction method...")

            # Try lattice first
            try:
                lattice_settings = table_settings.copy()
                lattice_settings.setdefault("vertical_strategy", "lines")
                lattice_settings.setdefault("horizontal_strategy", "lines")

                logger.debug(f"Page {self.number}: Trying 'lattice' method first for tables...")
                lattice_result = self._page.extract_tables(lattice_settings)

                # Check if lattice found meaningful tables
                if (
                    lattice_result
                    and len(lattice_result) > 0
                    and any(
                        any(
                            any(cell and cell.strip() for cell in row if cell)
                            for row in table
                            if table
                        )
                        for table in lattice_result
                    )
                ):
                    logger.debug(
                        f"Page {self.number}: 'lattice' method found {len(lattice_result)} tables"
                    )
                    return lattice_result
                else:
                    logger.debug(f"Page {self.number}: 'lattice' method found no meaningful tables")

            except Exception as e:
                logger.debug(f"Page {self.number}: 'lattice' method failed: {e}")

            # Fall back to stream
            logger.debug(f"Page {self.number}: Falling back to 'stream' method for tables...")
            stream_settings = table_settings.copy()
            stream_settings.setdefault("vertical_strategy", "text")
            stream_settings.setdefault("horizontal_strategy", "text")

            return self._page.extract_tables(stream_settings)

        effective_method = method

        # Handle method aliases
        if effective_method == "stream":
            logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "text")
            table_settings.setdefault("horizontal_strategy", "text")
        elif effective_method == "lattice":
            logger.debug(
                "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
            )
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "lines")
            table_settings.setdefault("horizontal_strategy", "lines")

        # Use the selected method
        if effective_method == "pdfplumber":
            # ---------------------------------------------------------
            # Inject auto-computed or user-specified text tolerances so
            # pdfplumber uses the same numbers we used for word grouping
            # whenever the table algorithm relies on word positions.
            # ---------------------------------------------------------
            if "text" in (
                table_settings.get("vertical_strategy"),
                table_settings.get("horizontal_strategy"),
            ):
                print("SETTING IT UP")
                pdf_cfg = getattr(self, "_config", getattr(self._parent, "_config", {}))
                if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
                    x_tol = pdf_cfg.get("x_tolerance")
                    if x_tol is not None:
                        table_settings.setdefault("text_x_tolerance", x_tol)
                if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
                    y_tol = pdf_cfg.get("y_tolerance")
                    if y_tol is not None:
                        table_settings.setdefault("text_y_tolerance", y_tol)

                # pdfplumber's text strategy benefits from a tight snap tolerance.
                if (
                    "snap_tolerance" not in table_settings
                    and "snap_x_tolerance" not in table_settings
                ):
                    # Derive from y_tol if available, else default 1
                    snap = max(1, round((pdf_cfg.get("y_tolerance", 1)) * 0.9))
                    table_settings.setdefault("snap_tolerance", snap)
                if (
                    "join_tolerance" not in table_settings
                    and "join_x_tolerance" not in table_settings
                ):
                    join = table_settings.get("snap_tolerance", 1)
                    table_settings.setdefault("join_tolerance", join)
                    table_settings.setdefault("join_x_tolerance", join)
                    table_settings.setdefault("join_y_tolerance", join)

            return self._page.extract_tables(table_settings)
        else:
            raise ValueError(
                f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
            )

    def _load_elements(self):
        """Load all elements from the page via ElementManager."""
        self._element_mgr.load_elements()

    def _create_char_elements(self):
        """DEPRECATED: Use self._element_mgr.chars"""
        logger.warning("_create_char_elements is deprecated. Access via self._element_mgr.chars.")
        return self._element_mgr.chars  # Delegate

    def _process_font_information(self, char_dict):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_process_font_information is deprecated. Handled by ElementManager.")
        # ElementManager handles this internally
        pass

    def _group_chars_into_words(self, keep_spaces=True, font_attrs=None):
        """DEPRECATED: Use self._element_mgr.words"""
        logger.warning("_group_chars_into_words is deprecated. Access via self._element_mgr.words.")
        return self._element_mgr.words  # Delegate

    def _process_line_into_words(self, line_chars, keep_spaces, font_attrs):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_process_line_into_words is deprecated. Handled by ElementManager.")
        pass

    def _check_font_attributes_match(self, char, prev_char, font_attrs):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_check_font_attributes_match is deprecated. Handled by ElementManager.")
        pass

    def _create_word_element(self, chars, font_attrs):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_create_word_element is deprecated. Handled by ElementManager.")
        pass

    @property
    def chars(self) -> List[Any]:
        """Get all character elements on this page."""
        return self._element_mgr.chars

    @property
    def words(self) -> List[Any]:
        """Get all word elements on this page."""
        return self._element_mgr.words

    @property
    def rects(self) -> List[Any]:
        """Get all rectangle elements on this page."""
        return self._element_mgr.rects

    @property
    def lines(self) -> List[Any]:
        """Get all line elements on this page."""
        return self._element_mgr.lines

    def highlight(
        self,
        bbox: Optional[Tuple[float, float, float, float]] = None,
        color: Optional[Union[Tuple, str]] = None,
        label: Optional[str] = None,
        use_color_cycling: bool = False,
        element: Optional[Any] = None,
        include_attrs: Optional[List[str]] = None,
        existing: str = "append",
    ) -> "Page":
        """
        Highlight a bounding box or the entire page.
        Delegates to the central HighlightingService.

        Args:
            bbox: Bounding box (x0, top, x1, bottom). If None, highlight entire page.
            color: RGBA color tuple/string for the highlight.
            label: Optional label for the highlight.
            use_color_cycling: If True and no label/color, use next cycle color.
            element: Optional original element being highlighted (for attribute extraction).
            include_attrs: List of attribute names from 'element' to display.
            existing: How to handle existing highlights ('append' or 'replace').

        Returns:
            Self for method chaining.
        """
        target_bbox = bbox if bbox is not None else (0, 0, self.width, self.height)
        self._highlighter.add(
            page_index=self.index,
            bbox=target_bbox,
            color=color,
            label=label,
            use_color_cycling=use_color_cycling,
            element=element,
            include_attrs=include_attrs,
            existing=existing,
        )
        return self

    def highlight_polygon(
        self,
        polygon: List[Tuple[float, float]],
        color: Optional[Union[Tuple, str]] = None,
        label: Optional[str] = None,
        use_color_cycling: bool = False,
        element: Optional[Any] = None,
        include_attrs: Optional[List[str]] = None,
        existing: str = "append",
    ) -> "Page":
        """
        Highlight a polygon shape on the page.
        Delegates to the central HighlightingService.

        Args:
            polygon: List of (x, y) points defining the polygon.
            color: RGBA color tuple/string for the highlight.
            label: Optional label for the highlight.
            use_color_cycling: If True and no label/color, use next cycle color.
            element: Optional original element being highlighted (for attribute extraction).
            include_attrs: List of attribute names from 'element' to display.
            existing: How to handle existing highlights ('append' or 'replace').

        Returns:
            Self for method chaining.
        """
        self._highlighter.add_polygon(
            page_index=self.index,
            polygon=polygon,
            color=color,
            label=label,
            use_color_cycling=use_color_cycling,
            element=element,
            include_attrs=include_attrs,
            existing=existing,
        )
        return self

    def show(
        self,
        resolution: float = 144,
        width: Optional[int] = None,
        labels: bool = True,
        legend_position: str = "right",
        render_ocr: bool = False,
    ) -> Optional[Image.Image]:
        """
        Generates and returns an image of the page with persistent highlights rendered.

        Args:
            resolution: Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).
            width: Optional width for the output image.
            labels: Whether to include a legend for labels.
            legend_position: Position of the legend.
            render_ocr: Whether to render OCR text.

        Returns:
            PIL Image object of the page with highlights, or None if rendering fails.
        """
        return self.to_image(
            resolution=resolution,
            width=width,
            labels=labels,
            legend_position=legend_position,
            render_ocr=render_ocr,
            include_highlights=True,  # Ensure highlights are requested
        )

    def save_image(
        self,
        filename: str,
        width: Optional[int] = None,
        labels: bool = True,
        legend_position: str = "right",
        render_ocr: bool = False,
        include_highlights: bool = True,  # Allow saving without highlights
        resolution: float = 144,
        **kwargs,
    ) -> "Page":
        """
        Save the page image to a file, rendering highlights via HighlightingService.

        Args:
            filename: Path to save the image to.
            width: Optional width for the output image.
            labels: Whether to include a legend.
            legend_position: Position of the legend.
            render_ocr: Whether to render OCR text.
            include_highlights: Whether to render highlights.
            resolution: Resolution in DPI for base image rendering (default: 144 DPI, equivalent to previous scale=2.0).
            **kwargs: Additional args for pdfplumber's to_image.

        Returns:
            Self for method chaining.
        """
        # Use to_image to generate and save the image
        self.to_image(
            path=filename,
            width=width,
            labels=labels,
            legend_position=legend_position,
            render_ocr=render_ocr,
            include_highlights=include_highlights,
            resolution=resolution,
            **kwargs,
        )
        return self

    def clear_highlights(self) -> "Page":
        """
        Clear all highlights *from this specific page* via HighlightingService.

        Returns:
            Self for method chaining
        """
        self._highlighter.clear_page(self.index)
        return self

    def analyze_text_styles(
        self, options: Optional[TextStyleOptions] = None
    ) -> "ElementCollection":
        """
        Analyze text elements by style, adding attributes directly to elements.

        This method uses TextStyleAnalyzer to process text elements (typically words)
        on the page. It adds the following attributes to each processed element:
        - style_label: A descriptive or numeric label for the style group.
        - style_key: A hashable tuple representing the style properties used for grouping.
        - style_properties: A dictionary containing the extracted style properties.

        Args:
            options: Optional TextStyleOptions to configure the analysis.
                     If None, the analyzer's default options are used.

        Returns:
            ElementCollection containing all processed text elements with added style attributes.
        """
        # Create analyzer (optionally pass default options from PDF config here)
        # For now, it uses its own defaults if options=None
        analyzer = TextStyleAnalyzer()

        # Analyze the page. The analyzer now modifies elements directly
        # and returns the collection of processed elements.
        processed_elements_collection = analyzer.analyze(self, options=options)

        # Return the collection of elements which now have style attributes
        return processed_elements_collection

    def to_image(
        self,
        path: Optional[str] = None,
        width: Optional[int] = None,
        labels: bool = True,
        legend_position: str = "right",
        render_ocr: bool = False,
        resolution: Optional[float] = None,
        include_highlights: bool = True,
        exclusions: Optional[str] = None,  # New parameter
        **kwargs,
    ) -> Optional[Image.Image]:
        """
        Generate a PIL image of the page, using HighlightingService if needed.

        Args:
            path: Optional path to save the image to.
            width: Optional width for the output image.
            labels: Whether to include a legend for highlights.
            legend_position: Position of the legend.
            render_ocr: Whether to render OCR text on highlights.
            resolution: Resolution in DPI for base page image. If None, uses global setting or defaults to 144 DPI.
            include_highlights: Whether to render highlights.
            exclusions: Accepts one of the following:
                        • None  – no masking (default)
                        • "mask" – mask using solid white (back-compat)
                        • CSS/HTML colour string (e.g. "red", "#ff0000", "#ff000080")
                        • Tuple of RGB or RGBA values (ints 0-255 or floats 0-1)
                        All excluded regions are filled with this colour.
            **kwargs: Additional parameters for pdfplumber.to_image.

        Returns:
            PIL Image of the page, or None if rendering fails.
        """
        # Apply global options as defaults, but allow explicit parameters to override
        import natural_pdf

        # Use global options if parameters are not explicitly set
        if width is None:
            width = natural_pdf.options.image.width
        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified
        # 1. Create cache key (excluding path)
        cache_key_parts = [
            width,
            labels,
            legend_position,
            render_ocr,
            resolution,
            include_highlights,
            exclusions,
        ]
        # Convert kwargs to a stable, hashable representation
        sorted_kwargs_list = []
        for k, v in sorted(kwargs.items()):
            if isinstance(v, list):
                try:
                    v = tuple(v)  # Convert lists to tuples
                except TypeError:  # pragma: no cover
                    # If list contains unhashable items, fall back to repr or skip
                    # For simplicity, we'll try to proceed; hashing will fail if v remains unhashable
                    logger.warning(
                        f"Cache key generation: List item in kwargs['{k}'] could not be converted to tuple due to unhashable elements."
                    )
            sorted_kwargs_list.append((k, v))

        cache_key_parts.append(tuple(sorted_kwargs_list))

        try:
            cache_key = tuple(cache_key_parts)
        except TypeError as e:  # pragma: no cover
            logger.warning(
                f"Page {self.index}: Could not create cache key for to_image due to unhashable item: {e}. Proceeding without cache for this call."
            )
            cache_key = None  # Fallback to not using cache for this call

        image_to_return: Optional[Image.Image] = None

        # 2. Check cache
        if cache_key is not None and cache_key in self._to_image_cache:
            image_to_return = self._to_image_cache[cache_key]
            logger.debug(f"Page {self.index}: Returning cached image for key: {cache_key}")
        else:
            # --- This is the original logic to generate the image ---
            rendered_image_component: Optional[Image.Image] = (
                None  # Renamed from 'image' in original
            )
            render_resolution = resolution
            thread_id = threading.current_thread().name
            logger.debug(
                f"[{thread_id}] Page {self.index}: Attempting to acquire pdf_render_lock for to_image..."
            )
            lock_wait_start = time.monotonic()
            try:
                # Acquire the global PDF rendering lock
                with pdf_render_lock:
                    lock_acquired_time = time.monotonic()
                    logger.debug(
                        f"[{thread_id}] Page {self.index}: Acquired pdf_render_lock (waited {lock_acquired_time - lock_wait_start:.2f}s). Starting render..."
                    )
                    if include_highlights:
                        # Delegate rendering to the central service
                        rendered_image_component = self._highlighter.render_page(
                            page_index=self.index,
                            resolution=render_resolution,
                            labels=labels,
                            legend_position=legend_position,
                            render_ocr=render_ocr,
                            **kwargs,
                        )
                    else:
                        rendered_image_component = render_plain_page(self, render_resolution)
            except Exception as e:
                logger.error(f"Error rendering page {self.index}: {e}", exc_info=True)
                # rendered_image_component remains None
            finally:
                render_end_time = time.monotonic()
                logger.debug(
                    f"[{thread_id}] Page {self.index}: Released pdf_render_lock. Total render time (incl. lock wait): {render_end_time - lock_wait_start:.2f}s"
                )

            if rendered_image_component is None:
                if cache_key is not None:
                    self._to_image_cache[cache_key] = None  # Cache the failure
                # Save the image if path is provided (will try to save None, handled by PIL/OS)
                if path:
                    try:
                        if os.path.dirname(path):
                            os.makedirs(os.path.dirname(path), exist_ok=True)
                        if rendered_image_component is not None:  # Should be None here
                            rendered_image_component.save(path)  # This line won't be hit if None
                        # else: logger.debug("Not saving None image") # Not strictly needed
                    except Exception as save_error:  # pragma: no cover
                        logger.error(f"Failed to save image to {path}: {save_error}")
                return None

            # --- Apply exclusion masking if requested ---
            # This modifies 'rendered_image_component'
            image_after_masking = rendered_image_component  # Start with the rendered image

            # Determine if masking is requested and establish the fill colour
            mask_requested = exclusions is not None and self._exclusions
            mask_color: Union[str, Tuple[int, int, int, int]] = "white"  # default

            if mask_requested:
                if exclusions != "mask":
                    # Attempt to parse custom colour input
                    try:
                        if isinstance(exclusions, tuple):
                            # Handle RGB/RGBA tuples with ints 0-255 or floats 0-1
                            processed = []
                            all_float = all(isinstance(c, float) for c in exclusions)
                            for i, c in enumerate(exclusions):
                                if isinstance(c, float):
                                    val = int(c * 255) if all_float or i == 3 else int(c)
                                else:
                                    val = int(c)
                                processed.append(max(0, min(255, val)))
                            if len(processed) == 3:
                                processed.append(255)  # add full alpha
                            mask_color = tuple(processed)  # type: ignore[assignment]
                        elif isinstance(exclusions, str):
                            # Try using the optional 'colour' library for rich parsing
                            try:
                                from colour import Color  # type: ignore

                                color_obj = Color(exclusions)
                                mask_color = (
                                    int(color_obj.red * 255),
                                    int(color_obj.green * 255),
                                    int(color_obj.blue * 255),
                                    255,
                                )
                            except Exception:
                                # Fallback: if parsing fails, treat as plain string accepted by PIL
                                mask_color = exclusions  # e.g. "red"
                        else:
                            logger.warning(
                                f"Unsupported exclusions colour spec: {exclusions!r}. Using white."
                            )
                    except Exception as colour_parse_err:  # pragma: no cover
                        logger.warning(
                            f"Failed to parse exclusions colour {exclusions!r}: {colour_parse_err}. Using white."
                        )

                try:
                    # Ensure image is mutable (RGB or RGBA)
                    if image_after_masking.mode not in ("RGB", "RGBA"):
                        image_after_masking = image_after_masking.convert("RGB")

                    exclusion_regions = self._get_exclusion_regions(
                        include_callable=True, debug=False
                    )
                    if exclusion_regions:
                        draw = ImageDraw.Draw(image_after_masking)
                        # Scaling factor for converting PDF pts → image px
                        img_scale = render_resolution / 72.0

                        # Determine fill colour compatible with current mode
                        def _mode_compatible(colour):
                            if isinstance(colour, tuple) and image_after_masking.mode != "RGBA":
                                return colour[:3]  # drop alpha for RGB images
                            return colour

                        fill_colour = _mode_compatible(mask_color)

                        for region in exclusion_regions:
                            img_x0 = region.x0 * img_scale
                            img_top = region.top * img_scale
                            img_x1 = region.x1 * img_scale
                            img_bottom = region.bottom * img_scale

                            img_coords = (
                                max(0, img_x0),
                                max(0, img_top),
                                min(image_after_masking.width, img_x1),
                                min(image_after_masking.height, img_bottom),
                            )
                            if img_coords[0] < img_coords[2] and img_coords[1] < img_coords[3]:
                                draw.rectangle(img_coords, fill=fill_colour)
                            else:  # pragma: no cover
                                logger.warning(
                                    f"Skipping invalid exclusion rect for masking: {img_coords}"
                                )
                        del draw  # Release drawing context
                except Exception as mask_error:  # pragma: no cover
                    logger.error(
                        f"Error applying exclusion mask to page {self.index}: {mask_error}",
                        exc_info=True,
                    )
                    # Continue with potentially unmasked or partially masked image

            # --- Resize the final image if width is provided ---
            image_final_content = image_after_masking  # Start with image after masking
            if width is not None and width > 0 and image_final_content.width > 0:
                aspect_ratio = image_final_content.height / image_final_content.width
                height = int(width * aspect_ratio)
                try:
                    image_final_content = image_final_content.resize(
                        (width, height), Image.Resampling.LANCZOS
                    )
                except Exception as resize_error:  # pragma: no cover
                    logger.warning(f"Could not resize image: {resize_error}")
                    # image_final_content remains the un-resized version if resize fails

            # Store in cache
            if cache_key is not None:
                self._to_image_cache[cache_key] = image_final_content
                logger.debug(f"Page {self.index}: Cached image for key: {cache_key}")
            image_to_return = image_final_content
        # --- End of cache miss block ---

        # Save the image (either from cache or newly generated) if path is provided
        if path and image_to_return:
            try:
                # Ensure directory exists
                if os.path.dirname(path):  # Only call makedirs if there's a directory part
                    os.makedirs(os.path.dirname(path), exist_ok=True)
                image_to_return.save(path)
                logger.debug(f"Saved page image to: {path}")
            except Exception as save_error:  # pragma: no cover
                logger.error(f"Failed to save image to {path}: {save_error}")

        return image_to_return

    def _create_text_elements_from_ocr(
        self, ocr_results: List[Dict[str, Any]], image_width=None, image_height=None
    ) -> List["TextElement"]:
        """DEPRECATED: Use self._element_mgr.create_text_elements_from_ocr"""
        logger.warning(
            "_create_text_elements_from_ocr is deprecated. Use self._element_mgr version."
        )
        return self._element_mgr.create_text_elements_from_ocr(
            ocr_results, image_width, image_height
        )

    def apply_ocr(
        self,
        engine: Optional[str] = None,
        options: Optional["OCROptions"] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
        detect_only: bool = False,
        apply_exclusions: bool = True,
        replace: bool = True,
    ) -> "Page":
        """
        Apply OCR to THIS page and add results to page elements via PDF.apply_ocr.

        Args:
            engine: Name of the OCR engine.
            options: Engine-specific options object or dict.
            languages: List of engine-specific language codes.
            min_confidence: Minimum confidence threshold.
            device: Device to run OCR on.
            resolution: DPI resolution for rendering page image before OCR.
            apply_exclusions: If True (default), render page image for OCR
                              with excluded areas masked (whited out).
            detect_only: If True, only detect text bounding boxes, don't perform OCR.
            replace: If True (default), remove any existing OCR elements before
                    adding new ones. If False, add new OCR elements to existing ones.

        Returns:
            Self for method chaining.
        """
        if not hasattr(self._parent, "apply_ocr"):
            logger.error(f"Page {self.number}: Parent PDF missing 'apply_ocr'. Cannot apply OCR.")
            return self  # Return self for chaining

        # Remove existing OCR elements if replace is True
        if replace and hasattr(self, "_element_mgr"):
            logger.info(
                f"Page {self.number}: Removing existing OCR elements before applying new OCR."
            )
            self._element_mgr.remove_ocr_elements()

        logger.info(f"Page {self.number}: Delegating apply_ocr to PDF.apply_ocr.")
        # Delegate to parent PDF, targeting only this page's index
        # Pass all relevant parameters through, including apply_exclusions
        self._parent.apply_ocr(
            pages=[self.index],
            engine=engine,
            options=options,
            languages=languages,
            min_confidence=min_confidence,
            device=device,
            resolution=resolution,
            detect_only=detect_only,
            apply_exclusions=apply_exclusions,
            replace=replace,  # Pass the replace parameter to PDF.apply_ocr
        )

        # Return self for chaining
        return self

    def extract_ocr_elements(
        self,
        engine: Optional[str] = None,
        options: Optional["OCROptions"] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
    ) -> List["TextElement"]:
        """
        Extract text elements using OCR *without* adding them to the page's elements.
        Uses the shared OCRManager instance.

        Args:
            engine: Name of the OCR engine.
            options: Engine-specific options object or dict.
            languages: List of engine-specific language codes.
            min_confidence: Minimum confidence threshold.
            device: Device to run OCR on.
            resolution: DPI resolution for rendering page image before OCR.

        Returns:
            List of created TextElement objects derived from OCR results for this page.
        """
        if not self._ocr_manager:
            logger.error(
                f"Page {self.number}: OCRManager not available. Cannot extract OCR elements."
            )
            return []

        logger.info(f"Page {self.number}: Extracting OCR elements (extract only)...")

        # Determine rendering resolution
        final_resolution = resolution if resolution is not None else 150  # Default to 150 DPI
        logger.debug(f"  Using rendering resolution: {final_resolution} DPI")

        try:
            # Get base image without highlights using the determined resolution
            # Use the global PDF rendering lock
            with pdf_render_lock:
                image = self.to_image(resolution=final_resolution, include_highlights=False)
                if not image:
                    logger.error(
                        f"  Failed to render page {self.number} to image for OCR extraction."
                    )
                    return []
                logger.debug(f"  Rendered image size: {image.width}x{image.height}")
        except Exception as e:
            logger.error(f"  Failed to render page {self.number} to image: {e}", exc_info=True)
            return []

        # Prepare arguments for the OCR Manager call
        manager_args = {
            "images": image,
            "engine": engine,
            "languages": languages,
            "min_confidence": min_confidence,
            "device": device,
            "options": options,
        }
        manager_args = {k: v for k, v in manager_args.items() if v is not None}

        logger.debug(
            f"  Calling OCR Manager (extract only) with args: { {k:v for k,v in manager_args.items() if k != 'images'} }"
        )
        try:
            # apply_ocr now returns List[List[Dict]] or List[Dict]
            results_list = self._ocr_manager.apply_ocr(**manager_args)
            # If it returned a list of lists (batch mode), take the first list
            results = (
                results_list[0]
                if isinstance(results_list, list)
                and results_list
                and isinstance(results_list[0], list)
                else results_list
            )
            if not isinstance(results, list):
                logger.error(f"  OCR Manager returned unexpected type: {type(results)}")
                results = []
            logger.info(f"  OCR Manager returned {len(results)} results for extraction.")
        except Exception as e:
            logger.error(f"  OCR processing failed during extraction: {e}", exc_info=True)
            return []

        # Convert results but DO NOT add to ElementManager
        logger.debug(f"  Converting OCR results to TextElements (extract only)...")
        temp_elements = []
        scale_x = self.width / image.width if image.width else 1
        scale_y = self.height / image.height if image.height else 1
        for result in results:
            try:  # Added try-except around result processing
                x0, top, x1, bottom = [float(c) for c in result["bbox"]]
                elem_data = {
                    "text": result["text"],
                    "confidence": result["confidence"],
                    "x0": x0 * scale_x,
                    "top": top * scale_y,
                    "x1": x1 * scale_x,
                    "bottom": bottom * scale_y,
                    "width": (x1 - x0) * scale_x,
                    "height": (bottom - top) * scale_y,
                    "object_type": "text",  # Using text for temporary elements
                    "source": "ocr",
                    "fontname": "OCR-extract",  # Different name for clarity
                    "size": 10.0,
                    "page_number": self.number,
                }
                temp_elements.append(TextElement(elem_data, self))
            except (KeyError, ValueError, TypeError) as convert_err:
                logger.warning(
                    f"  Skipping invalid OCR result during conversion: {result}. Error: {convert_err}"
                )

        logger.info(f"  Created {len(temp_elements)} TextElements from OCR (extract only).")
        return temp_elements

    @property
    def size(self) -> Tuple[float, float]:
        """Get the size of the page in points."""
        return (self._page.width, self._page.height)

    @property
    def layout_analyzer(self) -> "LayoutAnalyzer":
        """Get or create the layout analyzer for this page."""
        if self._layout_analyzer is None:
            if not self._layout_manager:
                logger.warning("LayoutManager not available, cannot create LayoutAnalyzer.")
                return None
            self._layout_analyzer = LayoutAnalyzer(self)
        return self._layout_analyzer

    def analyze_layout(
        self,
        engine: Optional[str] = None,
        options: Optional["LayoutOptions"] = None,
        confidence: Optional[float] = None,
        classes: Optional[List[str]] = None,
        exclude_classes: Optional[List[str]] = None,
        device: Optional[str] = None,
        existing: str = "replace",
        model_name: Optional[str] = None,
        client: Optional[Any] = None,  # Add client parameter
    ) -> "ElementCollection[Region]":
        """
        Analyze the page layout using the configured LayoutManager.
        Adds detected Region objects to the page's element manager.

        Returns:
            ElementCollection containing the detected Region objects.
        """
        analyzer = self.layout_analyzer
        if not analyzer:
            logger.error(
                "Layout analysis failed: LayoutAnalyzer not initialized (is LayoutManager available?)."
            )
            return ElementCollection([])  # Return empty collection

        # Clear existing detected regions if 'replace' is specified
        if existing == "replace":
            self.clear_detected_layout_regions()

        # The analyzer's analyze_layout method already adds regions to the page
        # and its element manager. We just need to retrieve them.
        analyzer.analyze_layout(
            engine=engine,
            options=options,
            confidence=confidence,
            classes=classes,
            exclude_classes=exclude_classes,
            device=device,
            existing=existing,
            model_name=model_name,
            client=client,  # Pass client down
        )

        # Retrieve the detected regions from the element manager
        # Filter regions based on source='detected' and potentially the model used if available
        detected_regions = [
            r
            for r in self._element_mgr.regions
            if r.source == "detected" and (not engine or getattr(r, "model", None) == engine)
        ]

        return ElementCollection(detected_regions)

    def clear_detected_layout_regions(self) -> "Page":
        """
        Removes all regions from this page that were added by layout analysis
        (i.e., regions where `source` attribute is 'detected').

        This clears the regions both from the page's internal `_regions['detected']` list
        and from the ElementManager's internal list of regions.

        Returns:
            Self for method chaining.
        """
        if (
            not hasattr(self._element_mgr, "regions")
            or not hasattr(self._element_mgr, "_elements")
            or "regions" not in self._element_mgr._elements
        ):
            logger.debug(
                f"Page {self.index}: No regions found in ElementManager, nothing to clear."
            )
            self._regions["detected"] = []  # Ensure page's list is also clear
            return self

        # Filter ElementManager's list to keep only non-detected regions
        original_count = len(self._element_mgr.regions)
        self._element_mgr._elements["regions"] = [
            r for r in self._element_mgr.regions if getattr(r, "source", None) != "detected"
        ]
        new_count = len(self._element_mgr.regions)
        removed_count = original_count - new_count

        # Clear the page's specific list of detected regions
        self._regions["detected"] = []

        logger.info(f"Page {self.index}: Cleared {removed_count} detected layout regions.")
        return self

    def get_section_between(
        self, start_element=None, end_element=None, boundary_inclusion="both"
    ) -> Optional["Region"]:  # Return Optional
        """
        Get a section between two elements on this page.
        """
        # Create a full-page region to operate within
        page_region = self.create_region(0, 0, self.width, self.height)

        # Delegate to the region's method
        try:
            return page_region.get_section_between(
                start_element=start_element,
                end_element=end_element,
                boundary_inclusion=boundary_inclusion,
            )
        except Exception as e:
            logger.error(
                f"Error getting section between elements on page {self.index}: {e}", exc_info=True
            )
            return None

    def split(self, divider, **kwargs) -> "ElementCollection[Region]":
        """
        Divides the page into sections based on the provided divider elements.
        """
        sections = self.get_sections(start_elements=divider, **kwargs)
        top = self.region(0, 0, self.width, sections[0].top)
        sections.append(top)

        return sections

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        boundary_inclusion="start",
        y_threshold=5.0,
        bounding_box=None,
    ) -> "ElementCollection[Region]":
        """
        Get sections of a page defined by start/end elements.
        Uses the page-level implementation.

        Returns:
            An ElementCollection containing the found Region objects.
        """

        # Helper function to get bounds from bounding_box parameter
        def get_bounds():
            if bounding_box:
                x0, top, x1, bottom = bounding_box
                # Clamp to page boundaries
                return max(0, x0), max(0, top), min(self.width, x1), min(self.height, bottom)
            else:
                return 0, 0, self.width, self.height

        regions = []

        # Handle cases where elements are provided as strings (selectors)
        if isinstance(start_elements, str):
            start_elements = self.find_all(start_elements).elements  # Get list of elements
        elif hasattr(start_elements, "elements"):  # Handle ElementCollection input
            start_elements = start_elements.elements

        if isinstance(end_elements, str):
            end_elements = self.find_all(end_elements).elements
        elif hasattr(end_elements, "elements"):
            end_elements = end_elements.elements

        # Ensure start_elements is a list
        if start_elements is None:
            start_elements = []
        if end_elements is None:
            end_elements = []

        valid_inclusions = ["start", "end", "both", "none"]
        if boundary_inclusion not in valid_inclusions:
            raise ValueError(f"boundary_inclusion must be one of {valid_inclusions}")

        if not start_elements:
            # Return an empty ElementCollection if no start elements
            return ElementCollection([])

        # Combine start and end elements with their type
        all_boundaries = []
        for el in start_elements:
            all_boundaries.append((el, "start"))
        for el in end_elements:
            all_boundaries.append((el, "end"))

        # Sort all boundary elements primarily by top, then x0
        try:
            all_boundaries.sort(key=lambda x: (x[0].top, x[0].x0))
        except AttributeError as e:
            logger.error(f"Error sorting boundaries: Element missing top/x0 attribute? {e}")
            return ElementCollection([])  # Cannot proceed if elements lack position

        # Process sorted boundaries to find sections
        current_start_element = None
        active_section_started = False

        for element, element_type in all_boundaries:
            if element_type == "start":
                # If we have an active section, this start implicitly ends it
                if active_section_started:
                    end_boundary_el = element  # Use this start as the end boundary
                    # Determine region boundaries
                    sec_top = (
                        current_start_element.top
                        if boundary_inclusion in ["start", "both"]
                        else current_start_element.bottom
                    )
                    sec_bottom = (
                        end_boundary_el.top
                        if boundary_inclusion not in ["end", "both"]
                        else end_boundary_el.bottom
                    )

                    if sec_top < sec_bottom:  # Ensure valid region
                        x0, _, x1, _ = get_bounds()
                        region = self.create_region(x0, sec_top, x1, sec_bottom)
                        region.start_element = current_start_element
                        region.end_element = end_boundary_el  # Mark the element that ended it
                        region.is_end_next_start = True  # Mark how it ended
                        regions.append(region)
                    active_section_started = False  # Reset for the new start

                # Set this as the potential start of the next section
                current_start_element = element
                active_section_started = True

            elif element_type == "end" and active_section_started:
                # We found an explicit end for the current section
                end_boundary_el = element
                sec_top = (
                    current_start_element.top
                    if boundary_inclusion in ["start", "both"]
                    else current_start_element.bottom
                )
                sec_bottom = (
                    end_boundary_el.bottom
                    if boundary_inclusion in ["end", "both"]
                    else end_boundary_el.top
                )

                if sec_top < sec_bottom:  # Ensure valid region
                    x0, _, x1, _ = get_bounds()
                    region = self.create_region(x0, sec_top, x1, sec_bottom)
                    region.start_element = current_start_element
                    region.end_element = end_boundary_el
                    region.is_end_next_start = False
                    regions.append(region)

                # Reset: section ended explicitly
                current_start_element = None
                active_section_started = False

        # Handle the last section if it was started but never explicitly ended
        if active_section_started:
            sec_top = (
                current_start_element.top
                if boundary_inclusion in ["start", "both"]
                else current_start_element.bottom
            )
            x0, _, x1, page_bottom = get_bounds()
            if sec_top < page_bottom:
                region = self.create_region(x0, sec_top, x1, page_bottom)
                region.start_element = current_start_element
                region.end_element = None  # Ended by page end
                region.is_end_next_start = False
                regions.append(region)

        return ElementCollection(regions)

    def __repr__(self) -> str:
        """String representation of the page."""
        return f"<Page number={self.number} index={self.index}>"

    def ask(
        self,
        question: Union[str, List[str], Tuple[str, ...]],
        min_confidence: float = 0.1,
        model: str = None,
        debug: bool = False,
        **kwargs,
    ) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
        """
        Ask a question about the page content using document QA.
        """
        try:
            from natural_pdf.qa.document_qa import get_qa_engine

            # Get or initialize QA engine with specified model
            qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
            # Ask the question using the QA engine
            return qa_engine.ask_pdf_page(
                self, question, min_confidence=min_confidence, debug=debug, **kwargs
            )
        except ImportError:
            logger.error(
                "Question answering requires the 'natural_pdf.qa' module. Please install necessary dependencies."
            )
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.number,
                "source_elements": [],
            }
        except Exception as e:
            logger.error(f"Error during page.ask: {e}", exc_info=True)
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.number,
                "source_elements": [],
            }

    def show_preview(
        self,
        temporary_highlights: List[Dict],
        resolution: float = 144,
        width: Optional[int] = None,
        labels: bool = True,
        legend_position: str = "right",
        render_ocr: bool = False,
    ) -> Optional[Image.Image]:
        """
        Generates and returns a non-stateful preview image containing only
        the provided temporary highlights.

        Args:
            temporary_highlights: List of highlight data dictionaries (as prepared by
                                  ElementCollection._prepare_highlight_data).
            resolution: Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).
            width: Optional width for the output image.
            labels: Whether to include a legend.
            legend_position: Position of the legend.
            render_ocr: Whether to render OCR text.

        Returns:
            PIL Image object of the preview, or None if rendering fails.
        """
        try:
            # Delegate rendering to the highlighter service's preview method
            img = self._highlighter.render_preview(
                page_index=self.index,
                temporary_highlights=temporary_highlights,
                resolution=resolution,
                labels=labels,
                legend_position=legend_position,
                render_ocr=render_ocr,
            )
        except AttributeError:
            logger.error(f"HighlightingService does not have the required 'render_preview' method.")
            return None
        except Exception as e:
            logger.error(
                f"Error calling highlighter.render_preview for page {self.index}: {e}",
                exc_info=True,
            )
            return None

        # Return the rendered image directly
        return img

    @property
    def text_style_labels(self) -> List[str]:
        """
        Get a sorted list of unique text style labels found on the page.

        Runs text style analysis with default options if it hasn't been run yet.
        To use custom options, call `analyze_text_styles(options=...)` explicitly first.

        Returns:
            A sorted list of unique style label strings.
        """
        # Check if the summary attribute exists from a previous run
        if not hasattr(self, "_text_styles_summary") or not self._text_styles_summary:
            # If not, run the analysis with default options
            logger.debug(f"Page {self.number}: Running default text style analysis to get labels.")
            self.analyze_text_styles()  # Use default options

        # Extract labels from the summary dictionary
        if hasattr(self, "_text_styles_summary") and self._text_styles_summary:
            # The summary maps style_key -> {'label': ..., 'properties': ...}
            labels = {style_info["label"] for style_info in self._text_styles_summary.values()}
            return sorted(list(labels))
        else:
            # Fallback if summary wasn't created for some reason (e.g., no text elements)
            logger.warning(f"Page {self.number}: Text style summary not found after analysis.")
            return []

    def viewer(
        self,
        # elements_to_render: Optional[List['Element']] = None, # No longer needed, from_page handles it
        # include_source_types: List[str] = ['word', 'line', 'rect', 'region'] # No longer needed
    ) -> Optional["InteractiveViewerWidget"]:  # Return type hint updated
        """
        Creates and returns an interactive ipywidget for exploring elements on this page.

        Uses InteractiveViewerWidget.from_page() to create the viewer.

        Returns:
            A InteractiveViewerWidget instance ready for display in Jupyter,
            or None if ipywidgets is not installed or widget creation fails.

        Raises:
            # Optional: Could raise ImportError instead of returning None
            # ImportError: If required dependencies (ipywidgets) are missing.
            ValueError: If image rendering or data preparation fails within from_page.
        """
        # Check for availability using the imported flag and class variable
        if not _IPYWIDGETS_AVAILABLE or InteractiveViewerWidget is None:
            logger.error(
                "Interactive viewer requires 'ipywidgets'. "
                'Please install with: pip install "ipywidgets>=7.0.0,<10.0.0"'
            )
            # raise ImportError("ipywidgets not found.") # Option 1: Raise error
            return None  # Option 2: Return None gracefully

        # If we reach here, InteractiveViewerWidget should be the actual class
        try:
            # Pass self (the Page object) to the factory method
            return InteractiveViewerWidget.from_page(self)
        except Exception as e:
            # Catch potential errors during widget creation (e.g., image rendering)
            logger.error(
                f"Error creating viewer widget from page {self.number}: {e}", exc_info=True
            )
            # raise # Option 1: Re-raise error (might include ValueError from from_page)
            return None  # Option 2: Return None on creation error

    # --- Indexable Protocol Methods ---
    def get_id(self) -> str:
        """Returns a unique identifier for the page (required by Indexable protocol)."""
        # Ensure path is safe for use in IDs (replace problematic chars)
        safe_path = re.sub(r"[^a-zA-Z0-9_-]", "_", str(self.pdf.path))
        return f"pdf_{safe_path}_page_{self.page_number}"

    def get_metadata(self) -> Dict[str, Any]:
        """Returns metadata associated with the page (required by Indexable protocol)."""
        # Add content hash here for sync
        metadata = {
            "pdf_path": str(self.pdf.path),
            "page_number": self.page_number,
            "width": self.width,
            "height": self.height,
            "content_hash": self.get_content_hash(),  # Include the hash
        }
        return metadata

    def get_content(self) -> "Page":
        """
        Returns the primary content object (self) for indexing (required by Indexable protocol).
        SearchService implementations decide how to process this (e.g., call extract_text).
        """
        return self  # Return the Page object itself

    def get_content_hash(self) -> str:
        """Returns a SHA256 hash of the extracted text content (required by Indexable for sync)."""
        # Hash the extracted text (without exclusions for consistency)
        # Consider if exclusions should be part of the hash? For now, hash raw text.
        # Using extract_text directly might be slow if called repeatedly. Cache? TODO: Optimization
        text_content = self.extract_text(
            use_exclusions=False, preserve_whitespace=False
        )  # Normalize whitespace?
        return hashlib.sha256(text_content.encode("utf-8")).hexdigest()

    # --- New Method: save_searchable ---
    def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
        """
        Saves the PDF page with an OCR text layer, making content searchable.

        Requires optional dependencies. Install with: pip install "natural-pdf[ocr-save]"

        Note: OCR must have been applied to the pages beforehand
              (e.g., pdf.apply_ocr()).

        Args:
            output_path: Path to save the searchable PDF.
            dpi: Resolution for rendering and OCR overlay (default 300).
            **kwargs: Additional keyword arguments passed to the exporter.
        """
        # Import moved here, assuming it's always available now
        from natural_pdf.exporters.searchable_pdf import create_searchable_pdf

        # Convert pathlib.Path to string if necessary
        output_path_str = str(output_path)

        create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
        logger.info(f"Searchable PDF saved to: {output_path_str}")

    # --- Added correct_ocr method ---
    def correct_ocr(
        self,
        correction_callback: Callable[[Any], Optional[str]],
        selector: Optional[str] = "text[source=ocr]",
        max_workers: Optional[int] = None,
        progress_callback: Optional[Callable[[], None]] = None,  # Added progress callback
    ) -> "Page":  # Return self for chaining
        """
        Applies corrections to OCR-generated text elements on this page
        using a user-provided callback function, potentially in parallel.

        Finds text elements on this page whose 'source' attribute starts
        with 'ocr' and calls the `correction_callback` for each, passing the
        element itself. Updates the element's text if the callback returns
        a new string.

        Args:
            correction_callback: A function accepting an element and returning
                                 `Optional[str]` (new text or None).
            max_workers: The maximum number of threads to use for parallel execution.
                         If None or 0 or 1, runs sequentially.
            progress_callback: Optional callback function to call after processing each element.

        Returns:
            Self for method chaining.
        """
        logger.info(
            f"Page {self.number}: Starting OCR correction with callback '{correction_callback.__name__}' (max_workers={max_workers})"
        )

        target_elements_collection = self.find_all(selector=selector, apply_exclusions=False)
        target_elements = target_elements_collection.elements  # Get the list

        if not target_elements:
            logger.info(f"Page {self.number}: No OCR elements found to correct.")
            return self

        element_pbar = None
        try:
            element_pbar = tqdm(
                total=len(target_elements),
                desc=f"Correcting OCR Page {self.number}",
                unit="element",
                leave=False,
            )

            processed_count = 0
            updated_count = 0
            error_count = 0

            # Define the task to be run by the worker thread or sequentially
            def _process_element_task(element):
                try:
                    current_text = getattr(element, "text", None)
                    # Call the user-provided callback
                    corrected_text = correction_callback(element)

                    # Validate result type
                    if corrected_text is not None and not isinstance(corrected_text, str):
                        logger.warning(
                            f"Page {self.number}: Correction callback for element '{getattr(element, 'text', '')[:20]}...' returned non-string, non-None type: {type(corrected_text)}. Skipping update."
                        )
                        return element, None, None  # Treat as no correction

                    return element, corrected_text, None  # Return element, result, no error
                except Exception as e:
                    logger.error(
                        f"Page {self.number}: Error applying correction callback to element '{getattr(element, 'text', '')[:30]}...' ({element.bbox}): {e}",
                        exc_info=False,  # Keep log concise
                    )
                    return element, None, e  # Return element, no result, error
                finally:
                    # --- Update internal tqdm progress bar ---
                    if element_pbar:
                        element_pbar.update(1)
                    # --- Call user's progress callback --- #
                    if progress_callback:
                        try:
                            progress_callback()
                        except Exception as cb_e:
                            # Log error in callback itself, but don't stop processing
                            logger.error(
                                f"Page {self.number}: Error executing progress_callback: {cb_e}",
                                exc_info=False,
                            )

            # Choose execution strategy based on max_workers
            if max_workers is not None and max_workers > 1:
                # --- Parallel execution --- #
                logger.info(
                    f"Page {self.number}: Running OCR correction in parallel with {max_workers} workers."
                )
                futures = []
                with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                    # Submit all tasks
                    future_to_element = {
                        executor.submit(_process_element_task, element): element
                        for element in target_elements
                    }

                    # Process results as they complete (progress_callback called by worker)
                    for future in concurrent.futures.as_completed(future_to_element):
                        processed_count += 1
                        try:
                            element, corrected_text, error = future.result()
                            if error:
                                error_count += 1
                                # Error already logged in worker
                            elif corrected_text is not None:
                                # Apply correction if text changed
                                current_text = getattr(element, "text", None)
                                if corrected_text != current_text:
                                    element.text = corrected_text
                                    updated_count += 1
                        except Exception as exc:
                            # Catch errors from future.result() itself
                            element = future_to_element[future]  # Find original element
                            logger.error(
                                f"Page {self.number}: Internal error retrieving correction result for element {element.bbox}: {exc}",
                                exc_info=True,
                            )
                            error_count += 1
                            # Note: progress_callback was already called in the worker's finally block

            else:
                # --- Sequential execution --- #
                logger.info(f"Page {self.number}: Running OCR correction sequentially.")
                for element in target_elements:
                    # Call the task function directly (it handles progress_callback)
                    processed_count += 1
                    _element, corrected_text, error = _process_element_task(element)
                    if error:
                        error_count += 1
                    elif corrected_text is not None:
                        # Apply correction if text changed
                        current_text = getattr(_element, "text", None)
                        if corrected_text != current_text:
                            _element.text = corrected_text
                            updated_count += 1

            logger.info(
                f"Page {self.number}: OCR correction finished. Processed: {processed_count}/{len(target_elements)}, Updated: {updated_count}, Errors: {error_count}."
            )

            return self  # Return self for chaining
        finally:
            if element_pbar:
                element_pbar.close()

    # --- Classification Mixin Implementation --- #
    def _get_classification_manager(self) -> "ClassificationManager":
        if not hasattr(self, "pdf") or not hasattr(self.pdf, "get_manager"):
            raise AttributeError(
                "ClassificationManager cannot be accessed: Parent PDF or get_manager method missing."
            )
        try:
            # Use the PDF's manager registry accessor
            return self.pdf.get_manager("classification")
        except (ValueError, RuntimeError, AttributeError) as e:
            # Wrap potential errors from get_manager for clarity
            raise AttributeError(f"Failed to get ClassificationManager from PDF: {e}") from e

    def _get_classification_content(
        self, model_type: str, **kwargs
    ) -> Union[str, "Image"]:  # Use "Image" for lazy import
        if model_type == "text":
            text_content = self.extract_text(
                layout=False, use_exclusions=False
            )  # Simple join, ignore exclusions for classification
            if not text_content or text_content.isspace():
                raise ValueError("Cannot classify page with 'text' model: No text content found.")
            return text_content
        elif model_type == "vision":
            # Get resolution from manager/kwargs if possible, else default
            manager = self._get_classification_manager()
            default_resolution = 150
            # Access kwargs passed to classify method if needed
            resolution = (
                kwargs.get("resolution", default_resolution)
                if "kwargs" in locals()
                else default_resolution
            )

            # Use to_image, ensuring no highlights interfere
            img = self.to_image(
                resolution=resolution,
                include_highlights=False,
                labels=False,
                exclusions=None,  # Don't mask exclusions for classification input image
            )
            if img is None:
                raise ValueError(
                    "Cannot classify page with 'vision' model: Failed to render image."
                )
            return img
        else:
            raise ValueError(f"Unsupported model_type for classification: {model_type}")

    def _get_metadata_storage(self) -> Dict[str, Any]:
        # Ensure metadata exists
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata

    # --- Content Extraction ---

    # --- Skew Detection and Correction --- #

    @property
    def skew_angle(self) -> Optional[float]:
        """Get the detected skew angle for this page (if calculated)."""
        return self._skew_angle

    def detect_skew_angle(
        self,
        resolution: int = 72,
        grayscale: bool = True,
        force_recalculate: bool = False,
        **deskew_kwargs,
    ) -> Optional[float]:
        """
        Detects the skew angle of the page image and stores it.

        Args:
            resolution: DPI resolution for rendering the page image for detection.
            grayscale: Whether to convert the image to grayscale before detection.
            force_recalculate: If True, recalculate even if an angle exists.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             (e.g., `max_angle`, `num_peaks`).

        Returns:
            The detected skew angle in degrees, or None if detection failed.

        Raises:
            ImportError: If the 'deskew' library is not installed.
        """
        if not DESKEW_AVAILABLE:
            raise ImportError(
                "Deskew library not found. Install with: pip install natural-pdf[deskew]"
            )

        if self._skew_angle is not None and not force_recalculate:
            logger.debug(f"Page {self.number}: Returning cached skew angle: {self._skew_angle:.2f}")
            return self._skew_angle

        logger.debug(f"Page {self.number}: Detecting skew angle (resolution={resolution} DPI)...")
        try:
            # Render the page at the specified detection resolution
            img = self.to_image(resolution=resolution, include_highlights=False)
            if not img:
                logger.warning(f"Page {self.number}: Failed to render image for skew detection.")
                self._skew_angle = None
                return None

            # Convert to numpy array
            img_np = np.array(img)

            # Convert to grayscale if needed
            if grayscale:
                if len(img_np.shape) == 3 and img_np.shape[2] >= 3:
                    gray_np = np.mean(img_np[:, :, :3], axis=2).astype(np.uint8)
                elif len(img_np.shape) == 2:
                    gray_np = img_np  # Already grayscale
                else:
                    logger.warning(
                        f"Page {self.number}: Unexpected image shape {img_np.shape} for grayscale conversion."
                    )
                    gray_np = img_np  # Try using it anyway
            else:
                gray_np = img_np  # Use original if grayscale=False

            # Determine skew angle using the deskew library
            angle = determine_skew(gray_np, **deskew_kwargs)
            self._skew_angle = angle
            logger.debug(f"Page {self.number}: Detected skew angle = {angle}")
            return angle

        except Exception as e:
            logger.warning(f"Page {self.number}: Failed during skew detection: {e}", exc_info=True)
            self._skew_angle = None
            return None

    def deskew(
        self,
        resolution: int = 300,
        angle: Optional[float] = None,
        detection_resolution: int = 72,
        **deskew_kwargs,
    ) -> Optional[Image.Image]:
        """
        Creates and returns a deskewed PIL image of the page.

        If `angle` is not provided, it will first try to detect the skew angle
        using `detect_skew_angle` (or use the cached angle if available).

        Args:
            resolution: DPI resolution for the output deskewed image.
            angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
            detection_resolution: DPI resolution used for detection if `angle` is None.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             if automatic detection is performed.

        Returns:
            A deskewed PIL.Image.Image object, or None if rendering/rotation fails.

        Raises:
            ImportError: If the 'deskew' library is not installed.
        """
        if not DESKEW_AVAILABLE:
            raise ImportError(
                "Deskew library not found. Install with: pip install natural-pdf[deskew]"
            )

        # Determine the angle to use
        rotation_angle = angle
        if rotation_angle is None:
            # Detect angle (or use cached) if not explicitly provided
            rotation_angle = self.detect_skew_angle(
                resolution=detection_resolution, **deskew_kwargs
            )

        logger.debug(
            f"Page {self.number}: Preparing to deskew (output resolution={resolution} DPI). Using angle: {rotation_angle}"
        )

        try:
            # Render the original page at the desired output resolution
            img = self.to_image(resolution=resolution, include_highlights=False)
            if not img:
                logger.error(f"Page {self.number}: Failed to render image for deskewing.")
                return None

            # Rotate if a significant angle was found/provided
            if rotation_angle is not None and abs(rotation_angle) > 0.05:
                logger.debug(f"Page {self.number}: Rotating by {rotation_angle:.2f} degrees.")
                # Determine fill color based on image mode
                fill = (255, 255, 255) if img.mode == "RGB" else 255  # White background
                # Rotate the image using PIL
                rotated_img = img.rotate(
                    rotation_angle,  # deskew provides angle, PIL rotates counter-clockwise
                    resample=Image.Resampling.BILINEAR,
                    expand=True,  # Expand image to fit rotated content
                    fillcolor=fill,
                )
                return rotated_img
            else:
                logger.debug(
                    f"Page {self.number}: No significant rotation needed (angle={rotation_angle}). Returning original render."
                )
                return img  # Return the original rendered image if no rotation needed

        except Exception as e:
            logger.error(
                f"Page {self.number}: Error during deskewing image generation: {e}", exc_info=True
            )
            return None

    # --- End Skew Detection and Correction --- #

    # ------------------------------------------------------------------
    # Unified analysis storage (maps to metadata["analysis"])
    # ------------------------------------------------------------------

    @property
    def analyses(self) -> Dict[str, Any]:
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata.setdefault("analysis", {})

    @analyses.setter
    def analyses(self, value: Dict[str, Any]):
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        self.metadata["analysis"] = value

    def inspect(self, limit: int = 30) -> "InspectionSummary":
        """
        Inspect all elements on this page with detailed tabular view.
        Equivalent to page.find_all('*').inspect().

        Args:
            limit: Maximum elements per type to show (default: 30)

        Returns:
            InspectionSummary with element tables showing coordinates,
            properties, and other details for each element
        """
        return self.find_all("*").inspect(limit=limit)

    def remove_text_layer(self) -> "Page":
        """
        Remove all text elements from this page.

        This removes all text elements (words and characters) from the page,
        effectively clearing the text layer.

        Returns:
            Self for method chaining
        """
        logger.info(f"Page {self.number}: Removing all text elements...")

        # Remove all words and chars from the element manager
        removed_words = len(self._element_mgr.words)
        removed_chars = len(self._element_mgr.chars)

        # Clear the lists
        self._element_mgr._elements["words"] = []
        self._element_mgr._elements["chars"] = []

        logger.info(
            f"Page {self.number}: Removed {removed_words} words and {removed_chars} characters"
        )
        return self

    @property
    def lines(self) -> List[Any]:
        """Get all line elements on this page."""
        return self._element_mgr.lines

    # ------------------------------------------------------------------
    # Image elements
    # ------------------------------------------------------------------

    @property
    def images(self) -> List[Any]:
        """Get all embedded raster images on this page."""
        return self._element_mgr.images
Attributes
natural_pdf.Page.chars property

Get all character elements on this page.

natural_pdf.Page.height property

Get page height.

natural_pdf.Page.images property

Get all embedded raster images on this page.

natural_pdf.Page.index property

Get page index (0-based).

natural_pdf.Page.layout_analyzer property

Get or create the layout analyzer for this page.

natural_pdf.Page.lines property

Get all line elements on this page.

natural_pdf.Page.number property

Get page number (1-based).

natural_pdf.Page.page_number property

Get page number (1-based).

natural_pdf.Page.pdf property

Provides public access to the parent PDF object.

natural_pdf.Page.rects property

Get all rectangle elements on this page.

natural_pdf.Page.size property

Get the size of the page in points.

natural_pdf.Page.skew_angle property

Get the detected skew angle for this page (if calculated).

natural_pdf.Page.text_style_labels property

Get a sorted list of unique text style labels found on the page.

Runs text style analysis with default options if it hasn't been run yet. To use custom options, call analyze_text_styles(options=...) explicitly first.

Returns:

Type Description
List[str]

A sorted list of unique style label strings.

natural_pdf.Page.width property

Get page width.

natural_pdf.Page.words property

Get all word elements on this page.

Functions
natural_pdf.Page.__init__(page, parent, index, font_attrs=None, load_text=True)

Initialize a page wrapper.

Creates an enhanced Page object that wraps a pdfplumber page with additional functionality for spatial navigation, analysis, and AI-powered extraction.

Parameters:

Name Type Description Default
page Page

The underlying pdfplumber page object that provides raw PDF data.

required
parent PDF

Parent PDF object that contains this page and provides access to managers and global settings.

required
index int

Zero-based index of this page in the PDF document.

required
font_attrs

List of font attributes to consider when grouping characters into words. Common attributes include ['fontname', 'size', 'flags']. If None, uses default character-to-word grouping rules.

None
load_text bool

If True, load and process text elements from the PDF's text layer. If False, skip text layer processing (useful for OCR-only workflows).

True
Note

This constructor is typically called automatically when accessing pages through the PDF.pages collection. Direct instantiation is rarely needed.

Example
# Pages are usually accessed through the PDF object
pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]  # Page object created automatically

# Direct construction (advanced usage)
import pdfplumber
with pdfplumber.open("document.pdf") as plumber_pdf:
    plumber_page = plumber_pdf.pages[0]
    page = Page(plumber_page, pdf, 0, load_text=True)
Source code in natural_pdf/core/page.py
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def __init__(
    self,
    page: "pdfplumber.page.Page",
    parent: "PDF",
    index: int,
    font_attrs=None,
    load_text: bool = True,
):
    """Initialize a page wrapper.

    Creates an enhanced Page object that wraps a pdfplumber page with additional
    functionality for spatial navigation, analysis, and AI-powered extraction.

    Args:
        page: The underlying pdfplumber page object that provides raw PDF data.
        parent: Parent PDF object that contains this page and provides access
            to managers and global settings.
        index: Zero-based index of this page in the PDF document.
        font_attrs: List of font attributes to consider when grouping characters
            into words. Common attributes include ['fontname', 'size', 'flags'].
            If None, uses default character-to-word grouping rules.
        load_text: If True, load and process text elements from the PDF's text layer.
            If False, skip text layer processing (useful for OCR-only workflows).

    Note:
        This constructor is typically called automatically when accessing pages
        through the PDF.pages collection. Direct instantiation is rarely needed.

    Example:
        ```python
        # Pages are usually accessed through the PDF object
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]  # Page object created automatically

        # Direct construction (advanced usage)
        import pdfplumber
        with pdfplumber.open("document.pdf") as plumber_pdf:
            plumber_page = plumber_pdf.pages[0]
            page = Page(plumber_page, pdf, 0, load_text=True)
        ```
    """
    self._page = page
    self._parent = parent
    self._index = index
    self._load_text = load_text
    self._text_styles = None  # Lazy-loaded text style analyzer results
    self._exclusions = []  # List to store exclusion functions/regions
    self._skew_angle: Optional[float] = None  # Stores detected skew angle

    # --- ADDED --- Metadata store for mixins
    self.metadata: Dict[str, Any] = {}
    # --- END ADDED ---

    # Region management
    self._regions = {
        "detected": [],  # Layout detection results
        "named": {},  # Named regions (name -> region)
    }

    # -------------------------------------------------------------
    # Page-scoped configuration begins as a shallow copy of the parent
    # PDF-level configuration so that auto-computed tolerances or other
    # page-specific values do not overwrite siblings.
    # -------------------------------------------------------------
    self._config = dict(getattr(self._parent, "_config", {}))

    # Initialize ElementManager, passing font_attrs
    self._element_mgr = ElementManager(self, font_attrs=font_attrs, load_text=self._load_text)
    # self._highlighter = HighlightingService(self) # REMOVED - Use property accessor
    # --- NEW --- Central registry for analysis results
    self.analyses: Dict[str, Any] = {}

    # --- Get OCR Manager Instance ---
    if (
        OCRManager
        and hasattr(parent, "_ocr_manager")
        and isinstance(parent._ocr_manager, OCRManager)
    ):
        self._ocr_manager = parent._ocr_manager
        logger.debug(f"Page {self.number}: Using OCRManager instance from parent PDF.")
    else:
        self._ocr_manager = None
        if OCRManager:
            logger.warning(
                f"Page {self.number}: OCRManager instance not found on parent PDF object."
            )

    # --- Get Layout Manager Instance ---
    if (
        LayoutManager
        and hasattr(parent, "_layout_manager")
        and isinstance(parent._layout_manager, LayoutManager)
    ):
        self._layout_manager = parent._layout_manager
        logger.debug(f"Page {self.number}: Using LayoutManager instance from parent PDF.")
    else:
        self._layout_manager = None
        if LayoutManager:
            logger.warning(
                f"Page {self.number}: LayoutManager instance not found on parent PDF object. Layout analysis will fail."
            )

    # Initialize the internal variable with a single underscore
    self._layout_analyzer = None

    self._load_elements()
    self._to_image_cache: Dict[tuple, Optional["Image.Image"]] = {}
natural_pdf.Page.__repr__()

String representation of the page.

Source code in natural_pdf/core/page.py
2482
2483
2484
def __repr__(self) -> str:
    """String representation of the page."""
    return f"<Page number={self.number} index={self.index}>"
natural_pdf.Page.add_exclusion(exclusion_func_or_region, label=None)

Add an exclusion to the page. Text from these regions will be excluded from extraction. Ensures non-callable items are stored as Region objects if possible.

Parameters:

Name Type Description Default
exclusion_func_or_region Union[Callable[[Page], Region], Region, Any]

Either a callable function returning a Region, a Region object, or another object with a valid .bbox attribute.

required
label Optional[str]

Optional label for this exclusion (e.g., 'header', 'footer').

None

Returns:

Type Description
Page

Self for method chaining

Raises:

Type Description
TypeError

If a non-callable, non-Region object without a valid bbox is provided.

Source code in natural_pdf/core/page.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
def add_exclusion(
    self,
    exclusion_func_or_region: Union[Callable[["Page"], "Region"], "Region", Any],
    label: Optional[str] = None,
) -> "Page":
    """
    Add an exclusion to the page. Text from these regions will be excluded from extraction.
    Ensures non-callable items are stored as Region objects if possible.

    Args:
        exclusion_func_or_region: Either a callable function returning a Region,
                                  a Region object, or another object with a valid .bbox attribute.
        label: Optional label for this exclusion (e.g., 'header', 'footer').

    Returns:
        Self for method chaining

    Raises:
        TypeError: If a non-callable, non-Region object without a valid bbox is provided.
    """
    exclusion_data = None  # Initialize exclusion data

    if callable(exclusion_func_or_region):
        # Store callable functions along with their label
        exclusion_data = (exclusion_func_or_region, label)
        logger.debug(
            f"Page {self.index}: Added callable exclusion '{label}': {exclusion_func_or_region}"
        )
    elif isinstance(exclusion_func_or_region, Region):
        # Store Region objects directly, assigning the label
        exclusion_func_or_region.label = label  # Assign label
        exclusion_data = (exclusion_func_or_region, label)  # Store as tuple for consistency
        logger.debug(
            f"Page {self.index}: Added Region exclusion '{label}': {exclusion_func_or_region}"
        )
    elif (
        hasattr(exclusion_func_or_region, "bbox")
        and isinstance(getattr(exclusion_func_or_region, "bbox", None), (tuple, list))
        and len(exclusion_func_or_region.bbox) == 4
    ):
        # Convert objects with a valid bbox to a Region before storing
        try:
            bbox_coords = tuple(float(v) for v in exclusion_func_or_region.bbox)
            # Pass the label to the Region constructor
            region_to_add = Region(self, bbox_coords, label=label)
            exclusion_data = (region_to_add, label)  # Store as tuple
            logger.debug(
                f"Page {self.index}: Added exclusion '{label}' converted to Region from {type(exclusion_func_or_region)}: {region_to_add}"
            )
        except (ValueError, TypeError, Exception) as e:
            # Raise an error if conversion fails
            raise TypeError(
                f"Failed to convert exclusion object {exclusion_func_or_region} with bbox {getattr(exclusion_func_or_region, 'bbox', 'N/A')} to Region: {e}"
            ) from e
    else:
        # Reject invalid types
        raise TypeError(
            f"Invalid exclusion type: {type(exclusion_func_or_region)}. Must be callable, Region, or have a valid .bbox attribute."
        )

    # Append the stored data (tuple of object/callable and label)
    if exclusion_data:
        self._exclusions.append(exclusion_data)

    return self
natural_pdf.Page.add_region(region, name=None)

Add a region to the page.

Parameters:

Name Type Description Default
region Region

Region object to add

required
name Optional[str]

Optional name for the region

None

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
def add_region(self, region: "Region", name: Optional[str] = None) -> "Page":
    """
    Add a region to the page.

    Args:
        region: Region object to add
        name: Optional name for the region

    Returns:
        Self for method chaining
    """
    # Check if it's actually a Region object
    if not isinstance(region, Region):
        raise TypeError("region must be a Region object")

    # Set the source and name
    region.source = "named"

    if name:
        region.name = name
        # Add to named regions dictionary (overwriting if name already exists)
        self._regions["named"][name] = region
    else:
        # Add to detected regions list (unnamed but registered)
        self._regions["detected"].append(region)

    # Add to element manager for selector queries
    self._element_mgr.add_region(region)

    return self
natural_pdf.Page.add_regions(regions, prefix=None)

Add multiple regions to the page.

Parameters:

Name Type Description Default
regions List[Region]

List of Region objects to add

required
prefix Optional[str]

Optional prefix for automatic naming (regions will be named prefix_1, prefix_2, etc.)

None

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
def add_regions(self, regions: List["Region"], prefix: Optional[str] = None) -> "Page":
    """
    Add multiple regions to the page.

    Args:
        regions: List of Region objects to add
        prefix: Optional prefix for automatic naming (regions will be named prefix_1, prefix_2, etc.)

    Returns:
        Self for method chaining
    """
    if prefix:
        # Add with automatic sequential naming
        for i, region in enumerate(regions):
            self.add_region(region, name=f"{prefix}_{i+1}")
    else:
        # Add without names
        for region in regions:
            self.add_region(region)

    return self
natural_pdf.Page.analyze_layout(engine=None, options=None, confidence=None, classes=None, exclude_classes=None, device=None, existing='replace', model_name=None, client=None)

Analyze the page layout using the configured LayoutManager. Adds detected Region objects to the page's element manager.

Returns:

Type Description
ElementCollection[Region]

ElementCollection containing the detected Region objects.

Source code in natural_pdf/core/page.py
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
def analyze_layout(
    self,
    engine: Optional[str] = None,
    options: Optional["LayoutOptions"] = None,
    confidence: Optional[float] = None,
    classes: Optional[List[str]] = None,
    exclude_classes: Optional[List[str]] = None,
    device: Optional[str] = None,
    existing: str = "replace",
    model_name: Optional[str] = None,
    client: Optional[Any] = None,  # Add client parameter
) -> "ElementCollection[Region]":
    """
    Analyze the page layout using the configured LayoutManager.
    Adds detected Region objects to the page's element manager.

    Returns:
        ElementCollection containing the detected Region objects.
    """
    analyzer = self.layout_analyzer
    if not analyzer:
        logger.error(
            "Layout analysis failed: LayoutAnalyzer not initialized (is LayoutManager available?)."
        )
        return ElementCollection([])  # Return empty collection

    # Clear existing detected regions if 'replace' is specified
    if existing == "replace":
        self.clear_detected_layout_regions()

    # The analyzer's analyze_layout method already adds regions to the page
    # and its element manager. We just need to retrieve them.
    analyzer.analyze_layout(
        engine=engine,
        options=options,
        confidence=confidence,
        classes=classes,
        exclude_classes=exclude_classes,
        device=device,
        existing=existing,
        model_name=model_name,
        client=client,  # Pass client down
    )

    # Retrieve the detected regions from the element manager
    # Filter regions based on source='detected' and potentially the model used if available
    detected_regions = [
        r
        for r in self._element_mgr.regions
        if r.source == "detected" and (not engine or getattr(r, "model", None) == engine)
    ]

    return ElementCollection(detected_regions)
natural_pdf.Page.analyze_text_styles(options=None)

Analyze text elements by style, adding attributes directly to elements.

This method uses TextStyleAnalyzer to process text elements (typically words) on the page. It adds the following attributes to each processed element: - style_label: A descriptive or numeric label for the style group. - style_key: A hashable tuple representing the style properties used for grouping. - style_properties: A dictionary containing the extracted style properties.

Parameters:

Name Type Description Default
options Optional[TextStyleOptions]

Optional TextStyleOptions to configure the analysis. If None, the analyzer's default options are used.

None

Returns:

Type Description
ElementCollection

ElementCollection containing all processed text elements with added style attributes.

Source code in natural_pdf/core/page.py
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
def analyze_text_styles(
    self, options: Optional[TextStyleOptions] = None
) -> "ElementCollection":
    """
    Analyze text elements by style, adding attributes directly to elements.

    This method uses TextStyleAnalyzer to process text elements (typically words)
    on the page. It adds the following attributes to each processed element:
    - style_label: A descriptive or numeric label for the style group.
    - style_key: A hashable tuple representing the style properties used for grouping.
    - style_properties: A dictionary containing the extracted style properties.

    Args:
        options: Optional TextStyleOptions to configure the analysis.
                 If None, the analyzer's default options are used.

    Returns:
        ElementCollection containing all processed text elements with added style attributes.
    """
    # Create analyzer (optionally pass default options from PDF config here)
    # For now, it uses its own defaults if options=None
    analyzer = TextStyleAnalyzer()

    # Analyze the page. The analyzer now modifies elements directly
    # and returns the collection of processed elements.
    processed_elements_collection = analyzer.analyze(self, options=options)

    # Return the collection of elements which now have style attributes
    return processed_elements_collection
natural_pdf.Page.apply_ocr(engine=None, options=None, languages=None, min_confidence=None, device=None, resolution=None, detect_only=False, apply_exclusions=True, replace=True)

Apply OCR to THIS page and add results to page elements via PDF.apply_ocr.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine.

None
options Optional[OCROptions]

Engine-specific options object or dict.

None
languages Optional[List[str]]

List of engine-specific language codes.

None
min_confidence Optional[float]

Minimum confidence threshold.

None
device Optional[str]

Device to run OCR on.

None
resolution Optional[int]

DPI resolution for rendering page image before OCR.

None
apply_exclusions bool

If True (default), render page image for OCR with excluded areas masked (whited out).

True
detect_only bool

If True, only detect text bounding boxes, don't perform OCR.

False
replace bool

If True (default), remove any existing OCR elements before adding new ones. If False, add new OCR elements to existing ones.

True

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
def apply_ocr(
    self,
    engine: Optional[str] = None,
    options: Optional["OCROptions"] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
    detect_only: bool = False,
    apply_exclusions: bool = True,
    replace: bool = True,
) -> "Page":
    """
    Apply OCR to THIS page and add results to page elements via PDF.apply_ocr.

    Args:
        engine: Name of the OCR engine.
        options: Engine-specific options object or dict.
        languages: List of engine-specific language codes.
        min_confidence: Minimum confidence threshold.
        device: Device to run OCR on.
        resolution: DPI resolution for rendering page image before OCR.
        apply_exclusions: If True (default), render page image for OCR
                          with excluded areas masked (whited out).
        detect_only: If True, only detect text bounding boxes, don't perform OCR.
        replace: If True (default), remove any existing OCR elements before
                adding new ones. If False, add new OCR elements to existing ones.

    Returns:
        Self for method chaining.
    """
    if not hasattr(self._parent, "apply_ocr"):
        logger.error(f"Page {self.number}: Parent PDF missing 'apply_ocr'. Cannot apply OCR.")
        return self  # Return self for chaining

    # Remove existing OCR elements if replace is True
    if replace and hasattr(self, "_element_mgr"):
        logger.info(
            f"Page {self.number}: Removing existing OCR elements before applying new OCR."
        )
        self._element_mgr.remove_ocr_elements()

    logger.info(f"Page {self.number}: Delegating apply_ocr to PDF.apply_ocr.")
    # Delegate to parent PDF, targeting only this page's index
    # Pass all relevant parameters through, including apply_exclusions
    self._parent.apply_ocr(
        pages=[self.index],
        engine=engine,
        options=options,
        languages=languages,
        min_confidence=min_confidence,
        device=device,
        resolution=resolution,
        detect_only=detect_only,
        apply_exclusions=apply_exclusions,
        replace=replace,  # Pass the replace parameter to PDF.apply_ocr
    )

    # Return self for chaining
    return self
natural_pdf.Page.ask(question, min_confidence=0.1, model=None, debug=False, **kwargs)

Ask a question about the page content using document QA.

Source code in natural_pdf/core/page.py
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
def ask(
    self,
    question: Union[str, List[str], Tuple[str, ...]],
    min_confidence: float = 0.1,
    model: str = None,
    debug: bool = False,
    **kwargs,
) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
    """
    Ask a question about the page content using document QA.
    """
    try:
        from natural_pdf.qa.document_qa import get_qa_engine

        # Get or initialize QA engine with specified model
        qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
        # Ask the question using the QA engine
        return qa_engine.ask_pdf_page(
            self, question, min_confidence=min_confidence, debug=debug, **kwargs
        )
    except ImportError:
        logger.error(
            "Question answering requires the 'natural_pdf.qa' module. Please install necessary dependencies."
        )
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.number,
            "source_elements": [],
        }
    except Exception as e:
        logger.error(f"Error during page.ask: {e}", exc_info=True)
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.number,
            "source_elements": [],
        }
natural_pdf.Page.clear_detected_layout_regions()

Removes all regions from this page that were added by layout analysis (i.e., regions where source attribute is 'detected').

This clears the regions both from the page's internal _regions['detected'] list and from the ElementManager's internal list of regions.

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
def clear_detected_layout_regions(self) -> "Page":
    """
    Removes all regions from this page that were added by layout analysis
    (i.e., regions where `source` attribute is 'detected').

    This clears the regions both from the page's internal `_regions['detected']` list
    and from the ElementManager's internal list of regions.

    Returns:
        Self for method chaining.
    """
    if (
        not hasattr(self._element_mgr, "regions")
        or not hasattr(self._element_mgr, "_elements")
        or "regions" not in self._element_mgr._elements
    ):
        logger.debug(
            f"Page {self.index}: No regions found in ElementManager, nothing to clear."
        )
        self._regions["detected"] = []  # Ensure page's list is also clear
        return self

    # Filter ElementManager's list to keep only non-detected regions
    original_count = len(self._element_mgr.regions)
    self._element_mgr._elements["regions"] = [
        r for r in self._element_mgr.regions if getattr(r, "source", None) != "detected"
    ]
    new_count = len(self._element_mgr.regions)
    removed_count = original_count - new_count

    # Clear the page's specific list of detected regions
    self._regions["detected"] = []

    logger.info(f"Page {self.index}: Cleared {removed_count} detected layout regions.")
    return self
natural_pdf.Page.clear_exclusions()

Clear all exclusions from the page.

Source code in natural_pdf/core/page.py
304
305
306
307
308
309
def clear_exclusions(self) -> "Page":
    """
    Clear all exclusions from the page.
    """
    self._exclusions = []
    return self
natural_pdf.Page.clear_highlights()

Clear all highlights from this specific page via HighlightingService.

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
1706
1707
1708
1709
1710
1711
1712
1713
1714
def clear_highlights(self) -> "Page":
    """
    Clear all highlights *from this specific page* via HighlightingService.

    Returns:
        Self for method chaining
    """
    self._highlighter.clear_page(self.index)
    return self
natural_pdf.Page.correct_ocr(correction_callback, selector='text[source=ocr]', max_workers=None, progress_callback=None)

Applies corrections to OCR-generated text elements on this page using a user-provided callback function, potentially in parallel.

Finds text elements on this page whose 'source' attribute starts with 'ocr' and calls the correction_callback for each, passing the element itself. Updates the element's text if the callback returns a new string.

Parameters:

Name Type Description Default
correction_callback Callable[[Any], Optional[str]]

A function accepting an element and returning Optional[str] (new text or None).

required
max_workers Optional[int]

The maximum number of threads to use for parallel execution. If None or 0 or 1, runs sequentially.

None
progress_callback Optional[Callable[[], None]]

Optional callback function to call after processing each element.

None

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
def correct_ocr(
    self,
    correction_callback: Callable[[Any], Optional[str]],
    selector: Optional[str] = "text[source=ocr]",
    max_workers: Optional[int] = None,
    progress_callback: Optional[Callable[[], None]] = None,  # Added progress callback
) -> "Page":  # Return self for chaining
    """
    Applies corrections to OCR-generated text elements on this page
    using a user-provided callback function, potentially in parallel.

    Finds text elements on this page whose 'source' attribute starts
    with 'ocr' and calls the `correction_callback` for each, passing the
    element itself. Updates the element's text if the callback returns
    a new string.

    Args:
        correction_callback: A function accepting an element and returning
                             `Optional[str]` (new text or None).
        max_workers: The maximum number of threads to use for parallel execution.
                     If None or 0 or 1, runs sequentially.
        progress_callback: Optional callback function to call after processing each element.

    Returns:
        Self for method chaining.
    """
    logger.info(
        f"Page {self.number}: Starting OCR correction with callback '{correction_callback.__name__}' (max_workers={max_workers})"
    )

    target_elements_collection = self.find_all(selector=selector, apply_exclusions=False)
    target_elements = target_elements_collection.elements  # Get the list

    if not target_elements:
        logger.info(f"Page {self.number}: No OCR elements found to correct.")
        return self

    element_pbar = None
    try:
        element_pbar = tqdm(
            total=len(target_elements),
            desc=f"Correcting OCR Page {self.number}",
            unit="element",
            leave=False,
        )

        processed_count = 0
        updated_count = 0
        error_count = 0

        # Define the task to be run by the worker thread or sequentially
        def _process_element_task(element):
            try:
                current_text = getattr(element, "text", None)
                # Call the user-provided callback
                corrected_text = correction_callback(element)

                # Validate result type
                if corrected_text is not None and not isinstance(corrected_text, str):
                    logger.warning(
                        f"Page {self.number}: Correction callback for element '{getattr(element, 'text', '')[:20]}...' returned non-string, non-None type: {type(corrected_text)}. Skipping update."
                    )
                    return element, None, None  # Treat as no correction

                return element, corrected_text, None  # Return element, result, no error
            except Exception as e:
                logger.error(
                    f"Page {self.number}: Error applying correction callback to element '{getattr(element, 'text', '')[:30]}...' ({element.bbox}): {e}",
                    exc_info=False,  # Keep log concise
                )
                return element, None, e  # Return element, no result, error
            finally:
                # --- Update internal tqdm progress bar ---
                if element_pbar:
                    element_pbar.update(1)
                # --- Call user's progress callback --- #
                if progress_callback:
                    try:
                        progress_callback()
                    except Exception as cb_e:
                        # Log error in callback itself, but don't stop processing
                        logger.error(
                            f"Page {self.number}: Error executing progress_callback: {cb_e}",
                            exc_info=False,
                        )

        # Choose execution strategy based on max_workers
        if max_workers is not None and max_workers > 1:
            # --- Parallel execution --- #
            logger.info(
                f"Page {self.number}: Running OCR correction in parallel with {max_workers} workers."
            )
            futures = []
            with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                # Submit all tasks
                future_to_element = {
                    executor.submit(_process_element_task, element): element
                    for element in target_elements
                }

                # Process results as they complete (progress_callback called by worker)
                for future in concurrent.futures.as_completed(future_to_element):
                    processed_count += 1
                    try:
                        element, corrected_text, error = future.result()
                        if error:
                            error_count += 1
                            # Error already logged in worker
                        elif corrected_text is not None:
                            # Apply correction if text changed
                            current_text = getattr(element, "text", None)
                            if corrected_text != current_text:
                                element.text = corrected_text
                                updated_count += 1
                    except Exception as exc:
                        # Catch errors from future.result() itself
                        element = future_to_element[future]  # Find original element
                        logger.error(
                            f"Page {self.number}: Internal error retrieving correction result for element {element.bbox}: {exc}",
                            exc_info=True,
                        )
                        error_count += 1
                        # Note: progress_callback was already called in the worker's finally block

        else:
            # --- Sequential execution --- #
            logger.info(f"Page {self.number}: Running OCR correction sequentially.")
            for element in target_elements:
                # Call the task function directly (it handles progress_callback)
                processed_count += 1
                _element, corrected_text, error = _process_element_task(element)
                if error:
                    error_count += 1
                elif corrected_text is not None:
                    # Apply correction if text changed
                    current_text = getattr(_element, "text", None)
                    if corrected_text != current_text:
                        _element.text = corrected_text
                        updated_count += 1

        logger.info(
            f"Page {self.number}: OCR correction finished. Processed: {processed_count}/{len(target_elements)}, Updated: {updated_count}, Errors: {error_count}."
        )

        return self  # Return self for chaining
    finally:
        if element_pbar:
            element_pbar.close()
natural_pdf.Page.create_region(x0, top, x1, bottom)

Create a region on this page with the specified coordinates.

Parameters:

Name Type Description Default
x0 float

Left x-coordinate

required
top float

Top y-coordinate

required
x1 float

Right x-coordinate

required
bottom float

Bottom y-coordinate

required

Returns:

Type Description
Any

Region object for the specified coordinates

Source code in natural_pdf/core/page.py
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
def create_region(self, x0: float, top: float, x1: float, bottom: float) -> Any:
    """
    Create a region on this page with the specified coordinates.

    Args:
        x0: Left x-coordinate
        top: Top y-coordinate
        x1: Right x-coordinate
        bottom: Bottom y-coordinate

    Returns:
        Region object for the specified coordinates
    """
    from natural_pdf.elements.region import Region

    return Region(self, (x0, top, x1, bottom))
natural_pdf.Page.crop(bbox=None, **kwargs)

Crop the page to the specified bounding box.

This is a direct wrapper around pdfplumber's crop method.

Parameters:

Name Type Description Default
bbox

Bounding box (x0, top, x1, bottom) or None

None
**kwargs

Additional parameters (top, bottom, left, right)

{}

Returns:

Type Description
Any

Cropped page object (pdfplumber.Page)

Source code in natural_pdf/core/page.py
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
def crop(self, bbox=None, **kwargs) -> Any:
    """
    Crop the page to the specified bounding box.

    This is a direct wrapper around pdfplumber's crop method.

    Args:
        bbox: Bounding box (x0, top, x1, bottom) or None
        **kwargs: Additional parameters (top, bottom, left, right)

    Returns:
        Cropped page object (pdfplumber.Page)
    """
    # Returns the pdfplumber page object, not a natural-pdf Page
    return self._page.crop(bbox, **kwargs)
natural_pdf.Page.deskew(resolution=300, angle=None, detection_resolution=72, **deskew_kwargs)

Creates and returns a deskewed PIL image of the page.

If angle is not provided, it will first try to detect the skew angle using detect_skew_angle (or use the cached angle if available).

Parameters:

Name Type Description Default
resolution int

DPI resolution for the output deskewed image.

300
angle Optional[float]

The specific angle (in degrees) to rotate by. If None, detects automatically.

None
detection_resolution int

DPI resolution used for detection if angle is None.

72
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew if automatic detection is performed.

{}

Returns:

Type Description
Optional[Image]

A deskewed PIL.Image.Image object, or None if rendering/rotation fails.

Raises:

Type Description
ImportError

If the 'deskew' library is not installed.

Source code in natural_pdf/core/page.py
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
def deskew(
    self,
    resolution: int = 300,
    angle: Optional[float] = None,
    detection_resolution: int = 72,
    **deskew_kwargs,
) -> Optional[Image.Image]:
    """
    Creates and returns a deskewed PIL image of the page.

    If `angle` is not provided, it will first try to detect the skew angle
    using `detect_skew_angle` (or use the cached angle if available).

    Args:
        resolution: DPI resolution for the output deskewed image.
        angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
        detection_resolution: DPI resolution used for detection if `angle` is None.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         if automatic detection is performed.

    Returns:
        A deskewed PIL.Image.Image object, or None if rendering/rotation fails.

    Raises:
        ImportError: If the 'deskew' library is not installed.
    """
    if not DESKEW_AVAILABLE:
        raise ImportError(
            "Deskew library not found. Install with: pip install natural-pdf[deskew]"
        )

    # Determine the angle to use
    rotation_angle = angle
    if rotation_angle is None:
        # Detect angle (or use cached) if not explicitly provided
        rotation_angle = self.detect_skew_angle(
            resolution=detection_resolution, **deskew_kwargs
        )

    logger.debug(
        f"Page {self.number}: Preparing to deskew (output resolution={resolution} DPI). Using angle: {rotation_angle}"
    )

    try:
        # Render the original page at the desired output resolution
        img = self.to_image(resolution=resolution, include_highlights=False)
        if not img:
            logger.error(f"Page {self.number}: Failed to render image for deskewing.")
            return None

        # Rotate if a significant angle was found/provided
        if rotation_angle is not None and abs(rotation_angle) > 0.05:
            logger.debug(f"Page {self.number}: Rotating by {rotation_angle:.2f} degrees.")
            # Determine fill color based on image mode
            fill = (255, 255, 255) if img.mode == "RGB" else 255  # White background
            # Rotate the image using PIL
            rotated_img = img.rotate(
                rotation_angle,  # deskew provides angle, PIL rotates counter-clockwise
                resample=Image.Resampling.BILINEAR,
                expand=True,  # Expand image to fit rotated content
                fillcolor=fill,
            )
            return rotated_img
        else:
            logger.debug(
                f"Page {self.number}: No significant rotation needed (angle={rotation_angle}). Returning original render."
            )
            return img  # Return the original rendered image if no rotation needed

    except Exception as e:
        logger.error(
            f"Page {self.number}: Error during deskewing image generation: {e}", exc_info=True
        )
        return None
natural_pdf.Page.detect_skew_angle(resolution=72, grayscale=True, force_recalculate=False, **deskew_kwargs)

Detects the skew angle of the page image and stores it.

Parameters:

Name Type Description Default
resolution int

DPI resolution for rendering the page image for detection.

72
grayscale bool

Whether to convert the image to grayscale before detection.

True
force_recalculate bool

If True, recalculate even if an angle exists.

False
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew (e.g., max_angle, num_peaks).

{}

Returns:

Type Description
Optional[float]

The detected skew angle in degrees, or None if detection failed.

Raises:

Type Description
ImportError

If the 'deskew' library is not installed.

Source code in natural_pdf/core/page.py
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
def detect_skew_angle(
    self,
    resolution: int = 72,
    grayscale: bool = True,
    force_recalculate: bool = False,
    **deskew_kwargs,
) -> Optional[float]:
    """
    Detects the skew angle of the page image and stores it.

    Args:
        resolution: DPI resolution for rendering the page image for detection.
        grayscale: Whether to convert the image to grayscale before detection.
        force_recalculate: If True, recalculate even if an angle exists.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         (e.g., `max_angle`, `num_peaks`).

    Returns:
        The detected skew angle in degrees, or None if detection failed.

    Raises:
        ImportError: If the 'deskew' library is not installed.
    """
    if not DESKEW_AVAILABLE:
        raise ImportError(
            "Deskew library not found. Install with: pip install natural-pdf[deskew]"
        )

    if self._skew_angle is not None and not force_recalculate:
        logger.debug(f"Page {self.number}: Returning cached skew angle: {self._skew_angle:.2f}")
        return self._skew_angle

    logger.debug(f"Page {self.number}: Detecting skew angle (resolution={resolution} DPI)...")
    try:
        # Render the page at the specified detection resolution
        img = self.to_image(resolution=resolution, include_highlights=False)
        if not img:
            logger.warning(f"Page {self.number}: Failed to render image for skew detection.")
            self._skew_angle = None
            return None

        # Convert to numpy array
        img_np = np.array(img)

        # Convert to grayscale if needed
        if grayscale:
            if len(img_np.shape) == 3 and img_np.shape[2] >= 3:
                gray_np = np.mean(img_np[:, :, :3], axis=2).astype(np.uint8)
            elif len(img_np.shape) == 2:
                gray_np = img_np  # Already grayscale
            else:
                logger.warning(
                    f"Page {self.number}: Unexpected image shape {img_np.shape} for grayscale conversion."
                )
                gray_np = img_np  # Try using it anyway
        else:
            gray_np = img_np  # Use original if grayscale=False

        # Determine skew angle using the deskew library
        angle = determine_skew(gray_np, **deskew_kwargs)
        self._skew_angle = angle
        logger.debug(f"Page {self.number}: Detected skew angle = {angle}")
        return angle

    except Exception as e:
        logger.warning(f"Page {self.number}: Failed during skew detection: {e}", exc_info=True)
        self._skew_angle = None
        return None
natural_pdf.Page.extract_ocr_elements(engine=None, options=None, languages=None, min_confidence=None, device=None, resolution=None)

Extract text elements using OCR without adding them to the page's elements. Uses the shared OCRManager instance.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine.

None
options Optional[OCROptions]

Engine-specific options object or dict.

None
languages Optional[List[str]]

List of engine-specific language codes.

None
min_confidence Optional[float]

Minimum confidence threshold.

None
device Optional[str]

Device to run OCR on.

None
resolution Optional[int]

DPI resolution for rendering page image before OCR.

None

Returns:

Type Description
List[TextElement]

List of created TextElement objects derived from OCR results for this page.

Source code in natural_pdf/core/page.py
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
def extract_ocr_elements(
    self,
    engine: Optional[str] = None,
    options: Optional["OCROptions"] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
) -> List["TextElement"]:
    """
    Extract text elements using OCR *without* adding them to the page's elements.
    Uses the shared OCRManager instance.

    Args:
        engine: Name of the OCR engine.
        options: Engine-specific options object or dict.
        languages: List of engine-specific language codes.
        min_confidence: Minimum confidence threshold.
        device: Device to run OCR on.
        resolution: DPI resolution for rendering page image before OCR.

    Returns:
        List of created TextElement objects derived from OCR results for this page.
    """
    if not self._ocr_manager:
        logger.error(
            f"Page {self.number}: OCRManager not available. Cannot extract OCR elements."
        )
        return []

    logger.info(f"Page {self.number}: Extracting OCR elements (extract only)...")

    # Determine rendering resolution
    final_resolution = resolution if resolution is not None else 150  # Default to 150 DPI
    logger.debug(f"  Using rendering resolution: {final_resolution} DPI")

    try:
        # Get base image without highlights using the determined resolution
        # Use the global PDF rendering lock
        with pdf_render_lock:
            image = self.to_image(resolution=final_resolution, include_highlights=False)
            if not image:
                logger.error(
                    f"  Failed to render page {self.number} to image for OCR extraction."
                )
                return []
            logger.debug(f"  Rendered image size: {image.width}x{image.height}")
    except Exception as e:
        logger.error(f"  Failed to render page {self.number} to image: {e}", exc_info=True)
        return []

    # Prepare arguments for the OCR Manager call
    manager_args = {
        "images": image,
        "engine": engine,
        "languages": languages,
        "min_confidence": min_confidence,
        "device": device,
        "options": options,
    }
    manager_args = {k: v for k, v in manager_args.items() if v is not None}

    logger.debug(
        f"  Calling OCR Manager (extract only) with args: { {k:v for k,v in manager_args.items() if k != 'images'} }"
    )
    try:
        # apply_ocr now returns List[List[Dict]] or List[Dict]
        results_list = self._ocr_manager.apply_ocr(**manager_args)
        # If it returned a list of lists (batch mode), take the first list
        results = (
            results_list[0]
            if isinstance(results_list, list)
            and results_list
            and isinstance(results_list[0], list)
            else results_list
        )
        if not isinstance(results, list):
            logger.error(f"  OCR Manager returned unexpected type: {type(results)}")
            results = []
        logger.info(f"  OCR Manager returned {len(results)} results for extraction.")
    except Exception as e:
        logger.error(f"  OCR processing failed during extraction: {e}", exc_info=True)
        return []

    # Convert results but DO NOT add to ElementManager
    logger.debug(f"  Converting OCR results to TextElements (extract only)...")
    temp_elements = []
    scale_x = self.width / image.width if image.width else 1
    scale_y = self.height / image.height if image.height else 1
    for result in results:
        try:  # Added try-except around result processing
            x0, top, x1, bottom = [float(c) for c in result["bbox"]]
            elem_data = {
                "text": result["text"],
                "confidence": result["confidence"],
                "x0": x0 * scale_x,
                "top": top * scale_y,
                "x1": x1 * scale_x,
                "bottom": bottom * scale_y,
                "width": (x1 - x0) * scale_x,
                "height": (bottom - top) * scale_y,
                "object_type": "text",  # Using text for temporary elements
                "source": "ocr",
                "fontname": "OCR-extract",  # Different name for clarity
                "size": 10.0,
                "page_number": self.number,
            }
            temp_elements.append(TextElement(elem_data, self))
        except (KeyError, ValueError, TypeError) as convert_err:
            logger.warning(
                f"  Skipping invalid OCR result during conversion: {result}. Error: {convert_err}"
            )

    logger.info(f"  Created {len(temp_elements)} TextElements from OCR (extract only).")
    return temp_elements
natural_pdf.Page.extract_table(method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False)

Extract the largest table from this page using enhanced region-based extraction.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None
use_ocr bool

Whether to use OCR for text extraction (currently only applicable with 'tatr' method).

False
ocr_config Optional[dict]

OCR configuration parameters.

None
text_options Optional[Dict]

Dictionary of options for the 'text' method.

None
cell_extraction_func Optional[Callable[[Region], Optional[str]]]

Optional callable function that takes a cell Region object and returns its string content. For 'text' method only.

None
show_progress bool

If True, display a progress bar during cell text extraction for the 'text' method.

False

Returns:

Type Description
List[List[Optional[str]]]

Table data as a list of rows, where each row is a list of cell values (str or None).

Source code in natural_pdf/core/page.py
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
def extract_table(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,
    text_options: Optional[Dict] = None,
    cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
    show_progress: bool = False,
) -> List[List[Optional[str]]]:
    """
    Extract the largest table from this page using enhanced region-based extraction.

    Args:
        method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
        table_settings: Settings for pdfplumber table extraction.
        use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
        ocr_config: OCR configuration parameters.
        text_options: Dictionary of options for the 'text' method.
        cell_extraction_func: Optional callable function that takes a cell Region object
                              and returns its string content. For 'text' method only.
        show_progress: If True, display a progress bar during cell text extraction for the 'text' method.

    Returns:
        Table data as a list of rows, where each row is a list of cell values (str or None).
    """
    # Create a full-page region and delegate to its enhanced extract_table method
    page_region = self.create_region(0, 0, self.width, self.height)
    return page_region.extract_table(
        method=method,
        table_settings=table_settings,
        use_ocr=use_ocr,
        ocr_config=ocr_config,
        text_options=text_options,
        cell_extraction_func=cell_extraction_func,
        show_progress=show_progress,
    )
natural_pdf.Page.extract_tables(method=None, table_settings=None, check_tatr=True)

Extract all tables from this page with enhanced method support.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect). 'stream' uses text-based strategies, 'lattice' uses line-based strategies. Note: 'tatr' and 'text' methods are not supported for extract_tables.

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None
check_tatr bool

If True (default), first check for TATR-detected table regions and extract from those before falling back to pdfplumber methods.

True

Returns:

Type Description
List[List[List[str]]]

List of tables, where each table is a list of rows, and each row is a list of cell values.

Source code in natural_pdf/core/page.py
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
def extract_tables(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    check_tatr: bool = True,
) -> List[List[List[str]]]:
    """
    Extract all tables from this page with enhanced method support.

    Args:
        method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                'stream' uses text-based strategies, 'lattice' uses line-based strategies.
                Note: 'tatr' and 'text' methods are not supported for extract_tables.
        table_settings: Settings for pdfplumber table extraction.
        check_tatr: If True (default), first check for TATR-detected table regions
                    and extract from those before falling back to pdfplumber methods.

    Returns:
        List of tables, where each table is a list of rows, and each row is a list of cell values.
    """
    if table_settings is None:
        table_settings = {}

    # Check for TATR-detected table regions first if enabled
    if check_tatr:
        try:
            tatr_tables = self.find_all("region[type=table][model=tatr]")
            if tatr_tables:
                logger.debug(
                    f"Page {self.number}: Found {len(tatr_tables)} TATR table regions, extracting from those..."
                )
                extracted_tables = []
                for table_region in tatr_tables:
                    try:
                        table_data = table_region.extract_table(method="tatr")
                        if table_data:  # Only add non-empty tables
                            extracted_tables.append(table_data)
                    except Exception as e:
                        logger.warning(
                            f"Failed to extract table from TATR region {table_region.bbox}: {e}"
                        )

                if extracted_tables:
                    logger.debug(
                        f"Page {self.number}: Successfully extracted {len(extracted_tables)} tables from TATR regions"
                    )
                    return extracted_tables
                else:
                    logger.debug(
                        f"Page {self.number}: TATR regions found but no tables extracted, falling back to pdfplumber"
                    )
            else:
                logger.debug(
                    f"Page {self.number}: No TATR table regions found, using pdfplumber methods"
                )
        except Exception as e:
            logger.debug(
                f"Page {self.number}: Error checking TATR regions: {e}, falling back to pdfplumber"
            )

    # Auto-detect method if not specified (try lattice first, then stream)
    if method is None:
        logger.debug(f"Page {self.number}: Auto-detecting tables extraction method...")

        # Try lattice first
        try:
            lattice_settings = table_settings.copy()
            lattice_settings.setdefault("vertical_strategy", "lines")
            lattice_settings.setdefault("horizontal_strategy", "lines")

            logger.debug(f"Page {self.number}: Trying 'lattice' method first for tables...")
            lattice_result = self._page.extract_tables(lattice_settings)

            # Check if lattice found meaningful tables
            if (
                lattice_result
                and len(lattice_result) > 0
                and any(
                    any(
                        any(cell and cell.strip() for cell in row if cell)
                        for row in table
                        if table
                    )
                    for table in lattice_result
                )
            ):
                logger.debug(
                    f"Page {self.number}: 'lattice' method found {len(lattice_result)} tables"
                )
                return lattice_result
            else:
                logger.debug(f"Page {self.number}: 'lattice' method found no meaningful tables")

        except Exception as e:
            logger.debug(f"Page {self.number}: 'lattice' method failed: {e}")

        # Fall back to stream
        logger.debug(f"Page {self.number}: Falling back to 'stream' method for tables...")
        stream_settings = table_settings.copy()
        stream_settings.setdefault("vertical_strategy", "text")
        stream_settings.setdefault("horizontal_strategy", "text")

        return self._page.extract_tables(stream_settings)

    effective_method = method

    # Handle method aliases
    if effective_method == "stream":
        logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "text")
        table_settings.setdefault("horizontal_strategy", "text")
    elif effective_method == "lattice":
        logger.debug(
            "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
        )
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "lines")
        table_settings.setdefault("horizontal_strategy", "lines")

    # Use the selected method
    if effective_method == "pdfplumber":
        # ---------------------------------------------------------
        # Inject auto-computed or user-specified text tolerances so
        # pdfplumber uses the same numbers we used for word grouping
        # whenever the table algorithm relies on word positions.
        # ---------------------------------------------------------
        if "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        ):
            print("SETTING IT UP")
            pdf_cfg = getattr(self, "_config", getattr(self._parent, "_config", {}))
            if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
                x_tol = pdf_cfg.get("x_tolerance")
                if x_tol is not None:
                    table_settings.setdefault("text_x_tolerance", x_tol)
            if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
                y_tol = pdf_cfg.get("y_tolerance")
                if y_tol is not None:
                    table_settings.setdefault("text_y_tolerance", y_tol)

            # pdfplumber's text strategy benefits from a tight snap tolerance.
            if (
                "snap_tolerance" not in table_settings
                and "snap_x_tolerance" not in table_settings
            ):
                # Derive from y_tol if available, else default 1
                snap = max(1, round((pdf_cfg.get("y_tolerance", 1)) * 0.9))
                table_settings.setdefault("snap_tolerance", snap)
            if (
                "join_tolerance" not in table_settings
                and "join_x_tolerance" not in table_settings
            ):
                join = table_settings.get("snap_tolerance", 1)
                table_settings.setdefault("join_tolerance", join)
                table_settings.setdefault("join_x_tolerance", join)
                table_settings.setdefault("join_y_tolerance", join)

        return self._page.extract_tables(table_settings)
    else:
        raise ValueError(
            f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
        )
natural_pdf.Page.extract_text(preserve_whitespace=True, use_exclusions=True, debug_exclusions=False, **kwargs)

Extract text from this page, respecting exclusions and using pdfplumber's layout engine (chars_to_textmap) if layout arguments are provided or default.

Parameters:

Name Type Description Default
use_exclusions

Whether to apply exclusion regions (default: True). Note: Filtering logic is now always applied if exclusions exist.

True
debug_exclusions

Whether to output detailed exclusion debugging info (default: False).

False
**kwargs

Additional layout parameters passed directly to pdfplumber's chars_to_textmap function. Common parameters include: - layout (bool): If True (default), inserts spaces/newlines. - x_density (float): Pixels per character horizontally. - y_density (float): Pixels per line vertically. - x_tolerance (float): Tolerance for horizontal character grouping. - y_tolerance (float): Tolerance for vertical character grouping. - line_dir (str): 'ttb', 'btt', 'ltr', 'rtl' - char_dir (str): 'ttb', 'btt', 'ltr', 'rtl' See pdfplumber documentation for more.

{}

Returns:

Type Description
str

Extracted text as string, potentially with layout-based spacing.

Source code in natural_pdf/core/page.py
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
def extract_text(
    self, preserve_whitespace=True, use_exclusions=True, debug_exclusions=False, **kwargs
) -> str:
    """
    Extract text from this page, respecting exclusions and using pdfplumber's
    layout engine (chars_to_textmap) if layout arguments are provided or default.

    Args:
        use_exclusions: Whether to apply exclusion regions (default: True).
                      Note: Filtering logic is now always applied if exclusions exist.
        debug_exclusions: Whether to output detailed exclusion debugging info (default: False).
        **kwargs: Additional layout parameters passed directly to pdfplumber's
                  `chars_to_textmap` function. Common parameters include:
                  - layout (bool): If True (default), inserts spaces/newlines.
                  - x_density (float): Pixels per character horizontally.
                  - y_density (float): Pixels per line vertically.
                  - x_tolerance (float): Tolerance for horizontal character grouping.
                  - y_tolerance (float): Tolerance for vertical character grouping.
                  - line_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                  - char_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                  See pdfplumber documentation for more.

    Returns:
        Extracted text as string, potentially with layout-based spacing.
    """
    logger.debug(f"Page {self.number}: extract_text called with kwargs: {kwargs}")
    debug = kwargs.get("debug", debug_exclusions)  # Allow 'debug' kwarg

    # 1. Get Word Elements (triggers load_elements if needed)
    word_elements = self.words
    if not word_elements:
        logger.debug(f"Page {self.number}: No word elements found.")
        return ""

    # 2. Get Exclusions
    apply_exclusions_flag = kwargs.get("use_exclusions", True)
    exclusion_regions = []
    if apply_exclusions_flag and self._exclusions:
        exclusion_regions = self._get_exclusion_regions(include_callable=True, debug=debug)
        if debug:
            logger.debug(f"Page {self.number}: Applying {len(exclusion_regions)} exclusions.")
    elif debug:
        logger.debug(f"Page {self.number}: Not applying exclusions.")

    # 3. Collect All Character Dictionaries from Word Elements
    all_char_dicts = []
    for word in word_elements:
        all_char_dicts.extend(getattr(word, "_char_dicts", []))

    # 4. Spatially Filter Characters
    filtered_chars = filter_chars_spatially(
        char_dicts=all_char_dicts,
        exclusion_regions=exclusion_regions,
        target_region=None,  # No target region for full page extraction
        debug=debug,
    )

    # 5. Generate Text Layout using Utility
    # Pass page bbox as layout context
    page_bbox = (0, 0, self.width, self.height)
    # Merge PDF-level default tolerances if caller did not override
    merged_kwargs = dict(kwargs)
    tol_keys = ["x_tolerance", "x_tolerance_ratio", "y_tolerance"]
    for k in tol_keys:
        if k not in merged_kwargs:
            if k in self._config:
                merged_kwargs[k] = self._config[k]
            elif k in getattr(self._parent, "_config", {}):
                merged_kwargs[k] = self._parent._config[k]

    result = generate_text_layout(
        char_dicts=filtered_chars,
        layout_context_bbox=page_bbox,
        user_kwargs=merged_kwargs,
    )

    # --- Optional: apply Unicode BiDi algorithm for mixed RTL/LTR correctness ---
    apply_bidi = kwargs.get("bidi", True)
    if apply_bidi and result:
        # Quick check for any RTL character
        import unicodedata

        def _contains_rtl(s):
            return any(unicodedata.bidirectional(ch) in ("R", "AL", "AN") for ch in s)

        if _contains_rtl(result):
            try:
                from bidi.algorithm import get_display  # type: ignore

                from natural_pdf.utils.bidi_mirror import mirror_brackets

                result = "\n".join(
                    mirror_brackets(
                        get_display(
                            line,
                            base_dir=(
                                "R"
                                if any(
                                    unicodedata.bidirectional(ch) in ("R", "AL", "AN")
                                    for ch in line
                                )
                                else "L"
                            ),
                        )
                    )
                    for line in result.split("\n")
                )
            except ModuleNotFoundError:
                pass  # silently skip if python-bidi not available

    logger.debug(f"Page {self.number}: extract_text finished, result length: {len(result)}.")
    return result
natural_pdf.Page.filter_elements(elements, selector, **kwargs)

Filter a list of elements based on a selector.

Parameters:

Name Type Description Default
elements List[Element]

List of elements to filter

required
selector str

CSS-like selector string

required
**kwargs

Additional filter parameters

{}

Returns:

Type Description
List[Element]

List of elements that match the selector

Source code in natural_pdf/core/page.py
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
def filter_elements(
    self, elements: List["Element"], selector: str, **kwargs
) -> List["Element"]:
    """
    Filter a list of elements based on a selector.

    Args:
        elements: List of elements to filter
        selector: CSS-like selector string
        **kwargs: Additional filter parameters

    Returns:
        List of elements that match the selector
    """
    from natural_pdf.selectors.parser import parse_selector, selector_to_filter_func

    # Parse the selector
    selector_obj = parse_selector(selector)

    # Create filter function from selector
    filter_func = selector_to_filter_func(selector_obj, **kwargs)

    # Apply the filter to the elements
    matching_elements = [element for element in elements if filter_func(element)]

    # Sort elements in reading order if requested
    if kwargs.get("reading_order", True):
        if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
            matching_elements.sort(key=lambda el: (el.top, el.x0))
        else:
            logger.warning(
                "Cannot sort elements in reading order: Missing required attributes (top, x0)."
            )

    return matching_elements
natural_pdf.Page.find(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]
find(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]

Find first element on this page matching selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
Optional[Any]

Element object or None if not found.

Source code in natural_pdf/core/page.py
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
def find(
    self,
    selector: Optional[str] = None,  # Now optional
    *,  # Force subsequent args to be keyword-only
    text: Optional[str] = None,  # New text parameter
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional[Any]:
    """
    Find first element on this page matching selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        Element object or None if not found.
    """
    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        # Escape quotes within the text for the selector string
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        # Default to 'text:contains(...)'
        effective_selector = f'text:contains("{escaped_text}")'
        # Note: regex/case handled by kwargs passed down
        logger.debug(
            f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        # Should be unreachable due to checks above
        raise ValueError("Internal error: No selector or text provided.")

    selector_obj = parse_selector(effective_selector)

    # Pass regex and case flags to selector function via kwargs
    kwargs["regex"] = regex
    kwargs["case"] = case

    # First get all matching elements without applying exclusions initially within _apply_selector
    results_collection = self._apply_selector(
        selector_obj, **kwargs
    )  # _apply_selector doesn't filter

    # Filter the results based on exclusions if requested
    if apply_exclusions and self._exclusions and results_collection:
        filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
        # Return the first element from the filtered list
        return filtered_elements[0] if filtered_elements else None
    elif results_collection:
        # Return the first element from the unfiltered results
        return results_collection.first
    else:
        return None
natural_pdf.Page.find_all(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements on this page matching selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements.

Source code in natural_pdf/core/page.py
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
def find_all(
    self,
    selector: Optional[str] = None,  # Now optional
    *,  # Force subsequent args to be keyword-only
    text: Optional[str] = None,  # New text parameter
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements on this page matching selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        ElementCollection with matching elements.
    """
    from natural_pdf.elements.collections import ElementCollection  # Import here for type hint

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        # Escape quotes within the text for the selector string
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        # Default to 'text:contains(...)'
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        # Should be unreachable due to checks above
        raise ValueError("Internal error: No selector or text provided.")

    selector_obj = parse_selector(effective_selector)

    # Pass regex and case flags to selector function via kwargs
    kwargs["regex"] = regex
    kwargs["case"] = case

    # First get all matching elements without applying exclusions initially within _apply_selector
    results_collection = self._apply_selector(
        selector_obj, **kwargs
    )  # _apply_selector doesn't filter

    # Filter the results based on exclusions if requested
    if apply_exclusions and self._exclusions and results_collection:
        filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
        return ElementCollection(filtered_elements)
    else:
        # Return the unfiltered collection
        return results_collection
natural_pdf.Page.get_content()

Returns the primary content object (self) for indexing (required by Indexable protocol). SearchService implementations decide how to process this (e.g., call extract_text).

Source code in natural_pdf/core/page.py
2661
2662
2663
2664
2665
2666
def get_content(self) -> "Page":
    """
    Returns the primary content object (self) for indexing (required by Indexable protocol).
    SearchService implementations decide how to process this (e.g., call extract_text).
    """
    return self  # Return the Page object itself
natural_pdf.Page.get_content_hash()

Returns a SHA256 hash of the extracted text content (required by Indexable for sync).

Source code in natural_pdf/core/page.py
2668
2669
2670
2671
2672
2673
2674
2675
2676
def get_content_hash(self) -> str:
    """Returns a SHA256 hash of the extracted text content (required by Indexable for sync)."""
    # Hash the extracted text (without exclusions for consistency)
    # Consider if exclusions should be part of the hash? For now, hash raw text.
    # Using extract_text directly might be slow if called repeatedly. Cache? TODO: Optimization
    text_content = self.extract_text(
        use_exclusions=False, preserve_whitespace=False
    )  # Normalize whitespace?
    return hashlib.sha256(text_content.encode("utf-8")).hexdigest()
natural_pdf.Page.get_elements(apply_exclusions=True, debug_exclusions=False)

Get all elements on this page.

Parameters:

Name Type Description Default
apply_exclusions

Whether to apply exclusion regions (default: True).

True
debug_exclusions bool

Whether to output detailed exclusion debugging info (default: False).

False

Returns:

Type Description
List[Element]

List of all elements on the page, potentially filtered by exclusions.

Source code in natural_pdf/core/page.py
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
def get_elements(
    self, apply_exclusions=True, debug_exclusions: bool = False
) -> List["Element"]:
    """
    Get all elements on this page.

    Args:
        apply_exclusions: Whether to apply exclusion regions (default: True).
        debug_exclusions: Whether to output detailed exclusion debugging info (default: False).

    Returns:
        List of all elements on the page, potentially filtered by exclusions.
    """
    # Get all elements from the element manager
    all_elements = self._element_mgr.get_all_elements()

    # Apply exclusions if requested
    if apply_exclusions and self._exclusions:
        return self._filter_elements_by_exclusions(
            all_elements, debug_exclusions=debug_exclusions
        )
    else:
        if debug_exclusions:
            print(
                f"Page {self.index}: get_elements returning all {len(all_elements)} elements (exclusions not applied)."
            )
        return all_elements
natural_pdf.Page.get_id()

Returns a unique identifier for the page (required by Indexable protocol).

Source code in natural_pdf/core/page.py
2643
2644
2645
2646
2647
def get_id(self) -> str:
    """Returns a unique identifier for the page (required by Indexable protocol)."""
    # Ensure path is safe for use in IDs (replace problematic chars)
    safe_path = re.sub(r"[^a-zA-Z0-9_-]", "_", str(self.pdf.path))
    return f"pdf_{safe_path}_page_{self.page_number}"
natural_pdf.Page.get_metadata()

Returns metadata associated with the page (required by Indexable protocol).

Source code in natural_pdf/core/page.py
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
def get_metadata(self) -> Dict[str, Any]:
    """Returns metadata associated with the page (required by Indexable protocol)."""
    # Add content hash here for sync
    metadata = {
        "pdf_path": str(self.pdf.path),
        "page_number": self.page_number,
        "width": self.width,
        "height": self.height,
        "content_hash": self.get_content_hash(),  # Include the hash
    }
    return metadata
natural_pdf.Page.get_section_between(start_element=None, end_element=None, boundary_inclusion='both')

Get a section between two elements on this page.

Source code in natural_pdf/core/page.py
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
def get_section_between(
    self, start_element=None, end_element=None, boundary_inclusion="both"
) -> Optional["Region"]:  # Return Optional
    """
    Get a section between two elements on this page.
    """
    # Create a full-page region to operate within
    page_region = self.create_region(0, 0, self.width, self.height)

    # Delegate to the region's method
    try:
        return page_region.get_section_between(
            start_element=start_element,
            end_element=end_element,
            boundary_inclusion=boundary_inclusion,
        )
    except Exception as e:
        logger.error(
            f"Error getting section between elements on page {self.index}: {e}", exc_info=True
        )
        return None
natural_pdf.Page.get_sections(start_elements=None, end_elements=None, boundary_inclusion='start', y_threshold=5.0, bounding_box=None)

Get sections of a page defined by start/end elements. Uses the page-level implementation.

Returns:

Type Description
ElementCollection[Region]

An ElementCollection containing the found Region objects.

Source code in natural_pdf/core/page.py
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    boundary_inclusion="start",
    y_threshold=5.0,
    bounding_box=None,
) -> "ElementCollection[Region]":
    """
    Get sections of a page defined by start/end elements.
    Uses the page-level implementation.

    Returns:
        An ElementCollection containing the found Region objects.
    """

    # Helper function to get bounds from bounding_box parameter
    def get_bounds():
        if bounding_box:
            x0, top, x1, bottom = bounding_box
            # Clamp to page boundaries
            return max(0, x0), max(0, top), min(self.width, x1), min(self.height, bottom)
        else:
            return 0, 0, self.width, self.height

    regions = []

    # Handle cases where elements are provided as strings (selectors)
    if isinstance(start_elements, str):
        start_elements = self.find_all(start_elements).elements  # Get list of elements
    elif hasattr(start_elements, "elements"):  # Handle ElementCollection input
        start_elements = start_elements.elements

    if isinstance(end_elements, str):
        end_elements = self.find_all(end_elements).elements
    elif hasattr(end_elements, "elements"):
        end_elements = end_elements.elements

    # Ensure start_elements is a list
    if start_elements is None:
        start_elements = []
    if end_elements is None:
        end_elements = []

    valid_inclusions = ["start", "end", "both", "none"]
    if boundary_inclusion not in valid_inclusions:
        raise ValueError(f"boundary_inclusion must be one of {valid_inclusions}")

    if not start_elements:
        # Return an empty ElementCollection if no start elements
        return ElementCollection([])

    # Combine start and end elements with their type
    all_boundaries = []
    for el in start_elements:
        all_boundaries.append((el, "start"))
    for el in end_elements:
        all_boundaries.append((el, "end"))

    # Sort all boundary elements primarily by top, then x0
    try:
        all_boundaries.sort(key=lambda x: (x[0].top, x[0].x0))
    except AttributeError as e:
        logger.error(f"Error sorting boundaries: Element missing top/x0 attribute? {e}")
        return ElementCollection([])  # Cannot proceed if elements lack position

    # Process sorted boundaries to find sections
    current_start_element = None
    active_section_started = False

    for element, element_type in all_boundaries:
        if element_type == "start":
            # If we have an active section, this start implicitly ends it
            if active_section_started:
                end_boundary_el = element  # Use this start as the end boundary
                # Determine region boundaries
                sec_top = (
                    current_start_element.top
                    if boundary_inclusion in ["start", "both"]
                    else current_start_element.bottom
                )
                sec_bottom = (
                    end_boundary_el.top
                    if boundary_inclusion not in ["end", "both"]
                    else end_boundary_el.bottom
                )

                if sec_top < sec_bottom:  # Ensure valid region
                    x0, _, x1, _ = get_bounds()
                    region = self.create_region(x0, sec_top, x1, sec_bottom)
                    region.start_element = current_start_element
                    region.end_element = end_boundary_el  # Mark the element that ended it
                    region.is_end_next_start = True  # Mark how it ended
                    regions.append(region)
                active_section_started = False  # Reset for the new start

            # Set this as the potential start of the next section
            current_start_element = element
            active_section_started = True

        elif element_type == "end" and active_section_started:
            # We found an explicit end for the current section
            end_boundary_el = element
            sec_top = (
                current_start_element.top
                if boundary_inclusion in ["start", "both"]
                else current_start_element.bottom
            )
            sec_bottom = (
                end_boundary_el.bottom
                if boundary_inclusion in ["end", "both"]
                else end_boundary_el.top
            )

            if sec_top < sec_bottom:  # Ensure valid region
                x0, _, x1, _ = get_bounds()
                region = self.create_region(x0, sec_top, x1, sec_bottom)
                region.start_element = current_start_element
                region.end_element = end_boundary_el
                region.is_end_next_start = False
                regions.append(region)

            # Reset: section ended explicitly
            current_start_element = None
            active_section_started = False

    # Handle the last section if it was started but never explicitly ended
    if active_section_started:
        sec_top = (
            current_start_element.top
            if boundary_inclusion in ["start", "both"]
            else current_start_element.bottom
        )
        x0, _, x1, page_bottom = get_bounds()
        if sec_top < page_bottom:
            region = self.create_region(x0, sec_top, x1, page_bottom)
            region.start_element = current_start_element
            region.end_element = None  # Ended by page end
            region.is_end_next_start = False
            regions.append(region)

    return ElementCollection(regions)
natural_pdf.Page.highlight(bbox=None, color=None, label=None, use_color_cycling=False, element=None, include_attrs=None, existing='append')

Highlight a bounding box or the entire page. Delegates to the central HighlightingService.

Parameters:

Name Type Description Default
bbox Optional[Tuple[float, float, float, float]]

Bounding box (x0, top, x1, bottom). If None, highlight entire page.

None
color Optional[Union[Tuple, str]]

RGBA color tuple/string for the highlight.

None
label Optional[str]

Optional label for the highlight.

None
use_color_cycling bool

If True and no label/color, use next cycle color.

False
element Optional[Any]

Optional original element being highlighted (for attribute extraction).

None
include_attrs Optional[List[str]]

List of attribute names from 'element' to display.

None
existing str

How to handle existing highlights ('append' or 'replace').

'append'

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
def highlight(
    self,
    bbox: Optional[Tuple[float, float, float, float]] = None,
    color: Optional[Union[Tuple, str]] = None,
    label: Optional[str] = None,
    use_color_cycling: bool = False,
    element: Optional[Any] = None,
    include_attrs: Optional[List[str]] = None,
    existing: str = "append",
) -> "Page":
    """
    Highlight a bounding box or the entire page.
    Delegates to the central HighlightingService.

    Args:
        bbox: Bounding box (x0, top, x1, bottom). If None, highlight entire page.
        color: RGBA color tuple/string for the highlight.
        label: Optional label for the highlight.
        use_color_cycling: If True and no label/color, use next cycle color.
        element: Optional original element being highlighted (for attribute extraction).
        include_attrs: List of attribute names from 'element' to display.
        existing: How to handle existing highlights ('append' or 'replace').

    Returns:
        Self for method chaining.
    """
    target_bbox = bbox if bbox is not None else (0, 0, self.width, self.height)
    self._highlighter.add(
        page_index=self.index,
        bbox=target_bbox,
        color=color,
        label=label,
        use_color_cycling=use_color_cycling,
        element=element,
        include_attrs=include_attrs,
        existing=existing,
    )
    return self
natural_pdf.Page.highlight_polygon(polygon, color=None, label=None, use_color_cycling=False, element=None, include_attrs=None, existing='append')

Highlight a polygon shape on the page. Delegates to the central HighlightingService.

Parameters:

Name Type Description Default
polygon List[Tuple[float, float]]

List of (x, y) points defining the polygon.

required
color Optional[Union[Tuple, str]]

RGBA color tuple/string for the highlight.

None
label Optional[str]

Optional label for the highlight.

None
use_color_cycling bool

If True and no label/color, use next cycle color.

False
element Optional[Any]

Optional original element being highlighted (for attribute extraction).

None
include_attrs Optional[List[str]]

List of attribute names from 'element' to display.

None
existing str

How to handle existing highlights ('append' or 'replace').

'append'

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
def highlight_polygon(
    self,
    polygon: List[Tuple[float, float]],
    color: Optional[Union[Tuple, str]] = None,
    label: Optional[str] = None,
    use_color_cycling: bool = False,
    element: Optional[Any] = None,
    include_attrs: Optional[List[str]] = None,
    existing: str = "append",
) -> "Page":
    """
    Highlight a polygon shape on the page.
    Delegates to the central HighlightingService.

    Args:
        polygon: List of (x, y) points defining the polygon.
        color: RGBA color tuple/string for the highlight.
        label: Optional label for the highlight.
        use_color_cycling: If True and no label/color, use next cycle color.
        element: Optional original element being highlighted (for attribute extraction).
        include_attrs: List of attribute names from 'element' to display.
        existing: How to handle existing highlights ('append' or 'replace').

    Returns:
        Self for method chaining.
    """
    self._highlighter.add_polygon(
        page_index=self.index,
        polygon=polygon,
        color=color,
        label=label,
        use_color_cycling=use_color_cycling,
        element=element,
        include_attrs=include_attrs,
        existing=existing,
    )
    return self
natural_pdf.Page.inspect(limit=30)

Inspect all elements on this page with detailed tabular view. Equivalent to page.find_all('*').inspect().

Parameters:

Name Type Description Default
limit int

Maximum elements per type to show (default: 30)

30

Returns:

Type Description
InspectionSummary

InspectionSummary with element tables showing coordinates,

InspectionSummary

properties, and other details for each element

Source code in natural_pdf/core/page.py
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
def inspect(self, limit: int = 30) -> "InspectionSummary":
    """
    Inspect all elements on this page with detailed tabular view.
    Equivalent to page.find_all('*').inspect().

    Args:
        limit: Maximum elements per type to show (default: 30)

    Returns:
        InspectionSummary with element tables showing coordinates,
        properties, and other details for each element
    """
    return self.find_all("*").inspect(limit=limit)
natural_pdf.Page.region(left=None, top=None, right=None, bottom=None, width=None, height=None)

Create a region on this page with more intuitive named parameters, allowing definition by coordinates or by coordinate + dimension.

Parameters:

Name Type Description Default
left float

Left x-coordinate (default: 0 if width not used).

None
top float

Top y-coordinate (default: 0 if height not used).

None
right float

Right x-coordinate (default: page width if width not used).

None
bottom float

Bottom y-coordinate (default: page height if height not used).

None
width Union[str, float, None]

Width definition. Can be: - Numeric: The width of the region in points. Cannot be used with both left and right. - String 'full': Sets region width to full page width (overrides left/right). - String 'element' or None (default): Uses provided/calculated left/right, defaulting to page width if neither are specified.

None
height Optional[float]

Numeric height of the region. Cannot be used with both top and bottom.

None

Returns:

Type Description
Any

Region object for the specified coordinates

Raises:

Type Description
ValueError

If conflicting arguments are provided (e.g., top, bottom, and height) or if width is an invalid string.

Examples:

>>> page.region(top=100, height=50)  # Region from y=100 to y=150, default width
>>> page.region(left=50, width=100)   # Region from x=50 to x=150, default height
>>> page.region(bottom=500, height=50) # Region from y=450 to y=500
>>> page.region(right=200, width=50)  # Region from x=150 to x=200
>>> page.region(top=100, bottom=200, width="full") # Explicit full width
Source code in natural_pdf/core/page.py
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
def region(
    self,
    left: float = None,
    top: float = None,
    right: float = None,
    bottom: float = None,
    width: Union[str, float, None] = None,
    height: Optional[float] = None,
) -> Any:
    """
    Create a region on this page with more intuitive named parameters,
    allowing definition by coordinates or by coordinate + dimension.

    Args:
        left: Left x-coordinate (default: 0 if width not used).
        top: Top y-coordinate (default: 0 if height not used).
        right: Right x-coordinate (default: page width if width not used).
        bottom: Bottom y-coordinate (default: page height if height not used).
        width: Width definition. Can be:
               - Numeric: The width of the region in points. Cannot be used with both left and right.
               - String 'full': Sets region width to full page width (overrides left/right).
               - String 'element' or None (default): Uses provided/calculated left/right,
                 defaulting to page width if neither are specified.
        height: Numeric height of the region. Cannot be used with both top and bottom.

    Returns:
        Region object for the specified coordinates

    Raises:
        ValueError: If conflicting arguments are provided (e.g., top, bottom, and height)
                  or if width is an invalid string.

    Examples:
        >>> page.region(top=100, height=50)  # Region from y=100 to y=150, default width
        >>> page.region(left=50, width=100)   # Region from x=50 to x=150, default height
        >>> page.region(bottom=500, height=50) # Region from y=450 to y=500
        >>> page.region(right=200, width=50)  # Region from x=150 to x=200
        >>> page.region(top=100, bottom=200, width="full") # Explicit full width
    """
    # ------------------------------------------------------------------
    # Percentage support – convert strings like "30%" to absolute values
    # based on page dimensions.  X-axis params (left, right, width) use
    # page.width; Y-axis params (top, bottom, height) use page.height.
    # ------------------------------------------------------------------

    def _pct_to_abs(val, axis: str):
        if isinstance(val, str) and val.strip().endswith("%"):
            try:
                pct = float(val.strip()[:-1]) / 100.0
            except ValueError:
                return val  # leave unchanged if not a number
            return pct * (self.width if axis == "x" else self.height)
        return val

    left = _pct_to_abs(left, "x")
    right = _pct_to_abs(right, "x")
    width = _pct_to_abs(width, "x")
    top = _pct_to_abs(top, "y")
    bottom = _pct_to_abs(bottom, "y")
    height = _pct_to_abs(height, "y")

    # --- Type checking and basic validation ---
    is_width_numeric = isinstance(width, (int, float))
    is_width_string = isinstance(width, str)
    width_mode = "element"  # Default mode

    if height is not None and top is not None and bottom is not None:
        raise ValueError("Cannot specify top, bottom, and height simultaneously.")
    if is_width_numeric and left is not None and right is not None:
        raise ValueError("Cannot specify left, right, and a numeric width simultaneously.")
    if is_width_string:
        width_lower = width.lower()
        if width_lower not in ["full", "element"]:
            raise ValueError("String width argument must be 'full' or 'element'.")
        width_mode = width_lower

    # --- Calculate Coordinates ---
    final_top = top
    final_bottom = bottom
    final_left = left
    final_right = right

    # Height calculations
    if height is not None:
        if top is not None:
            final_bottom = top + height
        elif bottom is not None:
            final_top = bottom - height
        else:  # Neither top nor bottom provided, default top to 0
            final_top = 0
            final_bottom = height

    # Width calculations (numeric only)
    if is_width_numeric:
        if left is not None:
            final_right = left + width
        elif right is not None:
            final_left = right - width
        else:  # Neither left nor right provided, default left to 0
            final_left = 0
            final_right = width

    # --- Apply Defaults for Unset Coordinates ---
    # Only default coordinates if they weren't set by dimension calculation
    if final_top is None:
        final_top = 0
    if final_bottom is None:
        # Check if bottom should have been set by height calc
        if height is None or top is None:
            final_bottom = self.height

    if final_left is None:
        final_left = 0
    if final_right is None:
        # Check if right should have been set by width calc
        if not is_width_numeric or left is None:
            final_right = self.width

    # --- Handle width_mode == 'full' ---
    if width_mode == "full":
        # Override left/right if mode is full
        final_left = 0
        final_right = self.width

    # --- Final Validation & Creation ---
    # Ensure coordinates are within page bounds (clamp)
    final_left = max(0, final_left)
    final_top = max(0, final_top)
    final_right = min(self.width, final_right)
    final_bottom = min(self.height, final_bottom)

    # Ensure valid box (x0<=x1, top<=bottom)
    if final_left > final_right:
        logger.warning(f"Calculated left ({final_left}) > right ({final_right}). Swapping.")
        final_left, final_right = final_right, final_left
    if final_top > final_bottom:
        logger.warning(f"Calculated top ({final_top}) > bottom ({final_bottom}). Swapping.")
        final_top, final_bottom = final_bottom, final_top

    from natural_pdf.elements.region import Region

    region = Region(self, (final_left, final_top, final_right, final_bottom))
    return region
natural_pdf.Page.remove_text_layer()

Remove all text elements from this page.

This removes all text elements (words and characters) from the page, effectively clearing the text layer.

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
def remove_text_layer(self) -> "Page":
    """
    Remove all text elements from this page.

    This removes all text elements (words and characters) from the page,
    effectively clearing the text layer.

    Returns:
        Self for method chaining
    """
    logger.info(f"Page {self.number}: Removing all text elements...")

    # Remove all words and chars from the element manager
    removed_words = len(self._element_mgr.words)
    removed_chars = len(self._element_mgr.chars)

    # Clear the lists
    self._element_mgr._elements["words"] = []
    self._element_mgr._elements["chars"] = []

    logger.info(
        f"Page {self.number}: Removed {removed_words} words and {removed_chars} characters"
    )
    return self
natural_pdf.Page.save_image(filename, width=None, labels=True, legend_position='right', render_ocr=False, include_highlights=True, resolution=144, **kwargs)

Save the page image to a file, rendering highlights via HighlightingService.

Parameters:

Name Type Description Default
filename str

Path to save the image to.

required
width Optional[int]

Optional width for the output image.

None
labels bool

Whether to include a legend.

True
legend_position str

Position of the legend.

'right'
render_ocr bool

Whether to render OCR text.

False
include_highlights bool

Whether to render highlights.

True
resolution float

Resolution in DPI for base image rendering (default: 144 DPI, equivalent to previous scale=2.0).

144
**kwargs

Additional args for pdfplumber's to_image.

{}

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
def save_image(
    self,
    filename: str,
    width: Optional[int] = None,
    labels: bool = True,
    legend_position: str = "right",
    render_ocr: bool = False,
    include_highlights: bool = True,  # Allow saving without highlights
    resolution: float = 144,
    **kwargs,
) -> "Page":
    """
    Save the page image to a file, rendering highlights via HighlightingService.

    Args:
        filename: Path to save the image to.
        width: Optional width for the output image.
        labels: Whether to include a legend.
        legend_position: Position of the legend.
        render_ocr: Whether to render OCR text.
        include_highlights: Whether to render highlights.
        resolution: Resolution in DPI for base image rendering (default: 144 DPI, equivalent to previous scale=2.0).
        **kwargs: Additional args for pdfplumber's to_image.

    Returns:
        Self for method chaining.
    """
    # Use to_image to generate and save the image
    self.to_image(
        path=filename,
        width=width,
        labels=labels,
        legend_position=legend_position,
        render_ocr=render_ocr,
        include_highlights=include_highlights,
        resolution=resolution,
        **kwargs,
    )
    return self
natural_pdf.Page.save_searchable(output_path, dpi=300, **kwargs)

Saves the PDF page with an OCR text layer, making content searchable.

Requires optional dependencies. Install with: pip install "natural-pdf[ocr-save]"

OCR must have been applied to the pages beforehand

(e.g., pdf.apply_ocr()).

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the searchable PDF.

required
dpi int

Resolution for rendering and OCR overlay (default 300).

300
**kwargs

Additional keyword arguments passed to the exporter.

{}
Source code in natural_pdf/core/page.py
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
    """
    Saves the PDF page with an OCR text layer, making content searchable.

    Requires optional dependencies. Install with: pip install "natural-pdf[ocr-save]"

    Note: OCR must have been applied to the pages beforehand
          (e.g., pdf.apply_ocr()).

    Args:
        output_path: Path to save the searchable PDF.
        dpi: Resolution for rendering and OCR overlay (default 300).
        **kwargs: Additional keyword arguments passed to the exporter.
    """
    # Import moved here, assuming it's always available now
    from natural_pdf.exporters.searchable_pdf import create_searchable_pdf

    # Convert pathlib.Path to string if necessary
    output_path_str = str(output_path)

    create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
    logger.info(f"Searchable PDF saved to: {output_path_str}")
natural_pdf.Page.show(resolution=144, width=None, labels=True, legend_position='right', render_ocr=False)

Generates and returns an image of the page with persistent highlights rendered.

Parameters:

Name Type Description Default
resolution float

Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).

144
width Optional[int]

Optional width for the output image.

None
labels bool

Whether to include a legend for labels.

True
legend_position str

Position of the legend.

'right'
render_ocr bool

Whether to render OCR text.

False

Returns:

Type Description
Optional[Image]

PIL Image object of the page with highlights, or None if rendering fails.

Source code in natural_pdf/core/page.py
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
def show(
    self,
    resolution: float = 144,
    width: Optional[int] = None,
    labels: bool = True,
    legend_position: str = "right",
    render_ocr: bool = False,
) -> Optional[Image.Image]:
    """
    Generates and returns an image of the page with persistent highlights rendered.

    Args:
        resolution: Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).
        width: Optional width for the output image.
        labels: Whether to include a legend for labels.
        legend_position: Position of the legend.
        render_ocr: Whether to render OCR text.

    Returns:
        PIL Image object of the page with highlights, or None if rendering fails.
    """
    return self.to_image(
        resolution=resolution,
        width=width,
        labels=labels,
        legend_position=legend_position,
        render_ocr=render_ocr,
        include_highlights=True,  # Ensure highlights are requested
    )
natural_pdf.Page.show_preview(temporary_highlights, resolution=144, width=None, labels=True, legend_position='right', render_ocr=False)

Generates and returns a non-stateful preview image containing only the provided temporary highlights.

Parameters:

Name Type Description Default
temporary_highlights List[Dict]

List of highlight data dictionaries (as prepared by ElementCollection._prepare_highlight_data).

required
resolution float

Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).

144
width Optional[int]

Optional width for the output image.

None
labels bool

Whether to include a legend.

True
legend_position str

Position of the legend.

'right'
render_ocr bool

Whether to render OCR text.

False

Returns:

Type Description
Optional[Image]

PIL Image object of the preview, or None if rendering fails.

Source code in natural_pdf/core/page.py
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
def show_preview(
    self,
    temporary_highlights: List[Dict],
    resolution: float = 144,
    width: Optional[int] = None,
    labels: bool = True,
    legend_position: str = "right",
    render_ocr: bool = False,
) -> Optional[Image.Image]:
    """
    Generates and returns a non-stateful preview image containing only
    the provided temporary highlights.

    Args:
        temporary_highlights: List of highlight data dictionaries (as prepared by
                              ElementCollection._prepare_highlight_data).
        resolution: Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).
        width: Optional width for the output image.
        labels: Whether to include a legend.
        legend_position: Position of the legend.
        render_ocr: Whether to render OCR text.

    Returns:
        PIL Image object of the preview, or None if rendering fails.
    """
    try:
        # Delegate rendering to the highlighter service's preview method
        img = self._highlighter.render_preview(
            page_index=self.index,
            temporary_highlights=temporary_highlights,
            resolution=resolution,
            labels=labels,
            legend_position=legend_position,
            render_ocr=render_ocr,
        )
    except AttributeError:
        logger.error(f"HighlightingService does not have the required 'render_preview' method.")
        return None
    except Exception as e:
        logger.error(
            f"Error calling highlighter.render_preview for page {self.index}: {e}",
            exc_info=True,
        )
        return None

    # Return the rendered image directly
    return img
natural_pdf.Page.split(divider, **kwargs)

Divides the page into sections based on the provided divider elements.

Source code in natural_pdf/core/page.py
2329
2330
2331
2332
2333
2334
2335
2336
2337
def split(self, divider, **kwargs) -> "ElementCollection[Region]":
    """
    Divides the page into sections based on the provided divider elements.
    """
    sections = self.get_sections(start_elements=divider, **kwargs)
    top = self.region(0, 0, self.width, sections[0].top)
    sections.append(top)

    return sections
natural_pdf.Page.to_image(path=None, width=None, labels=True, legend_position='right', render_ocr=False, resolution=None, include_highlights=True, exclusions=None, **kwargs)

Generate a PIL image of the page, using HighlightingService if needed.

Parameters:

Name Type Description Default
path Optional[str]

Optional path to save the image to.

None
width Optional[int]

Optional width for the output image.

None
labels bool

Whether to include a legend for highlights.

True
legend_position str

Position of the legend.

'right'
render_ocr bool

Whether to render OCR text on highlights.

False
resolution Optional[float]

Resolution in DPI for base page image. If None, uses global setting or defaults to 144 DPI.

None
include_highlights bool

Whether to render highlights.

True
exclusions Optional[str]

Accepts one of the following: • None – no masking (default) • "mask" – mask using solid white (back-compat) • CSS/HTML colour string (e.g. "red", "#ff0000", "#ff000080") • Tuple of RGB or RGBA values (ints 0-255 or floats 0-1) All excluded regions are filled with this colour.

None
**kwargs

Additional parameters for pdfplumber.to_image.

{}

Returns:

Type Description
Optional[Image]

PIL Image of the page, or None if rendering fails.

Source code in natural_pdf/core/page.py
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
def to_image(
    self,
    path: Optional[str] = None,
    width: Optional[int] = None,
    labels: bool = True,
    legend_position: str = "right",
    render_ocr: bool = False,
    resolution: Optional[float] = None,
    include_highlights: bool = True,
    exclusions: Optional[str] = None,  # New parameter
    **kwargs,
) -> Optional[Image.Image]:
    """
    Generate a PIL image of the page, using HighlightingService if needed.

    Args:
        path: Optional path to save the image to.
        width: Optional width for the output image.
        labels: Whether to include a legend for highlights.
        legend_position: Position of the legend.
        render_ocr: Whether to render OCR text on highlights.
        resolution: Resolution in DPI for base page image. If None, uses global setting or defaults to 144 DPI.
        include_highlights: Whether to render highlights.
        exclusions: Accepts one of the following:
                    • None  – no masking (default)
                    • "mask" – mask using solid white (back-compat)
                    • CSS/HTML colour string (e.g. "red", "#ff0000", "#ff000080")
                    • Tuple of RGB or RGBA values (ints 0-255 or floats 0-1)
                    All excluded regions are filled with this colour.
        **kwargs: Additional parameters for pdfplumber.to_image.

    Returns:
        PIL Image of the page, or None if rendering fails.
    """
    # Apply global options as defaults, but allow explicit parameters to override
    import natural_pdf

    # Use global options if parameters are not explicitly set
    if width is None:
        width = natural_pdf.options.image.width
    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified
    # 1. Create cache key (excluding path)
    cache_key_parts = [
        width,
        labels,
        legend_position,
        render_ocr,
        resolution,
        include_highlights,
        exclusions,
    ]
    # Convert kwargs to a stable, hashable representation
    sorted_kwargs_list = []
    for k, v in sorted(kwargs.items()):
        if isinstance(v, list):
            try:
                v = tuple(v)  # Convert lists to tuples
            except TypeError:  # pragma: no cover
                # If list contains unhashable items, fall back to repr or skip
                # For simplicity, we'll try to proceed; hashing will fail if v remains unhashable
                logger.warning(
                    f"Cache key generation: List item in kwargs['{k}'] could not be converted to tuple due to unhashable elements."
                )
        sorted_kwargs_list.append((k, v))

    cache_key_parts.append(tuple(sorted_kwargs_list))

    try:
        cache_key = tuple(cache_key_parts)
    except TypeError as e:  # pragma: no cover
        logger.warning(
            f"Page {self.index}: Could not create cache key for to_image due to unhashable item: {e}. Proceeding without cache for this call."
        )
        cache_key = None  # Fallback to not using cache for this call

    image_to_return: Optional[Image.Image] = None

    # 2. Check cache
    if cache_key is not None and cache_key in self._to_image_cache:
        image_to_return = self._to_image_cache[cache_key]
        logger.debug(f"Page {self.index}: Returning cached image for key: {cache_key}")
    else:
        # --- This is the original logic to generate the image ---
        rendered_image_component: Optional[Image.Image] = (
            None  # Renamed from 'image' in original
        )
        render_resolution = resolution
        thread_id = threading.current_thread().name
        logger.debug(
            f"[{thread_id}] Page {self.index}: Attempting to acquire pdf_render_lock for to_image..."
        )
        lock_wait_start = time.monotonic()
        try:
            # Acquire the global PDF rendering lock
            with pdf_render_lock:
                lock_acquired_time = time.monotonic()
                logger.debug(
                    f"[{thread_id}] Page {self.index}: Acquired pdf_render_lock (waited {lock_acquired_time - lock_wait_start:.2f}s). Starting render..."
                )
                if include_highlights:
                    # Delegate rendering to the central service
                    rendered_image_component = self._highlighter.render_page(
                        page_index=self.index,
                        resolution=render_resolution,
                        labels=labels,
                        legend_position=legend_position,
                        render_ocr=render_ocr,
                        **kwargs,
                    )
                else:
                    rendered_image_component = render_plain_page(self, render_resolution)
        except Exception as e:
            logger.error(f"Error rendering page {self.index}: {e}", exc_info=True)
            # rendered_image_component remains None
        finally:
            render_end_time = time.monotonic()
            logger.debug(
                f"[{thread_id}] Page {self.index}: Released pdf_render_lock. Total render time (incl. lock wait): {render_end_time - lock_wait_start:.2f}s"
            )

        if rendered_image_component is None:
            if cache_key is not None:
                self._to_image_cache[cache_key] = None  # Cache the failure
            # Save the image if path is provided (will try to save None, handled by PIL/OS)
            if path:
                try:
                    if os.path.dirname(path):
                        os.makedirs(os.path.dirname(path), exist_ok=True)
                    if rendered_image_component is not None:  # Should be None here
                        rendered_image_component.save(path)  # This line won't be hit if None
                    # else: logger.debug("Not saving None image") # Not strictly needed
                except Exception as save_error:  # pragma: no cover
                    logger.error(f"Failed to save image to {path}: {save_error}")
            return None

        # --- Apply exclusion masking if requested ---
        # This modifies 'rendered_image_component'
        image_after_masking = rendered_image_component  # Start with the rendered image

        # Determine if masking is requested and establish the fill colour
        mask_requested = exclusions is not None and self._exclusions
        mask_color: Union[str, Tuple[int, int, int, int]] = "white"  # default

        if mask_requested:
            if exclusions != "mask":
                # Attempt to parse custom colour input
                try:
                    if isinstance(exclusions, tuple):
                        # Handle RGB/RGBA tuples with ints 0-255 or floats 0-1
                        processed = []
                        all_float = all(isinstance(c, float) for c in exclusions)
                        for i, c in enumerate(exclusions):
                            if isinstance(c, float):
                                val = int(c * 255) if all_float or i == 3 else int(c)
                            else:
                                val = int(c)
                            processed.append(max(0, min(255, val)))
                        if len(processed) == 3:
                            processed.append(255)  # add full alpha
                        mask_color = tuple(processed)  # type: ignore[assignment]
                    elif isinstance(exclusions, str):
                        # Try using the optional 'colour' library for rich parsing
                        try:
                            from colour import Color  # type: ignore

                            color_obj = Color(exclusions)
                            mask_color = (
                                int(color_obj.red * 255),
                                int(color_obj.green * 255),
                                int(color_obj.blue * 255),
                                255,
                            )
                        except Exception:
                            # Fallback: if parsing fails, treat as plain string accepted by PIL
                            mask_color = exclusions  # e.g. "red"
                    else:
                        logger.warning(
                            f"Unsupported exclusions colour spec: {exclusions!r}. Using white."
                        )
                except Exception as colour_parse_err:  # pragma: no cover
                    logger.warning(
                        f"Failed to parse exclusions colour {exclusions!r}: {colour_parse_err}. Using white."
                    )

            try:
                # Ensure image is mutable (RGB or RGBA)
                if image_after_masking.mode not in ("RGB", "RGBA"):
                    image_after_masking = image_after_masking.convert("RGB")

                exclusion_regions = self._get_exclusion_regions(
                    include_callable=True, debug=False
                )
                if exclusion_regions:
                    draw = ImageDraw.Draw(image_after_masking)
                    # Scaling factor for converting PDF pts → image px
                    img_scale = render_resolution / 72.0

                    # Determine fill colour compatible with current mode
                    def _mode_compatible(colour):
                        if isinstance(colour, tuple) and image_after_masking.mode != "RGBA":
                            return colour[:3]  # drop alpha for RGB images
                        return colour

                    fill_colour = _mode_compatible(mask_color)

                    for region in exclusion_regions:
                        img_x0 = region.x0 * img_scale
                        img_top = region.top * img_scale
                        img_x1 = region.x1 * img_scale
                        img_bottom = region.bottom * img_scale

                        img_coords = (
                            max(0, img_x0),
                            max(0, img_top),
                            min(image_after_masking.width, img_x1),
                            min(image_after_masking.height, img_bottom),
                        )
                        if img_coords[0] < img_coords[2] and img_coords[1] < img_coords[3]:
                            draw.rectangle(img_coords, fill=fill_colour)
                        else:  # pragma: no cover
                            logger.warning(
                                f"Skipping invalid exclusion rect for masking: {img_coords}"
                            )
                    del draw  # Release drawing context
            except Exception as mask_error:  # pragma: no cover
                logger.error(
                    f"Error applying exclusion mask to page {self.index}: {mask_error}",
                    exc_info=True,
                )
                # Continue with potentially unmasked or partially masked image

        # --- Resize the final image if width is provided ---
        image_final_content = image_after_masking  # Start with image after masking
        if width is not None and width > 0 and image_final_content.width > 0:
            aspect_ratio = image_final_content.height / image_final_content.width
            height = int(width * aspect_ratio)
            try:
                image_final_content = image_final_content.resize(
                    (width, height), Image.Resampling.LANCZOS
                )
            except Exception as resize_error:  # pragma: no cover
                logger.warning(f"Could not resize image: {resize_error}")
                # image_final_content remains the un-resized version if resize fails

        # Store in cache
        if cache_key is not None:
            self._to_image_cache[cache_key] = image_final_content
            logger.debug(f"Page {self.index}: Cached image for key: {cache_key}")
        image_to_return = image_final_content
    # --- End of cache miss block ---

    # Save the image (either from cache or newly generated) if path is provided
    if path and image_to_return:
        try:
            # Ensure directory exists
            if os.path.dirname(path):  # Only call makedirs if there's a directory part
                os.makedirs(os.path.dirname(path), exist_ok=True)
            image_to_return.save(path)
            logger.debug(f"Saved page image to: {path}")
        except Exception as save_error:  # pragma: no cover
            logger.error(f"Failed to save image to {path}: {save_error}")

    return image_to_return
natural_pdf.Page.until(selector, include_endpoint=True, **kwargs)

Select content from the top of the page until matching selector.

Parameters:

Name Type Description Default
selector str

CSS-like selector string

required
include_endpoint bool

Whether to include the endpoint element in the region

True
**kwargs

Additional selection parameters

{}

Returns:

Type Description
Any

Region object representing the selected content

Examples:

>>> page.until('text:contains("Conclusion")')  # Select from top to conclusion
>>> page.until('line[width>=2]', include_endpoint=False)  # Select up to thick line
Source code in natural_pdf/core/page.py
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
def until(self, selector: str, include_endpoint: bool = True, **kwargs) -> Any:
    """
    Select content from the top of the page until matching selector.

    Args:
        selector: CSS-like selector string
        include_endpoint: Whether to include the endpoint element in the region
        **kwargs: Additional selection parameters

    Returns:
        Region object representing the selected content

    Examples:
        >>> page.until('text:contains("Conclusion")')  # Select from top to conclusion
        >>> page.until('line[width>=2]', include_endpoint=False)  # Select up to thick line
    """
    # Find the target element
    target = self.find(selector, **kwargs)
    if not target:
        # If target not found, return a default region (full page)
        from natural_pdf.elements.region import Region

        return Region(self, (0, 0, self.width, self.height))

    # Create a region from the top of the page to the target
    from natural_pdf.elements.region import Region

    # Ensure target has positional attributes before using them
    target_top = getattr(target, "top", 0)
    target_bottom = getattr(target, "bottom", self.height)

    if include_endpoint:
        # Include the target element
        region = Region(self, (0, 0, self.width, target_bottom))
    else:
        # Up to the target element
        region = Region(self, (0, 0, self.width, target_top))

    region.end_element = target
    return region
natural_pdf.Page.viewer()

Creates and returns an interactive ipywidget for exploring elements on this page.

Uses InteractiveViewerWidget.from_page() to create the viewer.

Returns:

Type Description
Optional[InteractiveViewerWidget]

A InteractiveViewerWidget instance ready for display in Jupyter,

Optional[InteractiveViewerWidget]

or None if ipywidgets is not installed or widget creation fails.

Raises:

Type Description
# Optional

Could raise ImportError instead of returning None

# ImportError

If required dependencies (ipywidgets) are missing.

ValueError

If image rendering or data preparation fails within from_page.

Source code in natural_pdf/core/page.py
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
def viewer(
    self,
    # elements_to_render: Optional[List['Element']] = None, # No longer needed, from_page handles it
    # include_source_types: List[str] = ['word', 'line', 'rect', 'region'] # No longer needed
) -> Optional["InteractiveViewerWidget"]:  # Return type hint updated
    """
    Creates and returns an interactive ipywidget for exploring elements on this page.

    Uses InteractiveViewerWidget.from_page() to create the viewer.

    Returns:
        A InteractiveViewerWidget instance ready for display in Jupyter,
        or None if ipywidgets is not installed or widget creation fails.

    Raises:
        # Optional: Could raise ImportError instead of returning None
        # ImportError: If required dependencies (ipywidgets) are missing.
        ValueError: If image rendering or data preparation fails within from_page.
    """
    # Check for availability using the imported flag and class variable
    if not _IPYWIDGETS_AVAILABLE or InteractiveViewerWidget is None:
        logger.error(
            "Interactive viewer requires 'ipywidgets'. "
            'Please install with: pip install "ipywidgets>=7.0.0,<10.0.0"'
        )
        # raise ImportError("ipywidgets not found.") # Option 1: Raise error
        return None  # Option 2: Return None gracefully

    # If we reach here, InteractiveViewerWidget should be the actual class
    try:
        # Pass self (the Page object) to the factory method
        return InteractiveViewerWidget.from_page(self)
    except Exception as e:
        # Catch potential errors during widget creation (e.g., image rendering)
        logger.error(
            f"Error creating viewer widget from page {self.number}: {e}", exc_info=True
        )
        # raise # Option 1: Re-raise error (might include ValueError from from_page)
        return None  # Option 2: Return None on creation error
natural_pdf.Region

Bases: DirectionalMixin, ClassificationMixin, ExtractionMixin, ShapeDetectionMixin, DescribeMixin

Represents a rectangular region on a page.

Regions are fundamental building blocks in natural-pdf that define rectangular areas of a page for analysis, extraction, and navigation. They can be created manually or automatically through spatial navigation methods like .below(), .above(), .left(), and .right() from elements or other regions.

Regions integrate multiple analysis capabilities through mixins and provide: - Element filtering and collection within the region boundary - OCR processing for the region area - Table detection and extraction - AI-powered classification and structured data extraction - Visual rendering and debugging capabilities - Text extraction with spatial awareness

The Region class supports both rectangular and polygonal boundaries, making it suitable for complex document layouts and irregular shapes detected by layout analysis algorithms.

Attributes:

Name Type Description
page Page

Reference to the parent Page object.

bbox Tuple[float, float, float, float]

Bounding box tuple (x0, top, x1, bottom) in PDF coordinates.

x0 float

Left x-coordinate.

top float

Top y-coordinate (minimum y).

x1 float

Right x-coordinate.

bottom float

Bottom y-coordinate (maximum y).

width float

Region width (x1 - x0).

height float

Region height (bottom - top).

polygon List[Tuple[float, float]]

List of coordinate points for non-rectangular regions.

label

Optional descriptive label for the region.

metadata Dict[str, Any]

Dictionary for storing analysis results and custom data.

Example

Creating regions:

pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]

# Manual region creation
header_region = page.region(0, 0, page.width, 100)

# Spatial navigation from elements
summary_text = page.find('text:contains("Summary")')
content_region = summary_text.below(until='text[size>12]:bold')

# Extract content from region
tables = content_region.extract_table()
text = content_region.get_text()

Advanced usage:

# OCR processing
region.apply_ocr(engine='easyocr', resolution=300)

# AI-powered extraction
data = region.extract_structured_data(MySchema)

# Visual debugging
region.show(highlights=['tables', 'text'])

Source code in natural_pdf/elements/region.py
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
class Region(
    DirectionalMixin, ClassificationMixin, ExtractionMixin, ShapeDetectionMixin, DescribeMixin
):
    """Represents a rectangular region on a page.

    Regions are fundamental building blocks in natural-pdf that define rectangular
    areas of a page for analysis, extraction, and navigation. They can be created
    manually or automatically through spatial navigation methods like .below(), .above(),
    .left(), and .right() from elements or other regions.

    Regions integrate multiple analysis capabilities through mixins and provide:
    - Element filtering and collection within the region boundary
    - OCR processing for the region area
    - Table detection and extraction
    - AI-powered classification and structured data extraction
    - Visual rendering and debugging capabilities
    - Text extraction with spatial awareness

    The Region class supports both rectangular and polygonal boundaries, making it
    suitable for complex document layouts and irregular shapes detected by layout
    analysis algorithms.

    Attributes:
        page: Reference to the parent Page object.
        bbox: Bounding box tuple (x0, top, x1, bottom) in PDF coordinates.
        x0: Left x-coordinate.
        top: Top y-coordinate (minimum y).
        x1: Right x-coordinate.
        bottom: Bottom y-coordinate (maximum y).
        width: Region width (x1 - x0).
        height: Region height (bottom - top).
        polygon: List of coordinate points for non-rectangular regions.
        label: Optional descriptive label for the region.
        metadata: Dictionary for storing analysis results and custom data.

    Example:
        Creating regions:
        ```python
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]

        # Manual region creation
        header_region = page.region(0, 0, page.width, 100)

        # Spatial navigation from elements
        summary_text = page.find('text:contains("Summary")')
        content_region = summary_text.below(until='text[size>12]:bold')

        # Extract content from region
        tables = content_region.extract_table()
        text = content_region.get_text()
        ```

        Advanced usage:
        ```python
        # OCR processing
        region.apply_ocr(engine='easyocr', resolution=300)

        # AI-powered extraction
        data = region.extract_structured_data(MySchema)

        # Visual debugging
        region.show(highlights=['tables', 'text'])
        ```
    """

    def __init__(
        self,
        page: "Page",
        bbox: Tuple[float, float, float, float],
        polygon: List[Tuple[float, float]] = None,
        parent=None,
        label: Optional[str] = None,
    ):
        """Initialize a region.

        Creates a Region object that represents a rectangular or polygonal area on a page.
        Regions are used for spatial navigation, content extraction, and analysis operations.

        Args:
            page: Parent Page object that contains this region and provides access
                to document elements and analysis capabilities.
            bbox: Bounding box coordinates as (x0, top, x1, bottom) tuple in PDF
                coordinate system (points, with origin at bottom-left).
            polygon: Optional list of coordinate points [(x1,y1), (x2,y2), ...] for
                non-rectangular regions. If provided, the region will use polygon-based
                intersection calculations instead of simple rectangle overlap.
            parent: Optional parent region for hierarchical document structure.
                Useful for maintaining tree-like relationships between regions.
            label: Optional descriptive label for the region, useful for debugging
                and identification in complex workflows.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            page = pdf.pages[0]

            # Rectangular region
            header = Region(page, (0, 0, page.width, 100), label="header")

            # Polygonal region (from layout detection)
            table_polygon = [(50, 100), (300, 100), (300, 400), (50, 400)]
            table_region = Region(page, (50, 100, 300, 400),
                                polygon=table_polygon, label="table")
            ```

        Note:
            Regions are typically created through page methods like page.region() or
            spatial navigation methods like element.below(). Direct instantiation is
            used mainly for advanced workflows or layout analysis integration.
        """
        self._page = page
        self._bbox = bbox
        self._polygon = polygon

        self.metadata: Dict[str, Any] = {}
        # Analysis results live under self.metadata['analysis'] via property

        # Standard attributes for all elements
        self.object_type = "region"  # For selector compatibility

        # Layout detection attributes
        self.region_type = None
        self.normalized_type = None
        self.confidence = None
        self.model = None

        # Region management attributes
        self.name = None
        self.label = label
        self.source = None  # Will be set by creation methods

        # Hierarchy support for nested document structure
        self.parent_region = parent
        self.child_regions = []
        self.text_content = None  # Direct text content (e.g., from Docling)
        self.associated_text_elements = []  # Native text elements that overlap with this region

    def _direction(
        self,
        direction: str,
        size: Optional[float] = None,
        cross_size: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Region-specific wrapper around :py:meth:`DirectionalMixin._direction`.

        It performs any pre-processing required by *Region* (none currently),
        delegates the core geometry work to the mix-in implementation via
        ``super()``, then attaches region-level metadata before returning the
        new :class:`Region` instance.
        """

        # Delegate to the shared implementation on DirectionalMixin
        region = super()._direction(
            direction=direction,
            size=size,
            cross_size=cross_size,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            **kwargs,
        )

        # Post-process: make sure callers can trace lineage and flags
        region.source_element = self
        region.includes_source = include_source

        return region

    def above(
        self,
        height: Optional[float] = None,
        width: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Select region above this region.

        Args:
            height: Height of the region above, in points
            width: Width mode - "full" for full page width or "element" for element width
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify an upper boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area above
        """
        return self._direction(
            direction="above",
            size=height,
            cross_size=width,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            **kwargs,
        )

    def below(
        self,
        height: Optional[float] = None,
        width: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Select region below this region.

        Args:
            height: Height of the region below, in points
            width: Width mode - "full" for full page width or "element" for element width
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify a lower boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area below
        """
        return self._direction(
            direction="below",
            size=height,
            cross_size=width,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            **kwargs,
        )

    def left(
        self,
        width: Optional[float] = None,
        height: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Select region to the left of this region.

        Args:
            width: Width of the region to the left, in points
            height: Height mode - "full" for full page height or "element" for element height
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify a left boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area to the left
        """
        return self._direction(
            direction="left",
            size=width,
            cross_size=height,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            **kwargs,
        )

    def right(
        self,
        width: Optional[float] = None,
        height: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Select region to the right of this region.

        Args:
            width: Width of the region to the right, in points
            height: Height mode - "full" for full page height or "element" for element height
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify a right boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area to the right
        """
        return self._direction(
            direction="right",
            size=width,
            cross_size=height,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            **kwargs,
        )

    @property
    def type(self) -> str:
        """Element type."""
        # Return the specific type if detected (e.g., from layout analysis)
        # or 'region' as a default.
        return self.region_type or "region"  # Prioritize specific region_type if set

    @property
    def page(self) -> "Page":
        """Get the parent page."""
        return self._page

    @property
    def bbox(self) -> Tuple[float, float, float, float]:
        """Get the bounding box as (x0, top, x1, bottom)."""
        return self._bbox

    @property
    def x0(self) -> float:
        """Get the left coordinate."""
        return self._bbox[0]

    @property
    def top(self) -> float:
        """Get the top coordinate."""
        return self._bbox[1]

    @property
    def x1(self) -> float:
        """Get the right coordinate."""
        return self._bbox[2]

    @property
    def bottom(self) -> float:
        """Get the bottom coordinate."""
        return self._bbox[3]

    @property
    def width(self) -> float:
        """Get the width of the region."""
        return self.x1 - self.x0

    @property
    def height(self) -> float:
        """Get the height of the region."""
        return self.bottom - self.top

    @property
    def has_polygon(self) -> bool:
        """Check if this region has polygon coordinates."""
        return self._polygon is not None and len(self._polygon) >= 3

    @property
    def polygon(self) -> List[Tuple[float, float]]:
        """Get polygon coordinates if available, otherwise return rectangle corners."""
        if self._polygon:
            return self._polygon
        else:
            # Create rectangle corners from bbox as fallback
            return [
                (self.x0, self.top),  # top-left
                (self.x1, self.top),  # top-right
                (self.x1, self.bottom),  # bottom-right
                (self.x0, self.bottom),  # bottom-left
            ]

    def _is_point_in_polygon(self, x: float, y: float) -> bool:
        """
        Check if a point is inside the polygon using ray casting algorithm.

        Args:
            x: X coordinate of the point
            y: Y coordinate of the point

        Returns:
            bool: True if the point is inside the polygon
        """
        if not self.has_polygon:
            return (self.x0 <= x <= self.x1) and (self.top <= y <= self.bottom)

        # Ray casting algorithm
        inside = False
        j = len(self.polygon) - 1

        for i in range(len(self.polygon)):
            if ((self.polygon[i][1] > y) != (self.polygon[j][1] > y)) and (
                x
                < (self.polygon[j][0] - self.polygon[i][0])
                * (y - self.polygon[i][1])
                / (self.polygon[j][1] - self.polygon[i][1])
                + self.polygon[i][0]
            ):
                inside = not inside
            j = i

        return inside

    def is_point_inside(self, x: float, y: float) -> bool:
        """
        Check if a point is inside this region using ray casting algorithm for polygons.

        Args:
            x: X coordinate of the point
            y: Y coordinate of the point

        Returns:
            bool: True if the point is inside the region
        """
        if not self.has_polygon:
            return (self.x0 <= x <= self.x1) and (self.top <= y <= self.bottom)

        # Ray casting algorithm
        inside = False
        j = len(self.polygon) - 1

        for i in range(len(self.polygon)):
            if ((self.polygon[i][1] > y) != (self.polygon[j][1] > y)) and (
                x
                < (self.polygon[j][0] - self.polygon[i][0])
                * (y - self.polygon[i][1])
                / (self.polygon[j][1] - self.polygon[i][1])
                + self.polygon[i][0]
            ):
                inside = not inside
            j = i

        return inside

    def is_element_center_inside(self, element: "Element") -> bool:
        """
        Check if the center point of an element's bounding box is inside this region.

        Args:
            element: Element to check

        Returns:
            True if the element's center point is inside the region, False otherwise.
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        # Ensure element has necessary attributes
        if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
            logger.warning(
                f"Element {element} lacks bounding box attributes. Cannot check center point."
            )
            return False  # Cannot determine position

        # Calculate center point
        center_x = (element.x0 + element.x1) / 2
        center_y = (element.top + element.bottom) / 2

        # Use the existing is_point_inside check
        return self.is_point_inside(center_x, center_y)

    def _is_element_in_region(self, element: "Element", use_boundary_tolerance=True) -> bool:
        """
        Check if an element intersects or is contained within this region.

        Args:
            element: Element to check
            use_boundary_tolerance: Whether to apply a small tolerance for boundary elements

        Returns:
            True if the element is in the region, False otherwise
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        return self.is_element_center_inside(element)
        # return self.intersects(element)

    def contains(self, element: "Element") -> bool:
        """
        Check if this region completely contains an element.

        Args:
            element: Element to check

        Returns:
            True if the element is completely contained within the region, False otherwise
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        # Ensure element has necessary attributes
        if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
            return False  # Cannot determine position

        # For rectangular regions, check if element's bbox is fully inside region's bbox
        if not self.has_polygon:
            return (
                self.x0 <= element.x0
                and element.x1 <= self.x1
                and self.top <= element.top
                and element.bottom <= self.bottom
            )

        # For polygon regions, check if all corners of the element are inside the polygon
        element_corners = [
            (element.x0, element.top),  # top-left
            (element.x1, element.top),  # top-right
            (element.x1, element.bottom),  # bottom-right
            (element.x0, element.bottom),  # bottom-left
        ]

        return all(self.is_point_inside(x, y) for x, y in element_corners)

    def intersects(self, element: "Element") -> bool:
        """
        Check if this region intersects with an element (any overlap).

        Args:
            element: Element to check

        Returns:
            True if the element overlaps with the region at all, False otherwise
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        # Ensure element has necessary attributes
        if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
            return False  # Cannot determine position

        # For rectangular regions, check for bbox overlap
        if not self.has_polygon:
            return (
                self.x0 < element.x1
                and self.x1 > element.x0
                and self.top < element.bottom
                and self.bottom > element.top
            )

        # For polygon regions, check if any corner of the element is inside the polygon
        element_corners = [
            (element.x0, element.top),  # top-left
            (element.x1, element.top),  # top-right
            (element.x1, element.bottom),  # bottom-right
            (element.x0, element.bottom),  # bottom-left
        ]

        # First check if any element corner is inside the polygon
        if any(self.is_point_inside(x, y) for x, y in element_corners):
            return True

        # Also check if any polygon corner is inside the element's rectangle
        for x, y in self.polygon:
            if element.x0 <= x <= element.x1 and element.top <= y <= element.bottom:
                return True

        # Also check if any polygon edge intersects with any rectangle edge
        # This is a simplification - for complex cases, we'd need a full polygon-rectangle
        # intersection algorithm

        # For now, return True if bounding boxes overlap (approximation for polygon-rectangle case)
        return (
            self.x0 < element.x1
            and self.x1 > element.x0
            and self.top < element.bottom
            and self.bottom > element.top
        )

    def highlight(
        self,
        label: Optional[str] = None,
        color: Optional[Union[Tuple, str]] = None,
        use_color_cycling: bool = False,
        include_attrs: Optional[List[str]] = None,
        existing: str = "append",
    ) -> "Region":
        """
        Highlight this region on the page.

        Args:
            label: Optional label for the highlight
            color: Color tuple/string for the highlight, or None to use automatic color
            use_color_cycling: Force color cycling even with no label (default: False)
            include_attrs: List of attribute names to display on the highlight (e.g., ['confidence', 'type'])
            existing: How to handle existing highlights ('append' or 'replace').

        Returns:
            Self for method chaining
        """
        # Access the highlighter service correctly
        highlighter = self.page._highlighter

        # Prepare common arguments
        highlight_args = {
            "page_index": self.page.index,
            "color": color,
            "label": label,
            "use_color_cycling": use_color_cycling,
            "element": self,  # Pass the region itself so attributes can be accessed
            "include_attrs": include_attrs,
            "existing": existing,
        }

        # Call the appropriate service method
        if self.has_polygon:
            highlight_args["polygon"] = self.polygon
            highlighter.add_polygon(**highlight_args)
        else:
            highlight_args["bbox"] = self.bbox
            highlighter.add(**highlight_args)

        return self

    def to_image(
        self,
        resolution: Optional[float] = None,
        crop: bool = False,
        include_highlights: bool = True,
        **kwargs,
    ) -> "Image.Image":
        """
        Generate an image of just this region.

        Args:
            resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
            crop: If True, only crop the region without highlighting its boundaries
            include_highlights: Whether to include existing highlights (default: True)
            **kwargs: Additional parameters for page.to_image()

        Returns:
            PIL Image of just this region
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Handle the case where user wants the cropped region to have a specific width
        page_kwargs = kwargs.copy()
        effective_resolution = resolution  # Start with the provided resolution

        if crop and "width" in kwargs:
            target_width = kwargs["width"]
            # Calculate what resolution is needed to make the region crop have target_width
            region_width_points = self.width  # Region width in PDF points

            if region_width_points > 0:
                # Calculate scale needed: target_width / region_width_points
                required_scale = target_width / region_width_points
                # Convert scale to resolution: scale * 72 DPI
                effective_resolution = required_scale * 72.0
                page_kwargs.pop("width")  # Remove width parameter to avoid conflicts
                logger.debug(
                    f"Region {self.bbox}: Calculated required resolution {effective_resolution:.1f} DPI for region crop width {target_width}"
                )
            else:
                logger.warning(
                    f"Region {self.bbox}: Invalid region width {region_width_points}, using original resolution"
                )

        # First get the full page image with highlights if requested
        page_image = self._page.to_image(
            resolution=effective_resolution,
            include_highlights=include_highlights,
            **page_kwargs,
        )

        # Calculate the actual scale factor used by the page image
        if page_image.width > 0 and self._page.width > 0:
            scale_factor = page_image.width / self._page.width
        else:
            # Fallback to resolution-based calculation if dimensions are invalid
            scale_factor = resolution / 72.0

        # Apply scaling to the coordinates
        x0 = int(self.x0 * scale_factor)
        top = int(self.top * scale_factor)
        x1 = int(self.x1 * scale_factor)
        bottom = int(self.bottom * scale_factor)

        # Ensure coords are valid for cropping (left < right, top < bottom)
        if x0 >= x1:
            logger.warning(
                f"Region {self.bbox} resulted in non-positive width after scaling ({x0} >= {x1}). Cannot create image."
            )
            return None
        if top >= bottom:
            logger.warning(
                f"Region {self.bbox} resulted in non-positive height after scaling ({top} >= {bottom}). Cannot create image."
            )
            return None

        # Crop the image to just this region
        region_image = page_image.crop((x0, top, x1, bottom))

        # If not crop, add a border to highlight the region boundaries
        if not crop:
            from PIL import ImageDraw

            # Create a 1px border around the region
            draw = ImageDraw.Draw(region_image)
            draw.rectangle(
                (0, 0, region_image.width - 1, region_image.height - 1),
                outline=(255, 0, 0),
                width=1,
            )

        return region_image

    def show(
        self,
        resolution: Optional[float] = None,
        labels: bool = True,
        legend_position: str = "right",
        # Add a default color for standalone show
        color: Optional[Union[Tuple, str]] = "blue",
        label: Optional[str] = None,
        width: Optional[int] = None,  # Add width parameter
        crop: bool = False,  # NEW: Crop output to region bounds before legend
    ) -> "Image.Image":
        """
        Show the page with just this region highlighted temporarily.

        Args:
            resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
            labels: Whether to include a legend for labels
            legend_position: Position of the legend
            color: Color to highlight this region (default: blue)
            label: Optional label for this region in the legend
            width: Optional width for the output image in pixels
            crop: If True, crop the rendered image to this region's
                        bounding box (with a small margin handled inside
                        HighlightingService) before legends/overlays are added.

        Returns:
            PIL Image of the page with only this region highlighted
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        if not self._page:
            raise ValueError("Region must be associated with a page to show.")

        # Use the highlighting service via the page's property
        service = self._page._highlighter

        # Determine the label if not provided
        display_label = (
            label if label is not None else f"Region ({self.type})" if self.type else "Region"
        )

        # Prepare temporary highlight data for just this region
        temp_highlight_data = {
            "page_index": self._page.index,
            "bbox": self.bbox,
            "polygon": self.polygon if self.has_polygon else None,
            "color": color,  # Use provided or default color
            "label": display_label,
            "use_color_cycling": False,  # Explicitly false for single preview
        }

        # Determine crop bbox if requested
        crop_bbox = self.bbox if crop else None

        # Use render_preview to show only this highlight
        return service.render_preview(
            page_index=self._page.index,
            temporary_highlights=[temp_highlight_data],
            resolution=resolution,
            width=width,  # Pass the width parameter
            labels=labels,
            legend_position=legend_position,
            crop_bbox=crop_bbox,
        )

    def save(
        self,
        filename: str,
        resolution: Optional[float] = None,
        labels: bool = True,
        legend_position: str = "right",
    ) -> "Region":
        """
        Save the page with this region highlighted to an image file.

        Args:
            filename: Path to save the image to
            resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
            labels: Whether to include a legend for labels
            legend_position: Position of the legend

        Returns:
            Self for method chaining
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Highlight this region if not already highlighted
        self.highlight()

        # Save the highlighted image
        self._page.save_image(
            filename, resolution=resolution, labels=labels, legend_position=legend_position
        )
        return self

    def save_image(
        self,
        filename: str,
        resolution: Optional[float] = None,
        crop: bool = False,
        include_highlights: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Save an image of just this region to a file.

        Args:
            filename: Path to save the image to
            resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
            crop: If True, only crop the region without highlighting its boundaries
            include_highlights: Whether to include existing highlights (default: True)
            **kwargs: Additional parameters for page.to_image()

        Returns:
            Self for method chaining
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Get the region image
        image = self.to_image(
            resolution=resolution,
            crop=crop,
            include_highlights=include_highlights,
            **kwargs,
        )

        # Save the image
        image.save(filename)
        return self

    def trim(
        self,
        padding: int = 1,
        threshold: float = 0.95,
        resolution: Optional[float] = None,
        pre_shrink: float = 0.5,
    ) -> "Region":
        """
        Trim visual whitespace from the edges of this region.

        Similar to Python's string .strip() method, but for visual whitespace in the region image.
        Uses pixel analysis to detect rows/columns that are predominantly whitespace.

        Args:
            padding: Number of pixels to keep as padding after trimming (default: 1)
            threshold: Threshold for considering a row/column as whitespace (0.0-1.0, default: 0.95)
                      Higher values mean more strict whitespace detection.
                      E.g., 0.95 means if 95% of pixels in a row/column are white, consider it whitespace.
            resolution: Resolution for image rendering in DPI (default: uses global options, fallback to 144 DPI)
            pre_shrink: Amount to shrink region before trimming, then expand back after (default: 0.5)
                       This helps avoid detecting box borders/slivers as content.

        Returns
        ------

        New Region with visual whitespace trimmed from all edges

        Examples
        --------

        ```python
        # Basic trimming with 1 pixel padding and 0.5px pre-shrink
        trimmed = region.trim()

        # More aggressive trimming with no padding and no pre-shrink
        tight = region.trim(padding=0, threshold=0.9, pre_shrink=0)

        # Conservative trimming with more padding
        loose = region.trim(padding=3, threshold=0.98)
        ```
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Pre-shrink the region to avoid box slivers
        work_region = (
            self.expand(left=-pre_shrink, right=-pre_shrink, top=-pre_shrink, bottom=-pre_shrink)
            if pre_shrink > 0
            else self
        )

        # Get the region image
        image = work_region.to_image(resolution=resolution, crop=True, include_highlights=False)

        if image is None:
            logger.warning(
                f"Region {self.bbox}: Could not generate image for trimming. Returning original region."
            )
            return self

        # Convert to grayscale for easier analysis
        import numpy as np

        # Convert PIL image to numpy array
        img_array = np.array(image.convert("L"))  # Convert to grayscale
        height, width = img_array.shape

        if height == 0 or width == 0:
            logger.warning(
                f"Region {self.bbox}: Image has zero dimensions. Returning original region."
            )
            return self

        # Normalize pixel values to 0-1 range (255 = white = 1.0, 0 = black = 0.0)
        normalized = img_array.astype(np.float32) / 255.0

        # Find content boundaries by analyzing row and column averages

        # Analyze rows (horizontal strips) to find top and bottom boundaries
        row_averages = np.mean(normalized, axis=1)  # Average each row
        content_rows = row_averages < threshold  # True where there's content (not whitespace)

        # Find first and last rows with content
        content_row_indices = np.where(content_rows)[0]
        if len(content_row_indices) == 0:
            # No content found, return a minimal region at the center
            logger.warning(
                f"Region {self.bbox}: No content detected during trimming. Returning center point."
            )
            center_x = (self.x0 + self.x1) / 2
            center_y = (self.top + self.bottom) / 2
            return Region(self.page, (center_x, center_y, center_x, center_y))

        top_content_row = max(0, content_row_indices[0] - padding)
        bottom_content_row = min(height - 1, content_row_indices[-1] + padding)

        # Analyze columns (vertical strips) to find left and right boundaries
        col_averages = np.mean(normalized, axis=0)  # Average each column
        content_cols = col_averages < threshold  # True where there's content

        content_col_indices = np.where(content_cols)[0]
        if len(content_col_indices) == 0:
            # No content found in columns either
            logger.warning(
                f"Region {self.bbox}: No column content detected during trimming. Returning center point."
            )
            center_x = (self.x0 + self.x1) / 2
            center_y = (self.top + self.bottom) / 2
            return Region(self.page, (center_x, center_y, center_x, center_y))

        left_content_col = max(0, content_col_indices[0] - padding)
        right_content_col = min(width - 1, content_col_indices[-1] + padding)

        # Convert trimmed pixel coordinates back to PDF coordinates
        scale_factor = resolution / 72.0  # Scale factor used in to_image()

        # Calculate new PDF coordinates and ensure they are Python floats
        trimmed_x0 = float(work_region.x0 + (left_content_col / scale_factor))
        trimmed_top = float(work_region.top + (top_content_row / scale_factor))
        trimmed_x1 = float(
            work_region.x0 + ((right_content_col + 1) / scale_factor)
        )  # +1 because we want inclusive right edge
        trimmed_bottom = float(
            work_region.top + ((bottom_content_row + 1) / scale_factor)
        )  # +1 because we want inclusive bottom edge

        # Ensure the trimmed region doesn't exceed the work region boundaries
        final_x0 = max(work_region.x0, trimmed_x0)
        final_top = max(work_region.top, trimmed_top)
        final_x1 = min(work_region.x1, trimmed_x1)
        final_bottom = min(work_region.bottom, trimmed_bottom)

        # Ensure valid coordinates (width > 0, height > 0)
        if final_x1 <= final_x0 or final_bottom <= final_top:
            logger.warning(
                f"Region {self.bbox}: Trimming resulted in invalid dimensions. Returning original region."
            )
            return self

        # Create the trimmed region
        trimmed_region = Region(self.page, (final_x0, final_top, final_x1, final_bottom))

        # Expand back by the pre_shrink amount to restore original positioning
        if pre_shrink > 0:
            trimmed_region = trimmed_region.expand(
                left=pre_shrink, right=pre_shrink, top=pre_shrink, bottom=pre_shrink
            )

        # Copy relevant metadata
        trimmed_region.region_type = self.region_type
        trimmed_region.normalized_type = self.normalized_type
        trimmed_region.confidence = self.confidence
        trimmed_region.model = self.model
        trimmed_region.name = self.name
        trimmed_region.label = self.label
        trimmed_region.source = "trimmed"  # Indicate this is a derived region
        trimmed_region.parent_region = self

        logger.debug(
            f"Region {self.bbox}: Trimmed to {trimmed_region.bbox} (padding={padding}, threshold={threshold}, pre_shrink={pre_shrink})"
        )
        return trimmed_region

    def clip(
        self,
        obj: Optional[Any] = None,
        left: Optional[float] = None,
        top: Optional[float] = None,
        right: Optional[float] = None,
        bottom: Optional[float] = None,
    ) -> "Region":
        """
        Clip this region to specific bounds, either from another object with bbox or explicit coordinates.

        The clipped region will be constrained to not exceed the specified boundaries.
        You can provide either an object with bounding box properties, specific coordinates, or both.
        When both are provided, explicit coordinates take precedence.

        Args:
            obj: Optional object with bbox properties (Region, Element, TextElement, etc.)
            left: Optional left boundary (x0) to clip to
            top: Optional top boundary to clip to
            right: Optional right boundary (x1) to clip to
            bottom: Optional bottom boundary to clip to

        Returns:
            New Region with bounds clipped to the specified constraints

        Examples:
            # Clip to another region's bounds
            clipped = region.clip(container_region)

            # Clip to any element's bounds
            clipped = region.clip(text_element)

            # Clip to specific coordinates
            clipped = region.clip(left=100, right=400)

            # Mix object bounds with specific overrides
            clipped = region.clip(obj=container, bottom=page.height/2)
        """
        from natural_pdf.elements.base import extract_bbox

        # Start with current region bounds
        clip_x0 = self.x0
        clip_top = self.top
        clip_x1 = self.x1
        clip_bottom = self.bottom

        # Apply object constraints if provided
        if obj is not None:
            obj_bbox = extract_bbox(obj)
            if obj_bbox is not None:
                obj_x0, obj_top, obj_x1, obj_bottom = obj_bbox
                # Constrain to the intersection with the provided object
                clip_x0 = max(clip_x0, obj_x0)
                clip_top = max(clip_top, obj_top)
                clip_x1 = min(clip_x1, obj_x1)
                clip_bottom = min(clip_bottom, obj_bottom)
            else:
                logger.warning(
                    f"Region {self.bbox}: Cannot extract bbox from clipping object {type(obj)}. "
                    "Object must have bbox property or x0/top/x1/bottom attributes."
                )

        # Apply explicit coordinate constraints (these take precedence)
        if left is not None:
            clip_x0 = max(clip_x0, left)
        if top is not None:
            clip_top = max(clip_top, top)
        if right is not None:
            clip_x1 = min(clip_x1, right)
        if bottom is not None:
            clip_bottom = min(clip_bottom, bottom)

        # Ensure valid coordinates
        if clip_x1 <= clip_x0 or clip_bottom <= clip_top:
            logger.warning(
                f"Region {self.bbox}: Clipping resulted in invalid dimensions "
                f"({clip_x0}, {clip_top}, {clip_x1}, {clip_bottom}). Returning minimal region."
            )
            # Return a minimal region at the clip area's top-left
            return Region(self.page, (clip_x0, clip_top, clip_x0, clip_top))

        # Create the clipped region
        clipped_region = Region(self.page, (clip_x0, clip_top, clip_x1, clip_bottom))

        # Copy relevant metadata
        clipped_region.region_type = self.region_type
        clipped_region.normalized_type = self.normalized_type
        clipped_region.confidence = self.confidence
        clipped_region.model = self.model
        clipped_region.name = self.name
        clipped_region.label = self.label
        clipped_region.source = "clipped"  # Indicate this is a derived region
        clipped_region.parent_region = self

        logger.debug(
            f"Region {self.bbox}: Clipped to {clipped_region.bbox} "
            f"(constraints: obj={type(obj).__name__ if obj else None}, "
            f"left={left}, top={top}, right={right}, bottom={bottom})"
        )
        return clipped_region

    def get_elements(
        self, selector: Optional[str] = None, apply_exclusions=True, **kwargs
    ) -> List["Element"]:
        """
        Get all elements within this region.

        Args:
            selector: Optional selector to filter elements
            apply_exclusions: Whether to apply exclusion regions
            **kwargs: Additional parameters for element filtering

        Returns:
            List of elements in the region
        """
        if selector:
            # Find elements on the page matching the selector
            page_elements = self.page.find_all(
                selector, apply_exclusions=apply_exclusions, **kwargs
            )
            # Filter those elements to only include ones within this region
            return [e for e in page_elements if self._is_element_in_region(e)]
        else:
            # Get all elements from the page
            page_elements = self.page.get_elements(apply_exclusions=apply_exclusions)
            # Filter to elements in this region
            return [e for e in page_elements if self._is_element_in_region(e)]

    def extract_text(self, apply_exclusions=True, debug=False, **kwargs) -> str:
        """
        Extract text from this region, respecting page exclusions and using pdfplumber's
        layout engine (chars_to_textmap).

        Args:
            apply_exclusions: Whether to apply exclusion regions defined on the parent page.
            debug: Enable verbose debugging output for filtering steps.
            **kwargs: Additional layout parameters passed directly to pdfplumber's
                      `chars_to_textmap` function (e.g., layout, x_density, y_density).
                      See Page.extract_text docstring for more.

        Returns:
            Extracted text as string, potentially with layout-based spacing.
        """
        # Allow 'debug_exclusions' for backward compatibility
        debug = kwargs.get("debug", debug or kwargs.get("debug_exclusions", False))
        logger.debug(f"Region {self.bbox}: extract_text called with kwargs: {kwargs}")

        # 1. Get Word Elements potentially within this region (initial broad phase)
        # Optimization: Could use spatial query if page elements were indexed
        page_words = self.page.words  # Get all words from the page

        # 2. Gather all character dicts from words potentially in region
        # We filter precisely in filter_chars_spatially
        all_char_dicts = []
        for word in page_words:
            # Quick bbox check to avoid processing words clearly outside
            if get_bbox_overlap(self.bbox, word.bbox) is not None:
                all_char_dicts.extend(getattr(word, "_char_dicts", []))

        if not all_char_dicts:
            logger.debug(f"Region {self.bbox}: No character dicts found overlapping region bbox.")
            return ""

        # 3. Get Relevant Exclusions (overlapping this region)
        apply_exclusions_flag = kwargs.get("apply_exclusions", apply_exclusions)
        exclusion_regions = []
        if apply_exclusions_flag and self._page._exclusions:
            all_page_exclusions = self._page._get_exclusion_regions(
                include_callable=True, debug=debug
            )
            overlapping_exclusions = []
            for excl in all_page_exclusions:
                if get_bbox_overlap(self.bbox, excl.bbox) is not None:
                    overlapping_exclusions.append(excl)
            exclusion_regions = overlapping_exclusions
            if debug:
                logger.debug(
                    f"Region {self.bbox}: Applying {len(exclusion_regions)} overlapping exclusions."
                )
        elif debug:
            logger.debug(f"Region {self.bbox}: Not applying exclusions.")

        # 4. Spatially Filter Characters using Utility
        # Pass self as the target_region for precise polygon checks etc.
        filtered_chars = filter_chars_spatially(
            char_dicts=all_char_dicts,
            exclusion_regions=exclusion_regions,
            target_region=self,  # Pass self!
            debug=debug,
        )

        # 5. Generate Text Layout using Utility
        result = generate_text_layout(
            char_dicts=filtered_chars,
            layout_context_bbox=self.bbox,  # Use region's bbox for context
            user_kwargs=kwargs,  # Pass original kwargs to layout generator
        )

        logger.debug(f"Region {self.bbox}: extract_text finished, result length: {len(result)}.")
        return result

    def extract_table(
        self,
        method: Optional[str] = None,  # Make method optional
        table_settings: Optional[dict] = None,  # Use Optional
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,  # Use Optional
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
        # --- NEW: Add tqdm control option --- #
        show_progress: bool = False,  # Controls progress bar for text method
    ) -> TableResult:  # Return type allows Optional[str] for cells
        """
        Extract a table from this region.

        Args:
            method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
                    'stream' is an alias for 'pdfplumber' with text-based strategies (equivalent to
                    setting `vertical_strategy` and `horizontal_strategy` to 'text').
                    'lattice' is an alias for 'pdfplumber' with line-based strategies (equivalent to
                    setting `vertical_strategy` and `horizontal_strategy` to 'lines').
            table_settings: Settings for pdfplumber table extraction (used with 'pdfplumber', 'stream', or 'lattice' methods).
            use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
            ocr_config: OCR configuration parameters.
            text_options: Dictionary of options for the 'text' method, corresponding to arguments
                          of analyze_text_table_structure (e.g., snap_tolerance, expand_bbox).
            cell_extraction_func: Optional callable function that takes a cell Region object
                                  and returns its string content. Overrides default text extraction
                                  for the 'text' method.
            show_progress: If True, display a progress bar during cell text extraction for the 'text' method.

        Returns:
            Table data as a list of rows, where each row is a list of cell values (str or None).
        """
        # Default settings if none provided
        if table_settings is None:
            table_settings = {}
        if text_options is None:
            text_options = {}  # Initialize empty dict

        # Auto-detect method if not specified
        if method is None:
            # If this is a TATR-detected region, use TATR method
            if hasattr(self, "model") and self.model == "tatr" and self.region_type == "table":
                effective_method = "tatr"
            else:
                # Try lattice first, then fall back to stream if no meaningful results
                logger.debug(f"Region {self.bbox}: Auto-detecting table extraction method...")

                # --- NEW: Prefer already-created table_cell regions if they exist --- #
                try:
                    cell_regions_in_table = [
                        c
                        for c in self.page.find_all(
                            "region[type=table_cell]", apply_exclusions=False
                        )
                        if self.intersects(c)
                    ]
                except Exception as _cells_err:
                    cell_regions_in_table = []  # Fallback silently

                if cell_regions_in_table:
                    logger.debug(
                        f"Region {self.bbox}: Found {len(cell_regions_in_table)} pre-computed table_cell regions – using 'cells' method."
                    )
                    return TableResult(self._extract_table_from_cells(cell_regions_in_table))

                # --------------------------------------------------------------- #

                try:
                    logger.debug(f"Region {self.bbox}: Trying 'lattice' method first...")
                    lattice_result = self.extract_table(
                        "lattice", table_settings=table_settings.copy()
                    )

                    # Check if lattice found meaningful content
                    if (
                        lattice_result
                        and len(lattice_result) > 0
                        and any(
                            any(cell and cell.strip() for cell in row if cell)
                            for row in lattice_result
                        )
                    ):
                        logger.debug(
                            f"Region {self.bbox}: 'lattice' method found table with {len(lattice_result)} rows"
                        )
                        return lattice_result
                    else:
                        logger.debug(
                            f"Region {self.bbox}: 'lattice' method found no meaningful content"
                        )
                except Exception as e:
                    logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

                # Fall back to stream
                logger.debug(f"Region {self.bbox}: Falling back to 'stream' method...")
                return self.extract_table("stream", table_settings=table_settings.copy())
        else:
            effective_method = method

        # Handle method aliases for pdfplumber
        if effective_method == "stream":
            logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
            effective_method = "pdfplumber"
            # Set default text strategies if not already provided by the user
            table_settings.setdefault("vertical_strategy", "text")
            table_settings.setdefault("horizontal_strategy", "text")
        elif effective_method == "lattice":
            logger.debug(
                "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
            )
            effective_method = "pdfplumber"
            # Set default line strategies if not already provided by the user
            table_settings.setdefault("vertical_strategy", "lines")
            table_settings.setdefault("horizontal_strategy", "lines")

        # -------------------------------------------------------------
        # Auto-inject tolerances when text-based strategies are requested.
        # This must happen AFTER alias handling (so strategies are final)
        # and BEFORE we delegate to _extract_table_* helpers.
        # -------------------------------------------------------------
        if "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        ):
            page_cfg = getattr(self.page, "_config", {})
            # Ensure text_* tolerances passed to pdfplumber
            if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
                if page_cfg.get("x_tolerance") is not None:
                    table_settings["text_x_tolerance"] = page_cfg["x_tolerance"]
            if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
                if page_cfg.get("y_tolerance") is not None:
                    table_settings["text_y_tolerance"] = page_cfg["y_tolerance"]

            # Snap / join tolerances (~ line spacing)
            if "snap_tolerance" not in table_settings and "snap_x_tolerance" not in table_settings:
                snap = max(1, round((page_cfg.get("y_tolerance", 1)) * 0.9))
                table_settings["snap_tolerance"] = snap
            if "join_tolerance" not in table_settings and "join_x_tolerance" not in table_settings:
                table_settings["join_tolerance"] = table_settings["snap_tolerance"]

        logger.debug(f"Region {self.bbox}: Extracting table using method '{effective_method}'")

        # Use the selected method
        if effective_method == "tatr":
            table_rows = self._extract_table_tatr(use_ocr=use_ocr, ocr_config=ocr_config)
        elif effective_method == "text":
            current_text_options = text_options.copy()
            current_text_options["cell_extraction_func"] = cell_extraction_func
            current_text_options["show_progress"] = show_progress
            table_rows = self._extract_table_text(**current_text_options)
        elif effective_method == "pdfplumber":
            table_rows = self._extract_table_plumber(table_settings)
        else:
            raise ValueError(
                f"Unknown table extraction method: '{method}'. Choose from 'tatr', 'pdfplumber', 'text', 'stream', 'lattice'."
            )

        return TableResult(table_rows)

    def extract_tables(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
    ) -> List[List[List[str]]]:
        """
        Extract all tables from this region using pdfplumber-based methods.

        Note: Only 'pdfplumber', 'stream', and 'lattice' methods are supported for extract_tables.
        'tatr' and 'text' methods are designed for single table extraction only.

        Args:
            method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                    'stream' uses text-based strategies, 'lattice' uses line-based strategies.
            table_settings: Settings for pdfplumber table extraction.

        Returns:
            List of tables, where each table is a list of rows, and each row is a list of cell values.
        """
        if table_settings is None:
            table_settings = {}

        # Auto-detect method if not specified (try lattice first, then stream)
        if method is None:
            logger.debug(f"Region {self.bbox}: Auto-detecting tables extraction method...")

            # Try lattice first
            try:
                lattice_settings = table_settings.copy()
                lattice_settings.setdefault("vertical_strategy", "lines")
                lattice_settings.setdefault("horizontal_strategy", "lines")

                logger.debug(f"Region {self.bbox}: Trying 'lattice' method first for tables...")
                lattice_result = self._extract_tables_plumber(lattice_settings)

                # Check if lattice found meaningful tables
                if (
                    lattice_result
                    and len(lattice_result) > 0
                    and any(
                        any(
                            any(cell and cell.strip() for cell in row if cell)
                            for row in table
                            if table
                        )
                        for table in lattice_result
                    )
                ):
                    logger.debug(
                        f"Region {self.bbox}: 'lattice' method found {len(lattice_result)} tables"
                    )
                    return lattice_result
                else:
                    logger.debug(f"Region {self.bbox}: 'lattice' method found no meaningful tables")

            except Exception as e:
                logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

            # Fall back to stream
            logger.debug(f"Region {self.bbox}: Falling back to 'stream' method for tables...")
            stream_settings = table_settings.copy()
            stream_settings.setdefault("vertical_strategy", "text")
            stream_settings.setdefault("horizontal_strategy", "text")

            return self._extract_tables_plumber(stream_settings)

        effective_method = method

        # Handle method aliases
        if effective_method == "stream":
            logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "text")
            table_settings.setdefault("horizontal_strategy", "text")
        elif effective_method == "lattice":
            logger.debug(
                "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
            )
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "lines")
            table_settings.setdefault("horizontal_strategy", "lines")

        # Use the selected method
        if effective_method == "pdfplumber":
            return self._extract_tables_plumber(table_settings)
        else:
            raise ValueError(
                f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
            )

    def _extract_tables_plumber(self, table_settings: dict) -> List[List[List[str]]]:
        """
        Extract all tables using pdfplumber's table extraction.

        Args:
            table_settings: Settings for pdfplumber table extraction

        Returns:
            List of tables, where each table is a list of rows, and each row is a list of cell values
        """
        # Inject global PDF-level text tolerances if not explicitly present
        pdf_cfg = getattr(self.page, "_config", getattr(self.page._parent, "_config", {}))
        _uses_text = "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        )
        if (
            _uses_text
            and "text_x_tolerance" not in table_settings
            and "x_tolerance" not in table_settings
        ):
            x_tol = pdf_cfg.get("x_tolerance")
            if x_tol is not None:
                table_settings.setdefault("text_x_tolerance", x_tol)
        if (
            _uses_text
            and "text_y_tolerance" not in table_settings
            and "y_tolerance" not in table_settings
        ):
            y_tol = pdf_cfg.get("y_tolerance")
            if y_tol is not None:
                table_settings.setdefault("text_y_tolerance", y_tol)

        if (
            _uses_text
            and "snap_tolerance" not in table_settings
            and "snap_x_tolerance" not in table_settings
        ):
            snap = max(1, round((pdf_cfg.get("y_tolerance", 1)) * 0.9))
            table_settings.setdefault("snap_tolerance", snap)
        if (
            _uses_text
            and "join_tolerance" not in table_settings
            and "join_x_tolerance" not in table_settings
        ):
            join = table_settings.get("snap_tolerance", 1)
            table_settings.setdefault("join_tolerance", join)
            table_settings.setdefault("join_x_tolerance", join)
            table_settings.setdefault("join_y_tolerance", join)

        # Create a crop of the page for this region
        cropped = self.page._page.crop(self.bbox)

        # Extract all tables from the cropped area
        tables = cropped.extract_tables(table_settings)

        # Return the tables or an empty list if none found
        return tables if tables else []

    def _extract_table_plumber(self, table_settings: dict) -> List[List[str]]:
        """
        Extract table using pdfplumber's table extraction.
        This method extracts the largest table within the region.

        Args:
            table_settings: Settings for pdfplumber table extraction

        Returns:
            Table data as a list of rows, where each row is a list of cell values
        """
        # Inject global PDF-level text tolerances if not explicitly present
        pdf_cfg = getattr(self.page, "_config", getattr(self.page._parent, "_config", {}))
        _uses_text = "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        )
        if (
            _uses_text
            and "text_x_tolerance" not in table_settings
            and "x_tolerance" not in table_settings
        ):
            x_tol = pdf_cfg.get("x_tolerance")
            if x_tol is not None:
                table_settings.setdefault("text_x_tolerance", x_tol)
        if (
            _uses_text
            and "text_y_tolerance" not in table_settings
            and "y_tolerance" not in table_settings
        ):
            y_tol = pdf_cfg.get("y_tolerance")
            if y_tol is not None:
                table_settings.setdefault("text_y_tolerance", y_tol)

        # Create a crop of the page for this region
        cropped = self.page._page.crop(self.bbox)

        # Extract the single largest table from the cropped area
        table = cropped.extract_table(table_settings)

        # Return the table or an empty list if none found
        if table:
            return table
        return []

    def _extract_table_tatr(self, use_ocr=False, ocr_config=None) -> List[List[str]]:
        """
        Extract table using TATR structure detection.

        Args:
            use_ocr: Whether to apply OCR to each cell for better text extraction
            ocr_config: Optional OCR configuration parameters

        Returns:
            Table data as a list of rows, where each row is a list of cell values
        """
        # Find all rows and headers in this table
        rows = self.page.find_all(f"region[type=table-row][model=tatr]")
        headers = self.page.find_all(f"region[type=table-column-header][model=tatr]")
        columns = self.page.find_all(f"region[type=table-column][model=tatr]")

        # Filter to only include rows/headers/columns that overlap with this table region
        def is_in_table(region):
            # Check for overlap - simplifying to center point for now
            region_center_x = (region.x0 + region.x1) / 2
            region_center_y = (region.top + region.bottom) / 2
            return (
                self.x0 <= region_center_x <= self.x1 and self.top <= region_center_y <= self.bottom
            )

        rows = [row for row in rows if is_in_table(row)]
        headers = [header for header in headers if is_in_table(header)]
        columns = [column for column in columns if is_in_table(column)]

        # Sort rows by vertical position (top to bottom)
        rows.sort(key=lambda r: r.top)

        # Sort columns by horizontal position (left to right)
        columns.sort(key=lambda c: c.x0)

        # Create table data structure
        table_data = []

        # Prepare OCR config if needed
        if use_ocr:
            # Default OCR config focuses on small text with low confidence
            default_ocr_config = {
                "enabled": True,
                "min_confidence": 0.1,  # Lower than default to catch more text
                "detection_params": {
                    "text_threshold": 0.1,  # Lower threshold for low-contrast text
                    "link_threshold": 0.1,  # Lower threshold for connecting text components
                },
            }

            # Merge with provided config if any
            if ocr_config:
                if isinstance(ocr_config, dict):
                    # Update default config with provided values
                    for key, value in ocr_config.items():
                        if (
                            isinstance(value, dict)
                            and key in default_ocr_config
                            and isinstance(default_ocr_config[key], dict)
                        ):
                            # Merge nested dicts
                            default_ocr_config[key].update(value)
                        else:
                            # Replace value
                            default_ocr_config[key] = value
                else:
                    # Not a dict, use as is
                    default_ocr_config = ocr_config

            # Use the merged config
            ocr_config = default_ocr_config

        # Add header row if headers were detected
        if headers:
            header_texts = []
            for header in headers:
                if use_ocr:
                    # Try OCR for better text extraction
                    ocr_elements = header.apply_ocr(**ocr_config)
                    if ocr_elements:
                        ocr_text = " ".join(e.text for e in ocr_elements).strip()
                        if ocr_text:
                            header_texts.append(ocr_text)
                            continue

                # Fallback to normal extraction
                header_texts.append(header.extract_text().strip())
            table_data.append(header_texts)

        # Process rows
        for row in rows:
            row_cells = []

            # If we have columns, use them to extract cells
            if columns:
                for column in columns:
                    # Create a cell region at the intersection of row and column
                    cell_bbox = (column.x0, row.top, column.x1, row.bottom)

                    # Create a region for this cell
                    from natural_pdf.elements.region import (  # Import here to avoid circular imports
                        Region,
                    )

                    cell_region = Region(self.page, cell_bbox)

                    # Extract text from the cell
                    if use_ocr:
                        # Apply OCR to the cell
                        ocr_elements = cell_region.apply_ocr(**ocr_config)
                        if ocr_elements:
                            # Get text from OCR elements
                            ocr_text = " ".join(e.text for e in ocr_elements).strip()
                            if ocr_text:
                                row_cells.append(ocr_text)
                                continue

                    # Fallback to normal extraction
                    cell_text = cell_region.extract_text().strip()
                    row_cells.append(cell_text)
            else:
                # No column information, just extract the whole row text
                if use_ocr:
                    # Try OCR on the whole row
                    ocr_elements = row.apply_ocr(**ocr_config)
                    if ocr_elements:
                        ocr_text = " ".join(e.text for e in ocr_elements).strip()
                        if ocr_text:
                            row_cells.append(ocr_text)
                            continue

                # Fallback to normal extraction
                row_cells.append(row.extract_text().strip())

            table_data.append(row_cells)

        return table_data

    def _extract_table_text(self, **text_options) -> List[List[Optional[str]]]:
        """
        Extracts table content based on text alignment analysis.

        Args:
            **text_options: Options passed to analyze_text_table_structure,
                          plus optional 'cell_extraction_func', 'coordinate_grouping_tolerance',
                          and 'show_progress'.

        Returns:
            Table data as list of lists of strings (or None for empty cells).
        """
        cell_extraction_func = text_options.pop("cell_extraction_func", None)
        # --- Get show_progress option --- #
        show_progress = text_options.pop("show_progress", False)

        # Analyze structure first (or use cached results)
        if "text_table_structure" in self.analyses:
            analysis_results = self.analyses["text_table_structure"]
            logger.debug("Using cached text table structure analysis results.")
        else:
            analysis_results = self.analyze_text_table_structure(**text_options)

        if analysis_results is None or not analysis_results.get("cells"):
            logger.warning(f"Region {self.bbox}: No cells found using 'text' method.")
            return []

        cell_dicts = analysis_results["cells"]

        # --- Grid Reconstruction Logic --- #
        if not cell_dicts:
            return []

        # 1. Get unique sorted top and left coordinates (cell boundaries)
        coord_tolerance = text_options.get("coordinate_grouping_tolerance", 1)
        tops = sorted(
            list(set(round(c["top"] / coord_tolerance) * coord_tolerance for c in cell_dicts))
        )
        lefts = sorted(
            list(set(round(c["left"] / coord_tolerance) * coord_tolerance for c in cell_dicts))
        )

        # Refine boundaries (cluster_coords helper remains the same)
        def cluster_coords(coords):
            if not coords:
                return []
            clustered = []
            current_cluster = [coords[0]]
            for c in coords[1:]:
                if abs(c - current_cluster[-1]) <= coord_tolerance:
                    current_cluster.append(c)
                else:
                    clustered.append(min(current_cluster))
                    current_cluster = [c]
            clustered.append(min(current_cluster))
            return clustered

        unique_tops = cluster_coords(tops)
        unique_lefts = cluster_coords(lefts)

        # Determine iterable for tqdm
        cell_iterator = cell_dicts
        if show_progress:
            # Only wrap if progress should be shown
            cell_iterator = tqdm(
                cell_dicts,
                desc=f"Extracting text from {len(cell_dicts)} cells (text method)",
                unit="cell",
                leave=False,  # Optional: Keep bar after completion
            )
        # --- End tqdm Setup --- #

        # 2. Create a lookup map for cell text: {(rounded_top, rounded_left): cell_text}
        cell_text_map = {}
        # --- Use the potentially wrapped iterator --- #
        for cell_data in cell_iterator:
            try:
                cell_region = self.page.region(**cell_data)
                cell_value = None  # Initialize
                if callable(cell_extraction_func):
                    try:
                        cell_value = cell_extraction_func(cell_region)
                        if not isinstance(cell_value, (str, type(None))):
                            logger.warning(
                                f"Custom cell_extraction_func returned non-string/None type ({type(cell_value)}) for cell {cell_data}. Treating as None."
                            )
                            cell_value = None
                    except Exception as func_err:
                        logger.error(
                            f"Error executing custom cell_extraction_func for cell {cell_data}: {func_err}",
                            exc_info=True,
                        )
                        cell_value = None
                else:
                    cell_value = cell_region.extract_text(
                        layout=False, apply_exclusions=False
                    ).strip()

                rounded_top = round(cell_data["top"] / coord_tolerance) * coord_tolerance
                rounded_left = round(cell_data["left"] / coord_tolerance) * coord_tolerance
                cell_text_map[(rounded_top, rounded_left)] = cell_value
            except Exception as e:
                logger.warning(f"Could not process cell {cell_data} for text extraction: {e}")

        # 3. Build the final list-of-lists table (loop remains the same)
        final_table = []
        for row_top in unique_tops:
            row_data = []
            for col_left in unique_lefts:
                best_match_key = None
                min_dist_sq = float("inf")
                for map_top, map_left in cell_text_map.keys():
                    if (
                        abs(map_top - row_top) <= coord_tolerance
                        and abs(map_left - col_left) <= coord_tolerance
                    ):
                        dist_sq = (map_top - row_top) ** 2 + (map_left - col_left) ** 2
                        if dist_sq < min_dist_sq:
                            min_dist_sq = dist_sq
                            best_match_key = (map_top, map_left)
                cell_value = cell_text_map.get(best_match_key)
                row_data.append(cell_value)
            final_table.append(row_data)

        return final_table

    # --- END MODIFIED METHOD --- #

    @overload
    def find(
        self,
        *,
        text: str,
        contains: str = "all",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["Element"]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        contains: str = "all",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["Element"]: ...

    def find(
        self,
        selector: Optional[str] = None,  # Now optional
        *,
        text: Optional[str] = None,  # New text parameter
        contains: str = "all",  # New parameter for containment behavior
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["Element"]:
        """
        Find the first element in this region matching the selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            contains: How to determine if elements are inside: 'all' (fully inside),
                     'any' (any overlap), or 'center' (center point inside).
                     (default: "all")
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional parameters for element filtering.

        Returns:
            First matching element or None.
        """
        # Delegate validation and selector construction to find_all
        elements = self.find_all(
            selector=selector,
            text=text,
            contains=contains,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        return elements.first if elements else None

    @overload
    def find_all(
        self,
        *,
        text: str,
        contains: str = "all",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        contains: str = "all",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,  # Now optional
        *,
        text: Optional[str] = None,  # New text parameter
        contains: str = "all",  # New parameter to control inside/overlap behavior
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements in this region matching the selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            contains: How to determine if elements are inside: 'all' (fully inside),
                     'any' (any overlap), or 'center' (center point inside).
                     (default: "all")
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional parameters for element filtering.

        Returns:
            ElementCollection with matching elements.
        """
        from natural_pdf.elements.collections import ElementCollection

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Validate contains parameter
        if contains not in ["all", "any", "center"]:
            raise ValueError(
                f"Invalid contains value: {contains}. Must be 'all', 'any', or 'center'"
            )

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            raise ValueError("Internal error: No selector or text provided.")

        # Normal case: Region is on a single page
        try:
            # Parse the final selector string
            selector_obj = parse_selector(effective_selector)

            # Get all potentially relevant elements from the page
            # Let the page handle its exclusion logic if needed
            potential_elements = self.page.find_all(
                selector=effective_selector,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )

            # Filter these elements based on the specified containment method
            region_bbox = self.bbox
            matching_elements = []

            if contains == "all":  # Fully inside (strict)
                matching_elements = [
                    el
                    for el in potential_elements
                    if el.x0 >= region_bbox[0]
                    and el.top >= region_bbox[1]
                    and el.x1 <= region_bbox[2]
                    and el.bottom <= region_bbox[3]
                ]
            elif contains == "any":  # Any overlap
                matching_elements = [el for el in potential_elements if self.intersects(el)]
            elif contains == "center":  # Center point inside
                matching_elements = [
                    el for el in potential_elements if self.is_element_center_inside(el)
                ]

            return ElementCollection(matching_elements)

        except Exception as e:
            logger.error(f"Error during find_all in region: {e}", exc_info=True)
            return ElementCollection([])

    def apply_ocr(self, replace=True, **ocr_params) -> "Region":
        """
        Apply OCR to this region and return the created text elements.

        This method supports two modes:
        1. **Built-in OCR Engines** (default) – identical to previous behaviour. Pass typical
           parameters like ``engine='easyocr'`` or ``languages=['en']`` and the method will
           route the request through :class:`OCRManager`.
        2. **Custom OCR Function** – pass a *callable* under the keyword ``function`` (or
           ``ocr_function``). The callable will receive *this* Region instance and should
           return the extracted text (``str``) or ``None``.  Internally the call is
           delegated to :pymeth:`apply_custom_ocr` so the same logic (replacement, element
           creation, etc.) is re-used.

        Examples
        ---------
        ```python
        def llm_ocr(region):
            image = region.to_image(resolution=300, crop=True)
            return my_llm_client.ocr(image)
        region.apply_ocr(function=llm_ocr)
        ```

        Args:
            replace: Whether to remove existing OCR elements first (default ``True``).
            **ocr_params: Parameters for the built-in OCR manager *or* the special
                          ``function``/``ocr_function`` keyword to trigger custom mode.

        Returns
        -------
            Self – for chaining.
        """
        # --- Custom OCR function path --------------------------------------------------
        custom_func = ocr_params.pop("function", None) or ocr_params.pop("ocr_function", None)
        if callable(custom_func):
            # Delegate to the specialised helper while preserving key kwargs
            return self.apply_custom_ocr(
                ocr_function=custom_func,
                source_label=ocr_params.pop("source_label", "custom-ocr"),
                replace=replace,
                confidence=ocr_params.pop("confidence", None),
                add_to_page=ocr_params.pop("add_to_page", True),
            )

        # --- Original built-in OCR engine path (unchanged except docstring) ------------
        # Ensure OCRManager is available
        if not hasattr(self.page._parent, "_ocr_manager") or self.page._parent._ocr_manager is None:
            logger.error("OCRManager not available on parent PDF. Cannot apply OCR to region.")
            return self

        # If replace is True, find and remove existing OCR elements in this region
        if replace:
            logger.info(
                f"Region {self.bbox}: Removing existing OCR elements before applying new OCR."
            )

            # --- Robust removal: iterate through all OCR elements on the page and
            #     remove those that overlap this region. This avoids reliance on
            #     identity‐based look-ups that can break if the ElementManager
            #     rebuilt its internal lists.

            removed_count = 0

            # Helper to remove a single element safely
            def _safe_remove(elem):
                nonlocal removed_count
                success = False
                if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                    etype = getattr(elem, "object_type", "word")
                    if etype == "word":
                        etype_key = "words"
                    elif etype == "char":
                        etype_key = "chars"
                    else:
                        etype_key = etype + "s" if not etype.endswith("s") else etype
                    try:
                        success = elem.page._element_mgr.remove_element(elem, etype_key)
                    except Exception:
                        success = False
                if success:
                    removed_count += 1

            # Remove OCR WORD elements overlapping region
            for word in list(self.page._element_mgr.words):
                if getattr(word, "source", None) == "ocr" and self.intersects(word):
                    _safe_remove(word)

            # Remove OCR CHAR dicts overlapping region
            for char in list(self.page._element_mgr.chars):
                # char can be dict or TextElement; normalise
                char_src = (
                    char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
                )
                if char_src == "ocr":
                    # Rough bbox for dicts
                    if isinstance(char, dict):
                        cx0, ctop, cx1, cbottom = (
                            char.get("x0", 0),
                            char.get("top", 0),
                            char.get("x1", 0),
                            char.get("bottom", 0),
                        )
                    else:
                        cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                    # Quick overlap check
                    if not (
                        cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                    ):
                        _safe_remove(char)

            logger.info(
                f"Region {self.bbox}: Removed {removed_count} existing OCR elements (words & chars) before re-applying OCR."
            )

        ocr_mgr = self.page._parent._ocr_manager

        # Determine rendering resolution from parameters
        final_resolution = ocr_params.get("resolution")
        if final_resolution is None and hasattr(self.page, "_parent") and self.page._parent:
            final_resolution = getattr(self.page._parent, "_config", {}).get("resolution", 150)
        elif final_resolution is None:
            final_resolution = 150
        logger.debug(
            f"Region {self.bbox}: Applying OCR with resolution {final_resolution} DPI and params: {ocr_params}"
        )

        # Render the page region to an image using the determined resolution
        try:
            region_image = self.to_image(
                resolution=final_resolution, include_highlights=False, crop=True
            )
            if not region_image:
                logger.error("Failed to render region to image for OCR.")
                return self
            logger.debug(f"Region rendered to image size: {region_image.size}")
        except Exception as e:
            logger.error(f"Error rendering region to image for OCR: {e}", exc_info=True)
            return self

        # Prepare args for the OCR Manager
        manager_args = {
            "images": region_image,
            "engine": ocr_params.get("engine"),
            "languages": ocr_params.get("languages"),
            "min_confidence": ocr_params.get("min_confidence"),
            "device": ocr_params.get("device"),
            "options": ocr_params.get("options"),
            "detect_only": ocr_params.get("detect_only"),
        }
        manager_args = {k: v for k, v in manager_args.items() if v is not None}

        # Run OCR on this region's image using the manager
        results = ocr_mgr.apply_ocr(**manager_args)
        if not isinstance(results, list):
            logger.error(
                f"OCRManager returned unexpected type for single region image: {type(results)}"
            )
            return self
        logger.debug(f"Region OCR processing returned {len(results)} results.")

        # Convert results to TextElements
        scale_x = self.width / region_image.width if region_image.width > 0 else 1.0
        scale_y = self.height / region_image.height if region_image.height > 0 else 1.0
        logger.debug(f"Region OCR scaling factors (PDF/Img): x={scale_x:.2f}, y={scale_y:.2f}")
        created_elements = []
        for result in results:
            try:
                img_x0, img_top, img_x1, img_bottom = map(float, result["bbox"])
                pdf_height = (img_bottom - img_top) * scale_y
                page_x0 = self.x0 + (img_x0 * scale_x)
                page_top = self.top + (img_top * scale_y)
                page_x1 = self.x0 + (img_x1 * scale_x)
                page_bottom = self.top + (img_bottom * scale_y)
                raw_conf = result.get("confidence")
                # Convert confidence to float unless it is None/invalid
                try:
                    confidence_val = float(raw_conf) if raw_conf is not None else None
                except (TypeError, ValueError):
                    confidence_val = None

                text_val = result.get("text")  # May legitimately be None in detect_only mode

                element_data = {
                    "text": text_val,
                    "x0": page_x0,
                    "top": page_top,
                    "x1": page_x1,
                    "bottom": page_bottom,
                    "width": page_x1 - page_x0,
                    "height": page_bottom - page_top,
                    "object_type": "word",
                    "source": "ocr",
                    "confidence": confidence_val,
                    "fontname": "OCR",
                    "size": round(pdf_height) if pdf_height > 0 else 10.0,
                    "page_number": self.page.number,
                    "bold": False,
                    "italic": False,
                    "upright": True,
                    "doctop": page_top + self.page._page.initial_doctop,
                }
                ocr_char_dict = element_data.copy()
                ocr_char_dict["object_type"] = "char"
                ocr_char_dict.setdefault("adv", ocr_char_dict.get("width", 0))
                element_data["_char_dicts"] = [ocr_char_dict]
                from natural_pdf.elements.text import TextElement

                elem = TextElement(element_data, self.page)
                created_elements.append(elem)
                self.page._element_mgr.add_element(elem, element_type="words")
                self.page._element_mgr.add_element(ocr_char_dict, element_type="chars")
            except Exception as e:
                logger.error(
                    f"Failed to convert region OCR result to element: {result}. Error: {e}",
                    exc_info=True,
                )
        logger.info(f"Region {self.bbox}: Added {len(created_elements)} elements from OCR.")
        return self

    def apply_custom_ocr(
        self,
        ocr_function: Callable[["Region"], Optional[str]],
        source_label: str = "custom-ocr",
        replace: bool = True,
        confidence: Optional[float] = None,
        add_to_page: bool = True,
    ) -> "Region":
        """
        Apply a custom OCR function to this region and create text elements from the results.

        This is useful when you want to use a custom OCR method (e.g., an LLM API,
        specialized OCR service, or any custom logic) instead of the built-in OCR engines.

        Args:
            ocr_function: A callable that takes a Region and returns the OCR'd text (or None).
                          The function receives this region as its argument and should return
                          the extracted text as a string, or None if no text was found.
            source_label: Label to identify the source of these text elements (default: "custom-ocr").
                          This will be set as the 'source' attribute on created elements.
            replace: If True (default), removes existing OCR elements in this region before
                     adding new ones. If False, adds new OCR elements alongside existing ones.
            confidence: Optional confidence score for the OCR result (0.0-1.0).
                        If None, defaults to 1.0 if text is returned, 0.0 if None is returned.
            add_to_page: If True (default), adds the created text element to the page.
                         If False, creates the element but doesn't add it to the page.

        Returns:
            Self for method chaining.

        Example:
            # Using with an LLM
            def ocr_with_llm(region):
                image = region.to_image(resolution=300, crop=True)
                # Call your LLM API here
                return llm_client.ocr(image)

            region.apply_custom_ocr(ocr_with_llm)

            # Using with a custom OCR service
            def ocr_with_service(region):
                img_bytes = region.to_image(crop=True).tobytes()
                response = ocr_service.process(img_bytes)
                return response.text

            region.apply_custom_ocr(ocr_with_service, source_label="my-ocr-service")
        """
        # If replace is True, remove existing OCR elements in this region
        if replace:
            logger.info(
                f"Region {self.bbox}: Removing existing OCR elements before applying custom OCR."
            )

            removed_count = 0

            # Helper to remove a single element safely
            def _safe_remove(elem):
                nonlocal removed_count
                success = False
                if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                    etype = getattr(elem, "object_type", "word")
                    if etype == "word":
                        etype_key = "words"
                    elif etype == "char":
                        etype_key = "chars"
                    else:
                        etype_key = etype + "s" if not etype.endswith("s") else etype
                    try:
                        success = elem.page._element_mgr.remove_element(elem, etype_key)
                    except Exception:
                        success = False
                if success:
                    removed_count += 1

            # Remove ALL OCR elements overlapping this region
            # Remove elements with source=="ocr" (built-in OCR) or matching the source_label (previous custom OCR)
            for word in list(self.page._element_mgr.words):
                word_source = getattr(word, "source", "")
                # Match built-in OCR behavior: remove elements with source "ocr" exactly
                # Also remove elements with the same source_label to avoid duplicates
                if (word_source == "ocr" or word_source == source_label) and self.intersects(word):
                    _safe_remove(word)

            # Also remove char dicts if needed (matching built-in OCR)
            for char in list(self.page._element_mgr.chars):
                # char can be dict or TextElement; normalize
                char_src = (
                    char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
                )
                if char_src == "ocr" or char_src == source_label:
                    # Rough bbox for dicts
                    if isinstance(char, dict):
                        cx0, ctop, cx1, cbottom = (
                            char.get("x0", 0),
                            char.get("top", 0),
                            char.get("x1", 0),
                            char.get("bottom", 0),
                        )
                    else:
                        cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                    # Quick overlap check
                    if not (
                        cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                    ):
                        _safe_remove(char)

            if removed_count > 0:
                logger.info(f"Region {self.bbox}: Removed {removed_count} existing OCR elements.")

        # Call the custom OCR function
        try:
            logger.debug(f"Region {self.bbox}: Calling custom OCR function...")
            ocr_text = ocr_function(self)

            if ocr_text is not None and not isinstance(ocr_text, str):
                logger.warning(
                    f"Custom OCR function returned non-string type ({type(ocr_text)}). "
                    f"Converting to string."
                )
                ocr_text = str(ocr_text)

        except Exception as e:
            logger.error(
                f"Error calling custom OCR function for region {self.bbox}: {e}", exc_info=True
            )
            return self

        # Create text element if we got text
        if ocr_text is not None:
            # Use the to_text_element method to create the element
            text_element = self.to_text_element(
                text_content=ocr_text,
                source_label=source_label,
                confidence=confidence,
                add_to_page=add_to_page,
            )

            logger.info(
                f"Region {self.bbox}: Created text element with {len(ocr_text)} chars"
                f"{' and added to page' if add_to_page else ''}"
            )
        else:
            logger.debug(f"Region {self.bbox}: Custom OCR function returned None (no text found)")

        return self

    def get_section_between(self, start_element=None, end_element=None, boundary_inclusion="both"):
        """
        Get a section between two elements within this region.

        Args:
            start_element: Element marking the start of the section
            end_element: Element marking the end of the section
            boundary_inclusion: How to include boundary elements: 'start', 'end', 'both', or 'none'

        Returns:
            Region representing the section
        """
        # Get elements only within this region first
        elements = self.get_elements()

        # If no elements, return self or empty region?
        if not elements:
            logger.warning(
                f"get_section_between called on region {self.bbox} with no contained elements."
            )
            # Return an empty region at the start of the parent region
            return Region(self.page, (self.x0, self.top, self.x0, self.top))

        # Sort elements in reading order
        elements.sort(key=lambda e: (e.top, e.x0))

        # Find start index
        start_idx = 0
        if start_element:
            try:
                start_idx = elements.index(start_element)
            except ValueError:
                # Start element not in region, use first element
                logger.debug("Start element not found in region, using first element.")
                start_element = elements[0]  # Use the actual first element
                start_idx = 0
        else:
            start_element = elements[0]  # Default start is first element

        # Find end index
        end_idx = len(elements) - 1
        if end_element:
            try:
                end_idx = elements.index(end_element)
            except ValueError:
                # End element not in region, use last element
                logger.debug("End element not found in region, using last element.")
                end_element = elements[-1]  # Use the actual last element
                end_idx = len(elements) - 1
        else:
            end_element = elements[-1]  # Default end is last element

        # Adjust indexes based on boundary inclusion
        start_element_for_bbox = start_element
        end_element_for_bbox = end_element

        if boundary_inclusion == "none":
            start_idx += 1
            end_idx -= 1
            start_element_for_bbox = elements[start_idx] if start_idx <= end_idx else None
            end_element_for_bbox = elements[end_idx] if start_idx <= end_idx else None
        elif boundary_inclusion == "start":
            end_idx -= 1
            end_element_for_bbox = elements[end_idx] if start_idx <= end_idx else None
        elif boundary_inclusion == "end":
            start_idx += 1
            start_element_for_bbox = elements[start_idx] if start_idx <= end_idx else None

        # Ensure valid indexes
        start_idx = max(0, start_idx)
        end_idx = min(len(elements) - 1, end_idx)

        # If no valid elements in range, return empty region
        if start_idx > end_idx or start_element_for_bbox is None or end_element_for_bbox is None:
            logger.debug("No valid elements in range for get_section_between.")
            # Return an empty region positioned at the start element boundary
            anchor = start_element if start_element else self
            return Region(self.page, (anchor.x0, anchor.top, anchor.x0, anchor.top))

        # Get elements in range based on adjusted indices
        section_elements = elements[start_idx : end_idx + 1]

        # Create bounding box around the ELEMENTS included based on indices
        x0 = min(e.x0 for e in section_elements)
        top = min(e.top for e in section_elements)
        x1 = max(e.x1 for e in section_elements)
        bottom = max(e.bottom for e in section_elements)

        # Create new region
        section = Region(self.page, (x0, top, x1, bottom))
        # Store the original boundary elements for reference
        section.start_element = start_element
        section.end_element = end_element

        return section

    def get_sections(
        self, start_elements=None, end_elements=None, boundary_inclusion="both"
    ) -> "ElementCollection[Region]":
        """
        Get sections within this region based on start/end elements.

        Args:
            start_elements: Elements or selector string that mark the start of sections
            end_elements: Elements or selector string that mark the end of sections
            boundary_inclusion: How to include boundary elements: 'start', 'end', 'both', or 'none'

        Returns:
            List of Region objects representing the extracted sections
        """
        from natural_pdf.elements.collections import ElementCollection

        # Process string selectors to find elements WITHIN THIS REGION
        if isinstance(start_elements, str):
            start_elements = self.find_all(start_elements)  # Use region's find_all
            if hasattr(start_elements, "elements"):
                start_elements = start_elements.elements

        if isinstance(end_elements, str):
            end_elements = self.find_all(end_elements)  # Use region's find_all
            if hasattr(end_elements, "elements"):
                end_elements = end_elements.elements

        # Ensure start_elements is a list (or similar iterable)
        if start_elements is None or not hasattr(start_elements, "__iter__"):
            logger.warning(
                "get_sections requires valid start_elements (selector or list). Returning empty."
            )
            return []
        # Ensure end_elements is a list if provided
        if end_elements is not None and not hasattr(end_elements, "__iter__"):
            logger.warning("end_elements must be iterable if provided. Ignoring.")
            end_elements = []
        elif end_elements is None:
            end_elements = []

        # If no start elements found within the region, return empty list
        if not start_elements:
            return []

        # Sort all elements within the region in reading order
        all_elements_in_region = self.get_elements()
        all_elements_in_region.sort(key=lambda e: (e.top, e.x0))

        if not all_elements_in_region:
            return []  # Cannot create sections if region is empty

        # Map elements to their indices in the sorted list
        element_to_index = {el: i for i, el in enumerate(all_elements_in_region)}

        # Mark section boundaries using indices from the sorted list
        section_boundaries = []

        # Add start element indexes
        for element in start_elements:
            idx = element_to_index.get(element)
            if idx is not None:
                section_boundaries.append({"index": idx, "element": element, "type": "start"})
            # else: Element found by selector might not be geometrically in region? Log warning?

        # Add end element indexes if provided
        for element in end_elements:
            idx = element_to_index.get(element)
            if idx is not None:
                section_boundaries.append({"index": idx, "element": element, "type": "end"})

        # Sort boundaries by index (document order within the region)
        section_boundaries.sort(key=lambda x: x["index"])

        # Generate sections
        sections = []
        current_start_boundary = None

        for i, boundary in enumerate(section_boundaries):
            # If it's a start boundary and we don't have a current start
            if boundary["type"] == "start" and current_start_boundary is None:
                current_start_boundary = boundary

            # If it's an end boundary and we have a current start
            elif boundary["type"] == "end" and current_start_boundary is not None:
                # Create a section from current_start to this boundary
                start_element = current_start_boundary["element"]
                end_element = boundary["element"]
                # Use the helper, ensuring elements are from within the region
                section = self.get_section_between(start_element, end_element, boundary_inclusion)
                sections.append(section)
                current_start_boundary = None  # Reset

            # If it's another start boundary and we have a current start (split by starts only)
            elif (
                boundary["type"] == "start"
                and current_start_boundary is not None
                and not end_elements
            ):
                # End the previous section just before this start boundary
                start_element = current_start_boundary["element"]
                # Find the element immediately preceding this start in the sorted list
                end_idx = boundary["index"] - 1
                if end_idx >= 0 and end_idx >= current_start_boundary["index"]:
                    end_element = all_elements_in_region[end_idx]
                    section = self.get_section_between(
                        start_element, end_element, boundary_inclusion
                    )
                    sections.append(section)
                # Else: Section started and ended by consecutive start elements? Create empty?
                # For now, just reset and start new section

                # Start the new section
                current_start_boundary = boundary

        # Handle the last section if we have a current start
        if current_start_boundary is not None:
            start_element = current_start_boundary["element"]
            # End at the last element within the region
            end_element = all_elements_in_region[-1]
            section = self.get_section_between(start_element, end_element, boundary_inclusion)
            sections.append(section)

        return ElementCollection(sections)

    def create_cells(self):
        """
        Create cell regions for a detected table by intersecting its
        row and column regions, and add them to the page.

        Assumes child row and column regions are already present on the page.

        Returns:
            Self for method chaining.
        """
        # Ensure this is called on a table region
        if self.region_type not in (
            "table",
            "tableofcontents",
        ):  # Allow for ToC which might have structure
            raise ValueError(
                f"create_cells should be called on a 'table' or 'tableofcontents' region, not '{self.region_type}'"
            )

        # Find rows and columns associated with this page
        # Remove the model-specific filter
        rows = self.page.find_all("region[type=table-row]")
        columns = self.page.find_all("region[type=table-column]")

        # Filter to only include those that overlap with this table region
        def is_in_table(element):
            # Use a simple overlap check (more robust than just center point)
            # Check if element's bbox overlaps with self.bbox
            return (
                hasattr(element, "bbox")
                and element.x0 < self.x1  # Ensure element has bbox
                and element.x1 > self.x0
                and element.top < self.bottom
                and element.bottom > self.top
            )

        table_rows = [r for r in rows if is_in_table(r)]
        table_columns = [c for c in columns if is_in_table(c)]

        if not table_rows or not table_columns:
            # Use page's logger if available
            logger_instance = getattr(self._page, "logger", logger)
            logger_instance.warning(
                f"Region {self.bbox}: Cannot create cells. No overlapping row or column regions found."
            )
            return self  # Return self even if no cells created

        # Sort rows and columns
        table_rows.sort(key=lambda r: r.top)
        table_columns.sort(key=lambda c: c.x0)

        # Create cells and add them to the page's element manager
        created_count = 0
        for row in table_rows:
            for column in table_columns:
                # Calculate intersection bbox for the cell
                cell_x0 = max(row.x0, column.x0)
                cell_y0 = max(row.top, column.top)
                cell_x1 = min(row.x1, column.x1)
                cell_y1 = min(row.bottom, column.bottom)

                # Only create a cell if the intersection is valid (positive width/height)
                if cell_x1 > cell_x0 and cell_y1 > cell_y0:
                    # Create cell region at the intersection
                    cell = self.page.create_region(cell_x0, cell_y0, cell_x1, cell_y1)
                    # Set metadata
                    cell.source = "derived"
                    cell.region_type = "table-cell"  # Explicitly set type
                    cell.normalized_type = "table-cell"  # And normalized type
                    # Inherit model from the parent table region
                    cell.model = self.model
                    cell.parent_region = self  # Link cell to parent table region

                    # Add the cell region to the page's element manager
                    self.page._element_mgr.add_region(cell)
                    created_count += 1

        # Optional: Add created cells to the table region's children
        # self.child_regions.extend(cells_created_in_this_call) # Needs list management

        logger_instance = getattr(self._page, "logger", logger)
        logger_instance.info(
            f"Region {self.bbox} (Model: {self.model}): Created and added {created_count} cell regions."
        )

        return self  # Return self for chaining

    def ask(
        self,
        question: Union[str, List[str], Tuple[str, ...]],
        min_confidence: float = 0.1,
        model: str = None,
        debug: bool = False,
        **kwargs,
    ) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
        """
        Ask a question about the region content using document QA.

        This method uses a document question answering model to extract answers from the region content.
        It leverages both textual content and layout information for better understanding.

        Args:
            question: The question to ask about the region content
            min_confidence: Minimum confidence threshold for answers (0.0-1.0)
            model: Optional model name to use for QA (if None, uses default model)
            **kwargs: Additional parameters to pass to the QA engine

        Returns:
            Dictionary with answer details: {
                "answer": extracted text,
                "confidence": confidence score,
                "found": whether an answer was found,
                "page_num": page number,
                "region": reference to this region,
                "source_elements": list of elements that contain the answer (if found)
            }
        """
        try:
            from natural_pdf.qa.document_qa import get_qa_engine
        except ImportError:
            logger.error(
                "Question answering requires optional dependencies. Install with `pip install natural-pdf[ai]`"
            )
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.page.number,
                "source_elements": [],
                "region": self,
            }

        # Get or initialize QA engine with specified model
        try:
            qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
        except Exception as e:
            logger.error(f"Failed to initialize QA engine (model: {model}): {e}", exc_info=True)
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.page.number,
                "source_elements": [],
                "region": self,
            }

        # Ask the question using the QA engine
        try:
            return qa_engine.ask_pdf_region(
                self, question, min_confidence=min_confidence, debug=debug, **kwargs
            )
        except Exception as e:
            logger.error(f"Error during qa_engine.ask_pdf_region: {e}", exc_info=True)
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.page.number,
                "source_elements": [],
                "region": self,
            }

    def add_child(self, child):
        """
        Add a child region to this region.

        Used for hierarchical document structure when using models like Docling
        that understand document hierarchy.

        Args:
            child: Region object to add as a child

        Returns:
            Self for method chaining
        """
        self.child_regions.append(child)
        child.parent_region = self
        return self

    def get_children(self, selector=None):
        """
        Get immediate child regions, optionally filtered by selector.

        Args:
            selector: Optional selector to filter children

        Returns:
            List of child regions matching the selector
        """
        import logging

        logger = logging.getLogger("natural_pdf.elements.region")

        if selector is None:
            return self.child_regions

        # Use existing selector parser to filter
        try:
            selector_obj = parse_selector(selector)
            filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
            matched = [child for child in self.child_regions if filter_func(child)]
            logger.debug(
                f"get_children: found {len(matched)} of {len(self.child_regions)} children matching '{selector}'"
            )
            return matched
        except Exception as e:
            logger.error(f"Error applying selector in get_children: {e}", exc_info=True)
            return []  # Return empty list on error

    def get_descendants(self, selector=None):
        """
        Get all descendant regions (children, grandchildren, etc.), optionally filtered by selector.

        Args:
            selector: Optional selector to filter descendants

        Returns:
            List of descendant regions matching the selector
        """
        import logging

        logger = logging.getLogger("natural_pdf.elements.region")

        all_descendants = []
        queue = list(self.child_regions)  # Start with direct children

        while queue:
            current = queue.pop(0)
            all_descendants.append(current)
            # Add current's children to the queue for processing
            if hasattr(current, "child_regions"):
                queue.extend(current.child_regions)

        logger.debug(f"get_descendants: found {len(all_descendants)} total descendants")

        # Filter by selector if provided
        if selector is not None:
            try:
                selector_obj = parse_selector(selector)
                filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
                matched = [desc for desc in all_descendants if filter_func(desc)]
                logger.debug(f"get_descendants: filtered to {len(matched)} matching '{selector}'")
                return matched
            except Exception as e:
                logger.error(f"Error applying selector in get_descendants: {e}", exc_info=True)
                return []  # Return empty list on error

        return all_descendants

    def __repr__(self) -> str:
        """String representation of the region."""
        poly_info = " (Polygon)" if self.has_polygon else ""
        name_info = f" name='{self.name}'" if self.name else ""
        type_info = f" type='{self.region_type}'" if self.region_type else ""
        source_info = f" source='{self.source}'" if self.source else ""
        return f"<Region{name_info}{type_info}{source_info} bbox={self.bbox}{poly_info}>"

    def correct_ocr(
        self,
        correction_callback: Callable[[Any], Optional[str]],
    ) -> "Region":  # Return self for chaining
        """
        Applies corrections to OCR-generated text elements within this region
        using a user-provided callback function.

        Finds text elements within this region whose 'source' attribute starts
        with 'ocr' and calls the `correction_callback` for each, passing the
        element itself.

        The `correction_callback` should contain the logic to:
        1. Determine if the element needs correction.
        2. Perform the correction (e.g., call an LLM).
        3. Return the new text (`str`) or `None`.

        If the callback returns a string, the element's `.text` is updated.
        Metadata updates (source, confidence, etc.) should happen within the callback.

        Args:
            correction_callback: A function accepting an element and returning
                                 `Optional[str]` (new text or None).

        Returns:
            Self for method chaining.
        """
        # Find OCR elements specifically within this region
        # Note: We typically want to correct even if the element falls in an excluded area
        target_elements = self.find_all(selector="text[source=ocr]", apply_exclusions=False)

        # Delegate to the utility function
        _apply_ocr_correction_to_elements(
            elements=target_elements,  # Pass the ElementCollection directly
            correction_callback=correction_callback,
            caller_info=f"Region({self.bbox})",  # Pass caller info
        )

        return self  # Return self for chaining

    # --- Classification Mixin Implementation --- #
    def _get_classification_manager(self) -> "ClassificationManager":
        if (
            not hasattr(self, "page")
            or not hasattr(self.page, "pdf")
            or not hasattr(self.page.pdf, "get_manager")
        ):
            raise AttributeError(
                "ClassificationManager cannot be accessed: Parent Page, PDF, or get_manager method missing."
            )
        try:
            # Use the PDF's manager registry accessor via page
            return self.page.pdf.get_manager("classification")
        except (ValueError, RuntimeError, AttributeError) as e:
            # Wrap potential errors from get_manager for clarity
            raise AttributeError(
                f"Failed to get ClassificationManager from PDF via Page: {e}"
            ) from e

    def _get_classification_content(
        self, model_type: str, **kwargs
    ) -> Union[str, "Image"]:  # Use "Image" for lazy import
        if model_type == "text":
            text_content = self.extract_text(layout=False)  # Simple join for classification
            if not text_content or text_content.isspace():
                raise ValueError("Cannot classify region with 'text' model: No text content found.")
            return text_content
        elif model_type == "vision":
            # Get resolution from manager/kwargs if possible, else default
            # We access manager via the method to ensure it's available
            manager = self._get_classification_manager()
            default_resolution = 150  # Manager doesn't store default res, set here
            # Note: classify() passes resolution via **kwargs if user specifies
            resolution = (
                kwargs.get("resolution", default_resolution)
                if "kwargs" in locals()
                else default_resolution
            )

            img = self.to_image(
                resolution=resolution,
                include_highlights=False,  # No highlights for classification input
                crop=True,  # Just the region content
            )
            if img is None:
                raise ValueError(
                    "Cannot classify region with 'vision' model: Failed to render image."
                )
            return img
        else:
            raise ValueError(f"Unsupported model_type for classification: {model_type}")

    def _get_metadata_storage(self) -> Dict[str, Any]:
        # Ensure metadata exists
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata

    # --- End Classification Mixin Implementation --- #

    # --- NEW METHOD: analyze_text_table_structure ---
    def analyze_text_table_structure(
        self,
        snap_tolerance: int = 10,
        join_tolerance: int = 3,
        min_words_vertical: int = 3,
        min_words_horizontal: int = 1,
        intersection_tolerance: int = 3,
        expand_bbox: Optional[Dict[str, int]] = None,
        **kwargs,
    ) -> Optional[Dict]:
        """
        Analyzes the text elements within the region (or slightly expanded area)
        to find potential table structure (lines, cells) using text alignment logic
        adapted from pdfplumber.

        Args:
            snap_tolerance: Tolerance for snapping parallel lines.
            join_tolerance: Tolerance for joining collinear lines.
            min_words_vertical: Minimum words needed to define a vertical line.
            min_words_horizontal: Minimum words needed to define a horizontal line.
            intersection_tolerance: Tolerance for detecting line intersections.
            expand_bbox: Optional dictionary to expand the search area slightly beyond
                         the region's exact bounds (e.g., {'left': 5, 'right': 5}).
            **kwargs: Additional keyword arguments passed to
                      find_text_based_tables (e.g., specific x/y tolerances).

        Returns:
            A dictionary containing 'horizontal_edges', 'vertical_edges', 'cells' (list of dicts),
            and 'intersections', or None if pdfplumber is unavailable or an error occurs.
        """

        # Determine the search region (expand if requested)
        search_region = self
        if expand_bbox and isinstance(expand_bbox, dict):
            try:
                search_region = self.expand(**expand_bbox)
                logger.debug(
                    f"Expanded search region for text table analysis to: {search_region.bbox}"
                )
            except Exception as e:
                logger.warning(f"Could not expand region bbox: {e}. Using original region.")
                search_region = self

        # Find text elements within the search region
        text_elements = search_region.find_all(
            "text", apply_exclusions=False
        )  # Use unfiltered text
        if not text_elements:
            logger.info(f"Region {self.bbox}: No text elements found for text table analysis.")
            return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

        # Extract bounding boxes
        bboxes = [element.bbox for element in text_elements if hasattr(element, "bbox")]
        if not bboxes:
            logger.info(f"Region {self.bbox}: No bboxes extracted from text elements.")
            return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

        # Call the utility function
        try:
            analysis_results = find_text_based_tables(
                bboxes=bboxes,
                snap_tolerance=snap_tolerance,
                join_tolerance=join_tolerance,
                min_words_vertical=min_words_vertical,
                min_words_horizontal=min_words_horizontal,
                intersection_tolerance=intersection_tolerance,
                **kwargs,  # Pass through any extra specific tolerance args
            )
            # Store results in the region's analyses cache
            self.analyses["text_table_structure"] = analysis_results
            return analysis_results
        except ImportError:
            logger.error("pdfplumber library is required for 'text' table analysis but not found.")
            return None
        except Exception as e:
            logger.error(f"Error during text-based table analysis: {e}", exc_info=True)
            return None

    # --- END NEW METHOD ---

    # --- NEW METHOD: get_text_table_cells ---
    def get_text_table_cells(
        self,
        snap_tolerance: int = 10,
        join_tolerance: int = 3,
        min_words_vertical: int = 3,
        min_words_horizontal: int = 1,
        intersection_tolerance: int = 3,
        expand_bbox: Optional[Dict[str, int]] = None,
        **kwargs,
    ) -> "ElementCollection[Region]":
        """
        Analyzes text alignment to find table cells and returns them as
        temporary Region objects without adding them to the page.

        Args:
            snap_tolerance: Tolerance for snapping parallel lines.
            join_tolerance: Tolerance for joining collinear lines.
            min_words_vertical: Minimum words needed to define a vertical line.
            min_words_horizontal: Minimum words needed to define a horizontal line.
            intersection_tolerance: Tolerance for detecting line intersections.
            expand_bbox: Optional dictionary to expand the search area slightly beyond
                         the region's exact bounds (e.g., {'left': 5, 'right': 5}).
            **kwargs: Additional keyword arguments passed to
                      find_text_based_tables (e.g., specific x/y tolerances).

        Returns:
            An ElementCollection containing temporary Region objects for each detected cell,
            or an empty ElementCollection if no cells are found or an error occurs.
        """
        from natural_pdf.elements.collections import ElementCollection

        # 1. Perform the analysis (or use cached results)
        if "text_table_structure" in self.analyses:
            analysis_results = self.analyses["text_table_structure"]
            logger.debug("get_text_table_cells: Using cached analysis results.")
        else:
            analysis_results = self.analyze_text_table_structure(
                snap_tolerance=snap_tolerance,
                join_tolerance=join_tolerance,
                min_words_vertical=min_words_vertical,
                min_words_horizontal=min_words_horizontal,
                intersection_tolerance=intersection_tolerance,
                expand_bbox=expand_bbox,
                **kwargs,
            )

        # 2. Check if analysis was successful and cells were found
        if analysis_results is None or not analysis_results.get("cells"):
            logger.info(f"Region {self.bbox}: No cells found by text table analysis.")
            return ElementCollection([])  # Return empty collection

        # 3. Create temporary Region objects for each cell dictionary
        cell_regions = []
        for cell_data in analysis_results["cells"]:
            try:
                # Use page.region to create the region object
                # It expects left, top, right, bottom keys
                cell_region = self.page.region(**cell_data)

                # Set metadata on the temporary region
                cell_region.region_type = "table-cell"
                cell_region.normalized_type = "table-cell"
                cell_region.model = "pdfplumber-text"
                cell_region.source = "volatile"  # Indicate it's not managed/persistent
                cell_region.parent_region = self  # Link back to the region it came from

                cell_regions.append(cell_region)
            except Exception as e:
                logger.warning(f"Could not create Region object for cell data {cell_data}: {e}")

        # 4. Return the list wrapped in an ElementCollection
        logger.debug(f"get_text_table_cells: Created {len(cell_regions)} temporary cell regions.")
        return ElementCollection(cell_regions)

    # --- END NEW METHOD ---

    def to_text_element(
        self,
        text_content: Optional[Union[str, Callable[["Region"], Optional[str]]]] = None,
        source_label: str = "derived_from_region",
        object_type: str = "word",  # Or "char", controls how it's categorized
        default_font_size: float = 10.0,
        default_font_name: str = "RegionContent",
        confidence: Optional[float] = None,  # Allow overriding confidence
        add_to_page: bool = False,  # NEW: Option to add to page
    ) -> "TextElement":
        """
        Creates a new TextElement object based on this region's geometry.

        The text for the new TextElement can be provided directly,
        generated by a callback function, or left as None.

        Args:
            text_content:
                - If a string, this will be the text of the new TextElement.
                - If a callable, it will be called with this region instance
                  and its return value (a string or None) will be the text.
                - If None (default), the TextElement's text will be None.
            source_label: The 'source' attribute for the new TextElement.
            object_type: The 'object_type' for the TextElement's data dict
                         (e.g., "word", "char").
            default_font_size: Placeholder font size if text is generated.
            default_font_name: Placeholder font name if text is generated.
            confidence: Confidence score for the text. If text_content is None,
                        defaults to 0.0. If text is provided/generated, defaults to 1.0
                        unless specified.
            add_to_page: If True, the created TextElement will be added to the
                         region's parent page. (Default: False)

        Returns:
            A new TextElement instance.

        Raises:
            ValueError: If the region does not have a valid 'page' attribute.
        """
        actual_text: Optional[str] = None
        if isinstance(text_content, str):
            actual_text = text_content
        elif callable(text_content):
            try:
                actual_text = text_content(self)
            except Exception as e:
                logger.error(
                    f"Error executing text_content callback for region {self.bbox}: {e}",
                    exc_info=True,
                )
                actual_text = None  # Ensure actual_text is None on error

        final_confidence = confidence
        if final_confidence is None:
            final_confidence = 1.0 if actual_text is not None and actual_text.strip() else 0.0

        if not hasattr(self, "page") or self.page is None:
            raise ValueError("Region must have a valid 'page' attribute to create a TextElement.")

        # Create character dictionaries for the text
        char_dicts = []
        if actual_text:
            # Create a single character dict that spans the entire region
            # This is a simplified approach - OCR engines typically create one per character
            char_dict = {
                "text": actual_text,
                "x0": self.x0,
                "top": self.top,
                "x1": self.x1,
                "bottom": self.bottom,
                "width": self.width,
                "height": self.height,
                "object_type": "char",
                "page_number": self.page.page_number,
                "fontname": default_font_name,
                "size": default_font_size,
                "upright": True,
                "direction": 1,
                "adv": self.width,
                "source": source_label,
                "confidence": final_confidence,
                "stroking_color": (0, 0, 0),
                "non_stroking_color": (0, 0, 0),
            }
            char_dicts.append(char_dict)

        elem_data = {
            "text": actual_text,
            "x0": self.x0,
            "top": self.top,
            "x1": self.x1,
            "bottom": self.bottom,
            "width": self.width,
            "height": self.height,
            "object_type": object_type,
            "page_number": self.page.page_number,
            "stroking_color": getattr(self, "stroking_color", (0, 0, 0)),
            "non_stroking_color": getattr(self, "non_stroking_color", (0, 0, 0)),
            "fontname": default_font_name,
            "size": default_font_size,
            "upright": True,
            "direction": 1,
            "adv": self.width,
            "source": source_label,
            "confidence": final_confidence,
            "_char_dicts": char_dicts,
        }
        text_element = TextElement(elem_data, self.page)

        if add_to_page:
            if hasattr(self.page, "_element_mgr") and self.page._element_mgr is not None:
                add_as_type = (
                    "words"
                    if object_type == "word"
                    else "chars" if object_type == "char" else object_type
                )
                # REMOVED try-except block around add_element
                self.page._element_mgr.add_element(text_element, element_type=add_as_type)
                logger.debug(
                    f"TextElement created from region {self.bbox} and added to page {self.page.page_number} as {add_as_type}."
                )
                # Also add character dictionaries to the chars collection
                if char_dicts and object_type == "word":
                    for char_dict in char_dicts:
                        self.page._element_mgr.add_element(char_dict, element_type="chars")
            else:
                page_num_str = (
                    str(self.page.page_number) if hasattr(self.page, "page_number") else "N/A"
                )
                logger.warning(
                    f"Cannot add TextElement to page: Page {page_num_str} for region {self.bbox} is missing '_element_mgr'."
                )

        return text_element

    # ------------------------------------------------------------------
    # Unified analysis storage (maps to metadata["analysis"])
    # ------------------------------------------------------------------

    @property
    def analyses(self) -> Dict[str, Any]:
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata.setdefault("analysis", {})

    @analyses.setter
    def analyses(self, value: Dict[str, Any]):
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        self.metadata["analysis"] = value

    # ------------------------------------------------------------------
    # New helper: build table from pre-computed table_cell regions
    # ------------------------------------------------------------------

    def _extract_table_from_cells(self, cell_regions: List["Region"]) -> List[List[Optional[str]]]:
        """Construct a table (list-of-lists) from table_cell regions.

        This assumes each cell Region has metadata.row_index / col_index as written by
        detect_table_structure_from_lines().  If these keys are missing we will
        fall back to sorting by geometry.
        """
        if not cell_regions:
            return []

        # Attempt to use explicit indices first
        all_row_idxs = []
        all_col_idxs = []
        for cell in cell_regions:
            try:
                r_idx = int(cell.metadata.get("row_index"))
                c_idx = int(cell.metadata.get("col_index"))
                all_row_idxs.append(r_idx)
                all_col_idxs.append(c_idx)
            except Exception:
                # Not all cells have indices – clear the lists so we switch to geometric sorting
                all_row_idxs = []
                all_col_idxs = []
                break

        if all_row_idxs and all_col_idxs:
            num_rows = max(all_row_idxs) + 1
            num_cols = max(all_col_idxs) + 1

            # Initialise blank grid
            table_grid: List[List[Optional[str]]] = [[None] * num_cols for _ in range(num_rows)]

            for cell in cell_regions:
                try:
                    r_idx = int(cell.metadata.get("row_index"))
                    c_idx = int(cell.metadata.get("col_index"))
                    text_val = cell.extract_text(layout=False, apply_exclusions=False).strip()
                    table_grid[r_idx][c_idx] = text_val if text_val else None
                except Exception as _err:
                    # Skip problematic cell
                    continue

            return table_grid

        # ------------------------------------------------------------------
        # Fallback: derive order purely from geometry if indices are absent
        # ------------------------------------------------------------------
        # Sort unique centers to define ordering
        try:
            import numpy as np
        except ImportError:
            logger.warning("NumPy required for geometric cell ordering; returning empty result.")
            return []

        # Build arrays of centers
        centers = np.array([[(c.x0 + c.x1) / 2.0, (c.top + c.bottom) / 2.0] for c in cell_regions])
        xs = centers[:, 0]
        ys = centers[:, 1]

        # Cluster unique row Y positions and column X positions with a tolerance
        def _cluster(vals, tol=1.0):
            sorted_vals = np.sort(vals)
            groups = [[sorted_vals[0]]]
            for v in sorted_vals[1:]:
                if abs(v - groups[-1][-1]) <= tol:
                    groups[-1].append(v)
                else:
                    groups.append([v])
            return [np.mean(g) for g in groups]

        row_centers = _cluster(ys)
        col_centers = _cluster(xs)

        num_rows = len(row_centers)
        num_cols = len(col_centers)

        table_grid: List[List[Optional[str]]] = [[None] * num_cols for _ in range(num_rows)]

        # Assign each cell to nearest row & col center
        for cell, (cx, cy) in zip(cell_regions, centers):
            row_idx = int(np.argmin([abs(cy - rc) for rc in row_centers]))
            col_idx = int(np.argmin([abs(cx - cc) for cc in col_centers]))

            text_val = cell.extract_text(layout=False, apply_exclusions=False).strip()
            table_grid[row_idx][col_idx] = text_val if text_val else None

        return table_grid
Attributes
natural_pdf.Region.bbox property

Get the bounding box as (x0, top, x1, bottom).

natural_pdf.Region.bottom property

Get the bottom coordinate.

natural_pdf.Region.has_polygon property

Check if this region has polygon coordinates.

natural_pdf.Region.height property

Get the height of the region.

natural_pdf.Region.page property

Get the parent page.

natural_pdf.Region.polygon property

Get polygon coordinates if available, otherwise return rectangle corners.

natural_pdf.Region.top property

Get the top coordinate.

natural_pdf.Region.type property

Element type.

natural_pdf.Region.width property

Get the width of the region.

natural_pdf.Region.x0 property

Get the left coordinate.

natural_pdf.Region.x1 property

Get the right coordinate.

Functions
natural_pdf.Region.__init__(page, bbox, polygon=None, parent=None, label=None)

Initialize a region.

Creates a Region object that represents a rectangular or polygonal area on a page. Regions are used for spatial navigation, content extraction, and analysis operations.

Parameters:

Name Type Description Default
page Page

Parent Page object that contains this region and provides access to document elements and analysis capabilities.

required
bbox Tuple[float, float, float, float]

Bounding box coordinates as (x0, top, x1, bottom) tuple in PDF coordinate system (points, with origin at bottom-left).

required
polygon List[Tuple[float, float]]

Optional list of coordinate points [(x1,y1), (x2,y2), ...] for non-rectangular regions. If provided, the region will use polygon-based intersection calculations instead of simple rectangle overlap.

None
parent

Optional parent region for hierarchical document structure. Useful for maintaining tree-like relationships between regions.

None
label Optional[str]

Optional descriptive label for the region, useful for debugging and identification in complex workflows.

None
Example
pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]

# Rectangular region
header = Region(page, (0, 0, page.width, 100), label="header")

# Polygonal region (from layout detection)
table_polygon = [(50, 100), (300, 100), (300, 400), (50, 400)]
table_region = Region(page, (50, 100, 300, 400),
                    polygon=table_polygon, label="table")
Note

Regions are typically created through page methods like page.region() or spatial navigation methods like element.below(). Direct instantiation is used mainly for advanced workflows or layout analysis integration.

Source code in natural_pdf/elements/region.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def __init__(
    self,
    page: "Page",
    bbox: Tuple[float, float, float, float],
    polygon: List[Tuple[float, float]] = None,
    parent=None,
    label: Optional[str] = None,
):
    """Initialize a region.

    Creates a Region object that represents a rectangular or polygonal area on a page.
    Regions are used for spatial navigation, content extraction, and analysis operations.

    Args:
        page: Parent Page object that contains this region and provides access
            to document elements and analysis capabilities.
        bbox: Bounding box coordinates as (x0, top, x1, bottom) tuple in PDF
            coordinate system (points, with origin at bottom-left).
        polygon: Optional list of coordinate points [(x1,y1), (x2,y2), ...] for
            non-rectangular regions. If provided, the region will use polygon-based
            intersection calculations instead of simple rectangle overlap.
        parent: Optional parent region for hierarchical document structure.
            Useful for maintaining tree-like relationships between regions.
        label: Optional descriptive label for the region, useful for debugging
            and identification in complex workflows.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]

        # Rectangular region
        header = Region(page, (0, 0, page.width, 100), label="header")

        # Polygonal region (from layout detection)
        table_polygon = [(50, 100), (300, 100), (300, 400), (50, 400)]
        table_region = Region(page, (50, 100, 300, 400),
                            polygon=table_polygon, label="table")
        ```

    Note:
        Regions are typically created through page methods like page.region() or
        spatial navigation methods like element.below(). Direct instantiation is
        used mainly for advanced workflows or layout analysis integration.
    """
    self._page = page
    self._bbox = bbox
    self._polygon = polygon

    self.metadata: Dict[str, Any] = {}
    # Analysis results live under self.metadata['analysis'] via property

    # Standard attributes for all elements
    self.object_type = "region"  # For selector compatibility

    # Layout detection attributes
    self.region_type = None
    self.normalized_type = None
    self.confidence = None
    self.model = None

    # Region management attributes
    self.name = None
    self.label = label
    self.source = None  # Will be set by creation methods

    # Hierarchy support for nested document structure
    self.parent_region = parent
    self.child_regions = []
    self.text_content = None  # Direct text content (e.g., from Docling)
    self.associated_text_elements = []  # Native text elements that overlap with this region
natural_pdf.Region.__repr__()

String representation of the region.

Source code in natural_pdf/elements/region.py
2938
2939
2940
2941
2942
2943
2944
def __repr__(self) -> str:
    """String representation of the region."""
    poly_info = " (Polygon)" if self.has_polygon else ""
    name_info = f" name='{self.name}'" if self.name else ""
    type_info = f" type='{self.region_type}'" if self.region_type else ""
    source_info = f" source='{self.source}'" if self.source else ""
    return f"<Region{name_info}{type_info}{source_info} bbox={self.bbox}{poly_info}>"
natural_pdf.Region.above(height=None, width='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Select region above this region.

Parameters:

Name Type Description Default
height Optional[float]

Height of the region above, in points

None
width str

Width mode - "full" for full page width or "element" for element width

'full'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify an upper boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area above

Source code in natural_pdf/elements/region.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def above(
    self,
    height: Optional[float] = None,
    width: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "Region":
    """
    Select region above this region.

    Args:
        height: Height of the region above, in points
        width: Width mode - "full" for full page width or "element" for element width
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify an upper boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area above
    """
    return self._direction(
        direction="above",
        size=height,
        cross_size=width,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        **kwargs,
    )
natural_pdf.Region.add_child(child)

Add a child region to this region.

Used for hierarchical document structure when using models like Docling that understand document hierarchy.

Parameters:

Name Type Description Default
child

Region object to add as a child

required

Returns:

Type Description

Self for method chaining

Source code in natural_pdf/elements/region.py
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
def add_child(self, child):
    """
    Add a child region to this region.

    Used for hierarchical document structure when using models like Docling
    that understand document hierarchy.

    Args:
        child: Region object to add as a child

    Returns:
        Self for method chaining
    """
    self.child_regions.append(child)
    child.parent_region = self
    return self
natural_pdf.Region.analyze_text_table_structure(snap_tolerance=10, join_tolerance=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=3, expand_bbox=None, **kwargs)

Analyzes the text elements within the region (or slightly expanded area) to find potential table structure (lines, cells) using text alignment logic adapted from pdfplumber.

Parameters:

Name Type Description Default
snap_tolerance int

Tolerance for snapping parallel lines.

10
join_tolerance int

Tolerance for joining collinear lines.

3
min_words_vertical int

Minimum words needed to define a vertical line.

3
min_words_horizontal int

Minimum words needed to define a horizontal line.

1
intersection_tolerance int

Tolerance for detecting line intersections.

3
expand_bbox Optional[Dict[str, int]]

Optional dictionary to expand the search area slightly beyond the region's exact bounds (e.g., {'left': 5, 'right': 5}).

None
**kwargs

Additional keyword arguments passed to find_text_based_tables (e.g., specific x/y tolerances).

{}

Returns:

Type Description
Optional[Dict]

A dictionary containing 'horizontal_edges', 'vertical_edges', 'cells' (list of dicts),

Optional[Dict]

and 'intersections', or None if pdfplumber is unavailable or an error occurs.

Source code in natural_pdf/elements/region.py
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
def analyze_text_table_structure(
    self,
    snap_tolerance: int = 10,
    join_tolerance: int = 3,
    min_words_vertical: int = 3,
    min_words_horizontal: int = 1,
    intersection_tolerance: int = 3,
    expand_bbox: Optional[Dict[str, int]] = None,
    **kwargs,
) -> Optional[Dict]:
    """
    Analyzes the text elements within the region (or slightly expanded area)
    to find potential table structure (lines, cells) using text alignment logic
    adapted from pdfplumber.

    Args:
        snap_tolerance: Tolerance for snapping parallel lines.
        join_tolerance: Tolerance for joining collinear lines.
        min_words_vertical: Minimum words needed to define a vertical line.
        min_words_horizontal: Minimum words needed to define a horizontal line.
        intersection_tolerance: Tolerance for detecting line intersections.
        expand_bbox: Optional dictionary to expand the search area slightly beyond
                     the region's exact bounds (e.g., {'left': 5, 'right': 5}).
        **kwargs: Additional keyword arguments passed to
                  find_text_based_tables (e.g., specific x/y tolerances).

    Returns:
        A dictionary containing 'horizontal_edges', 'vertical_edges', 'cells' (list of dicts),
        and 'intersections', or None if pdfplumber is unavailable or an error occurs.
    """

    # Determine the search region (expand if requested)
    search_region = self
    if expand_bbox and isinstance(expand_bbox, dict):
        try:
            search_region = self.expand(**expand_bbox)
            logger.debug(
                f"Expanded search region for text table analysis to: {search_region.bbox}"
            )
        except Exception as e:
            logger.warning(f"Could not expand region bbox: {e}. Using original region.")
            search_region = self

    # Find text elements within the search region
    text_elements = search_region.find_all(
        "text", apply_exclusions=False
    )  # Use unfiltered text
    if not text_elements:
        logger.info(f"Region {self.bbox}: No text elements found for text table analysis.")
        return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

    # Extract bounding boxes
    bboxes = [element.bbox for element in text_elements if hasattr(element, "bbox")]
    if not bboxes:
        logger.info(f"Region {self.bbox}: No bboxes extracted from text elements.")
        return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

    # Call the utility function
    try:
        analysis_results = find_text_based_tables(
            bboxes=bboxes,
            snap_tolerance=snap_tolerance,
            join_tolerance=join_tolerance,
            min_words_vertical=min_words_vertical,
            min_words_horizontal=min_words_horizontal,
            intersection_tolerance=intersection_tolerance,
            **kwargs,  # Pass through any extra specific tolerance args
        )
        # Store results in the region's analyses cache
        self.analyses["text_table_structure"] = analysis_results
        return analysis_results
    except ImportError:
        logger.error("pdfplumber library is required for 'text' table analysis but not found.")
        return None
    except Exception as e:
        logger.error(f"Error during text-based table analysis: {e}", exc_info=True)
        return None
natural_pdf.Region.apply_custom_ocr(ocr_function, source_label='custom-ocr', replace=True, confidence=None, add_to_page=True)

Apply a custom OCR function to this region and create text elements from the results.

This is useful when you want to use a custom OCR method (e.g., an LLM API, specialized OCR service, or any custom logic) instead of the built-in OCR engines.

Parameters:

Name Type Description Default
ocr_function Callable[[Region], Optional[str]]

A callable that takes a Region and returns the OCR'd text (or None). The function receives this region as its argument and should return the extracted text as a string, or None if no text was found.

required
source_label str

Label to identify the source of these text elements (default: "custom-ocr"). This will be set as the 'source' attribute on created elements.

'custom-ocr'
replace bool

If True (default), removes existing OCR elements in this region before adding new ones. If False, adds new OCR elements alongside existing ones.

True
confidence Optional[float]

Optional confidence score for the OCR result (0.0-1.0). If None, defaults to 1.0 if text is returned, 0.0 if None is returned.

None
add_to_page bool

If True (default), adds the created text element to the page. If False, creates the element but doesn't add it to the page.

True

Returns:

Type Description
Region

Self for method chaining.

Example
Using with an LLM

def ocr_with_llm(region): image = region.to_image(resolution=300, crop=True) # Call your LLM API here return llm_client.ocr(image)

region.apply_custom_ocr(ocr_with_llm)

Using with a custom OCR service

def ocr_with_service(region): img_bytes = region.to_image(crop=True).tobytes() response = ocr_service.process(img_bytes) return response.text

region.apply_custom_ocr(ocr_with_service, source_label="my-ocr-service")

Source code in natural_pdf/elements/region.py
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
def apply_custom_ocr(
    self,
    ocr_function: Callable[["Region"], Optional[str]],
    source_label: str = "custom-ocr",
    replace: bool = True,
    confidence: Optional[float] = None,
    add_to_page: bool = True,
) -> "Region":
    """
    Apply a custom OCR function to this region and create text elements from the results.

    This is useful when you want to use a custom OCR method (e.g., an LLM API,
    specialized OCR service, or any custom logic) instead of the built-in OCR engines.

    Args:
        ocr_function: A callable that takes a Region and returns the OCR'd text (or None).
                      The function receives this region as its argument and should return
                      the extracted text as a string, or None if no text was found.
        source_label: Label to identify the source of these text elements (default: "custom-ocr").
                      This will be set as the 'source' attribute on created elements.
        replace: If True (default), removes existing OCR elements in this region before
                 adding new ones. If False, adds new OCR elements alongside existing ones.
        confidence: Optional confidence score for the OCR result (0.0-1.0).
                    If None, defaults to 1.0 if text is returned, 0.0 if None is returned.
        add_to_page: If True (default), adds the created text element to the page.
                     If False, creates the element but doesn't add it to the page.

    Returns:
        Self for method chaining.

    Example:
        # Using with an LLM
        def ocr_with_llm(region):
            image = region.to_image(resolution=300, crop=True)
            # Call your LLM API here
            return llm_client.ocr(image)

        region.apply_custom_ocr(ocr_with_llm)

        # Using with a custom OCR service
        def ocr_with_service(region):
            img_bytes = region.to_image(crop=True).tobytes()
            response = ocr_service.process(img_bytes)
            return response.text

        region.apply_custom_ocr(ocr_with_service, source_label="my-ocr-service")
    """
    # If replace is True, remove existing OCR elements in this region
    if replace:
        logger.info(
            f"Region {self.bbox}: Removing existing OCR elements before applying custom OCR."
        )

        removed_count = 0

        # Helper to remove a single element safely
        def _safe_remove(elem):
            nonlocal removed_count
            success = False
            if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                etype = getattr(elem, "object_type", "word")
                if etype == "word":
                    etype_key = "words"
                elif etype == "char":
                    etype_key = "chars"
                else:
                    etype_key = etype + "s" if not etype.endswith("s") else etype
                try:
                    success = elem.page._element_mgr.remove_element(elem, etype_key)
                except Exception:
                    success = False
            if success:
                removed_count += 1

        # Remove ALL OCR elements overlapping this region
        # Remove elements with source=="ocr" (built-in OCR) or matching the source_label (previous custom OCR)
        for word in list(self.page._element_mgr.words):
            word_source = getattr(word, "source", "")
            # Match built-in OCR behavior: remove elements with source "ocr" exactly
            # Also remove elements with the same source_label to avoid duplicates
            if (word_source == "ocr" or word_source == source_label) and self.intersects(word):
                _safe_remove(word)

        # Also remove char dicts if needed (matching built-in OCR)
        for char in list(self.page._element_mgr.chars):
            # char can be dict or TextElement; normalize
            char_src = (
                char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
            )
            if char_src == "ocr" or char_src == source_label:
                # Rough bbox for dicts
                if isinstance(char, dict):
                    cx0, ctop, cx1, cbottom = (
                        char.get("x0", 0),
                        char.get("top", 0),
                        char.get("x1", 0),
                        char.get("bottom", 0),
                    )
                else:
                    cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                # Quick overlap check
                if not (
                    cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                ):
                    _safe_remove(char)

        if removed_count > 0:
            logger.info(f"Region {self.bbox}: Removed {removed_count} existing OCR elements.")

    # Call the custom OCR function
    try:
        logger.debug(f"Region {self.bbox}: Calling custom OCR function...")
        ocr_text = ocr_function(self)

        if ocr_text is not None and not isinstance(ocr_text, str):
            logger.warning(
                f"Custom OCR function returned non-string type ({type(ocr_text)}). "
                f"Converting to string."
            )
            ocr_text = str(ocr_text)

    except Exception as e:
        logger.error(
            f"Error calling custom OCR function for region {self.bbox}: {e}", exc_info=True
        )
        return self

    # Create text element if we got text
    if ocr_text is not None:
        # Use the to_text_element method to create the element
        text_element = self.to_text_element(
            text_content=ocr_text,
            source_label=source_label,
            confidence=confidence,
            add_to_page=add_to_page,
        )

        logger.info(
            f"Region {self.bbox}: Created text element with {len(ocr_text)} chars"
            f"{' and added to page' if add_to_page else ''}"
        )
    else:
        logger.debug(f"Region {self.bbox}: Custom OCR function returned None (no text found)")

    return self
natural_pdf.Region.apply_ocr(replace=True, **ocr_params)

Apply OCR to this region and return the created text elements.

This method supports two modes: 1. Built-in OCR Engines (default) – identical to previous behaviour. Pass typical parameters like engine='easyocr' or languages=['en'] and the method will route the request through :class:OCRManager. 2. Custom OCR Function – pass a callable under the keyword function (or ocr_function). The callable will receive this Region instance and should return the extracted text (str) or None. Internally the call is delegated to :pymeth:apply_custom_ocr so the same logic (replacement, element creation, etc.) is re-used.

Examples
def llm_ocr(region):
    image = region.to_image(resolution=300, crop=True)
    return my_llm_client.ocr(image)
region.apply_ocr(function=llm_ocr)

Parameters:

Name Type Description Default
replace

Whether to remove existing OCR elements first (default True).

True
**ocr_params

Parameters for the built-in OCR manager or the special function/ocr_function keyword to trigger custom mode.

{}
Returns
Self – for chaining.
Source code in natural_pdf/elements/region.py
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
def apply_ocr(self, replace=True, **ocr_params) -> "Region":
    """
    Apply OCR to this region and return the created text elements.

    This method supports two modes:
    1. **Built-in OCR Engines** (default) – identical to previous behaviour. Pass typical
       parameters like ``engine='easyocr'`` or ``languages=['en']`` and the method will
       route the request through :class:`OCRManager`.
    2. **Custom OCR Function** – pass a *callable* under the keyword ``function`` (or
       ``ocr_function``). The callable will receive *this* Region instance and should
       return the extracted text (``str``) or ``None``.  Internally the call is
       delegated to :pymeth:`apply_custom_ocr` so the same logic (replacement, element
       creation, etc.) is re-used.

    Examples
    ---------
    ```python
    def llm_ocr(region):
        image = region.to_image(resolution=300, crop=True)
        return my_llm_client.ocr(image)
    region.apply_ocr(function=llm_ocr)
    ```

    Args:
        replace: Whether to remove existing OCR elements first (default ``True``).
        **ocr_params: Parameters for the built-in OCR manager *or* the special
                      ``function``/``ocr_function`` keyword to trigger custom mode.

    Returns
    -------
        Self – for chaining.
    """
    # --- Custom OCR function path --------------------------------------------------
    custom_func = ocr_params.pop("function", None) or ocr_params.pop("ocr_function", None)
    if callable(custom_func):
        # Delegate to the specialised helper while preserving key kwargs
        return self.apply_custom_ocr(
            ocr_function=custom_func,
            source_label=ocr_params.pop("source_label", "custom-ocr"),
            replace=replace,
            confidence=ocr_params.pop("confidence", None),
            add_to_page=ocr_params.pop("add_to_page", True),
        )

    # --- Original built-in OCR engine path (unchanged except docstring) ------------
    # Ensure OCRManager is available
    if not hasattr(self.page._parent, "_ocr_manager") or self.page._parent._ocr_manager is None:
        logger.error("OCRManager not available on parent PDF. Cannot apply OCR to region.")
        return self

    # If replace is True, find and remove existing OCR elements in this region
    if replace:
        logger.info(
            f"Region {self.bbox}: Removing existing OCR elements before applying new OCR."
        )

        # --- Robust removal: iterate through all OCR elements on the page and
        #     remove those that overlap this region. This avoids reliance on
        #     identity‐based look-ups that can break if the ElementManager
        #     rebuilt its internal lists.

        removed_count = 0

        # Helper to remove a single element safely
        def _safe_remove(elem):
            nonlocal removed_count
            success = False
            if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                etype = getattr(elem, "object_type", "word")
                if etype == "word":
                    etype_key = "words"
                elif etype == "char":
                    etype_key = "chars"
                else:
                    etype_key = etype + "s" if not etype.endswith("s") else etype
                try:
                    success = elem.page._element_mgr.remove_element(elem, etype_key)
                except Exception:
                    success = False
            if success:
                removed_count += 1

        # Remove OCR WORD elements overlapping region
        for word in list(self.page._element_mgr.words):
            if getattr(word, "source", None) == "ocr" and self.intersects(word):
                _safe_remove(word)

        # Remove OCR CHAR dicts overlapping region
        for char in list(self.page._element_mgr.chars):
            # char can be dict or TextElement; normalise
            char_src = (
                char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
            )
            if char_src == "ocr":
                # Rough bbox for dicts
                if isinstance(char, dict):
                    cx0, ctop, cx1, cbottom = (
                        char.get("x0", 0),
                        char.get("top", 0),
                        char.get("x1", 0),
                        char.get("bottom", 0),
                    )
                else:
                    cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                # Quick overlap check
                if not (
                    cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                ):
                    _safe_remove(char)

        logger.info(
            f"Region {self.bbox}: Removed {removed_count} existing OCR elements (words & chars) before re-applying OCR."
        )

    ocr_mgr = self.page._parent._ocr_manager

    # Determine rendering resolution from parameters
    final_resolution = ocr_params.get("resolution")
    if final_resolution is None and hasattr(self.page, "_parent") and self.page._parent:
        final_resolution = getattr(self.page._parent, "_config", {}).get("resolution", 150)
    elif final_resolution is None:
        final_resolution = 150
    logger.debug(
        f"Region {self.bbox}: Applying OCR with resolution {final_resolution} DPI and params: {ocr_params}"
    )

    # Render the page region to an image using the determined resolution
    try:
        region_image = self.to_image(
            resolution=final_resolution, include_highlights=False, crop=True
        )
        if not region_image:
            logger.error("Failed to render region to image for OCR.")
            return self
        logger.debug(f"Region rendered to image size: {region_image.size}")
    except Exception as e:
        logger.error(f"Error rendering region to image for OCR: {e}", exc_info=True)
        return self

    # Prepare args for the OCR Manager
    manager_args = {
        "images": region_image,
        "engine": ocr_params.get("engine"),
        "languages": ocr_params.get("languages"),
        "min_confidence": ocr_params.get("min_confidence"),
        "device": ocr_params.get("device"),
        "options": ocr_params.get("options"),
        "detect_only": ocr_params.get("detect_only"),
    }
    manager_args = {k: v for k, v in manager_args.items() if v is not None}

    # Run OCR on this region's image using the manager
    results = ocr_mgr.apply_ocr(**manager_args)
    if not isinstance(results, list):
        logger.error(
            f"OCRManager returned unexpected type for single region image: {type(results)}"
        )
        return self
    logger.debug(f"Region OCR processing returned {len(results)} results.")

    # Convert results to TextElements
    scale_x = self.width / region_image.width if region_image.width > 0 else 1.0
    scale_y = self.height / region_image.height if region_image.height > 0 else 1.0
    logger.debug(f"Region OCR scaling factors (PDF/Img): x={scale_x:.2f}, y={scale_y:.2f}")
    created_elements = []
    for result in results:
        try:
            img_x0, img_top, img_x1, img_bottom = map(float, result["bbox"])
            pdf_height = (img_bottom - img_top) * scale_y
            page_x0 = self.x0 + (img_x0 * scale_x)
            page_top = self.top + (img_top * scale_y)
            page_x1 = self.x0 + (img_x1 * scale_x)
            page_bottom = self.top + (img_bottom * scale_y)
            raw_conf = result.get("confidence")
            # Convert confidence to float unless it is None/invalid
            try:
                confidence_val = float(raw_conf) if raw_conf is not None else None
            except (TypeError, ValueError):
                confidence_val = None

            text_val = result.get("text")  # May legitimately be None in detect_only mode

            element_data = {
                "text": text_val,
                "x0": page_x0,
                "top": page_top,
                "x1": page_x1,
                "bottom": page_bottom,
                "width": page_x1 - page_x0,
                "height": page_bottom - page_top,
                "object_type": "word",
                "source": "ocr",
                "confidence": confidence_val,
                "fontname": "OCR",
                "size": round(pdf_height) if pdf_height > 0 else 10.0,
                "page_number": self.page.number,
                "bold": False,
                "italic": False,
                "upright": True,
                "doctop": page_top + self.page._page.initial_doctop,
            }
            ocr_char_dict = element_data.copy()
            ocr_char_dict["object_type"] = "char"
            ocr_char_dict.setdefault("adv", ocr_char_dict.get("width", 0))
            element_data["_char_dicts"] = [ocr_char_dict]
            from natural_pdf.elements.text import TextElement

            elem = TextElement(element_data, self.page)
            created_elements.append(elem)
            self.page._element_mgr.add_element(elem, element_type="words")
            self.page._element_mgr.add_element(ocr_char_dict, element_type="chars")
        except Exception as e:
            logger.error(
                f"Failed to convert region OCR result to element: {result}. Error: {e}",
                exc_info=True,
            )
    logger.info(f"Region {self.bbox}: Added {len(created_elements)} elements from OCR.")
    return self
natural_pdf.Region.ask(question, min_confidence=0.1, model=None, debug=False, **kwargs)

Ask a question about the region content using document QA.

This method uses a document question answering model to extract answers from the region content. It leverages both textual content and layout information for better understanding.

Parameters:

Name Type Description Default
question Union[str, List[str], Tuple[str, ...]]

The question to ask about the region content

required
min_confidence float

Minimum confidence threshold for answers (0.0-1.0)

0.1
model str

Optional model name to use for QA (if None, uses default model)

None
**kwargs

Additional parameters to pass to the QA engine

{}

Returns:

Type Description
Union[Dict[str, Any], List[Dict[str, Any]]]

Dictionary with answer details: { "answer": extracted text, "confidence": confidence score, "found": whether an answer was found, "page_num": page number, "region": reference to this region, "source_elements": list of elements that contain the answer (if found)

Union[Dict[str, Any], List[Dict[str, Any]]]

}

Source code in natural_pdf/elements/region.py
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
def ask(
    self,
    question: Union[str, List[str], Tuple[str, ...]],
    min_confidence: float = 0.1,
    model: str = None,
    debug: bool = False,
    **kwargs,
) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
    """
    Ask a question about the region content using document QA.

    This method uses a document question answering model to extract answers from the region content.
    It leverages both textual content and layout information for better understanding.

    Args:
        question: The question to ask about the region content
        min_confidence: Minimum confidence threshold for answers (0.0-1.0)
        model: Optional model name to use for QA (if None, uses default model)
        **kwargs: Additional parameters to pass to the QA engine

    Returns:
        Dictionary with answer details: {
            "answer": extracted text,
            "confidence": confidence score,
            "found": whether an answer was found,
            "page_num": page number,
            "region": reference to this region,
            "source_elements": list of elements that contain the answer (if found)
        }
    """
    try:
        from natural_pdf.qa.document_qa import get_qa_engine
    except ImportError:
        logger.error(
            "Question answering requires optional dependencies. Install with `pip install natural-pdf[ai]`"
        )
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.page.number,
            "source_elements": [],
            "region": self,
        }

    # Get or initialize QA engine with specified model
    try:
        qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
    except Exception as e:
        logger.error(f"Failed to initialize QA engine (model: {model}): {e}", exc_info=True)
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.page.number,
            "source_elements": [],
            "region": self,
        }

    # Ask the question using the QA engine
    try:
        return qa_engine.ask_pdf_region(
            self, question, min_confidence=min_confidence, debug=debug, **kwargs
        )
    except Exception as e:
        logger.error(f"Error during qa_engine.ask_pdf_region: {e}", exc_info=True)
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.page.number,
            "source_elements": [],
            "region": self,
        }
natural_pdf.Region.below(height=None, width='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Select region below this region.

Parameters:

Name Type Description Default
height Optional[float]

Height of the region below, in points

None
width str

Width mode - "full" for full page width or "element" for element width

'full'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify a lower boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area below

Source code in natural_pdf/elements/region.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
def below(
    self,
    height: Optional[float] = None,
    width: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "Region":
    """
    Select region below this region.

    Args:
        height: Height of the region below, in points
        width: Width mode - "full" for full page width or "element" for element width
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify a lower boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area below
    """
    return self._direction(
        direction="below",
        size=height,
        cross_size=width,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        **kwargs,
    )
natural_pdf.Region.clip(obj=None, left=None, top=None, right=None, bottom=None)

Clip this region to specific bounds, either from another object with bbox or explicit coordinates.

The clipped region will be constrained to not exceed the specified boundaries. You can provide either an object with bounding box properties, specific coordinates, or both. When both are provided, explicit coordinates take precedence.

Parameters:

Name Type Description Default
obj Optional[Any]

Optional object with bbox properties (Region, Element, TextElement, etc.)

None
left Optional[float]

Optional left boundary (x0) to clip to

None
top Optional[float]

Optional top boundary to clip to

None
right Optional[float]

Optional right boundary (x1) to clip to

None
bottom Optional[float]

Optional bottom boundary to clip to

None

Returns:

Type Description
Region

New Region with bounds clipped to the specified constraints

Examples:

Clip to another region's bounds

clipped = region.clip(container_region)

Clip to any element's bounds

clipped = region.clip(text_element)

Clip to specific coordinates

clipped = region.clip(left=100, right=400)

Mix object bounds with specific overrides

clipped = region.clip(obj=container, bottom=page.height/2)

Source code in natural_pdf/elements/region.py
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
def clip(
    self,
    obj: Optional[Any] = None,
    left: Optional[float] = None,
    top: Optional[float] = None,
    right: Optional[float] = None,
    bottom: Optional[float] = None,
) -> "Region":
    """
    Clip this region to specific bounds, either from another object with bbox or explicit coordinates.

    The clipped region will be constrained to not exceed the specified boundaries.
    You can provide either an object with bounding box properties, specific coordinates, or both.
    When both are provided, explicit coordinates take precedence.

    Args:
        obj: Optional object with bbox properties (Region, Element, TextElement, etc.)
        left: Optional left boundary (x0) to clip to
        top: Optional top boundary to clip to
        right: Optional right boundary (x1) to clip to
        bottom: Optional bottom boundary to clip to

    Returns:
        New Region with bounds clipped to the specified constraints

    Examples:
        # Clip to another region's bounds
        clipped = region.clip(container_region)

        # Clip to any element's bounds
        clipped = region.clip(text_element)

        # Clip to specific coordinates
        clipped = region.clip(left=100, right=400)

        # Mix object bounds with specific overrides
        clipped = region.clip(obj=container, bottom=page.height/2)
    """
    from natural_pdf.elements.base import extract_bbox

    # Start with current region bounds
    clip_x0 = self.x0
    clip_top = self.top
    clip_x1 = self.x1
    clip_bottom = self.bottom

    # Apply object constraints if provided
    if obj is not None:
        obj_bbox = extract_bbox(obj)
        if obj_bbox is not None:
            obj_x0, obj_top, obj_x1, obj_bottom = obj_bbox
            # Constrain to the intersection with the provided object
            clip_x0 = max(clip_x0, obj_x0)
            clip_top = max(clip_top, obj_top)
            clip_x1 = min(clip_x1, obj_x1)
            clip_bottom = min(clip_bottom, obj_bottom)
        else:
            logger.warning(
                f"Region {self.bbox}: Cannot extract bbox from clipping object {type(obj)}. "
                "Object must have bbox property or x0/top/x1/bottom attributes."
            )

    # Apply explicit coordinate constraints (these take precedence)
    if left is not None:
        clip_x0 = max(clip_x0, left)
    if top is not None:
        clip_top = max(clip_top, top)
    if right is not None:
        clip_x1 = min(clip_x1, right)
    if bottom is not None:
        clip_bottom = min(clip_bottom, bottom)

    # Ensure valid coordinates
    if clip_x1 <= clip_x0 or clip_bottom <= clip_top:
        logger.warning(
            f"Region {self.bbox}: Clipping resulted in invalid dimensions "
            f"({clip_x0}, {clip_top}, {clip_x1}, {clip_bottom}). Returning minimal region."
        )
        # Return a minimal region at the clip area's top-left
        return Region(self.page, (clip_x0, clip_top, clip_x0, clip_top))

    # Create the clipped region
    clipped_region = Region(self.page, (clip_x0, clip_top, clip_x1, clip_bottom))

    # Copy relevant metadata
    clipped_region.region_type = self.region_type
    clipped_region.normalized_type = self.normalized_type
    clipped_region.confidence = self.confidence
    clipped_region.model = self.model
    clipped_region.name = self.name
    clipped_region.label = self.label
    clipped_region.source = "clipped"  # Indicate this is a derived region
    clipped_region.parent_region = self

    logger.debug(
        f"Region {self.bbox}: Clipped to {clipped_region.bbox} "
        f"(constraints: obj={type(obj).__name__ if obj else None}, "
        f"left={left}, top={top}, right={right}, bottom={bottom})"
    )
    return clipped_region
natural_pdf.Region.contains(element)

Check if this region completely contains an element.

Parameters:

Name Type Description Default
element Element

Element to check

required

Returns:

Type Description
bool

True if the element is completely contained within the region, False otherwise

Source code in natural_pdf/elements/region.py
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
def contains(self, element: "Element") -> bool:
    """
    Check if this region completely contains an element.

    Args:
        element: Element to check

    Returns:
        True if the element is completely contained within the region, False otherwise
    """
    # Check if element is on the same page
    if not hasattr(element, "page") or element.page != self._page:
        return False

    # Ensure element has necessary attributes
    if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
        return False  # Cannot determine position

    # For rectangular regions, check if element's bbox is fully inside region's bbox
    if not self.has_polygon:
        return (
            self.x0 <= element.x0
            and element.x1 <= self.x1
            and self.top <= element.top
            and element.bottom <= self.bottom
        )

    # For polygon regions, check if all corners of the element are inside the polygon
    element_corners = [
        (element.x0, element.top),  # top-left
        (element.x1, element.top),  # top-right
        (element.x1, element.bottom),  # bottom-right
        (element.x0, element.bottom),  # bottom-left
    ]

    return all(self.is_point_inside(x, y) for x, y in element_corners)
natural_pdf.Region.correct_ocr(correction_callback)

Applies corrections to OCR-generated text elements within this region using a user-provided callback function.

Finds text elements within this region whose 'source' attribute starts with 'ocr' and calls the correction_callback for each, passing the element itself.

The correction_callback should contain the logic to: 1. Determine if the element needs correction. 2. Perform the correction (e.g., call an LLM). 3. Return the new text (str) or None.

If the callback returns a string, the element's .text is updated. Metadata updates (source, confidence, etc.) should happen within the callback.

Parameters:

Name Type Description Default
correction_callback Callable[[Any], Optional[str]]

A function accepting an element and returning Optional[str] (new text or None).

required

Returns:

Type Description
Region

Self for method chaining.

Source code in natural_pdf/elements/region.py
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
def correct_ocr(
    self,
    correction_callback: Callable[[Any], Optional[str]],
) -> "Region":  # Return self for chaining
    """
    Applies corrections to OCR-generated text elements within this region
    using a user-provided callback function.

    Finds text elements within this region whose 'source' attribute starts
    with 'ocr' and calls the `correction_callback` for each, passing the
    element itself.

    The `correction_callback` should contain the logic to:
    1. Determine if the element needs correction.
    2. Perform the correction (e.g., call an LLM).
    3. Return the new text (`str`) or `None`.

    If the callback returns a string, the element's `.text` is updated.
    Metadata updates (source, confidence, etc.) should happen within the callback.

    Args:
        correction_callback: A function accepting an element and returning
                             `Optional[str]` (new text or None).

    Returns:
        Self for method chaining.
    """
    # Find OCR elements specifically within this region
    # Note: We typically want to correct even if the element falls in an excluded area
    target_elements = self.find_all(selector="text[source=ocr]", apply_exclusions=False)

    # Delegate to the utility function
    _apply_ocr_correction_to_elements(
        elements=target_elements,  # Pass the ElementCollection directly
        correction_callback=correction_callback,
        caller_info=f"Region({self.bbox})",  # Pass caller info
    )

    return self  # Return self for chaining
natural_pdf.Region.create_cells()

Create cell regions for a detected table by intersecting its row and column regions, and add them to the page.

Assumes child row and column regions are already present on the page.

Returns:

Type Description

Self for method chaining.

Source code in natural_pdf/elements/region.py
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
def create_cells(self):
    """
    Create cell regions for a detected table by intersecting its
    row and column regions, and add them to the page.

    Assumes child row and column regions are already present on the page.

    Returns:
        Self for method chaining.
    """
    # Ensure this is called on a table region
    if self.region_type not in (
        "table",
        "tableofcontents",
    ):  # Allow for ToC which might have structure
        raise ValueError(
            f"create_cells should be called on a 'table' or 'tableofcontents' region, not '{self.region_type}'"
        )

    # Find rows and columns associated with this page
    # Remove the model-specific filter
    rows = self.page.find_all("region[type=table-row]")
    columns = self.page.find_all("region[type=table-column]")

    # Filter to only include those that overlap with this table region
    def is_in_table(element):
        # Use a simple overlap check (more robust than just center point)
        # Check if element's bbox overlaps with self.bbox
        return (
            hasattr(element, "bbox")
            and element.x0 < self.x1  # Ensure element has bbox
            and element.x1 > self.x0
            and element.top < self.bottom
            and element.bottom > self.top
        )

    table_rows = [r for r in rows if is_in_table(r)]
    table_columns = [c for c in columns if is_in_table(c)]

    if not table_rows or not table_columns:
        # Use page's logger if available
        logger_instance = getattr(self._page, "logger", logger)
        logger_instance.warning(
            f"Region {self.bbox}: Cannot create cells. No overlapping row or column regions found."
        )
        return self  # Return self even if no cells created

    # Sort rows and columns
    table_rows.sort(key=lambda r: r.top)
    table_columns.sort(key=lambda c: c.x0)

    # Create cells and add them to the page's element manager
    created_count = 0
    for row in table_rows:
        for column in table_columns:
            # Calculate intersection bbox for the cell
            cell_x0 = max(row.x0, column.x0)
            cell_y0 = max(row.top, column.top)
            cell_x1 = min(row.x1, column.x1)
            cell_y1 = min(row.bottom, column.bottom)

            # Only create a cell if the intersection is valid (positive width/height)
            if cell_x1 > cell_x0 and cell_y1 > cell_y0:
                # Create cell region at the intersection
                cell = self.page.create_region(cell_x0, cell_y0, cell_x1, cell_y1)
                # Set metadata
                cell.source = "derived"
                cell.region_type = "table-cell"  # Explicitly set type
                cell.normalized_type = "table-cell"  # And normalized type
                # Inherit model from the parent table region
                cell.model = self.model
                cell.parent_region = self  # Link cell to parent table region

                # Add the cell region to the page's element manager
                self.page._element_mgr.add_region(cell)
                created_count += 1

    # Optional: Add created cells to the table region's children
    # self.child_regions.extend(cells_created_in_this_call) # Needs list management

    logger_instance = getattr(self._page, "logger", logger)
    logger_instance.info(
        f"Region {self.bbox} (Model: {self.model}): Created and added {created_count} cell regions."
    )

    return self  # Return self for chaining
natural_pdf.Region.extract_table(method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False)

Extract a table from this region.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect). 'stream' is an alias for 'pdfplumber' with text-based strategies (equivalent to setting vertical_strategy and horizontal_strategy to 'text'). 'lattice' is an alias for 'pdfplumber' with line-based strategies (equivalent to setting vertical_strategy and horizontal_strategy to 'lines').

None
table_settings Optional[dict]

Settings for pdfplumber table extraction (used with 'pdfplumber', 'stream', or 'lattice' methods).

None
use_ocr bool

Whether to use OCR for text extraction (currently only applicable with 'tatr' method).

False
ocr_config Optional[dict]

OCR configuration parameters.

None
text_options Optional[Dict]

Dictionary of options for the 'text' method, corresponding to arguments of analyze_text_table_structure (e.g., snap_tolerance, expand_bbox).

None
cell_extraction_func Optional[Callable[[Region], Optional[str]]]

Optional callable function that takes a cell Region object and returns its string content. Overrides default text extraction for the 'text' method.

None
show_progress bool

If True, display a progress bar during cell text extraction for the 'text' method.

False

Returns:

Type Description
TableResult

Table data as a list of rows, where each row is a list of cell values (str or None).

Source code in natural_pdf/elements/region.py
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
def extract_table(
    self,
    method: Optional[str] = None,  # Make method optional
    table_settings: Optional[dict] = None,  # Use Optional
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,  # Use Optional
    text_options: Optional[Dict] = None,
    cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
    # --- NEW: Add tqdm control option --- #
    show_progress: bool = False,  # Controls progress bar for text method
) -> TableResult:  # Return type allows Optional[str] for cells
    """
    Extract a table from this region.

    Args:
        method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
                'stream' is an alias for 'pdfplumber' with text-based strategies (equivalent to
                setting `vertical_strategy` and `horizontal_strategy` to 'text').
                'lattice' is an alias for 'pdfplumber' with line-based strategies (equivalent to
                setting `vertical_strategy` and `horizontal_strategy` to 'lines').
        table_settings: Settings for pdfplumber table extraction (used with 'pdfplumber', 'stream', or 'lattice' methods).
        use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
        ocr_config: OCR configuration parameters.
        text_options: Dictionary of options for the 'text' method, corresponding to arguments
                      of analyze_text_table_structure (e.g., snap_tolerance, expand_bbox).
        cell_extraction_func: Optional callable function that takes a cell Region object
                              and returns its string content. Overrides default text extraction
                              for the 'text' method.
        show_progress: If True, display a progress bar during cell text extraction for the 'text' method.

    Returns:
        Table data as a list of rows, where each row is a list of cell values (str or None).
    """
    # Default settings if none provided
    if table_settings is None:
        table_settings = {}
    if text_options is None:
        text_options = {}  # Initialize empty dict

    # Auto-detect method if not specified
    if method is None:
        # If this is a TATR-detected region, use TATR method
        if hasattr(self, "model") and self.model == "tatr" and self.region_type == "table":
            effective_method = "tatr"
        else:
            # Try lattice first, then fall back to stream if no meaningful results
            logger.debug(f"Region {self.bbox}: Auto-detecting table extraction method...")

            # --- NEW: Prefer already-created table_cell regions if they exist --- #
            try:
                cell_regions_in_table = [
                    c
                    for c in self.page.find_all(
                        "region[type=table_cell]", apply_exclusions=False
                    )
                    if self.intersects(c)
                ]
            except Exception as _cells_err:
                cell_regions_in_table = []  # Fallback silently

            if cell_regions_in_table:
                logger.debug(
                    f"Region {self.bbox}: Found {len(cell_regions_in_table)} pre-computed table_cell regions – using 'cells' method."
                )
                return TableResult(self._extract_table_from_cells(cell_regions_in_table))

            # --------------------------------------------------------------- #

            try:
                logger.debug(f"Region {self.bbox}: Trying 'lattice' method first...")
                lattice_result = self.extract_table(
                    "lattice", table_settings=table_settings.copy()
                )

                # Check if lattice found meaningful content
                if (
                    lattice_result
                    and len(lattice_result) > 0
                    and any(
                        any(cell and cell.strip() for cell in row if cell)
                        for row in lattice_result
                    )
                ):
                    logger.debug(
                        f"Region {self.bbox}: 'lattice' method found table with {len(lattice_result)} rows"
                    )
                    return lattice_result
                else:
                    logger.debug(
                        f"Region {self.bbox}: 'lattice' method found no meaningful content"
                    )
            except Exception as e:
                logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

            # Fall back to stream
            logger.debug(f"Region {self.bbox}: Falling back to 'stream' method...")
            return self.extract_table("stream", table_settings=table_settings.copy())
    else:
        effective_method = method

    # Handle method aliases for pdfplumber
    if effective_method == "stream":
        logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
        effective_method = "pdfplumber"
        # Set default text strategies if not already provided by the user
        table_settings.setdefault("vertical_strategy", "text")
        table_settings.setdefault("horizontal_strategy", "text")
    elif effective_method == "lattice":
        logger.debug(
            "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
        )
        effective_method = "pdfplumber"
        # Set default line strategies if not already provided by the user
        table_settings.setdefault("vertical_strategy", "lines")
        table_settings.setdefault("horizontal_strategy", "lines")

    # -------------------------------------------------------------
    # Auto-inject tolerances when text-based strategies are requested.
    # This must happen AFTER alias handling (so strategies are final)
    # and BEFORE we delegate to _extract_table_* helpers.
    # -------------------------------------------------------------
    if "text" in (
        table_settings.get("vertical_strategy"),
        table_settings.get("horizontal_strategy"),
    ):
        page_cfg = getattr(self.page, "_config", {})
        # Ensure text_* tolerances passed to pdfplumber
        if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
            if page_cfg.get("x_tolerance") is not None:
                table_settings["text_x_tolerance"] = page_cfg["x_tolerance"]
        if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
            if page_cfg.get("y_tolerance") is not None:
                table_settings["text_y_tolerance"] = page_cfg["y_tolerance"]

        # Snap / join tolerances (~ line spacing)
        if "snap_tolerance" not in table_settings and "snap_x_tolerance" not in table_settings:
            snap = max(1, round((page_cfg.get("y_tolerance", 1)) * 0.9))
            table_settings["snap_tolerance"] = snap
        if "join_tolerance" not in table_settings and "join_x_tolerance" not in table_settings:
            table_settings["join_tolerance"] = table_settings["snap_tolerance"]

    logger.debug(f"Region {self.bbox}: Extracting table using method '{effective_method}'")

    # Use the selected method
    if effective_method == "tatr":
        table_rows = self._extract_table_tatr(use_ocr=use_ocr, ocr_config=ocr_config)
    elif effective_method == "text":
        current_text_options = text_options.copy()
        current_text_options["cell_extraction_func"] = cell_extraction_func
        current_text_options["show_progress"] = show_progress
        table_rows = self._extract_table_text(**current_text_options)
    elif effective_method == "pdfplumber":
        table_rows = self._extract_table_plumber(table_settings)
    else:
        raise ValueError(
            f"Unknown table extraction method: '{method}'. Choose from 'tatr', 'pdfplumber', 'text', 'stream', 'lattice'."
        )

    return TableResult(table_rows)
natural_pdf.Region.extract_tables(method=None, table_settings=None)

Extract all tables from this region using pdfplumber-based methods.

Note: Only 'pdfplumber', 'stream', and 'lattice' methods are supported for extract_tables. 'tatr' and 'text' methods are designed for single table extraction only.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect). 'stream' uses text-based strategies, 'lattice' uses line-based strategies.

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None

Returns:

Type Description
List[List[List[str]]]

List of tables, where each table is a list of rows, and each row is a list of cell values.

Source code in natural_pdf/elements/region.py
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
def extract_tables(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
) -> List[List[List[str]]]:
    """
    Extract all tables from this region using pdfplumber-based methods.

    Note: Only 'pdfplumber', 'stream', and 'lattice' methods are supported for extract_tables.
    'tatr' and 'text' methods are designed for single table extraction only.

    Args:
        method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                'stream' uses text-based strategies, 'lattice' uses line-based strategies.
        table_settings: Settings for pdfplumber table extraction.

    Returns:
        List of tables, where each table is a list of rows, and each row is a list of cell values.
    """
    if table_settings is None:
        table_settings = {}

    # Auto-detect method if not specified (try lattice first, then stream)
    if method is None:
        logger.debug(f"Region {self.bbox}: Auto-detecting tables extraction method...")

        # Try lattice first
        try:
            lattice_settings = table_settings.copy()
            lattice_settings.setdefault("vertical_strategy", "lines")
            lattice_settings.setdefault("horizontal_strategy", "lines")

            logger.debug(f"Region {self.bbox}: Trying 'lattice' method first for tables...")
            lattice_result = self._extract_tables_plumber(lattice_settings)

            # Check if lattice found meaningful tables
            if (
                lattice_result
                and len(lattice_result) > 0
                and any(
                    any(
                        any(cell and cell.strip() for cell in row if cell)
                        for row in table
                        if table
                    )
                    for table in lattice_result
                )
            ):
                logger.debug(
                    f"Region {self.bbox}: 'lattice' method found {len(lattice_result)} tables"
                )
                return lattice_result
            else:
                logger.debug(f"Region {self.bbox}: 'lattice' method found no meaningful tables")

        except Exception as e:
            logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

        # Fall back to stream
        logger.debug(f"Region {self.bbox}: Falling back to 'stream' method for tables...")
        stream_settings = table_settings.copy()
        stream_settings.setdefault("vertical_strategy", "text")
        stream_settings.setdefault("horizontal_strategy", "text")

        return self._extract_tables_plumber(stream_settings)

    effective_method = method

    # Handle method aliases
    if effective_method == "stream":
        logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "text")
        table_settings.setdefault("horizontal_strategy", "text")
    elif effective_method == "lattice":
        logger.debug(
            "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
        )
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "lines")
        table_settings.setdefault("horizontal_strategy", "lines")

    # Use the selected method
    if effective_method == "pdfplumber":
        return self._extract_tables_plumber(table_settings)
    else:
        raise ValueError(
            f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
        )
natural_pdf.Region.extract_text(apply_exclusions=True, debug=False, **kwargs)

Extract text from this region, respecting page exclusions and using pdfplumber's layout engine (chars_to_textmap).

Parameters:

Name Type Description Default
apply_exclusions

Whether to apply exclusion regions defined on the parent page.

True
debug

Enable verbose debugging output for filtering steps.

False
**kwargs

Additional layout parameters passed directly to pdfplumber's chars_to_textmap function (e.g., layout, x_density, y_density). See Page.extract_text docstring for more.

{}

Returns:

Type Description
str

Extracted text as string, potentially with layout-based spacing.

Source code in natural_pdf/elements/region.py
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
def extract_text(self, apply_exclusions=True, debug=False, **kwargs) -> str:
    """
    Extract text from this region, respecting page exclusions and using pdfplumber's
    layout engine (chars_to_textmap).

    Args:
        apply_exclusions: Whether to apply exclusion regions defined on the parent page.
        debug: Enable verbose debugging output for filtering steps.
        **kwargs: Additional layout parameters passed directly to pdfplumber's
                  `chars_to_textmap` function (e.g., layout, x_density, y_density).
                  See Page.extract_text docstring for more.

    Returns:
        Extracted text as string, potentially with layout-based spacing.
    """
    # Allow 'debug_exclusions' for backward compatibility
    debug = kwargs.get("debug", debug or kwargs.get("debug_exclusions", False))
    logger.debug(f"Region {self.bbox}: extract_text called with kwargs: {kwargs}")

    # 1. Get Word Elements potentially within this region (initial broad phase)
    # Optimization: Could use spatial query if page elements were indexed
    page_words = self.page.words  # Get all words from the page

    # 2. Gather all character dicts from words potentially in region
    # We filter precisely in filter_chars_spatially
    all_char_dicts = []
    for word in page_words:
        # Quick bbox check to avoid processing words clearly outside
        if get_bbox_overlap(self.bbox, word.bbox) is not None:
            all_char_dicts.extend(getattr(word, "_char_dicts", []))

    if not all_char_dicts:
        logger.debug(f"Region {self.bbox}: No character dicts found overlapping region bbox.")
        return ""

    # 3. Get Relevant Exclusions (overlapping this region)
    apply_exclusions_flag = kwargs.get("apply_exclusions", apply_exclusions)
    exclusion_regions = []
    if apply_exclusions_flag and self._page._exclusions:
        all_page_exclusions = self._page._get_exclusion_regions(
            include_callable=True, debug=debug
        )
        overlapping_exclusions = []
        for excl in all_page_exclusions:
            if get_bbox_overlap(self.bbox, excl.bbox) is not None:
                overlapping_exclusions.append(excl)
        exclusion_regions = overlapping_exclusions
        if debug:
            logger.debug(
                f"Region {self.bbox}: Applying {len(exclusion_regions)} overlapping exclusions."
            )
    elif debug:
        logger.debug(f"Region {self.bbox}: Not applying exclusions.")

    # 4. Spatially Filter Characters using Utility
    # Pass self as the target_region for precise polygon checks etc.
    filtered_chars = filter_chars_spatially(
        char_dicts=all_char_dicts,
        exclusion_regions=exclusion_regions,
        target_region=self,  # Pass self!
        debug=debug,
    )

    # 5. Generate Text Layout using Utility
    result = generate_text_layout(
        char_dicts=filtered_chars,
        layout_context_bbox=self.bbox,  # Use region's bbox for context
        user_kwargs=kwargs,  # Pass original kwargs to layout generator
    )

    logger.debug(f"Region {self.bbox}: extract_text finished, result length: {len(result)}.")
    return result
natural_pdf.Region.find(selector=None, *, text=None, contains='all', apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, contains: str = 'all', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Element]
find(selector: str, *, contains: str = 'all', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Element]

Find the first element in this region matching the selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
contains str

How to determine if elements are inside: 'all' (fully inside), 'any' (any overlap), or 'center' (center point inside). (default: "all")

'all'
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional parameters for element filtering.

{}

Returns:

Type Description
Optional[Element]

First matching element or None.

Source code in natural_pdf/elements/region.py
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
def find(
    self,
    selector: Optional[str] = None,  # Now optional
    *,
    text: Optional[str] = None,  # New text parameter
    contains: str = "all",  # New parameter for containment behavior
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional["Element"]:
    """
    Find the first element in this region matching the selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        contains: How to determine if elements are inside: 'all' (fully inside),
                 'any' (any overlap), or 'center' (center point inside).
                 (default: "all")
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional parameters for element filtering.

    Returns:
        First matching element or None.
    """
    # Delegate validation and selector construction to find_all
    elements = self.find_all(
        selector=selector,
        text=text,
        contains=contains,
        apply_exclusions=apply_exclusions,
        regex=regex,
        case=case,
        **kwargs,
    )
    return elements.first if elements else None
natural_pdf.Region.find_all(selector=None, *, text=None, contains='all', apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, contains: str = 'all', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, contains: str = 'all', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements in this region matching the selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
contains str

How to determine if elements are inside: 'all' (fully inside), 'any' (any overlap), or 'center' (center point inside). (default: "all")

'all'
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional parameters for element filtering.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements.

Source code in natural_pdf/elements/region.py
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
def find_all(
    self,
    selector: Optional[str] = None,  # Now optional
    *,
    text: Optional[str] = None,  # New text parameter
    contains: str = "all",  # New parameter to control inside/overlap behavior
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements in this region matching the selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        contains: How to determine if elements are inside: 'all' (fully inside),
                 'any' (any overlap), or 'center' (center point inside).
                 (default: "all")
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional parameters for element filtering.

    Returns:
        ElementCollection with matching elements.
    """
    from natural_pdf.elements.collections import ElementCollection

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Validate contains parameter
    if contains not in ["all", "any", "center"]:
        raise ValueError(
            f"Invalid contains value: {contains}. Must be 'all', 'any', or 'center'"
        )

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        raise ValueError("Internal error: No selector or text provided.")

    # Normal case: Region is on a single page
    try:
        # Parse the final selector string
        selector_obj = parse_selector(effective_selector)

        # Get all potentially relevant elements from the page
        # Let the page handle its exclusion logic if needed
        potential_elements = self.page.find_all(
            selector=effective_selector,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )

        # Filter these elements based on the specified containment method
        region_bbox = self.bbox
        matching_elements = []

        if contains == "all":  # Fully inside (strict)
            matching_elements = [
                el
                for el in potential_elements
                if el.x0 >= region_bbox[0]
                and el.top >= region_bbox[1]
                and el.x1 <= region_bbox[2]
                and el.bottom <= region_bbox[3]
            ]
        elif contains == "any":  # Any overlap
            matching_elements = [el for el in potential_elements if self.intersects(el)]
        elif contains == "center":  # Center point inside
            matching_elements = [
                el for el in potential_elements if self.is_element_center_inside(el)
            ]

        return ElementCollection(matching_elements)

    except Exception as e:
        logger.error(f"Error during find_all in region: {e}", exc_info=True)
        return ElementCollection([])
natural_pdf.Region.get_children(selector=None)

Get immediate child regions, optionally filtered by selector.

Parameters:

Name Type Description Default
selector

Optional selector to filter children

None

Returns:

Type Description

List of child regions matching the selector

Source code in natural_pdf/elements/region.py
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
def get_children(self, selector=None):
    """
    Get immediate child regions, optionally filtered by selector.

    Args:
        selector: Optional selector to filter children

    Returns:
        List of child regions matching the selector
    """
    import logging

    logger = logging.getLogger("natural_pdf.elements.region")

    if selector is None:
        return self.child_regions

    # Use existing selector parser to filter
    try:
        selector_obj = parse_selector(selector)
        filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
        matched = [child for child in self.child_regions if filter_func(child)]
        logger.debug(
            f"get_children: found {len(matched)} of {len(self.child_regions)} children matching '{selector}'"
        )
        return matched
    except Exception as e:
        logger.error(f"Error applying selector in get_children: {e}", exc_info=True)
        return []  # Return empty list on error
natural_pdf.Region.get_descendants(selector=None)

Get all descendant regions (children, grandchildren, etc.), optionally filtered by selector.

Parameters:

Name Type Description Default
selector

Optional selector to filter descendants

None

Returns:

Type Description

List of descendant regions matching the selector

Source code in natural_pdf/elements/region.py
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
def get_descendants(self, selector=None):
    """
    Get all descendant regions (children, grandchildren, etc.), optionally filtered by selector.

    Args:
        selector: Optional selector to filter descendants

    Returns:
        List of descendant regions matching the selector
    """
    import logging

    logger = logging.getLogger("natural_pdf.elements.region")

    all_descendants = []
    queue = list(self.child_regions)  # Start with direct children

    while queue:
        current = queue.pop(0)
        all_descendants.append(current)
        # Add current's children to the queue for processing
        if hasattr(current, "child_regions"):
            queue.extend(current.child_regions)

    logger.debug(f"get_descendants: found {len(all_descendants)} total descendants")

    # Filter by selector if provided
    if selector is not None:
        try:
            selector_obj = parse_selector(selector)
            filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
            matched = [desc for desc in all_descendants if filter_func(desc)]
            logger.debug(f"get_descendants: filtered to {len(matched)} matching '{selector}'")
            return matched
        except Exception as e:
            logger.error(f"Error applying selector in get_descendants: {e}", exc_info=True)
            return []  # Return empty list on error

    return all_descendants
natural_pdf.Region.get_elements(selector=None, apply_exclusions=True, **kwargs)

Get all elements within this region.

Parameters:

Name Type Description Default
selector Optional[str]

Optional selector to filter elements

None
apply_exclusions

Whether to apply exclusion regions

True
**kwargs

Additional parameters for element filtering

{}

Returns:

Type Description
List[Element]

List of elements in the region

Source code in natural_pdf/elements/region.py
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
def get_elements(
    self, selector: Optional[str] = None, apply_exclusions=True, **kwargs
) -> List["Element"]:
    """
    Get all elements within this region.

    Args:
        selector: Optional selector to filter elements
        apply_exclusions: Whether to apply exclusion regions
        **kwargs: Additional parameters for element filtering

    Returns:
        List of elements in the region
    """
    if selector:
        # Find elements on the page matching the selector
        page_elements = self.page.find_all(
            selector, apply_exclusions=apply_exclusions, **kwargs
        )
        # Filter those elements to only include ones within this region
        return [e for e in page_elements if self._is_element_in_region(e)]
    else:
        # Get all elements from the page
        page_elements = self.page.get_elements(apply_exclusions=apply_exclusions)
        # Filter to elements in this region
        return [e for e in page_elements if self._is_element_in_region(e)]
natural_pdf.Region.get_section_between(start_element=None, end_element=None, boundary_inclusion='both')

Get a section between two elements within this region.

Parameters:

Name Type Description Default
start_element

Element marking the start of the section

None
end_element

Element marking the end of the section

None
boundary_inclusion

How to include boundary elements: 'start', 'end', 'both', or 'none'

'both'

Returns:

Type Description

Region representing the section

Source code in natural_pdf/elements/region.py
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
def get_section_between(self, start_element=None, end_element=None, boundary_inclusion="both"):
    """
    Get a section between two elements within this region.

    Args:
        start_element: Element marking the start of the section
        end_element: Element marking the end of the section
        boundary_inclusion: How to include boundary elements: 'start', 'end', 'both', or 'none'

    Returns:
        Region representing the section
    """
    # Get elements only within this region first
    elements = self.get_elements()

    # If no elements, return self or empty region?
    if not elements:
        logger.warning(
            f"get_section_between called on region {self.bbox} with no contained elements."
        )
        # Return an empty region at the start of the parent region
        return Region(self.page, (self.x0, self.top, self.x0, self.top))

    # Sort elements in reading order
    elements.sort(key=lambda e: (e.top, e.x0))

    # Find start index
    start_idx = 0
    if start_element:
        try:
            start_idx = elements.index(start_element)
        except ValueError:
            # Start element not in region, use first element
            logger.debug("Start element not found in region, using first element.")
            start_element = elements[0]  # Use the actual first element
            start_idx = 0
    else:
        start_element = elements[0]  # Default start is first element

    # Find end index
    end_idx = len(elements) - 1
    if end_element:
        try:
            end_idx = elements.index(end_element)
        except ValueError:
            # End element not in region, use last element
            logger.debug("End element not found in region, using last element.")
            end_element = elements[-1]  # Use the actual last element
            end_idx = len(elements) - 1
    else:
        end_element = elements[-1]  # Default end is last element

    # Adjust indexes based on boundary inclusion
    start_element_for_bbox = start_element
    end_element_for_bbox = end_element

    if boundary_inclusion == "none":
        start_idx += 1
        end_idx -= 1
        start_element_for_bbox = elements[start_idx] if start_idx <= end_idx else None
        end_element_for_bbox = elements[end_idx] if start_idx <= end_idx else None
    elif boundary_inclusion == "start":
        end_idx -= 1
        end_element_for_bbox = elements[end_idx] if start_idx <= end_idx else None
    elif boundary_inclusion == "end":
        start_idx += 1
        start_element_for_bbox = elements[start_idx] if start_idx <= end_idx else None

    # Ensure valid indexes
    start_idx = max(0, start_idx)
    end_idx = min(len(elements) - 1, end_idx)

    # If no valid elements in range, return empty region
    if start_idx > end_idx or start_element_for_bbox is None or end_element_for_bbox is None:
        logger.debug("No valid elements in range for get_section_between.")
        # Return an empty region positioned at the start element boundary
        anchor = start_element if start_element else self
        return Region(self.page, (anchor.x0, anchor.top, anchor.x0, anchor.top))

    # Get elements in range based on adjusted indices
    section_elements = elements[start_idx : end_idx + 1]

    # Create bounding box around the ELEMENTS included based on indices
    x0 = min(e.x0 for e in section_elements)
    top = min(e.top for e in section_elements)
    x1 = max(e.x1 for e in section_elements)
    bottom = max(e.bottom for e in section_elements)

    # Create new region
    section = Region(self.page, (x0, top, x1, bottom))
    # Store the original boundary elements for reference
    section.start_element = start_element
    section.end_element = end_element

    return section
natural_pdf.Region.get_sections(start_elements=None, end_elements=None, boundary_inclusion='both')

Get sections within this region based on start/end elements.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections

None
end_elements

Elements or selector string that mark the end of sections

None
boundary_inclusion

How to include boundary elements: 'start', 'end', 'both', or 'none'

'both'

Returns:

Type Description
ElementCollection[Region]

List of Region objects representing the extracted sections

Source code in natural_pdf/elements/region.py
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
def get_sections(
    self, start_elements=None, end_elements=None, boundary_inclusion="both"
) -> "ElementCollection[Region]":
    """
    Get sections within this region based on start/end elements.

    Args:
        start_elements: Elements or selector string that mark the start of sections
        end_elements: Elements or selector string that mark the end of sections
        boundary_inclusion: How to include boundary elements: 'start', 'end', 'both', or 'none'

    Returns:
        List of Region objects representing the extracted sections
    """
    from natural_pdf.elements.collections import ElementCollection

    # Process string selectors to find elements WITHIN THIS REGION
    if isinstance(start_elements, str):
        start_elements = self.find_all(start_elements)  # Use region's find_all
        if hasattr(start_elements, "elements"):
            start_elements = start_elements.elements

    if isinstance(end_elements, str):
        end_elements = self.find_all(end_elements)  # Use region's find_all
        if hasattr(end_elements, "elements"):
            end_elements = end_elements.elements

    # Ensure start_elements is a list (or similar iterable)
    if start_elements is None or not hasattr(start_elements, "__iter__"):
        logger.warning(
            "get_sections requires valid start_elements (selector or list). Returning empty."
        )
        return []
    # Ensure end_elements is a list if provided
    if end_elements is not None and not hasattr(end_elements, "__iter__"):
        logger.warning("end_elements must be iterable if provided. Ignoring.")
        end_elements = []
    elif end_elements is None:
        end_elements = []

    # If no start elements found within the region, return empty list
    if not start_elements:
        return []

    # Sort all elements within the region in reading order
    all_elements_in_region = self.get_elements()
    all_elements_in_region.sort(key=lambda e: (e.top, e.x0))

    if not all_elements_in_region:
        return []  # Cannot create sections if region is empty

    # Map elements to their indices in the sorted list
    element_to_index = {el: i for i, el in enumerate(all_elements_in_region)}

    # Mark section boundaries using indices from the sorted list
    section_boundaries = []

    # Add start element indexes
    for element in start_elements:
        idx = element_to_index.get(element)
        if idx is not None:
            section_boundaries.append({"index": idx, "element": element, "type": "start"})
        # else: Element found by selector might not be geometrically in region? Log warning?

    # Add end element indexes if provided
    for element in end_elements:
        idx = element_to_index.get(element)
        if idx is not None:
            section_boundaries.append({"index": idx, "element": element, "type": "end"})

    # Sort boundaries by index (document order within the region)
    section_boundaries.sort(key=lambda x: x["index"])

    # Generate sections
    sections = []
    current_start_boundary = None

    for i, boundary in enumerate(section_boundaries):
        # If it's a start boundary and we don't have a current start
        if boundary["type"] == "start" and current_start_boundary is None:
            current_start_boundary = boundary

        # If it's an end boundary and we have a current start
        elif boundary["type"] == "end" and current_start_boundary is not None:
            # Create a section from current_start to this boundary
            start_element = current_start_boundary["element"]
            end_element = boundary["element"]
            # Use the helper, ensuring elements are from within the region
            section = self.get_section_between(start_element, end_element, boundary_inclusion)
            sections.append(section)
            current_start_boundary = None  # Reset

        # If it's another start boundary and we have a current start (split by starts only)
        elif (
            boundary["type"] == "start"
            and current_start_boundary is not None
            and not end_elements
        ):
            # End the previous section just before this start boundary
            start_element = current_start_boundary["element"]
            # Find the element immediately preceding this start in the sorted list
            end_idx = boundary["index"] - 1
            if end_idx >= 0 and end_idx >= current_start_boundary["index"]:
                end_element = all_elements_in_region[end_idx]
                section = self.get_section_between(
                    start_element, end_element, boundary_inclusion
                )
                sections.append(section)
            # Else: Section started and ended by consecutive start elements? Create empty?
            # For now, just reset and start new section

            # Start the new section
            current_start_boundary = boundary

    # Handle the last section if we have a current start
    if current_start_boundary is not None:
        start_element = current_start_boundary["element"]
        # End at the last element within the region
        end_element = all_elements_in_region[-1]
        section = self.get_section_between(start_element, end_element, boundary_inclusion)
        sections.append(section)

    return ElementCollection(sections)
natural_pdf.Region.get_text_table_cells(snap_tolerance=10, join_tolerance=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=3, expand_bbox=None, **kwargs)

Analyzes text alignment to find table cells and returns them as temporary Region objects without adding them to the page.

Parameters:

Name Type Description Default
snap_tolerance int

Tolerance for snapping parallel lines.

10
join_tolerance int

Tolerance for joining collinear lines.

3
min_words_vertical int

Minimum words needed to define a vertical line.

3
min_words_horizontal int

Minimum words needed to define a horizontal line.

1
intersection_tolerance int

Tolerance for detecting line intersections.

3
expand_bbox Optional[Dict[str, int]]

Optional dictionary to expand the search area slightly beyond the region's exact bounds (e.g., {'left': 5, 'right': 5}).

None
**kwargs

Additional keyword arguments passed to find_text_based_tables (e.g., specific x/y tolerances).

{}

Returns:

Type Description
ElementCollection[Region]

An ElementCollection containing temporary Region objects for each detected cell,

ElementCollection[Region]

or an empty ElementCollection if no cells are found or an error occurs.

Source code in natural_pdf/elements/region.py
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
def get_text_table_cells(
    self,
    snap_tolerance: int = 10,
    join_tolerance: int = 3,
    min_words_vertical: int = 3,
    min_words_horizontal: int = 1,
    intersection_tolerance: int = 3,
    expand_bbox: Optional[Dict[str, int]] = None,
    **kwargs,
) -> "ElementCollection[Region]":
    """
    Analyzes text alignment to find table cells and returns them as
    temporary Region objects without adding them to the page.

    Args:
        snap_tolerance: Tolerance for snapping parallel lines.
        join_tolerance: Tolerance for joining collinear lines.
        min_words_vertical: Minimum words needed to define a vertical line.
        min_words_horizontal: Minimum words needed to define a horizontal line.
        intersection_tolerance: Tolerance for detecting line intersections.
        expand_bbox: Optional dictionary to expand the search area slightly beyond
                     the region's exact bounds (e.g., {'left': 5, 'right': 5}).
        **kwargs: Additional keyword arguments passed to
                  find_text_based_tables (e.g., specific x/y tolerances).

    Returns:
        An ElementCollection containing temporary Region objects for each detected cell,
        or an empty ElementCollection if no cells are found or an error occurs.
    """
    from natural_pdf.elements.collections import ElementCollection

    # 1. Perform the analysis (or use cached results)
    if "text_table_structure" in self.analyses:
        analysis_results = self.analyses["text_table_structure"]
        logger.debug("get_text_table_cells: Using cached analysis results.")
    else:
        analysis_results = self.analyze_text_table_structure(
            snap_tolerance=snap_tolerance,
            join_tolerance=join_tolerance,
            min_words_vertical=min_words_vertical,
            min_words_horizontal=min_words_horizontal,
            intersection_tolerance=intersection_tolerance,
            expand_bbox=expand_bbox,
            **kwargs,
        )

    # 2. Check if analysis was successful and cells were found
    if analysis_results is None or not analysis_results.get("cells"):
        logger.info(f"Region {self.bbox}: No cells found by text table analysis.")
        return ElementCollection([])  # Return empty collection

    # 3. Create temporary Region objects for each cell dictionary
    cell_regions = []
    for cell_data in analysis_results["cells"]:
        try:
            # Use page.region to create the region object
            # It expects left, top, right, bottom keys
            cell_region = self.page.region(**cell_data)

            # Set metadata on the temporary region
            cell_region.region_type = "table-cell"
            cell_region.normalized_type = "table-cell"
            cell_region.model = "pdfplumber-text"
            cell_region.source = "volatile"  # Indicate it's not managed/persistent
            cell_region.parent_region = self  # Link back to the region it came from

            cell_regions.append(cell_region)
        except Exception as e:
            logger.warning(f"Could not create Region object for cell data {cell_data}: {e}")

    # 4. Return the list wrapped in an ElementCollection
    logger.debug(f"get_text_table_cells: Created {len(cell_regions)} temporary cell regions.")
    return ElementCollection(cell_regions)
natural_pdf.Region.highlight(label=None, color=None, use_color_cycling=False, include_attrs=None, existing='append')

Highlight this region on the page.

Parameters:

Name Type Description Default
label Optional[str]

Optional label for the highlight

None
color Optional[Union[Tuple, str]]

Color tuple/string for the highlight, or None to use automatic color

None
use_color_cycling bool

Force color cycling even with no label (default: False)

False
include_attrs Optional[List[str]]

List of attribute names to display on the highlight (e.g., ['confidence', 'type'])

None
existing str

How to handle existing highlights ('append' or 'replace').

'append'

Returns:

Type Description
Region

Self for method chaining

Source code in natural_pdf/elements/region.py
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
def highlight(
    self,
    label: Optional[str] = None,
    color: Optional[Union[Tuple, str]] = None,
    use_color_cycling: bool = False,
    include_attrs: Optional[List[str]] = None,
    existing: str = "append",
) -> "Region":
    """
    Highlight this region on the page.

    Args:
        label: Optional label for the highlight
        color: Color tuple/string for the highlight, or None to use automatic color
        use_color_cycling: Force color cycling even with no label (default: False)
        include_attrs: List of attribute names to display on the highlight (e.g., ['confidence', 'type'])
        existing: How to handle existing highlights ('append' or 'replace').

    Returns:
        Self for method chaining
    """
    # Access the highlighter service correctly
    highlighter = self.page._highlighter

    # Prepare common arguments
    highlight_args = {
        "page_index": self.page.index,
        "color": color,
        "label": label,
        "use_color_cycling": use_color_cycling,
        "element": self,  # Pass the region itself so attributes can be accessed
        "include_attrs": include_attrs,
        "existing": existing,
    }

    # Call the appropriate service method
    if self.has_polygon:
        highlight_args["polygon"] = self.polygon
        highlighter.add_polygon(**highlight_args)
    else:
        highlight_args["bbox"] = self.bbox
        highlighter.add(**highlight_args)

    return self
natural_pdf.Region.intersects(element)

Check if this region intersects with an element (any overlap).

Parameters:

Name Type Description Default
element Element

Element to check

required

Returns:

Type Description
bool

True if the element overlaps with the region at all, False otherwise

Source code in natural_pdf/elements/region.py
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
def intersects(self, element: "Element") -> bool:
    """
    Check if this region intersects with an element (any overlap).

    Args:
        element: Element to check

    Returns:
        True if the element overlaps with the region at all, False otherwise
    """
    # Check if element is on the same page
    if not hasattr(element, "page") or element.page != self._page:
        return False

    # Ensure element has necessary attributes
    if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
        return False  # Cannot determine position

    # For rectangular regions, check for bbox overlap
    if not self.has_polygon:
        return (
            self.x0 < element.x1
            and self.x1 > element.x0
            and self.top < element.bottom
            and self.bottom > element.top
        )

    # For polygon regions, check if any corner of the element is inside the polygon
    element_corners = [
        (element.x0, element.top),  # top-left
        (element.x1, element.top),  # top-right
        (element.x1, element.bottom),  # bottom-right
        (element.x0, element.bottom),  # bottom-left
    ]

    # First check if any element corner is inside the polygon
    if any(self.is_point_inside(x, y) for x, y in element_corners):
        return True

    # Also check if any polygon corner is inside the element's rectangle
    for x, y in self.polygon:
        if element.x0 <= x <= element.x1 and element.top <= y <= element.bottom:
            return True

    # Also check if any polygon edge intersects with any rectangle edge
    # This is a simplification - for complex cases, we'd need a full polygon-rectangle
    # intersection algorithm

    # For now, return True if bounding boxes overlap (approximation for polygon-rectangle case)
    return (
        self.x0 < element.x1
        and self.x1 > element.x0
        and self.top < element.bottom
        and self.bottom > element.top
    )
natural_pdf.Region.is_element_center_inside(element)

Check if the center point of an element's bounding box is inside this region.

Parameters:

Name Type Description Default
element Element

Element to check

required

Returns:

Type Description
bool

True if the element's center point is inside the region, False otherwise.

Source code in natural_pdf/elements/region.py
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
def is_element_center_inside(self, element: "Element") -> bool:
    """
    Check if the center point of an element's bounding box is inside this region.

    Args:
        element: Element to check

    Returns:
        True if the element's center point is inside the region, False otherwise.
    """
    # Check if element is on the same page
    if not hasattr(element, "page") or element.page != self._page:
        return False

    # Ensure element has necessary attributes
    if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
        logger.warning(
            f"Element {element} lacks bounding box attributes. Cannot check center point."
        )
        return False  # Cannot determine position

    # Calculate center point
    center_x = (element.x0 + element.x1) / 2
    center_y = (element.top + element.bottom) / 2

    # Use the existing is_point_inside check
    return self.is_point_inside(center_x, center_y)
natural_pdf.Region.is_point_inside(x, y)

Check if a point is inside this region using ray casting algorithm for polygons.

Parameters:

Name Type Description Default
x float

X coordinate of the point

required
y float

Y coordinate of the point

required

Returns:

Name Type Description
bool bool

True if the point is inside the region

Source code in natural_pdf/elements/region.py
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
def is_point_inside(self, x: float, y: float) -> bool:
    """
    Check if a point is inside this region using ray casting algorithm for polygons.

    Args:
        x: X coordinate of the point
        y: Y coordinate of the point

    Returns:
        bool: True if the point is inside the region
    """
    if not self.has_polygon:
        return (self.x0 <= x <= self.x1) and (self.top <= y <= self.bottom)

    # Ray casting algorithm
    inside = False
    j = len(self.polygon) - 1

    for i in range(len(self.polygon)):
        if ((self.polygon[i][1] > y) != (self.polygon[j][1] > y)) and (
            x
            < (self.polygon[j][0] - self.polygon[i][0])
            * (y - self.polygon[i][1])
            / (self.polygon[j][1] - self.polygon[i][1])
            + self.polygon[i][0]
        ):
            inside = not inside
        j = i

    return inside
natural_pdf.Region.left(width=None, height='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Select region to the left of this region.

Parameters:

Name Type Description Default
width Optional[float]

Width of the region to the left, in points

None
height str

Height mode - "full" for full page height or "element" for element height

'full'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify a left boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area to the left

Source code in natural_pdf/elements/region.py
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
def left(
    self,
    width: Optional[float] = None,
    height: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "Region":
    """
    Select region to the left of this region.

    Args:
        width: Width of the region to the left, in points
        height: Height mode - "full" for full page height or "element" for element height
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify a left boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area to the left
    """
    return self._direction(
        direction="left",
        size=width,
        cross_size=height,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        **kwargs,
    )
natural_pdf.Region.right(width=None, height='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Select region to the right of this region.

Parameters:

Name Type Description Default
width Optional[float]

Width of the region to the right, in points

None
height str

Height mode - "full" for full page height or "element" for element height

'full'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify a right boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area to the right

Source code in natural_pdf/elements/region.py
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
def right(
    self,
    width: Optional[float] = None,
    height: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "Region":
    """
    Select region to the right of this region.

    Args:
        width: Width of the region to the right, in points
        height: Height mode - "full" for full page height or "element" for element height
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify a right boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area to the right
    """
    return self._direction(
        direction="right",
        size=width,
        cross_size=height,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        **kwargs,
    )
natural_pdf.Region.save(filename, resolution=None, labels=True, legend_position='right')

Save the page with this region highlighted to an image file.

Parameters:

Name Type Description Default
filename str

Path to save the image to

required
resolution Optional[float]

Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)

None
labels bool

Whether to include a legend for labels

True
legend_position str

Position of the legend

'right'

Returns:

Type Description
Region

Self for method chaining

Source code in natural_pdf/elements/region.py
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
def save(
    self,
    filename: str,
    resolution: Optional[float] = None,
    labels: bool = True,
    legend_position: str = "right",
) -> "Region":
    """
    Save the page with this region highlighted to an image file.

    Args:
        filename: Path to save the image to
        resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
        labels: Whether to include a legend for labels
        legend_position: Position of the legend

    Returns:
        Self for method chaining
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Highlight this region if not already highlighted
    self.highlight()

    # Save the highlighted image
    self._page.save_image(
        filename, resolution=resolution, labels=labels, legend_position=legend_position
    )
    return self
natural_pdf.Region.save_image(filename, resolution=None, crop=False, include_highlights=True, **kwargs)

Save an image of just this region to a file.

Parameters:

Name Type Description Default
filename str

Path to save the image to

required
resolution Optional[float]

Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)

None
crop bool

If True, only crop the region without highlighting its boundaries

False
include_highlights bool

Whether to include existing highlights (default: True)

True
**kwargs

Additional parameters for page.to_image()

{}

Returns:

Type Description
Region

Self for method chaining

Source code in natural_pdf/elements/region.py
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
def save_image(
    self,
    filename: str,
    resolution: Optional[float] = None,
    crop: bool = False,
    include_highlights: bool = True,
    **kwargs,
) -> "Region":
    """
    Save an image of just this region to a file.

    Args:
        filename: Path to save the image to
        resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
        crop: If True, only crop the region without highlighting its boundaries
        include_highlights: Whether to include existing highlights (default: True)
        **kwargs: Additional parameters for page.to_image()

    Returns:
        Self for method chaining
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Get the region image
    image = self.to_image(
        resolution=resolution,
        crop=crop,
        include_highlights=include_highlights,
        **kwargs,
    )

    # Save the image
    image.save(filename)
    return self
natural_pdf.Region.show(resolution=None, labels=True, legend_position='right', color='blue', label=None, width=None, crop=False)

Show the page with just this region highlighted temporarily.

Parameters:

Name Type Description Default
resolution Optional[float]

Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)

None
labels bool

Whether to include a legend for labels

True
legend_position str

Position of the legend

'right'
color Optional[Union[Tuple, str]]

Color to highlight this region (default: blue)

'blue'
label Optional[str]

Optional label for this region in the legend

None
width Optional[int]

Optional width for the output image in pixels

None
crop bool

If True, crop the rendered image to this region's bounding box (with a small margin handled inside HighlightingService) before legends/overlays are added.

False

Returns:

Type Description
Image

PIL Image of the page with only this region highlighted

Source code in natural_pdf/elements/region.py
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
def show(
    self,
    resolution: Optional[float] = None,
    labels: bool = True,
    legend_position: str = "right",
    # Add a default color for standalone show
    color: Optional[Union[Tuple, str]] = "blue",
    label: Optional[str] = None,
    width: Optional[int] = None,  # Add width parameter
    crop: bool = False,  # NEW: Crop output to region bounds before legend
) -> "Image.Image":
    """
    Show the page with just this region highlighted temporarily.

    Args:
        resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
        labels: Whether to include a legend for labels
        legend_position: Position of the legend
        color: Color to highlight this region (default: blue)
        label: Optional label for this region in the legend
        width: Optional width for the output image in pixels
        crop: If True, crop the rendered image to this region's
                    bounding box (with a small margin handled inside
                    HighlightingService) before legends/overlays are added.

    Returns:
        PIL Image of the page with only this region highlighted
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    if not self._page:
        raise ValueError("Region must be associated with a page to show.")

    # Use the highlighting service via the page's property
    service = self._page._highlighter

    # Determine the label if not provided
    display_label = (
        label if label is not None else f"Region ({self.type})" if self.type else "Region"
    )

    # Prepare temporary highlight data for just this region
    temp_highlight_data = {
        "page_index": self._page.index,
        "bbox": self.bbox,
        "polygon": self.polygon if self.has_polygon else None,
        "color": color,  # Use provided or default color
        "label": display_label,
        "use_color_cycling": False,  # Explicitly false for single preview
    }

    # Determine crop bbox if requested
    crop_bbox = self.bbox if crop else None

    # Use render_preview to show only this highlight
    return service.render_preview(
        page_index=self._page.index,
        temporary_highlights=[temp_highlight_data],
        resolution=resolution,
        width=width,  # Pass the width parameter
        labels=labels,
        legend_position=legend_position,
        crop_bbox=crop_bbox,
    )
natural_pdf.Region.to_image(resolution=None, crop=False, include_highlights=True, **kwargs)

Generate an image of just this region.

Parameters:

Name Type Description Default
resolution Optional[float]

Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)

None
crop bool

If True, only crop the region without highlighting its boundaries

False
include_highlights bool

Whether to include existing highlights (default: True)

True
**kwargs

Additional parameters for page.to_image()

{}

Returns:

Type Description
Image

PIL Image of just this region

Source code in natural_pdf/elements/region.py
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
def to_image(
    self,
    resolution: Optional[float] = None,
    crop: bool = False,
    include_highlights: bool = True,
    **kwargs,
) -> "Image.Image":
    """
    Generate an image of just this region.

    Args:
        resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
        crop: If True, only crop the region without highlighting its boundaries
        include_highlights: Whether to include existing highlights (default: True)
        **kwargs: Additional parameters for page.to_image()

    Returns:
        PIL Image of just this region
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Handle the case where user wants the cropped region to have a specific width
    page_kwargs = kwargs.copy()
    effective_resolution = resolution  # Start with the provided resolution

    if crop and "width" in kwargs:
        target_width = kwargs["width"]
        # Calculate what resolution is needed to make the region crop have target_width
        region_width_points = self.width  # Region width in PDF points

        if region_width_points > 0:
            # Calculate scale needed: target_width / region_width_points
            required_scale = target_width / region_width_points
            # Convert scale to resolution: scale * 72 DPI
            effective_resolution = required_scale * 72.0
            page_kwargs.pop("width")  # Remove width parameter to avoid conflicts
            logger.debug(
                f"Region {self.bbox}: Calculated required resolution {effective_resolution:.1f} DPI for region crop width {target_width}"
            )
        else:
            logger.warning(
                f"Region {self.bbox}: Invalid region width {region_width_points}, using original resolution"
            )

    # First get the full page image with highlights if requested
    page_image = self._page.to_image(
        resolution=effective_resolution,
        include_highlights=include_highlights,
        **page_kwargs,
    )

    # Calculate the actual scale factor used by the page image
    if page_image.width > 0 and self._page.width > 0:
        scale_factor = page_image.width / self._page.width
    else:
        # Fallback to resolution-based calculation if dimensions are invalid
        scale_factor = resolution / 72.0

    # Apply scaling to the coordinates
    x0 = int(self.x0 * scale_factor)
    top = int(self.top * scale_factor)
    x1 = int(self.x1 * scale_factor)
    bottom = int(self.bottom * scale_factor)

    # Ensure coords are valid for cropping (left < right, top < bottom)
    if x0 >= x1:
        logger.warning(
            f"Region {self.bbox} resulted in non-positive width after scaling ({x0} >= {x1}). Cannot create image."
        )
        return None
    if top >= bottom:
        logger.warning(
            f"Region {self.bbox} resulted in non-positive height after scaling ({top} >= {bottom}). Cannot create image."
        )
        return None

    # Crop the image to just this region
    region_image = page_image.crop((x0, top, x1, bottom))

    # If not crop, add a border to highlight the region boundaries
    if not crop:
        from PIL import ImageDraw

        # Create a 1px border around the region
        draw = ImageDraw.Draw(region_image)
        draw.rectangle(
            (0, 0, region_image.width - 1, region_image.height - 1),
            outline=(255, 0, 0),
            width=1,
        )

    return region_image
natural_pdf.Region.to_text_element(text_content=None, source_label='derived_from_region', object_type='word', default_font_size=10.0, default_font_name='RegionContent', confidence=None, add_to_page=False)

Creates a new TextElement object based on this region's geometry.

The text for the new TextElement can be provided directly, generated by a callback function, or left as None.

Parameters:

Name Type Description Default
text_content Optional[Union[str, Callable[[Region], Optional[str]]]]
  • If a string, this will be the text of the new TextElement.
  • If a callable, it will be called with this region instance and its return value (a string or None) will be the text.
  • If None (default), the TextElement's text will be None.
None
source_label str

The 'source' attribute for the new TextElement.

'derived_from_region'
object_type str

The 'object_type' for the TextElement's data dict (e.g., "word", "char").

'word'
default_font_size float

Placeholder font size if text is generated.

10.0
default_font_name str

Placeholder font name if text is generated.

'RegionContent'
confidence Optional[float]

Confidence score for the text. If text_content is None, defaults to 0.0. If text is provided/generated, defaults to 1.0 unless specified.

None
add_to_page bool

If True, the created TextElement will be added to the region's parent page. (Default: False)

False

Returns:

Type Description
TextElement

A new TextElement instance.

Raises:

Type Description
ValueError

If the region does not have a valid 'page' attribute.

Source code in natural_pdf/elements/region.py
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
def to_text_element(
    self,
    text_content: Optional[Union[str, Callable[["Region"], Optional[str]]]] = None,
    source_label: str = "derived_from_region",
    object_type: str = "word",  # Or "char", controls how it's categorized
    default_font_size: float = 10.0,
    default_font_name: str = "RegionContent",
    confidence: Optional[float] = None,  # Allow overriding confidence
    add_to_page: bool = False,  # NEW: Option to add to page
) -> "TextElement":
    """
    Creates a new TextElement object based on this region's geometry.

    The text for the new TextElement can be provided directly,
    generated by a callback function, or left as None.

    Args:
        text_content:
            - If a string, this will be the text of the new TextElement.
            - If a callable, it will be called with this region instance
              and its return value (a string or None) will be the text.
            - If None (default), the TextElement's text will be None.
        source_label: The 'source' attribute for the new TextElement.
        object_type: The 'object_type' for the TextElement's data dict
                     (e.g., "word", "char").
        default_font_size: Placeholder font size if text is generated.
        default_font_name: Placeholder font name if text is generated.
        confidence: Confidence score for the text. If text_content is None,
                    defaults to 0.0. If text is provided/generated, defaults to 1.0
                    unless specified.
        add_to_page: If True, the created TextElement will be added to the
                     region's parent page. (Default: False)

    Returns:
        A new TextElement instance.

    Raises:
        ValueError: If the region does not have a valid 'page' attribute.
    """
    actual_text: Optional[str] = None
    if isinstance(text_content, str):
        actual_text = text_content
    elif callable(text_content):
        try:
            actual_text = text_content(self)
        except Exception as e:
            logger.error(
                f"Error executing text_content callback for region {self.bbox}: {e}",
                exc_info=True,
            )
            actual_text = None  # Ensure actual_text is None on error

    final_confidence = confidence
    if final_confidence is None:
        final_confidence = 1.0 if actual_text is not None and actual_text.strip() else 0.0

    if not hasattr(self, "page") or self.page is None:
        raise ValueError("Region must have a valid 'page' attribute to create a TextElement.")

    # Create character dictionaries for the text
    char_dicts = []
    if actual_text:
        # Create a single character dict that spans the entire region
        # This is a simplified approach - OCR engines typically create one per character
        char_dict = {
            "text": actual_text,
            "x0": self.x0,
            "top": self.top,
            "x1": self.x1,
            "bottom": self.bottom,
            "width": self.width,
            "height": self.height,
            "object_type": "char",
            "page_number": self.page.page_number,
            "fontname": default_font_name,
            "size": default_font_size,
            "upright": True,
            "direction": 1,
            "adv": self.width,
            "source": source_label,
            "confidence": final_confidence,
            "stroking_color": (0, 0, 0),
            "non_stroking_color": (0, 0, 0),
        }
        char_dicts.append(char_dict)

    elem_data = {
        "text": actual_text,
        "x0": self.x0,
        "top": self.top,
        "x1": self.x1,
        "bottom": self.bottom,
        "width": self.width,
        "height": self.height,
        "object_type": object_type,
        "page_number": self.page.page_number,
        "stroking_color": getattr(self, "stroking_color", (0, 0, 0)),
        "non_stroking_color": getattr(self, "non_stroking_color", (0, 0, 0)),
        "fontname": default_font_name,
        "size": default_font_size,
        "upright": True,
        "direction": 1,
        "adv": self.width,
        "source": source_label,
        "confidence": final_confidence,
        "_char_dicts": char_dicts,
    }
    text_element = TextElement(elem_data, self.page)

    if add_to_page:
        if hasattr(self.page, "_element_mgr") and self.page._element_mgr is not None:
            add_as_type = (
                "words"
                if object_type == "word"
                else "chars" if object_type == "char" else object_type
            )
            # REMOVED try-except block around add_element
            self.page._element_mgr.add_element(text_element, element_type=add_as_type)
            logger.debug(
                f"TextElement created from region {self.bbox} and added to page {self.page.page_number} as {add_as_type}."
            )
            # Also add character dictionaries to the chars collection
            if char_dicts and object_type == "word":
                for char_dict in char_dicts:
                    self.page._element_mgr.add_element(char_dict, element_type="chars")
        else:
            page_num_str = (
                str(self.page.page_number) if hasattr(self.page, "page_number") else "N/A"
            )
            logger.warning(
                f"Cannot add TextElement to page: Page {page_num_str} for region {self.bbox} is missing '_element_mgr'."
            )

    return text_element
natural_pdf.Region.trim(padding=1, threshold=0.95, resolution=None, pre_shrink=0.5)

Trim visual whitespace from the edges of this region.

Similar to Python's string .strip() method, but for visual whitespace in the region image. Uses pixel analysis to detect rows/columns that are predominantly whitespace.

Parameters:

Name Type Description Default
padding int

Number of pixels to keep as padding after trimming (default: 1)

1
threshold float

Threshold for considering a row/column as whitespace (0.0-1.0, default: 0.95) Higher values mean more strict whitespace detection. E.g., 0.95 means if 95% of pixels in a row/column are white, consider it whitespace.

0.95
resolution Optional[float]

Resolution for image rendering in DPI (default: uses global options, fallback to 144 DPI)

None
pre_shrink float

Amount to shrink region before trimming, then expand back after (default: 0.5) This helps avoid detecting box borders/slivers as content.

0.5
Returns

New Region with visual whitespace trimmed from all edges

Examples
# Basic trimming with 1 pixel padding and 0.5px pre-shrink
trimmed = region.trim()

# More aggressive trimming with no padding and no pre-shrink
tight = region.trim(padding=0, threshold=0.9, pre_shrink=0)

# Conservative trimming with more padding
loose = region.trim(padding=3, threshold=0.98)
Source code in natural_pdf/elements/region.py
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
def trim(
    self,
    padding: int = 1,
    threshold: float = 0.95,
    resolution: Optional[float] = None,
    pre_shrink: float = 0.5,
) -> "Region":
    """
    Trim visual whitespace from the edges of this region.

    Similar to Python's string .strip() method, but for visual whitespace in the region image.
    Uses pixel analysis to detect rows/columns that are predominantly whitespace.

    Args:
        padding: Number of pixels to keep as padding after trimming (default: 1)
        threshold: Threshold for considering a row/column as whitespace (0.0-1.0, default: 0.95)
                  Higher values mean more strict whitespace detection.
                  E.g., 0.95 means if 95% of pixels in a row/column are white, consider it whitespace.
        resolution: Resolution for image rendering in DPI (default: uses global options, fallback to 144 DPI)
        pre_shrink: Amount to shrink region before trimming, then expand back after (default: 0.5)
                   This helps avoid detecting box borders/slivers as content.

    Returns
    ------

    New Region with visual whitespace trimmed from all edges

    Examples
    --------

    ```python
    # Basic trimming with 1 pixel padding and 0.5px pre-shrink
    trimmed = region.trim()

    # More aggressive trimming with no padding and no pre-shrink
    tight = region.trim(padding=0, threshold=0.9, pre_shrink=0)

    # Conservative trimming with more padding
    loose = region.trim(padding=3, threshold=0.98)
    ```
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Pre-shrink the region to avoid box slivers
    work_region = (
        self.expand(left=-pre_shrink, right=-pre_shrink, top=-pre_shrink, bottom=-pre_shrink)
        if pre_shrink > 0
        else self
    )

    # Get the region image
    image = work_region.to_image(resolution=resolution, crop=True, include_highlights=False)

    if image is None:
        logger.warning(
            f"Region {self.bbox}: Could not generate image for trimming. Returning original region."
        )
        return self

    # Convert to grayscale for easier analysis
    import numpy as np

    # Convert PIL image to numpy array
    img_array = np.array(image.convert("L"))  # Convert to grayscale
    height, width = img_array.shape

    if height == 0 or width == 0:
        logger.warning(
            f"Region {self.bbox}: Image has zero dimensions. Returning original region."
        )
        return self

    # Normalize pixel values to 0-1 range (255 = white = 1.0, 0 = black = 0.0)
    normalized = img_array.astype(np.float32) / 255.0

    # Find content boundaries by analyzing row and column averages

    # Analyze rows (horizontal strips) to find top and bottom boundaries
    row_averages = np.mean(normalized, axis=1)  # Average each row
    content_rows = row_averages < threshold  # True where there's content (not whitespace)

    # Find first and last rows with content
    content_row_indices = np.where(content_rows)[0]
    if len(content_row_indices) == 0:
        # No content found, return a minimal region at the center
        logger.warning(
            f"Region {self.bbox}: No content detected during trimming. Returning center point."
        )
        center_x = (self.x0 + self.x1) / 2
        center_y = (self.top + self.bottom) / 2
        return Region(self.page, (center_x, center_y, center_x, center_y))

    top_content_row = max(0, content_row_indices[0] - padding)
    bottom_content_row = min(height - 1, content_row_indices[-1] + padding)

    # Analyze columns (vertical strips) to find left and right boundaries
    col_averages = np.mean(normalized, axis=0)  # Average each column
    content_cols = col_averages < threshold  # True where there's content

    content_col_indices = np.where(content_cols)[0]
    if len(content_col_indices) == 0:
        # No content found in columns either
        logger.warning(
            f"Region {self.bbox}: No column content detected during trimming. Returning center point."
        )
        center_x = (self.x0 + self.x1) / 2
        center_y = (self.top + self.bottom) / 2
        return Region(self.page, (center_x, center_y, center_x, center_y))

    left_content_col = max(0, content_col_indices[0] - padding)
    right_content_col = min(width - 1, content_col_indices[-1] + padding)

    # Convert trimmed pixel coordinates back to PDF coordinates
    scale_factor = resolution / 72.0  # Scale factor used in to_image()

    # Calculate new PDF coordinates and ensure they are Python floats
    trimmed_x0 = float(work_region.x0 + (left_content_col / scale_factor))
    trimmed_top = float(work_region.top + (top_content_row / scale_factor))
    trimmed_x1 = float(
        work_region.x0 + ((right_content_col + 1) / scale_factor)
    )  # +1 because we want inclusive right edge
    trimmed_bottom = float(
        work_region.top + ((bottom_content_row + 1) / scale_factor)
    )  # +1 because we want inclusive bottom edge

    # Ensure the trimmed region doesn't exceed the work region boundaries
    final_x0 = max(work_region.x0, trimmed_x0)
    final_top = max(work_region.top, trimmed_top)
    final_x1 = min(work_region.x1, trimmed_x1)
    final_bottom = min(work_region.bottom, trimmed_bottom)

    # Ensure valid coordinates (width > 0, height > 0)
    if final_x1 <= final_x0 or final_bottom <= final_top:
        logger.warning(
            f"Region {self.bbox}: Trimming resulted in invalid dimensions. Returning original region."
        )
        return self

    # Create the trimmed region
    trimmed_region = Region(self.page, (final_x0, final_top, final_x1, final_bottom))

    # Expand back by the pre_shrink amount to restore original positioning
    if pre_shrink > 0:
        trimmed_region = trimmed_region.expand(
            left=pre_shrink, right=pre_shrink, top=pre_shrink, bottom=pre_shrink
        )

    # Copy relevant metadata
    trimmed_region.region_type = self.region_type
    trimmed_region.normalized_type = self.normalized_type
    trimmed_region.confidence = self.confidence
    trimmed_region.model = self.model
    trimmed_region.name = self.name
    trimmed_region.label = self.label
    trimmed_region.source = "trimmed"  # Indicate this is a derived region
    trimmed_region.parent_region = self

    logger.debug(
        f"Region {self.bbox}: Trimmed to {trimmed_region.bbox} (padding={padding}, threshold={threshold}, pre_shrink={pre_shrink})"
    )
    return trimmed_region

Functions

natural_pdf.configure_logging(level=logging.INFO, handler=None)

Configure logging for the natural_pdf package.

Parameters:

Name Type Description Default
level

Logging level (e.g., logging.INFO, logging.DEBUG)

INFO
handler

Optional custom handler. Defaults to a StreamHandler.

None
Source code in natural_pdf/__init__.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def configure_logging(level=logging.INFO, handler=None):
    """Configure logging for the natural_pdf package.

    Args:
        level: Logging level (e.g., logging.INFO, logging.DEBUG)
        handler: Optional custom handler. Defaults to a StreamHandler.
    """
    # Avoid adding duplicate handlers
    if any(isinstance(h, logging.StreamHandler) for h in logger.handlers):
        return

    if handler is None:
        handler = logging.StreamHandler()
        formatter = logging.Formatter("%(name)s - %(levelname)s - %(message)s")
        handler.setFormatter(formatter)

    logger.addHandler(handler)
    logger.setLevel(level)

    logger.propagate = False