Skip to content

API Reference

This section provides detailed documentation for all the classes and methods in Natural PDF.

Core Classes

natural_pdf

Natural PDF - A more intuitive interface for working with PDFs.

Classes

natural_pdf.ConfigSection

A configuration section that holds key-value option pairs.

Source code in natural_pdf/__init__.py
41
42
43
44
45
46
47
48
49
class ConfigSection:
    """A configuration section that holds key-value option pairs."""

    def __init__(self, **defaults):
        self.__dict__.update(defaults)

    def __repr__(self):
        items = [f"{k}={v!r}" for k, v in self.__dict__.items()]
        return f"{self.__class__.__name__}({', '.join(items)})"
natural_pdf.Flow

Bases: Visualizable

Defines a logical flow or sequence of physical Page or Region objects.

A Flow represents a continuous logical document structure that spans across multiple pages or regions, enabling operations on content that flows across boundaries. This is essential for handling multi-page tables, articles that span columns, or any content that requires reading order across segments.

Flows specify arrangement (vertical/horizontal) and alignment rules to create a unified coordinate system for element extraction and text processing. They enable natural-pdf to treat fragmented content as a single continuous area for analysis and extraction operations.

The Flow system is particularly useful for: - Multi-page tables that break across page boundaries - Multi-column articles with complex reading order - Forms that span multiple pages - Any content requiring logical continuation across segments

Attributes:

Name Type Description
segments List[Region]

List of Page or Region objects in flow order.

arrangement Literal['vertical', 'horizontal']

Primary flow direction ('vertical' or 'horizontal').

alignment Literal['start', 'center', 'end', 'top', 'left', 'bottom', 'right']

Cross-axis alignment for segments of different sizes.

segment_gap float

Virtual gap between segments in PDF points.

Example

Multi-page table flow:

pdf = npdf.PDF("multi_page_table.pdf")

# Create flow for table spanning pages 2-4
table_flow = Flow(
    segments=[pdf.pages[1], pdf.pages[2], pdf.pages[3]],
    arrangement='vertical',
    alignment='left',
    segment_gap=10.0
)

# Extract table as if it were continuous
table_data = table_flow.extract_table()
text_content = table_flow.get_text()

Multi-column article flow:

page = pdf.pages[0]
left_column = page.region(0, 0, 300, page.height)
right_column = page.region(320, 0, page.width, page.height)

# Create horizontal flow for columns
article_flow = Flow(
    segments=[left_column, right_column],
    arrangement='horizontal',
    alignment='top'
)

# Read in proper order
article_text = article_flow.get_text()

Note

Flows create virtual coordinate systems that map element positions across segments, enabling spatial navigation and element selection to work seamlessly across boundaries.

Source code in natural_pdf/flows/flow.py
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
class Flow(Visualizable):
    """Defines a logical flow or sequence of physical Page or Region objects.

    A Flow represents a continuous logical document structure that spans across
    multiple pages or regions, enabling operations on content that flows across
    boundaries. This is essential for handling multi-page tables, articles that
    span columns, or any content that requires reading order across segments.

    Flows specify arrangement (vertical/horizontal) and alignment rules to create
    a unified coordinate system for element extraction and text processing. They
    enable natural-pdf to treat fragmented content as a single continuous area
    for analysis and extraction operations.

    The Flow system is particularly useful for:
    - Multi-page tables that break across page boundaries
    - Multi-column articles with complex reading order
    - Forms that span multiple pages
    - Any content requiring logical continuation across segments

    Attributes:
        segments: List of Page or Region objects in flow order.
        arrangement: Primary flow direction ('vertical' or 'horizontal').
        alignment: Cross-axis alignment for segments of different sizes.
        segment_gap: Virtual gap between segments in PDF points.

    Example:
        Multi-page table flow:
        ```python
        pdf = npdf.PDF("multi_page_table.pdf")

        # Create flow for table spanning pages 2-4
        table_flow = Flow(
            segments=[pdf.pages[1], pdf.pages[2], pdf.pages[3]],
            arrangement='vertical',
            alignment='left',
            segment_gap=10.0
        )

        # Extract table as if it were continuous
        table_data = table_flow.extract_table()
        text_content = table_flow.get_text()
        ```

        Multi-column article flow:
        ```python
        page = pdf.pages[0]
        left_column = page.region(0, 0, 300, page.height)
        right_column = page.region(320, 0, page.width, page.height)

        # Create horizontal flow for columns
        article_flow = Flow(
            segments=[left_column, right_column],
            arrangement='horizontal',
            alignment='top'
        )

        # Read in proper order
        article_text = article_flow.get_text()
        ```

    Note:
        Flows create virtual coordinate systems that map element positions across
        segments, enabling spatial navigation and element selection to work
        seamlessly across boundaries.
    """

    def __init__(
        self,
        segments: Union[List[Union["Page", "PhysicalRegion"]], "PageCollection"],
        arrangement: Literal["vertical", "horizontal"],
        alignment: Literal["start", "center", "end", "top", "left", "bottom", "right"] = "start",
        segment_gap: float = 0.0,
    ):
        """
        Initializes a Flow object.

        Args:
            segments: An ordered list of natural_pdf.core.page.Page or
                      natural_pdf.elements.region.Region objects that constitute the flow,
                      or a PageCollection containing pages.
            arrangement: The primary direction of the flow.
                         - "vertical": Segments are stacked top-to-bottom.
                         - "horizontal": Segments are arranged left-to-right.
            alignment: How segments are aligned on their cross-axis if they have
                       differing dimensions. For a "vertical" arrangement:
                       - "left" (or "start"): Align left edges.
                       - "center": Align centers.
                       - "right" (or "end"): Align right edges.
                       For a "horizontal" arrangement:
                       - "top" (or "start"): Align top edges.
                       - "center": Align centers.
                       - "bottom" (or "end"): Align bottom edges.
            segment_gap: The virtual gap (in PDF points) between segments.
        """
        # Handle PageCollection input
        if hasattr(segments, "pages"):  # It's a PageCollection
            segments = list(segments.pages)

        if not segments:
            raise ValueError("Flow segments cannot be empty.")
        if arrangement not in ["vertical", "horizontal"]:
            raise ValueError("Arrangement must be 'vertical' or 'horizontal'.")

        self.segments: List["PhysicalRegion"] = self._normalize_segments(segments)
        self.arrangement: Literal["vertical", "horizontal"] = arrangement
        self.alignment: Literal["start", "center", "end", "top", "left", "bottom", "right"] = (
            alignment
        )
        self.segment_gap: float = segment_gap

        self._validate_alignment()

        # TODO: Pre-calculate segment offsets for faster lookups if needed

    def _normalize_segments(
        self, segments: List[Union["Page", "PhysicalRegion"]]
    ) -> List["PhysicalRegion"]:
        """Converts all Page segments to full-page Region objects for uniform processing."""
        normalized = []
        from natural_pdf.core.page import Page as CorePage
        from natural_pdf.elements.region import Region as ElementsRegion

        for i, segment in enumerate(segments):
            if isinstance(segment, CorePage):
                normalized.append(segment.region(0, 0, segment.width, segment.height))
            elif isinstance(segment, ElementsRegion):
                normalized.append(segment)
            elif hasattr(segment, "object_type") and segment.object_type == "page":
                if not isinstance(segment, CorePage):
                    raise TypeError(
                        f"Segment {i} has object_type 'page' but is not an instance of natural_pdf.core.page.Page. Got {type(segment)}"
                    )
                normalized.append(segment.region(0, 0, segment.width, segment.height))
            elif hasattr(segment, "object_type") and segment.object_type == "region":
                if not isinstance(segment, ElementsRegion):
                    raise TypeError(
                        f"Segment {i} has object_type 'region' but is not an instance of natural_pdf.elements.region.Region. Got {type(segment)}"
                    )
                normalized.append(segment)
            else:
                raise TypeError(
                    f"Segment {i} is not a valid Page or Region object. Got {type(segment)}."
                )
        return normalized

    def _validate_alignment(self) -> None:
        """Validates the alignment based on the arrangement."""
        valid_alignments = {
            "vertical": ["start", "center", "end", "left", "right"],
            "horizontal": ["start", "center", "end", "top", "bottom"],
        }
        if self.alignment not in valid_alignments[self.arrangement]:
            raise ValueError(
                f"Invalid alignment '{self.alignment}' for '{self.arrangement}' arrangement. "
                f"Valid options are: {valid_alignments[self.arrangement]}"
            )

    def _get_highlighter(self):
        """Get the highlighting service from the first segment."""
        if not self.segments:
            raise RuntimeError("Flow has no segments to get highlighter from")

        # Get highlighter from first segment
        first_segment = self.segments[0]
        if hasattr(first_segment, "_highlighter"):
            return first_segment._highlighter
        elif hasattr(first_segment, "page") and hasattr(first_segment.page, "_highlighter"):
            return first_segment.page._highlighter
        else:
            raise RuntimeError(
                f"Cannot find HighlightingService from Flow segments. "
                f"First segment type: {type(first_segment).__name__}"
            )

    def show(
        self,
        *,
        # Basic rendering options
        resolution: Optional[float] = None,
        width: Optional[int] = None,
        # Highlight options
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        labels: bool = True,
        label_format: Optional[str] = None,
        highlights: Optional[List[Dict[str, Any]]] = None,
        # Layout options for multi-page/region
        layout: Literal["stack", "grid", "single"] = "stack",
        stack_direction: Literal["vertical", "horizontal"] = "vertical",
        gap: int = 5,
        columns: Optional[int] = None,  # For grid layout
        # Cropping options
        crop: Union[bool, Literal["content"]] = False,
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        # Flow-specific options
        in_context: bool = False,
        separator_color: Optional[Tuple[int, int, int]] = None,
        separator_thickness: int = 2,
        **kwargs,
    ) -> Optional["PIL_Image"]:
        """Generate a preview image with highlights.

        If in_context=True, shows segments as cropped images stacked together
        with separators between segments.

        Args:
            resolution: DPI for rendering (default from global settings)
            width: Target width in pixels (overrides resolution)
            color: Default highlight color
            labels: Whether to show labels for highlights
            label_format: Format string for labels
            highlights: Additional highlight groups to show
            layout: How to arrange multiple pages/regions
            stack_direction: Direction for stack layout
            gap: Pixels between stacked images
            columns: Number of columns for grid layout
            crop: Whether to crop
            crop_bbox: Explicit crop bounds
            in_context: If True, use special Flow visualization with separators
            separator_color: RGB color for separator lines (default: red)
            separator_thickness: Thickness of separator lines
            **kwargs: Additional parameters passed to rendering

        Returns:
            PIL Image object or None if nothing to render
        """
        if in_context:
            # Use the special in_context visualization
            return self._show_in_context(
                resolution=resolution or 150,
                width=width,
                stack_direction=stack_direction,
                stack_gap=gap,
                separator_color=separator_color or (255, 0, 0),
                separator_thickness=separator_thickness,
                **kwargs,
            )

        # Otherwise use the standard show method
        return super().show(
            resolution=resolution,
            width=width,
            color=color,
            labels=labels,
            label_format=label_format,
            highlights=highlights,
            layout=layout,
            stack_direction=stack_direction,
            gap=gap,
            columns=columns,
            crop=crop,
            crop_bbox=crop_bbox,
            **kwargs,
        )

    def find(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["FlowElement"]:
        """
        Finds the first element within the flow that matches the given selector or text criteria.

        Elements found are wrapped as FlowElement objects, anchored to this Flow.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for.
            apply_exclusions: Whether to respect exclusion zones on the original pages/regions.
            regex: Whether the text search uses regex.
            case: Whether the text search is case-sensitive.
            **kwargs: Additional filter parameters for the underlying find operation.

        Returns:
            A FlowElement if a match is found, otherwise None.
        """
        results = self.find_all(
            selector=selector,
            text=text,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        return results.first if results else None

    def find_all(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "FlowElementCollection":
        """
        Finds all elements within the flow that match the given selector or text criteria.

        This method efficiently groups segments by their parent pages, searches at the page level,
        then filters results appropriately for each segment. This ensures elements that intersect
        with flow segments (but aren't fully contained) are still found.

        Elements found are wrapped as FlowElement objects, anchored to this Flow,
        and returned in a FlowElementCollection.
        """
        from .collections import FlowElementCollection
        from .element import FlowElement

        # Step 1: Group segments by their parent pages (like in analyze_layout)
        segments_by_page = {}  # Dict[Page, List[Segment]]

        for i, segment in enumerate(self.segments):
            # Determine the page for this segment - fix type detection
            if hasattr(segment, "page") and hasattr(segment.page, "find_all"):
                # It's a Region object (has a parent page)
                page_obj = segment.page
                segment_type = "region"
            elif (
                hasattr(segment, "find_all")
                and hasattr(segment, "width")
                and hasattr(segment, "height")
                and not hasattr(segment, "page")
            ):
                # It's a Page object (has find_all but no parent page)
                page_obj = segment
                segment_type = "page"
            else:
                logger.warning(f"Segment {i+1} does not support find_all, skipping")
                continue

            if page_obj not in segments_by_page:
                segments_by_page[page_obj] = []
            segments_by_page[page_obj].append((segment, segment_type))

        if not segments_by_page:
            logger.warning("No segments with searchable pages found")
            return FlowElementCollection([])

        # Step 2: Search each unique page only once
        all_flow_elements: List["FlowElement"] = []

        for page_obj, page_segments in segments_by_page.items():
            # Find all matching elements on this page
            page_matches = page_obj.find_all(
                selector=selector,
                text=text,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )

            if not page_matches:
                continue

            # Step 3: For each segment on this page, collect relevant elements
            for segment, segment_type in page_segments:
                if segment_type == "page":
                    # Full page segment: include all elements
                    for phys_elem in page_matches.elements:
                        all_flow_elements.append(FlowElement(physical_object=phys_elem, flow=self))

                elif segment_type == "region":
                    # Region segment: filter to only intersecting elements
                    for phys_elem in page_matches.elements:
                        try:
                            # Check if element intersects with this flow segment
                            if segment.intersects(phys_elem):
                                all_flow_elements.append(
                                    FlowElement(physical_object=phys_elem, flow=self)
                                )
                        except Exception as intersect_error:
                            logger.debug(
                                f"Error checking intersection for element: {intersect_error}"
                            )
                            # Include the element anyway if intersection check fails
                            all_flow_elements.append(
                                FlowElement(physical_object=phys_elem, flow=self)
                            )

        # Step 4: Remove duplicates (can happen if multiple segments intersect the same element)
        unique_flow_elements = []
        seen_element_ids = set()

        for flow_elem in all_flow_elements:
            # Create a unique identifier for the underlying physical element
            phys_elem = flow_elem.physical_object
            elem_id = (
                (
                    getattr(phys_elem.page, "index", id(phys_elem.page))
                    if hasattr(phys_elem, "page")
                    else id(phys_elem)
                ),
                phys_elem.bbox if hasattr(phys_elem, "bbox") else id(phys_elem),
            )

            if elem_id not in seen_element_ids:
                unique_flow_elements.append(flow_elem)
                seen_element_ids.add(elem_id)

        return FlowElementCollection(unique_flow_elements)

    def __repr__(self) -> str:
        return (
            f"<Flow segments={len(self.segments)}, "
            f"arrangement='{self.arrangement}', alignment='{self.alignment}', gap={self.segment_gap}>"
        )

    @overload
    def extract_table(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[dict] = None,
        cell_extraction_func: Optional[Any] = None,
        show_progress: bool = False,
        content_filter: Optional[Any] = None,
        stitch_rows: Callable[[List[Optional[str]]], bool] = None,
    ) -> TableResult: ...

    @overload
    def extract_table(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[dict] = None,
        cell_extraction_func: Optional[Any] = None,
        show_progress: bool = False,
        content_filter: Optional[Any] = None,
        stitch_rows: Callable[
            [List[Optional[str]], List[Optional[str]], int, Union["Page", "PhysicalRegion"]],
            bool,
        ] = None,
    ) -> TableResult: ...

    def extract_table(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[dict] = None,
        cell_extraction_func: Optional[Any] = None,
        show_progress: bool = False,
        content_filter: Optional[Any] = None,
        stitch_rows: Optional[Callable] = None,
        merge_headers: Optional[bool] = None,
    ) -> TableResult:
        """
        Extract table data from all segments in the flow, combining results sequentially.

        This method extracts table data from each segment in flow order and combines
        the results into a single logical table. This is particularly useful for
        multi-page tables or tables that span across columns.

        Args:
            method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
            table_settings: Settings for pdfplumber table extraction.
            use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
            ocr_config: OCR configuration parameters.
            text_options: Dictionary of options for the 'text' method.
            cell_extraction_func: Optional callable function that takes a cell Region object
                                  and returns its string content. For 'text' method only.
            show_progress: If True, display a progress bar during cell text extraction for the 'text' method.
            content_filter: Optional content filter to apply during cell text extraction.
            merge_headers: Whether to merge tables by removing repeated headers from subsequent
                segments. If None (default), auto-detects by checking if the first row
                of each segment matches the first row of the first segment. If segments have
                inconsistent header patterns (some repeat, others don't), raises ValueError.
                Useful for multi-page tables where headers repeat on each page.
            stitch_rows: Optional callable to determine when rows should be merged across
                         segment boundaries. Applied AFTER header removal if merge_headers
                         is enabled. Two overloaded signatures are supported:

                         • func(current_row) -> bool
                           Called only on the first row of each segment (after the first).
                           Return True to merge this first row with the last row from
                           the previous segment.

                         • func(prev_row, current_row, row_index, segment) -> bool
                           Called for every row. Return True to merge current_row with
                           the previous row in the aggregated results.

                         When True is returned, rows are concatenated cell-by-cell.
                         This is useful for handling table rows split across page
                         boundaries or segments. If None, rows are never merged.

        Returns:
            TableResult object containing the aggregated table data from all segments.

        Example:
            Multi-page table extraction:
            ```python
            pdf = npdf.PDF("multi_page_table.pdf")

            # Create flow for table spanning pages 2-4
            table_flow = Flow(
                segments=[pdf.pages[1], pdf.pages[2], pdf.pages[3]],
                arrangement='vertical'
            )

            # Extract table as if it were continuous
            table_data = table_flow.extract_table()
            df = table_data.df  # Convert to pandas DataFrame

            # Custom row stitching - single parameter (simple case)
            table_data = table_flow.extract_table(
                stitch_rows=lambda row: row and not (row[0] or "").strip()
            )

            # Custom row stitching - full parameters (advanced case)
            table_data = table_flow.extract_table(
                stitch_rows=lambda prev, curr, idx, seg: idx == 0 and curr and not (curr[0] or "").strip()
            )
            ```
        """
        logger.info(
            f"Extracting table from Flow with {len(self.segments)} segments (method: {method or 'auto'})"
        )

        if not self.segments:
            logger.warning("Flow has no segments, returning empty table")
            return TableResult([])

        # Resolve predicate and determine its signature
        predicate: Optional[Callable] = None
        predicate_type: str = "none"

        if callable(stitch_rows):
            import inspect

            sig = inspect.signature(stitch_rows)
            param_count = len(sig.parameters)

            if param_count == 1:
                predicate = stitch_rows
                predicate_type = "single_param"
            elif param_count == 4:
                predicate = stitch_rows
                predicate_type = "full_params"
            else:
                logger.warning(
                    f"stitch_rows function has {param_count} parameters, expected 1 or 4. Ignoring."
                )
                predicate = None
                predicate_type = "none"

        def _default_merge(
            prev_row: List[Optional[str]], cur_row: List[Optional[str]]
        ) -> List[Optional[str]]:
            from itertools import zip_longest

            merged: List[Optional[str]] = []
            for p, c in zip_longest(prev_row, cur_row, fillvalue=""):
                if (p or "").strip() and (c or "").strip():
                    merged.append(f"{p} {c}".strip())
                else:
                    merged.append((p or "") + (c or ""))
            return merged

        aggregated_rows: List[List[Optional[str]]] = []
        processed_segments = 0
        header_row: Optional[List[Optional[str]]] = None
        merge_headers_enabled = False
        headers_warned = False  # Track if we've already warned about dropping headers
        segment_has_repeated_header = []  # Track which segments have repeated headers

        for seg_idx, segment in enumerate(self.segments):
            try:
                logger.debug(f"  Extracting table from segment {seg_idx+1}/{len(self.segments)}")

                segment_result = segment.extract_table(
                    method=method,
                    table_settings=table_settings.copy() if table_settings else None,
                    use_ocr=use_ocr,
                    ocr_config=ocr_config,
                    text_options=text_options.copy() if text_options else None,
                    cell_extraction_func=cell_extraction_func,
                    show_progress=show_progress,
                    content_filter=content_filter,
                )

                if not segment_result:
                    continue

                if hasattr(segment_result, "_rows"):
                    segment_rows = list(segment_result._rows)
                else:
                    segment_rows = list(segment_result)

                if not segment_rows:
                    logger.debug(f"    No table data found in segment {seg_idx+1}")
                    continue

                # Handle header detection and merging for multi-page tables
                if seg_idx == 0:
                    # First segment: capture potential header row
                    if segment_rows:
                        header_row = segment_rows[0]
                        # Determine if we should merge headers
                        if merge_headers is None:
                            # Auto-detect: we'll check all subsequent segments
                            merge_headers_enabled = False  # Will be determined later
                        else:
                            merge_headers_enabled = merge_headers
                        # Track that first segment exists (for consistency checking)
                        segment_has_repeated_header.append(False)  # First segment doesn't "repeat"
                elif seg_idx == 1 and merge_headers is None:
                    # Auto-detection: check if first row of second segment matches header
                    has_header = segment_rows and header_row and segment_rows[0] == header_row
                    segment_has_repeated_header.append(has_header)

                    if has_header:
                        merge_headers_enabled = True
                        # Remove the detected repeated header from this segment
                        segment_rows = segment_rows[1:]
                        logger.debug(
                            f"    Auto-detected repeated header in segment {seg_idx+1}, removed"
                        )
                        if not headers_warned:
                            warnings.warn(
                                "Detected repeated headers in multi-page table. Merging by removing "
                                "repeated headers from subsequent pages.",
                                UserWarning,
                                stacklevel=2,
                            )
                            headers_warned = True
                    else:
                        merge_headers_enabled = False
                        logger.debug(f"    No repeated header detected in segment {seg_idx+1}")
                elif seg_idx > 1:
                    # Check consistency: all segments should have same pattern
                    has_header = segment_rows and header_row and segment_rows[0] == header_row
                    segment_has_repeated_header.append(has_header)

                    # Remove header if merging is enabled and header is present
                    if merge_headers_enabled and has_header:
                        segment_rows = segment_rows[1:]
                        logger.debug(f"    Removed repeated header from segment {seg_idx+1}")
                elif seg_idx > 0 and merge_headers_enabled:
                    # Explicit merge_headers=True: remove headers from subsequent segments
                    if segment_rows and header_row and segment_rows[0] == header_row:
                        segment_rows = segment_rows[1:]
                        logger.debug(f"    Removed repeated header from segment {seg_idx+1}")
                        if not headers_warned:
                            warnings.warn(
                                "Removing repeated headers from multi-page table during merge.",
                                UserWarning,
                                stacklevel=2,
                            )
                            headers_warned = True

                for row_idx, row in enumerate(segment_rows):
                    should_merge = False

                    if predicate is not None and aggregated_rows:
                        if predicate_type == "single_param":
                            # For single param: only call on first row of segment (row_idx == 0)
                            # and pass the current row
                            if row_idx == 0:
                                should_merge = predicate(row)
                        elif predicate_type == "full_params":
                            # For full params: call with all arguments
                            should_merge = predicate(aggregated_rows[-1], row, row_idx, segment)

                    if should_merge:
                        aggregated_rows[-1] = _default_merge(aggregated_rows[-1], row)
                    else:
                        aggregated_rows.append(row)

                processed_segments += 1
                logger.debug(
                    f"    Added {len(segment_rows)} rows (post-merge) from segment {seg_idx+1}"
                )

            except Exception as e:
                logger.error(f"Error extracting table from segment {seg_idx+1}: {e}", exc_info=True)
                continue

        # Check for inconsistent header patterns after processing all segments
        if merge_headers is None and len(segment_has_repeated_header) > 2:
            # During auto-detection, check for consistency across all segments
            expected_pattern = segment_has_repeated_header[1]  # Pattern from second segment
            for seg_idx, has_header in enumerate(segment_has_repeated_header[2:], 2):
                if has_header != expected_pattern:
                    # Inconsistent pattern detected
                    segments_with_headers = [
                        i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if has_h
                    ]
                    segments_without_headers = [
                        i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if not has_h
                    ]
                    raise ValueError(
                        f"Inconsistent header pattern in multi-page table: "
                        f"segments {segments_with_headers} have repeated headers, "
                        f"but segments {segments_without_headers} do not. "
                        f"All segments must have the same header pattern for reliable merging."
                    )

        logger.info(
            f"Flow table extraction complete: {len(aggregated_rows)} total rows from {processed_segments}/{len(self.segments)} segments"
        )
        return TableResult(aggregated_rows)

    def analyze_layout(
        self,
        engine: Optional[str] = None,
        options: Optional[Any] = None,
        confidence: Optional[float] = None,
        classes: Optional[List[str]] = None,
        exclude_classes: Optional[List[str]] = None,
        device: Optional[str] = None,
        existing: str = "replace",
        model_name: Optional[str] = None,
        client: Optional[Any] = None,
    ) -> "PhysicalElementCollection":
        """
        Analyze layout across all segments in the flow.

        This method efficiently groups segments by their parent pages, runs layout analysis
        only once per unique page, then filters results appropriately for each segment.
        This avoids redundant analysis when multiple flow segments come from the same page.

        Args:
            engine: Name of the layout engine (e.g., 'yolo', 'tatr'). Uses manager's default if None.
            options: Specific LayoutOptions object for advanced configuration.
            confidence: Minimum confidence threshold.
            classes: Specific classes to detect.
            exclude_classes: Classes to exclude.
            device: Device for inference.
            existing: How to handle existing detected regions: 'replace' (default) or 'append'.
            model_name: Optional model name for the engine.
            client: Optional client for API-based engines.

        Returns:
            ElementCollection containing all detected Region objects from all segments.

        Example:
            Multi-page layout analysis:
            ```python
            pdf = npdf.PDF("document.pdf")

            # Create flow for first 3 pages
            page_flow = Flow(
                segments=pdf.pages[:3],
                arrangement='vertical'
            )

            # Analyze layout across all pages (efficiently)
            all_regions = page_flow.analyze_layout(engine='yolo')

            # Find all tables across the flow
            tables = all_regions.filter('region[type=table]')
            ```
        """
        from natural_pdf.elements.element_collection import ElementCollection

        logger.info(
            f"Analyzing layout across Flow with {len(self.segments)} segments (engine: {engine or 'default'})"
        )

        if not self.segments:
            logger.warning("Flow has no segments, returning empty collection")
            return ElementCollection([])

        # Step 1: Group segments by their parent pages to avoid redundant analysis
        segments_by_page = {}  # Dict[Page, List[Segment]]

        for i, segment in enumerate(self.segments):
            # Determine the page for this segment
            if hasattr(segment, "analyze_layout"):
                # It's a Page object
                page_obj = segment
                segment_type = "page"
            elif hasattr(segment, "page") and hasattr(segment.page, "analyze_layout"):
                # It's a Region object
                page_obj = segment.page
                segment_type = "region"
            else:
                logger.warning(f"Segment {i+1} does not support layout analysis, skipping")
                continue

            if page_obj not in segments_by_page:
                segments_by_page[page_obj] = []
            segments_by_page[page_obj].append((segment, segment_type))

        if not segments_by_page:
            logger.warning("No segments with analyzable pages found")
            return ElementCollection([])

        logger.debug(
            f"  Grouped {len(self.segments)} segments into {len(segments_by_page)} unique pages"
        )

        # Step 2: Analyze each unique page only once
        all_detected_regions: List["PhysicalRegion"] = []
        processed_pages = 0

        for page_obj, page_segments in segments_by_page.items():
            try:
                logger.debug(
                    f"  Analyzing layout for page {getattr(page_obj, 'number', '?')} with {len(page_segments)} segments"
                )

                # Run layout analysis once for this page
                page_results = page_obj.analyze_layout(
                    engine=engine,
                    options=options,
                    confidence=confidence,
                    classes=classes,
                    exclude_classes=exclude_classes,
                    device=device,
                    existing=existing,
                    model_name=model_name,
                    client=client,
                )

                # Extract regions from results
                if hasattr(page_results, "elements"):
                    # It's an ElementCollection
                    page_regions = page_results.elements
                elif isinstance(page_results, list):
                    # It's a list of regions
                    page_regions = page_results
                else:
                    logger.warning(
                        f"Page {getattr(page_obj, 'number', '?')} returned unexpected layout analysis result type: {type(page_results)}"
                    )
                    continue

                if not page_regions:
                    logger.debug(
                        f"    No layout regions found on page {getattr(page_obj, 'number', '?')}"
                    )
                    continue

                # Step 3: For each segment on this page, collect relevant regions
                segments_processed_on_page = 0
                for segment, segment_type in page_segments:
                    if segment_type == "page":
                        # Full page segment: include all detected regions
                        all_detected_regions.extend(page_regions)
                        segments_processed_on_page += 1
                        logger.debug(f"    Added {len(page_regions)} regions for full-page segment")

                    elif segment_type == "region":
                        # Region segment: filter to only intersecting regions
                        intersecting_regions = []
                        for region in page_regions:
                            try:
                                if segment.intersects(region):
                                    intersecting_regions.append(region)
                            except Exception as intersect_error:
                                logger.debug(
                                    f"Error checking intersection for region: {intersect_error}"
                                )
                                # Include the region anyway if intersection check fails
                                intersecting_regions.append(region)

                        all_detected_regions.extend(intersecting_regions)
                        segments_processed_on_page += 1
                        logger.debug(
                            f"    Added {len(intersecting_regions)} intersecting regions for region segment {segment.bbox}"
                        )

                processed_pages += 1
                logger.debug(
                    f"    Processed {segments_processed_on_page} segments on page {getattr(page_obj, 'number', '?')}"
                )

            except Exception as e:
                logger.error(
                    f"Error analyzing layout for page {getattr(page_obj, 'number', '?')}: {e}",
                    exc_info=True,
                )
                continue

        # Step 4: Remove duplicates (can happen if multiple segments intersect the same region)
        unique_regions = []
        seen_region_ids = set()

        for region in all_detected_regions:
            # Create a unique identifier for this region (page + bbox)
            region_id = (
                getattr(region.page, "index", id(region.page)),
                region.bbox if hasattr(region, "bbox") else id(region),
            )

            if region_id not in seen_region_ids:
                unique_regions.append(region)
                seen_region_ids.add(region_id)

        dedupe_removed = len(all_detected_regions) - len(unique_regions)
        if dedupe_removed > 0:
            logger.debug(f"  Removed {dedupe_removed} duplicate regions")

        logger.info(
            f"Flow layout analysis complete: {len(unique_regions)} unique regions from {processed_pages} pages"
        )
        return ElementCollection(unique_regions)

    def _get_render_specs(
        self,
        mode: Literal["show", "render"] = "show",
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        highlights: Optional[List[Dict[str, Any]]] = None,
        crop: Union[bool, Literal["content"]] = False,
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        label_prefix: Optional[str] = "FlowSegment",
        **kwargs,
    ) -> List[RenderSpec]:
        """Get render specifications for this flow.

        Args:
            mode: Rendering mode - 'show' includes highlights, 'render' is clean
            color: Color for highlighting segments in show mode
            highlights: Additional highlight groups to show
            crop: Whether to crop to segments
            crop_bbox: Explicit crop bounds
            label_prefix: Prefix for segment labels
            **kwargs: Additional parameters

        Returns:
            List of RenderSpec objects, one per page with segments
        """
        if not self.segments:
            return []

        # Group segments by their physical pages
        segments_by_page = {}  # Dict[Page, List[PhysicalRegion]]

        for i, segment in enumerate(self.segments):
            # Get the page for this segment
            if hasattr(segment, "page") and segment.page is not None:
                # It's a Region, use its page
                page_obj = segment.page
                if page_obj not in segments_by_page:
                    segments_by_page[page_obj] = []
                segments_by_page[page_obj].append(segment)
            elif (
                hasattr(segment, "index")
                and hasattr(segment, "width")
                and hasattr(segment, "height")
            ):
                # It's a full Page object, create a full-page region for it
                page_obj = segment
                full_page_region = segment.region(0, 0, segment.width, segment.height)
                if page_obj not in segments_by_page:
                    segments_by_page[page_obj] = []
                segments_by_page[page_obj].append(full_page_region)
            else:
                logger.warning(f"Segment {i+1} has no identifiable page, skipping")
                continue

        if not segments_by_page:
            return []

        # Create RenderSpec for each page
        specs = []

        # Sort pages by index for consistent output order
        sorted_pages = sorted(
            segments_by_page.keys(),
            key=lambda p: p.index if hasattr(p, "index") else getattr(p, "page_number", 0),
        )

        for page_idx, page_obj in enumerate(sorted_pages):
            segments_on_this_page = segments_by_page[page_obj]
            if not segments_on_this_page:
                continue

            spec = RenderSpec(page=page_obj)

            # Handle cropping
            if crop_bbox:
                spec.crop_bbox = crop_bbox
            elif crop == "content" or crop is True:
                # Calculate bounds of segments on this page
                x_coords = []
                y_coords = []
                for segment in segments_on_this_page:
                    if hasattr(segment, "bbox") and segment.bbox:
                        x0, y0, x1, y1 = segment.bbox
                        x_coords.extend([x0, x1])
                        y_coords.extend([y0, y1])

                if x_coords and y_coords:
                    spec.crop_bbox = (min(x_coords), min(y_coords), max(x_coords), max(y_coords))

            # Add highlights in show mode
            if mode == "show":
                # Highlight segments
                for i, segment in enumerate(segments_on_this_page):
                    segment_label = None
                    if label_prefix:
                        # Create label for this segment
                        global_segment_idx = None
                        try:
                            # Find the global index of this segment in the original flow
                            global_segment_idx = self.segments.index(segment)
                        except ValueError:
                            # If it's a generated full-page region, find its source page
                            for idx, orig_segment in enumerate(self.segments):
                                if (
                                    hasattr(orig_segment, "index")
                                    and hasattr(segment, "page")
                                    and orig_segment.index == segment.page.index
                                ):
                                    global_segment_idx = idx
                                    break

                        if global_segment_idx is not None:
                            segment_label = f"{label_prefix}_{global_segment_idx + 1}"
                        else:
                            segment_label = f"{label_prefix}_p{page_idx + 1}s{i + 1}"

                    spec.add_highlight(
                        bbox=segment.bbox,
                        polygon=segment.polygon if segment.has_polygon else None,
                        color=color or "blue",
                        label=segment_label,
                    )

                # Add additional highlight groups if provided
                if highlights:
                    for group in highlights:
                        group_elements = group.get("elements", [])
                        group_color = group.get("color", color)
                        group_label = group.get("label")

                        for elem in group_elements:
                            # Only add if element is on this page
                            if hasattr(elem, "page") and elem.page == page_obj:
                                spec.add_highlight(
                                    element=elem, color=group_color, label=group_label
                                )

            specs.append(spec)

        return specs

    def _show_in_context(
        self,
        resolution: float,
        width: Optional[int] = None,
        stack_direction: str = "vertical",
        stack_gap: int = 5,
        stack_background_color: Tuple[int, int, int] = (255, 255, 255),
        separator_color: Tuple[int, int, int] = (255, 0, 0),
        separator_thickness: int = 2,
        **kwargs,
    ) -> Optional["PIL_Image"]:
        """
        Show segments as cropped images stacked together with separators between segments.

        Args:
            resolution: Resolution in DPI for rendering segment images
            width: Optional width for segment images
            stack_direction: Direction to stack segments ('vertical' or 'horizontal')
            stack_gap: Gap in pixels between segments
            stack_background_color: RGB background color for the final image
            separator_color: RGB color for separator lines between segments
            separator_thickness: Thickness in pixels of separator lines
            **kwargs: Additional arguments passed to segment rendering

        Returns:
            PIL Image with all segments stacked together
        """
        from PIL import Image, ImageDraw

        segment_images = []
        segment_pages = []

        # Determine stacking direction
        final_stack_direction = stack_direction
        if stack_direction == "auto":
            final_stack_direction = self.arrangement

        # Get cropped images for each segment
        for i, segment in enumerate(self.segments):
            # Get the page reference for this segment
            if hasattr(segment, "page") and segment.page is not None:
                segment_page = segment.page
                # Get cropped image of the segment
                # Use render() for clean image without highlights
                segment_image = segment.render(
                    resolution=resolution,
                    crop=True,
                    width=width,
                    **kwargs,
                )

            elif (
                hasattr(segment, "index")
                and hasattr(segment, "width")
                and hasattr(segment, "height")
            ):
                # It's a full Page object
                segment_page = segment
                # Use render() for clean image without highlights
                segment_image = segment.render(resolution=resolution, width=width, **kwargs)
            else:
                raise ValueError(
                    f"Segment {i+1} has no identifiable page. Segment type: {type(segment)}, attributes: {dir(segment)}"
                )

            if segment_image is not None:
                segment_images.append(segment_image)
                segment_pages.append(segment_page)
            else:
                logger.warning(f"Segment {i+1} render() returned None, skipping")

        # Check if we have any valid images
        if not segment_images:
            logger.error("No valid segment images could be rendered")
            return None

        # We should have at least one segment image by now (or an exception would have been raised)
        if len(segment_images) == 1:
            return segment_images[0]

        # Calculate dimensions for the final stacked image
        if final_stack_direction == "vertical":
            # Stack vertically
            final_width = max(img.width for img in segment_images)

            # Calculate total height including gaps and separators
            total_height = sum(img.height for img in segment_images)
            total_height += (len(segment_images) - 1) * stack_gap

            # Add separator thickness between all segments
            num_separators = len(segment_images) - 1 if len(segment_images) > 1 else 0
            total_height += num_separators * separator_thickness

            # Create the final image
            final_image = Image.new("RGB", (final_width, total_height), stack_background_color)
            draw = ImageDraw.Draw(final_image)

            current_y = 0

            for i, img in enumerate(segment_images):
                # Add separator line before each segment (except the first one)
                if i > 0:
                    # Draw separator line
                    draw.rectangle(
                        [(0, current_y), (final_width, current_y + separator_thickness)],
                        fill=separator_color,
                    )
                    current_y += separator_thickness

                # Paste the segment image
                paste_x = (final_width - img.width) // 2  # Center horizontally
                final_image.paste(img, (paste_x, current_y))
                current_y += img.height

                # Add gap after segment (except for the last one)
                if i < len(segment_images) - 1:
                    current_y += stack_gap

            return final_image

        elif final_stack_direction == "horizontal":
            # Stack horizontally
            final_height = max(img.height for img in segment_images)

            # Calculate total width including gaps and separators
            total_width = sum(img.width for img in segment_images)
            total_width += (len(segment_images) - 1) * stack_gap

            # Add separator thickness between all segments
            num_separators = len(segment_images) - 1 if len(segment_images) > 1 else 0
            total_width += num_separators * separator_thickness

            # Create the final image
            final_image = Image.new("RGB", (total_width, final_height), stack_background_color)
            draw = ImageDraw.Draw(final_image)

            current_x = 0

            for i, img in enumerate(segment_images):
                # Add separator line before each segment (except the first one)
                if i > 0:
                    # Draw separator line
                    draw.rectangle(
                        [(current_x, 0), (current_x + separator_thickness, final_height)],
                        fill=separator_color,
                    )
                    current_x += separator_thickness

                # Paste the segment image
                paste_y = (final_height - img.height) // 2  # Center vertically
                final_image.paste(img, (current_x, paste_y))
                current_x += img.width

                # Add gap after segment (except for the last one)
                if i < len(segment_images) - 1:
                    current_x += stack_gap

            return final_image

        else:
            raise ValueError(
                f"Invalid stack_direction '{final_stack_direction}' for in_context. Must be 'vertical' or 'horizontal'."
            )

    # --- Helper methods for coordinate transformations and segment iteration ---
    # These will be crucial for FlowElement's directional methods.

    def get_segment_bounding_box_in_flow(
        self, segment_index: int
    ) -> Optional[tuple[float, float, float, float]]:
        """
        Calculates the conceptual bounding box of a segment within the flow's coordinate system.
        This considers arrangement, alignment, and segment gaps.
        (This is a placeholder for more complex logic if a true virtual coordinate system is needed)
        For now, it might just return the physical segment's bbox if gaps are 0 and alignment is simple.
        """
        if segment_index < 0 or segment_index >= len(self.segments):
            return None

        # This is a simplified version. A full implementation would calculate offsets.
        # For now, we assume FlowElement directional logic handles segment traversal and uses physical coords.
        # If we were to *draw* the flow or get a FlowRegion bbox that spans gaps, this would be critical.
        # physical_segment = self.segments[segment_index]
        # return physical_segment.bbox
        raise NotImplementedError(
            "Calculating a segment's bbox *within the flow's virtual coordinate system* is not yet fully implemented."
        )

    def get_element_flow_coordinates(
        self, physical_element: "PhysicalElement"
    ) -> Optional[tuple[float, float, float, float]]:
        """
        Translates a physical element's coordinates into the flow's virtual coordinate system.
        (Placeholder - very complex if segment_gap > 0 or complex alignments)
        """
        # For now, elements operate in their own physical coordinates. This method would be needed
        # if FlowRegion.bbox or other operations needed to present a unified coordinate space.
        # As per our discussion, elements *within* a FlowRegion retain original physical coordinates.
        # So, this might not be strictly necessary for the current design's core functionality.
        raise NotImplementedError(
            "Translating element coordinates to a unified flow coordinate system is not yet implemented."
        )

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        new_section_on_page_break: bool = False,
        include_boundaries: str = "both",
        orientation: str = "vertical",
    ) -> "ElementCollection":
        """
        Extract logical sections from the Flow based on *start* and *end* boundary
        elements, mirroring the behaviour of PDF/PageCollection.get_sections().

        This implementation is a thin wrapper that converts the Flow into a
        temporary PageCollection (constructed from the unique pages that the
        Flow spans) and then delegates the heavy‐lifting to that existing
        implementation.  Any FlowElement / FlowElementCollection inputs are
        automatically unwrapped to their underlying physical elements so that
        PageCollection can work with them directly.

        Args:
            start_elements: Elements or selector string that mark the start of
                sections (optional).
            end_elements: Elements or selector string that mark the end of
                sections (optional).
            new_section_on_page_break: Whether to start a new section at page
                boundaries (default: False).
            include_boundaries: How to include boundary elements: 'start',
                'end', 'both', or 'none' (default: 'both').
            orientation: 'vertical' (default) or 'horizontal' - determines section direction.

        Returns:
            ElementCollection of Region/FlowRegion objects representing the
            extracted sections.
        """
        # ------------------------------------------------------------------
        # Unwrap FlowElement(-Collection) inputs and selector strings so we
        # can reason about them generically.
        # ------------------------------------------------------------------
        from natural_pdf.flows.collections import FlowElementCollection
        from natural_pdf.flows.element import FlowElement

        def _unwrap(obj):
            """Convert Flow-specific wrappers to their underlying physical objects.

            Keeps selector strings as-is; converts FlowElement to its physical
            element; converts FlowElementCollection to list of physical
            elements; passes through ElementCollection by taking .elements.
            """

            if obj is None or isinstance(obj, str):
                return obj

            if isinstance(obj, FlowElement):
                return obj.physical_object

            if isinstance(obj, FlowElementCollection):
                return [fe.physical_object for fe in obj.flow_elements]

            if hasattr(obj, "elements"):
                return obj.elements

            if isinstance(obj, (list, tuple, set)):
                out = []
                for item in obj:
                    if isinstance(item, FlowElement):
                        out.append(item.physical_object)
                    else:
                        out.append(item)
                return out

            return obj  # Fallback – unknown type

        start_elements_unwrapped = _unwrap(start_elements)
        end_elements_unwrapped = _unwrap(end_elements)

        # ------------------------------------------------------------------
        # For Flow, we need to handle sections that may span segments
        # We'll process all segments together, not independently
        # ------------------------------------------------------------------
        from natural_pdf.elements.element_collection import ElementCollection
        from natural_pdf.elements.region import Region
        from natural_pdf.flows.element import FlowElement
        from natural_pdf.flows.region import FlowRegion

        # Helper to check if element is in segment
        def _element_in_segment(elem, segment):
            # Simple bbox check
            return (
                elem.page == segment.page
                and elem.top >= segment.top
                and elem.bottom <= segment.bottom
                and elem.x0 >= segment.x0
                and elem.x1 <= segment.x1
            )

        # Collect all boundary elements with their segment info
        all_starts = []
        all_ends = []

        for seg_idx, segment in enumerate(self.segments):
            # Find starts in this segment
            if isinstance(start_elements_unwrapped, str):
                seg_starts = segment.find_all(start_elements_unwrapped).elements
            elif start_elements_unwrapped:
                seg_starts = [
                    e for e in start_elements_unwrapped if _element_in_segment(e, segment)
                ]
            else:
                seg_starts = []

            for elem in seg_starts:
                all_starts.append((elem, seg_idx, segment))

            # Find ends in this segment
            if isinstance(end_elements_unwrapped, str):
                seg_ends = segment.find_all(end_elements_unwrapped).elements
            elif end_elements_unwrapped:
                seg_ends = [e for e in end_elements_unwrapped if _element_in_segment(e, segment)]
            else:
                seg_ends = []

            for elem in seg_ends:
                all_ends.append((elem, seg_idx, segment))

        # Sort by segment index, then position
        all_starts.sort(key=lambda x: (x[1], x[0].top, x[0].x0))
        all_ends.sort(key=lambda x: (x[1], x[0].top, x[0].x0))

        # If no boundary elements found, return empty collection
        if not all_starts and not all_ends:
            return ElementCollection([])

        sections = []

        # Case 1: Only start elements provided
        if all_starts and not all_ends:
            for i in range(len(all_starts)):
                start_elem, start_seg_idx, start_seg = all_starts[i]

                # Find end (next start or end of flow)
                if i + 1 < len(all_starts):
                    # Section ends at next start
                    end_elem, end_seg_idx, end_seg = all_starts[i + 1]

                    if start_seg_idx == end_seg_idx:
                        # Same segment - create regular Region
                        section = start_seg.get_section_between(
                            start_elem, end_elem, include_boundaries, orientation
                        )
                        if section:
                            sections.append(section)
                    else:
                        # Cross-segment - create FlowRegion
                        regions = []

                        # First segment: from start to bottom
                        if include_boundaries in ["both", "start"]:
                            top = start_elem.top
                        else:
                            top = start_elem.bottom
                        regions.append(
                            Region(
                                start_seg.page, (start_seg.x0, top, start_seg.x1, start_seg.bottom)
                            )
                        )

                        # Middle segments (full)
                        for idx in range(start_seg_idx + 1, end_seg_idx):
                            regions.append(self.segments[idx])

                        # Last segment: from top to end element
                        if include_boundaries in ["both", "end"]:
                            bottom = end_elem.bottom
                        else:
                            bottom = end_elem.top
                        regions.append(
                            Region(end_seg.page, (end_seg.x0, end_seg.top, end_seg.x1, bottom))
                        )

                        # Create FlowRegion
                        flow_element = FlowElement(physical_object=start_elem, flow=self)
                        flow_region = FlowRegion(
                            flow=self,
                            constituent_regions=regions,
                            source_flow_element=flow_element,
                            boundary_element_found=end_elem,
                        )
                        flow_region.start_element = start_elem
                        flow_region.end_element = end_elem
                        flow_region._boundary_exclusions = include_boundaries
                        sections.append(flow_region)
                else:
                    # Last section - goes to end of flow
                    if start_seg_idx == len(self.segments) - 1:
                        # Within last segment
                        section = start_seg.get_section_between(
                            start_elem, None, include_boundaries, orientation
                        )
                        if section:
                            sections.append(section)
                    else:
                        # Spans to end
                        regions = []

                        # First segment: from start to bottom
                        if include_boundaries in ["both", "start"]:
                            top = start_elem.top
                        else:
                            top = start_elem.bottom
                        regions.append(
                            Region(
                                start_seg.page, (start_seg.x0, top, start_seg.x1, start_seg.bottom)
                            )
                        )

                        # Remaining segments (full)
                        for idx in range(start_seg_idx + 1, len(self.segments)):
                            regions.append(self.segments[idx])

                        # Create FlowRegion
                        flow_element = FlowElement(physical_object=start_elem, flow=self)
                        flow_region = FlowRegion(
                            flow=self,
                            constituent_regions=regions,
                            source_flow_element=flow_element,
                            boundary_element_found=None,
                        )
                        flow_region.start_element = start_elem
                        flow_region._boundary_exclusions = include_boundaries
                        sections.append(flow_region)

        # Case 2: Both start and end elements
        elif all_starts and all_ends:
            # Match starts with ends
            used_ends = set()

            for start_elem, start_seg_idx, start_seg in all_starts:
                # Find matching end
                best_end = None

                for end_elem, end_seg_idx, end_seg in all_ends:
                    if id(end_elem) in used_ends:
                        continue

                    # End must come after start
                    if end_seg_idx > start_seg_idx or (
                        end_seg_idx == start_seg_idx and end_elem.top >= start_elem.bottom
                    ):
                        best_end = (end_elem, end_seg_idx, end_seg)
                        break

                if best_end:
                    end_elem, end_seg_idx, end_seg = best_end
                    used_ends.add(id(end_elem))

                    if start_seg_idx == end_seg_idx:
                        # Same segment
                        section = start_seg.get_section_between(
                            start_elem, end_elem, include_boundaries, orientation
                        )
                        if section:
                            sections.append(section)
                    else:
                        # Cross-segment FlowRegion
                        regions = []

                        # First segment
                        if include_boundaries in ["both", "start"]:
                            top = start_elem.top
                        else:
                            top = start_elem.bottom
                        regions.append(
                            Region(
                                start_seg.page, (start_seg.x0, top, start_seg.x1, start_seg.bottom)
                            )
                        )

                        # Middle segments
                        for idx in range(start_seg_idx + 1, end_seg_idx):
                            regions.append(self.segments[idx])

                        # Last segment
                        if include_boundaries in ["both", "end"]:
                            bottom = end_elem.bottom
                        else:
                            bottom = end_elem.top
                        regions.append(
                            Region(end_seg.page, (end_seg.x0, end_seg.top, end_seg.x1, bottom))
                        )

                        # Create FlowRegion
                        flow_element = FlowElement(physical_object=start_elem, flow=self)
                        flow_region = FlowRegion(
                            flow=self,
                            constituent_regions=regions,
                            source_flow_element=flow_element,
                            boundary_element_found=end_elem,
                        )
                        flow_region.start_element = start_elem
                        flow_region.end_element = end_elem
                        flow_region._boundary_exclusions = include_boundaries
                        sections.append(flow_region)

        # Case 3: Only end elements (sections from beginning to each end)
        elif not all_starts and all_ends:
            # TODO: Handle this case if needed
            pass

        return ElementCollection(sections)

    def highlights(self, show: bool = False):
        """
        Create a highlight context for accumulating highlights.

        This allows for clean syntax to show multiple highlight groups:

        Example:
            with flow.highlights() as h:
                h.add(flow.find_all('table'), label='tables', color='blue')
                h.add(flow.find_all('text:bold'), label='bold text', color='red')
                h.show()

        Or with automatic display:
            with flow.highlights(show=True) as h:
                h.add(flow.find_all('table'), label='tables')
                h.add(flow.find_all('text:bold'), label='bold')
                # Automatically shows when exiting the context

        Args:
            show: If True, automatically show highlights when exiting context

        Returns:
            HighlightContext for accumulating highlights
        """
        from natural_pdf.core.highlighting_service import HighlightContext

        return HighlightContext(self, show_on_exit=show)
Functions
natural_pdf.Flow.__init__(segments, arrangement, alignment='start', segment_gap=0.0)

Initializes a Flow object.

Parameters:

Name Type Description Default
segments Union[List[Union[Page, Region]], PageCollection]

An ordered list of natural_pdf.core.page.Page or natural_pdf.elements.region.Region objects that constitute the flow, or a PageCollection containing pages.

required
arrangement Literal['vertical', 'horizontal']

The primary direction of the flow. - "vertical": Segments are stacked top-to-bottom. - "horizontal": Segments are arranged left-to-right.

required
alignment Literal['start', 'center', 'end', 'top', 'left', 'bottom', 'right']

How segments are aligned on their cross-axis if they have differing dimensions. For a "vertical" arrangement: - "left" (or "start"): Align left edges. - "center": Align centers. - "right" (or "end"): Align right edges. For a "horizontal" arrangement: - "top" (or "start"): Align top edges. - "center": Align centers. - "bottom" (or "end"): Align bottom edges.

'start'
segment_gap float

The virtual gap (in PDF points) between segments.

0.0
Source code in natural_pdf/flows/flow.py
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
def __init__(
    self,
    segments: Union[List[Union["Page", "PhysicalRegion"]], "PageCollection"],
    arrangement: Literal["vertical", "horizontal"],
    alignment: Literal["start", "center", "end", "top", "left", "bottom", "right"] = "start",
    segment_gap: float = 0.0,
):
    """
    Initializes a Flow object.

    Args:
        segments: An ordered list of natural_pdf.core.page.Page or
                  natural_pdf.elements.region.Region objects that constitute the flow,
                  or a PageCollection containing pages.
        arrangement: The primary direction of the flow.
                     - "vertical": Segments are stacked top-to-bottom.
                     - "horizontal": Segments are arranged left-to-right.
        alignment: How segments are aligned on their cross-axis if they have
                   differing dimensions. For a "vertical" arrangement:
                   - "left" (or "start"): Align left edges.
                   - "center": Align centers.
                   - "right" (or "end"): Align right edges.
                   For a "horizontal" arrangement:
                   - "top" (or "start"): Align top edges.
                   - "center": Align centers.
                   - "bottom" (or "end"): Align bottom edges.
        segment_gap: The virtual gap (in PDF points) between segments.
    """
    # Handle PageCollection input
    if hasattr(segments, "pages"):  # It's a PageCollection
        segments = list(segments.pages)

    if not segments:
        raise ValueError("Flow segments cannot be empty.")
    if arrangement not in ["vertical", "horizontal"]:
        raise ValueError("Arrangement must be 'vertical' or 'horizontal'.")

    self.segments: List["PhysicalRegion"] = self._normalize_segments(segments)
    self.arrangement: Literal["vertical", "horizontal"] = arrangement
    self.alignment: Literal["start", "center", "end", "top", "left", "bottom", "right"] = (
        alignment
    )
    self.segment_gap: float = segment_gap

    self._validate_alignment()
natural_pdf.Flow.analyze_layout(engine=None, options=None, confidence=None, classes=None, exclude_classes=None, device=None, existing='replace', model_name=None, client=None)

Analyze layout across all segments in the flow.

This method efficiently groups segments by their parent pages, runs layout analysis only once per unique page, then filters results appropriately for each segment. This avoids redundant analysis when multiple flow segments come from the same page.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the layout engine (e.g., 'yolo', 'tatr'). Uses manager's default if None.

None
options Optional[Any]

Specific LayoutOptions object for advanced configuration.

None
confidence Optional[float]

Minimum confidence threshold.

None
classes Optional[List[str]]

Specific classes to detect.

None
exclude_classes Optional[List[str]]

Classes to exclude.

None
device Optional[str]

Device for inference.

None
existing str

How to handle existing detected regions: 'replace' (default) or 'append'.

'replace'
model_name Optional[str]

Optional model name for the engine.

None
client Optional[Any]

Optional client for API-based engines.

None

Returns:

Type Description
ElementCollection

ElementCollection containing all detected Region objects from all segments.

Example

Multi-page layout analysis:

pdf = npdf.PDF("document.pdf")

# Create flow for first 3 pages
page_flow = Flow(
    segments=pdf.pages[:3],
    arrangement='vertical'
)

# Analyze layout across all pages (efficiently)
all_regions = page_flow.analyze_layout(engine='yolo')

# Find all tables across the flow
tables = all_regions.filter('region[type=table]')

Source code in natural_pdf/flows/flow.py
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
def analyze_layout(
    self,
    engine: Optional[str] = None,
    options: Optional[Any] = None,
    confidence: Optional[float] = None,
    classes: Optional[List[str]] = None,
    exclude_classes: Optional[List[str]] = None,
    device: Optional[str] = None,
    existing: str = "replace",
    model_name: Optional[str] = None,
    client: Optional[Any] = None,
) -> "PhysicalElementCollection":
    """
    Analyze layout across all segments in the flow.

    This method efficiently groups segments by their parent pages, runs layout analysis
    only once per unique page, then filters results appropriately for each segment.
    This avoids redundant analysis when multiple flow segments come from the same page.

    Args:
        engine: Name of the layout engine (e.g., 'yolo', 'tatr'). Uses manager's default if None.
        options: Specific LayoutOptions object for advanced configuration.
        confidence: Minimum confidence threshold.
        classes: Specific classes to detect.
        exclude_classes: Classes to exclude.
        device: Device for inference.
        existing: How to handle existing detected regions: 'replace' (default) or 'append'.
        model_name: Optional model name for the engine.
        client: Optional client for API-based engines.

    Returns:
        ElementCollection containing all detected Region objects from all segments.

    Example:
        Multi-page layout analysis:
        ```python
        pdf = npdf.PDF("document.pdf")

        # Create flow for first 3 pages
        page_flow = Flow(
            segments=pdf.pages[:3],
            arrangement='vertical'
        )

        # Analyze layout across all pages (efficiently)
        all_regions = page_flow.analyze_layout(engine='yolo')

        # Find all tables across the flow
        tables = all_regions.filter('region[type=table]')
        ```
    """
    from natural_pdf.elements.element_collection import ElementCollection

    logger.info(
        f"Analyzing layout across Flow with {len(self.segments)} segments (engine: {engine or 'default'})"
    )

    if not self.segments:
        logger.warning("Flow has no segments, returning empty collection")
        return ElementCollection([])

    # Step 1: Group segments by their parent pages to avoid redundant analysis
    segments_by_page = {}  # Dict[Page, List[Segment]]

    for i, segment in enumerate(self.segments):
        # Determine the page for this segment
        if hasattr(segment, "analyze_layout"):
            # It's a Page object
            page_obj = segment
            segment_type = "page"
        elif hasattr(segment, "page") and hasattr(segment.page, "analyze_layout"):
            # It's a Region object
            page_obj = segment.page
            segment_type = "region"
        else:
            logger.warning(f"Segment {i+1} does not support layout analysis, skipping")
            continue

        if page_obj not in segments_by_page:
            segments_by_page[page_obj] = []
        segments_by_page[page_obj].append((segment, segment_type))

    if not segments_by_page:
        logger.warning("No segments with analyzable pages found")
        return ElementCollection([])

    logger.debug(
        f"  Grouped {len(self.segments)} segments into {len(segments_by_page)} unique pages"
    )

    # Step 2: Analyze each unique page only once
    all_detected_regions: List["PhysicalRegion"] = []
    processed_pages = 0

    for page_obj, page_segments in segments_by_page.items():
        try:
            logger.debug(
                f"  Analyzing layout for page {getattr(page_obj, 'number', '?')} with {len(page_segments)} segments"
            )

            # Run layout analysis once for this page
            page_results = page_obj.analyze_layout(
                engine=engine,
                options=options,
                confidence=confidence,
                classes=classes,
                exclude_classes=exclude_classes,
                device=device,
                existing=existing,
                model_name=model_name,
                client=client,
            )

            # Extract regions from results
            if hasattr(page_results, "elements"):
                # It's an ElementCollection
                page_regions = page_results.elements
            elif isinstance(page_results, list):
                # It's a list of regions
                page_regions = page_results
            else:
                logger.warning(
                    f"Page {getattr(page_obj, 'number', '?')} returned unexpected layout analysis result type: {type(page_results)}"
                )
                continue

            if not page_regions:
                logger.debug(
                    f"    No layout regions found on page {getattr(page_obj, 'number', '?')}"
                )
                continue

            # Step 3: For each segment on this page, collect relevant regions
            segments_processed_on_page = 0
            for segment, segment_type in page_segments:
                if segment_type == "page":
                    # Full page segment: include all detected regions
                    all_detected_regions.extend(page_regions)
                    segments_processed_on_page += 1
                    logger.debug(f"    Added {len(page_regions)} regions for full-page segment")

                elif segment_type == "region":
                    # Region segment: filter to only intersecting regions
                    intersecting_regions = []
                    for region in page_regions:
                        try:
                            if segment.intersects(region):
                                intersecting_regions.append(region)
                        except Exception as intersect_error:
                            logger.debug(
                                f"Error checking intersection for region: {intersect_error}"
                            )
                            # Include the region anyway if intersection check fails
                            intersecting_regions.append(region)

                    all_detected_regions.extend(intersecting_regions)
                    segments_processed_on_page += 1
                    logger.debug(
                        f"    Added {len(intersecting_regions)} intersecting regions for region segment {segment.bbox}"
                    )

            processed_pages += 1
            logger.debug(
                f"    Processed {segments_processed_on_page} segments on page {getattr(page_obj, 'number', '?')}"
            )

        except Exception as e:
            logger.error(
                f"Error analyzing layout for page {getattr(page_obj, 'number', '?')}: {e}",
                exc_info=True,
            )
            continue

    # Step 4: Remove duplicates (can happen if multiple segments intersect the same region)
    unique_regions = []
    seen_region_ids = set()

    for region in all_detected_regions:
        # Create a unique identifier for this region (page + bbox)
        region_id = (
            getattr(region.page, "index", id(region.page)),
            region.bbox if hasattr(region, "bbox") else id(region),
        )

        if region_id not in seen_region_ids:
            unique_regions.append(region)
            seen_region_ids.add(region_id)

    dedupe_removed = len(all_detected_regions) - len(unique_regions)
    if dedupe_removed > 0:
        logger.debug(f"  Removed {dedupe_removed} duplicate regions")

    logger.info(
        f"Flow layout analysis complete: {len(unique_regions)} unique regions from {processed_pages} pages"
    )
    return ElementCollection(unique_regions)
natural_pdf.Flow.extract_table(method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False, content_filter=None, stitch_rows=None, merge_headers=None)
extract_table(method: Optional[str] = None, table_settings: Optional[dict] = None, use_ocr: bool = False, ocr_config: Optional[dict] = None, text_options: Optional[dict] = None, cell_extraction_func: Optional[Any] = None, show_progress: bool = False, content_filter: Optional[Any] = None, stitch_rows: Callable[[List[Optional[str]]], bool] = None) -> TableResult
extract_table(method: Optional[str] = None, table_settings: Optional[dict] = None, use_ocr: bool = False, ocr_config: Optional[dict] = None, text_options: Optional[dict] = None, cell_extraction_func: Optional[Any] = None, show_progress: bool = False, content_filter: Optional[Any] = None, stitch_rows: Callable[[List[Optional[str]], List[Optional[str]], int, Union[Page, PhysicalRegion]], bool] = None) -> TableResult

Extract table data from all segments in the flow, combining results sequentially.

This method extracts table data from each segment in flow order and combines the results into a single logical table. This is particularly useful for multi-page tables or tables that span across columns.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None
use_ocr bool

Whether to use OCR for text extraction (currently only applicable with 'tatr' method).

False
ocr_config Optional[dict]

OCR configuration parameters.

None
text_options Optional[dict]

Dictionary of options for the 'text' method.

None
cell_extraction_func Optional[Any]

Optional callable function that takes a cell Region object and returns its string content. For 'text' method only.

None
show_progress bool

If True, display a progress bar during cell text extraction for the 'text' method.

False
content_filter Optional[Any]

Optional content filter to apply during cell text extraction.

None
merge_headers Optional[bool]

Whether to merge tables by removing repeated headers from subsequent segments. If None (default), auto-detects by checking if the first row of each segment matches the first row of the first segment. If segments have inconsistent header patterns (some repeat, others don't), raises ValueError. Useful for multi-page tables where headers repeat on each page.

None
stitch_rows Optional[Callable]

Optional callable to determine when rows should be merged across segment boundaries. Applied AFTER header removal if merge_headers is enabled. Two overloaded signatures are supported:

     • func(current_row) -> bool
       Called only on the first row of each segment (after the first).
       Return True to merge this first row with the last row from
       the previous segment.

     • func(prev_row, current_row, row_index, segment) -> bool
       Called for every row. Return True to merge current_row with
       the previous row in the aggregated results.

     When True is returned, rows are concatenated cell-by-cell.
     This is useful for handling table rows split across page
     boundaries or segments. If None, rows are never merged.
None

Returns:

Type Description
TableResult

TableResult object containing the aggregated table data from all segments.

Example

Multi-page table extraction:

pdf = npdf.PDF("multi_page_table.pdf")

# Create flow for table spanning pages 2-4
table_flow = Flow(
    segments=[pdf.pages[1], pdf.pages[2], pdf.pages[3]],
    arrangement='vertical'
)

# Extract table as if it were continuous
table_data = table_flow.extract_table()
df = table_data.df  # Convert to pandas DataFrame

# Custom row stitching - single parameter (simple case)
table_data = table_flow.extract_table(
    stitch_rows=lambda row: row and not (row[0] or "").strip()
)

# Custom row stitching - full parameters (advanced case)
table_data = table_flow.extract_table(
    stitch_rows=lambda prev, curr, idx, seg: idx == 0 and curr and not (curr[0] or "").strip()
)

Source code in natural_pdf/flows/flow.py
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
def extract_table(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,
    text_options: Optional[dict] = None,
    cell_extraction_func: Optional[Any] = None,
    show_progress: bool = False,
    content_filter: Optional[Any] = None,
    stitch_rows: Optional[Callable] = None,
    merge_headers: Optional[bool] = None,
) -> TableResult:
    """
    Extract table data from all segments in the flow, combining results sequentially.

    This method extracts table data from each segment in flow order and combines
    the results into a single logical table. This is particularly useful for
    multi-page tables or tables that span across columns.

    Args:
        method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
        table_settings: Settings for pdfplumber table extraction.
        use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
        ocr_config: OCR configuration parameters.
        text_options: Dictionary of options for the 'text' method.
        cell_extraction_func: Optional callable function that takes a cell Region object
                              and returns its string content. For 'text' method only.
        show_progress: If True, display a progress bar during cell text extraction for the 'text' method.
        content_filter: Optional content filter to apply during cell text extraction.
        merge_headers: Whether to merge tables by removing repeated headers from subsequent
            segments. If None (default), auto-detects by checking if the first row
            of each segment matches the first row of the first segment. If segments have
            inconsistent header patterns (some repeat, others don't), raises ValueError.
            Useful for multi-page tables where headers repeat on each page.
        stitch_rows: Optional callable to determine when rows should be merged across
                     segment boundaries. Applied AFTER header removal if merge_headers
                     is enabled. Two overloaded signatures are supported:

                     • func(current_row) -> bool
                       Called only on the first row of each segment (after the first).
                       Return True to merge this first row with the last row from
                       the previous segment.

                     • func(prev_row, current_row, row_index, segment) -> bool
                       Called for every row. Return True to merge current_row with
                       the previous row in the aggregated results.

                     When True is returned, rows are concatenated cell-by-cell.
                     This is useful for handling table rows split across page
                     boundaries or segments. If None, rows are never merged.

    Returns:
        TableResult object containing the aggregated table data from all segments.

    Example:
        Multi-page table extraction:
        ```python
        pdf = npdf.PDF("multi_page_table.pdf")

        # Create flow for table spanning pages 2-4
        table_flow = Flow(
            segments=[pdf.pages[1], pdf.pages[2], pdf.pages[3]],
            arrangement='vertical'
        )

        # Extract table as if it were continuous
        table_data = table_flow.extract_table()
        df = table_data.df  # Convert to pandas DataFrame

        # Custom row stitching - single parameter (simple case)
        table_data = table_flow.extract_table(
            stitch_rows=lambda row: row and not (row[0] or "").strip()
        )

        # Custom row stitching - full parameters (advanced case)
        table_data = table_flow.extract_table(
            stitch_rows=lambda prev, curr, idx, seg: idx == 0 and curr and not (curr[0] or "").strip()
        )
        ```
    """
    logger.info(
        f"Extracting table from Flow with {len(self.segments)} segments (method: {method or 'auto'})"
    )

    if not self.segments:
        logger.warning("Flow has no segments, returning empty table")
        return TableResult([])

    # Resolve predicate and determine its signature
    predicate: Optional[Callable] = None
    predicate_type: str = "none"

    if callable(stitch_rows):
        import inspect

        sig = inspect.signature(stitch_rows)
        param_count = len(sig.parameters)

        if param_count == 1:
            predicate = stitch_rows
            predicate_type = "single_param"
        elif param_count == 4:
            predicate = stitch_rows
            predicate_type = "full_params"
        else:
            logger.warning(
                f"stitch_rows function has {param_count} parameters, expected 1 or 4. Ignoring."
            )
            predicate = None
            predicate_type = "none"

    def _default_merge(
        prev_row: List[Optional[str]], cur_row: List[Optional[str]]
    ) -> List[Optional[str]]:
        from itertools import zip_longest

        merged: List[Optional[str]] = []
        for p, c in zip_longest(prev_row, cur_row, fillvalue=""):
            if (p or "").strip() and (c or "").strip():
                merged.append(f"{p} {c}".strip())
            else:
                merged.append((p or "") + (c or ""))
        return merged

    aggregated_rows: List[List[Optional[str]]] = []
    processed_segments = 0
    header_row: Optional[List[Optional[str]]] = None
    merge_headers_enabled = False
    headers_warned = False  # Track if we've already warned about dropping headers
    segment_has_repeated_header = []  # Track which segments have repeated headers

    for seg_idx, segment in enumerate(self.segments):
        try:
            logger.debug(f"  Extracting table from segment {seg_idx+1}/{len(self.segments)}")

            segment_result = segment.extract_table(
                method=method,
                table_settings=table_settings.copy() if table_settings else None,
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                text_options=text_options.copy() if text_options else None,
                cell_extraction_func=cell_extraction_func,
                show_progress=show_progress,
                content_filter=content_filter,
            )

            if not segment_result:
                continue

            if hasattr(segment_result, "_rows"):
                segment_rows = list(segment_result._rows)
            else:
                segment_rows = list(segment_result)

            if not segment_rows:
                logger.debug(f"    No table data found in segment {seg_idx+1}")
                continue

            # Handle header detection and merging for multi-page tables
            if seg_idx == 0:
                # First segment: capture potential header row
                if segment_rows:
                    header_row = segment_rows[0]
                    # Determine if we should merge headers
                    if merge_headers is None:
                        # Auto-detect: we'll check all subsequent segments
                        merge_headers_enabled = False  # Will be determined later
                    else:
                        merge_headers_enabled = merge_headers
                    # Track that first segment exists (for consistency checking)
                    segment_has_repeated_header.append(False)  # First segment doesn't "repeat"
            elif seg_idx == 1 and merge_headers is None:
                # Auto-detection: check if first row of second segment matches header
                has_header = segment_rows and header_row and segment_rows[0] == header_row
                segment_has_repeated_header.append(has_header)

                if has_header:
                    merge_headers_enabled = True
                    # Remove the detected repeated header from this segment
                    segment_rows = segment_rows[1:]
                    logger.debug(
                        f"    Auto-detected repeated header in segment {seg_idx+1}, removed"
                    )
                    if not headers_warned:
                        warnings.warn(
                            "Detected repeated headers in multi-page table. Merging by removing "
                            "repeated headers from subsequent pages.",
                            UserWarning,
                            stacklevel=2,
                        )
                        headers_warned = True
                else:
                    merge_headers_enabled = False
                    logger.debug(f"    No repeated header detected in segment {seg_idx+1}")
            elif seg_idx > 1:
                # Check consistency: all segments should have same pattern
                has_header = segment_rows and header_row and segment_rows[0] == header_row
                segment_has_repeated_header.append(has_header)

                # Remove header if merging is enabled and header is present
                if merge_headers_enabled and has_header:
                    segment_rows = segment_rows[1:]
                    logger.debug(f"    Removed repeated header from segment {seg_idx+1}")
            elif seg_idx > 0 and merge_headers_enabled:
                # Explicit merge_headers=True: remove headers from subsequent segments
                if segment_rows and header_row and segment_rows[0] == header_row:
                    segment_rows = segment_rows[1:]
                    logger.debug(f"    Removed repeated header from segment {seg_idx+1}")
                    if not headers_warned:
                        warnings.warn(
                            "Removing repeated headers from multi-page table during merge.",
                            UserWarning,
                            stacklevel=2,
                        )
                        headers_warned = True

            for row_idx, row in enumerate(segment_rows):
                should_merge = False

                if predicate is not None and aggregated_rows:
                    if predicate_type == "single_param":
                        # For single param: only call on first row of segment (row_idx == 0)
                        # and pass the current row
                        if row_idx == 0:
                            should_merge = predicate(row)
                    elif predicate_type == "full_params":
                        # For full params: call with all arguments
                        should_merge = predicate(aggregated_rows[-1], row, row_idx, segment)

                if should_merge:
                    aggregated_rows[-1] = _default_merge(aggregated_rows[-1], row)
                else:
                    aggregated_rows.append(row)

            processed_segments += 1
            logger.debug(
                f"    Added {len(segment_rows)} rows (post-merge) from segment {seg_idx+1}"
            )

        except Exception as e:
            logger.error(f"Error extracting table from segment {seg_idx+1}: {e}", exc_info=True)
            continue

    # Check for inconsistent header patterns after processing all segments
    if merge_headers is None and len(segment_has_repeated_header) > 2:
        # During auto-detection, check for consistency across all segments
        expected_pattern = segment_has_repeated_header[1]  # Pattern from second segment
        for seg_idx, has_header in enumerate(segment_has_repeated_header[2:], 2):
            if has_header != expected_pattern:
                # Inconsistent pattern detected
                segments_with_headers = [
                    i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if has_h
                ]
                segments_without_headers = [
                    i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if not has_h
                ]
                raise ValueError(
                    f"Inconsistent header pattern in multi-page table: "
                    f"segments {segments_with_headers} have repeated headers, "
                    f"but segments {segments_without_headers} do not. "
                    f"All segments must have the same header pattern for reliable merging."
                )

    logger.info(
        f"Flow table extraction complete: {len(aggregated_rows)} total rows from {processed_segments}/{len(self.segments)} segments"
    )
    return TableResult(aggregated_rows)
natural_pdf.Flow.find(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)

Finds the first element within the flow that matches the given selector or text criteria.

Elements found are wrapped as FlowElement objects, anchored to this Flow.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for.

None
apply_exclusions bool

Whether to respect exclusion zones on the original pages/regions.

True
regex bool

Whether the text search uses regex.

False
case bool

Whether the text search is case-sensitive.

True
**kwargs

Additional filter parameters for the underlying find operation.

{}

Returns:

Type Description
Optional[FlowElement]

A FlowElement if a match is found, otherwise None.

Source code in natural_pdf/flows/flow.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
def find(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional["FlowElement"]:
    """
    Finds the first element within the flow that matches the given selector or text criteria.

    Elements found are wrapped as FlowElement objects, anchored to this Flow.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for.
        apply_exclusions: Whether to respect exclusion zones on the original pages/regions.
        regex: Whether the text search uses regex.
        case: Whether the text search is case-sensitive.
        **kwargs: Additional filter parameters for the underlying find operation.

    Returns:
        A FlowElement if a match is found, otherwise None.
    """
    results = self.find_all(
        selector=selector,
        text=text,
        apply_exclusions=apply_exclusions,
        regex=regex,
        case=case,
        **kwargs,
    )
    return results.first if results else None
natural_pdf.Flow.find_all(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)

Finds all elements within the flow that match the given selector or text criteria.

This method efficiently groups segments by their parent pages, searches at the page level, then filters results appropriately for each segment. This ensures elements that intersect with flow segments (but aren't fully contained) are still found.

Elements found are wrapped as FlowElement objects, anchored to this Flow, and returned in a FlowElementCollection.

Source code in natural_pdf/flows/flow.py
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
def find_all(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "FlowElementCollection":
    """
    Finds all elements within the flow that match the given selector or text criteria.

    This method efficiently groups segments by their parent pages, searches at the page level,
    then filters results appropriately for each segment. This ensures elements that intersect
    with flow segments (but aren't fully contained) are still found.

    Elements found are wrapped as FlowElement objects, anchored to this Flow,
    and returned in a FlowElementCollection.
    """
    from .collections import FlowElementCollection
    from .element import FlowElement

    # Step 1: Group segments by their parent pages (like in analyze_layout)
    segments_by_page = {}  # Dict[Page, List[Segment]]

    for i, segment in enumerate(self.segments):
        # Determine the page for this segment - fix type detection
        if hasattr(segment, "page") and hasattr(segment.page, "find_all"):
            # It's a Region object (has a parent page)
            page_obj = segment.page
            segment_type = "region"
        elif (
            hasattr(segment, "find_all")
            and hasattr(segment, "width")
            and hasattr(segment, "height")
            and not hasattr(segment, "page")
        ):
            # It's a Page object (has find_all but no parent page)
            page_obj = segment
            segment_type = "page"
        else:
            logger.warning(f"Segment {i+1} does not support find_all, skipping")
            continue

        if page_obj not in segments_by_page:
            segments_by_page[page_obj] = []
        segments_by_page[page_obj].append((segment, segment_type))

    if not segments_by_page:
        logger.warning("No segments with searchable pages found")
        return FlowElementCollection([])

    # Step 2: Search each unique page only once
    all_flow_elements: List["FlowElement"] = []

    for page_obj, page_segments in segments_by_page.items():
        # Find all matching elements on this page
        page_matches = page_obj.find_all(
            selector=selector,
            text=text,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )

        if not page_matches:
            continue

        # Step 3: For each segment on this page, collect relevant elements
        for segment, segment_type in page_segments:
            if segment_type == "page":
                # Full page segment: include all elements
                for phys_elem in page_matches.elements:
                    all_flow_elements.append(FlowElement(physical_object=phys_elem, flow=self))

            elif segment_type == "region":
                # Region segment: filter to only intersecting elements
                for phys_elem in page_matches.elements:
                    try:
                        # Check if element intersects with this flow segment
                        if segment.intersects(phys_elem):
                            all_flow_elements.append(
                                FlowElement(physical_object=phys_elem, flow=self)
                            )
                    except Exception as intersect_error:
                        logger.debug(
                            f"Error checking intersection for element: {intersect_error}"
                        )
                        # Include the element anyway if intersection check fails
                        all_flow_elements.append(
                            FlowElement(physical_object=phys_elem, flow=self)
                        )

    # Step 4: Remove duplicates (can happen if multiple segments intersect the same element)
    unique_flow_elements = []
    seen_element_ids = set()

    for flow_elem in all_flow_elements:
        # Create a unique identifier for the underlying physical element
        phys_elem = flow_elem.physical_object
        elem_id = (
            (
                getattr(phys_elem.page, "index", id(phys_elem.page))
                if hasattr(phys_elem, "page")
                else id(phys_elem)
            ),
            phys_elem.bbox if hasattr(phys_elem, "bbox") else id(phys_elem),
        )

        if elem_id not in seen_element_ids:
            unique_flow_elements.append(flow_elem)
            seen_element_ids.add(elem_id)

    return FlowElementCollection(unique_flow_elements)
natural_pdf.Flow.get_element_flow_coordinates(physical_element)

Translates a physical element's coordinates into the flow's virtual coordinate system. (Placeholder - very complex if segment_gap > 0 or complex alignments)

Source code in natural_pdf/flows/flow.py
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
def get_element_flow_coordinates(
    self, physical_element: "PhysicalElement"
) -> Optional[tuple[float, float, float, float]]:
    """
    Translates a physical element's coordinates into the flow's virtual coordinate system.
    (Placeholder - very complex if segment_gap > 0 or complex alignments)
    """
    # For now, elements operate in their own physical coordinates. This method would be needed
    # if FlowRegion.bbox or other operations needed to present a unified coordinate space.
    # As per our discussion, elements *within* a FlowRegion retain original physical coordinates.
    # So, this might not be strictly necessary for the current design's core functionality.
    raise NotImplementedError(
        "Translating element coordinates to a unified flow coordinate system is not yet implemented."
    )
natural_pdf.Flow.get_sections(start_elements=None, end_elements=None, new_section_on_page_break=False, include_boundaries='both', orientation='vertical')

Extract logical sections from the Flow based on start and end boundary elements, mirroring the behaviour of PDF/PageCollection.get_sections().

This implementation is a thin wrapper that converts the Flow into a temporary PageCollection (constructed from the unique pages that the Flow spans) and then delegates the heavy‐lifting to that existing implementation. Any FlowElement / FlowElementCollection inputs are automatically unwrapped to their underlying physical elements so that PageCollection can work with them directly.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections (optional).

None
end_elements

Elements or selector string that mark the end of sections (optional).

None
new_section_on_page_break bool

Whether to start a new section at page boundaries (default: False).

False
include_boundaries str

How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both').

'both'
orientation str

'vertical' (default) or 'horizontal' - determines section direction.

'vertical'

Returns:

Type Description
ElementCollection

ElementCollection of Region/FlowRegion objects representing the

ElementCollection

extracted sections.

Source code in natural_pdf/flows/flow.py
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    new_section_on_page_break: bool = False,
    include_boundaries: str = "both",
    orientation: str = "vertical",
) -> "ElementCollection":
    """
    Extract logical sections from the Flow based on *start* and *end* boundary
    elements, mirroring the behaviour of PDF/PageCollection.get_sections().

    This implementation is a thin wrapper that converts the Flow into a
    temporary PageCollection (constructed from the unique pages that the
    Flow spans) and then delegates the heavy‐lifting to that existing
    implementation.  Any FlowElement / FlowElementCollection inputs are
    automatically unwrapped to their underlying physical elements so that
    PageCollection can work with them directly.

    Args:
        start_elements: Elements or selector string that mark the start of
            sections (optional).
        end_elements: Elements or selector string that mark the end of
            sections (optional).
        new_section_on_page_break: Whether to start a new section at page
            boundaries (default: False).
        include_boundaries: How to include boundary elements: 'start',
            'end', 'both', or 'none' (default: 'both').
        orientation: 'vertical' (default) or 'horizontal' - determines section direction.

    Returns:
        ElementCollection of Region/FlowRegion objects representing the
        extracted sections.
    """
    # ------------------------------------------------------------------
    # Unwrap FlowElement(-Collection) inputs and selector strings so we
    # can reason about them generically.
    # ------------------------------------------------------------------
    from natural_pdf.flows.collections import FlowElementCollection
    from natural_pdf.flows.element import FlowElement

    def _unwrap(obj):
        """Convert Flow-specific wrappers to their underlying physical objects.

        Keeps selector strings as-is; converts FlowElement to its physical
        element; converts FlowElementCollection to list of physical
        elements; passes through ElementCollection by taking .elements.
        """

        if obj is None or isinstance(obj, str):
            return obj

        if isinstance(obj, FlowElement):
            return obj.physical_object

        if isinstance(obj, FlowElementCollection):
            return [fe.physical_object for fe in obj.flow_elements]

        if hasattr(obj, "elements"):
            return obj.elements

        if isinstance(obj, (list, tuple, set)):
            out = []
            for item in obj:
                if isinstance(item, FlowElement):
                    out.append(item.physical_object)
                else:
                    out.append(item)
            return out

        return obj  # Fallback – unknown type

    start_elements_unwrapped = _unwrap(start_elements)
    end_elements_unwrapped = _unwrap(end_elements)

    # ------------------------------------------------------------------
    # For Flow, we need to handle sections that may span segments
    # We'll process all segments together, not independently
    # ------------------------------------------------------------------
    from natural_pdf.elements.element_collection import ElementCollection
    from natural_pdf.elements.region import Region
    from natural_pdf.flows.element import FlowElement
    from natural_pdf.flows.region import FlowRegion

    # Helper to check if element is in segment
    def _element_in_segment(elem, segment):
        # Simple bbox check
        return (
            elem.page == segment.page
            and elem.top >= segment.top
            and elem.bottom <= segment.bottom
            and elem.x0 >= segment.x0
            and elem.x1 <= segment.x1
        )

    # Collect all boundary elements with their segment info
    all_starts = []
    all_ends = []

    for seg_idx, segment in enumerate(self.segments):
        # Find starts in this segment
        if isinstance(start_elements_unwrapped, str):
            seg_starts = segment.find_all(start_elements_unwrapped).elements
        elif start_elements_unwrapped:
            seg_starts = [
                e for e in start_elements_unwrapped if _element_in_segment(e, segment)
            ]
        else:
            seg_starts = []

        for elem in seg_starts:
            all_starts.append((elem, seg_idx, segment))

        # Find ends in this segment
        if isinstance(end_elements_unwrapped, str):
            seg_ends = segment.find_all(end_elements_unwrapped).elements
        elif end_elements_unwrapped:
            seg_ends = [e for e in end_elements_unwrapped if _element_in_segment(e, segment)]
        else:
            seg_ends = []

        for elem in seg_ends:
            all_ends.append((elem, seg_idx, segment))

    # Sort by segment index, then position
    all_starts.sort(key=lambda x: (x[1], x[0].top, x[0].x0))
    all_ends.sort(key=lambda x: (x[1], x[0].top, x[0].x0))

    # If no boundary elements found, return empty collection
    if not all_starts and not all_ends:
        return ElementCollection([])

    sections = []

    # Case 1: Only start elements provided
    if all_starts and not all_ends:
        for i in range(len(all_starts)):
            start_elem, start_seg_idx, start_seg = all_starts[i]

            # Find end (next start or end of flow)
            if i + 1 < len(all_starts):
                # Section ends at next start
                end_elem, end_seg_idx, end_seg = all_starts[i + 1]

                if start_seg_idx == end_seg_idx:
                    # Same segment - create regular Region
                    section = start_seg.get_section_between(
                        start_elem, end_elem, include_boundaries, orientation
                    )
                    if section:
                        sections.append(section)
                else:
                    # Cross-segment - create FlowRegion
                    regions = []

                    # First segment: from start to bottom
                    if include_boundaries in ["both", "start"]:
                        top = start_elem.top
                    else:
                        top = start_elem.bottom
                    regions.append(
                        Region(
                            start_seg.page, (start_seg.x0, top, start_seg.x1, start_seg.bottom)
                        )
                    )

                    # Middle segments (full)
                    for idx in range(start_seg_idx + 1, end_seg_idx):
                        regions.append(self.segments[idx])

                    # Last segment: from top to end element
                    if include_boundaries in ["both", "end"]:
                        bottom = end_elem.bottom
                    else:
                        bottom = end_elem.top
                    regions.append(
                        Region(end_seg.page, (end_seg.x0, end_seg.top, end_seg.x1, bottom))
                    )

                    # Create FlowRegion
                    flow_element = FlowElement(physical_object=start_elem, flow=self)
                    flow_region = FlowRegion(
                        flow=self,
                        constituent_regions=regions,
                        source_flow_element=flow_element,
                        boundary_element_found=end_elem,
                    )
                    flow_region.start_element = start_elem
                    flow_region.end_element = end_elem
                    flow_region._boundary_exclusions = include_boundaries
                    sections.append(flow_region)
            else:
                # Last section - goes to end of flow
                if start_seg_idx == len(self.segments) - 1:
                    # Within last segment
                    section = start_seg.get_section_between(
                        start_elem, None, include_boundaries, orientation
                    )
                    if section:
                        sections.append(section)
                else:
                    # Spans to end
                    regions = []

                    # First segment: from start to bottom
                    if include_boundaries in ["both", "start"]:
                        top = start_elem.top
                    else:
                        top = start_elem.bottom
                    regions.append(
                        Region(
                            start_seg.page, (start_seg.x0, top, start_seg.x1, start_seg.bottom)
                        )
                    )

                    # Remaining segments (full)
                    for idx in range(start_seg_idx + 1, len(self.segments)):
                        regions.append(self.segments[idx])

                    # Create FlowRegion
                    flow_element = FlowElement(physical_object=start_elem, flow=self)
                    flow_region = FlowRegion(
                        flow=self,
                        constituent_regions=regions,
                        source_flow_element=flow_element,
                        boundary_element_found=None,
                    )
                    flow_region.start_element = start_elem
                    flow_region._boundary_exclusions = include_boundaries
                    sections.append(flow_region)

    # Case 2: Both start and end elements
    elif all_starts and all_ends:
        # Match starts with ends
        used_ends = set()

        for start_elem, start_seg_idx, start_seg in all_starts:
            # Find matching end
            best_end = None

            for end_elem, end_seg_idx, end_seg in all_ends:
                if id(end_elem) in used_ends:
                    continue

                # End must come after start
                if end_seg_idx > start_seg_idx or (
                    end_seg_idx == start_seg_idx and end_elem.top >= start_elem.bottom
                ):
                    best_end = (end_elem, end_seg_idx, end_seg)
                    break

            if best_end:
                end_elem, end_seg_idx, end_seg = best_end
                used_ends.add(id(end_elem))

                if start_seg_idx == end_seg_idx:
                    # Same segment
                    section = start_seg.get_section_between(
                        start_elem, end_elem, include_boundaries, orientation
                    )
                    if section:
                        sections.append(section)
                else:
                    # Cross-segment FlowRegion
                    regions = []

                    # First segment
                    if include_boundaries in ["both", "start"]:
                        top = start_elem.top
                    else:
                        top = start_elem.bottom
                    regions.append(
                        Region(
                            start_seg.page, (start_seg.x0, top, start_seg.x1, start_seg.bottom)
                        )
                    )

                    # Middle segments
                    for idx in range(start_seg_idx + 1, end_seg_idx):
                        regions.append(self.segments[idx])

                    # Last segment
                    if include_boundaries in ["both", "end"]:
                        bottom = end_elem.bottom
                    else:
                        bottom = end_elem.top
                    regions.append(
                        Region(end_seg.page, (end_seg.x0, end_seg.top, end_seg.x1, bottom))
                    )

                    # Create FlowRegion
                    flow_element = FlowElement(physical_object=start_elem, flow=self)
                    flow_region = FlowRegion(
                        flow=self,
                        constituent_regions=regions,
                        source_flow_element=flow_element,
                        boundary_element_found=end_elem,
                    )
                    flow_region.start_element = start_elem
                    flow_region.end_element = end_elem
                    flow_region._boundary_exclusions = include_boundaries
                    sections.append(flow_region)

    # Case 3: Only end elements (sections from beginning to each end)
    elif not all_starts and all_ends:
        # TODO: Handle this case if needed
        pass

    return ElementCollection(sections)
natural_pdf.Flow.get_segment_bounding_box_in_flow(segment_index)

Calculates the conceptual bounding box of a segment within the flow's coordinate system. This considers arrangement, alignment, and segment gaps. (This is a placeholder for more complex logic if a true virtual coordinate system is needed) For now, it might just return the physical segment's bbox if gaps are 0 and alignment is simple.

Source code in natural_pdf/flows/flow.py
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
def get_segment_bounding_box_in_flow(
    self, segment_index: int
) -> Optional[tuple[float, float, float, float]]:
    """
    Calculates the conceptual bounding box of a segment within the flow's coordinate system.
    This considers arrangement, alignment, and segment gaps.
    (This is a placeholder for more complex logic if a true virtual coordinate system is needed)
    For now, it might just return the physical segment's bbox if gaps are 0 and alignment is simple.
    """
    if segment_index < 0 or segment_index >= len(self.segments):
        return None

    # This is a simplified version. A full implementation would calculate offsets.
    # For now, we assume FlowElement directional logic handles segment traversal and uses physical coords.
    # If we were to *draw* the flow or get a FlowRegion bbox that spans gaps, this would be critical.
    # physical_segment = self.segments[segment_index]
    # return physical_segment.bbox
    raise NotImplementedError(
        "Calculating a segment's bbox *within the flow's virtual coordinate system* is not yet fully implemented."
    )
natural_pdf.Flow.highlights(show=False)

Create a highlight context for accumulating highlights.

This allows for clean syntax to show multiple highlight groups:

Example

with flow.highlights() as h: h.add(flow.find_all('table'), label='tables', color='blue') h.add(flow.find_all('text:bold'), label='bold text', color='red') h.show()

Or with automatic display

with flow.highlights(show=True) as h: h.add(flow.find_all('table'), label='tables') h.add(flow.find_all('text:bold'), label='bold') # Automatically shows when exiting the context

Parameters:

Name Type Description Default
show bool

If True, automatically show highlights when exiting context

False

Returns:

Type Description

HighlightContext for accumulating highlights

Source code in natural_pdf/flows/flow.py
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
def highlights(self, show: bool = False):
    """
    Create a highlight context for accumulating highlights.

    This allows for clean syntax to show multiple highlight groups:

    Example:
        with flow.highlights() as h:
            h.add(flow.find_all('table'), label='tables', color='blue')
            h.add(flow.find_all('text:bold'), label='bold text', color='red')
            h.show()

    Or with automatic display:
        with flow.highlights(show=True) as h:
            h.add(flow.find_all('table'), label='tables')
            h.add(flow.find_all('text:bold'), label='bold')
            # Automatically shows when exiting the context

    Args:
        show: If True, automatically show highlights when exiting context

    Returns:
        HighlightContext for accumulating highlights
    """
    from natural_pdf.core.highlighting_service import HighlightContext

    return HighlightContext(self, show_on_exit=show)
natural_pdf.Flow.show(*, resolution=None, width=None, color=None, labels=True, label_format=None, highlights=None, layout='stack', stack_direction='vertical', gap=5, columns=None, crop=False, crop_bbox=None, in_context=False, separator_color=None, separator_thickness=2, **kwargs)

Generate a preview image with highlights.

If in_context=True, shows segments as cropped images stacked together with separators between segments.

Parameters:

Name Type Description Default
resolution Optional[float]

DPI for rendering (default from global settings)

None
width Optional[int]

Target width in pixels (overrides resolution)

None
color Optional[Union[str, Tuple[int, int, int]]]

Default highlight color

None
labels bool

Whether to show labels for highlights

True
label_format Optional[str]

Format string for labels

None
highlights Optional[List[Dict[str, Any]]]

Additional highlight groups to show

None
layout Literal['stack', 'grid', 'single']

How to arrange multiple pages/regions

'stack'
stack_direction Literal['vertical', 'horizontal']

Direction for stack layout

'vertical'
gap int

Pixels between stacked images

5
columns Optional[int]

Number of columns for grid layout

None
crop Union[bool, Literal['content']]

Whether to crop

False
crop_bbox Optional[Tuple[float, float, float, float]]

Explicit crop bounds

None
in_context bool

If True, use special Flow visualization with separators

False
separator_color Optional[Tuple[int, int, int]]

RGB color for separator lines (default: red)

None
separator_thickness int

Thickness of separator lines

2
**kwargs

Additional parameters passed to rendering

{}

Returns:

Type Description
Optional[Image]

PIL Image object or None if nothing to render

Source code in natural_pdf/flows/flow.py
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def show(
    self,
    *,
    # Basic rendering options
    resolution: Optional[float] = None,
    width: Optional[int] = None,
    # Highlight options
    color: Optional[Union[str, Tuple[int, int, int]]] = None,
    labels: bool = True,
    label_format: Optional[str] = None,
    highlights: Optional[List[Dict[str, Any]]] = None,
    # Layout options for multi-page/region
    layout: Literal["stack", "grid", "single"] = "stack",
    stack_direction: Literal["vertical", "horizontal"] = "vertical",
    gap: int = 5,
    columns: Optional[int] = None,  # For grid layout
    # Cropping options
    crop: Union[bool, Literal["content"]] = False,
    crop_bbox: Optional[Tuple[float, float, float, float]] = None,
    # Flow-specific options
    in_context: bool = False,
    separator_color: Optional[Tuple[int, int, int]] = None,
    separator_thickness: int = 2,
    **kwargs,
) -> Optional["PIL_Image"]:
    """Generate a preview image with highlights.

    If in_context=True, shows segments as cropped images stacked together
    with separators between segments.

    Args:
        resolution: DPI for rendering (default from global settings)
        width: Target width in pixels (overrides resolution)
        color: Default highlight color
        labels: Whether to show labels for highlights
        label_format: Format string for labels
        highlights: Additional highlight groups to show
        layout: How to arrange multiple pages/regions
        stack_direction: Direction for stack layout
        gap: Pixels between stacked images
        columns: Number of columns for grid layout
        crop: Whether to crop
        crop_bbox: Explicit crop bounds
        in_context: If True, use special Flow visualization with separators
        separator_color: RGB color for separator lines (default: red)
        separator_thickness: Thickness of separator lines
        **kwargs: Additional parameters passed to rendering

    Returns:
        PIL Image object or None if nothing to render
    """
    if in_context:
        # Use the special in_context visualization
        return self._show_in_context(
            resolution=resolution or 150,
            width=width,
            stack_direction=stack_direction,
            stack_gap=gap,
            separator_color=separator_color or (255, 0, 0),
            separator_thickness=separator_thickness,
            **kwargs,
        )

    # Otherwise use the standard show method
    return super().show(
        resolution=resolution,
        width=width,
        color=color,
        labels=labels,
        label_format=label_format,
        highlights=highlights,
        layout=layout,
        stack_direction=stack_direction,
        gap=gap,
        columns=columns,
        crop=crop,
        crop_bbox=crop_bbox,
        **kwargs,
    )
natural_pdf.FlowRegion

Bases: Visualizable

Represents a selected area within a Flow, potentially composed of multiple physical Region objects (constituent_regions) that might span across different original pages or disjoint physical regions defined in the Flow.

A FlowRegion is the result of a directional operation (e.g., .below(), .above()) on a FlowElement.

Source code in natural_pdf/flows/region.py
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
class FlowRegion(Visualizable):
    """
    Represents a selected area within a Flow, potentially composed of multiple
    physical Region objects (constituent_regions) that might span across
    different original pages or disjoint physical regions defined in the Flow.

    A FlowRegion is the result of a directional operation (e.g., .below(), .above())
    on a FlowElement.
    """

    def __init__(
        self,
        flow: "Flow",
        constituent_regions: List["PhysicalRegion"],
        source_flow_element: "FlowElement",
        boundary_element_found: Optional["PhysicalElement"] = None,
    ):
        """
        Initializes a FlowRegion.

        Args:
            flow: The Flow instance this region belongs to.
            constituent_regions: A list of physical natural_pdf.elements.region.Region
                                 objects that make up this FlowRegion.
            source_flow_element: The FlowElement that created this FlowRegion.
            boundary_element_found: The physical element that stopped an 'until' search,
                                    if applicable.
        """
        self.flow: "Flow" = flow
        self.constituent_regions: List["PhysicalRegion"] = constituent_regions
        self.source_flow_element: "FlowElement" = source_flow_element
        self.boundary_element_found: Optional["PhysicalElement"] = boundary_element_found

        # Add attributes for grid building, similar to Region
        self.source: Optional[str] = None
        self.region_type: Optional[str] = None
        self.metadata: Dict[str, Any] = {}

        # Cache for expensive operations
        self._cached_text: Optional[str] = None
        self._cached_elements: Optional["ElementCollection"] = None  # Stringized
        self._cached_bbox: Optional[Tuple[float, float, float, float]] = None

    def _get_highlighter(self):
        """Get the highlighting service from constituent regions."""
        if not self.constituent_regions:
            raise RuntimeError("FlowRegion has no constituent regions to get highlighter from")

        # Get highlighter from first constituent region
        first_region = self.constituent_regions[0]
        if hasattr(first_region, "_highlighter"):
            return first_region._highlighter
        elif hasattr(first_region, "page") and hasattr(first_region.page, "_highlighter"):
            return first_region.page._highlighter
        else:
            raise RuntimeError(
                f"Cannot find HighlightingService from FlowRegion constituent regions. "
                f"First region type: {type(first_region).__name__}"
            )

    def _get_render_specs(
        self,
        mode: Literal["show", "render"] = "show",
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        highlights: Optional[List[Dict[str, Any]]] = None,
        crop: Union[bool, Literal["content"]] = False,
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        **kwargs,
    ) -> List[RenderSpec]:
        """Get render specifications for this flow region.

        Args:
            mode: Rendering mode - 'show' includes highlights, 'render' is clean
            color: Color for highlighting this region in show mode
            highlights: Additional highlight groups to show
            crop: Whether to crop to constituent regions
            crop_bbox: Explicit crop bounds
            **kwargs: Additional parameters

        Returns:
            List of RenderSpec objects, one per page with constituent regions
        """
        if not self.constituent_regions:
            return []

        # Group constituent regions by page
        regions_by_page = {}
        for region in self.constituent_regions:
            if hasattr(region, "page") and region.page:
                page = region.page
                if page not in regions_by_page:
                    regions_by_page[page] = []
                regions_by_page[page].append(region)

        if not regions_by_page:
            return []

        # Create RenderSpec for each page
        specs = []
        for page, page_regions in regions_by_page.items():
            spec = RenderSpec(page=page)

            # Handle cropping
            if crop_bbox:
                spec.crop_bbox = crop_bbox
            elif crop == "content" or crop is True:
                # Calculate bounds of regions on this page
                x_coords = []
                y_coords = []
                for region in page_regions:
                    if hasattr(region, "bbox") and region.bbox:
                        x0, y0, x1, y1 = region.bbox
                        x_coords.extend([x0, x1])
                        y_coords.extend([y0, y1])

                if x_coords and y_coords:
                    spec.crop_bbox = (min(x_coords), min(y_coords), max(x_coords), max(y_coords))

            # Add highlights in show mode
            if mode == "show":
                # Highlight constituent regions
                for i, region in enumerate(page_regions):
                    # Label each part if multiple regions
                    label = None
                    if len(self.constituent_regions) > 1:
                        # Find global index
                        try:
                            global_idx = self.constituent_regions.index(region)
                            label = f"FlowPart_{global_idx + 1}"
                        except ValueError:
                            label = f"FlowPart_{i + 1}"
                    else:
                        label = "FlowRegion"

                    spec.add_highlight(
                        bbox=region.bbox,
                        polygon=region.polygon if region.has_polygon else None,
                        color=color or "fuchsia",
                        label=label,
                    )

                # Add additional highlight groups if provided
                if highlights:
                    for group in highlights:
                        group_elements = group.get("elements", [])
                        group_color = group.get("color", color)
                        group_label = group.get("label")

                        for elem in group_elements:
                            # Only add if element is on this page
                            if hasattr(elem, "page") and elem.page == page:
                                spec.add_highlight(
                                    element=elem, color=group_color, label=group_label
                                )

            specs.append(spec)

        return specs

    def __getattr__(self, name: str) -> Any:
        """
        Dynamically proxy attribute access to the source FlowElement for safe attributes only.
        Spatial methods (above, below, left, right) are explicitly implemented to prevent
        silent failures and incorrect behavior.
        """
        if name in self.__dict__:
            return self.__dict__[name]

        # List of methods that should NOT be proxied - they need proper FlowRegion implementation
        spatial_methods = {"above", "below", "left", "right", "to_region"}

        if name in spatial_methods:
            raise AttributeError(
                f"'{self.__class__.__name__}' object has no attribute '{name}'. "
                f"This method requires proper FlowRegion implementation to handle spatial relationships correctly."
            )

        # Only proxy safe attributes and methods
        if self.source_flow_element is not None:
            try:
                attr = getattr(self.source_flow_element, name)
                # Only proxy non-callable attributes and explicitly safe methods
                if not callable(attr) or name in {"page", "document"}:  # Add safe methods as needed
                    return attr
                else:
                    raise AttributeError(
                        f"Method '{name}' cannot be safely proxied from FlowElement to FlowRegion. "
                        f"It may need explicit implementation."
                    )
            except AttributeError:
                pass

        raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")

    @property
    def bbox(self) -> Optional[Tuple[float, float, float, float]]:
        """
        The bounding box that encloses all constituent regions.
        Calculated dynamically and cached.
        """
        if self._cached_bbox is not None:
            return self._cached_bbox
        if not self.constituent_regions:
            return None

        # Use merge_bboxes from pdfplumber.utils.geometry to merge bboxes
        # Extract bbox tuples from regions first
        region_bboxes = [
            region.bbox for region in self.constituent_regions if hasattr(region, "bbox")
        ]
        if not region_bboxes:
            return None

        self._cached_bbox = merge_bboxes(region_bboxes)
        return self._cached_bbox

    @property
    def x0(self) -> Optional[float]:
        return self.bbox[0] if self.bbox else None

    @property
    def top(self) -> Optional[float]:
        return self.bbox[1] if self.bbox else None

    @property
    def x1(self) -> Optional[float]:
        return self.bbox[2] if self.bbox else None

    @property
    def bottom(self) -> Optional[float]:
        return self.bbox[3] if self.bbox else None

    @property
    def width(self) -> Optional[float]:
        return self.x1 - self.x0 if self.bbox else None

    @property
    def height(self) -> Optional[float]:
        return self.bottom - self.top if self.bbox else None

    def extract_text(self, apply_exclusions: bool = True, **kwargs) -> str:
        """
        Extracts and concatenates text from all constituent physical regions.
        The order of concatenation respects the flow's arrangement.

        Args:
            apply_exclusions: Whether to respect PDF exclusion zones within each
                              constituent physical region during text extraction.
            **kwargs: Additional arguments passed to the underlying extract_text method
                      of each constituent region.

        Returns:
            The combined text content as a string.
        """
        if (
            self._cached_text is not None and apply_exclusions
        ):  # Simple cache check, might need refinement if kwargs change behavior
            return self._cached_text

        if not self.constituent_regions:
            return ""

        texts: List[str] = []
        # For now, simple concatenation. Order depends on how constituent_regions were added.
        # The FlowElement._flow_direction method is responsible for ordering constituent_regions correctly.
        for region in self.constituent_regions:
            texts.append(region.extract_text(apply_exclusions=apply_exclusions, **kwargs))

        # Join based on flow arrangement (e.g., newline for vertical, space for horizontal)
        # This is a simplification; true layout-aware joining would be more complex.
        joiner = (
            "\n" if self.flow.arrangement == "vertical" else " "
        )  # TODO: Consider flow.segment_gap for proportional spacing between segments
        extracted = joiner.join(t for t in texts if t)

        if apply_exclusions:  # Only cache if standard exclusion behavior
            self._cached_text = extracted
        return extracted

    def elements(self, apply_exclusions: bool = True) -> "ElementCollection":  # Stringized return
        """
        Collects all unique physical elements from all constituent physical regions.

        Args:
            apply_exclusions: Whether to respect PDF exclusion zones within each
                              constituent physical region when gathering elements.

        Returns:
            An ElementCollection containing all unique elements.
        """
        from natural_pdf.elements.element_collection import (
            ElementCollection as RuntimeElementCollection,  # Local import
        )

        if self._cached_elements is not None and apply_exclusions:  # Simple cache check
            return self._cached_elements

        if not self.constituent_regions:
            return RuntimeElementCollection([])

        all_physical_elements: List["PhysicalElement"] = []  # Stringized item type
        seen_elements = (
            set()
        )  # To ensure uniqueness if elements are shared or duplicated by region definitions

        for region in self.constituent_regions:
            # Region.get_elements() returns a list, not ElementCollection
            elements_in_region: List["PhysicalElement"] = region.get_elements(
                apply_exclusions=apply_exclusions
            )
            for elem in elements_in_region:
                if elem not in seen_elements:  # Check for uniqueness based on object identity
                    all_physical_elements.append(elem)
                    seen_elements.add(elem)

        # Basic reading order sort based on original page and coordinates.
        def get_sort_key(phys_elem: "PhysicalElement"):  # Stringized param type
            page_idx = -1
            if hasattr(phys_elem, "page") and hasattr(phys_elem.page, "index"):
                page_idx = phys_elem.page.index
            return (page_idx, phys_elem.top, phys_elem.x0)

        try:
            sorted_physical_elements = sorted(all_physical_elements, key=get_sort_key)
        except AttributeError:
            logger.warning(
                "Could not sort elements in FlowRegion by reading order; some elements might be missing page, top or x0 attributes."
            )
            sorted_physical_elements = all_physical_elements

        result_collection = RuntimeElementCollection(sorted_physical_elements)
        if apply_exclusions:
            self._cached_elements = result_collection
        return result_collection

    def find(
        self, selector: Optional[str] = None, *, text: Optional[str] = None, **kwargs
    ) -> Optional["PhysicalElement"]:  # Stringized
        """
        Find the first element in flow order that matches the selector or text.

        This implementation iterates through the constituent regions *in the order
        they appear in ``self.constituent_regions`` (i.e. document flow order),
        delegating the search to each region's own ``find`` method.  It therefore
        avoids constructing a huge intermediate ElementCollection and returns as
        soon as a match is found, which is substantially faster and ensures that
        selectors such as 'table' work exactly as they do on an individual
        Region.
        """
        if not self.constituent_regions:
            return None

        for region in self.constituent_regions:
            try:
                result = region.find(selector=selector, text=text, **kwargs)
                if result is not None:
                    return result
            except Exception as e:
                logger.warning(
                    f"FlowRegion.find: error searching region {region}: {e}",
                    exc_info=False,
                )
        return None  # No match found

    def find_all(
        self, selector: Optional[str] = None, *, text: Optional[str] = None, **kwargs
    ) -> "ElementCollection":  # Stringized
        """
        Find **all** elements across the constituent regions that match the given
        selector or text.

        Rather than first materialising *every* element in the FlowRegion (which
        can be extremely slow for multi-page flows), this implementation simply
        chains each region's native ``find_all`` call and concatenates their
        results into a single ElementCollection while preserving flow order.
        """
        from natural_pdf.elements.element_collection import (
            ElementCollection as RuntimeElementCollection,
        )

        matched_elements = []  # type: List["PhysicalElement"]

        if not self.constituent_regions:
            return RuntimeElementCollection([])

        for region in self.constituent_regions:
            try:
                region_matches = region.find_all(selector=selector, text=text, **kwargs)
                if region_matches:
                    # ``region_matches`` is an ElementCollection – extend with its
                    # underlying list so we don't create nested collections.
                    matched_elements.extend(
                        region_matches.elements
                        if hasattr(region_matches, "elements")
                        else list(region_matches)
                    )
            except Exception as e:
                logger.warning(
                    f"FlowRegion.find_all: error searching region {region}: {e}",
                    exc_info=False,
                )

        return RuntimeElementCollection(matched_elements)

    def highlight(
        self, label: Optional[str] = None, color: Optional[Union[Tuple, str]] = None, **kwargs
    ) -> "FlowRegion":  # Stringized
        """
        Highlights all constituent physical regions on their respective pages.

        Args:
            label: A base label for the highlights. Each constituent region might get an indexed label.
            color: Color for the highlight.
            **kwargs: Additional arguments for the underlying highlight method.

        Returns:
            Self for method chaining.
        """
        if not self.constituent_regions:
            return self

        base_label = label if label else "FlowRegionPart"
        for i, region in enumerate(self.constituent_regions):
            current_label = (
                f"{base_label}_{i+1}" if len(self.constituent_regions) > 1 else base_label
            )
            region.highlight(label=current_label, color=color, **kwargs)
        return self

    def highlights(self, show: bool = False) -> "HighlightContext":
        """
        Create a highlight context for accumulating highlights.

        This allows for clean syntax to show multiple highlight groups:

        Example:
            with flow_region.highlights() as h:
                h.add(flow_region.find_all('table'), label='tables', color='blue')
                h.add(flow_region.find_all('text:bold'), label='bold text', color='red')
                h.show()

        Or with automatic display:
            with flow_region.highlights(show=True) as h:
                h.add(flow_region.find_all('table'), label='tables')
                h.add(flow_region.find_all('text:bold'), label='bold')
                # Automatically shows when exiting the context

        Args:
            show: If True, automatically show highlights when exiting context

        Returns:
            HighlightContext for accumulating highlights
        """
        from natural_pdf.core.highlighting_service import HighlightContext

        return HighlightContext(self, show_on_exit=show)

    def to_images(
        self,
        resolution: float = 150,
        **kwargs,
    ) -> List["PIL_Image"]:
        """
        Generates and returns a list of cropped PIL Images,
        one for each constituent physical region of this FlowRegion.
        """
        if not self.constituent_regions:
            logger.info("FlowRegion.to_images() called on an empty FlowRegion.")
            return []

        cropped_images: List["PIL_Image"] = []
        for region_part in self.constituent_regions:
            try:
                # Use render() for clean image without highlights
                img = region_part.render(resolution=resolution, crop=True, **kwargs)
                if img:
                    cropped_images.append(img)
            except Exception as e:
                logger.error(
                    f"Error generating image for constituent region {region_part.bbox}: {e}",
                    exc_info=True,
                )

        return cropped_images

    def __repr__(self) -> str:
        return (
            f"<FlowRegion constituents={len(self.constituent_regions)}, flow={self.flow}, "
            f"source_bbox={self.source_flow_element.bbox if self.source_flow_element else 'N/A'}>"
        )

    def expand(
        self,
        left: float = 0,
        right: float = 0,
        top: float = 0,
        bottom: float = 0,
        width_factor: float = 1.0,
        height_factor: float = 1.0,
    ) -> "FlowRegion":
        """
        Create a new FlowRegion with all constituent regions expanded.

        Args:
            left: Amount to expand left edge (positive value expands leftwards)
            right: Amount to expand right edge (positive value expands rightwards)
            top: Amount to expand top edge (positive value expands upwards)
            bottom: Amount to expand bottom edge (positive value expands downwards)
            width_factor: Factor to multiply width by (applied after absolute expansion)
            height_factor: Factor to multiply height by (applied after absolute expansion)

        Returns:
            New FlowRegion with expanded constituent regions
        """
        if not self.constituent_regions:
            return FlowRegion(
                flow=self.flow,
                constituent_regions=[],
                source_flow_element=self.source_flow_element,
                boundary_element_found=self.boundary_element_found,
            )

        expanded_regions = []
        for idx, region in enumerate(self.constituent_regions):
            # Determine which adjustments to apply based on flow arrangement
            apply_left = left
            apply_right = right
            apply_top = top
            apply_bottom = bottom

            if self.flow.arrangement == "vertical":
                # In a vertical flow, only the *first* region should react to `top`
                # and only the *last* region should react to `bottom`.  This keeps
                # the virtual contiguous area intact while allowing users to nudge
                # the flow boundaries.
                if idx != 0:
                    apply_top = 0
                if idx != len(self.constituent_regions) - 1:
                    apply_bottom = 0
                # left/right apply to every region (same column width change)
            else:  # horizontal flow
                # In a horizontal flow, only the first region reacts to `left`
                # and only the last region reacts to `right`.
                if idx != 0:
                    apply_left = 0
                if idx != len(self.constituent_regions) - 1:
                    apply_right = 0
                # top/bottom apply to every region in horizontal flows

            # Skip no-op expansion to avoid extra Region objects
            needs_expansion = (
                any(
                    v not in (0, 1.0)  # compare width/height factor logically later
                    for v in (apply_left, apply_right, apply_top, apply_bottom)
                )
                or width_factor != 1.0
                or height_factor != 1.0
            )

            try:
                expanded_region = (
                    region.expand(
                        left=apply_left,
                        right=apply_right,
                        top=apply_top,
                        bottom=apply_bottom,
                        width_factor=width_factor,
                        height_factor=height_factor,
                    )
                    if needs_expansion
                    else region
                )
                expanded_regions.append(expanded_region)
            except Exception as e:
                logger.warning(
                    f"FlowRegion.expand: Error expanding constituent region {region.bbox}: {e}",
                    exc_info=False,
                )
                expanded_regions.append(region)

        # Create new FlowRegion with expanded constituent regions
        new_flow_region = FlowRegion(
            flow=self.flow,
            constituent_regions=expanded_regions,
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

        # Copy metadata
        new_flow_region.source = self.source
        new_flow_region.region_type = self.region_type
        new_flow_region.metadata = self.metadata.copy()

        # Clear caches since the regions have changed
        new_flow_region._cached_text = None
        new_flow_region._cached_elements = None
        new_flow_region._cached_bbox = None

        return new_flow_region

    def above(
        self,
        height: Optional[float] = None,
        width: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "FlowRegion":
        """
        Create a FlowRegion with regions above this FlowRegion.

        For vertical flows: Only expands the topmost constituent region upward.
        For horizontal flows: Expands all constituent regions upward.

        Args:
            height: Height of the region above, in points
            width: Width mode - "full" for full page width or "element" for element width
            include_source: Whether to include this FlowRegion in the result
            until: Optional selector string to specify an upper boundary element
            include_endpoint: Whether to include the boundary element in the region
            **kwargs: Additional parameters

        Returns:
            New FlowRegion with regions above
        """
        if not self.constituent_regions:
            return FlowRegion(
                flow=self.flow,
                constituent_regions=[],
                source_flow_element=self.source_flow_element,
                boundary_element_found=self.boundary_element_found,
            )

        new_regions = []

        if self.flow.arrangement == "vertical":
            # For vertical flow, use FLOW ORDER (index 0 is earliest). Only expand the
            # first constituent region in that order.
            first_region = self.constituent_regions[0]
            for idx, region in enumerate(self.constituent_regions):
                if idx == 0:  # Only expand the first region (earliest in flow)
                    above_region = region.above(
                        height=height,
                        width="element",  # Keep original column width
                        include_source=include_source,
                        until=until,
                        include_endpoint=include_endpoint,
                        **kwargs,
                    )
                    new_regions.append(above_region)
                elif include_source:
                    new_regions.append(region)
        else:  # horizontal flow
            # For horizontal flow, expand all regions upward
            for region in self.constituent_regions:
                above_region = region.above(
                    height=height,
                    width=width,
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(above_region)

        return FlowRegion(
            flow=self.flow,
            constituent_regions=new_regions,
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    def below(
        self,
        height: Optional[float] = None,
        width: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "FlowRegion":
        """
        Create a FlowRegion with regions below this FlowRegion.

        For vertical flows: Only expands the bottommost constituent region downward.
        For horizontal flows: Expands all constituent regions downward.

        Args:
            height: Height of the region below, in points
            width: Width mode - "full" for full page width or "element" for element width
            include_source: Whether to include this FlowRegion in the result
            until: Optional selector string to specify a lower boundary element
            include_endpoint: Whether to include the boundary element in the region
            **kwargs: Additional parameters

        Returns:
            New FlowRegion with regions below
        """
        if not self.constituent_regions:
            return FlowRegion(
                flow=self.flow,
                constituent_regions=[],
                source_flow_element=self.source_flow_element,
                boundary_element_found=self.boundary_element_found,
            )

        new_regions = []

        if self.flow.arrangement == "vertical":
            # For vertical flow, expand only the LAST constituent region in flow order.
            last_idx = len(self.constituent_regions) - 1
            for idx, region in enumerate(self.constituent_regions):
                if idx == last_idx:
                    below_region = region.below(
                        height=height,
                        width="element",
                        include_source=include_source,
                        until=until,
                        include_endpoint=include_endpoint,
                        **kwargs,
                    )
                    new_regions.append(below_region)
                elif include_source:
                    new_regions.append(region)
        else:  # horizontal flow
            # For horizontal flow, expand all regions downward
            for region in self.constituent_regions:
                below_region = region.below(
                    height=height,
                    width=width,
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(below_region)

        return FlowRegion(
            flow=self.flow,
            constituent_regions=new_regions,
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    def left(
        self,
        width: Optional[float] = None,
        height: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "FlowRegion":
        """
        Create a FlowRegion with regions to the left of this FlowRegion.

        For vertical flows: Expands all constituent regions leftward.
        For horizontal flows: Only expands the leftmost constituent region leftward.

        Args:
            width: Width of the region to the left, in points
            height: Height mode - "full" for full page height or "element" for element height
            include_source: Whether to include this FlowRegion in the result
            until: Optional selector string to specify a left boundary element
            include_endpoint: Whether to include the boundary element in the region
            **kwargs: Additional parameters

        Returns:
            New FlowRegion with regions to the left
        """
        if not self.constituent_regions:
            return FlowRegion(
                flow=self.flow,
                constituent_regions=[],
                source_flow_element=self.source_flow_element,
                boundary_element_found=self.boundary_element_found,
            )

        new_regions = []

        if self.flow.arrangement == "vertical":
            # For vertical flow, expand all regions leftward
            for region in self.constituent_regions:
                left_region = region.left(
                    width=width,
                    height="element",
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(left_region)
        else:  # horizontal flow
            # For horizontal flow, only expand the leftmost region leftward
            leftmost_region = min(self.constituent_regions, key=lambda r: r.x0)
            for region in self.constituent_regions:
                if region == leftmost_region:
                    # Expand this region leftward
                    left_region = region.left(
                        width=width,
                        height="element",
                        include_source=include_source,
                        until=until,
                        include_endpoint=include_endpoint,
                        **kwargs,
                    )
                    new_regions.append(left_region)
                elif include_source:
                    # Include other regions unchanged if include_source is True
                    new_regions.append(region)

        return FlowRegion(
            flow=self.flow,
            constituent_regions=new_regions,
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    def right(
        self,
        width: Optional[float] = None,
        height: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "FlowRegion":
        """
        Create a FlowRegion with regions to the right of this FlowRegion.

        For vertical flows: Expands all constituent regions rightward.
        For horizontal flows: Only expands the rightmost constituent region rightward.

        Args:
            width: Width of the region to the right, in points
            height: Height mode - "full" for full page height or "element" for element height
            include_source: Whether to include this FlowRegion in the result
            until: Optional selector string to specify a right boundary element
            include_endpoint: Whether to include the boundary element in the region
            **kwargs: Additional parameters

        Returns:
            New FlowRegion with regions to the right
        """
        if not self.constituent_regions:
            return FlowRegion(
                flow=self.flow,
                constituent_regions=[],
                source_flow_element=self.source_flow_element,
                boundary_element_found=self.boundary_element_found,
            )

        new_regions = []

        if self.flow.arrangement == "vertical":
            # For vertical flow, expand all regions rightward
            for region in self.constituent_regions:
                right_region = region.right(
                    width=width,
                    height="element",
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(right_region)
        else:  # horizontal flow
            # For horizontal flow, only expand the rightmost region rightward
            rightmost_region = max(self.constituent_regions, key=lambda r: r.x1)
            for region in self.constituent_regions:
                if region == rightmost_region:
                    # Expand this region rightward
                    right_region = region.right(
                        width=width,
                        height="element",
                        include_source=include_source,
                        until=until,
                        include_endpoint=include_endpoint,
                        **kwargs,
                    )
                    new_regions.append(right_region)
                elif include_source:
                    # Include other regions unchanged if include_source is True
                    new_regions.append(region)

        return FlowRegion(
            flow=self.flow,
            constituent_regions=new_regions,
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    def to_region(self) -> "FlowRegion":
        """
        Convert this FlowRegion to a region (returns a copy).
        This is equivalent to calling expand() with no arguments.

        Returns:
            Copy of this FlowRegion
        """
        return self.expand()

    @property
    def is_empty(self) -> bool:
        """Checks if the FlowRegion contains no constituent regions or if all are empty."""
        if not self.constituent_regions:
            return True
        # A more robust check might see if extract_text() is empty and elements() is empty.
        # For now, if it has regions, it's not considered empty by this simple check.
        # User Point 4: FlowRegion can be empty (no text, no elements). This implies checking content.
        try:
            return not bool(self.extract_text(apply_exclusions=False).strip()) and not bool(
                self.elements(apply_exclusions=False)
            )
        except Exception:
            return True  # If error during check, assume empty to be safe

    # ------------------------------------------------------------------
    # Table extraction helpers (delegates to underlying physical regions)
    # ------------------------------------------------------------------

    def extract_table(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["PhysicalRegion"], Optional[str]]] = None,
        show_progress: bool = False,
        # Optional row-level merge predicate. If provided, it decides whether
        # the current row (first row of a segment/page) should be merged with
        # the previous one (to handle multi-page spill-overs).
        stitch_rows: Optional[
            Callable[[List[Optional[str]], List[Optional[str]], int, "PhysicalRegion"], bool]
        ] = None,
        merge_headers: Optional[bool] = None,
        **kwargs,
    ) -> TableResult:
        """Extracts a single logical table from the FlowRegion.

        This is a convenience wrapper that iterates through the constituent
        physical regions **in flow order**, calls their ``extract_table``
        method, and concatenates the resulting rows.  It mirrors the public
        interface of :pymeth:`natural_pdf.elements.region.Region.extract_table`.

        Args:
            method, table_settings, use_ocr, ocr_config, text_options, cell_extraction_func, show_progress:
                Same as in :pymeth:`Region.extract_table` and are forwarded as-is
                to each physical region.
            merge_headers: Whether to merge tables by removing repeated headers from subsequent
                pages/segments. If None (default), auto-detects by checking if the first row
                of each segment matches the first row of the first segment. If segments have
                inconsistent header patterns (some repeat, others don't), raises ValueError.
                Useful for multi-page tables where headers repeat on each page.
            **kwargs: Additional keyword arguments forwarded to the underlying
                ``Region.extract_table`` implementation.

        Returns:
            A TableResult object containing the aggregated table data.  Rows returned from
            consecutive constituent regions are appended in document order.  If
            no tables are detected in any region, an empty TableResult is returned.

        stitch_rows parameter:
            Controls whether the first rows of subsequent segments/regions should be merged
            into the previous row (to handle spill-over across page breaks).
            Applied AFTER header removal if merge_headers is enabled.

            • None (default) – no merging (behaviour identical to previous versions).
            • Callable – custom predicate taking
                   (prev_row, cur_row, row_idx_in_segment, segment_object) → bool.
               Return True to merge `cur_row` into `prev_row` (default column-wise merge is used).
        """

        if table_settings is None:
            table_settings = {}
        if text_options is None:
            text_options = {}

        if not self.constituent_regions:
            return TableResult([])

        # Resolve stitch_rows predicate -------------------------------------------------------
        predicate: Optional[
            Callable[[List[Optional[str]], List[Optional[str]], int, "PhysicalRegion"], bool]
        ] = (stitch_rows if callable(stitch_rows) else None)

        def _default_merge(
            prev_row: List[Optional[str]], cur_row: List[Optional[str]]
        ) -> List[Optional[str]]:
            """Column-wise merge – concatenates non-empty strings with a space."""
            from itertools import zip_longest

            merged: List[Optional[str]] = []
            for p, c in zip_longest(prev_row, cur_row, fillvalue=""):
                if (p or "").strip() and (c or "").strip():
                    merged.append(f"{p} {c}".strip())
                else:
                    merged.append((p or "") + (c or ""))
            return merged

        aggregated_rows: List[List[Optional[str]]] = []
        header_row: Optional[List[Optional[str]]] = None
        merge_headers_enabled = False
        headers_warned = False  # Track if we've already warned about dropping headers
        segment_has_repeated_header = []  # Track which segments have repeated headers

        for region_idx, region in enumerate(self.constituent_regions):
            try:
                region_result = region.extract_table(
                    method=method,
                    table_settings=table_settings.copy(),  # Avoid side-effects
                    use_ocr=use_ocr,
                    ocr_config=ocr_config,
                    text_options=text_options.copy(),
                    cell_extraction_func=cell_extraction_func,
                    show_progress=show_progress,
                    **kwargs,
                )

                # Convert result to list of rows
                if not region_result:
                    continue

                if isinstance(region_result, TableResult):
                    segment_rows = list(region_result)
                else:
                    segment_rows = list(region_result)

                # Handle header detection and merging for multi-page tables
                if region_idx == 0:
                    # First segment: capture potential header row
                    if segment_rows:
                        header_row = segment_rows[0]
                        # Determine if we should merge headers
                        if merge_headers is None:
                            # Auto-detect: we'll check all subsequent segments
                            merge_headers_enabled = False  # Will be determined later
                        else:
                            merge_headers_enabled = merge_headers
                        # Track that first segment exists (for consistency checking)
                        segment_has_repeated_header.append(False)  # First segment doesn't "repeat"
                elif region_idx == 1 and merge_headers is None:
                    # Auto-detection: check if first row of second segment matches header
                    has_header = segment_rows and header_row and segment_rows[0] == header_row
                    segment_has_repeated_header.append(has_header)

                    if has_header:
                        merge_headers_enabled = True
                        # Remove the detected repeated header from this segment
                        segment_rows = segment_rows[1:]
                        if not headers_warned:
                            warnings.warn(
                                "Detected repeated headers in multi-page table. Merging by removing "
                                "repeated headers from subsequent pages.",
                                UserWarning,
                                stacklevel=2,
                            )
                            headers_warned = True
                    else:
                        merge_headers_enabled = False
                elif region_idx > 1:
                    # Check consistency: all segments should have same pattern
                    has_header = segment_rows and header_row and segment_rows[0] == header_row
                    segment_has_repeated_header.append(has_header)

                    # Remove header if merging is enabled and header is present
                    if merge_headers_enabled and has_header:
                        segment_rows = segment_rows[1:]
                elif region_idx > 0 and merge_headers_enabled:
                    # Explicit merge_headers=True: remove headers from subsequent segments
                    if segment_rows and header_row and segment_rows[0] == header_row:
                        segment_rows = segment_rows[1:]
                        if not headers_warned:
                            warnings.warn(
                                "Removing repeated headers from multi-page table during merge.",
                                UserWarning,
                                stacklevel=2,
                            )
                            headers_warned = True

                # Process remaining rows with stitch_rows logic
                for row_idx, row in enumerate(segment_rows):
                    if (
                        predicate is not None
                        and aggregated_rows
                        and predicate(aggregated_rows[-1], row, row_idx, region)
                    ):
                        # Merge with previous row
                        aggregated_rows[-1] = _default_merge(aggregated_rows[-1], row)
                    else:
                        aggregated_rows.append(row)
            except Exception as e:
                logger.error(
                    f"FlowRegion.extract_table: Error extracting table from constituent region {region}: {e}",
                    exc_info=True,
                )

        # Check for inconsistent header patterns after processing all segments
        if merge_headers is None and len(segment_has_repeated_header) > 2:
            # During auto-detection, check for consistency across all segments
            expected_pattern = segment_has_repeated_header[1]  # Pattern from second segment
            for seg_idx, has_header in enumerate(segment_has_repeated_header[2:], 2):
                if has_header != expected_pattern:
                    # Inconsistent pattern detected
                    segments_with_headers = [
                        i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if has_h
                    ]
                    segments_without_headers = [
                        i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if not has_h
                    ]
                    raise ValueError(
                        f"Inconsistent header pattern in multi-page table: "
                        f"segments {segments_with_headers} have repeated headers, "
                        f"but segments {segments_without_headers} do not. "
                        f"All segments must have the same header pattern for reliable merging."
                    )

        return TableResult(aggregated_rows)

    def extract_tables(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        **kwargs,
    ) -> List[List[List[Optional[str]]]]:
        """Extract **all** tables from the FlowRegion.

        This simply chains :pymeth:`Region.extract_tables` over each physical
        region and concatenates their results, preserving flow order.

        Args:
            method, table_settings: Forwarded to underlying ``Region.extract_tables``.
            **kwargs: Additional keyword arguments forwarded.

        Returns:
            A list where each item is a full table (list of rows).  The order of
            tables follows the order of the constituent regions in the flow.
        """

        if table_settings is None:
            table_settings = {}

        if not self.constituent_regions:
            return []

        all_tables: List[List[List[Optional[str]]]] = []

        for region in self.constituent_regions:
            try:
                region_tables = region.extract_tables(
                    method=method,
                    table_settings=table_settings.copy(),
                    **kwargs,
                )
                # ``region_tables`` is a list (possibly empty).
                if region_tables:
                    all_tables.extend(region_tables)
            except Exception as e:
                logger.error(
                    f"FlowRegion.extract_tables: Error extracting tables from constituent region {region}: {e}",
                    exc_info=True,
                )

        return all_tables

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        new_section_on_page_break: bool = False,
        include_boundaries: str = "both",
        orientation: str = "vertical",
    ) -> "ElementCollection":
        """
        Extract logical sections from this FlowRegion based on start/end boundary elements.

        This delegates to the parent Flow's get_sections() method, but only operates
        on the segments that are part of this FlowRegion.

        Args:
            start_elements: Elements or selector string that mark the start of sections
            end_elements: Elements or selector string that mark the end of sections
            new_section_on_page_break: Whether to start a new section at page boundaries
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            ElementCollection of FlowRegion objects representing the extracted sections

        Example:
            # Split a multi-page table region by headers
            table_region = flow.find("text:contains('Table 4')").below(until="text:contains('Table 5')")
            sections = table_region.get_sections(start_elements="text:bold")
        """
        # Create a temporary Flow with just our constituent regions as segments
        from natural_pdf.flows.flow import Flow

        temp_flow = Flow(
            segments=self.constituent_regions,
            arrangement=self.flow.arrangement,
            alignment=self.flow.alignment,
            segment_gap=self.flow.segment_gap,
        )

        # Delegate to Flow's get_sections implementation
        return temp_flow.get_sections(
            start_elements=start_elements,
            end_elements=end_elements,
            new_section_on_page_break=new_section_on_page_break,
            include_boundaries=include_boundaries,
            orientation=orientation,
        )

    def split(
        self, by: Optional[str] = None, page_breaks: bool = True, **kwargs
    ) -> "ElementCollection":
        """
        Split this FlowRegion into sections.

        This is a convenience method that wraps get_sections() with common splitting patterns.

        Args:
            by: Selector string for elements that mark section boundaries (e.g., "text:bold")
            page_breaks: Whether to also split at page boundaries (default: True)
            **kwargs: Additional arguments passed to get_sections()

        Returns:
            ElementCollection of FlowRegion objects representing the sections

        Example:
            # Split by bold headers
            sections = flow_region.split(by="text:bold")

            # Split only by specific text pattern, ignoring page breaks
            sections = flow_region.split(
                by="text:contains('Section')",
                page_breaks=False
            )
        """
        return self.get_sections(start_elements=by, new_section_on_page_break=page_breaks, **kwargs)

    @property
    def normalized_type(self) -> Optional[str]:
        """
        Return the normalized type for selector compatibility.
        This allows FlowRegion to be found by selectors like 'table'.
        """
        if self.region_type:
            # Convert region_type to normalized format (replace spaces with underscores, lowercase)
            return self.region_type.lower().replace(" ", "_")
        return None

    @property
    def type(self) -> Optional[str]:
        """
        Return the type attribute for selector compatibility.
        This is an alias for normalized_type.
        """
        return self.normalized_type

    def get_highlight_specs(self) -> List[Dict[str, Any]]:
        """
        Get highlight specifications for all constituent regions.

        This implements the highlighting protocol for FlowRegions, returning
        specs for each constituent region so they can be highlighted on their
        respective pages.

        Returns:
            List of highlight specification dictionaries, one for each
            constituent region.
        """
        specs = []

        for region in self.constituent_regions:
            if not hasattr(region, "page") or region.page is None:
                continue

            if not hasattr(region, "bbox") or region.bbox is None:
                continue

            spec = {
                "page": region.page,
                "page_index": region.page.index if hasattr(region.page, "index") else 0,
                "bbox": region.bbox,
                "element": region,  # Reference to the constituent region
            }

            # Add polygon if available
            if hasattr(region, "polygon") and hasattr(region, "has_polygon") and region.has_polygon:
                spec["polygon"] = region.polygon

            specs.append(spec)

        return specs
Attributes
natural_pdf.FlowRegion.bbox property

The bounding box that encloses all constituent regions. Calculated dynamically and cached.

natural_pdf.FlowRegion.is_empty property

Checks if the FlowRegion contains no constituent regions or if all are empty.

natural_pdf.FlowRegion.normalized_type property

Return the normalized type for selector compatibility. This allows FlowRegion to be found by selectors like 'table'.

natural_pdf.FlowRegion.type property

Return the type attribute for selector compatibility. This is an alias for normalized_type.

Functions
natural_pdf.FlowRegion.__getattr__(name)

Dynamically proxy attribute access to the source FlowElement for safe attributes only. Spatial methods (above, below, left, right) are explicitly implemented to prevent silent failures and incorrect behavior.

Source code in natural_pdf/flows/region.py
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
def __getattr__(self, name: str) -> Any:
    """
    Dynamically proxy attribute access to the source FlowElement for safe attributes only.
    Spatial methods (above, below, left, right) are explicitly implemented to prevent
    silent failures and incorrect behavior.
    """
    if name in self.__dict__:
        return self.__dict__[name]

    # List of methods that should NOT be proxied - they need proper FlowRegion implementation
    spatial_methods = {"above", "below", "left", "right", "to_region"}

    if name in spatial_methods:
        raise AttributeError(
            f"'{self.__class__.__name__}' object has no attribute '{name}'. "
            f"This method requires proper FlowRegion implementation to handle spatial relationships correctly."
        )

    # Only proxy safe attributes and methods
    if self.source_flow_element is not None:
        try:
            attr = getattr(self.source_flow_element, name)
            # Only proxy non-callable attributes and explicitly safe methods
            if not callable(attr) or name in {"page", "document"}:  # Add safe methods as needed
                return attr
            else:
                raise AttributeError(
                    f"Method '{name}' cannot be safely proxied from FlowElement to FlowRegion. "
                    f"It may need explicit implementation."
                )
        except AttributeError:
            pass

    raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")
natural_pdf.FlowRegion.__init__(flow, constituent_regions, source_flow_element, boundary_element_found=None)

Initializes a FlowRegion.

Parameters:

Name Type Description Default
flow Flow

The Flow instance this region belongs to.

required
constituent_regions List[Region]

A list of physical natural_pdf.elements.region.Region objects that make up this FlowRegion.

required
source_flow_element FlowElement

The FlowElement that created this FlowRegion.

required
boundary_element_found Optional[Element]

The physical element that stopped an 'until' search, if applicable.

None
Source code in natural_pdf/flows/region.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def __init__(
    self,
    flow: "Flow",
    constituent_regions: List["PhysicalRegion"],
    source_flow_element: "FlowElement",
    boundary_element_found: Optional["PhysicalElement"] = None,
):
    """
    Initializes a FlowRegion.

    Args:
        flow: The Flow instance this region belongs to.
        constituent_regions: A list of physical natural_pdf.elements.region.Region
                             objects that make up this FlowRegion.
        source_flow_element: The FlowElement that created this FlowRegion.
        boundary_element_found: The physical element that stopped an 'until' search,
                                if applicable.
    """
    self.flow: "Flow" = flow
    self.constituent_regions: List["PhysicalRegion"] = constituent_regions
    self.source_flow_element: "FlowElement" = source_flow_element
    self.boundary_element_found: Optional["PhysicalElement"] = boundary_element_found

    # Add attributes for grid building, similar to Region
    self.source: Optional[str] = None
    self.region_type: Optional[str] = None
    self.metadata: Dict[str, Any] = {}

    # Cache for expensive operations
    self._cached_text: Optional[str] = None
    self._cached_elements: Optional["ElementCollection"] = None  # Stringized
    self._cached_bbox: Optional[Tuple[float, float, float, float]] = None
natural_pdf.FlowRegion.above(height=None, width='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Create a FlowRegion with regions above this FlowRegion.

For vertical flows: Only expands the topmost constituent region upward. For horizontal flows: Expands all constituent regions upward.

Parameters:

Name Type Description Default
height Optional[float]

Height of the region above, in points

None
width str

Width mode - "full" for full page width or "element" for element width

'full'
include_source bool

Whether to include this FlowRegion in the result

False
until Optional[str]

Optional selector string to specify an upper boundary element

None
include_endpoint bool

Whether to include the boundary element in the region

True
**kwargs

Additional parameters

{}

Returns:

Type Description
FlowRegion

New FlowRegion with regions above

Source code in natural_pdf/flows/region.py
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
def above(
    self,
    height: Optional[float] = None,
    width: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "FlowRegion":
    """
    Create a FlowRegion with regions above this FlowRegion.

    For vertical flows: Only expands the topmost constituent region upward.
    For horizontal flows: Expands all constituent regions upward.

    Args:
        height: Height of the region above, in points
        width: Width mode - "full" for full page width or "element" for element width
        include_source: Whether to include this FlowRegion in the result
        until: Optional selector string to specify an upper boundary element
        include_endpoint: Whether to include the boundary element in the region
        **kwargs: Additional parameters

    Returns:
        New FlowRegion with regions above
    """
    if not self.constituent_regions:
        return FlowRegion(
            flow=self.flow,
            constituent_regions=[],
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    new_regions = []

    if self.flow.arrangement == "vertical":
        # For vertical flow, use FLOW ORDER (index 0 is earliest). Only expand the
        # first constituent region in that order.
        first_region = self.constituent_regions[0]
        for idx, region in enumerate(self.constituent_regions):
            if idx == 0:  # Only expand the first region (earliest in flow)
                above_region = region.above(
                    height=height,
                    width="element",  # Keep original column width
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(above_region)
            elif include_source:
                new_regions.append(region)
    else:  # horizontal flow
        # For horizontal flow, expand all regions upward
        for region in self.constituent_regions:
            above_region = region.above(
                height=height,
                width=width,
                include_source=include_source,
                until=until,
                include_endpoint=include_endpoint,
                **kwargs,
            )
            new_regions.append(above_region)

    return FlowRegion(
        flow=self.flow,
        constituent_regions=new_regions,
        source_flow_element=self.source_flow_element,
        boundary_element_found=self.boundary_element_found,
    )
natural_pdf.FlowRegion.below(height=None, width='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Create a FlowRegion with regions below this FlowRegion.

For vertical flows: Only expands the bottommost constituent region downward. For horizontal flows: Expands all constituent regions downward.

Parameters:

Name Type Description Default
height Optional[float]

Height of the region below, in points

None
width str

Width mode - "full" for full page width or "element" for element width

'full'
include_source bool

Whether to include this FlowRegion in the result

False
until Optional[str]

Optional selector string to specify a lower boundary element

None
include_endpoint bool

Whether to include the boundary element in the region

True
**kwargs

Additional parameters

{}

Returns:

Type Description
FlowRegion

New FlowRegion with regions below

Source code in natural_pdf/flows/region.py
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
def below(
    self,
    height: Optional[float] = None,
    width: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "FlowRegion":
    """
    Create a FlowRegion with regions below this FlowRegion.

    For vertical flows: Only expands the bottommost constituent region downward.
    For horizontal flows: Expands all constituent regions downward.

    Args:
        height: Height of the region below, in points
        width: Width mode - "full" for full page width or "element" for element width
        include_source: Whether to include this FlowRegion in the result
        until: Optional selector string to specify a lower boundary element
        include_endpoint: Whether to include the boundary element in the region
        **kwargs: Additional parameters

    Returns:
        New FlowRegion with regions below
    """
    if not self.constituent_regions:
        return FlowRegion(
            flow=self.flow,
            constituent_regions=[],
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    new_regions = []

    if self.flow.arrangement == "vertical":
        # For vertical flow, expand only the LAST constituent region in flow order.
        last_idx = len(self.constituent_regions) - 1
        for idx, region in enumerate(self.constituent_regions):
            if idx == last_idx:
                below_region = region.below(
                    height=height,
                    width="element",
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(below_region)
            elif include_source:
                new_regions.append(region)
    else:  # horizontal flow
        # For horizontal flow, expand all regions downward
        for region in self.constituent_regions:
            below_region = region.below(
                height=height,
                width=width,
                include_source=include_source,
                until=until,
                include_endpoint=include_endpoint,
                **kwargs,
            )
            new_regions.append(below_region)

    return FlowRegion(
        flow=self.flow,
        constituent_regions=new_regions,
        source_flow_element=self.source_flow_element,
        boundary_element_found=self.boundary_element_found,
    )
natural_pdf.FlowRegion.elements(apply_exclusions=True)

Collects all unique physical elements from all constituent physical regions.

Parameters:

Name Type Description Default
apply_exclusions bool

Whether to respect PDF exclusion zones within each constituent physical region when gathering elements.

True

Returns:

Type Description
ElementCollection

An ElementCollection containing all unique elements.

Source code in natural_pdf/flows/region.py
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
def elements(self, apply_exclusions: bool = True) -> "ElementCollection":  # Stringized return
    """
    Collects all unique physical elements from all constituent physical regions.

    Args:
        apply_exclusions: Whether to respect PDF exclusion zones within each
                          constituent physical region when gathering elements.

    Returns:
        An ElementCollection containing all unique elements.
    """
    from natural_pdf.elements.element_collection import (
        ElementCollection as RuntimeElementCollection,  # Local import
    )

    if self._cached_elements is not None and apply_exclusions:  # Simple cache check
        return self._cached_elements

    if not self.constituent_regions:
        return RuntimeElementCollection([])

    all_physical_elements: List["PhysicalElement"] = []  # Stringized item type
    seen_elements = (
        set()
    )  # To ensure uniqueness if elements are shared or duplicated by region definitions

    for region in self.constituent_regions:
        # Region.get_elements() returns a list, not ElementCollection
        elements_in_region: List["PhysicalElement"] = region.get_elements(
            apply_exclusions=apply_exclusions
        )
        for elem in elements_in_region:
            if elem not in seen_elements:  # Check for uniqueness based on object identity
                all_physical_elements.append(elem)
                seen_elements.add(elem)

    # Basic reading order sort based on original page and coordinates.
    def get_sort_key(phys_elem: "PhysicalElement"):  # Stringized param type
        page_idx = -1
        if hasattr(phys_elem, "page") and hasattr(phys_elem.page, "index"):
            page_idx = phys_elem.page.index
        return (page_idx, phys_elem.top, phys_elem.x0)

    try:
        sorted_physical_elements = sorted(all_physical_elements, key=get_sort_key)
    except AttributeError:
        logger.warning(
            "Could not sort elements in FlowRegion by reading order; some elements might be missing page, top or x0 attributes."
        )
        sorted_physical_elements = all_physical_elements

    result_collection = RuntimeElementCollection(sorted_physical_elements)
    if apply_exclusions:
        self._cached_elements = result_collection
    return result_collection
natural_pdf.FlowRegion.expand(left=0, right=0, top=0, bottom=0, width_factor=1.0, height_factor=1.0)

Create a new FlowRegion with all constituent regions expanded.

Parameters:

Name Type Description Default
left float

Amount to expand left edge (positive value expands leftwards)

0
right float

Amount to expand right edge (positive value expands rightwards)

0
top float

Amount to expand top edge (positive value expands upwards)

0
bottom float

Amount to expand bottom edge (positive value expands downwards)

0
width_factor float

Factor to multiply width by (applied after absolute expansion)

1.0
height_factor float

Factor to multiply height by (applied after absolute expansion)

1.0

Returns:

Type Description
FlowRegion

New FlowRegion with expanded constituent regions

Source code in natural_pdf/flows/region.py
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
def expand(
    self,
    left: float = 0,
    right: float = 0,
    top: float = 0,
    bottom: float = 0,
    width_factor: float = 1.0,
    height_factor: float = 1.0,
) -> "FlowRegion":
    """
    Create a new FlowRegion with all constituent regions expanded.

    Args:
        left: Amount to expand left edge (positive value expands leftwards)
        right: Amount to expand right edge (positive value expands rightwards)
        top: Amount to expand top edge (positive value expands upwards)
        bottom: Amount to expand bottom edge (positive value expands downwards)
        width_factor: Factor to multiply width by (applied after absolute expansion)
        height_factor: Factor to multiply height by (applied after absolute expansion)

    Returns:
        New FlowRegion with expanded constituent regions
    """
    if not self.constituent_regions:
        return FlowRegion(
            flow=self.flow,
            constituent_regions=[],
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    expanded_regions = []
    for idx, region in enumerate(self.constituent_regions):
        # Determine which adjustments to apply based on flow arrangement
        apply_left = left
        apply_right = right
        apply_top = top
        apply_bottom = bottom

        if self.flow.arrangement == "vertical":
            # In a vertical flow, only the *first* region should react to `top`
            # and only the *last* region should react to `bottom`.  This keeps
            # the virtual contiguous area intact while allowing users to nudge
            # the flow boundaries.
            if idx != 0:
                apply_top = 0
            if idx != len(self.constituent_regions) - 1:
                apply_bottom = 0
            # left/right apply to every region (same column width change)
        else:  # horizontal flow
            # In a horizontal flow, only the first region reacts to `left`
            # and only the last region reacts to `right`.
            if idx != 0:
                apply_left = 0
            if idx != len(self.constituent_regions) - 1:
                apply_right = 0
            # top/bottom apply to every region in horizontal flows

        # Skip no-op expansion to avoid extra Region objects
        needs_expansion = (
            any(
                v not in (0, 1.0)  # compare width/height factor logically later
                for v in (apply_left, apply_right, apply_top, apply_bottom)
            )
            or width_factor != 1.0
            or height_factor != 1.0
        )

        try:
            expanded_region = (
                region.expand(
                    left=apply_left,
                    right=apply_right,
                    top=apply_top,
                    bottom=apply_bottom,
                    width_factor=width_factor,
                    height_factor=height_factor,
                )
                if needs_expansion
                else region
            )
            expanded_regions.append(expanded_region)
        except Exception as e:
            logger.warning(
                f"FlowRegion.expand: Error expanding constituent region {region.bbox}: {e}",
                exc_info=False,
            )
            expanded_regions.append(region)

    # Create new FlowRegion with expanded constituent regions
    new_flow_region = FlowRegion(
        flow=self.flow,
        constituent_regions=expanded_regions,
        source_flow_element=self.source_flow_element,
        boundary_element_found=self.boundary_element_found,
    )

    # Copy metadata
    new_flow_region.source = self.source
    new_flow_region.region_type = self.region_type
    new_flow_region.metadata = self.metadata.copy()

    # Clear caches since the regions have changed
    new_flow_region._cached_text = None
    new_flow_region._cached_elements = None
    new_flow_region._cached_bbox = None

    return new_flow_region
natural_pdf.FlowRegion.extract_table(method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False, stitch_rows=None, merge_headers=None, **kwargs)

Extracts a single logical table from the FlowRegion.

This is a convenience wrapper that iterates through the constituent physical regions in flow order, calls their extract_table method, and concatenates the resulting rows. It mirrors the public interface of :pymeth:natural_pdf.elements.region.Region.extract_table.

Parameters:

Name Type Description Default
method, table_settings, use_ocr, ocr_config, text_options, cell_extraction_func, show_progress

Same as in :pymeth:Region.extract_table and are forwarded as-is to each physical region.

required
merge_headers Optional[bool]

Whether to merge tables by removing repeated headers from subsequent pages/segments. If None (default), auto-detects by checking if the first row of each segment matches the first row of the first segment. If segments have inconsistent header patterns (some repeat, others don't), raises ValueError. Useful for multi-page tables where headers repeat on each page.

None
**kwargs

Additional keyword arguments forwarded to the underlying Region.extract_table implementation.

{}

Returns:

Type Description
TableResult

A TableResult object containing the aggregated table data. Rows returned from

TableResult

consecutive constituent regions are appended in document order. If

TableResult

no tables are detected in any region, an empty TableResult is returned.

stitch_rows parameter

Controls whether the first rows of subsequent segments/regions should be merged into the previous row (to handle spill-over across page breaks). Applied AFTER header removal if merge_headers is enabled.

• None (default) – no merging (behaviour identical to previous versions). • Callable – custom predicate taking (prev_row, cur_row, row_idx_in_segment, segment_object) → bool. Return True to merge cur_row into prev_row (default column-wise merge is used).

Source code in natural_pdf/flows/region.py
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
def extract_table(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,
    text_options: Optional[Dict] = None,
    cell_extraction_func: Optional[Callable[["PhysicalRegion"], Optional[str]]] = None,
    show_progress: bool = False,
    # Optional row-level merge predicate. If provided, it decides whether
    # the current row (first row of a segment/page) should be merged with
    # the previous one (to handle multi-page spill-overs).
    stitch_rows: Optional[
        Callable[[List[Optional[str]], List[Optional[str]], int, "PhysicalRegion"], bool]
    ] = None,
    merge_headers: Optional[bool] = None,
    **kwargs,
) -> TableResult:
    """Extracts a single logical table from the FlowRegion.

    This is a convenience wrapper that iterates through the constituent
    physical regions **in flow order**, calls their ``extract_table``
    method, and concatenates the resulting rows.  It mirrors the public
    interface of :pymeth:`natural_pdf.elements.region.Region.extract_table`.

    Args:
        method, table_settings, use_ocr, ocr_config, text_options, cell_extraction_func, show_progress:
            Same as in :pymeth:`Region.extract_table` and are forwarded as-is
            to each physical region.
        merge_headers: Whether to merge tables by removing repeated headers from subsequent
            pages/segments. If None (default), auto-detects by checking if the first row
            of each segment matches the first row of the first segment. If segments have
            inconsistent header patterns (some repeat, others don't), raises ValueError.
            Useful for multi-page tables where headers repeat on each page.
        **kwargs: Additional keyword arguments forwarded to the underlying
            ``Region.extract_table`` implementation.

    Returns:
        A TableResult object containing the aggregated table data.  Rows returned from
        consecutive constituent regions are appended in document order.  If
        no tables are detected in any region, an empty TableResult is returned.

    stitch_rows parameter:
        Controls whether the first rows of subsequent segments/regions should be merged
        into the previous row (to handle spill-over across page breaks).
        Applied AFTER header removal if merge_headers is enabled.

        • None (default) – no merging (behaviour identical to previous versions).
        • Callable – custom predicate taking
               (prev_row, cur_row, row_idx_in_segment, segment_object) → bool.
           Return True to merge `cur_row` into `prev_row` (default column-wise merge is used).
    """

    if table_settings is None:
        table_settings = {}
    if text_options is None:
        text_options = {}

    if not self.constituent_regions:
        return TableResult([])

    # Resolve stitch_rows predicate -------------------------------------------------------
    predicate: Optional[
        Callable[[List[Optional[str]], List[Optional[str]], int, "PhysicalRegion"], bool]
    ] = (stitch_rows if callable(stitch_rows) else None)

    def _default_merge(
        prev_row: List[Optional[str]], cur_row: List[Optional[str]]
    ) -> List[Optional[str]]:
        """Column-wise merge – concatenates non-empty strings with a space."""
        from itertools import zip_longest

        merged: List[Optional[str]] = []
        for p, c in zip_longest(prev_row, cur_row, fillvalue=""):
            if (p or "").strip() and (c or "").strip():
                merged.append(f"{p} {c}".strip())
            else:
                merged.append((p or "") + (c or ""))
        return merged

    aggregated_rows: List[List[Optional[str]]] = []
    header_row: Optional[List[Optional[str]]] = None
    merge_headers_enabled = False
    headers_warned = False  # Track if we've already warned about dropping headers
    segment_has_repeated_header = []  # Track which segments have repeated headers

    for region_idx, region in enumerate(self.constituent_regions):
        try:
            region_result = region.extract_table(
                method=method,
                table_settings=table_settings.copy(),  # Avoid side-effects
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                text_options=text_options.copy(),
                cell_extraction_func=cell_extraction_func,
                show_progress=show_progress,
                **kwargs,
            )

            # Convert result to list of rows
            if not region_result:
                continue

            if isinstance(region_result, TableResult):
                segment_rows = list(region_result)
            else:
                segment_rows = list(region_result)

            # Handle header detection and merging for multi-page tables
            if region_idx == 0:
                # First segment: capture potential header row
                if segment_rows:
                    header_row = segment_rows[0]
                    # Determine if we should merge headers
                    if merge_headers is None:
                        # Auto-detect: we'll check all subsequent segments
                        merge_headers_enabled = False  # Will be determined later
                    else:
                        merge_headers_enabled = merge_headers
                    # Track that first segment exists (for consistency checking)
                    segment_has_repeated_header.append(False)  # First segment doesn't "repeat"
            elif region_idx == 1 and merge_headers is None:
                # Auto-detection: check if first row of second segment matches header
                has_header = segment_rows and header_row and segment_rows[0] == header_row
                segment_has_repeated_header.append(has_header)

                if has_header:
                    merge_headers_enabled = True
                    # Remove the detected repeated header from this segment
                    segment_rows = segment_rows[1:]
                    if not headers_warned:
                        warnings.warn(
                            "Detected repeated headers in multi-page table. Merging by removing "
                            "repeated headers from subsequent pages.",
                            UserWarning,
                            stacklevel=2,
                        )
                        headers_warned = True
                else:
                    merge_headers_enabled = False
            elif region_idx > 1:
                # Check consistency: all segments should have same pattern
                has_header = segment_rows and header_row and segment_rows[0] == header_row
                segment_has_repeated_header.append(has_header)

                # Remove header if merging is enabled and header is present
                if merge_headers_enabled and has_header:
                    segment_rows = segment_rows[1:]
            elif region_idx > 0 and merge_headers_enabled:
                # Explicit merge_headers=True: remove headers from subsequent segments
                if segment_rows and header_row and segment_rows[0] == header_row:
                    segment_rows = segment_rows[1:]
                    if not headers_warned:
                        warnings.warn(
                            "Removing repeated headers from multi-page table during merge.",
                            UserWarning,
                            stacklevel=2,
                        )
                        headers_warned = True

            # Process remaining rows with stitch_rows logic
            for row_idx, row in enumerate(segment_rows):
                if (
                    predicate is not None
                    and aggregated_rows
                    and predicate(aggregated_rows[-1], row, row_idx, region)
                ):
                    # Merge with previous row
                    aggregated_rows[-1] = _default_merge(aggregated_rows[-1], row)
                else:
                    aggregated_rows.append(row)
        except Exception as e:
            logger.error(
                f"FlowRegion.extract_table: Error extracting table from constituent region {region}: {e}",
                exc_info=True,
            )

    # Check for inconsistent header patterns after processing all segments
    if merge_headers is None and len(segment_has_repeated_header) > 2:
        # During auto-detection, check for consistency across all segments
        expected_pattern = segment_has_repeated_header[1]  # Pattern from second segment
        for seg_idx, has_header in enumerate(segment_has_repeated_header[2:], 2):
            if has_header != expected_pattern:
                # Inconsistent pattern detected
                segments_with_headers = [
                    i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if has_h
                ]
                segments_without_headers = [
                    i for i, has_h in enumerate(segment_has_repeated_header[1:], 1) if not has_h
                ]
                raise ValueError(
                    f"Inconsistent header pattern in multi-page table: "
                    f"segments {segments_with_headers} have repeated headers, "
                    f"but segments {segments_without_headers} do not. "
                    f"All segments must have the same header pattern for reliable merging."
                )

    return TableResult(aggregated_rows)
natural_pdf.FlowRegion.extract_tables(method=None, table_settings=None, **kwargs)

Extract all tables from the FlowRegion.

This simply chains :pymeth:Region.extract_tables over each physical region and concatenates their results, preserving flow order.

Parameters:

Name Type Description Default
method, table_settings

Forwarded to underlying Region.extract_tables.

required
**kwargs

Additional keyword arguments forwarded.

{}

Returns:

Type Description
List[List[List[Optional[str]]]]

A list where each item is a full table (list of rows). The order of

List[List[List[Optional[str]]]]

tables follows the order of the constituent regions in the flow.

Source code in natural_pdf/flows/region.py
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
def extract_tables(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    **kwargs,
) -> List[List[List[Optional[str]]]]:
    """Extract **all** tables from the FlowRegion.

    This simply chains :pymeth:`Region.extract_tables` over each physical
    region and concatenates their results, preserving flow order.

    Args:
        method, table_settings: Forwarded to underlying ``Region.extract_tables``.
        **kwargs: Additional keyword arguments forwarded.

    Returns:
        A list where each item is a full table (list of rows).  The order of
        tables follows the order of the constituent regions in the flow.
    """

    if table_settings is None:
        table_settings = {}

    if not self.constituent_regions:
        return []

    all_tables: List[List[List[Optional[str]]]] = []

    for region in self.constituent_regions:
        try:
            region_tables = region.extract_tables(
                method=method,
                table_settings=table_settings.copy(),
                **kwargs,
            )
            # ``region_tables`` is a list (possibly empty).
            if region_tables:
                all_tables.extend(region_tables)
        except Exception as e:
            logger.error(
                f"FlowRegion.extract_tables: Error extracting tables from constituent region {region}: {e}",
                exc_info=True,
            )

    return all_tables
natural_pdf.FlowRegion.extract_text(apply_exclusions=True, **kwargs)

Extracts and concatenates text from all constituent physical regions. The order of concatenation respects the flow's arrangement.

Parameters:

Name Type Description Default
apply_exclusions bool

Whether to respect PDF exclusion zones within each constituent physical region during text extraction.

True
**kwargs

Additional arguments passed to the underlying extract_text method of each constituent region.

{}

Returns:

Type Description
str

The combined text content as a string.

Source code in natural_pdf/flows/region.py
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
def extract_text(self, apply_exclusions: bool = True, **kwargs) -> str:
    """
    Extracts and concatenates text from all constituent physical regions.
    The order of concatenation respects the flow's arrangement.

    Args:
        apply_exclusions: Whether to respect PDF exclusion zones within each
                          constituent physical region during text extraction.
        **kwargs: Additional arguments passed to the underlying extract_text method
                  of each constituent region.

    Returns:
        The combined text content as a string.
    """
    if (
        self._cached_text is not None and apply_exclusions
    ):  # Simple cache check, might need refinement if kwargs change behavior
        return self._cached_text

    if not self.constituent_regions:
        return ""

    texts: List[str] = []
    # For now, simple concatenation. Order depends on how constituent_regions were added.
    # The FlowElement._flow_direction method is responsible for ordering constituent_regions correctly.
    for region in self.constituent_regions:
        texts.append(region.extract_text(apply_exclusions=apply_exclusions, **kwargs))

    # Join based on flow arrangement (e.g., newline for vertical, space for horizontal)
    # This is a simplification; true layout-aware joining would be more complex.
    joiner = (
        "\n" if self.flow.arrangement == "vertical" else " "
    )  # TODO: Consider flow.segment_gap for proportional spacing between segments
    extracted = joiner.join(t for t in texts if t)

    if apply_exclusions:  # Only cache if standard exclusion behavior
        self._cached_text = extracted
    return extracted
natural_pdf.FlowRegion.find(selector=None, *, text=None, **kwargs)

Find the first element in flow order that matches the selector or text.

This implementation iterates through the constituent regions *in the order they appear in self.constituent_regions (i.e. document flow order), delegating the search to each region's own find method. It therefore avoids constructing a huge intermediate ElementCollection and returns as soon as a match is found, which is substantially faster and ensures that selectors such as 'table' work exactly as they do on an individual Region.

Source code in natural_pdf/flows/region.py
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
def find(
    self, selector: Optional[str] = None, *, text: Optional[str] = None, **kwargs
) -> Optional["PhysicalElement"]:  # Stringized
    """
    Find the first element in flow order that matches the selector or text.

    This implementation iterates through the constituent regions *in the order
    they appear in ``self.constituent_regions`` (i.e. document flow order),
    delegating the search to each region's own ``find`` method.  It therefore
    avoids constructing a huge intermediate ElementCollection and returns as
    soon as a match is found, which is substantially faster and ensures that
    selectors such as 'table' work exactly as they do on an individual
    Region.
    """
    if not self.constituent_regions:
        return None

    for region in self.constituent_regions:
        try:
            result = region.find(selector=selector, text=text, **kwargs)
            if result is not None:
                return result
        except Exception as e:
            logger.warning(
                f"FlowRegion.find: error searching region {region}: {e}",
                exc_info=False,
            )
    return None  # No match found
natural_pdf.FlowRegion.find_all(selector=None, *, text=None, **kwargs)

Find all elements across the constituent regions that match the given selector or text.

Rather than first materialising every element in the FlowRegion (which can be extremely slow for multi-page flows), this implementation simply chains each region's native find_all call and concatenates their results into a single ElementCollection while preserving flow order.

Source code in natural_pdf/flows/region.py
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
def find_all(
    self, selector: Optional[str] = None, *, text: Optional[str] = None, **kwargs
) -> "ElementCollection":  # Stringized
    """
    Find **all** elements across the constituent regions that match the given
    selector or text.

    Rather than first materialising *every* element in the FlowRegion (which
    can be extremely slow for multi-page flows), this implementation simply
    chains each region's native ``find_all`` call and concatenates their
    results into a single ElementCollection while preserving flow order.
    """
    from natural_pdf.elements.element_collection import (
        ElementCollection as RuntimeElementCollection,
    )

    matched_elements = []  # type: List["PhysicalElement"]

    if not self.constituent_regions:
        return RuntimeElementCollection([])

    for region in self.constituent_regions:
        try:
            region_matches = region.find_all(selector=selector, text=text, **kwargs)
            if region_matches:
                # ``region_matches`` is an ElementCollection – extend with its
                # underlying list so we don't create nested collections.
                matched_elements.extend(
                    region_matches.elements
                    if hasattr(region_matches, "elements")
                    else list(region_matches)
                )
        except Exception as e:
            logger.warning(
                f"FlowRegion.find_all: error searching region {region}: {e}",
                exc_info=False,
            )

    return RuntimeElementCollection(matched_elements)
natural_pdf.FlowRegion.get_highlight_specs()

Get highlight specifications for all constituent regions.

This implements the highlighting protocol for FlowRegions, returning specs for each constituent region so they can be highlighted on their respective pages.

Returns:

Type Description
List[Dict[str, Any]]

List of highlight specification dictionaries, one for each

List[Dict[str, Any]]

constituent region.

Source code in natural_pdf/flows/region.py
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
def get_highlight_specs(self) -> List[Dict[str, Any]]:
    """
    Get highlight specifications for all constituent regions.

    This implements the highlighting protocol for FlowRegions, returning
    specs for each constituent region so they can be highlighted on their
    respective pages.

    Returns:
        List of highlight specification dictionaries, one for each
        constituent region.
    """
    specs = []

    for region in self.constituent_regions:
        if not hasattr(region, "page") or region.page is None:
            continue

        if not hasattr(region, "bbox") or region.bbox is None:
            continue

        spec = {
            "page": region.page,
            "page_index": region.page.index if hasattr(region.page, "index") else 0,
            "bbox": region.bbox,
            "element": region,  # Reference to the constituent region
        }

        # Add polygon if available
        if hasattr(region, "polygon") and hasattr(region, "has_polygon") and region.has_polygon:
            spec["polygon"] = region.polygon

        specs.append(spec)

    return specs
natural_pdf.FlowRegion.get_sections(start_elements=None, end_elements=None, new_section_on_page_break=False, include_boundaries='both', orientation='vertical')

Extract logical sections from this FlowRegion based on start/end boundary elements.

This delegates to the parent Flow's get_sections() method, but only operates on the segments that are part of this FlowRegion.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections

None
end_elements

Elements or selector string that mark the end of sections

None
new_section_on_page_break bool

Whether to start a new section at page boundaries

False
include_boundaries str

How to include boundary elements: 'start', 'end', 'both', or 'none'

'both'
orientation str

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description
ElementCollection

ElementCollection of FlowRegion objects representing the extracted sections

Example
Split a multi-page table region by headers

table_region = flow.find("text:contains('Table 4')").below(until="text:contains('Table 5')") sections = table_region.get_sections(start_elements="text:bold")

Source code in natural_pdf/flows/region.py
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    new_section_on_page_break: bool = False,
    include_boundaries: str = "both",
    orientation: str = "vertical",
) -> "ElementCollection":
    """
    Extract logical sections from this FlowRegion based on start/end boundary elements.

    This delegates to the parent Flow's get_sections() method, but only operates
    on the segments that are part of this FlowRegion.

    Args:
        start_elements: Elements or selector string that mark the start of sections
        end_elements: Elements or selector string that mark the end of sections
        new_section_on_page_break: Whether to start a new section at page boundaries
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        ElementCollection of FlowRegion objects representing the extracted sections

    Example:
        # Split a multi-page table region by headers
        table_region = flow.find("text:contains('Table 4')").below(until="text:contains('Table 5')")
        sections = table_region.get_sections(start_elements="text:bold")
    """
    # Create a temporary Flow with just our constituent regions as segments
    from natural_pdf.flows.flow import Flow

    temp_flow = Flow(
        segments=self.constituent_regions,
        arrangement=self.flow.arrangement,
        alignment=self.flow.alignment,
        segment_gap=self.flow.segment_gap,
    )

    # Delegate to Flow's get_sections implementation
    return temp_flow.get_sections(
        start_elements=start_elements,
        end_elements=end_elements,
        new_section_on_page_break=new_section_on_page_break,
        include_boundaries=include_boundaries,
        orientation=orientation,
    )
natural_pdf.FlowRegion.highlight(label=None, color=None, **kwargs)

Highlights all constituent physical regions on their respective pages.

Parameters:

Name Type Description Default
label Optional[str]

A base label for the highlights. Each constituent region might get an indexed label.

None
color Optional[Union[Tuple, str]]

Color for the highlight.

None
**kwargs

Additional arguments for the underlying highlight method.

{}

Returns:

Type Description
FlowRegion

Self for method chaining.

Source code in natural_pdf/flows/region.py
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
def highlight(
    self, label: Optional[str] = None, color: Optional[Union[Tuple, str]] = None, **kwargs
) -> "FlowRegion":  # Stringized
    """
    Highlights all constituent physical regions on their respective pages.

    Args:
        label: A base label for the highlights. Each constituent region might get an indexed label.
        color: Color for the highlight.
        **kwargs: Additional arguments for the underlying highlight method.

    Returns:
        Self for method chaining.
    """
    if not self.constituent_regions:
        return self

    base_label = label if label else "FlowRegionPart"
    for i, region in enumerate(self.constituent_regions):
        current_label = (
            f"{base_label}_{i+1}" if len(self.constituent_regions) > 1 else base_label
        )
        region.highlight(label=current_label, color=color, **kwargs)
    return self
natural_pdf.FlowRegion.highlights(show=False)

Create a highlight context for accumulating highlights.

This allows for clean syntax to show multiple highlight groups:

Example

with flow_region.highlights() as h: h.add(flow_region.find_all('table'), label='tables', color='blue') h.add(flow_region.find_all('text:bold'), label='bold text', color='red') h.show()

Or with automatic display

with flow_region.highlights(show=True) as h: h.add(flow_region.find_all('table'), label='tables') h.add(flow_region.find_all('text:bold'), label='bold') # Automatically shows when exiting the context

Parameters:

Name Type Description Default
show bool

If True, automatically show highlights when exiting context

False

Returns:

Type Description
HighlightContext

HighlightContext for accumulating highlights

Source code in natural_pdf/flows/region.py
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
def highlights(self, show: bool = False) -> "HighlightContext":
    """
    Create a highlight context for accumulating highlights.

    This allows for clean syntax to show multiple highlight groups:

    Example:
        with flow_region.highlights() as h:
            h.add(flow_region.find_all('table'), label='tables', color='blue')
            h.add(flow_region.find_all('text:bold'), label='bold text', color='red')
            h.show()

    Or with automatic display:
        with flow_region.highlights(show=True) as h:
            h.add(flow_region.find_all('table'), label='tables')
            h.add(flow_region.find_all('text:bold'), label='bold')
            # Automatically shows when exiting the context

    Args:
        show: If True, automatically show highlights when exiting context

    Returns:
        HighlightContext for accumulating highlights
    """
    from natural_pdf.core.highlighting_service import HighlightContext

    return HighlightContext(self, show_on_exit=show)
natural_pdf.FlowRegion.left(width=None, height='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Create a FlowRegion with regions to the left of this FlowRegion.

For vertical flows: Expands all constituent regions leftward. For horizontal flows: Only expands the leftmost constituent region leftward.

Parameters:

Name Type Description Default
width Optional[float]

Width of the region to the left, in points

None
height str

Height mode - "full" for full page height or "element" for element height

'full'
include_source bool

Whether to include this FlowRegion in the result

False
until Optional[str]

Optional selector string to specify a left boundary element

None
include_endpoint bool

Whether to include the boundary element in the region

True
**kwargs

Additional parameters

{}

Returns:

Type Description
FlowRegion

New FlowRegion with regions to the left

Source code in natural_pdf/flows/region.py
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
def left(
    self,
    width: Optional[float] = None,
    height: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "FlowRegion":
    """
    Create a FlowRegion with regions to the left of this FlowRegion.

    For vertical flows: Expands all constituent regions leftward.
    For horizontal flows: Only expands the leftmost constituent region leftward.

    Args:
        width: Width of the region to the left, in points
        height: Height mode - "full" for full page height or "element" for element height
        include_source: Whether to include this FlowRegion in the result
        until: Optional selector string to specify a left boundary element
        include_endpoint: Whether to include the boundary element in the region
        **kwargs: Additional parameters

    Returns:
        New FlowRegion with regions to the left
    """
    if not self.constituent_regions:
        return FlowRegion(
            flow=self.flow,
            constituent_regions=[],
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    new_regions = []

    if self.flow.arrangement == "vertical":
        # For vertical flow, expand all regions leftward
        for region in self.constituent_regions:
            left_region = region.left(
                width=width,
                height="element",
                include_source=include_source,
                until=until,
                include_endpoint=include_endpoint,
                **kwargs,
            )
            new_regions.append(left_region)
    else:  # horizontal flow
        # For horizontal flow, only expand the leftmost region leftward
        leftmost_region = min(self.constituent_regions, key=lambda r: r.x0)
        for region in self.constituent_regions:
            if region == leftmost_region:
                # Expand this region leftward
                left_region = region.left(
                    width=width,
                    height="element",
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(left_region)
            elif include_source:
                # Include other regions unchanged if include_source is True
                new_regions.append(region)

    return FlowRegion(
        flow=self.flow,
        constituent_regions=new_regions,
        source_flow_element=self.source_flow_element,
        boundary_element_found=self.boundary_element_found,
    )
natural_pdf.FlowRegion.right(width=None, height='full', include_source=False, until=None, include_endpoint=True, **kwargs)

Create a FlowRegion with regions to the right of this FlowRegion.

For vertical flows: Expands all constituent regions rightward. For horizontal flows: Only expands the rightmost constituent region rightward.

Parameters:

Name Type Description Default
width Optional[float]

Width of the region to the right, in points

None
height str

Height mode - "full" for full page height or "element" for element height

'full'
include_source bool

Whether to include this FlowRegion in the result

False
until Optional[str]

Optional selector string to specify a right boundary element

None
include_endpoint bool

Whether to include the boundary element in the region

True
**kwargs

Additional parameters

{}

Returns:

Type Description
FlowRegion

New FlowRegion with regions to the right

Source code in natural_pdf/flows/region.py
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
def right(
    self,
    width: Optional[float] = None,
    height: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    **kwargs,
) -> "FlowRegion":
    """
    Create a FlowRegion with regions to the right of this FlowRegion.

    For vertical flows: Expands all constituent regions rightward.
    For horizontal flows: Only expands the rightmost constituent region rightward.

    Args:
        width: Width of the region to the right, in points
        height: Height mode - "full" for full page height or "element" for element height
        include_source: Whether to include this FlowRegion in the result
        until: Optional selector string to specify a right boundary element
        include_endpoint: Whether to include the boundary element in the region
        **kwargs: Additional parameters

    Returns:
        New FlowRegion with regions to the right
    """
    if not self.constituent_regions:
        return FlowRegion(
            flow=self.flow,
            constituent_regions=[],
            source_flow_element=self.source_flow_element,
            boundary_element_found=self.boundary_element_found,
        )

    new_regions = []

    if self.flow.arrangement == "vertical":
        # For vertical flow, expand all regions rightward
        for region in self.constituent_regions:
            right_region = region.right(
                width=width,
                height="element",
                include_source=include_source,
                until=until,
                include_endpoint=include_endpoint,
                **kwargs,
            )
            new_regions.append(right_region)
    else:  # horizontal flow
        # For horizontal flow, only expand the rightmost region rightward
        rightmost_region = max(self.constituent_regions, key=lambda r: r.x1)
        for region in self.constituent_regions:
            if region == rightmost_region:
                # Expand this region rightward
                right_region = region.right(
                    width=width,
                    height="element",
                    include_source=include_source,
                    until=until,
                    include_endpoint=include_endpoint,
                    **kwargs,
                )
                new_regions.append(right_region)
            elif include_source:
                # Include other regions unchanged if include_source is True
                new_regions.append(region)

    return FlowRegion(
        flow=self.flow,
        constituent_regions=new_regions,
        source_flow_element=self.source_flow_element,
        boundary_element_found=self.boundary_element_found,
    )
natural_pdf.FlowRegion.split(by=None, page_breaks=True, **kwargs)

Split this FlowRegion into sections.

This is a convenience method that wraps get_sections() with common splitting patterns.

Parameters:

Name Type Description Default
by Optional[str]

Selector string for elements that mark section boundaries (e.g., "text:bold")

None
page_breaks bool

Whether to also split at page boundaries (default: True)

True
**kwargs

Additional arguments passed to get_sections()

{}

Returns:

Type Description
ElementCollection

ElementCollection of FlowRegion objects representing the sections

Example
Split by bold headers

sections = flow_region.split(by="text:bold")

Split only by specific text pattern, ignoring page breaks

sections = flow_region.split( by="text:contains('Section')", page_breaks=False )

Source code in natural_pdf/flows/region.py
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
def split(
    self, by: Optional[str] = None, page_breaks: bool = True, **kwargs
) -> "ElementCollection":
    """
    Split this FlowRegion into sections.

    This is a convenience method that wraps get_sections() with common splitting patterns.

    Args:
        by: Selector string for elements that mark section boundaries (e.g., "text:bold")
        page_breaks: Whether to also split at page boundaries (default: True)
        **kwargs: Additional arguments passed to get_sections()

    Returns:
        ElementCollection of FlowRegion objects representing the sections

    Example:
        # Split by bold headers
        sections = flow_region.split(by="text:bold")

        # Split only by specific text pattern, ignoring page breaks
        sections = flow_region.split(
            by="text:contains('Section')",
            page_breaks=False
        )
    """
    return self.get_sections(start_elements=by, new_section_on_page_break=page_breaks, **kwargs)
natural_pdf.FlowRegion.to_images(resolution=150, **kwargs)

Generates and returns a list of cropped PIL Images, one for each constituent physical region of this FlowRegion.

Source code in natural_pdf/flows/region.py
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
def to_images(
    self,
    resolution: float = 150,
    **kwargs,
) -> List["PIL_Image"]:
    """
    Generates and returns a list of cropped PIL Images,
    one for each constituent physical region of this FlowRegion.
    """
    if not self.constituent_regions:
        logger.info("FlowRegion.to_images() called on an empty FlowRegion.")
        return []

    cropped_images: List["PIL_Image"] = []
    for region_part in self.constituent_regions:
        try:
            # Use render() for clean image without highlights
            img = region_part.render(resolution=resolution, crop=True, **kwargs)
            if img:
                cropped_images.append(img)
        except Exception as e:
            logger.error(
                f"Error generating image for constituent region {region_part.bbox}: {e}",
                exc_info=True,
            )

    return cropped_images
natural_pdf.FlowRegion.to_region()

Convert this FlowRegion to a region (returns a copy). This is equivalent to calling expand() with no arguments.

Returns:

Type Description
FlowRegion

Copy of this FlowRegion

Source code in natural_pdf/flows/region.py
920
921
922
923
924
925
926
927
928
def to_region(self) -> "FlowRegion":
    """
    Convert this FlowRegion to a region (returns a copy).
    This is equivalent to calling expand() with no arguments.

    Returns:
        Copy of this FlowRegion
    """
    return self.expand()
natural_pdf.Guides

Manages vertical and horizontal guide lines for table extraction and layout analysis.

Guides are collections of coordinates that can be used to define table boundaries, column positions, or general layout structures. They can be created through various detection methods or manually specified.

Attributes:

Name Type Description
verticals

List of x-coordinates for vertical guide lines

horizontals

List of y-coordinates for horizontal guide lines

context

Optional Page/Region that these guides relate to

bounds

Optional bounding box (x0, y0, x1, y1) for relative coordinate conversion

snap_behavior

How to handle failed snapping operations ('warn', 'ignore', 'raise')

Source code in natural_pdf/analyzers/guides.py
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
4222
4223
4224
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
4371
4372
4373
4374
4375
4376
4377
4378
4379
4380
4381
4382
4383
4384
4385
4386
4387
4388
4389
4390
4391
4392
4393
4394
4395
4396
4397
4398
4399
4400
4401
4402
4403
4404
4405
4406
4407
4408
4409
4410
4411
4412
4413
4414
4415
4416
4417
4418
4419
4420
4421
4422
4423
4424
4425
4426
4427
4428
4429
4430
4431
4432
4433
4434
4435
4436
4437
4438
4439
4440
4441
4442
4443
4444
4445
4446
4447
4448
4449
4450
4451
4452
4453
4454
4455
4456
4457
4458
4459
4460
4461
4462
4463
4464
4465
4466
4467
4468
4469
4470
4471
4472
4473
4474
4475
4476
4477
4478
4479
4480
4481
4482
4483
4484
4485
4486
4487
4488
4489
4490
4491
4492
4493
4494
4495
4496
4497
4498
4499
4500
4501
4502
4503
4504
4505
4506
4507
4508
4509
4510
4511
4512
4513
4514
4515
4516
4517
4518
4519
4520
4521
4522
4523
4524
4525
4526
4527
4528
4529
4530
4531
4532
4533
4534
4535
4536
4537
4538
4539
4540
4541
4542
4543
4544
4545
4546
4547
4548
4549
4550
4551
4552
4553
4554
4555
4556
4557
4558
4559
4560
4561
4562
4563
4564
4565
4566
4567
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
4588
4589
4590
4591
4592
4593
4594
4595
4596
4597
4598
4599
4600
4601
4602
4603
4604
4605
4606
4607
4608
4609
4610
4611
4612
4613
4614
4615
4616
4617
4618
4619
4620
4621
4622
4623
4624
4625
4626
4627
4628
4629
4630
4631
4632
4633
4634
4635
4636
4637
4638
4639
4640
4641
4642
4643
4644
4645
4646
4647
4648
4649
4650
4651
4652
4653
4654
4655
4656
4657
4658
4659
4660
4661
4662
4663
4664
4665
4666
4667
4668
4669
4670
4671
4672
4673
4674
4675
4676
4677
4678
4679
4680
4681
4682
4683
4684
4685
4686
4687
4688
4689
4690
4691
4692
4693
4694
4695
4696
4697
4698
4699
4700
4701
4702
4703
4704
4705
4706
4707
4708
4709
4710
4711
4712
4713
4714
4715
4716
4717
4718
4719
4720
4721
4722
4723
4724
4725
4726
4727
4728
4729
4730
4731
4732
4733
4734
4735
4736
4737
4738
4739
4740
4741
4742
4743
4744
4745
4746
4747
4748
4749
4750
4751
4752
4753
4754
4755
4756
4757
4758
4759
4760
4761
4762
4763
4764
4765
4766
4767
class Guides:
    """
    Manages vertical and horizontal guide lines for table extraction and layout analysis.

    Guides are collections of coordinates that can be used to define table boundaries,
    column positions, or general layout structures. They can be created through various
    detection methods or manually specified.

    Attributes:
        verticals: List of x-coordinates for vertical guide lines
        horizontals: List of y-coordinates for horizontal guide lines
        context: Optional Page/Region that these guides relate to
        bounds: Optional bounding box (x0, y0, x1, y1) for relative coordinate conversion
        snap_behavior: How to handle failed snapping operations ('warn', 'ignore', 'raise')
    """

    def __init__(
        self,
        verticals: Optional[Union[List[float], "Page", "Region", "FlowRegion"]] = None,
        horizontals: Optional[List[float]] = None,
        context: Optional[Union["Page", "Region", "FlowRegion"]] = None,
        bounds: Optional[Tuple[float, float, float, float]] = None,
        relative: bool = False,
        snap_behavior: Literal["raise", "warn", "ignore"] = "warn",
    ):
        """
        Initialize a Guides object.

        Args:
            verticals: List of x-coordinates for vertical guides, or a Page/Region/FlowRegion as context
            horizontals: List of y-coordinates for horizontal guides
            context: Page, Region, or FlowRegion object these guides were created from
            bounds: Bounding box (x0, top, x1, bottom) if context not provided
            relative: Whether coordinates are relative (0-1) or absolute
            snap_behavior: How to handle snapping conflicts ('raise', 'warn', or 'ignore')
        """
        # Handle Guides(page) or Guides(flow_region) shorthand
        if (
            verticals is not None
            and not isinstance(verticals, (list, tuple))
            and horizontals is None
            and context is None
        ):
            # First argument is a page/region/flow_region, not coordinates
            context = verticals
            verticals = None

        self.context = context
        self.bounds = bounds
        self.relative = relative
        self.snap_behavior = snap_behavior

        # Check if we're dealing with a FlowRegion
        self.is_flow_region = hasattr(context, "constituent_regions")

        # If FlowRegion, we'll store guides per constituent region
        if self.is_flow_region:
            self._flow_guides: Dict["Region", Tuple[List[float], List[float]]] = {}
            # For unified view across all regions
            self._unified_vertical: List[Tuple[float, "Region"]] = []
            self._unified_horizontal: List[Tuple[float, "Region"]] = []
            # Cache for sorted unique coordinates
            self._vertical_cache: Optional[List[float]] = None
            self._horizontal_cache: Optional[List[float]] = None

        # Initialize with GuidesList instances
        self._vertical = GuidesList(self, "vertical", sorted([float(x) for x in (verticals or [])]))
        self._horizontal = GuidesList(
            self, "horizontal", sorted([float(y) for y in (horizontals or [])])
        )

        # Determine bounds from context if needed
        if self.bounds is None and self.context is not None:
            if hasattr(self.context, "bbox"):
                self.bounds = self.context.bbox
            elif hasattr(self.context, "x0"):
                self.bounds = (
                    self.context.x0,
                    self.context.top,
                    self.context.x1,
                    self.context.bottom,
                )

        # Convert relative to absolute if needed
        if self.relative and self.bounds:
            x0, top, x1, bottom = self.bounds
            width = x1 - x0
            height = bottom - top

            self._vertical.data = [x0 + v * width for v in self._vertical]
            self._horizontal.data = [top + h * height for h in self._horizontal]
            self.relative = False

    @property
    def vertical(self) -> GuidesList:
        """Get vertical guide coordinates."""
        if self.is_flow_region and self._vertical_cache is not None:
            # Return cached unified view
            self._vertical.data = self._vertical_cache
        elif self.is_flow_region and self._unified_vertical:
            # Build unified view from flow guides
            all_verticals = []
            for coord, region in self._unified_vertical:
                all_verticals.append(coord)
            # Remove duplicates and sort
            self._vertical_cache = sorted(list(set(all_verticals)))
            self._vertical.data = self._vertical_cache
        return self._vertical

    @vertical.setter
    def vertical(self, value: Union[List[float], "Guides", None]):
        """Set vertical guides from a list of coordinates or another Guides object."""
        if self.is_flow_region:
            # Invalidate cache when setting new values
            self._vertical_cache = None

        if value is None:
            self._vertical.data = []
        elif isinstance(value, Guides):
            # Extract vertical coordinates from another Guides object
            self._vertical.data = sorted([float(x) for x in value.vertical])
        elif isinstance(value, str):
            # Explicitly reject strings to avoid confusing iteration over characters
            raise TypeError(
                f"vertical cannot be a string, got '{value}'. Use a list of coordinates or Guides object."
            )
        elif hasattr(value, "__iter__"):
            # Handle list/tuple of coordinates
            try:
                self._vertical.data = sorted([float(x) for x in value])
            except (ValueError, TypeError) as e:
                raise TypeError(f"vertical must contain numeric values, got {value}: {e}")
        else:
            raise TypeError(f"vertical must be a list, Guides object, or None, got {type(value)}")

    @property
    def horizontal(self) -> GuidesList:
        """Get horizontal guide coordinates."""
        if self.is_flow_region and self._horizontal_cache is not None:
            # Return cached unified view
            self._horizontal.data = self._horizontal_cache
        elif self.is_flow_region and self._unified_horizontal:
            # Build unified view from flow guides
            all_horizontals = []
            for coord, region in self._unified_horizontal:
                all_horizontals.append(coord)
            # Remove duplicates and sort
            self._horizontal_cache = sorted(list(set(all_horizontals)))
            self._horizontal.data = self._horizontal_cache
        return self._horizontal

    @horizontal.setter
    def horizontal(self, value: Union[List[float], "Guides", None]):
        """Set horizontal guides from a list of coordinates or another Guides object."""
        if self.is_flow_region:
            # Invalidate cache when setting new values
            self._horizontal_cache = None

        if value is None:
            self._horizontal.data = []
        elif isinstance(value, Guides):
            # Extract horizontal coordinates from another Guides object
            self._horizontal.data = sorted([float(y) for y in value.horizontal])
        elif isinstance(value, str):
            # Explicitly reject strings
            raise TypeError(
                f"horizontal cannot be a string, got '{value}'. Use a list of coordinates or Guides object."
            )
        elif hasattr(value, "__iter__"):
            # Handle list/tuple of coordinates
            try:
                self._horizontal.data = sorted([float(y) for y in value])
            except (ValueError, TypeError) as e:
                raise TypeError(f"horizontal must contain numeric values, got {value}: {e}")
        else:
            raise TypeError(f"horizontal must be a list, Guides object, or None, got {type(value)}")

    def _get_context_bounds(self) -> Optional[Tuple[float, float, float, float]]:
        """Get bounds from context if available."""
        if self.context is None:
            return None

        if hasattr(self.context, "bbox"):
            return self.context.bbox
        elif hasattr(self.context, "x0") and hasattr(self.context, "top"):
            return (self.context.x0, self.context.top, self.context.x1, self.context.bottom)
        elif hasattr(self.context, "width") and hasattr(self.context, "height"):
            return (0, 0, self.context.width, self.context.height)
        return None

    # -------------------------------------------------------------------------
    # Factory Methods
    # -------------------------------------------------------------------------

    @classmethod
    def divide(
        cls,
        obj: Union["Page", "Region", Tuple[float, float, float, float]],
        n: Optional[int] = None,
        cols: Optional[int] = None,
        rows: Optional[int] = None,
        axis: Literal["vertical", "horizontal", "both"] = "both",
    ) -> "Guides":
        """
        Create guides by evenly dividing an object.

        Args:
            obj: Object to divide (Page, Region, or bbox tuple)
            n: Number of divisions (creates n+1 guides). Used if cols/rows not specified.
            cols: Number of columns (creates cols+1 vertical guides)
            rows: Number of rows (creates rows+1 horizontal guides)
            axis: Which axis to divide along

        Returns:
            New Guides object with evenly spaced lines

        Examples:
            # Divide into 3 columns
            guides = Guides.divide(page, cols=3)

            # Divide into 5 rows
            guides = Guides.divide(region, rows=5)

            # Divide both axes
            guides = Guides.divide(page, cols=3, rows=5)
        """
        # Extract bounds from object
        if isinstance(obj, tuple) and len(obj) == 4:
            bounds = obj
            context = None
        else:
            context = obj
            if hasattr(obj, "bbox"):
                bounds = obj.bbox
            elif hasattr(obj, "x0"):
                bounds = (obj.x0, obj.top, obj.x1, obj.bottom)
            else:
                bounds = (0, 0, obj.width, obj.height)

        x0, y0, x1, y1 = bounds
        verticals = []
        horizontals = []

        # Handle vertical guides
        if axis in ("vertical", "both"):
            n_vertical = cols + 1 if cols is not None else (n + 1 if n is not None else 0)
            if n_vertical > 0:
                for i in range(n_vertical):
                    x = x0 + (x1 - x0) * i / (n_vertical - 1)
                    verticals.append(float(x))

        # Handle horizontal guides
        if axis in ("horizontal", "both"):
            n_horizontal = rows + 1 if rows is not None else (n + 1 if n is not None else 0)
            if n_horizontal > 0:
                for i in range(n_horizontal):
                    y = y0 + (y1 - y0) * i / (n_horizontal - 1)
                    horizontals.append(float(y))

        return cls(verticals=verticals, horizontals=horizontals, context=context, bounds=bounds)

    @classmethod
    def from_lines(
        cls,
        obj: Union["Page", "Region", "FlowRegion"],
        axis: Literal["vertical", "horizontal", "both"] = "both",
        threshold: Union[float, str] = "auto",
        source_label: Optional[str] = None,
        max_lines_h: Optional[int] = None,
        max_lines_v: Optional[int] = None,
        outer: bool = False,
        detection_method: str = "pixels",
        resolution: int = 192,
        **detect_kwargs,
    ) -> "Guides":
        """
        Create guides from detected line elements.

        Args:
            obj: Page, Region, or FlowRegion to detect lines from
            axis: Which orientations to detect
            threshold: Detection threshold ('auto' or float 0.0-1.0) - used for pixel detection
            source_label: Filter for line source (vector method) or label for detected lines (pixel method)
            max_lines_h: Maximum number of horizontal lines to keep
            max_lines_v: Maximum number of vertical lines to keep
            outer: Whether to add outer boundary guides
            detection_method: 'vector' (use existing LineElements) or 'pixels' (detect from image)
            resolution: DPI for pixel-based detection (default: 192)
            **detect_kwargs: Additional parameters for pixel-based detection:
                - min_gap_h: Minimum gap between horizontal lines (pixels)
                - min_gap_v: Minimum gap between vertical lines (pixels)
                - binarization_method: 'adaptive' or 'otsu'
                - morph_op_h/v: Morphological operations ('open', 'close', 'none')
                - smoothing_sigma_h/v: Gaussian smoothing sigma
                - method: 'projection' (default) or 'lsd' (requires opencv)

        Returns:
            New Guides object with detected line positions
        """
        # Handle FlowRegion
        if hasattr(obj, "constituent_regions"):
            guides = cls(context=obj)

            # Process each constituent region
            for region in obj.constituent_regions:
                # Create guides for this specific region
                region_guides = cls.from_lines(
                    region,
                    axis=axis,
                    threshold=threshold,
                    source_label=source_label,
                    max_lines_h=max_lines_h,
                    max_lines_v=max_lines_v,
                    outer=outer,
                    detection_method=detection_method,
                    resolution=resolution,
                    **detect_kwargs,
                )

                # Store in flow guides
                guides._flow_guides[region] = (
                    list(region_guides.vertical),
                    list(region_guides.horizontal),
                )

                # Add to unified view
                for v in region_guides.vertical:
                    guides._unified_vertical.append((v, region))
                for h in region_guides.horizontal:
                    guides._unified_horizontal.append((h, region))

            # Invalidate caches to force rebuild on next access
            guides._vertical_cache = None
            guides._horizontal_cache = None

            return guides

        # Original single-region logic follows...
        # Get bounds for potential outer guides
        if hasattr(obj, "bbox"):
            bounds = obj.bbox
        elif hasattr(obj, "x0"):
            bounds = (obj.x0, obj.top, obj.x1, obj.bottom)
        elif hasattr(obj, "width"):
            bounds = (0, 0, obj.width, obj.height)
        else:
            bounds = None

        verticals = []
        horizontals = []

        if detection_method == "pixels":
            # Use pixel-based line detection
            if not hasattr(obj, "detect_lines"):
                raise ValueError(f"Object {obj} does not support pixel-based line detection")

            # Set up detection parameters
            detect_params = {
                "resolution": resolution,
                "source_label": source_label or "guides_detection",
                "horizontal": axis in ("horizontal", "both"),
                "vertical": axis in ("vertical", "both"),
                "replace": True,  # Replace any existing lines with this source
                "method": detect_kwargs.get("method", "projection"),
            }

            # Handle threshold parameter
            if threshold == "auto" and detection_method == "vector":
                # Auto mode: use very low thresholds with max_lines constraints
                detect_params["peak_threshold_h"] = 0.0
                detect_params["peak_threshold_v"] = 0.0
                detect_params["max_lines_h"] = max_lines_h
                detect_params["max_lines_v"] = max_lines_v
            if threshold == "auto" and detection_method == "pixels":
                detect_params["peak_threshold_h"] = 0.5
                detect_params["peak_threshold_v"] = 0.5
                detect_params["max_lines_h"] = max_lines_h
                detect_params["max_lines_v"] = max_lines_v
            else:
                # Fixed threshold mode
                detect_params["peak_threshold_h"] = (
                    float(threshold) if axis in ("horizontal", "both") else 1.0
                )
                detect_params["peak_threshold_v"] = (
                    float(threshold) if axis in ("vertical", "both") else 1.0
                )
                detect_params["max_lines_h"] = max_lines_h
                detect_params["max_lines_v"] = max_lines_v

            # Add any additional detection parameters
            for key in [
                "min_gap_h",
                "min_gap_v",
                "binarization_method",
                "adaptive_thresh_block_size",
                "adaptive_thresh_C_val",
                "morph_op_h",
                "morph_kernel_h",
                "morph_op_v",
                "morph_kernel_v",
                "smoothing_sigma_h",
                "smoothing_sigma_v",
                "peak_width_rel_height",
            ]:
                if key in detect_kwargs:
                    detect_params[key] = detect_kwargs[key]

            # Perform the detection
            obj.detect_lines(**detect_params)

            # Now get the detected lines and use them
            if hasattr(obj, "lines"):
                lines = obj.lines
            elif hasattr(obj, "find_all"):
                lines = obj.find_all("line")
            else:
                lines = []

            # Filter by the source we just used

            lines = [
                l for l in lines if getattr(l, "source", None) == detect_params["source_label"]
            ]

        else:  # detection_method == 'vector' (default)
            # Get existing lines from the object
            if hasattr(obj, "lines"):
                lines = obj.lines
            elif hasattr(obj, "find_all"):
                lines = obj.find_all("line")
            else:
                logger.warning(f"Object {obj} has no lines or find_all method")
                lines = []

            # Filter by source if specified
            if source_label:
                lines = [l for l in lines if getattr(l, "source", None) == source_label]

        # Process lines (same logic for both methods)
        # Separate lines by orientation and collect with metadata for ranking
        h_line_data = []  # (y_coord, length, line_obj)
        v_line_data = []  # (x_coord, length, line_obj)

        for line in lines:
            if hasattr(line, "is_horizontal") and hasattr(line, "is_vertical"):
                if line.is_horizontal and axis in ("horizontal", "both"):
                    # Use the midpoint y-coordinate for horizontal lines
                    y = (line.top + line.bottom) / 2
                    # Calculate line length for ranking
                    length = getattr(
                        line, "width", abs(getattr(line, "x1", 0) - getattr(line, "x0", 0))
                    )
                    h_line_data.append((y, length, line))
                elif line.is_vertical and axis in ("vertical", "both"):
                    # Use the midpoint x-coordinate for vertical lines
                    x = (line.x0 + line.x1) / 2
                    # Calculate line length for ranking
                    length = getattr(
                        line, "height", abs(getattr(line, "bottom", 0) - getattr(line, "top", 0))
                    )
                    v_line_data.append((x, length, line))

        # Process horizontal lines
        if max_lines_h is not None and h_line_data:
            # Sort by length (longer lines are typically more significant)
            h_line_data.sort(key=lambda x: x[1], reverse=True)
            # Take the top N by length
            selected_h = h_line_data[:max_lines_h]
            # Extract just the coordinates and sort by position
            horizontals = sorted([coord for coord, _, _ in selected_h])
            logger.debug(
                f"Selected {len(horizontals)} horizontal lines from {len(h_line_data)} candidates"
            )
        else:
            # Use all horizontal lines (original behavior)
            horizontals = [coord for coord, _, _ in h_line_data]
            horizontals = sorted(list(set(horizontals)))

        # Process vertical lines
        if max_lines_v is not None and v_line_data:
            # Sort by length (longer lines are typically more significant)
            v_line_data.sort(key=lambda x: x[1], reverse=True)
            # Take the top N by length
            selected_v = v_line_data[:max_lines_v]
            # Extract just the coordinates and sort by position
            verticals = sorted([coord for coord, _, _ in selected_v])
            logger.debug(
                f"Selected {len(verticals)} vertical lines from {len(v_line_data)} candidates"
            )
        else:
            # Use all vertical lines (original behavior)
            verticals = [coord for coord, _, _ in v_line_data]
            verticals = sorted(list(set(verticals)))

        # Add outer guides if requested
        if outer and bounds:
            if axis in ("vertical", "both"):
                if not verticals or verticals[0] > bounds[0]:
                    verticals.insert(0, bounds[0])  # x0
                if not verticals or verticals[-1] < bounds[2]:
                    verticals.append(bounds[2])  # x1
            if axis in ("horizontal", "both"):
                if not horizontals or horizontals[0] > bounds[1]:
                    horizontals.insert(0, bounds[1])  # y0
                if not horizontals or horizontals[-1] < bounds[3]:
                    horizontals.append(bounds[3])  # y1

        # Remove duplicates and sort again
        verticals = sorted(list(set(verticals)))
        horizontals = sorted(list(set(horizontals)))

        return cls(verticals=verticals, horizontals=horizontals, context=obj, bounds=bounds)

    @classmethod
    def from_content(
        cls,
        obj: Union["Page", "Region", "FlowRegion"],
        axis: Literal["vertical", "horizontal"] = "vertical",
        markers: Union[str, List[str], "ElementCollection", None] = None,
        align: Union[
            Literal["left", "right", "center", "between"], Literal["top", "bottom"]
        ] = "left",
        outer: bool = True,
        tolerance: float = 5,
        apply_exclusions: bool = True,
    ) -> "Guides":
        """
        Create guides based on text content positions.

        Args:
            obj: Page, Region, or FlowRegion to search for content
            axis: Whether to create vertical or horizontal guides
            markers: Content to search for. Can be:
                - str: single selector (e.g., 'text:contains("Name")') or literal text
                - List[str]: list of selectors or literal text strings
                - ElementCollection: collection of elements to extract text from
                - None: no markers
            align: Where to place guides relative to found text:
                - For vertical guides: 'left', 'right', 'center', 'between'
                - For horizontal guides: 'top', 'bottom', 'center', 'between'
            outer: Whether to add guides at the boundaries
            tolerance: Maximum distance to search for text
            apply_exclusions: Whether to apply exclusion zones when searching for text

        Returns:
            New Guides object aligned to text content
        """
        # Normalize alignment for horizontal guides
        if axis == "horizontal":
            if align == "top":
                align = "left"
            elif align == "bottom":
                align = "right"

        # Handle FlowRegion
        if hasattr(obj, "constituent_regions"):
            guides = cls(context=obj)

            # Process each constituent region
            for region in obj.constituent_regions:
                # Create guides for this specific region
                region_guides = cls.from_content(
                    region,
                    axis=axis,
                    markers=markers,
                    align=align,
                    outer=outer,
                    tolerance=tolerance,
                    apply_exclusions=apply_exclusions,
                )

                # Store in flow guides
                guides._flow_guides[region] = (
                    list(region_guides.vertical),
                    list(region_guides.horizontal),
                )

                # Add to unified view
                for v in region_guides.vertical:
                    guides._unified_vertical.append((v, region))
                for h in region_guides.horizontal:
                    guides._unified_horizontal.append((h, region))

            # Invalidate caches
            guides._vertical_cache = None
            guides._horizontal_cache = None

            return guides

        # Original single-region logic follows...
        guides_coords = []
        bounds = None

        # Get bounds from object
        if hasattr(obj, "bbox"):
            bounds = obj.bbox
        elif hasattr(obj, "x0"):
            bounds = (obj.x0, obj.top, obj.x1, obj.bottom)
        elif hasattr(obj, "width"):
            bounds = (0, 0, obj.width, obj.height)

        # Handle different marker types
        elements_to_process = []

        # Check if markers is an ElementCollection or has elements attribute
        if hasattr(markers, "elements") or hasattr(markers, "_elements"):
            # It's an ElementCollection - use elements directly
            elements_to_process = getattr(markers, "elements", getattr(markers, "_elements", []))
        elif hasattr(markers, "__iter__") and not isinstance(markers, str):
            # Check if it's an iterable of elements (not strings)
            try:
                markers_list = list(markers)
                if markers_list and hasattr(markers_list[0], "x0"):
                    # It's a list of elements
                    elements_to_process = markers_list
            except:
                pass

        if elements_to_process:
            # Process elements directly without text search
            for element in elements_to_process:
                if axis == "vertical":
                    if align == "left":
                        guides_coords.append(element.x0)
                    elif align == "right":
                        guides_coords.append(element.x1)
                    elif align == "center":
                        guides_coords.append((element.x0 + element.x1) / 2)
                    elif align == "between":
                        # For between, collect left edges for processing later
                        guides_coords.append(element.x0)
                else:  # horizontal
                    if align == "left":  # top for horizontal
                        guides_coords.append(element.top)
                    elif align == "right":  # bottom for horizontal
                        guides_coords.append(element.bottom)
                    elif align == "center":
                        guides_coords.append((element.top + element.bottom) / 2)
                    elif align == "between":
                        # For between, collect top edges for processing later
                        guides_coords.append(element.top)
        else:
            # Fall back to text-based search
            marker_texts = _normalize_markers(markers, obj)

            # Find each marker and determine guide position
            for marker in marker_texts:
                if hasattr(obj, "find"):
                    element = obj.find(
                        f'text:contains("{marker}")', apply_exclusions=apply_exclusions
                    )
                    if element:
                        if axis == "vertical":
                            if align == "left":
                                guides_coords.append(element.x0)
                            elif align == "right":
                                guides_coords.append(element.x1)
                            elif align == "center":
                                guides_coords.append((element.x0 + element.x1) / 2)
                            elif align == "between":
                                # For between, collect left edges for processing later
                                guides_coords.append(element.x0)
                        else:  # horizontal
                            if align == "left":  # top for horizontal
                                guides_coords.append(element.top)
                            elif align == "right":  # bottom for horizontal
                                guides_coords.append(element.bottom)
                            elif align == "center":
                                guides_coords.append((element.top + element.bottom) / 2)
                            elif align == "between":
                                # For between, collect top edges for processing later
                                guides_coords.append(element.top)

        # Handle 'between' alignment - find midpoints between adjacent markers
        if align == "between" and len(guides_coords) >= 2:
            # We need to get the right and left edges of each marker
            marker_bounds = []

            if elements_to_process:
                # Use elements directly
                for element in elements_to_process:
                    if axis == "vertical":
                        marker_bounds.append((element.x0, element.x1))
                    else:  # horizontal
                        marker_bounds.append((element.top, element.bottom))
            else:
                # Fall back to text search
                if "marker_texts" not in locals():
                    marker_texts = _normalize_markers(markers, obj)
                for marker in marker_texts:
                    if hasattr(obj, "find"):
                        element = obj.find(
                            f'text:contains("{marker}")', apply_exclusions=apply_exclusions
                        )
                        if element:
                            if axis == "vertical":
                                marker_bounds.append((element.x0, element.x1))
                            else:  # horizontal
                                marker_bounds.append((element.top, element.bottom))

            # Sort markers by their left edge (or top edge for horizontal)
            marker_bounds.sort(key=lambda x: x[0])

            # Create guides at midpoints between adjacent markers
            between_coords = []
            for i in range(len(marker_bounds) - 1):
                # Midpoint between right edge of current marker and left edge of next marker
                right_edge_current = marker_bounds[i][1]
                left_edge_next = marker_bounds[i + 1][0]
                midpoint = (right_edge_current + left_edge_next) / 2
                between_coords.append(midpoint)

            guides_coords = between_coords

        # Add outer guides if requested
        if outer and bounds:
            if axis == "vertical":
                if outer == True or outer == "first":
                    guides_coords.insert(0, bounds[0])  # x0
                if outer == True or outer == "last":
                    guides_coords.append(bounds[2])  # x1
            else:
                if outer == True or outer == "first":
                    guides_coords.insert(0, bounds[1])  # y0
                if outer == True or outer == "last":
                    guides_coords.append(bounds[3])  # y1

        # Remove duplicates and sort
        guides_coords = sorted(list(set(guides_coords)))

        # Create guides object
        if axis == "vertical":
            return cls(verticals=guides_coords, context=obj, bounds=bounds)
        else:
            return cls(horizontals=guides_coords, context=obj, bounds=bounds)

    @classmethod
    def from_whitespace(
        cls,
        obj: Union["Page", "Region", "FlowRegion"],
        axis: Literal["vertical", "horizontal", "both"] = "both",
        min_gap: float = 10,
    ) -> "Guides":
        """
        Create guides by detecting whitespace gaps.

        Args:
            obj: Page or Region to analyze
            min_gap: Minimum gap size to consider as whitespace
            axis: Which axes to analyze for gaps

        Returns:
            New Guides object positioned at whitespace gaps
        """
        # This is a placeholder - would need sophisticated gap detection
        logger.info("Whitespace detection not yet implemented, using divide instead")
        return cls.divide(obj, n=3, axis=axis)

    @classmethod
    def new(cls, context: Optional[Union["Page", "Region"]] = None) -> "Guides":
        """
        Create a new empty Guides object, optionally with a context.

        This provides a clean way to start building guides through chaining:
        guides = Guides.new(page).add_content(axis='vertical', markers=[...])

        Args:
            context: Optional Page or Region to use as default context for operations

        Returns:
            New empty Guides object
        """
        return cls(verticals=[], horizontals=[], context=context)

    # -------------------------------------------------------------------------
    # Manipulation Methods
    # -------------------------------------------------------------------------

    def snap_to_whitespace(
        self,
        axis: str = "vertical",
        min_gap: float = 10.0,
        detection_method: str = "pixels",  # 'pixels' or 'text'
        threshold: Union[
            float, str
        ] = "auto",  # threshold for what counts as a trough (0.0-1.0) or 'auto'
        on_no_snap: str = "warn",
    ) -> "Guides":
        """
        Snap guides to nearby whitespace gaps (troughs) using optimal assignment.
        Modifies this Guides object in place.

        Args:
            axis: Direction to snap ('vertical' or 'horizontal')
            min_gap: Minimum gap size to consider as a valid trough
            detection_method: Method for detecting troughs:
                            'pixels' - use pixel-based density analysis (default)
                            'text' - use text element spacing analysis
            threshold: Threshold for what counts as a trough:
                      - float (0.0-1.0): areas with this fraction or less of max density count as troughs
                      - 'auto': automatically find threshold that creates enough troughs for guides
            on_no_snap: Action when snapping fails ('warn', 'ignore', 'raise')

        Returns:
            Self for method chaining.
        """
        if not self.context:
            logger.warning("No context available for whitespace detection")
            return self

        # Handle FlowRegion case - collect all text elements across regions
        if self.is_flow_region:
            all_text_elements = []
            region_bounds = {}

            for region in self.context.constituent_regions:
                # Get text elements from this region
                if hasattr(region, "find_all"):
                    try:
                        text_elements = region.find_all("text", apply_exclusions=False)
                        elements = (
                            text_elements.elements
                            if hasattr(text_elements, "elements")
                            else text_elements
                        )
                        all_text_elements.extend(elements)

                        # Store bounds for each region
                        if hasattr(region, "bbox"):
                            region_bounds[region] = region.bbox
                        elif hasattr(region, "x0"):
                            region_bounds[region] = (
                                region.x0,
                                region.top,
                                region.x1,
                                region.bottom,
                            )
                    except Exception as e:
                        logger.warning(f"Error getting text elements from region: {e}")

            if not all_text_elements:
                logger.warning(
                    "No text elements found across flow regions for whitespace detection"
                )
                return self

            # Find whitespace gaps across all regions
            if axis == "vertical":
                gaps = self._find_vertical_whitespace_gaps(all_text_elements, min_gap, threshold)
                # Get all vertical guides across regions
                all_guides = []
                guide_to_region_map = {}  # Map guide coordinate to its original list of regions
                for coord, region in self._unified_vertical:
                    all_guides.append(coord)
                    guide_to_region_map.setdefault(coord, []).append(region)

                if gaps and all_guides:
                    # Keep a copy of original guides to maintain mapping
                    original_guides = all_guides.copy()

                    # Snap guides to gaps
                    self._snap_guides_to_gaps(all_guides, gaps, axis)

                    # Update the unified view with snapped positions
                    self._unified_vertical = []
                    for i, new_coord in enumerate(all_guides):
                        # Find the original region for this guide using the original position
                        original_coord = original_guides[i]
                        # A guide might be associated with multiple regions, add them all
                        regions = guide_to_region_map.get(original_coord, [])
                        for region in regions:
                            self._unified_vertical.append((new_coord, region))

                    # Update individual region guides
                    for region in self._flow_guides:
                        region_verticals = []
                        for coord, r in self._unified_vertical:
                            if r == region:
                                region_verticals.append(coord)
                        self._flow_guides[region] = (
                            sorted(list(set(region_verticals))),  # Deduplicate here
                            self._flow_guides[region][1],
                        )

                    # Invalidate cache
                    self._vertical_cache = None

            elif axis == "horizontal":
                gaps = self._find_horizontal_whitespace_gaps(all_text_elements, min_gap, threshold)
                # Get all horizontal guides across regions
                all_guides = []
                guide_to_region_map = {}  # Map guide coordinate to its original list of regions
                for coord, region in self._unified_horizontal:
                    all_guides.append(coord)
                    guide_to_region_map.setdefault(coord, []).append(region)

                if gaps and all_guides:
                    # Keep a copy of original guides to maintain mapping
                    original_guides = all_guides.copy()

                    # Snap guides to gaps
                    self._snap_guides_to_gaps(all_guides, gaps, axis)

                    # Update the unified view with snapped positions
                    self._unified_horizontal = []
                    for i, new_coord in enumerate(all_guides):
                        # Find the original region for this guide using the original position
                        original_coord = original_guides[i]
                        regions = guide_to_region_map.get(original_coord, [])
                        for region in regions:
                            self._unified_horizontal.append((new_coord, region))

                    # Update individual region guides
                    for region in self._flow_guides:
                        region_horizontals = []
                        for coord, r in self._unified_horizontal:
                            if r == region:
                                region_horizontals.append(coord)
                        self._flow_guides[region] = (
                            self._flow_guides[region][0],
                            sorted(list(set(region_horizontals))),  # Deduplicate here
                        )

                    # Invalidate cache
                    self._horizontal_cache = None

            else:
                raise ValueError("axis must be 'vertical' or 'horizontal'")

            return self

        # Original single-region logic
        # Get elements for trough detection
        text_elements = self._get_text_elements()
        if not text_elements:
            logger.warning("No text elements found for whitespace detection")
            return self

        if axis == "vertical":
            gaps = self._find_vertical_whitespace_gaps(text_elements, min_gap, threshold)
            if gaps:
                self._snap_guides_to_gaps(self.vertical.data, gaps, axis)
        elif axis == "horizontal":
            gaps = self._find_horizontal_whitespace_gaps(text_elements, min_gap, threshold)
            if gaps:
                self._snap_guides_to_gaps(self.horizontal.data, gaps, axis)
        else:
            raise ValueError("axis must be 'vertical' or 'horizontal'")

        # Ensure all coordinates are Python floats (not numpy types)
        self.vertical.data[:] = [float(x) for x in self.vertical.data]
        self.horizontal.data[:] = [float(y) for y in self.horizontal.data]

        return self

    def shift(
        self, index: int, offset: float, axis: Literal["vertical", "horizontal"] = "vertical"
    ) -> "Guides":
        """
        Move a specific guide by a offset amount.

        Args:
            index: Index of the guide to move
            offset: Amount to move (positive = right/down)
            axis: Which guide list to modify

        Returns:
            Self for method chaining
        """
        if axis == "vertical":
            if 0 <= index < len(self.vertical):
                self.vertical[index] += offset
                self.vertical = sorted(self.vertical)
            else:
                logger.warning(f"Vertical guide index {index} out of range")
        else:
            if 0 <= index < len(self.horizontal):
                self.horizontal[index] += offset
                self.horizontal = sorted(self.horizontal)
            else:
                logger.warning(f"Horizontal guide index {index} out of range")

        return self

    def add_vertical(self, x: float) -> "Guides":
        """Add a vertical guide at the specified x-coordinate."""
        self.vertical.append(x)
        self.vertical = sorted(self.vertical)
        return self

    def add_horizontal(self, y: float) -> "Guides":
        """Add a horizontal guide at the specified y-coordinate."""
        self.horizontal.append(y)
        self.horizontal = sorted(self.horizontal)
        return self

    def remove_vertical(self, index: int) -> "Guides":
        """Remove a vertical guide by index."""
        if 0 <= index < len(self.vertical):
            self.vertical.pop(index)
        return self

    def remove_horizontal(self, index: int) -> "Guides":
        """Remove a horizontal guide by index."""
        if 0 <= index < len(self.horizontal):
            self.horizontal.pop(index)
        return self

    # -------------------------------------------------------------------------
    # Region extraction properties
    # -------------------------------------------------------------------------

    @property
    def columns(self):
        """Access columns by index like guides.columns[0]."""
        return _ColumnAccessor(self)

    @property
    def rows(self):
        """Access rows by index like guides.rows[0]."""
        return _RowAccessor(self)

    @property
    def cells(self):
        """Access cells by index like guides.cells[row][col] or guides.cells[row, col]."""
        return _CellAccessor(self)

    # -------------------------------------------------------------------------
    # Region extraction methods (alternative API)
    # -------------------------------------------------------------------------

    def column(self, index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a column region from the guides.

        Args:
            index: Column index (0-based)
            obj: Page or Region to create the column on (uses self.context if None)

        Returns:
            Region representing the specified column

        Raises:
            IndexError: If column index is out of range
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.vertical or index < 0 or index >= len(self.vertical) - 1:
            raise IndexError(
                f"Column index {index} out of range (have {len(self.vertical)-1} columns)"
            )

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        _, y0, _, y1 = bounds

        # Get column boundaries
        x0 = self.vertical[index]
        x1 = self.vertical[index + 1]

        # Create region using absolute coordinates
        if hasattr(target, "region"):
            # Target has a region method (Page)
            return target.region(x0, y0, x1, y1)
        elif hasattr(target, "page"):
            # Target is a Region, use its parent page
            # The coordinates from guides are already absolute
            return target.page.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def row(self, index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a row region from the guides.

        Args:
            index: Row index (0-based)
            obj: Page or Region to create the row on (uses self.context if None)

        Returns:
            Region representing the specified row

        Raises:
            IndexError: If row index is out of range
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.horizontal or index < 0 or index >= len(self.horizontal) - 1:
            raise IndexError(f"Row index {index} out of range (have {len(self.horizontal)-1} rows)")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        x0, _, x1, _ = bounds

        # Get row boundaries
        y0 = self.horizontal[index]
        y1 = self.horizontal[index + 1]

        # Create region using absolute coordinates
        if hasattr(target, "region"):
            # Target has a region method (Page)
            return target.region(x0, y0, x1, y1)
        elif hasattr(target, "page"):
            # Target is a Region, use its parent page
            # The coordinates from guides are already absolute
            return target.page.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def cell(self, row: int, col: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a cell region from the guides.

        Args:
            row: Row index (0-based)
            col: Column index (0-based)
            obj: Page or Region to create the cell on (uses self.context if None)

        Returns:
            Region representing the specified cell

        Raises:
            IndexError: If row or column index is out of range
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.vertical or col < 0 or col >= len(self.vertical) - 1:
            raise IndexError(
                f"Column index {col} out of range (have {len(self.vertical)-1} columns)"
            )
        if not self.horizontal or row < 0 or row >= len(self.horizontal) - 1:
            raise IndexError(f"Row index {row} out of range (have {len(self.horizontal)-1} rows)")

        # Get cell boundaries
        x0 = self.vertical[col]
        x1 = self.vertical[col + 1]
        y0 = self.horizontal[row]
        y1 = self.horizontal[row + 1]

        # Create region using absolute coordinates
        if hasattr(target, "region"):
            # Target has a region method (Page)
            return target.region(x0, y0, x1, y1)
        elif hasattr(target, "page"):
            # Target is a Region, use its parent page
            # The coordinates from guides are already absolute
            return target.page.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def left_of(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a region to the left of a vertical guide.

        Args:
            guide_index: Vertical guide index
            obj: Page or Region to create the region on (uses self.context if None)

        Returns:
            Region to the left of the specified guide
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.vertical or guide_index < 0 or guide_index >= len(self.vertical):
            raise IndexError(f"Guide index {guide_index} out of range")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        x0, y0, _, y1 = bounds

        # Create region from left edge to guide
        x1 = self.vertical[guide_index]

        if hasattr(target, "region"):
            return target.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def right_of(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a region to the right of a vertical guide.

        Args:
            guide_index: Vertical guide index
            obj: Page or Region to create the region on (uses self.context if None)

        Returns:
            Region to the right of the specified guide
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.vertical or guide_index < 0 or guide_index >= len(self.vertical):
            raise IndexError(f"Guide index {guide_index} out of range")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        _, y0, x1, y1 = bounds

        # Create region from guide to right edge
        x0 = self.vertical[guide_index]

        if hasattr(target, "region"):
            return target.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def above(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a region above a horizontal guide.

        Args:
            guide_index: Horizontal guide index
            obj: Page or Region to create the region on (uses self.context if None)

        Returns:
            Region above the specified guide
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.horizontal or guide_index < 0 or guide_index >= len(self.horizontal):
            raise IndexError(f"Guide index {guide_index} out of range")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        x0, y0, x1, _ = bounds

        # Create region from top edge to guide
        y1 = self.horizontal[guide_index]

        if hasattr(target, "region"):
            return target.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def below(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
        """
        Get a region below a horizontal guide.

        Args:
            guide_index: Horizontal guide index
            obj: Page or Region to create the region on (uses self.context if None)

        Returns:
            Region below the specified guide
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.horizontal or guide_index < 0 or guide_index >= len(self.horizontal):
            raise IndexError(f"Guide index {guide_index} out of range")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        x0, _, x1, y1 = bounds

        # Create region from guide to bottom edge
        y0 = self.horizontal[guide_index]

        if hasattr(target, "region"):
            return target.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def between_vertical(
        self, start_index: int, end_index: int, obj: Optional[Union["Page", "Region"]] = None
    ) -> "Region":
        """
        Get a region between two vertical guides.

        Args:
            start_index: Starting vertical guide index
            end_index: Ending vertical guide index
            obj: Page or Region to create the region on (uses self.context if None)

        Returns:
            Region between the specified guides
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.vertical:
            raise ValueError("No vertical guides available")
        if start_index < 0 or start_index >= len(self.vertical):
            raise IndexError(f"Start index {start_index} out of range")
        if end_index < 0 or end_index >= len(self.vertical):
            raise IndexError(f"End index {end_index} out of range")
        if start_index >= end_index:
            raise ValueError("Start index must be less than end index")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        _, y0, _, y1 = bounds

        # Get horizontal boundaries
        x0 = self.vertical[start_index]
        x1 = self.vertical[end_index]

        if hasattr(target, "region"):
            return target.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    def between_horizontal(
        self, start_index: int, end_index: int, obj: Optional[Union["Page", "Region"]] = None
    ) -> "Region":
        """
        Get a region between two horizontal guides.

        Args:
            start_index: Starting horizontal guide index
            end_index: Ending horizontal guide index
            obj: Page or Region to create the region on (uses self.context if None)

        Returns:
            Region between the specified guides
        """
        target = obj or self.context
        if target is None:
            raise ValueError("No context available for region creation")

        if not self.horizontal:
            raise ValueError("No horizontal guides available")
        if start_index < 0 or start_index >= len(self.horizontal):
            raise IndexError(f"Start index {start_index} out of range")
        if end_index < 0 or end_index >= len(self.horizontal):
            raise IndexError(f"End index {end_index} out of range")
        if start_index >= end_index:
            raise ValueError("Start index must be less than end index")

        # Get bounds from context
        bounds = self._get_context_bounds()
        if not bounds:
            raise ValueError("Could not determine bounds")
        x0, _, x1, _ = bounds

        # Get vertical boundaries
        y0 = self.horizontal[start_index]
        y1 = self.horizontal[end_index]

        if hasattr(target, "region"):
            return target.region(x0, y0, x1, y1)
        else:
            raise TypeError(f"Cannot create region on {type(target)}")

    # -------------------------------------------------------------------------
    # Operations
    # -------------------------------------------------------------------------

    def __add__(self, other: "Guides") -> "Guides":
        """
        Combine two guide sets.

        Returns:
            New Guides object with combined coordinates
        """
        # Combine and deduplicate coordinates, ensuring Python floats
        combined_verticals = sorted([float(x) for x in set(self.vertical + other.vertical)])
        combined_horizontals = sorted([float(y) for y in set(self.horizontal + other.horizontal)])

        # Handle FlowRegion context merging
        new_context = self.context or other.context

        # If both are flow regions, we might need a more complex merge,
        # but for now, just picking one context is sufficient.

        # Create the new Guides object
        new_guides = Guides(
            verticals=combined_verticals,
            horizontals=combined_horizontals,
            context=new_context,
            bounds=self.bounds or other.bounds,
        )

        # If the new context is a FlowRegion, we need to rebuild the flow-related state
        if new_guides.is_flow_region:
            # Re-initialize flow guides from both sources
            # This is a simplification; a true merge would be more complex.
            # For now, we combine the flow_guides dictionaries.
            if hasattr(self, "_flow_guides"):
                new_guides._flow_guides.update(self._flow_guides)
            if hasattr(other, "_flow_guides"):
                new_guides._flow_guides.update(other._flow_guides)

            # Re-initialize unified views
            if hasattr(self, "_unified_vertical"):
                new_guides._unified_vertical.extend(self._unified_vertical)
            if hasattr(other, "_unified_vertical"):
                new_guides._unified_vertical.extend(other._unified_vertical)

            if hasattr(self, "_unified_horizontal"):
                new_guides._unified_horizontal.extend(self._unified_horizontal)
            if hasattr(other, "_unified_horizontal"):
                new_guides._unified_horizontal.extend(other._unified_horizontal)

            # Invalidate caches to force rebuild
            new_guides._vertical_cache = None
            new_guides._horizontal_cache = None

        return new_guides

    def show(self, on=None, **kwargs):
        """
        Display the guides overlaid on a page or region.

        Args:
            on: Page, Region, PIL Image, or string to display guides on.
                If None, uses self.context (the object guides were created from).
                If string 'page', uses the page from self.context.
            **kwargs: Additional arguments passed to to_image() if applicable.

        Returns:
            PIL Image with guides drawn on it.
        """
        # Handle FlowRegion case
        if self.is_flow_region and (on is None or on == self.context):
            if not self._flow_guides:
                raise ValueError("No guides to show for FlowRegion")

            # Get stacking parameters from kwargs or use defaults
            stack_direction = kwargs.get("stack_direction", "vertical")
            stack_gap = kwargs.get("stack_gap", 5)
            stack_background_color = kwargs.get("stack_background_color", (255, 255, 255))

            # First, render all constituent regions without guides to get base images
            base_images = []
            region_infos = []  # Store region info for guide coordinate mapping

            for region in self.context.constituent_regions:
                try:
                    # Render region without guides using new system
                    if hasattr(region, "render"):
                        img = region.render(
                            resolution=kwargs.get("resolution", 150),
                            width=kwargs.get("width", None),
                            crop=True,  # Always crop regions to their bounds
                        )
                    else:
                        # Fallback to old method
                        img = region.render(**kwargs)
                    if img:
                        base_images.append(img)

                        # Calculate scaling factors for this region
                        scale_x = img.width / region.width
                        scale_y = img.height / region.height

                        region_infos.append(
                            {
                                "region": region,
                                "img_width": img.width,
                                "img_height": img.height,
                                "scale_x": scale_x,
                                "scale_y": scale_y,
                                "pdf_x0": region.x0,
                                "pdf_top": region.top,
                                "pdf_x1": region.x1,
                                "pdf_bottom": region.bottom,
                            }
                        )
                except Exception as e:
                    logger.warning(f"Failed to render region: {e}")

            if not base_images:
                raise ValueError("Failed to render any images for FlowRegion")

            # Calculate final canvas size based on stacking direction
            if stack_direction == "vertical":
                final_width = max(img.width for img in base_images)
                final_height = (
                    sum(img.height for img in base_images) + (len(base_images) - 1) * stack_gap
                )
            else:  # horizontal
                final_width = (
                    sum(img.width for img in base_images) + (len(base_images) - 1) * stack_gap
                )
                final_height = max(img.height for img in base_images)

            # Create unified canvas
            canvas = Image.new("RGB", (final_width, final_height), stack_background_color)
            draw = ImageDraw.Draw(canvas)

            # Paste base images and track positions
            region_positions = []  # (region_info, paste_x, paste_y)

            if stack_direction == "vertical":
                current_y = 0
                for i, (img, info) in enumerate(zip(base_images, region_infos)):
                    paste_x = (final_width - img.width) // 2  # Center horizontally
                    canvas.paste(img, (paste_x, current_y))
                    region_positions.append((info, paste_x, current_y))
                    current_y += img.height + stack_gap
            else:  # horizontal
                current_x = 0
                for i, (img, info) in enumerate(zip(base_images, region_infos)):
                    paste_y = (final_height - img.height) // 2  # Center vertically
                    canvas.paste(img, (current_x, paste_y))
                    region_positions.append((info, current_x, paste_y))
                    current_x += img.width + stack_gap

            # Now draw guides on the unified canvas
            # Draw vertical guides (blue) - these extend through the full canvas height
            for v_coord in self.vertical:
                # Find which region(s) this guide intersects
                for info, paste_x, paste_y in region_positions:
                    if info["pdf_x0"] <= v_coord <= info["pdf_x1"]:
                        # This guide is within this region's x-bounds
                        # Convert PDF coordinate to pixel coordinate relative to the region
                        adjusted_x = v_coord - info["pdf_x0"]
                        pixel_x = adjusted_x * info["scale_x"] + paste_x

                        # Draw full-height line on canvas (not clipped to region)
                        if 0 <= pixel_x <= final_width:
                            x_pixel = int(pixel_x)
                            draw.line(
                                [(x_pixel, 0), (x_pixel, final_height - 1)],
                                fill=(0, 0, 255, 200),
                                width=2,
                            )
                        break  # Only draw once per guide

            # Draw horizontal guides (red) - these extend through the full canvas width
            for h_coord in self.horizontal:
                # Find which region(s) this guide intersects
                for info, paste_x, paste_y in region_positions:
                    if info["pdf_top"] <= h_coord <= info["pdf_bottom"]:
                        # This guide is within this region's y-bounds
                        # Convert PDF coordinate to pixel coordinate relative to the region
                        adjusted_y = h_coord - info["pdf_top"]
                        pixel_y = adjusted_y * info["scale_y"] + paste_y

                        # Draw full-width line on canvas (not clipped to region)
                        if 0 <= pixel_y <= final_height:
                            y_pixel = int(pixel_y)
                            draw.line(
                                [(0, y_pixel), (final_width - 1, y_pixel)],
                                fill=(255, 0, 0, 200),
                                width=2,
                            )
                        break  # Only draw once per guide

            return canvas

        # Original single-region logic follows...
        # Determine what to display guides on
        target = on if on is not None else self.context

        # Handle string shortcuts
        if isinstance(target, str):
            if target == "page":
                if hasattr(self.context, "page"):
                    target = self.context.page
                elif hasattr(self.context, "_page"):
                    target = self.context._page
                else:
                    raise ValueError("Cannot resolve 'page' - context has no page attribute")
            else:
                raise ValueError(f"Unknown string target: {target}. Only 'page' is supported.")

        if target is None:
            raise ValueError("No target specified and no context available for guides display")

        # Prepare kwargs for image generation
        image_kwargs = {}

        # Extract only the parameters that the new render() method accepts
        if "resolution" in kwargs:
            image_kwargs["resolution"] = kwargs["resolution"]
        if "width" in kwargs:
            image_kwargs["width"] = kwargs["width"]
        if "crop" in kwargs:
            image_kwargs["crop"] = kwargs["crop"]

        # If target is a region-like object, crop to just that region
        if hasattr(target, "bbox") and hasattr(target, "page"):
            # This is likely a Region
            image_kwargs["crop"] = True

        # Get base image
        if hasattr(target, "render"):
            # Use the new unified rendering system
            img = target.render(**image_kwargs)
        elif hasattr(target, "render"):
            # Fallback to old method if available
            img = target.render(**image_kwargs)
        elif hasattr(target, "mode") and hasattr(target, "size"):
            # It's already a PIL Image
            img = target
        else:
            raise ValueError(f"Object {target} does not support render() and is not a PIL Image")

        if img is None:
            raise ValueError("Failed to generate base image")

        # Create a copy to draw on
        img = img.copy()
        draw = ImageDraw.Draw(img)

        # Determine scale factor for coordinate conversion
        if (
            hasattr(target, "width")
            and hasattr(target, "height")
            and not (hasattr(target, "mode") and hasattr(target, "size"))
        ):
            # target is a PDF object (Page/Region) with PDF coordinates
            scale_x = img.width / target.width
            scale_y = img.height / target.height

            # If we're showing guides on a region, we need to adjust coordinates
            # to be relative to the region's origin
            if hasattr(target, "bbox") and hasattr(target, "page"):
                # This is a Region - adjust guide coordinates to be relative to region
                region_x0, region_top = target.x0, target.top
            else:
                # This is a Page - no adjustment needed
                region_x0, region_top = 0, 0
        else:
            # target is already an image, no scaling needed
            scale_x = 1.0
            scale_y = 1.0
            region_x0, region_top = 0, 0

        # Draw vertical guides (blue)
        for x_coord in self.vertical:
            # Adjust coordinate if we're showing on a region
            adjusted_x = x_coord - region_x0
            pixel_x = adjusted_x * scale_x
            # Ensure guides at the edge are still visible by clamping to valid range
            if 0 <= pixel_x <= img.width - 1:
                x_pixel = int(min(pixel_x, img.width - 1))
                draw.line([(x_pixel, 0), (x_pixel, img.height - 1)], fill=(0, 0, 255, 200), width=2)

        # Draw horizontal guides (red)
        for y_coord in self.horizontal:
            # Adjust coordinate if we're showing on a region
            adjusted_y = y_coord - region_top
            pixel_y = adjusted_y * scale_y
            # Ensure guides at the edge are still visible by clamping to valid range
            if 0 <= pixel_y <= img.height - 1:
                y_pixel = int(min(pixel_y, img.height - 1))
                draw.line([(0, y_pixel), (img.width - 1, y_pixel)], fill=(255, 0, 0, 200), width=2)

        return img

    # -------------------------------------------------------------------------
    # Utility Methods
    # -------------------------------------------------------------------------

    def get_cells(self) -> List[Tuple[float, float, float, float]]:
        """
        Get all cell bounding boxes from guide intersections.

        Returns:
            List of (x0, y0, x1, y1) tuples for each cell
        """
        cells = []

        # Create cells from guide intersections
        for i in range(len(self.vertical) - 1):
            for j in range(len(self.horizontal) - 1):
                x0 = self.vertical[i]
                x1 = self.vertical[i + 1]
                y0 = self.horizontal[j]
                y1 = self.horizontal[j + 1]
                cells.append((x0, y0, x1, y1))

        return cells

    def to_dict(self) -> Dict[str, Any]:
        """
        Convert to dictionary format suitable for pdfplumber table_settings.

        Returns:
            Dictionary with explicit_vertical_lines and explicit_horizontal_lines
        """
        return {
            "explicit_vertical_lines": self.vertical,
            "explicit_horizontal_lines": self.horizontal,
        }

    def to_relative(self) -> "Guides":
        """
        Convert absolute coordinates to relative (0-1) coordinates.

        Returns:
            New Guides object with relative coordinates
        """
        if self.relative:
            return self  # Already relative

        if not self.bounds:
            raise ValueError("Cannot convert to relative without bounds")

        x0, y0, x1, y1 = self.bounds
        width = x1 - x0
        height = y1 - y0

        rel_verticals = [(x - x0) / width for x in self.vertical]
        rel_horizontals = [(y - y0) / height for y in self.horizontal]

        return Guides(
            verticals=rel_verticals,
            horizontals=rel_horizontals,
            context=self.context,
            bounds=(0, 0, 1, 1),
            relative=True,
        )

    def to_absolute(self, bounds: Tuple[float, float, float, float]) -> "Guides":
        """
        Convert relative coordinates to absolute coordinates.

        Args:
            bounds: Target bounding box (x0, y0, x1, y1)

        Returns:
            New Guides object with absolute coordinates
        """
        if not self.relative:
            return self  # Already absolute

        x0, y0, x1, y1 = bounds
        width = x1 - x0
        height = y1 - y0

        abs_verticals = [x0 + x * width for x in self.vertical]
        abs_horizontals = [y0 + y * height for y in self.horizontal]

        return Guides(
            verticals=abs_verticals,
            horizontals=abs_horizontals,
            context=self.context,
            bounds=bounds,
            relative=False,
        )

    @property
    def n_rows(self) -> int:
        """Number of rows defined by horizontal guides."""
        return max(0, len(self.horizontal) - 1)

    @property
    def n_cols(self) -> int:
        """Number of columns defined by vertical guides."""
        return max(0, len(self.vertical) - 1)

    def _handle_snap_failure(self, message: str):
        """Handle cases where snapping cannot be performed."""
        if hasattr(self, "on_no_snap"):
            if self.on_no_snap == "warn":
                logger.warning(message)
            elif self.on_no_snap == "raise":
                raise ValueError(message)
            # 'ignore' case: do nothing
        else:
            logger.warning(message)  # Default behavior

    def _find_vertical_whitespace_gaps(
        self, text_elements, min_gap: float, threshold: Union[float, str] = "auto"
    ) -> List[Tuple[float, float]]:
        """
        Find vertical whitespace gaps using bbox-based density analysis.
        Returns list of (start, end) tuples representing trough ranges.
        """
        if not self.bounds:
            return []

        x0, _, x1, _ = self.bounds
        width_pixels = int(x1 - x0)

        if width_pixels <= 0:
            return []

        # Create density histogram: count bbox overlaps per x-coordinate
        density = np.zeros(width_pixels)

        for element in text_elements:
            if not hasattr(element, "x0") or not hasattr(element, "x1"):
                continue

            # Clip coordinates to bounds
            elem_x0 = max(x0, element.x0) - x0
            elem_x1 = min(x1, element.x1) - x0

            if elem_x1 > elem_x0:
                start_px = int(elem_x0)
                end_px = int(elem_x1)
                density[start_px:end_px] += 1

        if density.max() == 0:
            return []

        # Determine the threshold value
        if threshold == "auto":
            # Auto mode: try different thresholds with step 0.05 until we have enough troughs
            guides_needing_troughs = len(
                [g for i, g in enumerate(self.vertical) if 0 < i < len(self.vertical) - 1]
            )
            if guides_needing_troughs == 0:
                threshold_val = 0.5  # Default when no guides need placement
            else:
                threshold_val = None
                for test_threshold in np.arange(0.1, 1.0, 0.05):
                    test_gaps = self._find_gaps_with_threshold(density, test_threshold, min_gap, x0)
                    if len(test_gaps) >= guides_needing_troughs:
                        threshold_val = test_threshold
                        logger.debug(
                            f"Auto threshold found: {test_threshold:.2f} (found {len(test_gaps)} troughs for {guides_needing_troughs} guides)"
                        )
                        break

                if threshold_val is None:
                    threshold_val = 0.8  # Fallback to permissive threshold
                    logger.debug(f"Auto threshold fallback to {threshold_val}")
        else:
            # Fixed threshold mode
            if not isinstance(threshold, (int, float)) or not (0.0 <= threshold <= 1.0):
                raise ValueError("threshold must be a number between 0.0 and 1.0, or 'auto'")
            threshold_val = float(threshold)

        return self._find_gaps_with_threshold(density, threshold_val, min_gap, x0)

    def _find_gaps_with_threshold(self, density, threshold_val, min_gap, x0):
        """Helper method to find gaps given a specific threshold value."""
        max_density = density.max()
        threshold_density = threshold_val * max_density

        # Smooth the density for better trough detection
        from scipy.ndimage import gaussian_filter1d

        smoothed_density = gaussian_filter1d(density.astype(float), sigma=1.0)

        # Find regions below threshold
        below_threshold = smoothed_density <= threshold_density

        # Find contiguous regions
        from scipy.ndimage import label as nd_label

        labeled_regions, num_regions = nd_label(below_threshold)

        gaps = []
        for region_id in range(1, num_regions + 1):
            region_mask = labeled_regions == region_id
            region_indices = np.where(region_mask)[0]

            if len(region_indices) == 0:
                continue

            start_px = region_indices[0]
            end_px = region_indices[-1] + 1

            # Convert back to PDF coordinates
            start_pdf = x0 + start_px
            end_pdf = x0 + end_px

            # Check minimum gap size
            if end_pdf - start_pdf >= min_gap:
                gaps.append((start_pdf, end_pdf))

        return gaps

    def _find_horizontal_whitespace_gaps(
        self, text_elements, min_gap: float, threshold: Union[float, str] = "auto"
    ) -> List[Tuple[float, float]]:
        """
        Find horizontal whitespace gaps using bbox-based density analysis.
        Returns list of (start, end) tuples representing trough ranges.
        """
        if not self.bounds:
            return []

        _, y0, _, y1 = self.bounds
        height_pixels = int(y1 - y0)

        if height_pixels <= 0:
            return []

        # Create density histogram: count bbox overlaps per y-coordinate
        density = np.zeros(height_pixels)

        for element in text_elements:
            if not hasattr(element, "top") or not hasattr(element, "bottom"):
                continue

            # Clip coordinates to bounds
            elem_top = max(y0, element.top) - y0
            elem_bottom = min(y1, element.bottom) - y0

            if elem_bottom > elem_top:
                start_px = int(elem_top)
                end_px = int(elem_bottom)
                density[start_px:end_px] += 1

        if density.max() == 0:
            return []

        # Determine the threshold value (same logic as vertical)
        if threshold == "auto":
            guides_needing_troughs = len(
                [g for i, g in enumerate(self.horizontal) if 0 < i < len(self.horizontal) - 1]
            )
            if guides_needing_troughs == 0:
                threshold_val = 0.5  # Default when no guides need placement
            else:
                threshold_val = None
                for test_threshold in np.arange(0.1, 1.0, 0.05):
                    test_gaps = self._find_gaps_with_threshold_horizontal(
                        density, test_threshold, min_gap, y0
                    )
                    if len(test_gaps) >= guides_needing_troughs:
                        threshold_val = test_threshold
                        logger.debug(
                            f"Auto threshold found: {test_threshold:.2f} (found {len(test_gaps)} troughs for {guides_needing_troughs} guides)"
                        )
                        break

                if threshold_val is None:
                    threshold_val = 0.8  # Fallback to permissive threshold
                    logger.debug(f"Auto threshold fallback to {threshold_val}")
        else:
            # Fixed threshold mode
            if not isinstance(threshold, (int, float)) or not (0.0 <= threshold <= 1.0):
                raise ValueError("threshold must be a number between 0.0 and 1.0, or 'auto'")
            threshold_val = float(threshold)

        return self._find_gaps_with_threshold_horizontal(density, threshold_val, min_gap, y0)

    def _find_gaps_with_threshold_horizontal(self, density, threshold_val, min_gap, y0):
        """Helper method to find horizontal gaps given a specific threshold value."""
        max_density = density.max()
        threshold_density = threshold_val * max_density

        # Smooth the density for better trough detection
        from scipy.ndimage import gaussian_filter1d

        smoothed_density = gaussian_filter1d(density.astype(float), sigma=1.0)

        # Find regions below threshold
        below_threshold = smoothed_density <= threshold_density

        # Find contiguous regions
        from scipy.ndimage import label as nd_label

        labeled_regions, num_regions = nd_label(below_threshold)

        gaps = []
        for region_id in range(1, num_regions + 1):
            region_mask = labeled_regions == region_id
            region_indices = np.where(region_mask)[0]

            if len(region_indices) == 0:
                continue

            start_px = region_indices[0]
            end_px = region_indices[-1] + 1

            # Convert back to PDF coordinates
            start_pdf = y0 + start_px
            end_pdf = y0 + end_px

            # Check minimum gap size
            if end_pdf - start_pdf >= min_gap:
                gaps.append((start_pdf, end_pdf))

        return gaps

    def _find_vertical_element_gaps(
        self, text_elements, min_gap: float
    ) -> List[Tuple[float, float]]:
        """
        Find vertical whitespace gaps using text element spacing analysis.
        Returns list of (start, end) tuples representing trough ranges.
        """
        if not self.bounds or not text_elements:
            return []

        x0, _, x1, _ = self.bounds

        # Get all element right and left edges
        element_edges = []
        for element in text_elements:
            if not hasattr(element, "x0") or not hasattr(element, "x1"):
                continue
            # Only include elements that overlap vertically with our bounds
            if hasattr(element, "top") and hasattr(element, "bottom"):
                if element.bottom < self.bounds[1] or element.top > self.bounds[3]:
                    continue
            element_edges.extend([element.x0, element.x1])

        if not element_edges:
            return []

        # Sort edges and find gaps
        element_edges = sorted(set(element_edges))

        trough_ranges = []
        for i in range(len(element_edges) - 1):
            gap_start = element_edges[i]
            gap_end = element_edges[i + 1]
            gap_width = gap_end - gap_start

            if gap_width >= min_gap:
                # Check if this gap actually contains no text (is empty space)
                gap_has_text = False
                for element in text_elements:
                    if (
                        hasattr(element, "x0")
                        and hasattr(element, "x1")
                        and element.x0 < gap_end
                        and element.x1 > gap_start
                    ):
                        gap_has_text = True
                        break

                if not gap_has_text:
                    trough_ranges.append((gap_start, gap_end))

        return trough_ranges

    def _find_horizontal_element_gaps(
        self, text_elements, min_gap: float
    ) -> List[Tuple[float, float]]:
        """
        Find horizontal whitespace gaps using text element spacing analysis.
        Returns list of (start, end) tuples representing trough ranges.
        """
        if not self.bounds or not text_elements:
            return []

        _, y0, _, y1 = self.bounds

        # Get all element top and bottom edges
        element_edges = []
        for element in text_elements:
            if not hasattr(element, "top") or not hasattr(element, "bottom"):
                continue
            # Only include elements that overlap horizontally with our bounds
            if hasattr(element, "x0") and hasattr(element, "x1"):
                if element.x1 < self.bounds[0] or element.x0 > self.bounds[2]:
                    continue
            element_edges.extend([element.top, element.bottom])

        if not element_edges:
            return []

        # Sort edges and find gaps
        element_edges = sorted(set(element_edges))

        trough_ranges = []
        for i in range(len(element_edges) - 1):
            gap_start = element_edges[i]
            gap_end = element_edges[i + 1]
            gap_width = gap_end - gap_start

            if gap_width >= min_gap:
                # Check if this gap actually contains no text (is empty space)
                gap_has_text = False
                for element in text_elements:
                    if (
                        hasattr(element, "top")
                        and hasattr(element, "bottom")
                        and element.top < gap_end
                        and element.bottom > gap_start
                    ):
                        gap_has_text = True
                        break

                if not gap_has_text:
                    trough_ranges.append((gap_start, gap_end))

        return trough_ranges

    def _optimal_guide_assignment(
        self, guides: List[float], trough_ranges: List[Tuple[float, float]]
    ) -> Dict[int, int]:
        """
        Assign guides to trough ranges using the user's desired logic:
        - Guides already in a trough stay put
        - Only guides NOT in any trough get moved to available troughs
        - Prefer closest assignment for guides that need to move
        """
        if not guides or not trough_ranges:
            return {}

        assignments = {}

        # Step 1: Identify which guides are already in troughs
        guides_in_troughs = set()
        for i, guide_pos in enumerate(guides):
            for trough_start, trough_end in trough_ranges:
                if trough_start <= guide_pos <= trough_end:
                    guides_in_troughs.add(i)
                    logger.debug(
                        f"Guide {i} (pos {guide_pos:.1f}) is already in trough ({trough_start:.1f}-{trough_end:.1f}), keeping in place"
                    )
                    break

        # Step 2: Identify which troughs are already occupied
        occupied_troughs = set()
        for i in guides_in_troughs:
            guide_pos = guides[i]
            for j, (trough_start, trough_end) in enumerate(trough_ranges):
                if trough_start <= guide_pos <= trough_end:
                    occupied_troughs.add(j)
                    break

        # Step 3: Find guides that need reassignment (not in any trough)
        guides_to_move = []
        for i, guide_pos in enumerate(guides):
            if i not in guides_in_troughs:
                guides_to_move.append(i)
                logger.debug(
                    f"Guide {i} (pos {guide_pos:.1f}) is NOT in any trough, needs reassignment"
                )

        # Step 4: Find available troughs (not occupied by existing guides)
        available_troughs = []
        for j, (trough_start, trough_end) in enumerate(trough_ranges):
            if j not in occupied_troughs:
                available_troughs.append(j)
                logger.debug(f"Trough {j} ({trough_start:.1f}-{trough_end:.1f}) is available")

        # Step 5: Assign guides to move to closest available troughs
        if guides_to_move and available_troughs:
            # Calculate distances for all combinations
            distances = []
            for guide_idx in guides_to_move:
                guide_pos = guides[guide_idx]
                for trough_idx in available_troughs:
                    trough_start, trough_end = trough_ranges[trough_idx]
                    trough_center = (trough_start + trough_end) / 2
                    distance = abs(guide_pos - trough_center)
                    distances.append((distance, guide_idx, trough_idx))

            # Sort by distance and assign greedily
            distances.sort()
            used_troughs = set()

            for distance, guide_idx, trough_idx in distances:
                if guide_idx not in assignments and trough_idx not in used_troughs:
                    assignments[guide_idx] = trough_idx
                    used_troughs.add(trough_idx)
                    logger.debug(
                        f"Assigned guide {guide_idx} (pos {guides[guide_idx]:.1f}) to trough {trough_idx} (distance: {distance:.1f})"
                    )

        logger.debug(f"Final assignments: {assignments}")
        return assignments

    def _snap_guides_to_gaps(self, guides: List[float], gaps: List[Tuple[float, float]], axis: str):
        """
        Snap guides to nearby gaps using optimal assignment.
        Only moves guides that are NOT already in a trough.
        """
        if not guides or not gaps:
            return

        logger.debug(f"Snapping {len(guides)} {axis} guides to {len(gaps)} trough ranges")
        for i, (start, end) in enumerate(gaps):
            center = (start + end) / 2
            logger.debug(f"  Trough {i}: {start:.1f} to {end:.1f} (center: {center:.1f})")

        # Get optimal assignments
        assignments = self._optimal_guide_assignment(guides, gaps)

        # Apply assignments (modify guides list in-place)
        for guide_idx, trough_idx in assignments.items():
            trough_start, trough_end = gaps[trough_idx]
            new_pos = (trough_start + trough_end) / 2  # Move to trough center
            old_pos = guides[guide_idx]
            guides[guide_idx] = new_pos
            logger.info(f"Snapped {axis} guide from {old_pos:.1f} to {new_pos:.1f}")

    def build_grid(
        self,
        target: Optional[Union["Page", "Region"]] = None,
        source: str = "guides",
        cell_padding: float = 0.5,
        include_outer_boundaries: bool = False,
        *,
        multi_page: Literal["auto", True, False] = "auto",
    ) -> Dict[str, Any]:
        """
        Create table structure (table, rows, columns, cells) from guide coordinates.

        Args:
            target: Page or Region to create regions on (uses self.context if None)
            source: Source label for created regions (for identification)
            cell_padding: Internal padding for cell regions in points
            include_outer_boundaries: Whether to add boundaries at edges if missing
            multi_page: Controls multi-region table creation for FlowRegions.
                - "auto": (default) Creates a unified grid if there are multiple regions or guides span pages.
                - True: Forces creation of a unified multi-region grid.
                - False: Creates separate grids for each region.

        Returns:
            Dictionary with 'counts' and 'regions' created.
        """
        # Dispatch to appropriate implementation based on context and flags
        if self.is_flow_region:
            # Check if we should create a unified multi-region grid
            has_multiple_regions = len(self.context.constituent_regions) > 1
            spans_pages = self._spans_pages()

            # Create unified grid if:
            # - multi_page is explicitly True, OR
            # - multi_page is "auto" AND (spans pages OR has multiple regions)
            if multi_page is True or (
                multi_page == "auto" and (spans_pages or has_multiple_regions)
            ):
                return self._build_grid_multi_page(
                    source=source,
                    cell_padding=cell_padding,
                    include_outer_boundaries=include_outer_boundaries,
                )
            else:
                # Single region FlowRegion or multi_page=False: create separate tables per region
                total_counts = {"table": 0, "rows": 0, "columns": 0, "cells": 0}
                all_regions = {"table": [], "rows": [], "columns": [], "cells": []}

                for region in self.context.constituent_regions:
                    if region in self._flow_guides:
                        verticals, horizontals = self._flow_guides[region]

                        region_guides = Guides(
                            verticals=verticals, horizontals=horizontals, context=region
                        )

                        try:
                            result = region_guides._build_grid_single_page(
                                target=region,
                                source=source,
                                cell_padding=cell_padding,
                                include_outer_boundaries=include_outer_boundaries,
                            )

                            for key in total_counts:
                                total_counts[key] += result["counts"][key]

                            if result["regions"]["table"]:
                                all_regions["table"].append(result["regions"]["table"])
                            all_regions["rows"].extend(result["regions"]["rows"])
                            all_regions["columns"].extend(result["regions"]["columns"])
                            all_regions["cells"].extend(result["regions"]["cells"])

                        except Exception as e:
                            logger.warning(f"Failed to build grid on region: {e}")

                logger.info(
                    f"Created {total_counts['table']} tables, {total_counts['rows']} rows, "
                    f"{total_counts['columns']} columns, and {total_counts['cells']} cells "
                    f"from guides across {len(self._flow_guides)} regions"
                )

                return {"counts": total_counts, "regions": all_regions}

        # Fallback for single page/region
        return self._build_grid_single_page(
            target=target,
            source=source,
            cell_padding=cell_padding,
            include_outer_boundaries=include_outer_boundaries,
        )

    def _build_grid_multi_page(
        self,
        source: str,
        cell_padding: float,
        include_outer_boundaries: bool,
    ) -> Dict[str, Any]:
        """
        Builds a single, coherent grid across multiple regions of a FlowRegion.

        Creates physical Region objects for each constituent region with _fragment
        region types (e.g., table_column_fragment), then stitches them into logical
        FlowRegion objects. Both are registered with pages, but the fragment types
        allow easy differentiation:
        - find_all('table_column') returns only logical columns
        - find_all('table_column_fragment') returns only physical fragments
        """
        from natural_pdf.flows.region import FlowRegion

        if not self.is_flow_region or not hasattr(self.context, "flow") or not self.context.flow:
            raise ValueError("Multi-page grid building requires a FlowRegion with a valid Flow.")

        # Determine flow orientation to guide stitching
        orientation = self._get_flow_orientation()

        # Phase 1: Build physical grid on each page, clipping guides to that page's region
        results_by_region = []
        unified_verticals = self.vertical.data
        unified_horizontals = self.horizontal.data

        for region in self.context.constituent_regions:
            bounds = region.bbox
            if not bounds:
                continue

            # Clip unified guides to the current region's bounds
            clipped_verticals = [v for v in unified_verticals if bounds[0] <= v <= bounds[2]]
            clipped_horizontals = [h for h in unified_horizontals if bounds[1] <= h <= bounds[3]]

            # Ensure the region's own boundaries are included to close off cells at page breaks
            clipped_verticals = sorted(list(set([bounds[0], bounds[2]] + clipped_verticals)))
            clipped_horizontals = sorted(list(set([bounds[1], bounds[3]] + clipped_horizontals)))

            if len(clipped_verticals) < 2 or len(clipped_horizontals) < 2:
                continue  # Not enough guides to form a cell

            region_guides = Guides(
                verticals=clipped_verticals,
                horizontals=clipped_horizontals,
                context=region,
            )

            grid_parts = region_guides._build_grid_single_page(
                target=region,
                source=source,
                cell_padding=cell_padding,
                include_outer_boundaries=False,  # Boundaries are already handled
            )

            if grid_parts["counts"]["table"] > 0:
                # Mark physical regions as fragments by updating their region_type
                # This happens before stitching into logical FlowRegions
                if len(self.context.constituent_regions) > 1:
                    # Update region types to indicate these are fragments
                    if grid_parts["regions"]["table"]:
                        grid_parts["regions"]["table"].region_type = "table_fragment"
                        grid_parts["regions"]["table"].metadata["is_fragment"] = True

                    for row in grid_parts["regions"]["rows"]:
                        row.region_type = "table_row_fragment"
                        row.metadata["is_fragment"] = True

                    for col in grid_parts["regions"]["columns"]:
                        col.region_type = "table_column_fragment"
                        col.metadata["is_fragment"] = True

                    for cell in grid_parts["regions"]["cells"]:
                        cell.region_type = "table_cell_fragment"
                        cell.metadata["is_fragment"] = True

                results_by_region.append(grid_parts)

        if not results_by_region:
            return {
                "counts": {"table": 0, "rows": 0, "columns": 0, "cells": 0},
                "regions": {"table": None, "rows": [], "columns": [], "cells": []},
            }

        # Phase 2: Stitch physical regions into logical FlowRegions based on orientation
        flow = self.context.flow

        # The overall table is always a FlowRegion
        physical_tables = [res["regions"]["table"] for res in results_by_region]
        multi_page_table = FlowRegion(
            flow=flow, constituent_regions=physical_tables, source_flow_element=None
        )
        multi_page_table.source = source
        multi_page_table.region_type = "table"
        multi_page_table.metadata.update(
            {"is_multi_page": True, "num_rows": self.n_rows, "num_cols": self.n_cols}
        )

        # Initialize final region collections
        final_rows = []
        final_cols = []
        final_cells = []

        orientation = self._get_flow_orientation()

        if orientation == "vertical":
            # Start with all rows & cells from the first page's grid
            if results_by_region:
                # Make copies to modify
                page_rows = [res["regions"]["rows"] for res in results_by_region]
                page_cells = [res["regions"]["cells"] for res in results_by_region]

                # Iterate through page breaks to merge split rows/cells
                for i in range(len(results_by_region) - 1):
                    region_A = self.context.constituent_regions[i]

                    # Check if a guide exists at the boundary
                    is_break_bounded = any(
                        abs(h - region_A.bottom) < 0.1 for h in self.horizontal.data
                    )

                    if not is_break_bounded and page_rows[i] and page_rows[i + 1]:
                        # No guide at break -> merge last row of A with first row of B
                        last_row_A = page_rows[i].pop(-1)
                        first_row_B = page_rows[i + 1].pop(0)

                        merged_row = FlowRegion(
                            flow, [last_row_A, first_row_B], source_flow_element=None
                        )
                        merged_row.source = source
                        merged_row.region_type = "table_row"
                        merged_row.metadata.update(
                            {
                                "row_index": last_row_A.metadata.get("row_index"),
                                "is_multi_page": True,
                            }
                        )
                        page_rows[i].append(merged_row)  # Add merged row back in place of A's last

                        # Merge the corresponding cells using explicit row/col indices
                        last_row_idx = last_row_A.metadata.get("row_index")
                        first_row_idx = first_row_B.metadata.get("row_index")

                        # Cells belonging to those rows
                        last_cells_A = [
                            c for c in page_cells[i] if c.metadata.get("row_index") == last_row_idx
                        ]
                        first_cells_B = [
                            c
                            for c in page_cells[i + 1]
                            if c.metadata.get("row_index") == first_row_idx
                        ]

                        # Remove them from their page lists
                        page_cells[i] = [
                            c for c in page_cells[i] if c.metadata.get("row_index") != last_row_idx
                        ]
                        page_cells[i + 1] = [
                            c
                            for c in page_cells[i + 1]
                            if c.metadata.get("row_index") != first_row_idx
                        ]

                        # Sort both lists by column index to keep alignment stable
                        last_cells_A.sort(key=lambda c: c.metadata.get("col_index", 0))
                        first_cells_B.sort(key=lambda c: c.metadata.get("col_index", 0))

                        # Pair-wise merge
                        for cell_A, cell_B in zip(last_cells_A, first_cells_B):
                            merged_cell = FlowRegion(
                                flow, [cell_A, cell_B], source_flow_element=None
                            )
                            merged_cell.source = source
                            merged_cell.region_type = "table_cell"
                            merged_cell.metadata.update(
                                {
                                    "row_index": cell_A.metadata.get("row_index"),
                                    "col_index": cell_A.metadata.get("col_index"),
                                    "is_multi_page": True,
                                }
                            )
                            page_cells[i].append(merged_cell)

                # Flatten the potentially modified lists of rows and cells
                final_rows = [row for rows_list in page_rows for row in rows_list]
                final_cells = [cell for cells_list in page_cells for cell in cells_list]

                # Stitch columns, which always span vertically
                physical_cols_by_index = zip(
                    *(res["regions"]["columns"] for res in results_by_region)
                )
                for j, physical_cols in enumerate(physical_cols_by_index):
                    col_fr = FlowRegion(
                        flow=flow, constituent_regions=list(physical_cols), source_flow_element=None
                    )
                    col_fr.source = source
                    col_fr.region_type = "table_column"
                    col_fr.metadata.update({"col_index": j, "is_multi_page": True})
                    final_cols.append(col_fr)

        elif orientation == "horizontal":
            # Symmetric logic for horizontal flow (not fully implemented here for brevity)
            # This would merge last column of A with first column of B if no vertical guide exists
            logger.warning("Horizontal table stitching not fully implemented.")
            final_rows = [row for res in results_by_region for row in res["regions"]["rows"]]
            final_cols = [col for res in results_by_region for col in res["regions"]["columns"]]
            final_cells = [cell for res in results_by_region for cell in res["regions"]["cells"]]

        else:  # Unknown orientation, just flatten everything
            final_rows = [row for res in results_by_region for row in res["regions"]["rows"]]
            final_cols = [col for res in results_by_region for col in res["regions"]["columns"]]
            final_cells = [cell for res in results_by_region for cell in res["regions"]["cells"]]

        # SMART PAGE-LEVEL REGISTRY: Remove individual tables and replace with multi-page table
        # This ensures that page.find('table') finds the logical multi-page table, not fragments
        constituent_pages = set()
        for region in self.context.constituent_regions:
            if hasattr(region, "page") and hasattr(region.page, "_element_mgr"):
                constituent_pages.add(region.page)

        # Register the logical multi-page table with all constituent pages
        # Note: Physical table fragments are already registered with region_type="table_fragment"
        for page in constituent_pages:
            try:
                page._element_mgr.add_element(multi_page_table, element_type="regions")
                logger.debug(f"Registered multi-page table with page {page.page_number}")

            except Exception as e:
                logger.warning(
                    f"Failed to register multi-page table with page {page.page_number}: {e}"
                )

        # SMART PAGE-LEVEL REGISTRY: Register logical FlowRegion elements.
        # Physical fragments are already registered with their pages with _fragment region types,
        # so users can differentiate between logical regions and physical fragments.
        for page in constituent_pages:
            try:
                # Register all logical rows with this page
                for row in final_rows:
                    page._element_mgr.add_element(row, element_type="regions")

                # Register all logical columns with this page
                for col in final_cols:
                    page._element_mgr.add_element(col, element_type="regions")

                # Register all logical cells with this page
                for cell in final_cells:
                    page._element_mgr.add_element(cell, element_type="regions")

            except Exception as e:
                logger.warning(f"Failed to register multi-region table elements with page: {e}")

        final_counts = {
            "table": 1,
            "rows": len(final_rows),
            "columns": len(final_cols),
            "cells": len(final_cells),
        }
        final_regions = {
            "table": multi_page_table,
            "rows": final_rows,
            "columns": final_cols,
            "cells": final_cells,
        }

        logger.info(
            f"Created 1 multi-page table, {final_counts['rows']} logical rows, "
            f"{final_counts['columns']} logical columns from guides and registered with all constituent pages"
        )

        return {"counts": final_counts, "regions": final_regions}

    def _build_grid_single_page(
        self,
        target: Optional[Union["Page", "Region"]] = None,
        source: str = "guides",
        cell_padding: float = 0.5,
        include_outer_boundaries: bool = False,
    ) -> Dict[str, Any]:
        """
        Private method to create table structure on a single page or region.
        (Refactored from the original public build_grid method).
        """
        # This method now only handles a single page/region context.
        # Looping for FlowRegions is handled by the public `build_grid` method.

        # Original single-region logic follows...
        target_obj = target or self.context
        if not target_obj:
            raise ValueError("No target object available. Provide target parameter or context.")

        # Get the page for creating regions
        if hasattr(target_obj, "x0") and hasattr(
            target_obj, "top"
        ):  # Region (has bbox coordinates)
            page = target_obj._page
            origin_x, origin_y = target_obj.x0, target_obj.top
            context_width, context_height = target_obj.width, target_obj.height
        elif hasattr(target_obj, "_element_mgr") or hasattr(target_obj, "width"):  # Page
            page = target_obj
            origin_x, origin_y = 0.0, 0.0
            context_width, context_height = page.width, page.height
        else:
            raise ValueError(f"Target object {target_obj} is not a Page or Region")

        element_manager = page._element_mgr

        # Setup boundaries
        row_boundaries = list(self.horizontal)
        col_boundaries = list(self.vertical)

        # Add outer boundaries if requested and missing
        if include_outer_boundaries:
            if not row_boundaries or row_boundaries[0] > origin_y:
                row_boundaries.insert(0, origin_y)
            if not row_boundaries or row_boundaries[-1] < origin_y + context_height:
                row_boundaries.append(origin_y + context_height)

            if not col_boundaries or col_boundaries[0] > origin_x:
                col_boundaries.insert(0, origin_x)
            if not col_boundaries or col_boundaries[-1] < origin_x + context_width:
                col_boundaries.append(origin_x + context_width)

        # Remove duplicates and sort
        row_boundaries = sorted(list(set(row_boundaries)))
        col_boundaries = sorted(list(set(col_boundaries)))

        # ------------------------------------------------------------------
        # Clean-up: remove any previously created grid regions (table, rows,
        # columns, cells) that were generated by the same `source` label and
        # overlap the area we are about to populate.  This prevents the page's
        # `ElementManager` from accumulating stale/duplicate regions when the
        # user rebuilds the grid multiple times.
        # ------------------------------------------------------------------
        try:
            # Bounding box of the grid we are about to create
            if row_boundaries and col_boundaries:
                grid_bbox = (
                    col_boundaries[0],  # x0
                    row_boundaries[0],  # top
                    col_boundaries[-1],  # x1
                    row_boundaries[-1],  # bottom
                )

                def _bbox_overlap(b1, b2):
                    """Return True if two (x0, top, x1, bottom) bboxes overlap."""
                    return not (
                        b1[2] <= b2[0]  # b1 right ≤ b2 left
                        or b1[0] >= b2[2]  # b1 left ≥ b2 right
                        or b1[3] <= b2[1]  # b1 bottom ≤ b2 top
                        or b1[1] >= b2[3]  # b1 top ≥ b2 bottom
                    )

                # Collect existing regions that match the source & region types
                regions_to_remove = [
                    r
                    for r in element_manager.regions
                    if getattr(r, "source", None) == source
                    and getattr(r, "region_type", None)
                    in {"table", "table_row", "table_column", "table_cell"}
                    and hasattr(r, "bbox")
                    and _bbox_overlap(r.bbox, grid_bbox)
                ]

                for r in regions_to_remove:
                    element_manager.remove_element(r, element_type="regions")

                if regions_to_remove:
                    logger.debug(
                        f"Removed {len(regions_to_remove)} existing grid region(s) prior to rebuild"
                    )
        except Exception as cleanup_err:  # pragma: no cover – cleanup must never crash
            logger.warning(f"Grid cleanup failed: {cleanup_err}")

        logger.debug(
            f"Building grid with {len(row_boundaries)} row and {len(col_boundaries)} col boundaries"
        )

        # Track creation counts and regions
        counts = {"table": 0, "rows": 0, "columns": 0, "cells": 0}
        created_regions = {"table": None, "rows": [], "columns": [], "cells": []}

        # Create overall table region
        if len(row_boundaries) >= 2 and len(col_boundaries) >= 2:
            table_region = page.create_region(
                col_boundaries[0], row_boundaries[0], col_boundaries[-1], row_boundaries[-1]
            )
            table_region.source = source
            table_region.region_type = "table"
            table_region.normalized_type = "table"
            table_region.metadata.update(
                {
                    "source_guides": True,
                    "num_rows": len(row_boundaries) - 1,
                    "num_cols": len(col_boundaries) - 1,
                    "boundaries": {"rows": row_boundaries, "cols": col_boundaries},
                }
            )
            element_manager.add_element(table_region, element_type="regions")
            counts["table"] = 1
            created_regions["table"] = table_region

        # Create row regions
        if len(row_boundaries) >= 2 and len(col_boundaries) >= 2:
            for i in range(len(row_boundaries) - 1):
                row_region = page.create_region(
                    col_boundaries[0], row_boundaries[i], col_boundaries[-1], row_boundaries[i + 1]
                )
                row_region.source = source
                row_region.region_type = "table_row"
                row_region.normalized_type = "table_row"
                row_region.metadata.update({"row_index": i, "source_guides": True})
                element_manager.add_element(row_region, element_type="regions")
                counts["rows"] += 1
                created_regions["rows"].append(row_region)

        # Create column regions
        if len(col_boundaries) >= 2 and len(row_boundaries) >= 2:
            for j in range(len(col_boundaries) - 1):
                col_region = page.create_region(
                    col_boundaries[j], row_boundaries[0], col_boundaries[j + 1], row_boundaries[-1]
                )
                col_region.source = source
                col_region.region_type = "table_column"
                col_region.normalized_type = "table_column"
                col_region.metadata.update({"col_index": j, "source_guides": True})
                element_manager.add_element(col_region, element_type="regions")
                counts["columns"] += 1
                created_regions["columns"].append(col_region)

        # Create cell regions
        if len(row_boundaries) >= 2 and len(col_boundaries) >= 2:
            for i in range(len(row_boundaries) - 1):
                for j in range(len(col_boundaries) - 1):
                    # Apply padding
                    cell_x0 = col_boundaries[j] + cell_padding
                    cell_top = row_boundaries[i] + cell_padding
                    cell_x1 = col_boundaries[j + 1] - cell_padding
                    cell_bottom = row_boundaries[i + 1] - cell_padding

                    # Skip invalid cells
                    if cell_x1 <= cell_x0 or cell_bottom <= cell_top:
                        continue

                    cell_region = page.create_region(cell_x0, cell_top, cell_x1, cell_bottom)
                    cell_region.source = source
                    cell_region.region_type = "table_cell"
                    cell_region.normalized_type = "table_cell"
                    cell_region.metadata.update(
                        {
                            "row_index": i,
                            "col_index": j,
                            "source_guides": True,
                            "original_boundaries": {
                                "left": col_boundaries[j],
                                "top": row_boundaries[i],
                                "right": col_boundaries[j + 1],
                                "bottom": row_boundaries[i + 1],
                            },
                        }
                    )
                    element_manager.add_element(cell_region, element_type="regions")
                    counts["cells"] += 1
                    created_regions["cells"].append(cell_region)

        logger.info(
            f"Created {counts['table']} table, {counts['rows']} rows, "
            f"{counts['columns']} columns, and {counts['cells']} cells from guides"
        )

        return {"counts": counts, "regions": created_regions}

    def __repr__(self) -> str:
        """String representation of the guides."""
        return (
            f"Guides(verticals={len(self.vertical)}, "
            f"horizontals={len(self.horizontal)}, "
            f"cells={len(self.get_cells())})"
        )

    def _get_text_elements(self):
        """Get text elements from the context."""
        if not self.context:
            return []

        # Handle FlowRegion context
        if self.is_flow_region:
            all_text_elements = []
            for region in self.context.constituent_regions:
                if hasattr(region, "find_all"):
                    try:
                        text_elements = region.find_all("text", apply_exclusions=False)
                        elements = (
                            text_elements.elements
                            if hasattr(text_elements, "elements")
                            else text_elements
                        )
                        all_text_elements.extend(elements)
                    except Exception as e:
                        logger.warning(f"Error getting text elements from region: {e}")
            return all_text_elements

        # Original single-region logic
        # Get text elements from the context
        if hasattr(self.context, "find_all"):
            try:
                text_elements = self.context.find_all("text", apply_exclusions=False)
                return (
                    text_elements.elements if hasattr(text_elements, "elements") else text_elements
                )
            except Exception as e:
                logger.warning(f"Error getting text elements: {e}")
                return []
        else:
            logger.warning("Context does not support text element search")
            return []

    def _spans_pages(self) -> bool:
        """Check if any guides are defined across multiple pages in a FlowRegion."""
        if not self.is_flow_region:
            return False

        # Check vertical guides
        v_guide_pages = {}
        for coord, region in self._unified_vertical:
            v_guide_pages.setdefault(coord, set()).add(region.page.page_number)

        for pages in v_guide_pages.values():
            if len(pages) > 1:
                return True

        # Check horizontal guides
        h_guide_pages = {}
        for coord, region in self._unified_horizontal:
            h_guide_pages.setdefault(coord, set()).add(region.page.page_number)

        for pages in h_guide_pages.values():
            if len(pages) > 1:
                return True

        return False

    # -------------------------------------------------------------------------
    # Instance methods for fluent chaining (avoid name conflicts with class methods)
    # -------------------------------------------------------------------------

    def add_content(
        self,
        axis: Literal["vertical", "horizontal"] = "vertical",
        markers: Union[str, List[str], "ElementCollection", None] = None,
        obj: Optional[Union["Page", "Region"]] = None,
        align: Literal["left", "right", "center", "between"] = "left",
        outer: Union[str, bool] = True,
        tolerance: float = 5,
        apply_exclusions: bool = True,
    ) -> "Guides":
        """
        Instance method: Add guides from content, allowing chaining.
        This allows: Guides.new(page).add_content(axis='vertical', markers=[...])

        Args:
            axis: Which axis to create guides for
            markers: Content to search for. Can be:
                - str: single selector or literal text
                - List[str]: list of selectors or literal text strings
                - ElementCollection: collection of elements to extract text from
                - None: no markers
            obj: Page or Region to search (uses self.context if None)
            align: How to align guides relative to found elements
            outer: Whether to add outer boundary guides. Can be:
                - bool: True/False to add/not add both
                - "first": To add boundary before the first element
                - "last": To add boundary before the last element
            tolerance: Tolerance for snapping to element edges
            apply_exclusions: Whether to apply exclusion zones when searching for text

        Returns:
            Self for method chaining
        """
        # Use provided object or fall back to stored context
        target_obj = obj or self.context
        if target_obj is None:
            raise ValueError("No object provided and no context available")

        # Create new guides using the class method
        new_guides = Guides.from_content(
            obj=target_obj,
            axis=axis,
            markers=markers,
            align=align,
            outer=outer,
            tolerance=tolerance,
            apply_exclusions=apply_exclusions,
        )

        # Add the appropriate coordinates to this object
        if axis == "vertical":
            self.vertical = list(set(self.vertical + new_guides.vertical))
        else:
            self.horizontal = list(set(self.horizontal + new_guides.horizontal))

        return self

    def add_lines(
        self,
        axis: Literal["vertical", "horizontal", "both"] = "both",
        obj: Optional[Union["Page", "Region"]] = None,
        threshold: Union[float, str] = "auto",
        source_label: Optional[str] = None,
        max_lines_h: Optional[int] = None,
        max_lines_v: Optional[int] = None,
        outer: bool = False,
        detection_method: str = "vector",
        resolution: int = 192,
        **detect_kwargs,
    ) -> "Guides":
        """
        Instance method: Add guides from lines, allowing chaining.
        This allows: Guides.new(page).add_lines(axis='horizontal')

        Args:
            axis: Which axis to detect lines for
            obj: Page or Region to search (uses self.context if None)
            threshold: Line detection threshold ('auto' or float 0.0-1.0)
            source_label: Filter lines by source label (vector) or label for detected lines (pixels)
            max_lines_h: Maximum horizontal lines to use
            max_lines_v: Maximum vertical lines to use
            outer: Whether to add outer boundary guides
            detection_method: 'vector' (use existing LineElements) or 'pixels' (detect from image)
            resolution: DPI for pixel-based detection (default: 192)
            **detect_kwargs: Additional parameters for pixel detection (see from_lines)

        Returns:
            Self for method chaining
        """
        # Use provided object or fall back to stored context
        target_obj = obj or self.context
        if target_obj is None:
            raise ValueError("No object provided and no context available")

        # Create new guides using the class method
        new_guides = Guides.from_lines(
            obj=target_obj,
            axis=axis,
            threshold=threshold,
            source_label=source_label,
            max_lines_h=max_lines_h,
            max_lines_v=max_lines_v,
            outer=outer,
            detection_method=detection_method,
            resolution=resolution,
            **detect_kwargs,
        )

        # Add the appropriate coordinates to this object
        if axis in ("vertical", "both"):
            self.vertical = list(set(self.vertical + new_guides.vertical))
        if axis in ("horizontal", "both"):
            self.horizontal = list(set(self.horizontal + new_guides.horizontal))

        return self

    def add_whitespace(
        self,
        axis: Literal["vertical", "horizontal", "both"] = "both",
        obj: Optional[Union["Page", "Region"]] = None,
        min_gap: float = 10,
    ) -> "Guides":
        """
        Instance method: Add guides from whitespace, allowing chaining.
        This allows: Guides.new(page).add_whitespace(axis='both')

        Args:
            axis: Which axis to create guides for
            obj: Page or Region to search (uses self.context if None)
            min_gap: Minimum gap size to consider

        Returns:
            Self for method chaining
        """
        # Use provided object or fall back to stored context
        target_obj = obj or self.context
        if target_obj is None:
            raise ValueError("No object provided and no context available")

        # Create new guides using the class method
        new_guides = Guides.from_whitespace(obj=target_obj, axis=axis, min_gap=min_gap)

        # Add the appropriate coordinates to this object
        if axis in ("vertical", "both"):
            self.vertical = list(set(self.vertical + new_guides.vertical))
        if axis in ("horizontal", "both"):
            self.horizontal = list(set(self.horizontal + new_guides.horizontal))

        return self

    def extract_table(
        self,
        target: Optional[
            Union[
                "Page",
                "Region",
                "PageCollection",
                "ElementCollection",
                List[Union["Page", "Region"]],
            ]
        ] = None,
        source: str = "guides_temp",
        cell_padding: float = 0.5,
        include_outer_boundaries: bool = False,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
        show_progress: bool = False,
        content_filter: Optional[Union[str, Callable[[str], bool], List[str]]] = None,
        apply_exclusions: bool = True,
        *,
        multi_page: Literal["auto", True, False] = "auto",
        header: Union[str, List[str], None] = "first",
        skip_repeating_headers: Optional[bool] = None,
    ) -> "TableResult":
        """
        Extract table data directly from guides without leaving temporary regions.

        This method:
        1. Creates table structure using build_grid()
        2. Extracts table data from the created table region
        3. Cleans up all temporary regions
        4. Returns the TableResult

        When passed a collection (PageCollection, ElementCollection, or list), this method
        will extract tables from each element and combine them into a single result.

        Args:
            target: Page, Region, or collection of Pages/Regions to extract from (uses self.context if None)
            source: Source label for temporary regions (will be cleaned up)
            cell_padding: Internal padding for cell regions in points
            include_outer_boundaries: Whether to add boundaries at edges if missing
            method: Table extraction method ('tatr', 'pdfplumber', 'text', etc.)
            table_settings: Settings for pdfplumber table extraction
            use_ocr: Whether to use OCR for text extraction
            ocr_config: OCR configuration parameters
            text_options: Dictionary of options for the 'text' method
            cell_extraction_func: Optional callable for custom cell text extraction
            show_progress: Controls progress bar for text method
            content_filter: Content filtering function or patterns
            apply_exclusions: Whether to apply exclusion regions during text extraction (default: True)
            multi_page: Controls multi-region table creation for FlowRegions
            header: How to handle headers when extracting from collections:
                - "first": Use first row of first element as headers (default)
                - "all": Expect headers on each element, use from first element
                - None: No headers, use numeric indices
                - List[str]: Custom column names
            skip_repeating_headers: Whether to remove duplicate header rows when extracting from collections.
                Defaults to True when header is "first" or "all", False otherwise.

        Returns:
            TableResult: Extracted table data

        Raises:
            ValueError: If no table region is created from the guides

        Example:
            ```python
            from natural_pdf.analyzers import Guides

            # Single page extraction
            guides = Guides.from_lines(page, source_label="detected")
            table_data = guides.extract_table()
            df = table_data.to_df()

            # Multiple page extraction
            guides = Guides(pages[0])
            guides.vertical.from_content(['Column 1', 'Column 2'])
            table_result = guides.extract_table(pages, header=['Col1', 'Col2'])
            df = table_result.to_df()

            # Region collection extraction
            regions = pdf.find_all('region[type=table]')
            guides = Guides(regions[0])
            guides.vertical.from_lines(n=3)
            table_result = guides.extract_table(regions)
            ```
        """
        from natural_pdf.core.page_collection import PageCollection
        from natural_pdf.elements.element_collection import ElementCollection

        target_obj = target if target is not None else self.context
        if target_obj is None:
            raise ValueError("No target object available. Provide target parameter or context.")

        # Check if target is a collection - if so, delegate to _extract_table_from_collection
        if isinstance(target_obj, (PageCollection, ElementCollection, list)):
            # For collections, pass through most parameters as-is
            return self._extract_table_from_collection(
                elements=target_obj,
                header=header,
                skip_repeating_headers=skip_repeating_headers,
                method=method,
                table_settings=table_settings,
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                text_options=text_options,
                cell_extraction_func=cell_extraction_func,
                show_progress=show_progress,
                content_filter=content_filter,
                apply_exclusions=apply_exclusions,
            )

        # Get the page for cleanup later
        if hasattr(target_obj, "x0") and hasattr(target_obj, "top"):  # Region
            page = target_obj._page
            element_manager = page._element_mgr
        elif hasattr(target_obj, "_element_mgr"):  # Page
            page = target_obj
            element_manager = page._element_mgr
        else:
            raise ValueError(f"Target object {target_obj} is not a Page or Region")

        # Check if we have guides in only one dimension
        has_verticals = len(self.vertical) > 0
        has_horizontals = len(self.horizontal) > 0

        # If we have guides in only one dimension, use direct extraction with explicit lines
        if (has_verticals and not has_horizontals) or (has_horizontals and not has_verticals):
            logger.debug(
                f"Partial guides detected - using direct extraction (v={has_verticals}, h={has_horizontals})"
            )

            # Extract directly from the target using explicit lines
            if hasattr(target_obj, "extract_table"):
                return target_obj.extract_table(
                    method=method,  # Let auto-detection work when None
                    table_settings=table_settings,
                    use_ocr=use_ocr,
                    ocr_config=ocr_config,
                    text_options=text_options,
                    cell_extraction_func=cell_extraction_func,
                    show_progress=show_progress,
                    content_filter=content_filter,
                    verticals=list(self.vertical) if has_verticals else None,
                    horizontals=list(self.horizontal) if has_horizontals else None,
                )
            else:
                raise ValueError(f"Target object {type(target_obj)} does not support extract_table")

        # Both dimensions have guides - use normal grid-based extraction
        try:
            # Step 1: Build grid structure (creates temporary regions)
            grid_result = self.build_grid(
                target=target_obj,
                source=source,
                cell_padding=cell_padding,
                include_outer_boundaries=include_outer_boundaries,
                multi_page=multi_page,
            )

            # Step 2: Get the table region and extract table data
            table_region = grid_result["regions"]["table"]
            if table_region is None:
                raise ValueError(
                    "No table region was created from the guides. Check that you have both vertical and horizontal guides."
                )

            # Handle multi-page case where table_region might be a list
            if isinstance(table_region, list):
                if not table_region:
                    raise ValueError("No table regions were created from the guides.")
                # Use the first table region for extraction
                table_region = table_region[0]

            # Step 3: Extract table data using the region's extract_table method
            table_result = table_region.extract_table(
                method=method,
                table_settings=table_settings,
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                text_options=text_options,
                cell_extraction_func=cell_extraction_func,
                show_progress=show_progress,
                content_filter=content_filter,
                apply_exclusions=apply_exclusions,
            )

            return table_result

        finally:
            # Step 4: Clean up all temporary regions created by build_grid
            # This ensures no regions are left behind regardless of success/failure
            try:
                regions_to_remove = [
                    r
                    for r in element_manager.regions
                    if getattr(r, "source", None) == source
                    and getattr(r, "region_type", None)
                    in {"table", "table_row", "table_column", "table_cell"}
                ]

                for region in regions_to_remove:
                    element_manager.remove_element(region, element_type="regions")

                if regions_to_remove:
                    logger.debug(f"Cleaned up {len(regions_to_remove)} temporary regions")

            except Exception as cleanup_err:
                logger.warning(f"Failed to clean up temporary regions: {cleanup_err}")

    def _extract_table_from_collection(
        self,
        elements: Union["PageCollection", "ElementCollection", List[Union["Page", "Region"]]],
        header: Union[str, List[str], None] = "first",
        skip_repeating_headers: Optional[bool] = None,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
        show_progress: bool = True,
        content_filter: Optional[Union[str, Callable[[str], bool], List[str]]] = None,
        apply_exclusions: bool = True,
    ) -> "TableResult":
        """
        Extract tables from multiple pages or regions using this guide pattern.

        This method applies the guide to each element, extracts tables, and combines
        them into a single TableResult. Dynamic guides (using lambdas) are evaluated
        for each element.

        Args:
            elements: PageCollection, ElementCollection, or list of Pages/Regions to extract from
            header: How to handle headers:
                - "first": Use first row of first element as headers (default)
                - "all": Expect headers on each element, use from first element
                - None: No headers, use numeric indices
                - List[str]: Custom column names
            skip_repeating_headers: Whether to remove duplicate header rows.
                Defaults to True when header is "first" or "all", False otherwise.
            method: Table extraction method (passed to extract_table)
            table_settings: Settings for pdfplumber table extraction
            use_ocr: Whether to use OCR for text extraction
            ocr_config: OCR configuration parameters
            text_options: Dictionary of options for the 'text' method
            cell_extraction_func: Optional callable for custom cell text extraction
            show_progress: Show progress bar for multi-element extraction (default: True)
            content_filter: Content filtering function or patterns
            apply_exclusions: Whether to apply exclusion regions during extraction

        Returns:
            TableResult: Combined table data from all elements

        Example:
            ```python
            # Create guide with static vertical, dynamic horizontal
            guide = Guides(regions[0])
            guide.vertical.from_content(columns, outer="last")
            guide.horizontal.from_content(lambda r: r.find_all('text:starts-with(NF-)'))

            # Extract from all regions
            table_result = guide._extract_table_from_collection(regions, header=columns)
            df = table_result.to_df()
            ```
        """
        from natural_pdf.core.page_collection import PageCollection
        from natural_pdf.elements.element_collection import ElementCollection
        from natural_pdf.tables.result import TableResult

        # Convert to list if it's a collection
        if isinstance(elements, (PageCollection, ElementCollection)):
            element_list = list(elements)
        else:
            element_list = elements

        if not element_list:
            return TableResult([])

        # Determine header handling
        if skip_repeating_headers is None:
            skip_repeating_headers = header in ["first", "all"] or isinstance(header, list)

        all_rows = []
        header_row = None

        # Configure progress bar
        iterator = element_list
        if show_progress and len(element_list) > 1:
            try:
                from tqdm.auto import tqdm

                iterator = tqdm(
                    element_list, desc="Extracting tables from elements", unit="element"
                )
            except ImportError:
                pass

        for i, element in enumerate(iterator):
            # Create a new Guides object for this element
            element_guide = Guides(element)

            # Copy vertical guides (usually static)
            if hasattr(self.vertical, "_callable") and self.vertical._callable is not None:
                # If vertical is dynamic (lambda), evaluate it
                element_guide.vertical.from_content(self.vertical._callable(element))
            else:
                # Copy static vertical positions
                element_guide.vertical.data = self.vertical.data.copy()

            # Handle horizontal guides
            if hasattr(self.horizontal, "_callable") and self.horizontal._callable is not None:
                # If horizontal is dynamic (lambda), evaluate it
                element_guide.horizontal.from_content(self.horizontal._callable(element))
            else:
                # Copy static horizontal positions
                element_guide.horizontal.data = self.horizontal.data.copy()

            # Extract table from this element
            table_result = element_guide.extract_table(
                method=method,
                table_settings=table_settings,
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                text_options=text_options,
                cell_extraction_func=cell_extraction_func,
                show_progress=False,  # Don't show nested progress
                content_filter=content_filter,
                apply_exclusions=apply_exclusions,
            )

            # Convert to list of rows
            rows = list(table_result)

            # Handle headers based on strategy
            if i == 0:  # First element
                if header == "first" or header == "all":
                    # Use first row as header
                    if rows:
                        header_row = rows[0]
                        rows = rows[1:]  # Remove header from data
                elif isinstance(header, list):
                    # Custom headers provided
                    header_row = header
            else:  # Subsequent elements
                if header == "all" and skip_repeating_headers and rows:
                    # Expect and remove header row
                    if rows and header_row and rows[0] == header_row:
                        rows = rows[1:]
                    elif rows:
                        # Still remove first row if it looks like a header
                        rows = rows[1:]

            # Add rows to combined result
            all_rows.extend(rows)

        # Create final TableResult
        if isinstance(header, list):
            # Custom headers - prepend to data
            final_result = TableResult(all_rows)
        elif header_row is not None:
            # Prepend discovered header
            final_result = TableResult([header_row] + all_rows)
        else:
            # No headers
            final_result = TableResult(all_rows)

        return final_result

    def _get_flow_orientation(self) -> Literal["vertical", "horizontal", "unknown"]:
        """Determines if a FlowRegion's constituent parts are arranged vertically or horizontally."""
        if not self.is_flow_region or len(self.context.constituent_regions) < 2:
            return "unknown"

        r1 = self.context.constituent_regions[0]
        r2 = self.context.constituent_regions[1]  # Compare first two regions

        if not r1.bbox or not r2.bbox:
            return "unknown"

        # Calculate non-overlapping distances.
        # This determines the primary direction of separation.
        x_dist = max(0, max(r1.x0, r2.x0) - min(r1.x1, r2.x1))
        y_dist = max(0, max(r1.top, r2.top) - min(r1.bottom, r2.bottom))

        if y_dist > x_dist:
            return "vertical"
        else:
            return "horizontal"
Attributes
natural_pdf.Guides.cells property

Access cells by index like guides.cells[row][col] or guides.cells[row, col].

natural_pdf.Guides.columns property

Access columns by index like guides.columns[0].

natural_pdf.Guides.horizontal property writable

Get horizontal guide coordinates.

natural_pdf.Guides.n_cols property

Number of columns defined by vertical guides.

natural_pdf.Guides.n_rows property

Number of rows defined by horizontal guides.

natural_pdf.Guides.rows property

Access rows by index like guides.rows[0].

natural_pdf.Guides.vertical property writable

Get vertical guide coordinates.

Functions
natural_pdf.Guides.__add__(other)

Combine two guide sets.

Returns:

Type Description
Guides

New Guides object with combined coordinates

Source code in natural_pdf/analyzers/guides.py
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
def __add__(self, other: "Guides") -> "Guides":
    """
    Combine two guide sets.

    Returns:
        New Guides object with combined coordinates
    """
    # Combine and deduplicate coordinates, ensuring Python floats
    combined_verticals = sorted([float(x) for x in set(self.vertical + other.vertical)])
    combined_horizontals = sorted([float(y) for y in set(self.horizontal + other.horizontal)])

    # Handle FlowRegion context merging
    new_context = self.context or other.context

    # If both are flow regions, we might need a more complex merge,
    # but for now, just picking one context is sufficient.

    # Create the new Guides object
    new_guides = Guides(
        verticals=combined_verticals,
        horizontals=combined_horizontals,
        context=new_context,
        bounds=self.bounds or other.bounds,
    )

    # If the new context is a FlowRegion, we need to rebuild the flow-related state
    if new_guides.is_flow_region:
        # Re-initialize flow guides from both sources
        # This is a simplification; a true merge would be more complex.
        # For now, we combine the flow_guides dictionaries.
        if hasattr(self, "_flow_guides"):
            new_guides._flow_guides.update(self._flow_guides)
        if hasattr(other, "_flow_guides"):
            new_guides._flow_guides.update(other._flow_guides)

        # Re-initialize unified views
        if hasattr(self, "_unified_vertical"):
            new_guides._unified_vertical.extend(self._unified_vertical)
        if hasattr(other, "_unified_vertical"):
            new_guides._unified_vertical.extend(other._unified_vertical)

        if hasattr(self, "_unified_horizontal"):
            new_guides._unified_horizontal.extend(self._unified_horizontal)
        if hasattr(other, "_unified_horizontal"):
            new_guides._unified_horizontal.extend(other._unified_horizontal)

        # Invalidate caches to force rebuild
        new_guides._vertical_cache = None
        new_guides._horizontal_cache = None

    return new_guides
natural_pdf.Guides.__init__(verticals=None, horizontals=None, context=None, bounds=None, relative=False, snap_behavior='warn')

Initialize a Guides object.

Parameters:

Name Type Description Default
verticals Optional[Union[List[float], Page, Region, FlowRegion]]

List of x-coordinates for vertical guides, or a Page/Region/FlowRegion as context

None
horizontals Optional[List[float]]

List of y-coordinates for horizontal guides

None
context Optional[Union[Page, Region, FlowRegion]]

Page, Region, or FlowRegion object these guides were created from

None
bounds Optional[Tuple[float, float, float, float]]

Bounding box (x0, top, x1, bottom) if context not provided

None
relative bool

Whether coordinates are relative (0-1) or absolute

False
snap_behavior Literal['raise', 'warn', 'ignore']

How to handle snapping conflicts ('raise', 'warn', or 'ignore')

'warn'
Source code in natural_pdf/analyzers/guides.py
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
def __init__(
    self,
    verticals: Optional[Union[List[float], "Page", "Region", "FlowRegion"]] = None,
    horizontals: Optional[List[float]] = None,
    context: Optional[Union["Page", "Region", "FlowRegion"]] = None,
    bounds: Optional[Tuple[float, float, float, float]] = None,
    relative: bool = False,
    snap_behavior: Literal["raise", "warn", "ignore"] = "warn",
):
    """
    Initialize a Guides object.

    Args:
        verticals: List of x-coordinates for vertical guides, or a Page/Region/FlowRegion as context
        horizontals: List of y-coordinates for horizontal guides
        context: Page, Region, or FlowRegion object these guides were created from
        bounds: Bounding box (x0, top, x1, bottom) if context not provided
        relative: Whether coordinates are relative (0-1) or absolute
        snap_behavior: How to handle snapping conflicts ('raise', 'warn', or 'ignore')
    """
    # Handle Guides(page) or Guides(flow_region) shorthand
    if (
        verticals is not None
        and not isinstance(verticals, (list, tuple))
        and horizontals is None
        and context is None
    ):
        # First argument is a page/region/flow_region, not coordinates
        context = verticals
        verticals = None

    self.context = context
    self.bounds = bounds
    self.relative = relative
    self.snap_behavior = snap_behavior

    # Check if we're dealing with a FlowRegion
    self.is_flow_region = hasattr(context, "constituent_regions")

    # If FlowRegion, we'll store guides per constituent region
    if self.is_flow_region:
        self._flow_guides: Dict["Region", Tuple[List[float], List[float]]] = {}
        # For unified view across all regions
        self._unified_vertical: List[Tuple[float, "Region"]] = []
        self._unified_horizontal: List[Tuple[float, "Region"]] = []
        # Cache for sorted unique coordinates
        self._vertical_cache: Optional[List[float]] = None
        self._horizontal_cache: Optional[List[float]] = None

    # Initialize with GuidesList instances
    self._vertical = GuidesList(self, "vertical", sorted([float(x) for x in (verticals or [])]))
    self._horizontal = GuidesList(
        self, "horizontal", sorted([float(y) for y in (horizontals or [])])
    )

    # Determine bounds from context if needed
    if self.bounds is None and self.context is not None:
        if hasattr(self.context, "bbox"):
            self.bounds = self.context.bbox
        elif hasattr(self.context, "x0"):
            self.bounds = (
                self.context.x0,
                self.context.top,
                self.context.x1,
                self.context.bottom,
            )

    # Convert relative to absolute if needed
    if self.relative and self.bounds:
        x0, top, x1, bottom = self.bounds
        width = x1 - x0
        height = bottom - top

        self._vertical.data = [x0 + v * width for v in self._vertical]
        self._horizontal.data = [top + h * height for h in self._horizontal]
        self.relative = False
natural_pdf.Guides.__repr__()

String representation of the guides.

Source code in natural_pdf/analyzers/guides.py
4151
4152
4153
4154
4155
4156
4157
def __repr__(self) -> str:
    """String representation of the guides."""
    return (
        f"Guides(verticals={len(self.vertical)}, "
        f"horizontals={len(self.horizontal)}, "
        f"cells={len(self.get_cells())})"
    )
natural_pdf.Guides.above(guide_index, obj=None)

Get a region above a horizontal guide.

Parameters:

Name Type Description Default
guide_index int

Horizontal guide index

required
obj Optional[Union[Page, Region]]

Page or Region to create the region on (uses self.context if None)

None

Returns:

Type Description
Region

Region above the specified guide

Source code in natural_pdf/analyzers/guides.py
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
def above(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a region above a horizontal guide.

    Args:
        guide_index: Horizontal guide index
        obj: Page or Region to create the region on (uses self.context if None)

    Returns:
        Region above the specified guide
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.horizontal or guide_index < 0 or guide_index >= len(self.horizontal):
        raise IndexError(f"Guide index {guide_index} out of range")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    x0, y0, x1, _ = bounds

    # Create region from top edge to guide
    y1 = self.horizontal[guide_index]

    if hasattr(target, "region"):
        return target.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.add_content(axis='vertical', markers=None, obj=None, align='left', outer=True, tolerance=5, apply_exclusions=True)

Instance method: Add guides from content, allowing chaining. This allows: Guides.new(page).add_content(axis='vertical', markers=[...])

Parameters:

Name Type Description Default
axis Literal['vertical', 'horizontal']

Which axis to create guides for

'vertical'
markers Union[str, List[str], ElementCollection, None]

Content to search for. Can be: - str: single selector or literal text - List[str]: list of selectors or literal text strings - ElementCollection: collection of elements to extract text from - None: no markers

None
obj Optional[Union[Page, Region]]

Page or Region to search (uses self.context if None)

None
align Literal['left', 'right', 'center', 'between']

How to align guides relative to found elements

'left'
outer Union[str, bool]

Whether to add outer boundary guides. Can be: - bool: True/False to add/not add both - "first": To add boundary before the first element - "last": To add boundary before the last element

True
tolerance float

Tolerance for snapping to element edges

5
apply_exclusions bool

Whether to apply exclusion zones when searching for text

True

Returns:

Type Description
Guides

Self for method chaining

Source code in natural_pdf/analyzers/guides.py
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
def add_content(
    self,
    axis: Literal["vertical", "horizontal"] = "vertical",
    markers: Union[str, List[str], "ElementCollection", None] = None,
    obj: Optional[Union["Page", "Region"]] = None,
    align: Literal["left", "right", "center", "between"] = "left",
    outer: Union[str, bool] = True,
    tolerance: float = 5,
    apply_exclusions: bool = True,
) -> "Guides":
    """
    Instance method: Add guides from content, allowing chaining.
    This allows: Guides.new(page).add_content(axis='vertical', markers=[...])

    Args:
        axis: Which axis to create guides for
        markers: Content to search for. Can be:
            - str: single selector or literal text
            - List[str]: list of selectors or literal text strings
            - ElementCollection: collection of elements to extract text from
            - None: no markers
        obj: Page or Region to search (uses self.context if None)
        align: How to align guides relative to found elements
        outer: Whether to add outer boundary guides. Can be:
            - bool: True/False to add/not add both
            - "first": To add boundary before the first element
            - "last": To add boundary before the last element
        tolerance: Tolerance for snapping to element edges
        apply_exclusions: Whether to apply exclusion zones when searching for text

    Returns:
        Self for method chaining
    """
    # Use provided object or fall back to stored context
    target_obj = obj or self.context
    if target_obj is None:
        raise ValueError("No object provided and no context available")

    # Create new guides using the class method
    new_guides = Guides.from_content(
        obj=target_obj,
        axis=axis,
        markers=markers,
        align=align,
        outer=outer,
        tolerance=tolerance,
        apply_exclusions=apply_exclusions,
    )

    # Add the appropriate coordinates to this object
    if axis == "vertical":
        self.vertical = list(set(self.vertical + new_guides.vertical))
    else:
        self.horizontal = list(set(self.horizontal + new_guides.horizontal))

    return self
natural_pdf.Guides.add_horizontal(y)

Add a horizontal guide at the specified y-coordinate.

Source code in natural_pdf/analyzers/guides.py
2375
2376
2377
2378
2379
def add_horizontal(self, y: float) -> "Guides":
    """Add a horizontal guide at the specified y-coordinate."""
    self.horizontal.append(y)
    self.horizontal = sorted(self.horizontal)
    return self
natural_pdf.Guides.add_lines(axis='both', obj=None, threshold='auto', source_label=None, max_lines_h=None, max_lines_v=None, outer=False, detection_method='vector', resolution=192, **detect_kwargs)

Instance method: Add guides from lines, allowing chaining. This allows: Guides.new(page).add_lines(axis='horizontal')

Parameters:

Name Type Description Default
axis Literal['vertical', 'horizontal', 'both']

Which axis to detect lines for

'both'
obj Optional[Union[Page, Region]]

Page or Region to search (uses self.context if None)

None
threshold Union[float, str]

Line detection threshold ('auto' or float 0.0-1.0)

'auto'
source_label Optional[str]

Filter lines by source label (vector) or label for detected lines (pixels)

None
max_lines_h Optional[int]

Maximum horizontal lines to use

None
max_lines_v Optional[int]

Maximum vertical lines to use

None
outer bool

Whether to add outer boundary guides

False
detection_method str

'vector' (use existing LineElements) or 'pixels' (detect from image)

'vector'
resolution int

DPI for pixel-based detection (default: 192)

192
**detect_kwargs

Additional parameters for pixel detection (see from_lines)

{}

Returns:

Type Description
Guides

Self for method chaining

Source code in natural_pdf/analyzers/guides.py
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
def add_lines(
    self,
    axis: Literal["vertical", "horizontal", "both"] = "both",
    obj: Optional[Union["Page", "Region"]] = None,
    threshold: Union[float, str] = "auto",
    source_label: Optional[str] = None,
    max_lines_h: Optional[int] = None,
    max_lines_v: Optional[int] = None,
    outer: bool = False,
    detection_method: str = "vector",
    resolution: int = 192,
    **detect_kwargs,
) -> "Guides":
    """
    Instance method: Add guides from lines, allowing chaining.
    This allows: Guides.new(page).add_lines(axis='horizontal')

    Args:
        axis: Which axis to detect lines for
        obj: Page or Region to search (uses self.context if None)
        threshold: Line detection threshold ('auto' or float 0.0-1.0)
        source_label: Filter lines by source label (vector) or label for detected lines (pixels)
        max_lines_h: Maximum horizontal lines to use
        max_lines_v: Maximum vertical lines to use
        outer: Whether to add outer boundary guides
        detection_method: 'vector' (use existing LineElements) or 'pixels' (detect from image)
        resolution: DPI for pixel-based detection (default: 192)
        **detect_kwargs: Additional parameters for pixel detection (see from_lines)

    Returns:
        Self for method chaining
    """
    # Use provided object or fall back to stored context
    target_obj = obj or self.context
    if target_obj is None:
        raise ValueError("No object provided and no context available")

    # Create new guides using the class method
    new_guides = Guides.from_lines(
        obj=target_obj,
        axis=axis,
        threshold=threshold,
        source_label=source_label,
        max_lines_h=max_lines_h,
        max_lines_v=max_lines_v,
        outer=outer,
        detection_method=detection_method,
        resolution=resolution,
        **detect_kwargs,
    )

    # Add the appropriate coordinates to this object
    if axis in ("vertical", "both"):
        self.vertical = list(set(self.vertical + new_guides.vertical))
    if axis in ("horizontal", "both"):
        self.horizontal = list(set(self.horizontal + new_guides.horizontal))

    return self
natural_pdf.Guides.add_vertical(x)

Add a vertical guide at the specified x-coordinate.

Source code in natural_pdf/analyzers/guides.py
2369
2370
2371
2372
2373
def add_vertical(self, x: float) -> "Guides":
    """Add a vertical guide at the specified x-coordinate."""
    self.vertical.append(x)
    self.vertical = sorted(self.vertical)
    return self
natural_pdf.Guides.add_whitespace(axis='both', obj=None, min_gap=10)

Instance method: Add guides from whitespace, allowing chaining. This allows: Guides.new(page).add_whitespace(axis='both')

Parameters:

Name Type Description Default
axis Literal['vertical', 'horizontal', 'both']

Which axis to create guides for

'both'
obj Optional[Union[Page, Region]]

Page or Region to search (uses self.context if None)

None
min_gap float

Minimum gap size to consider

10

Returns:

Type Description
Guides

Self for method chaining

Source code in natural_pdf/analyzers/guides.py
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
4371
4372
4373
def add_whitespace(
    self,
    axis: Literal["vertical", "horizontal", "both"] = "both",
    obj: Optional[Union["Page", "Region"]] = None,
    min_gap: float = 10,
) -> "Guides":
    """
    Instance method: Add guides from whitespace, allowing chaining.
    This allows: Guides.new(page).add_whitespace(axis='both')

    Args:
        axis: Which axis to create guides for
        obj: Page or Region to search (uses self.context if None)
        min_gap: Minimum gap size to consider

    Returns:
        Self for method chaining
    """
    # Use provided object or fall back to stored context
    target_obj = obj or self.context
    if target_obj is None:
        raise ValueError("No object provided and no context available")

    # Create new guides using the class method
    new_guides = Guides.from_whitespace(obj=target_obj, axis=axis, min_gap=min_gap)

    # Add the appropriate coordinates to this object
    if axis in ("vertical", "both"):
        self.vertical = list(set(self.vertical + new_guides.vertical))
    if axis in ("horizontal", "both"):
        self.horizontal = list(set(self.horizontal + new_guides.horizontal))

    return self
natural_pdf.Guides.below(guide_index, obj=None)

Get a region below a horizontal guide.

Parameters:

Name Type Description Default
guide_index int

Horizontal guide index

required
obj Optional[Union[Page, Region]]

Page or Region to create the region on (uses self.context if None)

None

Returns:

Type Description
Region

Region below the specified guide

Source code in natural_pdf/analyzers/guides.py
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
def below(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a region below a horizontal guide.

    Args:
        guide_index: Horizontal guide index
        obj: Page or Region to create the region on (uses self.context if None)

    Returns:
        Region below the specified guide
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.horizontal or guide_index < 0 or guide_index >= len(self.horizontal):
        raise IndexError(f"Guide index {guide_index} out of range")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    x0, _, x1, y1 = bounds

    # Create region from guide to bottom edge
    y0 = self.horizontal[guide_index]

    if hasattr(target, "region"):
        return target.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.between_horizontal(start_index, end_index, obj=None)

Get a region between two horizontal guides.

Parameters:

Name Type Description Default
start_index int

Starting horizontal guide index

required
end_index int

Ending horizontal guide index

required
obj Optional[Union[Page, Region]]

Page or Region to create the region on (uses self.context if None)

None

Returns:

Type Description
Region

Region between the specified guides

Source code in natural_pdf/analyzers/guides.py
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
def between_horizontal(
    self, start_index: int, end_index: int, obj: Optional[Union["Page", "Region"]] = None
) -> "Region":
    """
    Get a region between two horizontal guides.

    Args:
        start_index: Starting horizontal guide index
        end_index: Ending horizontal guide index
        obj: Page or Region to create the region on (uses self.context if None)

    Returns:
        Region between the specified guides
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.horizontal:
        raise ValueError("No horizontal guides available")
    if start_index < 0 or start_index >= len(self.horizontal):
        raise IndexError(f"Start index {start_index} out of range")
    if end_index < 0 or end_index >= len(self.horizontal):
        raise IndexError(f"End index {end_index} out of range")
    if start_index >= end_index:
        raise ValueError("Start index must be less than end index")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    x0, _, x1, _ = bounds

    # Get vertical boundaries
    y0 = self.horizontal[start_index]
    y1 = self.horizontal[end_index]

    if hasattr(target, "region"):
        return target.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.between_vertical(start_index, end_index, obj=None)

Get a region between two vertical guides.

Parameters:

Name Type Description Default
start_index int

Starting vertical guide index

required
end_index int

Ending vertical guide index

required
obj Optional[Union[Page, Region]]

Page or Region to create the region on (uses self.context if None)

None

Returns:

Type Description
Region

Region between the specified guides

Source code in natural_pdf/analyzers/guides.py
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
def between_vertical(
    self, start_index: int, end_index: int, obj: Optional[Union["Page", "Region"]] = None
) -> "Region":
    """
    Get a region between two vertical guides.

    Args:
        start_index: Starting vertical guide index
        end_index: Ending vertical guide index
        obj: Page or Region to create the region on (uses self.context if None)

    Returns:
        Region between the specified guides
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.vertical:
        raise ValueError("No vertical guides available")
    if start_index < 0 or start_index >= len(self.vertical):
        raise IndexError(f"Start index {start_index} out of range")
    if end_index < 0 or end_index >= len(self.vertical):
        raise IndexError(f"End index {end_index} out of range")
    if start_index >= end_index:
        raise ValueError("Start index must be less than end index")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    _, y0, _, y1 = bounds

    # Get horizontal boundaries
    x0 = self.vertical[start_index]
    x1 = self.vertical[end_index]

    if hasattr(target, "region"):
        return target.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.build_grid(target=None, source='guides', cell_padding=0.5, include_outer_boundaries=False, *, multi_page='auto')

Create table structure (table, rows, columns, cells) from guide coordinates.

Parameters:

Name Type Description Default
target Optional[Union[Page, Region]]

Page or Region to create regions on (uses self.context if None)

None
source str

Source label for created regions (for identification)

'guides'
cell_padding float

Internal padding for cell regions in points

0.5
include_outer_boundaries bool

Whether to add boundaries at edges if missing

False
multi_page Literal['auto', True, False]

Controls multi-region table creation for FlowRegions. - "auto": (default) Creates a unified grid if there are multiple regions or guides span pages. - True: Forces creation of a unified multi-region grid. - False: Creates separate grids for each region.

'auto'

Returns:

Type Description
Dict[str, Any]

Dictionary with 'counts' and 'regions' created.

Source code in natural_pdf/analyzers/guides.py
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
def build_grid(
    self,
    target: Optional[Union["Page", "Region"]] = None,
    source: str = "guides",
    cell_padding: float = 0.5,
    include_outer_boundaries: bool = False,
    *,
    multi_page: Literal["auto", True, False] = "auto",
) -> Dict[str, Any]:
    """
    Create table structure (table, rows, columns, cells) from guide coordinates.

    Args:
        target: Page or Region to create regions on (uses self.context if None)
        source: Source label for created regions (for identification)
        cell_padding: Internal padding for cell regions in points
        include_outer_boundaries: Whether to add boundaries at edges if missing
        multi_page: Controls multi-region table creation for FlowRegions.
            - "auto": (default) Creates a unified grid if there are multiple regions or guides span pages.
            - True: Forces creation of a unified multi-region grid.
            - False: Creates separate grids for each region.

    Returns:
        Dictionary with 'counts' and 'regions' created.
    """
    # Dispatch to appropriate implementation based on context and flags
    if self.is_flow_region:
        # Check if we should create a unified multi-region grid
        has_multiple_regions = len(self.context.constituent_regions) > 1
        spans_pages = self._spans_pages()

        # Create unified grid if:
        # - multi_page is explicitly True, OR
        # - multi_page is "auto" AND (spans pages OR has multiple regions)
        if multi_page is True or (
            multi_page == "auto" and (spans_pages or has_multiple_regions)
        ):
            return self._build_grid_multi_page(
                source=source,
                cell_padding=cell_padding,
                include_outer_boundaries=include_outer_boundaries,
            )
        else:
            # Single region FlowRegion or multi_page=False: create separate tables per region
            total_counts = {"table": 0, "rows": 0, "columns": 0, "cells": 0}
            all_regions = {"table": [], "rows": [], "columns": [], "cells": []}

            for region in self.context.constituent_regions:
                if region in self._flow_guides:
                    verticals, horizontals = self._flow_guides[region]

                    region_guides = Guides(
                        verticals=verticals, horizontals=horizontals, context=region
                    )

                    try:
                        result = region_guides._build_grid_single_page(
                            target=region,
                            source=source,
                            cell_padding=cell_padding,
                            include_outer_boundaries=include_outer_boundaries,
                        )

                        for key in total_counts:
                            total_counts[key] += result["counts"][key]

                        if result["regions"]["table"]:
                            all_regions["table"].append(result["regions"]["table"])
                        all_regions["rows"].extend(result["regions"]["rows"])
                        all_regions["columns"].extend(result["regions"]["columns"])
                        all_regions["cells"].extend(result["regions"]["cells"])

                    except Exception as e:
                        logger.warning(f"Failed to build grid on region: {e}")

            logger.info(
                f"Created {total_counts['table']} tables, {total_counts['rows']} rows, "
                f"{total_counts['columns']} columns, and {total_counts['cells']} cells "
                f"from guides across {len(self._flow_guides)} regions"
            )

            return {"counts": total_counts, "regions": all_regions}

    # Fallback for single page/region
    return self._build_grid_single_page(
        target=target,
        source=source,
        cell_padding=cell_padding,
        include_outer_boundaries=include_outer_boundaries,
    )
natural_pdf.Guides.cell(row, col, obj=None)

Get a cell region from the guides.

Parameters:

Name Type Description Default
row int

Row index (0-based)

required
col int

Column index (0-based)

required
obj Optional[Union[Page, Region]]

Page or Region to create the cell on (uses self.context if None)

None

Returns:

Type Description
Region

Region representing the specified cell

Raises:

Type Description
IndexError

If row or column index is out of range

Source code in natural_pdf/analyzers/guides.py
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
def cell(self, row: int, col: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a cell region from the guides.

    Args:
        row: Row index (0-based)
        col: Column index (0-based)
        obj: Page or Region to create the cell on (uses self.context if None)

    Returns:
        Region representing the specified cell

    Raises:
        IndexError: If row or column index is out of range
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.vertical or col < 0 or col >= len(self.vertical) - 1:
        raise IndexError(
            f"Column index {col} out of range (have {len(self.vertical)-1} columns)"
        )
    if not self.horizontal or row < 0 or row >= len(self.horizontal) - 1:
        raise IndexError(f"Row index {row} out of range (have {len(self.horizontal)-1} rows)")

    # Get cell boundaries
    x0 = self.vertical[col]
    x1 = self.vertical[col + 1]
    y0 = self.horizontal[row]
    y1 = self.horizontal[row + 1]

    # Create region using absolute coordinates
    if hasattr(target, "region"):
        # Target has a region method (Page)
        return target.region(x0, y0, x1, y1)
    elif hasattr(target, "page"):
        # Target is a Region, use its parent page
        # The coordinates from guides are already absolute
        return target.page.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.column(index, obj=None)

Get a column region from the guides.

Parameters:

Name Type Description Default
index int

Column index (0-based)

required
obj Optional[Union[Page, Region]]

Page or Region to create the column on (uses self.context if None)

None

Returns:

Type Description
Region

Region representing the specified column

Raises:

Type Description
IndexError

If column index is out of range

Source code in natural_pdf/analyzers/guides.py
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
def column(self, index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a column region from the guides.

    Args:
        index: Column index (0-based)
        obj: Page or Region to create the column on (uses self.context if None)

    Returns:
        Region representing the specified column

    Raises:
        IndexError: If column index is out of range
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.vertical or index < 0 or index >= len(self.vertical) - 1:
        raise IndexError(
            f"Column index {index} out of range (have {len(self.vertical)-1} columns)"
        )

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    _, y0, _, y1 = bounds

    # Get column boundaries
    x0 = self.vertical[index]
    x1 = self.vertical[index + 1]

    # Create region using absolute coordinates
    if hasattr(target, "region"):
        # Target has a region method (Page)
        return target.region(x0, y0, x1, y1)
    elif hasattr(target, "page"):
        # Target is a Region, use its parent page
        # The coordinates from guides are already absolute
        return target.page.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.divide(obj, n=None, cols=None, rows=None, axis='both') classmethod

Create guides by evenly dividing an object.

Parameters:

Name Type Description Default
obj Union[Page, Region, Tuple[float, float, float, float]]

Object to divide (Page, Region, or bbox tuple)

required
n Optional[int]

Number of divisions (creates n+1 guides). Used if cols/rows not specified.

None
cols Optional[int]

Number of columns (creates cols+1 vertical guides)

None
rows Optional[int]

Number of rows (creates rows+1 horizontal guides)

None
axis Literal['vertical', 'horizontal', 'both']

Which axis to divide along

'both'

Returns:

Type Description
Guides

New Guides object with evenly spaced lines

Examples:

Divide into 3 columns

guides = Guides.divide(page, cols=3)

Divide into 5 rows

guides = Guides.divide(region, rows=5)

Divide both axes

guides = Guides.divide(page, cols=3, rows=5)

Source code in natural_pdf/analyzers/guides.py
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
@classmethod
def divide(
    cls,
    obj: Union["Page", "Region", Tuple[float, float, float, float]],
    n: Optional[int] = None,
    cols: Optional[int] = None,
    rows: Optional[int] = None,
    axis: Literal["vertical", "horizontal", "both"] = "both",
) -> "Guides":
    """
    Create guides by evenly dividing an object.

    Args:
        obj: Object to divide (Page, Region, or bbox tuple)
        n: Number of divisions (creates n+1 guides). Used if cols/rows not specified.
        cols: Number of columns (creates cols+1 vertical guides)
        rows: Number of rows (creates rows+1 horizontal guides)
        axis: Which axis to divide along

    Returns:
        New Guides object with evenly spaced lines

    Examples:
        # Divide into 3 columns
        guides = Guides.divide(page, cols=3)

        # Divide into 5 rows
        guides = Guides.divide(region, rows=5)

        # Divide both axes
        guides = Guides.divide(page, cols=3, rows=5)
    """
    # Extract bounds from object
    if isinstance(obj, tuple) and len(obj) == 4:
        bounds = obj
        context = None
    else:
        context = obj
        if hasattr(obj, "bbox"):
            bounds = obj.bbox
        elif hasattr(obj, "x0"):
            bounds = (obj.x0, obj.top, obj.x1, obj.bottom)
        else:
            bounds = (0, 0, obj.width, obj.height)

    x0, y0, x1, y1 = bounds
    verticals = []
    horizontals = []

    # Handle vertical guides
    if axis in ("vertical", "both"):
        n_vertical = cols + 1 if cols is not None else (n + 1 if n is not None else 0)
        if n_vertical > 0:
            for i in range(n_vertical):
                x = x0 + (x1 - x0) * i / (n_vertical - 1)
                verticals.append(float(x))

    # Handle horizontal guides
    if axis in ("horizontal", "both"):
        n_horizontal = rows + 1 if rows is not None else (n + 1 if n is not None else 0)
        if n_horizontal > 0:
            for i in range(n_horizontal):
                y = y0 + (y1 - y0) * i / (n_horizontal - 1)
                horizontals.append(float(y))

    return cls(verticals=verticals, horizontals=horizontals, context=context, bounds=bounds)
natural_pdf.Guides.extract_table(target=None, source='guides_temp', cell_padding=0.5, include_outer_boundaries=False, method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False, content_filter=None, apply_exclusions=True, *, multi_page='auto', header='first', skip_repeating_headers=None)

Extract table data directly from guides without leaving temporary regions.

This method: 1. Creates table structure using build_grid() 2. Extracts table data from the created table region 3. Cleans up all temporary regions 4. Returns the TableResult

When passed a collection (PageCollection, ElementCollection, or list), this method will extract tables from each element and combine them into a single result.

Parameters:

Name Type Description Default
target Optional[Union[Page, Region, PageCollection, ElementCollection, List[Union[Page, Region]]]]

Page, Region, or collection of Pages/Regions to extract from (uses self.context if None)

None
source str

Source label for temporary regions (will be cleaned up)

'guides_temp'
cell_padding float

Internal padding for cell regions in points

0.5
include_outer_boundaries bool

Whether to add boundaries at edges if missing

False
method Optional[str]

Table extraction method ('tatr', 'pdfplumber', 'text', etc.)

None
table_settings Optional[dict]

Settings for pdfplumber table extraction

None
use_ocr bool

Whether to use OCR for text extraction

False
ocr_config Optional[dict]

OCR configuration parameters

None
text_options Optional[Dict]

Dictionary of options for the 'text' method

None
cell_extraction_func Optional[Callable[[Region], Optional[str]]]

Optional callable for custom cell text extraction

None
show_progress bool

Controls progress bar for text method

False
content_filter Optional[Union[str, Callable[[str], bool], List[str]]]

Content filtering function or patterns

None
apply_exclusions bool

Whether to apply exclusion regions during text extraction (default: True)

True
multi_page Literal['auto', True, False]

Controls multi-region table creation for FlowRegions

'auto'
header Union[str, List[str], None]

How to handle headers when extracting from collections: - "first": Use first row of first element as headers (default) - "all": Expect headers on each element, use from first element - None: No headers, use numeric indices - List[str]: Custom column names

'first'
skip_repeating_headers Optional[bool]

Whether to remove duplicate header rows when extracting from collections. Defaults to True when header is "first" or "all", False otherwise.

None

Returns:

Name Type Description
TableResult TableResult

Extracted table data

Raises:

Type Description
ValueError

If no table region is created from the guides

Example
from natural_pdf.analyzers import Guides

# Single page extraction
guides = Guides.from_lines(page, source_label="detected")
table_data = guides.extract_table()
df = table_data.to_df()

# Multiple page extraction
guides = Guides(pages[0])
guides.vertical.from_content(['Column 1', 'Column 2'])
table_result = guides.extract_table(pages, header=['Col1', 'Col2'])
df = table_result.to_df()

# Region collection extraction
regions = pdf.find_all('region[type=table]')
guides = Guides(regions[0])
guides.vertical.from_lines(n=3)
table_result = guides.extract_table(regions)
Source code in natural_pdf/analyzers/guides.py
4375
4376
4377
4378
4379
4380
4381
4382
4383
4384
4385
4386
4387
4388
4389
4390
4391
4392
4393
4394
4395
4396
4397
4398
4399
4400
4401
4402
4403
4404
4405
4406
4407
4408
4409
4410
4411
4412
4413
4414
4415
4416
4417
4418
4419
4420
4421
4422
4423
4424
4425
4426
4427
4428
4429
4430
4431
4432
4433
4434
4435
4436
4437
4438
4439
4440
4441
4442
4443
4444
4445
4446
4447
4448
4449
4450
4451
4452
4453
4454
4455
4456
4457
4458
4459
4460
4461
4462
4463
4464
4465
4466
4467
4468
4469
4470
4471
4472
4473
4474
4475
4476
4477
4478
4479
4480
4481
4482
4483
4484
4485
4486
4487
4488
4489
4490
4491
4492
4493
4494
4495
4496
4497
4498
4499
4500
4501
4502
4503
4504
4505
4506
4507
4508
4509
4510
4511
4512
4513
4514
4515
4516
4517
4518
4519
4520
4521
4522
4523
4524
4525
4526
4527
4528
4529
4530
4531
4532
4533
4534
4535
4536
4537
4538
4539
4540
4541
4542
4543
4544
4545
4546
4547
4548
4549
4550
4551
4552
4553
4554
4555
4556
4557
4558
4559
4560
4561
4562
4563
4564
4565
4566
4567
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
def extract_table(
    self,
    target: Optional[
        Union[
            "Page",
            "Region",
            "PageCollection",
            "ElementCollection",
            List[Union["Page", "Region"]],
        ]
    ] = None,
    source: str = "guides_temp",
    cell_padding: float = 0.5,
    include_outer_boundaries: bool = False,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,
    text_options: Optional[Dict] = None,
    cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
    show_progress: bool = False,
    content_filter: Optional[Union[str, Callable[[str], bool], List[str]]] = None,
    apply_exclusions: bool = True,
    *,
    multi_page: Literal["auto", True, False] = "auto",
    header: Union[str, List[str], None] = "first",
    skip_repeating_headers: Optional[bool] = None,
) -> "TableResult":
    """
    Extract table data directly from guides without leaving temporary regions.

    This method:
    1. Creates table structure using build_grid()
    2. Extracts table data from the created table region
    3. Cleans up all temporary regions
    4. Returns the TableResult

    When passed a collection (PageCollection, ElementCollection, or list), this method
    will extract tables from each element and combine them into a single result.

    Args:
        target: Page, Region, or collection of Pages/Regions to extract from (uses self.context if None)
        source: Source label for temporary regions (will be cleaned up)
        cell_padding: Internal padding for cell regions in points
        include_outer_boundaries: Whether to add boundaries at edges if missing
        method: Table extraction method ('tatr', 'pdfplumber', 'text', etc.)
        table_settings: Settings for pdfplumber table extraction
        use_ocr: Whether to use OCR for text extraction
        ocr_config: OCR configuration parameters
        text_options: Dictionary of options for the 'text' method
        cell_extraction_func: Optional callable for custom cell text extraction
        show_progress: Controls progress bar for text method
        content_filter: Content filtering function or patterns
        apply_exclusions: Whether to apply exclusion regions during text extraction (default: True)
        multi_page: Controls multi-region table creation for FlowRegions
        header: How to handle headers when extracting from collections:
            - "first": Use first row of first element as headers (default)
            - "all": Expect headers on each element, use from first element
            - None: No headers, use numeric indices
            - List[str]: Custom column names
        skip_repeating_headers: Whether to remove duplicate header rows when extracting from collections.
            Defaults to True when header is "first" or "all", False otherwise.

    Returns:
        TableResult: Extracted table data

    Raises:
        ValueError: If no table region is created from the guides

    Example:
        ```python
        from natural_pdf.analyzers import Guides

        # Single page extraction
        guides = Guides.from_lines(page, source_label="detected")
        table_data = guides.extract_table()
        df = table_data.to_df()

        # Multiple page extraction
        guides = Guides(pages[0])
        guides.vertical.from_content(['Column 1', 'Column 2'])
        table_result = guides.extract_table(pages, header=['Col1', 'Col2'])
        df = table_result.to_df()

        # Region collection extraction
        regions = pdf.find_all('region[type=table]')
        guides = Guides(regions[0])
        guides.vertical.from_lines(n=3)
        table_result = guides.extract_table(regions)
        ```
    """
    from natural_pdf.core.page_collection import PageCollection
    from natural_pdf.elements.element_collection import ElementCollection

    target_obj = target if target is not None else self.context
    if target_obj is None:
        raise ValueError("No target object available. Provide target parameter or context.")

    # Check if target is a collection - if so, delegate to _extract_table_from_collection
    if isinstance(target_obj, (PageCollection, ElementCollection, list)):
        # For collections, pass through most parameters as-is
        return self._extract_table_from_collection(
            elements=target_obj,
            header=header,
            skip_repeating_headers=skip_repeating_headers,
            method=method,
            table_settings=table_settings,
            use_ocr=use_ocr,
            ocr_config=ocr_config,
            text_options=text_options,
            cell_extraction_func=cell_extraction_func,
            show_progress=show_progress,
            content_filter=content_filter,
            apply_exclusions=apply_exclusions,
        )

    # Get the page for cleanup later
    if hasattr(target_obj, "x0") and hasattr(target_obj, "top"):  # Region
        page = target_obj._page
        element_manager = page._element_mgr
    elif hasattr(target_obj, "_element_mgr"):  # Page
        page = target_obj
        element_manager = page._element_mgr
    else:
        raise ValueError(f"Target object {target_obj} is not a Page or Region")

    # Check if we have guides in only one dimension
    has_verticals = len(self.vertical) > 0
    has_horizontals = len(self.horizontal) > 0

    # If we have guides in only one dimension, use direct extraction with explicit lines
    if (has_verticals and not has_horizontals) or (has_horizontals and not has_verticals):
        logger.debug(
            f"Partial guides detected - using direct extraction (v={has_verticals}, h={has_horizontals})"
        )

        # Extract directly from the target using explicit lines
        if hasattr(target_obj, "extract_table"):
            return target_obj.extract_table(
                method=method,  # Let auto-detection work when None
                table_settings=table_settings,
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                text_options=text_options,
                cell_extraction_func=cell_extraction_func,
                show_progress=show_progress,
                content_filter=content_filter,
                verticals=list(self.vertical) if has_verticals else None,
                horizontals=list(self.horizontal) if has_horizontals else None,
            )
        else:
            raise ValueError(f"Target object {type(target_obj)} does not support extract_table")

    # Both dimensions have guides - use normal grid-based extraction
    try:
        # Step 1: Build grid structure (creates temporary regions)
        grid_result = self.build_grid(
            target=target_obj,
            source=source,
            cell_padding=cell_padding,
            include_outer_boundaries=include_outer_boundaries,
            multi_page=multi_page,
        )

        # Step 2: Get the table region and extract table data
        table_region = grid_result["regions"]["table"]
        if table_region is None:
            raise ValueError(
                "No table region was created from the guides. Check that you have both vertical and horizontal guides."
            )

        # Handle multi-page case where table_region might be a list
        if isinstance(table_region, list):
            if not table_region:
                raise ValueError("No table regions were created from the guides.")
            # Use the first table region for extraction
            table_region = table_region[0]

        # Step 3: Extract table data using the region's extract_table method
        table_result = table_region.extract_table(
            method=method,
            table_settings=table_settings,
            use_ocr=use_ocr,
            ocr_config=ocr_config,
            text_options=text_options,
            cell_extraction_func=cell_extraction_func,
            show_progress=show_progress,
            content_filter=content_filter,
            apply_exclusions=apply_exclusions,
        )

        return table_result

    finally:
        # Step 4: Clean up all temporary regions created by build_grid
        # This ensures no regions are left behind regardless of success/failure
        try:
            regions_to_remove = [
                r
                for r in element_manager.regions
                if getattr(r, "source", None) == source
                and getattr(r, "region_type", None)
                in {"table", "table_row", "table_column", "table_cell"}
            ]

            for region in regions_to_remove:
                element_manager.remove_element(region, element_type="regions")

            if regions_to_remove:
                logger.debug(f"Cleaned up {len(regions_to_remove)} temporary regions")

        except Exception as cleanup_err:
            logger.warning(f"Failed to clean up temporary regions: {cleanup_err}")
natural_pdf.Guides.from_content(obj, axis='vertical', markers=None, align='left', outer=True, tolerance=5, apply_exclusions=True) classmethod

Create guides based on text content positions.

Parameters:

Name Type Description Default
obj Union[Page, Region, FlowRegion]

Page, Region, or FlowRegion to search for content

required
axis Literal['vertical', 'horizontal']

Whether to create vertical or horizontal guides

'vertical'
markers Union[str, List[str], ElementCollection, None]

Content to search for. Can be: - str: single selector (e.g., 'text:contains("Name")') or literal text - List[str]: list of selectors or literal text strings - ElementCollection: collection of elements to extract text from - None: no markers

None
align Union[Literal['left', 'right', 'center', 'between'], Literal['top', 'bottom']]

Where to place guides relative to found text: - For vertical guides: 'left', 'right', 'center', 'between' - For horizontal guides: 'top', 'bottom', 'center', 'between'

'left'
outer bool

Whether to add guides at the boundaries

True
tolerance float

Maximum distance to search for text

5
apply_exclusions bool

Whether to apply exclusion zones when searching for text

True

Returns:

Type Description
Guides

New Guides object aligned to text content

Source code in natural_pdf/analyzers/guides.py
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
@classmethod
def from_content(
    cls,
    obj: Union["Page", "Region", "FlowRegion"],
    axis: Literal["vertical", "horizontal"] = "vertical",
    markers: Union[str, List[str], "ElementCollection", None] = None,
    align: Union[
        Literal["left", "right", "center", "between"], Literal["top", "bottom"]
    ] = "left",
    outer: bool = True,
    tolerance: float = 5,
    apply_exclusions: bool = True,
) -> "Guides":
    """
    Create guides based on text content positions.

    Args:
        obj: Page, Region, or FlowRegion to search for content
        axis: Whether to create vertical or horizontal guides
        markers: Content to search for. Can be:
            - str: single selector (e.g., 'text:contains("Name")') or literal text
            - List[str]: list of selectors or literal text strings
            - ElementCollection: collection of elements to extract text from
            - None: no markers
        align: Where to place guides relative to found text:
            - For vertical guides: 'left', 'right', 'center', 'between'
            - For horizontal guides: 'top', 'bottom', 'center', 'between'
        outer: Whether to add guides at the boundaries
        tolerance: Maximum distance to search for text
        apply_exclusions: Whether to apply exclusion zones when searching for text

    Returns:
        New Guides object aligned to text content
    """
    # Normalize alignment for horizontal guides
    if axis == "horizontal":
        if align == "top":
            align = "left"
        elif align == "bottom":
            align = "right"

    # Handle FlowRegion
    if hasattr(obj, "constituent_regions"):
        guides = cls(context=obj)

        # Process each constituent region
        for region in obj.constituent_regions:
            # Create guides for this specific region
            region_guides = cls.from_content(
                region,
                axis=axis,
                markers=markers,
                align=align,
                outer=outer,
                tolerance=tolerance,
                apply_exclusions=apply_exclusions,
            )

            # Store in flow guides
            guides._flow_guides[region] = (
                list(region_guides.vertical),
                list(region_guides.horizontal),
            )

            # Add to unified view
            for v in region_guides.vertical:
                guides._unified_vertical.append((v, region))
            for h in region_guides.horizontal:
                guides._unified_horizontal.append((h, region))

        # Invalidate caches
        guides._vertical_cache = None
        guides._horizontal_cache = None

        return guides

    # Original single-region logic follows...
    guides_coords = []
    bounds = None

    # Get bounds from object
    if hasattr(obj, "bbox"):
        bounds = obj.bbox
    elif hasattr(obj, "x0"):
        bounds = (obj.x0, obj.top, obj.x1, obj.bottom)
    elif hasattr(obj, "width"):
        bounds = (0, 0, obj.width, obj.height)

    # Handle different marker types
    elements_to_process = []

    # Check if markers is an ElementCollection or has elements attribute
    if hasattr(markers, "elements") or hasattr(markers, "_elements"):
        # It's an ElementCollection - use elements directly
        elements_to_process = getattr(markers, "elements", getattr(markers, "_elements", []))
    elif hasattr(markers, "__iter__") and not isinstance(markers, str):
        # Check if it's an iterable of elements (not strings)
        try:
            markers_list = list(markers)
            if markers_list and hasattr(markers_list[0], "x0"):
                # It's a list of elements
                elements_to_process = markers_list
        except:
            pass

    if elements_to_process:
        # Process elements directly without text search
        for element in elements_to_process:
            if axis == "vertical":
                if align == "left":
                    guides_coords.append(element.x0)
                elif align == "right":
                    guides_coords.append(element.x1)
                elif align == "center":
                    guides_coords.append((element.x0 + element.x1) / 2)
                elif align == "between":
                    # For between, collect left edges for processing later
                    guides_coords.append(element.x0)
            else:  # horizontal
                if align == "left":  # top for horizontal
                    guides_coords.append(element.top)
                elif align == "right":  # bottom for horizontal
                    guides_coords.append(element.bottom)
                elif align == "center":
                    guides_coords.append((element.top + element.bottom) / 2)
                elif align == "between":
                    # For between, collect top edges for processing later
                    guides_coords.append(element.top)
    else:
        # Fall back to text-based search
        marker_texts = _normalize_markers(markers, obj)

        # Find each marker and determine guide position
        for marker in marker_texts:
            if hasattr(obj, "find"):
                element = obj.find(
                    f'text:contains("{marker}")', apply_exclusions=apply_exclusions
                )
                if element:
                    if axis == "vertical":
                        if align == "left":
                            guides_coords.append(element.x0)
                        elif align == "right":
                            guides_coords.append(element.x1)
                        elif align == "center":
                            guides_coords.append((element.x0 + element.x1) / 2)
                        elif align == "between":
                            # For between, collect left edges for processing later
                            guides_coords.append(element.x0)
                    else:  # horizontal
                        if align == "left":  # top for horizontal
                            guides_coords.append(element.top)
                        elif align == "right":  # bottom for horizontal
                            guides_coords.append(element.bottom)
                        elif align == "center":
                            guides_coords.append((element.top + element.bottom) / 2)
                        elif align == "between":
                            # For between, collect top edges for processing later
                            guides_coords.append(element.top)

    # Handle 'between' alignment - find midpoints between adjacent markers
    if align == "between" and len(guides_coords) >= 2:
        # We need to get the right and left edges of each marker
        marker_bounds = []

        if elements_to_process:
            # Use elements directly
            for element in elements_to_process:
                if axis == "vertical":
                    marker_bounds.append((element.x0, element.x1))
                else:  # horizontal
                    marker_bounds.append((element.top, element.bottom))
        else:
            # Fall back to text search
            if "marker_texts" not in locals():
                marker_texts = _normalize_markers(markers, obj)
            for marker in marker_texts:
                if hasattr(obj, "find"):
                    element = obj.find(
                        f'text:contains("{marker}")', apply_exclusions=apply_exclusions
                    )
                    if element:
                        if axis == "vertical":
                            marker_bounds.append((element.x0, element.x1))
                        else:  # horizontal
                            marker_bounds.append((element.top, element.bottom))

        # Sort markers by their left edge (or top edge for horizontal)
        marker_bounds.sort(key=lambda x: x[0])

        # Create guides at midpoints between adjacent markers
        between_coords = []
        for i in range(len(marker_bounds) - 1):
            # Midpoint between right edge of current marker and left edge of next marker
            right_edge_current = marker_bounds[i][1]
            left_edge_next = marker_bounds[i + 1][0]
            midpoint = (right_edge_current + left_edge_next) / 2
            between_coords.append(midpoint)

        guides_coords = between_coords

    # Add outer guides if requested
    if outer and bounds:
        if axis == "vertical":
            if outer == True or outer == "first":
                guides_coords.insert(0, bounds[0])  # x0
            if outer == True or outer == "last":
                guides_coords.append(bounds[2])  # x1
        else:
            if outer == True or outer == "first":
                guides_coords.insert(0, bounds[1])  # y0
            if outer == True or outer == "last":
                guides_coords.append(bounds[3])  # y1

    # Remove duplicates and sort
    guides_coords = sorted(list(set(guides_coords)))

    # Create guides object
    if axis == "vertical":
        return cls(verticals=guides_coords, context=obj, bounds=bounds)
    else:
        return cls(horizontals=guides_coords, context=obj, bounds=bounds)
natural_pdf.Guides.from_lines(obj, axis='both', threshold='auto', source_label=None, max_lines_h=None, max_lines_v=None, outer=False, detection_method='pixels', resolution=192, **detect_kwargs) classmethod

Create guides from detected line elements.

Parameters:

Name Type Description Default
obj Union[Page, Region, FlowRegion]

Page, Region, or FlowRegion to detect lines from

required
axis Literal['vertical', 'horizontal', 'both']

Which orientations to detect

'both'
threshold Union[float, str]

Detection threshold ('auto' or float 0.0-1.0) - used for pixel detection

'auto'
source_label Optional[str]

Filter for line source (vector method) or label for detected lines (pixel method)

None
max_lines_h Optional[int]

Maximum number of horizontal lines to keep

None
max_lines_v Optional[int]

Maximum number of vertical lines to keep

None
outer bool

Whether to add outer boundary guides

False
detection_method str

'vector' (use existing LineElements) or 'pixels' (detect from image)

'pixels'
resolution int

DPI for pixel-based detection (default: 192)

192
**detect_kwargs

Additional parameters for pixel-based detection: - min_gap_h: Minimum gap between horizontal lines (pixels) - min_gap_v: Minimum gap between vertical lines (pixels) - binarization_method: 'adaptive' or 'otsu' - morph_op_h/v: Morphological operations ('open', 'close', 'none') - smoothing_sigma_h/v: Gaussian smoothing sigma - method: 'projection' (default) or 'lsd' (requires opencv)

{}

Returns:

Type Description
Guides

New Guides object with detected line positions

Source code in natural_pdf/analyzers/guides.py
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
@classmethod
def from_lines(
    cls,
    obj: Union["Page", "Region", "FlowRegion"],
    axis: Literal["vertical", "horizontal", "both"] = "both",
    threshold: Union[float, str] = "auto",
    source_label: Optional[str] = None,
    max_lines_h: Optional[int] = None,
    max_lines_v: Optional[int] = None,
    outer: bool = False,
    detection_method: str = "pixels",
    resolution: int = 192,
    **detect_kwargs,
) -> "Guides":
    """
    Create guides from detected line elements.

    Args:
        obj: Page, Region, or FlowRegion to detect lines from
        axis: Which orientations to detect
        threshold: Detection threshold ('auto' or float 0.0-1.0) - used for pixel detection
        source_label: Filter for line source (vector method) or label for detected lines (pixel method)
        max_lines_h: Maximum number of horizontal lines to keep
        max_lines_v: Maximum number of vertical lines to keep
        outer: Whether to add outer boundary guides
        detection_method: 'vector' (use existing LineElements) or 'pixels' (detect from image)
        resolution: DPI for pixel-based detection (default: 192)
        **detect_kwargs: Additional parameters for pixel-based detection:
            - min_gap_h: Minimum gap between horizontal lines (pixels)
            - min_gap_v: Minimum gap between vertical lines (pixels)
            - binarization_method: 'adaptive' or 'otsu'
            - morph_op_h/v: Morphological operations ('open', 'close', 'none')
            - smoothing_sigma_h/v: Gaussian smoothing sigma
            - method: 'projection' (default) or 'lsd' (requires opencv)

    Returns:
        New Guides object with detected line positions
    """
    # Handle FlowRegion
    if hasattr(obj, "constituent_regions"):
        guides = cls(context=obj)

        # Process each constituent region
        for region in obj.constituent_regions:
            # Create guides for this specific region
            region_guides = cls.from_lines(
                region,
                axis=axis,
                threshold=threshold,
                source_label=source_label,
                max_lines_h=max_lines_h,
                max_lines_v=max_lines_v,
                outer=outer,
                detection_method=detection_method,
                resolution=resolution,
                **detect_kwargs,
            )

            # Store in flow guides
            guides._flow_guides[region] = (
                list(region_guides.vertical),
                list(region_guides.horizontal),
            )

            # Add to unified view
            for v in region_guides.vertical:
                guides._unified_vertical.append((v, region))
            for h in region_guides.horizontal:
                guides._unified_horizontal.append((h, region))

        # Invalidate caches to force rebuild on next access
        guides._vertical_cache = None
        guides._horizontal_cache = None

        return guides

    # Original single-region logic follows...
    # Get bounds for potential outer guides
    if hasattr(obj, "bbox"):
        bounds = obj.bbox
    elif hasattr(obj, "x0"):
        bounds = (obj.x0, obj.top, obj.x1, obj.bottom)
    elif hasattr(obj, "width"):
        bounds = (0, 0, obj.width, obj.height)
    else:
        bounds = None

    verticals = []
    horizontals = []

    if detection_method == "pixels":
        # Use pixel-based line detection
        if not hasattr(obj, "detect_lines"):
            raise ValueError(f"Object {obj} does not support pixel-based line detection")

        # Set up detection parameters
        detect_params = {
            "resolution": resolution,
            "source_label": source_label or "guides_detection",
            "horizontal": axis in ("horizontal", "both"),
            "vertical": axis in ("vertical", "both"),
            "replace": True,  # Replace any existing lines with this source
            "method": detect_kwargs.get("method", "projection"),
        }

        # Handle threshold parameter
        if threshold == "auto" and detection_method == "vector":
            # Auto mode: use very low thresholds with max_lines constraints
            detect_params["peak_threshold_h"] = 0.0
            detect_params["peak_threshold_v"] = 0.0
            detect_params["max_lines_h"] = max_lines_h
            detect_params["max_lines_v"] = max_lines_v
        if threshold == "auto" and detection_method == "pixels":
            detect_params["peak_threshold_h"] = 0.5
            detect_params["peak_threshold_v"] = 0.5
            detect_params["max_lines_h"] = max_lines_h
            detect_params["max_lines_v"] = max_lines_v
        else:
            # Fixed threshold mode
            detect_params["peak_threshold_h"] = (
                float(threshold) if axis in ("horizontal", "both") else 1.0
            )
            detect_params["peak_threshold_v"] = (
                float(threshold) if axis in ("vertical", "both") else 1.0
            )
            detect_params["max_lines_h"] = max_lines_h
            detect_params["max_lines_v"] = max_lines_v

        # Add any additional detection parameters
        for key in [
            "min_gap_h",
            "min_gap_v",
            "binarization_method",
            "adaptive_thresh_block_size",
            "adaptive_thresh_C_val",
            "morph_op_h",
            "morph_kernel_h",
            "morph_op_v",
            "morph_kernel_v",
            "smoothing_sigma_h",
            "smoothing_sigma_v",
            "peak_width_rel_height",
        ]:
            if key in detect_kwargs:
                detect_params[key] = detect_kwargs[key]

        # Perform the detection
        obj.detect_lines(**detect_params)

        # Now get the detected lines and use them
        if hasattr(obj, "lines"):
            lines = obj.lines
        elif hasattr(obj, "find_all"):
            lines = obj.find_all("line")
        else:
            lines = []

        # Filter by the source we just used

        lines = [
            l for l in lines if getattr(l, "source", None) == detect_params["source_label"]
        ]

    else:  # detection_method == 'vector' (default)
        # Get existing lines from the object
        if hasattr(obj, "lines"):
            lines = obj.lines
        elif hasattr(obj, "find_all"):
            lines = obj.find_all("line")
        else:
            logger.warning(f"Object {obj} has no lines or find_all method")
            lines = []

        # Filter by source if specified
        if source_label:
            lines = [l for l in lines if getattr(l, "source", None) == source_label]

    # Process lines (same logic for both methods)
    # Separate lines by orientation and collect with metadata for ranking
    h_line_data = []  # (y_coord, length, line_obj)
    v_line_data = []  # (x_coord, length, line_obj)

    for line in lines:
        if hasattr(line, "is_horizontal") and hasattr(line, "is_vertical"):
            if line.is_horizontal and axis in ("horizontal", "both"):
                # Use the midpoint y-coordinate for horizontal lines
                y = (line.top + line.bottom) / 2
                # Calculate line length for ranking
                length = getattr(
                    line, "width", abs(getattr(line, "x1", 0) - getattr(line, "x0", 0))
                )
                h_line_data.append((y, length, line))
            elif line.is_vertical and axis in ("vertical", "both"):
                # Use the midpoint x-coordinate for vertical lines
                x = (line.x0 + line.x1) / 2
                # Calculate line length for ranking
                length = getattr(
                    line, "height", abs(getattr(line, "bottom", 0) - getattr(line, "top", 0))
                )
                v_line_data.append((x, length, line))

    # Process horizontal lines
    if max_lines_h is not None and h_line_data:
        # Sort by length (longer lines are typically more significant)
        h_line_data.sort(key=lambda x: x[1], reverse=True)
        # Take the top N by length
        selected_h = h_line_data[:max_lines_h]
        # Extract just the coordinates and sort by position
        horizontals = sorted([coord for coord, _, _ in selected_h])
        logger.debug(
            f"Selected {len(horizontals)} horizontal lines from {len(h_line_data)} candidates"
        )
    else:
        # Use all horizontal lines (original behavior)
        horizontals = [coord for coord, _, _ in h_line_data]
        horizontals = sorted(list(set(horizontals)))

    # Process vertical lines
    if max_lines_v is not None and v_line_data:
        # Sort by length (longer lines are typically more significant)
        v_line_data.sort(key=lambda x: x[1], reverse=True)
        # Take the top N by length
        selected_v = v_line_data[:max_lines_v]
        # Extract just the coordinates and sort by position
        verticals = sorted([coord for coord, _, _ in selected_v])
        logger.debug(
            f"Selected {len(verticals)} vertical lines from {len(v_line_data)} candidates"
        )
    else:
        # Use all vertical lines (original behavior)
        verticals = [coord for coord, _, _ in v_line_data]
        verticals = sorted(list(set(verticals)))

    # Add outer guides if requested
    if outer and bounds:
        if axis in ("vertical", "both"):
            if not verticals or verticals[0] > bounds[0]:
                verticals.insert(0, bounds[0])  # x0
            if not verticals or verticals[-1] < bounds[2]:
                verticals.append(bounds[2])  # x1
        if axis in ("horizontal", "both"):
            if not horizontals or horizontals[0] > bounds[1]:
                horizontals.insert(0, bounds[1])  # y0
            if not horizontals or horizontals[-1] < bounds[3]:
                horizontals.append(bounds[3])  # y1

    # Remove duplicates and sort again
    verticals = sorted(list(set(verticals)))
    horizontals = sorted(list(set(horizontals)))

    return cls(verticals=verticals, horizontals=horizontals, context=obj, bounds=bounds)
natural_pdf.Guides.from_whitespace(obj, axis='both', min_gap=10) classmethod

Create guides by detecting whitespace gaps.

Parameters:

Name Type Description Default
obj Union[Page, Region, FlowRegion]

Page or Region to analyze

required
min_gap float

Minimum gap size to consider as whitespace

10
axis Literal['vertical', 'horizontal', 'both']

Which axes to analyze for gaps

'both'

Returns:

Type Description
Guides

New Guides object positioned at whitespace gaps

Source code in natural_pdf/analyzers/guides.py
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
@classmethod
def from_whitespace(
    cls,
    obj: Union["Page", "Region", "FlowRegion"],
    axis: Literal["vertical", "horizontal", "both"] = "both",
    min_gap: float = 10,
) -> "Guides":
    """
    Create guides by detecting whitespace gaps.

    Args:
        obj: Page or Region to analyze
        min_gap: Minimum gap size to consider as whitespace
        axis: Which axes to analyze for gaps

    Returns:
        New Guides object positioned at whitespace gaps
    """
    # This is a placeholder - would need sophisticated gap detection
    logger.info("Whitespace detection not yet implemented, using divide instead")
    return cls.divide(obj, n=3, axis=axis)
natural_pdf.Guides.get_cells()

Get all cell bounding boxes from guide intersections.

Returns:

Type Description
List[Tuple[float, float, float, float]]

List of (x0, y0, x1, y1) tuples for each cell

Source code in natural_pdf/analyzers/guides.py
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
def get_cells(self) -> List[Tuple[float, float, float, float]]:
    """
    Get all cell bounding boxes from guide intersections.

    Returns:
        List of (x0, y0, x1, y1) tuples for each cell
    """
    cells = []

    # Create cells from guide intersections
    for i in range(len(self.vertical) - 1):
        for j in range(len(self.horizontal) - 1):
            x0 = self.vertical[i]
            x1 = self.vertical[i + 1]
            y0 = self.horizontal[j]
            y1 = self.horizontal[j + 1]
            cells.append((x0, y0, x1, y1))

    return cells
natural_pdf.Guides.left_of(guide_index, obj=None)

Get a region to the left of a vertical guide.

Parameters:

Name Type Description Default
guide_index int

Vertical guide index

required
obj Optional[Union[Page, Region]]

Page or Region to create the region on (uses self.context if None)

None

Returns:

Type Description
Region

Region to the left of the specified guide

Source code in natural_pdf/analyzers/guides.py
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
def left_of(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a region to the left of a vertical guide.

    Args:
        guide_index: Vertical guide index
        obj: Page or Region to create the region on (uses self.context if None)

    Returns:
        Region to the left of the specified guide
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.vertical or guide_index < 0 or guide_index >= len(self.vertical):
        raise IndexError(f"Guide index {guide_index} out of range")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    x0, y0, _, y1 = bounds

    # Create region from left edge to guide
    x1 = self.vertical[guide_index]

    if hasattr(target, "region"):
        return target.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.new(context=None) classmethod

Create a new empty Guides object, optionally with a context.

This provides a clean way to start building guides through chaining: guides = Guides.new(page).add_content(axis='vertical', markers=[...])

Parameters:

Name Type Description Default
context Optional[Union[Page, Region]]

Optional Page or Region to use as default context for operations

None

Returns:

Type Description
Guides

New empty Guides object

Source code in natural_pdf/analyzers/guides.py
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
@classmethod
def new(cls, context: Optional[Union["Page", "Region"]] = None) -> "Guides":
    """
    Create a new empty Guides object, optionally with a context.

    This provides a clean way to start building guides through chaining:
    guides = Guides.new(page).add_content(axis='vertical', markers=[...])

    Args:
        context: Optional Page or Region to use as default context for operations

    Returns:
        New empty Guides object
    """
    return cls(verticals=[], horizontals=[], context=context)
natural_pdf.Guides.remove_horizontal(index)

Remove a horizontal guide by index.

Source code in natural_pdf/analyzers/guides.py
2387
2388
2389
2390
2391
def remove_horizontal(self, index: int) -> "Guides":
    """Remove a horizontal guide by index."""
    if 0 <= index < len(self.horizontal):
        self.horizontal.pop(index)
    return self
natural_pdf.Guides.remove_vertical(index)

Remove a vertical guide by index.

Source code in natural_pdf/analyzers/guides.py
2381
2382
2383
2384
2385
def remove_vertical(self, index: int) -> "Guides":
    """Remove a vertical guide by index."""
    if 0 <= index < len(self.vertical):
        self.vertical.pop(index)
    return self
natural_pdf.Guides.right_of(guide_index, obj=None)

Get a region to the right of a vertical guide.

Parameters:

Name Type Description Default
guide_index int

Vertical guide index

required
obj Optional[Union[Page, Region]]

Page or Region to create the region on (uses self.context if None)

None

Returns:

Type Description
Region

Region to the right of the specified guide

Source code in natural_pdf/analyzers/guides.py
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
def right_of(self, guide_index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a region to the right of a vertical guide.

    Args:
        guide_index: Vertical guide index
        obj: Page or Region to create the region on (uses self.context if None)

    Returns:
        Region to the right of the specified guide
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.vertical or guide_index < 0 or guide_index >= len(self.vertical):
        raise IndexError(f"Guide index {guide_index} out of range")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    _, y0, x1, y1 = bounds

    # Create region from guide to right edge
    x0 = self.vertical[guide_index]

    if hasattr(target, "region"):
        return target.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.row(index, obj=None)

Get a row region from the guides.

Parameters:

Name Type Description Default
index int

Row index (0-based)

required
obj Optional[Union[Page, Region]]

Page or Region to create the row on (uses self.context if None)

None

Returns:

Type Description
Region

Region representing the specified row

Raises:

Type Description
IndexError

If row index is out of range

Source code in natural_pdf/analyzers/guides.py
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
def row(self, index: int, obj: Optional[Union["Page", "Region"]] = None) -> "Region":
    """
    Get a row region from the guides.

    Args:
        index: Row index (0-based)
        obj: Page or Region to create the row on (uses self.context if None)

    Returns:
        Region representing the specified row

    Raises:
        IndexError: If row index is out of range
    """
    target = obj or self.context
    if target is None:
        raise ValueError("No context available for region creation")

    if not self.horizontal or index < 0 or index >= len(self.horizontal) - 1:
        raise IndexError(f"Row index {index} out of range (have {len(self.horizontal)-1} rows)")

    # Get bounds from context
    bounds = self._get_context_bounds()
    if not bounds:
        raise ValueError("Could not determine bounds")
    x0, _, x1, _ = bounds

    # Get row boundaries
    y0 = self.horizontal[index]
    y1 = self.horizontal[index + 1]

    # Create region using absolute coordinates
    if hasattr(target, "region"):
        # Target has a region method (Page)
        return target.region(x0, y0, x1, y1)
    elif hasattr(target, "page"):
        # Target is a Region, use its parent page
        # The coordinates from guides are already absolute
        return target.page.region(x0, y0, x1, y1)
    else:
        raise TypeError(f"Cannot create region on {type(target)}")
natural_pdf.Guides.shift(index, offset, axis='vertical')

Move a specific guide by a offset amount.

Parameters:

Name Type Description Default
index int

Index of the guide to move

required
offset float

Amount to move (positive = right/down)

required
axis Literal['vertical', 'horizontal']

Which guide list to modify

'vertical'

Returns:

Type Description
Guides

Self for method chaining

Source code in natural_pdf/analyzers/guides.py
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
def shift(
    self, index: int, offset: float, axis: Literal["vertical", "horizontal"] = "vertical"
) -> "Guides":
    """
    Move a specific guide by a offset amount.

    Args:
        index: Index of the guide to move
        offset: Amount to move (positive = right/down)
        axis: Which guide list to modify

    Returns:
        Self for method chaining
    """
    if axis == "vertical":
        if 0 <= index < len(self.vertical):
            self.vertical[index] += offset
            self.vertical = sorted(self.vertical)
        else:
            logger.warning(f"Vertical guide index {index} out of range")
    else:
        if 0 <= index < len(self.horizontal):
            self.horizontal[index] += offset
            self.horizontal = sorted(self.horizontal)
        else:
            logger.warning(f"Horizontal guide index {index} out of range")

    return self
natural_pdf.Guides.show(on=None, **kwargs)

Display the guides overlaid on a page or region.

Parameters:

Name Type Description Default
on

Page, Region, PIL Image, or string to display guides on. If None, uses self.context (the object guides were created from). If string 'page', uses the page from self.context.

None
**kwargs

Additional arguments passed to to_image() if applicable.

{}

Returns:

Type Description

PIL Image with guides drawn on it.

Source code in natural_pdf/analyzers/guides.py
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
def show(self, on=None, **kwargs):
    """
    Display the guides overlaid on a page or region.

    Args:
        on: Page, Region, PIL Image, or string to display guides on.
            If None, uses self.context (the object guides were created from).
            If string 'page', uses the page from self.context.
        **kwargs: Additional arguments passed to to_image() if applicable.

    Returns:
        PIL Image with guides drawn on it.
    """
    # Handle FlowRegion case
    if self.is_flow_region and (on is None or on == self.context):
        if not self._flow_guides:
            raise ValueError("No guides to show for FlowRegion")

        # Get stacking parameters from kwargs or use defaults
        stack_direction = kwargs.get("stack_direction", "vertical")
        stack_gap = kwargs.get("stack_gap", 5)
        stack_background_color = kwargs.get("stack_background_color", (255, 255, 255))

        # First, render all constituent regions without guides to get base images
        base_images = []
        region_infos = []  # Store region info for guide coordinate mapping

        for region in self.context.constituent_regions:
            try:
                # Render region without guides using new system
                if hasattr(region, "render"):
                    img = region.render(
                        resolution=kwargs.get("resolution", 150),
                        width=kwargs.get("width", None),
                        crop=True,  # Always crop regions to their bounds
                    )
                else:
                    # Fallback to old method
                    img = region.render(**kwargs)
                if img:
                    base_images.append(img)

                    # Calculate scaling factors for this region
                    scale_x = img.width / region.width
                    scale_y = img.height / region.height

                    region_infos.append(
                        {
                            "region": region,
                            "img_width": img.width,
                            "img_height": img.height,
                            "scale_x": scale_x,
                            "scale_y": scale_y,
                            "pdf_x0": region.x0,
                            "pdf_top": region.top,
                            "pdf_x1": region.x1,
                            "pdf_bottom": region.bottom,
                        }
                    )
            except Exception as e:
                logger.warning(f"Failed to render region: {e}")

        if not base_images:
            raise ValueError("Failed to render any images for FlowRegion")

        # Calculate final canvas size based on stacking direction
        if stack_direction == "vertical":
            final_width = max(img.width for img in base_images)
            final_height = (
                sum(img.height for img in base_images) + (len(base_images) - 1) * stack_gap
            )
        else:  # horizontal
            final_width = (
                sum(img.width for img in base_images) + (len(base_images) - 1) * stack_gap
            )
            final_height = max(img.height for img in base_images)

        # Create unified canvas
        canvas = Image.new("RGB", (final_width, final_height), stack_background_color)
        draw = ImageDraw.Draw(canvas)

        # Paste base images and track positions
        region_positions = []  # (region_info, paste_x, paste_y)

        if stack_direction == "vertical":
            current_y = 0
            for i, (img, info) in enumerate(zip(base_images, region_infos)):
                paste_x = (final_width - img.width) // 2  # Center horizontally
                canvas.paste(img, (paste_x, current_y))
                region_positions.append((info, paste_x, current_y))
                current_y += img.height + stack_gap
        else:  # horizontal
            current_x = 0
            for i, (img, info) in enumerate(zip(base_images, region_infos)):
                paste_y = (final_height - img.height) // 2  # Center vertically
                canvas.paste(img, (current_x, paste_y))
                region_positions.append((info, current_x, paste_y))
                current_x += img.width + stack_gap

        # Now draw guides on the unified canvas
        # Draw vertical guides (blue) - these extend through the full canvas height
        for v_coord in self.vertical:
            # Find which region(s) this guide intersects
            for info, paste_x, paste_y in region_positions:
                if info["pdf_x0"] <= v_coord <= info["pdf_x1"]:
                    # This guide is within this region's x-bounds
                    # Convert PDF coordinate to pixel coordinate relative to the region
                    adjusted_x = v_coord - info["pdf_x0"]
                    pixel_x = adjusted_x * info["scale_x"] + paste_x

                    # Draw full-height line on canvas (not clipped to region)
                    if 0 <= pixel_x <= final_width:
                        x_pixel = int(pixel_x)
                        draw.line(
                            [(x_pixel, 0), (x_pixel, final_height - 1)],
                            fill=(0, 0, 255, 200),
                            width=2,
                        )
                    break  # Only draw once per guide

        # Draw horizontal guides (red) - these extend through the full canvas width
        for h_coord in self.horizontal:
            # Find which region(s) this guide intersects
            for info, paste_x, paste_y in region_positions:
                if info["pdf_top"] <= h_coord <= info["pdf_bottom"]:
                    # This guide is within this region's y-bounds
                    # Convert PDF coordinate to pixel coordinate relative to the region
                    adjusted_y = h_coord - info["pdf_top"]
                    pixel_y = adjusted_y * info["scale_y"] + paste_y

                    # Draw full-width line on canvas (not clipped to region)
                    if 0 <= pixel_y <= final_height:
                        y_pixel = int(pixel_y)
                        draw.line(
                            [(0, y_pixel), (final_width - 1, y_pixel)],
                            fill=(255, 0, 0, 200),
                            width=2,
                        )
                    break  # Only draw once per guide

        return canvas

    # Original single-region logic follows...
    # Determine what to display guides on
    target = on if on is not None else self.context

    # Handle string shortcuts
    if isinstance(target, str):
        if target == "page":
            if hasattr(self.context, "page"):
                target = self.context.page
            elif hasattr(self.context, "_page"):
                target = self.context._page
            else:
                raise ValueError("Cannot resolve 'page' - context has no page attribute")
        else:
            raise ValueError(f"Unknown string target: {target}. Only 'page' is supported.")

    if target is None:
        raise ValueError("No target specified and no context available for guides display")

    # Prepare kwargs for image generation
    image_kwargs = {}

    # Extract only the parameters that the new render() method accepts
    if "resolution" in kwargs:
        image_kwargs["resolution"] = kwargs["resolution"]
    if "width" in kwargs:
        image_kwargs["width"] = kwargs["width"]
    if "crop" in kwargs:
        image_kwargs["crop"] = kwargs["crop"]

    # If target is a region-like object, crop to just that region
    if hasattr(target, "bbox") and hasattr(target, "page"):
        # This is likely a Region
        image_kwargs["crop"] = True

    # Get base image
    if hasattr(target, "render"):
        # Use the new unified rendering system
        img = target.render(**image_kwargs)
    elif hasattr(target, "render"):
        # Fallback to old method if available
        img = target.render(**image_kwargs)
    elif hasattr(target, "mode") and hasattr(target, "size"):
        # It's already a PIL Image
        img = target
    else:
        raise ValueError(f"Object {target} does not support render() and is not a PIL Image")

    if img is None:
        raise ValueError("Failed to generate base image")

    # Create a copy to draw on
    img = img.copy()
    draw = ImageDraw.Draw(img)

    # Determine scale factor for coordinate conversion
    if (
        hasattr(target, "width")
        and hasattr(target, "height")
        and not (hasattr(target, "mode") and hasattr(target, "size"))
    ):
        # target is a PDF object (Page/Region) with PDF coordinates
        scale_x = img.width / target.width
        scale_y = img.height / target.height

        # If we're showing guides on a region, we need to adjust coordinates
        # to be relative to the region's origin
        if hasattr(target, "bbox") and hasattr(target, "page"):
            # This is a Region - adjust guide coordinates to be relative to region
            region_x0, region_top = target.x0, target.top
        else:
            # This is a Page - no adjustment needed
            region_x0, region_top = 0, 0
    else:
        # target is already an image, no scaling needed
        scale_x = 1.0
        scale_y = 1.0
        region_x0, region_top = 0, 0

    # Draw vertical guides (blue)
    for x_coord in self.vertical:
        # Adjust coordinate if we're showing on a region
        adjusted_x = x_coord - region_x0
        pixel_x = adjusted_x * scale_x
        # Ensure guides at the edge are still visible by clamping to valid range
        if 0 <= pixel_x <= img.width - 1:
            x_pixel = int(min(pixel_x, img.width - 1))
            draw.line([(x_pixel, 0), (x_pixel, img.height - 1)], fill=(0, 0, 255, 200), width=2)

    # Draw horizontal guides (red)
    for y_coord in self.horizontal:
        # Adjust coordinate if we're showing on a region
        adjusted_y = y_coord - region_top
        pixel_y = adjusted_y * scale_y
        # Ensure guides at the edge are still visible by clamping to valid range
        if 0 <= pixel_y <= img.height - 1:
            y_pixel = int(min(pixel_y, img.height - 1))
            draw.line([(0, y_pixel), (img.width - 1, y_pixel)], fill=(255, 0, 0, 200), width=2)

    return img
natural_pdf.Guides.snap_to_whitespace(axis='vertical', min_gap=10.0, detection_method='pixels', threshold='auto', on_no_snap='warn')

Snap guides to nearby whitespace gaps (troughs) using optimal assignment. Modifies this Guides object in place.

Parameters:

Name Type Description Default
axis str

Direction to snap ('vertical' or 'horizontal')

'vertical'
min_gap float

Minimum gap size to consider as a valid trough

10.0
detection_method str

Method for detecting troughs: 'pixels' - use pixel-based density analysis (default) 'text' - use text element spacing analysis

'pixels'
threshold Union[float, str]

Threshold for what counts as a trough: - float (0.0-1.0): areas with this fraction or less of max density count as troughs - 'auto': automatically find threshold that creates enough troughs for guides

'auto'
on_no_snap str

Action when snapping fails ('warn', 'ignore', 'raise')

'warn'

Returns:

Type Description
Guides

Self for method chaining.

Source code in natural_pdf/analyzers/guides.py
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
def snap_to_whitespace(
    self,
    axis: str = "vertical",
    min_gap: float = 10.0,
    detection_method: str = "pixels",  # 'pixels' or 'text'
    threshold: Union[
        float, str
    ] = "auto",  # threshold for what counts as a trough (0.0-1.0) or 'auto'
    on_no_snap: str = "warn",
) -> "Guides":
    """
    Snap guides to nearby whitespace gaps (troughs) using optimal assignment.
    Modifies this Guides object in place.

    Args:
        axis: Direction to snap ('vertical' or 'horizontal')
        min_gap: Minimum gap size to consider as a valid trough
        detection_method: Method for detecting troughs:
                        'pixels' - use pixel-based density analysis (default)
                        'text' - use text element spacing analysis
        threshold: Threshold for what counts as a trough:
                  - float (0.0-1.0): areas with this fraction or less of max density count as troughs
                  - 'auto': automatically find threshold that creates enough troughs for guides
        on_no_snap: Action when snapping fails ('warn', 'ignore', 'raise')

    Returns:
        Self for method chaining.
    """
    if not self.context:
        logger.warning("No context available for whitespace detection")
        return self

    # Handle FlowRegion case - collect all text elements across regions
    if self.is_flow_region:
        all_text_elements = []
        region_bounds = {}

        for region in self.context.constituent_regions:
            # Get text elements from this region
            if hasattr(region, "find_all"):
                try:
                    text_elements = region.find_all("text", apply_exclusions=False)
                    elements = (
                        text_elements.elements
                        if hasattr(text_elements, "elements")
                        else text_elements
                    )
                    all_text_elements.extend(elements)

                    # Store bounds for each region
                    if hasattr(region, "bbox"):
                        region_bounds[region] = region.bbox
                    elif hasattr(region, "x0"):
                        region_bounds[region] = (
                            region.x0,
                            region.top,
                            region.x1,
                            region.bottom,
                        )
                except Exception as e:
                    logger.warning(f"Error getting text elements from region: {e}")

        if not all_text_elements:
            logger.warning(
                "No text elements found across flow regions for whitespace detection"
            )
            return self

        # Find whitespace gaps across all regions
        if axis == "vertical":
            gaps = self._find_vertical_whitespace_gaps(all_text_elements, min_gap, threshold)
            # Get all vertical guides across regions
            all_guides = []
            guide_to_region_map = {}  # Map guide coordinate to its original list of regions
            for coord, region in self._unified_vertical:
                all_guides.append(coord)
                guide_to_region_map.setdefault(coord, []).append(region)

            if gaps and all_guides:
                # Keep a copy of original guides to maintain mapping
                original_guides = all_guides.copy()

                # Snap guides to gaps
                self._snap_guides_to_gaps(all_guides, gaps, axis)

                # Update the unified view with snapped positions
                self._unified_vertical = []
                for i, new_coord in enumerate(all_guides):
                    # Find the original region for this guide using the original position
                    original_coord = original_guides[i]
                    # A guide might be associated with multiple regions, add them all
                    regions = guide_to_region_map.get(original_coord, [])
                    for region in regions:
                        self._unified_vertical.append((new_coord, region))

                # Update individual region guides
                for region in self._flow_guides:
                    region_verticals = []
                    for coord, r in self._unified_vertical:
                        if r == region:
                            region_verticals.append(coord)
                    self._flow_guides[region] = (
                        sorted(list(set(region_verticals))),  # Deduplicate here
                        self._flow_guides[region][1],
                    )

                # Invalidate cache
                self._vertical_cache = None

        elif axis == "horizontal":
            gaps = self._find_horizontal_whitespace_gaps(all_text_elements, min_gap, threshold)
            # Get all horizontal guides across regions
            all_guides = []
            guide_to_region_map = {}  # Map guide coordinate to its original list of regions
            for coord, region in self._unified_horizontal:
                all_guides.append(coord)
                guide_to_region_map.setdefault(coord, []).append(region)

            if gaps and all_guides:
                # Keep a copy of original guides to maintain mapping
                original_guides = all_guides.copy()

                # Snap guides to gaps
                self._snap_guides_to_gaps(all_guides, gaps, axis)

                # Update the unified view with snapped positions
                self._unified_horizontal = []
                for i, new_coord in enumerate(all_guides):
                    # Find the original region for this guide using the original position
                    original_coord = original_guides[i]
                    regions = guide_to_region_map.get(original_coord, [])
                    for region in regions:
                        self._unified_horizontal.append((new_coord, region))

                # Update individual region guides
                for region in self._flow_guides:
                    region_horizontals = []
                    for coord, r in self._unified_horizontal:
                        if r == region:
                            region_horizontals.append(coord)
                    self._flow_guides[region] = (
                        self._flow_guides[region][0],
                        sorted(list(set(region_horizontals))),  # Deduplicate here
                    )

                # Invalidate cache
                self._horizontal_cache = None

        else:
            raise ValueError("axis must be 'vertical' or 'horizontal'")

        return self

    # Original single-region logic
    # Get elements for trough detection
    text_elements = self._get_text_elements()
    if not text_elements:
        logger.warning("No text elements found for whitespace detection")
        return self

    if axis == "vertical":
        gaps = self._find_vertical_whitespace_gaps(text_elements, min_gap, threshold)
        if gaps:
            self._snap_guides_to_gaps(self.vertical.data, gaps, axis)
    elif axis == "horizontal":
        gaps = self._find_horizontal_whitespace_gaps(text_elements, min_gap, threshold)
        if gaps:
            self._snap_guides_to_gaps(self.horizontal.data, gaps, axis)
    else:
        raise ValueError("axis must be 'vertical' or 'horizontal'")

    # Ensure all coordinates are Python floats (not numpy types)
    self.vertical.data[:] = [float(x) for x in self.vertical.data]
    self.horizontal.data[:] = [float(y) for y in self.horizontal.data]

    return self
natural_pdf.Guides.to_absolute(bounds)

Convert relative coordinates to absolute coordinates.

Parameters:

Name Type Description Default
bounds Tuple[float, float, float, float]

Target bounding box (x0, y0, x1, y1)

required

Returns:

Type Description
Guides

New Guides object with absolute coordinates

Source code in natural_pdf/analyzers/guides.py
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
def to_absolute(self, bounds: Tuple[float, float, float, float]) -> "Guides":
    """
    Convert relative coordinates to absolute coordinates.

    Args:
        bounds: Target bounding box (x0, y0, x1, y1)

    Returns:
        New Guides object with absolute coordinates
    """
    if not self.relative:
        return self  # Already absolute

    x0, y0, x1, y1 = bounds
    width = x1 - x0
    height = y1 - y0

    abs_verticals = [x0 + x * width for x in self.vertical]
    abs_horizontals = [y0 + y * height for y in self.horizontal]

    return Guides(
        verticals=abs_verticals,
        horizontals=abs_horizontals,
        context=self.context,
        bounds=bounds,
        relative=False,
    )
natural_pdf.Guides.to_dict()

Convert to dictionary format suitable for pdfplumber table_settings.

Returns:

Type Description
Dict[str, Any]

Dictionary with explicit_vertical_lines and explicit_horizontal_lines

Source code in natural_pdf/analyzers/guides.py
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
def to_dict(self) -> Dict[str, Any]:
    """
    Convert to dictionary format suitable for pdfplumber table_settings.

    Returns:
        Dictionary with explicit_vertical_lines and explicit_horizontal_lines
    """
    return {
        "explicit_vertical_lines": self.vertical,
        "explicit_horizontal_lines": self.horizontal,
    }
natural_pdf.Guides.to_relative()

Convert absolute coordinates to relative (0-1) coordinates.

Returns:

Type Description
Guides

New Guides object with relative coordinates

Source code in natural_pdf/analyzers/guides.py
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
def to_relative(self) -> "Guides":
    """
    Convert absolute coordinates to relative (0-1) coordinates.

    Returns:
        New Guides object with relative coordinates
    """
    if self.relative:
        return self  # Already relative

    if not self.bounds:
        raise ValueError("Cannot convert to relative without bounds")

    x0, y0, x1, y1 = self.bounds
    width = x1 - x0
    height = y1 - y0

    rel_verticals = [(x - x0) / width for x in self.vertical]
    rel_horizontals = [(y - y0) / height for y in self.horizontal]

    return Guides(
        verticals=rel_verticals,
        horizontals=rel_horizontals,
        context=self.context,
        bounds=(0, 0, 1, 1),
        relative=True,
    )
natural_pdf.Judge

Visual classifier for regions using simple image metrics.

Requires class labels to be specified. For binary classification, requires at least one example of each class before making decisions.

Examples:

Checkbox detection:

judge = Judge("checkboxes", labels=["unchecked", "checked"])
judge.add(empty_box, "unchecked")
judge.add(marked_box, "checked")

result = judge.decide(new_box)
if result.label == "checked":
    print("Box is checked!")

Signature detection:

judge = Judge("signatures", labels=["unsigned", "signed"])
judge.add(blank_area, "unsigned")
judge.add(signature_area, "signed")

result = judge.decide(new_region)
print(f"Classification: {result.label} (confidence: {result.score})")

Source code in natural_pdf/judge.py
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
class Judge:
    """
    Visual classifier for regions using simple image metrics.

    Requires class labels to be specified. For binary classification,
    requires at least one example of each class before making decisions.

    Examples:
        Checkbox detection:
        ```python
        judge = Judge("checkboxes", labels=["unchecked", "checked"])
        judge.add(empty_box, "unchecked")
        judge.add(marked_box, "checked")

        result = judge.decide(new_box)
        if result.label == "checked":
            print("Box is checked!")
        ```

        Signature detection:
        ```python
        judge = Judge("signatures", labels=["unsigned", "signed"])
        judge.add(blank_area, "unsigned")
        judge.add(signature_area, "signed")

        result = judge.decide(new_region)
        print(f"Classification: {result.label} (confidence: {result.score})")
        ```
    """

    def __init__(
        self,
        name: str,
        labels: List[str],
        base_dir: Optional[str] = None,
        target_prior: Optional[float] = None,
    ):
        """
        Initialize a Judge for visual classification.

        Args:
            name: Name for this judge (used for folder name)
            labels: Class labels (required, typically 2 for binary classification)
            base_dir: Base directory for storage. Defaults to current directory
            target_prior: Target prior probability for the FIRST label in the labels list.
                         - 0.5 (default) = neutral, treats both classes equally
                         - >0.5 = favors labels[0]
                         - <0.5 = favors labels[1]
                         Example: Judge("cb", ["checked", "unchecked"], target_prior=0.6)
                         favors detecting "checked" checkboxes.
        """
        if not labels or len(labels) != 2:
            raise JudgeError("Judge requires exactly 2 class labels (binary classification only)")

        self.name = name
        self.labels = labels
        self.target_prior = target_prior if target_prior is not None else 0.5

        # Set up directory structure
        self.base_dir = Path(base_dir) if base_dir else Path.cwd()
        self.root_dir = self.base_dir / name
        self.root_dir.mkdir(exist_ok=True)

        # Create label directories
        for label in self.labels:
            (self.root_dir / label).mkdir(exist_ok=True)
        (self.root_dir / "unlabeled").mkdir(exist_ok=True)
        (self.root_dir / "_removed").mkdir(exist_ok=True)

        # Config file
        self.config_path = self.root_dir / "judge.json"

        # Load existing config or initialize
        self.thresholds = {}
        self.metrics_info = {}
        if self.config_path.exists():
            self._load_config()

    def add(self, region, label: Optional[str] = None) -> None:
        """
        Add a region to the judge's dataset.

        Args:
            region: Region object to add
            label: Class label. If None, added to unlabeled for later teaching

        Raises:
            JudgeError: If label is not in allowed labels
        """
        if label is not None and label not in self.labels:
            raise JudgeError(f"Label '{label}' not in allowed labels: {self.labels}")

        # Render region to image
        try:
            img = region.render(crop=True)
            if not isinstance(img, Image.Image):
                img = Image.fromarray(img)
        except Exception as e:
            raise JudgeError(f"Failed to render region: {e}")

        # Convert to RGB if needed
        if img.mode != "RGB":
            img = img.convert("RGB")

        # Generate hash from image content
        img_array = np.array(img)
        img_hash = hashlib.md5(img_array.tobytes()).hexdigest()[:12]

        # Determine target directory
        target_dir = self.root_dir / (label if label else "unlabeled")
        target_path = target_dir / f"{img_hash}.png"

        # Check if hash already exists anywhere
        existing_locations = []
        for check_label in self.labels + ["unlabeled", "_removed"]:
            check_path = self.root_dir / check_label / f"{img_hash}.png"
            if check_path.exists():
                existing_locations.append(check_label)

        if existing_locations:
            logger.warning(f"Duplicate image detected (hash: {img_hash})")
            logger.warning(f"Already exists in: {', '.join(existing_locations)}")
            print(f"⚠️  Duplicate image - already exists in: {', '.join(existing_locations)}")
            return

        # Save image
        img.save(target_path)
        logger.debug(f"Added image {img_hash} to {label if label else 'unlabeled'}")

    def teach(self, labels: Optional[List[str]] = None, review: bool = False) -> None:
        """
        Interactive teaching interface using IPython widgets.

        Args:
            labels: Labels to use for teaching. Defaults to self.labels
            review: If True, review already labeled images for re-classification
        """
        # Check for IPython environment
        try:
            import ipywidgets as widgets
            from IPython.display import clear_output, display
        except ImportError:
            raise JudgeError(
                "Teaching requires IPython and ipywidgets. Use 'pip install ipywidgets'"
            )

        labels = labels or self.labels

        # Get images to review
        if review:
            # Get all labeled images for review
            files_to_review = []
            for label in self.labels:
                label_dir = self.root_dir / label
                for img_path in sorted(label_dir.glob("*.png")):
                    files_to_review.append((img_path, label))

            if not files_to_review:
                print("No labeled images to review")
                return

            # Shuffle for review
            import random

            random.shuffle(files_to_review)
            review_files = [f[0] for f in files_to_review]
            original_labels = {str(f[0]): f[1] for f in files_to_review}
        else:
            # Get unlabeled images
            unlabeled_dir = self.root_dir / "unlabeled"
            review_files = sorted(unlabeled_dir.glob("*.png"))
            original_labels = {}

            if not review_files:
                print("No unlabeled images to teach")
                return

        # State for teaching
        self._teaching_state = {
            "current_index": 0,
            "labeled_count": 0,
            "removed_count": 0,
            "files": review_files,
            "labels": labels,
            "review_mode": review,
            "original_labels": original_labels,
        }

        # Create widgets
        image_widget = widgets.Image()
        status_label = widgets.Label()

        # Create buttons for labeling
        button_layout = widgets.Layout(width="auto", margin="5px")

        btn_prev = widgets.Button(description="↑ Previous", layout=button_layout)
        btn_class1 = widgets.Button(
            description=f"← {labels[0]}", layout=button_layout, button_style="primary"
        )
        btn_class2 = widgets.Button(
            description=f"→ {labels[1]}", layout=button_layout, button_style="success"
        )
        btn_skip = widgets.Button(description="↓ Skip", layout=button_layout)
        btn_remove = widgets.Button(
            description="✗ Remove", layout=button_layout, button_style="danger"
        )

        button_box = widgets.HBox([btn_prev, btn_class1, btn_class2, btn_skip, btn_remove])

        # Keyboard shortcuts info
        info_label = widgets.Label(
            value="Keys: ↑ prev | ← "
            + labels[0]
            + " | → "
            + labels[1]
            + " | ↓ skip | Delete remove"
        )

        def update_display():
            """Update the displayed image and status."""
            state = self._teaching_state
            if 0 <= state["current_index"] < len(state["files"]):
                img_path = state["files"][state["current_index"]]
                with open(img_path, "rb") as f:
                    image_widget.value = f.read()

                # Build status text
                status_text = f"Image {state['current_index'] + 1} of {len(state['files'])}"
                if state["review_mode"]:
                    current_label = state["original_labels"].get(str(img_path), "unknown")
                    status_text += f" (Currently: {current_label})"
                status_text += f" | Labeled: {state['labeled_count']}"
                if state["removed_count"] > 0:
                    status_text += f" | Removed: {state['removed_count']}"

                status_label.value = status_text

                # Update button states
                btn_prev.disabled = state["current_index"] == 0
            else:
                status_label.value = "Teaching complete!"
                # Hide the image widget instead of showing broken image
                image_widget.layout.display = "none"
                # Disable all buttons
                btn_prev.disabled = True
                btn_class1.disabled = True
                btn_class2.disabled = True
                btn_skip.disabled = True

                # Auto-retrain
                if state["labeled_count"] > 0 or state["removed_count"] > 0:
                    clear_output(wait=True)
                    print("Teaching complete!")
                    print(f"Labeled: {state['labeled_count']} images")
                    if state["removed_count"] > 0:
                        print(f"Removed: {state['removed_count']} images")

                    if state["labeled_count"] > 0:
                        print("\nRetraining with new examples...")
                        self._retrain()
                        print("✓ Training complete! Judge is ready to use.")
                else:
                    print("No changes made.")

        def move_file_to_class(class_index):
            """Move current file to specified class."""
            state = self._teaching_state
            if state["current_index"] >= len(state["files"]):
                return

            current_file = state["files"][state["current_index"]]
            target_dir = self.root_dir / labels[class_index]
            shutil.move(str(current_file), str(target_dir / current_file.name))
            state["labeled_count"] += 1
            state["current_index"] += 1
            update_display()

        # Button callbacks
        def on_prev(b):
            state = self._teaching_state
            if state["current_index"] > 0:
                state["current_index"] -= 1
                update_display()

        def on_class1(b):
            move_file_to_class(0)

        def on_class2(b):
            move_file_to_class(1)

        def on_skip(b):
            state = self._teaching_state
            state["current_index"] += 1
            update_display()

        def on_remove(b):
            state = self._teaching_state
            if state["current_index"] >= len(state["files"]):
                return

            current_file = state["files"][state["current_index"]]
            target_dir = self.root_dir / "_removed"
            shutil.move(str(current_file), str(target_dir / current_file.name))
            state["removed_count"] += 1
            state["current_index"] += 1
            update_display()

        # Connect buttons
        btn_prev.on_click(on_prev)
        btn_class1.on_click(on_class1)
        btn_class2.on_click(on_class2)
        btn_skip.on_click(on_skip)
        btn_remove.on_click(on_remove)

        # Create output widget for keyboard handling
        output = widgets.Output()

        # Keyboard event handler
        def on_key(event):
            """Handle keyboard events."""
            if event["type"] != "keydown":
                return

            key = event["key"]

            if key == "ArrowUp":
                on_prev(None)
            elif key == "ArrowLeft":
                on_class1(None)
            elif key == "ArrowRight":
                on_class2(None)
            elif key == "ArrowDown":
                on_skip(None)
            elif key in ["Delete", "Backspace"]:
                on_remove(None)

        # Display everything
        display(status_label)
        display(image_widget)
        display(button_box)
        display(info_label)
        display(output)

        # Show first image
        update_display()

        # Try to set up keyboard handling (may not work in all environments)
        try:
            from ipyevents import Event

            event_handler = Event(source=output, watched_events=["keydown"])
            event_handler.on_dom_event(on_key)
        except:
            # If ipyevents not available, just use buttons
            print("Note: Install ipyevents for keyboard shortcuts: pip install ipyevents")

    def decide(self, regions: Union["Region", List["Region"]]) -> Union[Decision, List[Decision]]:
        """
        Classify one or more regions.

        Args:
            regions: Single region or list of regions to classify

        Returns:
            Decision or list of Decisions with label and score

        Raises:
            JudgeError: If not enough training examples
        """
        # Check if we have examples
        for label in self.labels:
            label_dir = self.root_dir / label
            if not any(label_dir.glob("*.png")):
                raise JudgeError(f"Need at least one example of class '{label}' before deciding")

        # Ensure thresholds are current
        if not self.thresholds:
            self._retrain()

        # Handle single region
        single_input = not isinstance(regions, list)
        if single_input:
            regions = [regions]

        results = []
        for region in regions:
            # Extract metrics
            metrics = self._extract_metrics(region)

            # Apply thresholds with soft voting
            votes = {label: 0.0 for label in self.labels}
            total_weight = 0.0

            for metric_name, value in metrics.items():
                if metric_name in self.thresholds:
                    metric_info = self.thresholds[metric_name]
                    weight = metric_info["accuracy"]  # This is now Youden's J

                    # For binary classification
                    label1, label2 = self.labels
                    threshold1, direction1 = metric_info["thresholds"][label1]

                    # Get standard deviations for soft voting
                    stats = self.metrics_info.get(metric_name, {})
                    s1 = stats.get(f"std_{label1}", 0.0)
                    s2 = stats.get(f"std_{label2}", 0.0)
                    scale1 = s1 if s1 > 1e-6 else 1.0
                    scale2 = s2 if s2 > 1e-6 else 1.0

                    # Calculate signed margin (positive favors label1, negative favors label2)
                    if direction1 == "higher":
                        margin = (value - threshold1) / (scale1 if value >= threshold1 else scale2)
                    else:
                        margin = (threshold1 - value) / (scale1 if value <= threshold1 else scale2)

                    # Clip margin to avoid single metric dominating
                    margin = np.clip(margin, -6, 6)

                    # Soft votes using sigmoid
                    p1 = 1.0 / (1.0 + np.exp(-margin))
                    p2 = 1.0 - p1

                    votes[label1] += weight * p1
                    votes[label2] += weight * p2
                    total_weight += weight

            # Normalize votes
            if total_weight > 0:
                for label in votes:
                    votes[label] /= total_weight
            else:
                # Fallback: uniform votes so prior still works
                for label in votes:
                    votes[label] = 0.5
                total_weight = 1.0

            # Apply prior bias correction
            def _logit(p, eps=1e-6):
                p = max(eps, min(1 - eps, p))
                return np.log(p / (1 - p))

            def _sigmoid(x):
                if x >= 0:
                    z = np.exp(-x)
                    return 1.0 / (1.0 + z)
                else:
                    z = np.exp(x)
                    return z / (1.0 + z)

            # Estimate priors from training counts
            counts = self._get_training_counts()
            label1, label2 = self.labels
            n1 = counts.get(label1, 0)
            n2 = counts.get(label2, 0)
            total = max(1, n1 + n2)

            if n1 > 0 and n2 > 0:  # Only apply bias if we have examples of both classes
                emp_prior1 = n1 / total
                emp_prior2 = n2 / total

                # Target prior (0.5/0.5 neutralizes imbalance)
                target_prior1 = self.target_prior
                target_prior2 = 1.0 - self.target_prior

                # Calculate bias
                bias1 = _logit(target_prior1) - _logit(emp_prior1)
                bias2 = _logit(target_prior2) - _logit(emp_prior2)

                # Apply bias in logit space
                v1 = _sigmoid(_logit(votes[label1]) + bias1)
                v2 = _sigmoid(_logit(votes[label2]) + bias2)

                # Renormalize
                s = v1 + v2
                votes[label1] = v1 / s
                votes[label2] = v2 / s

            # Find best label
            best_label = max(votes.items(), key=lambda x: x[1])
            results.append(Decision(label=best_label[0], score=best_label[1]))

        return results[0] if single_input else results

    def pick(
        self, target_label: str, regions: List["Region"], labels: Optional[List[str]] = None
    ) -> PickResult:
        """
        Pick which region best matches the target label.

        Args:
            target_label: The class label to look for
            regions: List of regions to choose from
            labels: Optional human-friendly labels for each region

        Returns:
            PickResult with winning region, index, label (if provided), and score

        Raises:
            JudgeError: If target_label not in allowed labels
        """
        if target_label not in self.labels:
            raise JudgeError(f"Target label '{target_label}' not in allowed labels: {self.labels}")

        # Classify all regions
        decisions = self.decide(regions)

        # Find best match for target label
        best_index = -1
        best_score = -1.0

        for i, decision in enumerate(decisions):
            if decision.label == target_label and decision.score > best_score:
                best_score = decision.score
                best_index = i

        if best_index == -1:
            # No region matched the target label
            raise JudgeError(f"No region classified as '{target_label}'")

        # Build result
        region = regions[best_index]
        label = labels[best_index] if labels and best_index < len(labels) else None

        return PickResult(region=region, index=best_index, label=label, score=best_score)

    def count(self, target_label: str, regions: List["Region"]) -> int:
        """
        Count how many regions match the target label.

        Args:
            target_label: The class label to count
            regions: List of regions to check

        Returns:
            Number of regions classified as target_label
        """
        decisions = self.decide(regions)
        return sum(1 for d in decisions if d.label == target_label)

    def info(self) -> None:
        """
        Show configuration and training information for this Judge.
        """
        print(f"Judge: {self.name}")
        print(f"Labels: {self.labels}")
        if self.target_prior != 0.5:
            print(
                f"Target prior: {self.target_prior:.2f} (favors '{self.labels[0]}')"
                if self.target_prior > 0.5
                else f"Target prior: {self.target_prior:.2f} (favors '{self.labels[1]}')"
            )

        # Get training counts
        counts = self._get_training_counts()
        print(f"\nTraining examples:")
        for label in self.labels:
            count = counts.get(label, 0)
            print(f"  {label}: {count}")

        if counts.get("unlabeled", 0) > 0:
            print(f"  unlabeled: {counts['unlabeled']}")

        # Show actual imbalance
        labeled_counts = [counts.get(label, 0) for label in self.labels]
        if all(c > 0 for c in labeled_counts):
            max_count = max(labeled_counts)
            min_count = min(labeled_counts)
            if max_count != min_count:
                # Find which is which
                for i, label in enumerate(self.labels):
                    if counts.get(label, 0) == max_count:
                        majority_label = label
                    if counts.get(label, 0) == min_count:
                        minority_label = label

                ratio = max_count / min_count
                print(
                    f"\nClass imbalance: {majority_label}:{minority_label} = {max_count}:{min_count} ({ratio:.1f}:1)"
                )

                print("  Using Youden's J weights with soft voting and prior correction")

    def inspect(self, preview: bool = True) -> None:
        """
        Inspect all stored examples, showing their true labels and predicted labels/scores.
        Useful for debugging classification issues.

        Args:
            preview: If True (default), display images inline in HTML tables (requires IPython/Jupyter).
                     If False, use text-only output.
        """
        if not self.thresholds:
            print("No trained model yet. Add examples and the model will auto-train.")
            return

        if not preview:
            # Show basic info first
            self.info()
            print("-" * 80)

            print("\nThresholds learned:")
            for metric, info in self.thresholds.items():
                weight = info["accuracy"]  # This is now Youden's J
                selection_acc = info.get(
                    "selection_accuracy", info["accuracy"]
                )  # Fallback for old models
                print(f"  {metric}: weight={weight:.3f} (selection_accuracy={selection_acc:.3f})")
                for label, (threshold, direction) in info["thresholds"].items():
                    print(f"    {label}: {direction} than {threshold:.3f}")

                # Show metric distribution info if available
                if metric in self.metrics_info:
                    metric_stats = self.metrics_info[metric]
                    for label in self.labels:
                        mean_key = f"mean_{label}"
                        std_key = f"std_{label}"
                        if mean_key in metric_stats:
                            print(
                                f"    {label} distribution: mean={metric_stats[mean_key]:.3f}, std={metric_stats[std_key]:.3f}"
                            )

        if preview:
            # HTML preview mode
            try:
                import base64
                import io

                from IPython.display import HTML, display
            except ImportError:
                print("Preview mode requires IPython/Jupyter. Falling back to text mode.")
                preview = False

        if preview:
            # Build HTML tables for everything
            html_parts = []
            html_parts.append("<style>")
            html_parts.append("table { border-collapse: collapse; margin: 20px 0; }")
            html_parts.append("th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }")
            html_parts.append("th { background-color: #f2f2f2; font-weight: bold; }")
            html_parts.append("img { max-width: 60px; max-height: 60px; }")
            html_parts.append(".correct { color: green; }")
            html_parts.append(".incorrect { color: red; }")
            html_parts.append(".metrics { font-size: 0.9em; color: #666; }")
            html_parts.append("h3 { margin-top: 30px; }")
            html_parts.append(".imbalance-warning { background-color: #fff3cd; color: #856404; }")
            html_parts.append("</style>")

            # Configuration table
            html_parts.append("<h3>Judge Configuration</h3>")
            html_parts.append("<table>")
            html_parts.append("<tr><th>Property</th><th>Value</th></tr>")
            html_parts.append(f"<tr><td>Name</td><td>{self.name}</td></tr>")
            html_parts.append(f"<tr><td>Labels</td><td>{', '.join(self.labels)}</td></tr>")
            html_parts.append(f"<tr><td>Target Prior</td><td>{self.target_prior:.2f}")
            if self.target_prior != 0.5:
                html_parts.append(
                    f" (favors '{self.labels[0] if self.target_prior > 0.5 else self.labels[1]}')"
                )
            html_parts.append("</td></tr>")
            html_parts.append("</table>")

            # Training counts table
            counts = self._get_training_counts()
            html_parts.append("<h3>Training Examples</h3>")
            html_parts.append("<table>")
            html_parts.append("<tr><th>Class</th><th>Count</th></tr>")

            # Check for imbalance
            labeled_counts = [counts.get(label, 0) for label in self.labels]
            is_imbalanced = False
            if all(c > 0 for c in labeled_counts):
                max_count = max(labeled_counts)
                min_count = min(labeled_counts)
                if max_count != min_count:
                    ratio = max_count / min_count
                    is_imbalanced = ratio > 1.5

            for label in self.labels:
                count = counts.get(label, 0)
                row_class = ""
                if is_imbalanced:
                    if count == max(labeled_counts):
                        row_class = ' class="imbalance-warning"'
                html_parts.append(f"<tr{row_class}><td>{label}</td><td>{count}</td></tr>")

            if counts.get("unlabeled", 0) > 0:
                html_parts.append(f"<tr><td>unlabeled</td><td>{counts['unlabeled']}</td></tr>")

            html_parts.append("</table>")

            if is_imbalanced:
                html_parts.append(
                    f"<p><em>Class imbalance detected ({ratio:.1f}:1). Using Youden's J weights with prior correction.</em></p>"
                )

            # Thresholds table
            html_parts.append("<h3>Learned Thresholds</h3>")
            html_parts.append("<table>")
            html_parts.append(
                "<tr><th>Metric</th><th>Weight (Youden's J)</th><th>Selection Accuracy</th><th>Threshold Details</th></tr>"
            )

            for metric, info in self.thresholds.items():
                weight = info["accuracy"]  # This is Youden's J
                selection_acc = info.get("selection_accuracy", weight)

                # Build threshold details
                details = []
                for label, (threshold, direction) in info["thresholds"].items():
                    details.append(f"<br>{label}: {direction} than {threshold:.3f}")

                # Add distribution info if available
                if metric in self.metrics_info:
                    metric_stats = self.metrics_info[metric]
                    details.append("<br><em>Distributions:</em>")
                    for label in self.labels:
                        mean_key = f"mean_{label}"
                        std_key = f"std_{label}"
                        if mean_key in metric_stats:
                            details.append(
                                f"<br>&nbsp;&nbsp;{label}: μ={metric_stats[mean_key]:.1f}, σ={metric_stats[std_key]:.1f}"
                            )

                html_parts.append("<tr>")
                html_parts.append(f"<td>{metric}</td>")
                html_parts.append(f"<td>{weight:.3f}</td>")
                html_parts.append(f"<td>{selection_acc:.3f}</td>")
                html_parts.append(f"<td>{''.join(details)}</td>")
                html_parts.append("</tr>")

            html_parts.append("</table>")

            all_correct = 0
            all_total = 0

            # First show labeled examples
            for true_label in self.labels:
                label_dir = self.root_dir / true_label
                examples = list(label_dir.glob("*.png"))

                if not examples:
                    continue

                html_parts.append(
                    f"<h3>Predictions: {true_label.upper()} ({len(examples)} total)</h3>"
                )
                html_parts.append("<table>")
                html_parts.append(
                    "<tr><th>Image</th><th>Status</th><th>Predicted</th><th>Score</th><th>Key Metrics</th></tr>"
                )

                correct = 0

                for img_path in sorted(examples)[:20]:  # Show max 20 per class in preview
                    # Load image
                    img = Image.open(img_path)
                    mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                    # Get prediction
                    decision = self.decide(mock_region)
                    is_correct = decision.label == true_label
                    if is_correct:
                        correct += 1

                    # Extract metrics
                    metrics = self._extract_metrics(mock_region)

                    # Convert image to base64
                    buffered = io.BytesIO()
                    img.save(buffered, format="PNG")
                    img_str = base64.b64encode(buffered.getvalue()).decode()

                    # Build row
                    status_class = "correct" if is_correct else "incorrect"
                    status_symbol = "✓" if is_correct else "✗"

                    # Format key metrics
                    metric_strs = []
                    for metric, value in sorted(metrics.items()):
                        if metric in self.thresholds:
                            metric_strs.append(f"{metric}={value:.1f}")
                    metrics_html = "<br>".join(metric_strs[:3])

                    html_parts.append("<tr>")
                    html_parts.append(f'<td><img src="data:image/png;base64,{img_str}" /></td>')
                    html_parts.append(f'<td class="{status_class}">{status_symbol}</td>')
                    html_parts.append(f"<td>{decision.label}</td>")
                    html_parts.append(f"<td>{decision.score:.3f}</td>")
                    html_parts.append(f'<td class="metrics">{metrics_html}</td>')
                    html_parts.append("</tr>")

                html_parts.append("</table>")

                accuracy = correct / len(examples) if examples else 0
                all_correct += correct
                all_total += len(examples)

                if len(examples) > 20:
                    html_parts.append(f"<p><em>... and {len(examples) - 20} more</em></p>")
                html_parts.append(
                    f"<p>Accuracy for {true_label}: <strong>{accuracy:.1%}</strong> ({correct}/{len(examples)})</p>"
                )

            if all_total > 0:
                overall_accuracy = all_correct / all_total
                html_parts.append(
                    f"<h3>Overall accuracy: {overall_accuracy:.1%} ({all_correct}/{all_total})</h3>"
                )

            # Now show unlabeled examples with predictions
            unlabeled_dir = self.root_dir / "unlabeled"
            unlabeled_examples = list(unlabeled_dir.glob("*.png"))

            if unlabeled_examples:
                html_parts.append(
                    f"<h3>Predictions: UNLABELED ({len(unlabeled_examples)} total)</h3>"
                )
                html_parts.append("<table>")
                html_parts.append(
                    "<tr><th>Image</th><th>Predicted</th><th>Score</th><th>Key Metrics</th></tr>"
                )

                for img_path in sorted(unlabeled_examples)[:20]:  # Show max 20
                    # Load image
                    img = Image.open(img_path)
                    mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                    # Get prediction
                    decision = self.decide(mock_region)

                    # Extract metrics
                    metrics = self._extract_metrics(mock_region)

                    # Convert image to base64
                    buffered = io.BytesIO()
                    img.save(buffered, format="PNG")
                    img_str = base64.b64encode(buffered.getvalue()).decode()

                    # Format key metrics
                    metric_strs = []
                    for metric, value in sorted(metrics.items()):
                        if metric in self.thresholds:
                            metric_strs.append(f"{metric}={value:.1f}")
                    metrics_html = "<br>".join(metric_strs[:3])

                    html_parts.append("<tr>")
                    html_parts.append(f'<td><img src="data:image/png;base64,{img_str}" /></td>')
                    html_parts.append(f"<td>{decision.label}</td>")
                    html_parts.append(f"<td>{decision.score:.3f}</td>")
                    html_parts.append(f'<td class="metrics">{metrics_html}</td>')
                    html_parts.append("</tr>")

                html_parts.append("</table>")

                if len(unlabeled_examples) > 20:
                    html_parts.append(
                        f"<p><em>... and {len(unlabeled_examples) - 20} more</em></p>"
                    )

            # Display HTML
            display(HTML("".join(html_parts)))

        else:
            # Text mode (original)
            print("\nPredictions on training data:")
            print("-" * 80)

            # Test each labeled example
            all_correct = 0
            all_total = 0

            for true_label in self.labels:
                label_dir = self.root_dir / true_label
                examples = list(label_dir.glob("*.png"))

                if not examples:
                    continue

                print(f"\n{true_label.upper()} examples ({len(examples)} total):")
                correct = 0

                for img_path in sorted(examples)[:10]:  # Show max 10 per class
                    # Load image and create mock region
                    img = Image.open(img_path)
                    mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                    # Get prediction
                    decision = self.decide(mock_region)
                    is_correct = decision.label == true_label
                    if is_correct:
                        correct += 1

                    # Extract metrics for this example
                    metrics = self._extract_metrics(mock_region)

                    # Show result
                    status = "✓" if is_correct else "✗"
                    print(
                        f"  {status} {img_path.name}: predicted={decision.label} (score={decision.score:.3f})"
                    )

                    # Show key metric values
                    metric_strs = []
                    for metric, value in sorted(metrics.items()):
                        if metric in self.thresholds:
                            metric_strs.append(f"{metric}={value:.2f}")
                    if metric_strs:
                        print(f"     Metrics: {', '.join(metric_strs[:3])}")

                accuracy = correct / len(examples) if examples else 0
                all_correct += correct
                all_total += len(examples)

                if len(examples) > 10:
                    print(f"  ... and {len(examples) - 10} more")
                print(f"  Accuracy for {true_label}: {accuracy:.1%} ({correct}/{len(examples)})")

            if all_total > 0:
                overall_accuracy = all_correct / all_total
                print(f"\nOverall accuracy: {overall_accuracy:.1%} ({all_correct}/{all_total})")

            # Show unlabeled examples with predictions
            unlabeled_dir = self.root_dir / "unlabeled"
            unlabeled_examples = list(unlabeled_dir.glob("*.png"))

            if unlabeled_examples:
                print(f"\nUNLABELED examples ({len(unlabeled_examples)} total) - predictions:")

                for img_path in sorted(unlabeled_examples)[:10]:  # Show max 10
                    # Load image and create mock region
                    img = Image.open(img_path)
                    mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                    # Get prediction
                    decision = self.decide(mock_region)

                    # Extract metrics
                    metrics = self._extract_metrics(mock_region)

                    print(
                        f"  {img_path.name}: predicted={decision.label} (score={decision.score:.3f})"
                    )

                    # Show key metric values
                    metric_strs = []
                    for metric, value in sorted(metrics.items()):
                        if metric in self.thresholds:
                            metric_strs.append(f"{metric}={value:.2f}")
                    if metric_strs:
                        print(f"     Metrics: {', '.join(metric_strs[:3])}")

                if len(unlabeled_examples) > 10:
                    print(f"  ... and {len(unlabeled_examples) - 10} more")

    def lookup(self, region) -> Optional[Tuple[str, Image.Image]]:
        """
        Look up a region and return its hash and image if found in training data.

        Args:
            region: Region to look up

        Returns:
            Tuple of (hash, image) if found, None if not found
        """
        try:
            # Generate hash for the region
            img = region.render(crop=True)
            if not isinstance(img, Image.Image):
                img = Image.fromarray(img)
            if img.mode != "RGB":
                img = img.convert("RGB")
            img_array = np.array(img)
            img_hash = hashlib.md5(img_array.tobytes()).hexdigest()[:12]

            # Look for the image in all directories
            for subdir in ["checked", "unchecked", "unlabeled", "_removed"]:
                if subdir == "checked" or subdir == "unchecked":
                    # Only look in valid label directories
                    if subdir not in self.labels:
                        continue

                img_path = self.root_dir / subdir / f"{img_hash}.png"
                if img_path.exists():
                    stored_img = Image.open(img_path)
                    logger.debug(f"Found region in '{subdir}' with hash {img_hash}")
                    return (img_hash, stored_img)

            logger.debug(f"Region not found in training data (hash: {img_hash})")
            return None

        except Exception as e:
            logger.error(f"Failed to lookup region: {e}")
            return None

    def show(self, max_per_class: int = 10, size: Tuple[int, int] = (100, 100)) -> None:
        """
        Display a grid showing examples from each category.

        Args:
            max_per_class: Maximum number of examples to show per class
            size: Size of each image in pixels (width, height)
        """
        try:
            import ipywidgets as widgets
            from IPython.display import display
            from PIL import Image as PILImage
        except ImportError:
            print("Show requires IPython and ipywidgets")
            return

        # Collect images from each category
        categories = {}
        total_counts = {}
        for label in self.labels:
            label_dir = self.root_dir / label
            all_images = list(label_dir.glob("*.png"))
            total_counts[label] = len(all_images)
            images = sorted(all_images)[:max_per_class]
            if images:
                categories[label] = images

        # Add unlabeled if any
        unlabeled_dir = self.root_dir / "unlabeled"
        all_unlabeled = list(unlabeled_dir.glob("*.png"))
        total_counts["unlabeled"] = len(all_unlabeled)
        unlabeled = sorted(all_unlabeled)[:max_per_class]
        if unlabeled:
            categories["unlabeled"] = unlabeled

        if not categories:
            print("No images to show")
            return

        # Create grid layout
        rows = []

        # Check for class imbalance
        labeled_counts = {k: v for k, v in total_counts.items() if k != "unlabeled"}
        if labeled_counts and len(labeled_counts) >= 2:
            max_count = max(labeled_counts.values())
            min_count = min(labeled_counts.values())
            if min_count > 0 and max_count / min_count > 3:
                warning = widgets.HTML(
                    f'<div style="background: #fff3cd; padding: 10px; margin: 10px 0; border: 1px solid #ffeeba; border-radius: 4px;">'
                    f"<strong>⚠️ Class imbalance detected:</strong> {labeled_counts}<br>"
                    f"Consider adding more examples of the minority class for better accuracy."
                    f"</div>"
                )
                rows.append(warning)

        for category, image_paths in categories.items():
            # Category header showing total count
            shown = len(image_paths)
            total = total_counts[category]
            header_text = f"<h3>{category}"
            if shown < total:
                header_text += f" ({shown} of {total} shown)"
            else:
                header_text += f" ({total} total)"
            header_text += "</h3>"
            header = widgets.HTML(header_text)

            # Image row
            image_widgets = []
            for img_path in image_paths:
                # Load and resize image
                img = PILImage.open(img_path)
                img.thumbnail(size, PILImage.Resampling.LANCZOS)

                # Convert to bytes for display
                import io

                img_bytes = io.BytesIO()
                img.save(img_bytes, format="PNG")
                img_bytes.seek(0)

                # Create image widget
                img_widget = widgets.Image(value=img_bytes.read(), width=size[0], height=size[1])
                image_widgets.append(img_widget)

            # Create horizontal box for this category
            category_box = widgets.VBox([header, widgets.HBox(image_widgets)])
            rows.append(category_box)

        # Display all categories
        display(widgets.VBox(rows))

    def forget(self, region: Optional["Region"] = None, delete: bool = False) -> None:
        """
        Clear training data, delete all files, or move a specific region to unlabeled.

        Args:
            region: If provided, move this specific region to unlabeled
            delete: If True, permanently delete all files
        """
        # Handle specific region case
        if region is not None:
            # Get hash of the region
            try:
                img = region.render(crop=True)
                if not isinstance(img, Image.Image):
                    img = Image.fromarray(img)
                if img.mode != "RGB":
                    img = img.convert("RGB")
                img_array = np.array(img)
                img_hash = hashlib.md5(img_array.tobytes()).hexdigest()[:12]
            except Exception as e:
                logger.error(f"Failed to hash region: {e}")
                return

            # Find and move the image
            moved = False
            for label in self.labels + ["_removed"]:
                source_path = self.root_dir / label / f"{img_hash}.png"
                if source_path.exists():
                    target_path = self.root_dir / "unlabeled" / f"{img_hash}.png"
                    shutil.move(str(source_path), str(target_path))
                    print(f"Moved region from '{label}' to 'unlabeled'")
                    moved = True
                    break

            if not moved:
                print(f"Region not found in training data")
            return

        # Handle delete or clear training
        if delete:
            # Delete entire directory
            if self.root_dir.exists():
                shutil.rmtree(self.root_dir)
                print(f"Deleted all data for judge '{self.name}'")
            else:
                print(f"No data found for judge '{self.name}'")

            # Reset internal state
            self.thresholds = {}
            self.metrics_info = {}

            # Recreate directory structure
            self.root_dir.mkdir(exist_ok=True)
            for label in self.labels:
                (self.root_dir / label).mkdir(exist_ok=True)
            (self.root_dir / "unlabeled").mkdir(exist_ok=True)
            (self.root_dir / "_removed").mkdir(exist_ok=True)

        else:
            # Just clear training (move everything to unlabeled)
            moved_count = 0

            # Move all labeled images back to unlabeled
            unlabeled_dir = self.root_dir / "unlabeled"
            for label in self.labels:
                label_dir = self.root_dir / label
                if label_dir.exists():
                    for img_path in label_dir.glob("*.png"):
                        shutil.move(str(img_path), str(unlabeled_dir / img_path.name))
                        moved_count += 1

            # Clear thresholds
            self.thresholds = {}
            self.metrics_info = {}

            # Remove saved config
            if self.config_path.exists():
                self.config_path.unlink()

            print(f"Moved {moved_count} labeled images back to unlabeled.")
            print("Training data cleared. Judge is now untrained.")

    def save(self, path: Optional[str] = None) -> None:
        """
        Save the judge configuration (auto-retrains first).

        Args:
            path: Optional path to save to. Defaults to judge.json in root directory
        """
        # Retrain with current examples
        self._retrain()

        # Save config
        save_path = Path(path) if path else self.config_path

        config = {
            "name": self.name,
            "labels": self.labels,
            "target_prior": self.target_prior,
            "thresholds": self.thresholds,
            "metrics_info": self.metrics_info,
            "training_counts": self._get_training_counts(),
        }

        with open(save_path, "w") as f:
            json.dump(config, f, indent=2)

        logger.info(f"Saved judge to {save_path}")

    @classmethod
    def load(cls, path: str) -> "Judge":
        """
        Load a judge from a saved configuration.

        Args:
            path: Path to the saved judge.json file or the judge directory

        Returns:
            Loaded Judge instance
        """
        path = Path(path)

        # If path is a directory, look for judge.json inside
        if path.is_dir():
            config_path = path / "judge.json"
            base_dir = path.parent
            name = path.name
        else:
            config_path = path
            base_dir = path.parent.parent if path.parent.name != "." else path.parent
            # Try to infer name from path
            name = None

        with open(config_path, "r") as f:
            config = json.load(f)

        # Use saved name if we couldn't infer it
        if name is None:
            name = config["name"]

        # Create judge with saved config
        judge = cls(
            name,
            labels=config["labels"],
            base_dir=base_dir,
            target_prior=config.get("target_prior", 0.5),
        )  # Default to 0.5 for old configs
        judge.thresholds = config["thresholds"]
        judge.metrics_info = config.get("metrics_info", {})

        return judge

    # Private methods

    def _extract_metrics(self, region) -> Dict[str, float]:
        """Extract image metrics from a region."""
        try:
            img = region.render(crop=True)
            if not isinstance(img, Image.Image):
                img = Image.fromarray(img)

            # Convert to grayscale for analysis
            gray = np.array(img.convert("L"))

            metrics = {}

            # 1. Center darkness
            h, w = gray.shape
            cy, cx = h // 2, w // 2
            center_size = min(5, h // 4, w // 4)  # Adaptive center size
            center = gray[
                max(0, cy - center_size) : min(h, cy + center_size + 1),
                max(0, cx - center_size) : min(w, cx + center_size + 1),
            ]
            metrics["center_darkness"] = 255 - np.mean(center)

            # 2. Overall darkness (ink density)
            metrics["ink_density"] = 255 - np.mean(gray)

            # 3. Dark pixel ratio
            metrics["dark_pixel_ratio"] = np.sum(gray < 200) / gray.size

            # 4. Standard deviation (complexity)
            metrics["std_dev"] = np.std(gray)

            # 5. Edge vs center ratio
            edge_size = max(2, min(h // 10, w // 10))
            edge_mask = np.zeros_like(gray, dtype=bool)
            edge_mask[:edge_size, :] = True
            edge_mask[-edge_size:, :] = True
            edge_mask[:, :edge_size] = True
            edge_mask[:, -edge_size:] = True

            edge_mean = np.mean(gray[edge_mask]) if np.any(edge_mask) else 255
            center_mean = np.mean(center)
            metrics["edge_center_ratio"] = edge_mean / (center_mean + 1)

            # 6. Diagonal density (for X patterns)
            if h > 10 and w > 10:
                diag_mask = np.zeros_like(gray, dtype=bool)
                for i in range(min(h, w)):
                    if i < h and i < w:
                        diag_mask[i, i] = True
                        diag_mask[i, w - 1 - i] = True
                metrics["diagonal_density"] = 255 - np.mean(gray[diag_mask])
            else:
                metrics["diagonal_density"] = metrics["ink_density"]

            return metrics

        except Exception as e:
            raise JudgeError(f"Failed to extract metrics: {e}")

    def _retrain(self) -> None:
        """Retrain thresholds from current examples."""
        # Collect all examples
        examples = {label: [] for label in self.labels}

        for label in self.labels:
            label_dir = self.root_dir / label
            for img_path in label_dir.glob("*.png"):
                img = Image.open(img_path)
                # Create a mock region that just returns the image
                mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()
                metrics = self._extract_metrics(mock_region)
                examples[label].append(metrics)

        # Check we have examples
        for label, exs in examples.items():
            if not exs:
                logger.warning(f"No examples for class '{label}'")
                return

        # Check for class imbalance
        example_counts = {label: len(exs) for label, exs in examples.items()}
        max_count = max(example_counts.values())
        min_count = min(example_counts.values())

        imbalance_ratio = max_count / min_count if min_count > 0 else float("inf")
        is_imbalanced = imbalance_ratio > 1.5  # Consider imbalanced if more than 1.5x difference

        if is_imbalanced:
            logger.info(
                f"Class imbalance detected: {example_counts} (ratio {imbalance_ratio:.1f}:1)"
            )
            logger.info("Using balanced accuracy for threshold selection")

        # Find best thresholds for each metric
        self.thresholds = {}
        self.metrics_info = {}
        metric_candidates = []  # Store all metrics with their scores

        all_metrics = set()
        for exs in examples.values():
            for ex in exs:
                all_metrics.update(ex.keys())

        for metric in all_metrics:
            # Get all values for this metric
            values_by_label = {}
            for label, exs in examples.items():
                values_by_label[label] = [ex.get(metric, 0) for ex in exs]

            # Find threshold that best separates classes (for binary)
            if len(self.labels) == 2:
                label1, label2 = self.labels
                vals1 = values_by_label[label1]
                vals2 = values_by_label[label2]

                # Try different thresholds
                all_vals = vals1 + vals2
                best_threshold = None
                best_accuracy = 0
                best_direction = None

                for threshold in np.percentile(all_vals, [10, 20, 30, 40, 50, 60, 70, 80, 90]):
                    # Test both directions
                    for direction in ["higher", "lower"]:
                        if direction == "higher":
                            correct1 = sum(1 for v in vals1 if v > threshold)
                            correct2 = sum(1 for v in vals2 if v <= threshold)
                        else:
                            correct1 = sum(1 for v in vals1 if v < threshold)
                            correct2 = sum(1 for v in vals2 if v >= threshold)

                        # Always use balanced accuracy for threshold selection
                        # This finds fair thresholds regardless of class imbalance
                        acc1 = correct1 / len(vals1) if len(vals1) > 0 else 0
                        acc2 = correct2 / len(vals2) if len(vals2) > 0 else 0
                        accuracy = (acc1 + acc2) / 2

                        if accuracy > best_accuracy:
                            best_accuracy = accuracy
                            best_threshold = threshold
                            best_direction = direction

                # Calculate Youden's J statistic for weight (TPR - FPR)
                if best_direction == "higher":
                    tp = sum(1 for v in vals1 if v > best_threshold)
                    fn = len(vals1) - tp
                    tn = sum(1 for v in vals2 if v <= best_threshold)
                    fp = len(vals2) - tn
                else:
                    tp = sum(1 for v in vals1 if v < best_threshold)
                    fn = len(vals1) - tp
                    tn = sum(1 for v in vals2 if v >= best_threshold)
                    fp = len(vals2) - tn

                tpr = tp / len(vals1) if len(vals1) > 0 else 0
                fpr = fp / len(vals2) if len(vals2) > 0 else 0
                youden_j = max(0.0, min(1.0, tpr - fpr))

                # Store all candidates
                metric_candidates.append(
                    {
                        "metric": metric,
                        "youden_j": youden_j,
                        "selection_accuracy": best_accuracy,
                        "threshold": best_threshold,
                        "direction": best_direction,
                        "label1": label1,
                        "label2": label2,
                        "stats": {
                            "mean_" + label1: np.mean(vals1),
                            "mean_" + label2: np.mean(vals2),
                            "std_" + label1: np.std(vals1),
                            "std_" + label2: np.std(vals2),
                        },
                    }
                )

        # Sort by selection accuracy
        metric_candidates.sort(key=lambda x: x["selection_accuracy"], reverse=True)

        # Use relaxed cutoff when imbalanced
        keep_cutoff = 0.55 if is_imbalanced else 0.60

        # Keep metrics that pass cutoff, or top 3 if none pass
        kept_metrics = [m for m in metric_candidates if m["selection_accuracy"] > keep_cutoff]
        if not kept_metrics and metric_candidates:
            # Keep top 3 metrics even if they don't pass cutoff
            kept_metrics = metric_candidates[:3]
            logger.warning(
                f"No metrics passed cutoff {keep_cutoff}, keeping top {len(kept_metrics)} metrics"
            )

        # Store selected metrics
        for candidate in kept_metrics:
            metric = candidate["metric"]
            label1 = candidate["label1"]
            label2 = candidate["label2"]
            self.thresholds[metric] = {
                "accuracy": candidate["youden_j"],  # Use Youden's J as weight
                "selection_accuracy": candidate["selection_accuracy"],
                "thresholds": {
                    label1: (candidate["threshold"], candidate["direction"]),
                    label2: (
                        candidate["threshold"],
                        "lower" if candidate["direction"] == "higher" else "higher",
                    ),
                },
            }
            self.metrics_info[metric] = candidate["stats"]

    def _load_config(self) -> None:
        """Load configuration from file."""
        try:
            with open(self.config_path, "r") as f:
                config = json.load(f)

            self.thresholds = config.get("thresholds", {})
            self.metrics_info = config.get("metrics_info", {})

            # Verify labels match
            if config.get("labels") != self.labels:
                logger.warning(
                    f"Saved labels {config.get('labels')} don't match current {self.labels}"
                )

        except Exception as e:
            logger.warning(f"Failed to load config: {e}")

    def _get_training_counts(self) -> Dict[str, int]:
        """Get count of examples per class."""
        counts = {}
        for label in self.labels:
            label_dir = self.root_dir / label
            counts[label] = len(list(label_dir.glob("*.png")))
        counts["unlabeled"] = len(list((self.root_dir / "unlabeled").glob("*.png")))
        return counts
Functions
natural_pdf.Judge.__init__(name, labels, base_dir=None, target_prior=None)

Initialize a Judge for visual classification.

Parameters:

Name Type Description Default
name str

Name for this judge (used for folder name)

required
labels List[str]

Class labels (required, typically 2 for binary classification)

required
base_dir Optional[str]

Base directory for storage. Defaults to current directory

None
target_prior Optional[float]

Target prior probability for the FIRST label in the labels list. - 0.5 (default) = neutral, treats both classes equally - >0.5 = favors labels[0] - <0.5 = favors labels[1] Example: Judge("cb", ["checked", "unchecked"], target_prior=0.6) favors detecting "checked" checkboxes.

None
Source code in natural_pdf/judge.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def __init__(
    self,
    name: str,
    labels: List[str],
    base_dir: Optional[str] = None,
    target_prior: Optional[float] = None,
):
    """
    Initialize a Judge for visual classification.

    Args:
        name: Name for this judge (used for folder name)
        labels: Class labels (required, typically 2 for binary classification)
        base_dir: Base directory for storage. Defaults to current directory
        target_prior: Target prior probability for the FIRST label in the labels list.
                     - 0.5 (default) = neutral, treats both classes equally
                     - >0.5 = favors labels[0]
                     - <0.5 = favors labels[1]
                     Example: Judge("cb", ["checked", "unchecked"], target_prior=0.6)
                     favors detecting "checked" checkboxes.
    """
    if not labels or len(labels) != 2:
        raise JudgeError("Judge requires exactly 2 class labels (binary classification only)")

    self.name = name
    self.labels = labels
    self.target_prior = target_prior if target_prior is not None else 0.5

    # Set up directory structure
    self.base_dir = Path(base_dir) if base_dir else Path.cwd()
    self.root_dir = self.base_dir / name
    self.root_dir.mkdir(exist_ok=True)

    # Create label directories
    for label in self.labels:
        (self.root_dir / label).mkdir(exist_ok=True)
    (self.root_dir / "unlabeled").mkdir(exist_ok=True)
    (self.root_dir / "_removed").mkdir(exist_ok=True)

    # Config file
    self.config_path = self.root_dir / "judge.json"

    # Load existing config or initialize
    self.thresholds = {}
    self.metrics_info = {}
    if self.config_path.exists():
        self._load_config()
natural_pdf.Judge.add(region, label=None)

Add a region to the judge's dataset.

Parameters:

Name Type Description Default
region

Region object to add

required
label Optional[str]

Class label. If None, added to unlabeled for later teaching

None

Raises:

Type Description
JudgeError

If label is not in allowed labels

Source code in natural_pdf/judge.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def add(self, region, label: Optional[str] = None) -> None:
    """
    Add a region to the judge's dataset.

    Args:
        region: Region object to add
        label: Class label. If None, added to unlabeled for later teaching

    Raises:
        JudgeError: If label is not in allowed labels
    """
    if label is not None and label not in self.labels:
        raise JudgeError(f"Label '{label}' not in allowed labels: {self.labels}")

    # Render region to image
    try:
        img = region.render(crop=True)
        if not isinstance(img, Image.Image):
            img = Image.fromarray(img)
    except Exception as e:
        raise JudgeError(f"Failed to render region: {e}")

    # Convert to RGB if needed
    if img.mode != "RGB":
        img = img.convert("RGB")

    # Generate hash from image content
    img_array = np.array(img)
    img_hash = hashlib.md5(img_array.tobytes()).hexdigest()[:12]

    # Determine target directory
    target_dir = self.root_dir / (label if label else "unlabeled")
    target_path = target_dir / f"{img_hash}.png"

    # Check if hash already exists anywhere
    existing_locations = []
    for check_label in self.labels + ["unlabeled", "_removed"]:
        check_path = self.root_dir / check_label / f"{img_hash}.png"
        if check_path.exists():
            existing_locations.append(check_label)

    if existing_locations:
        logger.warning(f"Duplicate image detected (hash: {img_hash})")
        logger.warning(f"Already exists in: {', '.join(existing_locations)}")
        print(f"⚠️  Duplicate image - already exists in: {', '.join(existing_locations)}")
        return

    # Save image
    img.save(target_path)
    logger.debug(f"Added image {img_hash} to {label if label else 'unlabeled'}")
natural_pdf.Judge.count(target_label, regions)

Count how many regions match the target label.

Parameters:

Name Type Description Default
target_label str

The class label to count

required
regions List[Region]

List of regions to check

required

Returns:

Type Description
int

Number of regions classified as target_label

Source code in natural_pdf/judge.py
558
559
560
561
562
563
564
565
566
567
568
569
570
def count(self, target_label: str, regions: List["Region"]) -> int:
    """
    Count how many regions match the target label.

    Args:
        target_label: The class label to count
        regions: List of regions to check

    Returns:
        Number of regions classified as target_label
    """
    decisions = self.decide(regions)
    return sum(1 for d in decisions if d.label == target_label)
natural_pdf.Judge.decide(regions)

Classify one or more regions.

Parameters:

Name Type Description Default
regions Union[Region, List[Region]]

Single region or list of regions to classify

required

Returns:

Type Description
Union[Decision, List[Decision]]

Decision or list of Decisions with label and score

Raises:

Type Description
JudgeError

If not enough training examples

Source code in natural_pdf/judge.py
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
def decide(self, regions: Union["Region", List["Region"]]) -> Union[Decision, List[Decision]]:
    """
    Classify one or more regions.

    Args:
        regions: Single region or list of regions to classify

    Returns:
        Decision or list of Decisions with label and score

    Raises:
        JudgeError: If not enough training examples
    """
    # Check if we have examples
    for label in self.labels:
        label_dir = self.root_dir / label
        if not any(label_dir.glob("*.png")):
            raise JudgeError(f"Need at least one example of class '{label}' before deciding")

    # Ensure thresholds are current
    if not self.thresholds:
        self._retrain()

    # Handle single region
    single_input = not isinstance(regions, list)
    if single_input:
        regions = [regions]

    results = []
    for region in regions:
        # Extract metrics
        metrics = self._extract_metrics(region)

        # Apply thresholds with soft voting
        votes = {label: 0.0 for label in self.labels}
        total_weight = 0.0

        for metric_name, value in metrics.items():
            if metric_name in self.thresholds:
                metric_info = self.thresholds[metric_name]
                weight = metric_info["accuracy"]  # This is now Youden's J

                # For binary classification
                label1, label2 = self.labels
                threshold1, direction1 = metric_info["thresholds"][label1]

                # Get standard deviations for soft voting
                stats = self.metrics_info.get(metric_name, {})
                s1 = stats.get(f"std_{label1}", 0.0)
                s2 = stats.get(f"std_{label2}", 0.0)
                scale1 = s1 if s1 > 1e-6 else 1.0
                scale2 = s2 if s2 > 1e-6 else 1.0

                # Calculate signed margin (positive favors label1, negative favors label2)
                if direction1 == "higher":
                    margin = (value - threshold1) / (scale1 if value >= threshold1 else scale2)
                else:
                    margin = (threshold1 - value) / (scale1 if value <= threshold1 else scale2)

                # Clip margin to avoid single metric dominating
                margin = np.clip(margin, -6, 6)

                # Soft votes using sigmoid
                p1 = 1.0 / (1.0 + np.exp(-margin))
                p2 = 1.0 - p1

                votes[label1] += weight * p1
                votes[label2] += weight * p2
                total_weight += weight

        # Normalize votes
        if total_weight > 0:
            for label in votes:
                votes[label] /= total_weight
        else:
            # Fallback: uniform votes so prior still works
            for label in votes:
                votes[label] = 0.5
            total_weight = 1.0

        # Apply prior bias correction
        def _logit(p, eps=1e-6):
            p = max(eps, min(1 - eps, p))
            return np.log(p / (1 - p))

        def _sigmoid(x):
            if x >= 0:
                z = np.exp(-x)
                return 1.0 / (1.0 + z)
            else:
                z = np.exp(x)
                return z / (1.0 + z)

        # Estimate priors from training counts
        counts = self._get_training_counts()
        label1, label2 = self.labels
        n1 = counts.get(label1, 0)
        n2 = counts.get(label2, 0)
        total = max(1, n1 + n2)

        if n1 > 0 and n2 > 0:  # Only apply bias if we have examples of both classes
            emp_prior1 = n1 / total
            emp_prior2 = n2 / total

            # Target prior (0.5/0.5 neutralizes imbalance)
            target_prior1 = self.target_prior
            target_prior2 = 1.0 - self.target_prior

            # Calculate bias
            bias1 = _logit(target_prior1) - _logit(emp_prior1)
            bias2 = _logit(target_prior2) - _logit(emp_prior2)

            # Apply bias in logit space
            v1 = _sigmoid(_logit(votes[label1]) + bias1)
            v2 = _sigmoid(_logit(votes[label2]) + bias2)

            # Renormalize
            s = v1 + v2
            votes[label1] = v1 / s
            votes[label2] = v2 / s

        # Find best label
        best_label = max(votes.items(), key=lambda x: x[1])
        results.append(Decision(label=best_label[0], score=best_label[1]))

    return results[0] if single_input else results
natural_pdf.Judge.forget(region=None, delete=False)

Clear training data, delete all files, or move a specific region to unlabeled.

Parameters:

Name Type Description Default
region Optional[Region]

If provided, move this specific region to unlabeled

None
delete bool

If True, permanently delete all files

False
Source code in natural_pdf/judge.py
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
def forget(self, region: Optional["Region"] = None, delete: bool = False) -> None:
    """
    Clear training data, delete all files, or move a specific region to unlabeled.

    Args:
        region: If provided, move this specific region to unlabeled
        delete: If True, permanently delete all files
    """
    # Handle specific region case
    if region is not None:
        # Get hash of the region
        try:
            img = region.render(crop=True)
            if not isinstance(img, Image.Image):
                img = Image.fromarray(img)
            if img.mode != "RGB":
                img = img.convert("RGB")
            img_array = np.array(img)
            img_hash = hashlib.md5(img_array.tobytes()).hexdigest()[:12]
        except Exception as e:
            logger.error(f"Failed to hash region: {e}")
            return

        # Find and move the image
        moved = False
        for label in self.labels + ["_removed"]:
            source_path = self.root_dir / label / f"{img_hash}.png"
            if source_path.exists():
                target_path = self.root_dir / "unlabeled" / f"{img_hash}.png"
                shutil.move(str(source_path), str(target_path))
                print(f"Moved region from '{label}' to 'unlabeled'")
                moved = True
                break

        if not moved:
            print(f"Region not found in training data")
        return

    # Handle delete or clear training
    if delete:
        # Delete entire directory
        if self.root_dir.exists():
            shutil.rmtree(self.root_dir)
            print(f"Deleted all data for judge '{self.name}'")
        else:
            print(f"No data found for judge '{self.name}'")

        # Reset internal state
        self.thresholds = {}
        self.metrics_info = {}

        # Recreate directory structure
        self.root_dir.mkdir(exist_ok=True)
        for label in self.labels:
            (self.root_dir / label).mkdir(exist_ok=True)
        (self.root_dir / "unlabeled").mkdir(exist_ok=True)
        (self.root_dir / "_removed").mkdir(exist_ok=True)

    else:
        # Just clear training (move everything to unlabeled)
        moved_count = 0

        # Move all labeled images back to unlabeled
        unlabeled_dir = self.root_dir / "unlabeled"
        for label in self.labels:
            label_dir = self.root_dir / label
            if label_dir.exists():
                for img_path in label_dir.glob("*.png"):
                    shutil.move(str(img_path), str(unlabeled_dir / img_path.name))
                    moved_count += 1

        # Clear thresholds
        self.thresholds = {}
        self.metrics_info = {}

        # Remove saved config
        if self.config_path.exists():
            self.config_path.unlink()

        print(f"Moved {moved_count} labeled images back to unlabeled.")
        print("Training data cleared. Judge is now untrained.")
natural_pdf.Judge.info()

Show configuration and training information for this Judge.

Source code in natural_pdf/judge.py
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
def info(self) -> None:
    """
    Show configuration and training information for this Judge.
    """
    print(f"Judge: {self.name}")
    print(f"Labels: {self.labels}")
    if self.target_prior != 0.5:
        print(
            f"Target prior: {self.target_prior:.2f} (favors '{self.labels[0]}')"
            if self.target_prior > 0.5
            else f"Target prior: {self.target_prior:.2f} (favors '{self.labels[1]}')"
        )

    # Get training counts
    counts = self._get_training_counts()
    print(f"\nTraining examples:")
    for label in self.labels:
        count = counts.get(label, 0)
        print(f"  {label}: {count}")

    if counts.get("unlabeled", 0) > 0:
        print(f"  unlabeled: {counts['unlabeled']}")

    # Show actual imbalance
    labeled_counts = [counts.get(label, 0) for label in self.labels]
    if all(c > 0 for c in labeled_counts):
        max_count = max(labeled_counts)
        min_count = min(labeled_counts)
        if max_count != min_count:
            # Find which is which
            for i, label in enumerate(self.labels):
                if counts.get(label, 0) == max_count:
                    majority_label = label
                if counts.get(label, 0) == min_count:
                    minority_label = label

            ratio = max_count / min_count
            print(
                f"\nClass imbalance: {majority_label}:{minority_label} = {max_count}:{min_count} ({ratio:.1f}:1)"
            )

            print("  Using Youden's J weights with soft voting and prior correction")
natural_pdf.Judge.inspect(preview=True)

Inspect all stored examples, showing their true labels and predicted labels/scores. Useful for debugging classification issues.

Parameters:

Name Type Description Default
preview bool

If True (default), display images inline in HTML tables (requires IPython/Jupyter). If False, use text-only output.

True
Source code in natural_pdf/judge.py
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
def inspect(self, preview: bool = True) -> None:
    """
    Inspect all stored examples, showing their true labels and predicted labels/scores.
    Useful for debugging classification issues.

    Args:
        preview: If True (default), display images inline in HTML tables (requires IPython/Jupyter).
                 If False, use text-only output.
    """
    if not self.thresholds:
        print("No trained model yet. Add examples and the model will auto-train.")
        return

    if not preview:
        # Show basic info first
        self.info()
        print("-" * 80)

        print("\nThresholds learned:")
        for metric, info in self.thresholds.items():
            weight = info["accuracy"]  # This is now Youden's J
            selection_acc = info.get(
                "selection_accuracy", info["accuracy"]
            )  # Fallback for old models
            print(f"  {metric}: weight={weight:.3f} (selection_accuracy={selection_acc:.3f})")
            for label, (threshold, direction) in info["thresholds"].items():
                print(f"    {label}: {direction} than {threshold:.3f}")

            # Show metric distribution info if available
            if metric in self.metrics_info:
                metric_stats = self.metrics_info[metric]
                for label in self.labels:
                    mean_key = f"mean_{label}"
                    std_key = f"std_{label}"
                    if mean_key in metric_stats:
                        print(
                            f"    {label} distribution: mean={metric_stats[mean_key]:.3f}, std={metric_stats[std_key]:.3f}"
                        )

    if preview:
        # HTML preview mode
        try:
            import base64
            import io

            from IPython.display import HTML, display
        except ImportError:
            print("Preview mode requires IPython/Jupyter. Falling back to text mode.")
            preview = False

    if preview:
        # Build HTML tables for everything
        html_parts = []
        html_parts.append("<style>")
        html_parts.append("table { border-collapse: collapse; margin: 20px 0; }")
        html_parts.append("th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }")
        html_parts.append("th { background-color: #f2f2f2; font-weight: bold; }")
        html_parts.append("img { max-width: 60px; max-height: 60px; }")
        html_parts.append(".correct { color: green; }")
        html_parts.append(".incorrect { color: red; }")
        html_parts.append(".metrics { font-size: 0.9em; color: #666; }")
        html_parts.append("h3 { margin-top: 30px; }")
        html_parts.append(".imbalance-warning { background-color: #fff3cd; color: #856404; }")
        html_parts.append("</style>")

        # Configuration table
        html_parts.append("<h3>Judge Configuration</h3>")
        html_parts.append("<table>")
        html_parts.append("<tr><th>Property</th><th>Value</th></tr>")
        html_parts.append(f"<tr><td>Name</td><td>{self.name}</td></tr>")
        html_parts.append(f"<tr><td>Labels</td><td>{', '.join(self.labels)}</td></tr>")
        html_parts.append(f"<tr><td>Target Prior</td><td>{self.target_prior:.2f}")
        if self.target_prior != 0.5:
            html_parts.append(
                f" (favors '{self.labels[0] if self.target_prior > 0.5 else self.labels[1]}')"
            )
        html_parts.append("</td></tr>")
        html_parts.append("</table>")

        # Training counts table
        counts = self._get_training_counts()
        html_parts.append("<h3>Training Examples</h3>")
        html_parts.append("<table>")
        html_parts.append("<tr><th>Class</th><th>Count</th></tr>")

        # Check for imbalance
        labeled_counts = [counts.get(label, 0) for label in self.labels]
        is_imbalanced = False
        if all(c > 0 for c in labeled_counts):
            max_count = max(labeled_counts)
            min_count = min(labeled_counts)
            if max_count != min_count:
                ratio = max_count / min_count
                is_imbalanced = ratio > 1.5

        for label in self.labels:
            count = counts.get(label, 0)
            row_class = ""
            if is_imbalanced:
                if count == max(labeled_counts):
                    row_class = ' class="imbalance-warning"'
            html_parts.append(f"<tr{row_class}><td>{label}</td><td>{count}</td></tr>")

        if counts.get("unlabeled", 0) > 0:
            html_parts.append(f"<tr><td>unlabeled</td><td>{counts['unlabeled']}</td></tr>")

        html_parts.append("</table>")

        if is_imbalanced:
            html_parts.append(
                f"<p><em>Class imbalance detected ({ratio:.1f}:1). Using Youden's J weights with prior correction.</em></p>"
            )

        # Thresholds table
        html_parts.append("<h3>Learned Thresholds</h3>")
        html_parts.append("<table>")
        html_parts.append(
            "<tr><th>Metric</th><th>Weight (Youden's J)</th><th>Selection Accuracy</th><th>Threshold Details</th></tr>"
        )

        for metric, info in self.thresholds.items():
            weight = info["accuracy"]  # This is Youden's J
            selection_acc = info.get("selection_accuracy", weight)

            # Build threshold details
            details = []
            for label, (threshold, direction) in info["thresholds"].items():
                details.append(f"<br>{label}: {direction} than {threshold:.3f}")

            # Add distribution info if available
            if metric in self.metrics_info:
                metric_stats = self.metrics_info[metric]
                details.append("<br><em>Distributions:</em>")
                for label in self.labels:
                    mean_key = f"mean_{label}"
                    std_key = f"std_{label}"
                    if mean_key in metric_stats:
                        details.append(
                            f"<br>&nbsp;&nbsp;{label}: μ={metric_stats[mean_key]:.1f}, σ={metric_stats[std_key]:.1f}"
                        )

            html_parts.append("<tr>")
            html_parts.append(f"<td>{metric}</td>")
            html_parts.append(f"<td>{weight:.3f}</td>")
            html_parts.append(f"<td>{selection_acc:.3f}</td>")
            html_parts.append(f"<td>{''.join(details)}</td>")
            html_parts.append("</tr>")

        html_parts.append("</table>")

        all_correct = 0
        all_total = 0

        # First show labeled examples
        for true_label in self.labels:
            label_dir = self.root_dir / true_label
            examples = list(label_dir.glob("*.png"))

            if not examples:
                continue

            html_parts.append(
                f"<h3>Predictions: {true_label.upper()} ({len(examples)} total)</h3>"
            )
            html_parts.append("<table>")
            html_parts.append(
                "<tr><th>Image</th><th>Status</th><th>Predicted</th><th>Score</th><th>Key Metrics</th></tr>"
            )

            correct = 0

            for img_path in sorted(examples)[:20]:  # Show max 20 per class in preview
                # Load image
                img = Image.open(img_path)
                mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                # Get prediction
                decision = self.decide(mock_region)
                is_correct = decision.label == true_label
                if is_correct:
                    correct += 1

                # Extract metrics
                metrics = self._extract_metrics(mock_region)

                # Convert image to base64
                buffered = io.BytesIO()
                img.save(buffered, format="PNG")
                img_str = base64.b64encode(buffered.getvalue()).decode()

                # Build row
                status_class = "correct" if is_correct else "incorrect"
                status_symbol = "✓" if is_correct else "✗"

                # Format key metrics
                metric_strs = []
                for metric, value in sorted(metrics.items()):
                    if metric in self.thresholds:
                        metric_strs.append(f"{metric}={value:.1f}")
                metrics_html = "<br>".join(metric_strs[:3])

                html_parts.append("<tr>")
                html_parts.append(f'<td><img src="data:image/png;base64,{img_str}" /></td>')
                html_parts.append(f'<td class="{status_class}">{status_symbol}</td>')
                html_parts.append(f"<td>{decision.label}</td>")
                html_parts.append(f"<td>{decision.score:.3f}</td>")
                html_parts.append(f'<td class="metrics">{metrics_html}</td>')
                html_parts.append("</tr>")

            html_parts.append("</table>")

            accuracy = correct / len(examples) if examples else 0
            all_correct += correct
            all_total += len(examples)

            if len(examples) > 20:
                html_parts.append(f"<p><em>... and {len(examples) - 20} more</em></p>")
            html_parts.append(
                f"<p>Accuracy for {true_label}: <strong>{accuracy:.1%}</strong> ({correct}/{len(examples)})</p>"
            )

        if all_total > 0:
            overall_accuracy = all_correct / all_total
            html_parts.append(
                f"<h3>Overall accuracy: {overall_accuracy:.1%} ({all_correct}/{all_total})</h3>"
            )

        # Now show unlabeled examples with predictions
        unlabeled_dir = self.root_dir / "unlabeled"
        unlabeled_examples = list(unlabeled_dir.glob("*.png"))

        if unlabeled_examples:
            html_parts.append(
                f"<h3>Predictions: UNLABELED ({len(unlabeled_examples)} total)</h3>"
            )
            html_parts.append("<table>")
            html_parts.append(
                "<tr><th>Image</th><th>Predicted</th><th>Score</th><th>Key Metrics</th></tr>"
            )

            for img_path in sorted(unlabeled_examples)[:20]:  # Show max 20
                # Load image
                img = Image.open(img_path)
                mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                # Get prediction
                decision = self.decide(mock_region)

                # Extract metrics
                metrics = self._extract_metrics(mock_region)

                # Convert image to base64
                buffered = io.BytesIO()
                img.save(buffered, format="PNG")
                img_str = base64.b64encode(buffered.getvalue()).decode()

                # Format key metrics
                metric_strs = []
                for metric, value in sorted(metrics.items()):
                    if metric in self.thresholds:
                        metric_strs.append(f"{metric}={value:.1f}")
                metrics_html = "<br>".join(metric_strs[:3])

                html_parts.append("<tr>")
                html_parts.append(f'<td><img src="data:image/png;base64,{img_str}" /></td>')
                html_parts.append(f"<td>{decision.label}</td>")
                html_parts.append(f"<td>{decision.score:.3f}</td>")
                html_parts.append(f'<td class="metrics">{metrics_html}</td>')
                html_parts.append("</tr>")

            html_parts.append("</table>")

            if len(unlabeled_examples) > 20:
                html_parts.append(
                    f"<p><em>... and {len(unlabeled_examples) - 20} more</em></p>"
                )

        # Display HTML
        display(HTML("".join(html_parts)))

    else:
        # Text mode (original)
        print("\nPredictions on training data:")
        print("-" * 80)

        # Test each labeled example
        all_correct = 0
        all_total = 0

        for true_label in self.labels:
            label_dir = self.root_dir / true_label
            examples = list(label_dir.glob("*.png"))

            if not examples:
                continue

            print(f"\n{true_label.upper()} examples ({len(examples)} total):")
            correct = 0

            for img_path in sorted(examples)[:10]:  # Show max 10 per class
                # Load image and create mock region
                img = Image.open(img_path)
                mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                # Get prediction
                decision = self.decide(mock_region)
                is_correct = decision.label == true_label
                if is_correct:
                    correct += 1

                # Extract metrics for this example
                metrics = self._extract_metrics(mock_region)

                # Show result
                status = "✓" if is_correct else "✗"
                print(
                    f"  {status} {img_path.name}: predicted={decision.label} (score={decision.score:.3f})"
                )

                # Show key metric values
                metric_strs = []
                for metric, value in sorted(metrics.items()):
                    if metric in self.thresholds:
                        metric_strs.append(f"{metric}={value:.2f}")
                if metric_strs:
                    print(f"     Metrics: {', '.join(metric_strs[:3])}")

            accuracy = correct / len(examples) if examples else 0
            all_correct += correct
            all_total += len(examples)

            if len(examples) > 10:
                print(f"  ... and {len(examples) - 10} more")
            print(f"  Accuracy for {true_label}: {accuracy:.1%} ({correct}/{len(examples)})")

        if all_total > 0:
            overall_accuracy = all_correct / all_total
            print(f"\nOverall accuracy: {overall_accuracy:.1%} ({all_correct}/{all_total})")

        # Show unlabeled examples with predictions
        unlabeled_dir = self.root_dir / "unlabeled"
        unlabeled_examples = list(unlabeled_dir.glob("*.png"))

        if unlabeled_examples:
            print(f"\nUNLABELED examples ({len(unlabeled_examples)} total) - predictions:")

            for img_path in sorted(unlabeled_examples)[:10]:  # Show max 10
                # Load image and create mock region
                img = Image.open(img_path)
                mock_region = type("MockRegion", (), {"render": lambda self, crop=True: img})()

                # Get prediction
                decision = self.decide(mock_region)

                # Extract metrics
                metrics = self._extract_metrics(mock_region)

                print(
                    f"  {img_path.name}: predicted={decision.label} (score={decision.score:.3f})"
                )

                # Show key metric values
                metric_strs = []
                for metric, value in sorted(metrics.items()):
                    if metric in self.thresholds:
                        metric_strs.append(f"{metric}={value:.2f}")
                if metric_strs:
                    print(f"     Metrics: {', '.join(metric_strs[:3])}")

            if len(unlabeled_examples) > 10:
                print(f"  ... and {len(unlabeled_examples) - 10} more")
natural_pdf.Judge.load(path) classmethod

Load a judge from a saved configuration.

Parameters:

Name Type Description Default
path str

Path to the saved judge.json file or the judge directory

required

Returns:

Type Description
Judge

Loaded Judge instance

Source code in natural_pdf/judge.py
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
@classmethod
def load(cls, path: str) -> "Judge":
    """
    Load a judge from a saved configuration.

    Args:
        path: Path to the saved judge.json file or the judge directory

    Returns:
        Loaded Judge instance
    """
    path = Path(path)

    # If path is a directory, look for judge.json inside
    if path.is_dir():
        config_path = path / "judge.json"
        base_dir = path.parent
        name = path.name
    else:
        config_path = path
        base_dir = path.parent.parent if path.parent.name != "." else path.parent
        # Try to infer name from path
        name = None

    with open(config_path, "r") as f:
        config = json.load(f)

    # Use saved name if we couldn't infer it
    if name is None:
        name = config["name"]

    # Create judge with saved config
    judge = cls(
        name,
        labels=config["labels"],
        base_dir=base_dir,
        target_prior=config.get("target_prior", 0.5),
    )  # Default to 0.5 for old configs
    judge.thresholds = config["thresholds"]
    judge.metrics_info = config.get("metrics_info", {})

    return judge
natural_pdf.Judge.lookup(region)

Look up a region and return its hash and image if found in training data.

Parameters:

Name Type Description Default
region

Region to look up

required

Returns:

Type Description
Optional[Tuple[str, Image]]

Tuple of (hash, image) if found, None if not found

Source code in natural_pdf/judge.py
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
def lookup(self, region) -> Optional[Tuple[str, Image.Image]]:
    """
    Look up a region and return its hash and image if found in training data.

    Args:
        region: Region to look up

    Returns:
        Tuple of (hash, image) if found, None if not found
    """
    try:
        # Generate hash for the region
        img = region.render(crop=True)
        if not isinstance(img, Image.Image):
            img = Image.fromarray(img)
        if img.mode != "RGB":
            img = img.convert("RGB")
        img_array = np.array(img)
        img_hash = hashlib.md5(img_array.tobytes()).hexdigest()[:12]

        # Look for the image in all directories
        for subdir in ["checked", "unchecked", "unlabeled", "_removed"]:
            if subdir == "checked" or subdir == "unchecked":
                # Only look in valid label directories
                if subdir not in self.labels:
                    continue

            img_path = self.root_dir / subdir / f"{img_hash}.png"
            if img_path.exists():
                stored_img = Image.open(img_path)
                logger.debug(f"Found region in '{subdir}' with hash {img_hash}")
                return (img_hash, stored_img)

        logger.debug(f"Region not found in training data (hash: {img_hash})")
        return None

    except Exception as e:
        logger.error(f"Failed to lookup region: {e}")
        return None
natural_pdf.Judge.pick(target_label, regions, labels=None)

Pick which region best matches the target label.

Parameters:

Name Type Description Default
target_label str

The class label to look for

required
regions List[Region]

List of regions to choose from

required
labels Optional[List[str]]

Optional human-friendly labels for each region

None

Returns:

Type Description
PickResult

PickResult with winning region, index, label (if provided), and score

Raises:

Type Description
JudgeError

If target_label not in allowed labels

Source code in natural_pdf/judge.py
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
def pick(
    self, target_label: str, regions: List["Region"], labels: Optional[List[str]] = None
) -> PickResult:
    """
    Pick which region best matches the target label.

    Args:
        target_label: The class label to look for
        regions: List of regions to choose from
        labels: Optional human-friendly labels for each region

    Returns:
        PickResult with winning region, index, label (if provided), and score

    Raises:
        JudgeError: If target_label not in allowed labels
    """
    if target_label not in self.labels:
        raise JudgeError(f"Target label '{target_label}' not in allowed labels: {self.labels}")

    # Classify all regions
    decisions = self.decide(regions)

    # Find best match for target label
    best_index = -1
    best_score = -1.0

    for i, decision in enumerate(decisions):
        if decision.label == target_label and decision.score > best_score:
            best_score = decision.score
            best_index = i

    if best_index == -1:
        # No region matched the target label
        raise JudgeError(f"No region classified as '{target_label}'")

    # Build result
    region = regions[best_index]
    label = labels[best_index] if labels and best_index < len(labels) else None

    return PickResult(region=region, index=best_index, label=label, score=best_score)
natural_pdf.Judge.save(path=None)

Save the judge configuration (auto-retrains first).

Parameters:

Name Type Description Default
path Optional[str]

Optional path to save to. Defaults to judge.json in root directory

None
Source code in natural_pdf/judge.py
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
def save(self, path: Optional[str] = None) -> None:
    """
    Save the judge configuration (auto-retrains first).

    Args:
        path: Optional path to save to. Defaults to judge.json in root directory
    """
    # Retrain with current examples
    self._retrain()

    # Save config
    save_path = Path(path) if path else self.config_path

    config = {
        "name": self.name,
        "labels": self.labels,
        "target_prior": self.target_prior,
        "thresholds": self.thresholds,
        "metrics_info": self.metrics_info,
        "training_counts": self._get_training_counts(),
    }

    with open(save_path, "w") as f:
        json.dump(config, f, indent=2)

    logger.info(f"Saved judge to {save_path}")
natural_pdf.Judge.show(max_per_class=10, size=(100, 100))

Display a grid showing examples from each category.

Parameters:

Name Type Description Default
max_per_class int

Maximum number of examples to show per class

10
size Tuple[int, int]

Size of each image in pixels (width, height)

(100, 100)
Source code in natural_pdf/judge.py
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
def show(self, max_per_class: int = 10, size: Tuple[int, int] = (100, 100)) -> None:
    """
    Display a grid showing examples from each category.

    Args:
        max_per_class: Maximum number of examples to show per class
        size: Size of each image in pixels (width, height)
    """
    try:
        import ipywidgets as widgets
        from IPython.display import display
        from PIL import Image as PILImage
    except ImportError:
        print("Show requires IPython and ipywidgets")
        return

    # Collect images from each category
    categories = {}
    total_counts = {}
    for label in self.labels:
        label_dir = self.root_dir / label
        all_images = list(label_dir.glob("*.png"))
        total_counts[label] = len(all_images)
        images = sorted(all_images)[:max_per_class]
        if images:
            categories[label] = images

    # Add unlabeled if any
    unlabeled_dir = self.root_dir / "unlabeled"
    all_unlabeled = list(unlabeled_dir.glob("*.png"))
    total_counts["unlabeled"] = len(all_unlabeled)
    unlabeled = sorted(all_unlabeled)[:max_per_class]
    if unlabeled:
        categories["unlabeled"] = unlabeled

    if not categories:
        print("No images to show")
        return

    # Create grid layout
    rows = []

    # Check for class imbalance
    labeled_counts = {k: v for k, v in total_counts.items() if k != "unlabeled"}
    if labeled_counts and len(labeled_counts) >= 2:
        max_count = max(labeled_counts.values())
        min_count = min(labeled_counts.values())
        if min_count > 0 and max_count / min_count > 3:
            warning = widgets.HTML(
                f'<div style="background: #fff3cd; padding: 10px; margin: 10px 0; border: 1px solid #ffeeba; border-radius: 4px;">'
                f"<strong>⚠️ Class imbalance detected:</strong> {labeled_counts}<br>"
                f"Consider adding more examples of the minority class for better accuracy."
                f"</div>"
            )
            rows.append(warning)

    for category, image_paths in categories.items():
        # Category header showing total count
        shown = len(image_paths)
        total = total_counts[category]
        header_text = f"<h3>{category}"
        if shown < total:
            header_text += f" ({shown} of {total} shown)"
        else:
            header_text += f" ({total} total)"
        header_text += "</h3>"
        header = widgets.HTML(header_text)

        # Image row
        image_widgets = []
        for img_path in image_paths:
            # Load and resize image
            img = PILImage.open(img_path)
            img.thumbnail(size, PILImage.Resampling.LANCZOS)

            # Convert to bytes for display
            import io

            img_bytes = io.BytesIO()
            img.save(img_bytes, format="PNG")
            img_bytes.seek(0)

            # Create image widget
            img_widget = widgets.Image(value=img_bytes.read(), width=size[0], height=size[1])
            image_widgets.append(img_widget)

        # Create horizontal box for this category
        category_box = widgets.VBox([header, widgets.HBox(image_widgets)])
        rows.append(category_box)

    # Display all categories
    display(widgets.VBox(rows))
natural_pdf.Judge.teach(labels=None, review=False)

Interactive teaching interface using IPython widgets.

Parameters:

Name Type Description Default
labels Optional[List[str]]

Labels to use for teaching. Defaults to self.labels

None
review bool

If True, review already labeled images for re-classification

False
Source code in natural_pdf/judge.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
def teach(self, labels: Optional[List[str]] = None, review: bool = False) -> None:
    """
    Interactive teaching interface using IPython widgets.

    Args:
        labels: Labels to use for teaching. Defaults to self.labels
        review: If True, review already labeled images for re-classification
    """
    # Check for IPython environment
    try:
        import ipywidgets as widgets
        from IPython.display import clear_output, display
    except ImportError:
        raise JudgeError(
            "Teaching requires IPython and ipywidgets. Use 'pip install ipywidgets'"
        )

    labels = labels or self.labels

    # Get images to review
    if review:
        # Get all labeled images for review
        files_to_review = []
        for label in self.labels:
            label_dir = self.root_dir / label
            for img_path in sorted(label_dir.glob("*.png")):
                files_to_review.append((img_path, label))

        if not files_to_review:
            print("No labeled images to review")
            return

        # Shuffle for review
        import random

        random.shuffle(files_to_review)
        review_files = [f[0] for f in files_to_review]
        original_labels = {str(f[0]): f[1] for f in files_to_review}
    else:
        # Get unlabeled images
        unlabeled_dir = self.root_dir / "unlabeled"
        review_files = sorted(unlabeled_dir.glob("*.png"))
        original_labels = {}

        if not review_files:
            print("No unlabeled images to teach")
            return

    # State for teaching
    self._teaching_state = {
        "current_index": 0,
        "labeled_count": 0,
        "removed_count": 0,
        "files": review_files,
        "labels": labels,
        "review_mode": review,
        "original_labels": original_labels,
    }

    # Create widgets
    image_widget = widgets.Image()
    status_label = widgets.Label()

    # Create buttons for labeling
    button_layout = widgets.Layout(width="auto", margin="5px")

    btn_prev = widgets.Button(description="↑ Previous", layout=button_layout)
    btn_class1 = widgets.Button(
        description=f"← {labels[0]}", layout=button_layout, button_style="primary"
    )
    btn_class2 = widgets.Button(
        description=f"→ {labels[1]}", layout=button_layout, button_style="success"
    )
    btn_skip = widgets.Button(description="↓ Skip", layout=button_layout)
    btn_remove = widgets.Button(
        description="✗ Remove", layout=button_layout, button_style="danger"
    )

    button_box = widgets.HBox([btn_prev, btn_class1, btn_class2, btn_skip, btn_remove])

    # Keyboard shortcuts info
    info_label = widgets.Label(
        value="Keys: ↑ prev | ← "
        + labels[0]
        + " | → "
        + labels[1]
        + " | ↓ skip | Delete remove"
    )

    def update_display():
        """Update the displayed image and status."""
        state = self._teaching_state
        if 0 <= state["current_index"] < len(state["files"]):
            img_path = state["files"][state["current_index"]]
            with open(img_path, "rb") as f:
                image_widget.value = f.read()

            # Build status text
            status_text = f"Image {state['current_index'] + 1} of {len(state['files'])}"
            if state["review_mode"]:
                current_label = state["original_labels"].get(str(img_path), "unknown")
                status_text += f" (Currently: {current_label})"
            status_text += f" | Labeled: {state['labeled_count']}"
            if state["removed_count"] > 0:
                status_text += f" | Removed: {state['removed_count']}"

            status_label.value = status_text

            # Update button states
            btn_prev.disabled = state["current_index"] == 0
        else:
            status_label.value = "Teaching complete!"
            # Hide the image widget instead of showing broken image
            image_widget.layout.display = "none"
            # Disable all buttons
            btn_prev.disabled = True
            btn_class1.disabled = True
            btn_class2.disabled = True
            btn_skip.disabled = True

            # Auto-retrain
            if state["labeled_count"] > 0 or state["removed_count"] > 0:
                clear_output(wait=True)
                print("Teaching complete!")
                print(f"Labeled: {state['labeled_count']} images")
                if state["removed_count"] > 0:
                    print(f"Removed: {state['removed_count']} images")

                if state["labeled_count"] > 0:
                    print("\nRetraining with new examples...")
                    self._retrain()
                    print("✓ Training complete! Judge is ready to use.")
            else:
                print("No changes made.")

    def move_file_to_class(class_index):
        """Move current file to specified class."""
        state = self._teaching_state
        if state["current_index"] >= len(state["files"]):
            return

        current_file = state["files"][state["current_index"]]
        target_dir = self.root_dir / labels[class_index]
        shutil.move(str(current_file), str(target_dir / current_file.name))
        state["labeled_count"] += 1
        state["current_index"] += 1
        update_display()

    # Button callbacks
    def on_prev(b):
        state = self._teaching_state
        if state["current_index"] > 0:
            state["current_index"] -= 1
            update_display()

    def on_class1(b):
        move_file_to_class(0)

    def on_class2(b):
        move_file_to_class(1)

    def on_skip(b):
        state = self._teaching_state
        state["current_index"] += 1
        update_display()

    def on_remove(b):
        state = self._teaching_state
        if state["current_index"] >= len(state["files"]):
            return

        current_file = state["files"][state["current_index"]]
        target_dir = self.root_dir / "_removed"
        shutil.move(str(current_file), str(target_dir / current_file.name))
        state["removed_count"] += 1
        state["current_index"] += 1
        update_display()

    # Connect buttons
    btn_prev.on_click(on_prev)
    btn_class1.on_click(on_class1)
    btn_class2.on_click(on_class2)
    btn_skip.on_click(on_skip)
    btn_remove.on_click(on_remove)

    # Create output widget for keyboard handling
    output = widgets.Output()

    # Keyboard event handler
    def on_key(event):
        """Handle keyboard events."""
        if event["type"] != "keydown":
            return

        key = event["key"]

        if key == "ArrowUp":
            on_prev(None)
        elif key == "ArrowLeft":
            on_class1(None)
        elif key == "ArrowRight":
            on_class2(None)
        elif key == "ArrowDown":
            on_skip(None)
        elif key in ["Delete", "Backspace"]:
            on_remove(None)

    # Display everything
    display(status_label)
    display(image_widget)
    display(button_box)
    display(info_label)
    display(output)

    # Show first image
    update_display()

    # Try to set up keyboard handling (may not work in all environments)
    try:
        from ipyevents import Event

        event_handler = Event(source=output, watched_events=["keydown"])
        event_handler.on_dom_event(on_key)
    except:
        # If ipyevents not available, just use buttons
        print("Note: Install ipyevents for keyboard shortcuts: pip install ipyevents")
natural_pdf.JudgeError

Bases: Exception

Raised when Judge operations fail.

Source code in natural_pdf/judge.py
27
28
29
30
class JudgeError(Exception):
    """Raised when Judge operations fail."""

    pass
natural_pdf.Options

Global options for natural-pdf, similar to pandas options.

Source code in natural_pdf/__init__.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class Options:
    """Global options for natural-pdf, similar to pandas options."""

    def __init__(self):
        # Image rendering defaults
        self.image = ConfigSection(width=None, resolution=150)

        # OCR defaults
        self.ocr = ConfigSection(engine="easyocr", languages=["en"], min_confidence=0.5)

        # Text extraction defaults (empty for now)
        self.text = ConfigSection()

        # Layout and navigation defaults
        self.layout = ConfigSection(
            directional_offset=0.01,  # Offset in points when using directional methods
            auto_multipage=False,  # Whether directional methods span pages by default
            directional_within=None,  # Region to constrain directional operations to
        )
natural_pdf.PDF

Bases: TextMixin, ExtractionMixin, ExportMixin, ClassificationMixin, CheckboxDetectionMixin, VisualSearchMixin, Visualizable

Enhanced PDF wrapper built on top of pdfplumber.

This class provides a fluent interface for working with PDF documents, with improved selection, navigation, and extraction capabilities. It integrates OCR, layout analysis, and AI-powered data extraction features while maintaining compatibility with the underlying pdfplumber API.

The PDF class supports loading from files, URLs, or streams, and provides spatial navigation, element selection with CSS-like selectors, and advanced document processing workflows including multi-page content flows.

Attributes:

Name Type Description
pages PageCollection

Lazy-loaded list of Page objects for document pages.

path

Resolved path to the PDF file or source identifier.

source_path

Original path, URL, or stream identifier provided during initialization.

highlighter

Service for rendering highlighted visualizations of document content.

Example

Basic usage:

import natural_pdf as npdf

pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]
text_elements = page.find_all('text:contains("Summary")')

Advanced usage with OCR:

pdf = npdf.PDF("scanned_document.pdf")
pdf.apply_ocr(engine="easyocr", resolution=144)
tables = pdf.pages[0].find_all('table')

Source code in natural_pdf/core/pdf.py
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
class PDF(
    TextMixin,
    ExtractionMixin,
    ExportMixin,
    ClassificationMixin,
    CheckboxDetectionMixin,
    VisualSearchMixin,
    Visualizable,
):
    """Enhanced PDF wrapper built on top of pdfplumber.

    This class provides a fluent interface for working with PDF documents,
    with improved selection, navigation, and extraction capabilities. It integrates
    OCR, layout analysis, and AI-powered data extraction features while maintaining
    compatibility with the underlying pdfplumber API.

    The PDF class supports loading from files, URLs, or streams, and provides
    spatial navigation, element selection with CSS-like selectors, and advanced
    document processing workflows including multi-page content flows.

    Attributes:
        pages: Lazy-loaded list of Page objects for document pages.
        path: Resolved path to the PDF file or source identifier.
        source_path: Original path, URL, or stream identifier provided during initialization.
        highlighter: Service for rendering highlighted visualizations of document content.

    Example:
        Basic usage:
        ```python
        import natural_pdf as npdf

        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]
        text_elements = page.find_all('text:contains("Summary")')
        ```

        Advanced usage with OCR:
        ```python
        pdf = npdf.PDF("scanned_document.pdf")
        pdf.apply_ocr(engine="easyocr", resolution=144)
        tables = pdf.pages[0].find_all('table')
        ```
    """

    @classmethod
    def from_images(
        cls,
        images: Union["Image.Image", List["Image.Image"], str, List[str], Path, List[Path]],
        resolution: int = 300,
        apply_ocr: bool = True,
        ocr_engine: Optional[str] = None,
        **pdf_options,
    ) -> "PDF":
        """Create a PDF from image(s).

        Args:
            images: Single image, list of images, or path(s)/URL(s) to image files
            resolution: DPI for the PDF (default: 300, good for OCR and viewing)
            apply_ocr: Apply OCR to make searchable (default: True)
            ocr_engine: OCR engine to use (default: auto-detect)
            **pdf_options: Options passed to PDF constructor

        Returns:
            PDF object containing the images as pages

        Example:
            ```python
            # Simple scan to searchable PDF
            pdf = PDF.from_images("scan.jpg")

            # From URL
            pdf = PDF.from_images("https://example.com/image.png")

            # Multiple pages (mix of local and URLs)
            pdf = PDF.from_images(["page1.png", "https://example.com/page2.jpg"])

            # Without OCR
            pdf = PDF.from_images(images, apply_ocr=False)

            # With specific engine
            pdf = PDF.from_images(images, ocr_engine='surya')
            ```
        """
        import urllib.request

        from PIL import ImageOps

        def _open_image(source):
            """Open an image from file path, URL, or return PIL Image as-is."""
            if isinstance(source, Image.Image):
                return source

            source_str = str(source)
            if source_str.startswith(("http://", "https://")):
                # Download from URL
                with urllib.request.urlopen(source_str) as response:
                    img_data = response.read()
                return Image.open(io.BytesIO(img_data))
            else:
                # Local file path
                return Image.open(source)

        # Normalize inputs to list of PIL Images
        if isinstance(images, (str, Path)):
            images = [_open_image(images)]
        elif isinstance(images, Image.Image):
            images = [images]
        elif isinstance(images, list):
            processed = []
            for img in images:
                processed.append(_open_image(img))
            images = processed

        # Process images
        processed_images = []
        for img in images:
            # Fix EXIF rotation
            img = ImageOps.exif_transpose(img) or img

            # Convert RGBA to RGB (PDF doesn't handle transparency well)
            if img.mode == "RGBA":
                bg = Image.new("RGB", img.size, "white")
                bg.paste(img, mask=img.split()[3])
                img = bg
            elif img.mode not in ["RGB", "L", "1", "CMYK"]:
                img = img.convert("RGB")

            processed_images.append(img)

        # Create PDF at specified resolution
        # Use BytesIO to keep in memory
        pdf_buffer = io.BytesIO()
        processed_images[0].save(
            pdf_buffer,
            "PDF",
            save_all=True,
            append_images=processed_images[1:] if len(processed_images) > 1 else [],
            resolution=resolution,
        )
        pdf_buffer.seek(0)

        # Create PDF object
        pdf = cls(pdf_buffer, **pdf_options)

        # Store metadata about source
        pdf._from_images = True
        pdf._source_metadata = {
            "type": "images",
            "count": len(processed_images),
            "resolution": resolution,
        }

        # Apply OCR if requested
        if apply_ocr:
            pdf.apply_ocr(engine=ocr_engine, resolution=resolution)

        return pdf

    def __init__(
        self,
        path_or_url_or_stream,
        reading_order: bool = True,
        font_attrs: Optional[List[str]] = None,
        keep_spaces: bool = True,
        text_tolerance: Optional[dict] = None,
        auto_text_tolerance: bool = True,
        text_layer: bool = True,
    ):
        """Initialize the enhanced PDF object.

        Args:
            path_or_url_or_stream: Path to the PDF file (str/Path), a URL (str),
                or a file-like object (stream). URLs must start with 'http://' or 'https://'.
            reading_order: If True, use natural reading order for text extraction.
                Defaults to True.
            font_attrs: List of font attributes for grouping characters into words.
                Common attributes include ['fontname', 'size']. Defaults to None.
            keep_spaces: If True, include spaces in word elements during text extraction.
                Defaults to True.
            text_tolerance: PDFplumber-style tolerance settings for text grouping.
                Dictionary with keys like 'x_tolerance', 'y_tolerance'. Defaults to None.
            auto_text_tolerance: If True, automatically scale text tolerance based on
                font size and document characteristics. Defaults to True.
            text_layer: If True, preserve existing text layer from the PDF. If False,
                removes all existing text elements during initialization, useful for
                OCR-only workflows. Defaults to True.

        Raises:
            TypeError: If path_or_url_or_stream is not a valid type.
            IOError: If the PDF file cannot be opened or read.
            ValueError: If URL download fails.

        Example:
            ```python
            # From file path
            pdf = npdf.PDF("document.pdf")

            # From URL
            pdf = npdf.PDF("https://example.com/document.pdf")

            # From stream
            with open("document.pdf", "rb") as f:
                pdf = npdf.PDF(f)

            # With custom settings
            pdf = npdf.PDF("document.pdf",
                          reading_order=False,
                          text_layer=False,  # For OCR-only processing
                          font_attrs=['fontname', 'size', 'flags'])
            ```
        """
        self._original_path_or_stream = path_or_url_or_stream
        self._temp_file = None
        self._resolved_path = None
        self._is_stream = False
        self._text_layer = text_layer
        stream_to_open = None

        if hasattr(path_or_url_or_stream, "read"):  # Check if it's file-like
            logger.info("Initializing PDF from in-memory stream.")
            self._is_stream = True
            self._resolved_path = None  # No resolved file path for streams
            self.source_path = "<stream>"  # Identifier for source
            self.path = self.source_path  # Use source identifier as path for streams
            stream_to_open = path_or_url_or_stream
            try:
                if hasattr(path_or_url_or_stream, "read"):
                    # If caller provided an in-memory binary stream, capture bytes for potential re-export
                    current_pos = path_or_url_or_stream.tell()
                    path_or_url_or_stream.seek(0)
                    self._original_bytes = path_or_url_or_stream.read()
                    path_or_url_or_stream.seek(current_pos)
            except Exception:
                pass
        elif isinstance(path_or_url_or_stream, (str, Path)):
            path_or_url = str(path_or_url_or_stream)
            self.source_path = path_or_url  # Store original path/URL as source
            is_url = path_or_url.startswith("http://") or path_or_url.startswith("https://")

            if is_url:
                logger.info(f"Downloading PDF from URL: {path_or_url}")
                try:
                    with urllib.request.urlopen(path_or_url) as response:
                        data = response.read()
                    # Load directly into an in-memory buffer — no temp file needed
                    buffer = io.BytesIO(data)
                    buffer.seek(0)
                    self._temp_file = None  # No on-disk temp file
                    self._resolved_path = path_or_url  # For repr / get_id purposes
                    stream_to_open = buffer  # pdfplumber accepts file-like objects
                except Exception as e:
                    logger.error(f"Failed to download PDF from URL: {e}")
                    raise ValueError(f"Failed to download PDF from URL: {e}")
            else:
                self._resolved_path = str(Path(path_or_url).resolve())  # Resolve local paths
                stream_to_open = self._resolved_path
            self.path = self._resolved_path  # Use resolved path for file-based PDFs
        else:
            raise TypeError(
                f"Invalid input type: {type(path_or_url_or_stream)}. "
                f"Expected path (str/Path), URL (str), or file-like object."
            )

        logger.info(f"Opening PDF source: {self.source_path}")
        logger.debug(
            f"Parameters: reading_order={reading_order}, font_attrs={font_attrs}, keep_spaces={keep_spaces}"
        )

        try:
            self._pdf = pdfplumber.open(stream_to_open)
        except Exception as e:
            logger.error(f"Failed to open PDF: {e}", exc_info=True)
            self.close()  # Attempt cleanup if opening fails
            raise IOError(f"Failed to open PDF source: {self.source_path}") from e

        # Store configuration used for initialization
        self._reading_order = reading_order
        self._config = {"keep_spaces": keep_spaces}
        self._font_attrs = font_attrs

        self._ocr_manager = OCRManager() if OCRManager else None
        self._layout_manager = LayoutManager() if LayoutManager else None
        self.highlighter = HighlightingService(self)
        # self._classification_manager_instance = ClassificationManager() # Removed this line
        self._manager_registry = {}

        # Lazily instantiate pages only when accessed
        self._pages = _LazyPageList(
            self, self._pdf, font_attrs=font_attrs, load_text=self._text_layer
        )

        self._element_cache = {}
        self._exclusions = []
        self._regions = []

        logger.info(f"PDF '{self.source_path}' initialized with {len(self._pages)} pages.")

        self._initialize_managers()
        self._initialize_highlighter()

        # Remove text layer if requested
        if not self._text_layer:
            logger.info("Removing text layer as requested (text_layer=False)")
            # Text layer is not loaded when text_layer=False, so no need to remove
            pass

        # Analysis results accessed via self.analyses property (see below)

        # --- Automatic cleanup when object is garbage-collected ---
        self._finalizer = weakref.finalize(
            self,
            PDF._finalize_cleanup,
            self._pdf,
            getattr(self, "_temp_file", None),
            getattr(self, "_is_stream", False),
        )

        # --- Text tolerance settings ------------------------------------
        # Users can pass pdfplumber-style keys (x_tolerance, x_tolerance_ratio,
        # y_tolerance, etc.) via *text_tolerance*.  We also keep a flag that
        # enables automatic tolerance scaling when explicit values are not
        # supplied.
        self._config["auto_text_tolerance"] = bool(auto_text_tolerance)
        if text_tolerance:
            # Only copy recognised primitives (numbers / None); ignore junk.
            allowed = {
                "x_tolerance",
                "x_tolerance_ratio",
                "y_tolerance",
                "keep_blank_chars",  # passthrough convenience
            }
            for k, v in text_tolerance.items():
                if k in allowed:
                    self._config[k] = v

    def _initialize_managers(self):
        """Set up manager factories for lazy instantiation."""
        # Store factories/classes for each manager key
        self._manager_factories = dict(DEFAULT_MANAGERS)
        self._managers = {}  # Will hold instantiated managers

    def get_manager(self, key: str) -> Any:
        """Retrieve a manager instance by its key, instantiating it lazily if needed.

        Managers are specialized components that handle specific functionality like
        classification, structured data extraction, or OCR processing. They are
        instantiated on-demand to minimize memory usage and startup time.

        Args:
            key: The manager key to retrieve. Common keys include 'classification'
                and 'structured_data'.

        Returns:
            The manager instance for the specified key.

        Raises:
            KeyError: If no manager is registered for the given key.
            RuntimeError: If the manager failed to initialize.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            classification_mgr = pdf.get_manager('classification')
            structured_data_mgr = pdf.get_manager('structured_data')
            ```
        """
        # Check if already instantiated
        if key in self._managers:
            manager_instance = self._managers[key]
            if manager_instance is None:
                raise RuntimeError(f"Manager '{key}' failed to initialize previously.")
            return manager_instance

        # Not instantiated yet: get factory/class
        if not hasattr(self, "_manager_factories") or key not in self._manager_factories:
            raise KeyError(
                f"No manager registered for key '{key}'. Available: {list(getattr(self, '_manager_factories', {}).keys())}"
            )
        factory_or_class = self._manager_factories[key]
        try:
            resolved = factory_or_class
            # If it's a callable that's not a class, call it to get the class/instance
            if not isinstance(resolved, type) and callable(resolved):
                resolved = resolved()
            # If it's a class, instantiate it
            if isinstance(resolved, type):
                instance = resolved()
            else:
                instance = resolved  # Already an instance
            self._managers[key] = instance
            return instance
        except Exception as e:
            logger.error(f"Failed to initialize manager for key '{key}': {e}")
            self._managers[key] = None
            raise RuntimeError(f"Manager '{key}' failed to initialize: {e}") from e

    def _initialize_highlighter(self):
        pass

    @property
    def metadata(self) -> Dict[str, Any]:
        """Access PDF metadata as a dictionary.

        Returns document metadata such as title, author, creation date, and other
        properties embedded in the PDF file. The exact keys available depend on
        what metadata was included when the PDF was created.

        Returns:
            Dictionary containing PDF metadata. Common keys include 'Title',
            'Author', 'Subject', 'Creator', 'Producer', 'CreationDate', and
            'ModDate'. May be empty if no metadata is available.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            print(pdf.metadata.get('Title', 'No title'))
            print(f"Created: {pdf.metadata.get('CreationDate')}")
            ```
        """
        return self._pdf.metadata

    @property
    def pages(self) -> "PageCollection":
        """Access pages as a PageCollection object.

        Provides access to individual pages of the PDF document through a
        collection interface that supports indexing, slicing, and iteration.
        Pages are lazy-loaded to minimize memory usage.

        Returns:
            PageCollection object that provides list-like access to PDF pages.

        Raises:
            AttributeError: If PDF pages are not yet initialized.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")

            # Access individual pages
            first_page = pdf.pages[0]
            last_page = pdf.pages[-1]

            # Slice pages
            first_three = pdf.pages[0:3]

            # Iterate over pages
            for page in pdf.pages:
                print(f"Page {page.index} has {len(page.chars)} characters")
            ```
        """
        from natural_pdf.core.page_collection import PageCollection

        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")
        return PageCollection(self._pages)

    def clear_exclusions(self) -> "PDF":
        """Clear all exclusion functions from the PDF.

        Removes all previously added exclusion functions that were used to filter
        out unwanted content (like headers, footers, or administrative text) from
        text extraction and analysis operations.

        Returns:
            Self for method chaining.

        Raises:
            AttributeError: If PDF pages are not yet initialized.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            pdf.add_exclusion(lambda page: page.find('text:contains("CONFIDENTIAL")').above())

            # Later, remove all exclusions
            pdf.clear_exclusions()
            ```
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        self._exclusions = []

        # Clear exclusions only from already-created (cached) pages to avoid forcing page creation
        for i in range(len(self._pages)):
            if self._pages._cache[i] is not None:  # Only clear from existing pages
                try:
                    self._pages._cache[i].clear_exclusions()
                except Exception as e:
                    logger.warning(f"Failed to clear exclusions from existing page {i}: {e}")
        return self

    def add_exclusion(self, exclusion_func, label: str = None) -> "PDF":
        """Add an exclusion function to the PDF.

        Exclusion functions define regions of each page that should be ignored during
        text extraction and analysis operations. This is useful for filtering out headers,
        footers, watermarks, or other administrative content that shouldn't be included
        in the main document processing.

        Args:
            exclusion_func: A function that takes a Page object and returns a Region
                to exclude from processing, or None if no exclusion should be applied
                to that page. The function is called once per page.
            label: Optional descriptive label for this exclusion rule, useful for
                debugging and identification.

        Returns:
            Self for method chaining.

        Raises:
            AttributeError: If PDF pages are not yet initialized.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")

            # Exclude headers (top 50 points of each page)
            pdf.add_exclusion(
                lambda page: page.region(0, 0, page.width, 50),
                label="header_exclusion"
            )

            # Exclude any text containing "CONFIDENTIAL"
            pdf.add_exclusion(
                lambda page: page.find('text:contains("CONFIDENTIAL")').above(include_source=True)
                if page.find('text:contains("CONFIDENTIAL")') else None,
                label="confidential_exclusion"
            )

            # Chain multiple exclusions
            pdf.add_exclusion(header_func).add_exclusion(footer_func)
            ```
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        # ------------------------------------------------------------------
        # Support selector strings and ElementCollection objects directly.
        # Store exclusion and apply only to already-created pages.
        # ------------------------------------------------------------------
        from natural_pdf.elements.element_collection import ElementCollection  # local import

        if isinstance(exclusion_func, str) or isinstance(exclusion_func, ElementCollection):
            # Store for bookkeeping and lazy application
            self._exclusions.append((exclusion_func, label))

            # Don't modify already-cached pages - they will get PDF-level exclusions
            # dynamically through _get_exclusion_regions()
            return self

        # Fallback to original callable / Region behaviour ------------------
        exclusion_data = (exclusion_func, label)
        self._exclusions.append(exclusion_data)

        # Don't modify already-cached pages - they will get PDF-level exclusions
        # dynamically through _get_exclusion_regions()

        return self

    def apply_ocr(
        self,
        engine: Optional[str] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
        apply_exclusions: bool = True,
        detect_only: bool = False,
        replace: bool = True,
        options: Optional[Any] = None,
        pages: Optional[Union[Iterable[int], range, slice]] = None,
    ) -> "PDF":
        """Apply OCR to specified pages of the PDF using batch processing.

        Performs optical character recognition on the specified pages, converting
        image-based text into searchable and extractable text elements. This method
        supports multiple OCR engines and provides batch processing for efficiency.

        Args:
            engine: Name of the OCR engine to use. Supported engines include
                'easyocr' (default), 'surya', 'paddle', and 'doctr'. If None,
                uses the global default from natural_pdf.options.ocr.engine.
            languages: List of language codes for OCR recognition (e.g., ['en', 'es']).
                If None, uses the global default from natural_pdf.options.ocr.languages.
            min_confidence: Minimum confidence threshold (0.0-1.0) for accepting
                OCR results. Text with lower confidence will be filtered out.
                If None, uses the global default.
            device: Device to run OCR on ('cpu', 'cuda', 'mps'). Engine-specific
                availability varies. If None, uses engine defaults.
            resolution: DPI resolution for rendering pages to images before OCR.
                Higher values improve accuracy but increase processing time and memory.
                Typical values: 150 (fast), 300 (balanced), 600 (high quality).
            apply_exclusions: If True, mask excluded regions before OCR to prevent
                processing of headers, footers, or other unwanted content.
            detect_only: If True, only detect text bounding boxes without performing
                character recognition. Useful for layout analysis workflows.
            replace: If True, replace any existing OCR elements on the pages.
                If False, append new OCR results to existing elements.
            options: Engine-specific options object (e.g., EasyOCROptions, SuryaOptions).
                Allows fine-tuning of engine behavior beyond common parameters.
            pages: Page indices to process. Can be:
                - None: Process all pages
                - slice: Process a range of pages (e.g., slice(0, 10))
                - Iterable[int]: Process specific page indices (e.g., [0, 2, 5])

        Returns:
            Self for method chaining.

        Raises:
            ValueError: If invalid page index is provided.
            TypeError: If pages parameter has invalid type.
            RuntimeError: If OCR engine is not available or fails.

        Example:
            ```python
            pdf = npdf.PDF("scanned_document.pdf")

            # Basic OCR on all pages
            pdf.apply_ocr()

            # High-quality OCR with specific settings
            pdf.apply_ocr(
                engine='easyocr',
                languages=['en', 'es'],
                resolution=300,
                min_confidence=0.8
            )

            # OCR specific pages only
            pdf.apply_ocr(pages=[0, 1, 2])  # First 3 pages
            pdf.apply_ocr(pages=slice(5, 10))  # Pages 5-9

            # Detection-only workflow for layout analysis
            pdf.apply_ocr(detect_only=True, resolution=150)
            ```

        Note:
            OCR processing can be time and memory intensive, especially at high
            resolutions. Consider using exclusions to mask unwanted regions and
            processing pages in batches for large documents.
        """
        if not self._ocr_manager:
            logger.error("OCRManager not available. Cannot apply OCR.")
            return self

        # Apply global options as defaults, but allow explicit parameters to override
        import natural_pdf

        # Use global OCR options if parameters are not explicitly set
        if engine is None:
            engine = natural_pdf.options.ocr.engine
        if languages is None:
            languages = natural_pdf.options.ocr.languages
        if min_confidence is None:
            min_confidence = natural_pdf.options.ocr.min_confidence
        if device is None:
            pass  # No default device in options.ocr anymore

        thread_id = threading.current_thread().name
        logger.debug(f"[{thread_id}] PDF.apply_ocr starting for {self.path}")

        target_pages = []

        target_pages = []
        if pages is None:
            target_pages = self._pages
        elif isinstance(pages, slice):
            target_pages = self._pages[pages]
        elif hasattr(pages, "__iter__"):
            try:
                target_pages = [self._pages[i] for i in pages]
            except IndexError:
                raise ValueError("Invalid page index provided in 'pages' iterable.")
            except TypeError:
                raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

        if not target_pages:
            logger.warning("No pages selected for OCR processing.")
            return self

        page_numbers = [p.number for p in target_pages]
        logger.info(f"Applying batch OCR to pages: {page_numbers}...")

        final_resolution = resolution or getattr(self, "_config", {}).get("resolution", 150)
        logger.debug(f"Using OCR image resolution: {final_resolution} DPI")

        images_pil = []
        page_image_map = []
        logger.info(f"[{thread_id}] Rendering {len(target_pages)} pages...")
        failed_page_num = "unknown"
        render_start_time = time.monotonic()

        try:
            for i, page in enumerate(tqdm(target_pages, desc="Rendering pages", leave=False)):
                failed_page_num = page.number
                logger.debug(f"  Rendering page {page.number} (index {page.index})...")
                to_image_kwargs = {
                    "resolution": final_resolution,
                    "include_highlights": False,
                    "exclusions": "mask" if apply_exclusions else None,
                }
                # Use render() for clean image without highlights
                img = page.render(resolution=final_resolution)
                if img is None:
                    logger.error(f"  Failed to render page {page.number} to image.")
                    continue
                images_pil.append(img)
                page_image_map.append((page, img))
        except Exception as e:
            logger.error(f"Failed to render pages for batch OCR: {e}")
            logger.error(f"Failed to render pages for batch OCR: {e}")
            raise RuntimeError(f"Failed to render page {failed_page_num} for OCR.") from e

        render_end_time = time.monotonic()
        logger.debug(
            f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
        )
        logger.debug(
            f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
        )

        if not images_pil or not page_image_map:
            logger.error("No images were successfully rendered for batch OCR.")
            return self

        manager_args = {
            "images": images_pil,
            "engine": engine,
            "languages": languages,
            "min_confidence": min_confidence,
            "min_confidence": min_confidence,
            "device": device,
            "options": options,
            "detect_only": detect_only,
        }
        manager_args = {k: v for k, v in manager_args.items() if v is not None}

        ocr_call_args = {k: v for k, v in manager_args.items() if k != "images"}
        logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
        logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
        ocr_start_time = time.monotonic()

        batch_results = self._ocr_manager.apply_ocr(**manager_args)

        if not isinstance(batch_results, list) or len(batch_results) != len(images_pil):
            logger.error(f"OCR Manager returned unexpected result format or length.")
            return self

        logger.info("OCR Manager batch processing complete.")

        ocr_end_time = time.monotonic()
        logger.debug(
            f"[{thread_id}] OCR processing finished (Duration: {ocr_end_time - ocr_start_time:.2f}s)"
        )

        logger.info("Adding OCR results to respective pages...")
        total_elements_added = 0

        for i, (page, img) in enumerate(page_image_map):
            results_for_page = batch_results[i]
            if not isinstance(results_for_page, list):
                logger.warning(
                    f"Skipping results for page {page.number}: Expected list, got {type(results_for_page)}"
                )
                continue

            logger.debug(f"  Processing {len(results_for_page)} results for page {page.number}...")
            try:
                if manager_args.get("replace", True) and hasattr(page, "_element_mgr"):
                    page._element_mgr.remove_ocr_elements()

                img_scale_x = page.width / img.width if img.width > 0 else 1
                img_scale_y = page.height / img.height if img.height > 0 else 1
                elements = page._element_mgr.create_text_elements_from_ocr(
                    results_for_page, img_scale_x, img_scale_y
                )

                if elements:
                    total_elements_added += len(elements)
                    logger.debug(f"  Added {len(elements)} OCR TextElements to page {page.number}.")
                else:
                    logger.debug(f"  No valid TextElements created for page {page.number}.")
            except Exception as e:
                logger.error(f"  Error adding OCR elements to page {page.number}: {e}")

        logger.info(f"Finished adding OCR results. Total elements added: {total_elements_added}")
        return self

    def add_region(
        self, region_func: Callable[["Page"], Optional["Region"]], name: str = None
    ) -> "PDF":
        """
        Add a region function to the PDF.

        Args:
            region_func: A function that takes a Page and returns a Region, or None
            name: Optional name for the region

        Returns:
            Self for method chaining
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        region_data = (region_func, name)
        self._regions.append(region_data)

        # Apply only to already-created (cached) pages to avoid forcing page creation
        for i in range(len(self._pages)):
            if self._pages._cache[i] is not None:  # Only apply to existing pages
                page = self._pages._cache[i]
                try:
                    region_instance = region_func(page)
                    if region_instance and isinstance(region_instance, Region):
                        page.add_region(region_instance, name=name, source="named")
                    elif region_instance is not None:
                        logger.warning(
                            f"Region function did not return a valid Region for page {page.number}"
                        )
                except Exception as e:
                    logger.error(f"Error adding region for page {page.number}: {e}")

        return self

    @overload
    def find(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    def find(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]:
        """
        Find the first element matching the selector OR text content across all pages.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            Element object or None if not found.
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            raise ValueError("Internal error: No selector or text provided.")

        selector_obj = parse_selector(effective_selector)

        # Search page by page
        for page in self.pages:
            # Note: _apply_selector is on Page, so we call find directly here
            # We pass the constructed/validated effective_selector
            element = page.find(
                selector=effective_selector,  # Use the processed selector
                apply_exclusions=apply_exclusions,
                regex=regex,  # Pass down flags
                case=case,
                **kwargs,
            )
            if element:
                return element
        return None  # Not found on any page

    @overload
    def find_all(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements matching the selector OR text content across all pages.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            ElementCollection with matching elements.
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            raise ValueError("Internal error: No selector or text provided.")

        # Instead of parsing here, let each page parse and apply
        # This avoids parsing the same selector multiple times if not needed
        # selector_obj = parse_selector(effective_selector)

        # kwargs["regex"] = regex # Removed: Already passed explicitly
        # kwargs["case"] = case   # Removed: Already passed explicitly

        all_elements = []
        for page in self.pages:
            # Call page.find_all with the effective selector and flags
            page_elements = page.find_all(
                selector=effective_selector,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )
            if page_elements:
                all_elements.extend(page_elements.elements)

        from natural_pdf.elements.element_collection import ElementCollection

        return ElementCollection(all_elements)

    def extract_text(
        self,
        selector: Optional[str] = None,
        preserve_whitespace=True,
        use_exclusions=True,
        debug_exclusions=False,
        **kwargs,
    ) -> str:
        """
        Extract text from the entire document or matching elements.

        Args:
            selector: Optional selector to filter elements
            preserve_whitespace: Whether to keep blank characters
            use_exclusions: Whether to apply exclusion regions
            debug_exclusions: Whether to output detailed debugging for exclusions
            preserve_whitespace: Whether to keep blank characters
            use_exclusions: Whether to apply exclusion regions
            debug_exclusions: Whether to output detailed debugging for exclusions
            **kwargs: Additional extraction parameters

        Returns:
            Extracted text as string
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        if selector:
            elements = self.find_all(selector, apply_exclusions=use_exclusions, **kwargs)
            return elements.extract_text(preserve_whitespace=preserve_whitespace, **kwargs)

        if debug_exclusions:
            print(f"PDF: Extracting text with exclusions from {len(self.pages)} pages")
            print(f"PDF: Found {len(self._exclusions)} document-level exclusions")

        texts = []
        for page in self.pages:
            texts.append(
                page.extract_text(
                    preserve_whitespace=preserve_whitespace,
                    use_exclusions=use_exclusions,
                    debug_exclusions=debug_exclusions,
                    **kwargs,
                )
            )

        if debug_exclusions:
            print(f"PDF: Combined {len(texts)} pages of text")

        return "\n".join(texts)

    def extract_tables(
        self, selector: Optional[str] = None, merge_across_pages: bool = False, **kwargs
    ) -> List[Any]:
        """
        Extract tables from the document or matching elements.

        Args:
            selector: Optional selector to filter tables
            merge_across_pages: Whether to merge tables that span across pages
            **kwargs: Additional extraction parameters

        Returns:
            List of extracted tables
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        logger.warning("PDF.extract_tables is not fully implemented yet.")
        all_tables = []

        for page in self.pages:
            if hasattr(page, "extract_tables"):
                all_tables.extend(page.extract_tables(**kwargs))
            else:
                logger.debug(f"Page {page.number} does not have extract_tables method.")

        if selector:
            logger.warning("Filtering extracted tables by selector is not implemented.")

        if merge_across_pages:
            logger.warning("Merging tables across pages is not implemented.")

        return all_tables

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        new_section_on_page_break=False,
        include_boundaries="both",
        orientation="vertical",
    ) -> "ElementCollection":
        """
        Extract sections from the entire PDF based on start/end elements.

        This method delegates to the PageCollection.get_sections() method,
        providing a convenient way to extract document sections across all pages.

        Args:
            start_elements: Elements or selector string that mark the start of sections (optional)
            end_elements: Elements or selector string that mark the end of sections (optional)
            new_section_on_page_break: Whether to start a new section at page boundaries (default: False)
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both')
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            ElementCollection of Region objects representing the extracted sections

        Example:
            Extract sections between headers:
            ```python
            pdf = npdf.PDF("document.pdf")

            # Get sections between headers
            sections = pdf.get_sections(
                start_elements='text[size>14]:bold',
                end_elements='text[size>14]:bold'
            )

            # Get sections that break at page boundaries
            sections = pdf.get_sections(
                start_elements='text:contains("Chapter")',
                new_section_on_page_break=True
            )
            ```

        Note:
            You can provide only start_elements, only end_elements, or both.
            - With only start_elements: sections go from each start to the next start (or end of document)
            - With only end_elements: sections go from beginning of document to each end
            - With both: sections go from each start to the corresponding end
        """
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not yet initialized.")

        return self.pages.get_sections(
            start_elements=start_elements,
            end_elements=end_elements,
            new_section_on_page_break=new_section_on_page_break,
            include_boundaries=include_boundaries,
            orientation=orientation,
        )

    def split(self, divider, **kwargs) -> "ElementCollection":
        """
        Divide the PDF into sections based on the provided divider elements.

        Args:
            divider: Elements or selector string that mark section boundaries
            **kwargs: Additional parameters passed to get_sections()
                - include_boundaries: How to include boundary elements (default: 'start')
                - orientation: 'vertical' or 'horizontal' (default: 'vertical')
                - new_section_on_page_break: Whether to split at page boundaries (default: False)

        Returns:
            ElementCollection of Region objects representing the sections

        Example:
            # Split a PDF by chapter titles
            chapters = pdf.split("text[size>20]:contains('Chapter')")

            # Export each chapter to a separate file
            for i, chapter in enumerate(chapters):
                chapter_text = chapter.extract_text()
                with open(f"chapter_{i+1}.txt", "w") as f:
                    f.write(chapter_text)

            # Split by horizontal rules/lines
            sections = pdf.split("line[orientation=horizontal]")

            # Split only by page breaks (no divider elements)
            pages = pdf.split(None, new_section_on_page_break=True)
        """
        # Delegate to pages collection
        return self.pages.split(divider, **kwargs)

    def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
        """
        DEPRECATED: Use save_pdf(..., ocr=True) instead.
        Saves the PDF with an OCR text layer, making content searchable.

        Requires optional dependencies. Install with: pip install \"natural-pdf[ocr-export]\"

        Args:
            output_path: Path to save the searchable PDF
            dpi: Resolution for rendering and OCR overlay
            **kwargs: Additional keyword arguments passed to the exporter
        """
        logger.warning(
            "PDF.save_searchable() is deprecated. Use PDF.save_pdf(..., ocr=True) instead."
        )
        if create_searchable_pdf is None:
            raise ImportError(
                "Saving searchable PDF requires 'pikepdf'. "
                'Install with: pip install "natural-pdf[ocr-export]"'
            )
        output_path_str = str(output_path)
        # Call the exporter directly, passing self (the PDF instance)
        create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
        # Logger info is handled within the exporter now
        # logger.info(f"Searchable PDF saved to: {output_path_str}")

    def save_pdf(
        self,
        output_path: Union[str, Path],
        ocr: bool = False,
        original: bool = False,
        dpi: int = 300,
    ):
        """
        Saves the PDF object (all its pages) to a new file.

        Choose one saving mode:
        - `ocr=True`: Creates a new, image-based PDF using OCR results from all pages.
          Text generated during the natural-pdf session becomes searchable,
          but original vector content is lost. Requires 'ocr-export' extras.
        - `original=True`: Saves a copy of the original PDF file this object represents.
          Any OCR results or analyses from the natural-pdf session are NOT included.
          If the PDF was opened from an in-memory buffer, this mode may not be suitable.
          Requires 'ocr-export' extras.

        Args:
            output_path: Path to save the new PDF file.
            ocr: If True, save as a searchable, image-based PDF using OCR data.
            original: If True, save the original source PDF content.
            dpi: Resolution (dots per inch) used only when ocr=True.

        Raises:
            ValueError: If the PDF has no pages, if neither or both 'ocr'
                        and 'original' are True.
            ImportError: If required libraries are not installed for the chosen mode.
            RuntimeError: If an unexpected error occurs during saving.
        """
        if not self.pages:
            raise ValueError("Cannot save an empty PDF object.")

        if not (ocr ^ original):  # XOR: exactly one must be true
            raise ValueError("Exactly one of 'ocr' or 'original' must be True.")

        output_path_obj = Path(output_path)
        output_path_str = str(output_path_obj)

        if ocr:
            has_vector_elements = False
            for page in self.pages:
                if (
                    hasattr(page, "rects")
                    and page.rects
                    or hasattr(page, "lines")
                    and page.lines
                    or hasattr(page, "curves")
                    and page.curves
                    or (
                        hasattr(page, "chars")
                        and any(getattr(el, "source", None) != "ocr" for el in page.chars)
                    )
                    or (
                        hasattr(page, "words")
                        and any(getattr(el, "source", None) != "ocr" for el in page.words)
                    )
                ):
                    has_vector_elements = True
                    break
            if has_vector_elements:
                logger.warning(
                    "Warning: Saving with ocr=True creates an image-based PDF. "
                    "Original vector elements (rects, lines, non-OCR text/chars) "
                    "will not be preserved in the output file."
                )

            logger.info(f"Saving searchable PDF (OCR text layer) to: {output_path_str}")
            try:
                # Delegate to the searchable PDF exporter, passing self (PDF instance)
                create_searchable_pdf(self, output_path_str, dpi=dpi)
            except Exception as e:
                raise RuntimeError(f"Failed to create searchable PDF: {e}") from e

        elif original:
            if create_original_pdf is None:
                raise ImportError(
                    "Saving with original=True requires 'pikepdf'. "
                    'Install with: pip install "natural-pdf[ocr-export]"'
                )

            # Optional: Add warning about losing OCR data similar to PageCollection
            has_ocr_elements = False
            for page in self.pages:
                if hasattr(page, "find_all"):
                    ocr_text_elements = page.find_all("text[source=ocr]")
                    if ocr_text_elements:
                        has_ocr_elements = True
                        break
                elif hasattr(page, "words"):  # Fallback
                    if any(getattr(el, "source", None) == "ocr" for el in page.words):
                        has_ocr_elements = True
                        break
            if has_ocr_elements:
                logger.warning(
                    "Warning: Saving with original=True preserves original page content. "
                    "OCR text generated in this session will not be included in the saved file."
                )

            logger.info(f"Saving original PDF content to: {output_path_str}")
            try:
                # Delegate to the original PDF exporter, passing self (PDF instance)
                create_original_pdf(self, output_path_str)
            except Exception as e:
                # Re-raise exception from exporter
                raise e

    def _get_render_specs(
        self,
        mode: Literal["show", "render"] = "show",
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        highlights: Optional[List[Dict[str, Any]]] = None,
        crop: Union[bool, Literal["content"]] = False,
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        **kwargs,
    ) -> List[RenderSpec]:
        """Get render specifications for this PDF.

        For PDF objects, this delegates to the pages collection to handle
        multi-page rendering.

        Args:
            mode: Rendering mode - 'show' includes highlights, 'render' is clean
            color: Color for highlighting pages in show mode
            highlights: Additional highlight groups to show
            crop: Whether to crop pages
            crop_bbox: Explicit crop bounds
            **kwargs: Additional parameters

        Returns:
            List of RenderSpec objects, one per page
        """
        # Delegate to pages collection
        return self.pages._get_render_specs(
            mode=mode, color=color, highlights=highlights, crop=crop, crop_bbox=crop_bbox, **kwargs
        )

    def ask(
        self,
        question: str,
        mode: str = "extractive",
        pages: Union[int, List[int], range] = None,
        min_confidence: float = 0.1,
        model: str = None,
        **kwargs,
    ) -> Dict[str, Any]:
        """
        Ask a single question about the document content.

        Args:
            question: Question string to ask about the document
            mode: "extractive" to extract answer from document, "generative" to generate
            pages: Specific pages to query (default: all pages)
            min_confidence: Minimum confidence threshold for answers
            model: Optional model name for question answering
            **kwargs: Additional parameters passed to the QA engine

        Returns:
            Dict containing: answer, confidence, found, page_num, source_elements, etc.
        """
        # Delegate to ask_batch and return the first result
        results = self.ask_batch(
            [question], mode=mode, pages=pages, min_confidence=min_confidence, model=model, **kwargs
        )
        return (
            results[0]
            if results
            else {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": None,
                "source_elements": [],
            }
        )

    def ask_batch(
        self,
        questions: List[str],
        mode: str = "extractive",
        pages: Union[int, List[int], range] = None,
        min_confidence: float = 0.1,
        model: str = None,
        **kwargs,
    ) -> List[Dict[str, Any]]:
        """
        Ask multiple questions about the document content using batch processing.

        This method processes multiple questions efficiently in a single batch,
        avoiding the multiprocessing resource accumulation that can occur with
        sequential individual question calls.

        Args:
            questions: List of question strings to ask about the document
            mode: "extractive" to extract answer from document, "generative" to generate
            pages: Specific pages to query (default: all pages)
            min_confidence: Minimum confidence threshold for answers
            model: Optional model name for question answering
            **kwargs: Additional parameters passed to the QA engine

        Returns:
            List of Dicts, each containing: answer, confidence, found, page_num, source_elements, etc.
        """
        from natural_pdf.qa import get_qa_engine

        if not questions:
            return []

        if not isinstance(questions, list) or not all(isinstance(q, str) for q in questions):
            raise TypeError("'questions' must be a list of strings")

        qa_engine = get_qa_engine() if model is None else get_qa_engine(model_name=model)

        # Resolve target pages
        if pages is None:
            target_pages = self.pages
        elif isinstance(pages, int):
            if 0 <= pages < len(self.pages):
                target_pages = [self.pages[pages]]
            else:
                raise IndexError(f"Page index {pages} out of range (0-{len(self.pages)-1})")
        elif isinstance(pages, (list, range)):
            target_pages = []
            for page_idx in pages:
                if 0 <= page_idx < len(self.pages):
                    target_pages.append(self.pages[page_idx])
                else:
                    logger.warning(f"Page index {page_idx} out of range, skipping")
        else:
            raise ValueError(f"Invalid pages parameter: {pages}")

        if not target_pages:
            logger.warning("No valid pages found for QA processing.")
            return [
                {
                    "answer": None,
                    "confidence": 0.0,
                    "found": False,
                    "page_num": None,
                    "source_elements": [],
                }
                for _ in questions
            ]

        logger.info(
            f"Processing {len(questions)} question(s) across {len(target_pages)} page(s) using batch QA..."
        )

        # Collect all page images and metadata for batch processing
        page_images = []
        page_word_boxes = []
        page_metadata = []

        for page in target_pages:
            # Get page image
            try:
                # Use render() for clean image without highlights
                page_image = page.render(resolution=150)
                if page_image is None:
                    logger.warning(f"Failed to render image for page {page.number}, skipping")
                    continue

                # Get text elements for word boxes
                elements = page.find_all("text")
                if not elements:
                    logger.warning(f"No text elements found on page {page.number}")
                    word_boxes = []
                else:
                    word_boxes = qa_engine._get_word_boxes_from_elements(
                        elements, offset_x=0, offset_y=0
                    )

                page_images.append(page_image)
                page_word_boxes.append(word_boxes)
                page_metadata.append({"page_number": page.number, "page_object": page})

            except Exception as e:
                logger.warning(f"Error processing page {page.number}: {e}")
                continue

        if not page_images:
            logger.warning("No page images could be processed for QA.")
            return [
                {
                    "answer": None,
                    "confidence": 0.0,
                    "found": False,
                    "page_num": None,
                    "source_elements": [],
                }
                for _ in questions
            ]

        # Process all questions against all pages in batch
        all_results = []

        for question_text in questions:
            question_results = []

            # Ask this question against each page (but in batch per page)
            for i, (page_image, word_boxes, page_meta) in enumerate(
                zip(page_images, page_word_boxes, page_metadata)
            ):
                try:
                    # Use the DocumentQA batch interface
                    page_result = qa_engine.ask(
                        image=page_image,
                        question=question_text,
                        word_boxes=word_boxes,
                        min_confidence=min_confidence,
                        **kwargs,
                    )

                    if page_result and page_result.found:
                        # Add page metadata to result
                        page_result_dict = {
                            "answer": page_result.answer,
                            "confidence": page_result.confidence,
                            "found": page_result.found,
                            "page_num": page_meta["page_number"],
                            "source_elements": getattr(page_result, "source_elements", []),
                            "start": getattr(page_result, "start", -1),
                            "end": getattr(page_result, "end", -1),
                        }
                        question_results.append(page_result_dict)

                except Exception as e:
                    logger.warning(
                        f"Error processing question '{question_text}' on page {page_meta['page_number']}: {e}"
                    )
                    continue

            # Sort results by confidence and take the best one for this question
            question_results.sort(key=lambda x: x.get("confidence", 0), reverse=True)

            if question_results:
                all_results.append(question_results[0])
            else:
                # No results found for this question
                all_results.append(
                    {
                        "answer": None,
                        "confidence": 0.0,
                        "found": False,
                        "page_num": None,
                        "source_elements": [],
                    }
                )

        return all_results

    def search_within_index(
        self,
        query: Union[str, Path, Image.Image, "Region"],
        search_service: "SearchServiceProtocol",
        options: Optional["SearchOptions"] = None,
    ) -> List[Dict[str, Any]]:
        """
        Finds relevant documents from this PDF within a search index.
        Finds relevant documents from this PDF within a search index.

        Args:
            query: The search query (text, image path, PIL Image, Region)
            search_service: A pre-configured SearchService instance
            options: Optional SearchOptions to configure the query
            query: The search query (text, image path, PIL Image, Region)
            search_service: A pre-configured SearchService instance
            options: Optional SearchOptions to configure the query

        Returns:
            A list of result dictionaries, sorted by relevance
            A list of result dictionaries, sorted by relevance

        Raises:
            ImportError: If search dependencies are not installed
            ValueError: If search_service is None
            TypeError: If search_service does not conform to the protocol
            FileNotFoundError: If the collection managed by the service does not exist
            RuntimeError: For other search failures
            ImportError: If search dependencies are not installed
            ValueError: If search_service is None
            TypeError: If search_service does not conform to the protocol
            FileNotFoundError: If the collection managed by the service does not exist
            RuntimeError: For other search failures
        """
        if not search_service:
            raise ValueError("A configured SearchServiceProtocol instance must be provided.")

        collection_name = getattr(search_service, "collection_name", "<Unknown Collection>")
        logger.info(
            f"Searching within index '{collection_name}' for content from PDF '{self.path}'"
        )

        service = search_service

        query_input = query
        effective_options = copy.deepcopy(options) if options is not None else TextSearchOptions()

        if isinstance(query, Region):
            logger.debug("Query is a Region object. Extracting text.")
            if not isinstance(effective_options, TextSearchOptions):
                logger.warning(
                    "Querying with Region image requires MultiModalSearchOptions. Falling back to text extraction."
                )
            query_input = query.extract_text()
            if not query_input or query_input.isspace():
                logger.error("Region has no extractable text for query.")
                return []

        # Add filter to scope search to THIS PDF
        # Add filter to scope search to THIS PDF
        pdf_scope_filter = {
            "field": "pdf_path",
            "operator": "eq",
            "value": self.path,
        }
        logger.debug(f"Applying filter to scope search to PDF: {pdf_scope_filter}")

        # Combine with existing filters in options (if any)
        if effective_options.filters:
            logger.debug(f"Combining PDF scope filter with existing filters")
            if (
                isinstance(effective_options.filters, dict)
                and effective_options.filters.get("operator") == "AND"
            ):
                effective_options.filters["conditions"].append(pdf_scope_filter)
            elif isinstance(effective_options.filters, list):
                effective_options.filters = {
                    "operator": "AND",
                    "conditions": effective_options.filters + [pdf_scope_filter],
                }
            elif isinstance(effective_options.filters, dict):
                effective_options.filters = {
                    "operator": "AND",
                    "conditions": [effective_options.filters, pdf_scope_filter],
                }
            else:
                logger.warning(
                    f"Unsupported format for existing filters. Overwriting with PDF scope filter."
                )
                effective_options.filters = pdf_scope_filter
        else:
            effective_options.filters = pdf_scope_filter

        logger.debug(f"Final filters for service search: {effective_options.filters}")

        try:
            results = service.search(
                query=query_input,
                options=effective_options,
            )
            logger.info(f"SearchService returned {len(results)} results from PDF '{self.path}'")
            return results
        except FileNotFoundError as fnf:
            logger.error(f"Search failed: Collection not found. Error: {fnf}")
            raise
            logger.error(f"Search failed: Collection not found. Error: {fnf}")
            raise
        except Exception as e:
            logger.error(f"SearchService search failed: {e}")
            raise RuntimeError(f"Search within index failed. See logs for details.") from e
            logger.error(f"SearchService search failed: {e}")
            raise RuntimeError(f"Search within index failed. See logs for details.") from e

    def export_ocr_correction_task(self, output_zip_path: str, **kwargs):
        """
        Exports OCR results from this PDF into a correction task package.
        Exports OCR results from this PDF into a correction task package.

        Args:
            output_zip_path: The path to save the output zip file
            output_zip_path: The path to save the output zip file
            **kwargs: Additional arguments passed to create_correction_task_package
        """
        try:
            from natural_pdf.utils.packaging import create_correction_task_package

            create_correction_task_package(source=self, output_zip_path=output_zip_path, **kwargs)
        except ImportError:
            logger.error(
                "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
            )
            logger.error(
                "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
            )
        except Exception as e:
            logger.error(f"Failed to export correction task: {e}")
            raise
            logger.error(f"Failed to export correction task: {e}")
            raise

    def update_text(
        self,
        transform: Callable[[Any], Optional[str]],
        pages: Optional[Union[Iterable[int], range, slice]] = None,
        selector: str = "text",
        max_workers: Optional[int] = None,
        progress_callback: Optional[Callable[[], None]] = None,
    ) -> "PDF":
        """
        Applies corrections to text elements using a callback function.

        Args:
            correction_callback: Function that takes an element and returns corrected text or None
            pages: Optional page indices/slice to limit the scope of correction
            selector: Selector to apply corrections to (default: "text")
            max_workers: Maximum number of threads to use for parallel execution
            progress_callback: Optional callback function for progress updates

        Returns:
            Self for method chaining
        """
        target_page_indices = []
        if pages is None:
            target_page_indices = list(range(len(self._pages)))
        elif isinstance(pages, slice):
            target_page_indices = list(range(*pages.indices(len(self._pages))))
        elif hasattr(pages, "__iter__"):
            try:
                target_page_indices = [int(i) for i in pages]
                for idx in target_page_indices:
                    if not (0 <= idx < len(self._pages)):
                        raise IndexError(f"Page index {idx} out of range (0-{len(self._pages)-1}).")
            except (IndexError, TypeError, ValueError) as e:
                raise ValueError(f"Invalid page index in 'pages': {pages}. Error: {e}") from e
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

        if not target_page_indices:
            logger.warning("No pages selected for text update.")
            return self

        logger.info(
            f"Starting text update for pages: {target_page_indices} with selector='{selector}'"
        )

        for page_idx in target_page_indices:
            page = self._pages[page_idx]
            try:
                page.update_text(
                    transform=transform,
                    selector=selector,
                    max_workers=max_workers,
                    progress_callback=progress_callback,
                )
            except Exception as e:
                logger.error(f"Error during text update on page {page_idx}: {e}")
                logger.error(f"Error during text update on page {page_idx}: {e}")

        logger.info("Text update process finished.")
        return self

    def __len__(self) -> int:
        """Return the number of pages in the PDF."""
        if not hasattr(self, "_pages"):
            return 0
        return len(self._pages)

    def __getitem__(self, key) -> Union["Page", "PageCollection"]:
        """Access pages by index or slice."""
        if not hasattr(self, "_pages"):
            raise AttributeError("PDF pages not initialized yet.")

        if isinstance(key, slice):
            from natural_pdf.core.page_collection import PageCollection

            # Use the lazy page list's slicing which returns another _LazyPageList
            lazy_slice = self._pages[key]
            # Wrap in PageCollection for compatibility
            return PageCollection(lazy_slice)
        elif isinstance(key, int):
            if 0 <= key < len(self._pages):
                return self._pages[key]
            else:
                raise IndexError(f"Page index {key} out of range (0-{len(self._pages)-1}).")
        else:
            raise TypeError(f"Page indices must be integers or slices, not {type(key)}.")

    def close(self):
        """Close the underlying PDF file and clean up any temporary files."""
        if hasattr(self, "_pdf") and self._pdf is not None:
            try:
                self._pdf.close()
                logger.debug(f"Closed pdfplumber PDF object for {self.source_path}")
            except Exception as e:
                logger.warning(f"Error closing pdfplumber object: {e}")
            finally:
                self._pdf = None

        if hasattr(self, "_temp_file") and self._temp_file is not None:
            temp_file_path = None
            try:
                if hasattr(self._temp_file, "name") and self._temp_file.name:
                    temp_file_path = self._temp_file.name
                    # Only unlink if it exists and _is_stream is False (meaning WE created it)
                    if not self._is_stream and os.path.exists(temp_file_path):
                        os.unlink(temp_file_path)
                        logger.debug(f"Removed temporary PDF file: {temp_file_path}")
            except Exception as e:
                logger.warning(f"Failed to clean up temporary file '{temp_file_path}': {e}")

        # Cancels the weakref finalizer so we don't double-clean
        if hasattr(self, "_finalizer") and self._finalizer.alive:
            self._finalizer()

    def __enter__(self):
        """Context manager entry."""
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit."""
        self.close()

    def __repr__(self) -> str:
        """Return a string representation of the PDF object."""
        if not hasattr(self, "_pages"):
            page_count_str = "uninitialized"
        else:
            page_count_str = str(len(self._pages))

        source_info = getattr(self, "source_path", "unknown source")
        return f"<PDF source='{source_info}' pages={page_count_str}>"

    def get_id(self) -> str:
        """Get unique identifier for this PDF."""
        """Get unique identifier for this PDF."""
        return self.path

    # --- Deskew Method --- #

    def deskew(
        self,
        pages: Optional[Union[Iterable[int], range, slice]] = None,
        resolution: int = 300,
        angle: Optional[float] = None,
        detection_resolution: int = 72,
        force_overwrite: bool = False,
        **deskew_kwargs,
    ) -> "PDF":
        """
        Creates a new, in-memory PDF object containing deskewed versions of the
        specified pages from the original PDF.

        This method renders each selected page, detects and corrects skew using the 'deskew'
        library, and then combines the resulting images into a new PDF using 'img2pdf'.
        The new PDF object is returned directly.

        Important: The returned PDF is image-based. Any existing text, OCR results,
        annotations, or other elements from the original pages will *not* be carried over.

        Args:
            pages: Page indices/slice to include (0-based). If None, processes all pages.
            resolution: DPI resolution for rendering the output deskewed pages.
            angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
            detection_resolution: DPI resolution used for skew detection if angles are not
                                  already cached on the page objects.
            force_overwrite: If False (default), raises a ValueError if any target page
                             already contains processed elements (text, OCR, regions) to
                             prevent accidental data loss. Set to True to proceed anyway.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             during automatic detection (e.g., `max_angle`, `num_peaks`).

        Returns:
            A new PDF object representing the deskewed document.

        Raises:
            ImportError: If 'deskew' or 'img2pdf' libraries are not installed.
            ValueError: If `force_overwrite` is False and target pages contain elements.
            FileNotFoundError: If the source PDF cannot be read (if file-based).
            IOError: If creating the in-memory PDF fails.
            RuntimeError: If rendering or deskewing individual pages fails.
        """
        if not DESKEW_AVAILABLE:
            raise ImportError(
                "Deskew/img2pdf libraries missing. Install with: pip install natural-pdf[deskew]"
            )

        target_pages = self._get_target_pages(pages)  # Use helper to resolve pages

        # --- Safety Check --- #
        if not force_overwrite:
            for page in target_pages:
                # Check if the element manager has been initialized and contains any elements
                if (
                    hasattr(page, "_element_mgr")
                    and page._element_mgr
                    and page._element_mgr.has_elements()
                ):
                    raise ValueError(
                        f"Page {page.number} contains existing elements (text, OCR, etc.). "
                        f"Deskewing creates an image-only PDF, discarding these elements. "
                        f"Set force_overwrite=True to proceed."
                    )

        # --- Process Pages --- #
        deskewed_images_bytes = []
        logger.info(f"Deskewing {len(target_pages)} pages (output resolution={resolution} DPI)...")

        for page in tqdm(target_pages, desc="Deskewing Pages", leave=False):
            try:
                # Use page.deskew to get the corrected PIL image
                # Pass down resolutions and kwargs
                deskewed_img = page.deskew(
                    resolution=resolution,
                    angle=angle,  # Let page.deskew handle detection/caching
                    detection_resolution=detection_resolution,
                    **deskew_kwargs,
                )

                if not deskewed_img:
                    logger.warning(
                        f"Page {page.number}: Failed to generate deskewed image, skipping."
                    )
                    continue

                # Convert image to bytes for img2pdf (use PNG for lossless quality)
                with io.BytesIO() as buf:
                    deskewed_img.save(buf, format="PNG")
                    deskewed_images_bytes.append(buf.getvalue())

            except Exception as e:
                logger.error(
                    f"Page {page.number}: Failed during deskewing process: {e}", exc_info=True
                )
                # Option: Raise a runtime error, or continue and skip the page?
                # Raising makes the whole operation fail if one page fails.
                raise RuntimeError(f"Failed to process page {page.number} during deskewing.") from e

        # --- Create PDF --- #
        if not deskewed_images_bytes:
            raise RuntimeError("No pages were successfully processed to create the deskewed PDF.")

        logger.info(f"Combining {len(deskewed_images_bytes)} deskewed images into in-memory PDF...")
        try:
            # Use img2pdf to combine image bytes into PDF bytes
            pdf_bytes = img2pdf.convert(deskewed_images_bytes)

            # Wrap bytes in a stream
            pdf_stream = io.BytesIO(pdf_bytes)

            # Create a new PDF object from the stream using original config
            logger.info("Creating new PDF object from deskewed stream...")
            new_pdf = PDF(
                pdf_stream,
                reading_order=self._reading_order,
                font_attrs=self._font_attrs,
                keep_spaces=self._config.get("keep_spaces", True),
                text_layer=self._text_layer,
            )
            return new_pdf
        except Exception as e:
            logger.error(f"Failed to create in-memory PDF using img2pdf/PDF init: {e}")
            raise IOError("Failed to create deskewed PDF object from image stream.") from e

    # --- End Deskew Method --- #

    # --- Classification Methods --- #

    def classify_pages(
        self,
        labels: List[str],
        model: Optional[str] = None,
        pages: Optional[Union[Iterable[int], range, slice]] = None,
        analysis_key: str = "classification",
        using: Optional[str] = None,
        **kwargs,
    ) -> "PDF":
        """
        Classifies specified pages of the PDF.

        Args:
            labels: List of category names
            model: Model identifier ('text', 'vision', or specific HF ID)
            pages: Page indices, slice, or None for all pages
            analysis_key: Key to store results in page's analyses dict
            using: Processing mode ('text' or 'vision')
            **kwargs: Additional arguments for the ClassificationManager

        Returns:
            Self for method chaining
        """
        if not labels:
            raise ValueError("Labels list cannot be empty.")

        try:
            manager = self.get_manager("classification")
        except (ValueError, RuntimeError) as e:
            raise ClassificationError(f"Cannot get ClassificationManager: {e}") from e

        if not manager or not manager.is_available():
            from natural_pdf.classification.manager import is_classification_available

            if not is_classification_available():
                raise ImportError(
                    "Classification dependencies missing. "
                    'Install with: pip install "natural-pdf[ai]"'
                )
            raise ClassificationError("ClassificationManager not available.")

        target_pages = []
        if pages is None:
            target_pages = self._pages
        elif isinstance(pages, slice):
            target_pages = self._pages[pages]
        elif hasattr(pages, "__iter__"):
            try:
                target_pages = [self._pages[i] for i in pages]
            except IndexError:
                raise ValueError("Invalid page index provided.")
            except TypeError:
                raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

        if not target_pages:
            logger.warning("No pages selected for classification.")
            return self

        inferred_using = manager.infer_using(model if model else manager.DEFAULT_TEXT_MODEL, using)
        logger.info(
            f"Classifying {len(target_pages)} pages using model '{model or '(default)'}' (mode: {inferred_using})"
        )

        page_contents = []
        pages_to_classify = []
        logger.debug(f"Gathering content for {len(target_pages)} pages...")

        for page in target_pages:
            try:
                content = page._get_classification_content(model_type=inferred_using, **kwargs)
                page_contents.append(content)
                pages_to_classify.append(page)
            except ValueError as e:
                logger.warning(f"Skipping page {page.number}: Cannot get content - {e}")
            except Exception as e:
                logger.warning(f"Skipping page {page.number}: Error getting content - {e}")

        if not page_contents:
            logger.warning("No content could be gathered for batch classification.")
            return self

        logger.debug(f"Gathered content for {len(pages_to_classify)} pages.")

        try:
            batch_results = manager.classify_batch(
                item_contents=page_contents,
                labels=labels,
                model_id=model,
                using=inferred_using,
                **kwargs,
            )
        except Exception as e:
            logger.error(f"Batch classification failed: {e}")
            raise ClassificationError(f"Batch classification failed: {e}") from e

        if len(batch_results) != len(pages_to_classify):
            logger.error(
                f"Mismatch between number of results ({len(batch_results)}) and pages ({len(pages_to_classify)})"
            )
            return self

        logger.debug(
            f"Distributing {len(batch_results)} results to pages under key '{analysis_key}'..."
        )
        for page, result_obj in zip(pages_to_classify, batch_results):
            try:
                if not hasattr(page, "analyses") or page.analyses is None:
                    page.analyses = {}
                page.analyses[analysis_key] = result_obj
            except Exception as e:
                logger.warning(
                    f"Failed to store classification results for page {page.number}: {e}"
                )

        logger.info(f"Finished classifying PDF pages.")
        return self

    # --- End Classification Methods --- #

    # --- Extraction Support --- #
    def _get_extraction_content(self, using: str = "text", **kwargs) -> Any:
        """
        Retrieves the content for the entire PDF.

        Args:
            using: 'text' or 'vision'
            **kwargs: Additional arguments passed to extract_text or page.to_image

        Returns:
            str: Extracted text if using='text'
            List[PIL.Image.Image]: List of page images if using='vision'
            None: If content cannot be retrieved
        """
        if using == "text":
            try:
                layout = kwargs.pop("layout", True)
                return self.extract_text(layout=layout, **kwargs)
            except Exception as e:
                logger.error(f"Error extracting text from PDF: {e}")
                return None
        elif using == "vision":
            page_images = []
            logger.info(f"Rendering {len(self.pages)} pages to images...")

            resolution = kwargs.pop("resolution", 72)
            include_highlights = kwargs.pop("include_highlights", False)
            labels = kwargs.pop("labels", False)

            try:
                for page in tqdm(self.pages, desc="Rendering Pages"):
                    # Use render() for clean images
                    img = page.render(
                        resolution=resolution,
                        **kwargs,
                    )
                    if img:
                        page_images.append(img)
                    else:
                        logger.warning(f"Failed to render page {page.number}, skipping.")
                if not page_images:
                    logger.error("Failed to render any pages.")
                    return None
                return page_images
            except Exception as e:
                logger.error(f"Error rendering pages: {e}")
                return None
        else:
            logger.error(f"Unsupported value for 'using': {using}")
            return None

    # --- End Extraction Support --- #

    def _gather_analysis_data(
        self,
        analysis_keys: List[str],
        include_content: bool,
        include_images: bool,
        image_dir: Optional[Path],
        image_format: str,
        image_resolution: int,
    ) -> List[Dict[str, Any]]:
        """
        Gather analysis data from all pages in the PDF.

        Args:
            analysis_keys: Keys in the analyses dictionary to export
            include_content: Whether to include extracted text
            include_images: Whether to export images
            image_dir: Directory to save images
            image_format: Format to save images
            image_resolution: Resolution for exported images

        Returns:
            List of dictionaries containing analysis data
        """
        if not hasattr(self, "_pages") or not self._pages:
            logger.warning(f"No pages found in PDF {self.path}")
            return []

        all_data = []

        for page in tqdm(self._pages, desc="Gathering page data", leave=False):
            # Basic page information
            page_data = {
                "pdf_path": self.path,
                "page_number": page.number,
                "page_index": page.index,
            }

            # Include extracted text if requested
            if include_content:
                try:
                    page_data["content"] = page.extract_text(preserve_whitespace=True)
                except Exception as e:
                    logger.error(f"Error extracting text from page {page.number}: {e}")
                    page_data["content"] = ""

            # Save image if requested
            if include_images:
                try:
                    # Create image filename
                    image_filename = f"pdf_{Path(self.path).stem}_page_{page.number}.{image_format}"
                    image_path = image_dir / image_filename

                    # Save image
                    page.save_image(
                        str(image_path), resolution=image_resolution, include_highlights=True
                    )

                    # Add relative path to data
                    page_data["image_path"] = str(Path(image_path).relative_to(image_dir.parent))
                except Exception as e:
                    logger.error(f"Error saving image for page {page.number}: {e}")
                    page_data["image_path"] = None

            # Add analyses data
            for key in analysis_keys:
                if not hasattr(page, "analyses") or not page.analyses:
                    raise ValueError(f"Page {page.number} does not have analyses data")

                if key not in page.analyses:
                    raise KeyError(f"Analysis key '{key}' not found in page {page.number}")

                # Get the analysis result
                analysis_result = page.analyses[key]

                # If the result has a to_dict method, use it
                if hasattr(analysis_result, "to_dict"):
                    analysis_data = analysis_result.to_dict()
                else:
                    # Otherwise, use the result directly if it's dict-like
                    try:
                        analysis_data = dict(analysis_result)
                    except (TypeError, ValueError):
                        # Last resort: convert to string
                        analysis_data = {"raw_result": str(analysis_result)}

                # Add analysis data to page data with the key as prefix
                for k, v in analysis_data.items():
                    page_data[f"{key}.{k}"] = v

            all_data.append(page_data)

        return all_data

    def _get_target_pages(
        self, pages: Optional[Union[Iterable[int], range, slice]] = None
    ) -> List["Page"]:
        """
        Helper method to get a list of Page objects based on the input pages.

        Args:
            pages: Page indices, slice, or None for all pages

        Returns:
            List of Page objects
        """
        if pages is None:
            return self._pages
        elif isinstance(pages, slice):
            return self._pages[pages]
        elif hasattr(pages, "__iter__"):
            try:
                return [self._pages[i] for i in pages]
            except IndexError:
                raise ValueError("Invalid page index provided in 'pages' iterable.")
            except TypeError:
                raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
        else:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    # --- Classification Mixin Implementation --- #

    def _get_classification_manager(self) -> "ClassificationManager":
        """Returns the ClassificationManager instance for this PDF."""
        try:
            return self.get_manager("classification")
        except (KeyError, RuntimeError) as e:
            raise AttributeError(f"Could not retrieve ClassificationManager: {e}") from e

    def _get_classification_content(self, model_type: str, **kwargs) -> Union[str, Image.Image]:
        """
        Provides the content for classifying the entire PDF.

        Args:
            model_type: 'text' or 'vision'.
            **kwargs: Additional arguments (e.g., for text extraction or image rendering).

        Returns:
            Extracted text (str) or the first page's image (PIL.Image).

        Raises:
            ValueError: If model_type is 'vision' and PDF has != 1 page,
                      or if model_type is unsupported, or if content cannot be generated.
        """
        if model_type == "text":
            try:
                # Extract text from the whole document
                text = self.extract_text(**kwargs)  # Pass relevant kwargs
                if not text or text.isspace():
                    raise ValueError("PDF contains no extractable text for classification.")
                return text
            except Exception as e:
                logger.error(f"Error extracting text for PDF classification: {e}")
                raise ValueError("Failed to extract text for classification.") from e

        elif model_type == "vision":
            if len(self.pages) == 1:
                # Use the single page's content method
                try:
                    return self.pages[0]._get_classification_content(model_type="vision", **kwargs)
                except Exception as e:
                    logger.error(f"Error getting image from single page for classification: {e}")
                    raise ValueError("Failed to get image from single page.") from e
            elif len(self.pages) == 0:
                raise ValueError("Cannot classify empty PDF using vision model.")
            else:
                raise ValueError(
                    f"Vision classification for a PDF object is only supported for single-page PDFs. "
                    f"This PDF has {len(self.pages)} pages. Use pdf.pages[0].classify() or pdf.classify_pages()."
                )
        else:
            raise ValueError(f"Unsupported model_type for PDF classification: {model_type}")

    # --- End Classification Mixin Implementation ---

    # ------------------------------------------------------------------
    # Unified analysis storage (maps to metadata["analysis"])
    # ------------------------------------------------------------------

    @property
    def analyses(self) -> Dict[str, Any]:
        if not hasattr(self, "metadata") or self.metadata is None:
            # For PDF, metadata property returns self._pdf.metadata which may be None
            self._pdf.metadata = self._pdf.metadata or {}
        if self.metadata is None:
            # Fallback safeguard
            self._pdf.metadata = {}
        return self.metadata.setdefault("analysis", {})  # type: ignore[attr-defined]

    @analyses.setter
    def analyses(self, value: Dict[str, Any]):
        if not hasattr(self, "metadata") or self.metadata is None:
            self._pdf.metadata = self._pdf.metadata or {}
        self.metadata["analysis"] = value  # type: ignore[attr-defined]

    # Static helper for weakref.finalize to avoid capturing 'self'
    @staticmethod
    def _finalize_cleanup(plumber_pdf, temp_file_obj, is_stream):
        try:
            if plumber_pdf is not None:
                plumber_pdf.close()
        except Exception:
            pass

        if temp_file_obj and not is_stream:
            try:
                path = temp_file_obj.name if hasattr(temp_file_obj, "name") else None
                if path and os.path.exists(path):
                    os.unlink(path)
            except Exception as e:
                logger.warning(f"Failed to clean up temporary file '{path}': {e}")

    def analyze_layout(self, *args, **kwargs) -> "ElementCollection[Region]":
        """
        Analyzes the layout of all pages in the PDF.

        This is a convenience method that calls analyze_layout on the PDF's
        page collection.

        Args:
            *args: Positional arguments passed to pages.analyze_layout().
            **kwargs: Keyword arguments passed to pages.analyze_layout().

        Returns:
            An ElementCollection of all detected Region objects.
        """
        return self.pages.analyze_layout(*args, **kwargs)

    def detect_checkboxes(self, *args, **kwargs) -> "ElementCollection[Region]":
        """
        Detects checkboxes on all pages in the PDF.

        This is a convenience method that calls detect_checkboxes on the PDF's
        page collection.

        Args:
            *args: Positional arguments passed to pages.detect_checkboxes().
            **kwargs: Keyword arguments passed to pages.detect_checkboxes().

        Returns:
            An ElementCollection of all detected checkbox Region objects.
        """
        return self.pages.detect_checkboxes(*args, **kwargs)

    def highlights(self, show: bool = False) -> "HighlightContext":
        """
        Create a highlight context for accumulating highlights.

        This allows for clean syntax to show multiple highlight groups:

        Example:
            with pdf.highlights() as h:
                h.add(pdf.find_all('table'), label='tables', color='blue')
                h.add(pdf.find_all('text:bold'), label='bold text', color='red')
                h.show()

        Or with automatic display:
            with pdf.highlights(show=True) as h:
                h.add(pdf.find_all('table'), label='tables')
                h.add(pdf.find_all('text:bold'), label='bold')
                # Automatically shows when exiting the context

        Args:
            show: If True, automatically show highlights when exiting context

        Returns:
            HighlightContext for accumulating highlights
        """
        from natural_pdf.core.highlighting_service import HighlightContext

        return HighlightContext(self, show_on_exit=show)
Attributes
natural_pdf.PDF.metadata property

Access PDF metadata as a dictionary.

Returns document metadata such as title, author, creation date, and other properties embedded in the PDF file. The exact keys available depend on what metadata was included when the PDF was created.

Returns:

Type Description
Dict[str, Any]

Dictionary containing PDF metadata. Common keys include 'Title',

Dict[str, Any]

'Author', 'Subject', 'Creator', 'Producer', 'CreationDate', and

Dict[str, Any]

'ModDate'. May be empty if no metadata is available.

Example
pdf = npdf.PDF("document.pdf")
print(pdf.metadata.get('Title', 'No title'))
print(f"Created: {pdf.metadata.get('CreationDate')}")
natural_pdf.PDF.pages property

Access pages as a PageCollection object.

Provides access to individual pages of the PDF document through a collection interface that supports indexing, slicing, and iteration. Pages are lazy-loaded to minimize memory usage.

Returns:

Type Description
PageCollection

PageCollection object that provides list-like access to PDF pages.

Raises:

Type Description
AttributeError

If PDF pages are not yet initialized.

Example
pdf = npdf.PDF("document.pdf")

# Access individual pages
first_page = pdf.pages[0]
last_page = pdf.pages[-1]

# Slice pages
first_three = pdf.pages[0:3]

# Iterate over pages
for page in pdf.pages:
    print(f"Page {page.index} has {len(page.chars)} characters")
Functions
natural_pdf.PDF.__enter__()

Context manager entry.

Source code in natural_pdf/core/pdf.py
2122
2123
2124
def __enter__(self):
    """Context manager entry."""
    return self
natural_pdf.PDF.__exit__(exc_type, exc_val, exc_tb)

Context manager exit.

Source code in natural_pdf/core/pdf.py
2126
2127
2128
def __exit__(self, exc_type, exc_val, exc_tb):
    """Context manager exit."""
    self.close()
natural_pdf.PDF.__getitem__(key)

Access pages by index or slice.

Source code in natural_pdf/core/pdf.py
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
def __getitem__(self, key) -> Union["Page", "PageCollection"]:
    """Access pages by index or slice."""
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not initialized yet.")

    if isinstance(key, slice):
        from natural_pdf.core.page_collection import PageCollection

        # Use the lazy page list's slicing which returns another _LazyPageList
        lazy_slice = self._pages[key]
        # Wrap in PageCollection for compatibility
        return PageCollection(lazy_slice)
    elif isinstance(key, int):
        if 0 <= key < len(self._pages):
            return self._pages[key]
        else:
            raise IndexError(f"Page index {key} out of range (0-{len(self._pages)-1}).")
    else:
        raise TypeError(f"Page indices must be integers or slices, not {type(key)}.")
natural_pdf.PDF.__init__(path_or_url_or_stream, reading_order=True, font_attrs=None, keep_spaces=True, text_tolerance=None, auto_text_tolerance=True, text_layer=True)

Initialize the enhanced PDF object.

Parameters:

Name Type Description Default
path_or_url_or_stream

Path to the PDF file (str/Path), a URL (str), or a file-like object (stream). URLs must start with 'http://' or 'https://'.

required
reading_order bool

If True, use natural reading order for text extraction. Defaults to True.

True
font_attrs Optional[List[str]]

List of font attributes for grouping characters into words. Common attributes include ['fontname', 'size']. Defaults to None.

None
keep_spaces bool

If True, include spaces in word elements during text extraction. Defaults to True.

True
text_tolerance Optional[dict]

PDFplumber-style tolerance settings for text grouping. Dictionary with keys like 'x_tolerance', 'y_tolerance'. Defaults to None.

None
auto_text_tolerance bool

If True, automatically scale text tolerance based on font size and document characteristics. Defaults to True.

True
text_layer bool

If True, preserve existing text layer from the PDF. If False, removes all existing text elements during initialization, useful for OCR-only workflows. Defaults to True.

True

Raises:

Type Description
TypeError

If path_or_url_or_stream is not a valid type.

IOError

If the PDF file cannot be opened or read.

ValueError

If URL download fails.

Example
# From file path
pdf = npdf.PDF("document.pdf")

# From URL
pdf = npdf.PDF("https://example.com/document.pdf")

# From stream
with open("document.pdf", "rb") as f:
    pdf = npdf.PDF(f)

# With custom settings
pdf = npdf.PDF("document.pdf",
              reading_order=False,
              text_layer=False,  # For OCR-only processing
              font_attrs=['fontname', 'size', 'flags'])
Source code in natural_pdf/core/pdf.py
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
def __init__(
    self,
    path_or_url_or_stream,
    reading_order: bool = True,
    font_attrs: Optional[List[str]] = None,
    keep_spaces: bool = True,
    text_tolerance: Optional[dict] = None,
    auto_text_tolerance: bool = True,
    text_layer: bool = True,
):
    """Initialize the enhanced PDF object.

    Args:
        path_or_url_or_stream: Path to the PDF file (str/Path), a URL (str),
            or a file-like object (stream). URLs must start with 'http://' or 'https://'.
        reading_order: If True, use natural reading order for text extraction.
            Defaults to True.
        font_attrs: List of font attributes for grouping characters into words.
            Common attributes include ['fontname', 'size']. Defaults to None.
        keep_spaces: If True, include spaces in word elements during text extraction.
            Defaults to True.
        text_tolerance: PDFplumber-style tolerance settings for text grouping.
            Dictionary with keys like 'x_tolerance', 'y_tolerance'. Defaults to None.
        auto_text_tolerance: If True, automatically scale text tolerance based on
            font size and document characteristics. Defaults to True.
        text_layer: If True, preserve existing text layer from the PDF. If False,
            removes all existing text elements during initialization, useful for
            OCR-only workflows. Defaults to True.

    Raises:
        TypeError: If path_or_url_or_stream is not a valid type.
        IOError: If the PDF file cannot be opened or read.
        ValueError: If URL download fails.

    Example:
        ```python
        # From file path
        pdf = npdf.PDF("document.pdf")

        # From URL
        pdf = npdf.PDF("https://example.com/document.pdf")

        # From stream
        with open("document.pdf", "rb") as f:
            pdf = npdf.PDF(f)

        # With custom settings
        pdf = npdf.PDF("document.pdf",
                      reading_order=False,
                      text_layer=False,  # For OCR-only processing
                      font_attrs=['fontname', 'size', 'flags'])
        ```
    """
    self._original_path_or_stream = path_or_url_or_stream
    self._temp_file = None
    self._resolved_path = None
    self._is_stream = False
    self._text_layer = text_layer
    stream_to_open = None

    if hasattr(path_or_url_or_stream, "read"):  # Check if it's file-like
        logger.info("Initializing PDF from in-memory stream.")
        self._is_stream = True
        self._resolved_path = None  # No resolved file path for streams
        self.source_path = "<stream>"  # Identifier for source
        self.path = self.source_path  # Use source identifier as path for streams
        stream_to_open = path_or_url_or_stream
        try:
            if hasattr(path_or_url_or_stream, "read"):
                # If caller provided an in-memory binary stream, capture bytes for potential re-export
                current_pos = path_or_url_or_stream.tell()
                path_or_url_or_stream.seek(0)
                self._original_bytes = path_or_url_or_stream.read()
                path_or_url_or_stream.seek(current_pos)
        except Exception:
            pass
    elif isinstance(path_or_url_or_stream, (str, Path)):
        path_or_url = str(path_or_url_or_stream)
        self.source_path = path_or_url  # Store original path/URL as source
        is_url = path_or_url.startswith("http://") or path_or_url.startswith("https://")

        if is_url:
            logger.info(f"Downloading PDF from URL: {path_or_url}")
            try:
                with urllib.request.urlopen(path_or_url) as response:
                    data = response.read()
                # Load directly into an in-memory buffer — no temp file needed
                buffer = io.BytesIO(data)
                buffer.seek(0)
                self._temp_file = None  # No on-disk temp file
                self._resolved_path = path_or_url  # For repr / get_id purposes
                stream_to_open = buffer  # pdfplumber accepts file-like objects
            except Exception as e:
                logger.error(f"Failed to download PDF from URL: {e}")
                raise ValueError(f"Failed to download PDF from URL: {e}")
        else:
            self._resolved_path = str(Path(path_or_url).resolve())  # Resolve local paths
            stream_to_open = self._resolved_path
        self.path = self._resolved_path  # Use resolved path for file-based PDFs
    else:
        raise TypeError(
            f"Invalid input type: {type(path_or_url_or_stream)}. "
            f"Expected path (str/Path), URL (str), or file-like object."
        )

    logger.info(f"Opening PDF source: {self.source_path}")
    logger.debug(
        f"Parameters: reading_order={reading_order}, font_attrs={font_attrs}, keep_spaces={keep_spaces}"
    )

    try:
        self._pdf = pdfplumber.open(stream_to_open)
    except Exception as e:
        logger.error(f"Failed to open PDF: {e}", exc_info=True)
        self.close()  # Attempt cleanup if opening fails
        raise IOError(f"Failed to open PDF source: {self.source_path}") from e

    # Store configuration used for initialization
    self._reading_order = reading_order
    self._config = {"keep_spaces": keep_spaces}
    self._font_attrs = font_attrs

    self._ocr_manager = OCRManager() if OCRManager else None
    self._layout_manager = LayoutManager() if LayoutManager else None
    self.highlighter = HighlightingService(self)
    # self._classification_manager_instance = ClassificationManager() # Removed this line
    self._manager_registry = {}

    # Lazily instantiate pages only when accessed
    self._pages = _LazyPageList(
        self, self._pdf, font_attrs=font_attrs, load_text=self._text_layer
    )

    self._element_cache = {}
    self._exclusions = []
    self._regions = []

    logger.info(f"PDF '{self.source_path}' initialized with {len(self._pages)} pages.")

    self._initialize_managers()
    self._initialize_highlighter()

    # Remove text layer if requested
    if not self._text_layer:
        logger.info("Removing text layer as requested (text_layer=False)")
        # Text layer is not loaded when text_layer=False, so no need to remove
        pass

    # Analysis results accessed via self.analyses property (see below)

    # --- Automatic cleanup when object is garbage-collected ---
    self._finalizer = weakref.finalize(
        self,
        PDF._finalize_cleanup,
        self._pdf,
        getattr(self, "_temp_file", None),
        getattr(self, "_is_stream", False),
    )

    # --- Text tolerance settings ------------------------------------
    # Users can pass pdfplumber-style keys (x_tolerance, x_tolerance_ratio,
    # y_tolerance, etc.) via *text_tolerance*.  We also keep a flag that
    # enables automatic tolerance scaling when explicit values are not
    # supplied.
    self._config["auto_text_tolerance"] = bool(auto_text_tolerance)
    if text_tolerance:
        # Only copy recognised primitives (numbers / None); ignore junk.
        allowed = {
            "x_tolerance",
            "x_tolerance_ratio",
            "y_tolerance",
            "keep_blank_chars",  # passthrough convenience
        }
        for k, v in text_tolerance.items():
            if k in allowed:
                self._config[k] = v
natural_pdf.PDF.__len__()

Return the number of pages in the PDF.

Source code in natural_pdf/core/pdf.py
2069
2070
2071
2072
2073
def __len__(self) -> int:
    """Return the number of pages in the PDF."""
    if not hasattr(self, "_pages"):
        return 0
    return len(self._pages)
natural_pdf.PDF.__repr__()

Return a string representation of the PDF object.

Source code in natural_pdf/core/pdf.py
2130
2131
2132
2133
2134
2135
2136
2137
2138
def __repr__(self) -> str:
    """Return a string representation of the PDF object."""
    if not hasattr(self, "_pages"):
        page_count_str = "uninitialized"
    else:
        page_count_str = str(len(self._pages))

    source_info = getattr(self, "source_path", "unknown source")
    return f"<PDF source='{source_info}' pages={page_count_str}>"
natural_pdf.PDF.add_exclusion(exclusion_func, label=None)

Add an exclusion function to the PDF.

Exclusion functions define regions of each page that should be ignored during text extraction and analysis operations. This is useful for filtering out headers, footers, watermarks, or other administrative content that shouldn't be included in the main document processing.

Parameters:

Name Type Description Default
exclusion_func

A function that takes a Page object and returns a Region to exclude from processing, or None if no exclusion should be applied to that page. The function is called once per page.

required
label str

Optional descriptive label for this exclusion rule, useful for debugging and identification.

None

Returns:

Type Description
PDF

Self for method chaining.

Raises:

Type Description
AttributeError

If PDF pages are not yet initialized.

Example
pdf = npdf.PDF("document.pdf")

# Exclude headers (top 50 points of each page)
pdf.add_exclusion(
    lambda page: page.region(0, 0, page.width, 50),
    label="header_exclusion"
)

# Exclude any text containing "CONFIDENTIAL"
pdf.add_exclusion(
    lambda page: page.find('text:contains("CONFIDENTIAL")').above(include_source=True)
    if page.find('text:contains("CONFIDENTIAL")') else None,
    label="confidential_exclusion"
)

# Chain multiple exclusions
pdf.add_exclusion(header_func).add_exclusion(footer_func)
Source code in natural_pdf/core/pdf.py
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
def add_exclusion(self, exclusion_func, label: str = None) -> "PDF":
    """Add an exclusion function to the PDF.

    Exclusion functions define regions of each page that should be ignored during
    text extraction and analysis operations. This is useful for filtering out headers,
    footers, watermarks, or other administrative content that shouldn't be included
    in the main document processing.

    Args:
        exclusion_func: A function that takes a Page object and returns a Region
            to exclude from processing, or None if no exclusion should be applied
            to that page. The function is called once per page.
        label: Optional descriptive label for this exclusion rule, useful for
            debugging and identification.

    Returns:
        Self for method chaining.

    Raises:
        AttributeError: If PDF pages are not yet initialized.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")

        # Exclude headers (top 50 points of each page)
        pdf.add_exclusion(
            lambda page: page.region(0, 0, page.width, 50),
            label="header_exclusion"
        )

        # Exclude any text containing "CONFIDENTIAL"
        pdf.add_exclusion(
            lambda page: page.find('text:contains("CONFIDENTIAL")').above(include_source=True)
            if page.find('text:contains("CONFIDENTIAL")') else None,
            label="confidential_exclusion"
        )

        # Chain multiple exclusions
        pdf.add_exclusion(header_func).add_exclusion(footer_func)
        ```
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    # ------------------------------------------------------------------
    # Support selector strings and ElementCollection objects directly.
    # Store exclusion and apply only to already-created pages.
    # ------------------------------------------------------------------
    from natural_pdf.elements.element_collection import ElementCollection  # local import

    if isinstance(exclusion_func, str) or isinstance(exclusion_func, ElementCollection):
        # Store for bookkeeping and lazy application
        self._exclusions.append((exclusion_func, label))

        # Don't modify already-cached pages - they will get PDF-level exclusions
        # dynamically through _get_exclusion_regions()
        return self

    # Fallback to original callable / Region behaviour ------------------
    exclusion_data = (exclusion_func, label)
    self._exclusions.append(exclusion_data)

    # Don't modify already-cached pages - they will get PDF-level exclusions
    # dynamically through _get_exclusion_regions()

    return self
natural_pdf.PDF.add_region(region_func, name=None)

Add a region function to the PDF.

Parameters:

Name Type Description Default
region_func Callable[[Page], Optional[Region]]

A function that takes a Page and returns a Region, or None

required
name str

Optional name for the region

None

Returns:

Type Description
PDF

Self for method chaining

Source code in natural_pdf/core/pdf.py
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
def add_region(
    self, region_func: Callable[["Page"], Optional["Region"]], name: str = None
) -> "PDF":
    """
    Add a region function to the PDF.

    Args:
        region_func: A function that takes a Page and returns a Region, or None
        name: Optional name for the region

    Returns:
        Self for method chaining
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    region_data = (region_func, name)
    self._regions.append(region_data)

    # Apply only to already-created (cached) pages to avoid forcing page creation
    for i in range(len(self._pages)):
        if self._pages._cache[i] is not None:  # Only apply to existing pages
            page = self._pages._cache[i]
            try:
                region_instance = region_func(page)
                if region_instance and isinstance(region_instance, Region):
                    page.add_region(region_instance, name=name, source="named")
                elif region_instance is not None:
                    logger.warning(
                        f"Region function did not return a valid Region for page {page.number}"
                    )
            except Exception as e:
                logger.error(f"Error adding region for page {page.number}: {e}")

    return self
natural_pdf.PDF.analyze_layout(*args, **kwargs)

Analyzes the layout of all pages in the PDF.

This is a convenience method that calls analyze_layout on the PDF's page collection.

Parameters:

Name Type Description Default
*args

Positional arguments passed to pages.analyze_layout().

()
**kwargs

Keyword arguments passed to pages.analyze_layout().

{}

Returns:

Type Description
ElementCollection[Region]

An ElementCollection of all detected Region objects.

Source code in natural_pdf/core/pdf.py
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
def analyze_layout(self, *args, **kwargs) -> "ElementCollection[Region]":
    """
    Analyzes the layout of all pages in the PDF.

    This is a convenience method that calls analyze_layout on the PDF's
    page collection.

    Args:
        *args: Positional arguments passed to pages.analyze_layout().
        **kwargs: Keyword arguments passed to pages.analyze_layout().

    Returns:
        An ElementCollection of all detected Region objects.
    """
    return self.pages.analyze_layout(*args, **kwargs)
natural_pdf.PDF.apply_ocr(engine=None, languages=None, min_confidence=None, device=None, resolution=None, apply_exclusions=True, detect_only=False, replace=True, options=None, pages=None)

Apply OCR to specified pages of the PDF using batch processing.

Performs optical character recognition on the specified pages, converting image-based text into searchable and extractable text elements. This method supports multiple OCR engines and provides batch processing for efficiency.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine to use. Supported engines include 'easyocr' (default), 'surya', 'paddle', and 'doctr'. If None, uses the global default from natural_pdf.options.ocr.engine.

None
languages Optional[List[str]]

List of language codes for OCR recognition (e.g., ['en', 'es']). If None, uses the global default from natural_pdf.options.ocr.languages.

None
min_confidence Optional[float]

Minimum confidence threshold (0.0-1.0) for accepting OCR results. Text with lower confidence will be filtered out. If None, uses the global default.

None
device Optional[str]

Device to run OCR on ('cpu', 'cuda', 'mps'). Engine-specific availability varies. If None, uses engine defaults.

None
resolution Optional[int]

DPI resolution for rendering pages to images before OCR. Higher values improve accuracy but increase processing time and memory. Typical values: 150 (fast), 300 (balanced), 600 (high quality).

None
apply_exclusions bool

If True, mask excluded regions before OCR to prevent processing of headers, footers, or other unwanted content.

True
detect_only bool

If True, only detect text bounding boxes without performing character recognition. Useful for layout analysis workflows.

False
replace bool

If True, replace any existing OCR elements on the pages. If False, append new OCR results to existing elements.

True
options Optional[Any]

Engine-specific options object (e.g., EasyOCROptions, SuryaOptions). Allows fine-tuning of engine behavior beyond common parameters.

None
pages Optional[Union[Iterable[int], range, slice]]

Page indices to process. Can be: - None: Process all pages - slice: Process a range of pages (e.g., slice(0, 10)) - Iterable[int]: Process specific page indices (e.g., [0, 2, 5])

None

Returns:

Type Description
PDF

Self for method chaining.

Raises:

Type Description
ValueError

If invalid page index is provided.

TypeError

If pages parameter has invalid type.

RuntimeError

If OCR engine is not available or fails.

Example
pdf = npdf.PDF("scanned_document.pdf")

# Basic OCR on all pages
pdf.apply_ocr()

# High-quality OCR with specific settings
pdf.apply_ocr(
    engine='easyocr',
    languages=['en', 'es'],
    resolution=300,
    min_confidence=0.8
)

# OCR specific pages only
pdf.apply_ocr(pages=[0, 1, 2])  # First 3 pages
pdf.apply_ocr(pages=slice(5, 10))  # Pages 5-9

# Detection-only workflow for layout analysis
pdf.apply_ocr(detect_only=True, resolution=150)
Note

OCR processing can be time and memory intensive, especially at high resolutions. Consider using exclusions to mask unwanted regions and processing pages in batches for large documents.

Source code in natural_pdf/core/pdf.py
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
def apply_ocr(
    self,
    engine: Optional[str] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
    apply_exclusions: bool = True,
    detect_only: bool = False,
    replace: bool = True,
    options: Optional[Any] = None,
    pages: Optional[Union[Iterable[int], range, slice]] = None,
) -> "PDF":
    """Apply OCR to specified pages of the PDF using batch processing.

    Performs optical character recognition on the specified pages, converting
    image-based text into searchable and extractable text elements. This method
    supports multiple OCR engines and provides batch processing for efficiency.

    Args:
        engine: Name of the OCR engine to use. Supported engines include
            'easyocr' (default), 'surya', 'paddle', and 'doctr'. If None,
            uses the global default from natural_pdf.options.ocr.engine.
        languages: List of language codes for OCR recognition (e.g., ['en', 'es']).
            If None, uses the global default from natural_pdf.options.ocr.languages.
        min_confidence: Minimum confidence threshold (0.0-1.0) for accepting
            OCR results. Text with lower confidence will be filtered out.
            If None, uses the global default.
        device: Device to run OCR on ('cpu', 'cuda', 'mps'). Engine-specific
            availability varies. If None, uses engine defaults.
        resolution: DPI resolution for rendering pages to images before OCR.
            Higher values improve accuracy but increase processing time and memory.
            Typical values: 150 (fast), 300 (balanced), 600 (high quality).
        apply_exclusions: If True, mask excluded regions before OCR to prevent
            processing of headers, footers, or other unwanted content.
        detect_only: If True, only detect text bounding boxes without performing
            character recognition. Useful for layout analysis workflows.
        replace: If True, replace any existing OCR elements on the pages.
            If False, append new OCR results to existing elements.
        options: Engine-specific options object (e.g., EasyOCROptions, SuryaOptions).
            Allows fine-tuning of engine behavior beyond common parameters.
        pages: Page indices to process. Can be:
            - None: Process all pages
            - slice: Process a range of pages (e.g., slice(0, 10))
            - Iterable[int]: Process specific page indices (e.g., [0, 2, 5])

    Returns:
        Self for method chaining.

    Raises:
        ValueError: If invalid page index is provided.
        TypeError: If pages parameter has invalid type.
        RuntimeError: If OCR engine is not available or fails.

    Example:
        ```python
        pdf = npdf.PDF("scanned_document.pdf")

        # Basic OCR on all pages
        pdf.apply_ocr()

        # High-quality OCR with specific settings
        pdf.apply_ocr(
            engine='easyocr',
            languages=['en', 'es'],
            resolution=300,
            min_confidence=0.8
        )

        # OCR specific pages only
        pdf.apply_ocr(pages=[0, 1, 2])  # First 3 pages
        pdf.apply_ocr(pages=slice(5, 10))  # Pages 5-9

        # Detection-only workflow for layout analysis
        pdf.apply_ocr(detect_only=True, resolution=150)
        ```

    Note:
        OCR processing can be time and memory intensive, especially at high
        resolutions. Consider using exclusions to mask unwanted regions and
        processing pages in batches for large documents.
    """
    if not self._ocr_manager:
        logger.error("OCRManager not available. Cannot apply OCR.")
        return self

    # Apply global options as defaults, but allow explicit parameters to override
    import natural_pdf

    # Use global OCR options if parameters are not explicitly set
    if engine is None:
        engine = natural_pdf.options.ocr.engine
    if languages is None:
        languages = natural_pdf.options.ocr.languages
    if min_confidence is None:
        min_confidence = natural_pdf.options.ocr.min_confidence
    if device is None:
        pass  # No default device in options.ocr anymore

    thread_id = threading.current_thread().name
    logger.debug(f"[{thread_id}] PDF.apply_ocr starting for {self.path}")

    target_pages = []

    target_pages = []
    if pages is None:
        target_pages = self._pages
    elif isinstance(pages, slice):
        target_pages = self._pages[pages]
    elif hasattr(pages, "__iter__"):
        try:
            target_pages = [self._pages[i] for i in pages]
        except IndexError:
            raise ValueError("Invalid page index provided in 'pages' iterable.")
        except TypeError:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
    else:
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    if not target_pages:
        logger.warning("No pages selected for OCR processing.")
        return self

    page_numbers = [p.number for p in target_pages]
    logger.info(f"Applying batch OCR to pages: {page_numbers}...")

    final_resolution = resolution or getattr(self, "_config", {}).get("resolution", 150)
    logger.debug(f"Using OCR image resolution: {final_resolution} DPI")

    images_pil = []
    page_image_map = []
    logger.info(f"[{thread_id}] Rendering {len(target_pages)} pages...")
    failed_page_num = "unknown"
    render_start_time = time.monotonic()

    try:
        for i, page in enumerate(tqdm(target_pages, desc="Rendering pages", leave=False)):
            failed_page_num = page.number
            logger.debug(f"  Rendering page {page.number} (index {page.index})...")
            to_image_kwargs = {
                "resolution": final_resolution,
                "include_highlights": False,
                "exclusions": "mask" if apply_exclusions else None,
            }
            # Use render() for clean image without highlights
            img = page.render(resolution=final_resolution)
            if img is None:
                logger.error(f"  Failed to render page {page.number} to image.")
                continue
            images_pil.append(img)
            page_image_map.append((page, img))
    except Exception as e:
        logger.error(f"Failed to render pages for batch OCR: {e}")
        logger.error(f"Failed to render pages for batch OCR: {e}")
        raise RuntimeError(f"Failed to render page {failed_page_num} for OCR.") from e

    render_end_time = time.monotonic()
    logger.debug(
        f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
    )
    logger.debug(
        f"[{thread_id}] Finished rendering {len(images_pil)} images (Duration: {render_end_time - render_start_time:.2f}s)"
    )

    if not images_pil or not page_image_map:
        logger.error("No images were successfully rendered for batch OCR.")
        return self

    manager_args = {
        "images": images_pil,
        "engine": engine,
        "languages": languages,
        "min_confidence": min_confidence,
        "min_confidence": min_confidence,
        "device": device,
        "options": options,
        "detect_only": detect_only,
    }
    manager_args = {k: v for k, v in manager_args.items() if v is not None}

    ocr_call_args = {k: v for k, v in manager_args.items() if k != "images"}
    logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
    logger.info(f"[{thread_id}] Calling OCR Manager with args: {ocr_call_args}...")
    ocr_start_time = time.monotonic()

    batch_results = self._ocr_manager.apply_ocr(**manager_args)

    if not isinstance(batch_results, list) or len(batch_results) != len(images_pil):
        logger.error(f"OCR Manager returned unexpected result format or length.")
        return self

    logger.info("OCR Manager batch processing complete.")

    ocr_end_time = time.monotonic()
    logger.debug(
        f"[{thread_id}] OCR processing finished (Duration: {ocr_end_time - ocr_start_time:.2f}s)"
    )

    logger.info("Adding OCR results to respective pages...")
    total_elements_added = 0

    for i, (page, img) in enumerate(page_image_map):
        results_for_page = batch_results[i]
        if not isinstance(results_for_page, list):
            logger.warning(
                f"Skipping results for page {page.number}: Expected list, got {type(results_for_page)}"
            )
            continue

        logger.debug(f"  Processing {len(results_for_page)} results for page {page.number}...")
        try:
            if manager_args.get("replace", True) and hasattr(page, "_element_mgr"):
                page._element_mgr.remove_ocr_elements()

            img_scale_x = page.width / img.width if img.width > 0 else 1
            img_scale_y = page.height / img.height if img.height > 0 else 1
            elements = page._element_mgr.create_text_elements_from_ocr(
                results_for_page, img_scale_x, img_scale_y
            )

            if elements:
                total_elements_added += len(elements)
                logger.debug(f"  Added {len(elements)} OCR TextElements to page {page.number}.")
            else:
                logger.debug(f"  No valid TextElements created for page {page.number}.")
        except Exception as e:
            logger.error(f"  Error adding OCR elements to page {page.number}: {e}")

    logger.info(f"Finished adding OCR results. Total elements added: {total_elements_added}")
    return self
natural_pdf.PDF.ask(question, mode='extractive', pages=None, min_confidence=0.1, model=None, **kwargs)

Ask a single question about the document content.

Parameters:

Name Type Description Default
question str

Question string to ask about the document

required
mode str

"extractive" to extract answer from document, "generative" to generate

'extractive'
pages Union[int, List[int], range]

Specific pages to query (default: all pages)

None
min_confidence float

Minimum confidence threshold for answers

0.1
model str

Optional model name for question answering

None
**kwargs

Additional parameters passed to the QA engine

{}

Returns:

Type Description
Dict[str, Any]

Dict containing: answer, confidence, found, page_num, source_elements, etc.

Source code in natural_pdf/core/pdf.py
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
def ask(
    self,
    question: str,
    mode: str = "extractive",
    pages: Union[int, List[int], range] = None,
    min_confidence: float = 0.1,
    model: str = None,
    **kwargs,
) -> Dict[str, Any]:
    """
    Ask a single question about the document content.

    Args:
        question: Question string to ask about the document
        mode: "extractive" to extract answer from document, "generative" to generate
        pages: Specific pages to query (default: all pages)
        min_confidence: Minimum confidence threshold for answers
        model: Optional model name for question answering
        **kwargs: Additional parameters passed to the QA engine

    Returns:
        Dict containing: answer, confidence, found, page_num, source_elements, etc.
    """
    # Delegate to ask_batch and return the first result
    results = self.ask_batch(
        [question], mode=mode, pages=pages, min_confidence=min_confidence, model=model, **kwargs
    )
    return (
        results[0]
        if results
        else {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": None,
            "source_elements": [],
        }
    )
natural_pdf.PDF.ask_batch(questions, mode='extractive', pages=None, min_confidence=0.1, model=None, **kwargs)

Ask multiple questions about the document content using batch processing.

This method processes multiple questions efficiently in a single batch, avoiding the multiprocessing resource accumulation that can occur with sequential individual question calls.

Parameters:

Name Type Description Default
questions List[str]

List of question strings to ask about the document

required
mode str

"extractive" to extract answer from document, "generative" to generate

'extractive'
pages Union[int, List[int], range]

Specific pages to query (default: all pages)

None
min_confidence float

Minimum confidence threshold for answers

0.1
model str

Optional model name for question answering

None
**kwargs

Additional parameters passed to the QA engine

{}

Returns:

Type Description
List[Dict[str, Any]]

List of Dicts, each containing: answer, confidence, found, page_num, source_elements, etc.

Source code in natural_pdf/core/pdf.py
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
def ask_batch(
    self,
    questions: List[str],
    mode: str = "extractive",
    pages: Union[int, List[int], range] = None,
    min_confidence: float = 0.1,
    model: str = None,
    **kwargs,
) -> List[Dict[str, Any]]:
    """
    Ask multiple questions about the document content using batch processing.

    This method processes multiple questions efficiently in a single batch,
    avoiding the multiprocessing resource accumulation that can occur with
    sequential individual question calls.

    Args:
        questions: List of question strings to ask about the document
        mode: "extractive" to extract answer from document, "generative" to generate
        pages: Specific pages to query (default: all pages)
        min_confidence: Minimum confidence threshold for answers
        model: Optional model name for question answering
        **kwargs: Additional parameters passed to the QA engine

    Returns:
        List of Dicts, each containing: answer, confidence, found, page_num, source_elements, etc.
    """
    from natural_pdf.qa import get_qa_engine

    if not questions:
        return []

    if not isinstance(questions, list) or not all(isinstance(q, str) for q in questions):
        raise TypeError("'questions' must be a list of strings")

    qa_engine = get_qa_engine() if model is None else get_qa_engine(model_name=model)

    # Resolve target pages
    if pages is None:
        target_pages = self.pages
    elif isinstance(pages, int):
        if 0 <= pages < len(self.pages):
            target_pages = [self.pages[pages]]
        else:
            raise IndexError(f"Page index {pages} out of range (0-{len(self.pages)-1})")
    elif isinstance(pages, (list, range)):
        target_pages = []
        for page_idx in pages:
            if 0 <= page_idx < len(self.pages):
                target_pages.append(self.pages[page_idx])
            else:
                logger.warning(f"Page index {page_idx} out of range, skipping")
    else:
        raise ValueError(f"Invalid pages parameter: {pages}")

    if not target_pages:
        logger.warning("No valid pages found for QA processing.")
        return [
            {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": None,
                "source_elements": [],
            }
            for _ in questions
        ]

    logger.info(
        f"Processing {len(questions)} question(s) across {len(target_pages)} page(s) using batch QA..."
    )

    # Collect all page images and metadata for batch processing
    page_images = []
    page_word_boxes = []
    page_metadata = []

    for page in target_pages:
        # Get page image
        try:
            # Use render() for clean image without highlights
            page_image = page.render(resolution=150)
            if page_image is None:
                logger.warning(f"Failed to render image for page {page.number}, skipping")
                continue

            # Get text elements for word boxes
            elements = page.find_all("text")
            if not elements:
                logger.warning(f"No text elements found on page {page.number}")
                word_boxes = []
            else:
                word_boxes = qa_engine._get_word_boxes_from_elements(
                    elements, offset_x=0, offset_y=0
                )

            page_images.append(page_image)
            page_word_boxes.append(word_boxes)
            page_metadata.append({"page_number": page.number, "page_object": page})

        except Exception as e:
            logger.warning(f"Error processing page {page.number}: {e}")
            continue

    if not page_images:
        logger.warning("No page images could be processed for QA.")
        return [
            {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": None,
                "source_elements": [],
            }
            for _ in questions
        ]

    # Process all questions against all pages in batch
    all_results = []

    for question_text in questions:
        question_results = []

        # Ask this question against each page (but in batch per page)
        for i, (page_image, word_boxes, page_meta) in enumerate(
            zip(page_images, page_word_boxes, page_metadata)
        ):
            try:
                # Use the DocumentQA batch interface
                page_result = qa_engine.ask(
                    image=page_image,
                    question=question_text,
                    word_boxes=word_boxes,
                    min_confidence=min_confidence,
                    **kwargs,
                )

                if page_result and page_result.found:
                    # Add page metadata to result
                    page_result_dict = {
                        "answer": page_result.answer,
                        "confidence": page_result.confidence,
                        "found": page_result.found,
                        "page_num": page_meta["page_number"],
                        "source_elements": getattr(page_result, "source_elements", []),
                        "start": getattr(page_result, "start", -1),
                        "end": getattr(page_result, "end", -1),
                    }
                    question_results.append(page_result_dict)

            except Exception as e:
                logger.warning(
                    f"Error processing question '{question_text}' on page {page_meta['page_number']}: {e}"
                )
                continue

        # Sort results by confidence and take the best one for this question
        question_results.sort(key=lambda x: x.get("confidence", 0), reverse=True)

        if question_results:
            all_results.append(question_results[0])
        else:
            # No results found for this question
            all_results.append(
                {
                    "answer": None,
                    "confidence": 0.0,
                    "found": False,
                    "page_num": None,
                    "source_elements": [],
                }
            )

    return all_results
natural_pdf.PDF.classify_pages(labels, model=None, pages=None, analysis_key='classification', using=None, **kwargs)

Classifies specified pages of the PDF.

Parameters:

Name Type Description Default
labels List[str]

List of category names

required
model Optional[str]

Model identifier ('text', 'vision', or specific HF ID)

None
pages Optional[Union[Iterable[int], range, slice]]

Page indices, slice, or None for all pages

None
analysis_key str

Key to store results in page's analyses dict

'classification'
using Optional[str]

Processing mode ('text' or 'vision')

None
**kwargs

Additional arguments for the ClassificationManager

{}

Returns:

Type Description
PDF

Self for method chaining

Source code in natural_pdf/core/pdf.py
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
def classify_pages(
    self,
    labels: List[str],
    model: Optional[str] = None,
    pages: Optional[Union[Iterable[int], range, slice]] = None,
    analysis_key: str = "classification",
    using: Optional[str] = None,
    **kwargs,
) -> "PDF":
    """
    Classifies specified pages of the PDF.

    Args:
        labels: List of category names
        model: Model identifier ('text', 'vision', or specific HF ID)
        pages: Page indices, slice, or None for all pages
        analysis_key: Key to store results in page's analyses dict
        using: Processing mode ('text' or 'vision')
        **kwargs: Additional arguments for the ClassificationManager

    Returns:
        Self for method chaining
    """
    if not labels:
        raise ValueError("Labels list cannot be empty.")

    try:
        manager = self.get_manager("classification")
    except (ValueError, RuntimeError) as e:
        raise ClassificationError(f"Cannot get ClassificationManager: {e}") from e

    if not manager or not manager.is_available():
        from natural_pdf.classification.manager import is_classification_available

        if not is_classification_available():
            raise ImportError(
                "Classification dependencies missing. "
                'Install with: pip install "natural-pdf[ai]"'
            )
        raise ClassificationError("ClassificationManager not available.")

    target_pages = []
    if pages is None:
        target_pages = self._pages
    elif isinstance(pages, slice):
        target_pages = self._pages[pages]
    elif hasattr(pages, "__iter__"):
        try:
            target_pages = [self._pages[i] for i in pages]
        except IndexError:
            raise ValueError("Invalid page index provided.")
        except TypeError:
            raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")
    else:
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    if not target_pages:
        logger.warning("No pages selected for classification.")
        return self

    inferred_using = manager.infer_using(model if model else manager.DEFAULT_TEXT_MODEL, using)
    logger.info(
        f"Classifying {len(target_pages)} pages using model '{model or '(default)'}' (mode: {inferred_using})"
    )

    page_contents = []
    pages_to_classify = []
    logger.debug(f"Gathering content for {len(target_pages)} pages...")

    for page in target_pages:
        try:
            content = page._get_classification_content(model_type=inferred_using, **kwargs)
            page_contents.append(content)
            pages_to_classify.append(page)
        except ValueError as e:
            logger.warning(f"Skipping page {page.number}: Cannot get content - {e}")
        except Exception as e:
            logger.warning(f"Skipping page {page.number}: Error getting content - {e}")

    if not page_contents:
        logger.warning("No content could be gathered for batch classification.")
        return self

    logger.debug(f"Gathered content for {len(pages_to_classify)} pages.")

    try:
        batch_results = manager.classify_batch(
            item_contents=page_contents,
            labels=labels,
            model_id=model,
            using=inferred_using,
            **kwargs,
        )
    except Exception as e:
        logger.error(f"Batch classification failed: {e}")
        raise ClassificationError(f"Batch classification failed: {e}") from e

    if len(batch_results) != len(pages_to_classify):
        logger.error(
            f"Mismatch between number of results ({len(batch_results)}) and pages ({len(pages_to_classify)})"
        )
        return self

    logger.debug(
        f"Distributing {len(batch_results)} results to pages under key '{analysis_key}'..."
    )
    for page, result_obj in zip(pages_to_classify, batch_results):
        try:
            if not hasattr(page, "analyses") or page.analyses is None:
                page.analyses = {}
            page.analyses[analysis_key] = result_obj
        except Exception as e:
            logger.warning(
                f"Failed to store classification results for page {page.number}: {e}"
            )

    logger.info(f"Finished classifying PDF pages.")
    return self
natural_pdf.PDF.clear_exclusions()

Clear all exclusion functions from the PDF.

Removes all previously added exclusion functions that were used to filter out unwanted content (like headers, footers, or administrative text) from text extraction and analysis operations.

Returns:

Type Description
PDF

Self for method chaining.

Raises:

Type Description
AttributeError

If PDF pages are not yet initialized.

Example
pdf = npdf.PDF("document.pdf")
pdf.add_exclusion(lambda page: page.find('text:contains("CONFIDENTIAL")').above())

# Later, remove all exclusions
pdf.clear_exclusions()
Source code in natural_pdf/core/pdf.py
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
def clear_exclusions(self) -> "PDF":
    """Clear all exclusion functions from the PDF.

    Removes all previously added exclusion functions that were used to filter
    out unwanted content (like headers, footers, or administrative text) from
    text extraction and analysis operations.

    Returns:
        Self for method chaining.

    Raises:
        AttributeError: If PDF pages are not yet initialized.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")
        pdf.add_exclusion(lambda page: page.find('text:contains("CONFIDENTIAL")').above())

        # Later, remove all exclusions
        pdf.clear_exclusions()
        ```
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    self._exclusions = []

    # Clear exclusions only from already-created (cached) pages to avoid forcing page creation
    for i in range(len(self._pages)):
        if self._pages._cache[i] is not None:  # Only clear from existing pages
            try:
                self._pages._cache[i].clear_exclusions()
            except Exception as e:
                logger.warning(f"Failed to clear exclusions from existing page {i}: {e}")
    return self
natural_pdf.PDF.close()

Close the underlying PDF file and clean up any temporary files.

Source code in natural_pdf/core/pdf.py
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
def close(self):
    """Close the underlying PDF file and clean up any temporary files."""
    if hasattr(self, "_pdf") and self._pdf is not None:
        try:
            self._pdf.close()
            logger.debug(f"Closed pdfplumber PDF object for {self.source_path}")
        except Exception as e:
            logger.warning(f"Error closing pdfplumber object: {e}")
        finally:
            self._pdf = None

    if hasattr(self, "_temp_file") and self._temp_file is not None:
        temp_file_path = None
        try:
            if hasattr(self._temp_file, "name") and self._temp_file.name:
                temp_file_path = self._temp_file.name
                # Only unlink if it exists and _is_stream is False (meaning WE created it)
                if not self._is_stream and os.path.exists(temp_file_path):
                    os.unlink(temp_file_path)
                    logger.debug(f"Removed temporary PDF file: {temp_file_path}")
        except Exception as e:
            logger.warning(f"Failed to clean up temporary file '{temp_file_path}': {e}")

    # Cancels the weakref finalizer so we don't double-clean
    if hasattr(self, "_finalizer") and self._finalizer.alive:
        self._finalizer()
natural_pdf.PDF.deskew(pages=None, resolution=300, angle=None, detection_resolution=72, force_overwrite=False, **deskew_kwargs)

Creates a new, in-memory PDF object containing deskewed versions of the specified pages from the original PDF.

This method renders each selected page, detects and corrects skew using the 'deskew' library, and then combines the resulting images into a new PDF using 'img2pdf'. The new PDF object is returned directly.

Important: The returned PDF is image-based. Any existing text, OCR results, annotations, or other elements from the original pages will not be carried over.

Parameters:

Name Type Description Default
pages Optional[Union[Iterable[int], range, slice]]

Page indices/slice to include (0-based). If None, processes all pages.

None
resolution int

DPI resolution for rendering the output deskewed pages.

300
angle Optional[float]

The specific angle (in degrees) to rotate by. If None, detects automatically.

None
detection_resolution int

DPI resolution used for skew detection if angles are not already cached on the page objects.

72
force_overwrite bool

If False (default), raises a ValueError if any target page already contains processed elements (text, OCR, regions) to prevent accidental data loss. Set to True to proceed anyway.

False
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew during automatic detection (e.g., max_angle, num_peaks).

{}

Returns:

Type Description
PDF

A new PDF object representing the deskewed document.

Raises:

Type Description
ImportError

If 'deskew' or 'img2pdf' libraries are not installed.

ValueError

If force_overwrite is False and target pages contain elements.

FileNotFoundError

If the source PDF cannot be read (if file-based).

IOError

If creating the in-memory PDF fails.

RuntimeError

If rendering or deskewing individual pages fails.

Source code in natural_pdf/core/pdf.py
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
def deskew(
    self,
    pages: Optional[Union[Iterable[int], range, slice]] = None,
    resolution: int = 300,
    angle: Optional[float] = None,
    detection_resolution: int = 72,
    force_overwrite: bool = False,
    **deskew_kwargs,
) -> "PDF":
    """
    Creates a new, in-memory PDF object containing deskewed versions of the
    specified pages from the original PDF.

    This method renders each selected page, detects and corrects skew using the 'deskew'
    library, and then combines the resulting images into a new PDF using 'img2pdf'.
    The new PDF object is returned directly.

    Important: The returned PDF is image-based. Any existing text, OCR results,
    annotations, or other elements from the original pages will *not* be carried over.

    Args:
        pages: Page indices/slice to include (0-based). If None, processes all pages.
        resolution: DPI resolution for rendering the output deskewed pages.
        angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
        detection_resolution: DPI resolution used for skew detection if angles are not
                              already cached on the page objects.
        force_overwrite: If False (default), raises a ValueError if any target page
                         already contains processed elements (text, OCR, regions) to
                         prevent accidental data loss. Set to True to proceed anyway.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         during automatic detection (e.g., `max_angle`, `num_peaks`).

    Returns:
        A new PDF object representing the deskewed document.

    Raises:
        ImportError: If 'deskew' or 'img2pdf' libraries are not installed.
        ValueError: If `force_overwrite` is False and target pages contain elements.
        FileNotFoundError: If the source PDF cannot be read (if file-based).
        IOError: If creating the in-memory PDF fails.
        RuntimeError: If rendering or deskewing individual pages fails.
    """
    if not DESKEW_AVAILABLE:
        raise ImportError(
            "Deskew/img2pdf libraries missing. Install with: pip install natural-pdf[deskew]"
        )

    target_pages = self._get_target_pages(pages)  # Use helper to resolve pages

    # --- Safety Check --- #
    if not force_overwrite:
        for page in target_pages:
            # Check if the element manager has been initialized and contains any elements
            if (
                hasattr(page, "_element_mgr")
                and page._element_mgr
                and page._element_mgr.has_elements()
            ):
                raise ValueError(
                    f"Page {page.number} contains existing elements (text, OCR, etc.). "
                    f"Deskewing creates an image-only PDF, discarding these elements. "
                    f"Set force_overwrite=True to proceed."
                )

    # --- Process Pages --- #
    deskewed_images_bytes = []
    logger.info(f"Deskewing {len(target_pages)} pages (output resolution={resolution} DPI)...")

    for page in tqdm(target_pages, desc="Deskewing Pages", leave=False):
        try:
            # Use page.deskew to get the corrected PIL image
            # Pass down resolutions and kwargs
            deskewed_img = page.deskew(
                resolution=resolution,
                angle=angle,  # Let page.deskew handle detection/caching
                detection_resolution=detection_resolution,
                **deskew_kwargs,
            )

            if not deskewed_img:
                logger.warning(
                    f"Page {page.number}: Failed to generate deskewed image, skipping."
                )
                continue

            # Convert image to bytes for img2pdf (use PNG for lossless quality)
            with io.BytesIO() as buf:
                deskewed_img.save(buf, format="PNG")
                deskewed_images_bytes.append(buf.getvalue())

        except Exception as e:
            logger.error(
                f"Page {page.number}: Failed during deskewing process: {e}", exc_info=True
            )
            # Option: Raise a runtime error, or continue and skip the page?
            # Raising makes the whole operation fail if one page fails.
            raise RuntimeError(f"Failed to process page {page.number} during deskewing.") from e

    # --- Create PDF --- #
    if not deskewed_images_bytes:
        raise RuntimeError("No pages were successfully processed to create the deskewed PDF.")

    logger.info(f"Combining {len(deskewed_images_bytes)} deskewed images into in-memory PDF...")
    try:
        # Use img2pdf to combine image bytes into PDF bytes
        pdf_bytes = img2pdf.convert(deskewed_images_bytes)

        # Wrap bytes in a stream
        pdf_stream = io.BytesIO(pdf_bytes)

        # Create a new PDF object from the stream using original config
        logger.info("Creating new PDF object from deskewed stream...")
        new_pdf = PDF(
            pdf_stream,
            reading_order=self._reading_order,
            font_attrs=self._font_attrs,
            keep_spaces=self._config.get("keep_spaces", True),
            text_layer=self._text_layer,
        )
        return new_pdf
    except Exception as e:
        logger.error(f"Failed to create in-memory PDF using img2pdf/PDF init: {e}")
        raise IOError("Failed to create deskewed PDF object from image stream.") from e
natural_pdf.PDF.detect_checkboxes(*args, **kwargs)

Detects checkboxes on all pages in the PDF.

This is a convenience method that calls detect_checkboxes on the PDF's page collection.

Parameters:

Name Type Description Default
*args

Positional arguments passed to pages.detect_checkboxes().

()
**kwargs

Keyword arguments passed to pages.detect_checkboxes().

{}

Returns:

Type Description
ElementCollection[Region]

An ElementCollection of all detected checkbox Region objects.

Source code in natural_pdf/core/pdf.py
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
def detect_checkboxes(self, *args, **kwargs) -> "ElementCollection[Region]":
    """
    Detects checkboxes on all pages in the PDF.

    This is a convenience method that calls detect_checkboxes on the PDF's
    page collection.

    Args:
        *args: Positional arguments passed to pages.detect_checkboxes().
        **kwargs: Keyword arguments passed to pages.detect_checkboxes().

    Returns:
        An ElementCollection of all detected checkbox Region objects.
    """
    return self.pages.detect_checkboxes(*args, **kwargs)
natural_pdf.PDF.export_ocr_correction_task(output_zip_path, **kwargs)

Exports OCR results from this PDF into a correction task package. Exports OCR results from this PDF into a correction task package.

Parameters:

Name Type Description Default
output_zip_path str

The path to save the output zip file

required
output_zip_path str

The path to save the output zip file

required
**kwargs

Additional arguments passed to create_correction_task_package

{}
Source code in natural_pdf/core/pdf.py
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
def export_ocr_correction_task(self, output_zip_path: str, **kwargs):
    """
    Exports OCR results from this PDF into a correction task package.
    Exports OCR results from this PDF into a correction task package.

    Args:
        output_zip_path: The path to save the output zip file
        output_zip_path: The path to save the output zip file
        **kwargs: Additional arguments passed to create_correction_task_package
    """
    try:
        from natural_pdf.utils.packaging import create_correction_task_package

        create_correction_task_package(source=self, output_zip_path=output_zip_path, **kwargs)
    except ImportError:
        logger.error(
            "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
        )
        logger.error(
            "Failed to import 'create_correction_task_package'. Packaging utility might be missing."
        )
    except Exception as e:
        logger.error(f"Failed to export correction task: {e}")
        raise
        logger.error(f"Failed to export correction task: {e}")
        raise
natural_pdf.PDF.extract_tables(selector=None, merge_across_pages=False, **kwargs)

Extract tables from the document or matching elements.

Parameters:

Name Type Description Default
selector Optional[str]

Optional selector to filter tables

None
merge_across_pages bool

Whether to merge tables that span across pages

False
**kwargs

Additional extraction parameters

{}

Returns:

Type Description
List[Any]

List of extracted tables

Source code in natural_pdf/core/pdf.py
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
def extract_tables(
    self, selector: Optional[str] = None, merge_across_pages: bool = False, **kwargs
) -> List[Any]:
    """
    Extract tables from the document or matching elements.

    Args:
        selector: Optional selector to filter tables
        merge_across_pages: Whether to merge tables that span across pages
        **kwargs: Additional extraction parameters

    Returns:
        List of extracted tables
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    logger.warning("PDF.extract_tables is not fully implemented yet.")
    all_tables = []

    for page in self.pages:
        if hasattr(page, "extract_tables"):
            all_tables.extend(page.extract_tables(**kwargs))
        else:
            logger.debug(f"Page {page.number} does not have extract_tables method.")

    if selector:
        logger.warning("Filtering extracted tables by selector is not implemented.")

    if merge_across_pages:
        logger.warning("Merging tables across pages is not implemented.")

    return all_tables
natural_pdf.PDF.extract_text(selector=None, preserve_whitespace=True, use_exclusions=True, debug_exclusions=False, **kwargs)

Extract text from the entire document or matching elements.

Parameters:

Name Type Description Default
selector Optional[str]

Optional selector to filter elements

None
preserve_whitespace

Whether to keep blank characters

True
use_exclusions

Whether to apply exclusion regions

True
debug_exclusions

Whether to output detailed debugging for exclusions

False
preserve_whitespace

Whether to keep blank characters

True
use_exclusions

Whether to apply exclusion regions

True
debug_exclusions

Whether to output detailed debugging for exclusions

False
**kwargs

Additional extraction parameters

{}

Returns:

Type Description
str

Extracted text as string

Source code in natural_pdf/core/pdf.py
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
def extract_text(
    self,
    selector: Optional[str] = None,
    preserve_whitespace=True,
    use_exclusions=True,
    debug_exclusions=False,
    **kwargs,
) -> str:
    """
    Extract text from the entire document or matching elements.

    Args:
        selector: Optional selector to filter elements
        preserve_whitespace: Whether to keep blank characters
        use_exclusions: Whether to apply exclusion regions
        debug_exclusions: Whether to output detailed debugging for exclusions
        preserve_whitespace: Whether to keep blank characters
        use_exclusions: Whether to apply exclusion regions
        debug_exclusions: Whether to output detailed debugging for exclusions
        **kwargs: Additional extraction parameters

    Returns:
        Extracted text as string
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    if selector:
        elements = self.find_all(selector, apply_exclusions=use_exclusions, **kwargs)
        return elements.extract_text(preserve_whitespace=preserve_whitespace, **kwargs)

    if debug_exclusions:
        print(f"PDF: Extracting text with exclusions from {len(self.pages)} pages")
        print(f"PDF: Found {len(self._exclusions)} document-level exclusions")

    texts = []
    for page in self.pages:
        texts.append(
            page.extract_text(
                preserve_whitespace=preserve_whitespace,
                use_exclusions=use_exclusions,
                debug_exclusions=debug_exclusions,
                **kwargs,
            )
        )

    if debug_exclusions:
        print(f"PDF: Combined {len(texts)} pages of text")

    return "\n".join(texts)
natural_pdf.PDF.find(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]
find(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]

Find the first element matching the selector OR text content across all pages.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
Optional[Any]

Element object or None if not found.

Source code in natural_pdf/core/pdf.py
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
def find(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional[Any]:
    """
    Find the first element matching the selector OR text content across all pages.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        Element object or None if not found.
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        raise ValueError("Internal error: No selector or text provided.")

    selector_obj = parse_selector(effective_selector)

    # Search page by page
    for page in self.pages:
        # Note: _apply_selector is on Page, so we call find directly here
        # We pass the constructed/validated effective_selector
        element = page.find(
            selector=effective_selector,  # Use the processed selector
            apply_exclusions=apply_exclusions,
            regex=regex,  # Pass down flags
            case=case,
            **kwargs,
        )
        if element:
            return element
    return None  # Not found on any page
natural_pdf.PDF.find_all(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements matching the selector OR text content across all pages.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements.

Source code in natural_pdf/core/pdf.py
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
def find_all(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements matching the selector OR text content across all pages.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        ElementCollection with matching elements.
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        raise ValueError("Internal error: No selector or text provided.")

    # Instead of parsing here, let each page parse and apply
    # This avoids parsing the same selector multiple times if not needed
    # selector_obj = parse_selector(effective_selector)

    # kwargs["regex"] = regex # Removed: Already passed explicitly
    # kwargs["case"] = case   # Removed: Already passed explicitly

    all_elements = []
    for page in self.pages:
        # Call page.find_all with the effective selector and flags
        page_elements = page.find_all(
            selector=effective_selector,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        if page_elements:
            all_elements.extend(page_elements.elements)

    from natural_pdf.elements.element_collection import ElementCollection

    return ElementCollection(all_elements)
natural_pdf.PDF.from_images(images, resolution=300, apply_ocr=True, ocr_engine=None, **pdf_options) classmethod

Create a PDF from image(s).

Parameters:

Name Type Description Default
images Union[Image, List[Image], str, List[str], Path, List[Path]]

Single image, list of images, or path(s)/URL(s) to image files

required
resolution int

DPI for the PDF (default: 300, good for OCR and viewing)

300
apply_ocr bool

Apply OCR to make searchable (default: True)

True
ocr_engine Optional[str]

OCR engine to use (default: auto-detect)

None
**pdf_options

Options passed to PDF constructor

{}

Returns:

Type Description
PDF

PDF object containing the images as pages

Example
# Simple scan to searchable PDF
pdf = PDF.from_images("scan.jpg")

# From URL
pdf = PDF.from_images("https://example.com/image.png")

# Multiple pages (mix of local and URLs)
pdf = PDF.from_images(["page1.png", "https://example.com/page2.jpg"])

# Without OCR
pdf = PDF.from_images(images, apply_ocr=False)

# With specific engine
pdf = PDF.from_images(images, ocr_engine='surya')
Source code in natural_pdf/core/pdf.py
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
@classmethod
def from_images(
    cls,
    images: Union["Image.Image", List["Image.Image"], str, List[str], Path, List[Path]],
    resolution: int = 300,
    apply_ocr: bool = True,
    ocr_engine: Optional[str] = None,
    **pdf_options,
) -> "PDF":
    """Create a PDF from image(s).

    Args:
        images: Single image, list of images, or path(s)/URL(s) to image files
        resolution: DPI for the PDF (default: 300, good for OCR and viewing)
        apply_ocr: Apply OCR to make searchable (default: True)
        ocr_engine: OCR engine to use (default: auto-detect)
        **pdf_options: Options passed to PDF constructor

    Returns:
        PDF object containing the images as pages

    Example:
        ```python
        # Simple scan to searchable PDF
        pdf = PDF.from_images("scan.jpg")

        # From URL
        pdf = PDF.from_images("https://example.com/image.png")

        # Multiple pages (mix of local and URLs)
        pdf = PDF.from_images(["page1.png", "https://example.com/page2.jpg"])

        # Without OCR
        pdf = PDF.from_images(images, apply_ocr=False)

        # With specific engine
        pdf = PDF.from_images(images, ocr_engine='surya')
        ```
    """
    import urllib.request

    from PIL import ImageOps

    def _open_image(source):
        """Open an image from file path, URL, or return PIL Image as-is."""
        if isinstance(source, Image.Image):
            return source

        source_str = str(source)
        if source_str.startswith(("http://", "https://")):
            # Download from URL
            with urllib.request.urlopen(source_str) as response:
                img_data = response.read()
            return Image.open(io.BytesIO(img_data))
        else:
            # Local file path
            return Image.open(source)

    # Normalize inputs to list of PIL Images
    if isinstance(images, (str, Path)):
        images = [_open_image(images)]
    elif isinstance(images, Image.Image):
        images = [images]
    elif isinstance(images, list):
        processed = []
        for img in images:
            processed.append(_open_image(img))
        images = processed

    # Process images
    processed_images = []
    for img in images:
        # Fix EXIF rotation
        img = ImageOps.exif_transpose(img) or img

        # Convert RGBA to RGB (PDF doesn't handle transparency well)
        if img.mode == "RGBA":
            bg = Image.new("RGB", img.size, "white")
            bg.paste(img, mask=img.split()[3])
            img = bg
        elif img.mode not in ["RGB", "L", "1", "CMYK"]:
            img = img.convert("RGB")

        processed_images.append(img)

    # Create PDF at specified resolution
    # Use BytesIO to keep in memory
    pdf_buffer = io.BytesIO()
    processed_images[0].save(
        pdf_buffer,
        "PDF",
        save_all=True,
        append_images=processed_images[1:] if len(processed_images) > 1 else [],
        resolution=resolution,
    )
    pdf_buffer.seek(0)

    # Create PDF object
    pdf = cls(pdf_buffer, **pdf_options)

    # Store metadata about source
    pdf._from_images = True
    pdf._source_metadata = {
        "type": "images",
        "count": len(processed_images),
        "resolution": resolution,
    }

    # Apply OCR if requested
    if apply_ocr:
        pdf.apply_ocr(engine=ocr_engine, resolution=resolution)

    return pdf
natural_pdf.PDF.get_id()

Get unique identifier for this PDF.

Source code in natural_pdf/core/pdf.py
2140
2141
2142
2143
def get_id(self) -> str:
    """Get unique identifier for this PDF."""
    """Get unique identifier for this PDF."""
    return self.path
natural_pdf.PDF.get_manager(key)

Retrieve a manager instance by its key, instantiating it lazily if needed.

Managers are specialized components that handle specific functionality like classification, structured data extraction, or OCR processing. They are instantiated on-demand to minimize memory usage and startup time.

Parameters:

Name Type Description Default
key str

The manager key to retrieve. Common keys include 'classification' and 'structured_data'.

required

Returns:

Type Description
Any

The manager instance for the specified key.

Raises:

Type Description
KeyError

If no manager is registered for the given key.

RuntimeError

If the manager failed to initialize.

Example
pdf = npdf.PDF("document.pdf")
classification_mgr = pdf.get_manager('classification')
structured_data_mgr = pdf.get_manager('structured_data')
Source code in natural_pdf/core/pdf.py
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
def get_manager(self, key: str) -> Any:
    """Retrieve a manager instance by its key, instantiating it lazily if needed.

    Managers are specialized components that handle specific functionality like
    classification, structured data extraction, or OCR processing. They are
    instantiated on-demand to minimize memory usage and startup time.

    Args:
        key: The manager key to retrieve. Common keys include 'classification'
            and 'structured_data'.

    Returns:
        The manager instance for the specified key.

    Raises:
        KeyError: If no manager is registered for the given key.
        RuntimeError: If the manager failed to initialize.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")
        classification_mgr = pdf.get_manager('classification')
        structured_data_mgr = pdf.get_manager('structured_data')
        ```
    """
    # Check if already instantiated
    if key in self._managers:
        manager_instance = self._managers[key]
        if manager_instance is None:
            raise RuntimeError(f"Manager '{key}' failed to initialize previously.")
        return manager_instance

    # Not instantiated yet: get factory/class
    if not hasattr(self, "_manager_factories") or key not in self._manager_factories:
        raise KeyError(
            f"No manager registered for key '{key}'. Available: {list(getattr(self, '_manager_factories', {}).keys())}"
        )
    factory_or_class = self._manager_factories[key]
    try:
        resolved = factory_or_class
        # If it's a callable that's not a class, call it to get the class/instance
        if not isinstance(resolved, type) and callable(resolved):
            resolved = resolved()
        # If it's a class, instantiate it
        if isinstance(resolved, type):
            instance = resolved()
        else:
            instance = resolved  # Already an instance
        self._managers[key] = instance
        return instance
    except Exception as e:
        logger.error(f"Failed to initialize manager for key '{key}': {e}")
        self._managers[key] = None
        raise RuntimeError(f"Manager '{key}' failed to initialize: {e}") from e
natural_pdf.PDF.get_sections(start_elements=None, end_elements=None, new_section_on_page_break=False, include_boundaries='both', orientation='vertical')

Extract sections from the entire PDF based on start/end elements.

This method delegates to the PageCollection.get_sections() method, providing a convenient way to extract document sections across all pages.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections (optional)

None
end_elements

Elements or selector string that mark the end of sections (optional)

None
new_section_on_page_break

Whether to start a new section at page boundaries (default: False)

False
include_boundaries

How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both')

'both'
orientation

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description
ElementCollection

ElementCollection of Region objects representing the extracted sections

Example

Extract sections between headers:

pdf = npdf.PDF("document.pdf")

# Get sections between headers
sections = pdf.get_sections(
    start_elements='text[size>14]:bold',
    end_elements='text[size>14]:bold'
)

# Get sections that break at page boundaries
sections = pdf.get_sections(
    start_elements='text:contains("Chapter")',
    new_section_on_page_break=True
)

Note

You can provide only start_elements, only end_elements, or both. - With only start_elements: sections go from each start to the next start (or end of document) - With only end_elements: sections go from beginning of document to each end - With both: sections go from each start to the corresponding end

Source code in natural_pdf/core/pdf.py
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    new_section_on_page_break=False,
    include_boundaries="both",
    orientation="vertical",
) -> "ElementCollection":
    """
    Extract sections from the entire PDF based on start/end elements.

    This method delegates to the PageCollection.get_sections() method,
    providing a convenient way to extract document sections across all pages.

    Args:
        start_elements: Elements or selector string that mark the start of sections (optional)
        end_elements: Elements or selector string that mark the end of sections (optional)
        new_section_on_page_break: Whether to start a new section at page boundaries (default: False)
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both')
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        ElementCollection of Region objects representing the extracted sections

    Example:
        Extract sections between headers:
        ```python
        pdf = npdf.PDF("document.pdf")

        # Get sections between headers
        sections = pdf.get_sections(
            start_elements='text[size>14]:bold',
            end_elements='text[size>14]:bold'
        )

        # Get sections that break at page boundaries
        sections = pdf.get_sections(
            start_elements='text:contains("Chapter")',
            new_section_on_page_break=True
        )
        ```

    Note:
        You can provide only start_elements, only end_elements, or both.
        - With only start_elements: sections go from each start to the next start (or end of document)
        - With only end_elements: sections go from beginning of document to each end
        - With both: sections go from each start to the corresponding end
    """
    if not hasattr(self, "_pages"):
        raise AttributeError("PDF pages not yet initialized.")

    return self.pages.get_sections(
        start_elements=start_elements,
        end_elements=end_elements,
        new_section_on_page_break=new_section_on_page_break,
        include_boundaries=include_boundaries,
        orientation=orientation,
    )
natural_pdf.PDF.highlights(show=False)

Create a highlight context for accumulating highlights.

This allows for clean syntax to show multiple highlight groups:

Example

with pdf.highlights() as h: h.add(pdf.find_all('table'), label='tables', color='blue') h.add(pdf.find_all('text:bold'), label='bold text', color='red') h.show()

Or with automatic display

with pdf.highlights(show=True) as h: h.add(pdf.find_all('table'), label='tables') h.add(pdf.find_all('text:bold'), label='bold') # Automatically shows when exiting the context

Parameters:

Name Type Description Default
show bool

If True, automatically show highlights when exiting context

False

Returns:

Type Description
HighlightContext

HighlightContext for accumulating highlights

Source code in natural_pdf/core/pdf.py
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
def highlights(self, show: bool = False) -> "HighlightContext":
    """
    Create a highlight context for accumulating highlights.

    This allows for clean syntax to show multiple highlight groups:

    Example:
        with pdf.highlights() as h:
            h.add(pdf.find_all('table'), label='tables', color='blue')
            h.add(pdf.find_all('text:bold'), label='bold text', color='red')
            h.show()

    Or with automatic display:
        with pdf.highlights(show=True) as h:
            h.add(pdf.find_all('table'), label='tables')
            h.add(pdf.find_all('text:bold'), label='bold')
            # Automatically shows when exiting the context

    Args:
        show: If True, automatically show highlights when exiting context

    Returns:
        HighlightContext for accumulating highlights
    """
    from natural_pdf.core.highlighting_service import HighlightContext

    return HighlightContext(self, show_on_exit=show)
natural_pdf.PDF.save_pdf(output_path, ocr=False, original=False, dpi=300)

Saves the PDF object (all its pages) to a new file.

Choose one saving mode: - ocr=True: Creates a new, image-based PDF using OCR results from all pages. Text generated during the natural-pdf session becomes searchable, but original vector content is lost. Requires 'ocr-export' extras. - original=True: Saves a copy of the original PDF file this object represents. Any OCR results or analyses from the natural-pdf session are NOT included. If the PDF was opened from an in-memory buffer, this mode may not be suitable. Requires 'ocr-export' extras.

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the new PDF file.

required
ocr bool

If True, save as a searchable, image-based PDF using OCR data.

False
original bool

If True, save the original source PDF content.

False
dpi int

Resolution (dots per inch) used only when ocr=True.

300

Raises:

Type Description
ValueError

If the PDF has no pages, if neither or both 'ocr' and 'original' are True.

ImportError

If required libraries are not installed for the chosen mode.

RuntimeError

If an unexpected error occurs during saving.

Source code in natural_pdf/core/pdf.py
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
def save_pdf(
    self,
    output_path: Union[str, Path],
    ocr: bool = False,
    original: bool = False,
    dpi: int = 300,
):
    """
    Saves the PDF object (all its pages) to a new file.

    Choose one saving mode:
    - `ocr=True`: Creates a new, image-based PDF using OCR results from all pages.
      Text generated during the natural-pdf session becomes searchable,
      but original vector content is lost. Requires 'ocr-export' extras.
    - `original=True`: Saves a copy of the original PDF file this object represents.
      Any OCR results or analyses from the natural-pdf session are NOT included.
      If the PDF was opened from an in-memory buffer, this mode may not be suitable.
      Requires 'ocr-export' extras.

    Args:
        output_path: Path to save the new PDF file.
        ocr: If True, save as a searchable, image-based PDF using OCR data.
        original: If True, save the original source PDF content.
        dpi: Resolution (dots per inch) used only when ocr=True.

    Raises:
        ValueError: If the PDF has no pages, if neither or both 'ocr'
                    and 'original' are True.
        ImportError: If required libraries are not installed for the chosen mode.
        RuntimeError: If an unexpected error occurs during saving.
    """
    if not self.pages:
        raise ValueError("Cannot save an empty PDF object.")

    if not (ocr ^ original):  # XOR: exactly one must be true
        raise ValueError("Exactly one of 'ocr' or 'original' must be True.")

    output_path_obj = Path(output_path)
    output_path_str = str(output_path_obj)

    if ocr:
        has_vector_elements = False
        for page in self.pages:
            if (
                hasattr(page, "rects")
                and page.rects
                or hasattr(page, "lines")
                and page.lines
                or hasattr(page, "curves")
                and page.curves
                or (
                    hasattr(page, "chars")
                    and any(getattr(el, "source", None) != "ocr" for el in page.chars)
                )
                or (
                    hasattr(page, "words")
                    and any(getattr(el, "source", None) != "ocr" for el in page.words)
                )
            ):
                has_vector_elements = True
                break
        if has_vector_elements:
            logger.warning(
                "Warning: Saving with ocr=True creates an image-based PDF. "
                "Original vector elements (rects, lines, non-OCR text/chars) "
                "will not be preserved in the output file."
            )

        logger.info(f"Saving searchable PDF (OCR text layer) to: {output_path_str}")
        try:
            # Delegate to the searchable PDF exporter, passing self (PDF instance)
            create_searchable_pdf(self, output_path_str, dpi=dpi)
        except Exception as e:
            raise RuntimeError(f"Failed to create searchable PDF: {e}") from e

    elif original:
        if create_original_pdf is None:
            raise ImportError(
                "Saving with original=True requires 'pikepdf'. "
                'Install with: pip install "natural-pdf[ocr-export]"'
            )

        # Optional: Add warning about losing OCR data similar to PageCollection
        has_ocr_elements = False
        for page in self.pages:
            if hasattr(page, "find_all"):
                ocr_text_elements = page.find_all("text[source=ocr]")
                if ocr_text_elements:
                    has_ocr_elements = True
                    break
            elif hasattr(page, "words"):  # Fallback
                if any(getattr(el, "source", None) == "ocr" for el in page.words):
                    has_ocr_elements = True
                    break
        if has_ocr_elements:
            logger.warning(
                "Warning: Saving with original=True preserves original page content. "
                "OCR text generated in this session will not be included in the saved file."
            )

        logger.info(f"Saving original PDF content to: {output_path_str}")
        try:
            # Delegate to the original PDF exporter, passing self (PDF instance)
            create_original_pdf(self, output_path_str)
        except Exception as e:
            # Re-raise exception from exporter
            raise e
natural_pdf.PDF.save_searchable(output_path, dpi=300, **kwargs)

DEPRECATED: Use save_pdf(..., ocr=True) instead. Saves the PDF with an OCR text layer, making content searchable.

Requires optional dependencies. Install with: pip install "natural-pdf[ocr-export]"

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the searchable PDF

required
dpi int

Resolution for rendering and OCR overlay

300
**kwargs

Additional keyword arguments passed to the exporter

{}
Source code in natural_pdf/core/pdf.py
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
    """
    DEPRECATED: Use save_pdf(..., ocr=True) instead.
    Saves the PDF with an OCR text layer, making content searchable.

    Requires optional dependencies. Install with: pip install \"natural-pdf[ocr-export]\"

    Args:
        output_path: Path to save the searchable PDF
        dpi: Resolution for rendering and OCR overlay
        **kwargs: Additional keyword arguments passed to the exporter
    """
    logger.warning(
        "PDF.save_searchable() is deprecated. Use PDF.save_pdf(..., ocr=True) instead."
    )
    if create_searchable_pdf is None:
        raise ImportError(
            "Saving searchable PDF requires 'pikepdf'. "
            'Install with: pip install "natural-pdf[ocr-export]"'
        )
    output_path_str = str(output_path)
    # Call the exporter directly, passing self (the PDF instance)
    create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
natural_pdf.PDF.search_within_index(query, search_service, options=None)

Finds relevant documents from this PDF within a search index. Finds relevant documents from this PDF within a search index.

Parameters:

Name Type Description Default
query Union[str, Path, Image, Region]

The search query (text, image path, PIL Image, Region)

required
search_service SearchServiceProtocol

A pre-configured SearchService instance

required
options Optional[SearchOptions]

Optional SearchOptions to configure the query

None
query Union[str, Path, Image, Region]

The search query (text, image path, PIL Image, Region)

required
search_service SearchServiceProtocol

A pre-configured SearchService instance

required
options Optional[SearchOptions]

Optional SearchOptions to configure the query

None

Returns:

Type Description
List[Dict[str, Any]]

A list of result dictionaries, sorted by relevance

List[Dict[str, Any]]

A list of result dictionaries, sorted by relevance

Raises:

Type Description
ImportError

If search dependencies are not installed

ValueError

If search_service is None

TypeError

If search_service does not conform to the protocol

FileNotFoundError

If the collection managed by the service does not exist

RuntimeError

For other search failures

ImportError

If search dependencies are not installed

ValueError

If search_service is None

TypeError

If search_service does not conform to the protocol

FileNotFoundError

If the collection managed by the service does not exist

RuntimeError

For other search failures

Source code in natural_pdf/core/pdf.py
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
def search_within_index(
    self,
    query: Union[str, Path, Image.Image, "Region"],
    search_service: "SearchServiceProtocol",
    options: Optional["SearchOptions"] = None,
) -> List[Dict[str, Any]]:
    """
    Finds relevant documents from this PDF within a search index.
    Finds relevant documents from this PDF within a search index.

    Args:
        query: The search query (text, image path, PIL Image, Region)
        search_service: A pre-configured SearchService instance
        options: Optional SearchOptions to configure the query
        query: The search query (text, image path, PIL Image, Region)
        search_service: A pre-configured SearchService instance
        options: Optional SearchOptions to configure the query

    Returns:
        A list of result dictionaries, sorted by relevance
        A list of result dictionaries, sorted by relevance

    Raises:
        ImportError: If search dependencies are not installed
        ValueError: If search_service is None
        TypeError: If search_service does not conform to the protocol
        FileNotFoundError: If the collection managed by the service does not exist
        RuntimeError: For other search failures
        ImportError: If search dependencies are not installed
        ValueError: If search_service is None
        TypeError: If search_service does not conform to the protocol
        FileNotFoundError: If the collection managed by the service does not exist
        RuntimeError: For other search failures
    """
    if not search_service:
        raise ValueError("A configured SearchServiceProtocol instance must be provided.")

    collection_name = getattr(search_service, "collection_name", "<Unknown Collection>")
    logger.info(
        f"Searching within index '{collection_name}' for content from PDF '{self.path}'"
    )

    service = search_service

    query_input = query
    effective_options = copy.deepcopy(options) if options is not None else TextSearchOptions()

    if isinstance(query, Region):
        logger.debug("Query is a Region object. Extracting text.")
        if not isinstance(effective_options, TextSearchOptions):
            logger.warning(
                "Querying with Region image requires MultiModalSearchOptions. Falling back to text extraction."
            )
        query_input = query.extract_text()
        if not query_input or query_input.isspace():
            logger.error("Region has no extractable text for query.")
            return []

    # Add filter to scope search to THIS PDF
    # Add filter to scope search to THIS PDF
    pdf_scope_filter = {
        "field": "pdf_path",
        "operator": "eq",
        "value": self.path,
    }
    logger.debug(f"Applying filter to scope search to PDF: {pdf_scope_filter}")

    # Combine with existing filters in options (if any)
    if effective_options.filters:
        logger.debug(f"Combining PDF scope filter with existing filters")
        if (
            isinstance(effective_options.filters, dict)
            and effective_options.filters.get("operator") == "AND"
        ):
            effective_options.filters["conditions"].append(pdf_scope_filter)
        elif isinstance(effective_options.filters, list):
            effective_options.filters = {
                "operator": "AND",
                "conditions": effective_options.filters + [pdf_scope_filter],
            }
        elif isinstance(effective_options.filters, dict):
            effective_options.filters = {
                "operator": "AND",
                "conditions": [effective_options.filters, pdf_scope_filter],
            }
        else:
            logger.warning(
                f"Unsupported format for existing filters. Overwriting with PDF scope filter."
            )
            effective_options.filters = pdf_scope_filter
    else:
        effective_options.filters = pdf_scope_filter

    logger.debug(f"Final filters for service search: {effective_options.filters}")

    try:
        results = service.search(
            query=query_input,
            options=effective_options,
        )
        logger.info(f"SearchService returned {len(results)} results from PDF '{self.path}'")
        return results
    except FileNotFoundError as fnf:
        logger.error(f"Search failed: Collection not found. Error: {fnf}")
        raise
        logger.error(f"Search failed: Collection not found. Error: {fnf}")
        raise
    except Exception as e:
        logger.error(f"SearchService search failed: {e}")
        raise RuntimeError(f"Search within index failed. See logs for details.") from e
        logger.error(f"SearchService search failed: {e}")
        raise RuntimeError(f"Search within index failed. See logs for details.") from e
natural_pdf.PDF.split(divider, **kwargs)

Divide the PDF into sections based on the provided divider elements.

Parameters:

Name Type Description Default
divider

Elements or selector string that mark section boundaries

required
**kwargs

Additional parameters passed to get_sections() - include_boundaries: How to include boundary elements (default: 'start') - orientation: 'vertical' or 'horizontal' (default: 'vertical') - new_section_on_page_break: Whether to split at page boundaries (default: False)

{}

Returns:

Type Description
ElementCollection

ElementCollection of Region objects representing the sections

Example
Split a PDF by chapter titles

chapters = pdf.split("text[size>20]:contains('Chapter')")

Export each chapter to a separate file

for i, chapter in enumerate(chapters): chapter_text = chapter.extract_text() with open(f"chapter_{i+1}.txt", "w") as f: f.write(chapter_text)

Split by horizontal rules/lines

sections = pdf.split("line[orientation=horizontal]")

Split only by page breaks (no divider elements)

pages = pdf.split(None, new_section_on_page_break=True)

Source code in natural_pdf/core/pdf.py
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
def split(self, divider, **kwargs) -> "ElementCollection":
    """
    Divide the PDF into sections based on the provided divider elements.

    Args:
        divider: Elements or selector string that mark section boundaries
        **kwargs: Additional parameters passed to get_sections()
            - include_boundaries: How to include boundary elements (default: 'start')
            - orientation: 'vertical' or 'horizontal' (default: 'vertical')
            - new_section_on_page_break: Whether to split at page boundaries (default: False)

    Returns:
        ElementCollection of Region objects representing the sections

    Example:
        # Split a PDF by chapter titles
        chapters = pdf.split("text[size>20]:contains('Chapter')")

        # Export each chapter to a separate file
        for i, chapter in enumerate(chapters):
            chapter_text = chapter.extract_text()
            with open(f"chapter_{i+1}.txt", "w") as f:
                f.write(chapter_text)

        # Split by horizontal rules/lines
        sections = pdf.split("line[orientation=horizontal]")

        # Split only by page breaks (no divider elements)
        pages = pdf.split(None, new_section_on_page_break=True)
    """
    # Delegate to pages collection
    return self.pages.split(divider, **kwargs)
natural_pdf.PDF.update_text(transform, pages=None, selector='text', max_workers=None, progress_callback=None)

Applies corrections to text elements using a callback function.

Parameters:

Name Type Description Default
correction_callback

Function that takes an element and returns corrected text or None

required
pages Optional[Union[Iterable[int], range, slice]]

Optional page indices/slice to limit the scope of correction

None
selector str

Selector to apply corrections to (default: "text")

'text'
max_workers Optional[int]

Maximum number of threads to use for parallel execution

None
progress_callback Optional[Callable[[], None]]

Optional callback function for progress updates

None

Returns:

Type Description
PDF

Self for method chaining

Source code in natural_pdf/core/pdf.py
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
def update_text(
    self,
    transform: Callable[[Any], Optional[str]],
    pages: Optional[Union[Iterable[int], range, slice]] = None,
    selector: str = "text",
    max_workers: Optional[int] = None,
    progress_callback: Optional[Callable[[], None]] = None,
) -> "PDF":
    """
    Applies corrections to text elements using a callback function.

    Args:
        correction_callback: Function that takes an element and returns corrected text or None
        pages: Optional page indices/slice to limit the scope of correction
        selector: Selector to apply corrections to (default: "text")
        max_workers: Maximum number of threads to use for parallel execution
        progress_callback: Optional callback function for progress updates

    Returns:
        Self for method chaining
    """
    target_page_indices = []
    if pages is None:
        target_page_indices = list(range(len(self._pages)))
    elif isinstance(pages, slice):
        target_page_indices = list(range(*pages.indices(len(self._pages))))
    elif hasattr(pages, "__iter__"):
        try:
            target_page_indices = [int(i) for i in pages]
            for idx in target_page_indices:
                if not (0 <= idx < len(self._pages)):
                    raise IndexError(f"Page index {idx} out of range (0-{len(self._pages)-1}).")
        except (IndexError, TypeError, ValueError) as e:
            raise ValueError(f"Invalid page index in 'pages': {pages}. Error: {e}") from e
    else:
        raise TypeError("'pages' must be None, a slice, or an iterable of page indices.")

    if not target_page_indices:
        logger.warning("No pages selected for text update.")
        return self

    logger.info(
        f"Starting text update for pages: {target_page_indices} with selector='{selector}'"
    )

    for page_idx in target_page_indices:
        page = self._pages[page_idx]
        try:
            page.update_text(
                transform=transform,
                selector=selector,
                max_workers=max_workers,
                progress_callback=progress_callback,
            )
        except Exception as e:
            logger.error(f"Error during text update on page {page_idx}: {e}")
            logger.error(f"Error during text update on page {page_idx}: {e}")

    logger.info("Text update process finished.")
    return self
natural_pdf.PDFCollection

Bases: SearchableMixin, ApplyMixin, ExportMixin, ShapeDetectionMixin, VisualSearchMixin

Source code in natural_pdf/core/pdf_collection.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
class PDFCollection(
    SearchableMixin, ApplyMixin, ExportMixin, ShapeDetectionMixin, VisualSearchMixin
):
    def __init__(
        self,
        source: Union[str, Iterable[Union[str, "PDF"]]],
        recursive: bool = True,
        **pdf_options: Any,
    ):
        """
        Initializes a collection of PDF documents from various sources.

        Args:
            source: The source of PDF documents. Can be:
                - An iterable (e.g., list) of existing PDF objects.
                - An iterable (e.g., list) of file paths/URLs/globs (strings).
                - A single file path/URL/directory/glob string.
            recursive: If source involves directories or glob patterns,
                       whether to search recursively (default: True).
            **pdf_options: Keyword arguments passed to the PDF constructor.
        """
        self._pdfs: List["PDF"] = []
        self._pdf_options = pdf_options  # Store options for potential slicing later
        self._recursive = recursive  # Store setting for potential slicing

        # Dynamically import PDF class within methods to avoid circular import at module load time
        PDF = self._get_pdf_class()

        if hasattr(source, "__iter__") and not isinstance(source, str):
            source_list = list(source)
            if not source_list:
                return  # Empty list source
            if isinstance(source_list[0], PDF):
                if all(isinstance(item, PDF) for item in source_list):
                    self._pdfs = source_list  # Direct assignment
                    # Don't adopt search context anymore
                    return
                else:
                    raise TypeError("Iterable source has mixed PDF/non-PDF objects.")
            # If it's an iterable but not PDFs, fall through to resolve sources

        # Resolve string, iterable of strings, or single string source to paths/URLs
        resolved_paths_or_urls = self._resolve_sources_to_paths(source)
        self._initialize_pdfs(resolved_paths_or_urls, PDF)  # Pass PDF class

        self._iter_index = 0

        # Initialize internal search service reference
        self._search_service: Optional[SearchServiceProtocol] = None

    @staticmethod
    def _get_pdf_class():
        """Helper method to dynamically import the PDF class."""
        from natural_pdf.core.pdf import PDF

        return PDF

    # --- Internal Helpers ---

    def _is_url(self, s: str) -> bool:
        return s.startswith(("http://", "https://"))

    def _has_glob_magic(self, s: str) -> bool:
        return py_glob.has_magic(s)

    def _execute_glob(self, pattern: str) -> Set[str]:
        """Glob for paths and return a set of valid PDF paths."""
        found_paths = set()
        # Use iglob for potentially large directories/matches
        paths_iter = py_glob.iglob(pattern, recursive=self._recursive)
        for path_str in paths_iter:
            # Use Path object for easier checking
            p = Path(path_str)
            if p.is_file() and p.suffix.lower() == ".pdf":
                found_paths.add(str(p.resolve()))  # Store resolved absolute path
        return found_paths

    def _resolve_sources_to_paths(self, source: Union[str, Iterable[str]]) -> List[str]:
        """Resolves various source types into a list of unique PDF paths/URLs."""
        final_paths = set()
        sources_to_process = []

        if isinstance(source, str):
            sources_to_process.append(source)
        elif hasattr(source, "__iter__"):
            sources_to_process.extend(list(source))
        else:  # Should not happen based on __init__ checks, but safeguard
            raise TypeError(f"Unexpected source type in _resolve_sources_to_paths: {type(source)}")

        for item in sources_to_process:
            if not isinstance(item, str):
                logger.warning(f"Skipping non-string item in source list: {type(item)}")
                continue

            item_path = Path(item)

            if self._is_url(item):
                final_paths.add(item)  # Add URL directly
            elif self._has_glob_magic(item):
                glob_results = self._execute_glob(item)
                final_paths.update(glob_results)
            elif item_path.is_dir():
                # Use glob to find PDFs in directory, respecting recursive flag
                dir_pattern = (
                    str(item_path / "**" / "*.pdf") if self._recursive else str(item_path / "*.pdf")
                )
                dir_glob_results = self._execute_glob(dir_pattern)
                final_paths.update(dir_glob_results)
            elif item_path.is_file() and item_path.suffix.lower() == ".pdf":
                final_paths.add(str(item_path.resolve()))  # Add resolved file path
            else:
                logger.warning(
                    f"Source item ignored (not a valid URL, directory, file, or glob): {item}"
                )

        return sorted(list(final_paths))

    def _initialize_pdfs(self, paths_or_urls: List[str], PDF_cls: Type):
        """Initializes PDF objects from a list of paths/URLs."""
        logger.info(f"Initializing {len(paths_or_urls)} PDF objects...")
        failed_count = 0
        for path_or_url in tqdm(paths_or_urls, desc="Loading PDFs"):
            try:
                pdf_instance = PDF_cls(path_or_url, **self._pdf_options)
                self._pdfs.append(pdf_instance)
            except Exception as e:
                logger.error(
                    f"Failed to load PDF: {path_or_url}. Error: {e}", exc_info=False
                )  # Keep log concise
                failed_count += 1
        logger.info(f"Successfully initialized {len(self._pdfs)} PDFs. Failed: {failed_count}")

    # --- Public Factory Class Methods (Simplified) ---

    @classmethod
    def from_paths(cls, paths_or_urls: List[str], **pdf_options: Any) -> "PDFCollection":
        """Creates a PDFCollection explicitly from a list of file paths or URLs."""
        # __init__ can handle List[str] directly now
        return cls(paths_or_urls, **pdf_options)

    @classmethod
    def from_glob(cls, pattern: str, recursive: bool = True, **pdf_options: Any) -> "PDFCollection":
        """Creates a PDFCollection explicitly from a single glob pattern."""
        # __init__ can handle single glob string directly
        return cls(pattern, recursive=recursive, **pdf_options)

    @classmethod
    def from_globs(
        cls, patterns: List[str], recursive: bool = True, **pdf_options: Any
    ) -> "PDFCollection":
        """Creates a PDFCollection explicitly from a list of glob patterns."""
        # __init__ can handle List[str] containing globs directly
        return cls(patterns, recursive=recursive, **pdf_options)

    @classmethod
    def from_directory(
        cls, directory_path: str, recursive: bool = True, **pdf_options: Any
    ) -> "PDFCollection":
        """Creates a PDFCollection explicitly from PDF files within a directory."""
        # __init__ can handle single directory string directly
        return cls(directory_path, recursive=recursive, **pdf_options)

    # --- Core Collection Methods ---
    def __len__(self) -> int:
        return len(self._pdfs)

    def __getitem__(self, key) -> Union["PDF", "PDFCollection"]:
        # Use dynamic import here as well
        PDF = self._get_pdf_class()
        if isinstance(key, slice):
            # Create a new collection with the sliced PDFs and original options
            new_collection = PDFCollection.__new__(PDFCollection)  # Create blank instance
            new_collection._pdfs = self._pdfs[key]
            new_collection._pdf_options = self._pdf_options
            new_collection._recursive = self._recursive
            # Search context is not copied/inherited anymore
            return new_collection
        elif isinstance(key, int):
            # Check bounds
            if 0 <= key < len(self._pdfs):
                return self._pdfs[key]
            else:
                raise IndexError(f"PDF index {key} out of range (0-{len(self._pdfs)-1}).")
        else:
            raise TypeError(f"PDF indices must be integers or slices, not {type(key)}.")

    def __iter__(self):
        return iter(self._pdfs)

    def __repr__(self) -> str:
        return f"<PDFCollection(count={len(self._pdfs)})>"

    @property
    def pdfs(self) -> List["PDF"]:
        """Returns the list of PDF objects held by the collection."""
        return self._pdfs

    def show(self, limit: Optional[int] = 30, per_pdf_limit: Optional[int] = 10, **kwargs):
        """
        Display all PDFs in the collection with labels.

        Each PDF is shown with its pages in a grid layout (6 columns by default),
        and all PDFs are stacked vertically with labels.

        Args:
            limit: Maximum total pages to show across all PDFs (default: 30)
            per_pdf_limit: Maximum pages to show per PDF (default: 10)
            **kwargs: Additional arguments passed to each PDF's show() method
                     (e.g., columns, exclusions, resolution, etc.)

        Returns:
            Displayed image in Jupyter or None
        """
        if not self._pdfs:
            print("Empty collection")
            return None

        # Import here to avoid circular imports
        import numpy as np
        from PIL import Image, ImageDraw, ImageFont

        # Calculate pages per PDF if total limit is set
        if limit and not per_pdf_limit:
            per_pdf_limit = max(1, limit // len(self._pdfs))

        # Collect images from each PDF
        all_images = []
        total_pages_shown = 0

        for pdf in self._pdfs:
            if limit and total_pages_shown >= limit:
                break

            # Calculate limit for this PDF
            pdf_limit = per_pdf_limit
            if limit:
                remaining = limit - total_pages_shown
                pdf_limit = min(per_pdf_limit or remaining, remaining)

            # Get PDF identifier
            pdf_name = getattr(pdf, "filename", None) or getattr(pdf, "path", "Unknown")
            if isinstance(pdf_name, Path):
                pdf_name = pdf_name.name
            elif "/" in str(pdf_name):
                pdf_name = str(pdf_name).split("/")[-1]

            # Render this PDF
            try:
                # Get render specs from the PDF
                render_specs = pdf._get_render_specs(mode="show", max_pages=pdf_limit, **kwargs)

                if not render_specs:
                    continue

                # Get the highlighter and render without displaying
                highlighter = pdf._get_highlighter()
                pdf_image = highlighter.unified_render(
                    specs=render_specs,
                    layout="grid" if len(render_specs) > 1 else "single",
                    columns=6,
                    **kwargs,
                )

                if pdf_image:
                    # Add label above the PDF image
                    label_height = 40
                    label_bg_color = (240, 240, 240)
                    label_text_color = (0, 0, 0)

                    # Create new image with space for label
                    width, height = pdf_image.size
                    labeled_image = Image.new("RGB", (width, height + label_height), "white")

                    # Draw label background
                    draw = ImageDraw.Draw(labeled_image)
                    draw.rectangle([0, 0, width, label_height], fill=label_bg_color)

                    # Draw label text
                    try:
                        # Try to use a nice font if available
                        font = ImageFont.truetype("Arial", 20)
                    except:
                        # Fallback to default font
                        font = ImageFont.load_default()

                    label_text = f"{pdf_name} ({len(pdf.pages)} pages)"
                    draw.text((10, 10), label_text, fill=label_text_color, font=font)

                    # Paste PDF image below label
                    labeled_image.paste(pdf_image, (0, label_height))

                    all_images.append(labeled_image)
                    total_pages_shown += min(pdf_limit, len(pdf.pages))

            except Exception as e:
                logger.warning(f"Failed to render PDF {pdf_name}: {e}")
                continue

        if not all_images:
            print("No PDFs could be rendered")
            return None

        # Combine all images vertically
        if len(all_images) == 1:
            combined = all_images[0]
        else:
            # Add spacing between PDFs
            spacing = 20
            total_height = sum(img.height for img in all_images) + spacing * (len(all_images) - 1)
            max_width = max(img.width for img in all_images)

            combined = Image.new("RGB", (max_width, total_height), "white")

            y_offset = 0
            for i, img in enumerate(all_images):
                # Center images if they're narrower than max width
                x_offset = (max_width - img.width) // 2
                combined.paste(img, (x_offset, y_offset))
                y_offset += img.height
                if i < len(all_images) - 1:
                    y_offset += spacing

        # Return the combined image (Jupyter will display it automatically)
        return combined

    @overload
    def find_all(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,  # Now optional
        *,
        text: Optional[str] = None,  # New text parameter
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements matching the selector OR text across all PDFs in the collection.

        Provide EITHER `selector` OR `text`, but not both.

        This creates an ElementCollection that can span multiple PDFs. Note that
        some ElementCollection methods have limitations when spanning PDFs.

        Args:
            selector: CSS-like selector string to query elements.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional keyword arguments passed to the find_all method of each PDF.

        Returns:
            ElementCollection containing all matching elements across all PDFs.
        """
        # Validation happens within pdf.find_all

        # Collect elements from all PDFs
        all_elements = []
        for pdf in self._pdfs:
            try:
                # Pass the relevant arguments down to each PDF's find_all
                elements = pdf.find_all(
                    selector=selector,
                    text=text,
                    apply_exclusions=apply_exclusions,
                    regex=regex,
                    case=case,
                    **kwargs,
                )
                all_elements.extend(elements.elements)
            except Exception as e:
                logger.error(f"Error finding elements in {pdf.path}: {e}", exc_info=True)

        return ElementCollection(all_elements)

    def apply_ocr(
        self,
        engine: Optional[str] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
        apply_exclusions: bool = True,
        detect_only: bool = False,
        replace: bool = True,
        options: Optional[Any] = None,
        pages: Optional[Union[slice, List[int]]] = None,
        max_workers: Optional[int] = None,
    ) -> "PDFCollection":
        """
        Apply OCR to all PDFs in the collection, potentially in parallel.

        Args:
            engine: OCR engine to use (e.g., 'easyocr', 'paddleocr', 'surya')
            languages: List of language codes for OCR
            min_confidence: Minimum confidence threshold for text detection
            device: Device to use for OCR (e.g., 'cpu', 'cuda')
            resolution: DPI resolution for page rendering
            apply_exclusions: Whether to apply exclusion regions
            detect_only: If True, only detect text regions without extracting text
            replace: If True, replace existing OCR elements
            options: Engine-specific options
            pages: Specific pages to process (None for all pages)
            max_workers: Maximum number of threads to process PDFs concurrently.
                         If None or 1, processing is sequential. (default: None)

        Returns:
            Self for method chaining
        """
        PDF = self._get_pdf_class()
        logger.info(
            f"Applying OCR to {len(self._pdfs)} PDFs in collection (max_workers={max_workers})..."
        )

        # Worker function takes PDF object again
        def _process_pdf(pdf: PDF):
            """Helper function to apply OCR to a single PDF, handling errors."""
            thread_id = threading.current_thread().name  # Get thread name for logging
            pdf_path = pdf.path  # Get path for logging
            logger.debug(f"[{thread_id}] Starting OCR process for: {pdf_path}")
            start_time = time.monotonic()
            pdf.apply_ocr(  # Call apply_ocr on the original PDF object
                pages=pages,
                engine=engine,
                languages=languages,
                min_confidence=min_confidence,
                device=device,
                resolution=resolution,
                apply_exclusions=apply_exclusions,
                detect_only=detect_only,
                replace=replace,
                options=options,
                # Note: We might want a max_workers here too for page rendering?
                # For now, PDF.apply_ocr doesn't have it.
            )
            end_time = time.monotonic()
            logger.debug(
                f"[{thread_id}] Finished OCR process for: {pdf_path} (Duration: {end_time - start_time:.2f}s)"
            )
            return pdf_path, None

        # Use ThreadPoolExecutor for parallel processing if max_workers > 1
        if max_workers is not None and max_workers > 1:
            futures = []
            with concurrent.futures.ThreadPoolExecutor(
                max_workers=max_workers, thread_name_prefix="OCRWorker"
            ) as executor:
                for pdf in self._pdfs:
                    # Submit the PDF object to the worker function
                    futures.append(executor.submit(_process_pdf, pdf))

            # Use the selected tqdm class with as_completed for progress tracking
            progress_bar = tqdm(
                concurrent.futures.as_completed(futures),
                total=len(self._pdfs),
                desc="Applying OCR (Parallel)",
                unit="pdf",
            )

            for future in progress_bar:
                pdf_path, error = future.result()  # Get result (or exception)
                if error:
                    progress_bar.set_postfix_str(f"Error: {pdf_path}", refresh=True)
                # Progress is updated automatically by tqdm

        else:  # Sequential processing (max_workers is None or 1)
            logger.info("Applying OCR sequentially...")
            # Use the selected tqdm class for sequential too for consistency
            # Iterate over PDF objects directly for sequential
            for pdf in tqdm(self._pdfs, desc="Applying OCR (Sequential)", unit="pdf"):
                _process_pdf(pdf)  # Call helper directly with PDF object

        logger.info("Finished applying OCR across the collection.")
        return self

    def correct_ocr(
        self,
        correction_callback: Callable[[Any], Optional[str]],
        max_workers: Optional[int] = None,
        progress_callback: Optional[Callable[[], None]] = None,
    ) -> "PDFCollection":
        """
        Apply OCR correction to all relevant elements across all pages and PDFs
        in the collection using a single progress bar.

        Args:
            correction_callback: Function to apply to each OCR element.
                                 It receives the element and should return
                                 the corrected text (str) or None.
            max_workers: Max threads to use for parallel execution within each page.
            progress_callback: Optional callback function to call after processing each element.

        Returns:
            Self for method chaining.
        """
        PDF = self._get_pdf_class()  # Ensure PDF class is available
        if not callable(correction_callback):
            raise TypeError("`correction_callback` must be a callable function.")

        logger.info(f"Gathering OCR elements from {len(self._pdfs)} PDFs for correction...")

        # 1. Gather all target elements using the collection's find_all
        #    Crucially, set apply_exclusions=False to include elements in headers/footers etc.
        all_ocr_elements = self.find_all("text[source=ocr]", apply_exclusions=False).elements

        if not all_ocr_elements:
            logger.info("No OCR elements found in the collection to correct.")
            return self

        total_elements = len(all_ocr_elements)
        logger.info(
            f"Found {total_elements} OCR elements across the collection. Starting correction process..."
        )

        # 2. Initialize the progress bar
        progress_bar = tqdm(total=total_elements, desc="Correcting OCR Elements", unit="element")

        # 3. Iterate through PDFs and delegate to PDF.correct_ocr
        #    PDF.correct_ocr handles page iteration and passing the progress callback down.
        for pdf in self._pdfs:
            if not pdf.pages:
                continue
            try:
                pdf.correct_ocr(
                    correction_callback=correction_callback,
                    max_workers=max_workers,
                    progress_callback=progress_bar.update,  # Pass the bar's update method
                )
            except Exception as e:
                logger.error(
                    f"Error occurred during correction process for PDF {pdf.path}: {e}",
                    exc_info=True,
                )
                # Decide if we should stop or continue? For now, continue.

        progress_bar.close()

        return self

    def categorize(self, labels: List[str], **kwargs):
        """Categorizes PDFs in the collection based on content or features."""
        # Implementation requires integrating with classification models or logic
        raise NotImplementedError("categorize requires classification implementation.")

    def export_ocr_correction_task(self, output_zip_path: str, **kwargs):
        """
        Exports OCR results from all PDFs in this collection into a single
        correction task package (zip file).

        Args:
            output_zip_path: The path to save the output zip file.
            **kwargs: Additional arguments passed to create_correction_task_package
                      (e.g., image_render_scale, overwrite).
        """
        from natural_pdf.utils.packaging import create_correction_task_package

        # Pass the collection itself (self) as the source
        create_correction_task_package(source=self, output_zip_path=output_zip_path, **kwargs)

    # --- Mixin Required Implementation ---
    def get_indexable_items(self) -> Iterable[Indexable]:
        """Yields Page objects from the collection, conforming to Indexable."""
        if not self._pdfs:
            return  # Return empty iterator if no PDFs

        for pdf in self._pdfs:
            if not pdf.pages:  # Handle case where a PDF might have 0 pages after loading
                logger.warning(f"PDF '{pdf.path}' has no pages. Skipping.")
                continue
            for page in pdf.pages:
                # Optional: Add filtering here if needed (e.g., skip empty pages)
                # Assuming Page object conforms to Indexable
                # We might still want the empty page check here for efficiency
                # if not page.extract_text(use_exclusions=False).strip():
                #     logger.debug(f"Skipping empty page {page.page_number} from PDF '{pdf.path}'.")
                #     continue
                yield page

    # --- Classification Method --- #
    def classify_all(
        self,
        labels: List[str],
        using: Optional[str] = None,  # Default handled by PDF.classify -> manager
        model: Optional[str] = None,  # Optional model ID
        analysis_key: str = "classification",  # Key for storing result in PDF.analyses
        **kwargs,
    ) -> "PDFCollection":
        """
        Classify each PDF document in the collection using batch processing.

        This method gathers content from all PDFs and processes them in a single
        batch to avoid multiprocessing resource accumulation that can occur with
        sequential individual classifications.

        Args:
            labels: A list of string category names.
            using: Processing mode ('text', 'vision'). If None, manager infers (defaulting to text).
            model: Optional specific model identifier (e.g., HF ID). If None, manager uses default for 'using' mode.
            analysis_key: Key under which to store the ClassificationResult in each PDF's `analyses` dict.
            **kwargs: Additional arguments passed down to the ClassificationManager.

        Returns:
            Self for method chaining.

        Raises:
            ValueError: If labels list is empty, or if using='vision' on a multi-page PDF.
            ClassificationError: If classification fails.
            ImportError: If classification dependencies are missing.
        """
        if not labels:
            raise ValueError("Labels list cannot be empty.")

        if not self._pdfs:
            logger.warning("PDFCollection is empty, skipping classification.")
            return self

        mode_desc = f"using='{using}'" if using else f"model='{model}'" if model else "default text"
        logger.info(
            f"Starting batch classification for {len(self._pdfs)} PDFs in collection ({mode_desc})..."
        )

        # Get classification manager from first PDF
        try:
            first_pdf = self._pdfs[0]
            if not hasattr(first_pdf, "get_manager"):
                raise RuntimeError("PDFs do not support classification manager")
            manager = first_pdf.get_manager("classification")
            if not manager or not manager.is_available():
                raise RuntimeError("ClassificationManager is not available")
        except Exception as e:
            from natural_pdf.classification.manager import ClassificationError

            raise ClassificationError(f"Cannot access ClassificationManager: {e}") from e

        # Determine processing mode early
        inferred_using = manager.infer_using(model if model else manager.DEFAULT_TEXT_MODEL, using)

        # Gather content from all PDFs
        pdf_contents = []
        valid_pdfs = []

        logger.info(f"Gathering content from {len(self._pdfs)} PDFs for batch classification...")

        for pdf in self._pdfs:
            try:
                # Get the content for classification - use the same logic as individual PDF classify
                if inferred_using == "text":
                    # Extract text content from PDF
                    content = pdf.extract_text()
                    if not content or content.isspace():
                        logger.warning(f"Skipping PDF {pdf.path}: No text content found")
                        continue
                elif inferred_using == "vision":
                    # For vision, we need single-page PDFs only
                    if len(pdf.pages) != 1:
                        logger.warning(
                            f"Skipping PDF {pdf.path}: Vision classification requires single-page PDFs"
                        )
                        continue
                    # Get first page image
                    content = pdf.pages[0].render()
                else:
                    raise ValueError(f"Unsupported using mode: {inferred_using}")

                pdf_contents.append(content)
                valid_pdfs.append(pdf)

            except Exception as e:
                logger.warning(f"Skipping PDF {pdf.path}: Error getting content - {e}")
                continue

        if not pdf_contents:
            logger.warning("No valid content could be gathered from PDFs for classification.")
            return self

        logger.info(
            f"Gathered content from {len(valid_pdfs)} PDFs. Running batch classification..."
        )

        # Run batch classification
        try:
            batch_results = manager.classify_batch(
                item_contents=pdf_contents,
                labels=labels,
                model_id=model,
                using=inferred_using,
                progress_bar=True,  # Let the manager handle progress display
                **kwargs,
            )
        except Exception as e:
            logger.error(f"Batch classification failed: {e}")
            from natural_pdf.classification.manager import ClassificationError

            raise ClassificationError(f"Batch classification failed: {e}") from e

        # Assign results back to PDFs
        if len(batch_results) != len(valid_pdfs):
            logger.error(
                f"Batch classification result count ({len(batch_results)}) mismatch "
                f"with PDFs processed ({len(valid_pdfs)}). Cannot assign results."
            )
            from natural_pdf.classification.manager import ClassificationError

            raise ClassificationError("Batch result count mismatch with input PDFs")

        logger.info(f"Assigning {len(batch_results)} results to PDFs under key '{analysis_key}'.")

        processed_count = 0
        for pdf, result_obj in zip(valid_pdfs, batch_results):
            try:
                if not hasattr(pdf, "analyses") or pdf.analyses is None:
                    pdf.analyses = {}
                pdf.analyses[analysis_key] = result_obj
                processed_count += 1
            except Exception as e:
                logger.warning(f"Failed to store classification result for {pdf.path}: {e}")

        skipped_count = len(self._pdfs) - processed_count
        final_message = f"Finished batch classification. Processed: {processed_count}"
        if skipped_count > 0:
            final_message += f", Skipped: {skipped_count}"
        logger.info(final_message + ".")

        return self

    # --- End Classification Method --- #

    def _gather_analysis_data(
        self,
        analysis_keys: List[str],
        include_content: bool,
        include_images: bool,
        image_dir: Optional[Path],
        image_format: str,
        image_resolution: int,
    ) -> List[Dict[str, Any]]:
        """
        Gather analysis data from all PDFs in the collection.

        Args:
            analysis_keys: Keys in the analyses dictionary to export
            include_content: Whether to include extracted text
            include_images: Whether to export images
            image_dir: Directory to save images
            image_format: Format to save images
            image_resolution: Resolution for exported images

        Returns:
            List of dictionaries containing analysis data
        """
        if not self._pdfs:
            logger.warning("No PDFs found in collection")
            return []

        all_data = []

        for pdf in tqdm(self._pdfs, desc="Gathering PDF data", leave=False):
            # PDF level data
            pdf_data = {
                "pdf_path": pdf.path,
                "pdf_filename": Path(pdf.path).name,
                "total_pages": len(pdf.pages) if hasattr(pdf, "pages") else 0,
            }

            # Add metadata if available
            if hasattr(pdf, "metadata") and pdf.metadata:
                for k, v in pdf.metadata.items():
                    if v:  # Only add non-empty metadata
                        pdf_data[f"metadata.{k}"] = str(v)

            all_data.append(pdf_data)

        return all_data
Attributes
natural_pdf.PDFCollection.pdfs property

Returns the list of PDF objects held by the collection.

Functions
natural_pdf.PDFCollection.__init__(source, recursive=True, **pdf_options)

Initializes a collection of PDF documents from various sources.

Parameters:

Name Type Description Default
source Union[str, Iterable[Union[str, PDF]]]

The source of PDF documents. Can be: - An iterable (e.g., list) of existing PDF objects. - An iterable (e.g., list) of file paths/URLs/globs (strings). - A single file path/URL/directory/glob string.

required
recursive bool

If source involves directories or glob patterns, whether to search recursively (default: True).

True
**pdf_options Any

Keyword arguments passed to the PDF constructor.

{}
Source code in natural_pdf/core/pdf_collection.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def __init__(
    self,
    source: Union[str, Iterable[Union[str, "PDF"]]],
    recursive: bool = True,
    **pdf_options: Any,
):
    """
    Initializes a collection of PDF documents from various sources.

    Args:
        source: The source of PDF documents. Can be:
            - An iterable (e.g., list) of existing PDF objects.
            - An iterable (e.g., list) of file paths/URLs/globs (strings).
            - A single file path/URL/directory/glob string.
        recursive: If source involves directories or glob patterns,
                   whether to search recursively (default: True).
        **pdf_options: Keyword arguments passed to the PDF constructor.
    """
    self._pdfs: List["PDF"] = []
    self._pdf_options = pdf_options  # Store options for potential slicing later
    self._recursive = recursive  # Store setting for potential slicing

    # Dynamically import PDF class within methods to avoid circular import at module load time
    PDF = self._get_pdf_class()

    if hasattr(source, "__iter__") and not isinstance(source, str):
        source_list = list(source)
        if not source_list:
            return  # Empty list source
        if isinstance(source_list[0], PDF):
            if all(isinstance(item, PDF) for item in source_list):
                self._pdfs = source_list  # Direct assignment
                # Don't adopt search context anymore
                return
            else:
                raise TypeError("Iterable source has mixed PDF/non-PDF objects.")
        # If it's an iterable but not PDFs, fall through to resolve sources

    # Resolve string, iterable of strings, or single string source to paths/URLs
    resolved_paths_or_urls = self._resolve_sources_to_paths(source)
    self._initialize_pdfs(resolved_paths_or_urls, PDF)  # Pass PDF class

    self._iter_index = 0

    # Initialize internal search service reference
    self._search_service: Optional[SearchServiceProtocol] = None
natural_pdf.PDFCollection.apply_ocr(engine=None, languages=None, min_confidence=None, device=None, resolution=None, apply_exclusions=True, detect_only=False, replace=True, options=None, pages=None, max_workers=None)

Apply OCR to all PDFs in the collection, potentially in parallel.

Parameters:

Name Type Description Default
engine Optional[str]

OCR engine to use (e.g., 'easyocr', 'paddleocr', 'surya')

None
languages Optional[List[str]]

List of language codes for OCR

None
min_confidence Optional[float]

Minimum confidence threshold for text detection

None
device Optional[str]

Device to use for OCR (e.g., 'cpu', 'cuda')

None
resolution Optional[int]

DPI resolution for page rendering

None
apply_exclusions bool

Whether to apply exclusion regions

True
detect_only bool

If True, only detect text regions without extracting text

False
replace bool

If True, replace existing OCR elements

True
options Optional[Any]

Engine-specific options

None
pages Optional[Union[slice, List[int]]]

Specific pages to process (None for all pages)

None
max_workers Optional[int]

Maximum number of threads to process PDFs concurrently. If None or 1, processing is sequential. (default: None)

None

Returns:

Type Description
PDFCollection

Self for method chaining

Source code in natural_pdf/core/pdf_collection.py
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
def apply_ocr(
    self,
    engine: Optional[str] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
    apply_exclusions: bool = True,
    detect_only: bool = False,
    replace: bool = True,
    options: Optional[Any] = None,
    pages: Optional[Union[slice, List[int]]] = None,
    max_workers: Optional[int] = None,
) -> "PDFCollection":
    """
    Apply OCR to all PDFs in the collection, potentially in parallel.

    Args:
        engine: OCR engine to use (e.g., 'easyocr', 'paddleocr', 'surya')
        languages: List of language codes for OCR
        min_confidence: Minimum confidence threshold for text detection
        device: Device to use for OCR (e.g., 'cpu', 'cuda')
        resolution: DPI resolution for page rendering
        apply_exclusions: Whether to apply exclusion regions
        detect_only: If True, only detect text regions without extracting text
        replace: If True, replace existing OCR elements
        options: Engine-specific options
        pages: Specific pages to process (None for all pages)
        max_workers: Maximum number of threads to process PDFs concurrently.
                     If None or 1, processing is sequential. (default: None)

    Returns:
        Self for method chaining
    """
    PDF = self._get_pdf_class()
    logger.info(
        f"Applying OCR to {len(self._pdfs)} PDFs in collection (max_workers={max_workers})..."
    )

    # Worker function takes PDF object again
    def _process_pdf(pdf: PDF):
        """Helper function to apply OCR to a single PDF, handling errors."""
        thread_id = threading.current_thread().name  # Get thread name for logging
        pdf_path = pdf.path  # Get path for logging
        logger.debug(f"[{thread_id}] Starting OCR process for: {pdf_path}")
        start_time = time.monotonic()
        pdf.apply_ocr(  # Call apply_ocr on the original PDF object
            pages=pages,
            engine=engine,
            languages=languages,
            min_confidence=min_confidence,
            device=device,
            resolution=resolution,
            apply_exclusions=apply_exclusions,
            detect_only=detect_only,
            replace=replace,
            options=options,
            # Note: We might want a max_workers here too for page rendering?
            # For now, PDF.apply_ocr doesn't have it.
        )
        end_time = time.monotonic()
        logger.debug(
            f"[{thread_id}] Finished OCR process for: {pdf_path} (Duration: {end_time - start_time:.2f}s)"
        )
        return pdf_path, None

    # Use ThreadPoolExecutor for parallel processing if max_workers > 1
    if max_workers is not None and max_workers > 1:
        futures = []
        with concurrent.futures.ThreadPoolExecutor(
            max_workers=max_workers, thread_name_prefix="OCRWorker"
        ) as executor:
            for pdf in self._pdfs:
                # Submit the PDF object to the worker function
                futures.append(executor.submit(_process_pdf, pdf))

        # Use the selected tqdm class with as_completed for progress tracking
        progress_bar = tqdm(
            concurrent.futures.as_completed(futures),
            total=len(self._pdfs),
            desc="Applying OCR (Parallel)",
            unit="pdf",
        )

        for future in progress_bar:
            pdf_path, error = future.result()  # Get result (or exception)
            if error:
                progress_bar.set_postfix_str(f"Error: {pdf_path}", refresh=True)
            # Progress is updated automatically by tqdm

    else:  # Sequential processing (max_workers is None or 1)
        logger.info("Applying OCR sequentially...")
        # Use the selected tqdm class for sequential too for consistency
        # Iterate over PDF objects directly for sequential
        for pdf in tqdm(self._pdfs, desc="Applying OCR (Sequential)", unit="pdf"):
            _process_pdf(pdf)  # Call helper directly with PDF object

    logger.info("Finished applying OCR across the collection.")
    return self
natural_pdf.PDFCollection.categorize(labels, **kwargs)

Categorizes PDFs in the collection based on content or features.

Source code in natural_pdf/core/pdf_collection.py
633
634
635
636
def categorize(self, labels: List[str], **kwargs):
    """Categorizes PDFs in the collection based on content or features."""
    # Implementation requires integrating with classification models or logic
    raise NotImplementedError("categorize requires classification implementation.")
natural_pdf.PDFCollection.classify_all(labels, using=None, model=None, analysis_key='classification', **kwargs)

Classify each PDF document in the collection using batch processing.

This method gathers content from all PDFs and processes them in a single batch to avoid multiprocessing resource accumulation that can occur with sequential individual classifications.

Parameters:

Name Type Description Default
labels List[str]

A list of string category names.

required
using Optional[str]

Processing mode ('text', 'vision'). If None, manager infers (defaulting to text).

None
model Optional[str]

Optional specific model identifier (e.g., HF ID). If None, manager uses default for 'using' mode.

None
analysis_key str

Key under which to store the ClassificationResult in each PDF's analyses dict.

'classification'
**kwargs

Additional arguments passed down to the ClassificationManager.

{}

Returns:

Type Description
PDFCollection

Self for method chaining.

Raises:

Type Description
ValueError

If labels list is empty, or if using='vision' on a multi-page PDF.

ClassificationError

If classification fails.

ImportError

If classification dependencies are missing.

Source code in natural_pdf/core/pdf_collection.py
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
def classify_all(
    self,
    labels: List[str],
    using: Optional[str] = None,  # Default handled by PDF.classify -> manager
    model: Optional[str] = None,  # Optional model ID
    analysis_key: str = "classification",  # Key for storing result in PDF.analyses
    **kwargs,
) -> "PDFCollection":
    """
    Classify each PDF document in the collection using batch processing.

    This method gathers content from all PDFs and processes them in a single
    batch to avoid multiprocessing resource accumulation that can occur with
    sequential individual classifications.

    Args:
        labels: A list of string category names.
        using: Processing mode ('text', 'vision'). If None, manager infers (defaulting to text).
        model: Optional specific model identifier (e.g., HF ID). If None, manager uses default for 'using' mode.
        analysis_key: Key under which to store the ClassificationResult in each PDF's `analyses` dict.
        **kwargs: Additional arguments passed down to the ClassificationManager.

    Returns:
        Self for method chaining.

    Raises:
        ValueError: If labels list is empty, or if using='vision' on a multi-page PDF.
        ClassificationError: If classification fails.
        ImportError: If classification dependencies are missing.
    """
    if not labels:
        raise ValueError("Labels list cannot be empty.")

    if not self._pdfs:
        logger.warning("PDFCollection is empty, skipping classification.")
        return self

    mode_desc = f"using='{using}'" if using else f"model='{model}'" if model else "default text"
    logger.info(
        f"Starting batch classification for {len(self._pdfs)} PDFs in collection ({mode_desc})..."
    )

    # Get classification manager from first PDF
    try:
        first_pdf = self._pdfs[0]
        if not hasattr(first_pdf, "get_manager"):
            raise RuntimeError("PDFs do not support classification manager")
        manager = first_pdf.get_manager("classification")
        if not manager or not manager.is_available():
            raise RuntimeError("ClassificationManager is not available")
    except Exception as e:
        from natural_pdf.classification.manager import ClassificationError

        raise ClassificationError(f"Cannot access ClassificationManager: {e}") from e

    # Determine processing mode early
    inferred_using = manager.infer_using(model if model else manager.DEFAULT_TEXT_MODEL, using)

    # Gather content from all PDFs
    pdf_contents = []
    valid_pdfs = []

    logger.info(f"Gathering content from {len(self._pdfs)} PDFs for batch classification...")

    for pdf in self._pdfs:
        try:
            # Get the content for classification - use the same logic as individual PDF classify
            if inferred_using == "text":
                # Extract text content from PDF
                content = pdf.extract_text()
                if not content or content.isspace():
                    logger.warning(f"Skipping PDF {pdf.path}: No text content found")
                    continue
            elif inferred_using == "vision":
                # For vision, we need single-page PDFs only
                if len(pdf.pages) != 1:
                    logger.warning(
                        f"Skipping PDF {pdf.path}: Vision classification requires single-page PDFs"
                    )
                    continue
                # Get first page image
                content = pdf.pages[0].render()
            else:
                raise ValueError(f"Unsupported using mode: {inferred_using}")

            pdf_contents.append(content)
            valid_pdfs.append(pdf)

        except Exception as e:
            logger.warning(f"Skipping PDF {pdf.path}: Error getting content - {e}")
            continue

    if not pdf_contents:
        logger.warning("No valid content could be gathered from PDFs for classification.")
        return self

    logger.info(
        f"Gathered content from {len(valid_pdfs)} PDFs. Running batch classification..."
    )

    # Run batch classification
    try:
        batch_results = manager.classify_batch(
            item_contents=pdf_contents,
            labels=labels,
            model_id=model,
            using=inferred_using,
            progress_bar=True,  # Let the manager handle progress display
            **kwargs,
        )
    except Exception as e:
        logger.error(f"Batch classification failed: {e}")
        from natural_pdf.classification.manager import ClassificationError

        raise ClassificationError(f"Batch classification failed: {e}") from e

    # Assign results back to PDFs
    if len(batch_results) != len(valid_pdfs):
        logger.error(
            f"Batch classification result count ({len(batch_results)}) mismatch "
            f"with PDFs processed ({len(valid_pdfs)}). Cannot assign results."
        )
        from natural_pdf.classification.manager import ClassificationError

        raise ClassificationError("Batch result count mismatch with input PDFs")

    logger.info(f"Assigning {len(batch_results)} results to PDFs under key '{analysis_key}'.")

    processed_count = 0
    for pdf, result_obj in zip(valid_pdfs, batch_results):
        try:
            if not hasattr(pdf, "analyses") or pdf.analyses is None:
                pdf.analyses = {}
            pdf.analyses[analysis_key] = result_obj
            processed_count += 1
        except Exception as e:
            logger.warning(f"Failed to store classification result for {pdf.path}: {e}")

    skipped_count = len(self._pdfs) - processed_count
    final_message = f"Finished batch classification. Processed: {processed_count}"
    if skipped_count > 0:
        final_message += f", Skipped: {skipped_count}"
    logger.info(final_message + ".")

    return self
natural_pdf.PDFCollection.correct_ocr(correction_callback, max_workers=None, progress_callback=None)

Apply OCR correction to all relevant elements across all pages and PDFs in the collection using a single progress bar.

Parameters:

Name Type Description Default
correction_callback Callable[[Any], Optional[str]]

Function to apply to each OCR element. It receives the element and should return the corrected text (str) or None.

required
max_workers Optional[int]

Max threads to use for parallel execution within each page.

None
progress_callback Optional[Callable[[], None]]

Optional callback function to call after processing each element.

None

Returns:

Type Description
PDFCollection

Self for method chaining.

Source code in natural_pdf/core/pdf_collection.py
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
def correct_ocr(
    self,
    correction_callback: Callable[[Any], Optional[str]],
    max_workers: Optional[int] = None,
    progress_callback: Optional[Callable[[], None]] = None,
) -> "PDFCollection":
    """
    Apply OCR correction to all relevant elements across all pages and PDFs
    in the collection using a single progress bar.

    Args:
        correction_callback: Function to apply to each OCR element.
                             It receives the element and should return
                             the corrected text (str) or None.
        max_workers: Max threads to use for parallel execution within each page.
        progress_callback: Optional callback function to call after processing each element.

    Returns:
        Self for method chaining.
    """
    PDF = self._get_pdf_class()  # Ensure PDF class is available
    if not callable(correction_callback):
        raise TypeError("`correction_callback` must be a callable function.")

    logger.info(f"Gathering OCR elements from {len(self._pdfs)} PDFs for correction...")

    # 1. Gather all target elements using the collection's find_all
    #    Crucially, set apply_exclusions=False to include elements in headers/footers etc.
    all_ocr_elements = self.find_all("text[source=ocr]", apply_exclusions=False).elements

    if not all_ocr_elements:
        logger.info("No OCR elements found in the collection to correct.")
        return self

    total_elements = len(all_ocr_elements)
    logger.info(
        f"Found {total_elements} OCR elements across the collection. Starting correction process..."
    )

    # 2. Initialize the progress bar
    progress_bar = tqdm(total=total_elements, desc="Correcting OCR Elements", unit="element")

    # 3. Iterate through PDFs and delegate to PDF.correct_ocr
    #    PDF.correct_ocr handles page iteration and passing the progress callback down.
    for pdf in self._pdfs:
        if not pdf.pages:
            continue
        try:
            pdf.correct_ocr(
                correction_callback=correction_callback,
                max_workers=max_workers,
                progress_callback=progress_bar.update,  # Pass the bar's update method
            )
        except Exception as e:
            logger.error(
                f"Error occurred during correction process for PDF {pdf.path}: {e}",
                exc_info=True,
            )
            # Decide if we should stop or continue? For now, continue.

    progress_bar.close()

    return self
natural_pdf.PDFCollection.export_ocr_correction_task(output_zip_path, **kwargs)

Exports OCR results from all PDFs in this collection into a single correction task package (zip file).

Parameters:

Name Type Description Default
output_zip_path str

The path to save the output zip file.

required
**kwargs

Additional arguments passed to create_correction_task_package (e.g., image_render_scale, overwrite).

{}
Source code in natural_pdf/core/pdf_collection.py
638
639
640
641
642
643
644
645
646
647
648
649
650
651
def export_ocr_correction_task(self, output_zip_path: str, **kwargs):
    """
    Exports OCR results from all PDFs in this collection into a single
    correction task package (zip file).

    Args:
        output_zip_path: The path to save the output zip file.
        **kwargs: Additional arguments passed to create_correction_task_package
                  (e.g., image_render_scale, overwrite).
    """
    from natural_pdf.utils.packaging import create_correction_task_package

    # Pass the collection itself (self) as the source
    create_correction_task_package(source=self, output_zip_path=output_zip_path, **kwargs)
natural_pdf.PDFCollection.find_all(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements matching the selector OR text across all PDFs in the collection.

Provide EITHER selector OR text, but not both.

This creates an ElementCollection that can span multiple PDFs. Note that some ElementCollection methods have limitations when spanning PDFs.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string to query elements.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional keyword arguments passed to the find_all method of each PDF.

{}

Returns:

Type Description
ElementCollection

ElementCollection containing all matching elements across all PDFs.

Source code in natural_pdf/core/pdf_collection.py
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
def find_all(
    self,
    selector: Optional[str] = None,  # Now optional
    *,
    text: Optional[str] = None,  # New text parameter
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements matching the selector OR text across all PDFs in the collection.

    Provide EITHER `selector` OR `text`, but not both.

    This creates an ElementCollection that can span multiple PDFs. Note that
    some ElementCollection methods have limitations when spanning PDFs.

    Args:
        selector: CSS-like selector string to query elements.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional keyword arguments passed to the find_all method of each PDF.

    Returns:
        ElementCollection containing all matching elements across all PDFs.
    """
    # Validation happens within pdf.find_all

    # Collect elements from all PDFs
    all_elements = []
    for pdf in self._pdfs:
        try:
            # Pass the relevant arguments down to each PDF's find_all
            elements = pdf.find_all(
                selector=selector,
                text=text,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )
            all_elements.extend(elements.elements)
        except Exception as e:
            logger.error(f"Error finding elements in {pdf.path}: {e}", exc_info=True)

    return ElementCollection(all_elements)
natural_pdf.PDFCollection.from_directory(directory_path, recursive=True, **pdf_options) classmethod

Creates a PDFCollection explicitly from PDF files within a directory.

Source code in natural_pdf/core/pdf_collection.py
226
227
228
229
230
231
232
@classmethod
def from_directory(
    cls, directory_path: str, recursive: bool = True, **pdf_options: Any
) -> "PDFCollection":
    """Creates a PDFCollection explicitly from PDF files within a directory."""
    # __init__ can handle single directory string directly
    return cls(directory_path, recursive=recursive, **pdf_options)
natural_pdf.PDFCollection.from_glob(pattern, recursive=True, **pdf_options) classmethod

Creates a PDFCollection explicitly from a single glob pattern.

Source code in natural_pdf/core/pdf_collection.py
212
213
214
215
216
@classmethod
def from_glob(cls, pattern: str, recursive: bool = True, **pdf_options: Any) -> "PDFCollection":
    """Creates a PDFCollection explicitly from a single glob pattern."""
    # __init__ can handle single glob string directly
    return cls(pattern, recursive=recursive, **pdf_options)
natural_pdf.PDFCollection.from_globs(patterns, recursive=True, **pdf_options) classmethod

Creates a PDFCollection explicitly from a list of glob patterns.

Source code in natural_pdf/core/pdf_collection.py
218
219
220
221
222
223
224
@classmethod
def from_globs(
    cls, patterns: List[str], recursive: bool = True, **pdf_options: Any
) -> "PDFCollection":
    """Creates a PDFCollection explicitly from a list of glob patterns."""
    # __init__ can handle List[str] containing globs directly
    return cls(patterns, recursive=recursive, **pdf_options)
natural_pdf.PDFCollection.from_paths(paths_or_urls, **pdf_options) classmethod

Creates a PDFCollection explicitly from a list of file paths or URLs.

Source code in natural_pdf/core/pdf_collection.py
206
207
208
209
210
@classmethod
def from_paths(cls, paths_or_urls: List[str], **pdf_options: Any) -> "PDFCollection":
    """Creates a PDFCollection explicitly from a list of file paths or URLs."""
    # __init__ can handle List[str] directly now
    return cls(paths_or_urls, **pdf_options)
natural_pdf.PDFCollection.get_indexable_items()

Yields Page objects from the collection, conforming to Indexable.

Source code in natural_pdf/core/pdf_collection.py
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
def get_indexable_items(self) -> Iterable[Indexable]:
    """Yields Page objects from the collection, conforming to Indexable."""
    if not self._pdfs:
        return  # Return empty iterator if no PDFs

    for pdf in self._pdfs:
        if not pdf.pages:  # Handle case where a PDF might have 0 pages after loading
            logger.warning(f"PDF '{pdf.path}' has no pages. Skipping.")
            continue
        for page in pdf.pages:
            # Optional: Add filtering here if needed (e.g., skip empty pages)
            # Assuming Page object conforms to Indexable
            # We might still want the empty page check here for efficiency
            # if not page.extract_text(use_exclusions=False).strip():
            #     logger.debug(f"Skipping empty page {page.page_number} from PDF '{pdf.path}'.")
            #     continue
            yield page
natural_pdf.PDFCollection.show(limit=30, per_pdf_limit=10, **kwargs)

Display all PDFs in the collection with labels.

Each PDF is shown with its pages in a grid layout (6 columns by default), and all PDFs are stacked vertically with labels.

Parameters:

Name Type Description Default
limit Optional[int]

Maximum total pages to show across all PDFs (default: 30)

30
per_pdf_limit Optional[int]

Maximum pages to show per PDF (default: 10)

10
**kwargs

Additional arguments passed to each PDF's show() method (e.g., columns, exclusions, resolution, etc.)

{}

Returns:

Type Description

Displayed image in Jupyter or None

Source code in natural_pdf/core/pdf_collection.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
def show(self, limit: Optional[int] = 30, per_pdf_limit: Optional[int] = 10, **kwargs):
    """
    Display all PDFs in the collection with labels.

    Each PDF is shown with its pages in a grid layout (6 columns by default),
    and all PDFs are stacked vertically with labels.

    Args:
        limit: Maximum total pages to show across all PDFs (default: 30)
        per_pdf_limit: Maximum pages to show per PDF (default: 10)
        **kwargs: Additional arguments passed to each PDF's show() method
                 (e.g., columns, exclusions, resolution, etc.)

    Returns:
        Displayed image in Jupyter or None
    """
    if not self._pdfs:
        print("Empty collection")
        return None

    # Import here to avoid circular imports
    import numpy as np
    from PIL import Image, ImageDraw, ImageFont

    # Calculate pages per PDF if total limit is set
    if limit and not per_pdf_limit:
        per_pdf_limit = max(1, limit // len(self._pdfs))

    # Collect images from each PDF
    all_images = []
    total_pages_shown = 0

    for pdf in self._pdfs:
        if limit and total_pages_shown >= limit:
            break

        # Calculate limit for this PDF
        pdf_limit = per_pdf_limit
        if limit:
            remaining = limit - total_pages_shown
            pdf_limit = min(per_pdf_limit or remaining, remaining)

        # Get PDF identifier
        pdf_name = getattr(pdf, "filename", None) or getattr(pdf, "path", "Unknown")
        if isinstance(pdf_name, Path):
            pdf_name = pdf_name.name
        elif "/" in str(pdf_name):
            pdf_name = str(pdf_name).split("/")[-1]

        # Render this PDF
        try:
            # Get render specs from the PDF
            render_specs = pdf._get_render_specs(mode="show", max_pages=pdf_limit, **kwargs)

            if not render_specs:
                continue

            # Get the highlighter and render without displaying
            highlighter = pdf._get_highlighter()
            pdf_image = highlighter.unified_render(
                specs=render_specs,
                layout="grid" if len(render_specs) > 1 else "single",
                columns=6,
                **kwargs,
            )

            if pdf_image:
                # Add label above the PDF image
                label_height = 40
                label_bg_color = (240, 240, 240)
                label_text_color = (0, 0, 0)

                # Create new image with space for label
                width, height = pdf_image.size
                labeled_image = Image.new("RGB", (width, height + label_height), "white")

                # Draw label background
                draw = ImageDraw.Draw(labeled_image)
                draw.rectangle([0, 0, width, label_height], fill=label_bg_color)

                # Draw label text
                try:
                    # Try to use a nice font if available
                    font = ImageFont.truetype("Arial", 20)
                except:
                    # Fallback to default font
                    font = ImageFont.load_default()

                label_text = f"{pdf_name} ({len(pdf.pages)} pages)"
                draw.text((10, 10), label_text, fill=label_text_color, font=font)

                # Paste PDF image below label
                labeled_image.paste(pdf_image, (0, label_height))

                all_images.append(labeled_image)
                total_pages_shown += min(pdf_limit, len(pdf.pages))

        except Exception as e:
            logger.warning(f"Failed to render PDF {pdf_name}: {e}")
            continue

    if not all_images:
        print("No PDFs could be rendered")
        return None

    # Combine all images vertically
    if len(all_images) == 1:
        combined = all_images[0]
    else:
        # Add spacing between PDFs
        spacing = 20
        total_height = sum(img.height for img in all_images) + spacing * (len(all_images) - 1)
        max_width = max(img.width for img in all_images)

        combined = Image.new("RGB", (max_width, total_height), "white")

        y_offset = 0
        for i, img in enumerate(all_images):
            # Center images if they're narrower than max width
            x_offset = (max_width - img.width) // 2
            combined.paste(img, (x_offset, y_offset))
            y_offset += img.height
            if i < len(all_images) - 1:
                y_offset += spacing

    # Return the combined image (Jupyter will display it automatically)
    return combined
natural_pdf.Page

Bases: TextMixin, ClassificationMixin, ExtractionMixin, ShapeDetectionMixin, CheckboxDetectionMixin, DescribeMixin, VisualSearchMixin, Visualizable

Enhanced Page wrapper built on top of pdfplumber.Page.

This class provides a fluent interface for working with PDF pages, with improved selection, navigation, extraction, and question-answering capabilities. It integrates multiple analysis capabilities through mixins and provides spatial navigation with CSS-like selectors.

The Page class serves as the primary interface for document analysis, offering: - Element selection and spatial navigation - OCR and layout analysis integration - Table detection and extraction - AI-powered classification and data extraction - Visual debugging with highlighting and cropping - Text style analysis and structure detection

Attributes:

Name Type Description
index int

Zero-based index of this page in the PDF.

number int

One-based page number (index + 1).

width float

Page width in points.

height float

Page height in points.

bbox float

Bounding box tuple (x0, top, x1, bottom) of the page.

chars List[Any]

Collection of character elements on the page.

words List[Any]

Collection of word elements on the page.

lines List[Any]

Collection of line elements on the page.

rects List[Any]

Collection of rectangle elements on the page.

images List[Any]

Collection of image elements on the page.

metadata Dict[str, Any]

Dictionary for storing analysis results and custom data.

Example

Basic usage:

pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]

# Find elements with CSS-like selectors
headers = page.find_all('text[size>12]:bold')
summaries = page.find('text:contains("Summary")')

# Spatial navigation
content_below = summaries.below(until='text[size>12]:bold')

# Table extraction
tables = page.extract_table()

Advanced usage:

# Apply OCR if needed
page.apply_ocr(engine='easyocr', resolution=300)

# Layout analysis
page.analyze_layout(engine='yolo')

# AI-powered extraction
data = page.extract_structured_data(MySchema)

# Visual debugging
page.find('text:contains("Important")').show()

Source code in natural_pdf/core/page.py
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
class Page(
    TextMixin,
    ClassificationMixin,
    ExtractionMixin,
    ShapeDetectionMixin,
    CheckboxDetectionMixin,
    DescribeMixin,
    VisualSearchMixin,
    Visualizable,
):
    """Enhanced Page wrapper built on top of pdfplumber.Page.

    This class provides a fluent interface for working with PDF pages,
    with improved selection, navigation, extraction, and question-answering capabilities.
    It integrates multiple analysis capabilities through mixins and provides spatial
    navigation with CSS-like selectors.

    The Page class serves as the primary interface for document analysis, offering:
    - Element selection and spatial navigation
    - OCR and layout analysis integration
    - Table detection and extraction
    - AI-powered classification and data extraction
    - Visual debugging with highlighting and cropping
    - Text style analysis and structure detection

    Attributes:
        index: Zero-based index of this page in the PDF.
        number: One-based page number (index + 1).
        width: Page width in points.
        height: Page height in points.
        bbox: Bounding box tuple (x0, top, x1, bottom) of the page.
        chars: Collection of character elements on the page.
        words: Collection of word elements on the page.
        lines: Collection of line elements on the page.
        rects: Collection of rectangle elements on the page.
        images: Collection of image elements on the page.
        metadata: Dictionary for storing analysis results and custom data.

    Example:
        Basic usage:
        ```python
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]

        # Find elements with CSS-like selectors
        headers = page.find_all('text[size>12]:bold')
        summaries = page.find('text:contains("Summary")')

        # Spatial navigation
        content_below = summaries.below(until='text[size>12]:bold')

        # Table extraction
        tables = page.extract_table()
        ```

        Advanced usage:
        ```python
        # Apply OCR if needed
        page.apply_ocr(engine='easyocr', resolution=300)

        # Layout analysis
        page.analyze_layout(engine='yolo')

        # AI-powered extraction
        data = page.extract_structured_data(MySchema)

        # Visual debugging
        page.find('text:contains("Important")').show()
        ```
    """

    def __init__(
        self,
        page: "pdfplumber.page.Page",
        parent: "PDF",
        index: int,
        font_attrs=None,
        load_text: bool = True,
    ):
        """Initialize a page wrapper.

        Creates an enhanced Page object that wraps a pdfplumber page with additional
        functionality for spatial navigation, analysis, and AI-powered extraction.

        Args:
            page: The underlying pdfplumber page object that provides raw PDF data.
            parent: Parent PDF object that contains this page and provides access
                to managers and global settings.
            index: Zero-based index of this page in the PDF document.
            font_attrs: List of font attributes to consider when grouping characters
                into words. Common attributes include ['fontname', 'size', 'flags'].
                If None, uses default character-to-word grouping rules.
            load_text: If True, load and process text elements from the PDF's text layer.
                If False, skip text layer processing (useful for OCR-only workflows).

        Note:
            This constructor is typically called automatically when accessing pages
            through the PDF.pages collection. Direct instantiation is rarely needed.

        Example:
            ```python
            # Pages are usually accessed through the PDF object
            pdf = npdf.PDF("document.pdf")
            page = pdf.pages[0]  # Page object created automatically

            # Direct construction (advanced usage)
            import pdfplumber
            with pdfplumber.open("document.pdf") as plumber_pdf:
                plumber_page = plumber_pdf.pages[0]
                page = Page(plumber_page, pdf, 0, load_text=True)
            ```
        """
        self._page = page
        self._parent = parent
        self._index = index
        self._load_text = load_text
        self._text_styles = None  # Lazy-loaded text style analyzer results
        self._exclusions = []  # List to store exclusion functions/regions
        self._skew_angle: Optional[float] = None  # Stores detected skew angle

        # --- ADDED --- Metadata store for mixins
        self.metadata: Dict[str, Any] = {}
        # --- END ADDED ---

        # Region management
        self._regions = {
            "detected": [],  # Layout detection results
            "named": {},  # Named regions (name -> region)
        }

        # -------------------------------------------------------------
        # Page-scoped configuration begins as a shallow copy of the parent
        # PDF-level configuration so that auto-computed tolerances or other
        # page-specific values do not overwrite siblings.
        # -------------------------------------------------------------
        self._config = dict(getattr(self._parent, "_config", {}))

        # Initialize ElementManager, passing font_attrs
        self._element_mgr = ElementManager(self, font_attrs=font_attrs, load_text=self._load_text)
        # self._highlighter = HighlightingService(self) # REMOVED - Use property accessor
        # --- NEW --- Central registry for analysis results
        self.analyses: Dict[str, Any] = {}

        # --- Get OCR Manager Instance ---
        if (
            OCRManager
            and hasattr(parent, "_ocr_manager")
            and isinstance(parent._ocr_manager, OCRManager)
        ):
            self._ocr_manager = parent._ocr_manager
            logger.debug(f"Page {self.number}: Using OCRManager instance from parent PDF.")
        else:
            self._ocr_manager = None
            if OCRManager:
                logger.warning(
                    f"Page {self.number}: OCRManager instance not found on parent PDF object."
                )

        # --- Get Layout Manager Instance ---
        if (
            LayoutManager
            and hasattr(parent, "_layout_manager")
            and isinstance(parent._layout_manager, LayoutManager)
        ):
            self._layout_manager = parent._layout_manager
            logger.debug(f"Page {self.number}: Using LayoutManager instance from parent PDF.")
        else:
            self._layout_manager = None
            if LayoutManager:
                logger.warning(
                    f"Page {self.number}: LayoutManager instance not found on parent PDF object. Layout analysis will fail."
                )

        # Initialize the internal variable with a single underscore
        self._layout_analyzer = None

        self._load_elements()
        self._to_image_cache: Dict[tuple, Optional["Image.Image"]] = {}

        # Flag to prevent infinite recursion when computing exclusions
        self._computing_exclusions = False

    def _get_render_specs(
        self,
        mode: Literal["show", "render"] = "show",
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        highlights: Optional[List[Dict[str, Any]]] = None,
        crop: Union[bool, Literal["content"]] = False,
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        **kwargs,
    ) -> List[RenderSpec]:
        """Get render specifications for this page.

        Args:
            mode: Rendering mode - 'show' includes page highlights, 'render' is clean
            color: Default color for highlights in show mode
            highlights: Additional highlight groups to show
            crop: Whether to crop the page
            crop_bbox: Explicit crop bounds
            **kwargs: Additional parameters

        Returns:
            List containing a single RenderSpec for this page
        """
        spec = RenderSpec(page=self)

        # Handle cropping
        if crop_bbox:
            spec.crop_bbox = crop_bbox
        elif crop == "content":
            # Calculate content bounds from all elements
            elements = self.get_elements(apply_exclusions=False)
            if elements:
                # Get bounding box of all elements
                x_coords = []
                y_coords = []
                for elem in elements:
                    if hasattr(elem, "bbox") and elem.bbox:
                        x0, y0, x1, y1 = elem.bbox
                        x_coords.extend([x0, x1])
                        y_coords.extend([y0, y1])

                if x_coords and y_coords:
                    spec.crop_bbox = (min(x_coords), min(y_coords), max(x_coords), max(y_coords))
        elif crop is True:
            # Crop to full page (no-op, but included for consistency)
            spec.crop_bbox = (0, 0, self.width, self.height)

        # Add highlights in show mode
        if mode == "show":
            # Add page's persistent highlights if any
            page_highlights = self._highlighter.get_highlights_for_page(self.index)
            for highlight in page_highlights:
                spec.add_highlight(
                    bbox=highlight.bbox,
                    polygon=highlight.polygon,
                    color=highlight.color,
                    label=highlight.label,
                    element=None,  # Persistent highlights don't have element refs
                )

            # Add additional highlight groups if provided
            if highlights:
                for group in highlights:
                    elements = group.get("elements", [])
                    group_color = group.get("color", color)
                    group_label = group.get("label")

                    for elem in elements:
                        spec.add_highlight(element=elem, color=group_color, label=group_label)

            # Handle exclusions visualization
            exclusions_param = kwargs.get("exclusions")
            if exclusions_param:
                # Get exclusion regions
                exclusion_regions = self._get_exclusion_regions(include_callable=True)

                if exclusion_regions:
                    # Determine color for exclusions
                    exclusion_color = (
                        exclusions_param if isinstance(exclusions_param, str) else "red"
                    )

                    # Add exclusion regions as highlights
                    for region in exclusion_regions:
                        spec.add_highlight(
                            element=region,
                            color=exclusion_color,
                            label=f"Exclusion: {region.label or 'unnamed'}",
                        )

        return [spec]

    @property
    def pdf(self) -> "PDF":
        """Provides public access to the parent PDF object."""
        return self._parent

    @property
    def number(self) -> int:
        """Get page number (1-based)."""
        return self._page.page_number

    @property
    def page_number(self) -> int:
        """Get page number (1-based)."""
        return self._page.page_number

    @property
    def index(self) -> int:
        """Get page index (0-based)."""
        return self._index

    @property
    def width(self) -> float:
        """Get page width."""
        return self._page.width

    @property
    def height(self) -> float:
        """Get page height."""
        return self._page.height

    # --- Highlighting Service Accessor ---
    @property
    def _highlighter(self) -> "HighlightingService":
        """Provides access to the parent PDF's HighlightingService."""
        if not hasattr(self._parent, "highlighter"):
            # This should ideally not happen if PDF.__init__ works correctly
            raise AttributeError("Parent PDF object does not have a 'highlighter' attribute.")
        return self._parent.highlighter

    def clear_exclusions(self) -> "Page":
        """
        Clear all exclusions from the page.
        """
        self._exclusions = []
        return self

    @contextlib.contextmanager
    def without_exclusions(self):
        """
        Context manager that temporarily disables exclusion processing.

        This prevents infinite recursion when exclusion callables themselves
        use find() operations. While in this context, all find operations
        will skip exclusion filtering.

        Example:
            ```python
            # This exclusion would normally cause infinite recursion:
            page.add_exclusion(lambda p: p.find("text:contains('Header')").expand())

            # But internally, it's safe because we use:
            with page.without_exclusions():
                region = exclusion_callable(page)
            ```

        Yields:
            The page object with exclusions temporarily disabled.
        """
        old_value = self._computing_exclusions
        self._computing_exclusions = True
        try:
            yield self
        finally:
            self._computing_exclusions = old_value

    def add_exclusion(
        self,
        exclusion_func_or_region: Union[
            Callable[["Page"], "Region"], "Region", List[Any], Tuple[Any, ...], Any
        ],
        label: Optional[str] = None,
        method: str = "region",
    ) -> "Page":
        """
        Add an exclusion to the page. Text from these regions will be excluded from extraction.
        Ensures non-callable items are stored as Region objects if possible.

        Args:
            exclusion_func_or_region: Either a callable function returning a Region,
                                      a Region object, a list/tuple of regions or elements,
                                      or another object with a valid .bbox attribute.
            label: Optional label for this exclusion (e.g., 'header', 'footer').
            method: Exclusion method - 'region' (exclude all elements in bounding box) or
                    'element' (exclude only the specific elements). Default: 'region'.

        Returns:
            Self for method chaining

        Raises:
            TypeError: If a non-callable, non-Region object without a valid bbox is provided.
            ValueError: If method is not 'region' or 'element'.
        """
        # Validate method parameter
        if method not in ("region", "element"):
            raise ValueError(f"Invalid exclusion method '{method}'. Must be 'region' or 'element'.")

        # ------------------------------------------------------------------
        # NEW: Handle selector strings and ElementCollection instances
        # ------------------------------------------------------------------
        # If a user supplies a selector string (e.g. "text:bold") we resolve it
        # immediately *on this page* to the matching elements and turn each into
        # a Region object which is added to the internal exclusions list.
        #
        # Likewise, if an ElementCollection is passed we iterate over its
        # elements and create Regions for each one.
        # ------------------------------------------------------------------
        # Import ElementCollection from the new module path (old path removed)
        from natural_pdf.elements.element_collection import ElementCollection

        # Selector string ---------------------------------------------------
        if isinstance(exclusion_func_or_region, str):
            selector_str = exclusion_func_or_region
            matching_elements = self.find_all(selector_str, apply_exclusions=False)

            if not matching_elements:
                logger.warning(
                    f"Page {self.index}: Selector '{selector_str}' returned no elements – no exclusions added."
                )
            else:
                if method == "element":
                    # Store the actual elements for element-based exclusion
                    for el in matching_elements:
                        self._exclusions.append((el, label, method))
                        logger.debug(
                            f"Page {self.index}: Added element exclusion from selector '{selector_str}' -> {el}"
                        )
                else:  # method == "region"
                    for el in matching_elements:
                        try:
                            bbox_coords = (
                                float(el.x0),
                                float(el.top),
                                float(el.x1),
                                float(el.bottom),
                            )
                            region = Region(self, bbox_coords, label=label)
                            # Store directly as a Region tuple so we don't recurse endlessly
                            self._exclusions.append((region, label, method))
                            logger.debug(
                                f"Page {self.index}: Added exclusion region from selector '{selector_str}' -> {bbox_coords}"
                            )
                        except Exception as e:
                            # Re-raise so calling code/test sees the failure immediately
                            logger.error(
                                f"Page {self.index}: Failed to create exclusion region from element {el}: {e}",
                                exc_info=False,
                            )
                            raise
            # Invalidate ElementManager cache since exclusions affect element filtering
            if hasattr(self, "_element_mgr") and self._element_mgr:
                self._element_mgr.invalidate_cache()
            return self  # Completed processing for selector input

        # ElementCollection -----------------------------------------------
        if isinstance(exclusion_func_or_region, ElementCollection):
            if method == "element":
                # Store the actual elements for element-based exclusion
                for el in exclusion_func_or_region:
                    self._exclusions.append((el, label, method))
                    logger.debug(
                        f"Page {self.index}: Added element exclusion from ElementCollection -> {el}"
                    )
            else:  # method == "region"
                # Convert each element to a Region and add
                for el in exclusion_func_or_region:
                    try:
                        if not (hasattr(el, "bbox") and len(el.bbox) == 4):
                            logger.warning(
                                f"Page {self.index}: Skipping element without bbox in ElementCollection exclusion: {el}"
                            )
                            continue
                        bbox_coords = tuple(float(v) for v in el.bbox)
                        region = Region(self, bbox_coords, label=label)
                        self._exclusions.append((region, label, method))
                        logger.debug(
                            f"Page {self.index}: Added exclusion region from ElementCollection element {bbox_coords}"
                        )
                    except Exception as e:
                        logger.error(
                            f"Page {self.index}: Failed to convert ElementCollection element to Region: {e}",
                            exc_info=False,
                        )
                        raise
            # Invalidate ElementManager cache since exclusions affect element filtering
            if hasattr(self, "_element_mgr") and self._element_mgr:
                self._element_mgr.invalidate_cache()
            return self  # Completed processing for ElementCollection input

        # ------------------------------------------------------------------
        # Existing logic (callable, Region, bbox-bearing objects)
        # ------------------------------------------------------------------
        exclusion_data = None  # Initialize exclusion data

        if callable(exclusion_func_or_region):
            # Store callable functions along with their label and method
            exclusion_data = (exclusion_func_or_region, label, method)
            logger.debug(
                f"Page {self.index}: Added callable exclusion '{label}' with method '{method}': {exclusion_func_or_region}"
            )
        elif isinstance(exclusion_func_or_region, Region):
            # Store Region objects directly, assigning the label
            exclusion_func_or_region.label = label  # Assign label
            exclusion_data = (
                exclusion_func_or_region,
                label,
                method,
            )  # Store as tuple for consistency
            logger.debug(
                f"Page {self.index}: Added Region exclusion '{label}' with method '{method}': {exclusion_func_or_region}"
            )
        elif (
            hasattr(exclusion_func_or_region, "bbox")
            and isinstance(getattr(exclusion_func_or_region, "bbox", None), (tuple, list))
            and len(exclusion_func_or_region.bbox) == 4
        ):
            if method == "element":
                # For element method, store the element directly
                exclusion_data = (exclusion_func_or_region, label, method)
                logger.debug(
                    f"Page {self.index}: Added element exclusion '{label}': {exclusion_func_or_region}"
                )
            else:  # method == "region"
                # Convert objects with a valid bbox to a Region before storing
                try:
                    bbox_coords = tuple(float(v) for v in exclusion_func_or_region.bbox)
                    # Pass the label to the Region constructor
                    region_to_add = Region(self, bbox_coords, label=label)
                    exclusion_data = (region_to_add, label, method)  # Store as tuple
                    logger.debug(
                        f"Page {self.index}: Added exclusion '{label}' with method '{method}' converted to Region from {type(exclusion_func_or_region)}: {region_to_add}"
                    )
                except (ValueError, TypeError, Exception) as e:
                    # Raise an error if conversion fails
                    raise TypeError(
                        f"Failed to convert exclusion object {exclusion_func_or_region} with bbox {getattr(exclusion_func_or_region, 'bbox', 'N/A')} to Region: {e}"
                    ) from e
        elif isinstance(exclusion_func_or_region, (list, tuple)):
            # Handle lists/tuples of regions or elements
            if not exclusion_func_or_region:
                logger.warning(f"Page {self.index}: Empty list provided for exclusion, ignoring.")
                return self

            if method == "element":
                # Store each element directly
                for item in exclusion_func_or_region:
                    if hasattr(item, "bbox") and len(getattr(item, "bbox", [])) == 4:
                        self._exclusions.append((item, label, method))
                        logger.debug(
                            f"Page {self.index}: Added element exclusion from list -> {item}"
                        )
                    else:
                        logger.warning(
                            f"Page {self.index}: Skipping item without valid bbox in list: {item}"
                        )
            else:  # method == "region"
                # Convert each item to a Region and add
                for item in exclusion_func_or_region:
                    try:
                        if isinstance(item, Region):
                            item.label = label
                            self._exclusions.append((item, label, method))
                            logger.debug(f"Page {self.index}: Added Region from list: {item}")
                        elif hasattr(item, "bbox") and len(getattr(item, "bbox", [])) == 4:
                            bbox_coords = tuple(float(v) for v in item.bbox)
                            region = Region(self, bbox_coords, label=label)
                            self._exclusions.append((region, label, method))
                            logger.debug(
                                f"Page {self.index}: Added exclusion region from list item {bbox_coords}"
                            )
                        else:
                            logger.warning(
                                f"Page {self.index}: Skipping item without valid bbox in list: {item}"
                            )
                    except Exception as e:
                        logger.error(
                            f"Page {self.index}: Failed to convert list item to Region: {e}"
                        )
                        continue
            # Invalidate ElementManager cache since exclusions affect element filtering
            if hasattr(self, "_element_mgr") and self._element_mgr:
                self._element_mgr.invalidate_cache()
            return self
        else:
            # Reject invalid types
            raise TypeError(
                f"Invalid exclusion type: {type(exclusion_func_or_region)}. Must be callable, Region, list/tuple of regions/elements, or have a valid .bbox attribute."
            )

        # Append the stored data (tuple of object/callable, label, and method)
        if exclusion_data:
            self._exclusions.append(exclusion_data)

        # Invalidate ElementManager cache since exclusions affect element filtering
        if hasattr(self, "_element_mgr") and self._element_mgr:
            self._element_mgr.invalidate_cache()

        return self

    def add_region(self, region: "Region", name: Optional[str] = None) -> "Page":
        """
        Add a region to the page.

        Args:
            region: Region object to add
            name: Optional name for the region

        Returns:
            Self for method chaining
        """
        # Check if it's actually a Region object
        if not isinstance(region, Region):
            raise TypeError("region must be a Region object")

        # Set the source and name
        region.source = "named"

        if name:
            region.name = name
            # Add to named regions dictionary (overwriting if name already exists)
            self._regions["named"][name] = region
        else:
            # Add to detected regions list (unnamed but registered)
            self._regions["detected"].append(region)

        # Add to element manager for selector queries
        self._element_mgr.add_region(region)

        return self

    def add_regions(self, regions: List["Region"], prefix: Optional[str] = None) -> "Page":
        """
        Add multiple regions to the page.

        Args:
            regions: List of Region objects to add
            prefix: Optional prefix for automatic naming (regions will be named prefix_1, prefix_2, etc.)

        Returns:
            Self for method chaining
        """
        if prefix:
            # Add with automatic sequential naming
            for i, region in enumerate(regions):
                self.add_region(region, name=f"{prefix}_{i+1}")
        else:
            # Add without names
            for region in regions:
                self.add_region(region)

        return self

    def _get_exclusion_regions(self, include_callable=True, debug=False) -> List["Region"]:
        """
        Get all exclusion regions for this page.
        Now handles both region-based and element-based exclusions.
        Assumes self._exclusions contains tuples of (callable/Region/Element, label, method).

        Args:
            include_callable: Whether to evaluate callable exclusion functions
            debug: Enable verbose debug logging for exclusion evaluation

        Returns:
            List of Region objects to exclude, with labels assigned.
        """
        regions = []

        # Combine page-specific exclusions with PDF-level exclusions
        all_exclusions = list(self._exclusions)  # Start with page-specific

        # Add PDF-level exclusions if we have a parent PDF
        if hasattr(self, "_parent") and self._parent and hasattr(self._parent, "_exclusions"):
            # Get existing labels to check for duplicates
            existing_labels = set()
            for exc in all_exclusions:
                if len(exc) >= 2 and exc[1]:  # Has a label
                    existing_labels.add(exc[1])

            for pdf_exclusion in self._parent._exclusions:
                # Check if this exclusion label is already in our list (avoid duplicates)
                label = pdf_exclusion[1] if len(pdf_exclusion) >= 2 else None
                if label and label in existing_labels:
                    continue  # Skip this exclusion as it's already been applied

                # Ensure consistent format (PDF exclusions might be 2-tuples, need to be 3-tuples)
                if len(pdf_exclusion) == 2:
                    # Convert to 3-tuple format with default method
                    pdf_exclusion = (pdf_exclusion[0], pdf_exclusion[1], "region")
                all_exclusions.append(pdf_exclusion)

        if debug:
            print(
                f"\nPage {self.index}: Evaluating {len(all_exclusions)} exclusions ({len(self._exclusions)} page-specific, {len(all_exclusions) - len(self._exclusions)} from PDF)"
            )

        for i, exclusion_data in enumerate(all_exclusions):
            # Handle both old format (2-tuple) and new format (3-tuple) for backward compatibility
            if len(exclusion_data) == 2:
                # Old format: (exclusion_item, label)
                exclusion_item, label = exclusion_data
                method = "region"  # Default to region for old format
            else:
                # New format: (exclusion_item, label, method)
                exclusion_item, label, method = exclusion_data

            exclusion_label = label if label else f"exclusion {i}"

            # Process callable exclusion functions
            if callable(exclusion_item) and include_callable:
                try:
                    if debug:
                        print(f"  - Evaluating callable '{exclusion_label}'...")

                    # Use context manager to prevent infinite recursion
                    with self.without_exclusions():
                        # Call the function - Expects it to return a Region or None
                        region_result = exclusion_item(self)

                    if isinstance(region_result, Region):
                        # Assign the label to the returned region
                        region_result.label = label
                        regions.append(region_result)
                        if debug:
                            print(f"    ✓ Added region from callable '{label}': {region_result}")
                    elif hasattr(region_result, "__iter__") and hasattr(region_result, "__len__"):
                        # Handle ElementCollection or other iterables
                        from natural_pdf.elements.element_collection import ElementCollection

                        if isinstance(region_result, ElementCollection) or (
                            hasattr(region_result, "__iter__") and region_result
                        ):
                            if debug:
                                print(
                                    f"    Converting {type(region_result)} with {len(region_result)} elements to regions..."
                                )

                            # Convert each element to a region
                            for elem in region_result:
                                try:
                                    if hasattr(elem, "bbox") and len(elem.bbox) == 4:
                                        bbox_coords = tuple(float(v) for v in elem.bbox)
                                        region = Region(self, bbox_coords, label=label)
                                        regions.append(region)
                                        if debug:
                                            print(
                                                f"      ✓ Added region from element: {bbox_coords}"
                                            )
                                    else:
                                        if debug:
                                            print(
                                                f"      ✗ Skipping element without valid bbox: {elem}"
                                            )
                                except Exception as e:
                                    if debug:
                                        print(f"      ✗ Failed to convert element to region: {e}")
                                    continue

                            if debug and len(region_result) > 0:
                                print(
                                    f"    ✓ Converted {len(region_result)} elements from callable '{label}'"
                                )
                        else:
                            if debug:
                                print(f"    ✗ Empty iterable returned from callable '{label}'")
                    elif region_result:
                        # Check if it's a single Element that can be converted to a Region
                        from natural_pdf.elements.base import Element

                        if isinstance(region_result, Element) or (
                            hasattr(region_result, "bbox") and hasattr(region_result, "expand")
                        ):
                            try:
                                # Convert Element to Region using expand()
                                expanded_region = region_result.expand()
                                if isinstance(expanded_region, Region):
                                    expanded_region.label = label
                                    regions.append(expanded_region)
                                    if debug:
                                        print(
                                            f"    ✓ Converted Element to Region from callable '{label}': {expanded_region}"
                                        )
                                else:
                                    if debug:
                                        print(
                                            f"    ✗ Element.expand() did not return a Region: {type(expanded_region)}"
                                        )
                            except Exception as e:
                                if debug:
                                    print(f"    ✗ Failed to convert Element to Region: {e}")
                        else:
                            logger.warning(
                                f"Callable exclusion '{exclusion_label}' returned non-Region object: {type(region_result)}. Skipping."
                            )
                            if debug:
                                print(
                                    f"    ✗ Callable returned non-Region/None: {type(region_result)}"
                                )
                    else:
                        if debug:
                            print(
                                f"    ✗ Callable '{exclusion_label}' returned None, no region added"
                            )

                except Exception as e:
                    error_msg = f"Error evaluating callable exclusion '{exclusion_label}' for page {self.index}: {e}"
                    print(error_msg)
                    import traceback

                    print(f"    Traceback: {traceback.format_exc().splitlines()[-3:]}")

            # Process direct Region objects (label was assigned in add_exclusion)
            elif isinstance(exclusion_item, Region):
                regions.append(exclusion_item)  # Label is already on the Region object
                if debug:
                    print(f"  - Added direct region '{label}': {exclusion_item}")

            # Process direct Element objects - only convert to Region if method is "region"
            elif hasattr(exclusion_item, "bbox") and hasattr(exclusion_item, "expand"):
                if method == "region":
                    try:
                        # Convert Element to Region using expand()
                        expanded_region = exclusion_item.expand()
                        if isinstance(expanded_region, Region):
                            expanded_region.label = label
                            regions.append(expanded_region)
                            if debug:
                                print(
                                    f"  - Converted direct Element to Region '{label}': {expanded_region}"
                                )
                        else:
                            if debug:
                                print(
                                    f"  - Element.expand() did not return a Region: {type(expanded_region)}"
                                )
                    except Exception as e:
                        if debug:
                            print(f"  - Failed to convert Element to Region: {e}")
                else:
                    # method == "element" - will be handled in _filter_elements_by_exclusions
                    if debug:
                        print(
                            f"  - Skipping element '{label}' (will be handled as element-based exclusion)"
                        )

            # Process string selectors (from PDF-level exclusions)
            elif isinstance(exclusion_item, str):
                selector_str = exclusion_item
                matching_elements = self.find_all(selector_str, apply_exclusions=False)

                if debug:
                    print(
                        f"  - Evaluating selector '{exclusion_label}': found {len(matching_elements)} elements"
                    )

                if method == "region":
                    # Convert each matching element to a region
                    for el in matching_elements:
                        try:
                            bbox_coords = (
                                float(el.x0),
                                float(el.top),
                                float(el.x1),
                                float(el.bottom),
                            )
                            region = Region(self, bbox_coords, label=label)
                            regions.append(region)
                            if debug:
                                print(f"    ✓ Added region from selector match: {bbox_coords}")
                        except Exception as e:
                            if debug:
                                print(f"    ✗ Failed to create region from element: {e}")
                # If method is "element", it will be handled in _filter_elements_by_exclusions

            # Element-based exclusions are not converted to regions here
            # They will be handled separately in _filter_elements_by_exclusions

        if debug:
            print(f"Page {self.index}: Found {len(regions)} valid exclusion regions to apply")

        return regions

    def _filter_elements_by_exclusions(
        self, elements: List["Element"], debug_exclusions: bool = False
    ) -> List["Element"]:
        """
        Filters a list of elements, removing those based on exclusion rules.
        Handles both region-based exclusions (exclude all in area) and
        element-based exclusions (exclude only specific elements).

        Args:
            elements: The list of elements to filter.
            debug_exclusions: Whether to output detailed exclusion debugging info (default: False).

        Returns:
            A new list containing only the elements not excluded.
        """
        # Skip exclusion filtering if we're currently computing exclusions
        # This prevents infinite recursion when exclusion callables use find operations
        if self._computing_exclusions:
            return elements

        # Check both page-level and PDF-level exclusions
        has_page_exclusions = bool(self._exclusions)
        has_pdf_exclusions = (
            hasattr(self, "_parent")
            and self._parent
            and hasattr(self._parent, "_exclusions")
            and bool(self._parent._exclusions)
        )

        if not has_page_exclusions and not has_pdf_exclusions:
            if debug_exclusions:
                print(
                    f"Page {self.index}: No exclusions defined, returning all {len(elements)} elements."
                )
            return elements

        # Get all exclusion regions, including evaluating callable functions
        exclusion_regions = self._get_exclusion_regions(
            include_callable=True, debug=debug_exclusions
        )

        # Collect element-based exclusions
        # Store element bboxes for comparison instead of object ids
        excluded_element_bboxes = set()  # Use set for O(1) lookup

        # Process both page-level and PDF-level exclusions
        all_exclusions = list(self._exclusions) if has_page_exclusions else []
        if has_pdf_exclusions:
            all_exclusions.extend(self._parent._exclusions)

        for exclusion_data in all_exclusions:
            # Handle both old format (2-tuple) and new format (3-tuple)
            if len(exclusion_data) == 2:
                exclusion_item, label = exclusion_data
                method = "region"
            else:
                exclusion_item, label, method = exclusion_data

            # Skip callables (already handled in _get_exclusion_regions)
            if callable(exclusion_item):
                continue

            # Skip regions (already in exclusion_regions)
            if isinstance(exclusion_item, Region):
                continue

            # Handle string selectors for element-based exclusions
            if isinstance(exclusion_item, str) and method == "element":
                selector_str = exclusion_item
                matching_elements = self.find_all(selector_str, apply_exclusions=False)
                for el in matching_elements:
                    if hasattr(el, "bbox"):
                        bbox = tuple(el.bbox)
                        excluded_element_bboxes.add(bbox)
                        if debug_exclusions:
                            print(
                                f"  - Added element exclusion from selector '{selector_str}': {bbox}"
                            )

            # Handle element-based exclusions
            elif method == "element" and hasattr(exclusion_item, "bbox"):
                # Store bbox tuple for comparison
                bbox = tuple(exclusion_item.bbox)
                excluded_element_bboxes.add(bbox)
                if debug_exclusions:
                    print(f"  - Added element exclusion with bbox {bbox}: {exclusion_item}")

        if debug_exclusions:
            print(
                f"Page {self.index}: Applying {len(exclusion_regions)} region exclusions "
                f"and {len(excluded_element_bboxes)} element exclusions to {len(elements)} elements."
            )

        filtered_elements = []
        region_excluded_count = 0
        element_excluded_count = 0

        for element in elements:
            exclude = False

            # Check element-based exclusions first (faster)
            if hasattr(element, "bbox") and tuple(element.bbox) in excluded_element_bboxes:
                exclude = True
                element_excluded_count += 1
                if debug_exclusions:
                    print(f"    Element {element} excluded by element-based rule")
            else:
                # Check region-based exclusions
                for region in exclusion_regions:
                    # Use the region's method to check if the element is inside
                    if region._is_element_in_region(element):
                        exclude = True
                        region_excluded_count += 1
                        if debug_exclusions:
                            print(f"    Element {element} excluded by region {region}")
                        break  # No need to check other regions for this element

            if not exclude:
                filtered_elements.append(element)

        if debug_exclusions:
            print(
                f"Page {self.index}: Excluded {region_excluded_count} by regions, "
                f"{element_excluded_count} by elements, keeping {len(filtered_elements)}."
            )

        return filtered_elements

    @overload
    def find(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]: ...

    def find(
        self,
        selector: Optional[str] = None,  # Now optional
        *,  # Force subsequent args to be keyword-only
        text: Optional[str] = None,  # New text parameter
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[Any]:
        """
        Find first element on this page matching selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            Element object or None if not found.
        """
        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            # Escape quotes within the text for the selector string
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            # Default to 'text:contains(...)'
            effective_selector = f'text:contains("{escaped_text}")'
            # Note: regex/case handled by kwargs passed down
            logger.debug(
                f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            # Should be unreachable due to checks above
            raise ValueError("Internal error: No selector or text provided.")

        selector_obj = parse_selector(effective_selector)

        # Pass regex and case flags to selector function via kwargs
        kwargs["regex"] = regex
        kwargs["case"] = case

        # First get all matching elements without applying exclusions initially within _apply_selector
        results_collection = self._apply_selector(
            selector_obj, **kwargs
        )  # _apply_selector doesn't filter

        # Filter the results based on exclusions if requested
        if apply_exclusions and results_collection:
            filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
            # Return the first element from the filtered list
            return filtered_elements[0] if filtered_elements else None
        elif results_collection:
            # Return the first element from the unfiltered results
            return results_collection.first
        else:
            return None

    @overload
    def find_all(
        self,
        *,
        text: str,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,  # Now optional
        *,  # Force subsequent args to be keyword-only
        text: Optional[str] = None,  # New text parameter
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements on this page matching selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            ElementCollection with matching elements.
        """
        from natural_pdf.elements.element_collection import (  # Import here for type hint
            ElementCollection,
        )

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            # Escape quotes within the text for the selector string
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            # Default to 'text:contains(...)'
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            # Should be unreachable due to checks above
            raise ValueError("Internal error: No selector or text provided.")

        selector_obj = parse_selector(effective_selector)

        # Pass regex and case flags to selector function via kwargs
        kwargs["regex"] = regex
        kwargs["case"] = case

        # First get all matching elements without applying exclusions initially within _apply_selector
        results_collection = self._apply_selector(
            selector_obj, **kwargs
        )  # _apply_selector doesn't filter

        # Filter the results based on exclusions if requested
        if apply_exclusions and results_collection:
            filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
            return ElementCollection(filtered_elements)
        else:
            # Return the unfiltered collection
            return results_collection

    def _apply_selector(
        self, selector_obj: Dict, **kwargs
    ) -> "ElementCollection":  # Removed apply_exclusions arg
        """
        Apply selector to page elements.
        Exclusions are now handled by the calling methods (find, find_all) if requested.

        Args:
            selector_obj: Parsed selector dictionary (single or compound OR selector)
            **kwargs: Additional filter parameters including 'regex' and 'case'

        Returns:
            ElementCollection of matching elements (unfiltered by exclusions)
        """
        from natural_pdf.selectors.parser import _calculate_aggregates, selector_to_filter_func

        # Handle compound OR selectors
        if selector_obj.get("type") == "or":
            # For OR selectors, search all elements and let the filter function decide
            elements_to_search = self._element_mgr.get_all_elements()

            # Check if any sub-selector contains aggregate functions
            has_aggregates = False
            for sub_selector in selector_obj.get("selectors", []):
                for attr in sub_selector.get("attributes", []):
                    value = attr.get("value")
                    if isinstance(value, dict) and value.get("type") == "aggregate":
                        has_aggregates = True
                        break
                if has_aggregates:
                    break

            # Calculate aggregates if needed - for OR selectors we calculate on ALL elements
            aggregates = {}
            if has_aggregates:
                # Need to calculate aggregates for each sub-selector type
                for sub_selector in selector_obj.get("selectors", []):
                    sub_type = sub_selector.get("type", "any").lower()
                    if sub_type == "text":
                        sub_elements = self._element_mgr.words
                    elif sub_type == "rect":
                        sub_elements = self._element_mgr.rects
                    elif sub_type == "line":
                        sub_elements = self._element_mgr.lines
                    elif sub_type == "region":
                        sub_elements = self._element_mgr.regions
                    else:
                        sub_elements = elements_to_search

                    sub_aggregates = _calculate_aggregates(sub_elements, sub_selector)
                    aggregates.update(sub_aggregates)

            # Create filter function from compound selector
            filter_func = selector_to_filter_func(selector_obj, aggregates=aggregates, **kwargs)

            # Apply the filter to all elements
            matching_elements = [element for element in elements_to_search if filter_func(element)]

            # Sort elements in reading order if requested
            if kwargs.get("reading_order", True):
                if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
                    matching_elements.sort(key=lambda el: (el.top, el.x0))
                else:
                    logger.warning(
                        "Cannot sort elements in reading order: Missing required attributes (top, x0)."
                    )

            # Handle collection-level pseudo-classes (:first, :last) for OR selectors
            # Note: We only apply :first/:last if they appear in any of the sub-selectors
            has_first = False
            has_last = False
            for sub_selector in selector_obj.get("selectors", []):
                for pseudo in sub_selector.get("pseudo_classes", []):
                    if pseudo.get("name") == "first":
                        has_first = True
                    elif pseudo.get("name") == "last":
                        has_last = True

            if has_first:
                matching_elements = matching_elements[:1] if matching_elements else []
            elif has_last:
                matching_elements = matching_elements[-1:] if matching_elements else []

            # Return result collection
            return ElementCollection(matching_elements)

        # Handle single selectors (existing logic)
        # Get element type to filter
        element_type = selector_obj.get("type", "any").lower()

        # Determine which elements to search based on element type
        elements_to_search = []
        if element_type == "any":
            elements_to_search = self._element_mgr.get_all_elements()
        elif element_type == "text":
            elements_to_search = self._element_mgr.words
        elif element_type == "char":
            elements_to_search = self._element_mgr.chars
        elif element_type == "word":
            elements_to_search = self._element_mgr.words
        elif element_type == "rect" or element_type == "rectangle":
            elements_to_search = self._element_mgr.rects
        elif element_type == "line":
            elements_to_search = self._element_mgr.lines
        elif element_type == "region":
            elements_to_search = self._element_mgr.regions
        else:
            elements_to_search = self._element_mgr.get_all_elements()

        # Check if selector contains aggregate functions
        has_aggregates = False
        for attr in selector_obj.get("attributes", []):
            value = attr.get("value")
            if isinstance(value, dict) and value.get("type") == "aggregate":
                has_aggregates = True
                break

        # Calculate aggregates if needed
        aggregates = {}
        if has_aggregates:
            # For aggregates, we need to calculate based on ALL elements of the same type
            # not just the filtered subset
            aggregates = _calculate_aggregates(elements_to_search, selector_obj)

        # Create filter function from selector, passing any additional parameters
        filter_func = selector_to_filter_func(selector_obj, aggregates=aggregates, **kwargs)

        # Apply the filter to matching elements
        matching_elements = [element for element in elements_to_search if filter_func(element)]

        # Handle spatial pseudo-classes that require relationship checking
        for pseudo in selector_obj.get("pseudo_classes", []):
            name = pseudo.get("name")
            args = pseudo.get("args", "")

            if name in ("above", "below", "near", "left-of", "right-of"):
                # Find the reference element first
                from natural_pdf.selectors.parser import parse_selector

                ref_selector = parse_selector(args) if isinstance(args, str) else args
                # Recursively call _apply_selector for reference element (exclusions handled later)
                ref_elements = self._apply_selector(ref_selector, **kwargs)

                if not ref_elements:
                    return ElementCollection([])

                ref_element = ref_elements.first
                if not ref_element:
                    continue

                # Filter elements based on spatial relationship
                if name == "above":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "bottom")
                        and hasattr(ref_element, "top")
                        and el.bottom <= ref_element.top
                    ]
                elif name == "below":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "top")
                        and hasattr(ref_element, "bottom")
                        and el.top >= ref_element.bottom
                    ]
                elif name == "left-of":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "x1")
                        and hasattr(ref_element, "x0")
                        and el.x1 <= ref_element.x0
                    ]
                elif name == "right-of":
                    matching_elements = [
                        el
                        for el in matching_elements
                        if hasattr(el, "x0")
                        and hasattr(ref_element, "x1")
                        and el.x0 >= ref_element.x1
                    ]
                elif name == "near":

                    def distance(el1, el2):
                        if not (
                            hasattr(el1, "x0")
                            and hasattr(el1, "x1")
                            and hasattr(el1, "top")
                            and hasattr(el1, "bottom")
                            and hasattr(el2, "x0")
                            and hasattr(el2, "x1")
                            and hasattr(el2, "top")
                            and hasattr(el2, "bottom")
                        ):
                            return float("inf")  # Cannot calculate distance
                        el1_center_x = (el1.x0 + el1.x1) / 2
                        el1_center_y = (el1.top + el1.bottom) / 2
                        el2_center_x = (el2.x0 + el2.x1) / 2
                        el2_center_y = (el2.top + el2.bottom) / 2
                        return (
                            (el1_center_x - el2_center_x) ** 2 + (el1_center_y - el2_center_y) ** 2
                        ) ** 0.5

                    threshold = kwargs.get("near_threshold", 50)
                    matching_elements = [
                        el for el in matching_elements if distance(el, ref_element) <= threshold
                    ]

        # Sort elements in reading order if requested
        if kwargs.get("reading_order", True):
            if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
                matching_elements.sort(key=lambda el: (el.top, el.x0))
            else:
                logger.warning(
                    "Cannot sort elements in reading order: Missing required attributes (top, x0)."
                )

        # Handle :closest pseudo-class for fuzzy text matching
        for pseudo in selector_obj.get("pseudo_classes", []):
            name = pseudo.get("name")
            if name == "closest" and pseudo.get("args") is not None:
                import difflib

                # Parse search text and threshold
                search_text = str(pseudo["args"]).strip()
                threshold = 0.0  # Default threshold

                # Handle empty search text
                if not search_text:
                    matching_elements = []
                    break

                # Check if threshold is specified with @ separator
                if "@" in search_text and search_text.count("@") == 1:
                    text_part, threshold_part = search_text.rsplit("@", 1)
                    try:
                        threshold = float(threshold_part)
                        search_text = text_part.strip()
                    except (ValueError, TypeError):
                        pass  # Keep original search_text and default threshold

                # Determine case sensitivity
                ignore_case = not kwargs.get("case", False)

                # Calculate similarity scores for all elements
                scored_elements = []

                for el in matching_elements:
                    if hasattr(el, "text") and el.text:
                        el_text = el.text.strip()
                        search_term = search_text

                        if ignore_case:
                            el_text = el_text.lower()
                            search_term = search_term.lower()

                        # Calculate similarity ratio
                        ratio = difflib.SequenceMatcher(None, search_term, el_text).ratio()

                        # Check if element contains the search term as substring
                        contains_match = search_term in el_text

                        # Store element with its similarity score and contains flag
                        if ratio >= threshold:
                            scored_elements.append((ratio, contains_match, el))

                # Sort by:
                # 1. Contains match (True before False)
                # 2. Similarity score (highest first)
                # This ensures substring matches come first but are sorted by similarity
                scored_elements.sort(key=lambda x: (x[1], x[0]), reverse=True)

                # Extract just the elements
                matching_elements = [el for _, _, el in scored_elements]
                break  # Only process the first :closest pseudo-class

        # Handle collection-level pseudo-classes (:first, :last)
        for pseudo in selector_obj.get("pseudo_classes", []):
            name = pseudo.get("name")

            if name == "first":
                matching_elements = matching_elements[:1] if matching_elements else []
            elif name == "last":
                matching_elements = matching_elements[-1:] if matching_elements else []

        # Create result collection - exclusions are handled by the calling methods (find, find_all)
        result = ElementCollection(matching_elements)

        return result

    def create_region(self, x0: float, top: float, x1: float, bottom: float) -> Any:
        """
        Create a region on this page with the specified coordinates.

        Args:
            x0: Left x-coordinate
            top: Top y-coordinate
            x1: Right x-coordinate
            bottom: Bottom y-coordinate

        Returns:
            Region object for the specified coordinates
        """
        from natural_pdf.elements.region import Region

        return Region(self, (x0, top, x1, bottom))

    def region(
        self,
        left: float = None,
        top: float = None,
        right: float = None,
        bottom: float = None,
        width: Union[str, float, None] = None,
        height: Optional[float] = None,
    ) -> Any:
        """
        Create a region on this page with more intuitive named parameters,
        allowing definition by coordinates or by coordinate + dimension.

        Args:
            left: Left x-coordinate (default: 0 if width not used).
            top: Top y-coordinate (default: 0 if height not used).
            right: Right x-coordinate (default: page width if width not used).
            bottom: Bottom y-coordinate (default: page height if height not used).
            width: Width definition. Can be:
                   - Numeric: The width of the region in points. Cannot be used with both left and right.
                   - String 'full': Sets region width to full page width (overrides left/right).
                   - String 'element' or None (default): Uses provided/calculated left/right,
                     defaulting to page width if neither are specified.
            height: Numeric height of the region. Cannot be used with both top and bottom.

        Returns:
            Region object for the specified coordinates

        Raises:
            ValueError: If conflicting arguments are provided (e.g., top, bottom, and height)
                      or if width is an invalid string.

        Examples:
            >>> page.region(top=100, height=50)  # Region from y=100 to y=150, default width
            >>> page.region(left=50, width=100)   # Region from x=50 to x=150, default height
            >>> page.region(bottom=500, height=50) # Region from y=450 to y=500
            >>> page.region(right=200, width=50)  # Region from x=150 to x=200
            >>> page.region(top=100, bottom=200, width="full") # Explicit full width
        """
        # ------------------------------------------------------------------
        # Percentage support – convert strings like "30%" to absolute values
        # based on page dimensions.  X-axis params (left, right, width) use
        # page.width; Y-axis params (top, bottom, height) use page.height.
        # ------------------------------------------------------------------

        def _pct_to_abs(val, axis: str):
            if isinstance(val, str) and val.strip().endswith("%"):
                try:
                    pct = float(val.strip()[:-1]) / 100.0
                except ValueError:
                    return val  # leave unchanged if not a number
                return pct * (self.width if axis == "x" else self.height)
            return val

        left = _pct_to_abs(left, "x")
        right = _pct_to_abs(right, "x")
        width = _pct_to_abs(width, "x")
        top = _pct_to_abs(top, "y")
        bottom = _pct_to_abs(bottom, "y")
        height = _pct_to_abs(height, "y")

        # --- Type checking and basic validation ---
        is_width_numeric = isinstance(width, (int, float))
        is_width_string = isinstance(width, str)
        width_mode = "element"  # Default mode

        if height is not None and top is not None and bottom is not None:
            raise ValueError("Cannot specify top, bottom, and height simultaneously.")
        if is_width_numeric and left is not None and right is not None:
            raise ValueError("Cannot specify left, right, and a numeric width simultaneously.")
        if is_width_string:
            width_lower = width.lower()
            if width_lower not in ["full", "element"]:
                raise ValueError("String width argument must be 'full' or 'element'.")
            width_mode = width_lower

        # --- Calculate Coordinates ---
        final_top = top
        final_bottom = bottom
        final_left = left
        final_right = right

        # Height calculations
        if height is not None:
            if top is not None:
                final_bottom = top + height
            elif bottom is not None:
                final_top = bottom - height
            else:  # Neither top nor bottom provided, default top to 0
                final_top = 0
                final_bottom = height

        # Width calculations (numeric only)
        if is_width_numeric:
            if left is not None:
                final_right = left + width
            elif right is not None:
                final_left = right - width
            else:  # Neither left nor right provided, default left to 0
                final_left = 0
                final_right = width

        # --- Apply Defaults for Unset Coordinates ---
        # Only default coordinates if they weren't set by dimension calculation
        if final_top is None:
            final_top = 0
        if final_bottom is None:
            # Check if bottom should have been set by height calc
            if height is None or top is None:
                final_bottom = self.height

        if final_left is None:
            final_left = 0
        if final_right is None:
            # Check if right should have been set by width calc
            if not is_width_numeric or left is None:
                final_right = self.width

        # --- Handle width_mode == 'full' ---
        if width_mode == "full":
            # Override left/right if mode is full
            final_left = 0
            final_right = self.width

        # --- Final Validation & Creation ---
        # Ensure coordinates are within page bounds (clamp)
        final_left = max(0, final_left)
        final_top = max(0, final_top)
        final_right = min(self.width, final_right)
        final_bottom = min(self.height, final_bottom)

        # Ensure valid box (x0<=x1, top<=bottom)
        if final_left > final_right:
            logger.warning(f"Calculated left ({final_left}) > right ({final_right}). Swapping.")
            final_left, final_right = final_right, final_left
        if final_top > final_bottom:
            logger.warning(f"Calculated top ({final_top}) > bottom ({final_bottom}). Swapping.")
            final_top, final_bottom = final_bottom, final_top

        from natural_pdf.elements.region import Region

        region = Region(self, (final_left, final_top, final_right, final_bottom))
        return region

    def get_elements(
        self, apply_exclusions=True, debug_exclusions: bool = False
    ) -> List["Element"]:
        """
        Get all elements on this page.

        Args:
            apply_exclusions: Whether to apply exclusion regions (default: True).
            debug_exclusions: Whether to output detailed exclusion debugging info (default: False).

        Returns:
            List of all elements on the page, potentially filtered by exclusions.
        """
        # Get all elements from the element manager
        all_elements = self._element_mgr.get_all_elements()

        # Apply exclusions if requested
        if apply_exclusions:
            return self._filter_elements_by_exclusions(
                all_elements, debug_exclusions=debug_exclusions
            )
        else:
            if debug_exclusions:
                print(
                    f"Page {self.index}: get_elements returning all {len(all_elements)} elements (exclusions not applied)."
                )
            return all_elements

    def filter_elements(
        self, elements: List["Element"], selector: str, **kwargs
    ) -> List["Element"]:
        """
        Filter a list of elements based on a selector.

        Args:
            elements: List of elements to filter
            selector: CSS-like selector string
            **kwargs: Additional filter parameters

        Returns:
            List of elements that match the selector
        """
        from natural_pdf.selectors.parser import parse_selector, selector_to_filter_func

        # Parse the selector
        selector_obj = parse_selector(selector)

        # Create filter function from selector
        filter_func = selector_to_filter_func(selector_obj, **kwargs)

        # Apply the filter to the elements
        matching_elements = [element for element in elements if filter_func(element)]

        # Sort elements in reading order if requested
        if kwargs.get("reading_order", True):
            if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
                matching_elements.sort(key=lambda el: (el.top, el.x0))
            else:
                logger.warning(
                    "Cannot sort elements in reading order: Missing required attributes (top, x0)."
                )

        return matching_elements

    def until(self, selector: str, include_endpoint: bool = True, **kwargs) -> Any:
        """
        Select content from the top of the page until matching selector.

        Args:
            selector: CSS-like selector string
            include_endpoint: Whether to include the endpoint element in the region
            **kwargs: Additional selection parameters

        Returns:
            Region object representing the selected content

        Examples:
            >>> page.until('text:contains("Conclusion")')  # Select from top to conclusion
            >>> page.until('line[width>=2]', include_endpoint=False)  # Select up to thick line
        """
        # Find the target element
        target = self.find(selector, **kwargs)
        if not target:
            # If target not found, return a default region (full page)
            from natural_pdf.elements.region import Region

            return Region(self, (0, 0, self.width, self.height))

        # Create a region from the top of the page to the target
        from natural_pdf.elements.region import Region

        # Ensure target has positional attributes before using them
        target_top = getattr(target, "top", 0)
        target_bottom = getattr(target, "bottom", self.height)

        if include_endpoint:
            # Include the target element
            region = Region(self, (0, 0, self.width, target_bottom))
        else:
            # Up to the target element
            region = Region(self, (0, 0, self.width, target_top))

        region.end_element = target
        return region

    def crop(self, bbox=None, **kwargs) -> Any:
        """
        Crop the page to the specified bounding box.

        This is a direct wrapper around pdfplumber's crop method.

        Args:
            bbox: Bounding box (x0, top, x1, bottom) or None
            **kwargs: Additional parameters (top, bottom, left, right)

        Returns:
            Cropped page object (pdfplumber.Page)
        """
        # Returns the pdfplumber page object, not a natural-pdf Page
        return self._page.crop(bbox, **kwargs)

    def extract_text(
        self,
        preserve_whitespace=True,
        use_exclusions=True,
        debug_exclusions=False,
        content_filter=None,
        **kwargs,
    ) -> str:
        """
        Extract text from this page, respecting exclusions and using pdfplumber's
        layout engine (chars_to_textmap) if layout arguments are provided or default.

        Args:
            use_exclusions: Whether to apply exclusion regions (default: True).
                          Note: Filtering logic is now always applied if exclusions exist.
            debug_exclusions: Whether to output detailed exclusion debugging info (default: False).
            content_filter: Optional content filter to exclude specific text patterns. Can be:
                - A regex pattern string (characters matching the pattern are EXCLUDED)
                - A callable that takes text and returns True to KEEP the character
                - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
            **kwargs: Additional layout parameters passed directly to pdfplumber's
                      `chars_to_textmap` function. Common parameters include:
                      - layout (bool): If True (default), inserts spaces/newlines.
                      - x_density (float): Pixels per character horizontally.
                      - y_density (float): Pixels per line vertically.
                      - x_tolerance (float): Tolerance for horizontal character grouping.
                      - y_tolerance (float): Tolerance for vertical character grouping.
                      - line_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                      - char_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                      See pdfplumber documentation for more.

        Returns:
            Extracted text as string, potentially with layout-based spacing.
        """
        logger.debug(f"Page {self.number}: extract_text called with kwargs: {kwargs}")
        debug = kwargs.get("debug", debug_exclusions)  # Allow 'debug' kwarg

        # 1. Get Word Elements (triggers load_elements if needed)
        word_elements = self.words
        if not word_elements:
            logger.debug(f"Page {self.number}: No word elements found.")
            return ""

        # 2. Apply element-based exclusions if enabled
        # Check both page-level and PDF-level exclusions
        has_exclusions = bool(self._exclusions) or (
            hasattr(self, "_parent")
            and self._parent
            and hasattr(self._parent, "_exclusions")
            and self._parent._exclusions
        )
        if use_exclusions and has_exclusions:
            # Filter word elements through _filter_elements_by_exclusions
            # This handles both element-based and region-based exclusions
            word_elements = self._filter_elements_by_exclusions(
                word_elements, debug_exclusions=debug
            )
            if debug:
                logger.debug(
                    f"Page {self.number}: {len(word_elements)} words remaining after exclusion filtering."
                )

        # 3. Get region-based exclusions for spatial filtering
        apply_exclusions_flag = kwargs.get("use_exclusions", use_exclusions)
        exclusion_regions = []
        if apply_exclusions_flag and has_exclusions:
            exclusion_regions = self._get_exclusion_regions(include_callable=True, debug=debug)
            if debug:
                logger.debug(
                    f"Page {self.number}: Found {len(exclusion_regions)} region exclusions for spatial filtering."
                )
        elif debug:
            logger.debug(f"Page {self.number}: Not applying exclusions.")

        # 4. Collect All Character Dictionaries from remaining Word Elements
        all_char_dicts = []
        for word in word_elements:
            all_char_dicts.extend(getattr(word, "_char_dicts", []))

        # 5. Spatially Filter Characters (only by regions, elements already filtered above)
        filtered_chars = filter_chars_spatially(
            char_dicts=all_char_dicts,
            exclusion_regions=exclusion_regions,
            target_region=None,  # No target region for full page extraction
            debug=debug,
        )

        # 5. Generate Text Layout using Utility
        # Pass page bbox as layout context
        page_bbox = (0, 0, self.width, self.height)
        # Merge PDF-level default tolerances if caller did not override
        merged_kwargs = dict(kwargs)
        tol_keys = ["x_tolerance", "x_tolerance_ratio", "y_tolerance"]
        for k in tol_keys:
            if k not in merged_kwargs:
                if k in self._config:
                    merged_kwargs[k] = self._config[k]
                elif k in getattr(self._parent, "_config", {}):
                    merged_kwargs[k] = self._parent._config[k]

        # Add content_filter to kwargs if provided
        if content_filter is not None:
            merged_kwargs["content_filter"] = content_filter

        result = generate_text_layout(
            char_dicts=filtered_chars,
            layout_context_bbox=page_bbox,
            user_kwargs=merged_kwargs,
        )

        # --- Optional: apply Unicode BiDi algorithm for mixed RTL/LTR correctness ---
        apply_bidi = kwargs.get("bidi", True)
        if apply_bidi and result:
            # Quick check for any RTL character
            import unicodedata

            def _contains_rtl(s):
                return any(unicodedata.bidirectional(ch) in ("R", "AL", "AN") for ch in s)

            if _contains_rtl(result):
                try:
                    from bidi.algorithm import get_display  # type: ignore

                    from natural_pdf.utils.bidi_mirror import mirror_brackets

                    result = "\n".join(
                        mirror_brackets(
                            get_display(
                                line,
                                base_dir=(
                                    "R"
                                    if any(
                                        unicodedata.bidirectional(ch) in ("R", "AL", "AN")
                                        for ch in line
                                    )
                                    else "L"
                                ),
                            )
                        )
                        for line in result.split("\n")
                    )
                except ModuleNotFoundError:
                    pass  # silently skip if python-bidi not available

        logger.debug(f"Page {self.number}: extract_text finished, result length: {len(result)}.")
        return result

    def extract_table(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
        show_progress: bool = False,
        content_filter=None,
        verticals: Optional[List[float]] = None,
        horizontals: Optional[List[float]] = None,
    ) -> TableResult:
        """
        Extract the largest table from this page using enhanced region-based extraction.

        Args:
            method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
            table_settings: Settings for pdfplumber table extraction.
            use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
            ocr_config: OCR configuration parameters.
            text_options: Dictionary of options for the 'text' method.
            cell_extraction_func: Optional callable function that takes a cell Region object
                                  and returns its string content. For 'text' method only.
            show_progress: If True, display a progress bar during cell text extraction for the 'text' method.
            content_filter: Optional content filter to apply during cell text extraction. Can be:
                - A regex pattern string (characters matching the pattern are EXCLUDED)
                - A callable that takes text and returns True to KEEP the character
                - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
            verticals: Optional list of x-coordinates for explicit vertical table lines.
            horizontals: Optional list of y-coordinates for explicit horizontal table lines.

        Returns:
            TableResult: A sequence-like object containing table rows that also provides .to_df() for pandas conversion.
        """
        # Create a full-page region and delegate to its enhanced extract_table method
        page_region = self.create_region(0, 0, self.width, self.height)
        return page_region.extract_table(
            method=method,
            table_settings=table_settings,
            use_ocr=use_ocr,
            ocr_config=ocr_config,
            text_options=text_options,
            cell_extraction_func=cell_extraction_func,
            show_progress=show_progress,
            content_filter=content_filter,
            verticals=verticals,
            horizontals=horizontals,
        )

    def extract_tables(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
        check_tatr: bool = True,
    ) -> List[List[List[str]]]:
        """
        Extract all tables from this page with enhanced method support.

        Args:
            method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                    'stream' uses text-based strategies, 'lattice' uses line-based strategies.
                    Note: 'tatr' and 'text' methods are not supported for extract_tables.
            table_settings: Settings for pdfplumber table extraction.
            check_tatr: If True (default), first check for TATR-detected table regions
                        and extract from those before falling back to pdfplumber methods.

        Returns:
            List of tables, where each table is a list of rows, and each row is a list of cell values.
        """
        if table_settings is None:
            table_settings = {}

        # Check for TATR-detected table regions first if enabled
        if check_tatr:
            try:
                tatr_tables = self.find_all("region[type=table][model=tatr]")
                if tatr_tables:
                    logger.debug(
                        f"Page {self.number}: Found {len(tatr_tables)} TATR table regions, extracting from those..."
                    )
                    extracted_tables = []
                    for table_region in tatr_tables:
                        try:
                            table_data = table_region.extract_table(method="tatr")
                            if table_data:  # Only add non-empty tables
                                extracted_tables.append(table_data)
                        except Exception as e:
                            logger.warning(
                                f"Failed to extract table from TATR region {table_region.bbox}: {e}"
                            )

                    if extracted_tables:
                        logger.debug(
                            f"Page {self.number}: Successfully extracted {len(extracted_tables)} tables from TATR regions"
                        )
                        return extracted_tables
                    else:
                        logger.debug(
                            f"Page {self.number}: TATR regions found but no tables extracted, falling back to pdfplumber"
                        )
                else:
                    logger.debug(
                        f"Page {self.number}: No TATR table regions found, using pdfplumber methods"
                    )
            except Exception as e:
                logger.debug(
                    f"Page {self.number}: Error checking TATR regions: {e}, falling back to pdfplumber"
                )

        # Auto-detect method if not specified (try lattice first, then stream)
        if method is None:
            logger.debug(f"Page {self.number}: Auto-detecting tables extraction method...")

            # Try lattice first
            try:
                lattice_settings = table_settings.copy()
                lattice_settings.setdefault("vertical_strategy", "lines")
                lattice_settings.setdefault("horizontal_strategy", "lines")

                logger.debug(f"Page {self.number}: Trying 'lattice' method first for tables...")
                lattice_result = self._page.extract_tables(lattice_settings)

                # Check if lattice found meaningful tables
                if (
                    lattice_result
                    and len(lattice_result) > 0
                    and any(
                        any(
                            any(cell and cell.strip() for cell in row if cell)
                            for row in table
                            if table
                        )
                        for table in lattice_result
                    )
                ):
                    logger.debug(
                        f"Page {self.number}: 'lattice' method found {len(lattice_result)} tables"
                    )
                    return lattice_result
                else:
                    logger.debug(f"Page {self.number}: 'lattice' method found no meaningful tables")

            except Exception as e:
                logger.debug(f"Page {self.number}: 'lattice' method failed: {e}")

            # Fall back to stream
            logger.debug(f"Page {self.number}: Falling back to 'stream' method for tables...")
            stream_settings = table_settings.copy()
            stream_settings.setdefault("vertical_strategy", "text")
            stream_settings.setdefault("horizontal_strategy", "text")

            return self._page.extract_tables(stream_settings)

        effective_method = method

        # Handle method aliases
        if effective_method == "stream":
            logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "text")
            table_settings.setdefault("horizontal_strategy", "text")
        elif effective_method == "lattice":
            logger.debug(
                "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
            )
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "lines")
            table_settings.setdefault("horizontal_strategy", "lines")

        # Use the selected method
        if effective_method == "pdfplumber":
            # ---------------------------------------------------------
            # Inject auto-computed or user-specified text tolerances so
            # pdfplumber uses the same numbers we used for word grouping
            # whenever the table algorithm relies on word positions.
            # ---------------------------------------------------------
            if "text" in (
                table_settings.get("vertical_strategy"),
                table_settings.get("horizontal_strategy"),
            ):
                print("SETTING IT UP")
                pdf_cfg = getattr(self, "_config", getattr(self._parent, "_config", {}))
                if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
                    x_tol = pdf_cfg.get("x_tolerance")
                    if x_tol is not None:
                        table_settings.setdefault("text_x_tolerance", x_tol)
                if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
                    y_tol = pdf_cfg.get("y_tolerance")
                    if y_tol is not None:
                        table_settings.setdefault("text_y_tolerance", y_tol)

                # pdfplumber's text strategy benefits from a tight snap tolerance.
                if (
                    "snap_tolerance" not in table_settings
                    and "snap_x_tolerance" not in table_settings
                ):
                    # Derive from y_tol if available, else default 1
                    snap = max(1, round((pdf_cfg.get("y_tolerance", 1)) * 0.9))
                    table_settings.setdefault("snap_tolerance", snap)
                if (
                    "join_tolerance" not in table_settings
                    and "join_x_tolerance" not in table_settings
                ):
                    join = table_settings.get("snap_tolerance", 1)
                    table_settings.setdefault("join_tolerance", join)
                    table_settings.setdefault("join_x_tolerance", join)
                    table_settings.setdefault("join_y_tolerance", join)

            raw_tables = self._page.extract_tables(table_settings)

            # Apply RTL text processing to all extracted tables
            if raw_tables:
                processed_tables = []
                for table in raw_tables:
                    processed_table = []
                    for row in table:
                        processed_row = []
                        for cell in row:
                            if cell is not None:
                                # Apply RTL text processing to each cell
                                rtl_processed_cell = self._apply_rtl_processing_to_text(cell)
                                processed_row.append(rtl_processed_cell)
                            else:
                                processed_row.append(cell)
                        processed_table.append(processed_row)
                    processed_tables.append(processed_table)
                return processed_tables

            return raw_tables
        else:
            raise ValueError(
                f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
            )

    def _load_elements(self):
        """Load all elements from the page via ElementManager."""
        self._element_mgr.load_elements()

    def _create_char_elements(self):
        """DEPRECATED: Use self._element_mgr.chars"""
        logger.warning("_create_char_elements is deprecated. Access via self._element_mgr.chars.")
        return self._element_mgr.chars  # Delegate

    def _process_font_information(self, char_dict):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_process_font_information is deprecated. Handled by ElementManager.")
        # ElementManager handles this internally
        pass

    def _group_chars_into_words(self, keep_spaces=True, font_attrs=None):
        """DEPRECATED: Use self._element_mgr.words"""
        logger.warning("_group_chars_into_words is deprecated. Access via self._element_mgr.words.")
        return self._element_mgr.words  # Delegate

    def _process_line_into_words(self, line_chars, keep_spaces, font_attrs):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_process_line_into_words is deprecated. Handled by ElementManager.")
        pass

    def _check_font_attributes_match(self, char, prev_char, font_attrs):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_check_font_attributes_match is deprecated. Handled by ElementManager.")
        pass

    def _create_word_element(self, chars, font_attrs):
        """DEPRECATED: Handled by ElementManager"""
        logger.warning("_create_word_element is deprecated. Handled by ElementManager.")
        pass

    @property
    def chars(self) -> List[Any]:
        """Get all character elements on this page."""
        return self._element_mgr.chars

    @property
    def words(self) -> List[Any]:
        """Get all word elements on this page."""
        return self._element_mgr.words

    @property
    def rects(self) -> List[Any]:
        """Get all rectangle elements on this page."""
        return self._element_mgr.rects

    @property
    def lines(self) -> List[Any]:
        """Get all line elements on this page."""
        return self._element_mgr.lines

    def add_highlight(
        self,
        bbox: Optional[Tuple[float, float, float, float]] = None,
        color: Optional[Union[Tuple, str]] = None,
        label: Optional[str] = None,
        use_color_cycling: bool = False,
        element: Optional[Any] = None,
        annotate: Optional[List[str]] = None,
        existing: str = "append",
    ) -> "Page":
        """
        Add a highlight to a bounding box or the entire page.
        Delegates to the central HighlightingService.

        Args:
            bbox: Bounding box (x0, top, x1, bottom). If None, highlight entire page.
            color: RGBA color tuple/string for the highlight.
            label: Optional label for the highlight.
            use_color_cycling: If True and no label/color, use next cycle color.
            element: Optional original element being highlighted (for attribute extraction).
            annotate: List of attribute names from 'element' to display.
            existing: How to handle existing highlights ('append' or 'replace').

        Returns:
            Self for method chaining.
        """
        target_bbox = bbox if bbox is not None else (0, 0, self.width, self.height)
        self._highlighter.add(
            page_index=self.index,
            bbox=target_bbox,
            color=color,
            label=label,
            use_color_cycling=use_color_cycling,
            element=element,
            annotate=annotate,
            existing=existing,
        )
        return self

    def add_highlight_polygon(
        self,
        polygon: List[Tuple[float, float]],
        color: Optional[Union[Tuple, str]] = None,
        label: Optional[str] = None,
        use_color_cycling: bool = False,
        element: Optional[Any] = None,
        annotate: Optional[List[str]] = None,
        existing: str = "append",
    ) -> "Page":
        """
        Highlight a polygon shape on the page.
        Delegates to the central HighlightingService.

        Args:
            polygon: List of (x, y) points defining the polygon.
            color: RGBA color tuple/string for the highlight.
            label: Optional label for the highlight.
            use_color_cycling: If True and no label/color, use next cycle color.
            element: Optional original element being highlighted (for attribute extraction).
            annotate: List of attribute names from 'element' to display.
            existing: How to handle existing highlights ('append' or 'replace').

        Returns:
            Self for method chaining.
        """
        self._highlighter.add_polygon(
            page_index=self.index,
            polygon=polygon,
            color=color,
            label=label,
            use_color_cycling=use_color_cycling,
            element=element,
            annotate=annotate,
            existing=existing,
        )
        return self

    def save_image(
        self,
        filename: str,
        width: Optional[int] = None,
        labels: bool = True,
        legend_position: str = "right",
        render_ocr: bool = False,
        include_highlights: bool = True,  # Allow saving without highlights
        resolution: float = 144,
        **kwargs,
    ) -> "Page":
        """
        Save the page image to a file, rendering highlights via HighlightingService.

        Args:
            filename: Path to save the image to.
            width: Optional width for the output image.
            labels: Whether to include a legend.
            legend_position: Position of the legend.
            render_ocr: Whether to render OCR text.
            include_highlights: Whether to render highlights.
            resolution: Resolution in DPI for base image rendering (default: 144 DPI, equivalent to previous scale=2.0).
            **kwargs: Additional args for pdfplumber's internal to_image.

        Returns:
            Self for method chaining.
        """
        # Use export() to save the image
        if include_highlights:
            self.export(
                path=filename,
                resolution=resolution,
                width=width,
                labels=labels,
                legend_position=legend_position,
                render_ocr=render_ocr,
                **kwargs,
            )
        else:
            # For saving without highlights, use render() and save manually
            img = self.render(resolution=resolution, **kwargs)
            if img:
                # Resize if width is specified
                if width is not None and width > 0 and img.width > 0:
                    aspect_ratio = img.height / img.width
                    height = int(width * aspect_ratio)
                    try:
                        img = img.resize((width, height), Image.Resampling.LANCZOS)
                    except Exception as e:
                        logger.warning(f"Could not resize image: {e}")

                # Save the image
                try:
                    if os.path.dirname(filename):
                        os.makedirs(os.path.dirname(filename), exist_ok=True)
                    img.save(filename)
                except Exception as e:
                    logger.error(f"Failed to save image to {filename}: {e}")

        return self

    def clear_highlights(self) -> "Page":
        """
        Clear all highlights *from this specific page* via HighlightingService.

        Returns:
            Self for method chaining
        """
        self._highlighter.clear_page(self.index)
        return self

    def analyze_text_styles(
        self, options: Optional[TextStyleOptions] = None
    ) -> "ElementCollection":
        """
        Analyze text elements by style, adding attributes directly to elements.

        This method uses TextStyleAnalyzer to process text elements (typically words)
        on the page. It adds the following attributes to each processed element:
        - style_label: A descriptive or numeric label for the style group.
        - style_key: A hashable tuple representing the style properties used for grouping.
        - style_properties: A dictionary containing the extracted style properties.

        Args:
            options: Optional TextStyleOptions to configure the analysis.
                     If None, the analyzer's default options are used.

        Returns:
            ElementCollection containing all processed text elements with added style attributes.
        """
        # Create analyzer (optionally pass default options from PDF config here)
        # For now, it uses its own defaults if options=None
        analyzer = TextStyleAnalyzer()

        # Analyze the page. The analyzer now modifies elements directly
        # and returns the collection of processed elements.
        processed_elements_collection = analyzer.analyze(self, options=options)

        # Return the collection of elements which now have style attributes
        return processed_elements_collection

    def _create_text_elements_from_ocr(
        self, ocr_results: List[Dict[str, Any]], image_width=None, image_height=None
    ) -> List["TextElement"]:
        """DEPRECATED: Use self._element_mgr.create_text_elements_from_ocr"""
        logger.warning(
            "_create_text_elements_from_ocr is deprecated. Use self._element_mgr version."
        )
        return self._element_mgr.create_text_elements_from_ocr(
            ocr_results, image_width, image_height
        )

    def apply_ocr(
        self,
        engine: Optional[str] = None,
        options: Optional["OCROptions"] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
        detect_only: bool = False,
        apply_exclusions: bool = True,
        replace: bool = True,
    ) -> "Page":
        """
        Apply OCR to THIS page and add results to page elements via PDF.apply_ocr.

        Args:
            engine: Name of the OCR engine.
            options: Engine-specific options object or dict.
            languages: List of engine-specific language codes.
            min_confidence: Minimum confidence threshold.
            device: Device to run OCR on.
            resolution: DPI resolution for rendering page image before OCR.
            apply_exclusions: If True (default), render page image for OCR
                              with excluded areas masked (whited out).
            detect_only: If True, only detect text bounding boxes, don't perform OCR.
            replace: If True (default), remove any existing OCR elements before
                    adding new ones. If False, add new OCR elements to existing ones.

        Returns:
            Self for method chaining.
        """
        if not hasattr(self._parent, "apply_ocr"):
            logger.error(f"Page {self.number}: Parent PDF missing 'apply_ocr'. Cannot apply OCR.")
            return self  # Return self for chaining

        # Remove existing OCR elements if replace is True
        if replace and hasattr(self, "_element_mgr"):
            logger.info(
                f"Page {self.number}: Removing existing OCR elements before applying new OCR."
            )
            self._element_mgr.remove_ocr_elements()

        logger.info(f"Page {self.number}: Delegating apply_ocr to PDF.apply_ocr.")
        # Delegate to parent PDF, targeting only this page's index
        # Pass all relevant parameters through, including apply_exclusions
        self._parent.apply_ocr(
            pages=[self.index],
            engine=engine,
            options=options,
            languages=languages,
            min_confidence=min_confidence,
            device=device,
            resolution=resolution,
            detect_only=detect_only,
            apply_exclusions=apply_exclusions,
            replace=replace,  # Pass the replace parameter to PDF.apply_ocr
        )

        # Return self for chaining
        return self

    def extract_ocr_elements(
        self,
        engine: Optional[str] = None,
        options: Optional["OCROptions"] = None,
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,
        device: Optional[str] = None,
        resolution: Optional[int] = None,
    ) -> List["TextElement"]:
        """
        Extract text elements using OCR *without* adding them to the page's elements.
        Uses the shared OCRManager instance.

        Args:
            engine: Name of the OCR engine.
            options: Engine-specific options object or dict.
            languages: List of engine-specific language codes.
            min_confidence: Minimum confidence threshold.
            device: Device to run OCR on.
            resolution: DPI resolution for rendering page image before OCR.

        Returns:
            List of created TextElement objects derived from OCR results for this page.
        """
        if not self._ocr_manager:
            logger.error(
                f"Page {self.number}: OCRManager not available. Cannot extract OCR elements."
            )
            return []

        logger.info(f"Page {self.number}: Extracting OCR elements (extract only)...")

        # Determine rendering resolution
        final_resolution = resolution if resolution is not None else 150  # Default to 150 DPI
        logger.debug(f"  Using rendering resolution: {final_resolution} DPI")

        try:
            # Get base image without highlights using the determined resolution
            # Use the global PDF rendering lock
            with pdf_render_lock:
                # Use render() for clean image without highlights
                image = self.render(resolution=final_resolution)
                if not image:
                    logger.error(
                        f"  Failed to render page {self.number} to image for OCR extraction."
                    )
                    return []
                logger.debug(f"  Rendered image size: {image.width}x{image.height}")
        except Exception as e:
            logger.error(f"  Failed to render page {self.number} to image: {e}", exc_info=True)
            return []

        # Prepare arguments for the OCR Manager call
        manager_args = {
            "images": image,
            "engine": engine,
            "languages": languages,
            "min_confidence": min_confidence,
            "device": device,
            "options": options,
        }
        manager_args = {k: v for k, v in manager_args.items() if v is not None}

        logger.debug(
            f"  Calling OCR Manager (extract only) with args: { {k:v for k,v in manager_args.items() if k != 'images'} }"
        )
        try:
            # apply_ocr now returns List[List[Dict]] or List[Dict]
            results_list = self._ocr_manager.apply_ocr(**manager_args)
            # If it returned a list of lists (batch mode), take the first list
            results = (
                results_list[0]
                if isinstance(results_list, list)
                and results_list
                and isinstance(results_list[0], list)
                else results_list
            )
            if not isinstance(results, list):
                logger.error(f"  OCR Manager returned unexpected type: {type(results)}")
                results = []
            logger.info(f"  OCR Manager returned {len(results)} results for extraction.")
        except Exception as e:
            logger.error(f"  OCR processing failed during extraction: {e}", exc_info=True)
            return []

        # Convert results but DO NOT add to ElementManager
        logger.debug(f"  Converting OCR results to TextElements (extract only)...")
        temp_elements = []
        scale_x = self.width / image.width if image.width else 1
        scale_y = self.height / image.height if image.height else 1
        for result in results:
            try:  # Added try-except around result processing
                x0, top, x1, bottom = [float(c) for c in result["bbox"]]
                elem_data = {
                    "text": result["text"],
                    "confidence": result["confidence"],
                    "x0": x0 * scale_x,
                    "top": top * scale_y,
                    "x1": x1 * scale_x,
                    "bottom": bottom * scale_y,
                    "width": (x1 - x0) * scale_x,
                    "height": (bottom - top) * scale_y,
                    "object_type": "text",  # Using text for temporary elements
                    "source": "ocr",
                    "fontname": "OCR-extract",  # Different name for clarity
                    "size": 10.0,
                    "page_number": self.number,
                }
                temp_elements.append(TextElement(elem_data, self))
            except (KeyError, ValueError, TypeError) as convert_err:
                logger.warning(
                    f"  Skipping invalid OCR result during conversion: {result}. Error: {convert_err}"
                )

        logger.info(f"  Created {len(temp_elements)} TextElements from OCR (extract only).")
        return temp_elements

    @property
    def size(self) -> Tuple[float, float]:
        """Get the size of the page in points."""
        return (self._page.width, self._page.height)

    @property
    def layout_analyzer(self) -> "LayoutAnalyzer":
        """Get or create the layout analyzer for this page."""
        if self._layout_analyzer is None:
            if not self._layout_manager:
                logger.warning("LayoutManager not available, cannot create LayoutAnalyzer.")
                return None
            self._layout_analyzer = LayoutAnalyzer(self)
        return self._layout_analyzer

    def analyze_layout(
        self,
        engine: Optional[str] = None,
        options: Optional["LayoutOptions"] = None,
        confidence: Optional[float] = None,
        classes: Optional[List[str]] = None,
        exclude_classes: Optional[List[str]] = None,
        device: Optional[str] = None,
        existing: str = "replace",
        model_name: Optional[str] = None,
        client: Optional[Any] = None,  # Add client parameter
    ) -> "ElementCollection[Region]":
        """
        Analyze the page layout using the configured LayoutManager.
        Adds detected Region objects to the page's element manager.

        Returns:
            ElementCollection containing the detected Region objects.
        """
        analyzer = self.layout_analyzer
        if not analyzer:
            logger.error(
                "Layout analysis failed: LayoutAnalyzer not initialized (is LayoutManager available?)."
            )
            return ElementCollection([])  # Return empty collection

        # Clear existing detected regions if 'replace' is specified
        if existing == "replace":
            self.clear_detected_layout_regions()

        # The analyzer's analyze_layout method already adds regions to the page
        # and its element manager. We just need to retrieve them.
        analyzer.analyze_layout(
            engine=engine,
            options=options,
            confidence=confidence,
            classes=classes,
            exclude_classes=exclude_classes,
            device=device,
            existing=existing,
            model_name=model_name,
            client=client,  # Pass client down
        )

        # Retrieve the detected regions from the element manager
        # Filter regions based on source='detected' and potentially the model used if available
        detected_regions = [
            r
            for r in self._element_mgr.regions
            if r.source == "detected" and (not engine or getattr(r, "model", None) == engine)
        ]

        return ElementCollection(detected_regions)

    def clear_detected_layout_regions(self) -> "Page":
        """
        Removes all regions from this page that were added by layout analysis
        (i.e., regions where `source` attribute is 'detected').

        This clears the regions both from the page's internal `_regions['detected']` list
        and from the ElementManager's internal list of regions.

        Returns:
            Self for method chaining.
        """
        if (
            not hasattr(self._element_mgr, "regions")
            or not hasattr(self._element_mgr, "_elements")
            or "regions" not in self._element_mgr._elements
        ):
            logger.debug(
                f"Page {self.index}: No regions found in ElementManager, nothing to clear."
            )
            self._regions["detected"] = []  # Ensure page's list is also clear
            return self

        # Filter ElementManager's list to keep only non-detected regions
        original_count = len(self._element_mgr.regions)
        self._element_mgr._elements["regions"] = [
            r for r in self._element_mgr.regions if getattr(r, "source", None) != "detected"
        ]
        new_count = len(self._element_mgr.regions)
        removed_count = original_count - new_count

        # Clear the page's specific list of detected regions
        self._regions["detected"] = []

        logger.info(f"Page {self.index}: Cleared {removed_count} detected layout regions.")
        return self

    def get_section_between(
        self,
        start_element=None,
        end_element=None,
        include_boundaries="both",
        orientation="vertical",
    ) -> Optional["Region"]:  # Return Optional
        """
        Get a section between two elements on this page.

        Args:
            start_element: Element marking the start of the section
            end_element: Element marking the end of the section
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            Region representing the section
        """
        # Create a full-page region to operate within
        page_region = self.create_region(0, 0, self.width, self.height)

        # Delegate to the region's method
        try:
            return page_region.get_section_between(
                start_element=start_element,
                end_element=end_element,
                include_boundaries=include_boundaries,
                orientation=orientation,
            )
        except Exception as e:
            logger.error(
                f"Error getting section between elements on page {self.index}: {e}", exc_info=True
            )
            return None

    def split(self, divider, **kwargs) -> "ElementCollection[Region]":
        """
        Divides the page into sections based on the provided divider elements.
        """
        sections = self.get_sections(start_elements=divider, **kwargs)
        top = self.region(0, 0, self.width, sections[0].top)
        sections.append(top)

        return sections

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        include_boundaries="start",
        y_threshold=5.0,
        bounding_box=None,
        orientation="vertical",
    ) -> "ElementCollection[Region]":
        """
        Get sections of a page defined by start/end elements.
        Uses the page-level implementation.

        Args:
            start_elements: Elements or selector string that mark the start of sections
            end_elements: Elements or selector string that mark the end of sections
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
            y_threshold: Threshold for vertical alignment (only used for vertical orientation)
            bounding_box: Optional bounding box to constrain sections
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            An ElementCollection containing the found Region objects.
        """

        # Helper function to get bounds from bounding_box parameter
        def get_bounds():
            if bounding_box:
                x0, top, x1, bottom = bounding_box
                # Clamp to page boundaries
                return max(0, x0), max(0, top), min(self.width, x1), min(self.height, bottom)
            else:
                return 0, 0, self.width, self.height

        regions = []

        # Handle cases where elements are provided as strings (selectors)
        if isinstance(start_elements, str):
            start_elements = self.find_all(start_elements).elements  # Get list of elements
        elif hasattr(start_elements, "elements"):  # Handle ElementCollection input
            start_elements = start_elements.elements

        if isinstance(end_elements, str):
            end_elements = self.find_all(end_elements).elements
        elif hasattr(end_elements, "elements"):
            end_elements = end_elements.elements

        # Ensure start_elements is a list
        if start_elements is None:
            start_elements = []
        if end_elements is None:
            end_elements = []

        valid_inclusions = ["start", "end", "both", "none"]
        if include_boundaries not in valid_inclusions:
            raise ValueError(f"include_boundaries must be one of {valid_inclusions}")

        if not start_elements and not end_elements:
            # Return an empty ElementCollection if no boundary elements at all
            return ElementCollection([])

        # If we only have end elements, create implicit start elements
        if not start_elements and end_elements:
            # Delegate to PageCollection implementation for consistency
            from natural_pdf.core.page_collection import PageCollection

            pages = PageCollection([self])
            return pages.get_sections(
                start_elements=start_elements,
                end_elements=end_elements,
                include_boundaries=include_boundaries,
                orientation=orientation,
            )

        # Combine start and end elements with their type
        all_boundaries = []
        for el in start_elements:
            all_boundaries.append((el, "start"))
        for el in end_elements:
            all_boundaries.append((el, "end"))

        # Sort all boundary elements based on orientation
        try:
            if orientation == "vertical":
                all_boundaries.sort(key=lambda x: (x[0].top, x[0].x0))
            else:  # horizontal
                all_boundaries.sort(key=lambda x: (x[0].x0, x[0].top))
        except AttributeError as e:
            logger.error(f"Error sorting boundaries: Element missing position attribute? {e}")
            return ElementCollection([])  # Cannot proceed if elements lack position

        # Process sorted boundaries to find sections
        current_start_element = None
        active_section_started = False

        for element, element_type in all_boundaries:
            if element_type == "start":
                # If we have an active section, this start implicitly ends it
                if active_section_started:
                    end_boundary_el = element  # Use this start as the end boundary
                    # Determine region boundaries based on orientation
                    if orientation == "vertical":
                        sec_top = (
                            current_start_element.top
                            if include_boundaries in ["start", "both"]
                            else current_start_element.bottom
                        )
                        sec_bottom = (
                            end_boundary_el.top
                            if include_boundaries not in ["end", "both"]
                            else end_boundary_el.bottom
                        )

                        if sec_top < sec_bottom:  # Ensure valid region
                            x0, _, x1, _ = get_bounds()
                            region = self.create_region(x0, sec_top, x1, sec_bottom)
                            region.start_element = current_start_element
                            region.end_element = end_boundary_el  # Mark the element that ended it
                            region.is_end_next_start = True  # Mark how it ended
                            region._boundary_exclusions = include_boundaries
                            regions.append(region)
                    else:  # horizontal
                        sec_left = (
                            current_start_element.x0
                            if include_boundaries in ["start", "both"]
                            else current_start_element.x1
                        )
                        sec_right = (
                            end_boundary_el.x0
                            if include_boundaries not in ["end", "both"]
                            else end_boundary_el.x1
                        )

                        if sec_left < sec_right:  # Ensure valid region
                            _, y0, _, y1 = get_bounds()
                            region = self.create_region(sec_left, y0, sec_right, y1)
                            region.start_element = current_start_element
                            region.end_element = end_boundary_el  # Mark the element that ended it
                            region.is_end_next_start = True  # Mark how it ended
                            region._boundary_exclusions = include_boundaries
                            regions.append(region)
                    active_section_started = False  # Reset for the new start

                # Set this as the potential start of the next section
                current_start_element = element
                active_section_started = True

            elif element_type == "end" and active_section_started:
                # We found an explicit end for the current section
                end_boundary_el = element
                if orientation == "vertical":
                    sec_top = (
                        current_start_element.top
                        if include_boundaries in ["start", "both"]
                        else current_start_element.bottom
                    )
                    sec_bottom = (
                        end_boundary_el.bottom
                        if include_boundaries in ["end", "both"]
                        else end_boundary_el.top
                    )

                    if sec_top < sec_bottom:  # Ensure valid region
                        x0, _, x1, _ = get_bounds()
                        region = self.create_region(x0, sec_top, x1, sec_bottom)
                        region.start_element = current_start_element
                        region.end_element = end_boundary_el
                        region.is_end_next_start = False
                        region._boundary_exclusions = include_boundaries
                        regions.append(region)
                else:  # horizontal
                    sec_left = (
                        current_start_element.x0
                        if include_boundaries in ["start", "both"]
                        else current_start_element.x1
                    )
                    sec_right = (
                        end_boundary_el.x1
                        if include_boundaries in ["end", "both"]
                        else end_boundary_el.x0
                    )

                    if sec_left < sec_right:  # Ensure valid region
                        _, y0, _, y1 = get_bounds()
                        region = self.create_region(sec_left, y0, sec_right, y1)
                        region.start_element = current_start_element
                        region.end_element = end_boundary_el
                        region.is_end_next_start = False
                        region._boundary_exclusions = include_boundaries
                        regions.append(region)

                # Reset: section ended explicitly
                current_start_element = None
                active_section_started = False

        # Handle the last section if it was started but never explicitly ended
        if active_section_started:
            if orientation == "vertical":
                sec_top = (
                    current_start_element.top
                    if include_boundaries in ["start", "both"]
                    else current_start_element.bottom
                )
                x0, _, x1, page_bottom = get_bounds()
                if sec_top < page_bottom:
                    region = self.create_region(x0, sec_top, x1, page_bottom)
                    region.start_element = current_start_element
                    region.end_element = None  # Ended by page end
                    region.is_end_next_start = False
                    region._boundary_exclusions = include_boundaries
                    regions.append(region)
            else:  # horizontal
                sec_left = (
                    current_start_element.x0
                    if include_boundaries in ["start", "both"]
                    else current_start_element.x1
                )
                page_left, y0, page_right, y1 = get_bounds()
                if sec_left < page_right:
                    region = self.create_region(sec_left, y0, page_right, y1)
                    region.start_element = current_start_element
                    region.end_element = None  # Ended by page end
                    region.is_end_next_start = False
                    region._boundary_exclusions = include_boundaries
                    regions.append(region)

        return ElementCollection(regions)

    def __repr__(self) -> str:
        """String representation of the page."""
        return f"<Page number={self.number} index={self.index}>"

    def ask(
        self,
        question: Union[str, List[str], Tuple[str, ...]],
        min_confidence: float = 0.1,
        model: str = None,
        debug: bool = False,
        **kwargs,
    ) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
        """
        Ask a question about the page content using document QA.
        """
        try:
            from natural_pdf.qa.document_qa import get_qa_engine

            # Get or initialize QA engine with specified model
            qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
            # Ask the question using the QA engine
            return qa_engine.ask_pdf_page(
                self, question, min_confidence=min_confidence, debug=debug, **kwargs
            )
        except ImportError:
            logger.error(
                "Question answering requires the 'natural_pdf.qa' module. Please install necessary dependencies."
            )
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.number,
                "source_elements": [],
            }
        except Exception as e:
            logger.error(f"Error during page.ask: {e}", exc_info=True)
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.number,
                "source_elements": [],
            }

    def show_preview(
        self,
        temporary_highlights: List[Dict],
        resolution: float = 144,
        width: Optional[int] = None,
        labels: bool = True,
        legend_position: str = "right",
        render_ocr: bool = False,
    ) -> Optional[Image.Image]:
        """
        Generates and returns a non-stateful preview image containing only
        the provided temporary highlights.

        Args:
            temporary_highlights: List of highlight data dictionaries (as prepared by
                                  ElementCollection._prepare_highlight_data).
            resolution: Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).
            width: Optional width for the output image.
            labels: Whether to include a legend.
            legend_position: Position of the legend.
            render_ocr: Whether to render OCR text.

        Returns:
            PIL Image object of the preview, or None if rendering fails.
        """
        try:
            # Delegate rendering to the highlighter service's preview method
            img = self._highlighter.render_preview(
                page_index=self.index,
                temporary_highlights=temporary_highlights,
                resolution=resolution,
                labels=labels,
                legend_position=legend_position,
                render_ocr=render_ocr,
            )
        except AttributeError:
            logger.error(f"HighlightingService does not have the required 'render_preview' method.")
            return None
        except Exception as e:
            logger.error(
                f"Error calling highlighter.render_preview for page {self.index}: {e}",
                exc_info=True,
            )
            return None

        # Return the rendered image directly
        return img

    @property
    def text_style_labels(self) -> List[str]:
        """
        Get a sorted list of unique text style labels found on the page.

        Runs text style analysis with default options if it hasn't been run yet.
        To use custom options, call `analyze_text_styles(options=...)` explicitly first.

        Returns:
            A sorted list of unique style label strings.
        """
        # Check if the summary attribute exists from a previous run
        if not hasattr(self, "_text_styles_summary") or not self._text_styles_summary:
            # If not, run the analysis with default options
            logger.debug(f"Page {self.number}: Running default text style analysis to get labels.")
            self.analyze_text_styles()  # Use default options

        # Extract labels from the summary dictionary
        if hasattr(self, "_text_styles_summary") and self._text_styles_summary:
            # The summary maps style_key -> {'label': ..., 'properties': ...}
            labels = {style_info["label"] for style_info in self._text_styles_summary.values()}
            return sorted(list(labels))
        else:
            # Fallback if summary wasn't created for some reason (e.g., no text elements)
            logger.warning(f"Page {self.number}: Text style summary not found after analysis.")
            return []

    def viewer(
        self,
        # elements_to_render: Optional[List['Element']] = None, # No longer needed, from_page handles it
        # include_source_types: List[str] = ['word', 'line', 'rect', 'region'] # No longer needed
    ) -> Optional["InteractiveViewerWidget"]:  # Return type hint updated
        """
        Creates and returns an interactive ipywidget for exploring elements on this page.

        Uses InteractiveViewerWidget.from_page() to create the viewer.

        Returns:
            A InteractiveViewerWidget instance ready for display in Jupyter,
            or None if ipywidgets is not installed or widget creation fails.

        Raises:
            # Optional: Could raise ImportError instead of returning None
            # ImportError: If required dependencies (ipywidgets) are missing.
            ValueError: If image rendering or data preparation fails within from_page.
        """
        # Check for availability using the imported flag and class variable
        if not _IPYWIDGETS_AVAILABLE or InteractiveViewerWidget is None:
            logger.error(
                "Interactive viewer requires 'ipywidgets'. "
                'Please install with: pip install "ipywidgets>=7.0.0,<10.0.0"'
            )
            # raise ImportError("ipywidgets not found.") # Option 1: Raise error
            return None  # Option 2: Return None gracefully

        # If we reach here, InteractiveViewerWidget should be the actual class
        try:
            # Pass self (the Page object) to the factory method
            return InteractiveViewerWidget.from_page(self)
        except Exception as e:
            # Catch potential errors during widget creation (e.g., image rendering)
            logger.error(
                f"Error creating viewer widget from page {self.number}: {e}", exc_info=True
            )
            # raise # Option 1: Re-raise error (might include ValueError from from_page)
            return None  # Option 2: Return None on creation error

    # --- Indexable Protocol Methods ---
    def get_id(self) -> str:
        """Returns a unique identifier for the page (required by Indexable protocol)."""
        # Ensure path is safe for use in IDs (replace problematic chars)
        safe_path = re.sub(r"[^a-zA-Z0-9_-]", "_", str(self.pdf.path))
        return f"pdf_{safe_path}_page_{self.page_number}"

    def get_metadata(self) -> Dict[str, Any]:
        """Returns metadata associated with the page (required by Indexable protocol)."""
        # Add content hash here for sync
        metadata = {
            "pdf_path": str(self.pdf.path),
            "page_number": self.page_number,
            "width": self.width,
            "height": self.height,
            "content_hash": self.get_content_hash(),  # Include the hash
        }
        return metadata

    def get_content(self) -> "Page":
        """
        Returns the primary content object (self) for indexing (required by Indexable protocol).
        SearchService implementations decide how to process this (e.g., call extract_text).
        """
        return self  # Return the Page object itself

    def get_content_hash(self) -> str:
        """Returns a SHA256 hash of the extracted text content (required by Indexable for sync)."""
        # Hash the extracted text (without exclusions for consistency)
        # Consider if exclusions should be part of the hash? For now, hash raw text.
        # Using extract_text directly might be slow if called repeatedly. Cache? TODO: Optimization
        text_content = self.extract_text(
            use_exclusions=False, preserve_whitespace=False
        )  # Normalize whitespace?
        return hashlib.sha256(text_content.encode("utf-8")).hexdigest()

    # --- New Method: save_searchable ---
    def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
        """
        Saves the PDF page with an OCR text layer, making content searchable.

        Requires optional dependencies. Install with: pip install "natural-pdf[ocr-save]"

        Note: OCR must have been applied to the pages beforehand
              (e.g., pdf.apply_ocr()).

        Args:
            output_path: Path to save the searchable PDF.
            dpi: Resolution for rendering and OCR overlay (default 300).
            **kwargs: Additional keyword arguments passed to the exporter.
        """
        # Import moved here, assuming it's always available now
        from natural_pdf.exporters.searchable_pdf import create_searchable_pdf

        # Convert pathlib.Path to string if necessary
        output_path_str = str(output_path)

        create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
        logger.info(f"Searchable PDF saved to: {output_path_str}")

    # --- Added correct_ocr method ---
    def update_text(
        self,
        transform: Callable[[Any], Optional[str]],
        selector: str = "text",
        max_workers: Optional[int] = None,
        progress_callback: Optional[Callable[[], None]] = None,  # Added progress callback
    ) -> "Page":  # Return self for chaining
        """
        Applies corrections to text elements on this page
        using a user-provided callback function, potentially in parallel.

        Finds text elements on this page matching the *selector* argument and
        calls the ``transform`` for each, passing the element itself.
        Updates the element's text if the callback returns a new string.

        Args:
            transform: A function accepting an element and returning
                       `Optional[str]` (new text or None).
            selector: CSS-like selector string to match text elements.
            max_workers: The maximum number of threads to use for parallel execution.
                         If None or 0 or 1, runs sequentially.
            progress_callback: Optional callback function to call after processing each element.

        Returns:
            Self for method chaining.
        """
        logger.info(
            f"Page {self.number}: Starting text update with callback '{transform.__name__}' (max_workers={max_workers}) and selector='{selector}'"
        )

        target_elements_collection = self.find_all(selector=selector, apply_exclusions=False)
        target_elements = target_elements_collection.elements  # Get the list

        if not target_elements:
            logger.info(f"Page {self.number}: No text elements found to update.")
            return self

        element_pbar = None
        try:
            element_pbar = tqdm(
                total=len(target_elements),
                desc=f"Updating text Page {self.number}",
                unit="element",
                leave=False,
            )

            processed_count = 0
            updated_count = 0
            error_count = 0

            # Define the task to be run by the worker thread or sequentially
            def _process_element_task(element):
                try:
                    current_text = getattr(element, "text", None)
                    # Call the user-provided callback
                    corrected_text = transform(element)

                    # Validate result type
                    if corrected_text is not None and not isinstance(corrected_text, str):
                        logger.warning(
                            f"Page {self.number}: Correction callback for element '{getattr(element, 'text', '')[:20]}...' returned non-string, non-None type: {type(corrected_text)}. Skipping update."
                        )
                        return element, None, None  # Treat as no correction

                    return element, corrected_text, None  # Return element, result, no error
                except Exception as e:
                    logger.error(
                        f"Page {self.number}: Error applying correction callback to element '{getattr(element, 'text', '')[:30]}...' ({element.bbox}): {e}",
                        exc_info=False,  # Keep log concise
                    )
                    return element, None, e  # Return element, no result, error
                finally:
                    # --- Update internal tqdm progress bar ---
                    if element_pbar:
                        element_pbar.update(1)
                    # --- Call user's progress callback --- #
                    if progress_callback:
                        try:
                            progress_callback()
                        except Exception as cb_e:
                            # Log error in callback itself, but don't stop processing
                            logger.error(
                                f"Page {self.number}: Error executing progress_callback: {cb_e}",
                                exc_info=False,
                            )

            # Choose execution strategy based on max_workers
            if max_workers is not None and max_workers > 1:
                # --- Parallel execution --- #
                logger.info(
                    f"Page {self.number}: Running text update in parallel with {max_workers} workers."
                )
                futures = []
                with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                    # Submit all tasks
                    future_to_element = {
                        executor.submit(_process_element_task, element): element
                        for element in target_elements
                    }

                    # Process results as they complete (progress_callback called by worker)
                    for future in concurrent.futures.as_completed(future_to_element):
                        processed_count += 1
                        try:
                            element, corrected_text, error = future.result()
                            if error:
                                error_count += 1
                                # Error already logged in worker
                            elif corrected_text is not None:
                                # Apply correction if text changed
                                current_text = getattr(element, "text", None)
                                if corrected_text != current_text:
                                    element.text = corrected_text
                                    updated_count += 1
                        except Exception as exc:
                            # Catch errors from future.result() itself
                            element = future_to_element[future]  # Find original element
                            logger.error(
                                f"Page {self.number}: Internal error retrieving correction result for element {element.bbox}: {exc}",
                                exc_info=True,
                            )
                            error_count += 1
                            # Note: progress_callback was already called in the worker's finally block

            else:
                # --- Sequential execution --- #
                logger.info(f"Page {self.number}: Running text update sequentially.")
                for element in target_elements:
                    # Call the task function directly (it handles progress_callback)
                    processed_count += 1
                    _element, corrected_text, error = _process_element_task(element)
                    if error:
                        error_count += 1
                    elif corrected_text is not None:
                        # Apply correction if text changed
                        current_text = getattr(_element, "text", None)
                        if corrected_text != current_text:
                            _element.text = corrected_text
                            updated_count += 1

            logger.info(
                f"Page {self.number}: Text update finished. Processed: {processed_count}/{len(target_elements)}, Updated: {updated_count}, Errors: {error_count}."
            )

            return self  # Return self for chaining
        finally:
            if element_pbar:
                element_pbar.close()

    # --- Classification Mixin Implementation --- #
    def _get_classification_manager(self) -> "ClassificationManager":
        if not hasattr(self, "pdf") or not hasattr(self.pdf, "get_manager"):
            raise AttributeError(
                "ClassificationManager cannot be accessed: Parent PDF or get_manager method missing."
            )
        try:
            # Use the PDF's manager registry accessor
            return self.pdf.get_manager("classification")
        except (ValueError, RuntimeError, AttributeError) as e:
            # Wrap potential errors from get_manager for clarity
            raise AttributeError(f"Failed to get ClassificationManager from PDF: {e}") from e

    def _get_classification_content(
        self, model_type: str, **kwargs
    ) -> Union[str, "Image"]:  # Use "Image" for lazy import
        if model_type == "text":
            text_content = self.extract_text(
                layout=False, use_exclusions=False
            )  # Simple join, ignore exclusions for classification
            if not text_content or text_content.isspace():
                raise ValueError("Cannot classify page with 'text' model: No text content found.")
            return text_content
        elif model_type == "vision":
            # Get resolution from manager/kwargs if possible, else default
            manager = self._get_classification_manager()
            default_resolution = 150
            # Access kwargs passed to classify method if needed
            resolution = (
                kwargs.get("resolution", default_resolution)
                if "kwargs" in locals()
                else default_resolution
            )

            # Use render() for clean image without highlights
            img = self.render(resolution=resolution)
            if img is None:
                raise ValueError(
                    "Cannot classify page with 'vision' model: Failed to render image."
                )
            return img
        else:
            raise ValueError(f"Unsupported model_type for classification: {model_type}")

    def _get_metadata_storage(self) -> Dict[str, Any]:
        # Ensure metadata exists
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata

    # --- Content Extraction ---

    # --- Skew Detection and Correction --- #

    @property
    def skew_angle(self) -> Optional[float]:
        """Get the detected skew angle for this page (if calculated)."""
        return self._skew_angle

    def detect_skew_angle(
        self,
        resolution: int = 72,
        grayscale: bool = True,
        force_recalculate: bool = False,
        **deskew_kwargs,
    ) -> Optional[float]:
        """
        Detects the skew angle of the page image and stores it.

        Args:
            resolution: DPI resolution for rendering the page image for detection.
            grayscale: Whether to convert the image to grayscale before detection.
            force_recalculate: If True, recalculate even if an angle exists.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             (e.g., `max_angle`, `num_peaks`).

        Returns:
            The detected skew angle in degrees, or None if detection failed.

        Raises:
            ImportError: If the 'deskew' library is not installed.
        """
        if not DESKEW_AVAILABLE:
            raise ImportError(
                "Deskew library not found. Install with: pip install natural-pdf[deskew]"
            )

        if self._skew_angle is not None and not force_recalculate:
            logger.debug(f"Page {self.number}: Returning cached skew angle: {self._skew_angle:.2f}")
            return self._skew_angle

        logger.debug(f"Page {self.number}: Detecting skew angle (resolution={resolution} DPI)...")
        try:
            # Render the page at the specified detection resolution
            # Use render() for clean image without highlights
            img = self.render(resolution=resolution)
            if not img:
                logger.warning(f"Page {self.number}: Failed to render image for skew detection.")
                self._skew_angle = None
                return None

            # Convert to numpy array
            img_np = np.array(img)

            # Convert to grayscale if needed
            if grayscale:
                if len(img_np.shape) == 3 and img_np.shape[2] >= 3:
                    gray_np = np.mean(img_np[:, :, :3], axis=2).astype(np.uint8)
                elif len(img_np.shape) == 2:
                    gray_np = img_np  # Already grayscale
                else:
                    logger.warning(
                        f"Page {self.number}: Unexpected image shape {img_np.shape} for grayscale conversion."
                    )
                    gray_np = img_np  # Try using it anyway
            else:
                gray_np = img_np  # Use original if grayscale=False

            # Determine skew angle using the deskew library
            angle = determine_skew(gray_np, **deskew_kwargs)
            self._skew_angle = angle
            logger.debug(f"Page {self.number}: Detected skew angle = {angle}")
            return angle

        except Exception as e:
            logger.warning(f"Page {self.number}: Failed during skew detection: {e}", exc_info=True)
            self._skew_angle = None
            return None

    def deskew(
        self,
        resolution: int = 300,
        angle: Optional[float] = None,
        detection_resolution: int = 72,
        **deskew_kwargs,
    ) -> Optional[Image.Image]:
        """
        Creates and returns a deskewed PIL image of the page.

        If `angle` is not provided, it will first try to detect the skew angle
        using `detect_skew_angle` (or use the cached angle if available).

        Args:
            resolution: DPI resolution for the output deskewed image.
            angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
            detection_resolution: DPI resolution used for detection if `angle` is None.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             if automatic detection is performed.

        Returns:
            A deskewed PIL.Image.Image object, or None if rendering/rotation fails.

        Raises:
            ImportError: If the 'deskew' library is not installed.
        """
        if not DESKEW_AVAILABLE:
            raise ImportError(
                "Deskew library not found. Install with: pip install natural-pdf[deskew]"
            )

        # Determine the angle to use
        rotation_angle = angle
        if rotation_angle is None:
            # Detect angle (or use cached) if not explicitly provided
            rotation_angle = self.detect_skew_angle(
                resolution=detection_resolution, **deskew_kwargs
            )

        logger.debug(
            f"Page {self.number}: Preparing to deskew (output resolution={resolution} DPI). Using angle: {rotation_angle}"
        )

        try:
            # Render the original page at the desired output resolution
            # Use render() for clean image without highlights
            img = self.render(resolution=resolution)
            if not img:
                logger.error(f"Page {self.number}: Failed to render image for deskewing.")
                return None

            # Rotate if a significant angle was found/provided
            if rotation_angle is not None and abs(rotation_angle) > 0.05:
                logger.debug(f"Page {self.number}: Rotating by {rotation_angle:.2f} degrees.")
                # Determine fill color based on image mode
                fill = (255, 255, 255) if img.mode == "RGB" else 255  # White background
                # Rotate the image using PIL
                rotated_img = img.rotate(
                    rotation_angle,  # deskew provides angle, PIL rotates counter-clockwise
                    resample=Image.Resampling.BILINEAR,
                    expand=True,  # Expand image to fit rotated content
                    fillcolor=fill,
                )
                return rotated_img
            else:
                logger.debug(
                    f"Page {self.number}: No significant rotation needed (angle={rotation_angle}). Returning original render."
                )
                return img  # Return the original rendered image if no rotation needed

        except Exception as e:
            logger.error(
                f"Page {self.number}: Error during deskewing image generation: {e}", exc_info=True
            )
            return None

    # --- End Skew Detection and Correction --- #

    # ------------------------------------------------------------------
    # Unified analysis storage (maps to metadata["analysis"])
    # ------------------------------------------------------------------

    @property
    def analyses(self) -> Dict[str, Any]:
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata.setdefault("analysis", {})

    @analyses.setter
    def analyses(self, value: Dict[str, Any]):
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        self.metadata["analysis"] = value

    def inspect(self, limit: int = 30) -> "InspectionSummary":
        """
        Inspect all elements on this page with detailed tabular view.
        Equivalent to page.find_all('*').inspect().

        Args:
            limit: Maximum elements per type to show (default: 30)

        Returns:
            InspectionSummary with element tables showing coordinates,
            properties, and other details for each element
        """
        return self.find_all("*").inspect(limit=limit)

    def remove_text_layer(self) -> "Page":
        """
        Remove all text elements from this page.

        This removes all text elements (words and characters) from the page,
        effectively clearing the text layer.

        Returns:
            Self for method chaining
        """
        logger.info(f"Page {self.number}: Removing all text elements...")

        # Remove all words and chars from the element manager
        removed_words = len(self._element_mgr.words)
        removed_chars = len(self._element_mgr.chars)

        # Clear the lists
        self._element_mgr._elements["words"] = []
        self._element_mgr._elements["chars"] = []

        logger.info(
            f"Page {self.number}: Removed {removed_words} words and {removed_chars} characters"
        )
        return self

    def _apply_rtl_processing_to_text(self, text: str) -> str:
        """
        Apply RTL (Right-to-Left) text processing to a string.

        This converts visual order text (as stored in PDFs) to logical order
        for proper display of Arabic, Hebrew, and other RTL scripts.

        Args:
            text: Input text string in visual order

        Returns:
            Text string in logical order
        """
        if not text or not text.strip():
            return text

        # Quick check for RTL characters - if none found, return as-is
        import unicodedata

        def _contains_rtl(s):
            return any(unicodedata.bidirectional(ch) in ("R", "AL", "AN") for ch in s)

        if not _contains_rtl(text):
            return text

        try:
            from bidi.algorithm import get_display  # type: ignore

            from natural_pdf.utils.bidi_mirror import mirror_brackets

            # Apply BiDi algorithm to convert from visual to logical order
            # Process line by line to handle mixed content properly
            processed_lines = []
            for line in text.split("\n"):
                if line.strip():
                    # Determine base direction for this line
                    base_dir = "R" if _contains_rtl(line) else "L"
                    logical_line = get_display(line, base_dir=base_dir)
                    # Apply bracket mirroring for correct logical order
                    processed_lines.append(mirror_brackets(logical_line))
                else:
                    processed_lines.append(line)

            return "\n".join(processed_lines)

        except (ImportError, Exception):
            # If bidi library is not available or fails, return original text
            return text

    @property
    def lines(self) -> List[Any]:
        """Get all line elements on this page."""
        return self._element_mgr.lines

    # ------------------------------------------------------------------
    # Image elements
    # ------------------------------------------------------------------

    @property
    def images(self) -> List[Any]:
        """Get all embedded raster images on this page."""
        return self._element_mgr.images

    def highlights(self, show: bool = False) -> "HighlightContext":
        """
        Create a highlight context for accumulating highlights.

        This allows for clean syntax to show multiple highlight groups:

        Example:
            with page.highlights() as h:
                h.add(page.find_all('table'), label='tables', color='blue')
                h.add(page.find_all('text:bold'), label='bold text', color='red')
                h.show()

        Or with automatic display:
            with page.highlights(show=True) as h:
                h.add(page.find_all('table'), label='tables')
                h.add(page.find_all('text:bold'), label='bold')
                # Automatically shows when exiting the context

        Args:
            show: If True, automatically show highlights when exiting context

        Returns:
            HighlightContext for accumulating highlights
        """
        from natural_pdf.core.highlighting_service import HighlightContext

        return HighlightContext(self, show_on_exit=show)
Attributes
natural_pdf.Page.chars property

Get all character elements on this page.

natural_pdf.Page.height property

Get page height.

natural_pdf.Page.images property

Get all embedded raster images on this page.

natural_pdf.Page.index property

Get page index (0-based).

natural_pdf.Page.layout_analyzer property

Get or create the layout analyzer for this page.

natural_pdf.Page.lines property

Get all line elements on this page.

natural_pdf.Page.number property

Get page number (1-based).

natural_pdf.Page.page_number property

Get page number (1-based).

natural_pdf.Page.pdf property

Provides public access to the parent PDF object.

natural_pdf.Page.rects property

Get all rectangle elements on this page.

natural_pdf.Page.size property

Get the size of the page in points.

natural_pdf.Page.skew_angle property

Get the detected skew angle for this page (if calculated).

natural_pdf.Page.text_style_labels property

Get a sorted list of unique text style labels found on the page.

Runs text style analysis with default options if it hasn't been run yet. To use custom options, call analyze_text_styles(options=...) explicitly first.

Returns:

Type Description
List[str]

A sorted list of unique style label strings.

natural_pdf.Page.width property

Get page width.

natural_pdf.Page.words property

Get all word elements on this page.

Functions
natural_pdf.Page.__init__(page, parent, index, font_attrs=None, load_text=True)

Initialize a page wrapper.

Creates an enhanced Page object that wraps a pdfplumber page with additional functionality for spatial navigation, analysis, and AI-powered extraction.

Parameters:

Name Type Description Default
page Page

The underlying pdfplumber page object that provides raw PDF data.

required
parent PDF

Parent PDF object that contains this page and provides access to managers and global settings.

required
index int

Zero-based index of this page in the PDF document.

required
font_attrs

List of font attributes to consider when grouping characters into words. Common attributes include ['fontname', 'size', 'flags']. If None, uses default character-to-word grouping rules.

None
load_text bool

If True, load and process text elements from the PDF's text layer. If False, skip text layer processing (useful for OCR-only workflows).

True
Note

This constructor is typically called automatically when accessing pages through the PDF.pages collection. Direct instantiation is rarely needed.

Example
# Pages are usually accessed through the PDF object
pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]  # Page object created automatically

# Direct construction (advanced usage)
import pdfplumber
with pdfplumber.open("document.pdf") as plumber_pdf:
    plumber_page = plumber_pdf.pages[0]
    page = Page(plumber_page, pdf, 0, load_text=True)
Source code in natural_pdf/core/page.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
def __init__(
    self,
    page: "pdfplumber.page.Page",
    parent: "PDF",
    index: int,
    font_attrs=None,
    load_text: bool = True,
):
    """Initialize a page wrapper.

    Creates an enhanced Page object that wraps a pdfplumber page with additional
    functionality for spatial navigation, analysis, and AI-powered extraction.

    Args:
        page: The underlying pdfplumber page object that provides raw PDF data.
        parent: Parent PDF object that contains this page and provides access
            to managers and global settings.
        index: Zero-based index of this page in the PDF document.
        font_attrs: List of font attributes to consider when grouping characters
            into words. Common attributes include ['fontname', 'size', 'flags'].
            If None, uses default character-to-word grouping rules.
        load_text: If True, load and process text elements from the PDF's text layer.
            If False, skip text layer processing (useful for OCR-only workflows).

    Note:
        This constructor is typically called automatically when accessing pages
        through the PDF.pages collection. Direct instantiation is rarely needed.

    Example:
        ```python
        # Pages are usually accessed through the PDF object
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]  # Page object created automatically

        # Direct construction (advanced usage)
        import pdfplumber
        with pdfplumber.open("document.pdf") as plumber_pdf:
            plumber_page = plumber_pdf.pages[0]
            page = Page(plumber_page, pdf, 0, load_text=True)
        ```
    """
    self._page = page
    self._parent = parent
    self._index = index
    self._load_text = load_text
    self._text_styles = None  # Lazy-loaded text style analyzer results
    self._exclusions = []  # List to store exclusion functions/regions
    self._skew_angle: Optional[float] = None  # Stores detected skew angle

    # --- ADDED --- Metadata store for mixins
    self.metadata: Dict[str, Any] = {}
    # --- END ADDED ---

    # Region management
    self._regions = {
        "detected": [],  # Layout detection results
        "named": {},  # Named regions (name -> region)
    }

    # -------------------------------------------------------------
    # Page-scoped configuration begins as a shallow copy of the parent
    # PDF-level configuration so that auto-computed tolerances or other
    # page-specific values do not overwrite siblings.
    # -------------------------------------------------------------
    self._config = dict(getattr(self._parent, "_config", {}))

    # Initialize ElementManager, passing font_attrs
    self._element_mgr = ElementManager(self, font_attrs=font_attrs, load_text=self._load_text)
    # self._highlighter = HighlightingService(self) # REMOVED - Use property accessor
    # --- NEW --- Central registry for analysis results
    self.analyses: Dict[str, Any] = {}

    # --- Get OCR Manager Instance ---
    if (
        OCRManager
        and hasattr(parent, "_ocr_manager")
        and isinstance(parent._ocr_manager, OCRManager)
    ):
        self._ocr_manager = parent._ocr_manager
        logger.debug(f"Page {self.number}: Using OCRManager instance from parent PDF.")
    else:
        self._ocr_manager = None
        if OCRManager:
            logger.warning(
                f"Page {self.number}: OCRManager instance not found on parent PDF object."
            )

    # --- Get Layout Manager Instance ---
    if (
        LayoutManager
        and hasattr(parent, "_layout_manager")
        and isinstance(parent._layout_manager, LayoutManager)
    ):
        self._layout_manager = parent._layout_manager
        logger.debug(f"Page {self.number}: Using LayoutManager instance from parent PDF.")
    else:
        self._layout_manager = None
        if LayoutManager:
            logger.warning(
                f"Page {self.number}: LayoutManager instance not found on parent PDF object. Layout analysis will fail."
            )

    # Initialize the internal variable with a single underscore
    self._layout_analyzer = None

    self._load_elements()
    self._to_image_cache: Dict[tuple, Optional["Image.Image"]] = {}

    # Flag to prevent infinite recursion when computing exclusions
    self._computing_exclusions = False
natural_pdf.Page.__repr__()

String representation of the page.

Source code in natural_pdf/core/page.py
3034
3035
3036
def __repr__(self) -> str:
    """String representation of the page."""
    return f"<Page number={self.number} index={self.index}>"
natural_pdf.Page.add_exclusion(exclusion_func_or_region, label=None, method='region')

Add an exclusion to the page. Text from these regions will be excluded from extraction. Ensures non-callable items are stored as Region objects if possible.

Parameters:

Name Type Description Default
exclusion_func_or_region Union[Callable[[Page], Region], Region, List[Any], Tuple[Any, ...], Any]

Either a callable function returning a Region, a Region object, a list/tuple of regions or elements, or another object with a valid .bbox attribute.

required
label Optional[str]

Optional label for this exclusion (e.g., 'header', 'footer').

None
method str

Exclusion method - 'region' (exclude all elements in bounding box) or 'element' (exclude only the specific elements). Default: 'region'.

'region'

Returns:

Type Description
Page

Self for method chaining

Raises:

Type Description
TypeError

If a non-callable, non-Region object without a valid bbox is provided.

ValueError

If method is not 'region' or 'element'.

Source code in natural_pdf/core/page.py
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
def add_exclusion(
    self,
    exclusion_func_or_region: Union[
        Callable[["Page"], "Region"], "Region", List[Any], Tuple[Any, ...], Any
    ],
    label: Optional[str] = None,
    method: str = "region",
) -> "Page":
    """
    Add an exclusion to the page. Text from these regions will be excluded from extraction.
    Ensures non-callable items are stored as Region objects if possible.

    Args:
        exclusion_func_or_region: Either a callable function returning a Region,
                                  a Region object, a list/tuple of regions or elements,
                                  or another object with a valid .bbox attribute.
        label: Optional label for this exclusion (e.g., 'header', 'footer').
        method: Exclusion method - 'region' (exclude all elements in bounding box) or
                'element' (exclude only the specific elements). Default: 'region'.

    Returns:
        Self for method chaining

    Raises:
        TypeError: If a non-callable, non-Region object without a valid bbox is provided.
        ValueError: If method is not 'region' or 'element'.
    """
    # Validate method parameter
    if method not in ("region", "element"):
        raise ValueError(f"Invalid exclusion method '{method}'. Must be 'region' or 'element'.")

    # ------------------------------------------------------------------
    # NEW: Handle selector strings and ElementCollection instances
    # ------------------------------------------------------------------
    # If a user supplies a selector string (e.g. "text:bold") we resolve it
    # immediately *on this page* to the matching elements and turn each into
    # a Region object which is added to the internal exclusions list.
    #
    # Likewise, if an ElementCollection is passed we iterate over its
    # elements and create Regions for each one.
    # ------------------------------------------------------------------
    # Import ElementCollection from the new module path (old path removed)
    from natural_pdf.elements.element_collection import ElementCollection

    # Selector string ---------------------------------------------------
    if isinstance(exclusion_func_or_region, str):
        selector_str = exclusion_func_or_region
        matching_elements = self.find_all(selector_str, apply_exclusions=False)

        if not matching_elements:
            logger.warning(
                f"Page {self.index}: Selector '{selector_str}' returned no elements – no exclusions added."
            )
        else:
            if method == "element":
                # Store the actual elements for element-based exclusion
                for el in matching_elements:
                    self._exclusions.append((el, label, method))
                    logger.debug(
                        f"Page {self.index}: Added element exclusion from selector '{selector_str}' -> {el}"
                    )
            else:  # method == "region"
                for el in matching_elements:
                    try:
                        bbox_coords = (
                            float(el.x0),
                            float(el.top),
                            float(el.x1),
                            float(el.bottom),
                        )
                        region = Region(self, bbox_coords, label=label)
                        # Store directly as a Region tuple so we don't recurse endlessly
                        self._exclusions.append((region, label, method))
                        logger.debug(
                            f"Page {self.index}: Added exclusion region from selector '{selector_str}' -> {bbox_coords}"
                        )
                    except Exception as e:
                        # Re-raise so calling code/test sees the failure immediately
                        logger.error(
                            f"Page {self.index}: Failed to create exclusion region from element {el}: {e}",
                            exc_info=False,
                        )
                        raise
        # Invalidate ElementManager cache since exclusions affect element filtering
        if hasattr(self, "_element_mgr") and self._element_mgr:
            self._element_mgr.invalidate_cache()
        return self  # Completed processing for selector input

    # ElementCollection -----------------------------------------------
    if isinstance(exclusion_func_or_region, ElementCollection):
        if method == "element":
            # Store the actual elements for element-based exclusion
            for el in exclusion_func_or_region:
                self._exclusions.append((el, label, method))
                logger.debug(
                    f"Page {self.index}: Added element exclusion from ElementCollection -> {el}"
                )
        else:  # method == "region"
            # Convert each element to a Region and add
            for el in exclusion_func_or_region:
                try:
                    if not (hasattr(el, "bbox") and len(el.bbox) == 4):
                        logger.warning(
                            f"Page {self.index}: Skipping element without bbox in ElementCollection exclusion: {el}"
                        )
                        continue
                    bbox_coords = tuple(float(v) for v in el.bbox)
                    region = Region(self, bbox_coords, label=label)
                    self._exclusions.append((region, label, method))
                    logger.debug(
                        f"Page {self.index}: Added exclusion region from ElementCollection element {bbox_coords}"
                    )
                except Exception as e:
                    logger.error(
                        f"Page {self.index}: Failed to convert ElementCollection element to Region: {e}",
                        exc_info=False,
                    )
                    raise
        # Invalidate ElementManager cache since exclusions affect element filtering
        if hasattr(self, "_element_mgr") and self._element_mgr:
            self._element_mgr.invalidate_cache()
        return self  # Completed processing for ElementCollection input

    # ------------------------------------------------------------------
    # Existing logic (callable, Region, bbox-bearing objects)
    # ------------------------------------------------------------------
    exclusion_data = None  # Initialize exclusion data

    if callable(exclusion_func_or_region):
        # Store callable functions along with their label and method
        exclusion_data = (exclusion_func_or_region, label, method)
        logger.debug(
            f"Page {self.index}: Added callable exclusion '{label}' with method '{method}': {exclusion_func_or_region}"
        )
    elif isinstance(exclusion_func_or_region, Region):
        # Store Region objects directly, assigning the label
        exclusion_func_or_region.label = label  # Assign label
        exclusion_data = (
            exclusion_func_or_region,
            label,
            method,
        )  # Store as tuple for consistency
        logger.debug(
            f"Page {self.index}: Added Region exclusion '{label}' with method '{method}': {exclusion_func_or_region}"
        )
    elif (
        hasattr(exclusion_func_or_region, "bbox")
        and isinstance(getattr(exclusion_func_or_region, "bbox", None), (tuple, list))
        and len(exclusion_func_or_region.bbox) == 4
    ):
        if method == "element":
            # For element method, store the element directly
            exclusion_data = (exclusion_func_or_region, label, method)
            logger.debug(
                f"Page {self.index}: Added element exclusion '{label}': {exclusion_func_or_region}"
            )
        else:  # method == "region"
            # Convert objects with a valid bbox to a Region before storing
            try:
                bbox_coords = tuple(float(v) for v in exclusion_func_or_region.bbox)
                # Pass the label to the Region constructor
                region_to_add = Region(self, bbox_coords, label=label)
                exclusion_data = (region_to_add, label, method)  # Store as tuple
                logger.debug(
                    f"Page {self.index}: Added exclusion '{label}' with method '{method}' converted to Region from {type(exclusion_func_or_region)}: {region_to_add}"
                )
            except (ValueError, TypeError, Exception) as e:
                # Raise an error if conversion fails
                raise TypeError(
                    f"Failed to convert exclusion object {exclusion_func_or_region} with bbox {getattr(exclusion_func_or_region, 'bbox', 'N/A')} to Region: {e}"
                ) from e
    elif isinstance(exclusion_func_or_region, (list, tuple)):
        # Handle lists/tuples of regions or elements
        if not exclusion_func_or_region:
            logger.warning(f"Page {self.index}: Empty list provided for exclusion, ignoring.")
            return self

        if method == "element":
            # Store each element directly
            for item in exclusion_func_or_region:
                if hasattr(item, "bbox") and len(getattr(item, "bbox", [])) == 4:
                    self._exclusions.append((item, label, method))
                    logger.debug(
                        f"Page {self.index}: Added element exclusion from list -> {item}"
                    )
                else:
                    logger.warning(
                        f"Page {self.index}: Skipping item without valid bbox in list: {item}"
                    )
        else:  # method == "region"
            # Convert each item to a Region and add
            for item in exclusion_func_or_region:
                try:
                    if isinstance(item, Region):
                        item.label = label
                        self._exclusions.append((item, label, method))
                        logger.debug(f"Page {self.index}: Added Region from list: {item}")
                    elif hasattr(item, "bbox") and len(getattr(item, "bbox", [])) == 4:
                        bbox_coords = tuple(float(v) for v in item.bbox)
                        region = Region(self, bbox_coords, label=label)
                        self._exclusions.append((region, label, method))
                        logger.debug(
                            f"Page {self.index}: Added exclusion region from list item {bbox_coords}"
                        )
                    else:
                        logger.warning(
                            f"Page {self.index}: Skipping item without valid bbox in list: {item}"
                        )
                except Exception as e:
                    logger.error(
                        f"Page {self.index}: Failed to convert list item to Region: {e}"
                    )
                    continue
        # Invalidate ElementManager cache since exclusions affect element filtering
        if hasattr(self, "_element_mgr") and self._element_mgr:
            self._element_mgr.invalidate_cache()
        return self
    else:
        # Reject invalid types
        raise TypeError(
            f"Invalid exclusion type: {type(exclusion_func_or_region)}. Must be callable, Region, list/tuple of regions/elements, or have a valid .bbox attribute."
        )

    # Append the stored data (tuple of object/callable, label, and method)
    if exclusion_data:
        self._exclusions.append(exclusion_data)

    # Invalidate ElementManager cache since exclusions affect element filtering
    if hasattr(self, "_element_mgr") and self._element_mgr:
        self._element_mgr.invalidate_cache()

    return self
natural_pdf.Page.add_highlight(bbox=None, color=None, label=None, use_color_cycling=False, element=None, annotate=None, existing='append')

Add a highlight to a bounding box or the entire page. Delegates to the central HighlightingService.

Parameters:

Name Type Description Default
bbox Optional[Tuple[float, float, float, float]]

Bounding box (x0, top, x1, bottom). If None, highlight entire page.

None
color Optional[Union[Tuple, str]]

RGBA color tuple/string for the highlight.

None
label Optional[str]

Optional label for the highlight.

None
use_color_cycling bool

If True and no label/color, use next cycle color.

False
element Optional[Any]

Optional original element being highlighted (for attribute extraction).

None
annotate Optional[List[str]]

List of attribute names from 'element' to display.

None
existing str

How to handle existing highlights ('append' or 'replace').

'append'

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
def add_highlight(
    self,
    bbox: Optional[Tuple[float, float, float, float]] = None,
    color: Optional[Union[Tuple, str]] = None,
    label: Optional[str] = None,
    use_color_cycling: bool = False,
    element: Optional[Any] = None,
    annotate: Optional[List[str]] = None,
    existing: str = "append",
) -> "Page":
    """
    Add a highlight to a bounding box or the entire page.
    Delegates to the central HighlightingService.

    Args:
        bbox: Bounding box (x0, top, x1, bottom). If None, highlight entire page.
        color: RGBA color tuple/string for the highlight.
        label: Optional label for the highlight.
        use_color_cycling: If True and no label/color, use next cycle color.
        element: Optional original element being highlighted (for attribute extraction).
        annotate: List of attribute names from 'element' to display.
        existing: How to handle existing highlights ('append' or 'replace').

    Returns:
        Self for method chaining.
    """
    target_bbox = bbox if bbox is not None else (0, 0, self.width, self.height)
    self._highlighter.add(
        page_index=self.index,
        bbox=target_bbox,
        color=color,
        label=label,
        use_color_cycling=use_color_cycling,
        element=element,
        annotate=annotate,
        existing=existing,
    )
    return self
natural_pdf.Page.add_highlight_polygon(polygon, color=None, label=None, use_color_cycling=False, element=None, annotate=None, existing='append')

Highlight a polygon shape on the page. Delegates to the central HighlightingService.

Parameters:

Name Type Description Default
polygon List[Tuple[float, float]]

List of (x, y) points defining the polygon.

required
color Optional[Union[Tuple, str]]

RGBA color tuple/string for the highlight.

None
label Optional[str]

Optional label for the highlight.

None
use_color_cycling bool

If True and no label/color, use next cycle color.

False
element Optional[Any]

Optional original element being highlighted (for attribute extraction).

None
annotate Optional[List[str]]

List of attribute names from 'element' to display.

None
existing str

How to handle existing highlights ('append' or 'replace').

'append'

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
def add_highlight_polygon(
    self,
    polygon: List[Tuple[float, float]],
    color: Optional[Union[Tuple, str]] = None,
    label: Optional[str] = None,
    use_color_cycling: bool = False,
    element: Optional[Any] = None,
    annotate: Optional[List[str]] = None,
    existing: str = "append",
) -> "Page":
    """
    Highlight a polygon shape on the page.
    Delegates to the central HighlightingService.

    Args:
        polygon: List of (x, y) points defining the polygon.
        color: RGBA color tuple/string for the highlight.
        label: Optional label for the highlight.
        use_color_cycling: If True and no label/color, use next cycle color.
        element: Optional original element being highlighted (for attribute extraction).
        annotate: List of attribute names from 'element' to display.
        existing: How to handle existing highlights ('append' or 'replace').

    Returns:
        Self for method chaining.
    """
    self._highlighter.add_polygon(
        page_index=self.index,
        polygon=polygon,
        color=color,
        label=label,
        use_color_cycling=use_color_cycling,
        element=element,
        annotate=annotate,
        existing=existing,
    )
    return self
natural_pdf.Page.add_region(region, name=None)

Add a region to the page.

Parameters:

Name Type Description Default
region Region

Region object to add

required
name Optional[str]

Optional name for the region

None

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
def add_region(self, region: "Region", name: Optional[str] = None) -> "Page":
    """
    Add a region to the page.

    Args:
        region: Region object to add
        name: Optional name for the region

    Returns:
        Self for method chaining
    """
    # Check if it's actually a Region object
    if not isinstance(region, Region):
        raise TypeError("region must be a Region object")

    # Set the source and name
    region.source = "named"

    if name:
        region.name = name
        # Add to named regions dictionary (overwriting if name already exists)
        self._regions["named"][name] = region
    else:
        # Add to detected regions list (unnamed but registered)
        self._regions["detected"].append(region)

    # Add to element manager for selector queries
    self._element_mgr.add_region(region)

    return self
natural_pdf.Page.add_regions(regions, prefix=None)

Add multiple regions to the page.

Parameters:

Name Type Description Default
regions List[Region]

List of Region objects to add

required
prefix Optional[str]

Optional prefix for automatic naming (regions will be named prefix_1, prefix_2, etc.)

None

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
def add_regions(self, regions: List["Region"], prefix: Optional[str] = None) -> "Page":
    """
    Add multiple regions to the page.

    Args:
        regions: List of Region objects to add
        prefix: Optional prefix for automatic naming (regions will be named prefix_1, prefix_2, etc.)

    Returns:
        Self for method chaining
    """
    if prefix:
        # Add with automatic sequential naming
        for i, region in enumerate(regions):
            self.add_region(region, name=f"{prefix}_{i+1}")
    else:
        # Add without names
        for region in regions:
            self.add_region(region)

    return self
natural_pdf.Page.analyze_layout(engine=None, options=None, confidence=None, classes=None, exclude_classes=None, device=None, existing='replace', model_name=None, client=None)

Analyze the page layout using the configured LayoutManager. Adds detected Region objects to the page's element manager.

Returns:

Type Description
ElementCollection[Region]

ElementCollection containing the detected Region objects.

Source code in natural_pdf/core/page.py
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
def analyze_layout(
    self,
    engine: Optional[str] = None,
    options: Optional["LayoutOptions"] = None,
    confidence: Optional[float] = None,
    classes: Optional[List[str]] = None,
    exclude_classes: Optional[List[str]] = None,
    device: Optional[str] = None,
    existing: str = "replace",
    model_name: Optional[str] = None,
    client: Optional[Any] = None,  # Add client parameter
) -> "ElementCollection[Region]":
    """
    Analyze the page layout using the configured LayoutManager.
    Adds detected Region objects to the page's element manager.

    Returns:
        ElementCollection containing the detected Region objects.
    """
    analyzer = self.layout_analyzer
    if not analyzer:
        logger.error(
            "Layout analysis failed: LayoutAnalyzer not initialized (is LayoutManager available?)."
        )
        return ElementCollection([])  # Return empty collection

    # Clear existing detected regions if 'replace' is specified
    if existing == "replace":
        self.clear_detected_layout_regions()

    # The analyzer's analyze_layout method already adds regions to the page
    # and its element manager. We just need to retrieve them.
    analyzer.analyze_layout(
        engine=engine,
        options=options,
        confidence=confidence,
        classes=classes,
        exclude_classes=exclude_classes,
        device=device,
        existing=existing,
        model_name=model_name,
        client=client,  # Pass client down
    )

    # Retrieve the detected regions from the element manager
    # Filter regions based on source='detected' and potentially the model used if available
    detected_regions = [
        r
        for r in self._element_mgr.regions
        if r.source == "detected" and (not engine or getattr(r, "model", None) == engine)
    ]

    return ElementCollection(detected_regions)
natural_pdf.Page.analyze_text_styles(options=None)

Analyze text elements by style, adding attributes directly to elements.

This method uses TextStyleAnalyzer to process text elements (typically words) on the page. It adds the following attributes to each processed element: - style_label: A descriptive or numeric label for the style group. - style_key: A hashable tuple representing the style properties used for grouping. - style_properties: A dictionary containing the extracted style properties.

Parameters:

Name Type Description Default
options Optional[TextStyleOptions]

Optional TextStyleOptions to configure the analysis. If None, the analyzer's default options are used.

None

Returns:

Type Description
ElementCollection

ElementCollection containing all processed text elements with added style attributes.

Source code in natural_pdf/core/page.py
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
def analyze_text_styles(
    self, options: Optional[TextStyleOptions] = None
) -> "ElementCollection":
    """
    Analyze text elements by style, adding attributes directly to elements.

    This method uses TextStyleAnalyzer to process text elements (typically words)
    on the page. It adds the following attributes to each processed element:
    - style_label: A descriptive or numeric label for the style group.
    - style_key: A hashable tuple representing the style properties used for grouping.
    - style_properties: A dictionary containing the extracted style properties.

    Args:
        options: Optional TextStyleOptions to configure the analysis.
                 If None, the analyzer's default options are used.

    Returns:
        ElementCollection containing all processed text elements with added style attributes.
    """
    # Create analyzer (optionally pass default options from PDF config here)
    # For now, it uses its own defaults if options=None
    analyzer = TextStyleAnalyzer()

    # Analyze the page. The analyzer now modifies elements directly
    # and returns the collection of processed elements.
    processed_elements_collection = analyzer.analyze(self, options=options)

    # Return the collection of elements which now have style attributes
    return processed_elements_collection
natural_pdf.Page.apply_ocr(engine=None, options=None, languages=None, min_confidence=None, device=None, resolution=None, detect_only=False, apply_exclusions=True, replace=True)

Apply OCR to THIS page and add results to page elements via PDF.apply_ocr.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine.

None
options Optional[OCROptions]

Engine-specific options object or dict.

None
languages Optional[List[str]]

List of engine-specific language codes.

None
min_confidence Optional[float]

Minimum confidence threshold.

None
device Optional[str]

Device to run OCR on.

None
resolution Optional[int]

DPI resolution for rendering page image before OCR.

None
apply_exclusions bool

If True (default), render page image for OCR with excluded areas masked (whited out).

True
detect_only bool

If True, only detect text bounding boxes, don't perform OCR.

False
replace bool

If True (default), remove any existing OCR elements before adding new ones. If False, add new OCR elements to existing ones.

True

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
def apply_ocr(
    self,
    engine: Optional[str] = None,
    options: Optional["OCROptions"] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
    detect_only: bool = False,
    apply_exclusions: bool = True,
    replace: bool = True,
) -> "Page":
    """
    Apply OCR to THIS page and add results to page elements via PDF.apply_ocr.

    Args:
        engine: Name of the OCR engine.
        options: Engine-specific options object or dict.
        languages: List of engine-specific language codes.
        min_confidence: Minimum confidence threshold.
        device: Device to run OCR on.
        resolution: DPI resolution for rendering page image before OCR.
        apply_exclusions: If True (default), render page image for OCR
                          with excluded areas masked (whited out).
        detect_only: If True, only detect text bounding boxes, don't perform OCR.
        replace: If True (default), remove any existing OCR elements before
                adding new ones. If False, add new OCR elements to existing ones.

    Returns:
        Self for method chaining.
    """
    if not hasattr(self._parent, "apply_ocr"):
        logger.error(f"Page {self.number}: Parent PDF missing 'apply_ocr'. Cannot apply OCR.")
        return self  # Return self for chaining

    # Remove existing OCR elements if replace is True
    if replace and hasattr(self, "_element_mgr"):
        logger.info(
            f"Page {self.number}: Removing existing OCR elements before applying new OCR."
        )
        self._element_mgr.remove_ocr_elements()

    logger.info(f"Page {self.number}: Delegating apply_ocr to PDF.apply_ocr.")
    # Delegate to parent PDF, targeting only this page's index
    # Pass all relevant parameters through, including apply_exclusions
    self._parent.apply_ocr(
        pages=[self.index],
        engine=engine,
        options=options,
        languages=languages,
        min_confidence=min_confidence,
        device=device,
        resolution=resolution,
        detect_only=detect_only,
        apply_exclusions=apply_exclusions,
        replace=replace,  # Pass the replace parameter to PDF.apply_ocr
    )

    # Return self for chaining
    return self
natural_pdf.Page.ask(question, min_confidence=0.1, model=None, debug=False, **kwargs)

Ask a question about the page content using document QA.

Source code in natural_pdf/core/page.py
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
def ask(
    self,
    question: Union[str, List[str], Tuple[str, ...]],
    min_confidence: float = 0.1,
    model: str = None,
    debug: bool = False,
    **kwargs,
) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
    """
    Ask a question about the page content using document QA.
    """
    try:
        from natural_pdf.qa.document_qa import get_qa_engine

        # Get or initialize QA engine with specified model
        qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
        # Ask the question using the QA engine
        return qa_engine.ask_pdf_page(
            self, question, min_confidence=min_confidence, debug=debug, **kwargs
        )
    except ImportError:
        logger.error(
            "Question answering requires the 'natural_pdf.qa' module. Please install necessary dependencies."
        )
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.number,
            "source_elements": [],
        }
    except Exception as e:
        logger.error(f"Error during page.ask: {e}", exc_info=True)
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.number,
            "source_elements": [],
        }
natural_pdf.Page.clear_detected_layout_regions()

Removes all regions from this page that were added by layout analysis (i.e., regions where source attribute is 'detected').

This clears the regions both from the page's internal _regions['detected'] list and from the ElementManager's internal list of regions.

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
def clear_detected_layout_regions(self) -> "Page":
    """
    Removes all regions from this page that were added by layout analysis
    (i.e., regions where `source` attribute is 'detected').

    This clears the regions both from the page's internal `_regions['detected']` list
    and from the ElementManager's internal list of regions.

    Returns:
        Self for method chaining.
    """
    if (
        not hasattr(self._element_mgr, "regions")
        or not hasattr(self._element_mgr, "_elements")
        or "regions" not in self._element_mgr._elements
    ):
        logger.debug(
            f"Page {self.index}: No regions found in ElementManager, nothing to clear."
        )
        self._regions["detected"] = []  # Ensure page's list is also clear
        return self

    # Filter ElementManager's list to keep only non-detected regions
    original_count = len(self._element_mgr.regions)
    self._element_mgr._elements["regions"] = [
        r for r in self._element_mgr.regions if getattr(r, "source", None) != "detected"
    ]
    new_count = len(self._element_mgr.regions)
    removed_count = original_count - new_count

    # Clear the page's specific list of detected regions
    self._regions["detected"] = []

    logger.info(f"Page {self.index}: Cleared {removed_count} detected layout regions.")
    return self
natural_pdf.Page.clear_exclusions()

Clear all exclusions from the page.

Source code in natural_pdf/core/page.py
414
415
416
417
418
419
def clear_exclusions(self) -> "Page":
    """
    Clear all exclusions from the page.
    """
    self._exclusions = []
    return self
natural_pdf.Page.clear_highlights()

Clear all highlights from this specific page via HighlightingService.

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
2426
2427
2428
2429
2430
2431
2432
2433
2434
def clear_highlights(self) -> "Page":
    """
    Clear all highlights *from this specific page* via HighlightingService.

    Returns:
        Self for method chaining
    """
    self._highlighter.clear_page(self.index)
    return self
natural_pdf.Page.create_region(x0, top, x1, bottom)

Create a region on this page with the specified coordinates.

Parameters:

Name Type Description Default
x0 float

Left x-coordinate

required
top float

Top y-coordinate

required
x1 float

Right x-coordinate

required
bottom float

Bottom y-coordinate

required

Returns:

Type Description
Any

Region object for the specified coordinates

Source code in natural_pdf/core/page.py
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
def create_region(self, x0: float, top: float, x1: float, bottom: float) -> Any:
    """
    Create a region on this page with the specified coordinates.

    Args:
        x0: Left x-coordinate
        top: Top y-coordinate
        x1: Right x-coordinate
        bottom: Bottom y-coordinate

    Returns:
        Region object for the specified coordinates
    """
    from natural_pdf.elements.region import Region

    return Region(self, (x0, top, x1, bottom))
natural_pdf.Page.crop(bbox=None, **kwargs)

Crop the page to the specified bounding box.

This is a direct wrapper around pdfplumber's crop method.

Parameters:

Name Type Description Default
bbox

Bounding box (x0, top, x1, bottom) or None

None
**kwargs

Additional parameters (top, bottom, left, right)

{}

Returns:

Type Description
Any

Cropped page object (pdfplumber.Page)

Source code in natural_pdf/core/page.py
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
def crop(self, bbox=None, **kwargs) -> Any:
    """
    Crop the page to the specified bounding box.

    This is a direct wrapper around pdfplumber's crop method.

    Args:
        bbox: Bounding box (x0, top, x1, bottom) or None
        **kwargs: Additional parameters (top, bottom, left, right)

    Returns:
        Cropped page object (pdfplumber.Page)
    """
    # Returns the pdfplumber page object, not a natural-pdf Page
    return self._page.crop(bbox, **kwargs)
natural_pdf.Page.deskew(resolution=300, angle=None, detection_resolution=72, **deskew_kwargs)

Creates and returns a deskewed PIL image of the page.

If angle is not provided, it will first try to detect the skew angle using detect_skew_angle (or use the cached angle if available).

Parameters:

Name Type Description Default
resolution int

DPI resolution for the output deskewed image.

300
angle Optional[float]

The specific angle (in degrees) to rotate by. If None, detects automatically.

None
detection_resolution int

DPI resolution used for detection if angle is None.

72
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew if automatic detection is performed.

{}

Returns:

Type Description
Optional[Image]

A deskewed PIL.Image.Image object, or None if rendering/rotation fails.

Raises:

Type Description
ImportError

If the 'deskew' library is not installed.

Source code in natural_pdf/core/page.py
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
def deskew(
    self,
    resolution: int = 300,
    angle: Optional[float] = None,
    detection_resolution: int = 72,
    **deskew_kwargs,
) -> Optional[Image.Image]:
    """
    Creates and returns a deskewed PIL image of the page.

    If `angle` is not provided, it will first try to detect the skew angle
    using `detect_skew_angle` (or use the cached angle if available).

    Args:
        resolution: DPI resolution for the output deskewed image.
        angle: The specific angle (in degrees) to rotate by. If None, detects automatically.
        detection_resolution: DPI resolution used for detection if `angle` is None.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         if automatic detection is performed.

    Returns:
        A deskewed PIL.Image.Image object, or None if rendering/rotation fails.

    Raises:
        ImportError: If the 'deskew' library is not installed.
    """
    if not DESKEW_AVAILABLE:
        raise ImportError(
            "Deskew library not found. Install with: pip install natural-pdf[deskew]"
        )

    # Determine the angle to use
    rotation_angle = angle
    if rotation_angle is None:
        # Detect angle (or use cached) if not explicitly provided
        rotation_angle = self.detect_skew_angle(
            resolution=detection_resolution, **deskew_kwargs
        )

    logger.debug(
        f"Page {self.number}: Preparing to deskew (output resolution={resolution} DPI). Using angle: {rotation_angle}"
    )

    try:
        # Render the original page at the desired output resolution
        # Use render() for clean image without highlights
        img = self.render(resolution=resolution)
        if not img:
            logger.error(f"Page {self.number}: Failed to render image for deskewing.")
            return None

        # Rotate if a significant angle was found/provided
        if rotation_angle is not None and abs(rotation_angle) > 0.05:
            logger.debug(f"Page {self.number}: Rotating by {rotation_angle:.2f} degrees.")
            # Determine fill color based on image mode
            fill = (255, 255, 255) if img.mode == "RGB" else 255  # White background
            # Rotate the image using PIL
            rotated_img = img.rotate(
                rotation_angle,  # deskew provides angle, PIL rotates counter-clockwise
                resample=Image.Resampling.BILINEAR,
                expand=True,  # Expand image to fit rotated content
                fillcolor=fill,
            )
            return rotated_img
        else:
            logger.debug(
                f"Page {self.number}: No significant rotation needed (angle={rotation_angle}). Returning original render."
            )
            return img  # Return the original rendered image if no rotation needed

    except Exception as e:
        logger.error(
            f"Page {self.number}: Error during deskewing image generation: {e}", exc_info=True
        )
        return None
natural_pdf.Page.detect_skew_angle(resolution=72, grayscale=True, force_recalculate=False, **deskew_kwargs)

Detects the skew angle of the page image and stores it.

Parameters:

Name Type Description Default
resolution int

DPI resolution for rendering the page image for detection.

72
grayscale bool

Whether to convert the image to grayscale before detection.

True
force_recalculate bool

If True, recalculate even if an angle exists.

False
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew (e.g., max_angle, num_peaks).

{}

Returns:

Type Description
Optional[float]

The detected skew angle in degrees, or None if detection failed.

Raises:

Type Description
ImportError

If the 'deskew' library is not installed.

Source code in natural_pdf/core/page.py
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
def detect_skew_angle(
    self,
    resolution: int = 72,
    grayscale: bool = True,
    force_recalculate: bool = False,
    **deskew_kwargs,
) -> Optional[float]:
    """
    Detects the skew angle of the page image and stores it.

    Args:
        resolution: DPI resolution for rendering the page image for detection.
        grayscale: Whether to convert the image to grayscale before detection.
        force_recalculate: If True, recalculate even if an angle exists.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         (e.g., `max_angle`, `num_peaks`).

    Returns:
        The detected skew angle in degrees, or None if detection failed.

    Raises:
        ImportError: If the 'deskew' library is not installed.
    """
    if not DESKEW_AVAILABLE:
        raise ImportError(
            "Deskew library not found. Install with: pip install natural-pdf[deskew]"
        )

    if self._skew_angle is not None and not force_recalculate:
        logger.debug(f"Page {self.number}: Returning cached skew angle: {self._skew_angle:.2f}")
        return self._skew_angle

    logger.debug(f"Page {self.number}: Detecting skew angle (resolution={resolution} DPI)...")
    try:
        # Render the page at the specified detection resolution
        # Use render() for clean image without highlights
        img = self.render(resolution=resolution)
        if not img:
            logger.warning(f"Page {self.number}: Failed to render image for skew detection.")
            self._skew_angle = None
            return None

        # Convert to numpy array
        img_np = np.array(img)

        # Convert to grayscale if needed
        if grayscale:
            if len(img_np.shape) == 3 and img_np.shape[2] >= 3:
                gray_np = np.mean(img_np[:, :, :3], axis=2).astype(np.uint8)
            elif len(img_np.shape) == 2:
                gray_np = img_np  # Already grayscale
            else:
                logger.warning(
                    f"Page {self.number}: Unexpected image shape {img_np.shape} for grayscale conversion."
                )
                gray_np = img_np  # Try using it anyway
        else:
            gray_np = img_np  # Use original if grayscale=False

        # Determine skew angle using the deskew library
        angle = determine_skew(gray_np, **deskew_kwargs)
        self._skew_angle = angle
        logger.debug(f"Page {self.number}: Detected skew angle = {angle}")
        return angle

    except Exception as e:
        logger.warning(f"Page {self.number}: Failed during skew detection: {e}", exc_info=True)
        self._skew_angle = None
        return None
natural_pdf.Page.extract_ocr_elements(engine=None, options=None, languages=None, min_confidence=None, device=None, resolution=None)

Extract text elements using OCR without adding them to the page's elements. Uses the shared OCRManager instance.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine.

None
options Optional[OCROptions]

Engine-specific options object or dict.

None
languages Optional[List[str]]

List of engine-specific language codes.

None
min_confidence Optional[float]

Minimum confidence threshold.

None
device Optional[str]

Device to run OCR on.

None
resolution Optional[int]

DPI resolution for rendering page image before OCR.

None

Returns:

Type Description
List[TextElement]

List of created TextElement objects derived from OCR results for this page.

Source code in natural_pdf/core/page.py
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
def extract_ocr_elements(
    self,
    engine: Optional[str] = None,
    options: Optional["OCROptions"] = None,
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,
    device: Optional[str] = None,
    resolution: Optional[int] = None,
) -> List["TextElement"]:
    """
    Extract text elements using OCR *without* adding them to the page's elements.
    Uses the shared OCRManager instance.

    Args:
        engine: Name of the OCR engine.
        options: Engine-specific options object or dict.
        languages: List of engine-specific language codes.
        min_confidence: Minimum confidence threshold.
        device: Device to run OCR on.
        resolution: DPI resolution for rendering page image before OCR.

    Returns:
        List of created TextElement objects derived from OCR results for this page.
    """
    if not self._ocr_manager:
        logger.error(
            f"Page {self.number}: OCRManager not available. Cannot extract OCR elements."
        )
        return []

    logger.info(f"Page {self.number}: Extracting OCR elements (extract only)...")

    # Determine rendering resolution
    final_resolution = resolution if resolution is not None else 150  # Default to 150 DPI
    logger.debug(f"  Using rendering resolution: {final_resolution} DPI")

    try:
        # Get base image without highlights using the determined resolution
        # Use the global PDF rendering lock
        with pdf_render_lock:
            # Use render() for clean image without highlights
            image = self.render(resolution=final_resolution)
            if not image:
                logger.error(
                    f"  Failed to render page {self.number} to image for OCR extraction."
                )
                return []
            logger.debug(f"  Rendered image size: {image.width}x{image.height}")
    except Exception as e:
        logger.error(f"  Failed to render page {self.number} to image: {e}", exc_info=True)
        return []

    # Prepare arguments for the OCR Manager call
    manager_args = {
        "images": image,
        "engine": engine,
        "languages": languages,
        "min_confidence": min_confidence,
        "device": device,
        "options": options,
    }
    manager_args = {k: v for k, v in manager_args.items() if v is not None}

    logger.debug(
        f"  Calling OCR Manager (extract only) with args: { {k:v for k,v in manager_args.items() if k != 'images'} }"
    )
    try:
        # apply_ocr now returns List[List[Dict]] or List[Dict]
        results_list = self._ocr_manager.apply_ocr(**manager_args)
        # If it returned a list of lists (batch mode), take the first list
        results = (
            results_list[0]
            if isinstance(results_list, list)
            and results_list
            and isinstance(results_list[0], list)
            else results_list
        )
        if not isinstance(results, list):
            logger.error(f"  OCR Manager returned unexpected type: {type(results)}")
            results = []
        logger.info(f"  OCR Manager returned {len(results)} results for extraction.")
    except Exception as e:
        logger.error(f"  OCR processing failed during extraction: {e}", exc_info=True)
        return []

    # Convert results but DO NOT add to ElementManager
    logger.debug(f"  Converting OCR results to TextElements (extract only)...")
    temp_elements = []
    scale_x = self.width / image.width if image.width else 1
    scale_y = self.height / image.height if image.height else 1
    for result in results:
        try:  # Added try-except around result processing
            x0, top, x1, bottom = [float(c) for c in result["bbox"]]
            elem_data = {
                "text": result["text"],
                "confidence": result["confidence"],
                "x0": x0 * scale_x,
                "top": top * scale_y,
                "x1": x1 * scale_x,
                "bottom": bottom * scale_y,
                "width": (x1 - x0) * scale_x,
                "height": (bottom - top) * scale_y,
                "object_type": "text",  # Using text for temporary elements
                "source": "ocr",
                "fontname": "OCR-extract",  # Different name for clarity
                "size": 10.0,
                "page_number": self.number,
            }
            temp_elements.append(TextElement(elem_data, self))
        except (KeyError, ValueError, TypeError) as convert_err:
            logger.warning(
                f"  Skipping invalid OCR result during conversion: {result}. Error: {convert_err}"
            )

    logger.info(f"  Created {len(temp_elements)} TextElements from OCR (extract only).")
    return temp_elements
natural_pdf.Page.extract_table(method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False, content_filter=None, verticals=None, horizontals=None)

Extract the largest table from this page using enhanced region-based extraction.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None
use_ocr bool

Whether to use OCR for text extraction (currently only applicable with 'tatr' method).

False
ocr_config Optional[dict]

OCR configuration parameters.

None
text_options Optional[Dict]

Dictionary of options for the 'text' method.

None
cell_extraction_func Optional[Callable[[Region], Optional[str]]]

Optional callable function that takes a cell Region object and returns its string content. For 'text' method only.

None
show_progress bool

If True, display a progress bar during cell text extraction for the 'text' method.

False
content_filter

Optional content filter to apply during cell text extraction. Can be: - A regex pattern string (characters matching the pattern are EXCLUDED) - A callable that takes text and returns True to KEEP the character - A list of regex patterns (characters matching ANY pattern are EXCLUDED)

None
verticals Optional[List[float]]

Optional list of x-coordinates for explicit vertical table lines.

None
horizontals Optional[List[float]]

Optional list of y-coordinates for explicit horizontal table lines.

None

Returns:

Name Type Description
TableResult TableResult

A sequence-like object containing table rows that also provides .to_df() for pandas conversion.

Source code in natural_pdf/core/page.py
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
def extract_table(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,
    text_options: Optional[Dict] = None,
    cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
    show_progress: bool = False,
    content_filter=None,
    verticals: Optional[List[float]] = None,
    horizontals: Optional[List[float]] = None,
) -> TableResult:
    """
    Extract the largest table from this page using enhanced region-based extraction.

    Args:
        method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
        table_settings: Settings for pdfplumber table extraction.
        use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
        ocr_config: OCR configuration parameters.
        text_options: Dictionary of options for the 'text' method.
        cell_extraction_func: Optional callable function that takes a cell Region object
                              and returns its string content. For 'text' method only.
        show_progress: If True, display a progress bar during cell text extraction for the 'text' method.
        content_filter: Optional content filter to apply during cell text extraction. Can be:
            - A regex pattern string (characters matching the pattern are EXCLUDED)
            - A callable that takes text and returns True to KEEP the character
            - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
        verticals: Optional list of x-coordinates for explicit vertical table lines.
        horizontals: Optional list of y-coordinates for explicit horizontal table lines.

    Returns:
        TableResult: A sequence-like object containing table rows that also provides .to_df() for pandas conversion.
    """
    # Create a full-page region and delegate to its enhanced extract_table method
    page_region = self.create_region(0, 0, self.width, self.height)
    return page_region.extract_table(
        method=method,
        table_settings=table_settings,
        use_ocr=use_ocr,
        ocr_config=ocr_config,
        text_options=text_options,
        cell_extraction_func=cell_extraction_func,
        show_progress=show_progress,
        content_filter=content_filter,
        verticals=verticals,
        horizontals=horizontals,
    )
natural_pdf.Page.extract_tables(method=None, table_settings=None, check_tatr=True)

Extract all tables from this page with enhanced method support.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect). 'stream' uses text-based strategies, 'lattice' uses line-based strategies. Note: 'tatr' and 'text' methods are not supported for extract_tables.

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None
check_tatr bool

If True (default), first check for TATR-detected table regions and extract from those before falling back to pdfplumber methods.

True

Returns:

Type Description
List[List[List[str]]]

List of tables, where each table is a list of rows, and each row is a list of cell values.

Source code in natural_pdf/core/page.py
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
def extract_tables(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
    check_tatr: bool = True,
) -> List[List[List[str]]]:
    """
    Extract all tables from this page with enhanced method support.

    Args:
        method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                'stream' uses text-based strategies, 'lattice' uses line-based strategies.
                Note: 'tatr' and 'text' methods are not supported for extract_tables.
        table_settings: Settings for pdfplumber table extraction.
        check_tatr: If True (default), first check for TATR-detected table regions
                    and extract from those before falling back to pdfplumber methods.

    Returns:
        List of tables, where each table is a list of rows, and each row is a list of cell values.
    """
    if table_settings is None:
        table_settings = {}

    # Check for TATR-detected table regions first if enabled
    if check_tatr:
        try:
            tatr_tables = self.find_all("region[type=table][model=tatr]")
            if tatr_tables:
                logger.debug(
                    f"Page {self.number}: Found {len(tatr_tables)} TATR table regions, extracting from those..."
                )
                extracted_tables = []
                for table_region in tatr_tables:
                    try:
                        table_data = table_region.extract_table(method="tatr")
                        if table_data:  # Only add non-empty tables
                            extracted_tables.append(table_data)
                    except Exception as e:
                        logger.warning(
                            f"Failed to extract table from TATR region {table_region.bbox}: {e}"
                        )

                if extracted_tables:
                    logger.debug(
                        f"Page {self.number}: Successfully extracted {len(extracted_tables)} tables from TATR regions"
                    )
                    return extracted_tables
                else:
                    logger.debug(
                        f"Page {self.number}: TATR regions found but no tables extracted, falling back to pdfplumber"
                    )
            else:
                logger.debug(
                    f"Page {self.number}: No TATR table regions found, using pdfplumber methods"
                )
        except Exception as e:
            logger.debug(
                f"Page {self.number}: Error checking TATR regions: {e}, falling back to pdfplumber"
            )

    # Auto-detect method if not specified (try lattice first, then stream)
    if method is None:
        logger.debug(f"Page {self.number}: Auto-detecting tables extraction method...")

        # Try lattice first
        try:
            lattice_settings = table_settings.copy()
            lattice_settings.setdefault("vertical_strategy", "lines")
            lattice_settings.setdefault("horizontal_strategy", "lines")

            logger.debug(f"Page {self.number}: Trying 'lattice' method first for tables...")
            lattice_result = self._page.extract_tables(lattice_settings)

            # Check if lattice found meaningful tables
            if (
                lattice_result
                and len(lattice_result) > 0
                and any(
                    any(
                        any(cell and cell.strip() for cell in row if cell)
                        for row in table
                        if table
                    )
                    for table in lattice_result
                )
            ):
                logger.debug(
                    f"Page {self.number}: 'lattice' method found {len(lattice_result)} tables"
                )
                return lattice_result
            else:
                logger.debug(f"Page {self.number}: 'lattice' method found no meaningful tables")

        except Exception as e:
            logger.debug(f"Page {self.number}: 'lattice' method failed: {e}")

        # Fall back to stream
        logger.debug(f"Page {self.number}: Falling back to 'stream' method for tables...")
        stream_settings = table_settings.copy()
        stream_settings.setdefault("vertical_strategy", "text")
        stream_settings.setdefault("horizontal_strategy", "text")

        return self._page.extract_tables(stream_settings)

    effective_method = method

    # Handle method aliases
    if effective_method == "stream":
        logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "text")
        table_settings.setdefault("horizontal_strategy", "text")
    elif effective_method == "lattice":
        logger.debug(
            "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
        )
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "lines")
        table_settings.setdefault("horizontal_strategy", "lines")

    # Use the selected method
    if effective_method == "pdfplumber":
        # ---------------------------------------------------------
        # Inject auto-computed or user-specified text tolerances so
        # pdfplumber uses the same numbers we used for word grouping
        # whenever the table algorithm relies on word positions.
        # ---------------------------------------------------------
        if "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        ):
            print("SETTING IT UP")
            pdf_cfg = getattr(self, "_config", getattr(self._parent, "_config", {}))
            if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
                x_tol = pdf_cfg.get("x_tolerance")
                if x_tol is not None:
                    table_settings.setdefault("text_x_tolerance", x_tol)
            if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
                y_tol = pdf_cfg.get("y_tolerance")
                if y_tol is not None:
                    table_settings.setdefault("text_y_tolerance", y_tol)

            # pdfplumber's text strategy benefits from a tight snap tolerance.
            if (
                "snap_tolerance" not in table_settings
                and "snap_x_tolerance" not in table_settings
            ):
                # Derive from y_tol if available, else default 1
                snap = max(1, round((pdf_cfg.get("y_tolerance", 1)) * 0.9))
                table_settings.setdefault("snap_tolerance", snap)
            if (
                "join_tolerance" not in table_settings
                and "join_x_tolerance" not in table_settings
            ):
                join = table_settings.get("snap_tolerance", 1)
                table_settings.setdefault("join_tolerance", join)
                table_settings.setdefault("join_x_tolerance", join)
                table_settings.setdefault("join_y_tolerance", join)

        raw_tables = self._page.extract_tables(table_settings)

        # Apply RTL text processing to all extracted tables
        if raw_tables:
            processed_tables = []
            for table in raw_tables:
                processed_table = []
                for row in table:
                    processed_row = []
                    for cell in row:
                        if cell is not None:
                            # Apply RTL text processing to each cell
                            rtl_processed_cell = self._apply_rtl_processing_to_text(cell)
                            processed_row.append(rtl_processed_cell)
                        else:
                            processed_row.append(cell)
                    processed_table.append(processed_row)
                processed_tables.append(processed_table)
            return processed_tables

        return raw_tables
    else:
        raise ValueError(
            f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
        )
natural_pdf.Page.extract_text(preserve_whitespace=True, use_exclusions=True, debug_exclusions=False, content_filter=None, **kwargs)

Extract text from this page, respecting exclusions and using pdfplumber's layout engine (chars_to_textmap) if layout arguments are provided or default.

Parameters:

Name Type Description Default
use_exclusions

Whether to apply exclusion regions (default: True). Note: Filtering logic is now always applied if exclusions exist.

True
debug_exclusions

Whether to output detailed exclusion debugging info (default: False).

False
content_filter

Optional content filter to exclude specific text patterns. Can be: - A regex pattern string (characters matching the pattern are EXCLUDED) - A callable that takes text and returns True to KEEP the character - A list of regex patterns (characters matching ANY pattern are EXCLUDED)

None
**kwargs

Additional layout parameters passed directly to pdfplumber's chars_to_textmap function. Common parameters include: - layout (bool): If True (default), inserts spaces/newlines. - x_density (float): Pixels per character horizontally. - y_density (float): Pixels per line vertically. - x_tolerance (float): Tolerance for horizontal character grouping. - y_tolerance (float): Tolerance for vertical character grouping. - line_dir (str): 'ttb', 'btt', 'ltr', 'rtl' - char_dir (str): 'ttb', 'btt', 'ltr', 'rtl' See pdfplumber documentation for more.

{}

Returns:

Type Description
str

Extracted text as string, potentially with layout-based spacing.

Source code in natural_pdf/core/page.py
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
def extract_text(
    self,
    preserve_whitespace=True,
    use_exclusions=True,
    debug_exclusions=False,
    content_filter=None,
    **kwargs,
) -> str:
    """
    Extract text from this page, respecting exclusions and using pdfplumber's
    layout engine (chars_to_textmap) if layout arguments are provided or default.

    Args:
        use_exclusions: Whether to apply exclusion regions (default: True).
                      Note: Filtering logic is now always applied if exclusions exist.
        debug_exclusions: Whether to output detailed exclusion debugging info (default: False).
        content_filter: Optional content filter to exclude specific text patterns. Can be:
            - A regex pattern string (characters matching the pattern are EXCLUDED)
            - A callable that takes text and returns True to KEEP the character
            - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
        **kwargs: Additional layout parameters passed directly to pdfplumber's
                  `chars_to_textmap` function. Common parameters include:
                  - layout (bool): If True (default), inserts spaces/newlines.
                  - x_density (float): Pixels per character horizontally.
                  - y_density (float): Pixels per line vertically.
                  - x_tolerance (float): Tolerance for horizontal character grouping.
                  - y_tolerance (float): Tolerance for vertical character grouping.
                  - line_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                  - char_dir (str): 'ttb', 'btt', 'ltr', 'rtl'
                  See pdfplumber documentation for more.

    Returns:
        Extracted text as string, potentially with layout-based spacing.
    """
    logger.debug(f"Page {self.number}: extract_text called with kwargs: {kwargs}")
    debug = kwargs.get("debug", debug_exclusions)  # Allow 'debug' kwarg

    # 1. Get Word Elements (triggers load_elements if needed)
    word_elements = self.words
    if not word_elements:
        logger.debug(f"Page {self.number}: No word elements found.")
        return ""

    # 2. Apply element-based exclusions if enabled
    # Check both page-level and PDF-level exclusions
    has_exclusions = bool(self._exclusions) or (
        hasattr(self, "_parent")
        and self._parent
        and hasattr(self._parent, "_exclusions")
        and self._parent._exclusions
    )
    if use_exclusions and has_exclusions:
        # Filter word elements through _filter_elements_by_exclusions
        # This handles both element-based and region-based exclusions
        word_elements = self._filter_elements_by_exclusions(
            word_elements, debug_exclusions=debug
        )
        if debug:
            logger.debug(
                f"Page {self.number}: {len(word_elements)} words remaining after exclusion filtering."
            )

    # 3. Get region-based exclusions for spatial filtering
    apply_exclusions_flag = kwargs.get("use_exclusions", use_exclusions)
    exclusion_regions = []
    if apply_exclusions_flag and has_exclusions:
        exclusion_regions = self._get_exclusion_regions(include_callable=True, debug=debug)
        if debug:
            logger.debug(
                f"Page {self.number}: Found {len(exclusion_regions)} region exclusions for spatial filtering."
            )
    elif debug:
        logger.debug(f"Page {self.number}: Not applying exclusions.")

    # 4. Collect All Character Dictionaries from remaining Word Elements
    all_char_dicts = []
    for word in word_elements:
        all_char_dicts.extend(getattr(word, "_char_dicts", []))

    # 5. Spatially Filter Characters (only by regions, elements already filtered above)
    filtered_chars = filter_chars_spatially(
        char_dicts=all_char_dicts,
        exclusion_regions=exclusion_regions,
        target_region=None,  # No target region for full page extraction
        debug=debug,
    )

    # 5. Generate Text Layout using Utility
    # Pass page bbox as layout context
    page_bbox = (0, 0, self.width, self.height)
    # Merge PDF-level default tolerances if caller did not override
    merged_kwargs = dict(kwargs)
    tol_keys = ["x_tolerance", "x_tolerance_ratio", "y_tolerance"]
    for k in tol_keys:
        if k not in merged_kwargs:
            if k in self._config:
                merged_kwargs[k] = self._config[k]
            elif k in getattr(self._parent, "_config", {}):
                merged_kwargs[k] = self._parent._config[k]

    # Add content_filter to kwargs if provided
    if content_filter is not None:
        merged_kwargs["content_filter"] = content_filter

    result = generate_text_layout(
        char_dicts=filtered_chars,
        layout_context_bbox=page_bbox,
        user_kwargs=merged_kwargs,
    )

    # --- Optional: apply Unicode BiDi algorithm for mixed RTL/LTR correctness ---
    apply_bidi = kwargs.get("bidi", True)
    if apply_bidi and result:
        # Quick check for any RTL character
        import unicodedata

        def _contains_rtl(s):
            return any(unicodedata.bidirectional(ch) in ("R", "AL", "AN") for ch in s)

        if _contains_rtl(result):
            try:
                from bidi.algorithm import get_display  # type: ignore

                from natural_pdf.utils.bidi_mirror import mirror_brackets

                result = "\n".join(
                    mirror_brackets(
                        get_display(
                            line,
                            base_dir=(
                                "R"
                                if any(
                                    unicodedata.bidirectional(ch) in ("R", "AL", "AN")
                                    for ch in line
                                )
                                else "L"
                            ),
                        )
                    )
                    for line in result.split("\n")
                )
            except ModuleNotFoundError:
                pass  # silently skip if python-bidi not available

    logger.debug(f"Page {self.number}: extract_text finished, result length: {len(result)}.")
    return result
natural_pdf.Page.filter_elements(elements, selector, **kwargs)

Filter a list of elements based on a selector.

Parameters:

Name Type Description Default
elements List[Element]

List of elements to filter

required
selector str

CSS-like selector string

required
**kwargs

Additional filter parameters

{}

Returns:

Type Description
List[Element]

List of elements that match the selector

Source code in natural_pdf/core/page.py
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
def filter_elements(
    self, elements: List["Element"], selector: str, **kwargs
) -> List["Element"]:
    """
    Filter a list of elements based on a selector.

    Args:
        elements: List of elements to filter
        selector: CSS-like selector string
        **kwargs: Additional filter parameters

    Returns:
        List of elements that match the selector
    """
    from natural_pdf.selectors.parser import parse_selector, selector_to_filter_func

    # Parse the selector
    selector_obj = parse_selector(selector)

    # Create filter function from selector
    filter_func = selector_to_filter_func(selector_obj, **kwargs)

    # Apply the filter to the elements
    matching_elements = [element for element in elements if filter_func(element)]

    # Sort elements in reading order if requested
    if kwargs.get("reading_order", True):
        if all(hasattr(el, "top") and hasattr(el, "x0") for el in matching_elements):
            matching_elements.sort(key=lambda el: (el.top, el.x0))
        else:
            logger.warning(
                "Cannot sort elements in reading order: Missing required attributes (top, x0)."
            )

    return matching_elements
natural_pdf.Page.find(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]
find(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Any]

Find first element on this page matching selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
Optional[Any]

Element object or None if not found.

Source code in natural_pdf/core/page.py
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
def find(
    self,
    selector: Optional[str] = None,  # Now optional
    *,  # Force subsequent args to be keyword-only
    text: Optional[str] = None,  # New text parameter
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional[Any]:
    """
    Find first element on this page matching selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        Element object or None if not found.
    """
    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        # Escape quotes within the text for the selector string
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        # Default to 'text:contains(...)'
        effective_selector = f'text:contains("{escaped_text}")'
        # Note: regex/case handled by kwargs passed down
        logger.debug(
            f"Using text shortcut: find(text='{text}') -> find('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        # Should be unreachable due to checks above
        raise ValueError("Internal error: No selector or text provided.")

    selector_obj = parse_selector(effective_selector)

    # Pass regex and case flags to selector function via kwargs
    kwargs["regex"] = regex
    kwargs["case"] = case

    # First get all matching elements without applying exclusions initially within _apply_selector
    results_collection = self._apply_selector(
        selector_obj, **kwargs
    )  # _apply_selector doesn't filter

    # Filter the results based on exclusions if requested
    if apply_exclusions and results_collection:
        filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
        # Return the first element from the filtered list
        return filtered_elements[0] if filtered_elements else None
    elif results_collection:
        # Return the first element from the unfiltered results
        return results_collection.first
    else:
        return None
natural_pdf.Page.find_all(selector=None, *, text=None, apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements on this page matching selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements.

Source code in natural_pdf/core/page.py
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
def find_all(
    self,
    selector: Optional[str] = None,  # Now optional
    *,  # Force subsequent args to be keyword-only
    text: Optional[str] = None,  # New text parameter
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements on this page matching selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        ElementCollection with matching elements.
    """
    from natural_pdf.elements.element_collection import (  # Import here for type hint
        ElementCollection,
    )

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        # Escape quotes within the text for the selector string
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        # Default to 'text:contains(...)'
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        # Should be unreachable due to checks above
        raise ValueError("Internal error: No selector or text provided.")

    selector_obj = parse_selector(effective_selector)

    # Pass regex and case flags to selector function via kwargs
    kwargs["regex"] = regex
    kwargs["case"] = case

    # First get all matching elements without applying exclusions initially within _apply_selector
    results_collection = self._apply_selector(
        selector_obj, **kwargs
    )  # _apply_selector doesn't filter

    # Filter the results based on exclusions if requested
    if apply_exclusions and results_collection:
        filtered_elements = self._filter_elements_by_exclusions(results_collection.elements)
        return ElementCollection(filtered_elements)
    else:
        # Return the unfiltered collection
        return results_collection
natural_pdf.Page.get_content()

Returns the primary content object (self) for indexing (required by Indexable protocol). SearchService implementations decide how to process this (e.g., call extract_text).

Source code in natural_pdf/core/page.py
3213
3214
3215
3216
3217
3218
def get_content(self) -> "Page":
    """
    Returns the primary content object (self) for indexing (required by Indexable protocol).
    SearchService implementations decide how to process this (e.g., call extract_text).
    """
    return self  # Return the Page object itself
natural_pdf.Page.get_content_hash()

Returns a SHA256 hash of the extracted text content (required by Indexable for sync).

Source code in natural_pdf/core/page.py
3220
3221
3222
3223
3224
3225
3226
3227
3228
def get_content_hash(self) -> str:
    """Returns a SHA256 hash of the extracted text content (required by Indexable for sync)."""
    # Hash the extracted text (without exclusions for consistency)
    # Consider if exclusions should be part of the hash? For now, hash raw text.
    # Using extract_text directly might be slow if called repeatedly. Cache? TODO: Optimization
    text_content = self.extract_text(
        use_exclusions=False, preserve_whitespace=False
    )  # Normalize whitespace?
    return hashlib.sha256(text_content.encode("utf-8")).hexdigest()
natural_pdf.Page.get_elements(apply_exclusions=True, debug_exclusions=False)

Get all elements on this page.

Parameters:

Name Type Description Default
apply_exclusions

Whether to apply exclusion regions (default: True).

True
debug_exclusions bool

Whether to output detailed exclusion debugging info (default: False).

False

Returns:

Type Description
List[Element]

List of all elements on the page, potentially filtered by exclusions.

Source code in natural_pdf/core/page.py
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
def get_elements(
    self, apply_exclusions=True, debug_exclusions: bool = False
) -> List["Element"]:
    """
    Get all elements on this page.

    Args:
        apply_exclusions: Whether to apply exclusion regions (default: True).
        debug_exclusions: Whether to output detailed exclusion debugging info (default: False).

    Returns:
        List of all elements on the page, potentially filtered by exclusions.
    """
    # Get all elements from the element manager
    all_elements = self._element_mgr.get_all_elements()

    # Apply exclusions if requested
    if apply_exclusions:
        return self._filter_elements_by_exclusions(
            all_elements, debug_exclusions=debug_exclusions
        )
    else:
        if debug_exclusions:
            print(
                f"Page {self.index}: get_elements returning all {len(all_elements)} elements (exclusions not applied)."
            )
        return all_elements
natural_pdf.Page.get_id()

Returns a unique identifier for the page (required by Indexable protocol).

Source code in natural_pdf/core/page.py
3195
3196
3197
3198
3199
def get_id(self) -> str:
    """Returns a unique identifier for the page (required by Indexable protocol)."""
    # Ensure path is safe for use in IDs (replace problematic chars)
    safe_path = re.sub(r"[^a-zA-Z0-9_-]", "_", str(self.pdf.path))
    return f"pdf_{safe_path}_page_{self.page_number}"
natural_pdf.Page.get_metadata()

Returns metadata associated with the page (required by Indexable protocol).

Source code in natural_pdf/core/page.py
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
def get_metadata(self) -> Dict[str, Any]:
    """Returns metadata associated with the page (required by Indexable protocol)."""
    # Add content hash here for sync
    metadata = {
        "pdf_path": str(self.pdf.path),
        "page_number": self.page_number,
        "width": self.width,
        "height": self.height,
        "content_hash": self.get_content_hash(),  # Include the hash
    }
    return metadata
natural_pdf.Page.get_section_between(start_element=None, end_element=None, include_boundaries='both', orientation='vertical')

Get a section between two elements on this page.

Parameters:

Name Type Description Default
start_element

Element marking the start of the section

None
end_element

Element marking the end of the section

None
include_boundaries

How to include boundary elements: 'start', 'end', 'both', or 'none'

'both'
orientation

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description
Optional[Region]

Region representing the section

Source code in natural_pdf/core/page.py
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
def get_section_between(
    self,
    start_element=None,
    end_element=None,
    include_boundaries="both",
    orientation="vertical",
) -> Optional["Region"]:  # Return Optional
    """
    Get a section between two elements on this page.

    Args:
        start_element: Element marking the start of the section
        end_element: Element marking the end of the section
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        Region representing the section
    """
    # Create a full-page region to operate within
    page_region = self.create_region(0, 0, self.width, self.height)

    # Delegate to the region's method
    try:
        return page_region.get_section_between(
            start_element=start_element,
            end_element=end_element,
            include_boundaries=include_boundaries,
            orientation=orientation,
        )
    except Exception as e:
        logger.error(
            f"Error getting section between elements on page {self.index}: {e}", exc_info=True
        )
        return None
natural_pdf.Page.get_sections(start_elements=None, end_elements=None, include_boundaries='start', y_threshold=5.0, bounding_box=None, orientation='vertical')

Get sections of a page defined by start/end elements. Uses the page-level implementation.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections

None
end_elements

Elements or selector string that mark the end of sections

None
include_boundaries

How to include boundary elements: 'start', 'end', 'both', or 'none'

'start'
y_threshold

Threshold for vertical alignment (only used for vertical orientation)

5.0
bounding_box

Optional bounding box to constrain sections

None
orientation

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description
ElementCollection[Region]

An ElementCollection containing the found Region objects.

Source code in natural_pdf/core/page.py
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    include_boundaries="start",
    y_threshold=5.0,
    bounding_box=None,
    orientation="vertical",
) -> "ElementCollection[Region]":
    """
    Get sections of a page defined by start/end elements.
    Uses the page-level implementation.

    Args:
        start_elements: Elements or selector string that mark the start of sections
        end_elements: Elements or selector string that mark the end of sections
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
        y_threshold: Threshold for vertical alignment (only used for vertical orientation)
        bounding_box: Optional bounding box to constrain sections
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        An ElementCollection containing the found Region objects.
    """

    # Helper function to get bounds from bounding_box parameter
    def get_bounds():
        if bounding_box:
            x0, top, x1, bottom = bounding_box
            # Clamp to page boundaries
            return max(0, x0), max(0, top), min(self.width, x1), min(self.height, bottom)
        else:
            return 0, 0, self.width, self.height

    regions = []

    # Handle cases where elements are provided as strings (selectors)
    if isinstance(start_elements, str):
        start_elements = self.find_all(start_elements).elements  # Get list of elements
    elif hasattr(start_elements, "elements"):  # Handle ElementCollection input
        start_elements = start_elements.elements

    if isinstance(end_elements, str):
        end_elements = self.find_all(end_elements).elements
    elif hasattr(end_elements, "elements"):
        end_elements = end_elements.elements

    # Ensure start_elements is a list
    if start_elements is None:
        start_elements = []
    if end_elements is None:
        end_elements = []

    valid_inclusions = ["start", "end", "both", "none"]
    if include_boundaries not in valid_inclusions:
        raise ValueError(f"include_boundaries must be one of {valid_inclusions}")

    if not start_elements and not end_elements:
        # Return an empty ElementCollection if no boundary elements at all
        return ElementCollection([])

    # If we only have end elements, create implicit start elements
    if not start_elements and end_elements:
        # Delegate to PageCollection implementation for consistency
        from natural_pdf.core.page_collection import PageCollection

        pages = PageCollection([self])
        return pages.get_sections(
            start_elements=start_elements,
            end_elements=end_elements,
            include_boundaries=include_boundaries,
            orientation=orientation,
        )

    # Combine start and end elements with their type
    all_boundaries = []
    for el in start_elements:
        all_boundaries.append((el, "start"))
    for el in end_elements:
        all_boundaries.append((el, "end"))

    # Sort all boundary elements based on orientation
    try:
        if orientation == "vertical":
            all_boundaries.sort(key=lambda x: (x[0].top, x[0].x0))
        else:  # horizontal
            all_boundaries.sort(key=lambda x: (x[0].x0, x[0].top))
    except AttributeError as e:
        logger.error(f"Error sorting boundaries: Element missing position attribute? {e}")
        return ElementCollection([])  # Cannot proceed if elements lack position

    # Process sorted boundaries to find sections
    current_start_element = None
    active_section_started = False

    for element, element_type in all_boundaries:
        if element_type == "start":
            # If we have an active section, this start implicitly ends it
            if active_section_started:
                end_boundary_el = element  # Use this start as the end boundary
                # Determine region boundaries based on orientation
                if orientation == "vertical":
                    sec_top = (
                        current_start_element.top
                        if include_boundaries in ["start", "both"]
                        else current_start_element.bottom
                    )
                    sec_bottom = (
                        end_boundary_el.top
                        if include_boundaries not in ["end", "both"]
                        else end_boundary_el.bottom
                    )

                    if sec_top < sec_bottom:  # Ensure valid region
                        x0, _, x1, _ = get_bounds()
                        region = self.create_region(x0, sec_top, x1, sec_bottom)
                        region.start_element = current_start_element
                        region.end_element = end_boundary_el  # Mark the element that ended it
                        region.is_end_next_start = True  # Mark how it ended
                        region._boundary_exclusions = include_boundaries
                        regions.append(region)
                else:  # horizontal
                    sec_left = (
                        current_start_element.x0
                        if include_boundaries in ["start", "both"]
                        else current_start_element.x1
                    )
                    sec_right = (
                        end_boundary_el.x0
                        if include_boundaries not in ["end", "both"]
                        else end_boundary_el.x1
                    )

                    if sec_left < sec_right:  # Ensure valid region
                        _, y0, _, y1 = get_bounds()
                        region = self.create_region(sec_left, y0, sec_right, y1)
                        region.start_element = current_start_element
                        region.end_element = end_boundary_el  # Mark the element that ended it
                        region.is_end_next_start = True  # Mark how it ended
                        region._boundary_exclusions = include_boundaries
                        regions.append(region)
                active_section_started = False  # Reset for the new start

            # Set this as the potential start of the next section
            current_start_element = element
            active_section_started = True

        elif element_type == "end" and active_section_started:
            # We found an explicit end for the current section
            end_boundary_el = element
            if orientation == "vertical":
                sec_top = (
                    current_start_element.top
                    if include_boundaries in ["start", "both"]
                    else current_start_element.bottom
                )
                sec_bottom = (
                    end_boundary_el.bottom
                    if include_boundaries in ["end", "both"]
                    else end_boundary_el.top
                )

                if sec_top < sec_bottom:  # Ensure valid region
                    x0, _, x1, _ = get_bounds()
                    region = self.create_region(x0, sec_top, x1, sec_bottom)
                    region.start_element = current_start_element
                    region.end_element = end_boundary_el
                    region.is_end_next_start = False
                    region._boundary_exclusions = include_boundaries
                    regions.append(region)
            else:  # horizontal
                sec_left = (
                    current_start_element.x0
                    if include_boundaries in ["start", "both"]
                    else current_start_element.x1
                )
                sec_right = (
                    end_boundary_el.x1
                    if include_boundaries in ["end", "both"]
                    else end_boundary_el.x0
                )

                if sec_left < sec_right:  # Ensure valid region
                    _, y0, _, y1 = get_bounds()
                    region = self.create_region(sec_left, y0, sec_right, y1)
                    region.start_element = current_start_element
                    region.end_element = end_boundary_el
                    region.is_end_next_start = False
                    region._boundary_exclusions = include_boundaries
                    regions.append(region)

            # Reset: section ended explicitly
            current_start_element = None
            active_section_started = False

    # Handle the last section if it was started but never explicitly ended
    if active_section_started:
        if orientation == "vertical":
            sec_top = (
                current_start_element.top
                if include_boundaries in ["start", "both"]
                else current_start_element.bottom
            )
            x0, _, x1, page_bottom = get_bounds()
            if sec_top < page_bottom:
                region = self.create_region(x0, sec_top, x1, page_bottom)
                region.start_element = current_start_element
                region.end_element = None  # Ended by page end
                region.is_end_next_start = False
                region._boundary_exclusions = include_boundaries
                regions.append(region)
        else:  # horizontal
            sec_left = (
                current_start_element.x0
                if include_boundaries in ["start", "both"]
                else current_start_element.x1
            )
            page_left, y0, page_right, y1 = get_bounds()
            if sec_left < page_right:
                region = self.create_region(sec_left, y0, page_right, y1)
                region.start_element = current_start_element
                region.end_element = None  # Ended by page end
                region.is_end_next_start = False
                region._boundary_exclusions = include_boundaries
                regions.append(region)

    return ElementCollection(regions)
natural_pdf.Page.highlights(show=False)

Create a highlight context for accumulating highlights.

This allows for clean syntax to show multiple highlight groups:

Example

with page.highlights() as h: h.add(page.find_all('table'), label='tables', color='blue') h.add(page.find_all('text:bold'), label='bold text', color='red') h.show()

Or with automatic display

with page.highlights(show=True) as h: h.add(page.find_all('table'), label='tables') h.add(page.find_all('text:bold'), label='bold') # Automatically shows when exiting the context

Parameters:

Name Type Description Default
show bool

If True, automatically show highlights when exiting context

False

Returns:

Type Description
HighlightContext

HighlightContext for accumulating highlights

Source code in natural_pdf/core/page.py
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
def highlights(self, show: bool = False) -> "HighlightContext":
    """
    Create a highlight context for accumulating highlights.

    This allows for clean syntax to show multiple highlight groups:

    Example:
        with page.highlights() as h:
            h.add(page.find_all('table'), label='tables', color='blue')
            h.add(page.find_all('text:bold'), label='bold text', color='red')
            h.show()

    Or with automatic display:
        with page.highlights(show=True) as h:
            h.add(page.find_all('table'), label='tables')
            h.add(page.find_all('text:bold'), label='bold')
            # Automatically shows when exiting the context

    Args:
        show: If True, automatically show highlights when exiting context

    Returns:
        HighlightContext for accumulating highlights
    """
    from natural_pdf.core.highlighting_service import HighlightContext

    return HighlightContext(self, show_on_exit=show)
natural_pdf.Page.inspect(limit=30)

Inspect all elements on this page with detailed tabular view. Equivalent to page.find_all('*').inspect().

Parameters:

Name Type Description Default
limit int

Maximum elements per type to show (default: 30)

30

Returns:

Type Description
InspectionSummary

InspectionSummary with element tables showing coordinates,

InspectionSummary

properties, and other details for each element

Source code in natural_pdf/core/page.py
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
def inspect(self, limit: int = 30) -> "InspectionSummary":
    """
    Inspect all elements on this page with detailed tabular view.
    Equivalent to page.find_all('*').inspect().

    Args:
        limit: Maximum elements per type to show (default: 30)

    Returns:
        InspectionSummary with element tables showing coordinates,
        properties, and other details for each element
    """
    return self.find_all("*").inspect(limit=limit)
natural_pdf.Page.region(left=None, top=None, right=None, bottom=None, width=None, height=None)

Create a region on this page with more intuitive named parameters, allowing definition by coordinates or by coordinate + dimension.

Parameters:

Name Type Description Default
left float

Left x-coordinate (default: 0 if width not used).

None
top float

Top y-coordinate (default: 0 if height not used).

None
right float

Right x-coordinate (default: page width if width not used).

None
bottom float

Bottom y-coordinate (default: page height if height not used).

None
width Union[str, float, None]

Width definition. Can be: - Numeric: The width of the region in points. Cannot be used with both left and right. - String 'full': Sets region width to full page width (overrides left/right). - String 'element' or None (default): Uses provided/calculated left/right, defaulting to page width if neither are specified.

None
height Optional[float]

Numeric height of the region. Cannot be used with both top and bottom.

None

Returns:

Type Description
Any

Region object for the specified coordinates

Raises:

Type Description
ValueError

If conflicting arguments are provided (e.g., top, bottom, and height) or if width is an invalid string.

Examples:

>>> page.region(top=100, height=50)  # Region from y=100 to y=150, default width
>>> page.region(left=50, width=100)   # Region from x=50 to x=150, default height
>>> page.region(bottom=500, height=50) # Region from y=450 to y=500
>>> page.region(right=200, width=50)  # Region from x=150 to x=200
>>> page.region(top=100, bottom=200, width="full") # Explicit full width
Source code in natural_pdf/core/page.py
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
def region(
    self,
    left: float = None,
    top: float = None,
    right: float = None,
    bottom: float = None,
    width: Union[str, float, None] = None,
    height: Optional[float] = None,
) -> Any:
    """
    Create a region on this page with more intuitive named parameters,
    allowing definition by coordinates or by coordinate + dimension.

    Args:
        left: Left x-coordinate (default: 0 if width not used).
        top: Top y-coordinate (default: 0 if height not used).
        right: Right x-coordinate (default: page width if width not used).
        bottom: Bottom y-coordinate (default: page height if height not used).
        width: Width definition. Can be:
               - Numeric: The width of the region in points. Cannot be used with both left and right.
               - String 'full': Sets region width to full page width (overrides left/right).
               - String 'element' or None (default): Uses provided/calculated left/right,
                 defaulting to page width if neither are specified.
        height: Numeric height of the region. Cannot be used with both top and bottom.

    Returns:
        Region object for the specified coordinates

    Raises:
        ValueError: If conflicting arguments are provided (e.g., top, bottom, and height)
                  or if width is an invalid string.

    Examples:
        >>> page.region(top=100, height=50)  # Region from y=100 to y=150, default width
        >>> page.region(left=50, width=100)   # Region from x=50 to x=150, default height
        >>> page.region(bottom=500, height=50) # Region from y=450 to y=500
        >>> page.region(right=200, width=50)  # Region from x=150 to x=200
        >>> page.region(top=100, bottom=200, width="full") # Explicit full width
    """
    # ------------------------------------------------------------------
    # Percentage support – convert strings like "30%" to absolute values
    # based on page dimensions.  X-axis params (left, right, width) use
    # page.width; Y-axis params (top, bottom, height) use page.height.
    # ------------------------------------------------------------------

    def _pct_to_abs(val, axis: str):
        if isinstance(val, str) and val.strip().endswith("%"):
            try:
                pct = float(val.strip()[:-1]) / 100.0
            except ValueError:
                return val  # leave unchanged if not a number
            return pct * (self.width if axis == "x" else self.height)
        return val

    left = _pct_to_abs(left, "x")
    right = _pct_to_abs(right, "x")
    width = _pct_to_abs(width, "x")
    top = _pct_to_abs(top, "y")
    bottom = _pct_to_abs(bottom, "y")
    height = _pct_to_abs(height, "y")

    # --- Type checking and basic validation ---
    is_width_numeric = isinstance(width, (int, float))
    is_width_string = isinstance(width, str)
    width_mode = "element"  # Default mode

    if height is not None and top is not None and bottom is not None:
        raise ValueError("Cannot specify top, bottom, and height simultaneously.")
    if is_width_numeric and left is not None and right is not None:
        raise ValueError("Cannot specify left, right, and a numeric width simultaneously.")
    if is_width_string:
        width_lower = width.lower()
        if width_lower not in ["full", "element"]:
            raise ValueError("String width argument must be 'full' or 'element'.")
        width_mode = width_lower

    # --- Calculate Coordinates ---
    final_top = top
    final_bottom = bottom
    final_left = left
    final_right = right

    # Height calculations
    if height is not None:
        if top is not None:
            final_bottom = top + height
        elif bottom is not None:
            final_top = bottom - height
        else:  # Neither top nor bottom provided, default top to 0
            final_top = 0
            final_bottom = height

    # Width calculations (numeric only)
    if is_width_numeric:
        if left is not None:
            final_right = left + width
        elif right is not None:
            final_left = right - width
        else:  # Neither left nor right provided, default left to 0
            final_left = 0
            final_right = width

    # --- Apply Defaults for Unset Coordinates ---
    # Only default coordinates if they weren't set by dimension calculation
    if final_top is None:
        final_top = 0
    if final_bottom is None:
        # Check if bottom should have been set by height calc
        if height is None or top is None:
            final_bottom = self.height

    if final_left is None:
        final_left = 0
    if final_right is None:
        # Check if right should have been set by width calc
        if not is_width_numeric or left is None:
            final_right = self.width

    # --- Handle width_mode == 'full' ---
    if width_mode == "full":
        # Override left/right if mode is full
        final_left = 0
        final_right = self.width

    # --- Final Validation & Creation ---
    # Ensure coordinates are within page bounds (clamp)
    final_left = max(0, final_left)
    final_top = max(0, final_top)
    final_right = min(self.width, final_right)
    final_bottom = min(self.height, final_bottom)

    # Ensure valid box (x0<=x1, top<=bottom)
    if final_left > final_right:
        logger.warning(f"Calculated left ({final_left}) > right ({final_right}). Swapping.")
        final_left, final_right = final_right, final_left
    if final_top > final_bottom:
        logger.warning(f"Calculated top ({final_top}) > bottom ({final_bottom}). Swapping.")
        final_top, final_bottom = final_bottom, final_top

    from natural_pdf.elements.region import Region

    region = Region(self, (final_left, final_top, final_right, final_bottom))
    return region
natural_pdf.Page.remove_text_layer()

Remove all text elements from this page.

This removes all text elements (words and characters) from the page, effectively clearing the text layer.

Returns:

Type Description
Page

Self for method chaining

Source code in natural_pdf/core/page.py
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
def remove_text_layer(self) -> "Page":
    """
    Remove all text elements from this page.

    This removes all text elements (words and characters) from the page,
    effectively clearing the text layer.

    Returns:
        Self for method chaining
    """
    logger.info(f"Page {self.number}: Removing all text elements...")

    # Remove all words and chars from the element manager
    removed_words = len(self._element_mgr.words)
    removed_chars = len(self._element_mgr.chars)

    # Clear the lists
    self._element_mgr._elements["words"] = []
    self._element_mgr._elements["chars"] = []

    logger.info(
        f"Page {self.number}: Removed {removed_words} words and {removed_chars} characters"
    )
    return self
natural_pdf.Page.save_image(filename, width=None, labels=True, legend_position='right', render_ocr=False, include_highlights=True, resolution=144, **kwargs)

Save the page image to a file, rendering highlights via HighlightingService.

Parameters:

Name Type Description Default
filename str

Path to save the image to.

required
width Optional[int]

Optional width for the output image.

None
labels bool

Whether to include a legend.

True
legend_position str

Position of the legend.

'right'
render_ocr bool

Whether to render OCR text.

False
include_highlights bool

Whether to render highlights.

True
resolution float

Resolution in DPI for base image rendering (default: 144 DPI, equivalent to previous scale=2.0).

144
**kwargs

Additional args for pdfplumber's internal to_image.

{}

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
def save_image(
    self,
    filename: str,
    width: Optional[int] = None,
    labels: bool = True,
    legend_position: str = "right",
    render_ocr: bool = False,
    include_highlights: bool = True,  # Allow saving without highlights
    resolution: float = 144,
    **kwargs,
) -> "Page":
    """
    Save the page image to a file, rendering highlights via HighlightingService.

    Args:
        filename: Path to save the image to.
        width: Optional width for the output image.
        labels: Whether to include a legend.
        legend_position: Position of the legend.
        render_ocr: Whether to render OCR text.
        include_highlights: Whether to render highlights.
        resolution: Resolution in DPI for base image rendering (default: 144 DPI, equivalent to previous scale=2.0).
        **kwargs: Additional args for pdfplumber's internal to_image.

    Returns:
        Self for method chaining.
    """
    # Use export() to save the image
    if include_highlights:
        self.export(
            path=filename,
            resolution=resolution,
            width=width,
            labels=labels,
            legend_position=legend_position,
            render_ocr=render_ocr,
            **kwargs,
        )
    else:
        # For saving without highlights, use render() and save manually
        img = self.render(resolution=resolution, **kwargs)
        if img:
            # Resize if width is specified
            if width is not None and width > 0 and img.width > 0:
                aspect_ratio = img.height / img.width
                height = int(width * aspect_ratio)
                try:
                    img = img.resize((width, height), Image.Resampling.LANCZOS)
                except Exception as e:
                    logger.warning(f"Could not resize image: {e}")

            # Save the image
            try:
                if os.path.dirname(filename):
                    os.makedirs(os.path.dirname(filename), exist_ok=True)
                img.save(filename)
            except Exception as e:
                logger.error(f"Failed to save image to {filename}: {e}")

    return self
natural_pdf.Page.save_searchable(output_path, dpi=300, **kwargs)

Saves the PDF page with an OCR text layer, making content searchable.

Requires optional dependencies. Install with: pip install "natural-pdf[ocr-save]"

OCR must have been applied to the pages beforehand

(e.g., pdf.apply_ocr()).

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the searchable PDF.

required
dpi int

Resolution for rendering and OCR overlay (default 300).

300
**kwargs

Additional keyword arguments passed to the exporter.

{}
Source code in natural_pdf/core/page.py
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
def save_searchable(self, output_path: Union[str, "Path"], dpi: int = 300, **kwargs):
    """
    Saves the PDF page with an OCR text layer, making content searchable.

    Requires optional dependencies. Install with: pip install "natural-pdf[ocr-save]"

    Note: OCR must have been applied to the pages beforehand
          (e.g., pdf.apply_ocr()).

    Args:
        output_path: Path to save the searchable PDF.
        dpi: Resolution for rendering and OCR overlay (default 300).
        **kwargs: Additional keyword arguments passed to the exporter.
    """
    # Import moved here, assuming it's always available now
    from natural_pdf.exporters.searchable_pdf import create_searchable_pdf

    # Convert pathlib.Path to string if necessary
    output_path_str = str(output_path)

    create_searchable_pdf(self, output_path_str, dpi=dpi, **kwargs)
    logger.info(f"Searchable PDF saved to: {output_path_str}")
natural_pdf.Page.show_preview(temporary_highlights, resolution=144, width=None, labels=True, legend_position='right', render_ocr=False)

Generates and returns a non-stateful preview image containing only the provided temporary highlights.

Parameters:

Name Type Description Default
temporary_highlights List[Dict]

List of highlight data dictionaries (as prepared by ElementCollection._prepare_highlight_data).

required
resolution float

Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).

144
width Optional[int]

Optional width for the output image.

None
labels bool

Whether to include a legend.

True
legend_position str

Position of the legend.

'right'
render_ocr bool

Whether to render OCR text.

False

Returns:

Type Description
Optional[Image]

PIL Image object of the preview, or None if rendering fails.

Source code in natural_pdf/core/page.py
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
def show_preview(
    self,
    temporary_highlights: List[Dict],
    resolution: float = 144,
    width: Optional[int] = None,
    labels: bool = True,
    legend_position: str = "right",
    render_ocr: bool = False,
) -> Optional[Image.Image]:
    """
    Generates and returns a non-stateful preview image containing only
    the provided temporary highlights.

    Args:
        temporary_highlights: List of highlight data dictionaries (as prepared by
                              ElementCollection._prepare_highlight_data).
        resolution: Resolution in DPI for rendering (default: 144 DPI, equivalent to previous scale=2.0).
        width: Optional width for the output image.
        labels: Whether to include a legend.
        legend_position: Position of the legend.
        render_ocr: Whether to render OCR text.

    Returns:
        PIL Image object of the preview, or None if rendering fails.
    """
    try:
        # Delegate rendering to the highlighter service's preview method
        img = self._highlighter.render_preview(
            page_index=self.index,
            temporary_highlights=temporary_highlights,
            resolution=resolution,
            labels=labels,
            legend_position=legend_position,
            render_ocr=render_ocr,
        )
    except AttributeError:
        logger.error(f"HighlightingService does not have the required 'render_preview' method.")
        return None
    except Exception as e:
        logger.error(
            f"Error calling highlighter.render_preview for page {self.index}: {e}",
            exc_info=True,
        )
        return None

    # Return the rendered image directly
    return img
natural_pdf.Page.split(divider, **kwargs)

Divides the page into sections based on the provided divider elements.

Source code in natural_pdf/core/page.py
2796
2797
2798
2799
2800
2801
2802
2803
2804
def split(self, divider, **kwargs) -> "ElementCollection[Region]":
    """
    Divides the page into sections based on the provided divider elements.
    """
    sections = self.get_sections(start_elements=divider, **kwargs)
    top = self.region(0, 0, self.width, sections[0].top)
    sections.append(top)

    return sections
natural_pdf.Page.until(selector, include_endpoint=True, **kwargs)

Select content from the top of the page until matching selector.

Parameters:

Name Type Description Default
selector str

CSS-like selector string

required
include_endpoint bool

Whether to include the endpoint element in the region

True
**kwargs

Additional selection parameters

{}

Returns:

Type Description
Any

Region object representing the selected content

Examples:

>>> page.until('text:contains("Conclusion")')  # Select from top to conclusion
>>> page.until('line[width>=2]', include_endpoint=False)  # Select up to thick line
Source code in natural_pdf/core/page.py
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
def until(self, selector: str, include_endpoint: bool = True, **kwargs) -> Any:
    """
    Select content from the top of the page until matching selector.

    Args:
        selector: CSS-like selector string
        include_endpoint: Whether to include the endpoint element in the region
        **kwargs: Additional selection parameters

    Returns:
        Region object representing the selected content

    Examples:
        >>> page.until('text:contains("Conclusion")')  # Select from top to conclusion
        >>> page.until('line[width>=2]', include_endpoint=False)  # Select up to thick line
    """
    # Find the target element
    target = self.find(selector, **kwargs)
    if not target:
        # If target not found, return a default region (full page)
        from natural_pdf.elements.region import Region

        return Region(self, (0, 0, self.width, self.height))

    # Create a region from the top of the page to the target
    from natural_pdf.elements.region import Region

    # Ensure target has positional attributes before using them
    target_top = getattr(target, "top", 0)
    target_bottom = getattr(target, "bottom", self.height)

    if include_endpoint:
        # Include the target element
        region = Region(self, (0, 0, self.width, target_bottom))
    else:
        # Up to the target element
        region = Region(self, (0, 0, self.width, target_top))

    region.end_element = target
    return region
natural_pdf.Page.update_text(transform, selector='text', max_workers=None, progress_callback=None)

Applies corrections to text elements on this page using a user-provided callback function, potentially in parallel.

Finds text elements on this page matching the selector argument and calls the transform for each, passing the element itself. Updates the element's text if the callback returns a new string.

Parameters:

Name Type Description Default
transform Callable[[Any], Optional[str]]

A function accepting an element and returning Optional[str] (new text or None).

required
selector str

CSS-like selector string to match text elements.

'text'
max_workers Optional[int]

The maximum number of threads to use for parallel execution. If None or 0 or 1, runs sequentially.

None
progress_callback Optional[Callable[[], None]]

Optional callback function to call after processing each element.

None

Returns:

Type Description
Page

Self for method chaining.

Source code in natural_pdf/core/page.py
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
def update_text(
    self,
    transform: Callable[[Any], Optional[str]],
    selector: str = "text",
    max_workers: Optional[int] = None,
    progress_callback: Optional[Callable[[], None]] = None,  # Added progress callback
) -> "Page":  # Return self for chaining
    """
    Applies corrections to text elements on this page
    using a user-provided callback function, potentially in parallel.

    Finds text elements on this page matching the *selector* argument and
    calls the ``transform`` for each, passing the element itself.
    Updates the element's text if the callback returns a new string.

    Args:
        transform: A function accepting an element and returning
                   `Optional[str]` (new text or None).
        selector: CSS-like selector string to match text elements.
        max_workers: The maximum number of threads to use for parallel execution.
                     If None or 0 or 1, runs sequentially.
        progress_callback: Optional callback function to call after processing each element.

    Returns:
        Self for method chaining.
    """
    logger.info(
        f"Page {self.number}: Starting text update with callback '{transform.__name__}' (max_workers={max_workers}) and selector='{selector}'"
    )

    target_elements_collection = self.find_all(selector=selector, apply_exclusions=False)
    target_elements = target_elements_collection.elements  # Get the list

    if not target_elements:
        logger.info(f"Page {self.number}: No text elements found to update.")
        return self

    element_pbar = None
    try:
        element_pbar = tqdm(
            total=len(target_elements),
            desc=f"Updating text Page {self.number}",
            unit="element",
            leave=False,
        )

        processed_count = 0
        updated_count = 0
        error_count = 0

        # Define the task to be run by the worker thread or sequentially
        def _process_element_task(element):
            try:
                current_text = getattr(element, "text", None)
                # Call the user-provided callback
                corrected_text = transform(element)

                # Validate result type
                if corrected_text is not None and not isinstance(corrected_text, str):
                    logger.warning(
                        f"Page {self.number}: Correction callback for element '{getattr(element, 'text', '')[:20]}...' returned non-string, non-None type: {type(corrected_text)}. Skipping update."
                    )
                    return element, None, None  # Treat as no correction

                return element, corrected_text, None  # Return element, result, no error
            except Exception as e:
                logger.error(
                    f"Page {self.number}: Error applying correction callback to element '{getattr(element, 'text', '')[:30]}...' ({element.bbox}): {e}",
                    exc_info=False,  # Keep log concise
                )
                return element, None, e  # Return element, no result, error
            finally:
                # --- Update internal tqdm progress bar ---
                if element_pbar:
                    element_pbar.update(1)
                # --- Call user's progress callback --- #
                if progress_callback:
                    try:
                        progress_callback()
                    except Exception as cb_e:
                        # Log error in callback itself, but don't stop processing
                        logger.error(
                            f"Page {self.number}: Error executing progress_callback: {cb_e}",
                            exc_info=False,
                        )

        # Choose execution strategy based on max_workers
        if max_workers is not None and max_workers > 1:
            # --- Parallel execution --- #
            logger.info(
                f"Page {self.number}: Running text update in parallel with {max_workers} workers."
            )
            futures = []
            with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                # Submit all tasks
                future_to_element = {
                    executor.submit(_process_element_task, element): element
                    for element in target_elements
                }

                # Process results as they complete (progress_callback called by worker)
                for future in concurrent.futures.as_completed(future_to_element):
                    processed_count += 1
                    try:
                        element, corrected_text, error = future.result()
                        if error:
                            error_count += 1
                            # Error already logged in worker
                        elif corrected_text is not None:
                            # Apply correction if text changed
                            current_text = getattr(element, "text", None)
                            if corrected_text != current_text:
                                element.text = corrected_text
                                updated_count += 1
                    except Exception as exc:
                        # Catch errors from future.result() itself
                        element = future_to_element[future]  # Find original element
                        logger.error(
                            f"Page {self.number}: Internal error retrieving correction result for element {element.bbox}: {exc}",
                            exc_info=True,
                        )
                        error_count += 1
                        # Note: progress_callback was already called in the worker's finally block

        else:
            # --- Sequential execution --- #
            logger.info(f"Page {self.number}: Running text update sequentially.")
            for element in target_elements:
                # Call the task function directly (it handles progress_callback)
                processed_count += 1
                _element, corrected_text, error = _process_element_task(element)
                if error:
                    error_count += 1
                elif corrected_text is not None:
                    # Apply correction if text changed
                    current_text = getattr(_element, "text", None)
                    if corrected_text != current_text:
                        _element.text = corrected_text
                        updated_count += 1

        logger.info(
            f"Page {self.number}: Text update finished. Processed: {processed_count}/{len(target_elements)}, Updated: {updated_count}, Errors: {error_count}."
        )

        return self  # Return self for chaining
    finally:
        if element_pbar:
            element_pbar.close()
natural_pdf.Page.viewer()

Creates and returns an interactive ipywidget for exploring elements on this page.

Uses InteractiveViewerWidget.from_page() to create the viewer.

Returns:

Type Description
Optional[InteractiveViewerWidget]

A InteractiveViewerWidget instance ready for display in Jupyter,

Optional[InteractiveViewerWidget]

or None if ipywidgets is not installed or widget creation fails.

Raises:

Type Description
# Optional

Could raise ImportError instead of returning None

# ImportError

If required dependencies (ipywidgets) are missing.

ValueError

If image rendering or data preparation fails within from_page.

Source code in natural_pdf/core/page.py
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
def viewer(
    self,
    # elements_to_render: Optional[List['Element']] = None, # No longer needed, from_page handles it
    # include_source_types: List[str] = ['word', 'line', 'rect', 'region'] # No longer needed
) -> Optional["InteractiveViewerWidget"]:  # Return type hint updated
    """
    Creates and returns an interactive ipywidget for exploring elements on this page.

    Uses InteractiveViewerWidget.from_page() to create the viewer.

    Returns:
        A InteractiveViewerWidget instance ready for display in Jupyter,
        or None if ipywidgets is not installed or widget creation fails.

    Raises:
        # Optional: Could raise ImportError instead of returning None
        # ImportError: If required dependencies (ipywidgets) are missing.
        ValueError: If image rendering or data preparation fails within from_page.
    """
    # Check for availability using the imported flag and class variable
    if not _IPYWIDGETS_AVAILABLE or InteractiveViewerWidget is None:
        logger.error(
            "Interactive viewer requires 'ipywidgets'. "
            'Please install with: pip install "ipywidgets>=7.0.0,<10.0.0"'
        )
        # raise ImportError("ipywidgets not found.") # Option 1: Raise error
        return None  # Option 2: Return None gracefully

    # If we reach here, InteractiveViewerWidget should be the actual class
    try:
        # Pass self (the Page object) to the factory method
        return InteractiveViewerWidget.from_page(self)
    except Exception as e:
        # Catch potential errors during widget creation (e.g., image rendering)
        logger.error(
            f"Error creating viewer widget from page {self.number}: {e}", exc_info=True
        )
        # raise # Option 1: Re-raise error (might include ValueError from from_page)
        return None  # Option 2: Return None on creation error
natural_pdf.Page.without_exclusions()

Context manager that temporarily disables exclusion processing.

This prevents infinite recursion when exclusion callables themselves use find() operations. While in this context, all find operations will skip exclusion filtering.

Example
# This exclusion would normally cause infinite recursion:
page.add_exclusion(lambda p: p.find("text:contains('Header')").expand())

# But internally, it's safe because we use:
with page.without_exclusions():
    region = exclusion_callable(page)

Yields:

Type Description

The page object with exclusions temporarily disabled.

Source code in natural_pdf/core/page.py
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
@contextlib.contextmanager
def without_exclusions(self):
    """
    Context manager that temporarily disables exclusion processing.

    This prevents infinite recursion when exclusion callables themselves
    use find() operations. While in this context, all find operations
    will skip exclusion filtering.

    Example:
        ```python
        # This exclusion would normally cause infinite recursion:
        page.add_exclusion(lambda p: p.find("text:contains('Header')").expand())

        # But internally, it's safe because we use:
        with page.without_exclusions():
            region = exclusion_callable(page)
        ```

    Yields:
        The page object with exclusions temporarily disabled.
    """
    old_value = self._computing_exclusions
    self._computing_exclusions = True
    try:
        yield self
    finally:
        self._computing_exclusions = old_value
natural_pdf.PageCollection

Bases: TextMixin, Generic[P], ApplyMixin, ShapeDetectionMixin, CheckboxDetectionMixin, Visualizable

Represents a collection of Page objects, often from a single PDF document. Provides methods for batch operations on these pages.

Source code in natural_pdf/core/page_collection.py
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
class PageCollection(
    TextMixin, Generic[P], ApplyMixin, ShapeDetectionMixin, CheckboxDetectionMixin, Visualizable
):
    """
    Represents a collection of Page objects, often from a single PDF document.
    Provides methods for batch operations on these pages.
    """

    def __init__(self, pages: Union[List[P], Sequence[P]]):
        """
        Initialize a page collection.

        Args:
            pages: List or sequence of Page objects (can be lazy)
        """
        # Store the sequence as-is to preserve lazy behavior
        # Only convert to list if we need list-specific operations
        if hasattr(pages, "__iter__") and hasattr(pages, "__len__"):
            self.pages = pages
        else:
            # Fallback for non-sequence types
            self.pages = list(pages)

    def __len__(self) -> int:
        """Return the number of pages in the collection."""
        return len(self.pages)

    def __getitem__(self, idx) -> Union[P, "PageCollection[P]"]:
        """Support indexing and slicing."""
        if isinstance(idx, slice):
            return PageCollection(self.pages[idx])
        return self.pages[idx]

    def __iter__(self) -> Iterator[P]:
        """Support iteration."""
        return iter(self.pages)

    def __repr__(self) -> str:
        """Return a string representation showing the page count."""
        return f"<PageCollection(count={len(self)})>"

    def _get_items_for_apply(self) -> Iterator[P]:
        """
        Override ApplyMixin's _get_items_for_apply to preserve lazy behavior.

        Returns an iterator that yields pages on-demand rather than materializing
        all pages at once, maintaining the lazy loading behavior.
        """
        return iter(self.pages)

    def _get_page_indices(self) -> List[int]:
        """
        Get page indices without forcing materialization of pages.

        Returns:
            List of page indices for the pages in this collection.
        """
        # Handle different types of page sequences efficiently
        if hasattr(self.pages, "_indices"):
            # If it's a _LazyPageList (or slice), get indices directly
            return list(self.pages._indices)
        else:
            # Fallback: if pages are already materialized, get indices normally
            # This will force materialization but only if pages aren't lazy
            return [p.index for p in self.pages]

    def extract_text(
        self,
        keep_blank_chars: bool = True,
        apply_exclusions: bool = True,
        strip: Optional[bool] = None,
        **kwargs,
    ) -> str:
        """
        Extract text from all pages in the collection.

        Args:
            keep_blank_chars: Whether to keep blank characters (default: True)
            apply_exclusions: Whether to apply exclusion regions (default: True)
            strip: Whether to strip whitespace from the extracted text.
            **kwargs: Additional extraction parameters

        Returns:
            Combined text from all pages
        """
        texts = []
        for page in self.pages:
            text = page.extract_text(
                keep_blank_chars=keep_blank_chars,
                apply_exclusions=apply_exclusions,
                **kwargs,
            )
            texts.append(text)

        combined = "\n".join(texts)

        # Default strip behaviour: if caller picks, honour; else respect layout flag passed via kwargs.
        use_layout = kwargs.get("layout", False)
        strip_final = strip if strip is not None else (not use_layout)

        if strip_final:
            combined = "\n".join(line.rstrip() for line in combined.splitlines()).strip()

        return combined

    def apply_ocr(
        self,
        engine: Optional[str] = None,
        # --- Common OCR Parameters (Direct Arguments) ---
        languages: Optional[List[str]] = None,
        min_confidence: Optional[float] = None,  # Min confidence threshold
        device: Optional[str] = None,
        resolution: Optional[int] = None,  # DPI for rendering
        apply_exclusions: bool = True,  # New parameter
        replace: bool = True,  # Whether to replace existing OCR elements
        # --- Engine-Specific Options ---
        options: Optional[Any] = None,  # e.g., EasyOCROptions(...)
    ) -> "PageCollection[P]":
        """
        Applies OCR to all pages within this collection using batch processing.

        This delegates the work to the parent PDF object's `apply_ocr` method.

        Args:
            engine: Name of the OCR engine (e.g., 'easyocr', 'paddleocr').
            languages: List of language codes (e.g., ['en', 'fr'], ['en', 'ch']).
                       **Must be codes understood by the specific selected engine.**
                       No mapping is performed.
            min_confidence: Minimum confidence threshold for detected text (0.0 to 1.0).
            device: Device to run OCR on (e.g., 'cpu', 'cuda', 'mps').
            resolution: DPI resolution to render page images before OCR (e.g., 150, 300).
            apply_exclusions: If True (default), render page images for OCR with
                              excluded areas masked (whited out). If False, OCR
                              the raw page images without masking exclusions.
            replace: If True (default), remove any existing OCR elements before
                    adding new ones. If False, add new OCR elements to existing ones.
            options: An engine-specific options object (e.g., EasyOCROptions) or dict.

        Returns:
            Self for method chaining.

        Raises:
            RuntimeError: If pages lack a parent PDF or parent lacks `apply_ocr`.
            (Propagates exceptions from PDF.apply_ocr)
        """
        if not self.pages:
            logger.warning("Cannot apply OCR to an empty PageCollection.")
            return self

        # Assume all pages share the same parent PDF object
        first_page = self.pages[0]
        if not hasattr(first_page, "_parent") or not first_page._parent:
            raise RuntimeError("Pages in this collection do not have a parent PDF reference.")

        parent_pdf = first_page._parent

        if not hasattr(parent_pdf, "apply_ocr") or not callable(parent_pdf.apply_ocr):
            raise RuntimeError("Parent PDF object does not have the required 'apply_ocr' method.")

        # Get the 0-based indices of the pages in this collection
        page_indices = self._get_page_indices()

        logger.info(f"Applying OCR via parent PDF to page indices: {page_indices} in collection.")

        # Delegate the batch call to the parent PDF object, passing direct args and apply_exclusions
        parent_pdf.apply_ocr(
            pages=page_indices,
            engine=engine,
            languages=languages,
            min_confidence=min_confidence,  # Pass the renamed parameter
            device=device,
            resolution=resolution,
            apply_exclusions=apply_exclusions,  # Pass down
            replace=replace,  # Pass the replace parameter
            options=options,
        )
        # The PDF method modifies the Page objects directly by adding elements.

        return self  # Return self for chaining

    @overload
    def find(
        self,
        *,
        text: str,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[T]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[T]: ...

    def find(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional[T]:
        """
        Find the first element matching the selector OR text across all pages in the collection.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            overlap: How to determine if elements overlap: 'full' (fully inside),
                     'partial' (any overlap), or 'center' (center point inside).
                     (default: "full")
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            First matching element or None.
        """
        # Input validation happens within page.find
        for page in self.pages:
            element = page.find(
                selector=selector,
                text=text,
                overlap=overlap,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )
            if element:
                return element
        return None

    @overload
    def find_all(
        self,
        *,
        text: str,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,
        *,
        text: Optional[str] = None,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements matching the selector OR text across all pages in the collection.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            overlap: How to determine if elements overlap: 'full' (fully inside),
                     'partial' (any overlap), or 'center' (center point inside).
                     (default: "full")
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional filter parameters.

        Returns:
            ElementCollection with matching elements from all pages.
        """
        all_elements = []
        # Input validation happens within page.find_all
        for page in self.pages:
            elements = page.find_all(
                selector=selector,
                text=text,
                overlap=overlap,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )
            if elements:
                all_elements.extend(elements.elements)

        return ElementCollection(all_elements)

    def update_text(
        self,
        transform: Callable[[Any], Optional[str]],
        selector: str = "text",
        max_workers: Optional[int] = None,
    ) -> "PageCollection[P]":
        """
        Applies corrections to text elements across all pages
        in this collection using a user-provided callback function, executed
        in parallel if `max_workers` is specified.

        This method delegates to the parent PDF's `update_text` method,
        targeting all pages within this collection.

        Args:
            transform: A function that accepts a single argument (an element
                       object) and returns `Optional[str]` (new text or None).
            selector: The attribute name to update. Default is 'text'.
            max_workers: The maximum number of worker threads to use for parallel
                         correction on each page. If None, defaults are used.

        Returns:
            Self for method chaining.

        Raises:
            RuntimeError: If the collection is empty, pages lack a parent PDF reference,
                          or the parent PDF lacks the `update_text` method.
        """
        if not self.pages:
            logger.warning("Cannot update text for an empty PageCollection.")
            # Return self even if empty to maintain chaining consistency
            return self

        # Assume all pages share the same parent PDF object
        parent_pdf = self.pages[0]._parent
        if (
            not parent_pdf
            or not hasattr(parent_pdf, "update_text")
            or not callable(parent_pdf.update_text)
        ):
            raise RuntimeError(
                "Parent PDF reference not found or parent PDF lacks the required 'update_text' method."
            )

        page_indices = self._get_page_indices()
        logger.info(
            f"PageCollection: Delegating text update to parent PDF for page indices: {page_indices} with max_workers={max_workers} and selector='{selector}'."
        )

        # Delegate the call to the parent PDF object for the relevant pages
        # Pass the max_workers parameter down
        parent_pdf.update_text(
            transform=transform,
            pages=page_indices,
            selector=selector,
            max_workers=max_workers,
        )

        return self

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        new_section_on_page_break=False,
        include_boundaries="both",
        orientation="vertical",
    ) -> "ElementCollection[Region]":
        """
        Extract sections from a page collection based on start/end elements.

        Args:
            start_elements: Elements or selector string that mark the start of sections (optional)
            end_elements: Elements or selector string that mark the end of sections (optional)
            new_section_on_page_break: Whether to start a new section at page boundaries (default: False)
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both')
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            List of Region objects representing the extracted sections

        Note:
            You can provide only start_elements, only end_elements, or both.
            - With only start_elements: sections go from each start to the next start (or end of page)
            - With only end_elements: sections go from beginning of document/page to each end
            - With both: sections go from each start to the corresponding end
        """
        # Find start and end elements across all pages
        if isinstance(start_elements, str):
            start_elements = self.find_all(start_elements).elements

        if isinstance(end_elements, str):
            end_elements = self.find_all(end_elements).elements

        # If no start elements and no end elements, return empty list
        if not start_elements and not end_elements:
            return []

        # If there are page break boundaries, we'll need to add them
        if new_section_on_page_break:
            # For each page boundary, create virtual "end" and "start" elements
            for i in range(len(self.pages) - 1):
                # Add a virtual "end" element at the bottom of the current page
                page = self.pages[i]
                # If end_elements is None, initialize it as an empty list
                if end_elements is None:
                    end_elements = []

                # Create a region at the bottom of the page as an artificial end marker
                from natural_pdf.elements.region import Region

                bottom_region = Region(page, (0, page.height - 1, page.width, page.height))
                bottom_region.is_page_boundary = True  # Mark it as a special boundary
                end_elements.append(bottom_region)

                # Add a virtual "start" element at the top of the next page
                next_page = self.pages[i + 1]
                top_region = Region(next_page, (0, 0, next_page.width, 1))
                top_region.is_page_boundary = True  # Mark it as a special boundary
                # If start_elements is None, initialize it as an empty list
                if start_elements is None:
                    start_elements = []
                start_elements.append(top_region)

        # Get all elements from all pages and sort them in document order
        all_elements = []
        for page in self.pages:
            elements = page.get_elements()
            all_elements.extend(elements)

        # Sort by page index, then vertical position, then horizontal position
        all_elements.sort(key=lambda e: (e.page.index, e.top, e.x0))

        # If we only have end_elements (no start_elements), create implicit start elements
        if not start_elements and end_elements:
            from natural_pdf.elements.region import Region

            start_elements = []

            # Add implicit start at the beginning of the first page
            first_page = self.pages[0]
            first_start = Region(first_page, (0, 0, first_page.width, 1))
            first_start.is_implicit_start = True
            # Don't mark this as created from any end element, so it can pair with any end
            start_elements.append(first_start)

            # For each end element (except the last), add an implicit start after it
            # Sort by page, then top, then bottom (for elements with same top), then x0
            sorted_end_elements = sorted(
                end_elements, key=lambda e: (e.page.index, e.top, e.bottom, e.x0)
            )
            for i, end_elem in enumerate(sorted_end_elements[:-1]):  # Exclude last end element
                # Create implicit start element right after this end element
                implicit_start = Region(
                    end_elem.page, (0, end_elem.bottom, end_elem.page.width, end_elem.bottom + 1)
                )
                implicit_start.is_implicit_start = True
                # Track which end element this implicit start was created from
                # to avoid pairing them together (which would create zero height)
                implicit_start.created_from_end = end_elem
                start_elements.append(implicit_start)

        # Mark section boundaries
        section_boundaries = []

        # Add start element boundaries
        for element in start_elements:
            if element in all_elements:
                idx = all_elements.index(element)
                section_boundaries.append(
                    {
                        "index": idx,
                        "element": element,
                        "type": "start",
                        "page_idx": element.page.index,
                    }
                )
            elif hasattr(element, "is_page_boundary") and element.is_page_boundary:
                # This is a virtual page boundary element
                section_boundaries.append(
                    {
                        "index": -1,  # Special index for page boundaries
                        "element": element,
                        "type": "start",
                        "page_idx": element.page.index,
                    }
                )
            elif hasattr(element, "is_implicit_start") and element.is_implicit_start:
                # This is an implicit start element
                section_boundaries.append(
                    {
                        "index": -2,  # Special index for implicit starts
                        "element": element,
                        "type": "start",
                        "page_idx": element.page.index,
                    }
                )

        # Add end element boundaries if provided
        if end_elements:
            for element in end_elements:
                if element in all_elements:
                    idx = all_elements.index(element)
                    section_boundaries.append(
                        {
                            "index": idx,
                            "element": element,
                            "type": "end",
                            "page_idx": element.page.index,
                        }
                    )
                elif hasattr(element, "is_page_boundary") and element.is_page_boundary:
                    # This is a virtual page boundary element
                    section_boundaries.append(
                        {
                            "index": -1,  # Special index for page boundaries
                            "element": element,
                            "type": "end",
                            "page_idx": element.page.index,
                        }
                    )

        # Sort boundaries by page index, then by actual document position
        def _sort_key(boundary):
            """Sort boundaries by (page_idx, position, priority)."""
            page_idx = boundary["page_idx"]
            element = boundary["element"]

            # Position on the page based on orientation
            if orientation == "vertical":
                pos = getattr(element, "top", 0.0)
            else:  # horizontal
                pos = getattr(element, "x0", 0.0)

            # Ensure starts come before ends at the same coordinate
            priority = 0 if boundary["type"] == "start" else 1

            return (page_idx, pos, priority)

        section_boundaries.sort(key=_sort_key)

        # Generate sections
        sections = []

        # --- Helper: build a FlowRegion spanning multiple pages ---
        def _build_flow_region(start_el, end_el, include_boundaries="both", orientation="vertical"):
            """Return a FlowRegion that covers from *start_el* to *end_el*.
            If *end_el* is None, the region continues to the bottom/right of the last
            page in this PageCollection.

            Args:
                start_el: Start element
                end_el: End element
                include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
                orientation: 'vertical' or 'horizontal' - determines section direction
            """
            # Local imports to avoid top-level cycles
            from natural_pdf.elements.region import Region
            from natural_pdf.flows.element import FlowElement
            from natural_pdf.flows.flow import Flow
            from natural_pdf.flows.region import FlowRegion

            start_pg = start_el.page
            end_pg = end_el.page if end_el is not None else self.pages[-1]

            parts: list[Region] = []

            if orientation == "vertical":
                # Determine the start_top based on include_boundaries
                start_top = start_el.top
                if include_boundaries == "none" or include_boundaries == "end":
                    # Exclude start boundary
                    start_top = start_el.bottom if hasattr(start_el, "bottom") else start_el.top

                # Slice of first page beginning at *start_top*
                parts.append(Region(start_pg, (0, start_top, start_pg.width, start_pg.height)))
            else:  # horizontal
                # Determine the start_left based on include_boundaries
                start_left = start_el.x0
                if include_boundaries == "none" or include_boundaries == "end":
                    # Exclude start boundary
                    start_left = start_el.x1 if hasattr(start_el, "x1") else start_el.x0

                # Slice of first page beginning at *start_left*
                parts.append(Region(start_pg, (start_left, 0, start_pg.width, start_pg.height)))

            # Full middle pages
            for pg_idx in range(start_pg.index + 1, end_pg.index):
                mid_pg = self.pages[pg_idx]
                parts.append(Region(mid_pg, (0, 0, mid_pg.width, mid_pg.height)))

            # Slice of last page (if distinct)
            if end_pg is not start_pg:
                if orientation == "vertical":
                    # Determine the bottom based on include_boundaries
                    if end_el is not None:
                        if include_boundaries == "none" or include_boundaries == "start":
                            # Exclude end boundary
                            bottom = end_el.top if hasattr(end_el, "top") else end_el.bottom
                        else:
                            # Include end boundary
                            bottom = end_el.bottom
                    else:
                        bottom = end_pg.height
                    parts.append(Region(end_pg, (0, 0, end_pg.width, bottom)))
                else:  # horizontal
                    # Determine the right based on include_boundaries
                    if end_el is not None:
                        if include_boundaries == "none" or include_boundaries == "start":
                            # Exclude end boundary
                            right = end_el.x0 if hasattr(end_el, "x0") else end_el.x1
                        else:
                            # Include end boundary
                            right = end_el.x1
                    else:
                        right = end_pg.width
                    parts.append(Region(end_pg, (0, 0, right, end_pg.height)))

            flow = Flow(segments=parts, arrangement=orientation)
            src_fe = FlowElement(physical_object=start_el, flow=flow)
            return FlowRegion(
                flow=flow,
                constituent_regions=parts,
                source_flow_element=src_fe,
                boundary_element_found=end_el,
            )

        # ------------------------------------------------------------------

        current_start = None

        for i, boundary in enumerate(section_boundaries):
            # If it's a start boundary and we don't have a current start
            if boundary["type"] == "start" and current_start is None:
                current_start = boundary

            # If it's an end boundary and we have a current start
            elif boundary["type"] == "end" and current_start is not None:
                # Create a section from current_start to this boundary
                start_element = current_start["element"]
                end_element = boundary["element"]

                # Check if this is an implicit start created from this same end element
                # This would create a zero-height section, so skip this pairing
                if (
                    hasattr(start_element, "is_implicit_start")
                    and hasattr(start_element, "created_from_end")
                    and start_element.created_from_end is end_element
                ):
                    # Skip this pairing - keep current_start for next end element
                    continue

                # If both elements are on the same page, use the page's get_section_between
                if start_element.page == end_element.page:
                    # For implicit start elements, create a region from the top of the page
                    if hasattr(start_element, "is_implicit_start"):
                        from natural_pdf.elements.region import Region

                        # Adjust boundaries based on include_boundaries parameter and orientation
                        if orientation == "vertical":
                            top = start_element.top
                            bottom = end_element.bottom

                            if include_boundaries == "none":
                                # Exclude both boundaries - move past them
                                top = (
                                    start_element.bottom
                                    if hasattr(start_element, "bottom")
                                    else start_element.top
                                )
                                bottom = (
                                    end_element.top
                                    if hasattr(end_element, "top")
                                    else end_element.bottom
                                )
                            elif include_boundaries == "start":
                                # Include start, exclude end
                                bottom = (
                                    end_element.top
                                    if hasattr(end_element, "top")
                                    else end_element.bottom
                                )
                            elif include_boundaries == "end":
                                # Exclude start, include end
                                top = (
                                    start_element.bottom
                                    if hasattr(start_element, "bottom")
                                    else start_element.top
                                )
                            # "both" is default - no adjustment needed

                            section = Region(
                                start_element.page,
                                (0, top, start_element.page.width, bottom),
                            )
                            section._boundary_exclusions = include_boundaries
                        else:  # horizontal
                            left = start_element.x0
                            right = end_element.x1

                            if include_boundaries == "none":
                                # Exclude both boundaries - move past them
                                left = (
                                    start_element.x1
                                    if hasattr(start_element, "x1")
                                    else start_element.x0
                                )
                                right = (
                                    end_element.x0 if hasattr(end_element, "x0") else end_element.x1
                                )
                            elif include_boundaries == "start":
                                # Include start, exclude end
                                right = (
                                    end_element.x0 if hasattr(end_element, "x0") else end_element.x1
                                )
                            elif include_boundaries == "end":
                                # Exclude start, include end
                                left = (
                                    start_element.x1
                                    if hasattr(start_element, "x1")
                                    else start_element.x0
                                )
                            # "both" is default - no adjustment needed

                            section = Region(
                                start_element.page,
                                (left, 0, right, start_element.page.height),
                            )
                            section._boundary_exclusions = include_boundaries
                        section.start_element = start_element
                        section.boundary_element_found = end_element
                    else:
                        section = start_element.page.get_section_between(
                            start_element, end_element, include_boundaries, orientation
                        )
                    sections.append(section)
                else:
                    # Create FlowRegion spanning pages
                    flow_region = _build_flow_region(
                        start_element, end_element, include_boundaries, orientation
                    )
                    sections.append(flow_region)

                current_start = None

            # If it's another start boundary and we have a current start (for splitting by starts only)
            elif boundary["type"] == "start" and current_start is not None and not end_elements:
                # Create a section from current_start to just before this boundary
                start_element = current_start["element"]

                # Create section from current start to just before this new start
                if start_element.page == boundary["element"].page:
                    from natural_pdf.elements.region import Region

                    next_start = boundary["element"]

                    # Create section based on orientation
                    if orientation == "vertical":
                        # Determine vertical bounds
                        if include_boundaries in ["start", "both"]:
                            top = start_element.top
                        else:
                            top = start_element.bottom

                        # The section ends just before the next start
                        bottom = next_start.top

                        # Create the section with full page width
                        if top < bottom:
                            section = Region(
                                start_element.page, (0, top, start_element.page.width, bottom)
                            )
                            section.start_element = start_element
                            section.end_element = (
                                next_start  # The next start is the end of this section
                            )
                            section._boundary_exclusions = include_boundaries
                            sections.append(section)
                    else:  # horizontal
                        # Determine horizontal bounds
                        if include_boundaries in ["start", "both"]:
                            left = start_element.x0
                        else:
                            left = start_element.x1

                        # The section ends just before the next start
                        right = next_start.x0

                        # Create the section with full page height
                        if left < right:
                            section = Region(
                                start_element.page, (left, 0, right, start_element.page.height)
                            )
                            section.start_element = start_element
                            section.end_element = (
                                next_start  # The next start is the end of this section
                            )
                            section._boundary_exclusions = include_boundaries
                            sections.append(section)
                else:
                    # Cross-page section - create from current_start to the end of its page
                    from natural_pdf.elements.region import Region

                    start_page = start_element.page

                    # Handle implicit start elements and respect include_boundaries
                    if orientation == "vertical":
                        if include_boundaries in ["none", "end"]:
                            # Exclude start boundary
                            start_top = (
                                start_element.bottom
                                if hasattr(start_element, "bottom")
                                else start_element.top
                            )
                        else:
                            # Include start boundary
                            start_top = start_element.top

                        region = Region(
                            start_page, (0, start_top, start_page.width, start_page.height)
                        )
                    else:  # horizontal
                        if include_boundaries in ["none", "end"]:
                            # Exclude start boundary
                            start_left = (
                                start_element.x1
                                if hasattr(start_element, "x1")
                                else start_element.x0
                            )
                        else:
                            # Include start boundary
                            start_left = start_element.x0

                        region = Region(
                            start_page, (start_left, 0, start_page.width, start_page.height)
                        )
                    region.start_element = start_element
                    sections.append(region)

                current_start = boundary

        # Handle the last section if we have a current start
        if current_start is not None:
            start_element = current_start["element"]
            start_page = start_element.page

            if end_elements:
                # With end_elements, we need an explicit end - use the last element
                # on the last page of the collection
                last_page = self.pages[-1]
                last_page_elements = [e for e in all_elements if e.page == last_page]
                if orientation == "vertical":
                    last_page_elements.sort(key=lambda e: (e.top, e.x0))
                else:  # horizontal
                    last_page_elements.sort(key=lambda e: (e.x0, e.top))
                end_element = last_page_elements[-1] if last_page_elements else None

                # Create FlowRegion spanning multiple pages using helper
                flow_region = _build_flow_region(
                    start_element, end_element, include_boundaries, orientation
                )
                sections.append(flow_region)
            else:
                # With start_elements only, create a section to the end of the current page
                from natural_pdf.elements.region import Region

                # Handle implicit start elements and respect include_boundaries
                if orientation == "vertical":
                    if include_boundaries in ["none", "end"]:
                        # Exclude start boundary
                        start_top = (
                            start_element.bottom
                            if hasattr(start_element, "bottom")
                            else start_element.top
                        )
                    else:
                        # Include start boundary
                        start_top = start_element.top

                    region = Region(start_page, (0, start_top, start_page.width, start_page.height))
                else:  # horizontal
                    if include_boundaries in ["none", "end"]:
                        # Exclude start boundary
                        start_left = (
                            start_element.x1 if hasattr(start_element, "x1") else start_element.x0
                        )
                    else:
                        # Include start boundary
                        start_left = start_element.x0

                    region = Region(
                        start_page, (start_left, 0, start_page.width, start_page.height)
                    )
                region.start_element = start_element
                sections.append(region)

        return ElementCollection(sections)

    def split(self, divider, **kwargs) -> "ElementCollection[Region]":
        """
        Divide this page collection into sections based on the provided divider elements.

        Args:
            divider: Elements or selector string that mark section boundaries
            **kwargs: Additional parameters passed to get_sections()
                - include_boundaries: How to include boundary elements (default: 'start')
                - orientation: 'vertical' or 'horizontal' (default: 'vertical')
                - new_section_on_page_break: Whether to split at page boundaries (default: False)

        Returns:
            ElementCollection of Region objects representing the sections

        Example:
            # Split a PDF by chapter titles
            chapters = pdf.pages.split("text[size>20]:contains('CHAPTER')")

            # Split by page breaks
            page_sections = pdf.pages.split(None, new_section_on_page_break=True)

            # Split multi-page document by section headers
            sections = pdf.pages[10:20].split("text:bold:contains('Section')")
        """
        # Default to 'start' boundaries for split (include divider at start of each section)
        if "include_boundaries" not in kwargs:
            kwargs["include_boundaries"] = "start"

        sections = self.get_sections(start_elements=divider, **kwargs)

        # Add initial section if there's content before the first divider
        if sections and divider is not None:
            # Get all elements across all pages
            all_elements = []
            for page in self.pages:
                all_elements.extend(page.get_elements())

            if all_elements:
                # Find first divider
                if isinstance(divider, str):
                    # Search for first matching element
                    first_divider = None
                    for page in self.pages:
                        match = page.find(divider)
                        if match:
                            first_divider = match
                            break
                else:
                    # divider is already elements
                    first_divider = divider[0] if hasattr(divider, "__getitem__") else divider

                if first_divider and all_elements[0] != first_divider:
                    # There's content before the first divider
                    # Get section from start to first divider
                    initial_sections = self.get_sections(
                        start_elements=None,
                        end_elements=[first_divider],
                        include_boundaries="none",
                        orientation=kwargs.get("orientation", "vertical"),
                    )
                    if initial_sections:
                        sections = ElementCollection([initial_sections[0]] + list(sections))

        return sections

    def _gather_analysis_data(
        self,
        analysis_keys: List[str],
        include_content: bool,
        include_images: bool,
        image_dir: Optional[Path],
        image_format: str,
        image_resolution: int,
    ) -> List[Dict[str, Any]]:
        """
        Gather analysis data from all pages in the collection.

        Args:
            analysis_keys: Keys in the analyses dictionary to export
            include_content: Whether to include extracted text
            include_images: Whether to export images
            image_dir: Directory to save images
            image_format: Format to save images
            image_resolution: Resolution for exported images

        Returns:
            List of dictionaries containing analysis data
        """
        if not self.elements:
            logger.warning("No pages found in collection")
            return []

        all_data = []

        for page in self.elements:
            # Basic page information
            page_data = {
                "page_number": page.number,
                "page_index": page.index,
                "width": page.width,
                "height": page.height,
            }

            # Add PDF information if available
            if hasattr(page, "pdf") and page.pdf:
                page_data["pdf_path"] = page.pdf.path
                page_data["pdf_filename"] = Path(page.pdf.path).name

            # Include extracted text if requested
            if include_content:
                try:
                    page_data["content"] = page.extract_text(preserve_whitespace=True)
                except Exception as e:
                    logger.error(f"Error extracting text from page {page.number}: {e}")
                    page_data["content"] = ""

            # Save image if requested
            if include_images:
                try:
                    # Create image filename
                    pdf_name = "unknown"
                    if hasattr(page, "pdf") and page.pdf:
                        pdf_name = Path(page.pdf.path).stem

                    image_filename = f"{pdf_name}_page_{page.number}.{image_format}"
                    image_path = image_dir / image_filename

                    # Save image
                    page.save_image(
                        str(image_path), resolution=image_resolution, include_highlights=True
                    )

                    # Add relative path to data
                    page_data["image_path"] = str(Path(image_path).relative_to(image_dir.parent))
                except Exception as e:
                    logger.error(f"Error saving image for page {page.number}: {e}")
                    page_data["image_path"] = None

            # Add analyses data
            if hasattr(page, "analyses") and page.analyses:
                for key in analysis_keys:
                    if key not in page.analyses:
                        raise KeyError(f"Analysis key '{key}' not found in page {page.number}")

                    # Get the analysis result
                    analysis_result = page.analyses[key]

                    # If the result has a to_dict method, use it
                    if hasattr(analysis_result, "to_dict"):
                        analysis_data = analysis_result.to_dict()
                    else:
                        # Otherwise, use the result directly if it's dict-like
                        try:
                            analysis_data = dict(analysis_result)
                        except (TypeError, ValueError):
                            # Last resort: convert to string
                            analysis_data = {"raw_result": str(analysis_result)}

                    # Add analysis data to page data with the key as prefix
                    for k, v in analysis_data.items():
                        page_data[f"{key}.{k}"] = v

            all_data.append(page_data)

        return all_data

    # --- Deskew Method --- #

    def deskew(
        self,
        resolution: int = 300,
        detection_resolution: int = 72,
        force_overwrite: bool = False,
        **deskew_kwargs,
    ) -> "PDF":  # Changed return type
        """
        Creates a new, in-memory PDF object containing deskewed versions of the pages
        in this collection.

        This method delegates the actual processing to the parent PDF object's
        `deskew` method.

        Important: The returned PDF is image-based. Any existing text, OCR results,
        annotations, or other elements from the original pages will *not* be carried over.

        Args:
            resolution: DPI resolution for rendering the output deskewed pages.
            detection_resolution: DPI resolution used for skew detection if angles are not
                                  already cached on the page objects.
            force_overwrite: If False (default), raises a ValueError if any target page
                             already contains processed elements (text, OCR, regions) to
                             prevent accidental data loss. Set to True to proceed anyway.
            **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                             during automatic detection (e.g., `max_angle`, `num_peaks`).

        Returns:
            A new PDF object representing the deskewed document.

        Raises:
            ImportError: If 'deskew' or 'img2pdf' libraries are not installed (raised by PDF.deskew).
            ValueError: If `force_overwrite` is False and target pages contain elements (raised by PDF.deskew),
                        or if the collection is empty.
            RuntimeError: If pages lack a parent PDF reference, or the parent PDF lacks the `deskew` method.
        """
        if not self.pages:
            logger.warning("Cannot deskew an empty PageCollection.")
            raise ValueError("Cannot deskew an empty PageCollection.")

        # Assume all pages share the same parent PDF object
        # Need to hint the type of _parent for type checkers
        if TYPE_CHECKING:
            parent_pdf: "natural_pdf.core.pdf.PDF" = self.pages[0]._parent
        else:
            parent_pdf = self.pages[0]._parent

        if not parent_pdf or not hasattr(parent_pdf, "deskew") or not callable(parent_pdf.deskew):
            raise RuntimeError(
                "Parent PDF reference not found or parent PDF lacks the required 'deskew' method."
            )

        # Get the 0-based indices of the pages in this collection
        page_indices = self._get_page_indices()
        logger.info(
            f"PageCollection: Delegating deskew to parent PDF for page indices: {page_indices}"
        )

        # Delegate the call to the parent PDF object for the relevant pages
        # Pass all relevant arguments through (no output_path anymore)
        return parent_pdf.deskew(
            pages=page_indices,
            resolution=resolution,
            detection_resolution=detection_resolution,
            force_overwrite=force_overwrite,
            **deskew_kwargs,
        )

    # --- End Deskew Method --- #

    def _get_render_specs(
        self,
        mode: Literal["show", "render"] = "show",
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        highlights: Optional[List[Dict[str, Any]]] = None,
        crop: Union[bool, Literal["content"]] = False,
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        **kwargs,
    ) -> List[RenderSpec]:
        """Get render specifications for this page collection.

        For page collections, we return specs for all pages that will be
        rendered into a grid layout.

        Args:
            mode: Rendering mode - 'show' includes highlights, 'render' is clean
            color: Color for highlighting pages in show mode
            highlights: Additional highlight groups to show
            crop: Whether to crop pages
            crop_bbox: Explicit crop bounds
            **kwargs: Additional parameters

        Returns:
            List of RenderSpec objects, one per page
        """
        specs = []

        # Get max pages from kwargs if specified
        max_pages = kwargs.get("max_pages")
        pages_to_render = self.pages[:max_pages] if max_pages else self.pages

        for page in pages_to_render:
            if hasattr(page, "_get_render_specs"):
                # Page has the new unified rendering
                page_specs = page._get_render_specs(
                    mode=mode,
                    color=color,
                    highlights=highlights,
                    crop=crop,
                    crop_bbox=crop_bbox,
                    **kwargs,
                )
                specs.extend(page_specs)
            else:
                # Fallback for pages without unified rendering
                spec = RenderSpec(page=page)
                if crop_bbox:
                    spec.crop_bbox = crop_bbox
                specs.append(spec)

        return specs

    def save_pdf(
        self,
        output_path: Union[str, Path],
        ocr: bool = False,
        original: bool = False,
        dpi: int = 300,
    ):
        """
        Saves the pages in this collection to a new PDF file.

        Choose one saving mode:
        - `ocr=True`: Creates a new, image-based PDF using OCR results. This
          makes the text generated during the natural-pdf session searchable,
          but loses original vector content. Requires 'ocr-export' extras.
        - `original=True`: Extracts the original pages from the source PDF,
          preserving all vector content, fonts, and annotations. OCR results
          from the natural-pdf session are NOT included. Requires 'ocr-export' extras.

        Args:
            output_path: Path to save the new PDF file.
            ocr: If True, save as a searchable, image-based PDF using OCR data.
            original: If True, save the original, vector-based pages.
            dpi: Resolution (dots per inch) used only when ocr=True for
                 rendering page images and aligning the text layer.

        Raises:
            ValueError: If the collection is empty, if neither or both 'ocr'
                        and 'original' are True, or if 'original=True' and
                        pages originate from different PDFs.
            ImportError: If required libraries ('pikepdf', 'Pillow')
                         are not installed for the chosen mode.
            RuntimeError: If an unexpected error occurs during saving.
        """
        if not self.pages:
            raise ValueError("Cannot save an empty PageCollection.")

        if not (ocr ^ original):  # XOR: exactly one must be true
            raise ValueError("Exactly one of 'ocr' or 'original' must be True.")

        output_path_obj = Path(output_path)
        output_path_str = str(output_path_obj)

        if ocr:
            if create_searchable_pdf is None:
                raise ImportError(
                    "Saving with ocr=True requires 'pikepdf' and 'Pillow'. "
                    'Install with: pip install \\"natural-pdf[ocr-export]\\"'  # Escaped quotes
                )

            # Check for non-OCR vector elements (provide a warning)
            has_vector_elements = False
            for page in self.pages:
                # Simplified check for common vector types or non-OCR chars/words
                if (
                    hasattr(page, "rects")
                    and page.rects
                    or hasattr(page, "lines")
                    and page.lines
                    or hasattr(page, "curves")
                    and page.curves
                    or (
                        hasattr(page, "chars")
                        and any(getattr(el, "source", None) != "ocr" for el in page.chars)
                    )
                    or (
                        hasattr(page, "words")
                        and any(getattr(el, "source", None) != "ocr" for el in page.words)
                    )
                ):
                    has_vector_elements = True
                    break
            if has_vector_elements:
                logger.warning(
                    "Warning: Saving with ocr=True creates an image-based PDF. "
                    "Original vector elements (rects, lines, non-OCR text/chars) "
                    "on selected pages will not be preserved in the output file."
                )

            logger.info(f"Saving searchable PDF (OCR text layer) to: {output_path_str}")
            try:
                # Delegate to the searchable PDF exporter function
                # Pass `self` (the PageCollection instance) as the source
                create_searchable_pdf(self, output_path_str, dpi=dpi)
                # Success log is now inside create_searchable_pdf if needed, or keep here
                # logger.info(f"Successfully saved searchable PDF to: {output_path_str}")
            except Exception as e:
                logger.error(f"Failed to create searchable PDF: {e}", exc_info=True)
                # Re-raise as RuntimeError for consistency, potentially handled in exporter too
                raise RuntimeError(f"Failed to create searchable PDF: {e}") from e

        elif original:
            # ---> MODIFIED: Call the new exporter
            if create_original_pdf is None:
                raise ImportError(
                    "Saving with original=True requires 'pikepdf'. "
                    'Install with: pip install \\"natural-pdf[ocr-export]\\"'  # Escaped quotes
                )

            # Check for OCR elements (provide a warning) - keep this check here
            has_ocr_elements = False
            for page in self.pages:
                # Use find_all which returns a collection; check if it's non-empty
                if hasattr(page, "find_all"):
                    ocr_text_elements = page.find_all("text[source=ocr]")
                    if ocr_text_elements:  # Check truthiness of collection
                        has_ocr_elements = True
                        break
                elif hasattr(page, "words"):  # Fallback check if find_all isn't present?
                    if any(getattr(el, "source", None) == "ocr" for el in page.words):
                        has_ocr_elements = True
                        break

            if has_ocr_elements:
                logger.warning(
                    "Warning: Saving with original=True preserves original page content. "
                    "OCR text generated in this session will not be included in the saved file."
                )

            logger.info(f"Saving original pages PDF to: {output_path_str}")
            try:
                # Delegate to the original PDF exporter function
                # Pass `self` (the PageCollection instance) as the source
                create_original_pdf(self, output_path_str)
                # Success log is now inside create_original_pdf
                # logger.info(f"Successfully saved original pages PDF to: {output_path_str}")
            except Exception as e:
                # Error logging is handled within create_original_pdf
                # Re-raise the exception caught from the exporter
                raise e  # Keep the original exception type (ValueError, RuntimeError, etc.)
            # <--- END MODIFIED

    def to_flow(
        self,
        arrangement: Literal["vertical", "horizontal"] = "vertical",
        alignment: Literal["start", "center", "end", "top", "left", "bottom", "right"] = "start",
        segment_gap: float = 0.0,
    ) -> "Flow":
        """
        Convert this PageCollection to a Flow for cross-page operations.

        This enables treating multiple pages as a continuous logical document
        structure, useful for multi-page tables, articles spanning columns,
        or any content requiring reading order across page boundaries.

        Args:
            arrangement: Primary flow direction ('vertical' or 'horizontal').
                        'vertical' stacks pages top-to-bottom (most common).
                        'horizontal' arranges pages left-to-right.
            alignment: Cross-axis alignment for pages of different sizes:
                      For vertical: 'left'/'start', 'center', 'right'/'end'
                      For horizontal: 'top'/'start', 'center', 'bottom'/'end'
            segment_gap: Virtual gap between pages in PDF points (default: 0.0).

        Returns:
            Flow object that can perform operations across all pages in sequence.

        Example:
            Multi-page table extraction:
            ```python
            pdf = npdf.PDF("multi_page_report.pdf")

            # Create flow for pages 2-4 containing a table
            table_flow = pdf.pages[1:4].to_flow()

            # Extract table as if it were continuous
            table_data = table_flow.extract_table()
            df = table_data.df
            ```

            Cross-page element search:
            ```python
            # Find all headers across multiple pages
            headers = pdf.pages[5:10].to_flow().find_all('text[size>12]:bold')

            # Analyze layout across pages
            regions = pdf.pages.to_flow().analyze_layout(engine='yolo')
            ```
        """
        from natural_pdf.flows.flow import Flow

        return Flow(
            segments=self,  # Flow constructor now handles PageCollection
            arrangement=arrangement,
            alignment=alignment,
            segment_gap=segment_gap,
        )

    def analyze_layout(self, *args, **kwargs) -> "ElementCollection[Region]":
        """
        Analyzes the layout of each page in the collection.

        This method iterates through each page, calls its analyze_layout method,
        and returns a single ElementCollection containing all the detected layout
        regions from all pages.

        Args:
            *args: Positional arguments to pass to each page's analyze_layout method.
            **kwargs: Keyword arguments to pass to each page's analyze_layout method.
                      A 'show_progress' kwarg can be included to show a progress bar.

        Returns:
            An ElementCollection of all detected Region objects.
        """
        all_regions = []

        show_progress = kwargs.pop("show_progress", True)

        iterator = self.pages
        if show_progress:
            try:
                from tqdm.auto import tqdm

                iterator = tqdm(self.pages, desc="Analyzing layout")
            except ImportError:
                pass  # tqdm not installed

        for page in iterator:
            # Each page's analyze_layout method returns an ElementCollection
            regions_collection = page.analyze_layout(*args, **kwargs)
            if regions_collection:
                all_regions.extend(regions_collection.elements)

        return ElementCollection(all_regions)

    def detect_checkboxes(self, *args, **kwargs) -> "ElementCollection[Region]":
        """
        Detects checkboxes on each page in the collection.

        This method iterates through each page, calls its detect_checkboxes method,
        and returns a single ElementCollection containing all detected checkbox
        regions from all pages.

        Args:
            *args: Positional arguments to pass to each page's detect_checkboxes method.
            **kwargs: Keyword arguments to pass to each page's detect_checkboxes method.
                      A 'show_progress' kwarg can be included to show a progress bar.

        Returns:
            An ElementCollection of all detected checkbox Region objects.
        """
        all_checkboxes = []

        show_progress = kwargs.pop("show_progress", True)

        iterator = self.pages
        if show_progress:
            try:
                from tqdm.auto import tqdm

                iterator = tqdm(self.pages, desc="Detecting checkboxes")
            except ImportError:
                pass  # tqdm not installed

        for page in iterator:
            # Each page's detect_checkboxes method returns an ElementCollection
            checkbox_collection = page.detect_checkboxes(*args, **kwargs)
            if checkbox_collection:
                all_checkboxes.extend(checkbox_collection.elements)

        return ElementCollection(all_checkboxes)

    def highlights(self, show: bool = False) -> "HighlightContext":
        """
        Create a highlight context for accumulating highlights.

        This allows for clean syntax to show multiple highlight groups:

        Example:
            with pages.highlights() as h:
                h.add(pages.find_all('table'), label='tables', color='blue')
                h.add(pages.find_all('text:bold'), label='bold text', color='red')
                h.show()

        Or with automatic display:
            with pages.highlights(show=True) as h:
                h.add(pages.find_all('table'), label='tables')
                h.add(pages.find_all('text:bold'), label='bold')
                # Automatically shows when exiting the context

        Args:
            show: If True, automatically show highlights when exiting context

        Returns:
            HighlightContext for accumulating highlights
        """
        from natural_pdf.core.highlighting_service import HighlightContext

        return HighlightContext(self, show_on_exit=show)

    def groupby(self, by: Union[str, Callable], *, show_progress: bool = True) -> "PageGroupBy":
        """
        Group pages by selector text or callable result.

        Args:
            by: CSS selector string or callable function
            show_progress: Whether to show progress bar during computation (default: True)

        Returns:
            PageGroupBy object supporting iteration and dict-like access

        Examples:
            # Group by header text
            for title, pages in pdf.pages.groupby('text[size=16]'):
                print(f"Section: {title}")

            # Group by callable
            for city, pages in pdf.pages.groupby(lambda p: p.find('text:contains("CITY")').extract_text()):
                process_city_pages(pages)

            # Quick exploration with indexing
            grouped = pdf.pages.groupby('text[size=16]')
            grouped.info()                    # Show all groups
            first_section = grouped[0]        # First group
            last_section = grouped[-1]       # Last group

            # Dict-like access by name
            madison_pages = grouped.get('CITY OF MADISON')
            madison_pages = grouped['CITY OF MADISON']  # Alternative

            # Disable progress bar for small collections
            grouped = pdf.pages.groupby('text[size=16]', show_progress=False)
        """
        from natural_pdf.core.page_groupby import PageGroupBy

        return PageGroupBy(self, by, show_progress=show_progress)
Functions
natural_pdf.PageCollection.__getitem__(idx)

Support indexing and slicing.

Source code in natural_pdf/core/page_collection.py
107
108
109
110
111
def __getitem__(self, idx) -> Union[P, "PageCollection[P]"]:
    """Support indexing and slicing."""
    if isinstance(idx, slice):
        return PageCollection(self.pages[idx])
    return self.pages[idx]
natural_pdf.PageCollection.__init__(pages)

Initialize a page collection.

Parameters:

Name Type Description Default
pages Union[List[P], Sequence[P]]

List or sequence of Page objects (can be lazy)

required
Source code in natural_pdf/core/page_collection.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def __init__(self, pages: Union[List[P], Sequence[P]]):
    """
    Initialize a page collection.

    Args:
        pages: List or sequence of Page objects (can be lazy)
    """
    # Store the sequence as-is to preserve lazy behavior
    # Only convert to list if we need list-specific operations
    if hasattr(pages, "__iter__") and hasattr(pages, "__len__"):
        self.pages = pages
    else:
        # Fallback for non-sequence types
        self.pages = list(pages)
natural_pdf.PageCollection.__iter__()

Support iteration.

Source code in natural_pdf/core/page_collection.py
113
114
115
def __iter__(self) -> Iterator[P]:
    """Support iteration."""
    return iter(self.pages)
natural_pdf.PageCollection.__len__()

Return the number of pages in the collection.

Source code in natural_pdf/core/page_collection.py
103
104
105
def __len__(self) -> int:
    """Return the number of pages in the collection."""
    return len(self.pages)
natural_pdf.PageCollection.__repr__()

Return a string representation showing the page count.

Source code in natural_pdf/core/page_collection.py
117
118
119
def __repr__(self) -> str:
    """Return a string representation showing the page count."""
    return f"<PageCollection(count={len(self)})>"
natural_pdf.PageCollection.analyze_layout(*args, **kwargs)

Analyzes the layout of each page in the collection.

This method iterates through each page, calls its analyze_layout method, and returns a single ElementCollection containing all the detected layout regions from all pages.

Parameters:

Name Type Description Default
*args

Positional arguments to pass to each page's analyze_layout method.

()
**kwargs

Keyword arguments to pass to each page's analyze_layout method. A 'show_progress' kwarg can be included to show a progress bar.

{}

Returns:

Type Description
ElementCollection[Region]

An ElementCollection of all detected Region objects.

Source code in natural_pdf/core/page_collection.py
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
def analyze_layout(self, *args, **kwargs) -> "ElementCollection[Region]":
    """
    Analyzes the layout of each page in the collection.

    This method iterates through each page, calls its analyze_layout method,
    and returns a single ElementCollection containing all the detected layout
    regions from all pages.

    Args:
        *args: Positional arguments to pass to each page's analyze_layout method.
        **kwargs: Keyword arguments to pass to each page's analyze_layout method.
                  A 'show_progress' kwarg can be included to show a progress bar.

    Returns:
        An ElementCollection of all detected Region objects.
    """
    all_regions = []

    show_progress = kwargs.pop("show_progress", True)

    iterator = self.pages
    if show_progress:
        try:
            from tqdm.auto import tqdm

            iterator = tqdm(self.pages, desc="Analyzing layout")
        except ImportError:
            pass  # tqdm not installed

    for page in iterator:
        # Each page's analyze_layout method returns an ElementCollection
        regions_collection = page.analyze_layout(*args, **kwargs)
        if regions_collection:
            all_regions.extend(regions_collection.elements)

    return ElementCollection(all_regions)
natural_pdf.PageCollection.apply_ocr(engine=None, languages=None, min_confidence=None, device=None, resolution=None, apply_exclusions=True, replace=True, options=None)

Applies OCR to all pages within this collection using batch processing.

This delegates the work to the parent PDF object's apply_ocr method.

Parameters:

Name Type Description Default
engine Optional[str]

Name of the OCR engine (e.g., 'easyocr', 'paddleocr').

None
languages Optional[List[str]]

List of language codes (e.g., ['en', 'fr'], ['en', 'ch']). Must be codes understood by the specific selected engine. No mapping is performed.

None
min_confidence Optional[float]

Minimum confidence threshold for detected text (0.0 to 1.0).

None
device Optional[str]

Device to run OCR on (e.g., 'cpu', 'cuda', 'mps').

None
resolution Optional[int]

DPI resolution to render page images before OCR (e.g., 150, 300).

None
apply_exclusions bool

If True (default), render page images for OCR with excluded areas masked (whited out). If False, OCR the raw page images without masking exclusions.

True
replace bool

If True (default), remove any existing OCR elements before adding new ones. If False, add new OCR elements to existing ones.

True
options Optional[Any]

An engine-specific options object (e.g., EasyOCROptions) or dict.

None

Returns:

Type Description
PageCollection[P]

Self for method chaining.

Raises:

Type Description
RuntimeError

If pages lack a parent PDF or parent lacks apply_ocr.

Source code in natural_pdf/core/page_collection.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def apply_ocr(
    self,
    engine: Optional[str] = None,
    # --- Common OCR Parameters (Direct Arguments) ---
    languages: Optional[List[str]] = None,
    min_confidence: Optional[float] = None,  # Min confidence threshold
    device: Optional[str] = None,
    resolution: Optional[int] = None,  # DPI for rendering
    apply_exclusions: bool = True,  # New parameter
    replace: bool = True,  # Whether to replace existing OCR elements
    # --- Engine-Specific Options ---
    options: Optional[Any] = None,  # e.g., EasyOCROptions(...)
) -> "PageCollection[P]":
    """
    Applies OCR to all pages within this collection using batch processing.

    This delegates the work to the parent PDF object's `apply_ocr` method.

    Args:
        engine: Name of the OCR engine (e.g., 'easyocr', 'paddleocr').
        languages: List of language codes (e.g., ['en', 'fr'], ['en', 'ch']).
                   **Must be codes understood by the specific selected engine.**
                   No mapping is performed.
        min_confidence: Minimum confidence threshold for detected text (0.0 to 1.0).
        device: Device to run OCR on (e.g., 'cpu', 'cuda', 'mps').
        resolution: DPI resolution to render page images before OCR (e.g., 150, 300).
        apply_exclusions: If True (default), render page images for OCR with
                          excluded areas masked (whited out). If False, OCR
                          the raw page images without masking exclusions.
        replace: If True (default), remove any existing OCR elements before
                adding new ones. If False, add new OCR elements to existing ones.
        options: An engine-specific options object (e.g., EasyOCROptions) or dict.

    Returns:
        Self for method chaining.

    Raises:
        RuntimeError: If pages lack a parent PDF or parent lacks `apply_ocr`.
        (Propagates exceptions from PDF.apply_ocr)
    """
    if not self.pages:
        logger.warning("Cannot apply OCR to an empty PageCollection.")
        return self

    # Assume all pages share the same parent PDF object
    first_page = self.pages[0]
    if not hasattr(first_page, "_parent") or not first_page._parent:
        raise RuntimeError("Pages in this collection do not have a parent PDF reference.")

    parent_pdf = first_page._parent

    if not hasattr(parent_pdf, "apply_ocr") or not callable(parent_pdf.apply_ocr):
        raise RuntimeError("Parent PDF object does not have the required 'apply_ocr' method.")

    # Get the 0-based indices of the pages in this collection
    page_indices = self._get_page_indices()

    logger.info(f"Applying OCR via parent PDF to page indices: {page_indices} in collection.")

    # Delegate the batch call to the parent PDF object, passing direct args and apply_exclusions
    parent_pdf.apply_ocr(
        pages=page_indices,
        engine=engine,
        languages=languages,
        min_confidence=min_confidence,  # Pass the renamed parameter
        device=device,
        resolution=resolution,
        apply_exclusions=apply_exclusions,  # Pass down
        replace=replace,  # Pass the replace parameter
        options=options,
    )
    # The PDF method modifies the Page objects directly by adding elements.

    return self  # Return self for chaining
natural_pdf.PageCollection.deskew(resolution=300, detection_resolution=72, force_overwrite=False, **deskew_kwargs)

Creates a new, in-memory PDF object containing deskewed versions of the pages in this collection.

This method delegates the actual processing to the parent PDF object's deskew method.

Important: The returned PDF is image-based. Any existing text, OCR results, annotations, or other elements from the original pages will not be carried over.

Parameters:

Name Type Description Default
resolution int

DPI resolution for rendering the output deskewed pages.

300
detection_resolution int

DPI resolution used for skew detection if angles are not already cached on the page objects.

72
force_overwrite bool

If False (default), raises a ValueError if any target page already contains processed elements (text, OCR, regions) to prevent accidental data loss. Set to True to proceed anyway.

False
**deskew_kwargs

Additional keyword arguments passed to deskew.determine_skew during automatic detection (e.g., max_angle, num_peaks).

{}

Returns:

Type Description
PDF

A new PDF object representing the deskewed document.

Raises:

Type Description
ImportError

If 'deskew' or 'img2pdf' libraries are not installed (raised by PDF.deskew).

ValueError

If force_overwrite is False and target pages contain elements (raised by PDF.deskew), or if the collection is empty.

RuntimeError

If pages lack a parent PDF reference, or the parent PDF lacks the deskew method.

Source code in natural_pdf/core/page_collection.py
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
def deskew(
    self,
    resolution: int = 300,
    detection_resolution: int = 72,
    force_overwrite: bool = False,
    **deskew_kwargs,
) -> "PDF":  # Changed return type
    """
    Creates a new, in-memory PDF object containing deskewed versions of the pages
    in this collection.

    This method delegates the actual processing to the parent PDF object's
    `deskew` method.

    Important: The returned PDF is image-based. Any existing text, OCR results,
    annotations, or other elements from the original pages will *not* be carried over.

    Args:
        resolution: DPI resolution for rendering the output deskewed pages.
        detection_resolution: DPI resolution used for skew detection if angles are not
                              already cached on the page objects.
        force_overwrite: If False (default), raises a ValueError if any target page
                         already contains processed elements (text, OCR, regions) to
                         prevent accidental data loss. Set to True to proceed anyway.
        **deskew_kwargs: Additional keyword arguments passed to `deskew.determine_skew`
                         during automatic detection (e.g., `max_angle`, `num_peaks`).

    Returns:
        A new PDF object representing the deskewed document.

    Raises:
        ImportError: If 'deskew' or 'img2pdf' libraries are not installed (raised by PDF.deskew).
        ValueError: If `force_overwrite` is False and target pages contain elements (raised by PDF.deskew),
                    or if the collection is empty.
        RuntimeError: If pages lack a parent PDF reference, or the parent PDF lacks the `deskew` method.
    """
    if not self.pages:
        logger.warning("Cannot deskew an empty PageCollection.")
        raise ValueError("Cannot deskew an empty PageCollection.")

    # Assume all pages share the same parent PDF object
    # Need to hint the type of _parent for type checkers
    if TYPE_CHECKING:
        parent_pdf: "natural_pdf.core.pdf.PDF" = self.pages[0]._parent
    else:
        parent_pdf = self.pages[0]._parent

    if not parent_pdf or not hasattr(parent_pdf, "deskew") or not callable(parent_pdf.deskew):
        raise RuntimeError(
            "Parent PDF reference not found or parent PDF lacks the required 'deskew' method."
        )

    # Get the 0-based indices of the pages in this collection
    page_indices = self._get_page_indices()
    logger.info(
        f"PageCollection: Delegating deskew to parent PDF for page indices: {page_indices}"
    )

    # Delegate the call to the parent PDF object for the relevant pages
    # Pass all relevant arguments through (no output_path anymore)
    return parent_pdf.deskew(
        pages=page_indices,
        resolution=resolution,
        detection_resolution=detection_resolution,
        force_overwrite=force_overwrite,
        **deskew_kwargs,
    )
natural_pdf.PageCollection.detect_checkboxes(*args, **kwargs)

Detects checkboxes on each page in the collection.

This method iterates through each page, calls its detect_checkboxes method, and returns a single ElementCollection containing all detected checkbox regions from all pages.

Parameters:

Name Type Description Default
*args

Positional arguments to pass to each page's detect_checkboxes method.

()
**kwargs

Keyword arguments to pass to each page's detect_checkboxes method. A 'show_progress' kwarg can be included to show a progress bar.

{}

Returns:

Type Description
ElementCollection[Region]

An ElementCollection of all detected checkbox Region objects.

Source code in natural_pdf/core/page_collection.py
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
def detect_checkboxes(self, *args, **kwargs) -> "ElementCollection[Region]":
    """
    Detects checkboxes on each page in the collection.

    This method iterates through each page, calls its detect_checkboxes method,
    and returns a single ElementCollection containing all detected checkbox
    regions from all pages.

    Args:
        *args: Positional arguments to pass to each page's detect_checkboxes method.
        **kwargs: Keyword arguments to pass to each page's detect_checkboxes method.
                  A 'show_progress' kwarg can be included to show a progress bar.

    Returns:
        An ElementCollection of all detected checkbox Region objects.
    """
    all_checkboxes = []

    show_progress = kwargs.pop("show_progress", True)

    iterator = self.pages
    if show_progress:
        try:
            from tqdm.auto import tqdm

            iterator = tqdm(self.pages, desc="Detecting checkboxes")
        except ImportError:
            pass  # tqdm not installed

    for page in iterator:
        # Each page's detect_checkboxes method returns an ElementCollection
        checkbox_collection = page.detect_checkboxes(*args, **kwargs)
        if checkbox_collection:
            all_checkboxes.extend(checkbox_collection.elements)

    return ElementCollection(all_checkboxes)
natural_pdf.PageCollection.extract_text(keep_blank_chars=True, apply_exclusions=True, strip=None, **kwargs)

Extract text from all pages in the collection.

Parameters:

Name Type Description Default
keep_blank_chars bool

Whether to keep blank characters (default: True)

True
apply_exclusions bool

Whether to apply exclusion regions (default: True)

True
strip Optional[bool]

Whether to strip whitespace from the extracted text.

None
**kwargs

Additional extraction parameters

{}

Returns:

Type Description
str

Combined text from all pages

Source code in natural_pdf/core/page_collection.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
def extract_text(
    self,
    keep_blank_chars: bool = True,
    apply_exclusions: bool = True,
    strip: Optional[bool] = None,
    **kwargs,
) -> str:
    """
    Extract text from all pages in the collection.

    Args:
        keep_blank_chars: Whether to keep blank characters (default: True)
        apply_exclusions: Whether to apply exclusion regions (default: True)
        strip: Whether to strip whitespace from the extracted text.
        **kwargs: Additional extraction parameters

    Returns:
        Combined text from all pages
    """
    texts = []
    for page in self.pages:
        text = page.extract_text(
            keep_blank_chars=keep_blank_chars,
            apply_exclusions=apply_exclusions,
            **kwargs,
        )
        texts.append(text)

    combined = "\n".join(texts)

    # Default strip behaviour: if caller picks, honour; else respect layout flag passed via kwargs.
    use_layout = kwargs.get("layout", False)
    strip_final = strip if strip is not None else (not use_layout)

    if strip_final:
        combined = "\n".join(line.rstrip() for line in combined.splitlines()).strip()

    return combined
natural_pdf.PageCollection.find(selector=None, *, text=None, overlap='full', apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[T]
find(selector: str, *, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[T]

Find the first element matching the selector OR text across all pages in the collection.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
overlap str

How to determine if elements overlap: 'full' (fully inside), 'partial' (any overlap), or 'center' (center point inside). (default: "full")

'full'
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
Optional[T]

First matching element or None.

Source code in natural_pdf/core/page_collection.py
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
def find(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    overlap: str = "full",
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional[T]:
    """
    Find the first element matching the selector OR text across all pages in the collection.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        overlap: How to determine if elements overlap: 'full' (fully inside),
                 'partial' (any overlap), or 'center' (center point inside).
                 (default: "full")
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        First matching element or None.
    """
    # Input validation happens within page.find
    for page in self.pages:
        element = page.find(
            selector=selector,
            text=text,
            overlap=overlap,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        if element:
            return element
    return None
natural_pdf.PageCollection.find_all(selector=None, *, text=None, overlap='full', apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements matching the selector OR text across all pages in the collection.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
overlap str

How to determine if elements overlap: 'full' (fully inside), 'partial' (any overlap), or 'center' (center point inside). (default: "full")

'full'
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional filter parameters.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements from all pages.

Source code in natural_pdf/core/page_collection.py
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
def find_all(
    self,
    selector: Optional[str] = None,
    *,
    text: Optional[str] = None,
    overlap: str = "full",
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements matching the selector OR text across all pages in the collection.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        overlap: How to determine if elements overlap: 'full' (fully inside),
                 'partial' (any overlap), or 'center' (center point inside).
                 (default: "full")
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional filter parameters.

    Returns:
        ElementCollection with matching elements from all pages.
    """
    all_elements = []
    # Input validation happens within page.find_all
    for page in self.pages:
        elements = page.find_all(
            selector=selector,
            text=text,
            overlap=overlap,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        if elements:
            all_elements.extend(elements.elements)

    return ElementCollection(all_elements)
natural_pdf.PageCollection.get_sections(start_elements=None, end_elements=None, new_section_on_page_break=False, include_boundaries='both', orientation='vertical')

Extract sections from a page collection based on start/end elements.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections (optional)

None
end_elements

Elements or selector string that mark the end of sections (optional)

None
new_section_on_page_break

Whether to start a new section at page boundaries (default: False)

False
include_boundaries

How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both')

'both'
orientation

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description
ElementCollection[Region]

List of Region objects representing the extracted sections

Note

You can provide only start_elements, only end_elements, or both. - With only start_elements: sections go from each start to the next start (or end of page) - With only end_elements: sections go from beginning of document/page to each end - With both: sections go from each start to the corresponding end

Source code in natural_pdf/core/page_collection.py
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    new_section_on_page_break=False,
    include_boundaries="both",
    orientation="vertical",
) -> "ElementCollection[Region]":
    """
    Extract sections from a page collection based on start/end elements.

    Args:
        start_elements: Elements or selector string that mark the start of sections (optional)
        end_elements: Elements or selector string that mark the end of sections (optional)
        new_section_on_page_break: Whether to start a new section at page boundaries (default: False)
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none' (default: 'both')
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        List of Region objects representing the extracted sections

    Note:
        You can provide only start_elements, only end_elements, or both.
        - With only start_elements: sections go from each start to the next start (or end of page)
        - With only end_elements: sections go from beginning of document/page to each end
        - With both: sections go from each start to the corresponding end
    """
    # Find start and end elements across all pages
    if isinstance(start_elements, str):
        start_elements = self.find_all(start_elements).elements

    if isinstance(end_elements, str):
        end_elements = self.find_all(end_elements).elements

    # If no start elements and no end elements, return empty list
    if not start_elements and not end_elements:
        return []

    # If there are page break boundaries, we'll need to add them
    if new_section_on_page_break:
        # For each page boundary, create virtual "end" and "start" elements
        for i in range(len(self.pages) - 1):
            # Add a virtual "end" element at the bottom of the current page
            page = self.pages[i]
            # If end_elements is None, initialize it as an empty list
            if end_elements is None:
                end_elements = []

            # Create a region at the bottom of the page as an artificial end marker
            from natural_pdf.elements.region import Region

            bottom_region = Region(page, (0, page.height - 1, page.width, page.height))
            bottom_region.is_page_boundary = True  # Mark it as a special boundary
            end_elements.append(bottom_region)

            # Add a virtual "start" element at the top of the next page
            next_page = self.pages[i + 1]
            top_region = Region(next_page, (0, 0, next_page.width, 1))
            top_region.is_page_boundary = True  # Mark it as a special boundary
            # If start_elements is None, initialize it as an empty list
            if start_elements is None:
                start_elements = []
            start_elements.append(top_region)

    # Get all elements from all pages and sort them in document order
    all_elements = []
    for page in self.pages:
        elements = page.get_elements()
        all_elements.extend(elements)

    # Sort by page index, then vertical position, then horizontal position
    all_elements.sort(key=lambda e: (e.page.index, e.top, e.x0))

    # If we only have end_elements (no start_elements), create implicit start elements
    if not start_elements and end_elements:
        from natural_pdf.elements.region import Region

        start_elements = []

        # Add implicit start at the beginning of the first page
        first_page = self.pages[0]
        first_start = Region(first_page, (0, 0, first_page.width, 1))
        first_start.is_implicit_start = True
        # Don't mark this as created from any end element, so it can pair with any end
        start_elements.append(first_start)

        # For each end element (except the last), add an implicit start after it
        # Sort by page, then top, then bottom (for elements with same top), then x0
        sorted_end_elements = sorted(
            end_elements, key=lambda e: (e.page.index, e.top, e.bottom, e.x0)
        )
        for i, end_elem in enumerate(sorted_end_elements[:-1]):  # Exclude last end element
            # Create implicit start element right after this end element
            implicit_start = Region(
                end_elem.page, (0, end_elem.bottom, end_elem.page.width, end_elem.bottom + 1)
            )
            implicit_start.is_implicit_start = True
            # Track which end element this implicit start was created from
            # to avoid pairing them together (which would create zero height)
            implicit_start.created_from_end = end_elem
            start_elements.append(implicit_start)

    # Mark section boundaries
    section_boundaries = []

    # Add start element boundaries
    for element in start_elements:
        if element in all_elements:
            idx = all_elements.index(element)
            section_boundaries.append(
                {
                    "index": idx,
                    "element": element,
                    "type": "start",
                    "page_idx": element.page.index,
                }
            )
        elif hasattr(element, "is_page_boundary") and element.is_page_boundary:
            # This is a virtual page boundary element
            section_boundaries.append(
                {
                    "index": -1,  # Special index for page boundaries
                    "element": element,
                    "type": "start",
                    "page_idx": element.page.index,
                }
            )
        elif hasattr(element, "is_implicit_start") and element.is_implicit_start:
            # This is an implicit start element
            section_boundaries.append(
                {
                    "index": -2,  # Special index for implicit starts
                    "element": element,
                    "type": "start",
                    "page_idx": element.page.index,
                }
            )

    # Add end element boundaries if provided
    if end_elements:
        for element in end_elements:
            if element in all_elements:
                idx = all_elements.index(element)
                section_boundaries.append(
                    {
                        "index": idx,
                        "element": element,
                        "type": "end",
                        "page_idx": element.page.index,
                    }
                )
            elif hasattr(element, "is_page_boundary") and element.is_page_boundary:
                # This is a virtual page boundary element
                section_boundaries.append(
                    {
                        "index": -1,  # Special index for page boundaries
                        "element": element,
                        "type": "end",
                        "page_idx": element.page.index,
                    }
                )

    # Sort boundaries by page index, then by actual document position
    def _sort_key(boundary):
        """Sort boundaries by (page_idx, position, priority)."""
        page_idx = boundary["page_idx"]
        element = boundary["element"]

        # Position on the page based on orientation
        if orientation == "vertical":
            pos = getattr(element, "top", 0.0)
        else:  # horizontal
            pos = getattr(element, "x0", 0.0)

        # Ensure starts come before ends at the same coordinate
        priority = 0 if boundary["type"] == "start" else 1

        return (page_idx, pos, priority)

    section_boundaries.sort(key=_sort_key)

    # Generate sections
    sections = []

    # --- Helper: build a FlowRegion spanning multiple pages ---
    def _build_flow_region(start_el, end_el, include_boundaries="both", orientation="vertical"):
        """Return a FlowRegion that covers from *start_el* to *end_el*.
        If *end_el* is None, the region continues to the bottom/right of the last
        page in this PageCollection.

        Args:
            start_el: Start element
            end_el: End element
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
            orientation: 'vertical' or 'horizontal' - determines section direction
        """
        # Local imports to avoid top-level cycles
        from natural_pdf.elements.region import Region
        from natural_pdf.flows.element import FlowElement
        from natural_pdf.flows.flow import Flow
        from natural_pdf.flows.region import FlowRegion

        start_pg = start_el.page
        end_pg = end_el.page if end_el is not None else self.pages[-1]

        parts: list[Region] = []

        if orientation == "vertical":
            # Determine the start_top based on include_boundaries
            start_top = start_el.top
            if include_boundaries == "none" or include_boundaries == "end":
                # Exclude start boundary
                start_top = start_el.bottom if hasattr(start_el, "bottom") else start_el.top

            # Slice of first page beginning at *start_top*
            parts.append(Region(start_pg, (0, start_top, start_pg.width, start_pg.height)))
        else:  # horizontal
            # Determine the start_left based on include_boundaries
            start_left = start_el.x0
            if include_boundaries == "none" or include_boundaries == "end":
                # Exclude start boundary
                start_left = start_el.x1 if hasattr(start_el, "x1") else start_el.x0

            # Slice of first page beginning at *start_left*
            parts.append(Region(start_pg, (start_left, 0, start_pg.width, start_pg.height)))

        # Full middle pages
        for pg_idx in range(start_pg.index + 1, end_pg.index):
            mid_pg = self.pages[pg_idx]
            parts.append(Region(mid_pg, (0, 0, mid_pg.width, mid_pg.height)))

        # Slice of last page (if distinct)
        if end_pg is not start_pg:
            if orientation == "vertical":
                # Determine the bottom based on include_boundaries
                if end_el is not None:
                    if include_boundaries == "none" or include_boundaries == "start":
                        # Exclude end boundary
                        bottom = end_el.top if hasattr(end_el, "top") else end_el.bottom
                    else:
                        # Include end boundary
                        bottom = end_el.bottom
                else:
                    bottom = end_pg.height
                parts.append(Region(end_pg, (0, 0, end_pg.width, bottom)))
            else:  # horizontal
                # Determine the right based on include_boundaries
                if end_el is not None:
                    if include_boundaries == "none" or include_boundaries == "start":
                        # Exclude end boundary
                        right = end_el.x0 if hasattr(end_el, "x0") else end_el.x1
                    else:
                        # Include end boundary
                        right = end_el.x1
                else:
                    right = end_pg.width
                parts.append(Region(end_pg, (0, 0, right, end_pg.height)))

        flow = Flow(segments=parts, arrangement=orientation)
        src_fe = FlowElement(physical_object=start_el, flow=flow)
        return FlowRegion(
            flow=flow,
            constituent_regions=parts,
            source_flow_element=src_fe,
            boundary_element_found=end_el,
        )

    # ------------------------------------------------------------------

    current_start = None

    for i, boundary in enumerate(section_boundaries):
        # If it's a start boundary and we don't have a current start
        if boundary["type"] == "start" and current_start is None:
            current_start = boundary

        # If it's an end boundary and we have a current start
        elif boundary["type"] == "end" and current_start is not None:
            # Create a section from current_start to this boundary
            start_element = current_start["element"]
            end_element = boundary["element"]

            # Check if this is an implicit start created from this same end element
            # This would create a zero-height section, so skip this pairing
            if (
                hasattr(start_element, "is_implicit_start")
                and hasattr(start_element, "created_from_end")
                and start_element.created_from_end is end_element
            ):
                # Skip this pairing - keep current_start for next end element
                continue

            # If both elements are on the same page, use the page's get_section_between
            if start_element.page == end_element.page:
                # For implicit start elements, create a region from the top of the page
                if hasattr(start_element, "is_implicit_start"):
                    from natural_pdf.elements.region import Region

                    # Adjust boundaries based on include_boundaries parameter and orientation
                    if orientation == "vertical":
                        top = start_element.top
                        bottom = end_element.bottom

                        if include_boundaries == "none":
                            # Exclude both boundaries - move past them
                            top = (
                                start_element.bottom
                                if hasattr(start_element, "bottom")
                                else start_element.top
                            )
                            bottom = (
                                end_element.top
                                if hasattr(end_element, "top")
                                else end_element.bottom
                            )
                        elif include_boundaries == "start":
                            # Include start, exclude end
                            bottom = (
                                end_element.top
                                if hasattr(end_element, "top")
                                else end_element.bottom
                            )
                        elif include_boundaries == "end":
                            # Exclude start, include end
                            top = (
                                start_element.bottom
                                if hasattr(start_element, "bottom")
                                else start_element.top
                            )
                        # "both" is default - no adjustment needed

                        section = Region(
                            start_element.page,
                            (0, top, start_element.page.width, bottom),
                        )
                        section._boundary_exclusions = include_boundaries
                    else:  # horizontal
                        left = start_element.x0
                        right = end_element.x1

                        if include_boundaries == "none":
                            # Exclude both boundaries - move past them
                            left = (
                                start_element.x1
                                if hasattr(start_element, "x1")
                                else start_element.x0
                            )
                            right = (
                                end_element.x0 if hasattr(end_element, "x0") else end_element.x1
                            )
                        elif include_boundaries == "start":
                            # Include start, exclude end
                            right = (
                                end_element.x0 if hasattr(end_element, "x0") else end_element.x1
                            )
                        elif include_boundaries == "end":
                            # Exclude start, include end
                            left = (
                                start_element.x1
                                if hasattr(start_element, "x1")
                                else start_element.x0
                            )
                        # "both" is default - no adjustment needed

                        section = Region(
                            start_element.page,
                            (left, 0, right, start_element.page.height),
                        )
                        section._boundary_exclusions = include_boundaries
                    section.start_element = start_element
                    section.boundary_element_found = end_element
                else:
                    section = start_element.page.get_section_between(
                        start_element, end_element, include_boundaries, orientation
                    )
                sections.append(section)
            else:
                # Create FlowRegion spanning pages
                flow_region = _build_flow_region(
                    start_element, end_element, include_boundaries, orientation
                )
                sections.append(flow_region)

            current_start = None

        # If it's another start boundary and we have a current start (for splitting by starts only)
        elif boundary["type"] == "start" and current_start is not None and not end_elements:
            # Create a section from current_start to just before this boundary
            start_element = current_start["element"]

            # Create section from current start to just before this new start
            if start_element.page == boundary["element"].page:
                from natural_pdf.elements.region import Region

                next_start = boundary["element"]

                # Create section based on orientation
                if orientation == "vertical":
                    # Determine vertical bounds
                    if include_boundaries in ["start", "both"]:
                        top = start_element.top
                    else:
                        top = start_element.bottom

                    # The section ends just before the next start
                    bottom = next_start.top

                    # Create the section with full page width
                    if top < bottom:
                        section = Region(
                            start_element.page, (0, top, start_element.page.width, bottom)
                        )
                        section.start_element = start_element
                        section.end_element = (
                            next_start  # The next start is the end of this section
                        )
                        section._boundary_exclusions = include_boundaries
                        sections.append(section)
                else:  # horizontal
                    # Determine horizontal bounds
                    if include_boundaries in ["start", "both"]:
                        left = start_element.x0
                    else:
                        left = start_element.x1

                    # The section ends just before the next start
                    right = next_start.x0

                    # Create the section with full page height
                    if left < right:
                        section = Region(
                            start_element.page, (left, 0, right, start_element.page.height)
                        )
                        section.start_element = start_element
                        section.end_element = (
                            next_start  # The next start is the end of this section
                        )
                        section._boundary_exclusions = include_boundaries
                        sections.append(section)
            else:
                # Cross-page section - create from current_start to the end of its page
                from natural_pdf.elements.region import Region

                start_page = start_element.page

                # Handle implicit start elements and respect include_boundaries
                if orientation == "vertical":
                    if include_boundaries in ["none", "end"]:
                        # Exclude start boundary
                        start_top = (
                            start_element.bottom
                            if hasattr(start_element, "bottom")
                            else start_element.top
                        )
                    else:
                        # Include start boundary
                        start_top = start_element.top

                    region = Region(
                        start_page, (0, start_top, start_page.width, start_page.height)
                    )
                else:  # horizontal
                    if include_boundaries in ["none", "end"]:
                        # Exclude start boundary
                        start_left = (
                            start_element.x1
                            if hasattr(start_element, "x1")
                            else start_element.x0
                        )
                    else:
                        # Include start boundary
                        start_left = start_element.x0

                    region = Region(
                        start_page, (start_left, 0, start_page.width, start_page.height)
                    )
                region.start_element = start_element
                sections.append(region)

            current_start = boundary

    # Handle the last section if we have a current start
    if current_start is not None:
        start_element = current_start["element"]
        start_page = start_element.page

        if end_elements:
            # With end_elements, we need an explicit end - use the last element
            # on the last page of the collection
            last_page = self.pages[-1]
            last_page_elements = [e for e in all_elements if e.page == last_page]
            if orientation == "vertical":
                last_page_elements.sort(key=lambda e: (e.top, e.x0))
            else:  # horizontal
                last_page_elements.sort(key=lambda e: (e.x0, e.top))
            end_element = last_page_elements[-1] if last_page_elements else None

            # Create FlowRegion spanning multiple pages using helper
            flow_region = _build_flow_region(
                start_element, end_element, include_boundaries, orientation
            )
            sections.append(flow_region)
        else:
            # With start_elements only, create a section to the end of the current page
            from natural_pdf.elements.region import Region

            # Handle implicit start elements and respect include_boundaries
            if orientation == "vertical":
                if include_boundaries in ["none", "end"]:
                    # Exclude start boundary
                    start_top = (
                        start_element.bottom
                        if hasattr(start_element, "bottom")
                        else start_element.top
                    )
                else:
                    # Include start boundary
                    start_top = start_element.top

                region = Region(start_page, (0, start_top, start_page.width, start_page.height))
            else:  # horizontal
                if include_boundaries in ["none", "end"]:
                    # Exclude start boundary
                    start_left = (
                        start_element.x1 if hasattr(start_element, "x1") else start_element.x0
                    )
                else:
                    # Include start boundary
                    start_left = start_element.x0

                region = Region(
                    start_page, (start_left, 0, start_page.width, start_page.height)
                )
            region.start_element = start_element
            sections.append(region)

    return ElementCollection(sections)
natural_pdf.PageCollection.groupby(by, *, show_progress=True)

Group pages by selector text or callable result.

Parameters:

Name Type Description Default
by Union[str, Callable]

CSS selector string or callable function

required
show_progress bool

Whether to show progress bar during computation (default: True)

True

Returns:

Type Description
PageGroupBy

PageGroupBy object supporting iteration and dict-like access

Examples:

Group by header text

for title, pages in pdf.pages.groupby('text[size=16]'): print(f"Section: {title}")

Group by callable

for city, pages in pdf.pages.groupby(lambda p: p.find('text:contains("CITY")').extract_text()): process_city_pages(pages)

Quick exploration with indexing

grouped = pdf.pages.groupby('text[size=16]') grouped.info() # Show all groups first_section = grouped[0] # First group last_section = grouped[-1] # Last group

Dict-like access by name

madison_pages = grouped.get('CITY OF MADISON') madison_pages = grouped['CITY OF MADISON'] # Alternative

Disable progress bar for small collections

grouped = pdf.pages.groupby('text[size=16]', show_progress=False)

Source code in natural_pdf/core/page_collection.py
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
def groupby(self, by: Union[str, Callable], *, show_progress: bool = True) -> "PageGroupBy":
    """
    Group pages by selector text or callable result.

    Args:
        by: CSS selector string or callable function
        show_progress: Whether to show progress bar during computation (default: True)

    Returns:
        PageGroupBy object supporting iteration and dict-like access

    Examples:
        # Group by header text
        for title, pages in pdf.pages.groupby('text[size=16]'):
            print(f"Section: {title}")

        # Group by callable
        for city, pages in pdf.pages.groupby(lambda p: p.find('text:contains("CITY")').extract_text()):
            process_city_pages(pages)

        # Quick exploration with indexing
        grouped = pdf.pages.groupby('text[size=16]')
        grouped.info()                    # Show all groups
        first_section = grouped[0]        # First group
        last_section = grouped[-1]       # Last group

        # Dict-like access by name
        madison_pages = grouped.get('CITY OF MADISON')
        madison_pages = grouped['CITY OF MADISON']  # Alternative

        # Disable progress bar for small collections
        grouped = pdf.pages.groupby('text[size=16]', show_progress=False)
    """
    from natural_pdf.core.page_groupby import PageGroupBy

    return PageGroupBy(self, by, show_progress=show_progress)
natural_pdf.PageCollection.highlights(show=False)

Create a highlight context for accumulating highlights.

This allows for clean syntax to show multiple highlight groups:

Example

with pages.highlights() as h: h.add(pages.find_all('table'), label='tables', color='blue') h.add(pages.find_all('text:bold'), label='bold text', color='red') h.show()

Or with automatic display

with pages.highlights(show=True) as h: h.add(pages.find_all('table'), label='tables') h.add(pages.find_all('text:bold'), label='bold') # Automatically shows when exiting the context

Parameters:

Name Type Description Default
show bool

If True, automatically show highlights when exiting context

False

Returns:

Type Description
HighlightContext

HighlightContext for accumulating highlights

Source code in natural_pdf/core/page_collection.py
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
def highlights(self, show: bool = False) -> "HighlightContext":
    """
    Create a highlight context for accumulating highlights.

    This allows for clean syntax to show multiple highlight groups:

    Example:
        with pages.highlights() as h:
            h.add(pages.find_all('table'), label='tables', color='blue')
            h.add(pages.find_all('text:bold'), label='bold text', color='red')
            h.show()

    Or with automatic display:
        with pages.highlights(show=True) as h:
            h.add(pages.find_all('table'), label='tables')
            h.add(pages.find_all('text:bold'), label='bold')
            # Automatically shows when exiting the context

    Args:
        show: If True, automatically show highlights when exiting context

    Returns:
        HighlightContext for accumulating highlights
    """
    from natural_pdf.core.highlighting_service import HighlightContext

    return HighlightContext(self, show_on_exit=show)
natural_pdf.PageCollection.save_pdf(output_path, ocr=False, original=False, dpi=300)

Saves the pages in this collection to a new PDF file.

Choose one saving mode: - ocr=True: Creates a new, image-based PDF using OCR results. This makes the text generated during the natural-pdf session searchable, but loses original vector content. Requires 'ocr-export' extras. - original=True: Extracts the original pages from the source PDF, preserving all vector content, fonts, and annotations. OCR results from the natural-pdf session are NOT included. Requires 'ocr-export' extras.

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the new PDF file.

required
ocr bool

If True, save as a searchable, image-based PDF using OCR data.

False
original bool

If True, save the original, vector-based pages.

False
dpi int

Resolution (dots per inch) used only when ocr=True for rendering page images and aligning the text layer.

300

Raises:

Type Description
ValueError

If the collection is empty, if neither or both 'ocr' and 'original' are True, or if 'original=True' and pages originate from different PDFs.

ImportError

If required libraries ('pikepdf', 'Pillow') are not installed for the chosen mode.

RuntimeError

If an unexpected error occurs during saving.

Source code in natural_pdf/core/page_collection.py
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
def save_pdf(
    self,
    output_path: Union[str, Path],
    ocr: bool = False,
    original: bool = False,
    dpi: int = 300,
):
    """
    Saves the pages in this collection to a new PDF file.

    Choose one saving mode:
    - `ocr=True`: Creates a new, image-based PDF using OCR results. This
      makes the text generated during the natural-pdf session searchable,
      but loses original vector content. Requires 'ocr-export' extras.
    - `original=True`: Extracts the original pages from the source PDF,
      preserving all vector content, fonts, and annotations. OCR results
      from the natural-pdf session are NOT included. Requires 'ocr-export' extras.

    Args:
        output_path: Path to save the new PDF file.
        ocr: If True, save as a searchable, image-based PDF using OCR data.
        original: If True, save the original, vector-based pages.
        dpi: Resolution (dots per inch) used only when ocr=True for
             rendering page images and aligning the text layer.

    Raises:
        ValueError: If the collection is empty, if neither or both 'ocr'
                    and 'original' are True, or if 'original=True' and
                    pages originate from different PDFs.
        ImportError: If required libraries ('pikepdf', 'Pillow')
                     are not installed for the chosen mode.
        RuntimeError: If an unexpected error occurs during saving.
    """
    if not self.pages:
        raise ValueError("Cannot save an empty PageCollection.")

    if not (ocr ^ original):  # XOR: exactly one must be true
        raise ValueError("Exactly one of 'ocr' or 'original' must be True.")

    output_path_obj = Path(output_path)
    output_path_str = str(output_path_obj)

    if ocr:
        if create_searchable_pdf is None:
            raise ImportError(
                "Saving with ocr=True requires 'pikepdf' and 'Pillow'. "
                'Install with: pip install \\"natural-pdf[ocr-export]\\"'  # Escaped quotes
            )

        # Check for non-OCR vector elements (provide a warning)
        has_vector_elements = False
        for page in self.pages:
            # Simplified check for common vector types or non-OCR chars/words
            if (
                hasattr(page, "rects")
                and page.rects
                or hasattr(page, "lines")
                and page.lines
                or hasattr(page, "curves")
                and page.curves
                or (
                    hasattr(page, "chars")
                    and any(getattr(el, "source", None) != "ocr" for el in page.chars)
                )
                or (
                    hasattr(page, "words")
                    and any(getattr(el, "source", None) != "ocr" for el in page.words)
                )
            ):
                has_vector_elements = True
                break
        if has_vector_elements:
            logger.warning(
                "Warning: Saving with ocr=True creates an image-based PDF. "
                "Original vector elements (rects, lines, non-OCR text/chars) "
                "on selected pages will not be preserved in the output file."
            )

        logger.info(f"Saving searchable PDF (OCR text layer) to: {output_path_str}")
        try:
            # Delegate to the searchable PDF exporter function
            # Pass `self` (the PageCollection instance) as the source
            create_searchable_pdf(self, output_path_str, dpi=dpi)
            # Success log is now inside create_searchable_pdf if needed, or keep here
            # logger.info(f"Successfully saved searchable PDF to: {output_path_str}")
        except Exception as e:
            logger.error(f"Failed to create searchable PDF: {e}", exc_info=True)
            # Re-raise as RuntimeError for consistency, potentially handled in exporter too
            raise RuntimeError(f"Failed to create searchable PDF: {e}") from e

    elif original:
        # ---> MODIFIED: Call the new exporter
        if create_original_pdf is None:
            raise ImportError(
                "Saving with original=True requires 'pikepdf'. "
                'Install with: pip install \\"natural-pdf[ocr-export]\\"'  # Escaped quotes
            )

        # Check for OCR elements (provide a warning) - keep this check here
        has_ocr_elements = False
        for page in self.pages:
            # Use find_all which returns a collection; check if it's non-empty
            if hasattr(page, "find_all"):
                ocr_text_elements = page.find_all("text[source=ocr]")
                if ocr_text_elements:  # Check truthiness of collection
                    has_ocr_elements = True
                    break
            elif hasattr(page, "words"):  # Fallback check if find_all isn't present?
                if any(getattr(el, "source", None) == "ocr" for el in page.words):
                    has_ocr_elements = True
                    break

        if has_ocr_elements:
            logger.warning(
                "Warning: Saving with original=True preserves original page content. "
                "OCR text generated in this session will not be included in the saved file."
            )

        logger.info(f"Saving original pages PDF to: {output_path_str}")
        try:
            # Delegate to the original PDF exporter function
            # Pass `self` (the PageCollection instance) as the source
            create_original_pdf(self, output_path_str)
            # Success log is now inside create_original_pdf
            # logger.info(f"Successfully saved original pages PDF to: {output_path_str}")
        except Exception as e:
            # Error logging is handled within create_original_pdf
            # Re-raise the exception caught from the exporter
            raise e  # Keep the original exception type (ValueError, RuntimeError, etc.)
natural_pdf.PageCollection.split(divider, **kwargs)

Divide this page collection into sections based on the provided divider elements.

Parameters:

Name Type Description Default
divider

Elements or selector string that mark section boundaries

required
**kwargs

Additional parameters passed to get_sections() - include_boundaries: How to include boundary elements (default: 'start') - orientation: 'vertical' or 'horizontal' (default: 'vertical') - new_section_on_page_break: Whether to split at page boundaries (default: False)

{}

Returns:

Type Description
ElementCollection[Region]

ElementCollection of Region objects representing the sections

Example
Split a PDF by chapter titles

chapters = pdf.pages.split("text[size>20]:contains('CHAPTER')")

Split by page breaks

page_sections = pdf.pages.split(None, new_section_on_page_break=True)

Split multi-page document by section headers

sections = pdf.pages[10:20].split("text:bold:contains('Section')")

Source code in natural_pdf/core/page_collection.py
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
def split(self, divider, **kwargs) -> "ElementCollection[Region]":
    """
    Divide this page collection into sections based on the provided divider elements.

    Args:
        divider: Elements or selector string that mark section boundaries
        **kwargs: Additional parameters passed to get_sections()
            - include_boundaries: How to include boundary elements (default: 'start')
            - orientation: 'vertical' or 'horizontal' (default: 'vertical')
            - new_section_on_page_break: Whether to split at page boundaries (default: False)

    Returns:
        ElementCollection of Region objects representing the sections

    Example:
        # Split a PDF by chapter titles
        chapters = pdf.pages.split("text[size>20]:contains('CHAPTER')")

        # Split by page breaks
        page_sections = pdf.pages.split(None, new_section_on_page_break=True)

        # Split multi-page document by section headers
        sections = pdf.pages[10:20].split("text:bold:contains('Section')")
    """
    # Default to 'start' boundaries for split (include divider at start of each section)
    if "include_boundaries" not in kwargs:
        kwargs["include_boundaries"] = "start"

    sections = self.get_sections(start_elements=divider, **kwargs)

    # Add initial section if there's content before the first divider
    if sections and divider is not None:
        # Get all elements across all pages
        all_elements = []
        for page in self.pages:
            all_elements.extend(page.get_elements())

        if all_elements:
            # Find first divider
            if isinstance(divider, str):
                # Search for first matching element
                first_divider = None
                for page in self.pages:
                    match = page.find(divider)
                    if match:
                        first_divider = match
                        break
            else:
                # divider is already elements
                first_divider = divider[0] if hasattr(divider, "__getitem__") else divider

            if first_divider and all_elements[0] != first_divider:
                # There's content before the first divider
                # Get section from start to first divider
                initial_sections = self.get_sections(
                    start_elements=None,
                    end_elements=[first_divider],
                    include_boundaries="none",
                    orientation=kwargs.get("orientation", "vertical"),
                )
                if initial_sections:
                    sections = ElementCollection([initial_sections[0]] + list(sections))

    return sections
natural_pdf.PageCollection.to_flow(arrangement='vertical', alignment='start', segment_gap=0.0)

Convert this PageCollection to a Flow for cross-page operations.

This enables treating multiple pages as a continuous logical document structure, useful for multi-page tables, articles spanning columns, or any content requiring reading order across page boundaries.

Parameters:

Name Type Description Default
arrangement Literal['vertical', 'horizontal']

Primary flow direction ('vertical' or 'horizontal'). 'vertical' stacks pages top-to-bottom (most common). 'horizontal' arranges pages left-to-right.

'vertical'
alignment Literal['start', 'center', 'end', 'top', 'left', 'bottom', 'right']

Cross-axis alignment for pages of different sizes: For vertical: 'left'/'start', 'center', 'right'/'end' For horizontal: 'top'/'start', 'center', 'bottom'/'end'

'start'
segment_gap float

Virtual gap between pages in PDF points (default: 0.0).

0.0

Returns:

Type Description
Flow

Flow object that can perform operations across all pages in sequence.

Example

Multi-page table extraction:

pdf = npdf.PDF("multi_page_report.pdf")

# Create flow for pages 2-4 containing a table
table_flow = pdf.pages[1:4].to_flow()

# Extract table as if it were continuous
table_data = table_flow.extract_table()
df = table_data.df

Cross-page element search:

# Find all headers across multiple pages
headers = pdf.pages[5:10].to_flow().find_all('text[size>12]:bold')

# Analyze layout across pages
regions = pdf.pages.to_flow().analyze_layout(engine='yolo')

Source code in natural_pdf/core/page_collection.py
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
def to_flow(
    self,
    arrangement: Literal["vertical", "horizontal"] = "vertical",
    alignment: Literal["start", "center", "end", "top", "left", "bottom", "right"] = "start",
    segment_gap: float = 0.0,
) -> "Flow":
    """
    Convert this PageCollection to a Flow for cross-page operations.

    This enables treating multiple pages as a continuous logical document
    structure, useful for multi-page tables, articles spanning columns,
    or any content requiring reading order across page boundaries.

    Args:
        arrangement: Primary flow direction ('vertical' or 'horizontal').
                    'vertical' stacks pages top-to-bottom (most common).
                    'horizontal' arranges pages left-to-right.
        alignment: Cross-axis alignment for pages of different sizes:
                  For vertical: 'left'/'start', 'center', 'right'/'end'
                  For horizontal: 'top'/'start', 'center', 'bottom'/'end'
        segment_gap: Virtual gap between pages in PDF points (default: 0.0).

    Returns:
        Flow object that can perform operations across all pages in sequence.

    Example:
        Multi-page table extraction:
        ```python
        pdf = npdf.PDF("multi_page_report.pdf")

        # Create flow for pages 2-4 containing a table
        table_flow = pdf.pages[1:4].to_flow()

        # Extract table as if it were continuous
        table_data = table_flow.extract_table()
        df = table_data.df
        ```

        Cross-page element search:
        ```python
        # Find all headers across multiple pages
        headers = pdf.pages[5:10].to_flow().find_all('text[size>12]:bold')

        # Analyze layout across pages
        regions = pdf.pages.to_flow().analyze_layout(engine='yolo')
        ```
    """
    from natural_pdf.flows.flow import Flow

    return Flow(
        segments=self,  # Flow constructor now handles PageCollection
        arrangement=arrangement,
        alignment=alignment,
        segment_gap=segment_gap,
    )
natural_pdf.PageCollection.update_text(transform, selector='text', max_workers=None)

Applies corrections to text elements across all pages in this collection using a user-provided callback function, executed in parallel if max_workers is specified.

This method delegates to the parent PDF's update_text method, targeting all pages within this collection.

Parameters:

Name Type Description Default
transform Callable[[Any], Optional[str]]

A function that accepts a single argument (an element object) and returns Optional[str] (new text or None).

required
selector str

The attribute name to update. Default is 'text'.

'text'
max_workers Optional[int]

The maximum number of worker threads to use for parallel correction on each page. If None, defaults are used.

None

Returns:

Type Description
PageCollection[P]

Self for method chaining.

Raises:

Type Description
RuntimeError

If the collection is empty, pages lack a parent PDF reference, or the parent PDF lacks the update_text method.

Source code in natural_pdf/core/page_collection.py
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
def update_text(
    self,
    transform: Callable[[Any], Optional[str]],
    selector: str = "text",
    max_workers: Optional[int] = None,
) -> "PageCollection[P]":
    """
    Applies corrections to text elements across all pages
    in this collection using a user-provided callback function, executed
    in parallel if `max_workers` is specified.

    This method delegates to the parent PDF's `update_text` method,
    targeting all pages within this collection.

    Args:
        transform: A function that accepts a single argument (an element
                   object) and returns `Optional[str]` (new text or None).
        selector: The attribute name to update. Default is 'text'.
        max_workers: The maximum number of worker threads to use for parallel
                     correction on each page. If None, defaults are used.

    Returns:
        Self for method chaining.

    Raises:
        RuntimeError: If the collection is empty, pages lack a parent PDF reference,
                      or the parent PDF lacks the `update_text` method.
    """
    if not self.pages:
        logger.warning("Cannot update text for an empty PageCollection.")
        # Return self even if empty to maintain chaining consistency
        return self

    # Assume all pages share the same parent PDF object
    parent_pdf = self.pages[0]._parent
    if (
        not parent_pdf
        or not hasattr(parent_pdf, "update_text")
        or not callable(parent_pdf.update_text)
    ):
        raise RuntimeError(
            "Parent PDF reference not found or parent PDF lacks the required 'update_text' method."
        )

    page_indices = self._get_page_indices()
    logger.info(
        f"PageCollection: Delegating text update to parent PDF for page indices: {page_indices} with max_workers={max_workers} and selector='{selector}'."
    )

    # Delegate the call to the parent PDF object for the relevant pages
    # Pass the max_workers parameter down
    parent_pdf.update_text(
        transform=transform,
        pages=page_indices,
        selector=selector,
        max_workers=max_workers,
    )

    return self
natural_pdf.Region

Bases: TextMixin, DirectionalMixin, ClassificationMixin, ExtractionMixin, ShapeDetectionMixin, CheckboxDetectionMixin, DescribeMixin, VisualSearchMixin, Visualizable

Represents a rectangular region on a page.

Regions are fundamental building blocks in natural-pdf that define rectangular areas of a page for analysis, extraction, and navigation. They can be created manually or automatically through spatial navigation methods like .below(), .above(), .left(), and .right() from elements or other regions.

Regions integrate multiple analysis capabilities through mixins and provide: - Element filtering and collection within the region boundary - OCR processing for the region area - Table detection and extraction - AI-powered classification and structured data extraction - Visual rendering and debugging capabilities - Text extraction with spatial awareness

The Region class supports both rectangular and polygonal boundaries, making it suitable for complex document layouts and irregular shapes detected by layout analysis algorithms.

Attributes:

Name Type Description
page Page

Reference to the parent Page object.

bbox Tuple[float, float, float, float]

Bounding box tuple (x0, top, x1, bottom) in PDF coordinates.

x0 float

Left x-coordinate.

top float

Top y-coordinate (minimum y).

x1 float

Right x-coordinate.

bottom float

Bottom y-coordinate (maximum y).

width float

Region width (x1 - x0).

height float

Region height (bottom - top).

polygon List[Tuple[float, float]]

List of coordinate points for non-rectangular regions.

label

Optional descriptive label for the region.

metadata Dict[str, Any]

Dictionary for storing analysis results and custom data.

Example

Creating regions:

pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]

# Manual region creation
header_region = page.region(0, 0, page.width, 100)

# Spatial navigation from elements
summary_text = page.find('text:contains("Summary")')
content_region = summary_text.below(until='text[size>12]:bold')

# Extract content from region
tables = content_region.extract_table()
text = content_region.get_text()

Advanced usage:

# OCR processing
region.apply_ocr(engine='easyocr', resolution=300)

# AI-powered extraction
data = region.extract_structured_data(MySchema)

# Visual debugging
region.show(highlights=['tables', 'text'])

Source code in natural_pdf/elements/region.py
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
class Region(
    TextMixin,
    DirectionalMixin,
    ClassificationMixin,
    ExtractionMixin,
    ShapeDetectionMixin,
    CheckboxDetectionMixin,
    DescribeMixin,
    VisualSearchMixin,
    Visualizable,
):
    """Represents a rectangular region on a page.

    Regions are fundamental building blocks in natural-pdf that define rectangular
    areas of a page for analysis, extraction, and navigation. They can be created
    manually or automatically through spatial navigation methods like .below(), .above(),
    .left(), and .right() from elements or other regions.

    Regions integrate multiple analysis capabilities through mixins and provide:
    - Element filtering and collection within the region boundary
    - OCR processing for the region area
    - Table detection and extraction
    - AI-powered classification and structured data extraction
    - Visual rendering and debugging capabilities
    - Text extraction with spatial awareness

    The Region class supports both rectangular and polygonal boundaries, making it
    suitable for complex document layouts and irregular shapes detected by layout
    analysis algorithms.

    Attributes:
        page: Reference to the parent Page object.
        bbox: Bounding box tuple (x0, top, x1, bottom) in PDF coordinates.
        x0: Left x-coordinate.
        top: Top y-coordinate (minimum y).
        x1: Right x-coordinate.
        bottom: Bottom y-coordinate (maximum y).
        width: Region width (x1 - x0).
        height: Region height (bottom - top).
        polygon: List of coordinate points for non-rectangular regions.
        label: Optional descriptive label for the region.
        metadata: Dictionary for storing analysis results and custom data.

    Example:
        Creating regions:
        ```python
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]

        # Manual region creation
        header_region = page.region(0, 0, page.width, 100)

        # Spatial navigation from elements
        summary_text = page.find('text:contains("Summary")')
        content_region = summary_text.below(until='text[size>12]:bold')

        # Extract content from region
        tables = content_region.extract_table()
        text = content_region.get_text()
        ```

        Advanced usage:
        ```python
        # OCR processing
        region.apply_ocr(engine='easyocr', resolution=300)

        # AI-powered extraction
        data = region.extract_structured_data(MySchema)

        # Visual debugging
        region.show(highlights=['tables', 'text'])
        ```
    """

    def __init__(
        self,
        page: "Page",
        bbox: Tuple[float, float, float, float],
        polygon: List[Tuple[float, float]] = None,
        parent=None,
        label: Optional[str] = None,
    ):
        """Initialize a region.

        Creates a Region object that represents a rectangular or polygonal area on a page.
        Regions are used for spatial navigation, content extraction, and analysis operations.

        Args:
            page: Parent Page object that contains this region and provides access
                to document elements and analysis capabilities.
            bbox: Bounding box coordinates as (x0, top, x1, bottom) tuple in PDF
                coordinate system (points, with origin at bottom-left).
            polygon: Optional list of coordinate points [(x1,y1), (x2,y2), ...] for
                non-rectangular regions. If provided, the region will use polygon-based
                intersection calculations instead of simple rectangle overlap.
            parent: Optional parent region for hierarchical document structure.
                Useful for maintaining tree-like relationships between regions.
            label: Optional descriptive label for the region, useful for debugging
                and identification in complex workflows.

        Example:
            ```python
            pdf = npdf.PDF("document.pdf")
            page = pdf.pages[0]

            # Rectangular region
            header = Region(page, (0, 0, page.width, 100), label="header")

            # Polygonal region (from layout detection)
            table_polygon = [(50, 100), (300, 100), (300, 400), (50, 400)]
            table_region = Region(page, (50, 100, 300, 400),
                                polygon=table_polygon, label="table")
            ```

        Note:
            Regions are typically created through page methods like page.region() or
            spatial navigation methods like element.below(). Direct instantiation is
            used mainly for advanced workflows or layout analysis integration.
        """
        self._page = page
        self._bbox = bbox
        self._polygon = polygon

        self.metadata: Dict[str, Any] = {}
        # Analysis results live under self.metadata['analysis'] via property

        # Standard attributes for all elements
        self.object_type = "region"  # For selector compatibility

        # Layout detection attributes
        self.region_type = None
        self.normalized_type = None
        self.confidence = None
        self.model = None

        # Region management attributes
        self.name = None
        self.label = label
        self.source = None  # Will be set by creation methods

        # Hierarchy support for nested document structure
        self.parent_region = parent
        self.child_regions = []
        self.text_content = None  # Direct text content (e.g., from Docling)
        self.associated_text_elements = []  # Native text elements that overlap with this region

    def _get_render_specs(
        self,
        mode: Literal["show", "render"] = "show",
        color: Optional[Union[str, Tuple[int, int, int]]] = None,
        highlights: Optional[Union[List[Dict[str, Any]], bool]] = None,
        crop: Union[
            bool, int, str, "Region", Literal["wide"]
        ] = True,  # Default to True for regions
        crop_bbox: Optional[Tuple[float, float, float, float]] = None,
        **kwargs,
    ) -> List[RenderSpec]:
        """Get render specifications for this region.

        Args:
            mode: Rendering mode - 'show' includes highlights, 'render' is clean
            color: Color for highlighting this region in show mode
            highlights: Additional highlight groups to show, or False to disable all highlights
            crop: Cropping mode:
                - False: No cropping
                - True: Crop to region bounds (default for regions)
                - int: Padding in pixels around region
                - 'wide': Full page width, cropped vertically to region
                - Region: Crop to the bounds of another region
            crop_bbox: Explicit crop bounds (overrides region bounds)
            **kwargs: Additional parameters

        Returns:
            List containing a single RenderSpec for this region's page
        """
        from typing import Literal

        spec = RenderSpec(page=self.page)

        # Handle cropping
        if crop_bbox:
            spec.crop_bbox = crop_bbox
        elif crop:
            x0, y0, x1, y1 = self.bbox

            if crop is True:
                # Crop to region bounds
                spec.crop_bbox = self.bbox
            elif isinstance(crop, (int, float)):
                # Add padding around region
                padding = float(crop)
                spec.crop_bbox = (
                    max(0, x0 - padding),
                    max(0, y0 - padding),
                    min(self.page.width, x1 + padding),
                    min(self.page.height, y1 + padding),
                )
            elif crop == "wide":
                # Full page width, cropped vertically to region
                spec.crop_bbox = (0, y0, self.page.width, y1)
            elif hasattr(crop, "bbox"):
                # Crop to another region's bounds
                spec.crop_bbox = crop.bbox

        # Add highlights in show mode (unless explicitly disabled with highlights=False)
        if mode == "show" and highlights is not False:
            # Only highlight this region if:
            # 1. We're not cropping, OR
            # 2. We're cropping but color was explicitly specified, OR
            # 3. We're cropping to another region (not tight crop)
            if not crop or color is not None or (crop and not isinstance(crop, bool)):
                spec.add_highlight(
                    bbox=self.bbox,
                    polygon=self.polygon if self.has_polygon else None,
                    color=color or "blue",
                    label=self.label or self.name or "Region",
                )

            # Add additional highlight groups if provided (and highlights is a list)
            if highlights and isinstance(highlights, list):
                for group in highlights:
                    elements = group.get("elements", [])
                    group_color = group.get("color", color)
                    group_label = group.get("label")

                    for elem in elements:
                        spec.add_highlight(element=elem, color=group_color, label=group_label)

        return [spec]

    def _direction(
        self,
        direction: str,
        size: Optional[float] = None,
        cross_size: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Region-specific wrapper around :py:meth:`DirectionalMixin._direction`.

        It performs any pre-processing required by *Region* (none currently),
        delegates the core geometry work to the mix-in implementation via
        ``super()``, then attaches region-level metadata before returning the
        new :class:`Region` instance.
        """

        # Delegate to the shared implementation on DirectionalMixin
        region = super()._direction(
            direction=direction,
            size=size,
            cross_size=cross_size,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            **kwargs,
        )

        # Post-process: make sure callers can trace lineage and flags
        region.source_element = self
        region.includes_source = include_source

        return region

    def above(
        self,
        height: Optional[float] = None,
        width: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        offset: Optional[float] = None,
        **kwargs,
    ) -> "Region":
        """
        Select region above this region.

        Args:
            height: Height of the region above, in points
            width: Width mode - "full" for full page width or "element" for element width
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify an upper boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area above
        """
        # Use global default if offset not provided
        if offset is None:
            import natural_pdf

            offset = natural_pdf.options.layout.directional_offset

        return self._direction(
            direction="above",
            size=height,
            cross_size=width,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            offset=offset,
            **kwargs,
        )

    def below(
        self,
        height: Optional[float] = None,
        width: str = "full",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        offset: Optional[float] = None,
        **kwargs,
    ) -> "Region":
        """
        Select region below this region.

        Args:
            height: Height of the region below, in points
            width: Width mode - "full" for full page width or "element" for element width
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify a lower boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area below
        """
        # Use global default if offset not provided
        if offset is None:
            import natural_pdf

            offset = natural_pdf.options.layout.directional_offset

        return self._direction(
            direction="below",
            size=height,
            cross_size=width,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            offset=offset,
            **kwargs,
        )

    def left(
        self,
        width: Optional[float] = None,
        height: str = "element",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        offset: Optional[float] = None,
        **kwargs,
    ) -> "Region":
        """
        Select region to the left of this region.

        Args:
            width: Width of the region to the left, in points
            height: Height mode - "full" for full page height or "element" for element height
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify a left boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area to the left
        """
        # Use global default if offset not provided
        if offset is None:
            import natural_pdf

            offset = natural_pdf.options.layout.directional_offset

        return self._direction(
            direction="left",
            size=width,
            cross_size=height,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            offset=offset,
            **kwargs,
        )

    def right(
        self,
        width: Optional[float] = None,
        height: str = "element",
        include_source: bool = False,
        until: Optional[str] = None,
        include_endpoint: bool = True,
        offset: Optional[float] = None,
        **kwargs,
    ) -> "Region":
        """
        Select region to the right of this region.

        Args:
            width: Width of the region to the right, in points
            height: Height mode - "full" for full page height or "element" for element height
            include_source: Whether to include this region in the result (default: False)
            until: Optional selector string to specify a right boundary element
            include_endpoint: Whether to include the boundary element in the region (default: True)
            offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
            **kwargs: Additional parameters

        Returns:
            Region object representing the area to the right
        """
        # Use global default if offset not provided
        if offset is None:
            import natural_pdf

            offset = natural_pdf.options.layout.directional_offset

        return self._direction(
            direction="right",
            size=width,
            cross_size=height,
            include_source=include_source,
            until=until,
            include_endpoint=include_endpoint,
            offset=offset,
            **kwargs,
        )

    @property
    def type(self) -> str:
        """Element type."""
        # Return the specific type if detected (e.g., from layout analysis)
        # or 'region' as a default.
        return self.region_type or "region"  # Prioritize specific region_type if set

    @property
    def page(self) -> "Page":
        """Get the parent page."""
        return self._page

    @property
    def bbox(self) -> Tuple[float, float, float, float]:
        """Get the bounding box as (x0, top, x1, bottom)."""
        return self._bbox

    @property
    def x0(self) -> float:
        """Get the left coordinate."""
        return self._bbox[0]

    @property
    def top(self) -> float:
        """Get the top coordinate."""
        return self._bbox[1]

    @property
    def x1(self) -> float:
        """Get the right coordinate."""
        return self._bbox[2]

    @property
    def bottom(self) -> float:
        """Get the bottom coordinate."""
        return self._bbox[3]

    @property
    def width(self) -> float:
        """Get the width of the region."""
        return self.x1 - self.x0

    @property
    def height(self) -> float:
        """Get the height of the region."""
        return self.bottom - self.top

    @property
    def has_polygon(self) -> bool:
        """Check if this region has polygon coordinates."""
        return self._polygon is not None and len(self._polygon) >= 3

    @property
    def polygon(self) -> List[Tuple[float, float]]:
        """Get polygon coordinates if available, otherwise return rectangle corners."""
        if self._polygon:
            return self._polygon
        else:
            # Create rectangle corners from bbox as fallback
            return [
                (self.x0, self.top),  # top-left
                (self.x1, self.top),  # top-right
                (self.x1, self.bottom),  # bottom-right
                (self.x0, self.bottom),  # bottom-left
            ]

    @property
    def origin(self) -> Optional[Union["Element", "Region"]]:
        """The element/region that created this region (if it was created via directional method)."""
        return getattr(self, "source_element", None)

    @property
    def endpoint(self) -> Optional["Element"]:
        """The element where this region stopped (if created with 'until' parameter)."""
        return getattr(self, "boundary_element", None)

    def _is_point_in_polygon(self, x: float, y: float) -> bool:
        """
        Check if a point is inside the polygon using ray casting algorithm.

        Args:
            x: X coordinate of the point
            y: Y coordinate of the point

        Returns:
            bool: True if the point is inside the polygon
        """
        if not self.has_polygon:
            return (self.x0 <= x <= self.x1) and (self.top <= y <= self.bottom)

        # Ray casting algorithm
        inside = False
        j = len(self.polygon) - 1

        for i in range(len(self.polygon)):
            if ((self.polygon[i][1] > y) != (self.polygon[j][1] > y)) and (
                x
                < (self.polygon[j][0] - self.polygon[i][0])
                * (y - self.polygon[i][1])
                / (self.polygon[j][1] - self.polygon[i][1])
                + self.polygon[i][0]
            ):
                inside = not inside
            j = i

        return inside

    def is_point_inside(self, x: float, y: float) -> bool:
        """
        Check if a point is inside this region using ray casting algorithm for polygons.

        Args:
            x: X coordinate of the point
            y: Y coordinate of the point

        Returns:
            bool: True if the point is inside the region
        """
        if not self.has_polygon:
            return (self.x0 <= x <= self.x1) and (self.top <= y <= self.bottom)

        # Ray casting algorithm
        inside = False
        j = len(self.polygon) - 1

        for i in range(len(self.polygon)):
            if ((self.polygon[i][1] > y) != (self.polygon[j][1] > y)) and (
                x
                < (self.polygon[j][0] - self.polygon[i][0])
                * (y - self.polygon[i][1])
                / (self.polygon[j][1] - self.polygon[i][1])
                + self.polygon[i][0]
            ):
                inside = not inside
            j = i

        return inside

    def is_element_center_inside(self, element: "Element") -> bool:
        """
        Check if the center point of an element's bounding box is inside this region.

        Args:
            element: Element to check

        Returns:
            True if the element's center point is inside the region, False otherwise.
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        # Ensure element has necessary attributes
        if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
            logger.warning(
                f"Element {element} lacks bounding box attributes. Cannot check center point."
            )
            return False  # Cannot determine position

        # Calculate center point
        center_x = (element.x0 + element.x1) / 2
        center_y = (element.top + element.bottom) / 2

        # Use the existing is_point_inside check
        return self.is_point_inside(center_x, center_y)

    def _is_element_in_region(self, element: "Element", use_boundary_tolerance=True) -> bool:
        """
        Check if an element intersects or is contained within this region.

        Args:
            element: Element to check
            use_boundary_tolerance: Whether to apply a small tolerance for boundary elements

        Returns:
            True if the element is in the region, False otherwise
        """
        # Use centralized spatial utility for consistency
        from natural_pdf.utils.spatial import is_element_in_region

        return is_element_in_region(element, self, strategy="center", check_page=True)

    def contains(self, element: "Element") -> bool:
        """
        Check if this region completely contains an element.

        Args:
            element: Element to check

        Returns:
            True if the element is completely contained within the region, False otherwise
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        # Ensure element has necessary attributes
        if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
            return False  # Cannot determine position

        # For rectangular regions, check if element's bbox is fully inside region's bbox
        if not self.has_polygon:
            return (
                self.x0 <= element.x0
                and element.x1 <= self.x1
                and self.top <= element.top
                and element.bottom <= self.bottom
            )

        # For polygon regions, check if all corners of the element are inside the polygon
        element_corners = [
            (element.x0, element.top),  # top-left
            (element.x1, element.top),  # top-right
            (element.x1, element.bottom),  # bottom-right
            (element.x0, element.bottom),  # bottom-left
        ]

        return all(self.is_point_inside(x, y) for x, y in element_corners)

    def intersects(self, element: "Element") -> bool:
        """
        Check if this region intersects with an element (any overlap).

        Args:
            element: Element to check

        Returns:
            True if the element overlaps with the region at all, False otherwise
        """
        # Check if element is on the same page
        if not hasattr(element, "page") or element.page != self._page:
            return False

        # Ensure element has necessary attributes
        if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
            return False  # Cannot determine position

        # For rectangular regions, check for bbox overlap
        if not self.has_polygon:
            return (
                self.x0 < element.x1
                and self.x1 > element.x0
                and self.top < element.bottom
                and self.bottom > element.top
            )

        # For polygon regions, check if any corner of the element is inside the polygon
        element_corners = [
            (element.x0, element.top),  # top-left
            (element.x1, element.top),  # top-right
            (element.x1, element.bottom),  # bottom-right
            (element.x0, element.bottom),  # bottom-left
        ]

        # First check if any element corner is inside the polygon
        if any(self.is_point_inside(x, y) for x, y in element_corners):
            return True

        # Also check if any polygon corner is inside the element's rectangle
        for x, y in self.polygon:
            if element.x0 <= x <= element.x1 and element.top <= y <= element.bottom:
                return True

        # Also check if any polygon edge intersects with any rectangle edge
        # This is a simplification - for complex cases, we'd need a full polygon-rectangle
        # intersection algorithm

        # For now, return True if bounding boxes overlap (approximation for polygon-rectangle case)
        return (
            self.x0 < element.x1
            and self.x1 > element.x0
            and self.top < element.bottom
            and self.bottom > element.top
        )

    def exclude(self):
        """
        Exclude this region from text extraction and other operations.

        This excludes everything within the region's bounds.
        """
        self.page.add_exclusion(self, method="region")

    def highlight(
        self,
        label: Optional[str] = None,
        color: Optional[Union[Tuple, str]] = None,
        use_color_cycling: bool = False,
        annotate: Optional[List[str]] = None,
        existing: str = "append",
    ) -> "Region":
        """
        Highlight this region on the page.

        Args:
            label: Optional label for the highlight
            color: Color tuple/string for the highlight, or None to use automatic color
            use_color_cycling: Force color cycling even with no label (default: False)
            annotate: List of attribute names to display on the highlight (e.g., ['confidence', 'type'])
            existing: How to handle existing highlights ('append' or 'replace').

        Returns:
            Self for method chaining
        """
        # Access the highlighter service correctly
        highlighter = self.page._highlighter

        # Prepare common arguments
        highlight_args = {
            "page_index": self.page.index,
            "color": color,
            "label": label,
            "use_color_cycling": use_color_cycling,
            "element": self,  # Pass the region itself so attributes can be accessed
            "annotate": annotate,
            "existing": existing,
        }

        # Call the appropriate service method
        if self.has_polygon:
            highlight_args["polygon"] = self.polygon
            highlighter.add_polygon(**highlight_args)
        else:
            highlight_args["bbox"] = self.bbox
            highlighter.add(**highlight_args)

        return self

    def save(
        self,
        filename: str,
        resolution: Optional[float] = None,
        labels: bool = True,
        legend_position: str = "right",
    ) -> "Region":
        """
        Save the page with this region highlighted to an image file.

        Args:
            filename: Path to save the image to
            resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
            labels: Whether to include a legend for labels
            legend_position: Position of the legend

        Returns:
            Self for method chaining
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Highlight this region if not already highlighted
        self.highlight()

        # Save the highlighted image
        self._page.save_image(
            filename, resolution=resolution, labels=labels, legend_position=legend_position
        )
        return self

    def save_image(
        self,
        filename: str,
        resolution: Optional[float] = None,
        crop: bool = False,
        include_highlights: bool = True,
        **kwargs,
    ) -> "Region":
        """
        Save an image of just this region to a file.

        Args:
            filename: Path to save the image to
            resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
            crop: If True, only crop the region without highlighting its boundaries
            include_highlights: Whether to include existing highlights (default: True)
            **kwargs: Additional parameters for rendering

        Returns:
            Self for method chaining
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Use export() to save the image
        if include_highlights:
            # With highlights, use export() which includes them
            self.export(
                path=filename,
                resolution=resolution,
                crop=crop,
                **kwargs,
            )
        else:
            # Without highlights, use render() and save manually
            image = self.render(resolution=resolution, crop=crop, **kwargs)
            if image:
                image.save(filename)
            else:
                logger.error(f"Failed to render region image for saving to {filename}")

        return self

    def trim(
        self,
        padding: int = 1,
        threshold: float = 0.95,
        resolution: Optional[float] = None,
        pre_shrink: float = 0.5,
    ) -> "Region":
        """
        Trim visual whitespace from the edges of this region.

        Similar to Python's string .strip() method, but for visual whitespace in the region image.
        Uses pixel analysis to detect rows/columns that are predominantly whitespace.

        Args:
            padding: Number of pixels to keep as padding after trimming (default: 1)
            threshold: Threshold for considering a row/column as whitespace (0.0-1.0, default: 0.95)
                      Higher values mean more strict whitespace detection.
                      E.g., 0.95 means if 95% of pixels in a row/column are white, consider it whitespace.
            resolution: Resolution for image rendering in DPI (default: uses global options, fallback to 144 DPI)
            pre_shrink: Amount to shrink region before trimming, then expand back after (default: 0.5)
                       This helps avoid detecting box borders/slivers as content.

        Returns
        ------

        New Region with visual whitespace trimmed from all edges

        Examples
        --------

        ```python
        # Basic trimming with 1 pixel padding and 0.5px pre-shrink
        trimmed = region.trim()

        # More aggressive trimming with no padding and no pre-shrink
        tight = region.trim(padding=0, threshold=0.9, pre_shrink=0)

        # Conservative trimming with more padding
        loose = region.trim(padding=3, threshold=0.98)
        ```
        """
        # Apply global options as defaults
        import natural_pdf

        if resolution is None:
            if natural_pdf.options.image.resolution is not None:
                resolution = natural_pdf.options.image.resolution
            else:
                resolution = 144  # Default resolution when none specified

        # Pre-shrink the region to avoid box slivers
        work_region = (
            self.expand(left=-pre_shrink, right=-pre_shrink, top=-pre_shrink, bottom=-pre_shrink)
            if pre_shrink > 0
            else self
        )

        # Get the region image
        # Use render() for clean image without highlights, with cropping
        image = work_region.render(resolution=resolution, crop=True)

        if image is None:
            logger.warning(
                f"Region {self.bbox}: Could not generate image for trimming. Returning original region."
            )
            return self

        # Convert to grayscale for easier analysis
        import numpy as np

        # Convert PIL image to numpy array
        img_array = np.array(image.convert("L"))  # Convert to grayscale
        height, width = img_array.shape

        if height == 0 or width == 0:
            logger.warning(
                f"Region {self.bbox}: Image has zero dimensions. Returning original region."
            )
            return self

        # Normalize pixel values to 0-1 range (255 = white = 1.0, 0 = black = 0.0)
        normalized = img_array.astype(np.float32) / 255.0

        # Find content boundaries by analyzing row and column averages

        # Analyze rows (horizontal strips) to find top and bottom boundaries
        row_averages = np.mean(normalized, axis=1)  # Average each row
        content_rows = row_averages < threshold  # True where there's content (not whitespace)

        # Find first and last rows with content
        content_row_indices = np.where(content_rows)[0]
        if len(content_row_indices) == 0:
            # No content found, return a minimal region at the center
            logger.warning(
                f"Region {self.bbox}: No content detected during trimming. Returning center point."
            )
            center_x = (self.x0 + self.x1) / 2
            center_y = (self.top + self.bottom) / 2
            return Region(self.page, (center_x, center_y, center_x, center_y))

        top_content_row = max(0, content_row_indices[0] - padding)
        bottom_content_row = min(height - 1, content_row_indices[-1] + padding)

        # Analyze columns (vertical strips) to find left and right boundaries
        col_averages = np.mean(normalized, axis=0)  # Average each column
        content_cols = col_averages < threshold  # True where there's content

        content_col_indices = np.where(content_cols)[0]
        if len(content_col_indices) == 0:
            # No content found in columns either
            logger.warning(
                f"Region {self.bbox}: No column content detected during trimming. Returning center point."
            )
            center_x = (self.x0 + self.x1) / 2
            center_y = (self.top + self.bottom) / 2
            return Region(self.page, (center_x, center_y, center_x, center_y))

        left_content_col = max(0, content_col_indices[0] - padding)
        right_content_col = min(width - 1, content_col_indices[-1] + padding)

        # Convert trimmed pixel coordinates back to PDF coordinates
        scale_factor = resolution / 72.0  # Scale factor used in render()

        # Calculate new PDF coordinates and ensure they are Python floats
        trimmed_x0 = float(work_region.x0 + (left_content_col / scale_factor))
        trimmed_top = float(work_region.top + (top_content_row / scale_factor))
        trimmed_x1 = float(
            work_region.x0 + ((right_content_col + 1) / scale_factor)
        )  # +1 because we want inclusive right edge
        trimmed_bottom = float(
            work_region.top + ((bottom_content_row + 1) / scale_factor)
        )  # +1 because we want inclusive bottom edge

        # Ensure the trimmed region doesn't exceed the work region boundaries
        final_x0 = max(work_region.x0, trimmed_x0)
        final_top = max(work_region.top, trimmed_top)
        final_x1 = min(work_region.x1, trimmed_x1)
        final_bottom = min(work_region.bottom, trimmed_bottom)

        # Ensure valid coordinates (width > 0, height > 0)
        if final_x1 <= final_x0 or final_bottom <= final_top:
            logger.warning(
                f"Region {self.bbox}: Trimming resulted in invalid dimensions. Returning original region."
            )
            return self

        # Create the trimmed region
        trimmed_region = Region(self.page, (final_x0, final_top, final_x1, final_bottom))

        # Expand back by the pre_shrink amount to restore original positioning
        if pre_shrink > 0:
            trimmed_region = trimmed_region.expand(
                left=pre_shrink, right=pre_shrink, top=pre_shrink, bottom=pre_shrink
            )

        # Copy relevant metadata
        trimmed_region.region_type = self.region_type
        trimmed_region.normalized_type = self.normalized_type
        trimmed_region.confidence = self.confidence
        trimmed_region.model = self.model
        trimmed_region.name = self.name
        trimmed_region.label = self.label
        trimmed_region.source = "trimmed"  # Indicate this is a derived region
        trimmed_region.parent_region = self

        logger.debug(
            f"Region {self.bbox}: Trimmed to {trimmed_region.bbox} (padding={padding}, threshold={threshold}, pre_shrink={pre_shrink})"
        )
        return trimmed_region

    def clip(
        self,
        obj: Optional[Any] = None,
        left: Optional[float] = None,
        top: Optional[float] = None,
        right: Optional[float] = None,
        bottom: Optional[float] = None,
    ) -> "Region":
        """
        Clip this region to specific bounds, either from another object with bbox or explicit coordinates.

        The clipped region will be constrained to not exceed the specified boundaries.
        You can provide either an object with bounding box properties, specific coordinates, or both.
        When both are provided, explicit coordinates take precedence.

        Args:
            obj: Optional object with bbox properties (Region, Element, TextElement, etc.)
            left: Optional left boundary (x0) to clip to
            top: Optional top boundary to clip to
            right: Optional right boundary (x1) to clip to
            bottom: Optional bottom boundary to clip to

        Returns:
            New Region with bounds clipped to the specified constraints

        Examples:
            # Clip to another region's bounds
            clipped = region.clip(container_region)

            # Clip to any element's bounds
            clipped = region.clip(text_element)

            # Clip to specific coordinates
            clipped = region.clip(left=100, right=400)

            # Mix object bounds with specific overrides
            clipped = region.clip(obj=container, bottom=page.height/2)
        """
        from natural_pdf.elements.base import extract_bbox

        # Start with current region bounds
        clip_x0 = self.x0
        clip_top = self.top
        clip_x1 = self.x1
        clip_bottom = self.bottom

        # Apply object constraints if provided
        if obj is not None:
            obj_bbox = extract_bbox(obj)
            if obj_bbox is not None:
                obj_x0, obj_top, obj_x1, obj_bottom = obj_bbox
                # Constrain to the intersection with the provided object
                clip_x0 = max(clip_x0, obj_x0)
                clip_top = max(clip_top, obj_top)
                clip_x1 = min(clip_x1, obj_x1)
                clip_bottom = min(clip_bottom, obj_bottom)
            else:
                logger.warning(
                    f"Region {self.bbox}: Cannot extract bbox from clipping object {type(obj)}. "
                    "Object must have bbox property or x0/top/x1/bottom attributes."
                )

        # Apply explicit coordinate constraints (these take precedence)
        if left is not None:
            clip_x0 = max(clip_x0, left)
        if top is not None:
            clip_top = max(clip_top, top)
        if right is not None:
            clip_x1 = min(clip_x1, right)
        if bottom is not None:
            clip_bottom = min(clip_bottom, bottom)

        # Ensure valid coordinates
        if clip_x1 <= clip_x0 or clip_bottom <= clip_top:
            logger.warning(
                f"Region {self.bbox}: Clipping resulted in invalid dimensions "
                f"({clip_x0}, {clip_top}, {clip_x1}, {clip_bottom}). Returning minimal region."
            )
            # Return a minimal region at the clip area's top-left
            return Region(self.page, (clip_x0, clip_top, clip_x0, clip_top))

        # Create the clipped region
        clipped_region = Region(self.page, (clip_x0, clip_top, clip_x1, clip_bottom))

        # Copy relevant metadata
        clipped_region.region_type = self.region_type
        clipped_region.normalized_type = self.normalized_type
        clipped_region.confidence = self.confidence
        clipped_region.model = self.model
        clipped_region.name = self.name
        clipped_region.label = self.label
        clipped_region.source = "clipped"  # Indicate this is a derived region
        clipped_region.parent_region = self

        logger.debug(
            f"Region {self.bbox}: Clipped to {clipped_region.bbox} "
            f"(constraints: obj={type(obj).__name__ if obj else None}, "
            f"left={left}, top={top}, right={right}, bottom={bottom})"
        )
        return clipped_region

    def region(
        self,
        left: float = None,
        top: float = None,
        right: float = None,
        bottom: float = None,
        width: Union[str, float, None] = None,
        height: Optional[float] = None,
        relative: bool = False,
    ) -> "Region":
        """
        Create a sub-region within this region using the same API as Page.region().

        By default, coordinates are absolute (relative to the page), matching Page.region().
        Set relative=True to use coordinates relative to this region's top-left corner.

        Args:
            left: Left x-coordinate (absolute by default, or relative to region if relative=True)
            top: Top y-coordinate (absolute by default, or relative to region if relative=True)
            right: Right x-coordinate (absolute by default, or relative to region if relative=True)
            bottom: Bottom y-coordinate (absolute by default, or relative to region if relative=True)
            width: Width definition (same as Page.region())
            height: Height of the region (same as Page.region())
            relative: If True, coordinates are relative to this region's top-left (0,0).
                     If False (default), coordinates are absolute page coordinates.

        Returns:
            Region object for the specified coordinates, clipped to this region's bounds

        Examples:
            # Absolute coordinates (default) - same as page.region()
            sub = region.region(left=100, top=200, width=50, height=30)

            # Relative to region's top-left
            sub = region.region(left=10, top=10, width=50, height=30, relative=True)

            # Mix relative positioning with this region's bounds
            sub = region.region(left=region.x0 + 10, width=50, height=30)
        """
        # If relative coordinates requested, convert to absolute
        if relative:
            if left is not None:
                left = self.x0 + left
            if top is not None:
                top = self.top + top
            if right is not None:
                right = self.x0 + right
            if bottom is not None:
                bottom = self.top + bottom

            # For numeric width/height with relative coords, we need to handle the calculation
            # in the context of absolute positioning

        # Use the parent page's region method to create the region with all its logic
        new_region = self.page.region(
            left=left, top=top, right=right, bottom=bottom, width=width, height=height
        )

        # Clip the new region to this region's bounds
        return new_region.clip(self)

    def get_elements(
        self, selector: Optional[str] = None, apply_exclusions=True, **kwargs
    ) -> List["Element"]:
        """
        Get all elements within this region.

        Args:
            selector: Optional selector to filter elements
            apply_exclusions: Whether to apply exclusion regions
            **kwargs: Additional parameters for element filtering

        Returns:
            List of elements in the region
        """
        if selector:
            # Find elements on the page matching the selector
            page_elements = self.page.find_all(
                selector, apply_exclusions=apply_exclusions, **kwargs
            )
            # Filter those elements to only include ones within this region
            elements = [e for e in page_elements if self._is_element_in_region(e)]
        else:
            # Get all elements from the page
            page_elements = self.page.get_elements(apply_exclusions=apply_exclusions)
            # Filter to elements in this region
            elements = [e for e in page_elements if self._is_element_in_region(e)]

        # Apply boundary exclusions if this is a section with boundary settings
        if hasattr(self, "_boundary_exclusions") and self._boundary_exclusions != "both":
            excluded_ids = set()

            if self._boundary_exclusions == "none":
                # Exclude both start and end elements
                if hasattr(self, "start_element") and self.start_element:
                    excluded_ids.add(id(self.start_element))
                if hasattr(self, "end_element") and self.end_element:
                    excluded_ids.add(id(self.end_element))
            elif self._boundary_exclusions == "start":
                # Exclude only end element
                if hasattr(self, "end_element") and self.end_element:
                    excluded_ids.add(id(self.end_element))
            elif self._boundary_exclusions == "end":
                # Exclude only start element
                if hasattr(self, "start_element") and self.start_element:
                    excluded_ids.add(id(self.start_element))

            if excluded_ids:
                elements = [e for e in elements if id(e) not in excluded_ids]

        return elements

    def attr(self, name: str) -> Any:
        """
        Get an attribute value from this region.

        This method provides a consistent interface for attribute access that works
        on both individual regions/elements and collections. When called on a single
        region, it simply returns the attribute value. When called on collections,
        it extracts the attribute from all items.

        Args:
            name: The attribute name to retrieve (e.g., 'text', 'width', 'height')

        Returns:
            The attribute value, or None if the attribute doesn't exist

        Examples:
            # On a single region
            region = page.find('text:contains("Title")').expand(10)
            width = region.attr('width')  # Same as region.width

            # Consistent API across elements and regions
            obj = page.find('*:contains("Title")')  # Could be element or region
            text = obj.attr('text')  # Works for both
        """
        return getattr(self, name, None)

    def extract_text(
        self,
        granularity: str = "chars",
        apply_exclusions: bool = True,
        debug: bool = False,
        *,
        overlap: str = "center",
        newlines: Union[bool, str] = True,
        content_filter=None,
        **kwargs,
    ) -> str:
        """
        Extract text from this region, respecting page exclusions and using pdfplumber's
        layout engine (chars_to_textmap).

        Args:
            granularity: Level of text extraction - 'chars' (default) or 'words'.
                - 'chars': Character-by-character extraction (current behavior)
                - 'words': Word-level extraction with configurable overlap
            apply_exclusions: Whether to apply exclusion regions defined on the parent page.
            debug: Enable verbose debugging output for filtering steps.
            overlap: How to determine if words overlap with the region (only used when granularity='words'):
                - 'center': Word center point must be inside (default)
                - 'full': Word must be fully inside the region
                - 'partial': Any overlap includes the word
            newlines: Whether to strip newline characters from the extracted text.
            content_filter: Optional content filter to exclude specific text patterns. Can be:
                - A regex pattern string (characters matching the pattern are EXCLUDED)
                - A callable that takes text and returns True to KEEP the character
                - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
            **kwargs: Additional layout parameters passed directly to pdfplumber's
                      `chars_to_textmap` function (e.g., layout, x_density, y_density).
                      See Page.extract_text docstring for more.

        Returns:
            Extracted text as string, potentially with layout-based spacing.
        """
        # Validate granularity parameter
        if granularity not in ("chars", "words"):
            raise ValueError(f"granularity must be 'chars' or 'words', got '{granularity}'")

        # Allow 'debug_exclusions' for backward compatibility
        debug = kwargs.get("debug", debug or kwargs.get("debug_exclusions", False))
        logger.debug(
            f"Region {self.bbox}: extract_text called with granularity='{granularity}', overlap='{overlap}', kwargs: {kwargs}"
        )

        # Handle word-level extraction
        if granularity == "words":
            # Use find_all to get words with proper overlap and exclusion handling
            word_elements = self.find_all(
                "text", overlap=overlap, apply_exclusions=apply_exclusions
            )

            # Join the text from all matching words
            text_parts = []
            for word in word_elements:
                word_text = word.extract_text()
                if word_text:  # Skip empty strings
                    text_parts.append(word_text)

            result = " ".join(text_parts)

            # Apply newlines processing if requested
            if newlines is False:
                result = result.replace("\n", " ").replace("\r", " ")
            elif isinstance(newlines, str):
                result = result.replace("\n", newlines).replace("\r", newlines)

            return result

        # Original character-level extraction logic follows...
        # 1. Get Word Elements potentially within this region (initial broad phase)
        # Optimization: Could use spatial query if page elements were indexed
        page_words = self.page.words  # Get all words from the page

        # 2. Gather all character dicts from words potentially in region
        # We filter precisely in filter_chars_spatially
        all_char_dicts = []
        for word in page_words:
            # Quick bbox check to avoid processing words clearly outside
            if get_bbox_overlap(self.bbox, word.bbox) is not None:
                all_char_dicts.extend(getattr(word, "_char_dicts", []))

        if not all_char_dicts:
            logger.debug(f"Region {self.bbox}: No character dicts found overlapping region bbox.")
            return ""

        # 3. Get Relevant Exclusions (overlapping this region)
        apply_exclusions_flag = kwargs.get("apply_exclusions", apply_exclusions)
        exclusion_regions = []
        if apply_exclusions_flag:
            # Always call _get_exclusion_regions to get both page and PDF level exclusions
            all_page_exclusions = self._page._get_exclusion_regions(
                include_callable=True, debug=debug
            )
            overlapping_exclusions = []
            for excl in all_page_exclusions:
                if get_bbox_overlap(self.bbox, excl.bbox) is not None:
                    overlapping_exclusions.append(excl)
            exclusion_regions = overlapping_exclusions
            if debug:
                logger.debug(
                    f"Region {self.bbox}: Found {len(all_page_exclusions)} total exclusions, "
                    f"{len(exclusion_regions)} overlapping this region."
                )
        elif debug:
            logger.debug(f"Region {self.bbox}: Not applying exclusions (apply_exclusions=False).")

        # Add boundary element exclusions if this is a section with boundary settings
        if hasattr(self, "_boundary_exclusions") and self._boundary_exclusions != "both":
            boundary_exclusions = []

            if self._boundary_exclusions == "none":
                # Exclude both start and end elements
                if hasattr(self, "start_element") and self.start_element:
                    boundary_exclusions.append(self.start_element)
                if hasattr(self, "end_element") and self.end_element:
                    boundary_exclusions.append(self.end_element)
            elif self._boundary_exclusions == "start":
                # Exclude only end element
                if hasattr(self, "end_element") and self.end_element:
                    boundary_exclusions.append(self.end_element)
            elif self._boundary_exclusions == "end":
                # Exclude only start element
                if hasattr(self, "start_element") and self.start_element:
                    boundary_exclusions.append(self.start_element)

            # Add boundary elements as exclusion regions
            for elem in boundary_exclusions:
                if hasattr(elem, "bbox"):
                    exclusion_regions.append(elem)
                    if debug:
                        logger.debug(
                            f"Adding boundary exclusion: {elem.extract_text().strip()} at {elem.bbox}"
                        )

        # 4. Spatially Filter Characters using Utility
        # Pass self as the target_region for precise polygon checks etc.
        filtered_chars = filter_chars_spatially(
            char_dicts=all_char_dicts,
            exclusion_regions=exclusion_regions,
            target_region=self,  # Pass self!
            debug=debug,
        )

        # 5. Generate Text Layout using Utility
        # Add content_filter to kwargs if provided
        final_kwargs = kwargs.copy()
        if content_filter is not None:
            final_kwargs["content_filter"] = content_filter

        result = generate_text_layout(
            char_dicts=filtered_chars,
            layout_context_bbox=self.bbox,  # Use region's bbox for context
            user_kwargs=final_kwargs,  # Pass kwargs including content_filter
        )

        # Flexible newline handling (same logic as TextElement)
        if isinstance(newlines, bool):
            if newlines is False:
                replacement = " "
            else:
                replacement = None
        else:
            replacement = str(newlines)

        if replacement is not None:
            result = result.replace("\n", replacement).replace("\r", replacement)

        logger.debug(f"Region {self.bbox}: extract_text finished, result length: {len(result)}.")
        return result

    def extract_table(
        self,
        method: Optional[str] = None,  # Make method optional
        table_settings: Optional[dict] = None,  # Use Optional
        use_ocr: bool = False,
        ocr_config: Optional[dict] = None,  # Use Optional
        text_options: Optional[Dict] = None,
        cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
        # --- NEW: Add tqdm control option --- #
        show_progress: bool = False,  # Controls progress bar for text method
        content_filter: Optional[
            Union[str, Callable[[str], bool], List[str]]
        ] = None,  # NEW: Content filtering
        apply_exclusions: bool = True,  # Whether to apply exclusion regions during extraction
        verticals: Optional[List] = None,  # Explicit vertical lines
        horizontals: Optional[List] = None,  # Explicit horizontal lines
    ) -> TableResult:  # Return type allows Optional[str] for cells
        """
        Extract a table from this region.

        Args:
            method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
                    'stream' is an alias for 'pdfplumber' with text-based strategies (equivalent to
                    setting `vertical_strategy` and `horizontal_strategy` to 'text').
                    'lattice' is an alias for 'pdfplumber' with line-based strategies (equivalent to
                    setting `vertical_strategy` and `horizontal_strategy` to 'lines').
            table_settings: Settings for pdfplumber table extraction (used with 'pdfplumber', 'stream', or 'lattice' methods).
            use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
            ocr_config: OCR configuration parameters.
            text_options: Dictionary of options for the 'text' method, corresponding to arguments
                          of analyze_text_table_structure (e.g., snap_tolerance, expand_bbox).
            cell_extraction_func: Optional callable function that takes a cell Region object
                                  and returns its string content. Overrides default text extraction
                                  for the 'text' method.
            show_progress: If True, display a progress bar during cell text extraction for the 'text' method.
            content_filter: Optional content filter to apply during cell text extraction. Can be:
                - A regex pattern string (characters matching the pattern are EXCLUDED)
                - A callable that takes text and returns True to KEEP the character
                - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
                Works with all extraction methods by filtering cell content.
            apply_exclusions: Whether to apply exclusion regions during text extraction (default: True).
                When True, text within excluded regions (e.g., headers/footers) will not be extracted.
            verticals: Optional list of explicit vertical lines for table extraction. When provided,
                       automatically sets vertical_strategy='explicit' and explicit_vertical_lines.
            horizontals: Optional list of explicit horizontal lines for table extraction. When provided,
                         automatically sets horizontal_strategy='explicit' and explicit_horizontal_lines.

        Returns:
            Table data as a list of rows, where each row is a list of cell values (str or None).
        """
        # Default settings if none provided
        if table_settings is None:
            table_settings = {}
        if text_options is None:
            text_options = {}  # Initialize empty dict

        # Handle explicit vertical and horizontal lines
        if verticals is not None:
            table_settings["vertical_strategy"] = "explicit"
            table_settings["explicit_vertical_lines"] = verticals
        if horizontals is not None:
            table_settings["horizontal_strategy"] = "explicit"
            table_settings["explicit_horizontal_lines"] = horizontals

        # Auto-detect method if not specified
        if method is None:
            # If this is a TATR-detected region, use TATR method
            if hasattr(self, "model") and self.model == "tatr" and self.region_type == "table":
                effective_method = "tatr"
            else:
                # Try lattice first, then fall back to stream if no meaningful results
                logger.debug(f"Region {self.bbox}: Auto-detecting table extraction method...")

                # --- NEW: Prefer already-created table_cell regions if they exist --- #
                try:
                    cell_regions_in_table = [
                        c
                        for c in self.page.find_all(
                            "region[type=table_cell]", apply_exclusions=False
                        )
                        if self.intersects(c)
                    ]
                except Exception as _cells_err:
                    cell_regions_in_table = []  # Fallback silently

                if cell_regions_in_table:
                    logger.debug(
                        f"Region {self.bbox}: Found {len(cell_regions_in_table)} pre-computed table_cell regions – using 'cells' method."
                    )
                    return TableResult(
                        self._extract_table_from_cells(
                            cell_regions_in_table,
                            content_filter=content_filter,
                            apply_exclusions=apply_exclusions,
                        )
                    )

                # --------------------------------------------------------------- #

                try:
                    logger.debug(f"Region {self.bbox}: Trying 'lattice' method first...")
                    lattice_result = self.extract_table(
                        "lattice", table_settings=table_settings.copy()
                    )

                    # Check if lattice found meaningful content
                    if (
                        lattice_result
                        and len(lattice_result) > 0
                        and any(
                            any(cell and cell.strip() for cell in row if cell)
                            for row in lattice_result
                        )
                    ):
                        logger.debug(
                            f"Region {self.bbox}: 'lattice' method found table with {len(lattice_result)} rows"
                        )
                        return lattice_result
                    else:
                        logger.debug(
                            f"Region {self.bbox}: 'lattice' method found no meaningful content"
                        )
                except Exception as e:
                    logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

                # Fall back to stream
                logger.debug(f"Region {self.bbox}: Falling back to 'stream' method...")
                return self.extract_table("stream", table_settings=table_settings.copy())
        else:
            effective_method = method

        # Handle method aliases for pdfplumber
        if effective_method == "stream":
            logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
            effective_method = "pdfplumber"
            # Set default text strategies if not already provided by the user
            table_settings.setdefault("vertical_strategy", "text")
            table_settings.setdefault("horizontal_strategy", "text")
        elif effective_method == "lattice":
            logger.debug(
                "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
            )
            effective_method = "pdfplumber"
            # Set default line strategies if not already provided by the user
            table_settings.setdefault("vertical_strategy", "lines")
            table_settings.setdefault("horizontal_strategy", "lines")

        # -------------------------------------------------------------
        # Auto-inject tolerances when text-based strategies are requested.
        # This must happen AFTER alias handling (so strategies are final)
        # and BEFORE we delegate to _extract_table_* helpers.
        # -------------------------------------------------------------
        if "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        ):
            page_cfg = getattr(self.page, "_config", {})
            # Ensure text_* tolerances passed to pdfplumber
            if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
                if page_cfg.get("x_tolerance") is not None:
                    table_settings["text_x_tolerance"] = page_cfg["x_tolerance"]
            if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
                if page_cfg.get("y_tolerance") is not None:
                    table_settings["text_y_tolerance"] = page_cfg["y_tolerance"]

            # Snap / join tolerances (~ line spacing)
            if "snap_tolerance" not in table_settings and "snap_x_tolerance" not in table_settings:
                snap = max(1, round((page_cfg.get("y_tolerance", 1)) * 0.9))
                table_settings["snap_tolerance"] = snap
            if "join_tolerance" not in table_settings and "join_x_tolerance" not in table_settings:
                table_settings["join_tolerance"] = table_settings["snap_tolerance"]

        logger.debug(f"Region {self.bbox}: Extracting table using method '{effective_method}'")

        # For stream method with text-based edge detection and explicit vertical lines,
        # adjust guides to ensure they fall within text bounds for proper intersection
        if (
            effective_method == "pdfplumber"
            and table_settings.get("horizontal_strategy") == "text"
            and table_settings.get("vertical_strategy") == "explicit"
            and "explicit_vertical_lines" in table_settings
        ):

            text_elements = self.find_all("text", apply_exclusions=apply_exclusions)
            if text_elements:
                text_bounds = text_elements.merge().bbox
                text_left = text_bounds[0]
                text_right = text_bounds[2]

                # Adjust vertical guides to fall within text bounds
                original_verticals = table_settings["explicit_vertical_lines"]
                adjusted_verticals = []

                for v in original_verticals:
                    if v < text_left:
                        # Guide is left of text bounds, clip to text start
                        adjusted_verticals.append(text_left)
                        logger.debug(
                            f"Region {self.bbox}: Adjusted left guide from {v:.1f} to {text_left:.1f}"
                        )
                    elif v > text_right:
                        # Guide is right of text bounds, clip to text end
                        adjusted_verticals.append(text_right)
                        logger.debug(
                            f"Region {self.bbox}: Adjusted right guide from {v:.1f} to {text_right:.1f}"
                        )
                    else:
                        # Guide is within text bounds, keep as is
                        adjusted_verticals.append(v)

                # Update table settings with adjusted guides
                table_settings["explicit_vertical_lines"] = adjusted_verticals
                logger.debug(
                    f"Region {self.bbox}: Adjusted {len(original_verticals)} guides for stream extraction. "
                    f"Text bounds: {text_left:.1f}-{text_right:.1f}"
                )

        # Use the selected method
        if effective_method == "tatr":
            table_rows = self._extract_table_tatr(
                use_ocr=use_ocr,
                ocr_config=ocr_config,
                content_filter=content_filter,
                apply_exclusions=apply_exclusions,
            )
        elif effective_method == "text":
            current_text_options = text_options.copy()
            current_text_options["cell_extraction_func"] = cell_extraction_func
            current_text_options["show_progress"] = show_progress
            current_text_options["content_filter"] = content_filter
            current_text_options["apply_exclusions"] = apply_exclusions
            table_rows = self._extract_table_text(**current_text_options)
        elif effective_method == "pdfplumber":
            table_rows = self._extract_table_plumber(
                table_settings, content_filter=content_filter, apply_exclusions=apply_exclusions
            )
        else:
            raise ValueError(
                f"Unknown table extraction method: '{method}'. Choose from 'tatr', 'pdfplumber', 'text', 'stream', 'lattice'."
            )

        return TableResult(table_rows)

    def extract_tables(
        self,
        method: Optional[str] = None,
        table_settings: Optional[dict] = None,
    ) -> List[List[List[str]]]:
        """
        Extract all tables from this region using pdfplumber-based methods.

        Note: Only 'pdfplumber', 'stream', and 'lattice' methods are supported for extract_tables.
        'tatr' and 'text' methods are designed for single table extraction only.

        Args:
            method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                    'stream' uses text-based strategies, 'lattice' uses line-based strategies.
            table_settings: Settings for pdfplumber table extraction.

        Returns:
            List of tables, where each table is a list of rows, and each row is a list of cell values.
        """
        if table_settings is None:
            table_settings = {}

        # Auto-detect method if not specified (try lattice first, then stream)
        if method is None:
            logger.debug(f"Region {self.bbox}: Auto-detecting tables extraction method...")

            # Try lattice first
            try:
                lattice_settings = table_settings.copy()
                lattice_settings.setdefault("vertical_strategy", "lines")
                lattice_settings.setdefault("horizontal_strategy", "lines")

                logger.debug(f"Region {self.bbox}: Trying 'lattice' method first for tables...")
                lattice_result = self._extract_tables_plumber(lattice_settings)

                # Check if lattice found meaningful tables
                if (
                    lattice_result
                    and len(lattice_result) > 0
                    and any(
                        any(
                            any(cell and cell.strip() for cell in row if cell)
                            for row in table
                            if table
                        )
                        for table in lattice_result
                    )
                ):
                    logger.debug(
                        f"Region {self.bbox}: 'lattice' method found {len(lattice_result)} tables"
                    )
                    return lattice_result
                else:
                    logger.debug(f"Region {self.bbox}: 'lattice' method found no meaningful tables")

            except Exception as e:
                logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

            # Fall back to stream
            logger.debug(f"Region {self.bbox}: Falling back to 'stream' method for tables...")
            stream_settings = table_settings.copy()
            stream_settings.setdefault("vertical_strategy", "text")
            stream_settings.setdefault("horizontal_strategy", "text")

            return self._extract_tables_plumber(stream_settings)

        effective_method = method

        # Handle method aliases
        if effective_method == "stream":
            logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "text")
            table_settings.setdefault("horizontal_strategy", "text")
        elif effective_method == "lattice":
            logger.debug(
                "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
            )
            effective_method = "pdfplumber"
            table_settings.setdefault("vertical_strategy", "lines")
            table_settings.setdefault("horizontal_strategy", "lines")

        # Use the selected method
        if effective_method == "pdfplumber":
            return self._extract_tables_plumber(table_settings)
        else:
            raise ValueError(
                f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
            )

    def _extract_tables_plumber(self, table_settings: dict) -> List[List[List[str]]]:
        """
        Extract all tables using pdfplumber's table extraction.

        Args:
            table_settings: Settings for pdfplumber table extraction

        Returns:
            List of tables, where each table is a list of rows, and each row is a list of cell values
        """
        # Inject global PDF-level text tolerances if not explicitly present
        pdf_cfg = getattr(self.page, "_config", getattr(self.page._parent, "_config", {}))
        _uses_text = "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        )
        if (
            _uses_text
            and "text_x_tolerance" not in table_settings
            and "x_tolerance" not in table_settings
        ):
            x_tol = pdf_cfg.get("x_tolerance")
            if x_tol is not None:
                table_settings.setdefault("text_x_tolerance", x_tol)
        if (
            _uses_text
            and "text_y_tolerance" not in table_settings
            and "y_tolerance" not in table_settings
        ):
            y_tol = pdf_cfg.get("y_tolerance")
            if y_tol is not None:
                table_settings.setdefault("text_y_tolerance", y_tol)

        if (
            _uses_text
            and "snap_tolerance" not in table_settings
            and "snap_x_tolerance" not in table_settings
        ):
            snap = max(1, round((pdf_cfg.get("y_tolerance", 1)) * 0.9))
            table_settings.setdefault("snap_tolerance", snap)
        if (
            _uses_text
            and "join_tolerance" not in table_settings
            and "join_x_tolerance" not in table_settings
        ):
            join = table_settings.get("snap_tolerance", 1)
            table_settings.setdefault("join_tolerance", join)
            table_settings.setdefault("join_x_tolerance", join)
            table_settings.setdefault("join_y_tolerance", join)

        # -------------------------------------------------------------
        # Apply char-level exclusion filtering, if any exclusions are
        # defined on the parent Page.  We create a lightweight
        # pdfplumber.Page copy whose .chars list omits characters that
        # fall inside any exclusion Region.  Other object types are
        # left untouched for now ("chars-only" strategy).
        # -------------------------------------------------------------
        base_plumber_page = self.page._page

        if getattr(self.page, "_exclusions", None):
            # Resolve exclusion Regions (callables already evaluated)
            exclusion_regions = self.page._get_exclusion_regions(include_callable=True)

            def _keep_char(obj):
                """Return True if pdfplumber obj should be kept."""
                if obj.get("object_type") != "char":
                    # Keep non-char objects unchanged – lattice grids etc.
                    return True

                # Compute character centre point
                cx = (obj["x0"] + obj["x1"]) / 2.0
                cy = (obj["top"] + obj["bottom"]) / 2.0

                # Reject if the centre lies inside ANY exclusion Region
                for reg in exclusion_regions:
                    if reg.x0 <= cx <= reg.x1 and reg.top <= cy <= reg.bottom:
                        return False
                return True

            try:
                filtered_page = base_plumber_page.filter(_keep_char)
            except Exception as _filter_err:
                # Fallback – if filtering fails, log and proceed unfiltered
                logger.warning(
                    f"Region {self.bbox}: Failed to filter pdfplumber chars for exclusions: {_filter_err}"
                )
                filtered_page = base_plumber_page
        else:
            filtered_page = base_plumber_page

        # Ensure bbox is within pdfplumber page bounds
        page_bbox = filtered_page.bbox
        clipped_bbox = (
            max(self.bbox[0], page_bbox[0]),  # x0
            max(self.bbox[1], page_bbox[1]),  # y0
            min(self.bbox[2], page_bbox[2]),  # x1
            min(self.bbox[3], page_bbox[3]),  # y1
        )

        # Only crop if the clipped bbox is valid (has positive width and height)
        if clipped_bbox[2] > clipped_bbox[0] and clipped_bbox[3] > clipped_bbox[1]:
            cropped = filtered_page.crop(clipped_bbox)
        else:
            # If the region is completely outside the page bounds, return empty list
            return []

        # Extract all tables from the cropped area
        tables = cropped.extract_tables(table_settings)

        # Apply RTL text processing to all tables
        if tables:
            processed_tables = []
            for table in tables:
                processed_table = []
                for row in table:
                    processed_row = []
                    for cell in row:
                        if cell is not None:
                            # Apply RTL text processing to each cell
                            rtl_processed_cell = self._apply_rtl_processing_to_text(cell)
                            processed_row.append(rtl_processed_cell)
                        else:
                            processed_row.append(cell)
                    processed_table.append(processed_row)
                processed_tables.append(processed_table)
            return processed_tables

        # Return empty list if no tables found
        return []

    def _extract_table_plumber(
        self, table_settings: dict, content_filter=None, apply_exclusions=True
    ) -> List[List[str]]:
        """
        Extract table using pdfplumber's table extraction.
        This method extracts the largest table within the region.

        Args:
            table_settings: Settings for pdfplumber table extraction
            content_filter: Optional content filter to apply to cell values

        Returns:
            Table data as a list of rows, where each row is a list of cell values
        """
        # Inject global PDF-level text tolerances if not explicitly present
        pdf_cfg = getattr(self.page, "_config", getattr(self.page._parent, "_config", {}))
        _uses_text = "text" in (
            table_settings.get("vertical_strategy"),
            table_settings.get("horizontal_strategy"),
        )
        if (
            _uses_text
            and "text_x_tolerance" not in table_settings
            and "x_tolerance" not in table_settings
        ):
            x_tol = pdf_cfg.get("x_tolerance")
            if x_tol is not None:
                table_settings.setdefault("text_x_tolerance", x_tol)
        if (
            _uses_text
            and "text_y_tolerance" not in table_settings
            and "y_tolerance" not in table_settings
        ):
            y_tol = pdf_cfg.get("y_tolerance")
            if y_tol is not None:
                table_settings.setdefault("text_y_tolerance", y_tol)

        # -------------------------------------------------------------
        # Apply char-level exclusion filtering (chars only) just like in
        # _extract_tables_plumber so header/footer text does not appear
        # in extracted tables.
        # -------------------------------------------------------------
        base_plumber_page = self.page._page

        if apply_exclusions and getattr(self.page, "_exclusions", None):
            exclusion_regions = self.page._get_exclusion_regions(include_callable=True)

            def _keep_char(obj):
                if obj.get("object_type") != "char":
                    return True
                cx = (obj["x0"] + obj["x1"]) / 2.0
                cy = (obj["top"] + obj["bottom"]) / 2.0
                for reg in exclusion_regions:
                    if reg.x0 <= cx <= reg.x1 and reg.top <= cy <= reg.bottom:
                        return False
                return True

            try:
                filtered_page = base_plumber_page.filter(_keep_char)
            except Exception as _filter_err:
                logger.warning(
                    f"Region {self.bbox}: Failed to filter pdfplumber chars for exclusions (single table): {_filter_err}"
                )
                filtered_page = base_plumber_page
        else:
            filtered_page = base_plumber_page

        # Now crop the (possibly filtered) page to the region bbox
        # Ensure bbox is within pdfplumber page bounds
        page_bbox = filtered_page.bbox
        clipped_bbox = (
            max(self.bbox[0], page_bbox[0]),  # x0
            max(self.bbox[1], page_bbox[1]),  # y0
            min(self.bbox[2], page_bbox[2]),  # x1
            min(self.bbox[3], page_bbox[3]),  # y1
        )

        # Only crop if the clipped bbox is valid (has positive width and height)
        if clipped_bbox[2] > clipped_bbox[0] and clipped_bbox[3] > clipped_bbox[1]:
            cropped = filtered_page.crop(clipped_bbox)
        else:
            # If the region is completely outside the page bounds, return empty table
            return []

        # Extract the single largest table from the cropped area
        table = cropped.extract_table(table_settings)

        # Return the table or an empty list if none found
        if table:
            # Apply RTL text processing and content filtering if provided
            processed_table = []
            for row in table:
                processed_row = []
                for cell in row:
                    if cell is not None:
                        # Apply RTL text processing first
                        rtl_processed_cell = self._apply_rtl_processing_to_text(cell)

                        # Then apply content filter if provided
                        if content_filter is not None:
                            filtered_cell = self._apply_content_filter_to_text(
                                rtl_processed_cell, content_filter
                            )
                            processed_row.append(filtered_cell)
                        else:
                            processed_row.append(rtl_processed_cell)
                    else:
                        processed_row.append(cell)
                processed_table.append(processed_row)
            return processed_table
        return []

    def _extract_table_tatr(
        self, use_ocr=False, ocr_config=None, content_filter=None, apply_exclusions=True
    ) -> List[List[str]]:
        """
        Extract table using TATR structure detection.

        Args:
            use_ocr: Whether to apply OCR to each cell for better text extraction
            ocr_config: Optional OCR configuration parameters
            content_filter: Optional content filter to apply to cell values

        Returns:
            Table data as a list of rows, where each row is a list of cell values
        """
        # Find all rows and headers in this table
        rows = self.page.find_all(f"region[type=table-row][model=tatr]")
        headers = self.page.find_all(f"region[type=table-column-header][model=tatr]")
        columns = self.page.find_all(f"region[type=table-column][model=tatr]")

        # Filter to only include rows/headers/columns that overlap with this table region
        def is_in_table(region):
            # Check for overlap - simplifying to center point for now
            region_center_x = (region.x0 + region.x1) / 2
            region_center_y = (region.top + region.bottom) / 2
            return (
                self.x0 <= region_center_x <= self.x1 and self.top <= region_center_y <= self.bottom
            )

        rows = [row for row in rows if is_in_table(row)]
        headers = [header for header in headers if is_in_table(header)]
        columns = [column for column in columns if is_in_table(column)]

        # Sort rows by vertical position (top to bottom)
        rows.sort(key=lambda r: r.top)

        # Sort columns by horizontal position (left to right)
        columns.sort(key=lambda c: c.x0)

        # Create table data structure
        table_data = []

        # Prepare OCR config if needed
        if use_ocr:
            # Default OCR config focuses on small text with low confidence
            default_ocr_config = {
                "enabled": True,
                "min_confidence": 0.1,  # Lower than default to catch more text
                "detection_params": {
                    "text_threshold": 0.1,  # Lower threshold for low-contrast text
                    "link_threshold": 0.1,  # Lower threshold for connecting text components
                },
            }

            # Merge with provided config if any
            if ocr_config:
                if isinstance(ocr_config, dict):
                    # Update default config with provided values
                    for key, value in ocr_config.items():
                        if (
                            isinstance(value, dict)
                            and key in default_ocr_config
                            and isinstance(default_ocr_config[key], dict)
                        ):
                            # Merge nested dicts
                            default_ocr_config[key].update(value)
                        else:
                            # Replace value
                            default_ocr_config[key] = value
                else:
                    # Not a dict, use as is
                    default_ocr_config = ocr_config

            # Use the merged config
            ocr_config = default_ocr_config

        # Add header row if headers were detected
        if headers:
            header_texts = []
            for header in headers:
                if use_ocr:
                    # Try OCR for better text extraction
                    ocr_elements = header.apply_ocr(**ocr_config)
                    if ocr_elements:
                        ocr_text = " ".join(e.text for e in ocr_elements).strip()
                        if ocr_text:
                            header_texts.append(ocr_text)
                            continue

                # Fallback to normal extraction
                header_text = header.extract_text(apply_exclusions=apply_exclusions).strip()
                if content_filter is not None:
                    header_text = self._apply_content_filter_to_text(header_text, content_filter)
                header_texts.append(header_text)
            table_data.append(header_texts)

        # Process rows
        for row in rows:
            row_cells = []

            # If we have columns, use them to extract cells
            if columns:
                for column in columns:
                    # Create a cell region at the intersection of row and column
                    cell_bbox = (column.x0, row.top, column.x1, row.bottom)

                    # Create a region for this cell
                    from natural_pdf.elements.region import (  # Import here to avoid circular imports
                        Region,
                    )

                    cell_region = Region(self.page, cell_bbox)

                    # Extract text from the cell
                    if use_ocr:
                        # Apply OCR to the cell
                        ocr_elements = cell_region.apply_ocr(**ocr_config)
                        if ocr_elements:
                            # Get text from OCR elements
                            ocr_text = " ".join(e.text for e in ocr_elements).strip()
                            if ocr_text:
                                row_cells.append(ocr_text)
                                continue

                    # Fallback to normal extraction
                    cell_text = cell_region.extract_text(apply_exclusions=apply_exclusions).strip()
                    if content_filter is not None:
                        cell_text = self._apply_content_filter_to_text(cell_text, content_filter)
                    row_cells.append(cell_text)
            else:
                # No column information, just extract the whole row text
                if use_ocr:
                    # Try OCR on the whole row
                    ocr_elements = row.apply_ocr(**ocr_config)
                    if ocr_elements:
                        ocr_text = " ".join(e.text for e in ocr_elements).strip()
                        if ocr_text:
                            row_cells.append(ocr_text)
                            continue

                # Fallback to normal extraction
                row_text = row.extract_text(apply_exclusions=apply_exclusions).strip()
                if content_filter is not None:
                    row_text = self._apply_content_filter_to_text(row_text, content_filter)
                row_cells.append(row_text)

            table_data.append(row_cells)

        return table_data

    def _extract_table_text(self, **text_options) -> List[List[Optional[str]]]:
        """
        Extracts table content based on text alignment analysis.

        Args:
            **text_options: Options passed to analyze_text_table_structure,
                          plus optional 'cell_extraction_func', 'coordinate_grouping_tolerance',
                          'show_progress', and 'content_filter'.

        Returns:
            Table data as list of lists of strings (or None for empty cells).
        """
        cell_extraction_func = text_options.pop("cell_extraction_func", None)
        # --- Get show_progress option --- #
        show_progress = text_options.pop("show_progress", False)
        # --- Get content_filter option --- #
        content_filter = text_options.pop("content_filter", None)
        # --- Get apply_exclusions option --- #
        apply_exclusions = text_options.pop("apply_exclusions", True)

        # Analyze structure first (or use cached results)
        if "text_table_structure" in self.analyses:
            analysis_results = self.analyses["text_table_structure"]
            logger.debug("Using cached text table structure analysis results.")
        else:
            analysis_results = self.analyze_text_table_structure(**text_options)

        if analysis_results is None or not analysis_results.get("cells"):
            logger.warning(f"Region {self.bbox}: No cells found using 'text' method.")
            return []

        cell_dicts = analysis_results["cells"]

        # --- Grid Reconstruction Logic --- #
        if not cell_dicts:
            return []

        # 1. Get unique sorted top and left coordinates (cell boundaries)
        coord_tolerance = text_options.get("coordinate_grouping_tolerance", 1)
        tops = sorted(
            list(set(round(c["top"] / coord_tolerance) * coord_tolerance for c in cell_dicts))
        )
        lefts = sorted(
            list(set(round(c["left"] / coord_tolerance) * coord_tolerance for c in cell_dicts))
        )

        # Refine boundaries (cluster_coords helper remains the same)
        def cluster_coords(coords):
            if not coords:
                return []
            clustered = []
            current_cluster = [coords[0]]
            for c in coords[1:]:
                if abs(c - current_cluster[-1]) <= coord_tolerance:
                    current_cluster.append(c)
                else:
                    clustered.append(min(current_cluster))
                    current_cluster = [c]
            clustered.append(min(current_cluster))
            return clustered

        unique_tops = cluster_coords(tops)
        unique_lefts = cluster_coords(lefts)

        # Determine iterable for tqdm
        cell_iterator = cell_dicts
        if show_progress:
            # Only wrap if progress should be shown
            cell_iterator = tqdm(
                cell_dicts,
                desc=f"Extracting text from {len(cell_dicts)} cells (text method)",
                unit="cell",
                leave=False,  # Optional: Keep bar after completion
            )
        # --- End tqdm Setup --- #

        # 2. Create a lookup map for cell text: {(rounded_top, rounded_left): cell_text}
        cell_text_map = {}
        # --- Use the potentially wrapped iterator --- #
        for cell_data in cell_iterator:
            try:
                cell_region = self.page.region(**cell_data)
                cell_value = None  # Initialize
                if callable(cell_extraction_func):
                    try:
                        cell_value = cell_extraction_func(cell_region)
                        if not isinstance(cell_value, (str, type(None))):
                            logger.warning(
                                f"Custom cell_extraction_func returned non-string/None type ({type(cell_value)}) for cell {cell_data}. Treating as None."
                            )
                            cell_value = None
                    except Exception as func_err:
                        logger.error(
                            f"Error executing custom cell_extraction_func for cell {cell_data}: {func_err}",
                            exc_info=True,
                        )
                        cell_value = None
                else:
                    cell_value = cell_region.extract_text(
                        layout=False,
                        apply_exclusions=apply_exclusions,
                        content_filter=content_filter,
                    ).strip()

                rounded_top = round(cell_data["top"] / coord_tolerance) * coord_tolerance
                rounded_left = round(cell_data["left"] / coord_tolerance) * coord_tolerance
                cell_text_map[(rounded_top, rounded_left)] = cell_value
            except Exception as e:
                logger.warning(f"Could not process cell {cell_data} for text extraction: {e}")

        # 3. Build the final list-of-lists table (loop remains the same)
        final_table = []
        for row_top in unique_tops:
            row_data = []
            for col_left in unique_lefts:
                best_match_key = None
                min_dist_sq = float("inf")
                for map_top, map_left in cell_text_map.keys():
                    if (
                        abs(map_top - row_top) <= coord_tolerance
                        and abs(map_left - col_left) <= coord_tolerance
                    ):
                        dist_sq = (map_top - row_top) ** 2 + (map_left - col_left) ** 2
                        if dist_sq < min_dist_sq:
                            min_dist_sq = dist_sq
                            best_match_key = (map_top, map_left)
                cell_value = cell_text_map.get(best_match_key)
                row_data.append(cell_value)
            final_table.append(row_data)

        return final_table

    # --- END MODIFIED METHOD --- #

    @overload
    def find(
        self,
        *,
        text: str,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["Element"]: ...

    @overload
    def find(
        self,
        selector: str,
        *,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["Element"]: ...

    def find(
        self,
        selector: Optional[str] = None,  # Now optional
        *,
        text: Optional[str] = None,  # New text parameter
        overlap: str = "full",  # How elements overlap with the region
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> Optional["Element"]:
        """
        Find the first element in this region matching the selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            overlap: How to determine if elements overlap with the region: 'full' (fully inside),
                     'partial' (any overlap), or 'center' (center point inside).
                     (default: "full")
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional parameters for element filtering.

        Returns:
            First matching element or None.
        """
        # Delegate validation and selector construction to find_all
        elements = self.find_all(
            selector=selector,
            text=text,
            overlap=overlap,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )
        return elements.first if elements else None

    @overload
    def find_all(
        self,
        *,
        text: str,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    @overload
    def find_all(
        self,
        selector: str,
        *,
        overlap: str = "full",
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection": ...

    def find_all(
        self,
        selector: Optional[str] = None,  # Now optional
        *,
        text: Optional[str] = None,  # New text parameter
        overlap: str = "full",  # How elements overlap with the region
        apply_exclusions: bool = True,
        regex: bool = False,
        case: bool = True,
        **kwargs,
    ) -> "ElementCollection":
        """
        Find all elements in this region matching the selector OR text content.

        Provide EITHER `selector` OR `text`, but not both.

        Args:
            selector: CSS-like selector string.
            text: Text content to search for (equivalent to 'text:contains(...)').
            overlap: How to determine if elements overlap with the region: 'full' (fully inside),
                     'partial' (any overlap), or 'center' (center point inside).
                     (default: "full")
            apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
            regex: Whether to use regex for text search (`selector` or `text`) (default: False).
            case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
            **kwargs: Additional parameters for element filtering.

        Returns:
            ElementCollection with matching elements.
        """
        from natural_pdf.elements.element_collection import ElementCollection

        if selector is not None and text is not None:
            raise ValueError("Provide either 'selector' or 'text', not both.")
        if selector is None and text is None:
            raise ValueError("Provide either 'selector' or 'text'.")

        # Validate overlap parameter
        if overlap not in ["full", "partial", "center"]:
            raise ValueError(
                f"Invalid overlap value: {overlap}. Must be 'full', 'partial', or 'center'"
            )

        # Construct selector if 'text' is provided
        effective_selector = ""
        if text is not None:
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            effective_selector = f'text:contains("{escaped_text}")'
            logger.debug(
                f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
            )
        elif selector is not None:
            effective_selector = selector
        else:
            raise ValueError("Internal error: No selector or text provided.")

        # Normal case: Region is on a single page
        try:
            # Parse the final selector string
            selector_obj = parse_selector(effective_selector)

            # Get all potentially relevant elements from the page
            # Let the page handle its exclusion logic if needed
            potential_elements = self.page.find_all(
                selector=effective_selector,
                apply_exclusions=apply_exclusions,
                regex=regex,
                case=case,
                **kwargs,
            )

            # Filter these elements based on the specified containment method
            region_bbox = self.bbox
            matching_elements = []

            if overlap == "full":  # Fully inside (strict)
                matching_elements = [
                    el
                    for el in potential_elements
                    if el.x0 >= region_bbox[0]
                    and el.top >= region_bbox[1]
                    and el.x1 <= region_bbox[2]
                    and el.bottom <= region_bbox[3]
                ]
            elif overlap == "partial":  # Any overlap
                matching_elements = [el for el in potential_elements if self.intersects(el)]
            elif overlap == "center":  # Center point inside
                matching_elements = [
                    el for el in potential_elements if self.is_element_center_inside(el)
                ]

            return ElementCollection(matching_elements)

        except Exception as e:
            logger.error(f"Error during find_all in region: {e}", exc_info=True)
            return ElementCollection([])

    def apply_ocr(self, replace=True, **ocr_params) -> "Region":
        """
        Apply OCR to this region and return the created text elements.

        This method supports two modes:
        1. **Built-in OCR Engines** (default) – identical to previous behaviour. Pass typical
           parameters like ``engine='easyocr'`` or ``languages=['en']`` and the method will
           route the request through :class:`OCRManager`.
        2. **Custom OCR Function** – pass a *callable* under the keyword ``function`` (or
           ``ocr_function``). The callable will receive *this* Region instance and should
           return the extracted text (``str``) or ``None``.  Internally the call is
           delegated to :pymeth:`apply_custom_ocr` so the same logic (replacement, element
           creation, etc.) is re-used.

        Examples
        ---------
        ```python
        def llm_ocr(region):
            image = region.render(resolution=300, crop=True)
            return my_llm_client.ocr(image)
        region.apply_ocr(function=llm_ocr)
        ```

        Args:
            replace: Whether to remove existing OCR elements first (default ``True``).
            **ocr_params: Parameters for the built-in OCR manager *or* the special
                          ``function``/``ocr_function`` keyword to trigger custom mode.

        Returns
        -------
            Self – for chaining.
        """
        # --- Custom OCR function path --------------------------------------------------
        custom_func = ocr_params.pop("function", None) or ocr_params.pop("ocr_function", None)
        if callable(custom_func):
            # Delegate to the specialised helper while preserving key kwargs
            return self.apply_custom_ocr(
                ocr_function=custom_func,
                source_label=ocr_params.pop("source_label", "custom-ocr"),
                replace=replace,
                confidence=ocr_params.pop("confidence", None),
                add_to_page=ocr_params.pop("add_to_page", True),
            )

        # --- Original built-in OCR engine path (unchanged except docstring) ------------
        # Ensure OCRManager is available
        if not hasattr(self.page._parent, "_ocr_manager") or self.page._parent._ocr_manager is None:
            logger.error("OCRManager not available on parent PDF. Cannot apply OCR to region.")
            return self

        # If replace is True, find and remove existing OCR elements in this region
        if replace:
            logger.info(
                f"Region {self.bbox}: Removing existing OCR elements before applying new OCR."
            )

            # --- Robust removal: iterate through all OCR elements on the page and
            #     remove those that overlap this region. This avoids reliance on
            #     identity‐based look-ups that can break if the ElementManager
            #     rebuilt its internal lists.

            removed_count = 0

            # Helper to remove a single element safely
            def _safe_remove(elem):
                nonlocal removed_count
                success = False
                if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                    etype = getattr(elem, "object_type", "word")
                    if etype == "word":
                        etype_key = "words"
                    elif etype == "char":
                        etype_key = "chars"
                    else:
                        etype_key = etype + "s" if not etype.endswith("s") else etype
                    try:
                        success = elem.page._element_mgr.remove_element(elem, etype_key)
                    except Exception:
                        success = False
                if success:
                    removed_count += 1

            # Remove OCR WORD elements overlapping region
            for word in list(self.page._element_mgr.words):
                if getattr(word, "source", None) == "ocr" and self.intersects(word):
                    _safe_remove(word)

            # Remove OCR CHAR dicts overlapping region
            for char in list(self.page._element_mgr.chars):
                # char can be dict or TextElement; normalise
                char_src = (
                    char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
                )
                if char_src == "ocr":
                    # Rough bbox for dicts
                    if isinstance(char, dict):
                        cx0, ctop, cx1, cbottom = (
                            char.get("x0", 0),
                            char.get("top", 0),
                            char.get("x1", 0),
                            char.get("bottom", 0),
                        )
                    else:
                        cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                    # Quick overlap check
                    if not (
                        cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                    ):
                        _safe_remove(char)

            logger.info(
                f"Region {self.bbox}: Removed {removed_count} existing OCR elements (words & chars) before re-applying OCR."
            )

        ocr_mgr = self.page._parent._ocr_manager

        # Determine rendering resolution from parameters
        final_resolution = ocr_params.get("resolution")
        if final_resolution is None and hasattr(self.page, "_parent") and self.page._parent:
            final_resolution = getattr(self.page._parent, "_config", {}).get("resolution", 150)
        elif final_resolution is None:
            final_resolution = 150
        logger.debug(
            f"Region {self.bbox}: Applying OCR with resolution {final_resolution} DPI and params: {ocr_params}"
        )

        # Render the page region to an image using the determined resolution
        try:
            # Use render() for clean image without highlights, with cropping
            region_image = self.render(resolution=final_resolution, crop=True)
            if not region_image:
                logger.error("Failed to render region to image for OCR.")
                return self
            logger.debug(f"Region rendered to image size: {region_image.size}")
        except Exception as e:
            logger.error(f"Error rendering region to image for OCR: {e}", exc_info=True)
            return self

        # Prepare args for the OCR Manager
        manager_args = {
            "images": region_image,
            "engine": ocr_params.get("engine"),
            "languages": ocr_params.get("languages"),
            "min_confidence": ocr_params.get("min_confidence"),
            "device": ocr_params.get("device"),
            "options": ocr_params.get("options"),
            "detect_only": ocr_params.get("detect_only"),
        }
        manager_args = {k: v for k, v in manager_args.items() if v is not None}

        # Run OCR on this region's image using the manager
        results = ocr_mgr.apply_ocr(**manager_args)
        if not isinstance(results, list):
            logger.error(
                f"OCRManager returned unexpected type for single region image: {type(results)}"
            )
            return self
        logger.debug(f"Region OCR processing returned {len(results)} results.")

        # Convert results to TextElements
        scale_x = self.width / region_image.width if region_image.width > 0 else 1.0
        scale_y = self.height / region_image.height if region_image.height > 0 else 1.0
        logger.debug(f"Region OCR scaling factors (PDF/Img): x={scale_x:.2f}, y={scale_y:.2f}")
        created_elements = []
        for result in results:
            try:
                img_x0, img_top, img_x1, img_bottom = map(float, result["bbox"])
                pdf_height = (img_bottom - img_top) * scale_y
                page_x0 = self.x0 + (img_x0 * scale_x)
                page_top = self.top + (img_top * scale_y)
                page_x1 = self.x0 + (img_x1 * scale_x)
                page_bottom = self.top + (img_bottom * scale_y)
                raw_conf = result.get("confidence")
                # Convert confidence to float unless it is None/invalid
                try:
                    confidence_val = float(raw_conf) if raw_conf is not None else None
                except (TypeError, ValueError):
                    confidence_val = None

                text_val = result.get("text")  # May legitimately be None in detect_only mode

                element_data = {
                    "text": text_val,
                    "x0": page_x0,
                    "top": page_top,
                    "x1": page_x1,
                    "bottom": page_bottom,
                    "width": page_x1 - page_x0,
                    "height": page_bottom - page_top,
                    "object_type": "word",
                    "source": "ocr",
                    "confidence": confidence_val,
                    "fontname": "OCR",
                    "size": round(pdf_height) if pdf_height > 0 else 10.0,
                    "page_number": self.page.number,
                    "bold": False,
                    "italic": False,
                    "upright": True,
                    "doctop": page_top + self.page._page.initial_doctop,
                }
                ocr_char_dict = element_data.copy()
                ocr_char_dict["object_type"] = "char"
                ocr_char_dict.setdefault("adv", ocr_char_dict.get("width", 0))
                element_data["_char_dicts"] = [ocr_char_dict]
                from natural_pdf.elements.text import TextElement

                elem = TextElement(element_data, self.page)
                created_elements.append(elem)
                self.page._element_mgr.add_element(elem, element_type="words")
                self.page._element_mgr.add_element(ocr_char_dict, element_type="chars")
            except Exception as e:
                logger.error(
                    f"Failed to convert region OCR result to element: {result}. Error: {e}",
                    exc_info=True,
                )
        logger.info(f"Region {self.bbox}: Added {len(created_elements)} elements from OCR.")
        return self

    def apply_custom_ocr(
        self,
        ocr_function: Callable[["Region"], Optional[str]],
        source_label: str = "custom-ocr",
        replace: bool = True,
        confidence: Optional[float] = None,
        add_to_page: bool = True,
    ) -> "Region":
        """
        Apply a custom OCR function to this region and create text elements from the results.

        This is useful when you want to use a custom OCR method (e.g., an LLM API,
        specialized OCR service, or any custom logic) instead of the built-in OCR engines.

        Args:
            ocr_function: A callable that takes a Region and returns the OCR'd text (or None).
                          The function receives this region as its argument and should return
                          the extracted text as a string, or None if no text was found.
            source_label: Label to identify the source of these text elements (default: "custom-ocr").
                          This will be set as the 'source' attribute on created elements.
            replace: If True (default), removes existing OCR elements in this region before
                     adding new ones. If False, adds new OCR elements alongside existing ones.
            confidence: Optional confidence score for the OCR result (0.0-1.0).
                        If None, defaults to 1.0 if text is returned, 0.0 if None is returned.
            add_to_page: If True (default), adds the created text element to the page.
                         If False, creates the element but doesn't add it to the page.

        Returns:
            Self for method chaining.

        Example:
            # Using with an LLM
            def ocr_with_llm(region):
                image = region.render(resolution=300, crop=True)
                # Call your LLM API here
                return llm_client.ocr(image)

            region.apply_custom_ocr(ocr_with_llm)

            # Using with a custom OCR service
            def ocr_with_service(region):
                img_bytes = region.render(crop=True).tobytes()
                response = ocr_service.process(img_bytes)
                return response.text

            region.apply_custom_ocr(ocr_with_service, source_label="my-ocr-service")
        """
        # If replace is True, remove existing OCR elements in this region
        if replace:
            logger.info(
                f"Region {self.bbox}: Removing existing OCR elements before applying custom OCR."
            )

            removed_count = 0

            # Helper to remove a single element safely
            def _safe_remove(elem):
                nonlocal removed_count
                success = False
                if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                    etype = getattr(elem, "object_type", "word")
                    if etype == "word":
                        etype_key = "words"
                    elif etype == "char":
                        etype_key = "chars"
                    else:
                        etype_key = etype + "s" if not etype.endswith("s") else etype
                    try:
                        success = elem.page._element_mgr.remove_element(elem, etype_key)
                    except Exception:
                        success = False
                if success:
                    removed_count += 1

            # Remove ALL OCR elements overlapping this region
            # Remove elements with source=="ocr" (built-in OCR) or matching the source_label (previous custom OCR)
            for word in list(self.page._element_mgr.words):
                word_source = getattr(word, "source", "")
                # Match built-in OCR behavior: remove elements with source "ocr" exactly
                # Also remove elements with the same source_label to avoid duplicates
                if (word_source == "ocr" or word_source == source_label) and self.intersects(word):
                    _safe_remove(word)

            # Also remove char dicts if needed (matching built-in OCR)
            for char in list(self.page._element_mgr.chars):
                # char can be dict or TextElement; normalize
                char_src = (
                    char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
                )
                if char_src == "ocr" or char_src == source_label:
                    # Rough bbox for dicts
                    if isinstance(char, dict):
                        cx0, ctop, cx1, cbottom = (
                            char.get("x0", 0),
                            char.get("top", 0),
                            char.get("x1", 0),
                            char.get("bottom", 0),
                        )
                    else:
                        cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                    # Quick overlap check
                    if not (
                        cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                    ):
                        _safe_remove(char)

            if removed_count > 0:
                logger.info(f"Region {self.bbox}: Removed {removed_count} existing OCR elements.")

        # Call the custom OCR function
        try:
            logger.debug(f"Region {self.bbox}: Calling custom OCR function...")
            ocr_text = ocr_function(self)

            if ocr_text is not None and not isinstance(ocr_text, str):
                logger.warning(
                    f"Custom OCR function returned non-string type ({type(ocr_text)}). "
                    f"Converting to string."
                )
                ocr_text = str(ocr_text)

        except Exception as e:
            logger.error(
                f"Error calling custom OCR function for region {self.bbox}: {e}", exc_info=True
            )
            return self

        # Create text element if we got text
        if ocr_text is not None:
            # Use the to_text_element method to create the element
            text_element = self.to_text_element(
                text_content=ocr_text,
                source_label=source_label,
                confidence=confidence,
                add_to_page=add_to_page,
            )

            logger.info(
                f"Region {self.bbox}: Created text element with {len(ocr_text)} chars"
                f"{' and added to page' if add_to_page else ''}"
            )
        else:
            logger.debug(f"Region {self.bbox}: Custom OCR function returned None (no text found)")

        return self

    def get_section_between(
        self,
        start_element=None,
        end_element=None,
        include_boundaries="both",
        orientation="vertical",
    ):
        """
        Get a section between two elements within this region.

        Args:
            start_element: Element marking the start of the section
            end_element: Element marking the end of the section
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            Region representing the section
        """
        # Get elements only within this region first
        elements = self.get_elements()

        # If no elements, return self or empty region?
        if not elements:
            logger.warning(
                f"get_section_between called on region {self.bbox} with no contained elements."
            )
            # Return an empty region at the start of the parent region
            return Region(self.page, (self.x0, self.top, self.x0, self.top))

        # Sort elements in reading order
        elements.sort(key=lambda e: (e.top, e.x0))

        # Find start index
        start_idx = 0
        if start_element:
            try:
                start_idx = elements.index(start_element)
            except ValueError:
                # Start element not in region, use first element
                logger.debug("Start element not found in region, using first element.")
                start_element = elements[0]  # Use the actual first element
                start_idx = 0
        else:
            start_element = elements[0]  # Default start is first element

        # Find end index
        end_idx = len(elements) - 1
        if end_element:
            try:
                end_idx = elements.index(end_element)
            except ValueError:
                # End element not in region, use last element
                logger.debug("End element not found in region, using last element.")
                end_element = elements[-1]  # Use the actual last element
                end_idx = len(elements) - 1
        else:
            end_element = elements[-1]  # Default end is last element

        # Validate orientation parameter
        if orientation not in ["vertical", "horizontal"]:
            raise ValueError(f"orientation must be 'vertical' or 'horizontal', got '{orientation}'")

        # Use centralized section utilities
        from natural_pdf.utils.sections import calculate_section_bounds, validate_section_bounds

        # Calculate section boundaries
        bounds = calculate_section_bounds(
            start_element=start_element,
            end_element=end_element,
            include_boundaries=include_boundaries,
            orientation=orientation,
            parent_bounds=self.bbox,
        )

        # Validate boundaries
        if not validate_section_bounds(bounds, orientation):
            # Return an empty region at the start position
            x0, top, _, _ = bounds
            return Region(self.page, (x0, top, x0, top))

        # Create new region
        section = Region(self.page, bounds)

        # Store the original boundary elements and exclusion info
        section.start_element = start_element
        section.end_element = end_element
        section._boundary_exclusions = include_boundaries

        return section

    def get_sections(
        self,
        start_elements=None,
        end_elements=None,
        include_boundaries="both",
        orientation="vertical",
    ) -> "ElementCollection[Region]":
        """
        Get sections within this region based on start/end elements.

        Args:
            start_elements: Elements or selector string that mark the start of sections
            end_elements: Elements or selector string that mark the end of sections
            include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
            orientation: 'vertical' (default) or 'horizontal' - determines section direction

        Returns:
            List of Region objects representing the extracted sections
        """
        from natural_pdf.elements.element_collection import ElementCollection
        from natural_pdf.utils.sections import extract_sections_from_region

        # Use centralized section extraction logic
        sections = extract_sections_from_region(
            region=self,
            start_elements=start_elements,
            end_elements=end_elements,
            include_boundaries=include_boundaries,
            orientation=orientation,
        )

        return ElementCollection(sections)

    def split(self, divider, **kwargs) -> "ElementCollection[Region]":
        """
        Divide this region into sections based on the provided divider elements.

        Args:
            divider: Elements or selector string that mark section boundaries
            **kwargs: Additional parameters passed to get_sections()
                - include_boundaries: How to include boundary elements (default: 'start')
                - orientation: 'vertical' or 'horizontal' (default: 'vertical')

        Returns:
            ElementCollection of Region objects representing the sections

        Example:
            # Split a region by bold text
            sections = region.split("text:bold")

            # Split horizontally by vertical lines
            sections = region.split("line[orientation=vertical]", orientation="horizontal")
        """
        # Default to 'start' boundaries for split (include divider at start of each section)
        if "include_boundaries" not in kwargs:
            kwargs["include_boundaries"] = "start"

        sections = self.get_sections(start_elements=divider, **kwargs)

        # Add section before first divider if there's content
        if sections and hasattr(sections[0], "start_element"):
            first_divider = sections[0].start_element
            if first_divider:
                # Get all elements before the first divider
                all_elements = self.get_elements()
                if all_elements and all_elements[0] != first_divider:
                    # Create section from start to just before first divider
                    initial_section = self.get_section_between(
                        start_element=None,
                        end_element=first_divider,
                        include_boundaries="none",
                        orientation=kwargs.get("orientation", "vertical"),
                    )
                    if initial_section and initial_section.get_elements():
                        sections.insert(0, initial_section)

        return sections

    def create_cells(self):
        """
        Create cell regions for a detected table by intersecting its
        row and column regions, and add them to the page.

        Assumes child row and column regions are already present on the page.

        Returns:
            Self for method chaining.
        """
        # Ensure this is called on a table region
        if self.region_type not in (
            "table",
            "tableofcontents",
        ):  # Allow for ToC which might have structure
            raise ValueError(
                f"create_cells should be called on a 'table' or 'tableofcontents' region, not '{self.region_type}'"
            )

        # Find rows and columns associated with this page
        # Remove the model-specific filter
        rows = self.page.find_all("region[type=table-row]")
        columns = self.page.find_all("region[type=table-column]")

        # Filter to only include those that overlap with this table region
        def is_in_table(element):
            # Use a simple overlap check (more robust than just center point)
            # Check if element's bbox overlaps with self.bbox
            return (
                hasattr(element, "bbox")
                and element.x0 < self.x1  # Ensure element has bbox
                and element.x1 > self.x0
                and element.top < self.bottom
                and element.bottom > self.top
            )

        table_rows = [r for r in rows if is_in_table(r)]
        table_columns = [c for c in columns if is_in_table(c)]

        if not table_rows or not table_columns:
            # Use page's logger if available
            logger_instance = getattr(self._page, "logger", logger)
            logger_instance.warning(
                f"Region {self.bbox}: Cannot create cells. No overlapping row or column regions found."
            )
            return self  # Return self even if no cells created

        # Sort rows and columns
        table_rows.sort(key=lambda r: r.top)
        table_columns.sort(key=lambda c: c.x0)

        # Create cells and add them to the page's element manager
        created_count = 0
        for row in table_rows:
            for column in table_columns:
                # Calculate intersection bbox for the cell
                cell_x0 = max(row.x0, column.x0)
                cell_y0 = max(row.top, column.top)
                cell_x1 = min(row.x1, column.x1)
                cell_y1 = min(row.bottom, column.bottom)

                # Only create a cell if the intersection is valid (positive width/height)
                if cell_x1 > cell_x0 and cell_y1 > cell_y0:
                    # Create cell region at the intersection
                    cell = self.page.create_region(cell_x0, cell_y0, cell_x1, cell_y1)
                    # Set metadata
                    cell.source = "derived"
                    cell.region_type = "table-cell"  # Explicitly set type
                    cell.normalized_type = "table-cell"  # And normalized type
                    # Inherit model from the parent table region
                    cell.model = self.model
                    cell.parent_region = self  # Link cell to parent table region

                    # Add the cell region to the page's element manager
                    self.page._element_mgr.add_region(cell)
                    created_count += 1

        # Optional: Add created cells to the table region's children
        # self.child_regions.extend(cells_created_in_this_call) # Needs list management

        logger_instance = getattr(self._page, "logger", logger)
        logger_instance.info(
            f"Region {self.bbox} (Model: {self.model}): Created and added {created_count} cell regions."
        )

        return self  # Return self for chaining

    def ask(
        self,
        question: Union[str, List[str], Tuple[str, ...]],
        min_confidence: float = 0.1,
        model: str = None,
        debug: bool = False,
        **kwargs,
    ) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
        """
        Ask a question about the region content using document QA.

        This method uses a document question answering model to extract answers from the region content.
        It leverages both textual content and layout information for better understanding.

        Args:
            question: The question to ask about the region content
            min_confidence: Minimum confidence threshold for answers (0.0-1.0)
            model: Optional model name to use for QA (if None, uses default model)
            **kwargs: Additional parameters to pass to the QA engine

        Returns:
            Dictionary with answer details: {
                "answer": extracted text,
                "confidence": confidence score,
                "found": whether an answer was found,
                "page_num": page number,
                "region": reference to this region,
                "source_elements": list of elements that contain the answer (if found)
            }
        """
        try:
            from natural_pdf.qa.document_qa import get_qa_engine
        except ImportError:
            logger.error(
                "Question answering requires optional dependencies. Install with `pip install natural-pdf[ai]`"
            )
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.page.number,
                "source_elements": [],
                "region": self,
            }

        # Get or initialize QA engine with specified model
        try:
            qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
        except Exception as e:
            logger.error(f"Failed to initialize QA engine (model: {model}): {e}", exc_info=True)
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.page.number,
                "source_elements": [],
                "region": self,
            }

        # Ask the question using the QA engine
        try:
            return qa_engine.ask_pdf_region(
                self, question, min_confidence=min_confidence, debug=debug, **kwargs
            )
        except Exception as e:
            logger.error(f"Error during qa_engine.ask_pdf_region: {e}", exc_info=True)
            return {
                "answer": None,
                "confidence": 0.0,
                "found": False,
                "page_num": self.page.number,
                "source_elements": [],
                "region": self,
            }

    def add_child(self, child):
        """
        Add a child region to this region.

        Used for hierarchical document structure when using models like Docling
        that understand document hierarchy.

        Args:
            child: Region object to add as a child

        Returns:
            Self for method chaining
        """
        self.child_regions.append(child)
        child.parent_region = self
        return self

    def get_children(self, selector=None):
        """
        Get immediate child regions, optionally filtered by selector.

        Args:
            selector: Optional selector to filter children

        Returns:
            List of child regions matching the selector
        """
        import logging

        logger = logging.getLogger("natural_pdf.elements.region")

        if selector is None:
            return self.child_regions

        # Use existing selector parser to filter
        try:
            selector_obj = parse_selector(selector)
            filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
            matched = [child for child in self.child_regions if filter_func(child)]
            logger.debug(
                f"get_children: found {len(matched)} of {len(self.child_regions)} children matching '{selector}'"
            )
            return matched
        except Exception as e:
            logger.error(f"Error applying selector in get_children: {e}", exc_info=True)
            return []  # Return empty list on error

    def get_descendants(self, selector=None):
        """
        Get all descendant regions (children, grandchildren, etc.), optionally filtered by selector.

        Args:
            selector: Optional selector to filter descendants

        Returns:
            List of descendant regions matching the selector
        """
        import logging

        logger = logging.getLogger("natural_pdf.elements.region")

        all_descendants = []
        queue = list(self.child_regions)  # Start with direct children

        while queue:
            current = queue.pop(0)
            all_descendants.append(current)
            # Add current's children to the queue for processing
            if hasattr(current, "child_regions"):
                queue.extend(current.child_regions)

        logger.debug(f"get_descendants: found {len(all_descendants)} total descendants")

        # Filter by selector if provided
        if selector is not None:
            try:
                selector_obj = parse_selector(selector)
                filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
                matched = [desc for desc in all_descendants if filter_func(desc)]
                logger.debug(f"get_descendants: filtered to {len(matched)} matching '{selector}'")
                return matched
            except Exception as e:
                logger.error(f"Error applying selector in get_descendants: {e}", exc_info=True)
                return []  # Return empty list on error

        return all_descendants

    def __add__(
        self, other: Union["Element", "Region", "ElementCollection"]
    ) -> "ElementCollection":
        """Add regions/elements together to create an ElementCollection.

        This allows intuitive combination of regions using the + operator:
        ```python
        complainant = section.find("text:contains(Complainant)").right(until='text')
        dob = section.find("text:contains(DOB)").right(until='text')
        combined = complainant + dob  # Creates ElementCollection with both regions
        ```

        Args:
            other: Another Region, Element or ElementCollection to combine

        Returns:
            ElementCollection containing all elements
        """
        from natural_pdf.elements.base import Element
        from natural_pdf.elements.element_collection import ElementCollection

        # Create a list starting with self
        elements = [self]

        # Add the other element(s)
        if isinstance(other, (Element, Region)):
            elements.append(other)
        elif isinstance(other, ElementCollection):
            elements.extend(other)
        elif hasattr(other, "__iter__") and not isinstance(other, (str, bytes)):
            # Handle other iterables but exclude strings
            elements.extend(other)
        else:
            raise TypeError(f"Cannot add Region with {type(other)}")

        return ElementCollection(elements)

    def __radd__(
        self, other: Union["Element", "Region", "ElementCollection"]
    ) -> "ElementCollection":
        """Right-hand addition to support ElementCollection + Region."""
        if other == 0:
            # This handles sum() which starts with 0
            from natural_pdf.elements.element_collection import ElementCollection

            return ElementCollection([self])
        return self.__add__(other)

    def __repr__(self) -> str:
        """String representation of the region."""
        poly_info = " (Polygon)" if self.has_polygon else ""
        name_info = f" name='{self.name}'" if self.name else ""
        type_info = f" type='{self.region_type}'" if self.region_type else ""
        source_info = f" source='{self.source}'" if self.source else ""

        # Add checkbox state if this is a checkbox
        checkbox_info = ""
        if self.region_type == "checkbox" and hasattr(self, "is_checked"):
            state = "checked" if self.is_checked else "unchecked"
            checkbox_info = f" [{state}]"

        return f"<Region{name_info}{type_info}{source_info}{checkbox_info} bbox={self.bbox}{poly_info}>"

    def update_text(
        self,
        transform: Callable[[Any], Optional[str]],
        *,
        selector: str = "text",
        apply_exclusions: bool = False,
    ) -> "Region":
        """Apply *transform* to every text element matched by *selector* inside this region.

        The heavy lifting is delegated to :py:meth:`TextMixin.update_text`; this
        override simply ensures the search is scoped to the region.
        """

        return TextMixin.update_text(
            self, transform, selector=selector, apply_exclusions=apply_exclusions
        )

    # --- Classification Mixin Implementation --- #
    def _get_classification_manager(self) -> "ClassificationManager":
        if (
            not hasattr(self, "page")
            or not hasattr(self.page, "pdf")
            or not hasattr(self.page.pdf, "get_manager")
        ):
            raise AttributeError(
                "ClassificationManager cannot be accessed: Parent Page, PDF, or get_manager method missing."
            )
        try:
            # Use the PDF's manager registry accessor via page
            return self.page.pdf.get_manager("classification")
        except (ValueError, RuntimeError, AttributeError) as e:
            # Wrap potential errors from get_manager for clarity
            raise AttributeError(
                f"Failed to get ClassificationManager from PDF via Page: {e}"
            ) from e

    def _get_classification_content(
        self, model_type: str, **kwargs
    ) -> Union[str, "Image"]:  # Use "Image" for lazy import
        if model_type == "text":
            text_content = self.extract_text(layout=False)  # Simple join for classification
            if not text_content or text_content.isspace():
                raise ValueError("Cannot classify region with 'text' model: No text content found.")
            return text_content
        elif model_type == "vision":
            # Get resolution from manager/kwargs if possible, else default
            # We access manager via the method to ensure it's available
            manager = self._get_classification_manager()
            default_resolution = 150  # Manager doesn't store default res, set here
            # Note: classify() passes resolution via **kwargs if user specifies
            resolution = (
                kwargs.get("resolution", default_resolution)
                if "kwargs" in locals()
                else default_resolution
            )

            img = self.render(
                resolution=resolution,
                crop=True,  # Just the region content
            )
            if img is None:
                raise ValueError(
                    "Cannot classify region with 'vision' model: Failed to render image."
                )
            return img
        else:
            raise ValueError(f"Unsupported model_type for classification: {model_type}")

    def _get_metadata_storage(self) -> Dict[str, Any]:
        # Ensure metadata exists
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata

    # --- End Classification Mixin Implementation --- #

    # --- NEW METHOD: analyze_text_table_structure ---
    def analyze_text_table_structure(
        self,
        snap_tolerance: int = 10,
        join_tolerance: int = 3,
        min_words_vertical: int = 3,
        min_words_horizontal: int = 1,
        intersection_tolerance: int = 3,
        expand_bbox: Optional[Dict[str, int]] = None,
        **kwargs,
    ) -> Optional[Dict]:
        """
        Analyzes the text elements within the region (or slightly expanded area)
        to find potential table structure (lines, cells) using text alignment logic
        adapted from pdfplumber.

        Args:
            snap_tolerance: Tolerance for snapping parallel lines.
            join_tolerance: Tolerance for joining collinear lines.
            min_words_vertical: Minimum words needed to define a vertical line.
            min_words_horizontal: Minimum words needed to define a horizontal line.
            intersection_tolerance: Tolerance for detecting line intersections.
            expand_bbox: Optional dictionary to expand the search area slightly beyond
                         the region's exact bounds (e.g., {'left': 5, 'right': 5}).
            **kwargs: Additional keyword arguments passed to
                      find_text_based_tables (e.g., specific x/y tolerances).

        Returns:
            A dictionary containing 'horizontal_edges', 'vertical_edges', 'cells' (list of dicts),
            and 'intersections', or None if pdfplumber is unavailable or an error occurs.
        """

        # Determine the search region (expand if requested)
        search_region = self
        if expand_bbox and isinstance(expand_bbox, dict):
            try:
                search_region = self.expand(**expand_bbox)
                logger.debug(
                    f"Expanded search region for text table analysis to: {search_region.bbox}"
                )
            except Exception as e:
                logger.warning(f"Could not expand region bbox: {e}. Using original region.")
                search_region = self

        # Find text elements within the search region
        text_elements = search_region.find_all(
            "text", apply_exclusions=False
        )  # Use unfiltered text
        if not text_elements:
            logger.info(f"Region {self.bbox}: No text elements found for text table analysis.")
            return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

        # Extract bounding boxes
        bboxes = [element.bbox for element in text_elements if hasattr(element, "bbox")]
        if not bboxes:
            logger.info(f"Region {self.bbox}: No bboxes extracted from text elements.")
            return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

        # Call the utility function
        try:
            analysis_results = find_text_based_tables(
                bboxes=bboxes,
                snap_tolerance=snap_tolerance,
                join_tolerance=join_tolerance,
                min_words_vertical=min_words_vertical,
                min_words_horizontal=min_words_horizontal,
                intersection_tolerance=intersection_tolerance,
                **kwargs,  # Pass through any extra specific tolerance args
            )
            # Store results in the region's analyses cache
            self.analyses["text_table_structure"] = analysis_results
            return analysis_results
        except ImportError:
            logger.error("pdfplumber library is required for 'text' table analysis but not found.")
            return None
        except Exception as e:
            logger.error(f"Error during text-based table analysis: {e}", exc_info=True)
            return None

    # --- END NEW METHOD ---

    # --- NEW METHOD: get_text_table_cells ---
    def get_text_table_cells(
        self,
        snap_tolerance: int = 10,
        join_tolerance: int = 3,
        min_words_vertical: int = 3,
        min_words_horizontal: int = 1,
        intersection_tolerance: int = 3,
        expand_bbox: Optional[Dict[str, int]] = None,
        **kwargs,
    ) -> "ElementCollection[Region]":
        """
        Analyzes text alignment to find table cells and returns them as
        temporary Region objects without adding them to the page.

        Args:
            snap_tolerance: Tolerance for snapping parallel lines.
            join_tolerance: Tolerance for joining collinear lines.
            min_words_vertical: Minimum words needed to define a vertical line.
            min_words_horizontal: Minimum words needed to define a horizontal line.
            intersection_tolerance: Tolerance for detecting line intersections.
            expand_bbox: Optional dictionary to expand the search area slightly beyond
                         the region's exact bounds (e.g., {'left': 5, 'right': 5}).
            **kwargs: Additional keyword arguments passed to
                      find_text_based_tables (e.g., specific x/y tolerances).

        Returns:
            An ElementCollection containing temporary Region objects for each detected cell,
            or an empty ElementCollection if no cells are found or an error occurs.
        """
        from natural_pdf.elements.element_collection import ElementCollection

        # 1. Perform the analysis (or use cached results)
        if "text_table_structure" in self.analyses:
            analysis_results = self.analyses["text_table_structure"]
            logger.debug("get_text_table_cells: Using cached analysis results.")
        else:
            analysis_results = self.analyze_text_table_structure(
                snap_tolerance=snap_tolerance,
                join_tolerance=join_tolerance,
                min_words_vertical=min_words_vertical,
                min_words_horizontal=min_words_horizontal,
                intersection_tolerance=intersection_tolerance,
                expand_bbox=expand_bbox,
                **kwargs,
            )

        # 2. Check if analysis was successful and cells were found
        if analysis_results is None or not analysis_results.get("cells"):
            logger.info(f"Region {self.bbox}: No cells found by text table analysis.")
            return ElementCollection([])  # Return empty collection

        # 3. Create temporary Region objects for each cell dictionary
        cell_regions = []
        for cell_data in analysis_results["cells"]:
            try:
                # Use page.region to create the region object
                # It expects left, top, right, bottom keys
                cell_region = self.page.region(**cell_data)

                # Set metadata on the temporary region
                cell_region.region_type = "table-cell"
                cell_region.normalized_type = "table-cell"
                cell_region.model = "pdfplumber-text"
                cell_region.source = "volatile"  # Indicate it's not managed/persistent
                cell_region.parent_region = self  # Link back to the region it came from

                cell_regions.append(cell_region)
            except Exception as e:
                logger.warning(f"Could not create Region object for cell data {cell_data}: {e}")

        # 4. Return the list wrapped in an ElementCollection
        logger.debug(f"get_text_table_cells: Created {len(cell_regions)} temporary cell regions.")
        return ElementCollection(cell_regions)

    # --- END NEW METHOD ---

    def to_text_element(
        self,
        text_content: Optional[Union[str, Callable[["Region"], Optional[str]]]] = None,
        source_label: str = "derived_from_region",
        object_type: str = "word",  # Or "char", controls how it's categorized
        default_font_size: float = 10.0,
        default_font_name: str = "RegionContent",
        confidence: Optional[float] = None,  # Allow overriding confidence
        add_to_page: bool = False,  # NEW: Option to add to page
    ) -> "TextElement":
        """
        Creates a new TextElement object based on this region's geometry.

        The text for the new TextElement can be provided directly,
        generated by a callback function, or left as None.

        Args:
            text_content:
                - If a string, this will be the text of the new TextElement.
                - If a callable, it will be called with this region instance
                  and its return value (a string or None) will be the text.
                - If None (default), the TextElement's text will be None.
            source_label: The 'source' attribute for the new TextElement.
            object_type: The 'object_type' for the TextElement's data dict
                         (e.g., "word", "char").
            default_font_size: Placeholder font size if text is generated.
            default_font_name: Placeholder font name if text is generated.
            confidence: Confidence score for the text. If text_content is None,
                        defaults to 0.0. If text is provided/generated, defaults to 1.0
                        unless specified.
            add_to_page: If True, the created TextElement will be added to the
                         region's parent page. (Default: False)

        Returns:
            A new TextElement instance.

        Raises:
            ValueError: If the region does not have a valid 'page' attribute.
        """
        actual_text: Optional[str] = None
        if isinstance(text_content, str):
            actual_text = text_content
        elif callable(text_content):
            try:
                actual_text = text_content(self)
            except Exception as e:
                logger.error(
                    f"Error executing text_content callback for region {self.bbox}: {e}",
                    exc_info=True,
                )
                actual_text = None  # Ensure actual_text is None on error

        final_confidence = confidence
        if final_confidence is None:
            final_confidence = 1.0 if actual_text is not None and actual_text.strip() else 0.0

        if not hasattr(self, "page") or self.page is None:
            raise ValueError("Region must have a valid 'page' attribute to create a TextElement.")

        # Create character dictionaries for the text
        char_dicts = []
        if actual_text:
            # Create a single character dict that spans the entire region
            # This is a simplified approach - OCR engines typically create one per character
            char_dict = {
                "text": actual_text,
                "x0": self.x0,
                "top": self.top,
                "x1": self.x1,
                "bottom": self.bottom,
                "width": self.width,
                "height": self.height,
                "object_type": "char",
                "page_number": self.page.page_number,
                "fontname": default_font_name,
                "size": default_font_size,
                "upright": True,
                "direction": 1,
                "adv": self.width,
                "source": source_label,
                "confidence": final_confidence,
                "stroking_color": (0, 0, 0),
                "non_stroking_color": (0, 0, 0),
            }
            char_dicts.append(char_dict)

        elem_data = {
            "text": actual_text,
            "x0": self.x0,
            "top": self.top,
            "x1": self.x1,
            "bottom": self.bottom,
            "width": self.width,
            "height": self.height,
            "object_type": object_type,
            "page_number": self.page.page_number,
            "stroking_color": getattr(self, "stroking_color", (0, 0, 0)),
            "non_stroking_color": getattr(self, "non_stroking_color", (0, 0, 0)),
            "fontname": default_font_name,
            "size": default_font_size,
            "upright": True,
            "direction": 1,
            "adv": self.width,
            "source": source_label,
            "confidence": final_confidence,
            "_char_dicts": char_dicts,
        }
        text_element = TextElement(elem_data, self.page)

        if add_to_page:
            if hasattr(self.page, "_element_mgr") and self.page._element_mgr is not None:
                add_as_type = (
                    "words"
                    if object_type == "word"
                    else "chars" if object_type == "char" else object_type
                )
                # REMOVED try-except block around add_element
                self.page._element_mgr.add_element(text_element, element_type=add_as_type)
                logger.debug(
                    f"TextElement created from region {self.bbox} and added to page {self.page.page_number} as {add_as_type}."
                )
                # Also add character dictionaries to the chars collection
                if char_dicts and object_type == "word":
                    for char_dict in char_dicts:
                        self.page._element_mgr.add_element(char_dict, element_type="chars")
            else:
                page_num_str = (
                    str(self.page.page_number) if hasattr(self.page, "page_number") else "N/A"
                )
                logger.warning(
                    f"Cannot add TextElement to page: Page {page_num_str} for region {self.bbox} is missing '_element_mgr'."
                )

        return text_element

    # ------------------------------------------------------------------
    # Unified analysis storage (maps to metadata["analysis"])
    # ------------------------------------------------------------------

    @property
    def analyses(self) -> Dict[str, Any]:
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        return self.metadata.setdefault("analysis", {})

    @analyses.setter
    def analyses(self, value: Dict[str, Any]):
        if not hasattr(self, "metadata") or self.metadata is None:
            self.metadata = {}
        self.metadata["analysis"] = value

    # ------------------------------------------------------------------
    # New helper: build table from pre-computed table_cell regions
    # ------------------------------------------------------------------

    def _extract_table_from_cells(
        self, cell_regions: List["Region"], content_filter=None, apply_exclusions=True
    ) -> List[List[Optional[str]]]:
        """Construct a table (list-of-lists) from table_cell regions.

        This assumes each cell Region has metadata.row_index / col_index as written by
        detect_table_structure_from_lines().  If these keys are missing we will
        fall back to sorting by geometry.

        Args:
            cell_regions: List of table cell Region objects to extract text from
            content_filter: Optional content filter to apply to cell text extraction
        """
        if not cell_regions:
            return []

        # Attempt to use explicit indices first
        all_row_idxs = []
        all_col_idxs = []
        for cell in cell_regions:
            try:
                r_idx = int(cell.metadata.get("row_index"))
                c_idx = int(cell.metadata.get("col_index"))
                all_row_idxs.append(r_idx)
                all_col_idxs.append(c_idx)
            except Exception:
                # Not all cells have indices – clear the lists so we switch to geometric sorting
                all_row_idxs = []
                all_col_idxs = []
                break

        if all_row_idxs and all_col_idxs:
            num_rows = max(all_row_idxs) + 1
            num_cols = max(all_col_idxs) + 1

            # Initialise blank grid
            table_grid: List[List[Optional[str]]] = [[None] * num_cols for _ in range(num_rows)]

            for cell in cell_regions:
                try:
                    r_idx = int(cell.metadata.get("row_index"))
                    c_idx = int(cell.metadata.get("col_index"))
                    text_val = cell.extract_text(
                        layout=False,
                        apply_exclusions=apply_exclusions,
                        content_filter=content_filter,
                    ).strip()
                    table_grid[r_idx][c_idx] = text_val if text_val else None
                except Exception as _err:
                    # Skip problematic cell
                    continue

            return table_grid

        # ------------------------------------------------------------------
        # Fallback: derive order purely from geometry if indices are absent
        # ------------------------------------------------------------------
        # Sort unique centers to define ordering
        try:
            import numpy as np
        except ImportError:
            logger.warning("NumPy required for geometric cell ordering; returning empty result.")
            return []

        # Build arrays of centers
        centers = np.array([[(c.x0 + c.x1) / 2.0, (c.top + c.bottom) / 2.0] for c in cell_regions])
        xs = centers[:, 0]
        ys = centers[:, 1]

        # Cluster unique row Y positions and column X positions with a tolerance
        def _cluster(vals, tol=1.0):
            sorted_vals = np.sort(vals)
            groups = [[sorted_vals[0]]]
            for v in sorted_vals[1:]:
                if abs(v - groups[-1][-1]) <= tol:
                    groups[-1].append(v)
                else:
                    groups.append([v])
            return [np.mean(g) for g in groups]

        row_centers = _cluster(ys)
        col_centers = _cluster(xs)

        num_rows = len(row_centers)
        num_cols = len(col_centers)

        table_grid: List[List[Optional[str]]] = [[None] * num_cols for _ in range(num_rows)]

        # Assign each cell to nearest row & col center
        for cell, (cx, cy) in zip(cell_regions, centers):
            row_idx = int(np.argmin([abs(cy - rc) for rc in row_centers]))
            col_idx = int(np.argmin([abs(cx - cc) for cc in col_centers]))

            text_val = cell.extract_text(
                layout=False, apply_exclusions=apply_exclusions, content_filter=content_filter
            ).strip()
            table_grid[row_idx][col_idx] = text_val if text_val else None

        return table_grid

    def _apply_rtl_processing_to_text(self, text: str) -> str:
        """
        Apply RTL (Right-to-Left) text processing to a string.

        This converts visual order text (as stored in PDFs) to logical order
        for proper display of Arabic, Hebrew, and other RTL scripts.

        Args:
            text: Input text string in visual order

        Returns:
            Text string in logical order
        """
        if not text or not text.strip():
            return text

        # Quick check for RTL characters - if none found, return as-is
        import unicodedata

        def _contains_rtl(s):
            return any(unicodedata.bidirectional(ch) in ("R", "AL", "AN") for ch in s)

        if not _contains_rtl(text):
            return text

        try:
            from bidi.algorithm import get_display  # type: ignore

            from natural_pdf.utils.bidi_mirror import mirror_brackets

            # Apply BiDi algorithm to convert from visual to logical order
            # Process line by line to handle mixed content properly
            processed_lines = []
            for line in text.split("\n"):
                if line.strip():
                    # Determine base direction for this line
                    base_dir = "R" if _contains_rtl(line) else "L"
                    logical_line = get_display(line, base_dir=base_dir)
                    # Apply bracket mirroring for correct logical order
                    processed_lines.append(mirror_brackets(logical_line))
                else:
                    processed_lines.append(line)

            return "\n".join(processed_lines)

        except (ImportError, Exception):
            # If bidi library is not available or fails, return original text
            return text

    def _apply_content_filter_to_text(self, text: str, content_filter) -> str:
        """
        Apply content filter to a text string.

        Args:
            text: Input text string
            content_filter: Content filter (regex, callable, or list of regexes)

        Returns:
            Filtered text string
        """
        if not text or content_filter is None:
            return text

        import re

        if isinstance(content_filter, str):
            # Single regex pattern - remove matching parts
            try:
                return re.sub(content_filter, "", text)
            except re.error:
                return text  # Invalid regex, return original

        elif isinstance(content_filter, list):
            # List of regex patterns - remove parts matching ANY pattern
            try:
                result = text
                for pattern in content_filter:
                    result = re.sub(pattern, "", result)
                return result
            except re.error:
                return text  # Invalid regex, return original

        elif callable(content_filter):
            # Callable filter - apply to individual characters
            try:
                filtered_chars = []
                for char in text:
                    if content_filter(char):
                        filtered_chars.append(char)
                return "".join(filtered_chars)
            except Exception:
                return text  # Function error, return original

        return text

    # ------------------------------------------------------------------
    # Interactive Viewer Support
    # ------------------------------------------------------------------

    def viewer(
        self,
        *,
        resolution: int = 150,
        include_chars: bool = False,
        include_attributes: Optional[List[str]] = None,
    ) -> Optional["InteractiveViewerWidget"]:
        """Create an interactive ipywidget viewer for **this specific region**.

        The method renders the region to an image (cropped to the region bounds) and
        overlays all elements that intersect the region (optionally excluding noisy
        character-level elements).  The resulting widget offers the same zoom / pan
        experience as :py:meth:`Page.viewer` but scoped to the region.

        Parameters
        ----------
        resolution : int, default 150
            Rendering resolution (DPI).  This should match the value used by the
            page-level viewer so element scaling is accurate.
        include_chars : bool, default False
            Whether to include individual *char* elements in the overlay.  These
            are often too dense for a meaningful visualisation so are skipped by
            default.
        include_attributes : list[str], optional
            Additional element attributes to expose in the info panel (on top of
            the default set used by the page viewer).

        Returns
        -------
        InteractiveViewerWidget | None
            The widget instance, or ``None`` if *ipywidgets* is not installed or
            an error occurred during creation.
        """

        # ------------------------------------------------------------------
        # Dependency / environment checks
        # ------------------------------------------------------------------
        if not _IPYWIDGETS_AVAILABLE or InteractiveViewerWidget is None:
            logger.error(
                "Interactive viewer requires 'ipywidgets'. "
                'Please install with: pip install "ipywidgets>=7.0.0,<10.0.0"'
            )
            return None

        try:
            # ------------------------------------------------------------------
            # Render region image (cropped) and encode as data URI
            # ------------------------------------------------------------------
            import base64
            from io import BytesIO

            # Use unified render() with crop=True to obtain just the region
            img = self.render(resolution=resolution, crop=True)
            if img is None:
                logger.error(f"Failed to render image for region {self.bbox} viewer.")
                return None

            buf = BytesIO()
            img.save(buf, format="PNG")
            img_str = base64.b64encode(buf.getvalue()).decode()
            image_uri = f"data:image/png;base64,{img_str}"

            # ------------------------------------------------------------------
            # Prepare element overlay data (coordinates relative to region)
            # ------------------------------------------------------------------
            scale = resolution / 72.0  # Same convention as page viewer

            # Gather elements intersecting the region
            region_elements = self.get_elements(apply_exclusions=False)

            # Optionally filter out chars
            if not include_chars:
                region_elements = [
                    el for el in region_elements if str(getattr(el, "type", "")).lower() != "char"
                ]

            default_attrs = [
                "text",
                "fontname",
                "size",
                "bold",
                "italic",
                "color",
                "linewidth",
                "is_horizontal",
                "is_vertical",
                "source",
                "confidence",
                "label",
                "model",
                "upright",
                "direction",
            ]

            if include_attributes:
                default_attrs.extend([a for a in include_attributes if a not in default_attrs])

            elements_json: List[dict] = []
            for idx, el in enumerate(region_elements):
                try:
                    # Calculate coordinates relative to region bbox and apply scale
                    x0 = (el.x0 - self.x0) * scale
                    y0 = (el.top - self.top) * scale
                    x1 = (el.x1 - self.x0) * scale
                    y1 = (el.bottom - self.top) * scale

                    elem_dict = {
                        "id": idx,
                        "type": getattr(el, "type", "unknown"),
                        "x0": round(x0, 2),
                        "y0": round(y0, 2),
                        "x1": round(x1, 2),
                        "y1": round(y1, 2),
                        "width": round(x1 - x0, 2),
                        "height": round(y1 - y0, 2),
                    }

                    # Add requested / default attributes
                    for attr_name in default_attrs:
                        if hasattr(el, attr_name):
                            val = getattr(el, attr_name)
                            # Ensure JSON serialisable
                            if not isinstance(val, (str, int, float, bool, list, dict, type(None))):
                                val = str(val)
                            elem_dict[attr_name] = val
                    elements_json.append(elem_dict)
                except Exception as e:
                    logger.warning(f"Error preparing element {idx} for region viewer: {e}")

            viewer_data = {"page_image": image_uri, "elements": elements_json}

            # ------------------------------------------------------------------
            # Instantiate the widget directly using the prepared data
            # ------------------------------------------------------------------
            return InteractiveViewerWidget(pdf_data=viewer_data)

        except Exception as e:
            logger.error(f"Error creating viewer for region {self.bbox}: {e}", exc_info=True)
            return None

    def within(self):
        """Context manager that constrains directional operations to this region.

        When used as a context manager, all directional navigation operations
        (above, below, left, right) will be constrained to the bounds of this region.

        Returns:
            RegionContext: A context manager that yields this region

        Examples:
            ```python
            # Create a column region
            left_col = page.region(right=page.width/2)

            # All directional operations are constrained to left_col
            with left_col.within() as col:
                header = col.find("text[size>14]")
                content = header.below(until="text[size>14]")
                # content will only include elements within left_col

            # Operations outside the context are not constrained
            full_page_below = header.below()  # Searches full page
            ```
        """
        return RegionContext(self)
Attributes
natural_pdf.Region.bbox property

Get the bounding box as (x0, top, x1, bottom).

natural_pdf.Region.bottom property

Get the bottom coordinate.

natural_pdf.Region.endpoint property

The element where this region stopped (if created with 'until' parameter).

natural_pdf.Region.has_polygon property

Check if this region has polygon coordinates.

natural_pdf.Region.height property

Get the height of the region.

natural_pdf.Region.origin property

The element/region that created this region (if it was created via directional method).

natural_pdf.Region.page property

Get the parent page.

natural_pdf.Region.polygon property

Get polygon coordinates if available, otherwise return rectangle corners.

natural_pdf.Region.top property

Get the top coordinate.

natural_pdf.Region.type property

Element type.

natural_pdf.Region.width property

Get the width of the region.

natural_pdf.Region.x0 property

Get the left coordinate.

natural_pdf.Region.x1 property

Get the right coordinate.

Functions
natural_pdf.Region.__add__(other)

Add regions/elements together to create an ElementCollection.

This allows intuitive combination of regions using the + operator:

complainant = section.find("text:contains(Complainant)").right(until='text')
dob = section.find("text:contains(DOB)").right(until='text')
combined = complainant + dob  # Creates ElementCollection with both regions

Parameters:

Name Type Description Default
other Union[Element, Region, ElementCollection]

Another Region, Element or ElementCollection to combine

required

Returns:

Type Description
ElementCollection

ElementCollection containing all elements

Source code in natural_pdf/elements/region.py
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
def __add__(
    self, other: Union["Element", "Region", "ElementCollection"]
) -> "ElementCollection":
    """Add regions/elements together to create an ElementCollection.

    This allows intuitive combination of regions using the + operator:
    ```python
    complainant = section.find("text:contains(Complainant)").right(until='text')
    dob = section.find("text:contains(DOB)").right(until='text')
    combined = complainant + dob  # Creates ElementCollection with both regions
    ```

    Args:
        other: Another Region, Element or ElementCollection to combine

    Returns:
        ElementCollection containing all elements
    """
    from natural_pdf.elements.base import Element
    from natural_pdf.elements.element_collection import ElementCollection

    # Create a list starting with self
    elements = [self]

    # Add the other element(s)
    if isinstance(other, (Element, Region)):
        elements.append(other)
    elif isinstance(other, ElementCollection):
        elements.extend(other)
    elif hasattr(other, "__iter__") and not isinstance(other, (str, bytes)):
        # Handle other iterables but exclude strings
        elements.extend(other)
    else:
        raise TypeError(f"Cannot add Region with {type(other)}")

    return ElementCollection(elements)
natural_pdf.Region.__init__(page, bbox, polygon=None, parent=None, label=None)

Initialize a region.

Creates a Region object that represents a rectangular or polygonal area on a page. Regions are used for spatial navigation, content extraction, and analysis operations.

Parameters:

Name Type Description Default
page Page

Parent Page object that contains this region and provides access to document elements and analysis capabilities.

required
bbox Tuple[float, float, float, float]

Bounding box coordinates as (x0, top, x1, bottom) tuple in PDF coordinate system (points, with origin at bottom-left).

required
polygon List[Tuple[float, float]]

Optional list of coordinate points [(x1,y1), (x2,y2), ...] for non-rectangular regions. If provided, the region will use polygon-based intersection calculations instead of simple rectangle overlap.

None
parent

Optional parent region for hierarchical document structure. Useful for maintaining tree-like relationships between regions.

None
label Optional[str]

Optional descriptive label for the region, useful for debugging and identification in complex workflows.

None
Example
pdf = npdf.PDF("document.pdf")
page = pdf.pages[0]

# Rectangular region
header = Region(page, (0, 0, page.width, 100), label="header")

# Polygonal region (from layout detection)
table_polygon = [(50, 100), (300, 100), (300, 400), (50, 400)]
table_region = Region(page, (50, 100, 300, 400),
                    polygon=table_polygon, label="table")
Note

Regions are typically created through page methods like page.region() or spatial navigation methods like element.below(). Direct instantiation is used mainly for advanced workflows or layout analysis integration.

Source code in natural_pdf/elements/region.py
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
def __init__(
    self,
    page: "Page",
    bbox: Tuple[float, float, float, float],
    polygon: List[Tuple[float, float]] = None,
    parent=None,
    label: Optional[str] = None,
):
    """Initialize a region.

    Creates a Region object that represents a rectangular or polygonal area on a page.
    Regions are used for spatial navigation, content extraction, and analysis operations.

    Args:
        page: Parent Page object that contains this region and provides access
            to document elements and analysis capabilities.
        bbox: Bounding box coordinates as (x0, top, x1, bottom) tuple in PDF
            coordinate system (points, with origin at bottom-left).
        polygon: Optional list of coordinate points [(x1,y1), (x2,y2), ...] for
            non-rectangular regions. If provided, the region will use polygon-based
            intersection calculations instead of simple rectangle overlap.
        parent: Optional parent region for hierarchical document structure.
            Useful for maintaining tree-like relationships between regions.
        label: Optional descriptive label for the region, useful for debugging
            and identification in complex workflows.

    Example:
        ```python
        pdf = npdf.PDF("document.pdf")
        page = pdf.pages[0]

        # Rectangular region
        header = Region(page, (0, 0, page.width, 100), label="header")

        # Polygonal region (from layout detection)
        table_polygon = [(50, 100), (300, 100), (300, 400), (50, 400)]
        table_region = Region(page, (50, 100, 300, 400),
                            polygon=table_polygon, label="table")
        ```

    Note:
        Regions are typically created through page methods like page.region() or
        spatial navigation methods like element.below(). Direct instantiation is
        used mainly for advanced workflows or layout analysis integration.
    """
    self._page = page
    self._bbox = bbox
    self._polygon = polygon

    self.metadata: Dict[str, Any] = {}
    # Analysis results live under self.metadata['analysis'] via property

    # Standard attributes for all elements
    self.object_type = "region"  # For selector compatibility

    # Layout detection attributes
    self.region_type = None
    self.normalized_type = None
    self.confidence = None
    self.model = None

    # Region management attributes
    self.name = None
    self.label = label
    self.source = None  # Will be set by creation methods

    # Hierarchy support for nested document structure
    self.parent_region = parent
    self.child_regions = []
    self.text_content = None  # Direct text content (e.g., from Docling)
    self.associated_text_elements = []  # Native text elements that overlap with this region
natural_pdf.Region.__radd__(other)

Right-hand addition to support ElementCollection + Region.

Source code in natural_pdf/elements/region.py
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
def __radd__(
    self, other: Union["Element", "Region", "ElementCollection"]
) -> "ElementCollection":
    """Right-hand addition to support ElementCollection + Region."""
    if other == 0:
        # This handles sum() which starts with 0
        from natural_pdf.elements.element_collection import ElementCollection

        return ElementCollection([self])
    return self.__add__(other)
natural_pdf.Region.__repr__()

String representation of the region.

Source code in natural_pdf/elements/region.py
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
def __repr__(self) -> str:
    """String representation of the region."""
    poly_info = " (Polygon)" if self.has_polygon else ""
    name_info = f" name='{self.name}'" if self.name else ""
    type_info = f" type='{self.region_type}'" if self.region_type else ""
    source_info = f" source='{self.source}'" if self.source else ""

    # Add checkbox state if this is a checkbox
    checkbox_info = ""
    if self.region_type == "checkbox" and hasattr(self, "is_checked"):
        state = "checked" if self.is_checked else "unchecked"
        checkbox_info = f" [{state}]"

    return f"<Region{name_info}{type_info}{source_info}{checkbox_info} bbox={self.bbox}{poly_info}>"
natural_pdf.Region.above(height=None, width='full', include_source=False, until=None, include_endpoint=True, offset=None, **kwargs)

Select region above this region.

Parameters:

Name Type Description Default
height Optional[float]

Height of the region above, in points

None
width str

Width mode - "full" for full page width or "element" for element width

'full'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify an upper boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
offset Optional[float]

Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)

None
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area above

Source code in natural_pdf/elements/region.py
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
def above(
    self,
    height: Optional[float] = None,
    width: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    offset: Optional[float] = None,
    **kwargs,
) -> "Region":
    """
    Select region above this region.

    Args:
        height: Height of the region above, in points
        width: Width mode - "full" for full page width or "element" for element width
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify an upper boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area above
    """
    # Use global default if offset not provided
    if offset is None:
        import natural_pdf

        offset = natural_pdf.options.layout.directional_offset

    return self._direction(
        direction="above",
        size=height,
        cross_size=width,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        offset=offset,
        **kwargs,
    )
natural_pdf.Region.add_child(child)

Add a child region to this region.

Used for hierarchical document structure when using models like Docling that understand document hierarchy.

Parameters:

Name Type Description Default
child

Region object to add as a child

required

Returns:

Type Description

Self for method chaining

Source code in natural_pdf/elements/region.py
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
def add_child(self, child):
    """
    Add a child region to this region.

    Used for hierarchical document structure when using models like Docling
    that understand document hierarchy.

    Args:
        child: Region object to add as a child

    Returns:
        Self for method chaining
    """
    self.child_regions.append(child)
    child.parent_region = self
    return self
natural_pdf.Region.analyze_text_table_structure(snap_tolerance=10, join_tolerance=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=3, expand_bbox=None, **kwargs)

Analyzes the text elements within the region (or slightly expanded area) to find potential table structure (lines, cells) using text alignment logic adapted from pdfplumber.

Parameters:

Name Type Description Default
snap_tolerance int

Tolerance for snapping parallel lines.

10
join_tolerance int

Tolerance for joining collinear lines.

3
min_words_vertical int

Minimum words needed to define a vertical line.

3
min_words_horizontal int

Minimum words needed to define a horizontal line.

1
intersection_tolerance int

Tolerance for detecting line intersections.

3
expand_bbox Optional[Dict[str, int]]

Optional dictionary to expand the search area slightly beyond the region's exact bounds (e.g., {'left': 5, 'right': 5}).

None
**kwargs

Additional keyword arguments passed to find_text_based_tables (e.g., specific x/y tolerances).

{}

Returns:

Type Description
Optional[Dict]

A dictionary containing 'horizontal_edges', 'vertical_edges', 'cells' (list of dicts),

Optional[Dict]

and 'intersections', or None if pdfplumber is unavailable or an error occurs.

Source code in natural_pdf/elements/region.py
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
def analyze_text_table_structure(
    self,
    snap_tolerance: int = 10,
    join_tolerance: int = 3,
    min_words_vertical: int = 3,
    min_words_horizontal: int = 1,
    intersection_tolerance: int = 3,
    expand_bbox: Optional[Dict[str, int]] = None,
    **kwargs,
) -> Optional[Dict]:
    """
    Analyzes the text elements within the region (or slightly expanded area)
    to find potential table structure (lines, cells) using text alignment logic
    adapted from pdfplumber.

    Args:
        snap_tolerance: Tolerance for snapping parallel lines.
        join_tolerance: Tolerance for joining collinear lines.
        min_words_vertical: Minimum words needed to define a vertical line.
        min_words_horizontal: Minimum words needed to define a horizontal line.
        intersection_tolerance: Tolerance for detecting line intersections.
        expand_bbox: Optional dictionary to expand the search area slightly beyond
                     the region's exact bounds (e.g., {'left': 5, 'right': 5}).
        **kwargs: Additional keyword arguments passed to
                  find_text_based_tables (e.g., specific x/y tolerances).

    Returns:
        A dictionary containing 'horizontal_edges', 'vertical_edges', 'cells' (list of dicts),
        and 'intersections', or None if pdfplumber is unavailable or an error occurs.
    """

    # Determine the search region (expand if requested)
    search_region = self
    if expand_bbox and isinstance(expand_bbox, dict):
        try:
            search_region = self.expand(**expand_bbox)
            logger.debug(
                f"Expanded search region for text table analysis to: {search_region.bbox}"
            )
        except Exception as e:
            logger.warning(f"Could not expand region bbox: {e}. Using original region.")
            search_region = self

    # Find text elements within the search region
    text_elements = search_region.find_all(
        "text", apply_exclusions=False
    )  # Use unfiltered text
    if not text_elements:
        logger.info(f"Region {self.bbox}: No text elements found for text table analysis.")
        return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

    # Extract bounding boxes
    bboxes = [element.bbox for element in text_elements if hasattr(element, "bbox")]
    if not bboxes:
        logger.info(f"Region {self.bbox}: No bboxes extracted from text elements.")
        return {"horizontal_edges": [], "vertical_edges": [], "cells": [], "intersections": {}}

    # Call the utility function
    try:
        analysis_results = find_text_based_tables(
            bboxes=bboxes,
            snap_tolerance=snap_tolerance,
            join_tolerance=join_tolerance,
            min_words_vertical=min_words_vertical,
            min_words_horizontal=min_words_horizontal,
            intersection_tolerance=intersection_tolerance,
            **kwargs,  # Pass through any extra specific tolerance args
        )
        # Store results in the region's analyses cache
        self.analyses["text_table_structure"] = analysis_results
        return analysis_results
    except ImportError:
        logger.error("pdfplumber library is required for 'text' table analysis but not found.")
        return None
    except Exception as e:
        logger.error(f"Error during text-based table analysis: {e}", exc_info=True)
        return None
natural_pdf.Region.apply_custom_ocr(ocr_function, source_label='custom-ocr', replace=True, confidence=None, add_to_page=True)

Apply a custom OCR function to this region and create text elements from the results.

This is useful when you want to use a custom OCR method (e.g., an LLM API, specialized OCR service, or any custom logic) instead of the built-in OCR engines.

Parameters:

Name Type Description Default
ocr_function Callable[[Region], Optional[str]]

A callable that takes a Region and returns the OCR'd text (or None). The function receives this region as its argument and should return the extracted text as a string, or None if no text was found.

required
source_label str

Label to identify the source of these text elements (default: "custom-ocr"). This will be set as the 'source' attribute on created elements.

'custom-ocr'
replace bool

If True (default), removes existing OCR elements in this region before adding new ones. If False, adds new OCR elements alongside existing ones.

True
confidence Optional[float]

Optional confidence score for the OCR result (0.0-1.0). If None, defaults to 1.0 if text is returned, 0.0 if None is returned.

None
add_to_page bool

If True (default), adds the created text element to the page. If False, creates the element but doesn't add it to the page.

True

Returns:

Type Description
Region

Self for method chaining.

Example
Using with an LLM

def ocr_with_llm(region): image = region.render(resolution=300, crop=True) # Call your LLM API here return llm_client.ocr(image)

region.apply_custom_ocr(ocr_with_llm)

Using with a custom OCR service

def ocr_with_service(region): img_bytes = region.render(crop=True).tobytes() response = ocr_service.process(img_bytes) return response.text

region.apply_custom_ocr(ocr_with_service, source_label="my-ocr-service")

Source code in natural_pdf/elements/region.py
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
def apply_custom_ocr(
    self,
    ocr_function: Callable[["Region"], Optional[str]],
    source_label: str = "custom-ocr",
    replace: bool = True,
    confidence: Optional[float] = None,
    add_to_page: bool = True,
) -> "Region":
    """
    Apply a custom OCR function to this region and create text elements from the results.

    This is useful when you want to use a custom OCR method (e.g., an LLM API,
    specialized OCR service, or any custom logic) instead of the built-in OCR engines.

    Args:
        ocr_function: A callable that takes a Region and returns the OCR'd text (or None).
                      The function receives this region as its argument and should return
                      the extracted text as a string, or None if no text was found.
        source_label: Label to identify the source of these text elements (default: "custom-ocr").
                      This will be set as the 'source' attribute on created elements.
        replace: If True (default), removes existing OCR elements in this region before
                 adding new ones. If False, adds new OCR elements alongside existing ones.
        confidence: Optional confidence score for the OCR result (0.0-1.0).
                    If None, defaults to 1.0 if text is returned, 0.0 if None is returned.
        add_to_page: If True (default), adds the created text element to the page.
                     If False, creates the element but doesn't add it to the page.

    Returns:
        Self for method chaining.

    Example:
        # Using with an LLM
        def ocr_with_llm(region):
            image = region.render(resolution=300, crop=True)
            # Call your LLM API here
            return llm_client.ocr(image)

        region.apply_custom_ocr(ocr_with_llm)

        # Using with a custom OCR service
        def ocr_with_service(region):
            img_bytes = region.render(crop=True).tobytes()
            response = ocr_service.process(img_bytes)
            return response.text

        region.apply_custom_ocr(ocr_with_service, source_label="my-ocr-service")
    """
    # If replace is True, remove existing OCR elements in this region
    if replace:
        logger.info(
            f"Region {self.bbox}: Removing existing OCR elements before applying custom OCR."
        )

        removed_count = 0

        # Helper to remove a single element safely
        def _safe_remove(elem):
            nonlocal removed_count
            success = False
            if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                etype = getattr(elem, "object_type", "word")
                if etype == "word":
                    etype_key = "words"
                elif etype == "char":
                    etype_key = "chars"
                else:
                    etype_key = etype + "s" if not etype.endswith("s") else etype
                try:
                    success = elem.page._element_mgr.remove_element(elem, etype_key)
                except Exception:
                    success = False
            if success:
                removed_count += 1

        # Remove ALL OCR elements overlapping this region
        # Remove elements with source=="ocr" (built-in OCR) or matching the source_label (previous custom OCR)
        for word in list(self.page._element_mgr.words):
            word_source = getattr(word, "source", "")
            # Match built-in OCR behavior: remove elements with source "ocr" exactly
            # Also remove elements with the same source_label to avoid duplicates
            if (word_source == "ocr" or word_source == source_label) and self.intersects(word):
                _safe_remove(word)

        # Also remove char dicts if needed (matching built-in OCR)
        for char in list(self.page._element_mgr.chars):
            # char can be dict or TextElement; normalize
            char_src = (
                char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
            )
            if char_src == "ocr" or char_src == source_label:
                # Rough bbox for dicts
                if isinstance(char, dict):
                    cx0, ctop, cx1, cbottom = (
                        char.get("x0", 0),
                        char.get("top", 0),
                        char.get("x1", 0),
                        char.get("bottom", 0),
                    )
                else:
                    cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                # Quick overlap check
                if not (
                    cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                ):
                    _safe_remove(char)

        if removed_count > 0:
            logger.info(f"Region {self.bbox}: Removed {removed_count} existing OCR elements.")

    # Call the custom OCR function
    try:
        logger.debug(f"Region {self.bbox}: Calling custom OCR function...")
        ocr_text = ocr_function(self)

        if ocr_text is not None and not isinstance(ocr_text, str):
            logger.warning(
                f"Custom OCR function returned non-string type ({type(ocr_text)}). "
                f"Converting to string."
            )
            ocr_text = str(ocr_text)

    except Exception as e:
        logger.error(
            f"Error calling custom OCR function for region {self.bbox}: {e}", exc_info=True
        )
        return self

    # Create text element if we got text
    if ocr_text is not None:
        # Use the to_text_element method to create the element
        text_element = self.to_text_element(
            text_content=ocr_text,
            source_label=source_label,
            confidence=confidence,
            add_to_page=add_to_page,
        )

        logger.info(
            f"Region {self.bbox}: Created text element with {len(ocr_text)} chars"
            f"{' and added to page' if add_to_page else ''}"
        )
    else:
        logger.debug(f"Region {self.bbox}: Custom OCR function returned None (no text found)")

    return self
natural_pdf.Region.apply_ocr(replace=True, **ocr_params)

Apply OCR to this region and return the created text elements.

This method supports two modes: 1. Built-in OCR Engines (default) – identical to previous behaviour. Pass typical parameters like engine='easyocr' or languages=['en'] and the method will route the request through :class:OCRManager. 2. Custom OCR Function – pass a callable under the keyword function (or ocr_function). The callable will receive this Region instance and should return the extracted text (str) or None. Internally the call is delegated to :pymeth:apply_custom_ocr so the same logic (replacement, element creation, etc.) is re-used.

Examples
def llm_ocr(region):
    image = region.render(resolution=300, crop=True)
    return my_llm_client.ocr(image)
region.apply_ocr(function=llm_ocr)

Parameters:

Name Type Description Default
replace

Whether to remove existing OCR elements first (default True).

True
**ocr_params

Parameters for the built-in OCR manager or the special function/ocr_function keyword to trigger custom mode.

{}
Returns
Self – for chaining.
Source code in natural_pdf/elements/region.py
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
def apply_ocr(self, replace=True, **ocr_params) -> "Region":
    """
    Apply OCR to this region and return the created text elements.

    This method supports two modes:
    1. **Built-in OCR Engines** (default) – identical to previous behaviour. Pass typical
       parameters like ``engine='easyocr'`` or ``languages=['en']`` and the method will
       route the request through :class:`OCRManager`.
    2. **Custom OCR Function** – pass a *callable* under the keyword ``function`` (or
       ``ocr_function``). The callable will receive *this* Region instance and should
       return the extracted text (``str``) or ``None``.  Internally the call is
       delegated to :pymeth:`apply_custom_ocr` so the same logic (replacement, element
       creation, etc.) is re-used.

    Examples
    ---------
    ```python
    def llm_ocr(region):
        image = region.render(resolution=300, crop=True)
        return my_llm_client.ocr(image)
    region.apply_ocr(function=llm_ocr)
    ```

    Args:
        replace: Whether to remove existing OCR elements first (default ``True``).
        **ocr_params: Parameters for the built-in OCR manager *or* the special
                      ``function``/``ocr_function`` keyword to trigger custom mode.

    Returns
    -------
        Self – for chaining.
    """
    # --- Custom OCR function path --------------------------------------------------
    custom_func = ocr_params.pop("function", None) or ocr_params.pop("ocr_function", None)
    if callable(custom_func):
        # Delegate to the specialised helper while preserving key kwargs
        return self.apply_custom_ocr(
            ocr_function=custom_func,
            source_label=ocr_params.pop("source_label", "custom-ocr"),
            replace=replace,
            confidence=ocr_params.pop("confidence", None),
            add_to_page=ocr_params.pop("add_to_page", True),
        )

    # --- Original built-in OCR engine path (unchanged except docstring) ------------
    # Ensure OCRManager is available
    if not hasattr(self.page._parent, "_ocr_manager") or self.page._parent._ocr_manager is None:
        logger.error("OCRManager not available on parent PDF. Cannot apply OCR to region.")
        return self

    # If replace is True, find and remove existing OCR elements in this region
    if replace:
        logger.info(
            f"Region {self.bbox}: Removing existing OCR elements before applying new OCR."
        )

        # --- Robust removal: iterate through all OCR elements on the page and
        #     remove those that overlap this region. This avoids reliance on
        #     identity‐based look-ups that can break if the ElementManager
        #     rebuilt its internal lists.

        removed_count = 0

        # Helper to remove a single element safely
        def _safe_remove(elem):
            nonlocal removed_count
            success = False
            if hasattr(elem, "page") and hasattr(elem.page, "_element_mgr"):
                etype = getattr(elem, "object_type", "word")
                if etype == "word":
                    etype_key = "words"
                elif etype == "char":
                    etype_key = "chars"
                else:
                    etype_key = etype + "s" if not etype.endswith("s") else etype
                try:
                    success = elem.page._element_mgr.remove_element(elem, etype_key)
                except Exception:
                    success = False
            if success:
                removed_count += 1

        # Remove OCR WORD elements overlapping region
        for word in list(self.page._element_mgr.words):
            if getattr(word, "source", None) == "ocr" and self.intersects(word):
                _safe_remove(word)

        # Remove OCR CHAR dicts overlapping region
        for char in list(self.page._element_mgr.chars):
            # char can be dict or TextElement; normalise
            char_src = (
                char.get("source") if isinstance(char, dict) else getattr(char, "source", None)
            )
            if char_src == "ocr":
                # Rough bbox for dicts
                if isinstance(char, dict):
                    cx0, ctop, cx1, cbottom = (
                        char.get("x0", 0),
                        char.get("top", 0),
                        char.get("x1", 0),
                        char.get("bottom", 0),
                    )
                else:
                    cx0, ctop, cx1, cbottom = char.x0, char.top, char.x1, char.bottom
                # Quick overlap check
                if not (
                    cx1 < self.x0 or cx0 > self.x1 or cbottom < self.top or ctop > self.bottom
                ):
                    _safe_remove(char)

        logger.info(
            f"Region {self.bbox}: Removed {removed_count} existing OCR elements (words & chars) before re-applying OCR."
        )

    ocr_mgr = self.page._parent._ocr_manager

    # Determine rendering resolution from parameters
    final_resolution = ocr_params.get("resolution")
    if final_resolution is None and hasattr(self.page, "_parent") and self.page._parent:
        final_resolution = getattr(self.page._parent, "_config", {}).get("resolution", 150)
    elif final_resolution is None:
        final_resolution = 150
    logger.debug(
        f"Region {self.bbox}: Applying OCR with resolution {final_resolution} DPI and params: {ocr_params}"
    )

    # Render the page region to an image using the determined resolution
    try:
        # Use render() for clean image without highlights, with cropping
        region_image = self.render(resolution=final_resolution, crop=True)
        if not region_image:
            logger.error("Failed to render region to image for OCR.")
            return self
        logger.debug(f"Region rendered to image size: {region_image.size}")
    except Exception as e:
        logger.error(f"Error rendering region to image for OCR: {e}", exc_info=True)
        return self

    # Prepare args for the OCR Manager
    manager_args = {
        "images": region_image,
        "engine": ocr_params.get("engine"),
        "languages": ocr_params.get("languages"),
        "min_confidence": ocr_params.get("min_confidence"),
        "device": ocr_params.get("device"),
        "options": ocr_params.get("options"),
        "detect_only": ocr_params.get("detect_only"),
    }
    manager_args = {k: v for k, v in manager_args.items() if v is not None}

    # Run OCR on this region's image using the manager
    results = ocr_mgr.apply_ocr(**manager_args)
    if not isinstance(results, list):
        logger.error(
            f"OCRManager returned unexpected type for single region image: {type(results)}"
        )
        return self
    logger.debug(f"Region OCR processing returned {len(results)} results.")

    # Convert results to TextElements
    scale_x = self.width / region_image.width if region_image.width > 0 else 1.0
    scale_y = self.height / region_image.height if region_image.height > 0 else 1.0
    logger.debug(f"Region OCR scaling factors (PDF/Img): x={scale_x:.2f}, y={scale_y:.2f}")
    created_elements = []
    for result in results:
        try:
            img_x0, img_top, img_x1, img_bottom = map(float, result["bbox"])
            pdf_height = (img_bottom - img_top) * scale_y
            page_x0 = self.x0 + (img_x0 * scale_x)
            page_top = self.top + (img_top * scale_y)
            page_x1 = self.x0 + (img_x1 * scale_x)
            page_bottom = self.top + (img_bottom * scale_y)
            raw_conf = result.get("confidence")
            # Convert confidence to float unless it is None/invalid
            try:
                confidence_val = float(raw_conf) if raw_conf is not None else None
            except (TypeError, ValueError):
                confidence_val = None

            text_val = result.get("text")  # May legitimately be None in detect_only mode

            element_data = {
                "text": text_val,
                "x0": page_x0,
                "top": page_top,
                "x1": page_x1,
                "bottom": page_bottom,
                "width": page_x1 - page_x0,
                "height": page_bottom - page_top,
                "object_type": "word",
                "source": "ocr",
                "confidence": confidence_val,
                "fontname": "OCR",
                "size": round(pdf_height) if pdf_height > 0 else 10.0,
                "page_number": self.page.number,
                "bold": False,
                "italic": False,
                "upright": True,
                "doctop": page_top + self.page._page.initial_doctop,
            }
            ocr_char_dict = element_data.copy()
            ocr_char_dict["object_type"] = "char"
            ocr_char_dict.setdefault("adv", ocr_char_dict.get("width", 0))
            element_data["_char_dicts"] = [ocr_char_dict]
            from natural_pdf.elements.text import TextElement

            elem = TextElement(element_data, self.page)
            created_elements.append(elem)
            self.page._element_mgr.add_element(elem, element_type="words")
            self.page._element_mgr.add_element(ocr_char_dict, element_type="chars")
        except Exception as e:
            logger.error(
                f"Failed to convert region OCR result to element: {result}. Error: {e}",
                exc_info=True,
            )
    logger.info(f"Region {self.bbox}: Added {len(created_elements)} elements from OCR.")
    return self
natural_pdf.Region.ask(question, min_confidence=0.1, model=None, debug=False, **kwargs)

Ask a question about the region content using document QA.

This method uses a document question answering model to extract answers from the region content. It leverages both textual content and layout information for better understanding.

Parameters:

Name Type Description Default
question Union[str, List[str], Tuple[str, ...]]

The question to ask about the region content

required
min_confidence float

Minimum confidence threshold for answers (0.0-1.0)

0.1
model str

Optional model name to use for QA (if None, uses default model)

None
**kwargs

Additional parameters to pass to the QA engine

{}

Returns:

Type Description
Union[Dict[str, Any], List[Dict[str, Any]]]

Dictionary with answer details: { "answer": extracted text, "confidence": confidence score, "found": whether an answer was found, "page_num": page number, "region": reference to this region, "source_elements": list of elements that contain the answer (if found)

Union[Dict[str, Any], List[Dict[str, Any]]]

}

Source code in natural_pdf/elements/region.py
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
def ask(
    self,
    question: Union[str, List[str], Tuple[str, ...]],
    min_confidence: float = 0.1,
    model: str = None,
    debug: bool = False,
    **kwargs,
) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
    """
    Ask a question about the region content using document QA.

    This method uses a document question answering model to extract answers from the region content.
    It leverages both textual content and layout information for better understanding.

    Args:
        question: The question to ask about the region content
        min_confidence: Minimum confidence threshold for answers (0.0-1.0)
        model: Optional model name to use for QA (if None, uses default model)
        **kwargs: Additional parameters to pass to the QA engine

    Returns:
        Dictionary with answer details: {
            "answer": extracted text,
            "confidence": confidence score,
            "found": whether an answer was found,
            "page_num": page number,
            "region": reference to this region,
            "source_elements": list of elements that contain the answer (if found)
        }
    """
    try:
        from natural_pdf.qa.document_qa import get_qa_engine
    except ImportError:
        logger.error(
            "Question answering requires optional dependencies. Install with `pip install natural-pdf[ai]`"
        )
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.page.number,
            "source_elements": [],
            "region": self,
        }

    # Get or initialize QA engine with specified model
    try:
        qa_engine = get_qa_engine(model_name=model) if model else get_qa_engine()
    except Exception as e:
        logger.error(f"Failed to initialize QA engine (model: {model}): {e}", exc_info=True)
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.page.number,
            "source_elements": [],
            "region": self,
        }

    # Ask the question using the QA engine
    try:
        return qa_engine.ask_pdf_region(
            self, question, min_confidence=min_confidence, debug=debug, **kwargs
        )
    except Exception as e:
        logger.error(f"Error during qa_engine.ask_pdf_region: {e}", exc_info=True)
        return {
            "answer": None,
            "confidence": 0.0,
            "found": False,
            "page_num": self.page.number,
            "source_elements": [],
            "region": self,
        }
natural_pdf.Region.attr(name)

Get an attribute value from this region.

This method provides a consistent interface for attribute access that works on both individual regions/elements and collections. When called on a single region, it simply returns the attribute value. When called on collections, it extracts the attribute from all items.

Parameters:

Name Type Description Default
name str

The attribute name to retrieve (e.g., 'text', 'width', 'height')

required

Returns:

Type Description
Any

The attribute value, or None if the attribute doesn't exist

Examples:

On a single region

region = page.find('text:contains("Title")').expand(10) width = region.attr('width') # Same as region.width

Consistent API across elements and regions

obj = page.find('*:contains("Title")') # Could be element or region text = obj.attr('text') # Works for both

Source code in natural_pdf/elements/region.py
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
def attr(self, name: str) -> Any:
    """
    Get an attribute value from this region.

    This method provides a consistent interface for attribute access that works
    on both individual regions/elements and collections. When called on a single
    region, it simply returns the attribute value. When called on collections,
    it extracts the attribute from all items.

    Args:
        name: The attribute name to retrieve (e.g., 'text', 'width', 'height')

    Returns:
        The attribute value, or None if the attribute doesn't exist

    Examples:
        # On a single region
        region = page.find('text:contains("Title")').expand(10)
        width = region.attr('width')  # Same as region.width

        # Consistent API across elements and regions
        obj = page.find('*:contains("Title")')  # Could be element or region
        text = obj.attr('text')  # Works for both
    """
    return getattr(self, name, None)
natural_pdf.Region.below(height=None, width='full', include_source=False, until=None, include_endpoint=True, offset=None, **kwargs)

Select region below this region.

Parameters:

Name Type Description Default
height Optional[float]

Height of the region below, in points

None
width str

Width mode - "full" for full page width or "element" for element width

'full'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify a lower boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
offset Optional[float]

Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)

None
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area below

Source code in natural_pdf/elements/region.py
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
def below(
    self,
    height: Optional[float] = None,
    width: str = "full",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    offset: Optional[float] = None,
    **kwargs,
) -> "Region":
    """
    Select region below this region.

    Args:
        height: Height of the region below, in points
        width: Width mode - "full" for full page width or "element" for element width
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify a lower boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area below
    """
    # Use global default if offset not provided
    if offset is None:
        import natural_pdf

        offset = natural_pdf.options.layout.directional_offset

    return self._direction(
        direction="below",
        size=height,
        cross_size=width,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        offset=offset,
        **kwargs,
    )
natural_pdf.Region.clip(obj=None, left=None, top=None, right=None, bottom=None)

Clip this region to specific bounds, either from another object with bbox or explicit coordinates.

The clipped region will be constrained to not exceed the specified boundaries. You can provide either an object with bounding box properties, specific coordinates, or both. When both are provided, explicit coordinates take precedence.

Parameters:

Name Type Description Default
obj Optional[Any]

Optional object with bbox properties (Region, Element, TextElement, etc.)

None
left Optional[float]

Optional left boundary (x0) to clip to

None
top Optional[float]

Optional top boundary to clip to

None
right Optional[float]

Optional right boundary (x1) to clip to

None
bottom Optional[float]

Optional bottom boundary to clip to

None

Returns:

Type Description
Region

New Region with bounds clipped to the specified constraints

Examples:

Clip to another region's bounds

clipped = region.clip(container_region)

Clip to any element's bounds

clipped = region.clip(text_element)

Clip to specific coordinates

clipped = region.clip(left=100, right=400)

Mix object bounds with specific overrides

clipped = region.clip(obj=container, bottom=page.height/2)

Source code in natural_pdf/elements/region.py
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
def clip(
    self,
    obj: Optional[Any] = None,
    left: Optional[float] = None,
    top: Optional[float] = None,
    right: Optional[float] = None,
    bottom: Optional[float] = None,
) -> "Region":
    """
    Clip this region to specific bounds, either from another object with bbox or explicit coordinates.

    The clipped region will be constrained to not exceed the specified boundaries.
    You can provide either an object with bounding box properties, specific coordinates, or both.
    When both are provided, explicit coordinates take precedence.

    Args:
        obj: Optional object with bbox properties (Region, Element, TextElement, etc.)
        left: Optional left boundary (x0) to clip to
        top: Optional top boundary to clip to
        right: Optional right boundary (x1) to clip to
        bottom: Optional bottom boundary to clip to

    Returns:
        New Region with bounds clipped to the specified constraints

    Examples:
        # Clip to another region's bounds
        clipped = region.clip(container_region)

        # Clip to any element's bounds
        clipped = region.clip(text_element)

        # Clip to specific coordinates
        clipped = region.clip(left=100, right=400)

        # Mix object bounds with specific overrides
        clipped = region.clip(obj=container, bottom=page.height/2)
    """
    from natural_pdf.elements.base import extract_bbox

    # Start with current region bounds
    clip_x0 = self.x0
    clip_top = self.top
    clip_x1 = self.x1
    clip_bottom = self.bottom

    # Apply object constraints if provided
    if obj is not None:
        obj_bbox = extract_bbox(obj)
        if obj_bbox is not None:
            obj_x0, obj_top, obj_x1, obj_bottom = obj_bbox
            # Constrain to the intersection with the provided object
            clip_x0 = max(clip_x0, obj_x0)
            clip_top = max(clip_top, obj_top)
            clip_x1 = min(clip_x1, obj_x1)
            clip_bottom = min(clip_bottom, obj_bottom)
        else:
            logger.warning(
                f"Region {self.bbox}: Cannot extract bbox from clipping object {type(obj)}. "
                "Object must have bbox property or x0/top/x1/bottom attributes."
            )

    # Apply explicit coordinate constraints (these take precedence)
    if left is not None:
        clip_x0 = max(clip_x0, left)
    if top is not None:
        clip_top = max(clip_top, top)
    if right is not None:
        clip_x1 = min(clip_x1, right)
    if bottom is not None:
        clip_bottom = min(clip_bottom, bottom)

    # Ensure valid coordinates
    if clip_x1 <= clip_x0 or clip_bottom <= clip_top:
        logger.warning(
            f"Region {self.bbox}: Clipping resulted in invalid dimensions "
            f"({clip_x0}, {clip_top}, {clip_x1}, {clip_bottom}). Returning minimal region."
        )
        # Return a minimal region at the clip area's top-left
        return Region(self.page, (clip_x0, clip_top, clip_x0, clip_top))

    # Create the clipped region
    clipped_region = Region(self.page, (clip_x0, clip_top, clip_x1, clip_bottom))

    # Copy relevant metadata
    clipped_region.region_type = self.region_type
    clipped_region.normalized_type = self.normalized_type
    clipped_region.confidence = self.confidence
    clipped_region.model = self.model
    clipped_region.name = self.name
    clipped_region.label = self.label
    clipped_region.source = "clipped"  # Indicate this is a derived region
    clipped_region.parent_region = self

    logger.debug(
        f"Region {self.bbox}: Clipped to {clipped_region.bbox} "
        f"(constraints: obj={type(obj).__name__ if obj else None}, "
        f"left={left}, top={top}, right={right}, bottom={bottom})"
    )
    return clipped_region
natural_pdf.Region.contains(element)

Check if this region completely contains an element.

Parameters:

Name Type Description Default
element Element

Element to check

required

Returns:

Type Description
bool

True if the element is completely contained within the region, False otherwise

Source code in natural_pdf/elements/region.py
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
def contains(self, element: "Element") -> bool:
    """
    Check if this region completely contains an element.

    Args:
        element: Element to check

    Returns:
        True if the element is completely contained within the region, False otherwise
    """
    # Check if element is on the same page
    if not hasattr(element, "page") or element.page != self._page:
        return False

    # Ensure element has necessary attributes
    if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
        return False  # Cannot determine position

    # For rectangular regions, check if element's bbox is fully inside region's bbox
    if not self.has_polygon:
        return (
            self.x0 <= element.x0
            and element.x1 <= self.x1
            and self.top <= element.top
            and element.bottom <= self.bottom
        )

    # For polygon regions, check if all corners of the element are inside the polygon
    element_corners = [
        (element.x0, element.top),  # top-left
        (element.x1, element.top),  # top-right
        (element.x1, element.bottom),  # bottom-right
        (element.x0, element.bottom),  # bottom-left
    ]

    return all(self.is_point_inside(x, y) for x, y in element_corners)
natural_pdf.Region.create_cells()

Create cell regions for a detected table by intersecting its row and column regions, and add them to the page.

Assumes child row and column regions are already present on the page.

Returns:

Type Description

Self for method chaining.

Source code in natural_pdf/elements/region.py
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
def create_cells(self):
    """
    Create cell regions for a detected table by intersecting its
    row and column regions, and add them to the page.

    Assumes child row and column regions are already present on the page.

    Returns:
        Self for method chaining.
    """
    # Ensure this is called on a table region
    if self.region_type not in (
        "table",
        "tableofcontents",
    ):  # Allow for ToC which might have structure
        raise ValueError(
            f"create_cells should be called on a 'table' or 'tableofcontents' region, not '{self.region_type}'"
        )

    # Find rows and columns associated with this page
    # Remove the model-specific filter
    rows = self.page.find_all("region[type=table-row]")
    columns = self.page.find_all("region[type=table-column]")

    # Filter to only include those that overlap with this table region
    def is_in_table(element):
        # Use a simple overlap check (more robust than just center point)
        # Check if element's bbox overlaps with self.bbox
        return (
            hasattr(element, "bbox")
            and element.x0 < self.x1  # Ensure element has bbox
            and element.x1 > self.x0
            and element.top < self.bottom
            and element.bottom > self.top
        )

    table_rows = [r for r in rows if is_in_table(r)]
    table_columns = [c for c in columns if is_in_table(c)]

    if not table_rows or not table_columns:
        # Use page's logger if available
        logger_instance = getattr(self._page, "logger", logger)
        logger_instance.warning(
            f"Region {self.bbox}: Cannot create cells. No overlapping row or column regions found."
        )
        return self  # Return self even if no cells created

    # Sort rows and columns
    table_rows.sort(key=lambda r: r.top)
    table_columns.sort(key=lambda c: c.x0)

    # Create cells and add them to the page's element manager
    created_count = 0
    for row in table_rows:
        for column in table_columns:
            # Calculate intersection bbox for the cell
            cell_x0 = max(row.x0, column.x0)
            cell_y0 = max(row.top, column.top)
            cell_x1 = min(row.x1, column.x1)
            cell_y1 = min(row.bottom, column.bottom)

            # Only create a cell if the intersection is valid (positive width/height)
            if cell_x1 > cell_x0 and cell_y1 > cell_y0:
                # Create cell region at the intersection
                cell = self.page.create_region(cell_x0, cell_y0, cell_x1, cell_y1)
                # Set metadata
                cell.source = "derived"
                cell.region_type = "table-cell"  # Explicitly set type
                cell.normalized_type = "table-cell"  # And normalized type
                # Inherit model from the parent table region
                cell.model = self.model
                cell.parent_region = self  # Link cell to parent table region

                # Add the cell region to the page's element manager
                self.page._element_mgr.add_region(cell)
                created_count += 1

    # Optional: Add created cells to the table region's children
    # self.child_regions.extend(cells_created_in_this_call) # Needs list management

    logger_instance = getattr(self._page, "logger", logger)
    logger_instance.info(
        f"Region {self.bbox} (Model: {self.model}): Created and added {created_count} cell regions."
    )

    return self  # Return self for chaining
natural_pdf.Region.exclude()

Exclude this region from text extraction and other operations.

This excludes everything within the region's bounds.

Source code in natural_pdf/elements/region.py
816
817
818
819
820
821
822
def exclude(self):
    """
    Exclude this region from text extraction and other operations.

    This excludes everything within the region's bounds.
    """
    self.page.add_exclusion(self, method="region")
natural_pdf.Region.extract_table(method=None, table_settings=None, use_ocr=False, ocr_config=None, text_options=None, cell_extraction_func=None, show_progress=False, content_filter=None, apply_exclusions=True, verticals=None, horizontals=None)

Extract a table from this region.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect). 'stream' is an alias for 'pdfplumber' with text-based strategies (equivalent to setting vertical_strategy and horizontal_strategy to 'text'). 'lattice' is an alias for 'pdfplumber' with line-based strategies (equivalent to setting vertical_strategy and horizontal_strategy to 'lines').

None
table_settings Optional[dict]

Settings for pdfplumber table extraction (used with 'pdfplumber', 'stream', or 'lattice' methods).

None
use_ocr bool

Whether to use OCR for text extraction (currently only applicable with 'tatr' method).

False
ocr_config Optional[dict]

OCR configuration parameters.

None
text_options Optional[Dict]

Dictionary of options for the 'text' method, corresponding to arguments of analyze_text_table_structure (e.g., snap_tolerance, expand_bbox).

None
cell_extraction_func Optional[Callable[[Region], Optional[str]]]

Optional callable function that takes a cell Region object and returns its string content. Overrides default text extraction for the 'text' method.

None
show_progress bool

If True, display a progress bar during cell text extraction for the 'text' method.

False
content_filter Optional[Union[str, Callable[[str], bool], List[str]]]

Optional content filter to apply during cell text extraction. Can be: - A regex pattern string (characters matching the pattern are EXCLUDED) - A callable that takes text and returns True to KEEP the character - A list of regex patterns (characters matching ANY pattern are EXCLUDED) Works with all extraction methods by filtering cell content.

None
apply_exclusions bool

Whether to apply exclusion regions during text extraction (default: True). When True, text within excluded regions (e.g., headers/footers) will not be extracted.

True
verticals Optional[List]

Optional list of explicit vertical lines for table extraction. When provided, automatically sets vertical_strategy='explicit' and explicit_vertical_lines.

None
horizontals Optional[List]

Optional list of explicit horizontal lines for table extraction. When provided, automatically sets horizontal_strategy='explicit' and explicit_horizontal_lines.

None

Returns:

Type Description
TableResult

Table data as a list of rows, where each row is a list of cell values (str or None).

Source code in natural_pdf/elements/region.py
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
def extract_table(
    self,
    method: Optional[str] = None,  # Make method optional
    table_settings: Optional[dict] = None,  # Use Optional
    use_ocr: bool = False,
    ocr_config: Optional[dict] = None,  # Use Optional
    text_options: Optional[Dict] = None,
    cell_extraction_func: Optional[Callable[["Region"], Optional[str]]] = None,
    # --- NEW: Add tqdm control option --- #
    show_progress: bool = False,  # Controls progress bar for text method
    content_filter: Optional[
        Union[str, Callable[[str], bool], List[str]]
    ] = None,  # NEW: Content filtering
    apply_exclusions: bool = True,  # Whether to apply exclusion regions during extraction
    verticals: Optional[List] = None,  # Explicit vertical lines
    horizontals: Optional[List] = None,  # Explicit horizontal lines
) -> TableResult:  # Return type allows Optional[str] for cells
    """
    Extract a table from this region.

    Args:
        method: Method to use: 'tatr', 'pdfplumber', 'text', 'stream', 'lattice', or None (auto-detect).
                'stream' is an alias for 'pdfplumber' with text-based strategies (equivalent to
                setting `vertical_strategy` and `horizontal_strategy` to 'text').
                'lattice' is an alias for 'pdfplumber' with line-based strategies (equivalent to
                setting `vertical_strategy` and `horizontal_strategy` to 'lines').
        table_settings: Settings for pdfplumber table extraction (used with 'pdfplumber', 'stream', or 'lattice' methods).
        use_ocr: Whether to use OCR for text extraction (currently only applicable with 'tatr' method).
        ocr_config: OCR configuration parameters.
        text_options: Dictionary of options for the 'text' method, corresponding to arguments
                      of analyze_text_table_structure (e.g., snap_tolerance, expand_bbox).
        cell_extraction_func: Optional callable function that takes a cell Region object
                              and returns its string content. Overrides default text extraction
                              for the 'text' method.
        show_progress: If True, display a progress bar during cell text extraction for the 'text' method.
        content_filter: Optional content filter to apply during cell text extraction. Can be:
            - A regex pattern string (characters matching the pattern are EXCLUDED)
            - A callable that takes text and returns True to KEEP the character
            - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
            Works with all extraction methods by filtering cell content.
        apply_exclusions: Whether to apply exclusion regions during text extraction (default: True).
            When True, text within excluded regions (e.g., headers/footers) will not be extracted.
        verticals: Optional list of explicit vertical lines for table extraction. When provided,
                   automatically sets vertical_strategy='explicit' and explicit_vertical_lines.
        horizontals: Optional list of explicit horizontal lines for table extraction. When provided,
                     automatically sets horizontal_strategy='explicit' and explicit_horizontal_lines.

    Returns:
        Table data as a list of rows, where each row is a list of cell values (str or None).
    """
    # Default settings if none provided
    if table_settings is None:
        table_settings = {}
    if text_options is None:
        text_options = {}  # Initialize empty dict

    # Handle explicit vertical and horizontal lines
    if verticals is not None:
        table_settings["vertical_strategy"] = "explicit"
        table_settings["explicit_vertical_lines"] = verticals
    if horizontals is not None:
        table_settings["horizontal_strategy"] = "explicit"
        table_settings["explicit_horizontal_lines"] = horizontals

    # Auto-detect method if not specified
    if method is None:
        # If this is a TATR-detected region, use TATR method
        if hasattr(self, "model") and self.model == "tatr" and self.region_type == "table":
            effective_method = "tatr"
        else:
            # Try lattice first, then fall back to stream if no meaningful results
            logger.debug(f"Region {self.bbox}: Auto-detecting table extraction method...")

            # --- NEW: Prefer already-created table_cell regions if they exist --- #
            try:
                cell_regions_in_table = [
                    c
                    for c in self.page.find_all(
                        "region[type=table_cell]", apply_exclusions=False
                    )
                    if self.intersects(c)
                ]
            except Exception as _cells_err:
                cell_regions_in_table = []  # Fallback silently

            if cell_regions_in_table:
                logger.debug(
                    f"Region {self.bbox}: Found {len(cell_regions_in_table)} pre-computed table_cell regions – using 'cells' method."
                )
                return TableResult(
                    self._extract_table_from_cells(
                        cell_regions_in_table,
                        content_filter=content_filter,
                        apply_exclusions=apply_exclusions,
                    )
                )

            # --------------------------------------------------------------- #

            try:
                logger.debug(f"Region {self.bbox}: Trying 'lattice' method first...")
                lattice_result = self.extract_table(
                    "lattice", table_settings=table_settings.copy()
                )

                # Check if lattice found meaningful content
                if (
                    lattice_result
                    and len(lattice_result) > 0
                    and any(
                        any(cell and cell.strip() for cell in row if cell)
                        for row in lattice_result
                    )
                ):
                    logger.debug(
                        f"Region {self.bbox}: 'lattice' method found table with {len(lattice_result)} rows"
                    )
                    return lattice_result
                else:
                    logger.debug(
                        f"Region {self.bbox}: 'lattice' method found no meaningful content"
                    )
            except Exception as e:
                logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

            # Fall back to stream
            logger.debug(f"Region {self.bbox}: Falling back to 'stream' method...")
            return self.extract_table("stream", table_settings=table_settings.copy())
    else:
        effective_method = method

    # Handle method aliases for pdfplumber
    if effective_method == "stream":
        logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
        effective_method = "pdfplumber"
        # Set default text strategies if not already provided by the user
        table_settings.setdefault("vertical_strategy", "text")
        table_settings.setdefault("horizontal_strategy", "text")
    elif effective_method == "lattice":
        logger.debug(
            "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
        )
        effective_method = "pdfplumber"
        # Set default line strategies if not already provided by the user
        table_settings.setdefault("vertical_strategy", "lines")
        table_settings.setdefault("horizontal_strategy", "lines")

    # -------------------------------------------------------------
    # Auto-inject tolerances when text-based strategies are requested.
    # This must happen AFTER alias handling (so strategies are final)
    # and BEFORE we delegate to _extract_table_* helpers.
    # -------------------------------------------------------------
    if "text" in (
        table_settings.get("vertical_strategy"),
        table_settings.get("horizontal_strategy"),
    ):
        page_cfg = getattr(self.page, "_config", {})
        # Ensure text_* tolerances passed to pdfplumber
        if "text_x_tolerance" not in table_settings and "x_tolerance" not in table_settings:
            if page_cfg.get("x_tolerance") is not None:
                table_settings["text_x_tolerance"] = page_cfg["x_tolerance"]
        if "text_y_tolerance" not in table_settings and "y_tolerance" not in table_settings:
            if page_cfg.get("y_tolerance") is not None:
                table_settings["text_y_tolerance"] = page_cfg["y_tolerance"]

        # Snap / join tolerances (~ line spacing)
        if "snap_tolerance" not in table_settings and "snap_x_tolerance" not in table_settings:
            snap = max(1, round((page_cfg.get("y_tolerance", 1)) * 0.9))
            table_settings["snap_tolerance"] = snap
        if "join_tolerance" not in table_settings and "join_x_tolerance" not in table_settings:
            table_settings["join_tolerance"] = table_settings["snap_tolerance"]

    logger.debug(f"Region {self.bbox}: Extracting table using method '{effective_method}'")

    # For stream method with text-based edge detection and explicit vertical lines,
    # adjust guides to ensure they fall within text bounds for proper intersection
    if (
        effective_method == "pdfplumber"
        and table_settings.get("horizontal_strategy") == "text"
        and table_settings.get("vertical_strategy") == "explicit"
        and "explicit_vertical_lines" in table_settings
    ):

        text_elements = self.find_all("text", apply_exclusions=apply_exclusions)
        if text_elements:
            text_bounds = text_elements.merge().bbox
            text_left = text_bounds[0]
            text_right = text_bounds[2]

            # Adjust vertical guides to fall within text bounds
            original_verticals = table_settings["explicit_vertical_lines"]
            adjusted_verticals = []

            for v in original_verticals:
                if v < text_left:
                    # Guide is left of text bounds, clip to text start
                    adjusted_verticals.append(text_left)
                    logger.debug(
                        f"Region {self.bbox}: Adjusted left guide from {v:.1f} to {text_left:.1f}"
                    )
                elif v > text_right:
                    # Guide is right of text bounds, clip to text end
                    adjusted_verticals.append(text_right)
                    logger.debug(
                        f"Region {self.bbox}: Adjusted right guide from {v:.1f} to {text_right:.1f}"
                    )
                else:
                    # Guide is within text bounds, keep as is
                    adjusted_verticals.append(v)

            # Update table settings with adjusted guides
            table_settings["explicit_vertical_lines"] = adjusted_verticals
            logger.debug(
                f"Region {self.bbox}: Adjusted {len(original_verticals)} guides for stream extraction. "
                f"Text bounds: {text_left:.1f}-{text_right:.1f}"
            )

    # Use the selected method
    if effective_method == "tatr":
        table_rows = self._extract_table_tatr(
            use_ocr=use_ocr,
            ocr_config=ocr_config,
            content_filter=content_filter,
            apply_exclusions=apply_exclusions,
        )
    elif effective_method == "text":
        current_text_options = text_options.copy()
        current_text_options["cell_extraction_func"] = cell_extraction_func
        current_text_options["show_progress"] = show_progress
        current_text_options["content_filter"] = content_filter
        current_text_options["apply_exclusions"] = apply_exclusions
        table_rows = self._extract_table_text(**current_text_options)
    elif effective_method == "pdfplumber":
        table_rows = self._extract_table_plumber(
            table_settings, content_filter=content_filter, apply_exclusions=apply_exclusions
        )
    else:
        raise ValueError(
            f"Unknown table extraction method: '{method}'. Choose from 'tatr', 'pdfplumber', 'text', 'stream', 'lattice'."
        )

    return TableResult(table_rows)
natural_pdf.Region.extract_tables(method=None, table_settings=None)

Extract all tables from this region using pdfplumber-based methods.

Note: Only 'pdfplumber', 'stream', and 'lattice' methods are supported for extract_tables. 'tatr' and 'text' methods are designed for single table extraction only.

Parameters:

Name Type Description Default
method Optional[str]

Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect). 'stream' uses text-based strategies, 'lattice' uses line-based strategies.

None
table_settings Optional[dict]

Settings for pdfplumber table extraction.

None

Returns:

Type Description
List[List[List[str]]]

List of tables, where each table is a list of rows, and each row is a list of cell values.

Source code in natural_pdf/elements/region.py
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
def extract_tables(
    self,
    method: Optional[str] = None,
    table_settings: Optional[dict] = None,
) -> List[List[List[str]]]:
    """
    Extract all tables from this region using pdfplumber-based methods.

    Note: Only 'pdfplumber', 'stream', and 'lattice' methods are supported for extract_tables.
    'tatr' and 'text' methods are designed for single table extraction only.

    Args:
        method: Method to use: 'pdfplumber', 'stream', 'lattice', or None (auto-detect).
                'stream' uses text-based strategies, 'lattice' uses line-based strategies.
        table_settings: Settings for pdfplumber table extraction.

    Returns:
        List of tables, where each table is a list of rows, and each row is a list of cell values.
    """
    if table_settings is None:
        table_settings = {}

    # Auto-detect method if not specified (try lattice first, then stream)
    if method is None:
        logger.debug(f"Region {self.bbox}: Auto-detecting tables extraction method...")

        # Try lattice first
        try:
            lattice_settings = table_settings.copy()
            lattice_settings.setdefault("vertical_strategy", "lines")
            lattice_settings.setdefault("horizontal_strategy", "lines")

            logger.debug(f"Region {self.bbox}: Trying 'lattice' method first for tables...")
            lattice_result = self._extract_tables_plumber(lattice_settings)

            # Check if lattice found meaningful tables
            if (
                lattice_result
                and len(lattice_result) > 0
                and any(
                    any(
                        any(cell and cell.strip() for cell in row if cell)
                        for row in table
                        if table
                    )
                    for table in lattice_result
                )
            ):
                logger.debug(
                    f"Region {self.bbox}: 'lattice' method found {len(lattice_result)} tables"
                )
                return lattice_result
            else:
                logger.debug(f"Region {self.bbox}: 'lattice' method found no meaningful tables")

        except Exception as e:
            logger.debug(f"Region {self.bbox}: 'lattice' method failed: {e}")

        # Fall back to stream
        logger.debug(f"Region {self.bbox}: Falling back to 'stream' method for tables...")
        stream_settings = table_settings.copy()
        stream_settings.setdefault("vertical_strategy", "text")
        stream_settings.setdefault("horizontal_strategy", "text")

        return self._extract_tables_plumber(stream_settings)

    effective_method = method

    # Handle method aliases
    if effective_method == "stream":
        logger.debug("Using 'stream' method alias for 'pdfplumber' with text-based strategies.")
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "text")
        table_settings.setdefault("horizontal_strategy", "text")
    elif effective_method == "lattice":
        logger.debug(
            "Using 'lattice' method alias for 'pdfplumber' with line-based strategies."
        )
        effective_method = "pdfplumber"
        table_settings.setdefault("vertical_strategy", "lines")
        table_settings.setdefault("horizontal_strategy", "lines")

    # Use the selected method
    if effective_method == "pdfplumber":
        return self._extract_tables_plumber(table_settings)
    else:
        raise ValueError(
            f"Unknown tables extraction method: '{method}'. Choose from 'pdfplumber', 'stream', 'lattice'."
        )
natural_pdf.Region.extract_text(granularity='chars', apply_exclusions=True, debug=False, *, overlap='center', newlines=True, content_filter=None, **kwargs)

Extract text from this region, respecting page exclusions and using pdfplumber's layout engine (chars_to_textmap).

Parameters:

Name Type Description Default
granularity str

Level of text extraction - 'chars' (default) or 'words'. - 'chars': Character-by-character extraction (current behavior) - 'words': Word-level extraction with configurable overlap

'chars'
apply_exclusions bool

Whether to apply exclusion regions defined on the parent page.

True
debug bool

Enable verbose debugging output for filtering steps.

False
overlap str

How to determine if words overlap with the region (only used when granularity='words'): - 'center': Word center point must be inside (default) - 'full': Word must be fully inside the region - 'partial': Any overlap includes the word

'center'
newlines Union[bool, str]

Whether to strip newline characters from the extracted text.

True
content_filter

Optional content filter to exclude specific text patterns. Can be: - A regex pattern string (characters matching the pattern are EXCLUDED) - A callable that takes text and returns True to KEEP the character - A list of regex patterns (characters matching ANY pattern are EXCLUDED)

None
**kwargs

Additional layout parameters passed directly to pdfplumber's chars_to_textmap function (e.g., layout, x_density, y_density). See Page.extract_text docstring for more.

{}

Returns:

Type Description
str

Extracted text as string, potentially with layout-based spacing.

Source code in natural_pdf/elements/region.py
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
def extract_text(
    self,
    granularity: str = "chars",
    apply_exclusions: bool = True,
    debug: bool = False,
    *,
    overlap: str = "center",
    newlines: Union[bool, str] = True,
    content_filter=None,
    **kwargs,
) -> str:
    """
    Extract text from this region, respecting page exclusions and using pdfplumber's
    layout engine (chars_to_textmap).

    Args:
        granularity: Level of text extraction - 'chars' (default) or 'words'.
            - 'chars': Character-by-character extraction (current behavior)
            - 'words': Word-level extraction with configurable overlap
        apply_exclusions: Whether to apply exclusion regions defined on the parent page.
        debug: Enable verbose debugging output for filtering steps.
        overlap: How to determine if words overlap with the region (only used when granularity='words'):
            - 'center': Word center point must be inside (default)
            - 'full': Word must be fully inside the region
            - 'partial': Any overlap includes the word
        newlines: Whether to strip newline characters from the extracted text.
        content_filter: Optional content filter to exclude specific text patterns. Can be:
            - A regex pattern string (characters matching the pattern are EXCLUDED)
            - A callable that takes text and returns True to KEEP the character
            - A list of regex patterns (characters matching ANY pattern are EXCLUDED)
        **kwargs: Additional layout parameters passed directly to pdfplumber's
                  `chars_to_textmap` function (e.g., layout, x_density, y_density).
                  See Page.extract_text docstring for more.

    Returns:
        Extracted text as string, potentially with layout-based spacing.
    """
    # Validate granularity parameter
    if granularity not in ("chars", "words"):
        raise ValueError(f"granularity must be 'chars' or 'words', got '{granularity}'")

    # Allow 'debug_exclusions' for backward compatibility
    debug = kwargs.get("debug", debug or kwargs.get("debug_exclusions", False))
    logger.debug(
        f"Region {self.bbox}: extract_text called with granularity='{granularity}', overlap='{overlap}', kwargs: {kwargs}"
    )

    # Handle word-level extraction
    if granularity == "words":
        # Use find_all to get words with proper overlap and exclusion handling
        word_elements = self.find_all(
            "text", overlap=overlap, apply_exclusions=apply_exclusions
        )

        # Join the text from all matching words
        text_parts = []
        for word in word_elements:
            word_text = word.extract_text()
            if word_text:  # Skip empty strings
                text_parts.append(word_text)

        result = " ".join(text_parts)

        # Apply newlines processing if requested
        if newlines is False:
            result = result.replace("\n", " ").replace("\r", " ")
        elif isinstance(newlines, str):
            result = result.replace("\n", newlines).replace("\r", newlines)

        return result

    # Original character-level extraction logic follows...
    # 1. Get Word Elements potentially within this region (initial broad phase)
    # Optimization: Could use spatial query if page elements were indexed
    page_words = self.page.words  # Get all words from the page

    # 2. Gather all character dicts from words potentially in region
    # We filter precisely in filter_chars_spatially
    all_char_dicts = []
    for word in page_words:
        # Quick bbox check to avoid processing words clearly outside
        if get_bbox_overlap(self.bbox, word.bbox) is not None:
            all_char_dicts.extend(getattr(word, "_char_dicts", []))

    if not all_char_dicts:
        logger.debug(f"Region {self.bbox}: No character dicts found overlapping region bbox.")
        return ""

    # 3. Get Relevant Exclusions (overlapping this region)
    apply_exclusions_flag = kwargs.get("apply_exclusions", apply_exclusions)
    exclusion_regions = []
    if apply_exclusions_flag:
        # Always call _get_exclusion_regions to get both page and PDF level exclusions
        all_page_exclusions = self._page._get_exclusion_regions(
            include_callable=True, debug=debug
        )
        overlapping_exclusions = []
        for excl in all_page_exclusions:
            if get_bbox_overlap(self.bbox, excl.bbox) is not None:
                overlapping_exclusions.append(excl)
        exclusion_regions = overlapping_exclusions
        if debug:
            logger.debug(
                f"Region {self.bbox}: Found {len(all_page_exclusions)} total exclusions, "
                f"{len(exclusion_regions)} overlapping this region."
            )
    elif debug:
        logger.debug(f"Region {self.bbox}: Not applying exclusions (apply_exclusions=False).")

    # Add boundary element exclusions if this is a section with boundary settings
    if hasattr(self, "_boundary_exclusions") and self._boundary_exclusions != "both":
        boundary_exclusions = []

        if self._boundary_exclusions == "none":
            # Exclude both start and end elements
            if hasattr(self, "start_element") and self.start_element:
                boundary_exclusions.append(self.start_element)
            if hasattr(self, "end_element") and self.end_element:
                boundary_exclusions.append(self.end_element)
        elif self._boundary_exclusions == "start":
            # Exclude only end element
            if hasattr(self, "end_element") and self.end_element:
                boundary_exclusions.append(self.end_element)
        elif self._boundary_exclusions == "end":
            # Exclude only start element
            if hasattr(self, "start_element") and self.start_element:
                boundary_exclusions.append(self.start_element)

        # Add boundary elements as exclusion regions
        for elem in boundary_exclusions:
            if hasattr(elem, "bbox"):
                exclusion_regions.append(elem)
                if debug:
                    logger.debug(
                        f"Adding boundary exclusion: {elem.extract_text().strip()} at {elem.bbox}"
                    )

    # 4. Spatially Filter Characters using Utility
    # Pass self as the target_region for precise polygon checks etc.
    filtered_chars = filter_chars_spatially(
        char_dicts=all_char_dicts,
        exclusion_regions=exclusion_regions,
        target_region=self,  # Pass self!
        debug=debug,
    )

    # 5. Generate Text Layout using Utility
    # Add content_filter to kwargs if provided
    final_kwargs = kwargs.copy()
    if content_filter is not None:
        final_kwargs["content_filter"] = content_filter

    result = generate_text_layout(
        char_dicts=filtered_chars,
        layout_context_bbox=self.bbox,  # Use region's bbox for context
        user_kwargs=final_kwargs,  # Pass kwargs including content_filter
    )

    # Flexible newline handling (same logic as TextElement)
    if isinstance(newlines, bool):
        if newlines is False:
            replacement = " "
        else:
            replacement = None
    else:
        replacement = str(newlines)

    if replacement is not None:
        result = result.replace("\n", replacement).replace("\r", replacement)

    logger.debug(f"Region {self.bbox}: extract_text finished, result length: {len(result)}.")
    return result
natural_pdf.Region.find(selector=None, *, text=None, overlap='full', apply_exclusions=True, regex=False, case=True, **kwargs)
find(*, text: str, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Element]
find(selector: str, *, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> Optional[Element]

Find the first element in this region matching the selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
overlap str

How to determine if elements overlap with the region: 'full' (fully inside), 'partial' (any overlap), or 'center' (center point inside). (default: "full")

'full'
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional parameters for element filtering.

{}

Returns:

Type Description
Optional[Element]

First matching element or None.

Source code in natural_pdf/elements/region.py
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
def find(
    self,
    selector: Optional[str] = None,  # Now optional
    *,
    text: Optional[str] = None,  # New text parameter
    overlap: str = "full",  # How elements overlap with the region
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> Optional["Element"]:
    """
    Find the first element in this region matching the selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        overlap: How to determine if elements overlap with the region: 'full' (fully inside),
                 'partial' (any overlap), or 'center' (center point inside).
                 (default: "full")
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional parameters for element filtering.

    Returns:
        First matching element or None.
    """
    # Delegate validation and selector construction to find_all
    elements = self.find_all(
        selector=selector,
        text=text,
        overlap=overlap,
        apply_exclusions=apply_exclusions,
        regex=regex,
        case=case,
        **kwargs,
    )
    return elements.first if elements else None
natural_pdf.Region.find_all(selector=None, *, text=None, overlap='full', apply_exclusions=True, regex=False, case=True, **kwargs)
find_all(*, text: str, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection
find_all(selector: str, *, overlap: str = 'full', apply_exclusions: bool = True, regex: bool = False, case: bool = True, **kwargs) -> ElementCollection

Find all elements in this region matching the selector OR text content.

Provide EITHER selector OR text, but not both.

Parameters:

Name Type Description Default
selector Optional[str]

CSS-like selector string.

None
text Optional[str]

Text content to search for (equivalent to 'text:contains(...)').

None
overlap str

How to determine if elements overlap with the region: 'full' (fully inside), 'partial' (any overlap), or 'center' (center point inside). (default: "full")

'full'
apply_exclusions bool

Whether to exclude elements in exclusion regions (default: True).

True
regex bool

Whether to use regex for text search (selector or text) (default: False).

False
case bool

Whether to do case-sensitive text search (selector or text) (default: True).

True
**kwargs

Additional parameters for element filtering.

{}

Returns:

Type Description
ElementCollection

ElementCollection with matching elements.

Source code in natural_pdf/elements/region.py
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
def find_all(
    self,
    selector: Optional[str] = None,  # Now optional
    *,
    text: Optional[str] = None,  # New text parameter
    overlap: str = "full",  # How elements overlap with the region
    apply_exclusions: bool = True,
    regex: bool = False,
    case: bool = True,
    **kwargs,
) -> "ElementCollection":
    """
    Find all elements in this region matching the selector OR text content.

    Provide EITHER `selector` OR `text`, but not both.

    Args:
        selector: CSS-like selector string.
        text: Text content to search for (equivalent to 'text:contains(...)').
        overlap: How to determine if elements overlap with the region: 'full' (fully inside),
                 'partial' (any overlap), or 'center' (center point inside).
                 (default: "full")
        apply_exclusions: Whether to exclude elements in exclusion regions (default: True).
        regex: Whether to use regex for text search (`selector` or `text`) (default: False).
        case: Whether to do case-sensitive text search (`selector` or `text`) (default: True).
        **kwargs: Additional parameters for element filtering.

    Returns:
        ElementCollection with matching elements.
    """
    from natural_pdf.elements.element_collection import ElementCollection

    if selector is not None and text is not None:
        raise ValueError("Provide either 'selector' or 'text', not both.")
    if selector is None and text is None:
        raise ValueError("Provide either 'selector' or 'text'.")

    # Validate overlap parameter
    if overlap not in ["full", "partial", "center"]:
        raise ValueError(
            f"Invalid overlap value: {overlap}. Must be 'full', 'partial', or 'center'"
        )

    # Construct selector if 'text' is provided
    effective_selector = ""
    if text is not None:
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        effective_selector = f'text:contains("{escaped_text}")'
        logger.debug(
            f"Using text shortcut: find_all(text='{text}') -> find_all('{effective_selector}')"
        )
    elif selector is not None:
        effective_selector = selector
    else:
        raise ValueError("Internal error: No selector or text provided.")

    # Normal case: Region is on a single page
    try:
        # Parse the final selector string
        selector_obj = parse_selector(effective_selector)

        # Get all potentially relevant elements from the page
        # Let the page handle its exclusion logic if needed
        potential_elements = self.page.find_all(
            selector=effective_selector,
            apply_exclusions=apply_exclusions,
            regex=regex,
            case=case,
            **kwargs,
        )

        # Filter these elements based on the specified containment method
        region_bbox = self.bbox
        matching_elements = []

        if overlap == "full":  # Fully inside (strict)
            matching_elements = [
                el
                for el in potential_elements
                if el.x0 >= region_bbox[0]
                and el.top >= region_bbox[1]
                and el.x1 <= region_bbox[2]
                and el.bottom <= region_bbox[3]
            ]
        elif overlap == "partial":  # Any overlap
            matching_elements = [el for el in potential_elements if self.intersects(el)]
        elif overlap == "center":  # Center point inside
            matching_elements = [
                el for el in potential_elements if self.is_element_center_inside(el)
            ]

        return ElementCollection(matching_elements)

    except Exception as e:
        logger.error(f"Error during find_all in region: {e}", exc_info=True)
        return ElementCollection([])
natural_pdf.Region.get_children(selector=None)

Get immediate child regions, optionally filtered by selector.

Parameters:

Name Type Description Default
selector

Optional selector to filter children

None

Returns:

Type Description

List of child regions matching the selector

Source code in natural_pdf/elements/region.py
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
def get_children(self, selector=None):
    """
    Get immediate child regions, optionally filtered by selector.

    Args:
        selector: Optional selector to filter children

    Returns:
        List of child regions matching the selector
    """
    import logging

    logger = logging.getLogger("natural_pdf.elements.region")

    if selector is None:
        return self.child_regions

    # Use existing selector parser to filter
    try:
        selector_obj = parse_selector(selector)
        filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
        matched = [child for child in self.child_regions if filter_func(child)]
        logger.debug(
            f"get_children: found {len(matched)} of {len(self.child_regions)} children matching '{selector}'"
        )
        return matched
    except Exception as e:
        logger.error(f"Error applying selector in get_children: {e}", exc_info=True)
        return []  # Return empty list on error
natural_pdf.Region.get_descendants(selector=None)

Get all descendant regions (children, grandchildren, etc.), optionally filtered by selector.

Parameters:

Name Type Description Default
selector

Optional selector to filter descendants

None

Returns:

Type Description

List of descendant regions matching the selector

Source code in natural_pdf/elements/region.py
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
def get_descendants(self, selector=None):
    """
    Get all descendant regions (children, grandchildren, etc.), optionally filtered by selector.

    Args:
        selector: Optional selector to filter descendants

    Returns:
        List of descendant regions matching the selector
    """
    import logging

    logger = logging.getLogger("natural_pdf.elements.region")

    all_descendants = []
    queue = list(self.child_regions)  # Start with direct children

    while queue:
        current = queue.pop(0)
        all_descendants.append(current)
        # Add current's children to the queue for processing
        if hasattr(current, "child_regions"):
            queue.extend(current.child_regions)

    logger.debug(f"get_descendants: found {len(all_descendants)} total descendants")

    # Filter by selector if provided
    if selector is not None:
        try:
            selector_obj = parse_selector(selector)
            filter_func = selector_to_filter_func(selector_obj)  # Removed region=self
            matched = [desc for desc in all_descendants if filter_func(desc)]
            logger.debug(f"get_descendants: filtered to {len(matched)} matching '{selector}'")
            return matched
        except Exception as e:
            logger.error(f"Error applying selector in get_descendants: {e}", exc_info=True)
            return []  # Return empty list on error

    return all_descendants
natural_pdf.Region.get_elements(selector=None, apply_exclusions=True, **kwargs)

Get all elements within this region.

Parameters:

Name Type Description Default
selector Optional[str]

Optional selector to filter elements

None
apply_exclusions

Whether to apply exclusion regions

True
**kwargs

Additional parameters for element filtering

{}

Returns:

Type Description
List[Element]

List of elements in the region

Source code in natural_pdf/elements/region.py
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
def get_elements(
    self, selector: Optional[str] = None, apply_exclusions=True, **kwargs
) -> List["Element"]:
    """
    Get all elements within this region.

    Args:
        selector: Optional selector to filter elements
        apply_exclusions: Whether to apply exclusion regions
        **kwargs: Additional parameters for element filtering

    Returns:
        List of elements in the region
    """
    if selector:
        # Find elements on the page matching the selector
        page_elements = self.page.find_all(
            selector, apply_exclusions=apply_exclusions, **kwargs
        )
        # Filter those elements to only include ones within this region
        elements = [e for e in page_elements if self._is_element_in_region(e)]
    else:
        # Get all elements from the page
        page_elements = self.page.get_elements(apply_exclusions=apply_exclusions)
        # Filter to elements in this region
        elements = [e for e in page_elements if self._is_element_in_region(e)]

    # Apply boundary exclusions if this is a section with boundary settings
    if hasattr(self, "_boundary_exclusions") and self._boundary_exclusions != "both":
        excluded_ids = set()

        if self._boundary_exclusions == "none":
            # Exclude both start and end elements
            if hasattr(self, "start_element") and self.start_element:
                excluded_ids.add(id(self.start_element))
            if hasattr(self, "end_element") and self.end_element:
                excluded_ids.add(id(self.end_element))
        elif self._boundary_exclusions == "start":
            # Exclude only end element
            if hasattr(self, "end_element") and self.end_element:
                excluded_ids.add(id(self.end_element))
        elif self._boundary_exclusions == "end":
            # Exclude only start element
            if hasattr(self, "start_element") and self.start_element:
                excluded_ids.add(id(self.start_element))

        if excluded_ids:
            elements = [e for e in elements if id(e) not in excluded_ids]

    return elements
natural_pdf.Region.get_section_between(start_element=None, end_element=None, include_boundaries='both', orientation='vertical')

Get a section between two elements within this region.

Parameters:

Name Type Description Default
start_element

Element marking the start of the section

None
end_element

Element marking the end of the section

None
include_boundaries

How to include boundary elements: 'start', 'end', 'both', or 'none'

'both'
orientation

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description

Region representing the section

Source code in natural_pdf/elements/region.py
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
def get_section_between(
    self,
    start_element=None,
    end_element=None,
    include_boundaries="both",
    orientation="vertical",
):
    """
    Get a section between two elements within this region.

    Args:
        start_element: Element marking the start of the section
        end_element: Element marking the end of the section
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        Region representing the section
    """
    # Get elements only within this region first
    elements = self.get_elements()

    # If no elements, return self or empty region?
    if not elements:
        logger.warning(
            f"get_section_between called on region {self.bbox} with no contained elements."
        )
        # Return an empty region at the start of the parent region
        return Region(self.page, (self.x0, self.top, self.x0, self.top))

    # Sort elements in reading order
    elements.sort(key=lambda e: (e.top, e.x0))

    # Find start index
    start_idx = 0
    if start_element:
        try:
            start_idx = elements.index(start_element)
        except ValueError:
            # Start element not in region, use first element
            logger.debug("Start element not found in region, using first element.")
            start_element = elements[0]  # Use the actual first element
            start_idx = 0
    else:
        start_element = elements[0]  # Default start is first element

    # Find end index
    end_idx = len(elements) - 1
    if end_element:
        try:
            end_idx = elements.index(end_element)
        except ValueError:
            # End element not in region, use last element
            logger.debug("End element not found in region, using last element.")
            end_element = elements[-1]  # Use the actual last element
            end_idx = len(elements) - 1
    else:
        end_element = elements[-1]  # Default end is last element

    # Validate orientation parameter
    if orientation not in ["vertical", "horizontal"]:
        raise ValueError(f"orientation must be 'vertical' or 'horizontal', got '{orientation}'")

    # Use centralized section utilities
    from natural_pdf.utils.sections import calculate_section_bounds, validate_section_bounds

    # Calculate section boundaries
    bounds = calculate_section_bounds(
        start_element=start_element,
        end_element=end_element,
        include_boundaries=include_boundaries,
        orientation=orientation,
        parent_bounds=self.bbox,
    )

    # Validate boundaries
    if not validate_section_bounds(bounds, orientation):
        # Return an empty region at the start position
        x0, top, _, _ = bounds
        return Region(self.page, (x0, top, x0, top))

    # Create new region
    section = Region(self.page, bounds)

    # Store the original boundary elements and exclusion info
    section.start_element = start_element
    section.end_element = end_element
    section._boundary_exclusions = include_boundaries

    return section
natural_pdf.Region.get_sections(start_elements=None, end_elements=None, include_boundaries='both', orientation='vertical')

Get sections within this region based on start/end elements.

Parameters:

Name Type Description Default
start_elements

Elements or selector string that mark the start of sections

None
end_elements

Elements or selector string that mark the end of sections

None
include_boundaries

How to include boundary elements: 'start', 'end', 'both', or 'none'

'both'
orientation

'vertical' (default) or 'horizontal' - determines section direction

'vertical'

Returns:

Type Description
ElementCollection[Region]

List of Region objects representing the extracted sections

Source code in natural_pdf/elements/region.py
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
def get_sections(
    self,
    start_elements=None,
    end_elements=None,
    include_boundaries="both",
    orientation="vertical",
) -> "ElementCollection[Region]":
    """
    Get sections within this region based on start/end elements.

    Args:
        start_elements: Elements or selector string that mark the start of sections
        end_elements: Elements or selector string that mark the end of sections
        include_boundaries: How to include boundary elements: 'start', 'end', 'both', or 'none'
        orientation: 'vertical' (default) or 'horizontal' - determines section direction

    Returns:
        List of Region objects representing the extracted sections
    """
    from natural_pdf.elements.element_collection import ElementCollection
    from natural_pdf.utils.sections import extract_sections_from_region

    # Use centralized section extraction logic
    sections = extract_sections_from_region(
        region=self,
        start_elements=start_elements,
        end_elements=end_elements,
        include_boundaries=include_boundaries,
        orientation=orientation,
    )

    return ElementCollection(sections)
natural_pdf.Region.get_text_table_cells(snap_tolerance=10, join_tolerance=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=3, expand_bbox=None, **kwargs)

Analyzes text alignment to find table cells and returns them as temporary Region objects without adding them to the page.

Parameters:

Name Type Description Default
snap_tolerance int

Tolerance for snapping parallel lines.

10
join_tolerance int

Tolerance for joining collinear lines.

3
min_words_vertical int

Minimum words needed to define a vertical line.

3
min_words_horizontal int

Minimum words needed to define a horizontal line.

1
intersection_tolerance int

Tolerance for detecting line intersections.

3
expand_bbox Optional[Dict[str, int]]

Optional dictionary to expand the search area slightly beyond the region's exact bounds (e.g., {'left': 5, 'right': 5}).

None
**kwargs

Additional keyword arguments passed to find_text_based_tables (e.g., specific x/y tolerances).

{}

Returns:

Type Description
ElementCollection[Region]

An ElementCollection containing temporary Region objects for each detected cell,

ElementCollection[Region]

or an empty ElementCollection if no cells are found or an error occurs.

Source code in natural_pdf/elements/region.py
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
def get_text_table_cells(
    self,
    snap_tolerance: int = 10,
    join_tolerance: int = 3,
    min_words_vertical: int = 3,
    min_words_horizontal: int = 1,
    intersection_tolerance: int = 3,
    expand_bbox: Optional[Dict[str, int]] = None,
    **kwargs,
) -> "ElementCollection[Region]":
    """
    Analyzes text alignment to find table cells and returns them as
    temporary Region objects without adding them to the page.

    Args:
        snap_tolerance: Tolerance for snapping parallel lines.
        join_tolerance: Tolerance for joining collinear lines.
        min_words_vertical: Minimum words needed to define a vertical line.
        min_words_horizontal: Minimum words needed to define a horizontal line.
        intersection_tolerance: Tolerance for detecting line intersections.
        expand_bbox: Optional dictionary to expand the search area slightly beyond
                     the region's exact bounds (e.g., {'left': 5, 'right': 5}).
        **kwargs: Additional keyword arguments passed to
                  find_text_based_tables (e.g., specific x/y tolerances).

    Returns:
        An ElementCollection containing temporary Region objects for each detected cell,
        or an empty ElementCollection if no cells are found or an error occurs.
    """
    from natural_pdf.elements.element_collection import ElementCollection

    # 1. Perform the analysis (or use cached results)
    if "text_table_structure" in self.analyses:
        analysis_results = self.analyses["text_table_structure"]
        logger.debug("get_text_table_cells: Using cached analysis results.")
    else:
        analysis_results = self.analyze_text_table_structure(
            snap_tolerance=snap_tolerance,
            join_tolerance=join_tolerance,
            min_words_vertical=min_words_vertical,
            min_words_horizontal=min_words_horizontal,
            intersection_tolerance=intersection_tolerance,
            expand_bbox=expand_bbox,
            **kwargs,
        )

    # 2. Check if analysis was successful and cells were found
    if analysis_results is None or not analysis_results.get("cells"):
        logger.info(f"Region {self.bbox}: No cells found by text table analysis.")
        return ElementCollection([])  # Return empty collection

    # 3. Create temporary Region objects for each cell dictionary
    cell_regions = []
    for cell_data in analysis_results["cells"]:
        try:
            # Use page.region to create the region object
            # It expects left, top, right, bottom keys
            cell_region = self.page.region(**cell_data)

            # Set metadata on the temporary region
            cell_region.region_type = "table-cell"
            cell_region.normalized_type = "table-cell"
            cell_region.model = "pdfplumber-text"
            cell_region.source = "volatile"  # Indicate it's not managed/persistent
            cell_region.parent_region = self  # Link back to the region it came from

            cell_regions.append(cell_region)
        except Exception as e:
            logger.warning(f"Could not create Region object for cell data {cell_data}: {e}")

    # 4. Return the list wrapped in an ElementCollection
    logger.debug(f"get_text_table_cells: Created {len(cell_regions)} temporary cell regions.")
    return ElementCollection(cell_regions)
natural_pdf.Region.highlight(label=None, color=None, use_color_cycling=False, annotate=None, existing='append')

Highlight this region on the page.

Parameters:

Name Type Description Default
label Optional[str]

Optional label for the highlight

None
color Optional[Union[Tuple, str]]

Color tuple/string for the highlight, or None to use automatic color

None
use_color_cycling bool

Force color cycling even with no label (default: False)

False
annotate Optional[List[str]]

List of attribute names to display on the highlight (e.g., ['confidence', 'type'])

None
existing str

How to handle existing highlights ('append' or 'replace').

'append'

Returns:

Type Description
Region

Self for method chaining

Source code in natural_pdf/elements/region.py
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
def highlight(
    self,
    label: Optional[str] = None,
    color: Optional[Union[Tuple, str]] = None,
    use_color_cycling: bool = False,
    annotate: Optional[List[str]] = None,
    existing: str = "append",
) -> "Region":
    """
    Highlight this region on the page.

    Args:
        label: Optional label for the highlight
        color: Color tuple/string for the highlight, or None to use automatic color
        use_color_cycling: Force color cycling even with no label (default: False)
        annotate: List of attribute names to display on the highlight (e.g., ['confidence', 'type'])
        existing: How to handle existing highlights ('append' or 'replace').

    Returns:
        Self for method chaining
    """
    # Access the highlighter service correctly
    highlighter = self.page._highlighter

    # Prepare common arguments
    highlight_args = {
        "page_index": self.page.index,
        "color": color,
        "label": label,
        "use_color_cycling": use_color_cycling,
        "element": self,  # Pass the region itself so attributes can be accessed
        "annotate": annotate,
        "existing": existing,
    }

    # Call the appropriate service method
    if self.has_polygon:
        highlight_args["polygon"] = self.polygon
        highlighter.add_polygon(**highlight_args)
    else:
        highlight_args["bbox"] = self.bbox
        highlighter.add(**highlight_args)

    return self
natural_pdf.Region.intersects(element)

Check if this region intersects with an element (any overlap).

Parameters:

Name Type Description Default
element Element

Element to check

required

Returns:

Type Description
bool

True if the element overlaps with the region at all, False otherwise

Source code in natural_pdf/elements/region.py
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
def intersects(self, element: "Element") -> bool:
    """
    Check if this region intersects with an element (any overlap).

    Args:
        element: Element to check

    Returns:
        True if the element overlaps with the region at all, False otherwise
    """
    # Check if element is on the same page
    if not hasattr(element, "page") or element.page != self._page:
        return False

    # Ensure element has necessary attributes
    if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
        return False  # Cannot determine position

    # For rectangular regions, check for bbox overlap
    if not self.has_polygon:
        return (
            self.x0 < element.x1
            and self.x1 > element.x0
            and self.top < element.bottom
            and self.bottom > element.top
        )

    # For polygon regions, check if any corner of the element is inside the polygon
    element_corners = [
        (element.x0, element.top),  # top-left
        (element.x1, element.top),  # top-right
        (element.x1, element.bottom),  # bottom-right
        (element.x0, element.bottom),  # bottom-left
    ]

    # First check if any element corner is inside the polygon
    if any(self.is_point_inside(x, y) for x, y in element_corners):
        return True

    # Also check if any polygon corner is inside the element's rectangle
    for x, y in self.polygon:
        if element.x0 <= x <= element.x1 and element.top <= y <= element.bottom:
            return True

    # Also check if any polygon edge intersects with any rectangle edge
    # This is a simplification - for complex cases, we'd need a full polygon-rectangle
    # intersection algorithm

    # For now, return True if bounding boxes overlap (approximation for polygon-rectangle case)
    return (
        self.x0 < element.x1
        and self.x1 > element.x0
        and self.top < element.bottom
        and self.bottom > element.top
    )
natural_pdf.Region.is_element_center_inside(element)

Check if the center point of an element's bounding box is inside this region.

Parameters:

Name Type Description Default
element Element

Element to check

required

Returns:

Type Description
bool

True if the element's center point is inside the region, False otherwise.

Source code in natural_pdf/elements/region.py
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
def is_element_center_inside(self, element: "Element") -> bool:
    """
    Check if the center point of an element's bounding box is inside this region.

    Args:
        element: Element to check

    Returns:
        True if the element's center point is inside the region, False otherwise.
    """
    # Check if element is on the same page
    if not hasattr(element, "page") or element.page != self._page:
        return False

    # Ensure element has necessary attributes
    if not all(hasattr(element, attr) for attr in ["x0", "x1", "top", "bottom"]):
        logger.warning(
            f"Element {element} lacks bounding box attributes. Cannot check center point."
        )
        return False  # Cannot determine position

    # Calculate center point
    center_x = (element.x0 + element.x1) / 2
    center_y = (element.top + element.bottom) / 2

    # Use the existing is_point_inside check
    return self.is_point_inside(center_x, center_y)
natural_pdf.Region.is_point_inside(x, y)

Check if a point is inside this region using ray casting algorithm for polygons.

Parameters:

Name Type Description Default
x float

X coordinate of the point

required
y float

Y coordinate of the point

required

Returns:

Name Type Description
bool bool

True if the point is inside the region

Source code in natural_pdf/elements/region.py
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
def is_point_inside(self, x: float, y: float) -> bool:
    """
    Check if a point is inside this region using ray casting algorithm for polygons.

    Args:
        x: X coordinate of the point
        y: Y coordinate of the point

    Returns:
        bool: True if the point is inside the region
    """
    if not self.has_polygon:
        return (self.x0 <= x <= self.x1) and (self.top <= y <= self.bottom)

    # Ray casting algorithm
    inside = False
    j = len(self.polygon) - 1

    for i in range(len(self.polygon)):
        if ((self.polygon[i][1] > y) != (self.polygon[j][1] > y)) and (
            x
            < (self.polygon[j][0] - self.polygon[i][0])
            * (y - self.polygon[i][1])
            / (self.polygon[j][1] - self.polygon[i][1])
            + self.polygon[i][0]
        ):
            inside = not inside
        j = i

    return inside
natural_pdf.Region.left(width=None, height='element', include_source=False, until=None, include_endpoint=True, offset=None, **kwargs)

Select region to the left of this region.

Parameters:

Name Type Description Default
width Optional[float]

Width of the region to the left, in points

None
height str

Height mode - "full" for full page height or "element" for element height

'element'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify a left boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
offset Optional[float]

Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)

None
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area to the left

Source code in natural_pdf/elements/region.py
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
def left(
    self,
    width: Optional[float] = None,
    height: str = "element",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    offset: Optional[float] = None,
    **kwargs,
) -> "Region":
    """
    Select region to the left of this region.

    Args:
        width: Width of the region to the left, in points
        height: Height mode - "full" for full page height or "element" for element height
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify a left boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area to the left
    """
    # Use global default if offset not provided
    if offset is None:
        import natural_pdf

        offset = natural_pdf.options.layout.directional_offset

    return self._direction(
        direction="left",
        size=width,
        cross_size=height,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        offset=offset,
        **kwargs,
    )
natural_pdf.Region.region(left=None, top=None, right=None, bottom=None, width=None, height=None, relative=False)

Create a sub-region within this region using the same API as Page.region().

By default, coordinates are absolute (relative to the page), matching Page.region(). Set relative=True to use coordinates relative to this region's top-left corner.

Parameters:

Name Type Description Default
left float

Left x-coordinate (absolute by default, or relative to region if relative=True)

None
top float

Top y-coordinate (absolute by default, or relative to region if relative=True)

None
right float

Right x-coordinate (absolute by default, or relative to region if relative=True)

None
bottom float

Bottom y-coordinate (absolute by default, or relative to region if relative=True)

None
width Union[str, float, None]

Width definition (same as Page.region())

None
height Optional[float]

Height of the region (same as Page.region())

None
relative bool

If True, coordinates are relative to this region's top-left (0,0). If False (default), coordinates are absolute page coordinates.

False

Returns:

Type Description
Region

Region object for the specified coordinates, clipped to this region's bounds

Examples:

Absolute coordinates (default) - same as page.region()

sub = region.region(left=100, top=200, width=50, height=30)

Relative to region's top-left

sub = region.region(left=10, top=10, width=50, height=30, relative=True)

Mix relative positioning with this region's bounds

sub = region.region(left=region.x0 + 10, width=50, height=30)

Source code in natural_pdf/elements/region.py
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
def region(
    self,
    left: float = None,
    top: float = None,
    right: float = None,
    bottom: float = None,
    width: Union[str, float, None] = None,
    height: Optional[float] = None,
    relative: bool = False,
) -> "Region":
    """
    Create a sub-region within this region using the same API as Page.region().

    By default, coordinates are absolute (relative to the page), matching Page.region().
    Set relative=True to use coordinates relative to this region's top-left corner.

    Args:
        left: Left x-coordinate (absolute by default, or relative to region if relative=True)
        top: Top y-coordinate (absolute by default, or relative to region if relative=True)
        right: Right x-coordinate (absolute by default, or relative to region if relative=True)
        bottom: Bottom y-coordinate (absolute by default, or relative to region if relative=True)
        width: Width definition (same as Page.region())
        height: Height of the region (same as Page.region())
        relative: If True, coordinates are relative to this region's top-left (0,0).
                 If False (default), coordinates are absolute page coordinates.

    Returns:
        Region object for the specified coordinates, clipped to this region's bounds

    Examples:
        # Absolute coordinates (default) - same as page.region()
        sub = region.region(left=100, top=200, width=50, height=30)

        # Relative to region's top-left
        sub = region.region(left=10, top=10, width=50, height=30, relative=True)

        # Mix relative positioning with this region's bounds
        sub = region.region(left=region.x0 + 10, width=50, height=30)
    """
    # If relative coordinates requested, convert to absolute
    if relative:
        if left is not None:
            left = self.x0 + left
        if top is not None:
            top = self.top + top
        if right is not None:
            right = self.x0 + right
        if bottom is not None:
            bottom = self.top + bottom

        # For numeric width/height with relative coords, we need to handle the calculation
        # in the context of absolute positioning

    # Use the parent page's region method to create the region with all its logic
    new_region = self.page.region(
        left=left, top=top, right=right, bottom=bottom, width=width, height=height
    )

    # Clip the new region to this region's bounds
    return new_region.clip(self)
natural_pdf.Region.right(width=None, height='element', include_source=False, until=None, include_endpoint=True, offset=None, **kwargs)

Select region to the right of this region.

Parameters:

Name Type Description Default
width Optional[float]

Width of the region to the right, in points

None
height str

Height mode - "full" for full page height or "element" for element height

'element'
include_source bool

Whether to include this region in the result (default: False)

False
until Optional[str]

Optional selector string to specify a right boundary element

None
include_endpoint bool

Whether to include the boundary element in the region (default: True)

True
offset Optional[float]

Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)

None
**kwargs

Additional parameters

{}

Returns:

Type Description
Region

Region object representing the area to the right

Source code in natural_pdf/elements/region.py
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
def right(
    self,
    width: Optional[float] = None,
    height: str = "element",
    include_source: bool = False,
    until: Optional[str] = None,
    include_endpoint: bool = True,
    offset: Optional[float] = None,
    **kwargs,
) -> "Region":
    """
    Select region to the right of this region.

    Args:
        width: Width of the region to the right, in points
        height: Height mode - "full" for full page height or "element" for element height
        include_source: Whether to include this region in the result (default: False)
        until: Optional selector string to specify a right boundary element
        include_endpoint: Whether to include the boundary element in the region (default: True)
        offset: Pixel offset when excluding source/endpoint (default: None, uses natural_pdf.options.layout.directional_offset)
        **kwargs: Additional parameters

    Returns:
        Region object representing the area to the right
    """
    # Use global default if offset not provided
    if offset is None:
        import natural_pdf

        offset = natural_pdf.options.layout.directional_offset

    return self._direction(
        direction="right",
        size=width,
        cross_size=height,
        include_source=include_source,
        until=until,
        include_endpoint=include_endpoint,
        offset=offset,
        **kwargs,
    )
natural_pdf.Region.save(filename, resolution=None, labels=True, legend_position='right')

Save the page with this region highlighted to an image file.

Parameters:

Name Type Description Default
filename str

Path to save the image to

required
resolution Optional[float]

Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)

None
labels bool

Whether to include a legend for labels

True
legend_position str

Position of the legend

'right'

Returns:

Type Description
Region

Self for method chaining

Source code in natural_pdf/elements/region.py
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
def save(
    self,
    filename: str,
    resolution: Optional[float] = None,
    labels: bool = True,
    legend_position: str = "right",
) -> "Region":
    """
    Save the page with this region highlighted to an image file.

    Args:
        filename: Path to save the image to
        resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
        labels: Whether to include a legend for labels
        legend_position: Position of the legend

    Returns:
        Self for method chaining
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Highlight this region if not already highlighted
    self.highlight()

    # Save the highlighted image
    self._page.save_image(
        filename, resolution=resolution, labels=labels, legend_position=legend_position
    )
    return self
natural_pdf.Region.save_image(filename, resolution=None, crop=False, include_highlights=True, **kwargs)

Save an image of just this region to a file.

Parameters:

Name Type Description Default
filename str

Path to save the image to

required
resolution Optional[float]

Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)

None
crop bool

If True, only crop the region without highlighting its boundaries

False
include_highlights bool

Whether to include existing highlights (default: True)

True
**kwargs

Additional parameters for rendering

{}

Returns:

Type Description
Region

Self for method chaining

Source code in natural_pdf/elements/region.py
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
def save_image(
    self,
    filename: str,
    resolution: Optional[float] = None,
    crop: bool = False,
    include_highlights: bool = True,
    **kwargs,
) -> "Region":
    """
    Save an image of just this region to a file.

    Args:
        filename: Path to save the image to
        resolution: Resolution in DPI for rendering (default: uses global options, fallback to 144 DPI)
        crop: If True, only crop the region without highlighting its boundaries
        include_highlights: Whether to include existing highlights (default: True)
        **kwargs: Additional parameters for rendering

    Returns:
        Self for method chaining
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Use export() to save the image
    if include_highlights:
        # With highlights, use export() which includes them
        self.export(
            path=filename,
            resolution=resolution,
            crop=crop,
            **kwargs,
        )
    else:
        # Without highlights, use render() and save manually
        image = self.render(resolution=resolution, crop=crop, **kwargs)
        if image:
            image.save(filename)
        else:
            logger.error(f"Failed to render region image for saving to {filename}")

    return self
natural_pdf.Region.split(divider, **kwargs)

Divide this region into sections based on the provided divider elements.

Parameters:

Name Type Description Default
divider

Elements or selector string that mark section boundaries

required
**kwargs

Additional parameters passed to get_sections() - include_boundaries: How to include boundary elements (default: 'start') - orientation: 'vertical' or 'horizontal' (default: 'vertical')

{}

Returns:

Type Description
ElementCollection[Region]

ElementCollection of Region objects representing the sections

Example
Split a region by bold text

sections = region.split("text:bold")

Split horizontally by vertical lines

sections = region.split("line[orientation=vertical]", orientation="horizontal")

Source code in natural_pdf/elements/region.py
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
def split(self, divider, **kwargs) -> "ElementCollection[Region]":
    """
    Divide this region into sections based on the provided divider elements.

    Args:
        divider: Elements or selector string that mark section boundaries
        **kwargs: Additional parameters passed to get_sections()
            - include_boundaries: How to include boundary elements (default: 'start')
            - orientation: 'vertical' or 'horizontal' (default: 'vertical')

    Returns:
        ElementCollection of Region objects representing the sections

    Example:
        # Split a region by bold text
        sections = region.split("text:bold")

        # Split horizontally by vertical lines
        sections = region.split("line[orientation=vertical]", orientation="horizontal")
    """
    # Default to 'start' boundaries for split (include divider at start of each section)
    if "include_boundaries" not in kwargs:
        kwargs["include_boundaries"] = "start"

    sections = self.get_sections(start_elements=divider, **kwargs)

    # Add section before first divider if there's content
    if sections and hasattr(sections[0], "start_element"):
        first_divider = sections[0].start_element
        if first_divider:
            # Get all elements before the first divider
            all_elements = self.get_elements()
            if all_elements and all_elements[0] != first_divider:
                # Create section from start to just before first divider
                initial_section = self.get_section_between(
                    start_element=None,
                    end_element=first_divider,
                    include_boundaries="none",
                    orientation=kwargs.get("orientation", "vertical"),
                )
                if initial_section and initial_section.get_elements():
                    sections.insert(0, initial_section)

    return sections
natural_pdf.Region.to_text_element(text_content=None, source_label='derived_from_region', object_type='word', default_font_size=10.0, default_font_name='RegionContent', confidence=None, add_to_page=False)

Creates a new TextElement object based on this region's geometry.

The text for the new TextElement can be provided directly, generated by a callback function, or left as None.

Parameters:

Name Type Description Default
text_content Optional[Union[str, Callable[[Region], Optional[str]]]]
  • If a string, this will be the text of the new TextElement.
  • If a callable, it will be called with this region instance and its return value (a string or None) will be the text.
  • If None (default), the TextElement's text will be None.
None
source_label str

The 'source' attribute for the new TextElement.

'derived_from_region'
object_type str

The 'object_type' for the TextElement's data dict (e.g., "word", "char").

'word'
default_font_size float

Placeholder font size if text is generated.

10.0
default_font_name str

Placeholder font name if text is generated.

'RegionContent'
confidence Optional[float]

Confidence score for the text. If text_content is None, defaults to 0.0. If text is provided/generated, defaults to 1.0 unless specified.

None
add_to_page bool

If True, the created TextElement will be added to the region's parent page. (Default: False)

False

Returns:

Type Description
TextElement

A new TextElement instance.

Raises:

Type Description
ValueError

If the region does not have a valid 'page' attribute.

Source code in natural_pdf/elements/region.py
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
def to_text_element(
    self,
    text_content: Optional[Union[str, Callable[["Region"], Optional[str]]]] = None,
    source_label: str = "derived_from_region",
    object_type: str = "word",  # Or "char", controls how it's categorized
    default_font_size: float = 10.0,
    default_font_name: str = "RegionContent",
    confidence: Optional[float] = None,  # Allow overriding confidence
    add_to_page: bool = False,  # NEW: Option to add to page
) -> "TextElement":
    """
    Creates a new TextElement object based on this region's geometry.

    The text for the new TextElement can be provided directly,
    generated by a callback function, or left as None.

    Args:
        text_content:
            - If a string, this will be the text of the new TextElement.
            - If a callable, it will be called with this region instance
              and its return value (a string or None) will be the text.
            - If None (default), the TextElement's text will be None.
        source_label: The 'source' attribute for the new TextElement.
        object_type: The 'object_type' for the TextElement's data dict
                     (e.g., "word", "char").
        default_font_size: Placeholder font size if text is generated.
        default_font_name: Placeholder font name if text is generated.
        confidence: Confidence score for the text. If text_content is None,
                    defaults to 0.0. If text is provided/generated, defaults to 1.0
                    unless specified.
        add_to_page: If True, the created TextElement will be added to the
                     region's parent page. (Default: False)

    Returns:
        A new TextElement instance.

    Raises:
        ValueError: If the region does not have a valid 'page' attribute.
    """
    actual_text: Optional[str] = None
    if isinstance(text_content, str):
        actual_text = text_content
    elif callable(text_content):
        try:
            actual_text = text_content(self)
        except Exception as e:
            logger.error(
                f"Error executing text_content callback for region {self.bbox}: {e}",
                exc_info=True,
            )
            actual_text = None  # Ensure actual_text is None on error

    final_confidence = confidence
    if final_confidence is None:
        final_confidence = 1.0 if actual_text is not None and actual_text.strip() else 0.0

    if not hasattr(self, "page") or self.page is None:
        raise ValueError("Region must have a valid 'page' attribute to create a TextElement.")

    # Create character dictionaries for the text
    char_dicts = []
    if actual_text:
        # Create a single character dict that spans the entire region
        # This is a simplified approach - OCR engines typically create one per character
        char_dict = {
            "text": actual_text,
            "x0": self.x0,
            "top": self.top,
            "x1": self.x1,
            "bottom": self.bottom,
            "width": self.width,
            "height": self.height,
            "object_type": "char",
            "page_number": self.page.page_number,
            "fontname": default_font_name,
            "size": default_font_size,
            "upright": True,
            "direction": 1,
            "adv": self.width,
            "source": source_label,
            "confidence": final_confidence,
            "stroking_color": (0, 0, 0),
            "non_stroking_color": (0, 0, 0),
        }
        char_dicts.append(char_dict)

    elem_data = {
        "text": actual_text,
        "x0": self.x0,
        "top": self.top,
        "x1": self.x1,
        "bottom": self.bottom,
        "width": self.width,
        "height": self.height,
        "object_type": object_type,
        "page_number": self.page.page_number,
        "stroking_color": getattr(self, "stroking_color", (0, 0, 0)),
        "non_stroking_color": getattr(self, "non_stroking_color", (0, 0, 0)),
        "fontname": default_font_name,
        "size": default_font_size,
        "upright": True,
        "direction": 1,
        "adv": self.width,
        "source": source_label,
        "confidence": final_confidence,
        "_char_dicts": char_dicts,
    }
    text_element = TextElement(elem_data, self.page)

    if add_to_page:
        if hasattr(self.page, "_element_mgr") and self.page._element_mgr is not None:
            add_as_type = (
                "words"
                if object_type == "word"
                else "chars" if object_type == "char" else object_type
            )
            # REMOVED try-except block around add_element
            self.page._element_mgr.add_element(text_element, element_type=add_as_type)
            logger.debug(
                f"TextElement created from region {self.bbox} and added to page {self.page.page_number} as {add_as_type}."
            )
            # Also add character dictionaries to the chars collection
            if char_dicts and object_type == "word":
                for char_dict in char_dicts:
                    self.page._element_mgr.add_element(char_dict, element_type="chars")
        else:
            page_num_str = (
                str(self.page.page_number) if hasattr(self.page, "page_number") else "N/A"
            )
            logger.warning(
                f"Cannot add TextElement to page: Page {page_num_str} for region {self.bbox} is missing '_element_mgr'."
            )

    return text_element
natural_pdf.Region.trim(padding=1, threshold=0.95, resolution=None, pre_shrink=0.5)

Trim visual whitespace from the edges of this region.

Similar to Python's string .strip() method, but for visual whitespace in the region image. Uses pixel analysis to detect rows/columns that are predominantly whitespace.

Parameters:

Name Type Description Default
padding int

Number of pixels to keep as padding after trimming (default: 1)

1
threshold float

Threshold for considering a row/column as whitespace (0.0-1.0, default: 0.95) Higher values mean more strict whitespace detection. E.g., 0.95 means if 95% of pixels in a row/column are white, consider it whitespace.

0.95
resolution Optional[float]

Resolution for image rendering in DPI (default: uses global options, fallback to 144 DPI)

None
pre_shrink float

Amount to shrink region before trimming, then expand back after (default: 0.5) This helps avoid detecting box borders/slivers as content.

0.5
Returns

New Region with visual whitespace trimmed from all edges

Examples
# Basic trimming with 1 pixel padding and 0.5px pre-shrink
trimmed = region.trim()

# More aggressive trimming with no padding and no pre-shrink
tight = region.trim(padding=0, threshold=0.9, pre_shrink=0)

# Conservative trimming with more padding
loose = region.trim(padding=3, threshold=0.98)
Source code in natural_pdf/elements/region.py
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
def trim(
    self,
    padding: int = 1,
    threshold: float = 0.95,
    resolution: Optional[float] = None,
    pre_shrink: float = 0.5,
) -> "Region":
    """
    Trim visual whitespace from the edges of this region.

    Similar to Python's string .strip() method, but for visual whitespace in the region image.
    Uses pixel analysis to detect rows/columns that are predominantly whitespace.

    Args:
        padding: Number of pixels to keep as padding after trimming (default: 1)
        threshold: Threshold for considering a row/column as whitespace (0.0-1.0, default: 0.95)
                  Higher values mean more strict whitespace detection.
                  E.g., 0.95 means if 95% of pixels in a row/column are white, consider it whitespace.
        resolution: Resolution for image rendering in DPI (default: uses global options, fallback to 144 DPI)
        pre_shrink: Amount to shrink region before trimming, then expand back after (default: 0.5)
                   This helps avoid detecting box borders/slivers as content.

    Returns
    ------

    New Region with visual whitespace trimmed from all edges

    Examples
    --------

    ```python
    # Basic trimming with 1 pixel padding and 0.5px pre-shrink
    trimmed = region.trim()

    # More aggressive trimming with no padding and no pre-shrink
    tight = region.trim(padding=0, threshold=0.9, pre_shrink=0)

    # Conservative trimming with more padding
    loose = region.trim(padding=3, threshold=0.98)
    ```
    """
    # Apply global options as defaults
    import natural_pdf

    if resolution is None:
        if natural_pdf.options.image.resolution is not None:
            resolution = natural_pdf.options.image.resolution
        else:
            resolution = 144  # Default resolution when none specified

    # Pre-shrink the region to avoid box slivers
    work_region = (
        self.expand(left=-pre_shrink, right=-pre_shrink, top=-pre_shrink, bottom=-pre_shrink)
        if pre_shrink > 0
        else self
    )

    # Get the region image
    # Use render() for clean image without highlights, with cropping
    image = work_region.render(resolution=resolution, crop=True)

    if image is None:
        logger.warning(
            f"Region {self.bbox}: Could not generate image for trimming. Returning original region."
        )
        return self

    # Convert to grayscale for easier analysis
    import numpy as np

    # Convert PIL image to numpy array
    img_array = np.array(image.convert("L"))  # Convert to grayscale
    height, width = img_array.shape

    if height == 0 or width == 0:
        logger.warning(
            f"Region {self.bbox}: Image has zero dimensions. Returning original region."
        )
        return self

    # Normalize pixel values to 0-1 range (255 = white = 1.0, 0 = black = 0.0)
    normalized = img_array.astype(np.float32) / 255.0

    # Find content boundaries by analyzing row and column averages

    # Analyze rows (horizontal strips) to find top and bottom boundaries
    row_averages = np.mean(normalized, axis=1)  # Average each row
    content_rows = row_averages < threshold  # True where there's content (not whitespace)

    # Find first and last rows with content
    content_row_indices = np.where(content_rows)[0]
    if len(content_row_indices) == 0:
        # No content found, return a minimal region at the center
        logger.warning(
            f"Region {self.bbox}: No content detected during trimming. Returning center point."
        )
        center_x = (self.x0 + self.x1) / 2
        center_y = (self.top + self.bottom) / 2
        return Region(self.page, (center_x, center_y, center_x, center_y))

    top_content_row = max(0, content_row_indices[0] - padding)
    bottom_content_row = min(height - 1, content_row_indices[-1] + padding)

    # Analyze columns (vertical strips) to find left and right boundaries
    col_averages = np.mean(normalized, axis=0)  # Average each column
    content_cols = col_averages < threshold  # True where there's content

    content_col_indices = np.where(content_cols)[0]
    if len(content_col_indices) == 0:
        # No content found in columns either
        logger.warning(
            f"Region {self.bbox}: No column content detected during trimming. Returning center point."
        )
        center_x = (self.x0 + self.x1) / 2
        center_y = (self.top + self.bottom) / 2
        return Region(self.page, (center_x, center_y, center_x, center_y))

    left_content_col = max(0, content_col_indices[0] - padding)
    right_content_col = min(width - 1, content_col_indices[-1] + padding)

    # Convert trimmed pixel coordinates back to PDF coordinates
    scale_factor = resolution / 72.0  # Scale factor used in render()

    # Calculate new PDF coordinates and ensure they are Python floats
    trimmed_x0 = float(work_region.x0 + (left_content_col / scale_factor))
    trimmed_top = float(work_region.top + (top_content_row / scale_factor))
    trimmed_x1 = float(
        work_region.x0 + ((right_content_col + 1) / scale_factor)
    )  # +1 because we want inclusive right edge
    trimmed_bottom = float(
        work_region.top + ((bottom_content_row + 1) / scale_factor)
    )  # +1 because we want inclusive bottom edge

    # Ensure the trimmed region doesn't exceed the work region boundaries
    final_x0 = max(work_region.x0, trimmed_x0)
    final_top = max(work_region.top, trimmed_top)
    final_x1 = min(work_region.x1, trimmed_x1)
    final_bottom = min(work_region.bottom, trimmed_bottom)

    # Ensure valid coordinates (width > 0, height > 0)
    if final_x1 <= final_x0 or final_bottom <= final_top:
        logger.warning(
            f"Region {self.bbox}: Trimming resulted in invalid dimensions. Returning original region."
        )
        return self

    # Create the trimmed region
    trimmed_region = Region(self.page, (final_x0, final_top, final_x1, final_bottom))

    # Expand back by the pre_shrink amount to restore original positioning
    if pre_shrink > 0:
        trimmed_region = trimmed_region.expand(
            left=pre_shrink, right=pre_shrink, top=pre_shrink, bottom=pre_shrink
        )

    # Copy relevant metadata
    trimmed_region.region_type = self.region_type
    trimmed_region.normalized_type = self.normalized_type
    trimmed_region.confidence = self.confidence
    trimmed_region.model = self.model
    trimmed_region.name = self.name
    trimmed_region.label = self.label
    trimmed_region.source = "trimmed"  # Indicate this is a derived region
    trimmed_region.parent_region = self

    logger.debug(
        f"Region {self.bbox}: Trimmed to {trimmed_region.bbox} (padding={padding}, threshold={threshold}, pre_shrink={pre_shrink})"
    )
    return trimmed_region
natural_pdf.Region.update_text(transform, *, selector='text', apply_exclusions=False)

Apply transform to every text element matched by selector inside this region.

The heavy lifting is delegated to :py:meth:TextMixin.update_text; this override simply ensures the search is scoped to the region.

Source code in natural_pdf/elements/region.py
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
def update_text(
    self,
    transform: Callable[[Any], Optional[str]],
    *,
    selector: str = "text",
    apply_exclusions: bool = False,
) -> "Region":
    """Apply *transform* to every text element matched by *selector* inside this region.

    The heavy lifting is delegated to :py:meth:`TextMixin.update_text`; this
    override simply ensures the search is scoped to the region.
    """

    return TextMixin.update_text(
        self, transform, selector=selector, apply_exclusions=apply_exclusions
    )
natural_pdf.Region.viewer(*, resolution=150, include_chars=False, include_attributes=None)

Create an interactive ipywidget viewer for this specific region.

The method renders the region to an image (cropped to the region bounds) and overlays all elements that intersect the region (optionally excluding noisy character-level elements). The resulting widget offers the same zoom / pan experience as :py:meth:Page.viewer but scoped to the region.

Parameters

resolution : int, default 150 Rendering resolution (DPI). This should match the value used by the page-level viewer so element scaling is accurate. include_chars : bool, default False Whether to include individual char elements in the overlay. These are often too dense for a meaningful visualisation so are skipped by default. include_attributes : list[str], optional Additional element attributes to expose in the info panel (on top of the default set used by the page viewer).

Returns

InteractiveViewerWidget | None The widget instance, or None if ipywidgets is not installed or an error occurred during creation.

Source code in natural_pdf/elements/region.py
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
def viewer(
    self,
    *,
    resolution: int = 150,
    include_chars: bool = False,
    include_attributes: Optional[List[str]] = None,
) -> Optional["InteractiveViewerWidget"]:
    """Create an interactive ipywidget viewer for **this specific region**.

    The method renders the region to an image (cropped to the region bounds) and
    overlays all elements that intersect the region (optionally excluding noisy
    character-level elements).  The resulting widget offers the same zoom / pan
    experience as :py:meth:`Page.viewer` but scoped to the region.

    Parameters
    ----------
    resolution : int, default 150
        Rendering resolution (DPI).  This should match the value used by the
        page-level viewer so element scaling is accurate.
    include_chars : bool, default False
        Whether to include individual *char* elements in the overlay.  These
        are often too dense for a meaningful visualisation so are skipped by
        default.
    include_attributes : list[str], optional
        Additional element attributes to expose in the info panel (on top of
        the default set used by the page viewer).

    Returns
    -------
    InteractiveViewerWidget | None
        The widget instance, or ``None`` if *ipywidgets* is not installed or
        an error occurred during creation.
    """

    # ------------------------------------------------------------------
    # Dependency / environment checks
    # ------------------------------------------------------------------
    if not _IPYWIDGETS_AVAILABLE or InteractiveViewerWidget is None:
        logger.error(
            "Interactive viewer requires 'ipywidgets'. "
            'Please install with: pip install "ipywidgets>=7.0.0,<10.0.0"'
        )
        return None

    try:
        # ------------------------------------------------------------------
        # Render region image (cropped) and encode as data URI
        # ------------------------------------------------------------------
        import base64
        from io import BytesIO

        # Use unified render() with crop=True to obtain just the region
        img = self.render(resolution=resolution, crop=True)
        if img is None:
            logger.error(f"Failed to render image for region {self.bbox} viewer.")
            return None

        buf = BytesIO()
        img.save(buf, format="PNG")
        img_str = base64.b64encode(buf.getvalue()).decode()
        image_uri = f"data:image/png;base64,{img_str}"

        # ------------------------------------------------------------------
        # Prepare element overlay data (coordinates relative to region)
        # ------------------------------------------------------------------
        scale = resolution / 72.0  # Same convention as page viewer

        # Gather elements intersecting the region
        region_elements = self.get_elements(apply_exclusions=False)

        # Optionally filter out chars
        if not include_chars:
            region_elements = [
                el for el in region_elements if str(getattr(el, "type", "")).lower() != "char"
            ]

        default_attrs = [
            "text",
            "fontname",
            "size",
            "bold",
            "italic",
            "color",
            "linewidth",
            "is_horizontal",
            "is_vertical",
            "source",
            "confidence",
            "label",
            "model",
            "upright",
            "direction",
        ]

        if include_attributes:
            default_attrs.extend([a for a in include_attributes if a not in default_attrs])

        elements_json: List[dict] = []
        for idx, el in enumerate(region_elements):
            try:
                # Calculate coordinates relative to region bbox and apply scale
                x0 = (el.x0 - self.x0) * scale
                y0 = (el.top - self.top) * scale
                x1 = (el.x1 - self.x0) * scale
                y1 = (el.bottom - self.top) * scale

                elem_dict = {
                    "id": idx,
                    "type": getattr(el, "type", "unknown"),
                    "x0": round(x0, 2),
                    "y0": round(y0, 2),
                    "x1": round(x1, 2),
                    "y1": round(y1, 2),
                    "width": round(x1 - x0, 2),
                    "height": round(y1 - y0, 2),
                }

                # Add requested / default attributes
                for attr_name in default_attrs:
                    if hasattr(el, attr_name):
                        val = getattr(el, attr_name)
                        # Ensure JSON serialisable
                        if not isinstance(val, (str, int, float, bool, list, dict, type(None))):
                            val = str(val)
                        elem_dict[attr_name] = val
                elements_json.append(elem_dict)
            except Exception as e:
                logger.warning(f"Error preparing element {idx} for region viewer: {e}")

        viewer_data = {"page_image": image_uri, "elements": elements_json}

        # ------------------------------------------------------------------
        # Instantiate the widget directly using the prepared data
        # ------------------------------------------------------------------
        return InteractiveViewerWidget(pdf_data=viewer_data)

    except Exception as e:
        logger.error(f"Error creating viewer for region {self.bbox}: {e}", exc_info=True)
        return None
natural_pdf.Region.within()

Context manager that constrains directional operations to this region.

When used as a context manager, all directional navigation operations (above, below, left, right) will be constrained to the bounds of this region.

Returns:

Name Type Description
RegionContext

A context manager that yields this region

Examples:

# Create a column region
left_col = page.region(right=page.width/2)

# All directional operations are constrained to left_col
with left_col.within() as col:
    header = col.find("text[size>14]")
    content = header.below(until="text[size>14]")
    # content will only include elements within left_col

# Operations outside the context are not constrained
full_page_below = header.below()  # Searches full page
Source code in natural_pdf/elements/region.py
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
def within(self):
    """Context manager that constrains directional operations to this region.

    When used as a context manager, all directional navigation operations
    (above, below, left, right) will be constrained to the bounds of this region.

    Returns:
        RegionContext: A context manager that yields this region

    Examples:
        ```python
        # Create a column region
        left_col = page.region(right=page.width/2)

        # All directional operations are constrained to left_col
        with left_col.within() as col:
            header = col.find("text[size>14]")
            content = header.below(until="text[size>14]")
            # content will only include elements within left_col

        # Operations outside the context are not constrained
        full_page_below = header.below()  # Searches full page
        ```
    """
    return RegionContext(self)

Functions

natural_pdf.configure_logging(level=logging.INFO, handler=None)

Configure logging for the natural_pdf package.

Parameters:

Name Type Description Default
level

Logging level (e.g., logging.INFO, logging.DEBUG)

INFO
handler

Optional custom handler. Defaults to a StreamHandler.

None
Source code in natural_pdf/__init__.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def configure_logging(level=logging.INFO, handler=None):
    """Configure logging for the natural_pdf package.

    Args:
        level: Logging level (e.g., logging.INFO, logging.DEBUG)
        handler: Optional custom handler. Defaults to a StreamHandler.
    """
    # Avoid adding duplicate handlers
    if any(isinstance(h, logging.StreamHandler) for h in logger.handlers):
        return

    if handler is None:
        handler = logging.StreamHandler()
        formatter = logging.Formatter("%(name)s - %(levelname)s - %(message)s")
        handler.setFormatter(formatter)

    logger.addHandler(handler)
    logger.setLevel(level)

    logger.propagate = False
natural_pdf.set_option(name, value)

Set a global Natural PDF option.

Parameters:

Name Type Description Default
name str

Option name in dot notation (e.g., 'layout.auto_multipage')

required
value

New value for the option

required
Example

import natural_pdf as npdf npdf.set_option('layout.auto_multipage', True) npdf.set_option('ocr.engine', 'surya')

Source code in natural_pdf/__init__.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def set_option(name: str, value):
    """
    Set a global Natural PDF option.

    Args:
        name: Option name in dot notation (e.g., 'layout.auto_multipage')
        value: New value for the option

    Example:
        import natural_pdf as npdf
        npdf.set_option('layout.auto_multipage', True)
        npdf.set_option('ocr.engine', 'surya')
    """
    parts = name.split(".")
    obj = options

    # Navigate to the right section
    for part in parts[:-1]:
        if hasattr(obj, part):
            obj = getattr(obj, part)
        else:
            raise KeyError(f"Unknown option section: {part}")

    # Set the final value
    final_key = parts[-1]
    if hasattr(obj, final_key):
        setattr(obj, final_key, value)
    else:
        raise KeyError(f"Unknown option: {name}")