Feat/field to store inner elements #268

benjats07 · 2023-10-26T18:22:46Z

This PR introduces clean_pdfminer_inner_elements , which deletes pdfminer elements inside other detection origins such as YoloX or detectron.
This function returns the clean document and stores nested elements inside pdfminer_inner_text

The best way to check that this function is working properly is checking the new test test_clean_pdfminer_inner_elements in test_unstructured_inference/inference/test_processing_elements.py

Also, the new ingest-tests will reflect the deletion of the nested elements.

benjats07 · 2023-10-26T21:46:20Z

@christinestraub
I addressed the issue you mentioned here
I think the code belongs here, and the main library just use the function clean_pdfminer_inner_elements from here.

cragwolfe · 2023-10-26T22:17:57Z

.github/workflows/ci.yml

@@ -109,6 +109,7 @@ jobs:
      uses: actions/checkout@v4
      with:
        repository: 'Unstructured-IO/unstructured'
+        ref: 'benjamin/feat/clean-by-store-pdfminer-inner-elements'


just a reminder to remove this line before merging

badGarnet · 2023-10-27T15:08:41Z

test_unstructured_inference/inference/test_processing_elements.py

+        bbox=Rectangle(0, 0, 100, 100),
+        text="Table with inner elements",
+        type="Table",
+    ),


is this element left intentionally without a source? Do think that can be a valid test but want to make sure this is the intention.

Yeah, was intentional, this is because if something is a table then should be kept

badGarnet · 2023-10-27T15:11:05Z

unstructured_inference/inference/layout.py

+        for page in self.pages:
+            tables = [e for e in page.elements if e.type == "Table"]
+            for i, element in enumerate(page.elements):
+                if element.source == Source.PDFMINER:


lets do inverse condition then continue here to reduce indentation of the code; its a good practice to improve code readability in general. Especially here we have so many levels of indentation

badGarnet · 2023-10-27T15:11:50Z

unstructured_inference/inference/layout.py

+            for i, element in enumerate(page.elements):
+                if element.source == Source.PDFMINER:
+                    element_inside_table = [
+                        element.bbox.is_in(t.bbox, error_margin=15) for t in tables


is the margin something we want to make a kwarg into this function? How did we arrive with this number?

badGarnet · 2023-10-27T15:13:15Z

unstructured_inference/inference/layout.py

+                    element_inside_table = [
+                        element.bbox.is_in(t.bbox, error_margin=15) for t in tables
+                    ]
+                    if sum(element_inside_table) == 1:


so this catches when an element is inside and only inside one table? Could you please clarify this in the docstring?

badGarnet · 2023-10-27T15:14:02Z

unstructured_inference/inference/layout.py

+                        parent_table_index = element_inside_table.index(True)
+                        parent_table = tables[parent_table_index]


Suggested change

parent_table_index = element_inside_table.index(True)

parent_table = tables[parent_table_index]

parent_table = tables[element_inside_table.index(True)]

badGarnet · 2023-10-27T15:17:22Z

unstructured_inference/models/base.py

@@ -52,7 +52,7 @@ def get_model(model_name: Optional[str] = None, **kwargs) -> UnstructuredModel:
        initialize_params = {**DETECTRON2_ONNX_MODEL_TYPES[model_name], **kwargs}
    elif model_name in YOLOX_MODEL_TYPES:
        model = UnstructuredYoloXModel()
-        initialize_params = {**YOLOX_MODEL_TYPES[model_name], **kwargs}


why this change?

badGarnet · 2023-10-27T15:18:47Z

test_unstructured_inference/inference/test_processing_elements.py

+    # call the function to clean the pdfminer inner elements
+    document_with_table.clean_pdfminer_inner_elements()
+
+    assert len(document_with_table.pages[0].elements) == expected_document_lenght


since you got the text to denote where the box should be and what is outside can we do assert more clearly to check if the right pdfminer box is in the right table? And the right outside box is outside?

test_unstructured_inference/inference/test_processing_elements.py

Co-authored-by: Yao You <theyaoyou@gmail.com>

qued · 2024-05-28T15:13:38Z

I don't think this is relevant any more given all that's changed in the meantime. Closing.

Benjamin Torres and others added 9 commits October 25, 2023 19:13

feat: add field to store pdfminer elements

0ca1843

fix: delete **kwargs during intialization of yolox and chipper

25f93af

feat: add type annotations

770235c

Merge branch 'main' into feat/field-to-store-inner-elements

05510e0

Changelog update

ea9e31c

Adds ref to CI.yaml

464fdfc

feat: add clean_pdfminer_inner_elements

2afc2e5

fix: added error margin to clean_pdfminer_inner_elements

4ddb200

Changelog update

eca01a2

benjats07 mentioned this pull request Oct 26, 2023

feat/clean tables Unstructured-IO/unstructured#1904

Closed

benjats07 requested review from christinestraub and qued October 26, 2023 21:46

benjats07 marked this pull request as ready for review October 26, 2023 21:51

benjats07 requested a review from badGarnet October 26, 2023 21:58

cragwolfe reviewed Oct 26, 2023

View reviewed changes

Merge branch 'main' into feat/field-to-store-inner-elements

969e73a

badGarnet reviewed Oct 27, 2023

View reviewed changes

badGarnet requested changes Oct 27, 2023

View reviewed changes

fix: Misspelled variable name

a8e247e

Co-authored-by: Yao You <theyaoyou@gmail.com>

qued closed this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/field to store inner elements #268

Feat/field to store inner elements #268

benjats07 commented Oct 26, 2023 •

edited

benjats07 commented Oct 26, 2023 •

edited

cragwolfe Oct 26, 2023

badGarnet Oct 27, 2023

benjats07 Oct 27, 2023

badGarnet Oct 27, 2023

badGarnet Oct 27, 2023

badGarnet Oct 27, 2023

badGarnet Oct 27, 2023

badGarnet Oct 27, 2023

badGarnet Oct 27, 2023

qued commented May 28, 2024

		parent_table_index = element_inside_table.index(True)
		parent_table = tables[parent_table_index]

	parent_table_index = element_inside_table.index(True)
	parent_table = tables[parent_table_index]
	parent_table = tables[element_inside_table.index(True)]

Feat/field to store inner elements #268

Feat/field to store inner elements #268

Conversation

benjats07 commented Oct 26, 2023 • edited

benjats07 commented Oct 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qued commented May 28, 2024

benjats07 commented Oct 26, 2023 •

edited

benjats07 commented Oct 26, 2023 •

edited