Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/field to store inner elements #268

Closed
wants to merge 11 commits into from

Conversation

benjats07
Copy link
Contributor

@benjats07 benjats07 commented Oct 26, 2023

This PR introduces clean_pdfminer_inner_elements , which deletes pdfminer elements inside other detection origins such as YoloX or detectron.
This function returns the clean document and stores nested elements inside pdfminer_inner_text

The best way to check that this function is working properly is checking the new test test_clean_pdfminer_inner_elements in test_unstructured_inference/inference/test_processing_elements.py

Also, the new ingest-tests will reflect the deletion of the nested elements.

@benjats07
Copy link
Contributor Author

benjats07 commented Oct 26, 2023

@christinestraub
I addressed the issue you mentioned here
I think the code belongs here, and the main library just use the function clean_pdfminer_inner_elements from here.

@benjats07 benjats07 marked this pull request as ready for review October 26, 2023 21:51
@@ -109,6 +109,7 @@ jobs:
uses: actions/checkout@v4
with:
repository: 'Unstructured-IO/unstructured'
ref: 'benjamin/feat/clean-by-store-pdfminer-inner-elements'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a reminder to remove this line before merging

bbox=Rectangle(0, 0, 100, 100),
text="Table with inner elements",
type="Table",
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this element left intentionally without a source? Do think that can be a valid test but want to make sure this is the intention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was intentional, this is because if something is a table then should be kept

for page in self.pages:
tables = [e for e in page.elements if e.type == "Table"]
for i, element in enumerate(page.elements):
if element.source == Source.PDFMINER:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do inverse condition then continue here to reduce indentation of the code; its a good practice to improve code readability in general. Especially here we have so many levels of indentation

for i, element in enumerate(page.elements):
if element.source == Source.PDFMINER:
element_inside_table = [
element.bbox.is_in(t.bbox, error_margin=15) for t in tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the margin something we want to make a kwarg into this function? How did we arrive with this number?

element_inside_table = [
element.bbox.is_in(t.bbox, error_margin=15) for t in tables
]
if sum(element_inside_table) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this catches when an element is inside and only inside one table? Could you please clarify this in the docstring?

Comment on lines +167 to +168
parent_table_index = element_inside_table.index(True)
parent_table = tables[parent_table_index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parent_table_index = element_inside_table.index(True)
parent_table = tables[parent_table_index]
parent_table = tables[element_inside_table.index(True)]

@@ -52,7 +52,7 @@ def get_model(model_name: Optional[str] = None, **kwargs) -> UnstructuredModel:
initialize_params = {**DETECTRON2_ONNX_MODEL_TYPES[model_name], **kwargs}
elif model_name in YOLOX_MODEL_TYPES:
model = UnstructuredYoloXModel()
initialize_params = {**YOLOX_MODEL_TYPES[model_name], **kwargs}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

# call the function to clean the pdfminer inner elements
document_with_table.clean_pdfminer_inner_elements()

assert len(document_with_table.pages[0].elements) == expected_document_lenght
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you got the text to denote where the box should be and what is outside can we do assert more clearly to check if the right pdfminer box is in the right table? And the right outside box is outside?

Co-authored-by: Yao You <theyaoyou@gmail.com>
@qued
Copy link
Contributor

qued commented May 28, 2024

I don't think this is relevant any more given all that's changed in the meantime. Closing.

@qued qued closed this May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants