-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/field to store inner elements #268
Conversation
@christinestraub |
@@ -109,6 +109,7 @@ jobs: | |||
uses: actions/checkout@v4 | |||
with: | |||
repository: 'Unstructured-IO/unstructured' | |||
ref: 'benjamin/feat/clean-by-store-pdfminer-inner-elements' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a reminder to remove this line before merging
bbox=Rectangle(0, 0, 100, 100), | ||
text="Table with inner elements", | ||
type="Table", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this element left intentionally without a source? Do think that can be a valid test but want to make sure this is the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was intentional, this is because if something is a table then should be kept
for page in self.pages: | ||
tables = [e for e in page.elements if e.type == "Table"] | ||
for i, element in enumerate(page.elements): | ||
if element.source == Source.PDFMINER: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets do inverse condition then continue
here to reduce indentation of the code; its a good practice to improve code readability in general. Especially here we have so many levels of indentation
for i, element in enumerate(page.elements): | ||
if element.source == Source.PDFMINER: | ||
element_inside_table = [ | ||
element.bbox.is_in(t.bbox, error_margin=15) for t in tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the margin something we want to make a kwarg into this function? How did we arrive with this number?
element_inside_table = [ | ||
element.bbox.is_in(t.bbox, error_margin=15) for t in tables | ||
] | ||
if sum(element_inside_table) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this catches when an element is inside and only inside one table? Could you please clarify this in the docstring?
parent_table_index = element_inside_table.index(True) | ||
parent_table = tables[parent_table_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent_table_index = element_inside_table.index(True) | |
parent_table = tables[parent_table_index] | |
parent_table = tables[element_inside_table.index(True)] |
@@ -52,7 +52,7 @@ def get_model(model_name: Optional[str] = None, **kwargs) -> UnstructuredModel: | |||
initialize_params = {**DETECTRON2_ONNX_MODEL_TYPES[model_name], **kwargs} | |||
elif model_name in YOLOX_MODEL_TYPES: | |||
model = UnstructuredYoloXModel() | |||
initialize_params = {**YOLOX_MODEL_TYPES[model_name], **kwargs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
# call the function to clean the pdfminer inner elements | ||
document_with_table.clean_pdfminer_inner_elements() | ||
|
||
assert len(document_with_table.pages[0].elements) == expected_document_lenght |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you got the text to denote where the box should be and what is outside can we do assert more clearly to check if the right pdfminer box is in the right table? And the right outside box is outside?
test_unstructured_inference/inference/test_processing_elements.py
Outdated
Show resolved
Hide resolved
test_unstructured_inference/inference/test_processing_elements.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Yao You <theyaoyou@gmail.com>
I don't think this is relevant any more given all that's changed in the meantime. Closing. |
This PR introduces
clean_pdfminer_inner_elements
, which deletes pdfminer elements inside other detection origins such as YoloX or detectron.This function returns the clean document and stores nested elements inside
pdfminer_inner_text
The best way to check that this function is working properly is checking the new test test_clean_pdfminer_inner_elements in test_unstructured_inference/inference/test_processing_elements.py
Also, the new ingest-tests will reflect the deletion of the nested elements.