Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: TesseractError: Estimating resolution as X #2900

Open
qued opened this issue Apr 17, 2024 · 1 comment
Open

bug: TesseractError: Estimating resolution as X #2900

qued opened this issue Apr 17, 2024 · 1 comment
Labels
bug Something isn't working ocr Related to optical character recognition (OCR).

Comments

@qued
Copy link
Contributor

qued commented Apr 17, 2024

Describe the bug
User gets a TesseractError when processing a particular document.

To Reproduce
Code was an API call with a certain image-based document.

Expected behavior
Document processed successfully.

Environment Info
Running in self-hosted open-source API.
Unstructured version 0.12.3.
Tesseract version 5.3.3

Additional context
User was able to successfully process the document with Tesseract version 4.1.1

Stack trace:

File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
    final_document_layout = process_data_with_ocr(
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
    merged_layouts = process_file_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
    raise e
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
    ocr_layout = ocr_agent.get_layout_from_image(
  File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')
@qued qued added the bug Something isn't working label Apr 17, 2024
@qued
Copy link
Contributor Author

qued commented Apr 17, 2024

Slack conversation: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1713364225537139

We've previously encountered this error in #1920 and closed the issue with #1996. The user is running a version of unstructured with the fix merged, so presumably this is the same error showing up for a different reason.

@scanny scanny added the ocr Related to optical character recognition (OCR). label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ocr Related to optical character recognition (OCR).
Projects
None yet
Development

No branches or pull requests

2 participants