-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doctr for scanned pdf #1614
Comments
By default docTR isn't used if many pages for PDF. It's possible it's using OCR (unstructured package) instead. Can you tell from the command line? I would disable unstructured and OCR from expert panel in UI and try again. You can disable via CLI as well. pymupdf is the default loader, unless the PDF is all (by pages) a scanned image based PDF, then it will revert progressively to other backup methods. i.e. it does in order:
With CLI, you can disable everything except DocTR and see how goes. Then narrow down which thing is taking time. If it's many pages, DocTR does take some time, but I've seen OCR from unstructured take too long and even much longer and worse quality. Also if you have a PDF that you can share that shows the issue, I'm happy to look. If you need to keep it semi-private, you can email me at jon.mckinney@h2o.ai. |
Hello, thank you for responding. Actually, it's not for a specific PDF as mentioned. In general, if the PDF contains a title or something similar that is not scanned, and all the subsequent pages are scanned, it extracts , doctr works and extract text from the entire file. If the whole PDF is scanned, it continues processing without a response. I'm showing you an example in this video: The first document contains 7 pages. The first page contains a title that is not scanned, and the next 6 pages are all scanned. However, the tool extracts text from all pages (I can verify this through the document viewer ==> view database text). The second document is the document "image_based_pdf_sample" found in the test folder in h2ogpt, which is similar to my problematic documents. As you can see, it continues processing without responding. I shortened the video so I could share it here, but it's still processing, as shown in the screenshot. testing-doctr_6BSPCc9W.mp4 |
Can you do two things?
|
Also, can you try gpt.h2o.ai -- do you have similar issues? This will help identify if it is an installation or computer issue. |
I guess DocTR has issues with windows VM. |
I have doctr installed , When I upload a photo, it works very well and extracts text perfectly. However, when I upload a scanned PDF, it keeps processing for a very long time without any response or error. What could be missing?
it works also with a scanned pdf but : for exemple first page is not scanned (contains title or a sentence etc) and all the next pages are scanned , it extracts text from all the pages perfectly , but when the whole pdf is scanned it keeps processing without response.
The text was updated successfully, but these errors were encountered: