Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: If page has text, force OCR and rasterize page #1314

Open
mikejokic opened this issue May 17, 2024 · 1 comment
Open

[Feature]: If page has text, force OCR and rasterize page #1314

mikejokic opened this issue May 17, 2024 · 1 comment
Assignees

Comments

@mikejokic
Copy link

mikejokic commented May 17, 2024

Describe the proposed feature

My multi-page pdf pages contains a mix of digital and scanned images that can be OCR'd.

In this case, redo-ocr is the current option to blank out certain areas of the page from OCR while avoiding rasterizing. However, it is possible for these results to be poorer than force-ocr on that same page. Force-ocr however rastertizes the page and can cause increase in file size. Force-ocr also enables this setting for all pages regardless of whether it may be the best option (if page has no text but only images, normal workflow w/o full page raster is the best option)

So, can we detect if page has text and then only force-ocr? If page has no text, ocrmypdf proceeds as normal and only overlays text layer on top of original page. If page has text, force-ocr and rasterize.

Sorry if this is already possible, but seems like force-ocr applies to all pages if enabled. @jbarlow83 Thank you for the assist!

@mikejokic
Copy link
Author

bumping this @jbarlow83 I know you must be busy - but is there a way to selectively apply force-ocr to pages that have text, and proceed normally for pages that are clean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants