Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
Updated
May 31, 2024 - Python
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
AI Media and Misinformation Content Analysis Tool: Analyze text and images
Extract embedded metadata from HTML markup
A very simple news crawler with a funny name
Golang PDF library for creating and processing PDF files (pure go)
A self-hosted search engine for documents.
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
Get text content from any file
Translate visual novels in real time
Module for automatic summarization of text documents and HTML pages.
This GitHub repository hosts the notebooks and tools developed as part of this thesis to automate the extraction, processing, and analysis of data from the MICCAI 2023 conference, aiding in the systematic review and providing a structured foundation for further research in this crucial area.
A TYPO3 CMS extension that provides Apache Tika functionality
OCR with Tesseract and OpenCV: Extract text from images effortlessly. Preprocess with OpenCV for accuracy. Display results and save output. Easy integration for document digitization and data entry automation.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Heuristic based boilerplate removal tool
Dataiku DSS plugin to perform optical character recognition (OCR) using the Tesseract engine.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."