PDF Word Extraction

This tool is designed to extract meaningful words from a collection of PDF documents. The extracted words are processed and their frequencies are counted. This frequency data can be used for various text analysis and visualization tasks, such as generating word clouds or identifying common themes in the document collection.

The tool leverages the modern text data toolchain in Python:

pypdf: for reading PDFs.
ftfy: for text cleaning.
SpaCy: for natural language processing such as tokenization, lemmatization, and stop-word removal.

The tool also provides customizable features such as the ability to specify words for removal or replacement.

Setup

You can install the latest versions of the required Python packages using pip:

pip install ftfy pypdf spacy
python3 -m spacy download en_core_web_sm

Alternatively, you can install all the dependencies at once with:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pdf		pdf
README.md		README.md
pdf_word_extraction.py		pdf_word_extraction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf

pdf

README.md

README.md