GitHub - sujitpal/content-engineering-tutorial

Content Engineering Tutorial

Abstract

According to Computer World magazine, unstructured text data accounts for roughly 70%-80% of all data in an organization. The most common approach to leveraging a company's text resources is to make it searchable using a search engine. While that in itself is a huge step forward, there is much more that can be done to extract further insight from the text. In this tutorial, we will look at extracting keywords and other features from the text, using well-known statistical and off-the-shelf machine learning techniques, improving both content search and discovery in the process. Finally we bring these threads together to build an ontology and a simple recommendation system. We will use Solr 7.x as our indexing platform and the NIPS papers dataset, a collection of 7000+ papers from the Neural Information Processing Systems Conference from 1987-2017, as our corpus. Tutorial is fairly code-heavy and Python based, and while knowledge of Python is not required, familiarity with a programming language would be very desirable.

Getting started

Please refer to the data/README.md and models/README.md to download the dataset and third party models.

Also refer to the requirements.txt to find if you need to install additional libraries for your Python3 installation. The code was built using Anaconda Python3 which has many (not all) of these libraries already installed. The only one I couldn't get to work was the dedupe library, which I had to install on a separate Anaconda Python 2 installation.

Finally, the notebooks and web application both use Solr 7.x as the search backend, so you need to install that. To start Solr, navigate to the Solr home directory, and run the following command. The Solr console can be accessed from your browser at http://localhost:8983.

cd <solr_home>
bin/solr start

The codebase consists of a set of notebooks under the notebooks folder and a Flask based web application under the webtool folder that provides a front end to showcase the application of outputs of the various content engineering techniques against a search index containing the NIPS papers.

To run the notebook server, navigate to the notebooks subdirectory, and then run the following command. By default, the default URL to navigate to on your browser to access the notebooks is http://localhost:8888/. You can also find the URL from the server logs that are written out on the console.

cd <project_home>/notebooks
jupyter notebook

To run the web application, navigate to the webtool subdirectory, then run the following command. The web application will start listening on port 5000. To get to the application from your browser, navigate to http://localhost:5000.

cd <project_home>/webtool
python webtool.py

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
models		models
notebooks		notebooks
scripts		scripts
webtool		webtool
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
search-summit-2018-content-engineering-slides.pdf		search-summit-2018-content-engineering-slides.pdf
search-summit-2018-content-engineering-slides.pptx		search-summit-2018-content-engineering-slides.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

models

models

notebooks

notebooks

scripts

scripts

webtool

webtool

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

search-summit-2018-content-engineering-slides.pdf

search-summit-2018-content-engineering-slides.pdf

search-summit-2018-content-engineering-slides.pptx

search-summit-2018-content-engineering-slides.pptx

Repository files navigation

Content Engineering Tutorial

Abstract

Getting started

About

Releases

Packages

Languages

License

sujitpal/content-engineering-tutorial

Folders and files

Latest commit

History

Repository files navigation

Content Engineering Tutorial

Abstract

Getting started

About

Topics

Resources

License

Stars

Watchers

Forks

Languages