Business Recommendation using Topic-Modelling on Yelp Data Set

Objective :

Based on Yelp Data Set of User Reviews for Restaurants. Business value for this project are:
- To provide recommendation on topics a Business Owner can improve in their restaurant.
- To provide recommendation on topics for a new Restaurant.
This project involved parsing raw customer reviews data with python's NLTK tokenizing words, POS tagging, filtering out only NN, NNS (noun singular/plurals) and removing stop words .
Created Corpus documents with tokenized words
Apply LDA algorithm on the documents and create a set of topics and the distribution of words in them.
Predict new topics based on trained model.

Dataset

Please download and extract Yelp Dataset Challenge data, yelp_dataset_challenge_academic_dataset from the following link: Yelp Data Set
Put this data dolder in the dataset folder in the same directory as Topic_Modelling.
Install MongoDB in your system from the following link : Install MongoDB
Download the MongoDB data files from the following google drive link: MongoDB
Put this folder in /var/lib

Python Packages:

Python packages to be installed for running the program:

Gensim
Nltk
Pandas
Numpy
Matplotlib
Scipy
pymongo

File Information:

Relevant File Informations:

__init__.py - To treat the current directory as package.
db_objects.py - Factory class to return database collections.
filter_review.py - Filters the reviews from JSON file and stores into MongoDB REVIEW collections.
make_json_serializable.py - Creates serializable JSON objects from the MongoDB collection to write JSON data to JSON file.
processing.py - Processes the reviews by tokenizing, removing stop words, lemmatizing and POS tagging and stores it into MongoDB CORPUS collection.
settings.py - Contains database constants, connection strings, collection names, CORPUS disctionary object names.
show_db_information.py - Displays the database names currently present in the MongoDB database, collections present and collection objects informations.
training.py - Creates CORPUS documents and train the LDA Model with documents as input.
klDivergence.py - Creates KL Divergence graph to find out optimal number of topics for LDA.

Project Setup:

We have filtered out review data based on “Pittsburgh” city (total 61849 reviews after filtering) and trained our LDA model on this data. It took a total of 7.91 Hrs to preprocess (tokenization, removing stop words, lemmatizing,POS tagging) the reviews on running 4 processes on 4 different cores on Dell, Intel i5 processor.
The review data are present in the REVIEWS_COLLECTION, while the corpus document data are present in the CORPUS_COLLECTION.

Steps To Run:

Run transform.py file to extract the reviews data for pittsburgh area into a new json file(review_json_file_pittsburgh_restaurant.json).
Run kldivergence.py to plot the symmetric KL divergence vs number of topic to find the optimal number of topics for LDA.
Run training.py to train the LDA model and populate the optimal number of topics found by KL Divergence method in the previous step.
Run display.py to check the results based on business ID provided.

Project WorkFlow:

Results:

The topics that would be output from the program

(0, u'0.093*pizza + 0.023*sauce + 0.015*slice + 0.015*place + 0.014*salad + 0.014*tomato + 0.013*cheese + 0.012*crust + 0.012*delivery + 0.012*bread')
(1, u'0.017*meal + 0.016*restaurant + 0.015*dinner + 0.014*flavor + 0.014*dessert + 0.013*menu + 0.011*plate + 0.011*pork + 0.011*meat + 0.010*potato')
(2, u'0.031*taco + 0.023*place + 0.021*breakfast + 0.020*coffee + 0.019*egg + 0.019*chip + 0.018*brunch + 0.010*time + 0.009*potato + 0.009*day')
(3, u'0.037*place + 0.026*time + 0.022*order + 0.021*burger + 0.012*sandwich + 0.012*don + 0.012*lunch + 0.011*fry + 0.011*service + 0.010*people')
(4, u'0.029*place + 0.026*chicken + 0.021*restaurant + 0.019*soup + 0.018*spicy + 0.016*rice + 0.016*roll + 0.013*sushi + 0.013*sauce + 0.013*service')
(5, u'0.036*place + 0.030*bar + 0.024*beer + 0.022*service + 0.021*time + 0.018*night + 0.013*drink + 0.013*restaurant + 0.012*selection + 0.012*menu')

This can be summarised as the following chart:

The following table shows the average rating for the topics of business id -

SsGNAc9U-aKPZccnaDtFkA

The above results infer that the restaurant needs to improve on Breakfast.

References:

LDA Algorithm Explanation:

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Topic_Modelling		Topic_Modelling
.project		.project
.pydevproject		.pydevproject
Capstone.pdf		Capstone.pdf
README.md		README.md
chart.png		chart.png
flow.jpg		flow.jpg
folder.png		folder.png
json_to_csv_converter.py		json_to_csv_converter.py
klDivergence.py		klDivergence.py
review_json_file_pittsburgh_restaurant.json		review_json_file_pittsburgh_restaurant.json
table.PNG		table.PNG
transform.py		transform.py
unix.png		unix.png

triandicAnt/Business-Recommendation

Folders and files

Latest commit

History

Repository files navigation

Business Recommendation using Topic-Modelling on Yelp Data Set

Objective :

Dataset

Python Packages:

File Information:

Project Setup:

Steps To Run:

Project WorkFlow:

Results:

References:

About

Topics

Resources

Stars

Watchers

Forks

Languages