Skip to content

ebtelmarz/big_data_lsh_ensemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LSH Ensemble

This is an assignment for the Big Data course in Roma Tre University.

This repo is based on the work reported in this paper: LSH Ensemble: Internet-Scale Domain Search.

Requirements

To run this project you need:

  • Python 3.6.9
  • Hadoop 3.2.1
  • Spark 3.0.0
  • pip3 intstalled in your machine. To install pip3 run the following commands in a shell
sudo apt update
sudo apt install python3-pip

Usage

To run the project locally

Start Hadoop, open a shell and run

$HADOOP_HOME/sbin/start-dfs.sh 

Download this repo or clone it by running

git clone https://github.com/ebtelmarz/big_data_lsh_ensemble.git

Move inside the downloaded directory

cd big_data_lsh_ensemble/

Execute the run.sh script by running in a shell

sh run.sh

 

To run the project on cluster

Create a virtual environment

python3 -m venv my_env
source .my_env/bin/activate 

Execute the run.sh script by running

sh run.sh

Releases

No releases published

Packages

No packages published