olist-end-to-end-data-engineering-project

Description

Infrastructure

Pipeline

ETL stands for Extract, Transform, Load. An ETL pipeline is a set of processes used to collect data from various sources, transform it into a format that is suitable for analysis or storage, and then load it into a target destination, such as a database, data warehouse, or data lake.

In this project, we focusing on generating an orders fact table from the dataset provided by Olist in order to analyze the sales performance of the company.

Component	Usage
Minio	Minio serves as the data lake where raw data is stored before being processed.
Airflow	Airflow is used to orchestrate the ETL pipeline.
Spark & Python	PySpark is used to process the raw data and generate the orders fact table.
MySQL	MySQL is used as the data warehouse where the orders fact table is stored.

Orders Fact Table

With these tables, we can analyze the sales performance of the company. For example, we can calculate the average order value.

select year(order_date)   as year,
       month(order_date)  as month,
       avg(price)         as avg_price,
       avg(freight_value) as avg_freight_value,
       avg(price + freight_value) as avg_total_value
from orders_fact
group by year, month
order by year, month;

+------+-------+------------------+-------------------+------------------+
| year | month | avg_total_price  | avg_total_freight | avg_total_order  |
+------+-------+------------------+-------------------+------------------+
| 2016 | 9     | 44.56            | 14.57             | 59.13            |
| 2016 | 10    | 136.38           | 20.11             | 156.50           |
| 2016 | 12    | 10.9             | 8.72              | 19.62            |
| 2017 | 1     | 125.98           | 17.67             | 143.65           |
| 2017 | 2     | 126.76           | 19.98             | 146.74           |
| 2017 | 3     | 124.78           | 19.23             | 144.02           |
| 2017 | 4     | 134.10           | 19.56             | 153.66           |
| 2017 | 5     | 122.36           | 19.37             | 141.73           |
| 2017 | 6     | 120.86           | 19.52             | 140.37           |
| 2017 | 7     | 110.21           | 19.24             | 129.45           |
| 2017 | 8     | 116.90           | 19.19             | 136.09           |
| 2017 | 9     | 129.25           | 19.87             | 149.12           |
| 2017 | 10    | 124.81           | 19.75             | 144.55           |
| 2017 | 11    | 116.59           | 19.49             | 136.08           |
| 2017 | 12    | 117.93           | 18.97             | 136.90           |
| 2018 | 1     | 115.74           | 19.16             | 134.91           |
| 2018 | 2     | 110.03           | 18.60             | 128.64           |
| 2018 | 3     | 119.66           | 20.92             | 140.58           |
| 2018 | 4     | 124.97           | 20.45             | 145.42           |
| 2018 | 5     | 125.74           | 19.34             | 145.08           |
| 2018 | 6     | 122.23           | 22.26             | 144.49           |
| 2018 | 7     | 126.27           | 23.01             | 149.28           |
| 2018 | 8     | 117.92           | 20.51             | 138.43           |
| 2018 | 9     | 145.00           | 21.46             | 166.46           |
+------+-------+------------------+-------------------+------------------+

A lot of other questions can be answered using this dataset. But we need to build other facts tables to answer them. By exploring these questions, you can gain valuable insights into customer behavior, operational efficiency, product performance, and overall business health. This can help in making data-driven decisions to improve various aspects of the e-commerce operation.

Setup

1) Setup docker

docker-compose up

2) Setup Spark and MySQL connection in airflow

follow this guide

Airflow login	admin
Airflow password	admin

Connection Id	spark-conn
Connection Type	spark
Host	spark://spark
Port	7077

Connection Id	mysql-conn
Connection Type	mysql
MySQL login	admin
MySQL password	admin
Host	database
Port	3306

3) Upload dataset to datalake

use minio web interface to upload dataset into olist bucket

login	admin
password	adminadmin

4) Download jars

download the following jars and place them in the jars folder

mysql-connector-j >> used to connect to data warehouse

hadoop-aws & aws-java-sdk >> used to connect to datalake

Possible Improvements

For the moment, our orders fact table pipeline is designed to recreate tables every time the pipeline is run. This is not ideal for a production environment, where we would want to update the tables incrementally.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
dags		dags
docker		docker
images		images
jars		jars
jobs		jobs
logs		logs
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

docker

docker

images

images

jars

jars

jobs

jobs

logs

logs

.gitignore

.gitignore

README.md

README.md

compose.yaml

compose.yaml

Repository files navigation

olist-end-to-end-data-engineering-project

Description

Infrastructure

Pipeline

Orders Fact Table

Setup

1) Setup docker

2) Setup Spark and MySQL connection in airflow

3) Upload dataset to datalake

4) Download jars

Possible Improvements

About

Releases

Packages

Languages

rxmi-bkd/olist-end-to-end-data-engineering-project

Folders and files

Latest commit

History

Repository files navigation

olist-end-to-end-data-engineering-project

Description

Infrastructure

Pipeline

Orders Fact Table

Setup

1) Setup docker

2) Setup Spark and MySQL connection in airflow

3) Upload dataset to datalake

4) Download jars

Possible Improvements

About

Topics

Resources

Stars

Watchers

Forks

Languages