Skip to content

Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM

License

Notifications You must be signed in to change notification settings

MauricioVazquezM/Spark_BigData_Architecture_Project

Repository files navigation

Spark Big Data Architecture Project


Team

Problem Definition and Motivation

In the field of Data Science, the processing of massive batch information for training machine learning models is crucial. While the architecture is predetermined, many factors, particularly those related to runtime, significantly influence the response time necessary to meet client expectations. Consequently, mastering aspects related to the architecture of the platform where our big data applications are executed becomes paramount.


The motivation for this project is to develop an application using Spark Structured Streaming and/or Kafka to capture real-time data. Once sufficient information is retrieved from the input stream, the goal is to train either a classification or linear regression model and test the model's quality with data that continues to arrive in real time. The challenge involves not only capturing data in real time for training but also receiving new data for making predictions using the built model. Data from disk will be accepted with a certain capture periodicity, but applications that manage real-time data capture will be evaluated more favorably.


Additionally, the project requires performing comparisons of execution times using the local ITAM cluster, Spark standalone, and cloud infrastructure (Databricks) to demonstrate, with the same application, the performance differences across various platforms.

Implementation Goals

    1. Real-Time Data Capture: Implement a system using Apache Kafka or Spark Structured Streaming for real-time data ingestion.
    1. Model Training and Prediction: Once enough data is collected, train a machine learning model and use it for real-time predictions as new data streams in.
    1. Performance Evaluation: Test the application on different environments including a local cluster, standalone Spark setup, and a cloud platform like Databricks. Measure and compare the execution times to evaluate the performance of each platform.

This project aims to harness the capabilities of modern data processing frameworks to ensure efficient, real-time data analytics, crucial for dynamic machine learning model training and prediction. It also seeks to compare the computational efficiency and scalability of various computing environments to identify the best infrastructure for specific data science tasks.

Instructions

  • To run the program, navigate to the "Spark_BigData_Architecture_Project" folder level and execute the following command in your console.

Execution Console Code

streamlit run Dashboard_Streamlit.py

Additional Comments

  • The most optimal scenario would be to have a continuous and fast data stream. However, the YahooFinance library sends responses every 10 seconds. Given this situation, the animation is therefore not very visually appealing.
  • The dashboard will perform better during trading hours, as stock values are constantly changing

About

Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages