#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,334 public repositories matching this topic...

feng-li / Distributed-Statistical-Computing

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

spark hadoop mapreduce statistical-models pyspark-tutorial spark-teaching

Updated Jun 11, 2024
HTML

nessie

projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

git java data spark aws-lambda iceberg

Updated Jun 11, 2024
Java

HsiehShuJeng / cdk-emrserverless-with-delta-lake

This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.

python java golang aws spark serverless dotnet javacript aws-cloudformation emr-notebooks delta-lake aws-service-catalog cdk-constructs projen emr-studio emr-serverless

Updated Jun 11, 2024
TypeScript

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated Jun 10, 2024
C++

kestra-io / plugin-spark

plugin spark kestra

Updated Jun 10, 2024
Java

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

rust spark arrow datafusion

Updated Jun 11, 2024
Rust

intel / e2eAIOK

Intel® End-to-End AI Optimization Kit

python spark tensorflow pytorch automl distributed-deep-learning neural-architecture-search

Updated Jun 11, 2024
Jupyter Notebook

onetl

MobileTeleSystems / onetl

One ETL tool to rule them all

spark etl plugin-system etl-pipeline etl-components pydantic hwm

Updated Jun 10, 2024
Python

moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

data-science spark record-linkage entity-resolution fuzzy-matching deduplication em-algorithm data-matching deduplicate-data duckdb uk-gov-data-science

Updated Jun 10, 2024
Python

smart-data-lake / smart-data-lake

Smart Automation Tool for building modern Data Lakes and Data Pipelines

scala spark hive hadoop transform-data data-lake data-pipelines deltalake smart-data-lake

Updated Jun 10, 2024
Scala

Marc-Eid / GA360Recommender

Collaborative Filtering based on Google Analytics 360 data from BigQuery.

python big-data spark collaborative-filtering reccomendersystem reccomendation-system

Updated Jun 10, 2024
HTML

delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

big-data spark analytics acid delta-lake

Updated Jun 11, 2024
Scala

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data spark gpu rapids

Updated Jun 11, 2024
Scala

delta-io / delta-sharing

An open protocol for secure data sharing

big-data spark pandas data-sharing delta-lake

Updated Jun 10, 2024
Scala

kamu-data / kamu-cli

New generation decentralized data lake and a streaming data pipeline

data-science sql spark jupyter blockchain open-data data-management flink data-as-code datafusion kamu open-data-fabric

Updated Jun 10, 2024
Rust

apache / spark

Apache Spark - A unified analytics engine for large-scale data processing

python java r scala sql big-data spark jdbc

Updated Jun 11, 2024
Scala

rajatkrishna / nlp-benchspark

Benchmark inference of custom and pre-trained NLP models with Spark NLP.

nlp benchmarking benchmark spark inference pyspark spark-nlp llm

Updated Jun 10, 2024
Python

marsfoundation / spark-app

spark ethereum dapp dai makerdao defi

Updated Jun 10, 2024
TypeScript

apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery real-time sql database spark hive hadoop etl snowflake olap query-engine redshift dbt elt iceberg hudi delta-lake lakehouse

Updated Jun 11, 2024
Java

mage-ai / mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

python data-science data machine-learning sql spark pipeline etl pipelines orchestration artificial-intelligence data-engineering data-integration dbt elt transformation data-pipelines reverse-etl

Updated Jun 10, 2024
Python

Created by Matei Zaharia

Released May 26, 2014

Followers: 417 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics