README.adoc

October 4, 2024 · View on GitHub

= Awesome Open-Source Data Engineering :toc: :toc-placement!:

This https://github.com/topics/awesome-list[Awesome List] aims at providing an overview of https://opensource.org/licenses[open-source] projects related to data engineering. This is a community effort: please https://github.com/gunnarmorling/awesome-opensource-data-engineering/blob/master/CONTRIBUTING.md[contribute] and send your pull requests for growing this list! For a list including non-OSS tools, see this amazing https://github.com/igorbarinov/awesome-data-engineering[Awesome List].

toc::[]

== Analytics

  • https://spark.apache.org/[Apache Spark] - A unified analytics engine for large-scale data processing. Includes APIs in Scala, Java, Python (known as PySpark), and R (SparkR).
  • https://beam.apache.org/[Apache Beam] - An open-source implementation of Google DataFlow. Provides capabilites of batch and streaming data processing jobs that run on any execution engine, including Spark, Flink, or its own DirectRunner. Supports multiple APIs in Java, Python, and Go.
  • https://flink.apache.org/[Apache Flink] - Stateful computations over data streams.
  • https://trino.io/[Trino (formerly known as PrestoSQL)] - Distributed SQL Query Engine for Big Data.

== Business Intelligence

== Data Lakehouse

  • https://delta.io/[Delta Lake] - Open-source storage framework that enables building a lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
  • https://hudi.apache.org/[Apache Hudi] - Transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.
  • https://iceberg.apache.org/[Apache Iceberg] - High-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

== Change Data Capture

== Datastores

== Data Governance and Registries

== Data Virtualization

== Data Orchestration

  • https://github.com/Alluxio/alluxio[Alluxio] - Scalable, multi-tiered distributed caching for HDFS, S3, Ceph, NFS, and related filestores. Provides integrations for SQL queries into a Catalog from Spark, Hive, and Presto.

  • https://www.getdbt.com/[dbt] - Empowering data analysts and engineers to apply methodologies akin to those used by software engineers for constructing applications, dbt ensures data transformation processes align with established practices.

== Formats

== Integration

== Messaging Infrastructure

== Specifications and Standards

== Stream Processing

== Testing

== Monitoring and Logging

== Versioning

== Workflow Management

== Related Resources

only overview contents, no specific tools

=== Slide Decks, Recordings and Podcasts

=== Blog Posts and Articles

=== Collections

== License

The contents of this repository is licensed under the "Creative Commons Attribution-ShareAlike 4.0 International License".