[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@harrisonqian / Awesome / wiki/big-data/data-engineering.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- visibility: public --- # Data Engineering **repo:** [igorbarinov/awesome-data-engineering](https://github.com/igorbarinov/awesome-data-engineering) **category:** [[big-data|Big Data]] **related:** [[python|Python]] · [[docker|Docker]] · [[apache-spark|Apache Spark]] · [[data-science|Data Science]] · [[streaming|Streaming]] --- # Awesome Data Engineering [](https://github.com/sindresorhus/awesome) > A curated list of awesome things related to Data Engineering. ## Contents - [Databases](#databases) - [Data Comparison](#data-comparison) - [Data Ingestion](#data-ingestion) - [File System](#file-system) - [Serialization format](#serialization-format) - [Stream Processing](#stream-processing) - [Batch Processing](#batch-processing) - [Charts and Dashboards](#charts-and-dashboards) - [Workflow](#workflow) - [Data Lake Management](#data-lake-management) - [ELK Elastic Logstash Kibana](#elk-elastic-logstash-kibana) - [Docker](#docker) - [Datasets](#datasets) - [Realtime](#realtime) - [Data Dumps](#data-dumps) - [Monitoring](#monitoring) - [Prometheus](#prometheus) - [Profiling](#profiling) - [Data Profiler](#data-profiler) - [Testing](#testing) - [Community](#community) - [Forums](#forums) - [Conferences](#conferences) - [Podcasts](#podcasts) - [Books](#books) ## Databases - Relational - [RQLite](https://github.com/rqlite/rqlite) - Replicated SQLite using the Raft consensus protocol. - [MySQL](https://www.[mysql](/@harrisonqian/awesome/wiki/databases/mysql).com/) - The world's most popular open source [database](/@harrisonqian/awesome/wiki/databases/database). - [TiDB](https://github.com/pingcap/tidb) - A distributed NewSQL [database](/@harrisonqian/awesome/wiki/databases/database) compatible with [MySQL](/@harrisonqian/awesome/wiki/databases/mysql) protocol. - [Percona XtraBackup](https://www.percona.com/software/mysql-database/percona-xtrabackup) - A free, open source, complete online backup solution for all versions of Percona Server, [MySQL](/@harrisonqian/awesome/wiki/databases/mysql)® and MariaDB®. - [mysql_utils](https://github.com/pinterest/mysql_utils) - Pinterest [MySQL](/@harrisonqian/awesome/wiki/databases/mysql) Management Tools. - [MariaDB](https://mariadb.org/) - An enhanced, drop-in replacement for [MySQL](/@harrisonqian/awesome/wiki/databases/mysql). - [PostgreSQL](https://www.[postgresql](/@harrisonqian/awesome/wiki/databases/postgresql).org/) - The world's most advanced open source [database](/@harrisonqian/awesome/wiki/databases/database). - [Rivestack](https://rivestack.io/) - Managed [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql) with pgvector for AI workloads. HNSW indexing, sub-4ms latency, and built-in SQL editor with automatic embedding generation. - [Amazon RDS](https://aws.amazon.com/rds/) - Makes it easy to set up, operate, and scale a relational [database](/@harrisonqian/awesome/wiki/databases/database) in the cloud. - [Crate.IO](https://crate.io/) - Scalable SQL [database](/@harrisonqian/awesome/wiki/databases/database) with the NOSQL goodies. - Key-Value - [Redis](https://redis.io/) - An open source, BSD licensed, advanced key-value cache and store. - [Riak](https://docs.basho.com/riak/kv/) - A distributed [database](/@harrisonqian/awesome/wiki/databases/database) designed to deliver maximum data availability by distributing data across multiple servers. - [AWS DynamoDB](https://aws.amazon.com/dynamodb/) - A fast and flexible NoSQL [database](/@harrisonqian/awesome/wiki/databases/database) service for all applications that need consistent, single-digit millisecond latency at any scale. - [HyperDex](https://github.com/rescrv/HyperDex) - A scalable, searchable key-value store. Deprecated. - [SSDB](https://ssdb.io) - A high performance NoSQL [database](/@harrisonqian/awesome/wiki/databases/database) supporting many data structures, an alternative to Redis. - [Kyoto Tycoon](https://github.com/alticelabs/kyoto) - A lightweight network server on top of the Kyoto Cabinet key-value [database](/@harrisonqian/awesome/wiki/databases/database), built for high-performance and concurrency. - [IonDB](https://github.com/iondbproject/iondb) - A key-value store for microcontroller and IoT applications. - Column - [Cassandra](https://cassandra.apache.org/) - The right choice when you need [scalability](/@harrisonqian/awesome/wiki/front-end-development/scalability) and high availability without compromising performance. - [Cassandra Calculator](https://www.ecyrd.com/cassandracalculator/) - This simple form allows you to try out different values for your Apache [Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra) cluster and see what the impact is for your application. - [CCM](https://github.com/pcmanus/ccm) - A script to easily create and destroy an Apache [Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra) cluster on localhost. - [ScyllaDB](https://github.com/scylladb/scylla) - NoSQL data store using the seastar framework, compatible with Apache [Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra). - [HBase](https://hbase.apache.org/) - The [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) [database](/@harrisonqian/awesome/wiki/databases/database), a distributed, scalable, [big data](/@harrisonqian/awesome/wiki/big-data/big-data) store. - [AWS Redshift](https://aws.amazon.com/redshift/) - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. - [FiloDB](https://github.com/filodb/FiloDB) - Distributed. Columnar. Versioned. [Streaming](/@harrisonqian/awesome/wiki/big-data/streaming). SQL. - [Vertica](https://www.vertica.com) - Distributed, MPP columnar [database](/@harrisonqian/awesome/wiki/databases/database) with extensive [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) SQL. - [ClickHouse](https://clickhouse.tech) - Distributed columnar DBMS for OLAP. SQL. - Document - [MongoDB](https://www.[mongodb](/@harrisonqian/awesome/wiki/databases/mongodb).com) - An open-source, document [database](/@harrisonqian/awesome/wiki/databases/database) designed for ease of development and scaling. - [Percona Server for MongoDB](https://www.percona.com/software/mongo-database/percona-server-for-mongodb) - Percona Server for [MongoDB](/@harrisonqian/awesome/wiki/databases/mongodb)® is a free, enhanced, fully compatible, open source, drop-in replacement for the [MongoDB](/@harrisonqian/awesome/wiki/databases/mongodb)® Community Edition that includes enterprise-grade features and functionality. - [MemDB](https://github.com/rain1017/memdb) - Distributed Transactional In-Memory [Database](/@harrisonqian/awesome/wiki/databases/database) (based on MongoDB). - [Elasticsearch](https://www.elastic.co/) - Search & Analyze Data in Real Time. - [Couchbase](https://www.couchbase.com/) - The highest performing NoSQL distributed [database](/@harrisonqian/awesome/wiki/databases/database). - [RethinkDB](https://rethinkdb.com/) - The open-source [database](/@harrisonqian/awesome/wiki/databases/database) for the realtime web. - [RavenDB](https://ravendb.net/) - Fully Transactional NoSQL Document [Database](/@harrisonqian/awesome/wiki/databases/database). - Graph - [ArcadeDB](https://arcadedb.com/) - Open-source multi-model [database](/@harrisonqian/awesome/wiki/databases/database) with native graph, document, key-value, and vector support. SQL, Cypher, and Gremlin query languages. Apache 2.0 license. - [Neo4j](https://neo4j.com/) - The world's leading graph [database](/@harrisonqian/awesome/wiki/databases/database). - [OrientDB](https://orientdb.com) - 2nd Generation Distributed Graph [Database](/@harrisonqian/awesome/wiki/databases/database) with the flexibility of Documents in one product with an Open Source commercial friendly license. - [ArangoDB](https://www.arangodb.com/) - A distributed free and open-source [database](/@harrisonqian/awesome/wiki/databases/database) with a flexible data model for documents, graphs, and key-values. - [Titan](https://titan.thinkaurelius.com) - A scalable graph [database](/@harrisonqian/awesome/wiki/databases/database) optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. - [FlockDB](https://github.com/twitter-archive/flockdb) - A distributed, fault-tolerant graph [database](/@harrisonqian/awesome/wiki/databases/database) by Twitter. Deprecated. - [Actionbase](https://github.com/kakao/actionbase) - A [database](/@harrisonqian/awesome/wiki/databases/database) for user interactions (likes, views, follows) represented as graphs, with precomputed reads served in real-time. - Distributed - [DAtomic](https://www.datomic.com) - The fully transactional, cloud-ready, distributed [database](/@harrisonqian/awesome/wiki/databases/database). - [Apache Geode](https://geode.apache.org/) - An open source, distributed, in-memory [database](/@harrisonqian/awesome/wiki/databases/database) for scale-out applications. - [Gaffer](https://github.com/gchq/Gaffer) - A large-scale graph [database](/@harrisonqian/awesome/wiki/databases/database). - Timeseries - [InfluxDB](https://github.com/influxdata/influxdb) - Scalable datastore for metrics, events, and real-time [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics). - [OpenTSDB](https://github.com/OpenTSDB/opentsdb) - A scalable, distributed Time Series [Database](/@harrisonqian/awesome/wiki/databases/database). - [QuestDB](https://questdb.io/) - A relational column-oriented [database](/@harrisonqian/awesome/wiki/databases/database) designed for real-time [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) on time series and event data. - [kairosdb](https://github.com/kairosdb/kairosdb) - Fast scalable time series [database](/@harrisonqian/awesome/wiki/databases/database). - [Heroic](https://github.com/spotify/heroic) - A scalable time series [database](/@harrisonqian/awesome/wiki/databases/database) based on [Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra) and Elasticsearch, by Spotify. - [Druid](https://github.com/apache/incubator-druid) - Column oriented distributed data store ideal for powering interactive applications. - [Riak-TS](https://basho.com/products/riak-ts/) - Riak TS is the only enterprise-grade NoSQL time series [database](/@harrisonqian/awesome/wiki/databases/database) optimized specifically for IoT and Time Series data. - [Akumuli](https://github.com/akumuli/Akumuli) - A numeric time-series [database](/@harrisonqian/awesome/wiki/databases/database). It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate". - [Rhombus](https://github.com/Pardot/Rhombus) - A time-series object store for [Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra) that handles all the complexity of building wide row indexes. - [Dalmatiner DB](https://github.com/dalmatinerdb/dalmatinerdb) - Fast distributed metrics [database](/@harrisonqian/awesome/wiki/databases/database). - [Blueflood](https://github.com/rackerlabs/blueflood) - A distributed system designed to ingest and process time series data. - [Timely](https://github.com/NationalSecurityAgency/timely) - A time series [database](/@harrisonqian/awesome/wiki/databases/database) application that provides secure access to time series data based on Accumulo and Grafana. - Other - [Tarantool](https://github.com/tarantool/tarantool/) - An in-memory [database](/@harrisonqian/awesome/wiki/databases/database) and application server. - [GreenPlum](https://github.com/greenplum-db/gpdb) - The Greenplum [Database](/@harrisonqian/awesome/wiki/databases/database) (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) on petabyte scale data volumes. - [cayley](https://github.com/cayleygraph/cayley) - An open-source graph [database](/@harrisonqian/awesome/wiki/databases/database). Google. - [Snappydata](https://github.com/SnappyDataInc/snappydata) - OLTP + OLAP [Database](/@harrisonqian/awesome/wiki/databases/database) built on [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark). - [TimescaleDB](https://www.timescale.com/) - Built as an extension on top of [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql), TimescaleDB is a time-series SQL [database](/@harrisonqian/awesome/wiki/databases/database) providing fast [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics), [scalability](/@harrisonqian/awesome/wiki/front-end-development/scalability), with automated data management on a proven storage engine. - [DuckDB](https://duckdb.org/) - A fast in-process analytical [database](/@harrisonqian/awesome/wiki/databases/database) that has zero external dependencies, runs on [Linux](/@harrisonqian/awesome/wiki/platforms/linux)/macOS/[Windows](/@harrisonqian/awesome/wiki/platforms/windows), offers a rich SQL dialect, and is free and extensible. ## Data Comparison - [datacompy](https://github.com/capitalone/datacompy) - A [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels. - [dvt](https://github.com/GoogleCloudPlatform/professional-services-data-validator) - Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration. - [koala-diff](https://github.com/godalida/koala-diff) - A high-performance [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for comparing large [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) (CSV, Parquet) locally using [Rust](/@harrisonqian/awesome/wiki/programming-languages/rust) and Polars. It features zero-copy [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) to prevent OOM errors and generates interactive HTML data quality reports. - [everyrow](https://github.com/futuresearch/everyrow-sdk) - AI-powered data operations SDK for [Python](/@harrisonqian/awesome/wiki/programming-languages/python). Semantic deduplication, fuzzy table merging, and intelligent row ranking using LLM agents. ## Data Ingestion - [ingestr](https://github.com/bruin-data/ingestr) - CLI tool to copy data between databases with a single command. Supports 50+ sources including [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql), [MySQL](/@harrisonqian/awesome/wiki/databases/mysql), [MongoDB](/@harrisonqian/awesome/wiki/databases/mongodb), [Salesforce](/@harrisonqian/awesome/wiki/platforms/salesforce), Shopify to any data warehouse. - [Kafka](https://kafka.apache.org/) - Publish-subscribe messaging rethought as a distributed commit log. - [BottledWater](https://github.com/confluentinc/bottledwater-pg) - Change data capture from [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql) into Kafka. Deprecated. - [kafkat](https://github.com/airbnb/kafkat) - Simplified command-line administration for Kafka brokers. - [kafkacat](https://github.com/edenhill/kafkacat) - Generic command line non-JVM Apache Kafka producer and consumer. - [pg-kafka](https://github.com/xstevens/pg_kafka) - A [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql) extension to produce messages to Apache Kafka. - [librdkafka](https://github.com/edenhill/librdkafka) - The Apache Kafka C/C++ library. - [kafka-docker](https://github.com/wurstmeister/kafka-docker) - Kafka in [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker). - [kafka-manager](https://github.com/yahoo/kafka-manager) - A tool for managing Apache Kafka. - [kafka-node](https://github.com/SOHU-Co/kafka-node) - [Node.js](/@harrisonqian/awesome/wiki/platforms/node-js) client for Apache Kafka 0.8. - [Secor](https://github.com/pinterest/secor) - Pinterest's Kafka to S3 distributed consumer. - [Kafka-logger](https://github.com/uber/kafka-logger) - Kafka-winston logger for [Node.js](/@harrisonqian/awesome/wiki/platforms/node-js) from Uber. - [Kroxylicious](https://github.com/kroxylicious/kroxylicious) - A Kafka Proxy, solving problems like encrypting your Kafka data at [rest](/@harrisonqian/awesome/wiki/miscellaneous/rest). - [AWS Kinesis](https://aws.amazon.com/kinesis/) - A fully managed, cloud-based service for real-time data processing over large, distributed data streams. - [RabbitMQ](https://www.rabbitmq.com/) - Robust messaging for applications. - [dlt](https://www.dlthub.com) - A fast&simple pipeline building library for [Python](/@harrisonqian/awesome/wiki/programming-languages/python) data devs, runs in notebooks, cloud functions, airflow, etc. - [FluentD](https://www.fluentd.org) - An open source data collector for unified logging layer. - [Embulk](https://www.embulk.org) - An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. - [Apache Sqoop](https://sqoop.apache.org) - A tool designed for efficiently transferring bulk data between Apache [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) and structured datastores such as relational databases. - [Heka](https://github.com/mozilla-services/heka) - Data Acquisition and Processing Made Easy. Deprecated. - [Gobblin](https://github.com/apache/incubator-gobblin) - Universal data ingestion framework for [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) from LinkedIn. - [Nakadi](https://nakadi.io) - An open source event messaging platform that provides a [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API on top of Kafka-like queues. - [Pravega](https://www.pravega.io) - Provides a new storage abstraction - a stream - for continuous and unbounded data. - [Apache Pulsar](https://pulsar.apache.org/) - An open-source distributed pub-sub messaging system. - [AWS Data Wrangler](https://github.com/awslabs/aws-data-wrangler) - Utility belt to handle data on AWS. - [Airbyte](https://airbyte.io/) - Open-source data [integration](/@harrisonqian/awesome/wiki/platforms/integration) for modern data teams. - [Artie](https://www.artie.com/) - Real-time data ingestion tool leveraging change data capture. - [Sling](https://slingdata.io/) - CLI data [integration](/@harrisonqian/awesome/wiki/platforms/integration) tool specialized in moving data between databases, as well as storage systems. - [Meltano](https://meltano.com/) - CLI & code-first ELT. - [Singer SDK](https://sdk.meltano.com) - The fastest way to build custom data extractors and loaders compliant with the Singer Spec. - [Google Sheets ETL](https://github.com/fulldecent/google-sheets-etl) - Live import all your Google Sheets to your data warehouse. - [CsvPath Framework](https://www.csvpath.org/) - A delimited data preboarding framework that fills the gap between MFT and the data lake. - [Estuary Flow](https://estuary.dev) - No/low-code data pipeline platform that handles both batch and real-time data ingestion. - [db2lake](https://github.com/bahador-r/db2lake) - Lightweight [Node.js](/@harrisonqian/awesome/wiki/platforms/node-js) ETL framework for databases → data lakes/warehouses. - [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) - Polyglot document intelligence library with a [Rust](/@harrisonqian/awesome/wiki/programming-languages/rust) [core](/@harrisonqian/awesome/wiki/platforms/core) and bindings for [Python](/@harrisonqian/awesome/wiki/programming-languages/python), TypeScript, Go, and more. Extracts text, tables, and metadata from 62+ document formats for data pipeline ingestion. - [DataRaven](https://dataraven.io/) - Managed cloud object storage transfers for ingestion workflows. - [Xquik](https://xquik.com) - Real-time X (Twitter) data extraction platform with [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API (76 endpoints), 20 bulk extraction tools, account monitoring, HMAC-signed webhooks, and MCP server for AI agent [integration](/@harrisonqian/awesome/wiki/platforms/integration). - [Arpe.io](https://www.arpe.io/) - High-speed CLI tools for [database](/@harrisonqian/awesome/wiki/databases/database) export, import, replication and migration with parallel [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) to CSV, Parquet, [JSON](/@harrisonqian/awesome/wiki/miscellaneous/json) and cloud storage, supporting [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql), [MySQL](/@harrisonqian/awesome/wiki/databases/mysql), Oracle, SQL Server and 80+ sources. - [Crustdata](https://crustdata.com) - A real-time B2B data API for company and people intelligence, providing firmographics, headcount signals, job listings, web traffic, and funding events via [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API and webhooks for data enrichment pipelines. - [crdt-merge](https://github.com/mgillr/crdt-merge) - Conflict-free merge for DataFrames, [JSON](/@harrisonqian/awesome/wiki/miscellaneous/json), ML models & distributed agents — powered by CRDTs. ## File System - [HDFS](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) - A distributed file system designed to run on commodity hardware. - [Snakebite](https://github.com/spotify/snakebite) - A pure [python](/@harrisonqian/awesome/wiki/programming-languages/python) HDFS client. - [AWS S3](https://aws.amazon.com/s3/) - Object storage built to retrieve any amount of data from anywhere. - [smart_open](https://github.com/RaRe-Technologies/smart_open) - Utils for [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) large files (S3, HDFS, gzip, bz2). - [Alluxio](https://www.alluxio.org/) - A memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster [frameworks](/@harrisonqian/awesome/wiki/front-end-development/frameworks), such as Spark and MapReduce. - [CEPH](https://ceph.com/) - A unified, distributed storage system designed for excellent performance, reliability, and [scalability](/@harrisonqian/awesome/wiki/front-end-development/scalability). - [JuiceFS](https://github.com/juicedata/juicefs) - A high-performance Cloud-Native file system driven by object storage for large-scale data storage. - [OrangeFS](https://www.orangefs.org/) - Orange File System is a branch of the Parallel Virtual File System. - [SnackFS](https://github.com/tuplejump/snackfs-release) - A bite-sized, lightweight HDFS compatible file system built over [Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra). - [GlusterFS](https://www.gluster.org/) - Gluster Filesystem. - [XtreemFS](https://www.xtreemfs.org/) - Fault-tolerant distributed file system for all storage needs. - [SeaweedFS](https://github.com/chrislusf/seaweedfs) - Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS". - [S3QL](https://github.com/s3ql/s3ql/) - A file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. - [LizardFS](https://lizardfs.com/) - Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system. ## Serialization format - [Apache Avro](https://avro.apache.org) - Apache Avro™ is a data serialization system. - [Apache Parquet](https://parquet.apache.org) - A columnar storage format available to any project in the [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) ecosystem, regardless of the choice of data processing framework, data model or programming language. - [Snappy](https://github.com/google/snappy) - A fast compressor/decompressor. Used with Parquet. - [PigZ](https://zlib.net/pigz/) - A parallel implementation of gzip for modern multi-processor, multi-[core](/@harrisonqian/awesome/wiki/platforms/core) machines. - [Apache ORC](https://orc.apache.org/) - The smallest, fastest columnar storage for [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) workloads. - [Apache Thrift](https://thrift.apache.org) - The Apache Thrift software framework, for scalable cross-language services development. - [ProtoBuf](https://github.com/protocolbuffers/protobuf) - Protocol Buffers - Google's data interchange format. - [SequenceFile](https://wiki.apache.org/hadoop/SequenceFile) - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. - [Kryo](https://github.com/EsotericSoftware/kryo) - A fast and efficient object graph serialization framework for [Java](/@harrisonqian/awesome/wiki/programming-languages/java). ## Stream Processing - [Apache Beam](https://beam.apache.org/) - A unified programming model that implements both batch and [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) data processing jobs that run on many execution engines. - [Spark Streaming](https://spark.apache.org/streaming/) - Makes it easy to build scalable fault-tolerant [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) applications. - [Apache Flink](https://flink.apache.org/) - A [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. - [Apache Storm](https://storm.apache.org) - A free and open source distributed realtime computation system. - [Apache Samza](https://samza.apache.org) - A distributed stream processing framework. - [Apache NiFi](https://nifi.apache.org/) - An easy to use, powerful, and reliable system to process and distribute data. - [Apache Hudi](https://hudi.apache.org/) - An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert. - [CocoIndex](https://github.com/cocoindex-io/cocoindex) - An open source ETL framework to build fresh index for AI. - [VoltDB](https://voltdb.com/) - An ACID-compliant RDBMS which uses a [shared nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture). - [PipelineDB](https://github.com/pipelinedb/pipelinedb) - The [Streaming](/@harrisonqian/awesome/wiki/big-data/streaming) SQL [Database](/@harrisonqian/awesome/wiki/databases/database). - [Spring Cloud Dataflow](https://cloud.spring.io/spring-cloud-dataflow/) - [Streaming](/@harrisonqian/awesome/wiki/big-data/streaming) and tasks execution between Spring Boot [apps](/@harrisonqian/awesome/wiki/platforms/apps). - [Bonobo](https://www.bonobo-project.org/) - A data-processing toolkit for [python](/@harrisonqian/awesome/wiki/programming-languages/python) 3.5+. - [Robinhood's Faust](https://github.com/faust-streaming/faust) - Forever scalable event processing & in-memory durable K/V store as a library with [asyncio](/@harrisonqian/awesome/wiki/programming-languages/asyncio) & static [typing](/@harrisonqian/awesome/wiki/programming-languages/typing). - [HStreamDB](https://github.com/hstreamdb/hstream) - The [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) [database](/@harrisonqian/awesome/wiki/databases/database) built for IoT data storage and real-time processing. - [Kuiper](https://github.com/emqx/kuiper) - An edge lightweight IoT data [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics)/[streaming](/@harrisonqian/awesome/wiki/big-data/streaming) software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices. - [Zilla](https://github.com/aklivity/zilla) - - An API gateway built for event-driven architectures and [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) that supports standard protocols such as HTTP, SSE, gRPC, [MQTT](/@harrisonqian/awesome/wiki/miscellaneous/mqtt), and the native Kafka protocol. - [SwimOS](https://github.com/swimos/swim-rust) - A framework for building real-time [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) data processing applications that supports a wide range of ingestion sources. - [Pathway](https://github.com/pathwaycom/pathway) - Performant open-source [Python](/@harrisonqian/awesome/wiki/programming-languages/python) ETL framework with [Rust](/@harrisonqian/awesome/wiki/programming-languages/rust) runtime, supporting 300+ data sources. ## Batch Processing - [Hadoop MapReduce](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) - A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner. - [Spark](https://spark.apache.org/) - A multi-language engine for executing data engineering, [data science](/@harrisonqian/awesome/wiki/programming-languages/data-science), and [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) on single-node machines or clusters. - [Spark Packages](https://spark-packages.org) - A community index of packages for [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark). - [Deep Spark](https://github.com/Stratio/deep-spark) - Connecting [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) with different data stores. Deprecated. - [Spark RDD API Examples](https://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html) - Examples by Zhen He. - [Livy](https://livy.incubator.apache.org) - The [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) Spark Server. - [Delight](https://github.com/datamechanics/delight) - A free & cross platform monitoring tool (Spark UI / Spark History Server alternative). - [AWS EMR](https://aws.amazon.com/emr/) - A web service that makes it easy to quickly and cost-effectively process vast amounts of data. - [Data Mechanics](https://www.datamechanics.co) - A cloud-based platform deployed on [Kubernetes](/@harrisonqian/awesome/wiki/back-end-development/kubernetes) making [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) more developer-friendly and cost-effective. - [Tez](https://tez.apache.org/) - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data. - [Bistro](https://github.com/asavinov/bistro) - A light-weight engine for general-purpose data processing including both batch and stream [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics). It is based on a novel unique data model, which represents data via _functions_ and processes data via _columns operations_ as opposed to having only set operations in conventional approaches like MapReduce or SQL. - [Substation](https://github.com/brexhq/substation) - A cloud native data pipeline and transformation toolkit written in Go. - [dna-claude-analysis](https://github.com/shmlkv/dna-claude-analysis) - Personal genome analysis toolkit with [Python](/@harrisonqian/awesome/wiki/programming-languages/python) scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, etc.) and generating a terminal-style single-page HTML visualization. - Batch ML - [H2O](https://www.h2o.ai/) - Fast scalable [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) API for smarter applications. - [Mahout](https://mahout.apache.org/) - An environment for quickly creating scalable performant [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) applications. - [Spark MLlib](https://spark.apache.org/docs/latest/ml-guide.html) - Spark's scalable [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) library consisting of common [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) [algorithms](/@harrisonqian/awesome/wiki/theory/algorithms) and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. - Batch Graph - [GraphLab Create](https://turi.com/products/create/docs/) - A [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) platform that enables data scientists and app developers to easily create intelligent [apps](/@harrisonqian/awesome/wiki/platforms/apps) at scale. - [Giraph](https://giraph.apache.org/) - An iterative graph processing system built for high [scalability](/@harrisonqian/awesome/wiki/front-end-development/scalability). - [Spark GraphX](https://spark.apache.org/graphx/) - [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark)'s API for graphs and graph-parallel computation. - Batch SQL - [Presto](https://prestodb.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/docs/current/index.html) - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. - [Hive](https://hive.apache.org) - Data warehouse software facilitates querying and managing large [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) residing in distributed storage. - [Hivemall](https://github.com/apache/incubator-hivemall) - Scalable [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) library for Hive/[Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop). - [PyHive](https://github.com/dropbox/PyHive) - [Python](/@harrisonqian/awesome/wiki/programming-languages/python) interface to Hive and Presto. - [Drill](https://drill.apache.org/) - Schema-free SQL Query Engine for [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop), NoSQL and Cloud Storage. ## Charts and Dashboards - [Highcharts](https://www.highcharts.com/) - A [charting](/@harrisonqian/awesome/wiki/front-end-development/charting) library written in pure [JavaScript](/@harrisonqian/awesome/wiki/programming-languages/javascript), offering an easy way of adding interactive charts to your web site or web application. - [ZingChart](https://www.zingchart.com/) - Fast [JavaScript](/@harrisonqian/awesome/wiki/programming-languages/javascript) charts for any data set. - [C3.js](https://c3js.org) - D3-based reusable chart library. - [D3.js](https://d3js.org/) - A [JavaScript](/@harrisonqian/awesome/wiki/programming-languages/javascript) library for manipulating documents based on data. - [D3Plus](https://d3plus.org) - D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in. - [SmoothieCharts](https://smoothiecharts.org) - A [JavaScript](/@harrisonqian/awesome/wiki/programming-languages/javascript) [Charting](/@harrisonqian/awesome/wiki/front-end-development/charting) Library for [Streaming](/@harrisonqian/awesome/wiki/big-data/streaming) Data. - [PyXley](https://github.com/stitchfix/pyxley) - [Python](/@harrisonqian/awesome/wiki/programming-languages/python) helpers for building dashboards using [Flask](/@harrisonqian/awesome/wiki/back-end-development/flask) and [React](/@harrisonqian/awesome/wiki/front-end-development/react). - [Plotly](https://github.com/plotly/dash) - [Flask](/@harrisonqian/awesome/wiki/back-end-development/flask), JS, and CSS boilerplate for interactive, web-based visualization [apps](/@harrisonqian/awesome/wiki/platforms/apps) in [Python](/@harrisonqian/awesome/wiki/programming-languages/python). - [Apache Superset](https://github.com/apache/incubator-superset) - A modern, enterprise-ready business intelligence web application. - [Redash](https://redash.io/) - Make Your Company Data Driven. Connect to any data source, easily visualize and share your data. - [Metabase](https://github.com/metabase/metabase) - The easy, open source way for everyone in your company to ask questions and learn from data. - [PyQtGraph](https://www.pyqtgraph.org/) - A pure-[python](/@harrisonqian/awesome/wiki/programming-languages/python) graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications. - [Seaborn](https://seaborn.pydata.org) - A [Python](/@harrisonqian/awesome/wiki/programming-languages/python) visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. - [QueryGPT](https://github.com/MKY508/QueryGPT) - Natural language [database](/@harrisonqian/awesome/wiki/databases/database) query interface with automatic chart generation, supporting Chinese and English queries. ## Workflow - [Bonnard](https://bonnard.dev/) - Agent-native semantic layer with governed metrics, [React](/@harrisonqian/awesome/wiki/front-end-development/react) SDK, and multi-warehouse support. Connects AI agents and dashboards to a single source of truth. - [Bruin](https://github.com/bruin-data/bruin) - End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql), Redshift, and more. Includes VS Code extension with live previews. - [Luigi](https://github.com/spotify/luigi) - A [Python](/@harrisonqian/awesome/wiki/programming-languages/python) module that helps you build complex pipelines of batch jobs. - [CronQ](https://github.com/seatgeek/cronq) - An application cron-like system. [Used](https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luigi. Deprecated. - [Cascading](https://www.cascading.org/) - [Java](/@harrisonqian/awesome/wiki/programming-languages/java) based application development platform. - [Airflow](https://github.com/apache/airflow) - A system to programmatically author, schedule, and monitor data pipelines. - [Azkaban](https://azkaban.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/) - A batch workflow job scheduler created at LinkedIn to run [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows. - [Oozie](https://oozie.apache.org/) - A workflow scheduler system to manage Apache [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) jobs. - [Pinball](https://github.com/pinterest/pinball) - DAG based workflow manager. Job flows are defined programmatically in [Python](/@harrisonqian/awesome/wiki/programming-languages/python). Support output passing between jobs. - [Dagster](https://github.com/dagster-io/dagster) - An open-source [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library for building data applications. - [Hamilton](https://github.com/dagworks-inc/hamilton) - A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for [Python](/@harrisonqian/awesome/wiki/programming-languages/python) processing. - [Kedro](https://kedro.readthedocs.io/en/latest/) - A framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly. - [Dataform](https://dataform.co/) - An open-source framework and web based IDE to manage [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, [testing](/@harrisonqian/awesome/wiki/testing/testing), documentation and more. - [Census](https://getcensus.com/) - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like [Salesforce](/@harrisonqian/awesome/wiki/platforms/salesforce), Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL. - [dbt](https://getdbt.com/) - A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively. - [Kestra](https://github.com/kestra-io/kestra) - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code. - [RudderStack](https://github.com/rudderlabs/rudder-server) - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools. - [PACE](https://github.com/getstrm/pace) - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.) - [Prefect](https://prefect.io/) - An orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly. - [Multiwoven](https://github.com/Multiwoven/multiwoven) - The open-source reverse ETL, data activation platform for modern data teams. - [SuprSend](https://www.suprsend.com/products/workflows) - Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse. - [Mage](https://www.mage.ai) - Open-source data pipeline tool for transforming and integrating data. - [SQLMesh](https://sqlmesh.readthedocs.io) - An open-source data transformation framework for managing, [testing](/@harrisonqian/awesome/wiki/testing/testing), and deploying SQL and [Python](/@harrisonqian/awesome/wiki/programming-languages/python)-based data pipelines with version control, environment isolation, and automatic dependency resolution. ## Data Lake Management - [lakeFS](https://github.com/treeverse/lakeFS) - An open source platform that delivers resilience and manageability to object-storage based data lakes. - [Project Nessie](https://github.com/projectnessie/nessie) - A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables. - [Ilum](https://ilum.cloud/) - A modular Data Lakehouse platform that simplifies the management and monitoring of [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) clusters across [Kubernetes](/@harrisonqian/awesome/wiki/back-end-development/kubernetes) and [Hadoop](/@harrisonqian/awesome/wiki/big-data/hadoop) environments. - [Gravitino](https://github.com/apache/gravitino) - An open-source, unified metadata management for data lakes, data warehouses, and external catalogs. - [FlightPath Data](https://www.flightpathdata.com) - FlightPath is a gateway to a data lake's bronze layer, protecting it from invalid external data file feeds as a trusted publisher. ## ELK Elastic Logstash Kibana - [docker-logstash](https://github.com/pblittle/docker-logstash) - A highly configurable Logstash (1.4.4) - [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) image running Elasticsearch (1.7.0) - and Kibana (3.1.2). - [elasticsearch-jdbc](https://github.com/jprante/elasticsearch-jdbc) - JDBC importer for Elasticsearch. - [ZomboDB](https://github.com/zombodb/zombodb) - [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql) Extension that allows creating an index backed by Elasticsearch. ## Docker - [Gockerize](https://github.com/redbooth/gockerize) - Package golang service into minimal [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) [containers](/@harrisonqian/awesome/wiki/platforms/containers). - [Flocker](https://github.com/ClusterHQ/flocker) - Easily manage [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) [containers](/@harrisonqian/awesome/wiki/platforms/containers) & their data. - [Rancher](https://rancher.com/rancher-os/) - RancherOS is a 20mb [Linux](/@harrisonqian/awesome/wiki/platforms/linux) distro that runs the entire OS as [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) [containers](/@harrisonqian/awesome/wiki/platforms/containers). - [Kontena](https://www.kontena.io/) - Application [Containers](/@harrisonqian/awesome/wiki/platforms/containers) for Masses. - [Weave](https://github.com/weaveworks/weave) - Weaving [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) [containers](/@harrisonqian/awesome/wiki/platforms/containers) into applications. - [Zodiac](https://github.com/CenturyLinkLabs/zodiac) - A lightweight tool for easy deployment and rollback of dockerized applications. - [cAdvisor](https://github.com/google/cadvisor) - Analyzes resource usage and performance characteristics of running [containers](/@harrisonqian/awesome/wiki/platforms/containers). - [Micro S3 persistence](https://github.com/figadore/micro-s3-persistence) - [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) microservice for saving/restoring volume data to S3. - [Rocker-compose](https://github.com/grammarly/rocker-compose) - [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) composition tool with idempotency features for deploying [apps](/@harrisonqian/awesome/wiki/platforms/apps) composed of multiple [containers](/@harrisonqian/awesome/wiki/platforms/containers). Deprecated. - [Nomad](https://github.com/hashicorp/nomad) - A cluster manager, designed for both long-lived services and short-lived batch processing workloads. - [ImageLayers](https://imagelayers.io/) - Visualize [Docker](/@harrisonqian/awesome/wiki/back-end-development/docker) images and the layers that compose them. ## Datasets ### Realtime - [DexPaprika](https://api.dexpaprika.com) - Free real-time DEX data via SSE [streaming](/@harrisonqian/awesome/wiki/big-data/streaming) across 34 blockchains. 30M+ pools, 27M+ tokens, ~1 second price updates. No API key, no rate limits. [Docs](https://docs.dexpaprika.com) - [Twitter Realtime](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview) - The [Streaming](/@harrisonqian/awesome/wiki/big-data/streaming) APIs give developers low latency access to Twitter's global stream of Tweet data. - [Eventsim](https://github.com/Interana/eventsim) - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. - [Reddit](https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_data_is_available_including_comments/) - Real-time data is available including comments, submissions and links posted to reddit. ### Data Dumps - [GitHub Archive](https://www.gharchive.org/) - [GitHub](/@harrisonqian/awesome/wiki/development-environment/github)'s public timeline since 2011, updated every hour. - [Common Crawl](https://commoncrawl.org/) - Open source repository of web crawl data. - [Wikipedia](https://dumps.wikimedia.org/enwiki/latest/) - Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw [database](/@harrisonqian/awesome/wiki/databases/database) tables in SQL form are also available. - [FirstData](https://github.com/MLT-OSS/FirstData) - The world's most comprehensive authoritative data source knowledge base. 160+ curated sources from governments, international organizations, and research institutions with MCP [integration](/@harrisonqian/awesome/wiki/platforms/integration). ## Monitoring ### Prometheus - [Prometheus.io](https://github.com/prometheus/prometheus) - An open-source service monitoring system and time series [database](/@harrisonqian/awesome/wiki/databases/database). - [HAProxy Exporter](https://github.com/prometheus/haproxy_exporter) - Simple server that scrapes HAProxy stats and exports them via HTTP for [Prometheus](/@harrisonqian/awesome/wiki/miscellaneous/prometheus) consumption. ## Profiling ### Data Profiler - [Data Profiler](https://github.com/capitalone/dataprofiler) - The DataProfiler is a [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library designed to make data analysis, monitoring, and sensitive data detection easy. - [YData Profiling](https://docs.profiling.ydata.ai/latest/) - A general-purpose open-source data profiler for high-level analysis of a dataset. - [Desbordante](https://github.com/desbordante/desbordante-core) - An open-source data profiler specifically focused on discovery and validation of complex patterns in data. ## Testing - [Grai](https://github.com/grai-io/grai-core/) - A data catalog tool that integrates into your CI system exposing downstream impact [testing](/@harrisonqian/awesome/wiki/testing/testing) of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production. - [DQOps](https://github.com/dqops/dqo) - An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring. - [DataKitchen](https://datakitchen.io/) - Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests. - [GreatExpectation](https://greatexpectations.io/) - Open Source data validation framework to manage data quality. Users can define and document “expectations” rules about how data should look and behave. - [Provero](https://github.com/provero-org/provero) - A vendor-neutral, declarative data quality engine. Define checks in YAML, run anywhere. Includes 16 built-in check types, SQL batch optimizer, anomaly detection, and data contracts. - [RunSQL](https://runsql.com/) - Free online SQL playground for [MySQL](/@harrisonqian/awesome/wiki/databases/mysql), [PostgreSQL](/@harrisonqian/awesome/wiki/databases/postgresql), and SQL Server. Create [database](/@harrisonqian/awesome/wiki/databases/database) structures, run queries, and share results instantly. - [Spark Playground](https://www.sparkplayground.com/) - Write, run, and test PySpark code on Spark Playground's online compiler. Access real-world sample [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) & solve interview questions to enhance your PySpark skills for data engineering roles. - [daffy](https://github.com/vertti/daffy/) - Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin. - [Snowflake Emulator](https://github.com/nnnkkk7/snowflake-emulator) - A Snowflake-compatible emulator for local development and [testing](/@harrisonqian/awesome/wiki/testing/testing). - [DataScreenIQ](https://datascreeniq.com) - Real-time data quality firewall for pipelines and APIs. Screens rows in milliseconds for schema drift, null spikes, type mismatches, and data anomalies with PASS / WARN / BLOCK decisions. - [DataDriven](https://www.datadriven.io/) - Interview practice with SQL query execution, [Python](/@harrisonqian/awesome/wiki/programming-languages/python), and data modeling exercises. ## Community ### Forums - [/r/dataengineering](https://www.reddit.com/r/dataengineering/) - News, [tips](/@harrisonqian/awesome/wiki/programming-languages/tips), and background on Data Engineering. - [/r/etl](https://www.reddit.com/r/ETL/) - Subreddit focused on ETL. ### Conferences - [Data Council](https://www.datacouncil.ai/about) - The first technical conference that bridges the gap between data scientists, data engineers and data analysts. ### Podcasts - [Chain of Thought](https://www.chainofthought.show/) - Interviews with AI and data infrastructure leaders on building production systems. - [Data Engineering Podcast](https://www.dataengineeringpodcast.com/) - The show about modern data infrastructure. - [Latent Space](https://www.latent.space/podcast) - Technical deep dives on AI engineering, from model training to deployment. - [Practical AI](https://practicalai.fm/) - Making AI practical, productive, and accessible to everyone. - [Software Engineering Daily](https://softwareengineeringdaily.com/) - Daily interviews about technical software topics, including data infrastructure. - [The [Analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) Engineering Podcast](https://roundup.getdbt.com/s/the-analytics-engineering-podcast) - How analytics engineers build and maintain data pipelines at scale. - [The Data Stack Show](https://datastackshow.com/) - A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data. ### Books - [Snowflake Data Engineering](https://www.manning.com/books/snowflake-data-engineering) - A practical introduction to data engineering on the Snowflake cloud data platform. - [Best [Data Science](/@harrisonqian/awesome/wiki/programming-languages/data-science) Books](https://www.appliedaicourse.com/blog/data-science-books/) - This blog offers a curated list of top [data science](/@harrisonqian/awesome/wiki/programming-languages/data-science) books, categorized by topics and [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) stages, to aid readers in building foundational knowledge and staying updated with industry trends. - [Architecting an Apache Iceberg Lakehouse](https://www.manning.com/books/architecting-an-apache-iceberg-lakehouse) - A guide to designing an Apache Iceberg lakehouse from scratch. - [Learn AI Data Engineering in a Month of Lunches](https://www.manning.com/books/learn-ai-data-engineering-in-a-month-of-lunches) - A fast, friendly guide to integrating large language models into your data workflows.