[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@harrisonqian / Awesome / wiki/big-data/hadoop.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- visibility: public --- # Hadoop **repo:** [youngwookim/awesome-hadoop](https://github.com/youngwookim/awesome-hadoop) **category:** [[big-data|Big Data]] **related:** [[apache-spark|Apache Spark]] · [[data-engineering|Data Engineering]] --- # Awesome Hadoop [](https://github.com/sindresorhus/awesome) A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by [Awesome PHP](https://github.com/ziadoz/awesome-php), [Awesome Python](https://github.com/vinta/awesome-python) and [Awesome Sysadmin](https://github.com/kahun/awesome-sysadmin) - [Awesome Hadoop](#[awesome](/@harrisonqian/awesome/wiki/miscellaneous/awesome)-hadoop) - [Hadoop](#hadoop) - [YARN](#yarn) - [NoSQL](#nosql) - [SQL on Hadoop](#sql-on-hadoop) - [Data Management](#data-management) - [Workflow, Lifecycle and Governance](#workflow-lifecycle-and-governance) - [Data Ingestion and Integration](#data-ingestion-and-integration) - [DSL](#dsl) - [Libraries and Tools](#libraries-and-tools) - [Realtime Data Processing](#realtime-data-processing) - [Distributed Computing and Programming](#distributed-computing-and-programming) - [Packaging, Provisioning and Monitoring](#packaging-provisioning-and-monitoring) - [Monitoring](#monitoring) - [Search](#search) - [Security](#security) - [Benchmark](#benchmark) - [Machine [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) and [Big Data](/@harrisonqian/awesome/wiki/big-data/big-data) analytics](#machine-learning-and-big-data-analytics) - [Misc.](#misc) - [Resources](#resources) - [Websites](#websites) - [Presentations](#presentations) - [Books](#books) - [Hadoop and [Big Data](/@harrisonqian/awesome/wiki/big-data/big-data) Events](#hadoop-and-big-data-events) - [Other [Awesome](/@harrisonqian/awesome/wiki/miscellaneous/awesome) Lists](#other-awesome-lists) ## Hadoop * [Apache Hadoop](http://hadoop.apache.org/) - Apache Hadoop * [Apache Hadoop Ozone](http://hadoop.apache.org/ozone/) - An Object Store for Apache Hadoop * [Apache Tez](http://tez.apache.org/) - A Framework for YARN-based, Data Processing Applications In Hadoop * [SpatialHadoop](http://spatialhadoop.cs.umn.edu/) - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. * [GIS Tools for Hadoop](http://esri.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/gis-tools-for-hadoop/) - [Big Data](/@harrisonqian/awesome/wiki/big-data/big-data) Spatial [Analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) for the Hadoop Framework * [Elasticsearch Hadoop](https://github.com/elastic/elasticsearch-hadoop) - Elasticsearch real-time search and [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig. * [hadoopy](https://github.com/bwhite/hadoopy) - [Python](/@harrisonqian/awesome/wiki/programming-languages/python) MapReduce library written in Cython. * [mrjob](https://github.com/Yelp/mrjob/) - mrjob is a [Python](/@harrisonqian/awesome/wiki/programming-languages/python) 2.5+ package that helps you write and run Hadoop [Streaming](/@harrisonqian/awesome/wiki/big-data/streaming) jobs. * [pydoop](http://pydoop.sourceforge.net/) - Pydoop is a package that provides a [Python](/@harrisonqian/awesome/wiki/programming-languages/python) API for Hadoop. * [hdfs-du](https://github.com/twitter/hdfs-du) - HDFS-DU is an interactive visualization of the Hadoop distributed file system. * [White Elephant](https://github.com/linkedin/white-elephant) - Hadoop log aggregator and dashboard * [Genie](https://github.com/Netflix/genie) - Genie provides [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest)-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them. * [Apache Kylin](http://kylin.incubator.apache.org/) - Apache Kylin is an open source Distributed [Analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) * [Crunch](https://github.com/jondot/crunch) - Go-based toolkit for ETL and feature extraction on Hadoop * [Apache Ignite](http://ignite.apache.org/) - Distributed in-memory platform ## YARN * [Apache Slider](http://slider.incubator.apache.org/) - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster. * [Apache Twill](http://twill.incubator.apache.org/) - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic. * [mpich2-yarn](https://github.com/alibaba/mpich2-yarn) - Running MPICH2 on Yarn ## NoSQL *Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.* * [Apache HBase](http://hbase.apache.org) - Apache [HBase](/@harrisonqian/awesome/wiki/databases/hbase) * [Apache Phoenix](http://phoenix.apache.org/) - A SQL skin over [HBase](/@harrisonqian/awesome/wiki/databases/hbase) supporting secondary indices * [happybase](https://github.com/wbolster/happybase) - A developer-friendly [Python](/@harrisonqian/awesome/wiki/programming-languages/python) library to interact with Apache [HBase](/@harrisonqian/awesome/wiki/databases/hbase). * [Hannibal](https://github.com/sentric/hannibal) - Hannibal is tool to help monitor and maintain [HBase](/@harrisonqian/awesome/wiki/databases/hbase)-Clusters that are configured for manual splitting. * [Haeinsa](https://github.com/VCNC/haeinsa) - Haeinsa is linearly scalable multi-row, multi-table transaction library for [HBase](/@harrisonqian/awesome/wiki/databases/hbase) * [hindex](https://github.com/Huawei-Hadoop/hindex) - Secondary Index for [HBase](/@harrisonqian/awesome/wiki/databases/hbase) * [Apache Accumulo](https://accumulo.apache.org/) - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. * [OpenTSDB](http://opentsdb.net/) - The Scalable Time Series [Database](/@harrisonqian/awesome/wiki/databases/database) * [Apache Cassandra](http://cassandra.apache.org/) ## SQL on Hadoop *SQL on Hadoop* * [Apache Hive](http://hive.apache.org) - The Apache Hive data warehouse software facilitates reading, writing, and managing large [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) residing in distributed storage using SQL * [Apache Phoenix](http://phoenix.apache.org) A SQL skin over [HBase](/@harrisonqian/awesome/wiki/databases/hbase) supporting secondary indices * [Apache HAWQ (incubating)](http://hawq.incubator.apache.org/) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP [database](/@harrisonqian/awesome/wiki/databases/database) with the [scalability](/@harrisonqian/awesome/wiki/front-end-development/scalability) and convenience of Hadoop * [Lingual](http://www.cascading.org/projects/lingual/) - SQL interface for Cascading (MR/Tez job generator) * [Apache Impala](https://impala.apache.org/) - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012. * [Presto](https://prestodb.io/) - Distributed SQL Query Engine for [Big Data](/@harrisonqian/awesome/wiki/big-data/big-data). Open sourced by Facebook. * [Apache Tajo](http://tajo.apache.org/) - Data warehouse system for Apache Hadoop * [Apache Drill](https://drill.apache.org/) - Schema-free SQL Query Engine * [Apache Trafodion](http://trafodion.apache.org/) ## Data Management * [Apache Calcite](http://calcite.apache.org/) - A Dynamic Data Management Framework * [Apache Atlas](http://atlas.incubator.apache.org/) - Metadata tagging & lineage capture suppoting complex business data taxonomies * [Apache Kudu](https://kudu.apache.org/) - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache [HBase](/@harrisonqian/awesome/wiki/databases/hbase). * [Confluent Schema registry for Kafka](https://github.com/confluentinc/schema-registry) - Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas. * [Hortonworks Schema Registry](https://github.com/hortonworks/registry) - Schema Registry is a framework to build metadata repositories. ## Workflow, Lifecycle and Governance * [Apache Oozie](http://oozie.apache.org) - Apache Oozie * [Azkaban](http://azkaban.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/) * [Apache Falcon](http://falcon.apache.org/) - Data management and processing platform * [Apache NiFi](http://nifi.apache.org/) - A dataflow system * [Apache AirFlow](https://github.com/apache/incubator-airflow) - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines * [Luigi](http://luigi.readthedocs.org/en/latest/) - [Python](/@harrisonqian/awesome/wiki/programming-languages/python) package that helps you build complex pipelines of batch jobs ## Data Ingestion and Integration * [Apache Flume](http://flume.apache.org) - Apache Flume * [Suro](https://github.com/Netflix/suro) - Netflix's distributed Data Pipeline * [Apache Sqoop](http://sqoop.apache.org) - Apache Sqoop * [Apache Kafka](http://kafka.apache.org/) - Apache Kafka * [Gobblin from LinkedIn](https://github.com/linkedin/gobblin) - Universal data ingestion framework for Hadoop ## DSL * [Apache Pig](http://pig.apache.org) - Apache Pig * [Apache DataFu](http://datafu.incubator.apache.org/) - A collection of libraries for working with large-scale data in Hadoop * [vahara](https://github.com/thedatachef/varaha) - [Machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) and natural language processing with Apache Pig * [packetpig](https://github.com/packetloop/packetpig) - Open Source [Big Data](/@harrisonqian/awesome/wiki/big-data/big-data) [Security](/@harrisonqian/awesome/wiki/security/security) [Analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) * [akela](https://github.com/mozilla-metrics/akela) - Mozilla's utility library for Hadoop, [HBase](/@harrisonqian/awesome/wiki/databases/hbase), Pig, etc. * [seqpig](http://seqpig.sourceforge.net/) - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop * [Lipstick](https://github.com/Netflix/Lipstick) - Pig workflow visualization tool. [Introducing Lipstick on A(pache) Pig](http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html) * [PigPen](https://github.com/Netflix/PigPen) - PigPen is map-reduce for [Clojure](/@harrisonqian/awesome/wiki/programming-languages/clojure), or distributed [Clojure](/@harrisonqian/awesome/wiki/programming-languages/clojure). It compiles to Apache Pig, but you don't need to know much about Pig to use it. ## Libraries and Tools * [Kite Software Development Kit](http://kitesdk.org/) - A set of libraries, tools, examples, and documentation * [gohadoop](https://github.com/hortonworks/gohadoop) - Native go clients for Apache Hadoop YARN. * [Hue](http://gethue.com/) - A Web interface for analyzing data with Apache Hadoop. * [Apache Zeppelin](https://zeppelin.incubator.apache.org/) - A web-based notebook that enables interactive data [analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) * [Apache Thrift](http://thrift.apache.org/) * [Apache Avro](http://avro.apache.org/) - Apache Avro is a data serialization system. * [Elephant Bird](https://github.com/twitter/elephant-bird) - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and [HBase](/@harrisonqian/awesome/wiki/databases/hbase) code. * [Spring for Apache Hadoop](http://projects.spring.io/spring-hadoop/) * [hdfs - A native go client for HDFS](https://github.com/colinmarc/hdfs) * [Oozie Eclipse Plugin](https://marketplace.eclipse.org/content/oozie-eclipse-plugin) - A graphical editor for editing Apache Oozie workflows inside Eclipse. * [snakebite](https://pypi.[python](/@harrisonqian/awesome/wiki/programming-languages/python).org/pypi/snakebite/) - A pure [python](/@harrisonqian/awesome/wiki/programming-languages/python) HDFS client * [Apache Parquet](https://parquet.apache.org/) - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. * [Apache Superset (incubating)](https://superset.incubator.apache.org/) - Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application * [Schema Registry UI](https://github.com/Landoop/schema-registry-ui) - Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster. ## Realtime Data Processing * [Apache Storm](http://storm.apache.org/) * [Apache Samza](http://samza.apache.org/) * [Apache Spark](http://spark.apache.org/streaming/) * [Apache Flink](https://flink.apache.org) - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing. * [Apache Pulsar (incubating)](http://pulsar.incubator.apache.org/) - Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. * [Apache Druid (incubating)](http://druid.incubator.apache.org/) - A high-performance, column-oriented, distributed data store. ## Distributed Computing and Programming * [Apache Spark](http://spark.apache.org/) * [Spark Packages](http://spark-packages.org/) - A community index of packages for [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) * [SparkHub](https://sparkhub.databricks.com/) - A community site for [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) * [Apache Crunch](http://crunch.apache.org) * [Cascading](http://www.cascading.org/) - Cascading is the proven application development platform for building data applications on Hadoop. * [Apache Flink](http://flink.apache.org/) - Apache Flink is a platform for efficient, distributed, general-purpose data processing. * [Apache Apex (incubating)](http://apex.incubator.apache.org/) - Enterprise-grade unified stream and batch processing engine. * [Apache Livy (incubating)](https://livy.incubator.apache.org/) - Apache Livy (incubating) is web service that exposes a [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) interface for managing long running [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) contexts in your cluster. With Livy, new applications can be built on top of [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark) that require fine grained interaction with many Spark contexts. ## Packaging, Provisioning and Monitoring * [Apache Bigtop](http://bigtop.apache.org/) - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem * [Apache Ambari](http://ambari.apache.org/) - Apache Ambari * [Ganglia Monitoring System](http://ganglia.sourceforge.net/) * [ankush](https://github.com/impetus-opensource/ankush) - A [big data](/@harrisonqian/awesome/wiki/big-data/big-data) cluster management tool that creates and manages clusters of different technologies. * [Apache Zookeeper](http://zookeeper.apache.org/) - Apache Zookeeper * [Apache Curator](http://curator.apache.org/) - ZooKeeper client wrapper and rich ZooKeeper framework * [inviso](https://github.com/Netflix/inviso) - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization. * [Logit.io](https://logit.io/) - Send logs from Hadoop to Elasticsearch for monitoring and alerting. ## Search * [ElasticSearch](https://www.elastic.co/) * [Apache Solr](http://lucene.apache.org/solr/) - Apache Solr is an open source search platform built upon a [Java](/@harrisonqian/awesome/wiki/programming-languages/java) library called Lucene. * [Banana](https://github.com/LucidWorks/banana) - Kibana port for Apache Solr ## Search Engine Framework * [Apache Nutch](http://nutch.apache.org/) - Apache Nutch is a highly extensible and scalable open source web crawler software project. ## Security * [Apache Ranger](http://ranger.incubator.apache.org/) - Ranger is a framework to enable, monitor and manage comprehensive data [security](/@harrisonqian/awesome/wiki/security/security) across the Hadoop platform. * [Apache Sentry](https://sentry.incubator.apache.org/) - An authorization module for Hadoop * [Apache Knox Gateway](https://knox.apache.org/) - A [REST](/@harrisonqian/awesome/wiki/miscellaneous/rest) API Gateway for interacting with Hadoop clusters. ## Benchmark * [Big Data Benchmark](https://amplab.cs.berkeley.edu/benchmark/) * [HiBench](https://github.com/intel-hadoop/HiBench) * [YCSB](https://github.com/brianfrankcooper/YCSB) - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL [database](/@harrisonqian/awesome/wiki/databases/database) management systems. ## Machine learning and Big Data analytics * [Apache Mahout](http://mahout.apache.org) * [Oryx 2](https://github.com/OryxProject/oryx) - Lambda architecture on Spark, Kafka for real-time large scale [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) * [MLlib](https://spark.apache.org/mllib/) - MLlib is [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark)'s scalable [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) library. * [R](http://www.r-project.org/) - R is a [free software](/@harrisonqian/awesome/wiki/miscellaneous/free-software) environment for statistical computing and graphics. * [RHadoop](https://github.com/RevolutionAnalytics/RHadoop/wiki) including RHDFS, RHBase, RMR2, plyrmr * [Apache Lens](http://lens.apache.org/) * [Apache SINGA (incubating)](https://singa.incubator.apache.org/) - SINGA is a general distributed [deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) platform for training big deep [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) models over large [datasets](/@harrisonqian/awesome/wiki/miscellaneous/datasets) * [BigDL](https://bigdl-project.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/) - BigDL is a distributed [deep learning](/@harrisonqian/awesome/wiki/computer-science/deep-learning) library for [Apache Spark](/@harrisonqian/awesome/wiki/big-data/apache-spark); with BigDL, users can write their deep [learning](/@harrisonqian/awesome/wiki/programming-languages/learning) applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters. * [Apache Hivemall (incubating)](http://hivemall.incubator.apache.org/) - Apache Hivemall is a scalable [machine learning](/@harrisonqian/awesome/wiki/computer-science/machine-learning) library that runs on Apache Hive, Spark and Pig. ## Misc. * Hive Plugins * UDF * https://[github](/@harrisonqian/awesome/wiki/development-environment/github).com/edwardcapriolo/hive_cassandra_udfs * https://github.com/livingsocial/HiveSwarm * https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-[Analytics](/@harrisonqian/awesome/wiki/miscellaneous/analytics) * https://github.com/twitter/elephant-bird - Twitter * https://github.com/lovelysystems/ls-hive * https://github.com/klout/brickhouse * Storage Handler * https://github.com/dvasilen/Hive-[Cassandra](/@harrisonqian/awesome/wiki/databases/cassandra) * https://github.com/yc-huang/Hive-mongo * https://github.com/balshor/gdata-storagehandler * https://github.com/chimpler/hive-solr * https://github.com/bfemiano/accumulo-hive-storage-manager * Libraries and tools * https://github.com/forward3d/rbhive * https://github.com/synctree/activerecord-hive-adapter * https://github.com/hrp/sequel-hive-adapter * https://github.com/forward/node-hive * https://github.com/recruitcojp/WebHive * [shib](https://github.com/tagomoris/shib) - WebUI for query engines: Hive and Presto * https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2) * [PyHive](https://github.com/dropbox/PyHive) - [Python](/@harrisonqian/awesome/wiki/programming-languages/python) interface to Hive and Presto * https://github.com/recruitcojp/OdbcHive * [HiveRunner](https://github.com/klarna/HiveRunner) - An Open Source unit test framework for hadoop hive queries based on JUnit4 * [Beetest](https://github.com/kawaa/Beetest) - A super simple utility for [testing](/@harrisonqian/awesome/wiki/testing/testing) Apache Hive scripts locally for non-[Java](/@harrisonqian/awesome/wiki/programming-languages/java) developers. * [Hive_test](https://github.com/edwardcapriolo/hive_test)- Unit test framework for hive and hive-service * Flume Plugins * [Flume [MongoDB](/@harrisonqian/awesome/wiki/databases/mongodb) Sink](https://github.com/leonlee/flume-ng-mongodb-sink) * [Flume RabbitMQ source and sink](https://github.com/jcustenborder/flume-ng-rabbitmq) * [Flume UDP Source](https://github.com/whitepages/flume-udp-source) * [.Net FlumeNG Clients](https://github.com/marksl/DotNetFlumeNG.Clients) # Resources Various resources, such as books, websites and articles. ## Websites *Useful websites and articles* * [Hadoop Weekly](http://www.hadoopweekly.com/) * [The Hadoop Ecosystem Table](http://hadoopecosystemtable.[github](/@harrisonqian/awesome/wiki/development-environment/github).io/) * [Hadoop illuminated](http://hadoopilluminated.com/) - Open Source Hadoop Book * [AWS BigData Blog](http://blogs.aws.amazon.com/bigdata/) * [Hadoop360](http://www.hadoop360.com/) * [How to monitor Hadoop metrics](https://www.datadoghq.com/blog/monitor-hadoop-metrics/) ## Presentations * [Apache Hadoop In Theory And Practice](http://www.slideshare.net/AdamKawa/hadoop-intheoryandpractice) * [Hadoop Operations at LinkedIn](http://www.slideshare.net/allenwittenauer/2013-hadoopsummitemea) * [Hadoop Performance at LinkedIn](http://www.slideshare.net/allenwittenauer/2012-lihadoopperf) * [Docker based Hadoop provisioning](http://www.slideshare.net/JanosMatyas/docker-based-hadoop-provisioning) ## Books * [Hadoop: The Definitive Guide](http://www.amazon.com/gp/product/1449311520/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=1449311520&linkCode=as2&tag=matratsblo-20) * [Hadoop Operations](http://www.amazon.com/gp/product/1449327052/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=1449327052&linkCode=as2&tag=matratsblo-20) * [Apache Hadoop Yarn](http://www.amazon.com/dp/0321934504?tag=matratsblo-20) * [HBase: The Definitive Guide](http://shop.oreilly.com/product/0636920014348.do) * [Programming Pig](http://shop.oreilly.com/product/0636920018087.do) * [Programming Hive](http://shop.oreilly.com/product/0636920023555.do) * [Hadoop in Practice, Second Edition](http://www.manning.com/holmes2/) * [Hadoop in Action, Second Edition](http://www.manning.com/lam2/) ## Hadoop and Big Data Events * [ApacheCon](http://www.apachecon.com/) * [Strata + Hadoop World](http://conferences.oreilly.com/strata) * [DataWorks Summit](https://dataworkssummit.com/) * [Spark Summit](https://databricks.com/sparkaisummit) # Other Awesome Lists Other amazingly awesome lists can be found in the [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness) and [awesome](https://github.com/sindresorhus/awesome) list.