Hadoop Projects of Note

The beauty of an open source platform like Apache Hadoop is that it can always be altered, built upon, and expanded to make it beneficial and productive to all of its users.

As the premier, go-to software for the distributed processing of big data, Hadoop and Hadoop projects are made up of the collaborative efforts of developers all over the world. Some of these Hadoop products come from third-party software providers, but there are many other Apache-endorsed products and projects that each offer businesses unique functionality and flexibility to perform the tasks that they need from Hadoop.

Here's a look at some of the most notable Hadoop products backed by Apache:

Hadoop MapReduce:

One of the two primary components of Hadoop, MapReduce is a programming model and software framework meant to write applications that rapidly process massive amounts of data in parallel on large clusters of compute nodes.

Hadoop Distributed File System (HDFS):

As the primary storage system of Hadoop applications, HDFS makes up the other major half of the Hadoop framework. The filesystem creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster for rapid, reliable computations.

Hive:

Hive is a widely used data warehouse system that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in a Hadoop-compatible filesystem. The system offers a mechanism for the projection of structure onto this stored data through the use of an SQL-like language, HiveQL, which also allows map/reduce programmers to insert custom mappers and reducers when it doesn't make sense to express logic using HiveQL.

Pig:

This platform was built to analyze big data sets consisting of high-level language for expressing data analysis programs, and it is joined with an infrastructure that can be used to evaluate those programs. Because Pig programs offer a structure that is open to substantial parallelization, they are able to handle very large datasets. Its infrastructure is made up of a compiler that produces sequences for map and reduce programs, as well as a language layer, known as Pig Latin (har har), that was created for ease of programming, optimization opportunities, and extensibility.

Flume:

Created to deliver data from applications to HDFS, Flume is a distributed, reliable, and available service for the efficient collection, aggregation, and movement of large amounts of log data. It comes with a simple, flexible architecture based on streaming data flows.

ZooKeeper:

ZooKeeper is a centralized service used to maintain configuration information, name, provide distributed synchronization, and offer group services, all of which are used in some form by distributed applications.

Oozie:

Managing Hadoop jobs is so much easier thanks to Oozie, a workflow/coordination system from Apache. Workflow jobs are Directed Acyclical Graphs (DAGs) of actions, while Coordinator jobs are recurrent Workflow jobs triggered by time and data availability. Oozie, a scalable, reliable, and extensible system, is integrated with the Hadoop stack and supports several Hadoop jobs right out of the box, including Java map/reduce, Streaming map/reduce, Pig, and more.

HBase:

HBase is the distributed, scalable Hadoop database that hosts very large tables atop clusters of commodity hardware.

Mahout:

This is a library for machine learning and data mining meant to scale to "reasonably large" data sets. It uses core algorithms for clustering, classification, and batch based on collaborative filtering, which are built upon Hadoop using the map/reduce paradigm.

Chukwa:

Built atop HDFS and the MapReduce framework for scalability and robustness, the Chukwa open source data collection system for large distributed systems includes a flexible, powerful toolkit for displaying, monitoring, and analyzing results to make the most of collected data.

Avro:

Avro is a data serialization system.

Cassandra:

This scalable, multi-master database comes with no single points of failure offers scalability and high availability without compromising performance. It features linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure, which makes it an ideal system for mission-critical data.