Who's Using Hadoop?

This lack of down-to-Earth, nitty gritty details on how Apache Hadoop works and who uses it may make it seem more elusive than it really is, but that couldn't be further from the truth. Hadoop isn't the exclusive Web technology of the digital gods atop Mt. Olympus; in fact, many major 'Net names are using it right now, and have been for some time.

Let's take a look at how some of today's top Web companies are utilizing Apache Hadoop:

Amazon Web Services

The leader in cloud computing and Hadoop appear to be a natural fit together. AWS customers are able to use Hadoop with their Elastic Compute Cloud (EC2) and/or Simple Storage Service (S3) accounts.

AWS offers Amazon Elastic MapReduce, which automates the provisioning of Hadoop clusters, running and termination of jobs, and handling of data transfers between EC2 and S3. Plus, customers can access Apache Hive, a data warehouse infrastructure built on Hadoop for data summarization, query, and analysis. Many of the Web businesses that use Hadoop do so by going through Amazon Web Services, as products like Elastic MapReduce simplify the management of large Hadoop clusters and allow companies to immediately begin analyzing and utilizing their big data sets.

Facebook

With over 900 million users, Facebook is far and away the world's largest social network, so it obviously has a massive amount of data to store. It makes sense, then, that it would turn to Hadoop to help manage all of that information. The social network stores over 100 petabytes (PB) of data in a single HDFS filesystem in over 100 different clusters.

It turns out that Facebook is so committed to the Hadoop platform that its engineering team actually worked to solve the biggest Hadoop problem to date - the fact that the NameNode metadata service runs on a single node, meaning anytime it goes down, nothing on HDFS runs properly - with the creation of AvatarNode, which establishes a two-node NameNode with manual failover. Facebook has even open sourced AvatarNode back into the Hadoop community.

Foursquare

Location-based social network Foursquare boasts over 10 million users that check into businesses and share their locations with friends using the company's smartphone app. With so much data coming and going throughout the day, the company needed a way to leverage big data analytics that would be efficient, cost-effective, and flexible. This led Foursquare to Amazon Elastic MapReduce.

Foursquare utilizes the open source Apache Flume to send hundreds of millions of application logs into the S3 storage cloud every day, and then uses Elastic MapReduce to analyze them, gleaning insights on how new features are being used, machine learning, exploratory analysis, customer usage, and long-term trends. Accessing Hadoop through Amazon Elastic MapReduce was a logical decision for Foursquare, as it was already employing Amazon S3 and could do away with the difficult management and wasted time spent trying to manage a Hadoop cluster all on its own.

Yahoo!

When Yahoo! first launched its Hadoop production application in 2008, it was one of the platform's largest (and most notable) users. Since then, Yahoo! has continued to be one of Hadoop's biggest supporters; its Search Webmap application runs on over 10,000 core Linux clusters and produces data used in every Yahoo! search query, and over time, the company has contributed over 80 percent of code to Hadoop along with Hortonworks.

The company runs Hadoop on 42,000 servers (or 1200 racks) in four different data centers. The largest current cluster is 4000 nodes, and will increase to 10,000 when Hadoop 2.0 is released. Mostly, Yahoo! uses the platform to block email spam, and to store user data to provide personalized experiences to its visitors based on their previous visits. It also uses Hadoop to provide demographic data to advertisers that allows them to adjust their ads so that they appeal to the widest possible audience, leading to an improved ROI (in theory).

StumbleUpon

With a comprehensive collection of personalized data on every single one of its users and their online actions, StumbleUpon, by its very nature, breeds big data. Because of this, it has turned to Hadoop to make sense of all of that information, largely through the use of HBase, an open-source distributed database that runs on Hadoop.

Using HBase allows StumbleUpon to manage all of its user signals, such as every thumb-up, stumble, and share that takes place. StumbleUpon is able to store this data to make more educated decisions about what page to show next for every individual user. However, since these decisions need to be made quickly in many situations, it is imperative that it is organized safely, retrieved quickly, deleted occasionally, and refreshed often, and HBase offers a method that is cost-effective, dependable, and capable of rapidly retrieving data.