This lack of down-to-Earth, nitty
gritty details on how Apache Hadoop works and who uses it may make it seem more
elusive than it really is, but that couldn’t be further from
the truth. Hadoop isn’t the exclusive Web technology of the digital gods atop Mt. Olympus;
in fact, many major ‘Net names are using it right now, and have been for some
Let’s take a look at how some of today’s top Web companies are
utilizing Apache Hadoop:
Amazon Web Services
The leader in cloud computing and Hadoop
appear to be a natural fit together. AWS customers are able to use Hadoop with their
Elastic Compute Cloud (EC2) and/or Simple Storage Service (S3) accounts.
AWS offers Amazon Elastic
MapReduce, which automates the provisioning of Hadoop clusters, running and
termination of jobs, and handling of data transfers between EC2 and S3. Plus,
customers can access Apache Hive, a data warehouse infrastructure built on
Hadoop for data summarization, query, and analysis. Many of the Web businesses
that use Hadoop do so by going through Amazon Web Services, as products like
Elastic MapReduce simplify the management of large Hadoop clusters and allow companies
to immediately begin analyzing and utilizing their big data sets.
With over 900 million users, Facebook is far and away the world’s largest
social network, so it obviously has a massive amount of data to store. It makes
sense, then, that it would turn to Hadoop to help manage all of that
information. The social network stores over 100 petabytes (PB) of data in a
single HDFS filesystem in over 100 different clusters.
It turns out that Facebook is so
committed to the Hadoop platform that its engineering team actually worked to
solve the biggest Hadoop problem to date – the fact that the NameNode metadata
service runs on a single node, meaning anytime it goes down, nothing on HDFS runs
properly – with the creation of AvatarNode, which establishes a two-node
NameNode with manual failover. Facebook has even open sourced AvatarNode back into
the Hadoop community.
Location-based social network Foursquare
boasts over 10 million users that check into businesses and share their
locations with friends using the company’s smartphone app. With so much data
coming and going throughout the day, the company needed a way to leverage big
data analytics that would be efficient, cost-effective, and flexible. This led
Foursquare to Amazon Elastic MapReduce.
Foursquare utilizes the open
source Apache Flume to send hundreds of millions of application logs into the
S3 storage cloud every day, and then uses Elastic MapReduce to analyze them,
gleaning insights on how new features are being used, machine learning,
exploratory analysis, customer usage, and long-term trends. Accessing Hadoop
through Amazon Elastic MapReduce was a logical decision for Foursquare, as it
was already employing Amazon S3 and could do away with the difficult management
and wasted time spent trying to manage a Hadoop cluster all on its own.
When Yahoo! first launched its Hadoop production application in 2008, it was one
of the platform’s largest (and most notable) users. Since then, Yahoo! has
continued to be one of Hadoop’s biggest supporters; its Search Webmap
application runs on over 10,000 core Linux clusters and produces data used in
every Yahoo! search query, and over time, the company has contributed over 80
percent of code to Hadoop along with Hortonworks.
The company runs Hadoop on 42,000
servers (or 1200 racks) in four different data centers. The largest current
cluster is 4000 nodes, and will increase to 10,000 when Hadoop 2.0 is released.
Mostly, Yahoo! uses the platform to block email spam, and to store user data to
provide personalized experiences to its visitors based on their previous visits.
It also uses Hadoop to provide demographic data to advertisers that allows them
to adjust their ads so that they appeal to the widest possible audience,
leading to an improved ROI (in theory).
With a comprehensive collection of
personalized data on every single one of its users and their online actions,
StumbleUpon, by its very nature, breeds big data. Because of this, it has
turned to Hadoop to make sense of all of that information, largely through the
use of HBase, an open-source distributed database that runs on Hadoop.
Using HBase allows StumbleUpon to manage
all of its user signals, such as every thumb-up, stumble, and share that takes
place. StumbleUpon is able to store this data to make more educated decisions
about what page to show next for every individual user. However, since these decisions
need to be made quickly in many situations, it is imperative that it is
organized safely, retrieved quickly, deleted occasionally, and refreshed often,
and HBase offers a method that is cost-effective, dependable, and capable of
rapidly retrieving data.