Apache Hadoop In Focus
Lately, Apache Hadoop has entered the
lexicon of Web professionals thanks to
the big data explosion. But talking about
Hadoop can be tricky, and that’s without
considering all of the associated technologies
and architectural mumbo jumbo.
At the most basic level, Hadoop is simply an open source software framework built for the distributed processing of large data sets. In traditional non-distributed architectures, scaling meant adding more CPU and storage, but in the Hadoop distributed architecture, both data and processing work simultaneously across clusters of commodity servers. But that’s only part of the story when it comes to Hadoop.
Thanks to the Hadoop framework, there are now a variety of products that can help companies mine and use big data — something your enterprise will likely do more of in the future. But first, you will need to know more about how Hadoop works.
The core of Apache Hadoop consists primarily of two sub-projects — Hadoop MapReduce (a parallel- processing engine) and the Hadoop Distributed File System or HDFS (which makes it possible to scale across servers and store data on compute nodes).
While MapReduce and HDFS are unquestionably the most important Hadoop-related projects from Apache, there are others. The most notable are the query languages Hive and Pig. The SQL-like Hive acts as a data warehouse infrastructure that allows for data summarization and ad hoc querying, while Pig is a data flow language and execution framework for parallel computation. Both make it possible to process a lot of data without having to write MapReduce code.
To ensure that it can perform required tasks of various organizations, there are many other Apache-endorsed products and projects for Hadoop including Flume, ZooKeeper, Oozie, and HBase. Learn more about these projects by visiting Website Magazine at http://wsm.co/MWB6NO.
There are many Apache-created and third-party distribution and management software offerings that make Hadoop easier to deploy and run. As the framework grows in popularity, more of these systems are emerging. Cloudera, Hortonworks, and MapR for example each have their own unique distribution products, while others including Platform Computing and Zettaset offer software for managing Hadoop clusters (most of which are distribution software agnostic).
You may have a better idea of what Hadoop is, what it’s made of and some of the software and service options that developers build around it, but what is the point of Apache Hadoop, and what can it do for your business?
If you’ve read anything on big data in the past, you’ll know that by collecting, organizing and studying it, you can “find insights, discover emerging data types and spot trends” that can be used to asses the real business value of these hefty data sets, and then turn these large volumes of information into actionable resources.
How does Hadoop help? Well, when you’re interested in looking at this data in a way that is both deep and computationally extensive (such as clustering or targeting), Hadoop offers a scalable, flexible, fault tolerant and cost effective way to do so. That is why it is the basis of numerous third-party application software products like IBM Infosphere BigInsights and HStreaming.
In short, Apache Hadoop offers a framework for structuring and studying a significant amount of data that may or may not be well-organized, and can be built upon to provide a variety of products to analyze and leverage big data in different ways depending on the needs of the business.
Hadoop in the Future
At the moment, Apache Hadoop is going to be most useful for companies with highly sophisticated IT infrastructures, at least far more than the average relational database adopter, largely because there are very few applications that can simply be opened and run on a Hadoop processor. And, of course, these larger businesses are more likely to have massive amounts of data on hand to be leveraged.
However, as big data becomes an increasingly common issue for all businesses, expect to see more shrink wrapped Hadoop applications that can be quickly and easily installed and used by companies of all shapes and sizes.
Who is Using Hadoop?
There are already plenty of companies out there utilizing Apache Hadoop every day to analyze and make sense of all of their data, and in most cases it has been a smashing success. If you’re only kind of familiar with Hadoop, head over to Website Magazine online to get a quick look at five companies, including major names like Amazon and Facebook, that are using this fast-emerging technology. You can find them at