“Big data”—the gathering, manipulation, analysis, and reporting of data based on one or more data sets that are too large to be managed by traditional means—has had a big problem: Because of the vast quantity of data to be processed, a single computer, or even a high-end virtual or physical server with multiple CPU cores, is not up to the task of processing that much data efficiently. It’s much better to divide the work among several computers or servers operating in parallel.
Many Hands = Light Work
Think about the old saying, “many hands make light work.” It refers to the idea that if everyone contributes some effort to completing a task, it won’t be an overwhelming burden for anyone and the task will be completed much faster. It’s a nice idea, but it requires some coordination—someone needs to understand the overall task, determine how to divide the work up into logical subtasks, and dole the subtasks out according to the skills and capacity of each contributor. For more complex tasks, this coordinator also must ensure that work is done in parallel as much as possible, rather than in a linear fashion that can introduce bottlenecks and inefficiencies.
When it comes to “big data” jobs, the task of dividing up the data and the computational tasks among multiple computers and ensuring everything flows smoothly is often left to a set of software utilities called Apache Hadoop.
Hadoop: The Elephant in the Room
Hadoop (named for a toy elephant owned by a child of one of the founding developers) is an open-source tool that gives big-data developers the ability to divide up and parcel out the data and computational tasks without having to concern themselves with messy details, such as determining the capabilities of each computer and making sure the data that a computer needs to work on is physically as close as possible to that computer, to reduce delays caused by networking bottlenecks.
Hadoop implements the divide-and-conquer functionality through a programming approach called MapReduce (more about this later). Hadoop brings a crucial element to this functionality: fault-tolerance. Hadoop was developed from the viewpoint that there inevitably will be failures in some hardware components that will keep a given computer from completing its assigned subtasks. Hadoop includes functionality to deal with hardware failures gracefully, re-assigning subtasks appropriately to keep the overall job flowing.
Other Hadoop components include:
- A file system that distributes data among multiple computers
- YARN, which optimizes task scheduling
One of the main advantages of Hadoop is that it’s open-source, and therefore free to obtain. Each entity that uses Hadoop can modify the underlying source code to suit that entity’s purposes. Modifications that prove especially useful can be fed back into the main Hadoop project for use by everyone.
Spark
Hadoop does have some limitations, the most important of which is that for certain types of big-data jobs, MapReduce is not always the best approach. This is where Spark comes in. Spark—another member of the growing list of open-source solutions managed by the Apache Foundation—addresses the shortcomings of MapReduce. Spark can be used on its own, but it lacks the file-system capabilities of Hadoop; therefore, Hadoop and Spark are often used together.
The Future of Hadoop and Spark
Big-data problems have traditionally involved large but static data sets—that is, all the data is there and isn’t changing. Big-data problems, therefore, involved asking a question and formulating the right way to slice and dice the data to come up with the answer.
Increasingly, however, big data involves streams of incoming information that must be analyzed in real time. The “internet of things” (IoT), with multiple sensors and other devices constantly feeding data to ever-growing databases, is one of the primary drivers of this evolution of big-data problems from static to dynamic. Another source of streaming data is the transactional data generated by e-commerce and related financial infrastructure. In fact, any enterprise with widely distributed activities that need to be analyzed and summarized—think Amazon, Walmart, FedEx, Bank of America—can benefit from the ability to capture trends and identify issues in real time.
Fortunately, Hadoop and Spark have grown to meet this challenge, enabling users to conduct analysis of these data streams in real time, rather than a static picture from a snapshot in time. Look for Hadoop and Spark to continue to evolve to serve these needs.
At AndPlus, we know how to leverage Hadoop and Spark to address your big-data problems, and we are excited about how they can be applied to solve your current and future business problems. Contact us today to learn how we can help solve your big-data problems.