Welcome to part seven of our blog series based on my latest PluralSight course: Applied Azure. Previously, we’ve discussed Azure Web Sites, Azure Worker Roles, Identity and Access with Azure Active Directory, Azure Service Bus and MongoDB, HIPPA Compliant Apps in Azure and Offloading SharePoint Customizations to Azure.
No lengthy commentary is needed to communicate the growing importance of big data technologies. Look no further than the rounds of funding  that companies like Cloudera, Hortonworks and MapR have attracted in recent months. It is widely expected that the market for Hadoop will likely grow to $20 Billion by 2018.
The key motivations for the growth of big data technologies includes:
- The growing need to process ever increasing volumes of data. This growth in data is not limited to web scale companies alone. Businesses of all sizes are seeing this growth.
- Not all data conforms to a well-defined structure/schema, so there is a need to supplement (if not replace) the traditional data processing and analysis tools such as EDWs.
- Ability to take advantage of deep compute analytics using massively parallel, commodity based clusters. We will see examples of deep compute analysis a little bit later but this is a growing area of deriving knowledge from the data.
- Overall simplicity (from the standpoint of the analyst/ developer authoring the query) that hides the non-trivial complexity of the underlying infrastructure.
- Price-performance benefit accorded due to the commodity based clusters and fault tolerance.
- The ability to tap into fast paced innovation taking place within the “Hadoop” ecosystem. Consider that Map Reduce, which has been the underpinning of Hadoop ecosystem for years, is being replaced by projects such as Yarn in recent months.
Scenarios for “big data” technologies range from canonical usage in web companies involving click stream and log analysis, to business event analysis in financial services and travel industries. But beyond these well-established scenarios, enterprises that have traditionally dealt with structured data and tools such as EDW for data processing are finding it useful to offload some of the data processing to “Big Data” technologies, especially with the availability of “Hadoop as a service” offerings such as Azure HDInsight and Amazon’s Elastic MapReduce. These technologies have greatly simplified the time and process it takes to set up a big data infrastructure.
Here is a useful wiki pages that lists organizations that are using Hadoop is some shape or another.
Let us briefly review some key terms.
Apache Hadoop – A software framework for distribute processing of large amounts of data. It consists of core modules including:
- Map Reduce – A scheme for chunking the data, storing and processing the chunks and then aggregating the data. We will look at Map Reduce in detail shortly.
- Hadoop Distributed File System- A distributed file system that provides high-throughput access to application data.
- Yarn – A framework for job scheduling and cluster resource management.
In addition to the above modules, Apache Hadoop consists of a growing list of projects including:
- Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Pig – A high-level data-flow language and execution framework for parallel computation.
- Spark – A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing and graph computation.
This is an ever growing list of projects and in many ways we are rediscovering and relearning some of database optimization techniques we have learnt as an industry over the years.
Let us delve a bit deeper into the core concept of Map Reduce and HDFS since they are key to understanding big data technologies.
In a nutshell, Hadoop is a massively parallel data processing scheme. The idea is to break down the data to be processed into smaller chunks, then process these chunks in parallel, and finally aggregate the results of the “chunked” process.
Let us begin with Hadoop Distributed File System (HDFS) – HDFS is where data that needs to be processed (and eventually the results) are stored. HDFS, as we have stated earlier, is a distributed, high throughput and fault tolerant data store.
Next, let us look at the mapper and reducer blocks in the above diagram. These are collectively known as “processors”, and as the name suggests, are responsible for processing the data. The mapper processor is responsible for taking the input and transforming it into an intermediate format. The reducer processor is responsible for combining, or reducing, the data given to it. Let us get more specific about the format of data passed into theses processors. It will help us understand the power (and simplicity) of the Hadoop framework. It turns out that mapper and reducer processors exclusively operate on < key, value> pairs. The mapper processor gets, as input, a series of key/ value pairs, and is responsible for transforming them into an intermediate key/ value. The reducer processor is responsible for reducing or combining key/ value pairs that share a key. The word count example that we will discuss next will make the mapper and reducer phases very clear. But before we do that, the important take away from the above discussion is that we have broken a complex data processing requirement into map and reduce phases. As we shall see shortly, the logic that goes into the mapper and reducer processors is completely up to us. Fortunately, that is the only thing we need to worry about. Everything else is handled by the Hadoop framework, including the logic to partition the data, managing all the variable processors, managing all the data passing, handling failures etc. The last point is really the reason why Hadoop has gained so much adoption – serving such a highly complex, parallel data processing system on commodity hardware, in a manner that boils down to custom developed map and reduce routines is the power of Hadoop.
As promised, let us look at a concrete example that will bring out what goes inside the map and reduce phases. We will use the canonical “hello world” example of the big data world – “the word count”. This example is easy to understand and easy to setup and run. In fact, if you provision a HDInsight (Hadoop instance in Azure – more on this in a bit) the ”world count” comes pre-provisioned and you can follow these steps to take it for a spin.
Here is the problem statement. We are trying to figure out the frequency of occurrence of words across any number of documents. For the sake of simplicity, we have only depicted three documents in the diagram below (Doc #1-3). These documents are going to be randomly sent to the mapper processor. Each document that is sent to the mapper processor (depicted as <Doc #1,Words> in the diagram below) is really a key/value pair – name of the document being the id and the contents making up the key. The mapper processor, executing our custom code, is responsible for transforming this incoming key/value pair into another key/value pair that is made of an individual word (that is part of the document) as the key and the occurrence of that word within the incoming document as the value. In the diagram below, the mapper processor has transformed the incoming key/value pair < Doc # 1,Words> into key/value pairs <Word B,2> and <WordA,1> . Based on this example, it is not hard to imagine the custom code that is required to achieve this key / value transformation – essentially tokenize the documents, iterate over the word collection and increment the occurrence frequency. Finally, emit the resulting key/ value pairs. This concludes the description of the map phase. Of course we are not done yet, we still need to add up all the key /value pairs that share the same key to get the complete count. In the diagram below, mapper #1 emitted <WordB,2> and <WordA,1> and mapper #2 emitted <WordC,2> and <WordA,2>. We need to add up the key /value pairs with WordA as the key to get the complete occurrence frequency for that word. This is where the reduce phase comes in. The reducer processor is responsible for compressing or reducing the results. As shown in the diagram below, the reducer processor, executing our custom code, is responsible for reducing a collection key /value pairs ( <WordA,2> and <WordA,1>) into one (<WordA, 3>). Once again it is not hard to imagine what that custom code would look like. The bigger question, however, is this – how can we be sure that all the key/value pairs with the same key end up on the same reducer processor? Well that is the “magic” of Hadoop Map/Reduce framework. It ensures that the key value pairs are routed to the appropriate reducer processor. Once again, hiding the underlying complexity so we can develop custom map/reduce routines is really the power of Hadoop.
Note that map and reduce phases that I have described above have been elided for clarity and frankly out of scope for this blog post. For a more detailed treatment of map and reduce phases, please refer to the Apache Hadoop Documentation tutorial.
So where does Azure HDInsight fit in with all this?
HDInsight makes Apache Hadoop, HDFS, Map Reduce framework and other projects such as Pig, Hive and Ozie available as a service in Azure. This means that like other advanced Azure services such as Mobile Services and Service Bus, customers are not required to provision the basic IaaS building blocks in order to use this service – although setting up a Hadoop cluster on Azure IaaS is certainly a route some customers have taken. But based on our discussion thus far, it is not hard to imagine that setting up a Hadoop cluster is not a trivial task, especially as you approach highly scaled out configurations. Instead, Microsoft worked with Horton Works to bring a fully compatible Hadoop implementation to Azure. As a result, customers can “push button” provision a Hadoop cluster based on their needs thus greatly lowering the barrier of entry into big data. This is especially true of enterprise customers who may not have access to infrastructure skills relate to big data in-house given their traditional focus on relational and EDW technologies. Of course it is not enough to get started. Ultimately to succeed with big data, enterprises are going to demand the ease of fault tolerance, flexibility to cost effectively scale up and down based on the need, interoperation with existing analytics tool etc. Fortunately, HDInsight seeks to address these valid concerns. Here is how:
- HDInsight team is no just “porting” the latest version of Hadoop framework to Azure they are actively contributing to Apache Hadoop Project.
- Typical standard implementations of Hadoop clusters have single head node, HDInsight provides a second head for increased fault tolerance.
- Even though HDFS is available within HDInsight, customers can use Blob Storage as the file system. This is important because Azure blob storage is the rendezvous point for Azure solutions (IaaS Disks, Database backups, shared storage “files” are all hosted in Azure Blob storage). Azure Blob storage is also very cost effective. With features such as built-in geo replication, an export-import service that allows you to “Fedex” data in and out of an Azure data center, it is the most used Azure service with over a trillion objects currently stored. Big data analysis is likely going to need Blob storage as the file system – which HDInsight does by implementing the HDFS interface. By the way, there is another benefit to using Blob storage – you can dismantle he HDInsight cluster without losing the data. You can easily see why this is going to be cost effective.
- Since HDInsight is part of the Azure family of services, you can continue to use the tools (PowerShell, Management API) you use to manage the HDInsight cluster.
- What about analytics and other BI tools that enterprise rely on heavily, you ask? It turns out via the Hive ODBC driver, Hadoop data is accessible to BI tools such as Power Query, analysis and reporting services or any BI tool that can work standard ODBC data source. This last bullet represents the intersection between big data and traditional BI data analysis that is going to be so important to enterprises. As depicted in the diagram below, enterprises and can cull through a much wider range of data sources that includes structured, unstructured or semi-structured and using HDInsight and then use traditional BI tools to conduct an end-to-end analysis. This virtuous cycle of load, analyze and report is also described as Acquire/Compute/Publish loop.
The importance of big data cannot be overstated. It can be used to process unstructured or semi-structured data from web clickstreams, social media, server logs, devices and sensors and more. Microsoft HDInsight is a 100% Apache Hadoop distribution, available as a service on Azure. By including features such as second head node, Blob storage as the file system and language extensions such c#, HDInsight is making it easier for enterprises to get started with big data.