• Cognizant 20-20 Insights Infrastructure Considerations for Analytical Workloads By applying Hadoop clusters to big data workloads, organizations can achieve incredible performance gains that can vary based on physical versus virtual infrastructure. Executive Summary On the list of technology industry buzzwords, “big data” is among the most intriguing ones. As data volume, velocity and variety proliferate, and the search for veracity escalates, organizations across industries are placing new bets on various new data sources such as machine sensor data, medical images, financial information, retail sales, radio frequency identification and Web tracking data. This is creating huge challenges for decision-makers to make meaning and untangle trends from input larger than ever before. From a technological perspective, the so-called four V’s of big data (volume, velocity, variety and veracity) make it ever more difficult to process big data on a single system. Even if one disregarded the storage constraints of a single system, and utilized a storage area network (SAN) to store the petabytes of incoming data, processing speed remains a huge bottleneck. Whether a single-core or multi-core processor is used, a single system would take substantially more time to process data than if the data was partitioned across an cognizant 20-20 insights | april 2016 array of systems used in parallel. That’s not to say that the processing conundrum shouldn’t be confronted and overcome. Big data plays a vital role in improving organizational profitability, increasing productivity and solving scientific challenges. It also enables decision-makers to understand customer needs, wants and desires, and to see where markets are heading. One of the major technologies that helps organizations make sense of big data is the open source distributed processing framework known as Apache Hadoop. Based on our engagement experiences and through intensive benchmarking, this white paper analyzes the infrastructure considerations for running analytical workloads on Hadoop clusters. The primary emphasis is to compare and contrast the physical or virtual infrastructure requirements to support typical business workloads from performance, cost, support and scalability perspectives. Our goal is to arm the reader with the insights necessary for assessing whether physical or virtual infrastructure would best suit your organization’s requirements. HDFS Architecture Client Metadata Ops Block Ops DataNodes Write Read NameNode Replication Rack 1 Rack 2 Figure 1 Hadoop: A Primer Hadoop’s Role To solve many of the aforementioned big data issues, the Apache Foundation developed Apache Hadoop, a Java-based framework that can be used to process large amounts of data across thousands of computing nodes. It consists of two main components – HDFS1 and MapReduce.2 Hadoop Distributed File System (HDFS) is designed to run on commodity hardware, while MapReduce provides the processing framework for distributed data across thousands of nodes. Hadoop provides performance enhancements that enable high throughput access to application data. It also handles streaming access to file system resources, which are increasingly challenging when attempting to manipulate larger data sets. Many of the design considerations can be subdivided into the following categories: HDFS shares many attributes with other distributed file systems. However, Hadoop has implemented numerous features that allow the file system to be significantly more fault-tolerant than typical hardware solutions such as redundant arrays of inexpensive disks (RAIDs) or data replication alone. What follows is a deep dive into the reasons Hadoop is considered a viable solution for the challenges created by big data. The HDFS components explored are the NameNode and DataNodes (see Figure 1). The MapReduce framework processes large data sets across numerous computer nodes (known as data nodes) where all nodes are on the same local network and use similar hardware. Computational processing can occur on data stored either in a file system (either semi-structured or unstructured) or in a database (structured). MapReduce can take advantage of data locality. In MapReduce version 1, the components are JobTracker and TaskTrackers, whereas in MapReduce version 2 (YARN), the components are the ResourceManager and NodeManagers (see next page, Figure 2). cognizant 20-20 insights • Data asset size. • Transformational challenges. • Decision-making. • Analytics. Hadoop’s ability to integrate data from different sources (databases, social media, etc.), systems (network/machine/sensor logs, geo-spatial data, etc.) and file types (structured, unstructured and semi-structured) enable organizations to respond to business questions such as: • Do you test all of your decisions to compete in the market? • Can new business models be created based on the available data in the organization? • Can you drive new operational efficiencies by modernizing extract, transform and load (ETL) and optimizing batch processing? • How can you harness the hidden value in your data that until now has been archived, discarded or ignored? All applications utilizing HDFS tend to have large data sets that range from gigabytes to petabytes. HDFS has been calibrated to adjust to 2 MR vs. YARN Architecture YARN* - MapReduce v2 MapReduce v1 Client TaskTracker Client JobTracker Resource Manager NameNode NameNode TaskTracker r DataNode DataNode Client Client TaskTracker r Node Manager Node Manager DataNode Application Master Application Master DataNode DataNode Node Manager Node Manager Node Manager Node Manager Container Container Container Container DataNode DataNode DataNode DataNode * YARN – Yet Another Resource Negotiator Figure 2 such large data volumes. By providing substantial aggregated data bandwidth, HDFS should scale to thousands of nodes per cluster. Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of commodity servers operating in parallel. This enables businesses to run their applications on thousands of nodes involving thousands of terabytes of data. In legacy environments, traditional ETL and batch processes can take hours, days or even weeks – in a world where businesses require access to data in minutes or even seconds. Hadoop excels at high-volume batch processing. Because the processing is in parallel, Hadoop is said to perform batch processing multiple times faster than on a single server. Likewise, when Hadoop is used as an enterprise data hub (EDH), it can ease the ETL bottleneck by establishing a single version of truth that can be accessed and transformed by business users without the need for a dedicated infrastructure setup. This makes Hadoop one place to store all data, for as long as desired or required – and in its original fidelity – that is integrated with existing infrastructure and tools. Doing this provides the flexibility to run a variety of enterprise workloads, including batch processing, interactive SQL, enterprise search and advanced analytics. It also comes with the built-in security, governance, data protection and management that enterprises require. With EDH, leading organizations are changing the way they think about data, transforming it from a cost to an asset. For many enterprises, data streams from all directions. The challenge is to synthesize and quantify it and convert bits and bytes into insights and foresights by applying analytical procedures on the historical data collected. Hadoop enables organizations not only to store the data collected but also to analyze it. With Hadoop, business value can be elevated by: • Mining social media customer sentiments. data to determine • Evaluating Web clickstream data to improve customer segmentation. • Proactively identifying and responding to security breaches. MapReduce Logical Data Flow Input Split Map [Combine] Figure 3 cognizant 20-20 insights 3 Shuffle & Sort Reduce Output • Predicting a customer’s next buy. • Fortifying security and compliance the effective cost of the entire cluster. using server/machine logs and analyzing various data sets across multiple data sources. Understanding Hadoop Infrastructure Hadoop can be deployed in either of two environments: • Physical-infrastructure-based. • Virtual-infrastructure-based. Physical Infrastructure for Hadoop Cluster Deployment Hadoop and its associated ecosystem components are deployed on physical machines with large amounts of local storage and memory. Machines are racked and stacked with high-speed network switches. The merits: • Delivers the full benefits of Hadoop’s performance, especially with locality-aware computation. In the case where a node is too busy to accept additional work, the JobTracker can still schedule work near the node and take advantage of the switch’s bandwidth. • The cluster hostnames and IP addresses needs to be copied into /etc/hosts of each server in the cluster, to avoid DNS load. Virtual Infrastructure for Hadoop Cluster Deployment Virtual machines (VMs) are created only up to the duration of the Hadoop cluster. In this approach, a cluster configuration with the NameNode and JobTracker hostnames is created, usually in the same machine for a small cluster. Network rules can ensure that only authorized hosts have access to the master and slave nodes. Persistent data must be kept in an alternate file system to avoid data loss. The merits: • Can be cost-effective as the organization is billed based on the duration of cluster usage; when the cluster is not needed, it can be shut down – thus saving money. • Can scale the cluster up and down on demand. • Some cloud service providers provide a version of Hadoop that is prepackaged, easy and readyto-use. • The HDFS file system is persistent over cluster restarts (provided the data on the NameNode is protected and a secondary NameNode exists to keep up with the data, or the high availability has been configured). The demerits: • Prepackaged Hadoop implementations may be older versions or private branches without the code being public. This makes it harder to handle failure. • When writing files to HDFS, data blocks can be • Startup can be complex, as the hostnames of streamed to multiple racks; importantly, if a switch fails or a rack loses power, a copy of the data is still retained. the master node(s) are not known until they are allocated; configuration files need to be created on demand and then placed in the VMs. The demerits: • Unless there is enough work to keep the CPUs busy, hardware becomes a depreciating investment, particularly if servers aren’t being used to their full potential – thereby increasing • There is no persistent storage except through non-HDFS file systems. • There is no locality in a Hadoop cluster; thus, there is no easy way to determine the location of slave nodes and their relativity to each other. Factors Affecting Hadoop Cluster Performance Soft Factors Hard Factors Performance optimization parameters External factors Number of maps Environment Number of reducers Number of cores Combiner Memory size Custom serialization The Network Shuffle tweaks Intermediate compression Figure 4 cognizant 20-20 insights 4 A Tale of the Tape: Physical vs. Virtual Machines AWS VM Sizes* vCPU x Memory No. of Nodes m1.medium 1 X 2 GB 4 m1.large 1 X 4 GB 4 m1.xlarge 4 X 16 GB 4 Machine Sizes CPU x Memory NameNode 4 x 4 GB 1 DataNode 4 x 4 GB 3 Client 4 x 8 GB 1 Processor: Intel Core i3-3220 [email protected] 4 cores Figure 5 Benchmarking Physical and Virtual Machines* AWS EMR Physical machine Distribution Apache Hadoop Cloudera Distribution for Hadoop 4 Hadoop Version 1.0.3 2.0.0+1518 Pig 0.11.1.1-amzn (rexported) 0.11.0+36 Hive 0.11.0.1 0.10.0+214 Mahout 0.9 0.7+22 *Instance details may differ with releases.3 Figure 6 Data Details Requirement Generate 1B records and store it on S3 Bucket/HDFS No. of Columns 37 No. of Files 50 nos. No. of Records (each file) 20 Million File Size (each file) 2.7 GB Total Data Size 135 GB Cluster Size (4-Node) No. of DataNodes/TaskTrackers: 03 nos. Figure 7 • DataNodes may be colocated on the same physical servers, and so lack the actual redundancy which they appear to offer in the HDFS. • Extra tooling is often needed to restart the cluster when the machines are destroyed. Hadoop Performance Evaluation When it comes to Hadoop clusters, performance is critical. These clusters may run on premises on a physical or on a virtualized environment, or both. A performance analysis of individual clusters in each environment aids in determining the best alternative for achieving required performance (see Figure 4, previous page). cognizant 20-20 insights Setup Details and Experiment Results We compared the performance of a Hadoop cluster running virtually on Amazon Web Services’ Elastic Map Reduce (AWS-EMR) and a similar hard-wired cluster running on internal physical infrastructure. See Figure 5 for the precise configurations. Figure 6 reveals our benchmark findings of the virtual cluster running Hive and Pig scripts versus the physical machines running Mahout KMeans Clustering. Figure 7 reveals the nature of benchmark data. This benchmark was performed to transform 5 Time (in seconds) Hive Transformation (PM vs. VM) 8000 6000 4000 2000 0 40M 80M 160M 320M 640M 1B No. of Records AWS EMR (m1.large) Physical Machines Figure 8 Pig Transformation (PM vs. VM) Time (in seconds) 6000 5000 4000 3000 2000 1000 0 40M 80M 160M 320M No. of records Physical Machines-PIG AWS EMR (m1.large)-PIG Figure 9 PM PM vs. VM (for 320M records) Hive (Query-3) Hive (Query-2) VM Hive (Transformation) Pig (Transformation) 0 1000 2000 3000 4000 5000 6000 Time (in seconds) Figure 10 raw data into a standard format using big data tools such as Hive Query Language (HiveQL) and Pig Latin on 40 million records, scaling to 1 billion records. Along with this, Mahout (the machine learning tool for Hadoop) was also run for KMeans Clustering of the data that created five clusters with a maximum of eight iterations on m1.large (1vCPU x 4GB memory), m1.xlarge (4vCPU x 15.3GB memory) and physical machines (4CPU x 4GB memory). The input data was placed in the HDFS for physical machines and on AWS S3 for AWS-EMR. cognizant 20-20 insights Consequential Graphs Figure 8 shows how the cluster performed for the Hive transformation on both physical and virtual environments. Figure 8 reveals that both workloads took almost the same time for smaller datasets (~40 to ~80 million records). Gradually with increasing data sizes, the physical machines performed better than EMR’s m1.large cluster. Figure 9, which compares PM versus VM using Pig transformation, shows that the EMR cluster 6 Time (in Seconds) Pig/Hive Transformation (PM vs. VM) 6000 5000 4000 3000 2000 1000 0 40M 80M 160M 320M 40M 80M m1.large 160M 320M Physical Machines No. of Records Pig (Transformation) Hive (Transformation) Figure 11 Time (in Seconds) Hive (Query-2 & Query-3): PM vs. VM 3500 3000 2500 2000 1500 1000 500 0 80M 160M 320M 640M 1B 40M m1.large 80M 160M 320M 640M 1B Physical Machines No. of Records Hive (Query-2) Hive (Query-3) Figure 12 PM vs. VM Mahout K-means 11.93 180.86 No. of Records 6M 10.21 139.37 4M 2M 1M 8.62 8.1 0.00 750.82 736.65 91.06 532.37 73.64 100.00 PM VM(1x4) 455.27 200.00 300.00 400.00 500.00 VM(4x15) 600.00 700.00 800.00 Time (in Seconds) Figure 13 executing Pig Latin script on 40 million records takes longer compared with a workload running the same script on physical machines. Eventually with increasing data sizes, the difference between the time taken by physical and virtual infrastructure expands to a point where physical machines execute significantly faster. Figure 10 (previous page) shows the time taken for all four operations on a dataset containing 320 million records. This includes running various Hive queries and Pig scripts to compare their performance. With the exception of the Hive Transcognizant 20-20 insights formation, the other operations are faster with physical compared with virtual infrastructure. Figure 11 compares the gradual increase in execution time with increasing data sizes. Here the Pig scripts appear to have a faster execution time on physical machines than on virtual machines. Figure 12 shows the time taken by Hive queries to run on physical and virtual machines for various data sizes. Again, physical machines appear to perform much faster than virtual ones. 7 Characteristic Differences Between Physical and Virtual Infrastructure PERFORMANCE SCALABILITY COST Comparing the performance of physical and virtual machines with the same configuration, the physical machines have higher performance; with increased memory, however, a VM can perform better. Commissioning and decommissioning of physical machines’ cluster nodes can prove to be an expensive affair compared to provisioning VMs as per need. Thereby scalability can be highly limited with physical machines. Provisioning physical machines incurs higher cost than virtual machines, where creation of a VM can be as simple as cloning an instance of a VM and its unique identity. RESOURCE UTILIZATION The processor utilization of physical machines is less than 20%; however, the rest is all available for use. In the case of virtual machines, the CPU is utilized at its best, with high chances of CPU overhead leading to lower performance. Figure 14 Figure 13 (previous page) displays the K-Means Clustering performance on physical infrastructure, m1.large virtual infrastructure (1 core x 4GB memory) and m1.xlarge virtual infrastructure (4 cores x 15 GB memory). In this test, the best performance was clocked on an m1.xlarge cluster. Hence, the performance achieved depends significantly on the memory consumed for the run. In this case, the ease of scalability of virtual machines drove the performance advantage over physical machines. Moving Forward In our experiment, we perceived that AWS EMR up to m1.large instances performs significantly slower than the one running in a physical environment. Whereas with the m1.xlarge instance with a larger memory capacity, virtual performance was faster than on physical machines. • If In sum, Hadoop MapReduce jobs are IO bound and, generally speaking, virtualization will not help organizations to boost performance. Hadoop takes advantage of sequential disk IO, for example, by using larger block sizes. Virtualization works on the notion that multiple “machines” do not need full physical resources at all times. IO-intensive data processing applications that operate on dedicated storage are preferred to be non-virtualized. For a large job, adding more TaskTrackers to the cluster will help boost computational speed, but there is no flexibility for adding or removing nodes from the cluster on physical machines. cognizant 20-20 insights Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. It is important to understand your workload and the principles involved for hardware selection (e.g., blades and SANs are preferred to satisfy their grid and processing-intensive workloads). Based on the finding from our benchmark study, we recommend that organizations keep in mind the following infrastructure considerations: your application depends on performance, the application has a longer lifecycle and the data growth is regular, a physical machine would be a better option as it performs better, the deployment cost is a one-time expense and as data growth is regular there might not be a need of highly scalable infrastructure. • In cases where your application has a balanced workload, is cost-intensive, the data growth is exponential and requires support, virtual machines can prove to be safer as the CPU is well utilized and the memory is scalable. They are also a more cost-efficient option since they come with a more flexible pay-per-use policy. Also, the VM environment is highly scalable in the event of adding or deleting DataNodes/ TaskTrackers/NodeManagers. • In cases where your application depends on performance, has to be cost-efficient, and data growth is regular and requires support, virtual machines can be a better choice. 8 • In cases where your application requires high performance and data growth is exponential with no required support, virtual machines with higher memory are a better choice. During the course of our investigation, we found that the commodity systems, while both antiquated and less responsive, performed significantly better using our implementation than customary virtual machine implementations using standard hypervisors. From these results, we observe that virtual Hadoop cluster performance is significantly lower than the cluster running on a physical machine due to the overhead of the virtualization on the CPU of the physical host. Any feature that overrides this virtualization overhead of virtual machines with larger memory would boost performance. Footnotes 1 HDFS - http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. 2 MapReduce - http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. 3 AWS Details - http://aws.amazon.com/ec2/previous-generation/. About the Authors Apsara Radhakrishnan is an Associate of the Decision Science Team within Cognizant Analytics. She has three years of experience in the areas of big data technology focused on ETL in the Hadoop environment, its administration and AWS Analytics products. She holds a master’s degree in computer applications from Visvesvaraya Technological University. Apsara can be reached at [email protected]. Harish Chauhan is Principal Consultant, Cloud Services, within Cognizant Infrastructure Services. He has over 24 years of IT experience, has numerous technical publications to his credit, and he has also coauthored two patents in the area of virtualization – one of which was issued in January 2015. Harish’s white paper on “Harnessing Hadoop” was released in 2013. His areas of specialization include distributed computing (Hadoop/big data/HPC), cloud computing (private cloud technologies), virtualization/containerization and system management/monitoring. Harish has worked in many areas including infrastructure management, product engineering, consulting/assessment, advisory services and pre-sales. He holds a bachelor’s degree in computer science and engineering. In his current role, Harish is responsible for capability building on emerging trends and technologies like big data/Hadoop, cloud computing/virtualization, private clouds and mobility. He can be reached at [email protected]. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process outsourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 100 development and delivery centers worldwide and approximately 221,700 employees as of December 31, 2015, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters European Headquarters India Operations Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: [email protected] 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: [email protected] #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: [email protected] © Copyright 2016, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. TL Codex 1732
© Copyright 2026 Paperzz