The Wotan Beowulf Cluster Project Carson Wang, Jacob Buys, and Krist Pregracke Advisors: Dr. Sam Chung and Sarah McQuarrie 15 October 2016 1 Agenda 1. 2. 3. Motivation Problem Approach a. 4. 5. Architecture and tools Analysis Conclusion 2 Motivation ● ● Big data analysis has become one of the primary interest areas for research and a growing industry. Big data has become an area of interest for us at Carbondale Community High School Computer Club. ○ We see this project as an opportunity to provide a resource for the local scientific community and aid future research. 3 Problem ● ● ● Over the years, data capacity and production has grown at an astonishing pace. As a result, methods are required to process this data. We asked the question “How can we use parallel computing to make data processing more efficient?” 4 https://upload.wikimedia.org/wikipedia/commons/7/7c/Hilbert_InfoGrowth.png Approach ● Management system ○ ○ ○ ● Time management People management Website development Non-proprietary computer systems ○ Only used free software Planning in progress 5 Web Development ● ● ● To accommodate for our diverse group, numerous skills were developed at the same time. Some members of our team learned web development with graduate mentors. The project thus encompassed a wide range of Information Systems skills. 6 7 Computer Architecture ● Trisquel ○ ● Ubuntu derivative that takes out proprietary firmware and binary “blobs” ○ Secure Shell Allows for remote login NFS ○ Apache Hadoop ○ SSH ○ ○ ● ● ● Industry standard framework for big data analysis Uses MapReduce model 18 nodes ○ 17 secondary nodes, 1 primary or “head” node Network File System http://www.revista.espiritolivre.org/wp-content/uploads/2013/03/26-03-2013_trisquel-logo.jpg https://svn.apache.org/repos/asf/hadoop/logos/out_rgb/hadoop+elephant_rgb.png 8 Computer Nodes ● Beowulf cluster setup ○ ○ ○ ● ● “ad hoc” Each node is an individual computer Easily expandable Nodes identical in make and model “Commodity hardware” ○ Low-end off-the-shelf computer towers Finished cluster (top) In progress installing software (left) 9 This is is the basic topology of a Beowulf cluster, which the model we used for our own computer cluster. We named the cluster Wotan because Wotan and Beowulf are both characters in Norse Mythology. https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.png 10 Hadoop Implementation ● Hadoop is modular, consisting of: ○ ○ ● ● Hadoop Common - the core libraries required for function Hadoop HDFS - the Hadoop Distributed File System, which sets up a store for data across the cluster ○ Hadoop YARN - manages and allocates resources for applications ○ Hadoop MapReduce - the main function of Hadoop, data processing Each configuration had to be customized to link modules together. For lower-end computers, occasional computer failure caused a zombie hadoop process which broke the entire cluster. 11 12 Hadoop MapReduce ● ● ● MapReduce, the main data processing function of Hadoop, works by splitting the data analysis into two parts, a Map() function and a Reduce() function. The Map() function filters and sorts the data into manageable categories for the cluster to process. The Reduce() function analyzes this data and combines similar data. 13 http://ccs.miami.edu/wp-content/uploads/2016/05/Hadoop-Map-Reduce.png Analysis (pi) Approximation ● 3.141592... DNA Sequencing ○ ● ○ ● Shotgun Sequencing TeraSort Mahir Morshed, former Project Lead 14 Approximation ● Used Nilakantha method ○ ○ ○ Calculated 128 digits at 100,000 k value 1 node: 2h 10min 16 nodes: 1h 15 Genome Assembly ● ● ● ● Process in which DNA reads, or short pieces of raw DNA data, are combined into an entire genome. We plan to use shotgun sequencing, which randomly tests pairs for a match. Shotgun sequencing matches ending pairs of reads and overlaps them to assemble the DNA. This is called the chain termination method. We downloaded a data store of strawberry DNA reads from the National Center for Biotechnological Information (NCBI), a project of the National Institutes of Health (NIH). 16 17 TeraSort ● Simple, standard Apache benchmark for hadoop clusters ○ ● Generates random data and then sorts it by key ○ ● ● Within hadoop Sorting is the benchmark Hadoop single-node cluster: 10min Hadoop multinode cluster: 20s 18 Conclusion ● ● ● ● Benchmarks showed that computing time decreased consistently when multiple nodes were used. We experienced challenges in the implementation of Hadoop. We were able to simplify the process by using Network File System to copy configuration files across the network, but the necessary manual configuration and troubleshooting was time-consuming. We practiced web development to increase the value of the project. In the future, we hope to further investigate genome sequencing. 19 Thank You! 20
© Copyright 2026 Paperzz