CASA symposium 2016(1).

The Wotan Beowulf
Cluster Project
Carson Wang, Jacob Buys, and Krist Pregracke
Advisors: Dr. Sam Chung and Sarah McQuarrie
15 October 2016
1
Agenda
1.
2.
3.
Motivation
Problem
Approach
a.
4.
5.
Architecture and tools
Analysis
Conclusion
2
Motivation
●
●
Big data analysis has become one of the primary interest areas for research
and a growing industry.
Big data has become an area of interest for us at Carbondale Community
High School Computer Club.
○
We see this project as an opportunity to provide a resource for the local scientific
community and aid future research.
3
Problem
●
●
●
Over the years, data capacity and
production has grown at an
astonishing pace.
As a result, methods are required to
process this data.
We asked the question “How can
we use parallel computing to make
data processing more efficient?”
4
https://upload.wikimedia.org/wikipedia/commons/7/7c/Hilbert_InfoGrowth.png
Approach
●
Management system
○
○
○
●
Time management
People management
Website development
Non-proprietary computer systems
○
Only used free software
Planning in progress
5
Web Development
●
●
●
To accommodate for our diverse group, numerous skills were developed at
the same time.
Some members of our team learned web development with graduate
mentors.
The project thus encompassed a wide range of Information Systems skills.
6
7
Computer Architecture
●
Trisquel
○
●
Ubuntu derivative that takes out
proprietary firmware and binary “blobs”
○
Secure Shell
Allows for remote login
NFS
○
Apache Hadoop
○
SSH
○
○
●
●
●
Industry standard framework for big data
analysis
Uses MapReduce model
18 nodes
○
17 secondary nodes, 1 primary or “head”
node
Network File System
http://www.revista.espiritolivre.org/wp-content/uploads/2013/03/26-03-2013_trisquel-logo.jpg
https://svn.apache.org/repos/asf/hadoop/logos/out_rgb/hadoop+elephant_rgb.png
8
Computer Nodes
●
Beowulf cluster setup
○
○
○
●
●
“ad hoc”
Each node is an individual computer
Easily expandable
Nodes identical in make and model
“Commodity hardware”
○
Low-end off-the-shelf computer
towers
Finished cluster (top)
In progress installing
software (left)
9
This is is the basic topology of a Beowulf cluster, which the model
we used for our own computer cluster. We named the cluster
Wotan because Wotan and Beowulf are both characters in Norse
Mythology.
https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.png
10
Hadoop Implementation
●
Hadoop is modular, consisting of:
○
○
●
●
Hadoop Common - the core libraries required for function
Hadoop HDFS - the Hadoop Distributed File System, which sets up a store for
data across the cluster
○ Hadoop YARN - manages and allocates resources for applications
○ Hadoop MapReduce - the main function of Hadoop, data processing
Each configuration had to be customized to link modules together.
For lower-end computers, occasional computer failure caused a zombie hadoop
process which broke the entire cluster.
11
12
Hadoop MapReduce
●
●
●
MapReduce, the main data processing function of Hadoop, works by splitting
the data analysis into two parts, a Map() function and a Reduce() function.
The Map() function filters and sorts the data into manageable categories for
the cluster to process.
The Reduce() function analyzes this data and combines similar data.
13
http://ccs.miami.edu/wp-content/uploads/2016/05/Hadoop-Map-Reduce.png
Analysis
(pi) Approximation
●
3.141592...
DNA Sequencing
○
●
○
●
Shotgun Sequencing
TeraSort
Mahir Morshed, former Project Lead
14
Approximation
●
Used Nilakantha method
○
○
○
Calculated 128 digits at 100,000 k value
1 node: 2h 10min
16 nodes: 1h
15
Genome Assembly
●
●
●
●
Process in which DNA reads, or short pieces of raw DNA data, are combined
into an entire genome.
We plan to use shotgun sequencing, which randomly tests pairs for a match.
Shotgun sequencing matches ending pairs of reads and overlaps them to
assemble the DNA. This is called the chain termination method.
We downloaded a data store of strawberry DNA reads from the National
Center for Biotechnological Information (NCBI), a project of the National
Institutes of Health (NIH).
16
17
TeraSort
●
Simple, standard Apache benchmark for hadoop clusters
○
●
Generates random data and then sorts it by key
○
●
●
Within hadoop
Sorting is the benchmark
Hadoop single-node cluster: 10min
Hadoop multinode cluster: 20s
18
Conclusion
●
●
●
●
Benchmarks showed that computing time decreased consistently when
multiple nodes were used.
We experienced challenges in the implementation of Hadoop. We were able
to simplify the process by using Network File System to copy configuration
files across the network, but the necessary manual configuration and
troubleshooting was time-consuming.
We practiced web development to increase the value of the project.
In the future, we hope to further investigate genome sequencing.
19
Thank You!
20