Data Intensive Science Education

Data Intensive Science
Education
Thomas J. Hacker
Associate Professor, Computer & Information Technology
Purdue University, West Lafayette, Indiana USA
Gjesteprofessor (Visiting Professor), Department of Electrical
Engineering and Computer Science
University of Stavanger, Norway
EU-China-Nord America Workshop on HPC Cloud and Big Data
June 20, 2013
University of Stavanger, Norway
Introduction and Motivation
• Theory and Experiment (1800s)
• Computational Simulation
– Third leg of science
– Past 50 years or so (1950s)
• Data (21st century science)
– Fourth “leg” of science
– Researchers are flooded with data
– Tremendous quantity and multiple scales of data
– Difficult to collect, store, and manage
– How can we distill meaningful knowledge from data?
Data is the 4th Paradigm
• Producing an avalanche of high resolution
digital data
• All (or most) of the data needs to be accessible
over a long period of time
– Much of the data is not reproducible
• Example – NEES project
– Structure or sample destroyed
through testing
– Very expensive to
rebuild for more tests
Data, data every where…
• We are surrounded by data that we want,
but it is difficult to find the information
that we need
“Water, water every where, Nor any dro
to drink.” Samuel Taylor Coleridge, Rime of the
Ancient Mariner
• Private, shared, and public data
repositories
– Files on your computer
– E-mail
– Group documents and files
– Experimental results
– Published papers
•
Data are scattered across many systems and
devices
–
Personal computer, old diskettes in a box, several email
systems,
–
Old computer systems
The Rime of the Ancient Mariner: Plate 32:
The Pliot, by Gustave Doré
Need for Data Education
• Data is the 4th paradigm of Science and Engineering
• We are losing valuable data every day
– The techniques we were taught to maintain a “lab
notebook” has not been effectively transferred to
computer based data collection and registration systems.
– So much data is available and collected today, it is not
possible to keep it on paper anymore.
Two Examples of Data Intensive
Science
• Two large-scale science and engineering projects
illustrate the problems related to data intensive
science
• National Science Foundation George E. Brown
Network for Earthquake Engineering Simulation
(NEES)
– Purdue operates the headquarters for the NEEScomm,
the community of NEES research facilities
• The Compact Muon Solenoid project
– Purdue operates a Tier-2 CMS center
NSF Network for Earthquake
Engineering Simulation (NEES)
• Safer buildings and civil infrastructure are needed to reduce
damage and loss from earthquakes and tsunamis
• To facilitate research to improve seismic design of buildings and
civil infrastructure, the National Science Foundation established
NEES
–
NEES Objectives
– Develop a national, multi-user, research infrastructure to support research
and innovation in earthquake and tsunami loss reduction
– Create an educated workforce in hazard mitigation
– Conduct broader outreach and lifelong learning activities
Vision for NEES
• Facilitate access to the world's best integrated
network of state-of-the art physical simulation
facilities
• Build a cyber-enabled community that shares
ideas, data, and computational tools and models.
• Promote education and training for the next
generation of researchers and practitioners.
• Cultivate partnerships with other organizations to
disseminate research results, leverage
cyberinfrastructure, and reduce risk by transferring
results into practice.
NEES Research Facilities
• NEES has a broad set of experimental facilities
– Each type of equipment produces unique data
– Located at 14 sites across the United States
• Shake Table
• Tsunami Wave Basin
• Large-Scale Testing Facilities
• Centrifuge
• Field and Mobile Facilities
• Large-Displacement Facility
• Cyberinfrastructure
Oregon State University
University of Minnesota
University of Illinois- Urbana
University of California
Berkeley
University of California
Davis
https://www.nees.org
University of Buffalo
University of California
Santa Barbara
Cornell University
University of California
Los Angeles
Rensselaer Polytechnic Institute
University of California
San Diego
University of Nevada
Reno
University of Texas
Austin
Lehigh University
Large-Scale Testing Facilties
• Lehigh University
– Reaction wall, strong floor
– dynamic actuators
• UC-Berkeley
– Reconfigurable Reaction Wall
• University of Illinois Urbana-Champaign
– Multi-Axial Full-Scale Sub-Structured Testing & Simulation (MUST-SIM)
• University of Minnesota
– Reaction walls
– Multi-Axial Subassemblage
Testing (MAST)
Images: Univ of Minnesota
NEEShub at
Nees.org
Compact Muon Solenoid Project
• Another example of a “big data” project
• Two primary computational goals
– Move detector data from Large Hadron Collider at CERN to
remote sites for processing
– Examine detector data for evidence of Higgs boson
• ~15 PB/yr data
• Applications used by CMS are not inherently parallel
– Data is split up and distributed across nodes
– Embarrassingly parallel
CMS Project Overview
• CERN Large Hadron Collider Project (LHC)
– Particle accelerator and collider – largest in the
world
– 17 mile circumference tunnel
– Providing evidence to support the existence of
the Higgs boson
• Six detector experiments at the LHC
– Atlas, CMS, LHCb, ALICE, TOTEM, LHCf
• Compact Muon Solenoid (CMS)
– Very large solenoid with 4 Tesla magnetic field
– Earth’s magnetic field 60 x 10^-6 Tesla
CMS Detector
Purdue CMS Tier-2 Center
Computing Infrastructure
• ~10,000 computing cores within the Purdue
University Community Cluster program
– Purdue recently (June 18) announced the Conte Supercomputer
– Fastest university-owned supercomputer in the United States
• 3 PB of disk storage running Hadoop
• Sharing a 100 Gb/sec network uplink to Indianapolis and
Chicago
– Ultimately connecting to Fermi National Lab in Chicago
• Provided 14% of all Tier-2 computing globally in 2012
Purdue CMS Tier-2 Center
• Physicists from around the world submit
computational jobs to Purdue
– Data is copied from the Tier-1 to Purdue storage on
user request
– Simulation codes also run at Purdue, with results
pushed up to Tier-1 center or other Tier-2s.
• International data sharing
– Data interoperability is designed into the project
from the beginning. There is one instrument (the
CMS detector), which greatly simplified the sharing
and reuse of data compared with a project like NEES
Challenges involved in Big Data
• Performance at Scale
– How can we effectively match data performance with HPC
capabilities?
– How can we ensure good reliability of these systems?
• Data Curation Challenges
– What should we preserve, how should we preserve it, and how
can we ensure the long-term viability of the data?
• Disciplinary Sociology and Cyberinfrastructure
– How can we effectively promote and support the adoption and
use of new technologies?
– How can we foster the development of new disciplinary
practices focused on the long-term accessibility of data?
Performance at Scale
• Petaflop scale systems are now available for use by researchers
– Example: Purdue Conte system announced this week (Rmax 943 TF,
Rpeak 1.342 Petaflops)
– Conte was built with 580 HP ProLiant SL250 Generation 8 (Gen8) servers,
each incorporating two Intel Xeon processors and two Intel Xeon Phi
coprocessors, integrated with Mellanox 56Gb/S FDR InfiniBand.
– Conte has 580 servers (570 at the time of testing) with 9,120 standard
cores and 68,400 Phi cores, for a total of 77,520 cores.
• Big data analytics coupled with petascale systems requires high
bandwidth storage systems
– Avoid wasteful and expensive CPU stalls
• Scaling up is along two axes:
– Large volume of data (example: CMS Project)
– Large variety and number of files (example: NEES project)
Curation Challenges
• Data production rate is tremendous
– Volume of data is growing over time
– Sensor sampling rate increasing
– High definition video
• Managing data transfer
– Time required to upload and download data is growing
– Upload and download time can take a lot of time if there are network bottlenecks
• Ensuring data integrity
– Filtering, cleaning, and calibration is often needed before upload and
curating data
– The community needs to also retain the raw data in case an error is
made or in case a researcher can later distill further insights from the
data.
Curation Challenges
• File type management
– Data is stored in files through the intermediary of an application
– This means that the information in the data will be encoded into some kind of
format
– It’s difficult (if not impossible) to restrict the file formats used by
the research community
– As these applications change (or disappear) over time, the information
encoded in the data may become stranded
• Risk of stranded data
– When the file format cannot be precisely identified, then we don’t
know which application can be used as an intermediary for reading
the information encoded in the data.
– This leads to ‘stranded data’ that is useless.
Curation Challenges
• Linking computation with data and archived data
– Will need the ability to quickly search archived data –
much more detailed that what Google can deliver
• How can we quickly discover, convert, and transfer
archived data to be close to the user and to
computation? (especially HPC)
– Need to match data I/O capabilities with growth in the
number of CPU cores and core speed.
Long-term accessibility
• We have data in the NEEShub from the 1970s
– Science: “Rescue of Old Data Offers Lesson for
Particle Physicists” by Andrew Curry (Feb 2011)
– Described the need to find old, almost lost data for a
physics experiment from the 1980s
• The data will need to remain viable and
accessible for years into the future
Discipline Sociology
• Sociological factors in data curation
– Disciplinary differences in how data are archived, how to value
archived data, and determining what is worth retaining
– Who determines what is worth keeping?
– What is the practice in the specific discipline?
• International standards and practices in metadata tagging,
representing numbers, and even character sets
– NEES is working with partners in Japan and China – we need to
determine how to represent their data in a common standard
framework
– Terminology for numbers (“,” vs. “.’, lakh vs. 100,000)
– Changing the behavior of scientists to value curation and long-term
accessibility
Managing Curation at Scale
• How can we efficiently use data curator’s time?
• NEES now has 1.8M files, what will happen in 3 more years?
– How can we manage 10M files with a limited curation staff?
• For NEES ,we are using the OAIS model as a guideline for designing
a curation pipeline for curating NEES data
– The OAIS model is proving to be a useful model for thinking about how to
undertake data curation
– We are developing a curation pipeline to help automate curation for the
many files in the NEES Project Warehouse
Data Analytics
• There are technologies available today that can be
used to provide solutions to these problems
– High performance computing
– Parallel file systems
– Map Reduce/Hadoop
• A sustainable solution requires more than a set of
technologies
– An effective data cyberinfrastructure involves both
sociological and technological components.
• What is needed to educate and train researchers to
effectively learn to use new technologies?
Our approach
• Developing a joint research and education and program in big
data analytics between the University of Stavanger and Purdue
University and AMD Research.
– Chunming Rong, Tomasz Wlodarczyk (Stavanger)
– Thomas Hacker, Ray Hansen, Natasha Nikolaidis (Purdue)
– Greg Rodgers (AMD Research)
– Funded by SIU: “Strategic Collaboration on Advanced Data Analysis and
Communication between Purdue University and University of Stavanger”
– Developing a semester long joint course in HPC and Big Data Analytics ,
and a short summer course (to be delivered next week)
Planned Course Objectives
• Students will learn to put modern tools to use in order to do data
analysis of large and complex data sets.
– Students will be able to: design, construct, test, and benchmark a small
data processing cluster (based on Hadoop)
– Demonstrate knowledge of MapReduce functionalities through the
development of a MapReduce program
– Understand Hadoop job tracker, task tracker, scheduling issues,
communications, and resource management.
– Construct programs based on MapReduce paradigm for typical algorithmic
problems
– Use functional programming concept to describe data dependencies and
analyze complexity of MapReduce programs
Planned Course Objectives
• Algorithms
– Understand algorithmic complexity of the worst case, expected case, and best
case running time (big-Oh notation), and the orders of complexity (e.g. N, N^2,
Log N, NP-Hard)
– Examine a basic algorithm and identify the algorithmic complexity order
• File Systems
– Describe the concepts of a distributed file system, how it differs from a local file
system, the performance of distributed file systems.
– Describe a parallel file system, the performance advantages possible through the
use of a parallel file system, and the inherent reliability and fault tolerance mechanisms needed for parallel file systems.
Examples include OrangeFS and Lustre
– understand peak and sustained bandwidth rates
– understand the differences between RDBMS, data warehouse, unstructured big
data, and keyed files.
Short Course Format
• Lecture in the morning followed by lab in the
afternoon
– Labs are built on a set of Desktop PCs running Hadoop
in an RHEL6 virtual machine running on top of VMware
– Using pfsense (open source firewall) to create a secure
network connection from the instruction site to the
computers running Hadoop
– Working to refine the network and lab equipment
setup based on our experiences delivering the short
course next week.
Short Course – Day 1 Topics
• Lecture
– Introduction and motivation for the course
– History of HPC, big data, Moore's Law.
– Science domain areas, and problems in each of those areas that motivate the need for this.
Where are we today, and what is the projected need later? How are things driven by
increases in computing power?
– Definition of big data, big compute, why we need both combined
– Mixture of trends, principles, and implementation in historic context that students should
understand.
– Parallel application types
– Introduction to MapReduce
– Dataflow within MapReduce with plug-in
• Labs
– The hadoop command, HDFS, and Linux basics
– Hadoop basic examples from lectures Short Course – Day 2 Topics
• Lectures
– Introduction to MapReduce, continued
– Combiners
– More complex MapReduce example (search assist)
– Hadoop Architecture
– Motivation for Hadoop
– Hadoop building blocks (name node, data node, etc.)
– Fault tolerance and failures, replication, and data aware scheduling.
– Main components (HDFS, MapReduce, modes (local, distributed, pseudo distributed), etc.)
– HDFS GUI
• Labs
– We will use combiners and multiple reducers to improve performance. We will look at network traffic
and data counters to evaluate. – Students will evaluate the performance improvement for each optimization of MapReduce program.
– The advanced student will gather network and data statistics to explain why each phase got better.
Short Course – Day 3 Topics
•
Lectures
– Hadoop Architecture, continued
– Comparison of HDFS with other Parallel File System architectures (GoogleFS, Lustre,
OrangeFS), and how Hadoop differs from these systems
– Chaining MapReduce jobs
– Mapreduce Algorithms: K-means or other algorithms
– Schemas for unstructured data using Hive
– Introduction to data organization. Why are we concerned about data organization? What
are the impacts of poor organization on performance and correctness?
– Data organization: Level of data organization - data structure, file level, cluster level, data
parallelization, organization level.
– How do we deal with large sequential files from a performance perspective and how it
would be represented in parallel file system (e.g. HDFS)
•
Lab
– Hive
Expected Outcomes
• Provide education and training to researchers to allow
them to effectively think about big data to effectively
use the technologies in their research and daily work.
• Improved data collection and management practices
by researchers
• Development of new techniques for collaboration on a
joint course across the Atlantic with a shared lab
infrastructure for lab assignments.
Conclusions
• There is a need for data intensive training and education for scientists and
engineers
– Effectively use existing technologies
– Develop new disciplinary practices for annotating and preserving valuable data
– Understand the critical need for data curation for the viability and long-term accessibility of data
• We are developing a education and research program focused on these issues
– Short course
– Semester length joint course at University of Stavanger and Purdue University
• Holding a symposium at the CloudCom conference in December
– DataCom - Symposium on High Performance and Data Intensive Computing –
Thomas Hacker, Purdue Univ., USA
–
Tomasz Wiktor Wlodarczyk, University of Stavanger, Norway
– DataCom is organized under CloudCom as two tracks
–
Big Data
–
HPC on Cloud