Slides - School of Electrical Engineering and Computer Science

Research Opportunities and Challenges in Big Data
BigDat2015 Lessons
Haruna Isah
Supervisors: Daniel Neagu and Paul Trundle
Artificial Intelligence Research Group
Haruna Isah is a Commonwealth Scholar Funded by the UK Government
1
11 June, 2015
BigDat2015 Lessons
Outline
Introduction
Big Data Fundamentals
Big Data Research Challenges
Open Problems in Big Data Research
My PhD Work
Conclusion
References
Links/Resources
•
•
•
•
•
•
•
•
2
11 June, 2015
BigDat2015 Lessons
Introduction
BigDat2015
• Research training: updating researchers about the most recent
developments and research challenges in big data.
– 4 Keynotes, 19 Courses
– Most slides centred around BigDat2015
• Coverage
– Foundation, Infrastructure, Management, Search and Mining,
Security and Privacy, Applications
• Talk Goals:
– To uncover ideas and open problems in big data and related areas
– To discuss my PhD work for suggestions and collaborations
– To practice time management during academic presentations
– To provide links to available BigDat2015 materials
3
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals
What is Big Data?
• Very large and complex data
– difficult to capture, process, and analyse using current computing
infrastructure.
• Big data is characterised by:
– Volume (data size): TB (240), PB (250), and exabytes (260);
– Velocity: rate at which data arrive
– Variety: different data types, representation and semantics;
– Variability: rate of change in data characteristics
– Veracity: uncertainty, inconsistency, and privacy issues;
– Value: usefulness
4
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data Examples [Vijay1]:
NASA’s Solar Dynamics Observatory (SOD): uses 4 telescopes to
gather 8 images of the Sun every 12s.
• 1PB publicly accessible online, to grow at 0.5 PB/yr.
5
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data Examples [Vijay1]:
The Large Hadron Collider: world's largest machine
• Uses 150M sensors to capture data at about 600M collisions per s.
6
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data Examples [Blockeel]:
Square Kilometre Array : world's largest radio telescope. Ready in 2024
• to produce raw data at 157TB/s for creating the biggest map of the
universe.
7
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data Examples [Vijay1]:
Social Media:
• 300M new photos/day uploaded on Facebook; 300M Instagram users
share 60M photos/day; >100 hrs/min video uploaded to YouTube.
8
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data Applications: (predictions, recommendations, etc.)
Security [Pentagon]:
• Social media is the next battleground. Big social data can serve as:
– an early-warning system of trouble brewing
– a leading indicator of imminent action by a potential troublemaker.
Product Anti-counterfeiting Intelligence [FDA]:
• Social media listening for:
– early detection of adverse events and food-borne illness
– product quality surveillance
Business Intelligence [Other8]:
• product recommendation (Amazon)
9
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data Growth [Vijay1]:
• to continue as we shift toward:
– GB networks, gigapixel cameras, data-intensive Internet of Things
(IoT), etc.
Big Data Usage [Vijay1]:
• In 2013:
– only 22% of data was considered useful,
– less than 5% of that amount was actually analysed.
• By 2020:
– more than35% of all data could be considered useful:
• in scientific discovery or process optimization.
10
11 June, 2015
BigDat2015 Lessons
Big Data Fundamentals Cont’d
Big Data in Scientific Research
New paradigm of research!
• More data; less models! [Fox1][Other5][Other6][Other7].
11
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges
With all of its promise, big data presents problems for researchers:
• Challenges:
– storage: how to capture, store big data
– transfer: how to transfer, share big data
– interestingness: how to clean, search, visualise, analyse big data
– privacy: how to secure big data
Automated Management Challenge
• We now need more storage capabilities in our labs!
– Must every lab be filled with computers?
• How can we tame big data and make it less powerful/easier to control?
• Automated management challenges requires cloud discovery
[Foster1].
12
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges Cont’d
Cloud computing: provides a scalable computing infrastructure,
platform and software services through which large datasets can be
stored and analysed.
• Infrastructure services: servers, storage disks, web hosting, etc.
• Platform services: environments for running/building software apps
• Software services: software delivered over the web
13
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges Cont’d
Massive Flow Challenge
Data must move to be useful [Other2]:
• how can TB of data be transferred between labs in cities, countries?
• FTP and Secure Copy (SCP) not sufficient [Foster3].
• Big data transfer challenge requires new protocols and algorithms.
14
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges Cont’d
Current innovations aimed at tackling massive flow challenge include:
• GridFTP: high performance, secure, reliable protocol for bulk data
transfer optimized for high-bandwidth wide-area networks [Foster3].
• Globus: a hosted provider of high-performance, reliable, and secure
data transfer, synchronisation, and sharing [Foster2].
Privacy and Security Challenges
• The Cloud allow users and organizations to rely on external providers.
– This leads to new security and privacy problems: users lose control
over their data.
Interestingness Challenge
• How can we search and extract useful information from big data?
• How can we analyse PB of complex data, coming our way at GB/s?
Search for interesting information in big data requires discovery engines
e.g. KBase[Other3] and MG-RAST[Other4].
15
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges Cont’d
KBase’s large collection of reference data includes over 23,000 plant and
microbial genomes and over 15,000 metagenomics datasets.
16
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges Cont’d
Analysis or knowledge discovery in big data
17
11 June, 2015
BigDat2015 Lessons
Big Data Research Challenges Cont’d
Analysis of big data is crucial because:
• it is often the end goal of big data
• collecting, storing, retrieving data uses analysis methods.
Several successful big data analysis use cases are compiled in [Fox3].
• Classical data mining algorithms suffer computational challenges
when processing big data:
– storage, processing time issues
• Adaptations to the algorithms made to tackle the challenges. E.g.
– [Blockeel] adaptations to decision trees in the context of big data
– [Suykens][Jipei1][Other9] adaptations to SVM & Kernel methods
• Other scalable data mining approaches for speeding big data
processing such as cloud, parallel and distributed computing can be
found in:
– [Chang][Talia]
18
11 June, 2015
BigDat2015 Lessons
Open Problems in Big Data Research
• There is need for new metrics, techniques, analysis for finding relevant
data. E.g. selecting representative (diversity):
– given a large set of data S, select a small representative subset of S’
that covers all aspects of S.
• There is need to better integrate HPC and Data Analytics (Machine
learning)
• There is need to identify/develop parallel large scale data analytics
library SPIDAL (Scalable Parallel Interoperable Data Analytics Library) of
similar quality to PETSc and ScaLAPACK which have been very
influential in success of HPC for simulations
• There is need to clearly separate as independent requirements:
– needs of Data management and
– needs of Data Analytics
Each suggest different software and hardware characteristics and their
confusion generates machines that do not support parallel machine
learning.
19
11 June, 2015
BigDat2015 Lessons
Open Problems in Big Data Research Cont’d
• There is need for software model HPC-ABDS (HPC Apache Big Data
Stack) to achieve functionality of commodity big data and performance
of HPC.
• There is need for efficient parallel algorithms for streaming and
multiscale adaptive algorithms reducing time complexity of O(N2) to
O(N)
• There is need for solutions to protect and securely process data in the
cloud.
• There is need for more adaptations to classical algorithms in order to
tackle big data challenges.
• There is need to promote Data Science as an Academic Discipline
20
11 June, 2015
BigDat2015 Lessons
My PhD Research
Research Title: Knowledge Discovery in Drugs Anti-counterfeiting
MPhil to PhD Research [Isah]:
The problem:
• counterfeit products: deadly, financial losses
• current strategies: failed, but provide data for KD
Research Goal:
• Harness counterfeit related data, observe, measure, develop
models/algorithms/systems for predictive anti-counterfeiting
Contributions:
• product safety framework using text mining and sentiment analysis.
• We utilised the framework on feedbacks of cosmetics product users.
• We demonstrated the development of custom lexicon and training data
and also modelled a naïve Bayes classifier for sentiment prediction.
21
11 June, 2015
BigDat2015 Lessons
My PhD Research Cont’d
Results
We extracted user comments on:
Avon, Dove and Oral B Facebook
pages:
• Comments on brand
advertisement posts that:
– do not offer a prize in
return i.e. Y and Z
– offer a prize in return X
• Interesting result:
– all positively skewed
– neg:neu:pos for X, Y and Z
= 1:42:175, 1:3:3 and
1:5:5 respectively.
22
11 June, 2015
BigDat2015 Lessons
My PhD Research Cont’d
Software Packages Used:
• R: environment for statistical computing and graphics.
• tm package: framework for text mining applications in R.
• Rfacebook package: access to Facebook Graph API via R.
• streamR package: access to Twitter Streaming API via R.
• twitteR package: interface to the Twitter RESTful API.
• Stringr package: string functions
• plyr package: for split-apply-combine (SAC) procedures.
• ggplot2 package: for creating elegant and complex plots.
• sentiment package: tools for sentiment analysis in R.
• e1071 package: functions for various machine learning
• gmodels package: for model fitting. + Dependencies
23
11 June, 2015
BigDat2015 Lessons
My PhD Research Cont’d
Current Research
• My current work is centred around the design and development of
models and algorithms for crime intelligence in the cyberspace.
– I employ a variety of graph and link mining methods to characterise
and combat cybercrime on the Web.
• The Problem:
– Vulnerability of Web search and ranking algorithms to
manipulations by commercial interests.
– Web and Social Spamming
– Community Structure or Organisational Strategy of Cybercriminals
• Proposed Solutions:
– Bipartite model for uncovering hidden tie in crime data:
undergoing peer review
24
11 June, 2015
BigDat2015 Lessons
My PhD Research Cont’d
Collaborative Opportunities
• Large Scale Data Analysis with R and Hadoop
• Developing Eigenvalue and Large Matrix Solvers
• Developing Measures and Strategies for Mitigating the Manipulation of
Popular Ranking Algorithms such as PageRank and HITS.
• Models and Algorithms for Measuring and Combatting Web and Social
Spamming.
• Etc.
25
11 June, 2015
BigDat2015 Lessons
Conclusions
Today, we can collect massive amounts of data at a very high rate
• What can we do with all these data?
– In theory, a lot! In practice, a function of today’s technology and
user’s skills!
Most classical data mining algorithms suffer computational challenges
when processing big data. E.g. speed and memory limitations
• There is need for more research both in the algorithmic and system
domains to tackle these challenges.
BigDat2015 areas not covered in the presentation:
• Big Data Visualisation, Big Data Challenges in Simulation-based
Science, DataSpaces, Scholarly Big Data, End-User Access to Big Data
Using Ontologies, and Big Data Stream Processing.
26
11 June, 2015
BigDat2015 Lessons
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[Foster1] Ian Foster: Taming Big Data
[Foster2] Efficient and Secure Transfer, Synchronization, and Sharing of Big Data
[Foster3] GridFTP
[Fox1] Big Data Applications & Analytics
[Fox2] Big Data Open Source Software and Projects
[Fox3] National Institute of Standards and Technology (NIST) Big Data Program
[Vijay1] Big Data: Promises and Problems
[Gropp] Using MPI I/O for Big Data
[Samarati] Data Security and Privacy in the Cloud
[Jipei] Sparse Learning with Efficient Projections
[Jipei1] Sparse Learning and Low Rank Modelling
[Pitoura] Big Data: Status and Challenges (Keynote)
[Blockeel] Decision trees for big data analytics
[Suykens] Fixed-size Kernel Models for Mining Big Data
[Chang] Big Data Analytics Architectures, Algorithms, and Applications
[Talia] Scalable Data Mining on Parallel, Distributed and Cloud Computing Systems
27
11 June, 2015
BigDat2015 Lessons
References Cont’d
•
•
•
•
•
•
•
•
•
•
•
•
[Isah] Isah, H.; Trundle, P.; Neagu, D., "Social media analysis for product safety using
text mining and sentiment analysis,"Computational Intelligence (UKCI), 2014 14th UK
Workshop on, vol., no., pp.1,7, 8-10 Sept. 2014 doi: 10.1109/UKCI.2014.6930158
[Pentagon] Defense Advanced Projects Research Agency’s (DARPA’s) Social Media in
Strategic Communication (SMISC) Program
[FDA] Use of Social Media to Inform and Evaluate FDA Risk Communications
[Other1] Big Data or Right Data?
[Other2] Low Latency and Big Data
[Other3] KBase
[Other4] MG-RAST (the Metagenomics RAST)
[Other5] The Fourth Paradigm: Data-Intensive Scientific Discovery
[Other6] Do We Need More Training Data or More Complex Models?
[Other7] Why More Data and Simple Algorithms Beat Complex Analytics Models
[Other8] Recommendation Engines: The Reason Why We Love Big Data
[Other9] Sparse Matrix Algorithms and Applied Mathematics
28
11 June, 2015
BigDat2015 Lessons
Links/Resources
International Winter School on Big Data (BigDat2015)
• http://grammars.grlmc.com/bigdat2015/
BigDat2015 Handouts
• http://www.computing.brad.ac.uk/uobacm/doc/BigDat2015.zip
University of Bradford ACM Chapter Website
• http://uob.acm.org/
My Webpage
• http://www.computing.brad.ac.uk/~hisah/
Tensor Methods for Big Data
• Tutorial on Spectral and Tensor Methods for Guaranteed Learning
• Era of Big Data Processing: A New Approach via Tensor Networks and
Tensor Decompositions
• Tensor Networks for Big Data Analytics and Large-Scale Optimization
Problems
29
11 June, 2015
BigDat2015 Lessons
Thanks for your audience.
???
Suggestions
Etc.
30
11 June, 2015
BigDat2015 Lessons