Research Opportunities and Challenges in Big Data BigDat2015 Lessons Haruna Isah Supervisors: Daniel Neagu and Paul Trundle Artificial Intelligence Research Group Haruna Isah is a Commonwealth Scholar Funded by the UK Government 1 11 June, 2015 BigDat2015 Lessons Outline Introduction Big Data Fundamentals Big Data Research Challenges Open Problems in Big Data Research My PhD Work Conclusion References Links/Resources • • • • • • • • 2 11 June, 2015 BigDat2015 Lessons Introduction BigDat2015 • Research training: updating researchers about the most recent developments and research challenges in big data. – 4 Keynotes, 19 Courses – Most slides centred around BigDat2015 • Coverage – Foundation, Infrastructure, Management, Search and Mining, Security and Privacy, Applications • Talk Goals: – To uncover ideas and open problems in big data and related areas – To discuss my PhD work for suggestions and collaborations – To practice time management during academic presentations – To provide links to available BigDat2015 materials 3 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals What is Big Data? • Very large and complex data – difficult to capture, process, and analyse using current computing infrastructure. • Big data is characterised by: – Volume (data size): TB (240), PB (250), and exabytes (260); – Velocity: rate at which data arrive – Variety: different data types, representation and semantics; – Variability: rate of change in data characteristics – Veracity: uncertainty, inconsistency, and privacy issues; – Value: usefulness 4 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data Examples [Vijay1]: NASA’s Solar Dynamics Observatory (SOD): uses 4 telescopes to gather 8 images of the Sun every 12s. • 1PB publicly accessible online, to grow at 0.5 PB/yr. 5 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data Examples [Vijay1]: The Large Hadron Collider: world's largest machine • Uses 150M sensors to capture data at about 600M collisions per s. 6 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data Examples [Blockeel]: Square Kilometre Array : world's largest radio telescope. Ready in 2024 • to produce raw data at 157TB/s for creating the biggest map of the universe. 7 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data Examples [Vijay1]: Social Media: • 300M new photos/day uploaded on Facebook; 300M Instagram users share 60M photos/day; >100 hrs/min video uploaded to YouTube. 8 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data Applications: (predictions, recommendations, etc.) Security [Pentagon]: • Social media is the next battleground. Big social data can serve as: – an early-warning system of trouble brewing – a leading indicator of imminent action by a potential troublemaker. Product Anti-counterfeiting Intelligence [FDA]: • Social media listening for: – early detection of adverse events and food-borne illness – product quality surveillance Business Intelligence [Other8]: • product recommendation (Amazon) 9 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data Growth [Vijay1]: • to continue as we shift toward: – GB networks, gigapixel cameras, data-intensive Internet of Things (IoT), etc. Big Data Usage [Vijay1]: • In 2013: – only 22% of data was considered useful, – less than 5% of that amount was actually analysed. • By 2020: – more than35% of all data could be considered useful: • in scientific discovery or process optimization. 10 11 June, 2015 BigDat2015 Lessons Big Data Fundamentals Cont’d Big Data in Scientific Research New paradigm of research! • More data; less models! [Fox1][Other5][Other6][Other7]. 11 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges With all of its promise, big data presents problems for researchers: • Challenges: – storage: how to capture, store big data – transfer: how to transfer, share big data – interestingness: how to clean, search, visualise, analyse big data – privacy: how to secure big data Automated Management Challenge • We now need more storage capabilities in our labs! – Must every lab be filled with computers? • How can we tame big data and make it less powerful/easier to control? • Automated management challenges requires cloud discovery [Foster1]. 12 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges Cont’d Cloud computing: provides a scalable computing infrastructure, platform and software services through which large datasets can be stored and analysed. • Infrastructure services: servers, storage disks, web hosting, etc. • Platform services: environments for running/building software apps • Software services: software delivered over the web 13 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges Cont’d Massive Flow Challenge Data must move to be useful [Other2]: • how can TB of data be transferred between labs in cities, countries? • FTP and Secure Copy (SCP) not sufficient [Foster3]. • Big data transfer challenge requires new protocols and algorithms. 14 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges Cont’d Current innovations aimed at tackling massive flow challenge include: • GridFTP: high performance, secure, reliable protocol for bulk data transfer optimized for high-bandwidth wide-area networks [Foster3]. • Globus: a hosted provider of high-performance, reliable, and secure data transfer, synchronisation, and sharing [Foster2]. Privacy and Security Challenges • The Cloud allow users and organizations to rely on external providers. – This leads to new security and privacy problems: users lose control over their data. Interestingness Challenge • How can we search and extract useful information from big data? • How can we analyse PB of complex data, coming our way at GB/s? Search for interesting information in big data requires discovery engines e.g. KBase[Other3] and MG-RAST[Other4]. 15 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges Cont’d KBase’s large collection of reference data includes over 23,000 plant and microbial genomes and over 15,000 metagenomics datasets. 16 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges Cont’d Analysis or knowledge discovery in big data 17 11 June, 2015 BigDat2015 Lessons Big Data Research Challenges Cont’d Analysis of big data is crucial because: • it is often the end goal of big data • collecting, storing, retrieving data uses analysis methods. Several successful big data analysis use cases are compiled in [Fox3]. • Classical data mining algorithms suffer computational challenges when processing big data: – storage, processing time issues • Adaptations to the algorithms made to tackle the challenges. E.g. – [Blockeel] adaptations to decision trees in the context of big data – [Suykens][Jipei1][Other9] adaptations to SVM & Kernel methods • Other scalable data mining approaches for speeding big data processing such as cloud, parallel and distributed computing can be found in: – [Chang][Talia] 18 11 June, 2015 BigDat2015 Lessons Open Problems in Big Data Research • There is need for new metrics, techniques, analysis for finding relevant data. E.g. selecting representative (diversity): – given a large set of data S, select a small representative subset of S’ that covers all aspects of S. • There is need to better integrate HPC and Data Analytics (Machine learning) • There is need to identify/develop parallel large scale data analytics library SPIDAL (Scalable Parallel Interoperable Data Analytics Library) of similar quality to PETSc and ScaLAPACK which have been very influential in success of HPC for simulations • There is need to clearly separate as independent requirements: – needs of Data management and – needs of Data Analytics Each suggest different software and hardware characteristics and their confusion generates machines that do not support parallel machine learning. 19 11 June, 2015 BigDat2015 Lessons Open Problems in Big Data Research Cont’d • There is need for software model HPC-ABDS (HPC Apache Big Data Stack) to achieve functionality of commodity big data and performance of HPC. • There is need for efficient parallel algorithms for streaming and multiscale adaptive algorithms reducing time complexity of O(N2) to O(N) • There is need for solutions to protect and securely process data in the cloud. • There is need for more adaptations to classical algorithms in order to tackle big data challenges. • There is need to promote Data Science as an Academic Discipline 20 11 June, 2015 BigDat2015 Lessons My PhD Research Research Title: Knowledge Discovery in Drugs Anti-counterfeiting MPhil to PhD Research [Isah]: The problem: • counterfeit products: deadly, financial losses • current strategies: failed, but provide data for KD Research Goal: • Harness counterfeit related data, observe, measure, develop models/algorithms/systems for predictive anti-counterfeiting Contributions: • product safety framework using text mining and sentiment analysis. • We utilised the framework on feedbacks of cosmetics product users. • We demonstrated the development of custom lexicon and training data and also modelled a naïve Bayes classifier for sentiment prediction. 21 11 June, 2015 BigDat2015 Lessons My PhD Research Cont’d Results We extracted user comments on: Avon, Dove and Oral B Facebook pages: • Comments on brand advertisement posts that: – do not offer a prize in return i.e. Y and Z – offer a prize in return X • Interesting result: – all positively skewed – neg:neu:pos for X, Y and Z = 1:42:175, 1:3:3 and 1:5:5 respectively. 22 11 June, 2015 BigDat2015 Lessons My PhD Research Cont’d Software Packages Used: • R: environment for statistical computing and graphics. • tm package: framework for text mining applications in R. • Rfacebook package: access to Facebook Graph API via R. • streamR package: access to Twitter Streaming API via R. • twitteR package: interface to the Twitter RESTful API. • Stringr package: string functions • plyr package: for split-apply-combine (SAC) procedures. • ggplot2 package: for creating elegant and complex plots. • sentiment package: tools for sentiment analysis in R. • e1071 package: functions for various machine learning • gmodels package: for model fitting. + Dependencies 23 11 June, 2015 BigDat2015 Lessons My PhD Research Cont’d Current Research • My current work is centred around the design and development of models and algorithms for crime intelligence in the cyberspace. – I employ a variety of graph and link mining methods to characterise and combat cybercrime on the Web. • The Problem: – Vulnerability of Web search and ranking algorithms to manipulations by commercial interests. – Web and Social Spamming – Community Structure or Organisational Strategy of Cybercriminals • Proposed Solutions: – Bipartite model for uncovering hidden tie in crime data: undergoing peer review 24 11 June, 2015 BigDat2015 Lessons My PhD Research Cont’d Collaborative Opportunities • Large Scale Data Analysis with R and Hadoop • Developing Eigenvalue and Large Matrix Solvers • Developing Measures and Strategies for Mitigating the Manipulation of Popular Ranking Algorithms such as PageRank and HITS. • Models and Algorithms for Measuring and Combatting Web and Social Spamming. • Etc. 25 11 June, 2015 BigDat2015 Lessons Conclusions Today, we can collect massive amounts of data at a very high rate • What can we do with all these data? – In theory, a lot! In practice, a function of today’s technology and user’s skills! Most classical data mining algorithms suffer computational challenges when processing big data. E.g. speed and memory limitations • There is need for more research both in the algorithmic and system domains to tackle these challenges. BigDat2015 areas not covered in the presentation: • Big Data Visualisation, Big Data Challenges in Simulation-based Science, DataSpaces, Scholarly Big Data, End-User Access to Big Data Using Ontologies, and Big Data Stream Processing. 26 11 June, 2015 BigDat2015 Lessons References • • • • • • • • • • • • • • • • [Foster1] Ian Foster: Taming Big Data [Foster2] Efficient and Secure Transfer, Synchronization, and Sharing of Big Data [Foster3] GridFTP [Fox1] Big Data Applications & Analytics [Fox2] Big Data Open Source Software and Projects [Fox3] National Institute of Standards and Technology (NIST) Big Data Program [Vijay1] Big Data: Promises and Problems [Gropp] Using MPI I/O for Big Data [Samarati] Data Security and Privacy in the Cloud [Jipei] Sparse Learning with Efficient Projections [Jipei1] Sparse Learning and Low Rank Modelling [Pitoura] Big Data: Status and Challenges (Keynote) [Blockeel] Decision trees for big data analytics [Suykens] Fixed-size Kernel Models for Mining Big Data [Chang] Big Data Analytics Architectures, Algorithms, and Applications [Talia] Scalable Data Mining on Parallel, Distributed and Cloud Computing Systems 27 11 June, 2015 BigDat2015 Lessons References Cont’d • • • • • • • • • • • • [Isah] Isah, H.; Trundle, P.; Neagu, D., "Social media analysis for product safety using text mining and sentiment analysis,"Computational Intelligence (UKCI), 2014 14th UK Workshop on, vol., no., pp.1,7, 8-10 Sept. 2014 doi: 10.1109/UKCI.2014.6930158 [Pentagon] Defense Advanced Projects Research Agency’s (DARPA’s) Social Media in Strategic Communication (SMISC) Program [FDA] Use of Social Media to Inform and Evaluate FDA Risk Communications [Other1] Big Data or Right Data? [Other2] Low Latency and Big Data [Other3] KBase [Other4] MG-RAST (the Metagenomics RAST) [Other5] The Fourth Paradigm: Data-Intensive Scientific Discovery [Other6] Do We Need More Training Data or More Complex Models? [Other7] Why More Data and Simple Algorithms Beat Complex Analytics Models [Other8] Recommendation Engines: The Reason Why We Love Big Data [Other9] Sparse Matrix Algorithms and Applied Mathematics 28 11 June, 2015 BigDat2015 Lessons Links/Resources International Winter School on Big Data (BigDat2015) • http://grammars.grlmc.com/bigdat2015/ BigDat2015 Handouts • http://www.computing.brad.ac.uk/uobacm/doc/BigDat2015.zip University of Bradford ACM Chapter Website • http://uob.acm.org/ My Webpage • http://www.computing.brad.ac.uk/~hisah/ Tensor Methods for Big Data • Tutorial on Spectral and Tensor Methods for Guaranteed Learning • Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions • Tensor Networks for Big Data Analytics and Large-Scale Optimization Problems 29 11 June, 2015 BigDat2015 Lessons Thanks for your audience. ??? Suggestions Etc. 30 11 June, 2015 BigDat2015 Lessons
© Copyright 2026 Paperzz