Data-Intensive Applications, Challenges, Techniques and Technologies: Big Data Current and Future Research Frontiers Big Data Big Data has drawn huge attention from researchers in information sciences, policy and decision makers in governments and enterprises Big Data is extremely valuable to produce productivity in businesses evolutionary breakthroughs in scientific disciplines give us a lot of opportunities to make great progresses in many fields. Big Data arises with many challenges difficulties in data capture data storage data analysis data visualization. Big data is a set of techniques and technologies Require new forms of integration to uncover large hidden values from large data sets diverse, complex, a massive scale. Characteristics of Big Data Volume refers to the amount of all types of data generated from different sources and continue to expand. The benefit of gathering large amounts of data includes the creation of hidden information and patterns through data analysis Variety refers to the different types of data collected via sensors, smartphones ,or social networks. Such data types include video ,image, text, audio, and data logs, in either structured or unstructured format Most of the data generated from mobile applications are in unstructured format Characteristics of big data Velocity refers to the speed of data transfer. The contents of data constantly change because of the absorption of complementary data collections introduction of previously archived data or legacy collections Streamed data arriving from multiple sources Value is the most important aspect of big data It refers to the process of discovering huge hidden values from large datasets with various types and rapid generation Big Data in Commerce and Business 267 million transactions per day in Wal-Mart’s 6000 stores (in 2014) worldwide Wall Mart collaborated with Hewlett Packard to establish a data warehouse With a capability to store 4 petabytes (4000 trillion bytes), tracing every purchase record from their point-of-sale terminals Wall Mark company takes advantage of sophisticated machine learning techniques Exploit the knowledge hidden in this huge volume of data, they successfully improve efficiency of their pricing strategies and advertising campaigns. The management of their inventory and supply chains also significantly benefits from the large-scale warehouse 4V Categorization of IBM Extracting Business Value from the 4 V’s of Big Data Big Data Classification Categories of Big Data I- Data Sources Social media is the source of information generated via URL Share or Exchange information and ideas in virtual communities and networks For example: collaborative projects, blogs and microblogs, Facebook, and Twitter. Machine-generated data are information automatically generated from a hardware or software Such as computers, medical devices, or other machines, without human intervention Sensing devices exist to measure physical quantities and change them into signals Transaction data, such as financial and work data, comprise an event that involves a time dimension to describe the data IoT represents a set of objects that are uniquely identifiable as a part of the Internet. IOT as Big Data Source The objects of IOT include smartphones, digital cameras, and tablets. When these devices connect with one another over the Internet, they enable more smart processes and services support basic, economic, environmental, and health needs. A large number of devices connected to the Internet provides many types of services and produces huge amounts of data and information Categories of Big Data II - Content Format Structured data are often managed SQL, a programming language created for managing and querying data in RDBMS Structured data are easy to input, query, store, and analyze . Examples of structured data include numbers, words, and dates. Semi-structured data are data that do not follow a conventional database system. Semi-structured data may be in the form of structured data that are not organized in relational database models, such as tables. Capturing semi-structured data for analysis is different from capturing a fixed file format. Capturing semi-structured data requires the use of complex rules that dynamically decide the next process after capturing the data Categories of Big Data II - Content Format Unstructured data, such as text messages, location information, videos, and social media data, are data that do not follow a specified format. The size of this type of data continues to increase through the use of smartphones To analyze and understand such data has become a challenge Categories of Big Data III- Data Stores Document-Oriented Data stores are mainly designed to Store and retrieve collections of documents or information Support complex data forms in several standard formats such as JSON, XML, and binary forms (e.g., PDF and MS Word) A document- oriented data store is similar to a record or row in a relational database But it is more flexible and can retrieve documents based on their contents (e.g., MongoDB, SimpleDB, and CouchDB). Column-Oriented Database stores its content in columns aside from rows, with attribute values belonging to the same column stored contiguously. Column oriented is different from classical data base systems that store entire rows one after the other Categories of Big Data III- Data Stores Graph database is designed to store and represent data that utilize a graph model with nodes, edges, and properties related to one another through relations For example: Neo4j Key-value is an alternative relational database system that stores and accesses data designed to scale to a very large size Dynamo is a good example of a highly available key-value storage system it is used by amazon.com in some of its services Categories of Big Data IV- Data staging Cleaning is the process of identifying in complete and unreasonable data Transform is the process of transforming data into a form suitable for analysis. Normalization is the method of structuring database schema to minimize redundancy Categories of Big Data V- Data processing Batch MapReduce-based systems have been adopted by many organizations in the past few years for long-running batch jobs Such system allows for the scaling of applications across large clusters of machines comprising thousands of nodes. Realtime One of the powerful real time process-based big data tools is simple scalable streaming system S4 is a distributed computing platform that allows streams of data S4 is a scalable, partially fault tolerant, general purpose, and pluggable platform Transforming Big Data Analysis Structured and Unstructured Data Transformation In the case of structured data, the data is pre-processed before they are stored in relational databases to meet the constraints of schemaon-write. The secod step is to retrieve the data for analysis. In case of unstructured data, the data must first be stored in distributed databases before they are processed for analysis ,. Unstructured data are retrieved from distributed databases after meeting the schema-on-read constraints. For example HBase, Unified Architecture Apache Hadoop MapReduce framework was initially designed to perform a batch processing on large amounts of data Tools such as Hive and Pig helps to execute ad-hoc queries on historical data using query language. Processing using MapReduce and tools such as Pig and Hive is slow due to disk reads and writes during data processing. A new stack which contains tools such as HBase, Impala etc. enables interactive query processing to access data faster Apache Storm and Kafka include streaming data and were introduced to fulfill the need of real-time analytics Batch Data Processing Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced Hadoop is focused on batch data processing Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. Disadvantages of Batch Processing (Apache Hadoop MapReduce) The limitations of this model are that it’s expensive and complex. It is hard to compute the consistent metrics among these stacks. Processing on streaming data is slow in case of MapReduce due to the use of disk for storing the intermediate results. Real Time Data Processing Real Time Data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Radar systems, customer services and bank ATMs Apache Spark Apache Spark introduced the unified architecture Combines streaming, interactive and batch processing components It is easy to build applications using powerful APIs in JAVA, Python and Scala. Real Time and Batch Processing Application We can compare Real Time Analytics and Batch Processing Application with Hadoop MapReduce and Spark Batch and Real Time Data Processing Solutions MapReduce and Hadoop MapReduce has been used by Google to generate scalable applications. In other words MapReduce as a programming model and an implementation for processing and generating large data set was created at Google in 2004 by Jeffrey Dean and Sanjay Ghemawat. MapReduce inspired by the “map” and “reduce” functions in Lisp MapReduce breaks an application into several small portions of the problem Each of them can be executed across any node in a computer cluster. The “map” stage gives sub problems to nodes of computers The “reduce” combines the results from all of those different sub problems. MapReduce Etymology In LISP, the map function takes a function and a set of values as parameters . That function is then applied to each of the values. For example: (map ‘length ‘(() (a) (ab) (abc))) applies the length function to each of the three items in the list. Since length returns the length of an item, the result of map is a list containing the length of each item: (0 1 2 3) MapReduce Etymology The reduce function is given a binary function and a set of values as parameters. It combines all the values together using the binary function. If we use the + (add) function to reduce the list (0 1 2 3): (reduce # '+ ‘ (0 1 2 3)) we get 6 MapReduce Framework for Parallel Computing Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). From a user's perspective, there are two basic operations in MapReduce: Map and Reduce. The Operations of MapReduce Map Operation: Each application of the function to a value can be performed in parallel (concurrently) There is no dependence of one upon another. Reduce Operation can take place only after the map is complete. Map and Reduce Functions The Map function reads a stream of data and parses it into intermediate (key, value) pairs. The Reduce function is called once for each unique key that was generated by Map The Reduce function is given the key and a list of all values that were generated for that key as a parameter. The keys are presented in sorted order. An Example of Using MapReduce -I The task is counting the number of occurrences of each word in a large collection of documents. The user-written Map function reads the document data and parses out the words. For each word, it writes the (key, value) pair of (word, 1). The word is treated as the key and the associated value of 1 means that we saw the word once. An Example of Using MapReduce -II This intermediate data is then sorted by MapReduce by keys The user's Reduce function is called for each unique key. Since the only values are the count of 1, Reduce is called with a list of a "1" for each occurrence of the word that was parsed from the document. The function simply adds them up to generate a total word count for that word Comparision of Several Big Data Cloud Platforms NoSQL Database Called as ‘‘Not Only SQL’’, is a current approach for large and distributed data management and database design. Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’. NoSQL does not avoid SQL. i)Some NoSQL systems are entirely non-relational ii)Some NoSQL systems simply avoid selected relational functionality such as fixed table schemas and join operations. iii)Some analytic platforms like SQLstream and Cloudera Impala series still use SQL in its database systems, Because SQL is more reliable and simpler query language with high performance in stream Big Data real-time analytics. NoSQL database for Unstructured or Non-Relational Data Data storage and management are separated into two indepenent parts. This is contrary to relational databases i) In the storage part which is also called key-value storage, NoSQL focuses on the scalability of data storage with high-performance. ii) In the management part, NoSQL provides low-level access mechanism Data management tasks can be implemented in the application layer rather than having data management logic spread across in SQL or DB-specific stored procedure languages NoSQL systems are very flexible for data modeling NoSQL systems are easy to update application deployments Hbase NoSQL Database System Architecture (Apache Hadoop) Hbase is one of the most famous used NoSQL databases NoSQL Database for Unstructured or Non-Relational Data An important property of the most NoSQL databases that they are commonly schema-free. The biggest advantage of schema-free databases is that it enables applications to quickly modify the structure of data and does not need to rewrite tables. It possesses greater flexibility when the structured data is heterogeneously stored. In the data management layer, the data is enforced to be integrated and valid. NoSQL database for Unstructured or Non-Relational Data The most popular NoSQL database is Apache Cassandra. Cassandra was released as open source in 2008. Cassandra was Facebook proprietary database Other NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies that use NoSQL include Twitter, LinkedIn and NetFlix Twitter as an Example using data published in November 2012 Two of Twitter’s main operations are: Post tweet A user can publish a new message to their followers (4.6 k requests/sec on average, over 12 k requests/sec at peak). Home timeline A user can view tweets recently published by the people they follow (300 k requests/sec) Twitter as an Example Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy. However; Twitter’s scaling challenge is not primarily due to tweet volume scaling was due to fan-out* Each user follows many people, and each user is followed by many people. *Fan-out is a term that defines the maximum number of digital inputs that the output of a single logic gate can feed Two Different Approaches for Tweeet Implementation 1. Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests home timeline, look up all the people they follow, find all recent tweets for each of those users, and merge them (sorted by time). In a relational database, this would be a query along the lines of: SELECT tweets.*, users.* FROM tweets JOIN users ON tweets.sender_id = users.id JOIN follows ON follows.followee_id = users.id WHERE follows.follower_id = current_user Simple relational schema for implementing a Twitter home timeline For this usage version of Twitter the systems struggled to keep up with the load of home timeline queries Therefore; The company switched to the other solution Other Solution for Tweet Implementation 2. Maintain a cache for each user’s home timeline like a mailbox of tweets for each recipient user When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. Then the request to read the home timeline is cheap Because The result has been computed ahead of time This operation works better than the previous solution The average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads It’s possible to do more work at write time and less at read time. Twitter’s data pipeline for delivering tweets to followers, with load parameters as of November 2012 Tweeter Implementation Posting a tweet requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6 k tweets per second become 345 k writes per second to the home timeline caches. This average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers. This means that a single tweet may result in over 30 million writes to home timelines! Doing this in a timely manner Twitter tries to deliver tweets to followers within 5 seconds is a significant challenge. Hybrid Approach for Tweeter Implementation Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, But a small number of users with a very large number of followers are excepted from this fan-out When the home timeline is read, the tweets followed by the user are fetched separately and merged with the home timeline when the timeline is read, like in the first approach This hybrid approach is able to deliver consistently good performance. Describing Performance Once you have described the load on our system, you can investigate what happens when the load increases. You can look at it in two ways When you increase a load parameter, and keep the system resources (CPU, memory, network bandwidth,etc.) unchanged, how is performance of your system affected? When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged? Answer to the Questions Both questions require performance numbers In a batch-processing system such as Hadoop, we usually care about throughput The number of records we can process per second, The total time it takes to run a job on a dataset of a certain size. In online systems, the response time of a service is usually more important The time between a client sending a request and receiving a response. Latency and Response Time Latency and response time are often used synonymously but they are not the same. The response time is what the client sees besides the actual time to process the request (the service time) It includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled during which it is latent, awaiting service Relational Model vs. Document Model In the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance. The term NoSQL is unfortunate, since it doesn’t actually refer to any particular technology it was intended simply as a catchy Twitter hashtag for a meetup on open source, distributed, non-relational databases in 2009 Nevertheless, the term struck a nerve, and quickly spread through the web startup community and beyond. A number of interesting database systems are now associated with the #NoSQL hashtag, it has been retroactively re-interpreted as Not Only SQL The Adoption of NoSQL databases A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput A widespread preference of free and open source software over commercial database products Specialized query operations that are not well supported by the relational model Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model a LinkedIn Profile as a JSON Document { "user_id": 251, "first_name": "Bill", "last_name": "Gates", "summary": "Co-chair of the Bill & Melinda Gates... Active blogger.", "region_id": "us:91", "industry_id": 131, "photo_url": "/p/7/000/253/05b/308dd6e.jpg", "positions": [ {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"}, {"job_title": "Co-founder, Chairman", "organization": "Microsoft"} ], "education": [ {"school_name": "Harvard University", "start": 1973, "end": 1975}, {"school_name": "Lakeside School, Seattle", "start": null, "end": null} ], "contact_info": { "blog": "http://thegatesnotes.com", "twitter": "http://twitter.com/BillGates" } } One-to-Many Relationships Forming a Tree Structure The company name is not just a string, but a link to a company entity Extention to Many-to-Many Relationships The data within each dotted rectangle can be grouped into one document, The references to organizations, schools and other users need to be represented as references, and require joins when queried. Schema Flexibility in the Document Model Document databases are sometimes called schemaless, The code that reads the data usually assumes some kind of structure There is an implicit schema, but it is not enforced by the database A more accurate term is schema-on-read The structure of the data is implicit, and only interpreted when the data is read Schema-on-write is the traditional approach of relational databases The schema is explicit and the database ensures all data conforms to it Schema-on-Read & Schema on-Write Schema-on-read is similar to dynamic (run-time) type-checking in programming languages Schema-on-write is similar to static (compile-time) type-checking. Document Database In a document database, we would just start writing new documents with the new fields, The code in the application handles the case when old documents are read For example: if (user && user.name && !user.first_name) { // Documents written before Dec 8, 2013 don't have first_name user.first_name = user.name.split(" ")[0]; } Statically Typed Database Schema A ‘statically typed’ database schema performs a migration along the lines ALTER TABLE users ADD COLUMN first_name text; UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL Schema changes have a bad reputation of being slow and requiring downtime
© Copyright 2024 Paperzz