Big Data - zeynepaltan.info

Data-Intensive Applications,
Challenges, Techniques and
Technologies:
Big Data
Current and Future Research Frontiers
Big Data
Big Data has drawn huge attention from researchers in
 information sciences, policy and decision makers in governments and
enterprises
Big Data is extremely valuable to produce
productivity in businesses
evolutionary breakthroughs in scientific disciplines
 give us a lot of opportunities to make great progresses in many fields.
Big Data arises with many challenges
difficulties in data capture
 data storage
data analysis
 data visualization.
Big data is a set of techniques and technologies
Require new forms of integration to uncover large hidden values
 from large data sets
 diverse, complex, a massive scale.
Characteristics of Big Data
 Volume refers to the amount of all types of
data generated from different sources and
continue to expand.
 The benefit of gathering large
amounts of data includes the
creation of hidden information and
patterns through data analysis
 Variety refers to the different types of data collected
via sensors, smartphones ,or social networks.
 Such data types include video ,image, text,
audio, and data logs, in either structured or
unstructured format
 Most of the data generated from mobile
applications are in unstructured format
Characteristics of big data
 Velocity refers to the speed of data transfer.
 The contents of data constantly change
because of the
 absorption of complementary data
collections
 introduction of previously archived
data or legacy collections
 Streamed data arriving from
multiple sources
 Value is the most important aspect of big data
 It refers to the process of discovering huge
hidden values from large datasets with
various types and rapid generation
Big Data in Commerce and Business
267 million transactions per day in Wal-Mart’s 6000 stores (in 2014)
worldwide
Wall Mart collaborated with Hewlett Packard to establish a data
warehouse
With a capability to store 4 petabytes (4000 trillion bytes), tracing every
purchase record from their point-of-sale terminals
Wall Mark company takes advantage of sophisticated machine
learning techniques
Exploit the knowledge hidden in this huge volume of data,
 they successfully improve efficiency of their pricing strategies and advertising
campaigns.
The management of their inventory and supply chains also
significantly benefits from the large-scale warehouse
4V Categorization of IBM
Extracting
Business
Value from
the 4 V’s of
Big Data
Big Data Classification
Categories of Big Data
I- Data Sources
Social media is the source of information generated via URL
Share or Exchange information and ideas in virtual communities and networks
For example: collaborative projects, blogs and microblogs, Facebook, and Twitter.
 Machine-generated data are information automatically generated from a
hardware or software
Such as computers, medical devices, or other machines, without human
intervention
 Sensing devices exist to measure physical quantities and change them into
signals
 Transaction data, such as financial and work data, comprise an event that
involves a time dimension to describe the data
 IoT represents a set of objects that are uniquely identifiable as a part of
the Internet.
IOT as Big Data Source
The objects of IOT include smartphones, digital cameras, and
tablets.
When these devices connect with one another over the Internet,
they enable more smart processes and services
support basic, economic, environmental, and health needs.
 A large number of devices connected to the Internet provides many
types of services and produces huge amounts of data and information
Categories of Big Data
II - Content Format
Structured data are often managed SQL,
 a programming language created for managing and querying data in RDBMS
Structured data are easy to input, query, store, and analyze
. Examples of structured data include numbers, words, and dates.
Semi-structured data are data that do not follow a conventional
database system.
Semi-structured data may be in the form of structured data that are not
organized in relational database models, such as tables.
Capturing semi-structured data for analysis is different from capturing a fixed
file format.
Capturing semi-structured data requires the use of complex rules that
dynamically decide the next process after capturing the data
Categories of Big Data
II - Content Format
Unstructured data, such as text messages, location information,
videos, and social media data, are data that do not follow a specified
format.
The size of this type of data continues to increase through the use of
smartphones
 To analyze and understand such data has become a challenge
Categories of Big Data
III- Data Stores
Document-Oriented Data stores are mainly designed to
Store and retrieve collections of documents or information
Support complex data forms in several standard formats
 such as JSON, XML, and binary forms (e.g., PDF and MS Word)
A document- oriented data store is similar to a record or row in a
relational database
But it is more flexible and can retrieve documents based on their contents
(e.g., MongoDB, SimpleDB, and CouchDB).
Column-Oriented Database stores its content in columns aside from
rows, with attribute values belonging to the same column stored
contiguously.
Column oriented is different from classical data base systems that
store entire rows one after the other
Categories of Big Data
III- Data Stores
Graph database is designed to store and represent data that utilize a
graph model with nodes, edges, and properties related to one
another through relations
 For example: Neo4j
 Key-value is an alternative relational database system that stores
and accesses data designed to scale to a very large size
Dynamo is a good example of a highly available key-value storage system
 it is used by amazon.com in some of its services
Categories of Big Data
IV- Data staging
Cleaning is the process of identifying in complete and unreasonable
data
Transform is the process of transforming data into a form suitable for
analysis.
Normalization is the method of structuring database schema to
minimize redundancy
Categories of Big Data
V- Data processing
Batch MapReduce-based systems have been adopted by many
organizations in the past few years for long-running batch jobs
Such system allows for the scaling of applications across large clusters of
machines comprising thousands of nodes.
Realtime One of the powerful real time process-based big data tools
is simple scalable streaming system
S4 is a distributed computing platform that allows streams of data
S4 is a scalable, partially fault tolerant, general purpose, and pluggable
platform
Transforming Big Data Analysis
Structured and Unstructured Data
Transformation
In the case of structured data, the data is pre-processed before they
are stored in relational databases to meet the constraints of schemaon-write.
The secod step is to retrieve the data for analysis.
In case of unstructured data, the data must first be stored in
distributed databases before they are processed for analysis ,.
Unstructured data are retrieved from distributed databases after
meeting the schema-on-read constraints.
 For example HBase,
Unified Architecture
Apache Hadoop MapReduce framework was initially designed to
perform a batch processing on large amounts of data
Tools such as Hive and Pig helps to execute ad-hoc queries on historical data
using query language.
Processing using MapReduce and tools such as Pig and Hive is slow
due to disk reads and writes during data processing.
A new stack which contains tools such as HBase, Impala etc. enables
interactive query processing to access data faster
Apache Storm and Kafka include streaming data and were introduced
to fulfill the need of real-time analytics
Batch Data Processing
Batch data processing is an efficient way of processing high
volumes of data is where a group of transactions is collected over a
period of time.
Data is collected, entered, processed and then the batch results are
produced
Hadoop is focused on batch data processing
Batch processing requires separate programs for input, process and
output.
An example is payroll and billing systems.
Disadvantages of Batch Processing
(Apache Hadoop MapReduce)
The limitations of this model are that it’s expensive and
complex.
It is hard to compute the consistent metrics among these
stacks.
Processing on streaming data is slow in case of MapReduce
due to the use of disk for storing the intermediate results.
Real Time Data Processing
Real Time Data processing involves a continual input, process and
output of data.
 Data must be processed in a small time period (or near real time).
Radar systems, customer services and bank ATMs
Apache Spark
Apache Spark introduced the unified architecture
Combines streaming, interactive and batch processing
components
 It is easy to build applications using powerful APIs in JAVA, Python
and Scala.
Real Time and Batch Processing Application
We can compare Real Time Analytics and Batch Processing
Application with Hadoop MapReduce and Spark
Batch and Real Time Data Processing Solutions
MapReduce and Hadoop
MapReduce has been used by Google to generate scalable
applications.
In other words
MapReduce as a programming model and an implementation for
processing and generating large data set was created at Google in 2004
by Jeffrey Dean and Sanjay Ghemawat.
MapReduce inspired by the “map” and “reduce” functions in Lisp
MapReduce breaks an application into several small portions of the problem
Each of them can be executed across any node in a computer cluster.
The “map” stage gives sub problems to nodes of computers
The “reduce” combines the results from all of those different sub
problems.
MapReduce Etymology
In LISP, the map function takes a function and a set of values as
parameters .
That function is then applied to each of the values.
 For example:
(map ‘length ‘(() (a) (ab) (abc)))
applies the length function to each of the three items in the list.
Since length returns the length of an item, the result of map is a list
containing the length of each item:
(0 1 2 3)
MapReduce Etymology
The reduce function is given a binary function and a set of values as
parameters.
It combines all the values together using the binary function.
 If we use the + (add) function to reduce the list (0 1 2 3):
(reduce # '+ ‘ (0 1 2 3)) we get 6
MapReduce Framework
for Parallel Computing
Programmers get a simple API and do not have to deal with issues of
parallelization, remote execution, data distribution, load balancing, or
fault tolerance.
The framework makes it easy for one to use thousands of processors
to process huge amounts of data (e.g., terabytes and petabytes).
From a user's perspective, there are two basic operations in
MapReduce: Map and Reduce.
The Operations of MapReduce
Map Operation: Each application of the function to a value can be
performed in parallel (concurrently)
There is no dependence of one upon another.
 Reduce Operation can take place only after the map is complete.
Map and Reduce Functions
The Map function reads a stream of data and parses it into
intermediate (key, value) pairs.
The Reduce function is called once for each unique key that was
generated by Map
The Reduce function is given the key and a list of all values that were
generated for that key as a parameter.
The keys are presented in sorted order.
An Example of Using MapReduce -I
The task is counting the number of occurrences of each word in a
large collection of documents.
The user-written Map function reads the document data and parses
out the words.
For each word, it writes the (key, value) pair of (word, 1).
The word is treated as the key and the associated value of 1 means
that we saw the word once.
An Example of Using MapReduce -II
This intermediate data is then sorted by MapReduce by keys
The user's Reduce function is called for each unique key.
Since the only values are the count of 1, Reduce is called with a list
of a "1" for each occurrence of the word that was parsed from the
document.
The function simply adds them up to generate a total word count for
that word
Comparision of Several Big Data Cloud Platforms
NoSQL Database
Called as ‘‘Not Only SQL’’, is a current approach for large and
distributed data management and database design.
 Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’.
NoSQL does not avoid SQL.
i)Some NoSQL systems are entirely non-relational
ii)Some NoSQL systems simply avoid selected relational functionality such as
fixed table schemas and join operations.
iii)Some analytic platforms like SQLstream and Cloudera Impala series still use
SQL in its database systems,
Because SQL is more reliable and simpler query language with high performance
in stream Big Data real-time analytics.
NoSQL database
for Unstructured or Non-Relational Data
Data storage and management are separated into two indepenent
parts.
This is contrary to relational databases
i) In the storage part which is also called key-value storage, NoSQL
focuses on the scalability of data storage with high-performance.
ii) In the management part, NoSQL provides low-level access
mechanism
Data management tasks can be implemented in the application layer rather
than having data management logic spread across in SQL or DB-specific stored
procedure languages
NoSQL systems are very flexible for data modeling
 NoSQL systems are easy to update application deployments
Hbase NoSQL Database System Architecture
(Apache Hadoop)
Hbase is one of the most famous used NoSQL databases
NoSQL Database
for Unstructured or Non-Relational Data
An important property of the most NoSQL databases that they are
commonly schema-free.
The biggest advantage of schema-free databases is that it enables
applications to quickly modify the structure of data and does not
need to rewrite tables.
 It possesses greater flexibility when the structured data is
heterogeneously stored.
In the data management layer, the data is enforced to be integrated
and valid.
NoSQL database
for Unstructured or Non-Relational Data
The most popular NoSQL database is Apache Cassandra.
Cassandra was released as open source in 2008.
 Cassandra was Facebook proprietary database
Other NoSQL implementations include SimpleDB, Google BigTable,
Apache Hadoop, MapReduce, MemcacheDB, and Voldemort.
Companies that use NoSQL include Twitter, LinkedIn and NetFlix
Twitter as an Example
using data published in November 2012
Two of Twitter’s main operations are:
Post tweet
A user can publish a new message to their followers (4.6 k requests/sec
on average, over 12 k requests/sec at peak).
Home timeline
A user can view tweets recently published by the people they follow
(300 k requests/sec)
Twitter as an Example
Simply handling 12,000 writes per second (the peak rate for posting
tweets) would be fairly easy.
However;
Twitter’s scaling challenge is not primarily due to tweet volume
scaling was due to fan-out*
Each user follows many people, and each user is followed by many
people.
*Fan-out is a term that defines the maximum number of digital inputs that the
output of a single logic gate can feed
Two Different Approaches for Tweeet Implementation
1. Posting a tweet simply inserts the new tweet into a global collection
of tweets.
When a user requests home timeline, look up all the people they
follow, find all recent tweets for each of those users, and merge them
(sorted by time).
In a relational database, this would be a query along the lines of:
SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
Simple relational schema for
implementing a Twitter home
timeline
For this usage version of Twitter the systems struggled to keep up
with the load of home timeline queries
Therefore; The company switched to the other solution
Other Solution for Tweet Implementation
2. Maintain a cache for each user’s home timeline like a mailbox of
tweets for each recipient user
When a user posts a tweet, look up all the people who follow that
user, and insert the new tweet into each of their home timeline
caches.
Then the request to read the home timeline is cheap
Because
The result has been computed ahead of time
This operation works better than the previous solution
The average rate of published tweets is almost two orders of
magnitude lower than the rate of home timeline reads
It’s possible to do more work at write time and less at read
time.
Twitter’s data pipeline for delivering tweets to followers, with load
parameters as of November 2012
Tweeter Implementation
Posting a tweet requires a lot of extra work.
On average, a tweet is delivered to about 75 followers, so 4.6 k
tweets per second become 345 k writes per second to the home
timeline caches.
This average hides the fact that the number of followers per user
varies wildly, and some users have over 30 million followers.
This means that a single tweet may result in over 30 million writes to
home timelines!
 Doing this in a timely manner Twitter tries to deliver tweets to followers within 5 seconds is a significant challenge.
Hybrid Approach for Tweeter Implementation
Twitter is moving to a hybrid of both approaches.
Most users’ tweets continue to be fanned out to home timelines at the
time when they are posted,
 But a small number of users with a very large number of followers are
excepted from this fan-out
When the home timeline is read, the tweets followed by the user are fetched
separately and merged with the home timeline when the timeline is read, like in
the first approach
This hybrid approach is able to deliver consistently good performance.
Describing Performance
Once you have described the load on our system, you can investigate
what happens when the load increases.
You can look at it in two ways
When you increase a load parameter, and keep the system resources
(CPU, memory, network bandwidth,etc.) unchanged, how is
performance of your system affected?
When you increase a load parameter, how much do you need to
increase the resources if you want to keep performance unchanged?
Answer to the Questions
Both questions require performance numbers
In a batch-processing system such as Hadoop, we usually care about
throughput  The number of records we can process per second,
 The total time it takes to run a job on a dataset of a certain size.
 In online systems, the response time of a service is usually more
important  The time between a client sending a request and receiving a response.
Latency and Response Time
Latency and response time are often used synonymously but they
are not the same.
The response time is what the client sees besides the actual time to
process the request (the service time)
It includes network delays and queueing delays.
Latency is the duration that a request is waiting to be handled during
which it is latent, awaiting service
Relational Model vs. Document Model
In the 2010s, NoSQL is the latest attempt to overthrow the relational
model’s dominance.
The term NoSQL is unfortunate, since it doesn’t actually refer to any
particular technology  it was intended simply as a catchy Twitter hashtag for a meetup on open
source, distributed, non-relational databases in 2009
Nevertheless, the term struck a nerve, and quickly spread through
the web startup community and beyond.
A number of interesting database systems are now associated with
the #NoSQL hashtag,
it has been retroactively re-interpreted as Not Only SQL
The Adoption of NoSQL databases
A need for greater scalability than relational databases can easily
achieve, including very large datasets or very high write throughput
A widespread preference of free and open source software over
commercial database products
Specialized query operations that are not well supported by the
relational model
Frustration with the restrictiveness of relational schemas, and a
desire for a more dynamic and expressive data model
a LinkedIn Profile as a JSON Document
{ "user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.", "region_id": "us:91",
"industry_id": 131,
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [ {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
{"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
],
"education": [
{"school_name": "Harvard University",
"start": 1973, "end": 1975}, {"school_name":
"Lakeside School, Seattle", "start": null, "end": null}
],
"contact_info": {
"blog": "http://thegatesnotes.com",
"twitter": "http://twitter.com/BillGates"
}
}
One-to-Many Relationships Forming a Tree Structure
The company name is not just a string, but a link to a company entity
Extention to Many-to-Many Relationships
 The data within each dotted rectangle can be grouped into one document,
 The references to organizations, schools and other users need to be
represented as references, and require joins when queried.
Schema Flexibility in the Document Model
Document databases are sometimes called schemaless,
The code that reads the data usually assumes some kind of structure  There is an implicit schema, but it is not enforced by the database
 A more accurate term is schema-on-read
The structure of the data is implicit, and only interpreted when the data is
read
 Schema-on-write is the traditional approach of relational databases
The schema is explicit and the database ensures all data conforms to it
Schema-on-Read & Schema on-Write
Schema-on-read is similar to dynamic (run-time) type-checking in
programming languages
Schema-on-write is similar to static (compile-time) type-checking.
Document Database
In a document database, we would just start writing new documents with the
new fields,
The code in the application handles the case when old documents are read
For example:
if (user && user.name && !user.first_name)
{
// Documents written before Dec 8, 2013 don't have first_name
user.first_name = user.name.split(" ")[0];
}
Statically Typed Database Schema
 A ‘statically typed’ database schema performs a migration along the lines
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1);
-- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1);
-- MySQL
Schema changes have a bad reputation of being slow and requiring
downtime