MIS 3500 - University of Manitoba

MIS 3500
Instructor: Bob Travica
Trendy Database Topics
2017
Big Data

3 big V:

Volume: terabytes (15 zeroes), petabytes (18 zeroes)

Variety: Social media, communications, sensors everywhere*, Internet of
Things, video feeds, GPS… Implication: various formats

Velocity: wired and wireless continuous feeds
2
Big Data Goals and Uses


Goals:

Integrate data on the same object across sources (Customer, Citizen, Patient...;
spatial mashups*)

Analysis: Existing patterns (e.g., el. energy consumption over time), Predictive
analysis (prediction of energy needs)
Application domains:

Product & object monitoring in real time via sensors (Internet of Things- IoT)

Marketing (sentiment analysis in social media, discovering investment opportunities
– major US banks)

Energy grid management (IoT)
3
Big Data Uses (cont’d)

Transportation networks management (cueing airplanes in air corridors in
Brazil, optimizing cargo railroad net in Germany)

Operations/process optimization (UPS sensors in trucks, manufacturing)

Strategy making (Google, Facebook, banks; emerging strategy not planned)

Health (integration of customer data, tracking/analyzing patient vital signs &
cancer cell behavior)

Science (human genome analysis, 2TB of data/person+gene interactions)

Public safety/security (profiling outlaws)

Policy analysis (United Nations’ system for predicting social problems)
4
Big Data Tasks
• Querying
unstructured
data
5
Big Data Benefits & Costs
B
E
N
E
F
I
T
S

Comprehensive informing on business objects (customers, patients…)

Pattern discovery, predictive analysis (fraud detection)

More effective decision making (Citigroup)

Savings (e.g., UPS operations)

Strategizing for innovation (Google)
C
O
S
T
S
 Direct technology costs
 Truthfulness (“veracity”) of sources & findings
 Sense making challenges (big & “small” data)
 Legality, ethics
 Implementation & fit with organization
6

Machine-generated data (sensors); automatic creation and transfer *

Home appliances (security, energy consumption, heating, food,
entertainment)

Monitoring/Control (cars, athletic equipment, machinery, appliances)*

Example: Smart power grid**
Smart meter;
Internet & Wi-Fi
connectivity
7
Technologies

Hadoop (framework for file system and processing of large datasets on server
clusters)*

Machine learning – automated construction of models to fit data (instead of
hypothesis testing as with DW and Analytics)

Non-relations databases

Open source

Notable developers: Google, Facebook, Yahoo!, Microsoft, Amazon
Microsoft Azure-based
Hadoop
8
Two keys - Row and column key
Row identifier/hash
Column Key-value pairs
Primary Key
10938374
LastName=‘Jones’, FirstName=‘John’, …
IP address
Email={‘Home’ : ‘[email protected]’,
MAC address,
‘Work’ : ‘[email protected]’}
UUID*
The E-mail column is a map (collection) data structure containing
multiple key-value pairs. The value can be other data structures.
Deviations form RDBS:
- column definition
- column key
- non-atomic data, no normalization
- row key is the row index, indexes on columns can be set
- no joins
* IP address
MAC=Media Access Control
UUID=Universally Unique Identifier*
9
Tablet=range of rows

Big Data database Based on Google’s BigTable

Distributed, non-relational, scalable
Row Key
(reversed
URL)
Time
Stamp
Column Key – “Anchor” (Family)
+ URLpart (Qualifier)
Column Key – “Contents” +
keyword in tagged content
contents:html = "<html>…​“ (compressed webpages)
"com.cnn.www"
t9
anchor:cnnsi.com = "CNN“ (citations)
"com.cnn.www"
t8
anchor:my.look.ca = "CNN.com“
“
contents:html = "<html>…​"
• The table is used for searching websites related to the TV network CNN. The output are snippets of
relevant sites and compressed webpages that can be viewed in a historical sequence.
• 1 row key and 2 column keys. A combination of row and columnar DB.
10
Data Structures in Cassandra
set
{‘[email protected]’, ‘[email protected]’}
list
[‘[email protected]’, ‘[email protected]’]
map
{‘Home’ : ‘[email protected]’, ‘Work’ :
‘[email protected]’}
• Set: unordered collection, cannot contain duplicates
• List: ordered collection [1, 2, …] can contain duplicates
• Map: key-value pairs
11
Complex Data Structures:
JavaScript Object Notation (JSON)
Key
These structure can be
applied in table columns.
12
Not Only SQL*
13
Distributed Design
• Distributed design, replication across nodes, redundancy
• No referential integrity, no consistency guaranteed
• Goal: fast storage & processing of massive and varying data types
PROCESSING
DATA
14
Cassandra Distributed Storage
servers
Key range
000-200
Replication = 3
201-400
Gossip/status
301-600 601-800 801-1000
Data: key=325
Servers are configured as (virtual) nodes.
They communicate with each other via gossip for status (every second).
A data partitioner assigns data to an initial server based on key value.
The replication parameter specifies the number of copies.
15
RDBS – Two Phase Commit
Coordinator
DB On Location 1:
Initiate Transaction (Reduce price of
cookies for 5%)
1. Prepare to update.
All agree?
2. Commit
• Compare with non-relational DB
• Orderly, data integrity and accuracy assured
OK!
DB Partition 2
(Store A Product tbl) Partition 3
(Store B Product tbl)
16
Modern Database Systems
17
Modern Database Systems
18
Conclusion

Modern database systems (DBS) still rely predominantly on relational
DBS, while trying to integrate these with Big Data systems for
unstructured & multi-type data, which are based on distributed
storage & parallel processing.

Ad-hoc relationship discovery and predictive analytics are major tasks
and benefits.
19