MIS 3500
Instructor: Bob Travica
Trendy Database Topics
2017
Big Data
3 big V:
Volume: terabytes (15 zeroes), petabytes (18 zeroes)
Variety: Social media, communications, sensors everywhere*, Internet of
Things, video feeds, GPS… Implication: various formats
Velocity: wired and wireless continuous feeds
2
Big Data Goals and Uses
Goals:
Integrate data on the same object across sources (Customer, Citizen, Patient...;
spatial mashups*)
Analysis: Existing patterns (e.g., el. energy consumption over time), Predictive
analysis (prediction of energy needs)
Application domains:
Product & object monitoring in real time via sensors (Internet of Things- IoT)
Marketing (sentiment analysis in social media, discovering investment opportunities
– major US banks)
Energy grid management (IoT)
3
Big Data Uses (cont’d)
Transportation networks management (cueing airplanes in air corridors in
Brazil, optimizing cargo railroad net in Germany)
Operations/process optimization (UPS sensors in trucks, manufacturing)
Strategy making (Google, Facebook, banks; emerging strategy not planned)
Health (integration of customer data, tracking/analyzing patient vital signs &
cancer cell behavior)
Science (human genome analysis, 2TB of data/person+gene interactions)
Public safety/security (profiling outlaws)
Policy analysis (United Nations’ system for predicting social problems)
4
Big Data Tasks
• Querying
unstructured
data
5
Big Data Benefits & Costs
B
E
N
E
F
I
T
S
Comprehensive informing on business objects (customers, patients…)
Pattern discovery, predictive analysis (fraud detection)
More effective decision making (Citigroup)
Savings (e.g., UPS operations)
Strategizing for innovation (Google)
C
O
S
T
S
Direct technology costs
Truthfulness (“veracity”) of sources & findings
Sense making challenges (big & “small” data)
Legality, ethics
Implementation & fit with organization
6
Machine-generated data (sensors); automatic creation and transfer *
Home appliances (security, energy consumption, heating, food,
entertainment)
Monitoring/Control (cars, athletic equipment, machinery, appliances)*
Example: Smart power grid**
Smart meter;
Internet & Wi-Fi
connectivity
7
Technologies
Hadoop (framework for file system and processing of large datasets on server
clusters)*
Machine learning – automated construction of models to fit data (instead of
hypothesis testing as with DW and Analytics)
Non-relations databases
Open source
Notable developers: Google, Facebook, Yahoo!, Microsoft, Amazon
Microsoft Azure-based
Hadoop
8
Two keys - Row and column key
Row identifier/hash
Column Key-value pairs
Primary Key
10938374
LastName=‘Jones’, FirstName=‘John’, …
IP address
Email={‘Home’ : ‘[email protected]’,
MAC address,
‘Work’ : ‘[email protected]’}
UUID*
The E-mail column is a map (collection) data structure containing
multiple key-value pairs. The value can be other data structures.
Deviations form RDBS:
- column definition
- column key
- non-atomic data, no normalization
- row key is the row index, indexes on columns can be set
- no joins
* IP address
MAC=Media Access Control
UUID=Universally Unique Identifier*
9
Tablet=range of rows
Big Data database Based on Google’s BigTable
Distributed, non-relational, scalable
Row Key
(reversed
URL)
Time
Stamp
Column Key – “Anchor” (Family)
+ URLpart (Qualifier)
Column Key – “Contents” +
keyword in tagged content
contents:html = "<html>…“ (compressed webpages)
"com.cnn.www"
t9
anchor:cnnsi.com = "CNN“ (citations)
"com.cnn.www"
t8
anchor:my.look.ca = "CNN.com“
“
contents:html = "<html>…"
• The table is used for searching websites related to the TV network CNN. The output are snippets of
relevant sites and compressed webpages that can be viewed in a historical sequence.
• 1 row key and 2 column keys. A combination of row and columnar DB.
10
Data Structures in Cassandra
set
{‘[email protected]’, ‘[email protected]’}
list
[‘[email protected]’, ‘[email protected]’]
map
{‘Home’ : ‘[email protected]’, ‘Work’ :
‘[email protected]’}
• Set: unordered collection, cannot contain duplicates
• List: ordered collection [1, 2, …] can contain duplicates
• Map: key-value pairs
11
Complex Data Structures:
JavaScript Object Notation (JSON)
Key
These structure can be
applied in table columns.
12
Not Only SQL*
13
Distributed Design
• Distributed design, replication across nodes, redundancy
• No referential integrity, no consistency guaranteed
• Goal: fast storage & processing of massive and varying data types
PROCESSING
DATA
14
Cassandra Distributed Storage
servers
Key range
000-200
Replication = 3
201-400
Gossip/status
301-600 601-800 801-1000
Data: key=325
Servers are configured as (virtual) nodes.
They communicate with each other via gossip for status (every second).
A data partitioner assigns data to an initial server based on key value.
The replication parameter specifies the number of copies.
15
RDBS – Two Phase Commit
Coordinator
DB On Location 1:
Initiate Transaction (Reduce price of
cookies for 5%)
1. Prepare to update.
All agree?
2. Commit
• Compare with non-relational DB
• Orderly, data integrity and accuracy assured
OK!
DB Partition 2
(Store A Product tbl) Partition 3
(Store B Product tbl)
16
Modern Database Systems
17
Modern Database Systems
18
Conclusion
Modern database systems (DBS) still rely predominantly on relational
DBS, while trying to integrate these with Big Data systems for
unstructured & multi-type data, which are based on distributed
storage & parallel processing.
Ad-hoc relationship discovery and predictive analytics are major tasks
and benefits.
19
© Copyright 2026 Paperzz