Methodological Foundations of Biomedical
Informatics (BMSC-GA 4449)
http://fenyolab.org/methods2015
Himanshu Grover
Big Data in Biology/Healthcare
• 3 Vs:
•
•
•
Volume
Velocity
Variety (richness/complexity; Structured, semi-structured,
unstructured)
• Examples:
•
•
Omics technologies (Proteomics, genomics / NGS,
metabolomics etc.)
Clinical data – EMRs (patients, providers, medications,
procedures, symptoms, diagnoses, financials)
• Need simple and advanced analytics – exploratory
analyses & knowledge discovery, complex
visualizations, reporting, operations etc.
Utilization: Persistence + Analytics
File System
1.
2.
3.
4.
Ex. ASCII, semi-structured (xml), binary
Overhead in Parsing
No Indexing/Search/Filtering
Too large
Vs.
1. Efficient storage and access
2. Analytics
Databases
Example: Proteomics
MS/MS Spectrum Information
}
...
...
'experimentName' : '20586475',
Identifier
'filename' : 'GPM77711009076.mgf',
'scan’: 3
Peptide Info
'mz' : 584.0507,
'expPeaks' : [ { 'mz' : 792.6084, 'intensity' : 14.0},
{ 'mz' : 874.639, 'intensity' : 23.0},
{ 'mz' : 903.1962, 'intensity' : 19.0},
Peaks
{ 'mz' : 917.9162, 'intensity' : 22.0},
...
]
100
Relative
Intensity
{
0
Experimental Spectrum
250
500 m/z 750
1000
Relational Databases
v1 (un-normalized)
id/pk
Exp
Name
File
Name
scan
Peaks
…
Impedance
{ 'experimentName' : '20586475',
Mismatch
Identification
'filename' : 'GPM77711009076.mgf',
'scan’: 3 v2 (un-normalized)
id/pk
File
scan … Peak1
Peak2 Peptide
…
Info
'mz' :Exp
1050.51,
Name :Name
'expPeaks'
[ { 'mz' : 792.6084, 'intensity' :
14.0},
Peaks
{ 'mz' : 874.639, 'intensity' :
v3 (normalized)
23.0},
id/pk { Exp
File
scan
… :
'mz' : 903.1962,
'intensity'
Name Name
Difficulty
19.0},
running on
{ 'mz' : 917.9162, 'intensity' :
Clusters
22.0},
id/pk
Fk
mz
int
}
…
Spectrum Ex. cont’d
• Example of a 1-to-many relationship
• Un-normalized schema
• redundancy and disk wastage
• non-uniformity (ex. different numbers of peaks per
spectrum)
• query ability varies in blob storage
• Normalized schema
• effective, but requires joins
• Other examples (relationship types?:
• Proteins-to-peptides; genes to proteins; patients-todiseases
Not Only SQL (NOSQL) / Non-Relational
• Key features:
– Aggregate Orientation, i.e. closely related data, that is
accessed as a unit (aggregate), leads to faster
read/write operations
– Facility for rich structure
– Easier to program data access (application
development productivity)
– Application/context-specific, unlike generic relational
data model (database as an integration point)
• Representation:
– Key-value Stores (Ex.Riak, Redis, etc.)
– Column Family Stores (Ex. Cassandra, HBase etc.)
– Document-oriented Stores (Ex. MongoDB, CouchDB
etc.)
– Graph databases (Ex. Neo4J etc.)
Why MongoDB: Flexible
• Collections (≈Tables) of Documents (≈Rows)
• Documents = set of key-value pairs (Ex. Python Dict, Java
HashMap etc.)
doc={ ‘_id’: <unique ID>,
<key1>: <val1>,
<key2>: <val2>,
<key3>: { <key31>: <val31>,
<key32>: <val32>},
<key4>: [ {<key41>: <val41>},
{<key42>: val42},
…]
}
• Non-uniform and dynamic
Simple
’experimentName' : '20586475’
Embedded/Hierarchical
List (non-uniform)
Ex. ‘peaks' : [{'mz’ : 792.6, 'int' : 14.0},
{ 'mz' : 874.6,'int' : 23.0}
,…]
Collection: SpectrumArchive
}
...
...
'_id' : ObjectId('52c48046321ded5b32082bb5'),
'experimentName' : '20586475',
'filename' : 'GPM77711009076.mgf',
'scan' : 1749
'mz' : 1050.51,
'intensity' : 0.0,
'rt' : 0.0,
'expPeaks' : [ { 'mz' : 792.6084, 'intensity' : 14.0},
{ 'mz' : 874.639, 'intensity' : 23.0},
{ 'mz' : 903.1962, 'intensity' : 19.0},
{ 'mz' : 917.9162, 'intensity' : 22.0},
...
100
...
Experimental Spectrum
]
Relative
Intensity
{
0
250
500 m/z 750
1000
Data Modeling: Design Choices
• Document structure (Entities/Aggregates)
– Data Access patterns => What is accessed together
must go together
• Relationships
– 1-to-few, 1-to-many, many-to-many
– Embedding (de-normalized) vs. Referencing
(normalized)
– cardinality of relationship may be unbounded and/or
quite large (for some cases)
• Document growth issue
Why MongoDB:
ApplicationDistribution
Code
• Aggregates are a natural unit of interaction as well as
distribution – no notionMongoS
of joins
– Scale out storage and processing
• Two forms
MongoD - 1
– Replication
– Sharding
MongoD - 2
..…
MongoD - 5
Automatic data distribution
Seamless distributed query processing and analytics
Why MongoDB: Other features
• Powerful query language and operators
– including ability to look into nested/embedded
documents and arrays/lists
• Secondary indexing
• Performant and extensive analytics operators
over distributed databases
Demo
Basic C – Create
R – Read
U – Update
D – Delete
- From mongo shell
- From PyMongo driver
Some Limitations of MongoDB
• No built-in atomic transactions across multiple
documents or collections.
• Difficult to do lots of many-to-many
relationships (use graph databases)
Some Resources
• NoSQL Distilled – Broad discussion of diff.
categories of NoSQL databases
• MongoDB specific:
– MongoDB website
• https://mongoDB.com; https://mongodb.org
• user manual, public talks/presentations
– MongoDB University:
• https://university.mongodb.com/
• “MongoDB for Developers” course uses Python
– MongoDB – The Definitive Guide ()
Take home
• No one size fits all
• Choice depends on application requirements
© Copyright 2026 Paperzz