Using Social Media for Epidemiology

Using Social Media for Epidemiology
Mike Chary
Nick Genes
Department of Emergency Medicine
Mount Sinai Hospital
Data Science Seminar, 2013
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
1 / 34
1
Motivation
2
Prior Work
Surveillance
3
Our Work
Syndromic Surveillance via Social Media
Signs and Symptoms from YouTube
Computational Semantics of Social Media
Dynamics of Social Networks
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
2 / 34
Motivation
Limitations of Current Syndromic Surveillance
Lag time
Sampling bias, observer effect
Cost
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
4 / 34
Motivation
Limitations of Current Syndromic Surveillance
Lag time
Sampling bias, observer effect
Cost
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
4 / 34
Motivation
Limitations of Current Syndromic Surveillance
Lag time
Sampling bias, observer effect
Cost
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
4 / 34
Prior Work
Surveillance
1
Motivation
2
Prior Work
Surveillance
3
Our Work
Syndromic Surveillance via Social Media
Signs and Symptoms from YouTube
Computational Semantics of Social Media
Dynamics of Social Networks
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
5 / 34
Prior Work
Surveillance
Tracking Epidemics
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
6 / 34
Prior Work
Surveillance
Everything isn’t an Epidemic
Endemic infections
Psychiatric diseases
Consumption of healthcare resources
We need a deeper understanding of the meaning of unstructured
text.
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
7 / 34
Prior Work
Surveillance
Everything isn’t an Epidemic
Endemic infections
Psychiatric diseases
Consumption of healthcare resources
We need a deeper understanding of the meaning of unstructured
text.
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
7 / 34
Our Work
Concepts from Data Science
Supervised learning
Information retrieval
Calculating similarity
Natural language processing
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
9 / 34
Our Work
Concepts from Computational Linguistics
Stemming / Lemmatization
N-grams
Vector Space model
Context-free grammar
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
10 / 34
query
Concept
SocMe
Manual Curation
Identify most
informative features
φx =
Yes
Not Yes x
Classifier
1 φtarget − φmin
ρ φmax − φmin
Controls
SocMe, φ
Census, ρ
Epidemiologic
curves
Our Work
Syndromic Surveillance via Social Media
1
Motivation
2
Prior Work
Surveillance
3
Our Work
Syndromic Surveillance via Social Media
Signs and Symptoms from YouTube
Computational Semantics of Social Media
Dynamics of Social Networks
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
12 / 34
Our Work
Syndromic Surveillance via Social Media
Prevalence of alcohol consumption from twitter
Twitter
SAMHSA 2010
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
13 / 34
Our Work
Syndromic Surveillance via Social Media
Twitter estimates agree with official sources
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
14 / 34
Our Work
Syndromic Surveillance via Social Media
Twitter estimates agree with official sources
Chary and Genes (2012), NYAS Machine Learning Symposium
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
15 / 34
Our Work
Syndromic Surveillance via Social Media
Prevalence of Salvia Usage from Tweets
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
16 / 34
Our Work
Signs and Symptoms from YouTube
1
Motivation
2
Prior Work
Surveillance
3
Our Work
Syndromic Surveillance via Social Media
Signs and Symptoms from YouTube
Computational Semantics of Social Media
Dynamics of Social Networks
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
17 / 34
Our Work
Signs and Symptoms from YouTube
Estimates of Common Doses of DXM
Chary et al. (2013), PLoS ONE
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
18 / 34
Our Work
Signs and Symptoms from YouTube
Common words associated with DXM
Chary et al. (2013), PLoS ONE
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
19 / 34
Our Work
Signs and Symptoms from YouTube
Tf-idf (term frequency-inverse document frequency)
How specific is a word to a document?
tf (t, d ) · idf (t, D ) =
Chary & Genes (Sinai EM)
|D |
f (t, d )
ln
max {f (w , d ) |w ∈ d } | {d ∈ D |t ∈ d } |
ToxTweet Intro
CU IGERT 2013
20 / 34
Our Work
Signs and Symptoms from YouTube
Distribution of specific words across plateaus
Chary et al. (2013), PLoS ONE
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
21 / 34
Our Work
Signs and Symptoms from YouTube
Different words in different plateaus
Chary et al. (2013), PLoS ONE
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
22 / 34
Our Work
Signs and Symptoms from YouTube
What is a term?
n-grams: all combinations of n words
words or semantic units
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
23 / 34
Our Work
Signs and Symptoms from YouTube
A term can be longer than one word
Myslin et al. (2013)
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
24 / 34
Our Work
Computational Semantics of Social Media
1
Motivation
2
Prior Work
Surveillance
3
Our Work
Syndromic Surveillance via Social Media
Signs and Symptoms from YouTube
Computational Semantics of Social Media
Dynamics of Social Networks
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
25 / 34
Our Work
Computational Semantics of Social Media
How similar are two phrases?
run(s)
Dogs run.
Cats run.
φ
dog(s)
cat(s)
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
26 / 34
Our Work
Computational Semantics of Social Media
Computational Semantics vs Linguistics
Sentences are more than bags of words. Dog bites man vs. Man bites
dog.
Word frequency overlooks emotion and cryptic or allusive language
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
27 / 34
Our Work
Computational Semantics of Social Media
λ-Calculus
NYU dominates CU.
S
dominate (NYU, CU)
NP
NYU
λP.P (NYU)
VP
λx .dominate (x , CU )
TV
NP
dominate
λOx . (O@λy .dominate (x , y ))
CU
λP.P (CU)
Similarity of λ-expressions not words
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
28 / 34
Our Work
Dynamics of Social Networks
1
Motivation
2
Prior Work
Surveillance
3
Our Work
Syndromic Surveillance via Social Media
Signs and Symptoms from YouTube
Computational Semantics of Social Media
Dynamics of Social Networks
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
29 / 34
Our Work
Dynamics of Social Networks
Motivation
Homophily: people interact with similar people
Corroboration: internal check on statements about illegal activities
Different Toplogies Allow the Different Dynamics
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
30 / 34
Our Work
Dynamics of Social Networks
Networks discussing drugs have different topologies
Chary et al. (2013) EAPCCT
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
31 / 34
Our Work
Dynamics of Social Networks
A model of social interactions
τ
Mii
da
= −a + M · a + W · u
dt
ai
Mji
Mij
Mjj
aj
Mjk
wi
Mik
Mki
Mkj
ak
Mkk
Chary & Genes (Sinai EM)
outside u
wk
ToxTweet Intro
CU IGERT 2013
32 / 34
Summary
Social media provide readily accessible data for analyzing endemics,
epidemics, behavior
Analyzing social provide a model envinroment to develop tools to
deeply analyze unstructured text in electronic medical records, libraries
Analyzing unstructured data requires a unique combination of
linguistics, applied mathematics, and computer science.
Outlook
Generalize to multiple languages
Improve curation
Intervene
Chary & Genes (Sinai EM)
ToxTweet Intro
CU IGERT 2013
33 / 34
Appendix
Chary & Genes (Sinai EM)
For Further Reading
ToxTweet Intro
CU IGERT 2013
34 / 34