Extracting Mobile Behavioral Patterns with the Distant N

Extracting Mobile Behavioral
Patterns with the Distant NGram Topic Model
Lingzi Hong
Feb 10th
Research Question
•
problem: modeling activity sequences for large-scale
human routine discovery from cellphone censor data
•
fundamental difficulties: do not know the basic units
of time for the activities in the question.
(hourly,daily?) =>effective modeling of multiple
unknown time duration
•
focus on Probabilistic Topic Models
•
unsupervised=>mining structure of data
•
handle uncertainty
•
extended in various ways to integrate multiple
data types=>sensor activity sequences
contributions
•
propose the distant n-gram topic model (DNTM) for
sequence modeling
•
derive inference process using Markov Chain Monte
Carlo (MCMC) sampling
•
apply to two real large-scale datasets
•
comparative analysis with Latent Dirichlet Allocation
(LDA)
Related Work
Topic model as a useful tool
1. T. Huynh, M. Fritz, and B. Schiele. Discovery of activity patterns using topic models.
2. K. Farrahi and D. Gatica-Perez. Probabilistic mining of socio- geographic routines from mobile phone data.
3. T. Bao, H. Cao, E. Chen, J. Tian, and H. Xiong. An unsupervised approach to modeling personalized
contexts of mobile users.
4. K. Farrahi and D. Gatica-Perez. Discovering routines from large-scale human locations using probabilistic
topic models.
Topic model in terms of text
1. LDA. determine probability of each word to each topic and probability of each topic given each document
N-gram discovery
1. bigram topic model
2. topic n-gram model
Distant N-Gram Topic Model
q
m
corpus
S
m
w ,w ,…,w
1
2
N
w = (t, l)
t-location l-coordinate of a day
The distribution of
W1 given topics
Distant N-Gram Topic Model
•
General process:
•
1. Initialization (document topic, distribution over labels)
•
2. Sequence generation procedure (estimate paratemeters)
•
model parameters derived based on MCMC approach of Gibbs sampling
•
estimation of parameters:
?
Distant N-Gram Topic Model
•
Anyway there is code that helps to implement this
process
Experiments and
Results
•
Nokia Smartphone Data
•
Tricks?: days with topic
distribution => 10 most
probable days for the topic
ranked from top to bottom
Experiments and
Results
•
MIT Reality Mining Data
•
L={‘H’,’W’,’O’,’N’}, tt=48
•
most probable days given
topics
•
Experiments and
Results
most probable sequence
components for topics
Evaluation
•
splitting into training and testing
A test set is a collection of unseen documents wd, the model is described by
the topic matrix Φ, and the hyperparameter α for topic-distribution of documents.
•
log-likihood:
The probability of unseen held-out documents given some training documents.
Higher likelihood implies a better model
•
Perplexity:
The lower perplexity the better the model
Evaluation
•
perplexity of the DNTM over
number of 20% unseen days
•
Average log-likelihood of the
DNTM versus LDA on 20%
unseen days.
Discussion
•
generalization of the model
•
model assumes every topic has a distribution of sequence q, with element w labeled with time
and location, which means w involves with a general topic distribution. But if there is a lot of
user samples, a workplace for A might be leisure place for B.
•
For topic models, if one word involves with a topic distribution, this distribution will be equally
applied to all documents. However we can’t assume a place has the same topic distribution of
day activities for different people. Could we?
•
Nokia Smartphone: 2 users and each with a lot of places in two different cities. Few overlapping
places with mixed function. Result is separately for user1 and user2.
•
MIT data: lots of users but places have been labeled. So result is only identification of topics.
•
Real data set will include a lot of users and not labeled places.
Discussion
•
How to choose N? Segmentation of sequences
according to activities or according to time? What if
the last sequence q is not complete?
Discussion
•
Could we just make clustering of the sequences to
detect activity patterns?
•
48 intervals a day, each interval as a feature, value
of the feature is the label (‘H’,’W’,’O’,’N’)