An Information-Theoretic Approach to Individual Sequential Data

An Information-‐Theoretic Approach to Individual Sequence Sanitization
Luca Bonomi, UCSD
Liyue Fan, USC
Hongxia Jin, Samsung Research
Previous Work
• Sequential Pattern Mining • [CIKM-‐12, CIKM-‐13, PhD-‐VLDB-‐13]
• Privacy Preserving Record Linkage • [CIKM-‐12, SIGMOD-‐13, Book_Chp-‐15]
• Time-‐series and Data Streams • [WWW-‐14, PAIS-‐15, TDP-‐16]
Outline
• Introduction
• Sequential Data
• Motivations
• Related Works
• Our framework
• Definitions
• Problem Formulation
• Solutions
• Top-‐Down Approach
• Bottom-‐Up Approach
• Evaluations
• Conclusions
Outline
• Introduction
• Sequential Data
• Motivations
• Related Works
• Our framework
• Definitions
• Problem formulation
• Solutions
• Top-‐Down Approach
• Bottom-‐Up Approach
• Evaluations
• Conclusions
Introduction
• People are generating a lot of data
• Many of these data are sequentially by nature (e.g. time-‐series data, data stream, biomedical signals, DNA sequences, etc..)
Introduction
• Time-‐series data and biomedical signals (e.g. blood pressure, heart rate) can be converted into sequential data to enable knowledge discovery.
Time-‐series (numeric representation)
Pattern (symbolic representation)
b a a b c c b c
• Find the occurrences of a pattern over a input sequence within a window w
p = a b w = 2 S = b a c b e c a b b c d b a b
a b
ab
3 occurrences of p in S
Introduction
• Trusted Server scenario:
• Data records from multiple users are collected by a trusted server
• Privacy preserving data are released to service provider/third party (e.g. researchers)
Privacy Preserving
Data
Third Party
TRUSTED
-‐ Differential privacy (aggregated statistics)
-‐ k-‐anonymity (anonymized database)
Introduction
a b a a b c c a b c d d e a b b e e e d
User
Service
Service Provider
• User directly release his/her individual data to service provider to receive some services:
• Fitness Band: heart rate, distance, location -‐> fitness evaluation and recommend workout
• Mobile devices: GPS, trajectory -‐> Ads and location based services
• Credit card: transactions -‐> coupons, product recommendations
However releasing this data may disclose sensitive information to individual
Example I: Surveillance
• HyReminder (D. Wang et al.) aims to track, monitor and remind health-‐care workers with respect to hygiene compliance.
-‐ Sensors are installed in the hospital -‐ Workers (doctor, etc.) are equipped with RFID which can be tracked by the sensors.
Example I: Surveillance
• Processing the doctor signals, event patterns can be detected to infer hygiene compliance.
• The event exit-‐patient-‐room for a doctor, should be followed by hand-‐
sanitization within a short period of time.
• Some patterns may disclose some sensitive information
• E.g. Doctor exit-‐patient-‐room followed by enter-‐psychiatrist-‐office • May reveal that this patient is experiencing psychiatric problems.
time
Ref: Utility-‐maximizing event stream suppression, by Wang et al. SIGMOD’13
Example II: Market Basket Analysis
• Items bought during transactions can be modeled as patterns.
• Companies look for combinations of products that frequently co-‐occur in transactions to make business decision (e.g. arrange product together).
• Individual users may wish to receive targeted coupons and product reccomendation.
Example II: Market Basket Analysis
• However, some patterns may disclose sensitive information about the costumer :
• A female customer who buys unscented lotions and zinc/magnesium supplements first, and unscented soaps and cotton balls in extra-‐large bags later, is probably pregnant and very close to her delivery date. (e.g. Target’s incident)
time
Ref: H ow company learn your secrets, by D uhigg on the New York Times 2012
Example III: Assisted Daily Living/Smart Homes
• Assisted Daily living for elder or disable
• Data collected from sensors in smart homes enable detecting activities and behaviors which can be prompted supported with the adequate service
• Activities are represented with patterns:
MakeTea: KitchenDoorObj, KettleObj, CupObj, TeaObj, MilkObj, SugarObj
Some of the patterns monitored may disclose sensitive information about the resident monitored
Ref: An Agent-‐mediated Ontology-‐based Approach for Composite Activity Recognition in Smart H omes, by Okeyo et al. Our Goals
• We want to design a sanitization solution such that:
1. Enable individual users to release their sequential data while protecting the information about sensitive patterns.
2. Preserve the utility in the sanitized released sequence.
Related Works
• Hiding Sequential Patterns by removing the patterns occurrences
below a given threshold
• Suppression: remove individual symbols
• Permutation: change symbol order within the pattern
Sensitive patterns: {ab, bc}, th=2
S = a b c c a b e e a b c
ab and bc occur 3 times in S within a window of length 2 Sanitization via suppression
Sanitization via permutation
S = a b c c a b e e a b c
S = a b c c a b e e a b c
S’ = a c c a b e e a b c
S’ = b a c c a b e e a b c
Cons: data loss
Cons: create ghost patterns
Related Works
• Privacy for Sequential data
• Generalize patterns using k-‐anonymity and l-‐diversity • Cons: treat each pattern in the same way regardless their frequency
• Release sanitized data using differential privacy
• Cons: protect single events rather than patterns, may create ghost patterns
Outline
• Introduction
• Sequential Data
• Motivations
• Related Works
• Our framework
• Definitions
• Problem formulation
• Solutions
• Top-‐Down Approach
• Bottom-‐Up Approach
• Evaluations
• Conclusions
Our Framework
• Sanitization method based on generalization via a taxonomy tree T. • The tree T preserves partial information between the most fine-‐
grained events and more general events.
We consider an user specified cost function c(a,a’)
which defines the utility loss in generalizing the symbol a in a’
c(Wine,Alcohol) = 0.2
Our Framework
Input:
S original sequence
S set of sensitive patterns
Output:
S’ sanitized sequence
Generalize Sensitive Patterns
Observe generalized patterns in S’ and try to infer the sensitive information S
• GOAL: Bound the inference gain of the adversary after seeing S’
Our Framework
• An adversary may have prior knowledge about the sensitive patterns (e.g. from historic data)
• The average statistical information gain of an adversary observing the generalized patterns g correspond to the mutual information
Mutual information between
sensitive and generalized patterns
Original sensitive
patterns entropy
Conditional entropy
• ϵ-‐mutual information privacy guarantees that for the release S’, holds
• ϵ is the privacy parameter (lower ϵ -‐> higher privacy)
• We can always achieve perfect privacy (i.e. ϵ = 0) by generalizing all the symbols to maximum level of generalization (i.e. root of the tree)
Our Framework
• Utility for the released sequence S’:
• Problem Formulation (Minimum Utility Loss Generalization):
• Given a set of user specific sensitive patterns S, an input sequence S, and a privacy parameter ϵ, construct a sanitized sequence S’ such that:
(1) S’ satisfies ϵ-‐mutual information privacy
(2) minimize the utility loss UL.
• We want to find a generalization map h for each symbol in S to produce S’ Our Framework
• Problem Variants:
• Global: generalize all the occurrence of the same symbol in the same way.
• Local: each symbol occurrence may have different generalizations
Global: generalize a with a’ •
•
Local: generalize a with a’,a’’ or even a S = a b a b b d a e b e
S = a b a b b d a e b e
S’ = a’ b a’ b b d a’ e b e
S’ = a’ b a b b d a’’ e b e
Offline: the original input sequence is given all at once
Online: the original input sequence is given a symbol at a time We focus on global offline version of this problem (g-‐MULG)
Even with this simple formulation, g-‐MULG is NP-‐HARD.
Outline
• Introduction
• Sequential Data
• Motivations
• Related Works
• Our framework
• Definitions
• Problem formulation
• Solutions
• Top-‐Down Approach
• Bottom-‐Up Approach
• Evaluations
• Conclusions
Heuristic I: Top-‐down Approach
• Sanitize the input sequence by generalize individual symbols • The process is guided by the taxonomy tree T. 1.
2.
3.
4.
Start from the highest level of generalization for each symbols in the taxonomy tree T • (i.e. h(a)=root for all the symbols a)
Pick the symbol to refine that incur minimum utility loss Refine its generalization by replacing it with the next node on the path
If the privacy is violated terminate, otherwise repeat from (2).
Less utility, More Privacy
More utility, Less Privacy
Heuristic I: Top-‐down Approach
• Example of refinement from g’=[Wine=h(Wine),…,Cheese=h(Cheese)] where h(Wine)=…=h(Cheese)=All to the next generalization g’’
g’= [Wine=All,Beer=All,Cookies=All,Chips=All, Milk=All, Cheese=All]
g’’= [Wine=All,Beer=All,Cookies=Snack,Chips=Snack, Milk=All, Cheese=All]
Select all the symbols in the same
sub-‐tree with the symbol Cookies
Heuristic I: Top-‐down Approach
• By generalizing symbols in subtrees we can take advantage of the privacy monotonicity property:
• Privacy Monotonicity: At each refining step the privacy never increases
• Pruning: If for some symbols the privacy test fails in the early stage, then also in later steps algorithms these symbols will not meet the privacy requirements
g’ =[Wine=All,Chookies=All, Milk=Diary]
Fails Privacy
Then refining All to Alcohol for the symbol Wine, will fail privacy for all the future generalizations obtained from g ’
g’’ =[Wine=All,Chookies=Snack, Milk=Diary]
Heuristic I: Top-‐down Approach
• If a symbol refinement fails the privacy we do not consider it in future steps of the algorithm
• From an exponential number of symbol combinations we need to consider only a polynomial number of generalization steps.
• The overall running time is: O(|Σ|3 ⨉ |S| ⨉ w ⨉ m)
Alphabet
Window
Length of the input sequence
Length of the patterns
Heuristic II: Bottom-‐up Approach
• Construct a generalization at the sensitive patterns level.
• Group similar sensitive patterns
• Find the grouping that leads minimum utility loss p’1
p’5
p’2
C1
C1 = {p1,p2,p3,p4} group of sensitive patterns
p’3
p’1 is the generalized pattern for C1 in the sanitized sequence S’
p’4
Heuristic II: Bottom-‐up Approach
• We devise a hierarchy clustering-‐based approach:
1.
2.
3.
4.
Start: each sensitive pattern is a singleton cluster
Find the pair of clusters C1, C2 that merged leads minimum utility loss Merge C1 and C2 in C12 and generalize their sensitive pattern.
If the privacy is not achieved, repeat from (2).
Heuristic II: Bottom-‐up Approach
• For each cluster of sensitive patterns a generalized pattern is computed (i.e. representative of the cluster)
• In principle there are many ways to compute a representative
• Construct a representative using the Least Common Ancestor
information in the taxonomy tree -‐> incur minimum utility loss.
P1=<Beer, Chips>, P2=<Milk, Cookies>
P12=<All, Snack>
P3 = <Wine, Beer>
P13 = <All,All>
Heuristic II: Bottom-‐up Approach
• We select the pair of clusters to merge that incur minimum utility loss at each iteration.
• All the sensitive patterns are replaced with the generalized pattern associated with their cluster
• When the privacy is achieved, the original sequence is sanitized and released.
Alphabet Depth of T
• Running time: O(|S|3 ⨉ |S| ⨉ |Σ| ⨉ w ⨉ hT ⨉ m3)
Num. of sensitive patterns
Length of the patterns
Window
Length of the input sequence
Outline
• Introduction
• Sequential Data
• Motivations
• Related Works
• Our framework
• Definitions
• Problem formulation
• Solutions
• Top-‐Down Approach
• Bottom-‐Up Approach
• Evaluations
• Conclusions
Evaluations
• Datasets
• MSNB: web browsing sequences • Taxonomy tree based on the page topic
• RM (RealityMining): user’s trajectory collected by mobile devices (i.e. mobile phone)
• Taxonomy tree based on clustering cell towers
• Evaluation metrics
• Utility loss (UL)
• Linear, exponential utility decay for the symbol in the tree and Information Loss standard metric
• Running time (ms)
Evaluations
Running time increases as more patterns are specified as sensitive
UL increases as more
patterns are specified as sensitive
Evaluations
UL and Running time increase as the frequency of the
sensitive patterns
increases
Preserve frequent patterns: increment of the absolute
frequency for the generalized
patterns since sensitive patterns may collude in the same
generalized patterns
Conclusion
• Our framework :
• Enables individual sequence sanitization while providing strong and user specified privacy.
• Minimizes the utility loss inflicted in the sanitization process
• Enable third party to perform data mining analytics on the released sequence (e.g. frequent patterns)
Thanks!
Questions?
Contact:
[email protected]

Download Report

An Information-Theoretic Approach to Individual Sequential Data

Paperzz.com

Your Paperzz