Detecting Data Leakage

Detecting Data Leakage
Panagiotis Papadimitriou
[email protected]
Hector Garcia-Molina
[email protected]
Leakage Problem
Name: Sarah
Sex: Female
….
Name: Mark
Sex: Male
….
Jeremy
Sarah
App. U1
Mark
App. U2
Other Sources
e.g. Sarah’s Network
Kathryn
Stanford Infolab
2
Outline
• Problem Description
• Guilt Models
– Pr{U1 leaked data} = 0.7
– Pr{U2 leaked data} = 0.2
• Distribution Strategies
Stanford Infolab
3
• Problem Description
• Guilt Models
• Distribution Strategies
Stanford Infolab
4
Problem Entities
Entity
Dataset
Distributor
Facebook
T
Set of all Facebook profiles
Agents
Facebook Apps U1, …, Un
R1, …, Rn
Ri: Set of people’s profiles who have
added the application Ui
Leaker
S
Set of leaked profiles
Stanford Infolab
5
Agents’ Data Requests
• Sample
– 100 profiles of Stanford people
• Explicit
– All people who added application
(example we used so far)
– All Stanford profiles
Stanford Infolab
6
• Problem Description
• Guilt Models
• Distribution Strategies
Stanford Infolab
7
Guilt Models (1/3)
p: posterior probability that a leaked profile
comes from other sources
p
p
Guilty Agent: Agent who leaks at least one profile
Pr{Gi|S}: probability that agent Ui is guilty, given
the leaked set of profiles S
Stanford Infolab
Other Sources
e.g. Sarah’s
Network
8
Guilt Models (2/3)
Agents leak each of their
data items independently
Agents leak all their data
items OR nothing
(1-p)p
or
or
(1-p)2
or
p2
p(1-p)
Stanford Infolab
9
Guilt Models (3/3)
Independently
NOT Independently
Pr{G2}
Pr{G2}
Pr{G1}
Pr{G1}
Stanford Infolab
10
• Problem Description
• Guilt Models
• Distribution Strategies
Stanford Infolab
11
The Distributor’s Objective (1/2)
R1
R2
U1
S (leaked)
U2
R1
R3
R3
R4
U3
U4
Stanford Infolab
Pr{G1|S}>>Pr{G2|S}
Pr{G1|S}>> Pr{G4|S}
12
The Distributor’s Objective (2/2)
• To achieve his objective the distributor has to
distribute sets Ri, …, Rn that
minimize 
i
1
Ri
 R R
j i
i
j
, i, j  1,..., n
• Intuition: Minimized data sharing among
agents makes leaked data reveal the guilty
agents
Stanford Infolab
13
Distribution Strategies – Sample (1/4)
• Set T has four profiles:
– Kathryn, Jeremy, Sarah and Mark
• There are 4 agents:
– U1, U2, U3 and U4
• Each agent requests a sample of any 2 profiles
of T for a market survey
Stanford Infolab
14
Distribution Strategies – Sample (2/4)
Minimize  Ri  R j
Poor
U1
U2
U3
U4








i j
U1
U2
U3
U4
Stanford Infolab








15
Distribution Strategies – Sample (3/4)
• Optimal Distribution
U1

U2


U3
U4





1
Ri
• Avoid full overlaps and minimize   R  R
i
Stanford Infolab
j i
i
j
16
Distribution Strategies – Sample (4/4)
Stanford Infolab
17
Distribution Strategies
Sample Data Requests
Explicit Data Requests
• The distributor must
provide agents with the
data they request
• General Idea:
• The distributor has the
freedom to select the data
items to provide the agents
with
• General Idea:
– Add fake data to the
distributed ones to minimize
overlap of distributed data
– Provide agents with as much
disjoint sets of data as possible
• Problem: There are cases
where the distributed data
must overlap E.g.,
|Ri|+…+|Rn|>|T|
• Problem: Agents can collude
and identify fake data
• NOT COVERED in this talk
Stanford Infolab
18
Conclusions
• Data Leakage
• Modeled as maximum likelihood problem
• Data distribution strategies that help identify
the guilty agents
Stanford Infolab
19
Thank You!