Detecting Data Leakage Panagiotis Papadimitriou [email protected] Hector Garcia-Molina [email protected] Leakage Problem Name: Sarah Sex: Female …. Name: Mark Sex: Male …. Jeremy Sarah App. U1 Mark App. U2 Other Sources e.g. Sarah’s Network Kathryn Stanford Infolab 2 Outline • Problem Description • Guilt Models – Pr{U1 leaked data} = 0.7 – Pr{U2 leaked data} = 0.2 • Distribution Strategies Stanford Infolab 3 • Problem Description • Guilt Models • Distribution Strategies Stanford Infolab 4 Problem Entities Entity Dataset Distributor Facebook T Set of all Facebook profiles Agents Facebook Apps U1, …, Un R1, …, Rn Ri: Set of people’s profiles who have added the application Ui Leaker S Set of leaked profiles Stanford Infolab 5 Agents’ Data Requests • Sample – 100 profiles of Stanford people • Explicit – All people who added application (example we used so far) – All Stanford profiles Stanford Infolab 6 • Problem Description • Guilt Models • Distribution Strategies Stanford Infolab 7 Guilt Models (1/3) p: posterior probability that a leaked profile comes from other sources p p Guilty Agent: Agent who leaks at least one profile Pr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S Stanford Infolab Other Sources e.g. Sarah’s Network 8 Guilt Models (2/3) Agents leak each of their data items independently Agents leak all their data items OR nothing (1-p)p or or (1-p)2 or p2 p(1-p) Stanford Infolab 9 Guilt Models (3/3) Independently NOT Independently Pr{G2} Pr{G2} Pr{G1} Pr{G1} Stanford Infolab 10 • Problem Description • Guilt Models • Distribution Strategies Stanford Infolab 11 The Distributor’s Objective (1/2) R1 R2 U1 S (leaked) U2 R1 R3 R3 R4 U3 U4 Stanford Infolab Pr{G1|S}>>Pr{G2|S} Pr{G1|S}>> Pr{G4|S} 12 The Distributor’s Objective (2/2) • To achieve his objective the distributor has to distribute sets Ri, …, Rn that minimize i 1 Ri R R j i i j , i, j 1,..., n • Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab 13 Distribution Strategies – Sample (1/4) • Set T has four profiles: – Kathryn, Jeremy, Sarah and Mark • There are 4 agents: – U1, U2, U3 and U4 • Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab 14 Distribution Strategies – Sample (2/4) Minimize Ri R j Poor U1 U2 U3 U4 i j U1 U2 U3 U4 Stanford Infolab 15 Distribution Strategies – Sample (3/4) • Optimal Distribution U1 U2 U3 U4 1 Ri • Avoid full overlaps and minimize R R i Stanford Infolab j i i j 16 Distribution Strategies – Sample (4/4) Stanford Infolab 17 Distribution Strategies Sample Data Requests Explicit Data Requests • The distributor must provide agents with the data they request • General Idea: • The distributor has the freedom to select the data items to provide the agents with • General Idea: – Add fake data to the distributed ones to minimize overlap of distributed data – Provide agents with as much disjoint sets of data as possible • Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T| • Problem: Agents can collude and identify fake data • NOT COVERED in this talk Stanford Infolab 18 Conclusions • Data Leakage • Modeled as maximum likelihood problem • Data distribution strategies that help identify the guilty agents Stanford Infolab 19 Thank You!
© Copyright 2026 Paperzz