A Methodology for Direct and Indirect Discrimination

A Methodology for Direct and Indirect
Discrimination Prevention in Data Mining
Sara Hajian and Josep Domingo-Ferrer
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013
Presented by Polina Rozenshtein
Outline
• Problem addressed
– Direct and indirect discrimination
• Background, definitions and measures
• Approach proposed
– Discrimination Measurement
– Data Transformation
• Algorithms and running time
• Experimental results
Problem
• Discrimination: direct or indirect.
• Direct discrimination: decisions are made based on
sensitive attributes.
• Indirect discrimination (redlining): decisions are made
based on nonsensitive attributes which are strongly
correlated with biased sensitive ones.
• Decision rules
Definitions
• Dataset – collection of records
• Item - attribute with its value, e.g., Race = black
• item set - collection of items 𝑋:
{Foreign worker = Yes; City = NYC}
• classification rule - 𝑋 → 𝐶 ∈ {𝑦𝑒𝑠/𝑛𝑜}
{Foreign worker = Yes; City = NYC} → Hire = no
Definitions
• support, supp(𝑋) - fraction of records that contain 𝑋
• confidence, conf 𝑋 → 𝐶 - how often C appears in
records that contain X
conf 𝑋 → 𝐶 =
𝑠𝑢𝑝𝑝(𝑋,𝐶)
𝑠𝑢𝑝𝑝(𝑋)
• frequent classification rule:
𝑠𝑢𝑝𝑝 𝑋, 𝐶 > 𝑠
𝑐𝑜𝑛𝑓 𝑋 → 𝐶 > 𝑐
• negated item set: 𝑋 = {Foreign worker = Yes}
¬𝑋 = {Foreign worker = No}
Classification rules
• 𝐷𝐼𝑠 - predetermined discriminatory items
𝐷𝐼𝑠 = {𝐹𝑜𝑟𝑒𝑖𝑔𝑛 𝑤𝑜𝑟𝑘𝑒𝑟 = 𝑌𝑒𝑠; 𝑅𝑎𝑐𝑒 = 𝐵𝑙𝑎𝑐𝑘}
• 𝑋 → 𝐶 - potentially discriminatory (PD)
• 𝑋 = 𝐴, 𝐵 with 𝐴 ⊆ 𝐷𝐼𝑠 , B ⊈ 𝐷𝐼𝑠
{𝐹𝑜𝑟𝑒𝑖𝑔𝑛 𝑤𝑜𝑟𝑘𝑒𝑟 = 𝑌𝑒𝑠; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜
• 𝑋 → 𝐶 - potentially nondiscriminatory (PND)
• 𝑋 = 𝐷, 𝐵 with 𝐷 ⊈ 𝐷𝐼𝑠 , 𝐵 ⊈ 𝐷𝐼𝑠
{𝑍𝑖𝑝 = 10451; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜
Direct Discrimination Measure
• extended lift (elift):
𝑒𝑙𝑖𝑓𝑡 𝐴, 𝐵 → 𝐶 =
𝑐𝑜𝑛𝑓(𝐴,𝐵→𝐶)
𝑐𝑜𝑛𝑓(𝐵→𝐶)
• 𝐴 ∈ 𝐷𝐼𝑆
• 𝐴, 𝐵 → 𝐶 is 𝛼-protective, if and 𝑒𝑙𝑖𝑓𝑡 𝐴, 𝐵 → 𝐶 < 𝛼
• 𝐴, 𝐵 → 𝐶 is 𝛼-discriminatory, if 𝑒𝑙𝑖𝑓𝑡 𝐴, 𝐵 → 𝐶 ≥ 𝛼
Indirect Discrimination Measure
•
•
•
•
Theorem: Let 𝑟: 𝐷, 𝐵 → 𝐶 is PND;
𝛾 = 𝑐𝑜𝑛𝑓(𝑟: 𝐷, 𝐵 → 𝐶) and 𝛿 = 𝑐𝑜𝑛𝑓 𝐵 → 𝐶 > 0
𝐴 ⊆ 𝐷𝐼𝑠 ,
𝑐𝑜𝑛𝑓 𝑟𝑏1 : 𝐴, 𝐵 → 𝐷 ≥ 𝛽1 , 𝑐𝑜𝑛𝑓 𝑟𝑏2 : 𝐷, 𝐵 → 𝐴 ≥ 𝛽2 > 0
• 𝑓 𝑥 =
𝛽1
𝛽2
𝛽2 + 𝑥 − 1
• 𝑒𝑙𝑏 𝑥, 𝑦 =
𝑓(𝑥)
, 𝑖𝑓
𝑦
0,
𝑓 𝑥 >0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Then if 𝑒𝑙𝑏 𝛾, 𝛿 ≥ 𝛼,
then PD 𝑟′: 𝐴, 𝐵 → 𝐶 is 𝛼-discriminatory
Indirect Discrimination or not
• A PND rule 𝑟: 𝐷, 𝐵 → 𝐶 is a redlining rule, if it could yield 𝛼discriminatory rule 𝑟0 : 𝐴, 𝐵 → 𝐶
• available knowledge rules 𝑟𝑏1 : 𝐴, 𝐵 → 𝐷 and 𝑟𝑏2 : 𝐷, 𝐵 → 𝐴
• With 𝐴 ∈ 𝐷𝐼𝑠
{𝑍𝑖𝑝 = 10451; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜.
• A PND rule 𝑟: 𝐷, 𝐵 → 𝐶 is a nonredlining rule, if it cannot yield
• 𝛼-discriminatory rule 𝑟0 : 𝐴, 𝐵 → 𝐶
• available rules 𝑟𝑏1 : 𝐴, 𝐵 → 𝐷 and 𝑟𝑏2 : 𝐷, 𝐵 → 𝐴 and 𝐴 ∈ 𝐷𝐼𝑠
{𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = 𝐿𝑜𝑤; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜.
The Approach
• Discrimination measurement:
– Find PD and PND
– Direct discrimination:
• In PD find 𝛼-discriminatory by 𝑒𝑙𝑖𝑓()
– Indirect discrimination:
• In PND find redlining by 𝑒𝑙𝑏() + background knowledge
• Data transformation:
– Alter dataset and remove discriminatory biases
– Minimum impact on data and legitimate rules
Direct rules protection
• 𝐴 ∈ 𝐼𝐷𝑆 ,
• Wish 𝑒𝑙𝑖𝑓 𝑟 ′ : 𝐴, 𝐵 → 𝐶 > 𝛼
•
𝑐𝑜𝑛𝑓(𝐴,𝐵→𝐶)
𝑐𝑜𝑛𝑓(𝐵→𝐶)
<𝛼
• Decrease 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐶 =
𝑠𝑢𝑝𝑝(𝐴,𝐵,𝐶)
𝑠𝑢𝑝𝑝(𝐴,𝐵)
• Decrease 𝑐𝑜𝑛𝑓(𝐴, 𝐵 → 𝐶) by increasing 𝑠𝑢𝑝𝑝(𝐴, 𝐵)!
¬𝐴, 𝐵 → ¬𝐶 ⇒ 𝐴, 𝐵 → ¬𝐶
• 𝑠𝑢𝑝𝑝 𝐴, 𝐵, 𝐶 remains the same
Direct rules protection 2
• Wish 𝑒𝑙𝑖𝑓 𝑟 ′ : 𝐴, 𝐵 → 𝐶 > 𝛼
•
𝑐𝑜𝑛𝑓(𝐴,𝐵→𝐶)
𝑐𝑜𝑛𝑓(𝐵→𝐶)
<𝛼
• Increase 𝑐𝑜𝑛𝑓 𝐵 → 𝐶 =
𝑠𝑢𝑝𝑝(𝐵,𝐶)
𝑠𝑢𝑝𝑝(𝐵)
• Increase 𝑠𝑢𝑝𝑝 𝐵, 𝐶 !
¬𝐴, 𝐵 → ¬𝐶 ⇒ ¬𝐴, 𝐵 → 𝐶
• 𝑠𝑢𝑝𝑝 𝐵 remains the same
Direct rules generalization
• PD: {Foreign worker = Yes; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜.
• PND: {𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = 𝐿𝑜𝑤; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜.
• If 𝑐𝑜𝑛𝑓 𝑟: 𝐷, 𝐵 → 𝐶 ≥ 𝑐𝑜𝑛𝑓 𝑟′: 𝐴, 𝐵 → 𝐶 ,
• and 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐷 = 1 then
• PD rule 𝑟 ′ : 𝐴, 𝐵 → 𝐶 is an instance of a PND rule 𝑟: 𝐷, 𝐵 → 𝐶
Direct rules generalization
• PD: {Foreign worker = Yes; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜.
• PND: {𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = 𝐿𝑜𝑤; 𝐶𝑖𝑡𝑦 = 𝑁𝑌𝐶} → 𝐻𝑖𝑟𝑒 = 𝑁𝑜.
• 1) If 𝑐𝑜𝑛𝑓 𝑟: 𝐷, 𝐵 → 𝐶 ≥ 𝑝 𝑐𝑜𝑛𝑓 𝑟′: 𝐴, 𝐵 → 𝐶 ,
• 2) and 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐷 ≥ 𝑝 then
• PD rule 𝑟 ′ : 𝐴, 𝐵 → 𝐶 is an 𝑝-instance of a PND rule
𝑟: 𝐷, 𝐵 → 𝐶
• Change 𝛼-discriminatory to be 𝑝-instance of some PND
rule 𝑟: 𝐷, 𝐵 → 𝐶
Direct rules generalization
•
•
•
•
Condition 2 is satisfied, but Condition 1 is not:
Wish 𝑐𝑜𝑛𝑓 𝑟: 𝐷, 𝐵 → 𝐶 ≥ 𝑝 𝑐𝑜𝑛𝑓 𝑟 ′ : 𝐴, 𝐵 → 𝐶
Decrease 𝑐𝑜𝑛𝑓 𝑟 ′ : 𝐴, 𝐵 → 𝐶 , preserve 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐷
𝐴, 𝐵, ¬𝐷 → 𝐶 ⇒ 𝐴, 𝐵, ¬𝐷 → ¬𝐶
• Condition 1 is satisfied, but Condition 2 is not:
• Wish 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐷 ≥ 𝑝
• Increase 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐷 , preserve 𝑐𝑜𝑛𝑓 𝑟: 𝐷, 𝐵 → 𝐶 ≥
𝑝 𝑐𝑜𝑛𝑓 𝑟 ′ : 𝐴, 𝐵 → 𝐶
• Impossible
Direct rules generalization
• Use generalization when possible to increase number of
PND
• Use generalization when at least Condition 2 is satisfied
• After generalization is done, use methods for direct
protection
• Try to perform minimum transformation
Indirect Rule Protection
• The same strategy as for Directed Rule Protection:
• Wish 𝑒𝑙𝑏 𝑐𝑜𝑛𝑓 𝑟: 𝐷, 𝐵 → 𝐶 , 𝑐𝑜𝑛𝑓 𝐵 → 𝐶
•
•
•
•
•
𝑐𝑜𝑛𝑓 𝑟𝑏1 :𝐴,𝐵→𝐷
𝑐𝑜𝑛𝑓 𝑟𝑏2 :𝐷,𝐵→𝐴
𝑐𝑜𝑛𝑓 𝑟𝑏2 :𝐷,𝐵→𝐴 +𝑐𝑜𝑛𝑓 𝑟:𝐷,𝐵→𝐶 −1
𝑐𝑜𝑛𝑓(𝐵→𝐶)
Method 1: Decrease 𝑐𝑜𝑛𝑓 𝐴, 𝐵 → 𝐷
¬𝐴, 𝐵, ¬𝐷 → ¬𝐶 ⇒ 𝐴, 𝐵, ¬𝐷 → ¬𝐶
Method 2: Increase 𝑐𝑜𝑛𝑓 𝐵 → 𝐶
¬𝐴, 𝐵, ¬𝐷 → ¬𝐶 ⇒ ¬𝐴, 𝐵, ¬𝐷 → 𝐶
>𝛼
<𝛼
Simultaneous direct and indirect
discrimination prevention
Method 1
Direct Rule Protection
Indirect Rule Protection
Method 2
¬𝐴, 𝐵 → ¬𝐶 ⇒ 𝐴, 𝐵 → ¬𝐶
¬𝐴, 𝐵 → ¬𝐶 ⇒ ¬𝐴, 𝐵 → 𝐶
¬𝐴, 𝐵, ¬𝐷 → ¬𝐶 ⇒
𝐴, 𝐵, ¬𝐷 → ¬𝐶
¬𝐴, 𝐵, ¬𝐷 → ¬𝐶 ⇒
¬𝐴, 𝐵, ¬𝐷 → 𝐶
• Lemma 1. Method 1 for DRP cannot be used for
simultaneous DRP and IRP
• Method 1 for DRP might undo the protection provided by
Method 1 for IRP
Simultaneous direct and indirect
discrimination prevention
Method 1
Direct Rule Protection
Indirect Rule Protection
Method 2
¬𝐴, 𝐵 → ¬𝐶 ⇒ 𝐴, 𝐵 → ¬𝐶
¬𝐴, 𝐵 → ¬𝐶 ⇒ ¬𝐴, 𝐵 → 𝐶
¬𝐴, 𝐵, ¬𝐷 → ¬𝐶 ⇒
𝐴, 𝐵, ¬𝐷 → ¬𝐶
¬𝐴, 𝐵, ¬𝐷 → ¬𝐶 ⇒
¬𝐴, 𝐵, ¬𝐷 → 𝐶
• Lemma 2. Method 2 for IRP is beneficial for Method 2
for DRP. Method 2 for DRP is at worst neutral for
Method 2 for IRP.
• Method 2 for DRP and Method 2 for IRP both increase
𝑐𝑜𝑛𝑓(𝐵 → 𝐶).
Simultaneous direct and indirect
discrimination prevention
• Transform PD to PND when possible
• Run Method 2 for IRP for PND and Method 2 for DRP
for the rest PD.
Algorithms
𝐷𝐵 – database
𝐹𝑅 – frequent rules
𝑀𝑅 – direct
discriminative rules
𝐷𝐼𝑠 – discriminative
item set
Computational Cost
•
•
•
•
•
•
•
•
•
m - the number of records in 𝐷𝐵
k - number of rules in 𝐹𝑅
h - number of records in subset 𝐷𝐵𝑐
n - the number of discriminatory rules in MR
𝑂(𝑚) to get 𝐷𝐵𝑐
𝑂(𝑘ℎ) to get 𝑖𝑚𝑝𝑎𝑐𝑡(𝑑𝑏𝑐 ) for all 𝑑𝑏𝑐 ∈ 𝐷𝐵𝑐
𝑂(ℎ log ℎ ) for sorting
𝑂(𝑑𝑚) for modification
𝑂(𝑛 ∗ (𝑚 + 𝑘ℎ + ℎ log ℎ + 𝑑𝑚))
Experiments
• German credit data set and adult data set.
• Direct discrimination prevention degree (DDPD):
percentage of 𝛼-discriminatory rules that are no longer 𝛼discriminatory
• Direct discrimination protection preservation (DDPP):
percentage of 𝛼-protective rules that remain 𝛼- protective
• IDPD and IDPP – the same for redlining rules
German credit data set
•
•
•
•
Min support 5%, min confidence 10%
32 340 frequent classification rules
22 763 background knowledge rules
37 redlining rules, 42 indirect and 991 direct discriminations
Information loss
• Misses cost (MC): percentage of lost rules
• Ghost cost (GC): percentage of introduced rules
Conclusions
•
•
•
•
Considers frequent classification rule mining
Defines direct and indirect discrimination
Propose measures of discrimination
Propose methods to modify dataset to avoid
discrimination
• Meaningful qualitative results

Download Report

A Methodology for Direct and Indirect Discrimination

Paperzz.com

Your Paperzz