Robust feature selection by mutual information distributions

Updating with incomplete observations
(UAI-2003)
Gert de Cooman
Marco Zaffalon
IDSIA
SYSTeMS research group
BELGIUM
http://ippserv.ugent.be/~gert
[email protected]
“Dalle Molle” Institute for Artificial Intelligence
SWITZERLAND
http://www.idsia.ch/~zaffalon
[email protected]
What are incomplete observations?
A simple example

C (class) and A (attribute) are Boolean random variables




Let us do diagnosis
Good point: you know that




C = 1 is the presence of a disease
A = 1 is the positive result of a medical test
p(C = 0, A = 0) = 0.99
p(C = 1, A = 1) = 0.01
Whence p(C = 0 | A = a) allows you to make a sure diagnosis
Bad point: the test result can be missing

This is an incomplete, or set-valued, observation {0,1} for A
What is p(C = 0 | A is missing)?
2
Example ctd

Kolmogorov’s definition of conditional probability seems
to say





p(C = 0 | A  {0,1}) = p(C = 0) = 0.99
i.e., with high probability the patient is healthy
Is this right?
In general, it is not
Why?
3
Why?


Because A can be selectively reported
e.g., the medical test machine is broken;
it produces an output  the test is negative (A = 0)



In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0
The patient is definitely ill!
Compare this with the former naive application of Kolmogorov’s
updating (or naive updating, for short)
4
Modeling it the right way

Observations-generating model
p(C,A)
Distribution
generating
pairs for (C,A)



(c,a)
Complete pair
(not observed)
IM
Incompleteness
Mechanism (IM)
o
Actual
observation (o)
about A
o is a generic value for O, another random variable
o can be 0, 1, or * (i.e., missing value for A)
IM = p(O | C,A) should not be neglected!
The correct overall model we need is p(C,A)p(O | C,A)
5
What about Bayesian nets (BNs)?

Asia net
(V)isit to Asia
(S)moking = y
(T)uberculosis = n
Lung (C)ancer?
Abnorma(L)
X-rays = y
(D)yspnea
Bronc(H)itis

Let us predict C on the basis of the observation (L,S,T) = (y,y,n)

BN updating instructs us to use p(C | L = y,S = y,T = n) to predict C
6
Asia ctd

Should we really use p(C | L = y,S = y,T = n) to predict C?
(V,H,D) is missing
(L,S,T,V,H,D) = (y,y,n,*,*,*) is an incomplete observation


p(C | L = y,S = y,T = n) is just the naive updating
By using the naive updating, we are neglecting the IM!
Wrong inference in general
7
New problem?


Problems with naive updating were already clear since
1985 at least (Shafer)
Practical consequences were not so clear


How often does naive updating make problems?
Perhaps it is not a problem in practice?
8
Grünwald & Halpern (UAI-2002)
on naive updating

Three points made strongly
1)
naive updating works  CAR holds

i.e., neglecting the IM is correct  CAR holds

2)
3)
With missing data:
CAR (coarsening at random) = MAR (missing at random) =
p(A is missing | c,a) is the same for all pairs (c,a)
CAR holds rather infrequently
The IM, p(O | C,A), can be difficult to model
2 & 3 = serious theoretical & practical problem
How should we do updating given 2 & 3?
9
What this paper is about

Have a conservative (i.e., robust) point of view


Assume little knowledge about the IM



Deliberately worst case, as opposed to the MAR best case
You are not allowed to assume MAR
You are not able/willing to model the IM explicitly
Derive an updating rule for this important case

Conservative updating rule
10
1st step: plug ignorance into your model


Fact: the IM is unknown
p(O{0,1,*} | C,A) = 1




a constraint on p(O | C,A)
i.e. any distribution
p(O | C,A) is possible
Known prior
This is too conservative;
distribution
to draw useful conclusions
we need a little less ignorance
(c,a)
Complete pair
(not observed)
o
IM
Unknown
Incompleteness
Mechanism
Actual
observation (o)
about A
Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | A)


p(C,A)
i.e., all the IMs which do not depend on what you want to predict
Use this set of IMs jointly with prior information p(C,A)
11
2nd step: derive the conservative updating

Let E = evidence = observed variables, in state e
Let R = remaining unobserved variables (except C)

Formal derivation yields:

1)
2)
All the values for R should be considered
In particular, updating becomes:
minrR p(c | E = e,R = r)  p(c | o)  maxrR p(c | E = e,R = r)
Conservative Updating Rule (CUR)
12
CUR & Bayesian nets



Evidence: (L,S,T) = (y,y,n)
What is your posterior
confidence on C = y?
(V)isit to Asia
(T)uberculosis = n
(S)moking = y
Lung (C)ancer?
Bronc(H)itis
Abnorma(L)
Consider all the joint
(D)yspnea
X-rays = y
values of nodes in R
Take min & max of p(C = y | L = y,S = y,T = n,v,h,d)
Posterior confidence  [0.42,0.71]

Computational note: only Markov blanket matters!
13
A few remarks

The CUR…



is based only on p(C,A), like the naive updating
produces lower & upper probabilities
can produce indecision
14
CUR & decision-making

Decisions

c’ dominates c’’ (c’,c’’ C) if for all r R ,
p(c’ | E = e, R = r) > p(c’’ | E = e, R = r)

Indecision?

It may happen that r’,r’’ R so that:
p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’)
and
p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’)
There is no evidence that you should prefer c’ to c’’ and vice versa
(= keep both)
15
Decision-making example


Evidence:
E = (L,S,T) = (y,y,n) = e
What is your diagnosis for C?



p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y)
p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n)
Both C = y and C = n are plausible
(V)isit to Asia
(T)uberculosis


Evidence:
Abnorma(L)
E = (L,S,T) = (y,y,y) = e
X-rays = y
C = n dominates C = y: “cancer” is ruled out
(S)moking = y
Lung (C)ancer?
Bronc(H)itis
(D)yspnea
16
Algorithmic facts


CUR  restrict attention to Markov blanket
State enumeration still prohibitive in some cases

e.g., naive Bayes
However:
decision-making possible in linear time,
by provided algorithm, even on some multiply connected nets!

Dominance test based on dynamic programming

Linear in the number of children of class node C
17
On the application side

Important characteristics of present approach


Robust approach, easy to implement
Does not require changes in pre-existing BN knowledge bases




based on p(C,A) only!
Markov blanket  favors low computational complexity
If you can write down the IM explicitly, your decisions/inferences will
be contained in ours
By-product for large networks

Even when naive updating is OK, CUR can serve as a useful
preprocessing phase

Restricting attention to Markov blanket may produce strong enough inferences and
decisions
18
What we did in the paper

Theory of coherent lower previsions (imprecise probabilities)

Coherence

Equivalent to a large extent to sets of probability distributions
Weaker assumptions

CUR derived in quite a general framework

19
Concluding notes

There are cases when:


IM is unknown/difficult to model
MAR does not hold

Serious theoretical and practical problem

CUR applies



CUR works with credal nets, too


Robust to the unknown IM
Computationally easy decision-making with BNs
Same complexity
Future: how to make stronger inferences and decisions

Hybrid MAR/non-MAR modeling?
20