D511 CBS Algorithms

Deliverable 5.1.1:
An Algorithm for
Consistent
Imputation in
Mixed Data
Ton de Waal – Statistics Netherlands
March 2002
1
Contents
Summary .................................................................................................. 3
1 Introduction ......................................................................................... 4
2 The edits ............................................................................................. 4
3 The consistent imputation problem ...................................................... 5
4 Elimination of variables ....................................................................... 6
5 The algorithm ...................................................................................... 9
6 Example ............................................................................................ 10
References ............................................................................................. 13
2
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
Summary
Automatic edit and imputation can be subdivided into three steps. The first step consists of
solving the so-called error localisation problem, i.e. the problem of identifying the errors in the
data. To identify these errors one often adopts (a generalised version of) the so-called FellegiHolt paradigm. The identified errors are set to missing. The second step is imputation of the
missing values, both the values that were originally missing as well as the values that were set
to missing in the error localisation phase. During the imputation step edit rules are often not
taken into account. As a third and final step one can then modify the imputed values such that
all edits become satisfied. In this paper we describe an algorithm for this third step that keeps
the modified values close to the imputed values while ensuring that all edits become satisfied.
This algorithm has been developed as part of the EUREDIT project.
3
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
1 Introduction
Statistical data editing is an important topic for national statistical institutes. The traditional
approach is based on editing each record, i.e. data of an individual respondent, manually. The
computer is only used interactively to help the subject-matter specialist who is cleaning the
data. This traditional approach is rather expensive and time-consuming. To make the data
editing process more efficient, part of the data is often edited and imputed automatically. The
first step of automatic edit and imputation consists of solving the so-called error localisation
problem, i.e. the problem of identifying the errors in the data. To identify these errors one often
adopts (a generalised version of) the so-called Fellegi-Holt paradigm (see Fellegi and Holt,
1976). This (generalised) paradigm says that the data of a record should be made to satisfy all
edits by changing the values of the fewest possible (weighted) number of variables.
De Waal (2002) describes an algorithm for solving the error localisation problem in mixed data,
i.e. a mix of categorical and continuous data. This error localisation algorithm only sets
erroneous data to missing. Subsequently, the missing data – both the values that were originally
missing and the values that were set to missing by the error localisation algorithm – have to be
imputed. We can do this in two steps. In the first step missing values are imputed using a
statistical model, without taking the edits into account. In the second step the imputed values
are modified so that the resulting data are consistent, i.e. so that they satisfy the edits. To
maintain the statistical characteristics as well as possible the modified values should remain as
close as possible to the imputed values.
In this paper we describe an algorithm to impute a mix of continuous and categorical data in a
consistent manner, i.e. such that all edits become satisfied. Section 2 describes the edits we
consider in this paper, and Section 3 describes the mathematical problem we are trying to solve.
Sections 4 and 5 discuss a heuristic algorithm to obtain consistently imputed data that is close
to the data that have been imputed using a statistical imputation model. Section 6 concludes the
paper with an example illustrating the proposed algorithm.
2 The edits
We denote the categorical variables by v i (i=1,…,m) and the continuous variables by x i
(i=1,…,n). For categorical data we denote the domain, i.e. the set of the possible values, of
variable i by Di . We assume that every edit E j (j=1,...,J) is written in the following form: edit
E j is satisfied by a record (v1 ,, vm , x1 ,, xn ) if the following statement holds true
IF
vi  Fi j for i=1,…,m
4
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
THEN ( x1 , , x n )  {x | a1 j x1    a nj x n  b j  0} ,
(2.1)
where  denotes either “” or “=”.
All edits given by (2.1) have to be satisfied simultaneously. We assume that the edits can
indeed be satisfied simultaneously.
The condition after the IF-statement, i.e. vi  Fi j for all i=1,…,m, is called the IF-condition of
the edit. The condition after the THEN-statement, i.e.
( x1 ,, x n )  {x | a1 j x1    a nj x n  b j  0} , is called the THEN-condition. If the IFcondition does not hold true for a particular record, the edit is always satisfied, irrespective of
the values of the continuous variables.
A categorical variable v i is said to enter an edit E j given by (2.1) if Fi j  Di and Fi j  Di ,
i.e. if Fi j is strictly contained in the domain of variable i. That edit is then said to be involved
with this categorical variable. A continuous variable x i is said to enter the THEN-condition of
edit E j given by (2.1) if a ij  0 . That THEN-condition is then said to be involved with this
continuous variable.
We assume that none of the values of the variables entering the edits may be missing. That is,
we assume that for each variable entering the edits a value has to be filled in. Any field for
which the value is missing is hence considered to be erroneous.
The set in the THEN-condition of (2.1) may be the empty set or the entire n-dimensional real
vector space. If the set in the THEN-condition of (2.1) is the entire n-dimensional real vector
space, then the edit is always satisfied. Such an edit may be discarded. If the set in the THENcondition of (2.1) is empty, then the edit is failed by any record for which the IF-condition
holds true, i.e. for any record for which v i  Fi j for i=1,…,m. Likewise, Fi j in (2.1) may be
the empty set or equal to Di .
3 The consistent imputation problem
As we already described in Section 1, automatic edit and imputation can be carried out in three
subsequent steps. During the error localisation phase firstly a solution to the error localisation
problem is determined. The values of the variables involved in this solution are set to missing.
After the error localisation phase the missing values (both the values that were missing in the
original record and the values that have been set to missing in the error localisation phase) are
imputed. In this imputation step imputation methods can be used that preserve the statistical
5
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
properties as well as possible. During this step the edits are not necessarily taken into account,
as a result some edits may still be failed.
Finally, the imputed values are modified slightly such that all edits become satisfied. The nonimputed, original values are not modified. We try to ensure that the resulting, consistent,
records preserve the statistical properties of the data as well as possible by keeping the final
values as close as possible to the imputed values. This is the so-called consistent imputation
problem.
We assume that the imputed values can indeed be modified in such a way that a consistent
record results. This is, for instance, the case if the (generalised) Fellegi-Holt paradigm has been
adopted in order to find a solution to the error localisation problem.
To measure how close a record is to another record, a suitable distance function has to be
defined. In this paper we consider distance functions of the type
m
n
i 1
i 1
 wi a(vi , v~i )   wmi | xi  ~xi | ,
(3.1)
where the record after the imputation phase is given by (v1 ,, vm , x1 ,, xn ) , the final record
by (v~1 ,, v~m , ~
x1 ,, ~
xn ) , the wi ’s are non-negative user-specified weights, and a(vi , v~i ) is a
non-negative matrix satisfying a(vi , v~i )  0 if vi  v~i . Note that v~i  vi and ~
xi  xi for
variables (categorical and continuous, respectively) that have not been imputed in the
imputation step, because we only modify the imputed values.
Note that for purely continuous data (3.1) reduces to
n
w
i 1
i
| xi  ~
xi | .
(3.2)
The consistent imputation problem can be formulated concisely as: minimise (3.1) by
modifying the imputed values so that (2.1) is satisfied for all edits j=1,…,J.
4 Elimination of variables
An important part of our algorithm for solving the consistent imputation problem (see also
Section 5) is elimination of variables. In this section we describe the elimination technique.
If a continuous variable is to be eliminated, we basically apply Fourier-Motzkin elimination
(see Duffin, 1974; Chvàtal, 1983; Schrijver, 1986; Quere, 2000; Quere and De Waal, 2000) to
6
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
eliminate that variable from the set of edits. Some care has to be taken in order to ensure that
the IF-conditions of the resulting edits are correctly defined.
In particular, if we want to eliminate a continuous variable xr from the current set of edits, we
start by copying all edits not involving this continuous variable from the current set of edits to
the new set of edits.
Next, we consider all edits in format (2.1) involving xr pair-wise. Suppose we consider the
following pair of edits:
vi  Fi s
IF
THEN
for i=1,…,m
( x1 ,, xn )  {x | a1s x1    ans xn  bs  0}
(4.1)
and
vi  Fi t
IF
THEN
for i=1,…,m
( x1 ,, xn )  {x | a1t x1    ant xn  bt  0} .
(4.2)
We start by checking whether the intersection of the IF-conditions is non-empty, i.e. whether
the intersections Fi s  Fi t are non-empty for all i=1,…,m. If any of these intersections is
empty, we do not have to consider this pair of edits anymore. So, suppose that all intersections
are non-empty.
We now construct an implied edit. If the THEN-condition of (4.1) is an equality, we use the
equality
xr  
1 

 bs   ais xi 
a rs 
i r

(4.3)
to eliminate xr from the THEN-condition of (4.2). Similarly, if the THEN-condition of (4.1) is
an inequality and the THEN-condition of (4.2) is an equality, the equality in (4.2) is used to
eliminate xr .
If the THEN-conditions of both (4.1) and (4.2) are inequalities, we check whether the
coefficients of xr in those inequalities have opposite signs. That is, we check whether
ars  art  0 .
7
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
If that is not the case, we do not consider this pair of edits anymore. If the coefficients of xr do
have opposite signs, we generate the THEN-condition:
~
( x1 ,, xn ) {x | a~1 x1    a~n xn  b  0} ,
(4.4)
a~i | ars | ait  | art | ais for all i=1,…,m
(4.5)
~
b | ars | bt  | art | bs .
(4.6)
where
and
Note that xr indeed does not enter the resulting THEN-condition.
This is the THEN-condition of a new implied edit. The IF-condition of this implied edit is
given by the intersections Fi s  Fi t for all i=1,…,m.
Categorical variables are only eliminated once all continuous variables involved in the edits
have been eliminated. So, once the categorical variables may be selected for elimination the
edits in current set of edits all have the following form:
vi  Fi s
IF
THEN
for i=1,…,m
( x1 ,, xn )   .
(4.7)
To eliminate categorical variable vr from the set of edits given by (4.7), we start by copying all
edits not involving this variable to the set of implied edits.
Next, we basically apply the method of Fellegi and Holt to the IF-conditions to generate the IFconditions of the implied edits (see Fellegi and Holt, 1976). In the terminology of Fellegi and
Holt, field vr is selected as the generated field. We start by determining all index sets S such
that
F
j
 Di
F
j

r
(4.8)
jS
and
i
for all i=1,…,r-1,r+1,…,m.
jS
8
(4.9)
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
From these index sets we select the minimal ones, i.e. the index sets S that obey (4.8) and (4.9),
but none of their subsets obey (4.8).
Given such a minimal index set we construct the implied edit given by
IF
vr  Dr , vi   Fi j
for i=1,…,r-1,r+1,…,m
jS
THEN
( x1 ,, xn )   .
(4.10)
After all categorical variables have been eliminated we are left with a set of relations without
any unknowns.
5 The algorithm
The problem of minimising (3.1) subject to the constraint that all edits (2.1) become satisfied
can be formulated as a mixed integer programming problem (see Kartika, 2001). This mixed
integer programming problem may be solved by using standard software. Unfortunately, this
mixed integer programming problem is usually rather large, so solving it by means of standard
mixed integer programming software is likely to be rather time-consuming.
In this paper we will not make an attempt to solve the consistent imputation problem to
optimality, and restrict ourselves to describing a heuristic that is likely to give acceptable
results in practice. In any case the heuristic will lead to consistent data that satisfy all edits.
Denote the set of variables that have been imputed by S. We assume that the variables in S can
be imputed consistently. We start by filling in the original values for all variables not in S in the
set of explicit edits. This leads to a reduced set of edits involving only the imputed variables.
We eliminate these variables from the reduced set of edits by applying the elimination
technique described in Section 4.We keep track of the corresponding sets of (implicit) edits
after i variables in S have been eliminated (i=0,…,|S|). We denote the set of (implicit) edits after
i variables in S have been eliminated by  i . The set of edits for i=0,  0 , is the set of explicit
edits.
After all s=|S| variables in S have been eliminated, the set  s of relations not involving any
unknowns is consistent, i.e. they do not contradict each other. This follows from our
assumption that the variables in S can be imputed consistently and Theorem 5.2 in De Waal
(2002). (  s may be the empty set, which is consistent by definition). Hence, according to
Theorem 5.1 in De Waal (2002), there is a value v~s for the s-th variable that has been
eliminated such that  s 1 is consistent if we fill in this value. If the s-th variable is categorical,
9
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
we choose v~s such that a(v s , v~s ) is minimal. For the (s-1)-th variable we apply the same
approach, et cetera.
We continue this process until all values of imputed categorical variables have been modified
in the above way. We are then left with a set of imputed continuous variables (if any) and a
current set of (implicit) edits involving only these variables. The final values for these
continuous variables are then found by minimising (3.2) subject to the constraint that the
current set of implicit edits is satisfied. This minimisation problem can simply be formulated as
an LP-problem, and can, e.g., be solved by means of the simplex algorithm.
Theorem 5.1. The heuristic described above leads to a record that satisfies all edits.
Proof. That the value of each imputed categorical variable can be modified in such a way that
the imputed values that have not yet been modified can later be adapted in such a way that all
explicit edits can be satisfied is a direct consequence of Theorem 5.1 in De Waal (2002). After
the imputed values of the categorical variables have been modified, that theorem also implies
that the remaining imputed continuous variables can be modified such that all explicit edits
become satisfied. The problem of finding modified continuous values that are as close as
possible, in the sense of (3.2), to the imputed values can be found by solving an LP-problem. 
When only continuous variables have been imputed, our method solves the consistent
imputation problem to optimality. When categorical variables have been imputed, optimality of
the method is not guaranteed, because the optimal modified value is sequentially determined for
each individual categorical variable separately. Optimality of the method would only have been
guaranteed if the optimal modified values would have been determined for all variables
simultaneously. However, as we have already mentioned, this is a very difficult problem. The
method described is “only” a heuristic. It is, however, much simpler and faster than an optimal
method.
6 Example
To illustrate the algorithm we given a simple example involving only categorical variables.
This example is taken from Kartika (2001).
Suppose we have four imputed, categorical variables with domains: D1  {1, 2, 3, 4} ,
D2  {1,2,3} , D3  {1,2,3} and D4  {1, 2} , and no imputed continuous variables. Suppose
also that the reduced edit set is given by:
10
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
IF
(v2  3) AND (v3  {1,2}) AND (v4  1) THEN ,
(6.1)
IF
(v2 {2,3}) AND (v4  2) THEN ,
(6.2)
IF
(v1 {1,2,4}) AND (v2 {1,3}) AND (v3  {2,3}) THEN 
(6.3)
IF
(v1  3) AND (v3  {2,3}) AND (v4  1) THEN .
(6.4)
Here we use the convention that if a categorical variable is not mentioned in an IF-condition,
this variable may take any value. The matrix element a(vi , v~i ) in objective function (3.1)
equals 1 if vi  v~i , and 0 otherwise. Suppose that the vector of imputed values is given by
v 0 =(3, 3, 2, 2). This vector fails edit (6.2).
We apply our algorithm to obtain a consistent record. We start by selecting a variable, say v1 .
We eliminate this variable and obtain a set of implicit edits without v1 . This set of implicit
edits is given by (6.1), (6.2) and
IF
(v2 {1,3}) AND (v3  {2,3}) AND (v4  1) THEN .
(6.5)
We again select a variable, say v2 , and eliminate this variable from the current set of edits. As
a result, we obtain an empty set of implicit edits. This means that we may assign arbitrary
values to v3 and v4 . Because our aim is to keep the final record close to the imputed record,
we assign to both variables their original imputed values, i.e. 2. Now, a value has to be assigned
to v2 such that (6.1), (6.2) and (6.5) become satisfied given that to both the third and the fourth
variable the value 2 has been assigned. Filling in the value 2 for both the third and fourth
variable in (6.1), (6.2) and (6.5) gives the edit
IF
(v2 {2,3}) THEN .
(6.6)
The only possibility to satisfy (6.6) is to assign the value 1 to v2 . Finally, we assign a value to
v1 such that (6.1) to (6.4) are satisfied given the values that have already been assigned earlier.
Filling in the values assigned to v2 , v3 and v4 in (6.1) to (6.4) gives the edit
IF (v1  {1,2,4}) THEN .
(6.7)
The only way to satisfy (6.7) is to assign the value 3 to v1 , which happens to be its original
v 0 =(3, 1, 2, 2) with target value
imputed value. So, we obtain a new record ~
11
m
 a(v , v~ )  1 .
i 1
i
i
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
If the variables would have been eliminated in a different order, one may arrive at a different
solution with a different target value. To illustrate we now assume that we start by eliminating
v4 instead of v1 . The set of implicit edits is then given by (6.3),
IF
(v2  3) AND (v3  {1,2}) THEN 
(6.8)
IF
(v1  3) AND (v2 {2,3}) AND (v3  {2,3}) THEN .
(6.9)
and
We now eliminate variable v1 . The set of implicit edits is given by (6.8) and
IF
(v2  3) AND (v3  {2,3}) THEN .
(6.10)
We eliminate v3 , and obtain
IF
(v2  3) THEN .
(6.11)
as the only implicit edit. We eliminate v2 and obtain the empty set as the set of implicit edits,
which is consistent by definition.
To satisfy (6.11) we have to change the value of v2 . We make v2 equal to the feasible value
nearest to its original imputed value, i.e. to 2. We now have to satisfy (6.8) and (6.10) given the
value assigned to v2 . For this we do not have to change the value of v3 . Next, we have to
satisfy (6.3), (6.8) and (6.9) by changing the value of v1 given the values already assigned. We
make v1 equal to one of the feasible values nearest to its original value, say to 4. Finally, we
have to satisfy (6.1) to (6.4) by changing the value of v4 given the values already assigned. We
make v4 equal to the only feasible value, i.e. to 1.
v 0 =(4, 2, 2, 1) with target value
So, we obtain a new record ~
clearly not optimal.
12
m
 a(v , v~ )  3 . This solution is
i 1
i
i
EUREDIT Deliverable 5.1.1 An Algorithm for Consistent Imputation in Mixed Data
References
Chvátal, V., 1983, Linear Programming. W.H. Freeman and Company.
De Waal, T., 2002, An Algorithm for Error Localisation in Mixed Data. Deliverable 4.1.1 for
the EUREDIT project.
Duffin, R.J., 1974, On Fourier’s Analysis of Linear Inequality Systems. Mathematical
Programming Studies, 1, 71-95.
Fellegi, I.P. and D. Holt, 1976, A Systematic Approach to Automatic Edit and Imputation.
Journal of the American Statistical Association, 71, 17-35.
Kartika, W., 2001, Consistent Imputation of Categorical and Numerical Data. Report, Statistics
Netherlands, Voorburg.
Quere, R., 2000, Automatic Editing of Numerical Data. Report, Statistics Netherlands,
Voorburg.
Quere R. and T. De Waal, 2000, Error Localization in Mixed Data Sets. Report, Statistics
Netherlands, Voorburg.
Schrijver, A., 1986, Theory of Linear and Integer Programming. John Wiley & Sons.
13

Download Report

D511 CBS Algorithms

Paperzz.com

Your Paperzz