### Loss Functions for Detecting Outliers in Panel Data

```Loss Functions for Detecting
Outliers in Panel Data
Charles D. Coleman
Thomas Bryan
Jason E. Devine
U.S. Census Bureau
Prepared for the Spring 2000 meetings of the Federal-State
Cooperative Program for Population Estimates, Los Angeles, CA,
March, 2000
Panel Data
A.k.a. “longitudinal data.”
xit:
– i indexes cross-sectional units: retain identities
over time. Exx: Geographic areas, persons,
households, companies, autos.
– t indexes time.
– Chronological or nominal.
– Chronological time measures time elapsed between
two dates.
– Nominal time indexes different sets of estimates,
can also index true values.
Notation
•
•
•
•
•
•
Bi is base value for unit i.
Fi is “future” value for unit i.
Fit is future value for unit i at time t.
Bi, Fi, Fit > 0.
i=|Fi-Bi| is absolute difference for unit i.
Subscripts will be dropped when not
needed.
What is an Outlier?
“[An outlier is] an observation which deviates
so much from other observations as to arouse
suspicions that it was generated by a different
mechanism.”
D.M. Hawkins, Identification of Outliers,
1980, p. 1.
Meaning of an Outlier
• Either
– Indication of a problem with the data
generation process.
• Or
– A true, but unusual, statement about reality.
Loss Functions
• Motivations: The i come from unknown
distributions. Want to compare multiple
size classes on same basis.
• L(Fi;Bi)(i,Bi) is loss function for
observation i.
• Loss functions produce rankings of
observations to be examined.
• Loss functions are empirically based, except
for one special case in nominal time.
Assumption 1
Loss is symmetric in error:
L(B+; B) = L(B–; B)
Assumption 2
Loss increases in difference:
/ > 0
Assumption 3
Loss decreases in base value:
/B < 0
Property 1
Loss associated with given absolute
percentage difference (| / B|) increases in B.
Simplest Loss Function
L(F;B) = |F – B|Bq
(1a)
or
(,B) = Bq
with
0 > q > –1.
(1b)
Loss as Weighted Combination of
Absolute Difference and
Absolute Percentage Difference
~
r F  B 
L ( F; B)  F  B 

 B 
s
• This generates loss function with q = –s/(r + s).
• Infinite number of pairs (r, s) correspond to any
given q.
Outlier Criterion
• Outlier declared whenever
L(F;B)(,B) > C
• C is “critical value.”
• C can be determined in advance, or as
function of data (e.g., quantile or multiple of
scale measure).
Loss Function Variants
• Time-Invariant Loss Function
• Signed Loss Function
• Nominal Time
Time-Invariant Loss Function
• Idea: Compare multiple dates of data on
same basis.
• Time need not be round number.
• L(Fit;Bi,t) = |Fit – Bi|Btq
• Property 1 satisfied as long as t < –1/q.
• Thus, useful horizon is limited.
Signed Loss Function
• Idea: Account for direction and magnitude
of loss.
S(F;B) = (F – B) Bq
• Can use asymmetric critical values and
“q”s:
– Declare outliers whenever
S+(F;B) = (F – B) Bq > C+
or
S–(F;B) = (F – B) Bq < C–
+
–
with C+  –C–, q+  q–.
Nominal Time
• Compare 2 sets of estimates, one set can be
actual values, Ai.
• Assumptions:
– Unbiased: EBi = EFi = Ai.
– Proportionate variance: Var(Bi) = Var(Fi) =
2Ai.
• q = –1/2.
• Either set of estimates can be used for Bi, Fi.
– Exception: Ai can only be substituted for Bi.
How to Use: No Preexisting
Outlier Criteria
– Adjust by increments of 0.1 to get “good”
distribution of outliers.
q = log(range)/25 – 1, where range is range
of data. (Bryan, 1999)
How to Use: Preexisting Discrete
Outlier Criteria
– These pairs (approximately) satisfy equation
Bq = C for some q and C. They are the cutoffs
between outliers and nonoutliers.
• Run regression
log j = –q log Bj + K
• Then, C = eK.
Loss Functions and GIS
• Loss functions can be used with GIS to
focus analyst’s attention on problem areas.
• Maps compare tax method county
population estimates to unconstrained
housing unit method estimates.
• q = –0.5 in loss function map.
Map 1
Absolute between
Differences betweenthe
the TwoPopulation
Sets of Population Estimates
Absolute Differences
Estimates
Persons
Note: The tax method estimates are the base
0 - 5000
5000 - 25000
25000 - 50000
Over 50000
No Data
Map 2
Absolute
Percent Differences between
the Two Sets ofthe
Population
Estimates
Percent Absolute
Differences
between
Population
Estimates
Percent
Note: The tax method estimates are the base
0-5
5 - 10
10 - 20
Above 20
No Data
Map 3
LossLoss
Function
Values
Function Values
Loss
Note: The tax method estimates are the base
0 - 1000
1000 - 2000
2000 - 4000
Above 4000
No Data
Outliers Classified by Another
Variable
• Di is function of 2 successive observations.
• Ri is “reference” variable, used to classify
outliers.
• Run regression
log Dj = a + log Rj
• Then, L(D, R) = DRb and C = ea.
What to Do with Negative Data
• From Coleman and Bryan (2000):
L(F,B) = |F–B|(|F|+|B|)q, B  0 or F  0,
0
, B = F = 0.
S(F,B) = (F–B)(|F|+|B|)q, B  0 or F  0,
0
, B = F = 0.
• 0 > q > –1. Suggest q  –0.5.
Summary
• Defined panel data.
• Defined outliers.
• Created several types of loss functions to
detect outliers in panel data.
• Loss functions are empirical (except for
nominal time.)
• Showed several applications, including GIS.
URL for Presentation
http://chuckcoleman.home.dhs.org/fscpela.ppt
```