Robust Statistics

Nov 2015
Outliers
• Values that “lie outside” other values
• Sample 1 = {2,3,4,5,6}
• Mean = 4, Median = 4, 𝜎 = 1.58
• Mean = 14.8, Median = 4, 𝜎 = 25.29
• What to do with 60: To drop or not to drop? “out liar”
• Can mess up your analysis
• Can you remove legitimate observations? They can be the most interesting one of your
distribution!
• Rejecting the outliers affects the distribution theory, which ought to be adjusted
• In reality:
• Screen the data, to check any transcriptions errors
• It could be difficult or impossible to spot outliers in multivariate or highly structured data
Learning From Data – C. Le Lannou
• Sample 2 = {2,3,4,5,60}
Nov 2015
Robust Statistics
• Statistics with good performance for data drawn from a wide range of
distributions, especially for not normally distributed
•
•
•
•
Find the structure best fitting the majority of the data
Not/ Less affected by outliers
Identify outliers for further treatment
Help identifying highly influential data points (leverage points°
• Stable inference and prediction ??
Learning From Data – C. Le Lannou
• To aim to produce statistical procedures which are stable with respect to
small changes in the data or model and even large changes should not
cause a complete breakdown of the procedure
Nov 2015
Robust Statistics
• Median
• Truncated Mean = Mean of your sample after removing an equal
amount of point at the low and high end of your distribution: 25%
truncated mean
• Robust measures of statistical dispersion
• Mean absolute Deviation = E(𝑋𝑖 − 𝑀𝑒𝑑𝑖𝑎𝑛𝑗 (𝑋𝑗 ))
• For a normal distribution : the ratio of mean absolute deviation to
standard deviation =
2
𝜋
≈ 0.8
• Median absolute deviation
= 𝑀𝑒𝑑𝑖𝑎𝑛𝑖 (𝑋𝑖 − 𝑀𝑒𝑑𝑖𝑎𝑛𝑗 (𝑋𝑗 ))
Learning From Data – C. Le Lannou
• robust measure of central tendency
Count
103
Average
0.07%
Median
-0.26%
14
12
Nov 2015
DANONE Weekly Returns
Distribution weekly returns
Normal Distribution
Annualised
Std Dev
15.7%
Mean Absolute
Deviation
11.8%
Median Absolute
Deviation
9.0%
Maximum
6.7%
Minimum
-4.4%
15.7%
14.8%
8
Robust Distribution Median
Absolute Deviation
6
Normal Distribution without
outlier +8%
4
2
11.3%
0
-8.00% -6.00% -4.00% -2.00% 0.00% 2.00% 4.00% 6.00% 8.00% 10.00%
-2
Source: yahoo finance from 17/9/2012 to 15/9/2014
Without outlier: 8.2%
= 3.5 𝜎 move
Learning From Data – C. Le Lannou
10
Nov 2015
Linear Regression
• 𝑦 = 𝛼 + 𝛽. 𝑥, residuals : 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖
• Minimize Sum of squared residuals =
𝑒𝑖 2
• Least Absolute Deviation
• Minimize Sum of absolute residuals = 𝑒𝑖
• No simple formula, but linear programming
• Part of most statistical package: Matlab, R, python,…
Learning From Data – C. Le Lannou
• Least Square Methods
Nov 2015
Any Error Formula
10
9
8
7
6
5
4
3
2
1
0
-3
-2
-1
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
-3
-2
-1
• 𝐷𝑒𝑓𝑖𝑛𝑒 𝑦𝑜𝑢𝑟 𝑜𝑤𝑛 𝑖𝑚𝑝𝑎𝑐𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑢𝑡𝑙𝑖𝑒𝑟𝑠
1.5
1
0.5
Ln(1+e)
0
-4
-2
0
2
4
0
1
2
3
Learning From Data – C. Le Lannou
Absolute Errors
Squared Errors
Nov 2015
Missing Values
• What to do with them?
• Remove the observations
• Remove the variables
• Replace them by alternative values (external sources)
• Include it as a missing value
• Statistical Approach:
• average median, distribution
• Linear regression, decision tree
• Multiple sets of alternative data
 Your choice will have an impact
Learning From Data – C. Le Lannou
• As part of your data cleaning process