Mathematical analysis of the ecoinvent LCI database

Mathematical analysis of the
ecoinvent LCI database with the
purpose of developing new
validation tools for the database
Andreas Ciroth
LCA IX conference, Boston
October 1, 2009
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Agenda
1. Mathematical analysis of ecoinvent: Goal and scope
2. Approach and applied procedures
a) Exploratory data analyses
b) Tests, plausibility checks
3. Towards a validation suite of ecoinvent data
4. Outlook
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
1. Goal and scope
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
project ‚Mathematical analysis
of ecoinvent data‘ (2008/2009)
„Overall goal of the project is to design a suite of tools for automated data
quality assessment of data in the ecoinvent data base, as a pilot, including
identification of outliers, missing values, and erroneous or misleading
entries of various forms. In the design, both effort for applying the tools as
well as results that can be expected shall be investigated. […]
The designed suite shall later on be applied for the complete ecoinvent data
base […]
Flexibility, and required knowledge for applying the suite, shall be considered
during the design when evaluating different possible solutions. The
development will consider test cases in order to be able to give qualified
estimates on required effort and on the expected results.”
(project goal and scope)
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
goal and scope
- find errors
- find structures and patterns
- think towards a validation suite
for ecoinvent data.
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
2. Approach and applied
procedures
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Approach: Principles
- all ecoinvent data considered (from ecoinvent.org website
XML files, system and unit processes, only single
output processes, V 2.0)
- data imported into a relational Access database
- queries conducted on the database
- database further analysed in R (statistical software)
- whereever possible practical examples
- types of interesting different analyses covered
- no claim for completeness ( pilot!)
extend, living system
(3rd party input, 1st experiences, …)
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
R: Output
One-sample Kolmogorov-Smirnov test
Text
data:
d_Cesium[[5]]
D = 1, p-value < 2.2e-16
20
15
10
5
0
Graphics
fossil CO2 [kg]
25
30
alternative hypothesis: two-sided
0
5
10
15
20
CO2-eqv. [kg]
25
30
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
R: StatET Plugin
Eclipse plugin available more comfortable editor
)
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
(what is explorative data analysis?)
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
(what is explorative data analysis?)
“exploratory data analysis is detective work. […]
restricting one's self to the planned analysis – failing to
accompany it with exploration – loses sight of the most
interesting results too frequently to be comfortable”
[Tukey 1977, p. 3].
Explorative. Unplanned. Keen on learning about the
data at hand. Few, if any, own assumptions about data.
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
EDA: Applied methods
- Boxplots
- Scatter plots
- Stem and leaf plots
- Matrix scatter plots
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
EDA example: Boxplot
0
20
40
60
80
100
Transport effort in [tkm] for
selected processes
agricultural means of production
agricultural production
chemicals
construction materials
construction processes
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
EDA example: Boxplot
Transport effort in [tkm] for selected processes
check of the construction processes
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
EDA example: Matrix scatter plot
20
0
40000
0
1000
20
0
1
10
2
1000
0
Matrix scatter plots of
selected impact category
results for all unit processes
in ecoinvent, limited to
amounts smaller than twice
the mean;
10
150 300
0
40000
0
400
3
0.004
0
4
1000
0.000
5
6
0e+00
7
3e+05
0
1: Acidification acc. to EDIP
2003,
2: eutrophication acc. to
CML 2001,
3: eutrophication acc. to
EDIP 2003,
4: human health acc. to
Ecoindicator 99 (I,I)
5: human health acc. to
Impact2002+
6: human toxicity acc. to
CML 2001
7: human toxicity acc. to
EDIP 2003
0
150 300
0
400
1000
0.000
0.004
0e+00
3e+05
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
EDA example: Matrix scatter plot
20
0
40000
0
1000
20
0
1
10
2
1000
0
Matrix scatter plots of
selected impact category
results for all unit processes
in ecoinvent, limited to
amounts smaller than twice
the mean;
10
150 300
0
40000
0
400
3
0.004
0
4
1000
0.000
5
6
0e+00
7
3e+05
0
1: Acidification acc. to EDIP
2003,
2: eutrophication acc. to
CML 2001,
3: eutrophication acc. to
EDIP 2003,
4: human health acc. to
Ecoindicator 99 (I,I)
5: human health acc. to
Impact2002+
6: human toxicity acc. to
CML 2001
7: human toxicity acc. to
EDIP 2003
0
150 300
0
400
1000
0.000
0.004
0e+00
3e+05
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Some of the applied plausibility checks
- negative flow amounts
- input / output quotient, per process (problem:
units)
- carbon dioxide emissions per fuel consumption
- input and output pattern of similar processes:
missing flows in one process that appear in
other likely processes, flows that are unique for
a process
(similar process: category & subcategory)
…
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Some of the further applied statistical
analyses
- distribution tests
- regression analysis
- cluster analyses
- heat maps
- principal component analyses
…
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Distribution test
Q-Q plot (quantiles plot), CO2 emissions, left,
and Cesium-137 emissions, right, for system
processes
Normal Q-Q- Plot, CO2 fossil, ecoinvent System Processes
0
-20
-10
Sample Quantiles
-10
-20
-30
-30
-40
Sample Quantiles
0
10
10
20
Normal Q-Q- Plot, Cesium-137, ecoinvent System Processes
-4
-2
0
2
4
-4
-2
0
Theoretical Quantiles
Theoretical Quantiles
2
4
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Some of the further applied statistical
analyses
- distribution tests
- regression analysis
- cluster analyses
- heat maps
- principal component analyses
…
Patterns and
structure detection
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
3. Towards a validation suite
of ecoinvent data
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Principles of a tool and
of the overall application
-
-
-
efficiency: Detect errors in data fast, and with least
effort possible.
Ex.: while the data set is entered into an editor instead
of after it is merged with all other data in ecoinvent.
double checks: For important decisions conduct more
than one single analysis.
Does not hold for basic checks which provide an
unambiguous result.
“expertise”: Analyses should provide their results in a
context where the results can be understood.
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Use cases
-
-
Small review: a small amount of new data as candidate for
ecoinvent data
Large review: a large amount of new data, from one source, as
candidate for ecoinvent data
Stock-take: thorough investigation of the complete data stock
Screening: light investigation of the complete data stock, e.g. after
integration with new data
Single error detection: by user feedback, an error in a data set is
discovered (and it is not sure whether this error is also relevant for
other datasets).
Data background change: A new database or format structure
Knowledge change: refinement of data structure knowledge and of
the tests and analyses
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Only one part of the picture:
a relation of knowledge and different analyses types
basic checks
method knowledge,
quality quidelines...
further analyses
and checks
explorative data
analyses
(confirmatory)
tests
structural and
pattern knowledge
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Further aspects of a validation suite
…
Different users and their roles
Procedure for “daily” maintenance
Procedure for larger release
Focus on single data set vs. focus on group of data sets,
different users
Relation of analyses and „ecoinvent in-house“ data pool
and released data pool
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
4. Outlook (& Conclusions)
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Conclusions
-
Already in the pilot, based on tentative analyses, very
useful insights
-
Several errors and „oddities“ in data detected, despite
extensive expert-based quality assurance
-
The project will most likely be continued with a
second phase.
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
elements in a to-be-built
quality assurance system
Conclusions, 2 –
-
Mathematical analyses and procedures:
* basic checks
* explorative data analyses
-
In-depth statistical analyses of various kind
-
Knowledge
* data pattern and data structure knowledge (as result
from tests, from other statistical analyses, and also
from general expert knowledge)
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
elements in a to-be-built quality
assurance system, 2
-
Data:
* ecoinvent internal data pool with incoming and
changing data sets, probably with version numbers
* ecoinvent released data pool
-
People
* users with feedback concerning released data
* data generators with new data sets for ecoinvent
* “data management unit”, responsible for ecoinvent
releases and for error corrections
* data reviewers and validation experts
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Conclusions, 3: Who does what?
Some examples
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Outlook
-
Probably: build an integrated tool using R and
openLCA, in eclipse, via the StatET plugin:
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Outlook
- Almost certainly: Build a system for
automated data quality assurance and data
analysis for eclipse, for the upcoming
ecoinvent release 3.
Math. analysis of ecoinvent data
A.. Ciroth, October 1, 2009
Many thanks!
Dr. Andreas Ciroth
GreenDeltaTC GmbH, Berlin,
[email protected]