Mathematical analysis of the ecoinvent LCI database with the purpose of developing new validation tools for the database Andreas Ciroth LCA IX conference, Boston October 1, 2009 Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Agenda 1. Mathematical analysis of ecoinvent: Goal and scope 2. Approach and applied procedures a) Exploratory data analyses b) Tests, plausibility checks 3. Towards a validation suite of ecoinvent data 4. Outlook Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 1. Goal and scope Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 project ‚Mathematical analysis of ecoinvent data‘ (2008/2009) „Overall goal of the project is to design a suite of tools for automated data quality assessment of data in the ecoinvent data base, as a pilot, including identification of outliers, missing values, and erroneous or misleading entries of various forms. In the design, both effort for applying the tools as well as results that can be expected shall be investigated. […] The designed suite shall later on be applied for the complete ecoinvent data base […] Flexibility, and required knowledge for applying the suite, shall be considered during the design when evaluating different possible solutions. The development will consider test cases in order to be able to give qualified estimates on required effort and on the expected results.” (project goal and scope) Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 goal and scope - find errors - find structures and patterns - think towards a validation suite for ecoinvent data. Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 2. Approach and applied procedures Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Approach: Principles - all ecoinvent data considered (from ecoinvent.org website XML files, system and unit processes, only single output processes, V 2.0) - data imported into a relational Access database - queries conducted on the database - database further analysed in R (statistical software) - whereever possible practical examples - types of interesting different analyses covered - no claim for completeness ( pilot!) extend, living system (3rd party input, 1st experiences, …) Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 R: Output One-sample Kolmogorov-Smirnov test Text data: d_Cesium[[5]] D = 1, p-value < 2.2e-16 20 15 10 5 0 Graphics fossil CO2 [kg] 25 30 alternative hypothesis: two-sided 0 5 10 15 20 CO2-eqv. [kg] 25 30 Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 R: StatET Plugin Eclipse plugin available more comfortable editor ) Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 (what is explorative data analysis?) Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 (what is explorative data analysis?) “exploratory data analysis is detective work. […] restricting one's self to the planned analysis – failing to accompany it with exploration – loses sight of the most interesting results too frequently to be comfortable” [Tukey 1977, p. 3]. Explorative. Unplanned. Keen on learning about the data at hand. Few, if any, own assumptions about data. Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 EDA: Applied methods - Boxplots - Scatter plots - Stem and leaf plots - Matrix scatter plots Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 EDA example: Boxplot 0 20 40 60 80 100 Transport effort in [tkm] for selected processes agricultural means of production agricultural production chemicals construction materials construction processes Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 EDA example: Boxplot Transport effort in [tkm] for selected processes check of the construction processes Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 EDA example: Matrix scatter plot 20 0 40000 0 1000 20 0 1 10 2 1000 0 Matrix scatter plots of selected impact category results for all unit processes in ecoinvent, limited to amounts smaller than twice the mean; 10 150 300 0 40000 0 400 3 0.004 0 4 1000 0.000 5 6 0e+00 7 3e+05 0 1: Acidification acc. to EDIP 2003, 2: eutrophication acc. to CML 2001, 3: eutrophication acc. to EDIP 2003, 4: human health acc. to Ecoindicator 99 (I,I) 5: human health acc. to Impact2002+ 6: human toxicity acc. to CML 2001 7: human toxicity acc. to EDIP 2003 0 150 300 0 400 1000 0.000 0.004 0e+00 3e+05 Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 EDA example: Matrix scatter plot 20 0 40000 0 1000 20 0 1 10 2 1000 0 Matrix scatter plots of selected impact category results for all unit processes in ecoinvent, limited to amounts smaller than twice the mean; 10 150 300 0 40000 0 400 3 0.004 0 4 1000 0.000 5 6 0e+00 7 3e+05 0 1: Acidification acc. to EDIP 2003, 2: eutrophication acc. to CML 2001, 3: eutrophication acc. to EDIP 2003, 4: human health acc. to Ecoindicator 99 (I,I) 5: human health acc. to Impact2002+ 6: human toxicity acc. to CML 2001 7: human toxicity acc. to EDIP 2003 0 150 300 0 400 1000 0.000 0.004 0e+00 3e+05 Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Some of the applied plausibility checks - negative flow amounts - input / output quotient, per process (problem: units) - carbon dioxide emissions per fuel consumption - input and output pattern of similar processes: missing flows in one process that appear in other likely processes, flows that are unique for a process (similar process: category & subcategory) … Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Some of the further applied statistical analyses - distribution tests - regression analysis - cluster analyses - heat maps - principal component analyses … Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Distribution test Q-Q plot (quantiles plot), CO2 emissions, left, and Cesium-137 emissions, right, for system processes Normal Q-Q- Plot, CO2 fossil, ecoinvent System Processes 0 -20 -10 Sample Quantiles -10 -20 -30 -30 -40 Sample Quantiles 0 10 10 20 Normal Q-Q- Plot, Cesium-137, ecoinvent System Processes -4 -2 0 2 4 -4 -2 0 Theoretical Quantiles Theoretical Quantiles 2 4 Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Some of the further applied statistical analyses - distribution tests - regression analysis - cluster analyses - heat maps - principal component analyses … Patterns and structure detection Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 3. Towards a validation suite of ecoinvent data Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Principles of a tool and of the overall application - - - efficiency: Detect errors in data fast, and with least effort possible. Ex.: while the data set is entered into an editor instead of after it is merged with all other data in ecoinvent. double checks: For important decisions conduct more than one single analysis. Does not hold for basic checks which provide an unambiguous result. “expertise”: Analyses should provide their results in a context where the results can be understood. Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Use cases - - Small review: a small amount of new data as candidate for ecoinvent data Large review: a large amount of new data, from one source, as candidate for ecoinvent data Stock-take: thorough investigation of the complete data stock Screening: light investigation of the complete data stock, e.g. after integration with new data Single error detection: by user feedback, an error in a data set is discovered (and it is not sure whether this error is also relevant for other datasets). Data background change: A new database or format structure Knowledge change: refinement of data structure knowledge and of the tests and analyses Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Only one part of the picture: a relation of knowledge and different analyses types basic checks method knowledge, quality quidelines... further analyses and checks explorative data analyses (confirmatory) tests structural and pattern knowledge Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Further aspects of a validation suite … Different users and their roles Procedure for “daily” maintenance Procedure for larger release Focus on single data set vs. focus on group of data sets, different users Relation of analyses and „ecoinvent in-house“ data pool and released data pool Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 4. Outlook (& Conclusions) Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Conclusions - Already in the pilot, based on tentative analyses, very useful insights - Several errors and „oddities“ in data detected, despite extensive expert-based quality assurance - The project will most likely be continued with a second phase. Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 elements in a to-be-built quality assurance system Conclusions, 2 – - Mathematical analyses and procedures: * basic checks * explorative data analyses - In-depth statistical analyses of various kind - Knowledge * data pattern and data structure knowledge (as result from tests, from other statistical analyses, and also from general expert knowledge) Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 elements in a to-be-built quality assurance system, 2 - Data: * ecoinvent internal data pool with incoming and changing data sets, probably with version numbers * ecoinvent released data pool - People * users with feedback concerning released data * data generators with new data sets for ecoinvent * “data management unit”, responsible for ecoinvent releases and for error corrections * data reviewers and validation experts Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Conclusions, 3: Who does what? Some examples Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Outlook - Probably: build an integrated tool using R and openLCA, in eclipse, via the StatET plugin: Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Outlook - Almost certainly: Build a system for automated data quality assurance and data analysis for eclipse, for the upcoming ecoinvent release 3. Math. analysis of ecoinvent data A.. Ciroth, October 1, 2009 Many thanks! Dr. Andreas Ciroth GreenDeltaTC GmbH, Berlin, [email protected]
© Copyright 2026 Paperzz