A Generic Data Model for Statistical Indicators and Measurement

A Generic Data Model for Statistical Indicators and
Measurement Units to Enable User-Specific
Representation Formats
Michaela Denk1, Wilfried Grossmann2
1
International Monetary Fund, Statistics Department*,
e-mail: [email protected]
2
University of Vienna, Institute for Scientific Computing,
e-mail: [email protected]
Abstract
Based on a review of existing standards and guidelines as well as the current international
practice of modeling measurement units and related concepts in representation of
economic and statistical data, a generic data model for statistical indicators and
measurement units is introduced that may contribute to the further development of the
SDMX content oriented guidelines in terms of harmonizing cross-domain concepts such
as measurement units. Examples from databases of SDMX sponsor organizations are
used to demonstrate the applicability of the proposed data model to a broad range of
statistical indicators and its ability to serve as a basis for the creation of user-specific data
representation formats. This is achieved through customizable combinations of the
structural elements of the generic data model according to user requirements and
illustrates the practical relevance of the presented ideas.
Keywords: Metadata, SDMX, semantic decomposition.
1. Introduction
In data exchange the concept of indicator plays a crucial role. Proper understanding of the
indicator depends on knowledge of how the indicator was calculated and what kinds of
units are used for measurement. The SDMX Content Oriented Guidelines (COG) (2009)
can be regarded as the most prominent current effort focusing on the harmonization of
cross-domain concepts for data exchange. The guidelines recommend practices for
creating interoperable data and metadata sets using the SDMX technical standards with
the intent of generic applicability across subject-matter domains. In the Metadata
Common Vocabulary (Annex 4 of SDMX COG) five cross-domain concepts are
described for exchange of indicators and unit of measure. The (statistical) indicator itself
is defined as item 331 by “A data element that represents statistical data for a specified
*
The views expressed herein are those of the authors and should not be attributed to the IMF, its Executive Board, or
its management.
time, place, and other characteristics, and is corrected for at least one dimension (usually
size) to allow for meaningful comparisons”. Four items in SDMX COG are related to
measurement unit, viz. unit of measure (item 384), adjustment (5; included in unit or
indicator in many statistical databases as illustrated in Denk and Grossmann (2010)), base
period (19; relevant for the interpretation of index data, series at real terms, or changes
with respect to a certain period), and unit multiplier (382; specifies the exponent to the
basis 10 observation values were divided by, usually for presentation purposes).
From the 14 datasets investigated by Denk and Grossmann and Froeschl (2010), four do
not separate the economic indicator from the measurement unit or do not provide the unit
information at all, whereas four other datasets even split other concepts such as unit
multiplier or adjustment method from the unit. The other six databases separate unit of
measure from economic indicator. A broad variety of unit types is used such as index,
count, ratio, rate, percentage, or changes. The cases with a single, mixed dimension at
least combine information on measured (economic) indicator, type of unit, unit of
measure, adjustment method, and frequency. Several examples (e.g. "Personal
computers" or "Youth unemployment rate, aged 15-24, men") omit the unit information
completely, assuming that it is obvious from the indicator used. On the other hand,
observe that these indicators give information about the underlying population to which
the measured concept refers.
This shows that in contrast to the first impression one may have (viz. that this problem is
a rather easy one that was resolved a long time ago, e.g. by the International System of
Units), the analysis showed that the current international practice is very diverse and that
neither the recommendations provided in SDMX COG (2009) nor more general
measurement unit codification systems have already been adopted by the statistical
organizations investigated. One reason may be that the units for indicators are in some
sense simple; besides monetary units dimensionless units like percent or counts dominate
in applications. What makes the usage of such units difficult is the fact that the
measurement instruments are rather complicated (consider for example change in GDP
over years), and many times the computation of the indicator provides little information
about the used unit. Hence, it is not surprising that the main issue in the analyzed
databases is that the measurement unit dimension in the data models used does not
represent a "pure" unit of measure. Even SDMX recommendations do not treat all of
these components separately.
The first step in harmonizing the structure and content of measurement units as currently
used by statistical organizations is the identification of their basic building blocks and of
relations between these building blocks. Denk and Grossmann (2010) proposed a generic
model for the semantic decomposition into four components, viz. Indicator,
Measurement, Adjustment, and Reference. The present paper further develops the
proposed model by refining it with respect to three features: (i) introduction of a "Family"
concept for indicator and unit of measurement that allows grouping of indicators and
units into families that share the phenomena they are destined to measure and, if
applicable, the derivation method they were obtained from; (ii) generalization of the
concepts of unit multiplier, adjustment, and reference which seems necessary for
covering complex indicators; (iii) inclusion of additional standard dimensions required to
define the meaning of any statistical figure, such as geographical and temporal reference
and measurement conditions.
The paper is structured as follows. Following some introductory remarks and basic
definitions, the extended generic data model is presented in section 2. Section 3 illustrates
the application of the model by means of examples, primarily from economic statistics.
The derivation of customized data representation formats based on the needs of data
consumers is described in section 4. Finally, the paper provides some concluding remarks
and an outlook on future work.
2. A Generic Metadata Model
Starting point of our development is a look at other institutions aiming for standardization
of measurement units. Two prominent examples are the International System of Units
(SI) and the Unified Code for Units of Measurement (UCUM) which mainly cover
measurement in physical sciences. These systems show a number of features that are of
interest for standardization of measurement in case of international statistics as well. The
most important feature is a unit typology with the fundamental distinction between base
units and units derived from base units by mathematical formulas. Another important
feature is the idea of using a prefix notation corresponding to the idea of unit multiplier in
SDMX. Besides these two major types of units UCUM also considers in section 3.2 (§24
- §26) so-called arbitrary or procedure defined units defined as “units whose meaning
entirely depends on the measurement procedure that are not related to any UCUM or SI
unit but completely depend on the measurement procedure”, and in section 4.2 (§34 §42) customary units that correspond roughly to the idea of local and traditional usage of
alternative measurement systems for quantities that can be measured by UCUM or SI
(base) units. Customary units are grouped into some common families defined according
to the corresponding standard unit (for example units for length like inch, foot or yard).
Looking at statistical applications in the examples we can conclude that in the sense of
UCUM there are two types of important applications: arbitrary units using in a strict
sense a dimensionless unit like percent or count with proper prefixes or multipliers
(millions, thousands etc.) and monetary units, which can be interpreted as customary
units in a local setting of a fictitious or virtual universal currency unit. Moreover for the
dimensionless unit percent many different customary units are in use for example ratio or
rate. The indicator itself roughly corresponds to a combination of the UCUM name of the
unit and the type of quantity measured and requires additional specification of the
measurement procedure in the sense of UCUM.
The following metadata model is an attempt to put these ideas into a more formal
framework as outlined in Figure 1. It is based on the semantic decomposition of a
statistical observation into four basic components: Indicator, Measurement, Unit Family,
and Unit. Each component has a label attribute that is a simple textual descriptor that may
contain a combination of information that is included in other attributes in an
unstructured way.
Figure 1. Semantic Model for Indicators and Corresponding Measurement
2.1. Indicator
Strictly speaking, the Indicator itself is not part of the measurement, but as mentioned
above a description of the quantity measured. In that sense it seems important to include
the indicator into the model, in order to avoid confusion between indicator and
measurement unit as it is the case in several analyzed examples.
The Indicator class consists of the following attributes and references to other model
components: label, concept, population unit, population restrictions, reference area,
reference period, cross-classification, type, family, derivation formula, link to
Measurement, and links to two or more Indicators a derived Indicator was derived from.
The concept attribute specifies the quantity to be measured by the indicator. It often
corresponds to the name of the attribute for which data are provided. The label typically
refers to a table header. Base and derived concepts can be discerned. A base concept can
be directly obtained from a survey or other form of data collection in some population.
Typical examples are prices of goods, income of a household or a person, or assets and
liabilities of a bank. Derived concepts are obtained from measuring different base
concepts and/or from different populations, as for example indices or growth rates.
A specification of the statistical reference population is of utmost importance for a proper
understanding of an indicator. According to the MCV a (statistical) population is defined
as “The total membership or population or "universe" of a defined class of People”.
Considering economic statistics, the statistical population may be a collection of
population units other than people, hence we define it in terms of a population unit (e.g.
person, commodity, institutional unit) and a subject-matter definition provided as
restrictions on some characteristics of the population unit (e.g. age=20-30, sector=banks).
In any case, a specification of the temporal and spatial validity (reference period and
reference area) is required as well. Additional classification criteria, for example
economic sectors in economic statistics, or age groups of persons in case of demographic
indicators, are captured in the cross-classification attribute. If such additional
specifications are given the indicator may be a vector or an array of numbers instead of a
single number. Note that such an understanding of indicators as vectors also allows the
formation of indicators for different countries or of time series of indicators.
The type of the indicator reflects the different types of concepts introduced above and we
distinguish between base indicators and derived indicators. Base indicators correspond to
quantities which are available for direct measurement in one population. In context of
statistical applications typical examples are population counts or other distributional
characteristics of a surveyed quantity (for example poverty measurement using the Gini
index), prices of units of goods (for example oil prices per barrel), or currency amounts.
As indicated in Figure 1, derived indicators are obtained from at least two base indicators
or indicators by using a mathematical formula. Such derivations may use different
concepts in the indicators as well as different statistical populations for the indicators
involved. For example, a growth rate uses the same indicator concept for both indicators,
but different statistical populations in terms of the reference period, whereas GDP per
Capita uses two different indicator concepts for different statistical population units, viz.
GDP (population unit = institutional unit according to SNA) and domestic population
(population unit = person).
The family attribute groups different indicators according to the method how the
indicator is derived from other indicators. In case of base indicators there are some
natural types like population counts or other distributional characteristics, unit prices or
amount of money. For derived indicators there are types like differences, ratios, or
indices, which specify how the indicator is obtained from the base indicators.
2.2. Measurement
As usual in models for measurement units (cf. SI-system and UCUM), the actual
measurement unit used for an indicator is specified by selecting a specific unit from a
specific measurement unit family. This is done via Unit and Scale (multiplier) references
as central components of Measurement. The scale (multiplier) is defined as a special
dimensionless unit prefix as discussed in the subsection on unit families and units below.
As previously discussed many units used in economic statistics fall into the categories of
arbitrary or customary units in the sense of UCUM. Hence, a description of the
measurement method and the conditions under which the measurement is done are
needed in addition to unit and scale. Measurement conditions can be described by
indicating the institution responsible for presenting the data. This is of utmost importance
in case of international statistics where the standard documentation proposed by SDMX
uses only the conditions of the last step but does not take into account the environment of
the original measurement.
Three methodological aspects of high relevance in official statistics are currently
included in the model, but there is potential for extensions and generalizations. Statistical
aggregation measures are essential in case of high frequency data for which specific
summary measures, such as period averages or highest and lowest values in a period (e.g.
maximum daily price of a stock index) are exchanged. The same measures are used for
temporal aggregation of lower frequency data, e.g. to aggregate monthly data to quarters.
Besides these traditional location measures also other characteristics of a distribution can
be used, for example in case of poverty indicators, the Gini coefficient or the percentage
of households having an income below 60 percent of the median household income are
important measures. In case of indicators available as time series, measures taking into
account the time series feature such as moving averages may be used. Statistical
measures referring to a period, such as period average, actually relate to the frequency or
periodicity of the data as specified in the cross-classification attribute of the Indicator.
Units of particular Unit Families such as index, change over reference period, or balance
indicator require reference information that is specified as part of Measurement. This
reference information may be a reference period, value, or statistical measure used to
define a reference value. The value domain of unit reference period includes time stamps
in different granularity and predefined values such as previous period, corresponding
period of previous year, or years since time stamp. Reference value specifies the value of
the indicator in the reference period (often 100 or 1000 for an index) or the “norm” value
for a balance indicator (often 0 for differences or 1 for ratios). The reference value may
also be defined as aggregate with respect to a reference period, for example as moving
average of months since reference period. In this case, reference value statistical measure
needs to be specified as well. The value domain is the same as for statistical measure.
The adjustment attributes capture monetary adjustments such as price or exchange rate
adjustments (e.g. constant) that also require the specification of a reference period as well
as more general econometric adjustments such as seasonal or working day adjustments
making use of methods like X-12. Other types of adjustment are conceivable; the list can
be extended according to specific needs.
2.3. Unit and Unit Family
The Unit Family is mainly used for grouping a set of measurement units that measure the
same kind of quantity and thus have the same meaning from a practical point of view.
Examples for unit families from physical measurement are length, time, mass or
temperature. In the present model a Unit Family is categorized with respect to two
different typologies following the ideas of UCUM. On the one hand, base and derived
type unit families (and thus, units) are distinguished. On the other hand, UCUM type
discerns standard, customary, arbitrary, prefix, and dimensionless unit families and units.
Derived units are calculated from base or derived units and in many cases the derivation
of a unit is closely related to the derivation of an indicator, for example velocity as ratio
of length of a covered distance (e.g. in meter) and time taken to cover the distance (e.g. in
seconds). However, a derived indicator may be measured in a base measurement unit.
Consider for example the difference of an indicator in two groups. The measurement unit
of the derived indicator (the difference) is the same as the unit of the two initial indicators
in most cases (for differences of ratios the unit of the derived indicator may be different).
With respect to UCUM type, UCUM and SI base and derived units as well as units
derived thereof are considered standard. For some quantities not only a standard unit
family, but also additional customary unit families exists. For instance length can be
measured by the metric length measurement system with units like meter and kilometer
or by the US & British system with units like foot or mile. Prefix unit families can be
used in combination with any other unit family that is not dimensionless to represent the
unit multiplier. UCUM provides two prefix families. One of them represents positive and
negative powers of 10, viz. the conventional metric prefixes such as kilo- (10^3) or centi(10^-2). The second prefix unit family is the binary prefix family typically used in
information technology. Instead of powers of 10 it is based on powers of 2. UCUM
defines an additional unit family measuring dimensionless quantities that closely
resembles the metric prefix family. It just uses different labels for the powers of ten (e.g.
percent for 10^-2 or parts per million for 10^-6) and is used on its own.
In addition to label, type, and UCUM type, a Unit Family has a reference to its standard
unit (e.g. meter for metric length or second for time) and, in case of a derived family, a
unit derivation rule that contains links to unit families from which it is derived. The
derivation rule specifies how units of the unit family are derived from units of other
families, for example a unit of family “metric area” is derived by squaring a unit of
family “metric length”.
The Unit itself consists of its label, a reference to the Unit Family it belongs to, a
conversion formula that specifies the relation of the unit to the standard unit of its family,
a conversion reference period that is relevant for time-dependent conversion (e.g. for
currency units), a conversion method to apply the conversion formula to actual data, and,
in case of derived units, a derivation formula that follows the derivation rule specified in
the corresponding Unit Family. In many cases, the conversion formula is just a linear
transformation, most often even without the shift parameter, that shows how a unit can be
converted to the standard unit of the family. The conversion method combines the
appropriate conversion formula of the unit and the corresponding inverse conversion
formula of the target unit (method parameter) to convert the observations to the target
unit. The derivation formula can be regarded as an instance of the more general unit
conversion rule specified in the unit family. It contains references to specific units, for
example meter in the above “metric area” example.
3. Applying the Model
In addition to the measurable quantities covered by UCUM (and SI) measurement units
and unit families, population counts, amounts of money, unit prices, and various derived
quantities such as ratios (e.g. index, share of total) or differences (e.g. balance indicators)
play a central role in official statistics. They are accounted for in the presented model as
indicator families. The units and unit families used to measure these indicators are
included in the model as customary unit families. Population count (type base) is
basically a dimensionless measurement unit family. The units are more precisely defined
by referring to the population unit that the counting operation applies to, e.g. persons or
specific commodities. Currency (base) encompasses currencies such as Euro or US
Dollar with one of the currencies arbitrarily defined as standard unit. Unit prices are
modeled as base indicator family, with subfamilies such as currency exchange rates.
However, they use units from derived unit families such as currency per metric volume,
currency per time, or currency exchange rates (=currency per currency). Ratios (e.g.
percent) and differences of ratios (e.g. percentage points) are also customary derived unit
families. They actually measure dimensionless indicators.
Based on the above ideas the unit families displayed in Figure 2 can be regarded as a
basis for enhanced code lists for these cross-domain concepts.
Figure 2. Unit Families
For base indicator families, the correspondence to unit families is one-to-one apart from
customary unit families. In many cases, derived unit families and their derivation rule
resemble the construction of a derived indicator. For example derived indicator families
such as indices, general ratios or more specific ratios like shares of total or ratio balance
indicators (a ratio of indicators with the same concept but different subpopulations
defined by different population restrictions) correspond to the dimensionless
measurement unit families index and ratio (percent). Relative changes over a reference
period (or change rates) correspond to derived indicators obtained as ratios of two
indicators differing only with respect to their temporal specification and use the ratio
(percent) unit family. In case of a ratio of different concepts for the same population the
unit family and unit can be obtained in a similar way as for physical units. Consider for
example GDP per capita calculated as a ratio of GDP measured in some currency (e.g.
Euro) and population measured in number of persons. The unit family is then derived
from currency and population count families, and unit as Euro per person. However, a
derived indicator (family) does not necessarily imply a derived unit (family). E.g.
indicators calculated as differences, such as absolute changes over reference period, refer
to the same measurement unit of the derived indicator as the initial indicators (unless the
initial indicators are ratios). These examples show that a calculus of units and indicators
requires careful examination of the formulas, similar to the case of physical units. A more
detailed analysis of various important economic indicators and the corresponding
derivation concept may be found in Denk, Grossmann and Froeschl (2010).
4. Customized Data Representation Formats
The generic data model presented in the previous sections is regarded as a foundation of a
common language of data exchange that allows the creation of (ideally: any; realistically:
the most relevant) data representation formats required by data consumers. Data that were
decomposed into the structural elements of the generic data model can easily be viewed
in terms of data representation formats customizable to the needs of data consumers. The
generic data model needs to have the finest possible granularity to enable as many data
representation formats as possible. For each data representation format to be supported a
mapping rule has to be defined that aligns the model components, in particular attributes
and references, to the components of the target representation format. In most cases, the
target representation format simply consists of a table title and row and column headers.
For time series, row headers often correspond to the reference area or the indicator and
column headers to the reference period. The table header usually contains the indicator
concept, population unit and restrictions, and measurement information. More
sophisticated, that is more structured, representation outputs may have separate elements
for unit and other measurement information such as adjustments.
The following examples may illustrate the idea of customizing data representation
formats by means of mapping rules. The rules may simply concatenate values of model
attributes, replace attribute values by a certain text (e.g. replace Price Adjustment =
constant by constant prices) or omit them on a certain condition (e.g. omit Scale if Scale
= units), and/or combine them with additional text.
 Table header = Indicator-Concept, Indicator-Population Restrictions,
Measurement-Scale (omit if units), Measurement-Unit (omit if 1), Measurementany Adjustment
 Unemployment, female, 15-24, thousands, number of persons, seasonal
adjustment
 GDP, millions, US Dollar, constant prices
 Commodity price, crude oil, US Dollar per Barrel
 Indicator = Indicator-Derivation Formula
Unit = Measurement-Unit (omit if 1), Measurement-Scale (omit if units)
 Expenditure, government, military / GDP
Percent
 Table header = Indicator-Concept, Indicator-Population Restrictions
Unit = Measurement-Unit (omit if 1), Indicator-Family (if subfamily of ratio),
Measurement-Unit Ref. Period, Measurement-Scale (omit if units)
Column headers = Indicator-Cross-classification, Indicator-Reference Period
 GDP
Percent, relative change over reference period, same period previous year
Quarters, 2000Q1-2010Q4
5. Conclusion
The motivation of this paper was the decomposition of the concepts statistical indicator
and measurement unit into their basic building blocks to enable the further development
of a generic metadata model for these concepts. The metadata model introduced can be
regarded as a foundation for the development of (enhanced) standardized value domains
and code lists for the identified structural elements in the context of SDMX. As
demonstrated, it may also serve as a basis for the creation of user-specific, customizable
data representation formats. Future work will focus on the operational aspect and hence
on the representation of derived indicators, measurement, and unit families as well as
required constraints and rules for the propagation of indicator and measurement
information. The definition of code lists including "shortcut" descriptors for mixed
concepts (as also used in the mapping rules that specify data representation formats) is
also of high priority. The long-term objective of this research is the development of a unit
(family) calculus and its inclusion in the metadata model to enable metadata driven
processing based on ideas developed by Froeschl (1997).
References
Denk M., Grossmann W., Froeschl K. A. (2010) Towards a best practice of modeling unit
of measure and related statistical metadata, in: European Conference on Quality in
Official Statistics 2010, http://q2010.stat.fi/papers/, Statistics Finland.
Denk M., Grossmann W. (2010) Semantic Decomposition of Indicators and
Corresponding Measurement Units, in: KSEM 2010, LNAI 6291, Bi & Williams
(Eds.), Springer, 603-608.
Froeschl, K. A. (1997) Metadata Management in Statistical Information Processing,
Springer, Wien-New York.
The International System of Units, http://www.bipm.org/en/si/
The Unified Code for Units of Measure, http://unitsofmeasure.org/
SDMX Content Oriented Guidelines 2009, 16pp + 5 annexes, http://www.sdmx.org/