Linear regression with empirical distributions

Linear regression
with empirical
distributions
Sónia Manuela Mendes Dias
Tese de Doutoramento apresentada à
Faculdade de Ciências da Universidade do Porto
Matemática
2014
Sónia Manuela Mendes Dias
Plano Doutoral em Matemática Aplicada
Matemática
2014
Orientador
Paula Brito, Professora Associada
Faculdade de Economia, Universidade do Porto
D
Linear regression
with empirical
distributions
F or you, Nuno
Acknowledgements
Superar esta “prova”, que como alguém sempre me dizia era uma “maratona e não uma
corrida dos 100m”, não teria sido possı́vel sem o apoio de todos aqueles que estiveram por
perto durante estes cinco anos e que sempre acreditaram, muitas vezes mais do que eu,
que iria conseguir superar os obstáculos e “chegar à meta”.
O meu muito obrigada...
À Professora Paula Brito pelo incondicional apoio, pela disponibilidade que sempre
teve para me ajudar e aconselhar. Por me ter dado força para continuar nos momentos
em que desanimava, por ter sempre uma palavra de incentivo, por me fazer acreditar que
conseguiria vencer as dificuldades e chegar ao fim desta etapa. Tenho também que lhe
agradecer por me ter mostrado o que é fazer parte do “mundo da investigação”, por me
ter apresentado à “comunidade de Dados Simbólicos”, por me ter incentivado a participar
em conferências dando a conhecer trabalho que estávamos a desenvolver. Tudo isto foi,
sem dúvida, importante para me fazer crescer não só cientificamente mas também como
pessoa. Professora Paula, gostei muito de trabalhar consigo. Espero que continuemos a
colaborar.
A todos aqueles que trabalham em Análise de Dados Simbólicos e que mais ou menos
diretamente contribuı́ram para o desenvolvimento deste trabalho.
À Instituição e à Direção da Escola Superior de Tecnologia e Gestão (ESTG), do Instituto Politécnico de Viana do Castelo, onde trabalho desde 2001, por me terem concedido
algumas condições especiais, que permitiram que nestes últimos anos fosse possı́vel desempenhar em paralelo, e com sucesso, as minhas atividades de docente e de estudante
de doutoramento.
v
Aos meus colegas do grupo de Matemática da ESTG, que sempre me apoiaram, compreenderam a minha ausência da escola e a minha falta de disponibilidade para participar
noutras atividades que não fossem as absolutamente necessárias.
Aos meus amigos e amigas pelo apoio, compreensão e amizade. Pelos bons momentos que passamos juntos, que me ajudaram a descontrair e a recuperar forças. Um
especial obrigada à Sandra, pelas intermináveis conversas, pelo apoio, pela partilha de
experiências. Foi muito importante ter por perto uma amiga, que por estar numa fase da
vida semelhante à minha, me entendia de forma especial.
Aos meus pais e ao meu irmão Sérgio. O meu muito obrigada por tudo o que já
me deram e fizeram por mim e por tudo aquilo que sei que irão sempre fazer. O vosso
apoio durante estes anos em que estive a fazer o doutoramento foi fundamental. Nunca
conseguirei agradecer-vos a vossa infinita disponibilidade. Obrigada por estarem sempre
por perto. Sérgio obrigada pelos conselhos de profissional, que contribuı́ram para uma
melhoria da parte gráfica desta tese.
Ao Nuno, porque com toda a certeza que não teria chegado até aqui, se não estivesses
sempre ao meu lado. Foste tu que me incentivaste a fazer um doutoramento, que me
apoiaste e que me ajudaste a chegar ao fim. Foi com a tua ajuda que consegui o que
nunca tinha sonhado: ter um doutoramento e vencer a barreira da lı́ngua inglesa. Disseste
muitas vezes que merecias ser co-orientador da minha tese, e tenho que concordar que
estiveste muito perto. Obrigada pela ajuda, pela paciência, pelo apoio incondicional, pela
confiança que sempre depositaste em mim, por acreditares que eu seria capaz de fazer
o doutoramento. Desculpa as minhas fases más, as variações de humor que oscilavam
consoante o trabalho do doutoramento corria bem ou mal. E é por tudo isto, que te dedico
esta tese.
vi
Resumo
No contexto dos dados clássicos, a cada indivı́duo está associado um valor real único
ou uma categoria (microdados). No entanto, o interesse de muitos estudos recai sobre
conjuntos de registos agregados de acordo com caracterı́sticas de indivı́duos ou classes
de indivı́duos, os designados macrodados. A solução clássica para estudar este tipo de
situações passa por associar a cada indivı́duo ou classe de indivı́duos uma medida de
tendência central, por exemplo, a correspondente média ou a moda do conjunto de registos; no entanto com esta opção perde-se a variabilidade inerente aos dados. Neste tipo
de situações a Análise de Dados Simbólicos propõe que a cada unidade seja associada a
distribuição ou o intervalo de valores que contemplam os registos individuais, considerando
assim novos tipos de variáveis, designadas por variáveis simbólicas. Um dos tipos de
variáveis simbólicas é a variável histograma, para a qual a cada unidade corresponde uma
distribuição empı́rica que se pode representar por um histograma ou uma função quantil.
Se para todas as observações, cada unidade tomar valores num único intervalo de peso
igual a um, a variável histograma reduz-se ao caso particular da variável intervalar. Em
ambos os casos é assumido que os valores dentro de cada intervalo estão uniformemente
distribuı́dos. Por conseguinte, é necessário adaptar os conceitos e métodos da estatı́stica
clássica a estes novos tipos de variáveis.
A relação funcional entre variáveis histograma ou entre variáveis intervalares não pode
ser obtida por uma simples adaptação do modelo de regressão clássico. Neste trabalho
são propostos novos modelos de regressão linear para dados histograma e intervalares.
Estes novos modelos de regressão, designados por Modelos de Distribuição e Distribuição
Simétrica permitem prever distribuições/intervalos, representados pelas respectivas funções quantil, a partir de distribuições e intervalos associados às variáveis explicativas. Para
determinar os parâmetros dos modelos é necessário resolver problemas de optimização
vii
quadrática, sujeitos a restrições de não negatividade sobre as incógnitas. Para definir os
problemas de minimização e calcular o erro entre as distribuições observadas e previstas
é usada a distância de Mallows. Tal como na análise clássica, é possı́vel deduzir a partir
dos modelos uma medida para a qualidade do ajuste cujos valores variam entre 0 e 1.
O comportamento dos modelos propostos e a medida da qualidade do ajuste são
ilustrados com exemplos de dados reais e estudos de simulação. Estes estudos indicam
um bom desempenho dos métodos propostos e dos respectivos coeficientes de determinação.
Palavras Chave: Dados com variabilidade, variáveis histograma, variáveis intervalares,
regressão linear, Análise de Dados Simbólicos.
viii
Abstract
In the classical data framework one numerical value or one category is associated
with each individual (microdata). However, the interest of many studies lays in groups
of records gathered according to characteristics of the individuals or classes of individuals,
leading to macrodata. The classical solution for these situations is to associate with each
individual or class of individuals a central measure, e.g., the mean or the mode of the
corresponding records; however with this option the variability across the records is lost.
For such situations, Symbolic Data Analysis proposes that a distribution or an interval of
the individual records’ values is associated with each unit, thereby considering new variable
types, named symbolic variables. One such type of symbolic variable is the histogramvalued variable, where to each entity under analysis corresponds an empirical distribution
that can be represented by a histogram or a quantile function. If for all observations each
unit takes values on only one interval with weight equal to one, the histogram-valued
variable is then reduced to the particular case of an interval-valued variable. In either
case, an Uniform distribution is assumed within the considered intervals. Accordingly, it is
necessary to adapt concepts and methods of classical statistics to new kinds of variables.
The functional linear relations between histogram or between interval-valued variables
cannot be a simple adaptation of the classical regression model. In this work new linear regression models for histogram data and interval data are proposed. These new Distribution
and Symmetric Distributions Regression Models allow predicting distributions/intervals, represented by their quantile functions, from distributions/intervals of the explicative variables.
To determine the parameters of the models it is necessary to solve quadratic optimization
problems subject to non-negativity constraints on the unknowns. To define the minimization
problems and to compute the error measure between the predicted and observed distributions, the Mallows distance is used. As in classical analysis, it is possible to deduce a
ix
goodness-of-fit measure from the models whose values range between 0 and 1.
Examples on real data as well as simulated experiments illustrate the behavior of the
proposed models and the goodness-of-fit measure. These studies indicate a good performance of the proposed methods and of the respective coefficients of determination.
Key-words: Data with variability, histogram-valued variables, interval-valued variables,
linear regression, Symbolic Data Analysis.
x
Contents
List of Tables
xxii
List of Figures
xxviii
1 Introduction
1.1 Motivation
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2 Data with variability and imprecise data . . . . . . . . . . . . . . . . . . . .
2
1.3 Symbolic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4 Linear regression between empirical distributions . . . . . . . . . . . . . . .
7
1.5 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2 Histogram and Interval data
13
2.1 Symbolic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Definition and classification
13
. . . . . . . . . . . . . . . . . . . . . .
13
2.1.2 Histogram and Interval-valued variables . . . . . . . . . . . . . . . .
17
2.2 Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2.1 Interval Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2.2 Histogram Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.2.3 Operations with Quantile Functions . . . . . . . . . . . . . . . . . .
31
2.2.3.1
Quantile Functions defined with the same number of pieces
31
2.2.3.2
Operations . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.2.3.3
The space of quantile functions . . . . . . . . . . . . . . .
37
2.3 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
2.3.1 Distances between intervals . . . . . . . . . . . . . . . . . . . . . .
42
2.3.2 Distances between distributions . . . . . . . . . . . . . . . . . . . .
44
xi
2.4 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2.4.1 Symbolic-numerical-numerical category . . . . . . . . . . . . . . . .
50
2.4.1.1
Univariate descriptive statistics
. . . . . . . . . . . . . . .
51
2.4.1.2
Bivariate descriptive statistics . . . . . . . . . . . . . . . .
57
2.4.1.3
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
2.4.2 Symbolic-symbolic-symbolic category . . . . . . . . . . . . . . . . .
75
2.4.2.1
Descriptive measures defined from the Mallows distance . .
2.4.2.2
Descriptive measures defined from the Wasserstein distance 81
2.4.2.3
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
3 State of the art
76
91
3.1 Linear Regression Models for interval variables . . . . . . . . . . . . . . . .
91
3.1.1 Descriptive linear regression models . . . . . . . . . . . . . . . . . .
92
3.1.1.1
Linear regression models for data with variability . . . . . .
92
3.1.1.2
Linear regression models for imprecise data . . . . . . . . . 106
3.1.2 Probabilistic linear regression models . . . . . . . . . . . . . . . . . 109
3.1.2.1
Linear regression models for data with variability . . . . . . 109
3.1.2.2
Linear regression models for imprecise data . . . . . . . . . 113
3.2 Linear Regression Models for histogram variables . . . . . . . . . . . . . . 120
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4 Regression Models for histogram data
129
4.1 The DSD Regression Model I . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1.1 Definition of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1.2 Estimation of the parameters of the DSD Model I . . . . . . . . . . . 137
4.1.2.1
Optimization problem . . . . . . . . . . . . . . . . . . . . . 137
4.1.2.2
Kuhn Tucker conditions
. . . . . . . . . . . . . . . . . . . 142
4.1.3 Goodness-of-fit measure . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2 The DSD Regression Model II . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2.1 Definition of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.2.2 Estimation of the parameters of the DSD Model II . . . . . . . . . . . 153
4.2.2.1
Optimization problem . . . . . . . . . . . . . . . . . . . . . 153
xii
4.2.2.2
Properties and the Goodness-of-fit measure
. . . . . . . . 162
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5 Regression Models for interval data
169
5.1 The DSD Regression Model I . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.1.1 Particularization of the Model to interval-valued variables . . . . . . . 169
5.1.2 The DSD Model I is a generalization of the classical linear regression
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.1.3 The simple DSD Model I for interval-valued variables . . . . . . . . . 178
5.2 DSD Regression Model II . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.2.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . 186
5.2.2 Goodness-of-fit measures . . . . . . . . . . . . . . . . . . . . . . . 194
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6 Simulation studies
197
6.1 Building symbolic simulated data tables . . . . . . . . . . . . . . . . . . . . 197
6.2 Simulation study I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.2.1 Factorial design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.2.2 Results and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 203
6.2.2.1
Study of the behavior of the error function . . . . . . . . . . 204
6.2.2.2
Study of the behavior of the coefficients of determination Ω
6.2.2.3
e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
and Ω
Comparison of the observed and predicted intervals . . . . 208
6.3 Simulation study II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.3.1 Simulation study with interval-valued variables . . . . . . . . . . . . 214
6.3.1.1
Description of the simulation study . . . . . . . . . . . . . . 214
6.3.1.2
Results and conclusions . . . . . . . . . . . . . . . . . . . 216
6.3.2 Simulation study with histogram-valued variables . . . . . . . . . . . 221
6.3.2.1
Description of the simulation study . . . . . . . . . . . . . . 221
6.3.2.2
Results and conclusions . . . . . . . . . . . . . . . . . . . 225
6.3.2.3
Concerning the goodness-of-fit measures versus level of
linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.3.2.4
Concerning the analysis of the parameters’ estimation . . . 227
xiii
6.3.2.5
6.3.2.6
Concerning symmetry/asymmetry of Yb (j). . . . . . . . . . 231
Comparing DSD Models I and II . . . . . . . . . . . . . . . 233
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7 Analysis of data with variability
239
7.1 Prediction of the hematocrit values . . . . . . . . . . . . . . . . . . . . . . 240
7.1.1 The histogram data . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.1.2 The DSD Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.1.3 Comparison of the DSD Models with other proposed symbolic
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.1.4 Interval data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.2 Distributions of Crimes in USA
. . . . . . . . . . . . . . . . . . . . . . . . 250
7.2.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
7.2.2 Three approaches to study linear relations between data with
variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.2.3 The prediction of the violent crimes in the state of Arkansas . . . . . 255
7.2.4 Predicted Quantiles
. . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.3 Time of unemployment from years of employment . . . . . . . . . . . . . . 263
7.3.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.3.2 Prediction with the DSD Model I . . . . . . . . . . . . . . . . . . . . 264
7.3.3 Comparison of the predictions with different symbolic models
. . . . 265
7.4 Predicted burned area of forest fires . . . . . . . . . . . . . . . . . . . . . . 273
7.4.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.4.2 Linear regression studies with macrodata . . . . . . . . . . . . . . . 274
7.4.2.1
Non symbolic approaches . . . . . . . . . . . . . . . . . . 274
7.4.2.2
Symbolic approaches
. . . . . . . . . . . . . . . . . . . . 275
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
8 General conclusions and Future Work
283
8.1 Conclusions of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
References
287
xiv
Appendices
295
A Ordered and disjoint histograms
297
B Behavior of pairs of quantile functions
299
C Results of simulation studies
307
C.1 Simulation study I with interval variables
. . . . . . . . . . . . . . . . . . . 307
C.2 Simulation study II with interval variables . . . . . . . . . . . . . . . . . . . 314
C.3 Simulation study II with histogram variables . . . . . . . . . . . . . . . . . . 324
xv
xvi
List of Tables
1.1 Symbolic data table with information of three healthcare centers (part 1). . .
4
1.2 Symbolic data table with information of three healthcare centers (part 2). . .
4
1.3 Classical data table (microdata) with the records of hematocrit and hemoglobin
of each patient per day.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4 Symbolic data table (macrodata) when the values of hematocrit and hemoglobin
are symbolic variables (Billard and Diday (2002)). . . . . . . . . . . . . . . .
6
2.1 Symbolic data table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2 Divergency measures between distributions (Arroyo (2008)). . . . . . . . . .
44
2.3 Mallows and Wasserstein distances between the observations of the histogram-valued variable “Waiting time for a consult” in Table 1.2. . . . . . . . .
50
2.4 Symbolic data table where the variables hematocrit and hemoglobin are
interval-valued variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
2.5 Symbolic data table where the variables hematocrit and hemoglobin are
histogram-valued variables. . . . . . . . . . . . . . . . . . . . . . . . . . .
61
2.6 Table of frequencies associated with interval-valued variable “Age” in
Table 1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
2.7 Descriptive statistics for the interval-valued variable “Age” in Table 1.1. . . .
73
2.8 Table of frequencies of the histogram-valued variable “Waiting time for a
consult” in Table 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
2.9 Descriptive statistics for the histogram-valued variable “Waiting time for a
consult” in Table 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
2.10 Values of the covariance and variance for the data in Tables 2.4 and 2.5. . .
75
4.1 Symbolic data table for histogram-valued variables −X and Y. . . . . . . . . 135
xvii
5.1 Symbolic data table for interval-valued variables −X and Y . . . . . . . . . . 172
e and RM SEM in the situations analyzed in the simula6.1 Mean values of Ω, Ω
c(j) ∈ U (−5, 5)
tion study I, when the error function is generated considering e
and re(j) ∈ U(−2, 2).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.2 Mean values of Ω considering different levels of linearity, when the distribu-
tions generating observations of X are Uniform ΩU and Normal ΩN . . . 226
6.3 Mean values of Ω considering different levels of linearity, when the distribu-
tions generating observations of X are Log-Normal ΩLogN and a mixture
of distributions ΩM ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.4 Relative efficiency of the estimation of parameters a and b when the distributions of Y are created by DSD Model I and DSD Model II. . . . . . . . . . . 234
6.5 Evaluation measures for the situations CDSDI /PDSDI , CDSDI /PDSDII ,
CDSDII /PDSDII , and CDSDII /PDSDI . . . . . . . . . . . . . . . . . . . . . . 236
7.1 Observed and predicted histograms (using different methods) of the hematocrit values shown in Table 2.5 (part1: patients 1 to 4).
. . . . . . . . . . . 244
7.2 Observed and predicted histograms (using different methods) of the hematocrit values shown in Table 2.5 (part2: patients 5 to 8).
. . . . . . . . . . . 245
7.3 Observed and predicted histograms (using different methods) of the hematocrit values shown in Table 2.5 (part3: patients 9 and 10). . . . . . . . . . . 246
7.4 Comparison of the expressions of the symbolic linear regression models for
the histogram-valued variables in Table 2.5. . . . . . . . . . . . . . . . . . . 246
7.5 Comparison of the Root Mean Square Error values when the Leave-One-Out
method is not/is applied together with the proposed models for the histogramvalued variables in Table 2.5.
. . . . . . . . . . . . . . . . . . . . . . . . . 247
7.6 Comparison of the Root Mean Square Error values when the Leave-One-Out
method is not/is applied together with the DSD Models for the interval-valued
variables in Table 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.7 Comparison of the Root Mean Square Error values when the Leave-One-Out
method is not/is applied together with the DSD Models for the histogramvalued variables in Section 7.2. . . . . . . . . . . . . . . . . . . . . . . . . 252
xviii
7.8 Comparison of the expressions of the symbolic linear regression models that
predict the number of violent crimes in USA states. . . . . . . . . . . . . . . 253
7.9 Performance of the symbolic linear regression models that predict the number of violent crimes in USA states. . . . . . . . . . . . . . . . . . . . . . . 253
7.10 Comparison between the observed and predicted LV C quantiles. . . . . . . 262
7.11 Symbolic data table where the two variables, time of activity before unemployment and time of unemployment are interval-valued variables. . . . . . . 264
7.12 Comparison of the expressions of the symbolic linear regression models for
interval-valued variables in Table 7.11.
. . . . . . . . . . . . . . . . . . . . 266
7.13 Performance of the symbolic linear regression models that predict the logarithm of the time of unemployment. . . . . . . . . . . . . . . . . . . . . . . 267
7.14 Data with information about the total burned area of forest fire and other four
variables: LN area, temp, wind and rh organized by month. . . . . . . . . 274
7.15 Comparison of the expressions of the symbolic linear regression models for
interval-valued variables in Table 7.14.
. . . . . . . . . . . . . . . . . . . . 277
7.16 Comparison of the Root Mean Square Error values when the Leave-One-Out
method is not/is applied together with the proposed models for the histogramvalued variables in Table 7.14. . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.1 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is low and
similar the variability in variable X . . . . . . . . . . . . . . . . . . . . . . . 307
C.2 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is high and
similar the variability in variable X. . . . . . . . . . . . . . . . . . . . . . . 308
C.3 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is mixed
(type I) the variability in variable X . . . . . . . . . . . . . . . . . . . . . . . 308
C.4 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is mixed
(type II) the variability in variable X. . . . . . . . . . . . . . . . . . . . . . . 309
C.5 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is low
and similar the variability in variable X (part I). . . . . . . . . . . . . . . . . 310
C.6 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is low
and similar the variability in variable X (part II). . . . . . . . . . . . . . . . . 310
xix
C.7 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is high
and similar the variability in variable X (part I). . . . . . . . . . . . . . . . . 311
C.8 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is high
and similar the variability in variable X (part II). . . . . . . . . . . . . . . . . 311
C.9 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is
mixed (type I) the variability in variable X (part I). . . . . . . . . . . . . . . . 312
C.10 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is
mixed (type I) the variability in variable X (part II). . . . . . . . . . . . . . . . 312
C.11 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is
mixed (type II) the variability in variable X (part I). . . . . . . . . . . . . . . 313
C.12 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is
mixed (type II) the variability in variable X (part II).
. . . . . . . . . . . . . . 313
C.13 Results, in different conditions, of the DSD Model I with a = 2, b = 1 and
v = −1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
C.14 Results, in different conditions, of the DSD Model I with a = 2, b = 8 and
v = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
C.15 Results, in different conditions, of the DSD Model I with a = 6, b = 0 and
v = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
C.16 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,
a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1. . . . . . . . . . . . . . . . . 317
C.17 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,
a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1 (continuation of the Table
C.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
C.18 Results, in different conditions, of the DSD Model II with a = 2, b = 1 and
Ψ−1
Constant (t) that represents the interval I = [−2, 0]. . . . . . . . . . . . . . 319
C.19 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and
Ψ−1
Constant (t) that represents the interval I = [1, 5]. . . . . . . . . . . . . . . 320
C.20 Results, in different conditions, of the DSD Model II with a = 6, b = 0 and
Ψ−1
Constant (t) that represents the interval I = [1, 3]. . . . . . . . . . . . . . . 321
C.21 Results, in different conditions, of the DSD Model II with a1 = 2, b1 = 1,
a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and Ψ−1
Constant (t) that represents the
interval I = [−2, 0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
xx
C.22 Results, in different conditions, of the DSD Model II with a1 = 2, b1 = 1,
a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and Ψ−1
Constant (t) that represents the
interval I = [−2, 0] (continuation of the Table C.21). . . . . . . . . . . . . . 323
C.23 Results, in different conditions, of the DSD Model I with a = 2, b = 1 and
v = −1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
C.24 Results, in different conditions, of the DSD Model I with a = 2, b = 8 and
v = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
C.25 Results, in different conditions, of the DSD Model I with a = 6, b = 0 and
v = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
C.26 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,
a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1. . . . . . . . . . . . . . . . . 327
C.27 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,
a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1 (continuation of the Table
C.26). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
C.28 Results, in different conditions, of the DSD Model I with a1 = 6, b1 = 0,
a2 = 2, b2 = 8, a3 = 10, b3 = 5 and v = 3. . . . . . . . . . . . . . . . . . . 329
C.29 Results, in different conditions, of the DSD Model I with a1 = 6, b1 = 0,
a2 = 2, b2 = 8, a3 = 10, b3 = 5 and v = 3 (continuation of the Table C.28). . 330
−1
C.30 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨU (t)
when the histogram-valued variable X have Uniform or Normal distributions. 331
−1
C.31 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨU (t)
when the histogram-valued variable X have LogNormal distributions or a
mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 332
C.32 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and
−1
Ψ−1
Constant (t) = ΨU (t) (continuation of the Tables C.30 and C.31). . . . . . . 333
−1
C.33 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨN (t)
when the histogram-valued variable X have Uniform or Normal distributions. 334
−1
C.34 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨN (t)
when the histogram-valued variable X have LogNormal distributions or a
mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 335
C.35 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and
−1
Ψ−1
Constant (t) = ΨN (t) (continuation of the Tables C.33 and C.34). . . . . . . 336
xxi
−1
C.36 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨLogN (t)
when the histogram-valued variable X have Uniform or Normal distributions. 337
−1
C.37 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨLogN (t)
when the histogram-valued variable X have LogNormal distributions or a
mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 338
C.38 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and
−1
Ψ−1
Constant (t) = ΨLogN (t) (continuation of the Tables C.36 and C.37). . . . . . 339
−1
C.39 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨM ix (t)
when the histogram-valued variable X have Uniform or Normal distributions. 340
−1
C.40 Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨM ix (t)
when the histogram-valued variable X have LogNormal distributions or a
mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 341
C.41 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and
−1
Ψ−1
Constant (t) = ΨM ix (t) (continuation of the Tables C.39 and C.40). . . . . . 342
xxii
List of Figures
2.1 Representation of the histograms associated with each healthcare center for
the histogram-valued variable Y5 in Table 1.2. . . . . . . . . . . . . . . . . .
22
−1
−1
2.2 Representation of the quantile functions Ψ−1
Y5 (A) , ΨY5 (B) , ΨY5 (C) in Table 1.2. .
23
−1
−1
2.3 Representation of the quantile functions Ψ−1
Y2 (A) ; ΨY2 (B) ; ΨY2 (C) in Table 1.1. .
24
2.4 Representation of interval IX + IY in Example 2.4. . . . . . . . . . . . . . .
25
2.5 Representation of intervals IX , 2IX and respective symmetric intervals −IX ,
−2IX in Example 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.6 Representation of histograms HX , HY and HX + HY in Example 2.5. . . . .
28
2.7 Representation of histograms HX , HX + 2 and HX − 2 in Example 2.5.
. .
28
2.8 Representation of histograms HX , 2HX and −HX in Example 2.5. . . . . .
29
2.9 Representation of histogram HX − HX in Example 2.6. . . . . . . . . . . .
30
. . .
33
−1
2.10 Representation of the quantile functions Ψ−1
X and ΨY in Example 2.7.
−1
−1
−1
2.11 Representation of the quantile functions Ψ−1
X , ΨY , ΨX +ΨY in Example 2.8. 36
−1
−1
2.12 Representation of the quantile functions Ψ−1
X , ΨX + 2, ΨX − 2 in
Example 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
−1
−1
2.13 Representation of the functions Ψ−1
X (t), 2ΨX (t), −ΨX (t) in Example 2.8. .
37
2.14 Representation of the histogram HX in Example 2.8 and the respective
symmetric −HX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Example 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
−1
−1
2.15 Representation of the functions Ψ−1
X (t); −ΨX (t), −ΨX (1 − t) in
−1
−1
2.16 Representation of the functions Ψ−1
X (t); −ΨX (t), −ΨX (1 − t) in
Example 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.17 Scatter Plot of the interval data in Table 2.4 described in case 1.a). . . . . .
61
2.18 Scatter Plot of the interval data in Table 2.4 described in case 2. . . . . . . .
62
xxiii
2.19 Scatter Plot of the histogram data in Table 2.5. . . . . . . . . . . . . . . . .
62
2.20 Histogram of the interval-valued variable “Age” in Table 1.1. . . . . . . . . .
72
2.21 Histogram of the cumulative relative frequency and the quartiles of the intervalvalued variable “Age” in Table 1.1.
. . . . . . . . . . . . . . . . . . . . . .
72
2.22 Histogram of the histogram-valued variable “Waiting time for a consult” in
Table 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
2.23 Barycentric histogram of the histograms in Example 2.15. . . . . . . . . . .
84
2.24 Quantile function that represents the barycentric histogram of the histograms
in Example 2.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
2.25 Quantile function that represents the barycentric histogram of the histograms
in Example 2.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
2.26 Quantile function that represents the median histogram of the histograms in
Example 2.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
2.27 Quantile function that represents the barycentric interval of the intervals in
Example 2.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
2.28 Quantile function that represents the median histogram of the intervals in
Example 2.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.1 Scatter plots considering the histogram-valued variables X and Y in Table 2.5. 135
4.2 Scatter plots considering the histogram-valued variables −X and Y in
Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.3 Scatter plots considering the mean values of the observations of the histogramvalued variables X and Y in (a); −X and Y in (b). . . . . . . . . . . . . . . 136
5.1 Scatter plots associated with interval-valued variables X and Y in Table 2.4. 171
5.2 Scatter plots associated with interval-valued variables −X and Y in
Table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.1 Mean values of Ω and the respective standard deviation for different error
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
e and the respective standard deviation for different error
6.2 Mean values of Ω
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
xxiv
6.3 Mean values of RM SEM and the respective standard deviation for different
error functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.4 Observed and predicted intervals with low and similar variability when
e = 0.0993). . . . . . . . 209
e
c ∈ U(−5, 5) and re ∈ U(−2, 2) (Ω = 0.7138 and Ω
6.5 Observed and predicted intervals with high and similar variability when
e = 0.0730). . . . . . . . 209
e
c ∈ U(−5, 5) and re ∈ U(−2, 2) (Ω = 0.9898 and Ω
c ∈ U(−5, 5)
6.6 Observed and predicted intervals with mixed variability I when e
e = 0.6728). . . . . . . . . . . . . . . . 210
and re ∈ U(−2, 2) (Ω = 0.9677 and Ω
c ∈ U(−5, 5)
6.7 Observed and predicted intervals with mixed variability II when e
e = 0.7914). . . . . . . . . . . . . . . . 210
and re ∈ U(−2, 2) (Ω = 0.9571 and Ω
6.8 Observed and predicted intervals with low and similar variability when
e = 0.6807). . . . . . 211
e
c ∈ U(−1, 1) and re ∈ U(−0.5, 0.5) (Ω = 0.9866 and Ω
6.9 Observed and predicted intervals with high and similar variability when
e = 0.6943). . . . . . 211
e
c ∈ U(−1, 1) and re ∈ U(−0.5, 0.5) (Ω = 0.9996 and Ω
6.10 Observed and predicted intervals with high and similar variability when
e = 0.0018). . . . . 212
e
c ∈ U(−60, 60) and re ∈ U(−mr, mr) (Ω = 0.3663 and Ω
c ∈ U(−40, 40)
6.11 Observed and predicted intervals with mixed variability II when e
e = 0.0262). . . . . . . . . . . . . . 212
and re ∈ U(−mr, mr) (Ω = 0.2199 and Ω
6.12 Representation of the RM SEM in all cases when the DSD Model I, with one
explicative variable, is applied. . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.13 Boxplots of the values estimated for parameter a, under different conditions,
when DSD Model I (a = 2; b = 8; v = 3) is applied to interval-valued
variables and when the level I error is considered. . . . . . . . . . . . . . . 219
6.14 Boxplots of the values estimated for parameter a, under different conditions,
when DSD Model II (a = 2; b = 8; Ψ−1
Constant (t) = 3 + 2(2t − 1)) is applied to
interval-valued variables and when the level I error is considered. . . . . . . 219
6.15 Boxplots of the values estimated for parameter b, under different conditions,
when DSD Model I (a = 2; b = 8; v = 3) is applied to interval-valued
variables and when the level I error is considered. . . . . . . . . . . . . . . 220
6.16 Boxplots of the values estimated for parameter b, under different conditions,
when DSD Model II (a = 2; b = 8; Ψ−1
Constant (t) = 3 + 2(2t − 1)) is applied to
interval-valued variables and when the level I error is considered. . . . . . . 220
xxv
6.17 Boxplots of the values estimated for parameter a, under different conditions,
when DSD Model I (a = 2, b = 8, v = 3) is applied to histogram-valued
variables and when level I error is considered. . . . . . . . . . . . . . . . . 228
6.18 Boxplots of the values estimated for parameter b, under different conditions,
when DSD Model I (a = 2, b = 8, v = 3) is applied to histogram-valued
variables and when level I error is considered. . . . . . . . . . . . . . . . . 229
6.19 Boxplots of the values estimated for parameter v, under different conditions,
when DSD Model I (a = 2, b = 8, v = 3) is applied to histogram-valued
variables and when level I error is considered. . . . . . . . . . . . . . . . . 229
6.20 Boxplots of the values estimated for parameter a, under different conditions,
−1
when DSD Model II (a = 2, b = 8, Ψ−1
Constant (t) = ΨN (t)) is applied to
histogram-valued variables and when level I error is considered. . . . . . . . 230
6.21 Boxplots of the values estimated for parameter b, under different conditions,
−1
when DSD Model II (a = 2, b = 8, Ψ−1
Constant (t) = ΨN (t)) is applied to
histogram-valued variables and when level I error is considered. . . . . . . . 230
2
b −1
6.22 Representation of the DM (Ψ−1
N (t), ΨN (t)) in different conditions, when DSD
−1
Model II (a = 2; b = 8; Ψ−1
Constant (t) = ΨN (t)) is applied to histogram-valued
variables and when level I error is considered. . . . . . . . . . . . . . . . . 231
6.23 Boxplots that represent the “skewness” of the distributions estimated with
DSD Model I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.24 Boxplots that represent the “skewness” of the distributions estimated with
DSD Model II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.25 Boxplots that represent the “skewness” of the distributions estimated with the
DSD Model II with parameters a = 2, b = 8, Ψ−1
LogN (0.95,1) and a = 2, b = 8,
Ψ−1
LogN (0.5,0.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.26 Boxplot for the estimated parameter a, in four cases CDSDI /PDSDI ,
CDSDI /PDSDII , CDSDII /PDSDII , CDSDII /PDSDI . . . . . . . . . . . . . . . . 235
6.27 Boxplot for the estimated parameter b, in four cases CDSDI /PDSDI ,
CDSDI /PDSDII , CDSDII /PDSDII , CDSDII /PDSDI . . . . . . . . . . . . . . . . 236
7.1 Comparing the predictions and error functions for observation 1 in Table 2.5. 242
7.2 Observed and predicted quantile functions of each observation in Table 2.5.
xxvi
243
7.3 Scatter plots representing the observed and predicted intervals of hematocrit
values in Table 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
7.4 Scatter plots representing the observed and predicted intervals of hematocrit
values in Table 2.4 when the Leave-One-Out method is applied. . . . . . . . 249
7.5 Map of the communities of the USA. . . . . . . . . . . . . . . . . . . . . . . 250
7.6 Selected states used to define the model. . . . . . . . . . . . . . . . . . . . 250
7.7 Observed and predicted quantile functions of LV C considering the approaches:
symbolic, classic, classic - symbolic (part1). . . . . . . . . . . . . . . . . . . 256
7.8 Observed and predicted quantile functions of LV C considering the approaches:
symbolic, classic, classic-symbolic (part2). . . . . . . . . . . . . . . . . . . 257
7.9 Observed and estimated quantile function of the variable LV C in the state
of Arkansas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.10 Predicted quantile 0.2 of LV C for all states.
. . . . . . . . . . . . . . . . . 261
7.11 Predicted quantile 0.4 of LV C for all states.
. . . . . . . . . . . . . . . . . 261
7.12 Predicted quantile 0.6 of LV C for all states.
. . . . . . . . . . . . . . . . . 262
7.13 Predicted quantile 0.8 of LV C for all states.
. . . . . . . . . . . . . . . . . 262
7.14 Scatter plot of the explicative interval-valued variables E and of the response
variables LN U observed in (a) and predicted with the DSD Model I, in (b). . 265
7.15 Observed and predicted quantile functions consider all methods presented
in Table 7.12 (part1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
7.16 Observed and predicted quantile functions consider all methods presented
in Table 7.12 (part2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
7.17 Observed and predicted quantile functions consider all methods presented
in Table 7.12 (part3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.18 Observed and predicted quantile functions consider all methods presented
in Table 7.12 (part4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.19 Observed and predicted quantile functions consider all methods presented
in Table 7.12 (part5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.20 Localization of Montesinho natural park.
. . . . . . . . . . . . . . . . . . . 273
7.21 Observed and predicted intervals of burned area by month in Table 7.14. . . 276
7.22 Observed and predicted intervals of the burned area in Montesinho natural
park in Table 7.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
xxvii
7.23 Observed and predicted intervals applying the DSD Model I and DSD Model
I with LOO, for the interval-valued variable LN area, for each month. . . . . 279
B.1 Relative positions of the intervals when the respective quantile functions are
parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
B.2 Relative positions of the intervals when the respective quantile functions
intersect outside the interval [0, 1]. . . . . . . . . . . . . . . . . . . . . . . . 301
B.3 Relative positions of the intervals when the respective quantile functions
intersect in the interval [0, 1].
. . . . . . . . . . . . . . . . . . . . . . . . . 302
xxviii
1. Introduction
In this work, linear regression models that allow predicting intervals and histograms from
other intervals and histograms will be proposed. The models are developed in the context
of Symbolic Data Analysis. In this first chapter the concepts of data with variability and
symbolic variables, which generalize the classical definition of variables in Multivariate Data
Analysis will be introduced. The option to study linear regression between empirical distributions using Symbolic Data Analysis rather than other approaches will also be explained.
To finalize this chapter, a short description of the structure of the thesis is presented.
1.1. Motivation
The extensive and complex data that emerged in the last decades made it necessary
to extend and generalize the classical concept of data sets. Data tables where the cells
contain a single quantitative or categorical value were no longer sufficient. More complex
data tables were needed, with cells that include more accurate and complete information.
In some situations, the data needs to express the variability or imprecision of the records
associated with each observed unit. In this research we will focus on situations where
variability in data description occurs. The classical solution to analyze these data is to
reduce the collection of records associated with each individual or class of individuals to
one value, typically the mean, mode or maximum/minimum; however, with this option the
variability across the records is lost. The main goal of this study is to propose a linear regression model that allows predicting, for each observed unit, a distribution representing the
variability of its values associated with a variable, from the distribution of values associated
with other variables.
1
2
CHAPTER 1. INTRODUCTION
1.2. Data with variability and imprecise data
It is important to clarify the difference between data with variability and imprecise data,
two types of data that are confused at times. When we want to study a characteristic that
floats around a period of time or is associated with a specific group/class of individuals,
the “value” that best describes this characteristic is not a real value or category but a
set/distribution/range of values. In this case we are in the presence of data with variability.
This kind of data is associated with situations where the variables are, for example, the
temperature or stock values that fluctuate during a day, week or month; the prices of a
product in some regions; the age of the patients in various healthcare centers; the weight
or height of the players of a football team. The variability of the data might emerge due
to the aggregation of single observations (Arroyo (2008)). This aggregation is named
contemporary, if the records are collected in the same temporal instant or the temporal
instant is not relevant; or temporal, if the time is the aggregation criterion and the records
are grouped along one unit of time (one day, for example) but the observed order is not
pertinent. When we work with data with variability, the “best values” that represent the
observations associated with each unit (individual or class of individuals) are the empirical
distributions or, in particular, intervals of values.
Although with a different meaning, the observations associated with imprecise data are
also represented by ranges of values. Imprecise data occurs when each interval associated to each unit under analysis represents the uncertain value associated with the record
(e.g. the measure of distances or longitudes obtained with imprecise instruments). In this
context, “the intervals are a imprecise perception of real values non observable” (Moore
(1966)).
It is important to underline that the same variable can be approached in different contexts. As an example, the variable “weight” is a classical variable if the goal is to study the
data table where, for example, the individuals are football players and to each football player
corresponds his exact weight. If we are interested in studying the weight not of one single
player but of a football team, since each team is a class of individuals, it is characterized not
by a real value but by a set of values, an interval or distribution. For example, the interval
[75,80] may represent the weights of all players from a given football team. The variable
weight can also be studied in other situations. For example, if we don’t know the exact
1.3. SYMBOLIC DATA ANALYSIS
3
weight of the football players of the team, we can associate with each player the “value” that
represents this imprecision. The interval [80,82] may mean that the weight of one football
player is between 80 and 82 Kg. In these examples, the element interval may represent two
different types of records. In the first situation the interval considers the variability of the
weight values in the football team, whereas in the second situation the interval represents
the imprecision of the weight value.
In this research we will work with data with variability and not imprecise data. Considering all values or distributions associated with each unit allows accounting the variability of
these records and hence perform more accurate studies. So, this perspective is in agreement with the opinion of Schweizer, that around 30 years ago advocated that “distributions
are the numbers of the future”. Following in his footsteps, Diday generalized the classical
concept of variables in Multivariate Data Analysis and introduced Symbolic Data Analysis
(Diday (1988)). It is in this context that this work is developed.
1.3. Symbolic Data Analysis
The data studied in Symbolic Data Analysis are represented in complex data tables
where each cell expresses the variability of the records of each observed unit. These tables
are called symbolic data tables and their cells may contain finite sets of values/categories,
intervals or distributions. For these cases the variables are named symbolic variables. The
objects/units/individuals may be single individuals (first-level units) or classes of individuals
(higher-level units). Similarly as to the classical case, symbolic variables may also be classified as quantitative or qualitative. For quantitative symbolic variables, each unit is allowed
to take a single value (single-valued variables); a finite set of values (multi-valued variables);
an interval (interval-valued variables); or a probability/frequency/weight distribution (modalvalued variables). A particular type of modal-valued variables is histogram-valued variables.
In this case, the values attained by the variable for each unit are empirical distributions or,
more specifically, histograms, where the values in each subinterval are assumed to be
uniformly distributed. If we consider a symbolic variable where all units are associated with
only one interval of real numbers (uniformly distributed) with probability/frequency/weight
equal to one, then we are in the presence of interval-valued variables.
As an example, let us consider a symbolic data table containing information about
4
CHAPTER 1. INTRODUCTION
patients (adults) attending healthcare centers, during a fixed period of time. In healthcare
centre A, the age of patients ranged from 25 to 83 years old, in healthcare centre B, it
ranged from 18 to 90 years old and in healthcare centre C, from 20 to 74 years old, so that
the age is an interval-valued variable (see Table 1.1). Now, consider another variable which
records the waiting time for a consult. In this case, the information is recorded with respect
to intervals of time with associated frequencies of the waiting time in each healthcare
center. Each entity is a histogram and the waiting time for a consult is a histogram-valued
variable (see Table 1.2). Notice that in this example the entities under analysis are the
healthcare centers (higher-level units), for each of which we have aggregated information
(contemporary aggregation), and not the individual patients of each center (first-level units).
Table 1.1: Symbolic data table with information of three healthcare centers (part 1).
Healthcare
center
A
B
C
Gender
Age
Y1
Y2
1
F, 12 ; M, 2
F, 23 ; M, 13
F, 25 ; M, 35
Education
[25, 83]
[18, 90]
[20, 74]
Y3
9th grade, 1/2; Higher education, 1/2
th
6 grade, 1/4; 9th grade, 1/4;
12th grade, 1/4; Higher education, 1/4
4th grade, 1/3; 9th grade, 1/3; 12th grade, 1/3
Table 1.2: Symbolic data table with information of three healthcare centers (part 2).
Healthcare
Number of emergency
center
consults Y4
A
{1, 2, 3}
B
{0, 1, 4, 5, 10}
C
{0, 1, 3, 7}
Waiting time for
a consult (in minutes) Y5
[15, 30[ , 0.1; [30, 45[ , 0.6; [45, 90] , 0.3
[0, 15[ , 0.8; [15, 45[ , 0.2
[0, 15[ , 0.6; [30, 60] , 0.4
Emergency 24h
Y6
No
Yes
Yes
In other situations we may have multiple records associated with each unit that may
be the result of several observations performed in one day/month/year. If we want to
study this variable, and as an alternative to summarizing all values by just one value
losing the variability of the information, we may aggregate the refering information to one
specific period of time (temporary aggregation). Thereby each individual (first-level unit)
may be associated with an interval of values (interval-valued variable) or with a distribution
1.3. SYMBOLIC DATA ANALYSIS
5
(histogram-valued variable).
As an example, consider the classical data table, in Table 1.3, that contains information
about the level of hematocrit and hemoglobin of a set of patients attending a healthcare
center during one month. Aggregating the values associated with each patient we build
a symbolic data table, Table 1.4, where to each unit, the patient, corresponds a distribution or the interval of values that describes the variability of the values of hematocrit and
hemoglobin recorded for each patient during one month.
Table 1.3: Classical data table (microdata) with the records of hematocrit and hemoglobin of each patient per
day.
Patients
Hematocrit (Y)
Hemoglobin (X)
Day 1
Day 2
...
Day 30
Day 1
Day 2
...
Day 30
1
35.68
39.61
...
34.54
12.4
12.19
...
11.54
2
40.83
36.69
...
39.45
12.67
13.04
...
12.07
3
46.45
47.97
...
48.68
12.38
13.63
...
16.16
4
42.62
38.34
...
39.89
14.26
13.58
...
12.89
5
48.65
46.32
...
39.19
14.61
13.80
...
16.24
6
46.58
39.70
...
39.12
13.98
14.54
...
13.81
7
47.64
46.09
...
48.25
14.81
15.55
...
14.68
8
43.68
39.84
...
38.40
13.27
13.68
...
13.67
9
38.88
29.06
...
41.64
10.97
11.98
...
13.56
10
47.54
50.60
...
49.82
15.95
15.64
...
16.01
Since the eighties of last century, Symbolic Data Analysis has achieved considerable
development of new statistical techniques to analyze multi-valued data (see, for instance,
Bock and Diday (2000), Billard and Diday (2003, 2006), Diday and Noirhomme-Fraiture
(2008), Noirhomme-Fraiture and Brito (2011)). Recently, there has been a growing interest
in the analysis of histogram-valued variables, but the bulk of the research is still developed
for interval-valued variables. The methods proposed so far for the former are indeed,
frequently, a generalization of their counterparts for the latter.
The concepts and methods which have been developed in the context of Symbolic Data
Analysis can be classified in three categories, according to the input data - the method the output data (Irpino and Verde (2012, 2013)), as follows:
6
CHAPTER 1. INTRODUCTION
Table 1.4: Symbolic data table (macrodata) when the values of hematocrit and hemoglobin are symbolic
variables (Billard and Diday (2002)).
Patients
Hematocrit (Y)
Hemoglobin (X)
1
[33.29; 39.61]
{[11.54; 12.19[ , 0.4; [12.19; 12.8] , 0.6}
2
[36.69; 45.12]
{[12.07; 13.32[ , 0.5; [13.32; 14.17] , 0.5}
3
[36.70; 48.68]
{[12.38; 14.2[ , 0.3; [14.2; 16.16] , 0.7}
4
[36.38; 47.41]
{[12.38; 14.26[ , 0.5; [14.26; 15.29] , 0.5}
5
[39.19; 50.86]
{[13.58; 14.28[ , 0.3; [14.28; 16.24] , 0.7}
6
[39.7; 47.24]
{[13.81; 14.5[ , 0.4; [14.5; 15.2] , 0.6}
7
[41.56; 48.81]
{[14.34; 14.81[ , 0.5; [14.81; 15.55] , 0.5}
8
[38.4; 45.22]
{[13.27; 14.0[ , 0.6; [14.0; 14.6] , 0.4}
9
[28.83; 41.98]
{[9.92; 11.98[ , 0.4; [11.98; 13.8] , 0.6}
10
[44.48; 52.53]
{[15.37; 15.78[ , 0.3; [15.78; 16.75] , 0.7}
symbolic-numerical-numerical: Symbolic data in input are transformed into standard
data in order to apply classic multivariate techniques. The results are real values. For
example, the symbolic mean of a set of intervals is the classical mean of their centers.
symbolic-numerical-symbolic: Symbolic data in input are analyzed according to classic
multivariate techniques and the results are symbolic data. Most part of Symbolic
Data Analysis techniques belong to this category. For example, the linear regression
models defined for interval-valued variables reduce the intervals to their centers and
half ranges, apply classical linear regression models to these elements and afterwards
reconstruct the intervals with the estimated centers and half ranges.
symbolic-symbolic-symbolic: Symbolic data in input are transformed using generalization/specialization operators and the results are symbolic data. Under this category,
some concepts were only recently proposed, such as, for example, the concept of
barycentric histogram. In that case, the input data are a set of histograms, the method
involves a distance that works with distributions (a histogram may represent an empirical distribution) and the result is a symbolic element, a barycentric histogram, which
stays at the minimum distance of a set of histograms.
1.4. LINEAR REGRESSION BETWEEN EMPIRICAL DISTRIBUTIONS
7
The category symbolic-symbolic-symbolic is the most recent approach and will be the
one considered in this research. In this context it was necessary to study the behavior of
the new elements that are now intervals and histograms. In the Symbolic Data Analysis
approach and in this category, linear regression models do not predict real values from
real values but rather distributions from other distributions. All concepts and methods in
this category require new tools to work with these more complex elements, with which it
is necessary to learn how to operate. Knowing the arithmetics and distances that will be
applied to these elements and analyzing the behavior of the vector spaces whose elements
are intervals and distributions, constitutes important knowledge from which we can define
statistical concepts encompassed in the category symbolic-symbolic-symbolic.
Most concepts and methods developed in the Symbolic Data Analysis approach are
descriptive since a probabilistic assumption is not considered. The development of nondescriptive methods for Symbolic Data Analysis is still an open research topic for almost
all kinds of symbolic variables. Studies were recently published proposing probabilistic
models for interval-valued variables (see Lima Neto et al. (2011) and Brito and Duarte Silva
(2012)). However, in these works, intervals are treated as vectors whose components are
the centers and half ranges or the bounds, instead of being considered intervals as such.
In general, the difficulty to work with symbolic elements under a probabilistic context lays in
the extension of the concept of randomness.
1.4. Linear regression between empirical distributions
When this research started, several linear regression models for interval-valued variables had already been defined (Billard and Diday (2000, 2002), Lima Neto and De Carvalho
(2008, 2010)), but for histogram-valued variables only an extension of the first models
proposed by Billard and Diday for interval-valued variables had been proposed (Billard and
Diday (2002, 2006)).
The models proposed by Billard and Diday present some limitations. The main one
consists in the fact that those models are a simple adaptation of the classical models
where the parameters are predicted using the symbolic variance and symbolic covariance
definitions; moreover, the estimation of intervals does not consider the intervals as such,
but rather the bounds are estimated separately; consequently, the elements predicted by
8
CHAPTER 1. INTRODUCTION
the models may fail to build an interval. Because the first models proposed for histogramvalued variables are a generalization of the models presented for interval-valued variables,
the problems associated with this case are similar but are more difficult to solve because
distributions are much more complex elements. For interval-valued variables, Lima Neto
and De Carvalho (2008, 2010); Lima Neto et al. (2011) and, recently, also other authors
(Giordani (2011, 2014), Yang et al. (2011), Ahn et al. (2012)), proposed new approaches
to define linear regression models where the limitations of the first proposed models are
solved. However, no model was defined with the elements “intervals” completely considered
as such. During the period under which the research presented in this work was developed,
alternative methods for histogram-valued variables were proposed by Verde and Irpino
(2010); Irpino and Verde (2012, 2013). In this case the model is already defined taking
into account the entire distributions.
As histogram-valued variables are still little studied, the development of a linear regression model for histogram-valued variables that allows predicting distributions from other
distributions is the main goal of this research. The analysis of the limitations of the models
presented in the literature was the starting point to define the goals for the models to
propose. These goals consist in designing a linear regression model that:
• is flexible and truly predicts histograms or intervals from other histograms or intervals;
• solves the problem of the lack of linearity in the spaces whose elements are intervals
and histograms; this limitation imposes that a linear regression between intervals or
histograms will be direct;
• uses an adequate distance to measure the error between the observed and predicted
elements;
• can be particularized to interval-valued variables; for this particular case different
distributions within the intervals may be considered.
The difficulties associated with the definition of a linear regression between intervals or
distributions lays in the definition of linear combination between these kinds of elements.
The interval and histogram arithmetics are complex and the semi-linearity of the spaces,
whose elements are intervals or histograms, does not allow for the generalization of the
1.4. LINEAR REGRESSION BETWEEN EMPIRICAL DISTRIBUTIONS
9
classical definition of linear combination and consequently of linear relation between elements of these types. To solve this first limitation, the solution is to consider the representation of the histograms by the inverse of the cumulative distribution function, as proposed by
Irpino and Verde (2006). This representation of the histograms is named quantile function.
Considering this representation, instead of working with intervals/histograms, we work with
linear functions or piecewise linear functions. Considering that the distributions or intervals
may be represented by quantile functions, we may now define a linear regression where the
elements involved are functions. However, these functions have an important property. As
the subintervals within the histograms (in the case of the histogram-valued variables) and
the intervals (for interval-valued variables) have the upper bound greater than or equal to the
lower bound, the quantile functions used to represented them are always non-decreasing
functions.
Since we are working with functions, we might consider Functional Data Analysis methods (Ramsay and Silverman (2005)) to apply to histogram-valued variables. In fact, in
the Functional Data Analysis context the data are functions rather than individual values or
sequences of individual observations. Usually, in this context, functional data are observed
and recorded as discrete points (tj , yj ) where for the individual j, the values yj of the
variable are observed, for example, at time tj . However Functional Data Analysis considers
not those discrete points but rather a function obtained from the observed data, i.e. it
considers data that associates with each individual a curve that represents the mathematical description of discrete data points distributed over space, time, and other types
of continuum. These curves are typically adjusted by functions as a weighted sum or
linear combination of basis functions, B-spline functions or Fourier series. The functions
thus obtained and the functions that in general are considered in this context are smooth
functions.
Under the Functional Data Analysis framework, linear models may be functional because the prediction of the response variable is a function and/or the observations of the
explicative variables in the model are functions. In any case, the regression coefficients are
functions rather than real numbers as in the classical case. Three types of functional linear
regression models may be considered.
10
CHAPTER 1. INTRODUCTION
• Functional linear model for functional responses.
– The parameters and observations of the response variable are functional but the
observations of the explicative variables, xjk are multivariate as in the classical
linear regression model. In this case, the predicted function ybj is given by:
ybj (t) = β0 +
p
X
βj (t)xjk .
k=1
– The fully functional linear model considers both the observations of the response
variable yj and the explicative variables xj as functions defined in intervals TX
and TY , respectively. In this case, the functional prediction is obtained by
ybj (t) = β0 (t) +
Z
βj (s, t)xj (s)ds.
TX
• Functional linear model for scalar responses
The response variable yj is a scalar or multivariate predict from the functions xj
defined in an interval TX , i.e.,
ybj = β0 +
Z
βj (s)xj (s)ds.
TX
In this work, the goal is to predict distributions from other distributions. Consequently,
the most straightforward situation is be the case where both the observations of the response and explicative variables are functions. However, in general, these methods are not
adequate to work with quantile functions because they are not smooth functions. A possibility to allowing the methods of Functional Data Analysis to quantile functions would include
smoothing the quantile functions that are the observations, assuming in this case that we
are working with the distribution function instead of the empirical distribution function. In
this situation, when the microdata are not known, the smoothing of the functions would
have to be obtained only from the quantiles used in the quantile functions, which could be a
limitative information to represent the behavior of the variable for each observation. Another
option would be to consider a different model for each piece (linear function) that composes
the quantile functions. For this, it would be necessary that all quantile functions are rewritten
with the same number of pieces and an equal domain for each piece. However, this option
is not applicable to the whole function associated with each unit and it predicts the pieces
1.5. ORGANIZATION OF THE THESIS
11
separately, which may lead to functions that are not non-decreasing; consequently, when
the pieces are joined to build the function, we fail to obtain quantile functions.
1.5. Organization of the thesis
In addition to this chapter, the Introduction, this thesis is composed by 7 more chapters
and 3 Appendices. The remainder of the thesis is organized as follows. Chapter 2 introduces Symbolic Data Analysis, the kind of data and the variables that this approach uses,
focusing on histogram and interval-valued variables. In the first sections of this chapter the
arithmetics and distance measures more commonly applied to this kind of elements are
presented. The Mallows distance selected to use in this work and the reasons behind this
choice are explained. In later sections, the concepts and methods of descriptive statistics
for histogram/interval-valued variables are defined in detail. For a good understanding of
the definitions, the concepts and methods, some examples are presented throughout the
chapter.
The main goal of Chapter 3 is to present the state of the art related to linear regression
models proposed for the case where the observations in the data tables are intervals or
distributions. Both the methods that use the Symbolic Data Analysis approach and the
models proposed for imprecise data are presented.
The new linear regression models for histogram-valued variables and interval-valued
variables are presented in Chapters 4 and 5, respectively. The problem of defining a linear
regression model for histogram-valued variables is addressed. Two alternative models are
proposed. In Chapter 5, and since interval-valued variables constitute a particular case
of histogram-valued variables, the linear regression models for interval-valued variables
emerge as a special case. We also prove that the particularization of the proposed model
to degenerate intervals is coincident to the classical linear regression model. In addition
to the presentation of the models and respective goodness-of-fit measures, several related
properties are enunciated and proved in both chapters.
Chapter 6 is focused on simulation studies considering both kinds of symbolic variables treated in this work. We first introduce the process to generate symbolic data tables;
then the factorial design for each study is described. For interval-valued variables two
simulation studies are performed and for histogram-valued variables just one. The results
12
CHAPTER 1. INTRODUCTION
and conclusions of these studies are discussed; the tables with the records that support the
analysis are included in Appendices C.1, C.2 and C.3.
To illustrate the application of the models to real data, four examples are studied in
Chapter 7. In several situations, not only the new models proposed in this work but also
other methods proposed in the literature are applied. Thereby it is possible to compare the
models and the predictions obtained. In one of these cases a classical study is also performed with the goal of comparing the conclusions obtained using two different approaches,
the classical and the symbolic.
To finalize, the general conclusions of this work are presented and directions for future
research are pointed out.
2. Histogram and Interval data
The research presented in this work is developed under the scope of Symbolic Data
Analysis considering interval-valued variables and histogram-valued variables, for which the
observed values are not single values or categories but intervals or distributions,
respectively.
In the first part of this chapter we will define histogram-valued variables and intervalvalued variables. As the main goal is to work with interval data and histogram data in the
symbolic-symbolic-symbolic category, it is important to find the best representation for these
kind of data, study the arithmetics that we may use when we need to operate with these
elements and select the distance that can provide a good dissimilarity measure between
intervals and distributions. These three points are the main contribute that allow defining
a new approach for linear regression that allows predicting intervals or distributions from
other intervals or distributions.
In the second part of this chapter we will perform a review of the state of the art of
the descriptive statistics associated with these types of symbolic variables. Many concepts
and definitions of descriptive measures for one and more variables were proposed in the
symbolic-numerical-numerical category. Recently were proposed descriptive measures in
symbolic-symbolic-symbolic category. The results associated with the descriptive statistics
will be introduced and demonstrated with detail.
2.1. Symbolic variables
2.1.1. Definition and classification
Classical multivariate statistics studies data tables that summarize observations of “statistical units” (individuals). Each row of these tables represents one individual and each of
13
14
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
these individuals is characterized by different variables (in columns). The “values” attained
by the variables may be real values, if the variable represents the measurement of a
quantity (quantitative variables), or a category, if the variable is qualitative. However, this
representation is too strict to take into account variability which is often inherent to the data.
For instance, when analyzing a collection of observations (along time or space) rather than
a single one, then variability intrinsic to the observation set should be taken into account. In
the late 80’s, the classical concept of variables in Multivariate Data Analysis was expanded
and Symbolic Data Analysis (Diday (1988)) introduced. Symbolic Data Analysis generalizes
the classical framework by allowing each individual or class of individuals to take a finite set
of values/categories, an interval or a probability/weight/frequency distribution for a given
variable. This new kind of data table is called symbolic data table and the variables symbolic
variables. In this case, the objects may be single individuals (first-level units) or, when the
individuals are aggregated, classes of units (higher-level units).
The definition of this more general concept of variable was formally presented in
Chapter 3 of the book from Bock and Diday (2000), as follows:
Definition 2.1. A symbolic variable Y is a mapping
Y : E → B
j
7→ Y (j) = ξj
defined on a set E of statistical entities.
We have Ω = E = {1, 2, . . . , m} when the individuals are first-level units or
E = {C1 , C2 , . . . Cm } with Cj ⊆ Ω when the individuals are higher-level units (classes/con-
cepts or categories). Each unit j in E takes its “values” in B. According to the type
of realization of the symbolic variables, the set B will be: B = Y (classical variables);
6 ∅} ; B a set of intervals of values in Y ⊆ R or B a family of
B = {D : D ⊆ Y, D =
distributions on Y.
Note 2.1. Henceforth in this work, when we use the term unit, we will be referring to a firstlevel unit or to a higher-level unit according to the kind of prior aggregation of the microdata
used to build the symbolic data table.
Notation 1. For a classical variable Yk , we will denote by Yk (j) = yjk the value that the
variable k takes for the individual j. For symbolic variables we denote each variable by Yk
2.1. SYMBOLIC VARIABLES
15
and the “value” that the variable Yk takes for an unit j by Yk (j) = ξjk . When we have only
one variable Y we will denote by Y (j) = ξj , the value that the variable takes on the unit j.
The symbolic data table that organizes the records of the p symbolic variables for m
units is a matrix with m rows and p columns, as illustrated in Table 2.1.
Table 2.1: Symbolic data table.
Y1
Y2
...
Yk
...
Yp
1
ξ11
ξ12
...
ξ1k
...
ξ1p
2
..
.
ξ21
ξ22
...
ξ2k
...
ξ2p
j
..
.
ξj1
ξj2
...
ξjk
...
ξjp
m
ξm1
ξm2
...
ξmk
...
ξmp
In symbolic data tables each cell contains “symbolic values”. To each row of the table
corresponds an unit (individual or class of individuals) that contains its symbolic description,
and each column corresponds to a symbolic variable.
Definition 2.2. Consider the p symbolic variables Y1 , Y2 , . . . , Yp , and let Bk , with
k ∈ {1, 2, . . . , p} be the sets where each variable Yk takes its ”symbolic value” for each unit
j. A symbolic description of an unit j
dj = (ξj1 , . . . , ξjp ) , with j ∈ {1, 2, . . . , m} .
∈ E is given by the description vector
Similarly to the case of classical variables, symbolic variables may also be quantitative
or qualitative. Inside of these two categories, the classification of the variables is performed
according to the kind of elements in B. The classification of symbolic variables is as follows
(Bock and Diday (2000), Billard and Diday (2006)):
• Single-valued variables - when B = Y we are in the presence of a particular case
of the symbolic multi-valued variables, that are the classical variables.
– If Y ⊆ R we have a single-valued quantitative variable.
– If Y is a set of categories, the variable Y is classified as single-valued categorical variable.
16
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
• Multi-valued variables - when B = P(Y) = {D : D ⊆ Y, D =
6 ∅} is one set of
non-empty subsets of Y.
– If Y ⊆ R and the “values” of Y (j) are finite sets of real numbers we have a
multi-valued quantitative variable.
– If the “values” of Y (j) are finite sets of categorical values in Y the symbolic
variable Y is classified as multi-valued categorical variable.
• Interval-valued variable - when B is a set of intervals of values in Y ⊆ R. In this
case the “values” of Y (j) are intervals of real numbers.
• Modal-valued variable - when B is a set of distributions on Y. A particular outcome
in modal-valued variables takes the form
Y (j) = {ηji , pji ; i ∈ {1, . . . , nj }}
where for each unit j, pji is a non-negative measure (weight, probability, relative
frequency) associated with ηji that take values in Y and nj is the number of ηji taken
by Y (j); ηji may be finite or infinite in number and categorical or quantitative in value.
– If Y ⊆ R or Y is a set of categories and ηj1 ; . . . ; ηjnj
⊆ Y, Y is a modal-
valued quantitative variable or modal-valued categorical variable.
– If, for each unit j, the “values” ηji with i ∈ {1, . . . , nj } are ordered and disjoint
intervals of values in Y ⊆ R and
histogram-valued variable.
nj
P
pji = 1, the symbolic variable Y is a
i=1
Example 2.1. The symbolic variables in Tables 1.1 and 1.2 in Chapter 1, may be classified
as follows:
• Gender and Education are modal-valued categorical variables;
• Age is an interval-valued variable;
• Number of emergency consults is a multi-valued quantitative variable;
• Waiting time for a consult is a histogram-valued variable;
• Emergency consult is a single-valued categorical variable.
2.1. SYMBOLIC VARIABLES
17
A symbolic description of healthcare center A is given by the description vector
dA =
n
{F, 1/2; M, 1/2} , [25, 83] , {9th grade, 1/2; Higher education, 1/2} ,
o
{1, 2, 3} , {[15, 30[ , 0.1; [30, 45[ , 0.6; [45, 90] , 0.3} , N o .
2.1.2. Histogram and Interval-valued variables
In this work, we will be dealing with a particular type of modal-valued variables, the
histogram-valued variables. We will also deal with interval-valued variables that, under
certain conditions, are a particular case of the histogram-valued variables.
Definition 2.3. Consider a symbolic variable Y : E → B. The set of units E may be
E = Ω = {1, 2, . . . , m} when the individuals are first-level units or E = {C1 , C2 , . . . , Cn }
with Cj ⊆ Ω when the individuals are higher-level units. Consider also the quantitative
(single-value) variable Ẏ defined on Ω. If the aggregation of the observations is temporal,
to each unit j ∈ Ω corresponds the empirical distribution of the values or range of values
that Ẏ takes within a certain period of time. If the aggregation is contemporary, to each unit
j corresponds the empirical distribution or interval of values of Ẏ in Cj .
The outcome associated with an unit j may take the form:
n
Y (j) = IY (j)1 , pj1 ; IY (j)2 , pj2 ; . . . ; IY (j)nj , pjnj
o
where IY (j)i , represents the subinterval i for the unit j; pji is the weight associated with the
subinterval IY (j)i and
nj
P
pji = 1 with nj the number of subintervals for the j th unit.
i=1
Although interval-valued variables (Bock and Diday (2000), Billard and Diday (2006))
are not originally define as a particular case of histogram-valued variables when, to each
unit j, corresponds only one interval with associated weight equal to one, the histogramvalued variable is then reduce to the particular case of an interval-valued variable. In this
case we have, Y (j) = IY (j) .
In Symbolic Data Analysis, it is generally assumed that the values in the intervals or
subintervals of the histograms are uniformly distributed. When this distribution is assumed
within the intervals, the elements intervals may be defined only by two points, that may also
be the bounds or center and half range. Nonetheless, other distributions may be assumed
as well, namely for the case of interval data.
18
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Note 2.2. In this work, we will assume the hypotheses usually considered in Symbolic Data
Analysis (Billard and Diday, 2003). As such, we assume the Uniform distribution within
the (sub)intervals that are the “observations values” of the interval-valued variables or that
define the empirical distributions of the “observations values” associated with the histogramvalued variables.
Next, we will present the possible representations of an empirical distribution. However,
first it is important to define some notations.
Notation 2. Consider a histogram-valued variable Y where to each unit j corresponds a
empirical distribution Y (j). So, let be nj the number of subintervals for the j th unit, i.e.,
the number of subintervals IY (j)i , with i ∈ {1, 2, . . . , nj } . The frequency/weight associated
with the subinterval IY (j)i is represented by pji .
The subinterval IY (j)i may be represented by the bounds or by the centers and half
ranges of the interval. So, we have:
• IY (j)i = I Y (j)i , I Y (j)i with I Y (j)i the lower bound and I Y (j)i the upper bound of the
interval;
• IY (j)i = [cY (j)i − rY (j)i , cY (j)i + rY (j)i [ with cY (j)i =
rY (j)i =
I Y (j)i −I Y (j)
2
i
I Y (j)i +I Y (j)
2
i
the center and
the half range of the interval.
For each unit j, I Y (j)i ≤ I Y (j)i and I Y (j)i ≤ I Y (j)i+1 .
When we have p histogram-valued variables, we represent by Yk (j) the distribution
associated with variable Yk , k ∈ {1, 2, . . . , p} for unit j, and by IYk (j)i the subinterval i
associated with the distribution Yk (j) for the unit j.
The empirical distribution Y (j) may be represented in different ways.
1. Histogram - Histograms are an usual representation of empirical distributions. The
outcome associated with each unit j of a histogram-valued variable, may take the
form:
HY (j) =
or,
HY (j) =
n
n
h
i
I Y (j)1 , I Y (j)1 , pj1 ; . . . ; I Y (j)nj , I Y (j)nj , pjnj
o
[cY (j)1 − rY (j)1 , cY (j)1 + rY (j)1 [ , pj1 ; . . . ;
o
. . . ; cY (j)nj − rY (j)nj , cY (j)nj + rY (j)nj , pjnj
(2.1)
(2.2)
2.1. SYMBOLIC VARIABLES
19
where I Y (j)i ≤ I Y (j)i ; I Y (j)i ≤ I Y (j)i+1 and
nj
P
pij = 1.
i=1
2. Cumulative distribution function - As usual, the empirical distribution Y (j) may be
represented by a cumulative distribution function, ΨY (j) (z). As in this situation, we
assume a Uniform distribution within subintervals, the cumulative distribution function
is given by



0





z−I Y (j)

1


pj1

I
−I
Y

 (j)1 Y (j)1
z−I Y (j)
2
ΨY (j) (z) =
pj2
p
+
j1
I

Y (j)2 −I Y (j)2



..


.






 1
if
z ≤ I Y (j)1
if
I Y (j)1 ≤ z < I Y (j)1
if
I Y (j)2 ≤ z < I Y (j)2
if
z ≥ I Y (j)nj
(2.3)
3. Quantile function (Irpino and Verde (2006)) - In the latest studies with histogramvalued variables, the “observation values” associated with each unit are represented
by the inverse of the cumulative empirical distribution function, Ψ−1
Y (j) (t) with t ∈ [0, 1],
also called quantile function. As before, the Uniform distribution is assumed for the
values within the subintervals.
Ψ−1
Y (j) (t) =
or,




I Y (j)1 +






 I Y (j) +
2
..
.










t−wjnj −1
1−wjnj −1
cY (j)2 +











 0
..
.
cY (j)nj +
i
P


pjh
I Y (j)2 − I Y (j)2
wj2 −wj1
2(t−w
I Y (j)nj − I Y (j)nj


2t

+
−
1
rY (j)1
c

Y
(j)
1
w

j1




2(t−wj1 )

h=1
in Y (j).
I Y (j)1 − I Y (j)1
t−wj1
wj2 −wj1
I Y (j)nj +
Ψ−1
Y (j) (t) =
where wji =
t
wj1
− 1 rY (j)2
jnj −1 )
1−wjnj −1
if
i=0
if
i = 1, . . . , nj
− 1 rY (j)nj
if
0 ≤ t < wj1
if
wj1 ≤ t < wj2
if
(2.4)
wjnj −1 ≤ t ≤ 1
if
0 ≤ t < wj1
if
wj1 ≤ t < wj2
if
wjnj −1 ≤ t ≤ 1
(2.5)
and nj is the number of subintervals
20
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
If any of the weights pji with i > 1 is null, the function ΨY (j) (z) will not have inverse with
domain [0, 1]. Consequently, the function Ψ−1
Y (j) (t) is not continuous and has nj − 1 pieces.
In this case it is not possible to calculate the value of Ψ−1
Y (j) (wji−1 ) but only
lim Ψ−1
Y (j) (t)
−
t→wji−1
Ψ−1
and lim
Y (j) (t).
+
t→wji−1
Note 2.3. We will use the term distribution to refer to the elements that are the observations
of a histogram-valued variable. When using the term distribution, we are referring to an
empirical distribution of a continuous variable, and it is assumed that within each subinterval
I Y (j)i , I Y (j)i the values for the variable Y for each unit j ∈ {1, . . . , m}, are uniformly
distributed. Each of these distributions may be represented by histograms, cumulative
distribution functions or quantile functions.
When nj = 1 for each unit j, to Y (j) corresponds the interval I Y (j) , I Y (j) , with
frequency pj = 1. The histogram-valued variable is then reduced to the particular case
of an interval-valued variable. In this case, the values in each interval are assumed to
be uniformly distributed. However, if the interval-valued variables are not considered a
particular case of the histogram-valued variables, the distribution within the intervals is not
necessarily Uniform.
In this work, the interval-valued variables are considered a particular case of the histogram-valued variables. Thereby, to each unit j corresponds one “ symbolic value” (uniformly
distributed range of the values) that may be represented by an interval IY (j) , a cumulative
distribution function or the quantile function Ψ−1
Y (j) , as follows:
1. Interval - This is the most used approach in Symbolic Data Analysis studies with
interval-valued variables (e.g. Bock and Diday (2000), Billard and Diday (2003)). Considering the intervals IY (j) defined from the bounds of the interval, the representation
of IY (j) may be given by
IY (j) = I Y (j) , I Y (j) .
(2.6)
It is also possible to represent the range of values Y (j) considering its center cY (j)
and half range rY (j) . In this case,
IY (j) = [cY (j) − rY (j) ; cY (j) + rY (j) ] .
(2.7)
2.1. SYMBOLIC VARIABLES
21
2. Cumulative distribution function - The range of values Y (j) may be represented by a
cumulative distribution function, ΨY (j) (z), as follows:
ΨY (j) (z) =
z − I Y (j)
I Y (j) − I Y (j)
,
z ∈ I Y (j) , I Y (j) .
(2.8)
3. Quantile Function - Y (j) may be represented by a linear function with domain [0, 1].
This function, named quantile function, is the inverse of the cumulative distribution
function and is represented by Ψ−1
Y (j) (t) with t ∈ [0, 1] . As for the case of histogram
values, the quantile function may be defined in terms of the bounds or in terms of the
center and half range of the interval:
or,
Ψ−1
Y (j) (t) = I Y (j) + I Y (j) − I Y (j) t,
Ψ−1
Y (j) (t) = cY (j) + rY (j) (2t − 1),
0≤t≤1
0 ≤ t ≤ 1.
(2.9)
(2.10)
The representation of intervals by quantile functions is also well established. In the
work of Bertoluzza et al. (1995), the authors named it “parametrization of the interval”.
In Appendix B we compare the graphical behavior of two quantile functions as relates
the two corresponding intervals.
4. Canonical decomposition - In the work of Blanco-Fernández (2009) and
Blanco-Fernández et al. (2011) about imprecise data, the interval Y (j) is represented
by a Canonical decomposition as follows:
Y (j) = cY (j) [1 ± 0] + rY (j) [0 ± 1].
(2.11)
In Examples 2.2 and 2.3 the possible representations of the “symbolic observations”
associated with histogram-valued variables and interval-valued variables, will be illustrated.
22
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Example 2.2. Consider the histogram-valued variable Y5 , “Waiting time for a consult” in Table 1.2. The histogram values associated with the three healthcare centers are represented
in Figure 2.1.
0.9
0.8
0.6
0.6
0.4
0.3
0.2
0
15
30
45
90
0
(a) Representation of the HY5 (A) .
15
45
(b) Representation of the HY5 (B) .
0.9
0.6
0.3
0
15
30
60
(c) Representation of the HY5 (C) .
Figure 2.1: Representation of the histograms associated with each healthcare center for the histogram-valued
variable Y5 in Table 1.2.
Alternatively, these histograms may be represented by their quantile functions (see
Figure 2.2):


2t


22.5 + 0.1
− 1 × 7.5



−1
2(t−0.1)
ΨY5 (A) (t) =
37.5
+
−
1
× 7.5
0.6






− 1 × 22.5
67.5 + 2(t−0.7)
0.3


 7.5 + 2t − 1 × 7.5
0.8
Ψ−1
Y5 (B) (t) =

 30 + 2(t−0.8) − 1 × 15
0.2


 7.5 + 2t − 1 × 7.5
0.6
−1
ΨY5 (C) (t) =

 45 + 2(t−0.6) − 1 × 15
0.4
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t ≤ 1
if
0 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
if
0 ≤ t < 0.6
if
0.6 < t ≤ 1
2.1. SYMBOLIC VARIABLES
−1
Y (A)
Ψ
5
−1
Y (B)
Ψ
23
−1
Y (C)
Ψ
5
5
80
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−1
−1
Figure 2.2: Representation of the quantile functions Ψ−1
Y5 (A) , ΨY5 (B) , ΨY5 (C) in Table 1.2.
For the healthcare center C, the quantile function Ψ−1
Y5 (C) (t) is an example of a non
continuous quantile function.
It is important to bear in mind that in a histogram the lower bound of each subinterval
is always less than or equal to the upper bound, I Y (j)i ≤ I Y (j)i and the upper bound
of the following subinterval is always greater or equal to the previous, I Y (j)i ≤ I Y (j)i+1 .
Consequently, the quantile function that represents the empirical distribution is always a
non-decreasing function defined in [0, 1].
When we work with histogram-valued variables, it is important to note that for different
observations, the number of subintervals in the histograms or the number of pieces in
quantile functions may be different. Additionally the subintervals in histogram values must
be ordered and disjoint; if these last conditions do not occur, it is possible to rewrite them in
the required form (Williamson (1989), Arroyo (2008) (see Appendix A)).
Example 2.3. Consider again the interval-valued variable Y2 , “Age” in Table 1.1. The observed values of this interval-valued variable Y2 for Healthcare center A, may be represented
as follows:
IY2 (A) = [25; 83]
with 0 ≤ t ≤ 1
Ψ−1
Y2 (A) = 25 + 58t
Y2 (A) = 54[1 ± 0] + 29[0 ± 1]
The quantile functions that represent the intervals of the ages associated with each healthcare center are represented in Figure 2.3.
24
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
−1
Y (A)
Ψ
80
−1
Y (B)
Ψ
2
−1
Y (C)
Ψ
2
2
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−1
−1
Figure 2.3: Representation of the quantile functions Ψ−1
Y2 (A) ; ΨY2 (B) ; ΨY2 (C) in Table 1.1.
As for the case subintervals of the histograms, in each interval the lower bound is always
less than or equal to the upper bound, I Y (j) ≤ I Y (j) , consequently the quantile function that
represent an interval is always a non-decreasing function. This behavior may be observed
in Example 2.3.
2.2. Arithmetics
In classical statistics the data are real numbers and the standard arithmetic is used to
operate with them. In Symbolic Data Analysis many concepts and methods have so far
been developed under the symbolic-numerical-numerical category. In this category there is
no have the need to operate with ranges of values or distributions. As the goal in this study,
is to develop methods under the symbolic-symbolic-symbolic category, it is necessary to
work with intervals and distributions. For this reason, it is important to study the possibilities
to operate with them.
2.2.1. Interval Arithmetic
Moore (1966) and afterwards Case (1999) studied the interval arithmetic. The operations between interval elements were defined as follows:
Definition 2.4. Consider the intervals IX = [I X , I X ], IY = [I Y , I Y ]. The operations with
∈ {+, −, ×, ÷} between the intervals IX and IY produce the interval IX IY , that
is the set that contains the result of all operations a b for which a ∈ IX and b ∈ IY .
2.2. ARITHMETICS
25
When = ÷, it is assumed that 0 6∈ IY . Shortly, the operations between intervals, may be
obtained by:
I (XY ) = min I X I Y , I X I Y , I X I Y , I X I Y
I (XY ) = max I X I Y , I X I Y , I X I Y , I X I Y .
Example 2.4. Consider the intervals IX = [1, 3] and IY = [−1, 2] .
Addition:
IX + IY = [1, 3] + [−1, 2] = [0, 5] (see Figure 2.4).
IX +2 = [1, 3]+[2, 2] = [3, 5] - The addition of an interval with a real number influences
the center of the interval, but not its range.
Figure 2.4: Representation of interval IX + IY in Example 2.4.
Division:
IX ÷IX = [1, 3]÷[1, 3] =
1
3
, 3 - The division between two equal intervals is an interval
where the bounds are inverse real values.
Multiplication:
IX × IY = [1, 3] × [−1, 2] = [−3, 6] .
2 × IX = [2, 6] - If we multiply an interval by a positive number α we obtain a new
interval where the center and half range increase α times (see Figure 2.5(a)).
−2 × IX = [−6, −2] ; −1 × IX = [−3, −1] - If we multiply an interval IX by a negative
number α, we obtain a new interval where the center and half range increase |α| times
and the center of the new interval αIX is the symmetric of the center of the interval |α|IX .
In particular case when we multiply the interval by −1, we obtain one symmetric interval
relatively to the yy -axis (see Figure 2.5(b)).
26
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
(a) Intervals 2IX and −2IX .
(b) Intervals IX and −IX .
Figure 2.5: Representation of intervals IX , 2IX and respective symmetric intervals −IX , −2IX in Example 2.4.
2.2.2. Histogram Arithmetic
Colombo and Jaarsma (1980) generalized Interval Arithmetic to operate with histograms,
a subject that was further studied by Williamson (1989). This arithmetic is defined as
follows:
Definition 2.5. Consider the histogram HX with n1 subintervals,
HX = [I X1 , I X1 [, pHX1 ; [I X2 , I X2 [, pHX2 ; . . . , [I Xn1 , I Xn1 ], pHXn1
and the histogram HY with n2 subintervals,
HY = [I Y1 , I Y1 [, pHY1 ; [I Y2 , I Y2 [, pHY2 ; . . . , [I Yn2 , I Yn2 ], pHYn2 .
The operations with ∈ {+, −, ×, ÷} between the histograms HX and HY (when
= ÷ it is assumed that 0 6∈ HY ), produce the histogram HX HY with n = n1 × n2
subintervals and are defined by
n
o
I (HX HY )(j1 −1)n2 +j2 = min I HXj I HYj , I HXj1 I HYj , I HXj I HYj2 , I HXj1 I HYj2
1
n
2
2
1
I (HX HY )(j1 −1)n2 +j2 = max I HXj I HYj , I HXj1 I HYj , I HXj I HYj2 , I HXj1 I HYj2
1
2
2
1
p(HX HY )(j1 −1)n2 +j2 = pHXj1 × pHYj2 .
o
with j1 ∈ {1, . . . , n1 } and j2 ∈ {1, . . . , n2 } .
When operating with histograms with n1 and n2 subintervals, we obtain a new histogram
with n1 × n2 subintervals. This new histogram may have a larger number of subintervals,
is not necessary ordered and the subintervals can intersect each other. However, the
histograms may be organized to overcome these situations (see Appendix A).
2.2. ARITHMETICS
27
We may particularize Definition 2.5, to define operations between histograms and real
values. Consider a real number α, which may be represented by {[α, α] , 1}. The addition
between the histogram HX and the real number α results in the following histogram:
HX + α =
nh
h
h
i
o
I HX1 + α, I HX1 + α , p1 ; . . . ; I HXn + α, I HXn1 + α , pn1 .
1
In this case the histogram is still ordered, disjointed and the range of the subintervals of
the histograms HX and HX + α is the same, for all α ∈ R. If α is a positive number, the
histogram HX + α results of the translation of α units of the histogram HX to the right; if
α is a negative number, the histogram HX + α results of the translation of α units of the
histogram HX to the left.
When we multiply one histogram HX by a positive real number α, the resulting histogram is:
αHX =
nh
h
h
h
nh
i
αI HX1 , αI HX1 , p1 ; αI HX2 , αI HX2 , p2 ; . . . ; αI HXn , αI HXn1 , pn1
but if α is a negative real number, we obtain the histogram
αHX =
h
αI HXn1 , αI HXn
1
h
1
o
h
h
h
i
o
, pn1 ; . . . ; αI HX2 , αI HX2 , p2 ; αI HX1 , αI HX1 , p1 .
The result of multiplying a histogram by a real number α is a new ordered and disjoint
histogram. The histogram αHX with α negative and the histogram αHX , with α positive,
are symmetric in relation to the yy -axis.
Example 2.5. Consider the histograms
HX =
and
[1, 3[ , 0.1; [3, 5[ , 0.6; [5, 8] , 0.3
HY =
[0, 1[ , 0.8; [1, 4] , 0.2 .
These histograms will be used to exemplify some operations defined in Definition 2.5.
Addition:
The addition of the histograms HX and HY is not an ordered and disjoint histogram but
it may be rewritten as (see Figure 2.6):
HX + HY
=
[1, 2[ , 0.0267; [2, 3[ , 0.0307; [3, 4[ , 0.1907; [4, 5[ , 0.1880; [5, 6[ , 0.2480;
[6, 7[ , 0.0980; [7, 9[ , 0.1880; [9, 12] , 0.0300 .
28
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Figure 2.6: Representation of histograms HX , HY and HX + HY in Example 2.5.
The addition of the histogram HX with the real numbers 2 and −2 are the histograms
(see Figure 2.7):
HX + 2 =
HX − 2 =
[3, 5[ , 0.1; [5, 7[ , 0.6; [7, 10] , 0.3
[−1, 1[ , 0.1; [1, 3[ , 0.6; [3, 6] , 0.3
Figure 2.7: Representation of histograms HX , HX + 2 and HX − 2 in Example 2.5.
Multiplication:
The product of the histogram HX by the real number 2 and the symmetric of the
histogram HX , −HX , are the histograms given below and represented in Figure 2.8.
2HX =
−HX =
[2, 6[ , 0.1; [6, 10[ , 0.6; [10, 16] , 0.3
[−8, −5[ , 0.3; [−5, −3[ , 0.6; [−3, −1] , 0.1
2.2. ARITHMETICS
29
Figure 2.8: Representation of histograms HX , 2HX and −HX in Example 2.5.
Considering the previous remarks about the results that we obtain when we multiply
a histogram or interval by a negative number, we may define symmetric histogram and
symmetric interval as follows:
Definition 2.6. Consider the element histogram or interval E. Using the histogram/ interval
arithmetic, if we multiply E by the real number −1, we obtain its symmetric element −E.
The elements E and −E are symmetric relatively to the yy -axis.
When these arithmetics are applied between intervals or between histograms, the result
of the operations might not be as expected. For example, the addition of the element E
with its symmetric, −E, is not null. However, the symbolic element E − E is a symmetric
symbolic element and the axis of symmetry of this symbolic element is the yy -axis.
Example 2.6. Consider the interval IX = [1, 3] and the histogram
HX =
[1, 3[ , 0.1; [3, 5[ , 0.6; [5, 8] , 0.3
already considered in Examples 2.4 and 2.5, respectively.
• Intervals - The difference between two equal intervals is not the null interval but one
interval with symmetric bounds (or null center).
IX − IX = [1, 3] − [1, 3] = [−2, 2]
30
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
• Histograms - Using the histogram arithmetic, the addition of the histogram HX with
the histogram −HX is the histogram:
HX − HX =
[−7, −5[ , 0.0120; [−5, −4[ , 0.0420; [−4, −3[ , 0.0570;
[−3, −2[ , 0.0720; [−2, 0[ , 0.3170; [0, 2[ , 0.3170; [2, 3[ , 0.0720; [3, 4[ , 0.0570;
[4, 5[ , 0.0420; [5, 7] , 0.0120 .
This histogram is a symmetric histogram relatively to the yy -axis. Its representation
may be observed in Figure 2.9.
0.3
0.2
0.1
−7
−5
−4
−3
−2
0
2
3
4
5
7
Figure 2.9: Representation of histogram HX − HX in Example 2.6.
When we work with histogram or interval data represented by histograms or intervals,
the arithmetics proposed above are not in general a good option. In fact, the representation fails to take into consideration the distribution of the values within the intervals or
subintervals of the histograms. The arithmetic is complex and we obtain results that are not
expected. For example, the mean of two equal histograms H is not the histogram H.
In more recent studies, where the goal is to work in the symbolic-symbolic-symbolic
category, the “values” observed for histogram and interval-valued variables have been
represented as quantile functions (Arroyo and Maté (2009), Irpino and Verde (2006); Verde
and Irpino (2007, 2008)). In Symbolic Data Analysis, the representation of the intervals
by quantile functions has not been used until now. However, whether the intervals are a
particular case of the histograms or not, it is always possible to represent them by quantile
functions.
2.2. ARITHMETICS
31
If we represent the empirical distribution that each unit takes on a histogram-valued
variable by a quantile function, then the operations are simplified because, as quantile
functions are piecewise functions, the adequate arithmetic for them is a function arithmetic.
For the particular case of the intervals (where the values are uniformly distributed), the
representation by a continuous linear function with domain [0,1] will be a simple representation. However, this representation raises, in general, other questions.
2.2.3. Operations with Quantile Functions
To operate with quantile functions, it is necessary to define all involved functions with an
equal number of pieces. Furthermore, the domain of each piece has to be the same for all
functions. In other words, it is necessary to rewrite all corresponding histograms with the
same number of subintervals and the weight associated with each subinterval has to be
the same in all histograms but not in all subintervals of each histogram i.e., the histograms
are not necessarily equiprobable histograms. To represent the distributions as quantile
functions or histograms in these conditions, it may be necessary to apply the procedure
defined by Irpino and Verde (2006), as described next.
2.2.3.1. Quantile Functions defined with the same number of pieces
Consider the empirical distributions Y (j) with j ∈ {1, 2, . . . , m} , each of them repre-
sented by a quantile function with nj pieces, as in Expression (2.4).
In order to compute operations between the m distributions, we need to identify a set of
uniformly dense intervals to compare.
Let W be the set of the cumulative weights of the m distributions:
W = {w10 , . . . , w1n1 ; w20 , . . . , w2n2 ; . . . ; wm0 , . . . , wmnm }
To rewrite the histograms/quantile functions, we need to sort W without repetitions. The
sorted values may be represented by
Z = {w0 , w1 , . . . , wi , . . . , wn }
where i ∈ {0, . . . , n} ; w0 = 0, wn = 1 and max{n1 , . . . , nm } ≤ n ≤
m
P
j=1
nj − 1.
The intervals [wi−1 , wi [ with i ∈ {1, . . . , n} allow identifying the uniformly dense subin-
tervals for all involved distributions. Each distribution Y (j) may be rewritten as a histogram
32
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
−1
where the subintervals have as lower bound Ψ−1
Y (j) (wi−1 ) and as upper bound ΨY (j) (wi );
so, we may rewrite each Y (j) as
Ψ−1
Y (j) (t) =
or




Ψ−1

Y (j) (w0 ) +





 Ψ−1
Y (j) (w1 ) +










HY (j) =
..
.
t
w1
−1
Y (j)
−1
Y (j)
Ψ (w1 ) − Ψ (w0 )
−1
−1
t−w1
Ψ
(w
)
−
Ψ
(w
)
2
1
Y (j)
Y (j)
w2 −w1
Ψ−1
Y (j) (wn−1 ) +
nh
t−wn−1
1−wn−1
−1
Ψ−1
Y (j) (wn ) − ΨY (j) (wn−1 )
if
0 ≤ t < w1
if
w 1 ≤ t < w2
if
wn−1 ≤ t ≤ 1
h
h
h
−1
−1
−1
Ψ−1
(0),
Ψ
(w
)
,
p
;
Ψ
(w
),
Ψ
(w
)
, p2 ; . . . ;
1
1
1
2
Y (j)
Y (j)
Y (j)
Y (j)
h
i
o
−1
. . . ; Ψ−1
Y (j) (wn−1 ), ΨY (j) (1) , pn
(2.12)
(2.13)
with pi = wi − wi−1 and i ∈ {1, . . . , n} .
Example 2.7 illustrates the process describe above.
Example 2.7. Consider the histograms HX and HY in Example 2.5. These histograms
may be represented by their quantile functions (see Figure 2.10):




1+



Ψ−1
3+
X (t) =





 5+
−1
Y
Ψ (t) =



t
0.1
×2
if
0 ≤ t < 0.1
t−0.1
0.6
×2
if
0.1 ≤ t < 0.7
t−0.7
0.3
×3
if
0.7 ≤ t ≤ 1
t
0.8

 1+
t−0.8
0.2
×3
if
0 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
The histograms HX and HY may be rewritten with the same number of pieces using the
described process of Irpino and Verde (2006).
2.2. ARITHMETICS
−1
X
Ψ (t)
33
−1
Y
Ψ (t)
6
4
2
0
0.2
0.4
0.6
0.8
1
−1
Figure 2.10: Representation of the quantile functions Ψ−1
X and ΨY in Example 2.7.
The set of the cumulative weights is W = {0, 0.1, 0.7, 1; 0, 0.8, 1} . Selecting the weights
without repetition, we have Z = {0, 0.1, 0.7, 0.8, 1} . Using these weights, we can rewrite
−1
the quantile functions Ψ−1
X (t) and ΨY (t) as follows:
−1
X
Ψ (t) =
Ψ−1
Y (t) =




1+






 3+



5+






 6+











t
0.1
1
8
t
0.1
t−0.1
0.6
×2
t−0.7
0.1
t−0.8
0.2
×
+
×2
×2
1
8
t−0.1
0.6
×
6
8

7


+ t−0.7
× 18

8
0.1





 1 + t−0.8 × 3
0.2
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
It is important to avoid that the number of subintervals for each histogram becomes
“too” large (which could happen by applying the process refereed), in which case the
distributions that represent the data would be meaningless. Obviously, when we use the
quantile functions to represent intervals, the previous situations and respective problems
will not occur. To prevent the situation mentioned above and when the microdata are known,
we may consider the option of Colombo and Jaarsma (1980), that encountered similar
problems when operating with histograms and have considered advantageous to work with
34
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
equiprobable histograms (histograms with equal probability subintervals). In their study,
Colombo and Jaarsma (1980) refer that the use of equal probability subintervals offers
many advantages: the distributions are reasonably well approximated by equiprobable
histograms, the subintervals into which a distribution is subdivided are small when the
frequency is high and large when the frequency is low; operations/combinations of equal
frequency subintervals form again equal frequency subintervals.
Note 2.4. In this work, when we refer that the distributions are represented by histograms
with the same number of subintervals, we also consider that the weight associated with
each subinterval i, of all histograms, is the same (but not necessarily the same in all
subintervals of each histogram). Analogously, when we refer that the distributions are
represented by quantile functions with the same number of pieces, the domain of each
piece has to be the same for all functions (the set of cumulative weights is the same in all
functions). So, in general, we will consider that the quantile functions (histograms) have n
pieces (subintervals), for a given n.
2.2.3.2. Operations
When we have distributions or ranges of values represented by quantile functions, the
operations between the elements are the usual operations with functions. When we operate
with distributions with an equal number of pieces, the operations are simplified. Next, we
will explore some operations with quantile functions that will be important for this work.
−1
Consider the quantile functions Ψ−1
X (t) and ΨY (t), that represent two distributions,
defined according to Expression (2.4), both with fixed n subintervals. The addition of these
quantile functions leads to the function:
−1
X
−1
Y
Ψ (t) + Ψ (t) =




I X 1 + I Y1 +






 IX + IY +
2
2










..
.
I X n + I Yn +
t
w1
(2rX1 + 2rY1 )
t−w1
w2 −w1
(2rX2 + 2rY2 )
t−wn−1
1−wn−1
(2rXn + 2rYn )
if
0 ≤ t < w1
if
w 1 ≤ t < w2
if
wn−1 ≤ t ≤ 1
.
When we add two quantile functions we obtain a non-decreasing function. In this case
both the slopes and the y -intercepts of the resulting function are influenced by the two
functions.
2.2. ARITHMETICS
35
The particular case of the addition of a quantile function Ψ−1
X (t) with a real number α is
the function:
(Ψ
−1
X
+ α) (t) =




I X1 + α +






 IX + α +
2










..
.
I Xn + α +
t
w1
(2rX1 )
t−w1
w2 −w1
(2rX2 )
t−wn−1
1−wn−1
(2rXn )
if
0 ≤ t < w1
if
w1 ≤ t < w 2
if
wn−1 ≤ t ≤ 1
.
In this case, only the y -intercepts are affected by the operation. We have a translation up
when adding a real positive number α and a translation down when the real number α is
negative.
The multiplication of the quantile function Ψ−1
X (t) by a real number α leads to the
function:
−1
X
αΨ (t) =




αI X1 +







 αI X +
2











t
w1
(2αrX1 )
t−w1
w2 −w1
(2αrX2 )
if
0 ≤ t < w1
if
w1 ≤ t < w 2
if
wn−1 ≤ t ≤ 1
..
.
αI Xn +
t−wn−1
1−wn−1
(2αrXn )
.
In this case, both the slopes and the y -intercepts are affected by α. If α is positive we will
have a non-decreasing function but if α is negative we will obtain a decreasing function
that cannot be a quantile function, since quantile functions must always be non-decreasing
functions. The following example illustrates the previous situations.
Example 2.8. Consider the distribution represented by the quantile function Ψ−1
X (t) and
Ψ−1
Y (t) presented in Example 2.7 and rewritten with the same number of subintervals. The
−1
resulting function Ψ−1
X + ΨY is a non-decreasing function defined as follows:
−1
X
−1
Y
Ψ (t) + Ψ (t) =



t

1 + 0.1
× 178






 25 + t−0.1 ×
8
0.6
22
8

47


+ t−0.7
× 98

8
0.1





 7 + t−0.8 × 5
0.2
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
−1
The representation of the functions Ψ−1
and Ψ−1
+ Ψ−1
is illustrated in
X , ΨY
X
Y
Figure 2.11.
36
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
12
−1
ΨX (t)
−1
ΨY (t)
−1
−1
ΨX (t)+ΨY (t)
9
6
3
0
0.2
0.4
0.6
0.8
1
−1
−1
−1
Figure 2.11: Representation of the quantile functions Ψ−1
X , ΨY , ΨX +ΨY in Example 2.8.
When we add the quantile function Ψ−1
X with a positive real number, e.g. 2 and with a
negative real number, e.g. −2, we obtain non-decreasing functions. In these cases, we
obtain (see Figure 2.12):


t


×2
3 + 0.1



Ψ−1
×2
5 + t−0.1
X (t) + 2 =
0.6





 7 + t−0.7 × 3
0.3


t


−1 + 0.1
×2



Ψ−1
×2
1 + t−0.1
X (t) − 2 =
0.6





 3 + t−0.7 × 3
0.3
Ψ−1(t)
8
X
Ψ−1(t)+2
X
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t ≤ 1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t ≤ 1
Ψ−1(t)−2
X
6
4
2
0
0.2
0.4
0.6
0.8
1
−1
−1
Figure 2.12: Representation of the quantile functions Ψ−1
X , ΨX + 2, ΨX − 2 in Example 2.8.
2.2. ARITHMETICS
37
If we multiply the quantile function Ψ−1
X (t) by the positive real number 2, we obtain a
non-decreasing function but if we multiply the quantile function Ψ−1
X (t) by the negative real
number −1 the resulting function is not a non-decreasing function. The following functions
and representations in Figure 2.13 illustrate this situation:


t


2 + 0.1
×4



2Ψ−1
6 + t−0.1
×4
X (t) =
0.6





 10 + t−0.7 × 6
0.3




−1 +



−Ψ−1
−3 +
X (t) =





 −5 +
15
−1
X
Ψ (t)
−1
X
2Ψ (t)
t
0.1
× (−2)
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.7
if
0.7 ≤ t ≤ 1
if
0 ≤ t < 0.1
t−0.1
0.6
× (−2)
if
0.1 ≤ t < 0.7
t−0.7
0.3
× (−3)
if
0.7 ≤ t ≤ 1
−1
X
−Ψ (t)
10
5
0
−5
0.2
0.4
0.6
0.8
1
−1
−1
Figure 2.13: Representation of the functions Ψ−1
X (t), 2ΨX (t), −ΨX (t) in Example 2.8.
2.2.3.3. The space of quantile functions
Because in this work we will represent “observed values” of histogram-valued variables
and interval-valued variables by quantile functions as in Expressions (2.4) or (2.5) and as
in Expressions (2.9) or (2.10), respectively, it is important to analyze the behavior of the
space of quantile functions. Quantile functions are a particular kind of functions. Consider
the set of the functions defined from R in R, F(R, R), and the usual operations defined in
F as follows:
38
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
• Addition: (f + g)(x) = f (x) + g(x), ∀x ∈ R;
• Product of a function by a real number: (λf )(x) = λf (x), ∀x ∈ R, and λ ∈ R.
The (F, +, .) is a vector space. However, the space (E, +, .) where E([0, 1], R) is the
set of the quantile functions (piecewise of linear functions) defined from [0, 1] in R, with the
same operations of addition and product by a real number of (F, +, .), is not a subspace of
the vector space (F, +, .). As we show in Section 2.2.3.2, if we multiply a quantile function
by a positive number, we will have a non-decreasing function but if the number is negative
we will obtain a decreasing function and consequently cannot be a quantile function. In
Figure 2.13 of Example 2.8 we may observe this situation. For this reason (E, +, .) is only
a semi-vector (or semi-linear) space (Prakash and Murat (1971)).
However if we consider the distributions represented by histograms and use the histograms arithmetic proposed by Colombo and Jaarsma (1980), it is possible to obtain a
new histogram, that is the symmetric of the histogram H. The histogram −H obtained
when the histogram H is multiply by −1, is the symmetric of the histogram H. Figure 2.14
represents the histogram HX in Example 2.8 and the respective symmetric histogram.
Figure 2.14: Representation of the histogram HX in Example 2.8 and the respective symmetric −HX .
It is then possible to define the quantile function that represents the distribution of the
histogram −H. This function is not obtained by multiplying the quantile function Ψ−1 (t) by
−1. Instead, it is obtained by performing the transformations of Ψ−1 (t) in −Ψ−1 (1 − t) with
t ∈ [0, 1]. The function −Ψ−1 (1 − t) is a quantile function.
2.2. ARITHMETICS
39
Definition 2.7. Consider a histogram or interval E and the respective symmetric −E,
according to Definition 2.6. If Ψ−1 (t) is the quantile function that represents the histogram
or interval E, −Ψ−1 (1 − t) is the quantile function that represents its symmetric −E.
Note 2.5. In this study we will work with histogram-valued variables or with the particular
case of the interval-valued variables, where the “observed values” are elements of the set
E([0, 1], R), and where the function −Ψ−1 (1 − t) with t ∈ [0, 1] is the quantile function that
represents the distribution that is symmetric of the distribution represented by Ψ−1 (t).
Figure 2.15 shows that the function −Ψ−1
X (t) in Example 2.8 is different from the quantile
function −Ψ−1
X (1 − t) that corresponds to the histogram −HX . Figure 2.16 illustrates an
analogous situation for the case of the interval IX = [1, 3] in Example 2.4.
−1
6
ΨX (t)
−1
−ΨX (t)
−1
−ΨX (1−t)
3
0
−3
−6
0.2
0.4
0.6
0.8
1
−1
−1
Figure 2.15: Representation of the functions Ψ−1
X (t); −ΨX (t), −ΨX (1 − t) in Example 2.8.
In Section 2.2.3.1, the process of rewriting two or more distributions with the same
number of subintervals and same set of cumulative weights was described. An interesting
behavior is observed when we applied this process to the quantile functions Ψ−1 (t) and
−Ψ−1 (1 − t), with t ∈ [0, 1]. In this case, the distributions are defined with an equal number
of subintervals, each of which have associated weights pi and that verify the condition
pi = pn−i+1 , with i ∈ {1, 2, . . . , n} . This behavior results from construction process, when
all functions involved represent HY (j) and the respective symmetric −HY (j) . Example 2.9
illustrates this behavior that will be very relevant in this work.
40
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
5
−1
ΨX (t)
−1
−ΨX (1−t)
−1
−ΨX (t)
3
1
−1
−3
−5
0.2
0.4
0.6
0.8
1
−1
−1
Figure 2.16: Representation of the functions Ψ−1
X (t); −ΨX (t), −ΨX (1 − t) in Example 2.4.
Example 2.9. Consider the histogram HX in Example 2.5 and respective symmetric −HX ,
represented in Figure 2.14. The respective quantile functions in Figure 2.15 that represent
HX and −HX are the following:




1+



Ψ−1
3+
X (t) =





 5+
t
0.1
×2
if
0 ≤ t < 0.1
t−0.1
0.6
×2
if
0.1 ≤ t < 0.7
t−0.7
0.3
×3
if
0.7 ≤ t ≤ 1




−8 +



−Ψ−1
−5 +
X (1 − t) =





 −3 +
t
0.1
×3
if
0 ≤ t < 0.3
t−0.3
0.6
×2
if
0.3 ≤ t < 0.9
t−0.9
0.1
×2
if
0.9 ≤ t ≤ 1
The set of the cumulative weights is W = {0, 0.1, 0.7, 1; 0, 0.3, 0.9, 1} . Selecting the
weights without repetition, we have Z = {0, 0.1, 0.3, 0.7, 0.9, 1} . From this set of cumu-
lative weights we have p1 = p5 = 0.1; p2 = p4 = 0.2; and p3 = 0.4. So, the quantile
−1
functions Ψ−1
X (t) and −ΨX (1 − t) may be rewritten as follows:
2.3. DISTANCES
41
Ψ−1
X (t) =



1+








3+



t
0.1
×2
t−0.1
0.2
×
2
3
11
+ t−0.3
× 43
3
0.4






5 + t−0.7
×2


0.2




 7 + t−0.9
0.1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.3
if
0.3 ≤ t < 0.7
if
0.7 ≤ t < 0.9
if
0.9 ≤ t ≤ 1


t

−8 + 0.1








−7 + t−0.1
×2

0.2


−Ψ−1
× 43
−5 + t−0.3
X (1 − t) =
0.4






− 113 + t−0.7
× 23


0.2




 −3 + t−0.9 × 2
0.1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.3
if
0.3 ≤ t < 0.7
if
0.7 ≤ t < 0.9
if
0.9 ≤ t ≤ 1
To conclude, it is important to underline some conclusions about the function
−Ψ−1 (1 − t), t ∈ [0, 1]:
• As it is required for quantile functions, −Ψ−1 (1 − t) is a non-decreasing function;
• Ψ−1 (t) − Ψ−1 (1 − t) is not a null function, as expected, but is a quantile function with
null (symbolic) mean, as defined in Billard and Diday (2003);
• the functions −Ψ−1 (1 − t) and Ψ−1 (t) are linearly independent, providing that
−Ψ−1 (1 − t) 6= Ψ−1 (t);
• −Ψ−1 (1 − t) = Ψ−1 (t) only when the histogram or interval is symmetric with respect
to the yy -axis.
2.3. Distances
The goal of this work is to propose a linear regression model between variables whose
observations are empirical distributions or ranges of values. Therefore, one goal of this
study is also to select a adequate measure to calculate the difference between the observed
and predicted “values” that are now much more complex than in classical case.
42
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
In classical linear regression, to quantify the error between the observed values yj and
the predicted values ybj the difference between two real numbers, ej = yj − ybj is used.
In this case, the model to estimate the values ybj minimizes the quantity
m
P
j=1
(yj − ybj )2 .
Similarly as in classical statistics, one possible approach could be computing the difference
between two distributions represented by their respective histograms/intervals using the
histograms’/intervals’ arithmetic. In their work about forecasting time series applied to
histogram-valued variables, Arroyo and Maté (Arroyo and Maté (2009), Arroyo (2008)) also
needed to measure the error between the observed and forecasted distributions. Firstly,
they considered the possibility to measure the error between two distributions represented
by their respective histograms using the histograms’ arithmetic. However, this option turned
out to be of little use. As we have seen, it is not easy to operate with the histograms
arithmetic and some results are not as expected. For example, the difference between two
equal histograms is not zero in the same way as the difference between two equal intervals
is not zero. According to Arroyo (2008), this happens because the goal of these arithmetics
is to provide an interval that includes all results that are possible to be obtained by each
one of the possible values in each of the terms. So, it is not adequate to analyze the
similarity between distributions or intervals applying the respective arithmetics. Due to the
complexity of histogram-valued variables and interval-valued variables, the error between
the observed and predicted distributions or ranges of values requires a different approach
and dissimilarity measures are an option.
2.3.1. Distances between intervals
In the case of the intervals, several distances have been proposed and studied in works
that deal with data with variability or imprecise data (Blanco-Fernández (2009), Arroyo
(2008), Chavent and Saracco (2008), Bock and Diday (2000)). In context of the Symbolic Data Analysis, Ichino and Yaguchi (1994) proposed a dissimilarity measure between
symbolic elements, including intervals, that is defined on the basis of cartesian operations
of union and intersection between intervals. De Carvalho (1994) proposed a normalization
for this distance. However other dissimilarity measures will be considered to compare two
intervals.
2.3. DISTANCES
43
Consider the intervals A and B that can be defined by their bounds or by the centers
and half ranges: A = [a, a] = [cA − rA , cA + rA ] and B = [b, b] = [cB − rB , cB + rB ].
The Hausdorff distance (Nadler, 1978), that is a general distance used to compare sets,
may be applied to intervals if we consider intervals as compact and convex sets of real
numbers. Given two intervals A and B, the Hausdorff distance is defined as follows:
DH (A, B) = max |a − b|, |a − b|
= |cA − cB | + |rA − rB |.
(2.14)
A L2 metric used between intervals (Vitale, 1985) is defined by
ρ2 (A, B) =
h
1
2
a−b
2
+ 12 (a − b)
2
i 12
.
(2.15)
However, according to Bertoluzza et al. (1995) these distances are not a good option to
measure the dissimilarity between intervals because they only take into account distances
between the bounds of the intervals. Therefore, this author proposed a weighted average
of the distances between all combinations of the bounds of the intervals. The square of the
Bertoluzza distance between intervals A and B is defined by
2
B
D (A, B) =
Z
[0,1]
2
t(a − b) + (1 − t)(a − b) dB(t).
(2.16)
In certain conditions, the square of the Bertoluzza distance may be defined as follows:
where θB =
Z
DB2 (A, B) = (cA − cB )2 + θB (rA − rB )2
(2.17)
(2t − 1)2 dB(t) ∈ ]0, 1] . For more details see Bertoluzza et al. (1995) and
[0,1]
Trutschnig et al. (2009).
According to Expression (2.17), as θB ∈ ]0, 1] , the weight of the distance between
the centers is strictly greater than or equal to that between the half ranges. This is in
accordance to the fact that it seems natural that the distance between the centers is more
important than that between the half ranges, since the center determines the position of the
set. As the Bertoluzza distance is a L2 metric, its statistical properties make it appropriate
to be applied in situations that involve optimization problems, in particular when the Least
Squares method is used (González-Rodrı́guez et al. (2007)).
Recently, the distance previously presented in Expression (2.17) is generalized to the
Dθ distance (Blanco-Fernández (2009), Blanco-Fernández et al. (2011)) defined as follows:
Dθ2 (A, B) = (cA − cB )2 + θ(rA − rB )2
for an arbitrary θ > 0.
(2.18)
44
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
In imprecise data context, the linear regression models proposed for random intervals
used the Dθ distance to calculate the error between the observed and predicted interval
(Blanco-Fernández (2009),Blanco-Fernández et al. (2011)).
2.3.2. Distances between distributions
As the observations of the histogram-valued variables are distributions, the dissimilarity
between these elements will be measured using distances between distributions. As in
this work interval-valued variables will be considered particular case of histogram-valued
variables, the distance selected to calculate the dissimilarity between distributions will also
be applied to intervals. According to the selection of divergency measures between distributions presented by Gibbs and Su (2002) and Bock and Diday (2000), Arroyo (2008)
studied several hypotheses to quantify the error of a forecast. In Table 2.2 the analyzed
measures are presented, where f (x) and g(x) are density functions defined for x ∈ R,
F (x) and G(x) the distribution functions and F −1 (x) and G−1 (x) the respective quantile
functions.
Table 2.2: Divergency measures between distributions (Arroyo (2008)).
Divergency measures
Kullback-Leibler
Jeffrey
χ2
Hellinger
Total variation
Wasserstein
Mallows
Definitions
Z
f (x)
DKL (f, g) =
log
f (x)dx
g(x)
R
DJ (f, g) = DKL (f, g) + DKL (g, f )
Z
|f (x) − g(x)|2
Dχ2 (f, g) =
dx
g(x)
R
Z 12
p
p
DH (f, g) =
f (x) − g(x) dx
R
Z
Dvar (f, g) =
|f (x) − g(x)|dx
R
Z
DW (f, g) =
|F (x) − G(x)|dx
R
s
Z 1
2
DM (f, g) =
(F −1 (x) − G−1 (x)) dx
0
Kolmogorov
DK (f, g) = max |F (x) − G(x)|
R
2.3. DISTANCES
45
According to Arroyo, only the Wasserstein and Mallows divergency measures were
considered adequate to measure the dissimilarity between observed and forecasted distributions (for more details see Chapter 5 in Arroyo (2008)). These measures present
interesting properties for error measurement that allows them to be considered as distances: positive definiteness, symmetry and triangle inequality condition. For Arroyo and
Maté (Arroyo and Maté (2009), Arroyo (2008)), the Wasserstein and Mallows distance have
intuitive interpretations related to the Earth Mover’s Distance and are the ones that better
adjust to the concept of distance as assessed by the human eye. The measures are defined
in terms of the quantile functions and the more further apart are these functions the larger
the distance between them. This distance was also used in other works such as those
of Irpino and Verde (Irpino and Verde (2006), Verde and Irpino (2008)), where the Mallows
distance is successfully applied to cluster histogram data. These authors proved interesting
properties of these measures such as, for example, that they allow for the Huygens theorem
of decomposition of inertia for clustered data (Irpino and Verde (2006)). The same authors
used this distance in their, recently proposed, linear regression model for histogram-valued
variables (Verde and Irpino (2010); Irpino and Verde (2012, 2013)). The Wasserstein and
Mallows distances are defined as follows:
−1
Definition 2.8. Given two quantile functions Ψ−1
X (t) and ΨY (t) that represent distributions
or intervals, the Wasserstein distance is defined as:
−1
X
−1
Y
DW (Ψ (t), Ψ (t)) =
Z
1
−1
|Ψ−1
X (t) − ΨY (t)| dt
0
(2.19)
and the Mallows distance:
−1
DM (Ψ−1
X (t), ΨY (t)) =
s
Z
1
0
2
−1
Ψ−1
X (t) − ΨY (t) dt.
(2.20)
Note 2.6. In some works the distance defined in Expression (2.20) is named Wasserstein
distance. This occurs because historically, this metric was introduced several times and
in different ways (Irpino and Verde (2012)). D(f, g) =
qR
1
0
2
F −1 (x) − G−1 (x) dx is a
distance function defined between the probability distributions of two variables on a given
metric space. However, it was Mallows who introduced this metric in a statistical context,
therefore it may also be named Mallows distance; this is how we will name it henceforth in
this work.
46
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
If we consider the more general expression:
−1
X
−1
Y
D(Ψ (t), Ψ (t)) =
Z
1
−1
X
0
−1
Y
p
|Ψ (t) − Ψ (t)| dt
p1
(2.21)
we obtain for the particular cases of p = 1 and p = 2 the Wasserstein and Mallows distance,
respectively. This is similar to the case of the Minkowski metric, that is the Manhattan
distance when p = 1 and the Euclidean distance when p = 2.
Assuming the Uniform distribution across each of the subintervals of the histograms, the
expressions that define the Mallows and Wasserstein distances may be rewritten using the
centres and half ranges or only the centers of the (sub)intervals for histograms and intervals.
The simplification of the expression for the Mallows distance was proposed by Irpino and
Verde (2006); later Arroyo (2008) presented a similar simplification for the expression of the
Wasserstein distance.
Proposition 2.1. (Irpino and Verde (2006)) Consider two empirical distributions X and Y
−1
that may be represented by the quantile functions Ψ−1
X (t) and ΨY (t), both written with n
pieces and the same set of cumulative weights, and assume the Uniform distribution across
the subintervals. The square of the Mallows distance between these distributions is given
by
2
M
−1
X
−1
Y
D (Ψ (t), Ψ (t)) =
n
X
i=1
pi (cXi
1
− cYi ) + (rXi − rYi )2
3
2
where, cXi , cYi and rXi , rYi with i ∈ {1, . . . , n} are the centers and half ranges of the
subinterval i of the distributions X and Y.
In the particular case of the intervals, we have:
1
2
−1
2
DM
(Ψ−1
(rX − rY )2
X , ΨY ) = (cX − cY ) +
3
where cX , cY , rX , rY are the centers and half ranges of the intervals X and Y.
−1
Proof. Consider the quantile functions Ψ−1
X (t) and ΨY (t) as in Expression (2.5) written with
equal numbers of pieces and the same set of cumulative weights. According to Expression
(2.20) in Definition 2.8, we have
2.3. DISTANCES
2
M
−1
X
47
−1
Y
D (Ψ (t), Ψ (t)) =
Z1
0
2
2(t − wi )
2(t − wi )
=
cX i +
− 1 rX i − c Y i +
− 1 r Yi
dt =
wi − wi−1
wi − wi−1
i=1 wi−1
2
n Z wi X
2(t − wi )
=
(cXi − cYi ) +
− 1 (rYi − rXi ) dt.
wi − wi−1
i=1 wi−1
n Z
X
wi
−1
2
(Ψ−1
X (t) − ΨY (t)) dt =
Making the change of variable υ =
n Z
X
i=1
=
2
M
−1
X
we can rewrite the integral as follows:
1
0
2
pi [(cXi − cYi ) + (2υ − 1) (rYi − rXi )] dυ =
n
X
i=1
So, we have
t−wi−1
wi −wi−1
pi (cXi
−1
Y
1
2
− cYi ) + (rYi − rXi )
3
D (Ψ (t), Ψ (t)) =
2
n
X
i=1
pi (cXi
.
1
2
− cYi ) + (rXi − rYi ) .
3
2
Observing the simplify expression of the Mallows distance between intervals, we may
conclude that the application of this distance for the case of intervals is not new. The
Mallows distance when applied to intervals when the values are uniformly distributed is a
particularization of the Bertoluzza distance, as defined in Expression (2.17), for θB = 13 ,
used in the literature to measure the distance between two intervals (Bertoluzza et al.
(1995)).
Proposition 2.2. (Arroyo (2008)) Consider two empirical distributions X and Y that may
−1
be represented by the quantile functions Ψ−1
X (t) and ΨY (t), both written with n pieces
and the same set of cumulative weights, and assuming the Uniform distribution across the
subintervals. The Wasserstein distance between these distributions is given by
−1
DW (Ψ−1
X (t), ΨY (t)) =
n
X
i=1
pi |(cXi − cYi )|
where, cXi and cYi are the centers of the subintervals i, with i ∈ {1, . . . , n} .
In the particular case of the intervals, we have:
2
−1
DW
(Ψ−1
X (t), ΨY (t)) = |cX − cY |
where cX , cY are the centers of the intervals X and Y, respectively.
48
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
−1
Proof. Consider the quantile functions Ψ−1
X (t) and ΨY (t) as in Expression (2.5) written
with equal numbers of pieces.
According to Expression (2.19) in Definition 2.8, we have
−1
X
Z
−1
Y
1
−1
D (Ψ (t), Ψ (t)) =
|Ψ−1
X (t) − ΨY (t)|dt =
0
n Z wi X
2(t
−
w
)
2(t
−
w
)
i
i
cX +
=
− 1 rX i − c Y i +
− 1 rYi dt =
i
wi − wi−1
wi − wi−1
i=1 wi−1
Z
n
wi X
(cX − cY ) + 2(t − wi ) − 1 (rY − rX ) dt.
=
i
i
i i
w −w
2
W
i=1
i
wi−1
Making the change of variable υ =
n Z
X
i=1
i−1
t−wi−1
wi −wi−1
we can rewrite the integral as follows:
1
pi |(cXi − cYi ) + (2υ − 1) (rYi − rXi )| dυ.
0
• If (cXi − cYi ) + (2υ − 1) (rYi − rXi ) ≥ 0 we have
n Z
X
i=1
=
1
0
pi |(cXi − cYi ) + (2υ − 1) (rYi − rXi )| dυ =
n
X
i=1
=
n
X
i=1
1
pi [(cXi − cYi ) υ + (rYi − rXi ) υ 2 − (rYi − rXi ) υ]0 =
pi (cXi − cYi ) .
• If (cXi − cYi ) + (2υ − 1) (rYi − rXi ) < 0 we have
n Z
X
i=1
1
0
=−
pi |(cXi − cYi ) + (2υ − 1) (rYi − rXi )| dυ =
n
X
i=1
pi (cXi − cYi ) .
So,
2
W
−1
X
−1
Y
D (Ψ (t), Ψ (t)) =
n
X
i=1
pi |cXi − cYi | .
In Example 2.10 we show that the Mallows and Wasserstein distances adjust to the
concept of distance as assessed by the human eye.
2.3. DISTANCES
49
Example 2.10. Consider again the histogram-valued variable Y5 , “Waiting time for a consult” in Table 1.2. The observed values of this histogram-valued variable for the three
healthcare centers are represented in Figure 2.1 and the respective quantile functions that
define the observations of each healthcare center are given in Example 2.2.
To analyze which healthcare centers are more similar, we calculate the Mallows and
Wasserstein distances between the distributions. First, we have to rewrite the three quantile
functions with the same number of pieces. According to the process described in Section
2.2.3.1, we have:


2t

−
1
× 7.50
22.50
+

0.1





2(t−0.1)


−
1
× 6.25
36.25
+

0.5


Ψ−1
43.75 + 2(t−0.6)
− 1 × 1.25
Y5 (A) (t) =
0.1





2(t−0.7)

52.50
+
−
1
× 7.50


0.1




 75.00 + 2(t−0.8) − 1 × 15.00
0.1


2t

−
1
× 0.94
0.94
+

0.1





2(t−0.1)


6.56
+
−
1
× 4.69

0.5


Ψ−1
12.19 + 2(t−0.6)
− 1 × 0.94
Y5 (B) (t) =
0.1





2(t−0.7)

−
1
× 0.94
14.06
+


0.1




 30.00 + 2(t−0.8) − 1 × 15.00
0.1


2t

−
1
× 1.25
1.25
+

0.1





2(t−0.1)


8.75
+
−
1
× 6.25

0.5


Ψ−1
33.75 + 2(t−0.6)
− 1 × 3.75
Y5 (C) (t) =
0.1





2(t−0.7)

− 1 × 3.75

 41.25 +
0.1




 52.50 + 2(t−0.8) − 1 × 7.50
0.1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.6
if
0.6 ≤ t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.6
if
0.6 ≤ t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.6
if
0.6 < t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
50
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
The square of the Mallows and Wasserstein distances between the distributions that
represent the “Waiting time for a consult” in each healthcare center are presented in
Table 2.3.
Table 2.3: Mallows and Wasserstein distances between the observations of the histogram-valued variable
“Waiting time for a consult” in Table 1.2.
Mallows distance
Wasserstein distance
33.81
5.74
23.51
4.74
15.12
3.24
−1
D Ψ−1
Y5 (A) (t), ΨY5 (B) (t)
−1
D Ψ−1
Y5 (A) (t), ΨY5 (C) (t)
−1
D Ψ−1
Y5 (B) (t), ΨY5 (C) (t)
The results are according to the graphical behavior of the quantile functions in Figure
2.2 (Example 2.2). The healthcare centers B and C are the most similar and the healthcare
centers A and B are the most different.
2.4. Descriptive Statistics
It is also important, for the development of this work, to know descriptive concepts for
the case the interval-valued and histogram-valued variables. In this section, a state-ofthe-art related to descriptive statistics will be presented. The first and most considered
descriptive statistics were defined in the symbolic-numerical-numerical category (Bock and
Diday (2000), Billard and Diday (2003)). In this category, basic descriptive statistics for one,
two or more variables were proposed. Concepts such as frequency histograms, sample
mean, variance and covariance were defined for the kind of symbolic variables under
study. However, more recently, Irpino and Verde (Irpino and Verde (2006), Verde and Irpino
(2008)) proposed some descriptive concepts for histogram-valued variables in the symbolicsymbolic-symbolic category, that may be particularized to interval-valued variables.
2.4.1. Symbolic-numerical-numerical category
As in this work interval-valued variables are considered a particular case of histogramvalued variables, in this section we will describe with detail the descriptive statistics for
histogram-valued variables and the definitions for interval-valued variables will emerge as a
2.4. DESCRIPTIVE STATISTICS
51
particular case. However, it is possible, following a similar process, to deduce the descriptive statistics for interval-valued variables (Bock and Diday (2000)).
2.4.1.1. Univariate descriptive statistics
Consider the histogram-valued variable Y : E → B with B a set of distributions with
values in Y . The observation associated with each unit j, with j ∈ {1, ..., m} is represented
by
a
histogram
composed
nj
by
subintervals
with
weight
pji ,
and
i ∈ {1, . . . , nj } . The histogram associated with unit j may be represented as follows:
Y (j) = [I Y (j)1 , I Y (j)1 [, pj1 ; [I Y (j)2 , I Y (j)2 [, pj2 ; . . . , [I Y (j)nj , I Y (j)nj ], pjnj .
For the particular case of an interval-valued variable Y, only one interval is associated
with each unit j, with j ∈ {1, ..., m} . This interval is represented by Y (j) = [I Y (j) , I Y (j) ].
Most of the descriptive univariate concepts that we will present below were proposed by
Bertrand and Goupil (in Chapter 6 of the book Bock and Diday (2000)) for interval-valued
variables and by Billard and Diday (Billard and Diday (2003); Chapter 3 of the book Billard
and Diday (2006)) for histogram-valued variables.
Definition 2.9. (Arroyo (2008), Bock and Diday (2000), in Chapter 6) The empirical distribution function, F (ξ) is the distribution function of a mixture of m histograms. Each of
these histograms is decomposed in nj subintervals IY (j)i = I Y (j)i , I Y (j)i , i ∈ {1, . . . , nj }
where the values are uniformly distributed and which are associated with weight pji . The
empirical distribution function of a histogram-valued variable is then defined as a uniform
mixture of m distributions, more specifically:
nj
1 XX F (ξ) =
P Y (j)i ≤ ξ pji
m j=1 i=1
m
with
P Y (j)i ≤ ξ =




0









ξ−I Y (j)
i
I Y (j)i −I Y (j)
1
if
ξ < I Y (j)i
if
I Y (j)i ≤ ξ < I Y (j)i .
if
ξ ≥ I Y (j)i
i
52
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
So,

m
X
X
1

F (ξ) =
m j=1
pji +
X
pji
i:ξ∈IY (j)i
i:ξ≥I Y (j)i
!
ξ − I Y (j)i
I Y (j)i − I Y (j)i
"
.
In particular, for interval-valued variables, the empirical distribution function of an interval
variable is defined as the uniform mixture of m uniform distributions, i.e,
F (ξ) =

X
1 
m j:ξ∈I
Y (j)
ξ − I Y (j)
I Y (j) − I Y (j)

+ ♯ j : ξ ≥ I Y (j)  .
Hence, by taking the derivative with respect to ξ, we obtain the empirical density function.
Definition 2.10. (Billard and Diday (2003), Bock and Diday (2000), in Chapter 6) For the
histogram-valued variable Y , the empirical density function is given by
n
m
j
1 X X 1IY (j)i (ξ)
pji ,
f (ξ) =
m j=1 i=1 kIY (j)i k
ξ ∈ R.
When nj = 1, for all observations, we have pj = 1 and the empirical density function
for an interval-valued variable is defined as follows:
m
1 X 1IY (j)i (ξ)
f (ξ) =
,
m j=1 kIY (j) k
ξ ∈ R.
In previous expressions, 1IY (j)i is the indicator function of the interval IY (j)i and kYk (j, i)k
is the length of that interval.
As in classical descriptive statistics,
Z
+∞
f (ξ)dξ = 1.
−∞
For histogram-valued or interval-valued variables we may define the histogram that
allows visualizing the frequency distribution of the values. The histogram of all symbolic
observations may be constructed as follows. Let
h
i
I = min I Y (j)i : i ∈ {1, . . . , nj } , j ∈ E , max I Y (j)i : i ∈ {1, . . . , nj } , j ∈ E
be the interval which spans all observed values and let I be partitioned into r subintervals
Ig = [ζg−1 , ζg [ with g ∈ {1, . . . , r − 1} and Ir = [ζr−1 , ζr ] .
2.4. DESCRIPTIVE STATISTICS
53
To define the histogram from the symbolic observations it is necessary not only to
select the subintervals but also to calculate the observed frequency or relative frequency
associated with each subinterval. These concepts are defined below.
Definition 2.11. (Billard and Diday (2003), Bock and Diday (2000), in Chapter 6) For a
histogram-valued variable Y the observed frequency for the interval Ig with g ∈ {1, . . . , r}
is given by
f (Ig ) =
m
X
X
j=1 i:I(g)⊆IY (j)i
kIY (j)i ∩ Ig k
pji .
kIY (j)i k
Particularizing to the case of the interval-valued variable, we have:
f (Ig ) =
m
X
kY (j)
j=1
T
Ig k
.
kY (j)k
The relative frequency is then,
fr (Ig ) =
f (Ig )
.
m
Notice that each term in the expression that defines f (Ig ) , represents the portion of
the (sub)interval IY (j)i which is spanned by Ig and hence the proportion of its observed
relative frequency pji which pertains to the overall histogram interval Ig . It follows that
r
P
f (Ig ) = m.
g=1
The graphical representation of the set
n
o
Ig , fr (Ig ) , g ∈ {1, . . . , r}
is the histogram of the histogram/interval-valued variable.
In Symbolic Data Analysis, the concepts of median or quartile and mode may be obtained
by a similar process as for grouped data in classical statistics.
Definition 2.12. The interval median of the histogram/interval-valued variable is the interval Ig where the cumulative relative frequency is 0.5. Similarly, we obtain as interval first
quartile the interval Ig where the cumulative relative frequency is 0.25 and the interval third
quartile the interval Ig where the cumulative relative frequency is 0.75.
Definition 2.13. The interval mode of the histogram/interval-valued variable is the interval
Ig where the relative frequency is higher.
54
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Let Y be the histogram-valued variable and f the empirical density function presented
in Definition 2.10. Similarly as in classical statistics, we can define the descriptive measures
as the symbolic sample mean and the symbolic sample variance from this function.
Symbolic Mean
The symbolic mean (empirical symbolic mean) for histogram-valued variables is obtained
as follows:
Y
! nj
"
Z
m
1 +∞ X X 1IY (j)i (ξ)
=
ξf (ξ)dξ =
ξ
pij dξ
m −∞ j=1 i=1 kIY (j)i k
−∞
nj Z
m
1 X X I Y (j)i
ξ
=
pij dξ
m j=1 i=1 I Y (j)i kI Y (j)i − I Y (j)i k
Z
+∞
2
n
2
m
j
1 X X I Y (j)i − I Y (j)i
pij
=
2m j=1 i=1 I Y (j)i − I Y (j)i
n
m
j
1 X X I Y (j)i + I Y (j)i
=
pij
m j=1 i=1
2
So, the symbolic sample mean,
• for histogram-valued variables (Billard and Diday (2003)) is given by:
n
Y
m
j
1 X X I Y (j)i + I Y (j)i
=
pij ;
m j=1 i=1
2
(2.22)
• for interval-valued variables (Bock and Diday (2000), in Chapter 6) is given by:
1 X I Y (j) + I Y (j)
.
=
m j=1
2
m
Y
(2.23)
Symbolic Variance 1
As two definitions of symbolic variance were proposed, we will denote the first definition
proposed by Bertrand and Goupil (Bock and Diday (2000), in Chapter 6) and Billard and
Diday (2003), respectively for interval-valued variables and histogram-valued variables, as
symbolic variance 1, (s21 ) and the definition proposed by Billard and Diday (2000, 2002) as
symbolic variance 2, (s22 ). Similarly as for the mean, the symbolic variance 1 for histogram-
2.4. DESCRIPTIVE STATISTICS
55
valued variables is defined as follows:
Z
2
1
s (Y ) =
Z
=
Z
=
with
Z
+∞
−∞
+∞
−∞
+∞
−∞
ξ−Y
2
f (ξ)dξ
Z
2
ξ f (ξ)dξ − 2Y
ξ 2 f (ξ)dξ − Y
Z
+∞
ξf (ξ)dξ + Y
−∞
2
ξ f (ξ)dξ =
−∞
=
+∞
f (ξ)dξ
−∞
n
+∞
m
Z
2
m
j
ξ 2 X X 1IY (j)i (ξ)
pji dξ
−∞ m j=1 i=1 kIY (j)i k
nj Z
m
ξ2
1 X X I Y (j)i
+∞
2
j=1 i=1
nj
m
I Y (j)
i
3
I Y (j)i − I Y (j)i
pji dξ
3
1 X X I Y (j)i − I Y (j)i
=
pji
3m j=1 i=1 I Y (j)i − I Y (j)i
nj
m
1 XX 2
=
I Y (j)i + I Y (j)i I Y (j)i + I 2Y (j)i pji .
3m j=1 i=1
So, the symbolic variance 1,
• for histogram-valued variables (Billard and Diday (2003)) is given by:
1 XX 2
2
s (Y ) =
I Y (j)i + I Y (j)i I Y (j)i + I 2Y (j)i pji − Y
3m j=1 i=1
m
nj
2
1
(2.24)
that may also be written as
2
1
s (Y ) =
1
3m
m
P
j=1
nj
P
i=1
I Y (j)i − Y
2
+ I Y (j)i − Y
I Y (j)i − Y + I Y (j)i − Y
2
pji ;
(2.25)
• for interval-valued variables (Bock and Diday (2000), in Chapter 6) is given by:
1 X 2
2
s (Y ) =
I Y (j) + I Y (j) I Y (j) + I 2Y (j) − Y
3m j=1
m
2
1
(2.26)
or, equivalently,
2
2
1 X
s (Y ) =
I Y (j) − Y + I Y (j) − Y I Y (j) − Y + I Y (j) − Y .
3m j=1
m
2
1
(2.27)
56
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Symbolic Standard Deviation
As s21 (Y ) is the symbolic variance of the variable Y, the symbolic standard deviation, is
usually:
s1 (Y ) =
p
s21 (Y )
Concerning the formulas obtained for the symbolic mean and the symbolic variance 1, it is
important to underline that:
• As in classical statistics, the expressions of the symbolic mean and symbolic variance 1
are deduced from the density function and because of this, they are consistent with
the assumption of uniformity within each subinterval.
• The expression of the symbolic mean for interval-valued variables is the one obtained
applying the classical definition of mean to the mean values (midpoints/centers) of the
interval-valued variable Y for each observation j . Let the midpoints of each interval j
be represented by cY (j) =
I Y (j) +I Y (j)
with j ∈ {1, ..., m} , the expression that defines
2
the symbolic mean of the interval-valued variable may be given by
1 X
=
cY (j) .
m j=1
m
Y
= cY
• The symbolic mean of a histogram-valued variable Y is obtained applying the classical definition of mean to the weighted mean values, for each unit j, of the histogramvalued variable Y, i. e. Y (j) =
nj
P
i=1
I Y (j) +I Y (j)
2
pji with j ∈ {1, ..., m} . The symbolic
mean of the histogram-valued variable is then
1 X
=
Y (j).
m j=1
m
Y
• If we apply the classical definition of variance to the midpoints of the intervals, i.e. to
cY (j) , or to the weighted mean values of the distributions, i.e.
Y (j), with
j ∈ {1, ..., m} , we don’t obtain the expression of the symbolic variance 1.
• In the particular case where all observations of a symbolic variable are degenerate (real values), the symbolic variance 1 coincides with the classical definition of
variance.
2.4. DESCRIPTIVE STATISTICS
57
2.4.1.2. Bivariate descriptive statistics
Consider now the concepts of descriptive statistics when we have two histogram/intervalvalued variables (Bock and Diday (2000), Billard and Diday (2000, 2002, 2003, 2006),
Arroyo (2008)). Let Y1 and Y2 be two histogram-valued variables. The histogram-valued
variable Yk with k ∈ {1, 2} are composed by njk subintervals and to each subinterval
IYk (j)ik a weight pjkik is associated, ik ∈ {1, . . . , njk } . The histograms are defined by:
n
o
Y1 (j) = [I Y1 (j)1 , I Y1 (j)1 [, pj11 ; [I Y1 (j)2 , I Y1 (j)2 [, pj12 ; . . . , [I Y1 (j)nj1 , I Y1 (j)nj1 ], pj1nj1 ;
n
o
Y2 (j) = [I Y2 (j)1 , I Y2 (j)1 [, pj21 ; [I Y2 (j)2 , I Y2 (j)2 [, pj22 ; . . . , [I Y2 (j)nj2 , I Y2 (j)nj2 ], pj2nj2 .
For the particular case of interval-valued variables Yk with k ∈ {1, 2} only one interval
is associated with each unit j, with j ∈ {1, ..., njk }, that we represent by
Yk (j) = [I Yk (j) , I Yk (j) ].
Definition 2.14. (Arroyo (2008)) The empirical joint distribution function FY1 ×Y2 (ξ1 , ξ2 )
for the histogram-valued variables Y1 and Y2 , assuming the Uniform distribution within the
subintervals of the histograms, is given by
1
m
F (ξ1 , ξ2 ) =
1
m
=
nj1 P
nj2
m P
P
j=1 i1 =1 i2 =1
m
P
j=1
!
P {Y1 (j)i1 ≤ ξ1 , Y2 (j)i2 ≤ ξ2 } pj1i1 pj2i2
P
i1 :ξ1 ≥I Y1 (j)i
i2 :ξ2 ≥I Y2 (j)i
pj1i1 pj2i2 +
1
2
P
+
pj1i1 pj2i2
i1 ,i2 :(ξ1 ,ξ2 )∈I1 (j)i1 ×I2 (j)i2
ξ1 −I Y
1 (j)i
I Y1 (j)i −I Y
1
1
1 (j)i1
ξ2 −I Y
2 (j)i
I Y2 (j)i −I Y
2
2
2 (j)i2
"
with ik ∈ {1, . . . , njk } ; k ∈ {1, 2} and njk is the number of subintervals for the unit j of
the histogram-valued variable Yk .
For the particular case where Y1 and Y2 are interval-valued variables, the empirical joint
distribution function is
F (ξ1 , ξ2 ) =
=
1
m
1
m
m
P
j=1
!
P Y1 (j) ≤ ξ1 , Y2 (j) ≤ ξ2
gY1 ×Y2 (ξ1 , ξ2 ) +
P
j:(ξ1 ,ξ2 )∈I1 (j)×I2 (j)
ξ1 −I Y
1 (j)
I Y1 (j) −I Y
1 (j)
ξ2 −I Y
2 (j)
I Y2 (j) −I Y
2 (j)
"
where gY1 ×Y2 (ξ1 , ξ2 ) is the number of rectangles Y1 (j) × Y2 (j) for which ξ1 ≥ I Y1 (j) and
ξ2 ≥ I Y2 (j) .
58
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Definition 2.15. (Billard and Diday (2003)) The empirical joint density function is defined
for histogram-valued variables Y1 , Y2 as follows:
! nj1 nj2
"
m
1 X X X 1IY1 (j)i1 ×IY2 (j)i2 (ξ1 , ξ2 )
f (ξ1 , ξ2 ) =
pj1i1 pj2i2
m j=1 i1 =1 i2 =1 kIY1 (j)i1 × IY2 (j)i2 k
where 1IY1 (j)i
1
×IY2 (j)i
2
is the indicator function and kIY1 (j)i1 × IY2 (j)i2 k is the area of the
rectangle I Y1 (j)i1 , I Y1 (j)i1 × I Y2 (j)i2 , I Y2 (j)i2 .
Note that, when Y1 and Y2 are interval-valued variables, i1 = i2 = 1 and in this case
f (ξ1 , ξ2 ) =
As in classical statistics, we have
Z
m
1 X 1IY1 (j) ×IY2 (j) (ξ1 , ξ2 )
.
m j=1 kIY1 (j) × IY2 (j) k
−∞
−∞
Z
−∞
f (ξ1 , ξ2 )dξ1 dξ2 = 1.
−∞
Analogously to the univariate situation, we may obtain the joint histogram for Y1 × Y2
by graphically plotting
n
o
Ig1 × Ig2 , fr (Ig1 × Ig2 ) , g1 ∈ {1, . . . , r1 } , g2 ∈ {1, . . . , r2 } .
In this histogram we have:
• Ig1 = [ζg1 −1 , ζg1 [ with g1 ∈ {1, . . . , r1 − 1} and Ir1 = [ζr1 −1 , ζr1 ] ;
• Ig2 = [ζg2 −1 , ζg2 [ with g2 ∈ {1, . . . , r2 − 1} and Ir2 = [ζr2 −1 , ζr2 ] .
The intervals Ig1 and Ig2 result of a partition of the intervals I1 and I2 in r1 and r2
subintervals, respectively. The intervals Ik , with k ∈ {1, 2} are the following:
h
n
o
i
Ik = min I Yk (j)ik : ik ∈ {1, . . . , njk } , j ∈ E , max I Yk (j)ik : ik ∈ {1, . . . , njk } , j ∈ E .
To build the joint histogram, it is necessary to calculate the relative frequency associated
with each rectangle Ig1 × Ig2 . To define this frequency, we will first define the observed
frequency.
Definition 2.16. (Billard and Diday (2003)) For the histogram-valued variables Y1 and Y2 ,
the observed frequency of the rectangle Ig1 × Ig2 with g1
g2 ∈ {1, . . . , r2 } is given by
f (Ig1 × Ig2 ) =
m
X
X
X
j=1 i1 :Ig1 ⊆IY1 (j)i i2 :Ig2 ⊆IY2 (j)i
1
2
∈ {1, . . . , r1 } and
k IY1 (j)i1 × IY2 (j)i2 ∩ (Ig1 × Ig2 ) k
pj1i1 pj2i2 .
kIY1 (j)i1 × IY2 (j)i2 k
2.4. DESCRIPTIVE STATISTICS
59
Particularizing the variables Yk to the case of interval-valued variables, we have:
m
1 X k IY1 (j)i1 × IY2 (j)i2 ∩ (Ig1 × Ig2 ) k
.
f (Ig1 × Ig2 ) =
m j=1
kIY1 (j)i1 × IY2 (j)i2 k
Each term j of the expressions in Definition 2.16, defines the proportion of the observed
(sub)rectangle IY1 (j)i1 × IY2 (j)i2 which overlaps with the rectangle Ig1 × Ig2 . The relative
frequency is, in either case,
fr (Ig1 × Ig2 ) =
f (Ig1 × Ig2 )
.
m
In addition to the joint histograms, other graphical representation may be considered for
two histogram/interval-valued variables, namely the scatter plot.
For interval-valued variables it is possible to represent a scatter plot in three ways:
1. The usual representations of each observation of the interval-valued variables Y1 and
Y2 is by rectangles or crosses.
(a) In the first case, the center of the rectangle that represents the observation j is
located in the point (cY1 (j) , cY2 (j) ) , the length of the rectangle j is 2rY1 (j) and its
height 2rY2 (j) , with j ∈ {1, . . . , m}.
(b) Alternatively, each par of intervals related with the same observation j may be
represented by a cross where the center is the point (cY1 (j) , cY2 (j) ) , the horizontal
and vertical lines represent the range of the intervals, 2rY1 (j) and 2rY2 (j) , respectively.
2. The third possibility is proposed in this work for the first time. To represent the
scatter plot of the interval-valued variables, we will represent each observation of
both variables Y1 and Y2 by a line segment; this line, that passes through the points
(I Y1 (j) , I Y2 (j) ) and (I Y1 (j) , I Y2 (j) ), is the diagonal of the rectangle described in case
1.(a).
For histogram-valued variables, we may consider an extension of the first approach for
the scatter plot of interval values. To each unit j, the histograms that represent Y1 (j) and
Y2 (j) must be defined with same number of subintervals. Each subinterval that composes
the histogram is represented by a rectangle in the x and y axis, and the respective weight is
represented in the zz -axis. Therefore the scatter plot for two histogram-valued variables is a
60
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
three-dimensional figure where each observation j of the pair (Y1 (j), Y2 (j)) is represented
by a set of non-overlapping and contiguous hyperrectangule (if the histograms don’t have
subintervals with null weight) .
The example that follows, illustrates this kinds of graphics.
Example 2.11. Consider the microdata in Table 1.3, that contains information about the
level of hematocrit and hemoglobin of a set of patients attending in a healthcare center
during one month. Aggregating the values associated with each patient in intervals or
distributions, we obtain the symbolic data tables 2.4 and 2.5, respectively.
Table 2.4: Symbolic data table where the variables hematocrit and hemoglobin are interval-valued variables.
Patients
Hematocrit (Y)
Hemoglobin (X)
1
[33.29; 39.61]
[11.54; 12.8]
2
[36.69; 45.12]
[12.07; 14.17]
3
[36.69; 48.68]
[12.38; 16.16]
4
[36.38; 47.41]
[12.38; 15.29]
5
[39.19; 50.86]
[13.58; 16.24]
6
[39.7; 47.24]
[13.81; 15.2]
7
[41.56; 48.81]
[14.34; 15.55]
8
[38.4; 45.22]
[13.27; 14.6]
9
[28.83; 41.98]
[9.92; 13.8]
10
[44.48; 52.53]
[15.37; 16.75]
Figures 2.17, 2.18 and 2.19 represents the scatter plots of the data in Tables 2.4 and
2.5. In these graphics, each distribution or interval, associated with the respective patient
is represented with a different color.
From a scatter plot we can verify if the collection of observations corresponding to a
given pair of variables displays a linear behavior. For classical variables, this means that
the observed values are aligned and also allows knowing whether the linear relation is
direct or inverse. For symbolic variables, when the rectangles (or their diagonals) or sets of
hyperrectangle are well aligned in the scatter plots, we may say that the symbolic variables
are in a linear relation. The direction of this alignment also indicates if the relation is direct
or inverse. In the next chapters, we will analyze this behavior in more detail.
2.4. DESCRIPTIVE STATISTICS
61
Table 2.5: Symbolic data table where the variables hematocrit and hemoglobin are histogram-valued variables.
Patients
Hematocrit (Y)
Hemoglobin (X)
1
{[33.29; 37.52[ , 0.6; [37.52; 39.61] , 0.4}
{[11.54; 12.19[ , 0.4; [12.19; 12.8] , 0.6}
2
{[36.69; 39.11[ , 0.3; [39.11; 45.12] , 0.7}
{[12.07; 13.32[ , 0.5; [13.32; 14.17] , 0.5}
3
{[36.69; 42.64[ , 0.5; [42.64; 48.68] , 0.5}
{[12.38; 14.2[ , 0.3; [14.2; 16.16] , 0.7}
4
{[36.38; 40.87[ , 0.4; [40.87; 47.41] , 0.6}
{[12.38; 14.26[ , 0.5; [14.26; 15.29] , 0.5}
5
{[39.19; 50.86] , 1}
{[13.58; 14.28[ , 0.3; [14.28; 16.24] , 0.7}
6
{[39.7; 44.32[ , 0.4; [44.32; 47.24] , 0.6}
{[13.81; 14.5[ , 0.4; [14.5; 15.2] , 0.6}
7
{[41.56; 46.65[ , 0.6; [46.65; 48.81] , 0.4}
{[14.34; 14.81[ , 0.5; [14.81; 15.55] , 0.5}
8
{[38.4; 42.93[ , 0.7; [42.93; 45.22] , 0.3}
{[13.27; 14.0[ , 0.6; [14.0; 14.6] , 0.4}
9
{[28.83; 35.55[ , 0.5; [35.55; 41.98] , 0.5}
{[9.92; 11.98[ , 0.4; [11.98; 13.8] , 0.6}
10
{[44.48; 52.53] , 1}
{[15.37; 15.78[ , 0.3; [15.78; 16.75] , 0.7}
Figure 2.17: Scatter Plot of the interval data in Table 2.4 described in case 1.a).
As in the univariate case, bivariate concepts are also deduced from the empirical joint
density function. Following to this approach, Billard and Diday obtain the empirical covariance for interval-valued variables (Billard and Diday (2000)) and histogram-valued variables (Billard and Diday (2002)).
62
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
50
45
40
35
30
10
11
12
13
14
15
16
Figure 2.18: Scatter Plot of the interval data in Table 2.4 described in case 2.
Figure 2.19: Scatter Plot of the histogram data in Table 2.5.
Symbolic covariance 1
For the symbolic concept of covariance, three definitions were already proposed. To distinguish
the
three
definitions
we
will
denote
the
first
proposed
definition
(Billard and Diday (2000, 2002)) by symbolic covariance 1 (cov1 ) 1 , the second proposed
definition (Billard and Diday (2003), Billard (2007)) by symbolic covariance 2 (cov2 ) and
more recently the symbolic covariance 3 (cov3 ) in Billard (2008).
1
For the sake of simplification of notation the empirical covariance between Y1 and Y2 is denoted cov(Y1 , Y2 ).
2.4. DESCRIPTIVE STATISTICS
63
Consider the histogram-valued variables Y1 and Y2 , and the empirical joint density
function f. We may define the symbolic covariance 1 for histogram-valued variables as
follows:
Z
Z
+∞
+∞
ξ2 − Y2 f (ξ1 , ξ2 )dξ1 dξ2
−∞
−∞
Z +∞ Z +∞
Z +∞ Z +∞
=
ξ1 ξ2 f (ξ1 , ξ2 )dξ1 dξ2 − Y1
ξ2 f (ξ1 , ξ2 )dξ1 dξ2 −
−∞
−∞
−∞
−∞
Z +∞ Z +∞
Z +∞ Z +∞
ξ1 f (ξ1 , ξ2 )dξ1 dξ2 + Y1 Y2
f (ξ1 , ξ2 )dξ1 dξ2 .
−Y2
cov1 (Y1 , Y2 ) =
ξ 1 − Y1
−∞
−∞
−∞
−∞
We may easily prove that,
Z
+∞
−∞
=
Z
Z
+∞
ξ1 ξ2 f (ξ1 , ξ2 )dξ1 dξ2 =
−∞
+∞
−∞
Z
+∞
−∞
nj1
! nj1 nj2
"
m
1 X X X 1IY1 (j)i1 ×IY2 (j)i2 (ξ1 , ξ2 )
ξ1 ξ2
pj1i1 pj2i2 dξ1 dξ2
m j=1 i1 =1 i2 =1 kIY1 (j)i1 × IY2 (j)i2 k
nj2
1 XXX
=
m j=1 i1 =1 i2 =1
m
Z
I Y1 (j)i
Z
I Y2 (j)i
ξ1 ξ2 pj1i1 pj2i2
dξ1 dξ2
I Y (j)
I Y (j)
−
I
−
I
I
I
Y
(j)
Y
(j)
1
2
i1
i2
1
i1
Y1 (j)i1
2
i2
Y2 (j)i2
2
2
2
2
m nj1 nj2
1 X X X pj1i1 pj2i2 I Y1 (j)i1 − I Y1 (j)i1 I Y2 (j)i2 − I Y2 (j)i2
=
4m j=1 i1 =1 i2 =1
I Y1 (j)i1 − I Y1 (j)i1 I Y2 (j)i2 − I Y2 (j)i2
1
2
m nj1 nj2
1 XXX
=
pj1i1 pj2i2 I Y1 (j)i1 + I Y1 (j)i1 I Y2 (j)i2 + I Y2 (j)i2
4m j=1 i1 =1 i2 =1
!
" nj2
!
"
m nj1
I Y1 (j)i1 + I Y1 (j)i1 X
I Y2 (j)i2 + I Y2 (j)i2
1 XX
=
pj1i1
pj2i2
m j=1 i1 =1
2
2
i2 =1
and similarly that,
Z
+∞
−∞
Z
+∞
−∞
ξk f (ξ1 , ξ2 )dξ1 dξ2 = Yk , with k ∈ {1, 2}.
From the above, it follows that,
cov1 (Y1 , Y2 ) =
Z
+∞
−∞
Z
+∞
−∞
ξ1 ξ2 f (ξ1 , ξ2 )dξ1 dξ2 − Y1 Y2 .
From Expression (2.28), we obtain that the symbolic covariance 1,
(2.28)
64
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
• for histogram-valued variables (Billard and Diday, 2002) is given by:
cov1 (Y1 , Y2 ) =
1
m
nj1
m P
P
pj1i1
j=1 i1 =1
I Y1 (j)i +I Y
1
1 (j)i1
2
nj2
P
pj2i2
i2 =1
I Y2 (j)i +I Y
2
2
2 (j)i2
− Y1 Y2 ;
(2.29)
• for interval-valued variables (Billard and Diday, 2000), is given by:
1 X
cov1 (Y1 , Y2 ) =
m j=1
m
!
I Y1 (j) + I Y1 (j)
2
"!
I Y2 (j) + I Y2 (j)
2
"
− Y1 Y2 .
(2.30)
We should note the following:
• Similarly as for variance 1 and mean, the expression for the covariance 1 is obtained
from the empirical joint density function that considers that in each (sub)interval associated with each unit j the values are uniformly distributed.
• The expression of the empirical covariance 1 for interval-valued variables is the classical definition of covariance applied to the midpoints of the intervals Y1 (j) and Y2 (j),
with j ∈ {1, . . . , m} . So,
1 X
cov1 (Y1 , Y2 ) =
cY (j) cY2 (j) − Y1 Y2 .
m j=1 1
m
• The expression of the empirical covariance function for histogram-valued variables is
the classical definition of covariance applied to the weighted mean values, for each
unit j ∈ {1, . . . , m} , of the variables Y1 and Y2 ,
1 X
cov1 (Y1 , Y2 ) =
Y 1 (j)Y 2 (j) − Y1 Y2 .
m j=1
m
• If we apply the symbolic covariance to degenerate histogram/interval-valued variables
(where to all units are associated real values), we obtain the classical definition of
covariance.
• When we consider the histogram/interval-valued variables Y1 = Y2 = Y, the expression of the covariance does not coincide with the Expression (2.24)/(2.26) of
variance 1 of the histogram/interval-valued variable Y.
A new definition of variance is obtained particularizing Expressions (2.29) and (2.30)
of covariance 1 to the case where Y1 = Y2 = Y . We will designate this new definition of
variance by symbolic variance 2, despite the fact that chronologically, for histogram-valued
variables, this definition have been the first one to be proposed.
2.4. DESCRIPTIVE STATISTICS
65
Symbolic Variance 2
This approach to symbolic variance is defined,
• for histogram-valued variables (Billard and Diday, 2002) as follows:
! nj
"2
m
X
X I Y (j)i + I Y (j)i
1
2
s22 (Y ) = cov1 (Y, Y ) =
pji − Y ;
m j=1 i=1
2
• for interval-valued variables (Billard and Diday, 2000) as:
1 X
s22 (Y ) = cov1 (Y, Y ) =
m j=1
m
!
I Y (j) + I Y (j)
2
"2
2
−Y .
(2.31)
(2.32)
Some considerations are important:
• The new definition of variance is not deduced from the density function of the symbolic
variable.
• The new definition emerges only from the particularization of the definition of symbolic
covariance 1 when the two variables are the same.
• The expression of empirical variance 2 for interval-valued variables is the classical
definition of variance applied to the midpoints of the observed intervals of Y. Therefore,
1 X 2
2
cY (j) − Y .
m j=1
m
s22 (Y ) =
This behavior is not verified by the first approach of the variance in Expression (2.26).
• If we apply the classical definition of variance to the weighted mean value, for each
unit j ∈ {1, . . . , m} , of the variable Y, i.e. to Y (j), we obtain the expression of
symbolic variance 2 for histogram-valued variables. So,
1 X 2
2
s (Y ) =
Y (j) − Y .
m j=1
m
2
2
This behavior is not verified by the first definition of variance (variance 1) in
Expression (2.24).
• When the definition of symbolic variance 2 is applied to the particular case of the symbolic variable where all observations are degenerate, i.e. real values, it corresponds
to the classical definition of variance.
66
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Since by particularizing the concept of symbolic covariance in Expressions (2.29) and
(2.30) the definition of symbolic variance in Expressions (2.24) and (2.26) is not obtained,
Billard and Diday (2006) proposed a new definition of covariance for interval-valued variables that afterwards is generalized to histogram-valued variables.
Symbolic Covariance 2
Billard and Diday (Billard and Diday (2006), Billard (2007)) proposed a new expression
to define symbolic covariance, named in this work covariance 2, that emerges from the
similarity between the expressions of classical variance and covariance. To understand the
proposed definition, it is convenient to compare those expressions. Let Y1 and Y2 be two
classical variables where to each individual j corresponds the real value yjk with k ∈ {1, 2}.
The classical variance of variable Y1 is defined as
1 X
2
(yj1 − y1 )
=
m j=1
m
s
2
(2.33)
and the classical covariance of the variables Y1 and Y2 , as
1 X
(yj1 − y1 ) (yj2 − y2 ) .
m j=1
m
cov(Y1 , Y2 ) =
(2.34)
Consider the definition of variance 1 for interval-valued variables deduced from the density
function and that may be written as in Expression (2.27),
2
2
1 X
s (Y ) =
I Y (j) − Y + I Y (j) − Y I Y (j) − Y + I Y (j) − Y .
3m j=1
m
2
1
Let Q = I Y (j) − Y
2
+ I Y (j) − Y
2
I Y (j) − Y + I Y (j) − Y , we may write
1 X 12 2
s (Y ) =
Q
.
3m j=1
m
2
1
(2.35)
So, in analogy with the classical definition, the symbolic covariance between two intervalvalued variables is defined as follows.
Definition 2.17. (Billard and Diday (2006)) For interval-valued variables Y1 and Y2 the
empirical covariance 2 is given by
1 X
1
cov2 (Y1 , Y2 ) =
G1 G2 [Q1 Q2 ] 2
3m j=1
m
(2.36)
2.4. DESCRIPTIVE STATISTICS
67
with
Qk = I Yk (j) − Y k
for k
cYk (j) =
∈
2
2
+ I Yk (j) − Y k I Yk (j) − Y k + I Yk (j) − Y k ,


 −1 if cYk (j) ≤ Y k
Gj =

 1
if cYk (j) > Y k
{1, 2} where Y k is the symbolic sample mean of variable Yk and
I Yk (j) +I Y
2
k (j)
is the midpoint, for each unit j ∈ {1, . . . , m} , of the variable Yk .
However, the Expression (2.36) has two more factors, G1 and G2 than the classical definition of covariance, in Expression (2.34). These factors are in the expression of symbolic
covariance because otherwise the covariance between two interval-valued variables would
always be non-negative.
Let us show that Qk , k ∈ {1, 2} is always non-negative. Since
Qk = I Yk (j) − Y k
2
+ I Yk (j) − Y k
2
I Yk (j) − Y k + I Yk (j) − Y k
to prove that Qk is non-negative it is sufficient to prove that
I Yk (j) − Y k
I Yk (j) − Y k ≥ 0.
• If Y k ≤ I Yk (j) as I Yk (j) ≤ I Yk (j) we have I Yk (j) − Y k ≥ 0 and I Yk (j) − Y k ≥ 0, so
Qk ≥ 0;
• If Y k ≥ I Yk (j) as I Yk (j) ≤ I Yk (j) we have I Yk (j) − Y k ≤ 0 and I Yk (j) − Y k ≤ 0, so
Qk ≥ 0;
I Yk (j) − Y k ≤ 0. We may rewrite Qk
2
I Yk (j) − Y k + I Yk (j) − Y k
− I Yk (j) − Y k I Yk (j) − Y k . So, in this
• If I Yk (j) ≤ Y k ≤ I Yk (j) we have I Yk (j) − Y k
as Qk =
situation we have also Qk ≥ 0.
In conclusion, in all situations Qk ≥ 0 and consequently, if the expression of covariance
did not have the factors G1 and G2 it would always be non-negative.
Analyzing what happens in the classical definition of covariance, we have that for each
unit j, and for each k ∈ {1, 2}, the values (yjk − yk ) and (yjk − yk ) are positive or negative
according to whether:


 yjk − yk ≤ 0

 y −y >0
jk
k
if
yjk ≤ yk
if
yjk > yk
.
68
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
So, in Expression (2.36) that defines covariance 2, an analogous behavior is imposed to
the centers of the intervals cYk (j) , to k ∈ {1, 2}. The considered conditions allow to obtain
the definition of the factors Gk .
According to this definition of covariance, it is obvious that when Y1 = Y2 = Y, Expression (2.36) reduces to the variance 1 of Y in Expression (2.27). Also, when all observations
of the interval-valued variables are degenerate intervals, Expression (2.36) reduces to the
classical expression of variance.
The Definition 2.17 of symbolic covariance of interval-valued variables was generalized
to histogram-valued variables.
Definition 2.18. (Billard and Diday (2006)) For histogram-valued variables Y1 and Y2 the
empirical covariance 2 is given by
nj1
nj2
1 XXX
1
cov2 (Y1 , Y2 ) =
pj1i1 pj2i2 G1 G2 [Q1 Q2 ] 2
3m j=1 i1 i2
m
(2.37)
with
Qk = I Yk (j)ik − Y k
2
+ I Yk (j)ik − Y k
Gk =


 −1

 1
2
I Yk (j)ik − Y k + I Yk (j)ik − Y k ,
if
Y k (j) ≤ Y k
if
Y k (j) > Y k
∈ {1, 2} where Y k is the symbolic sample mean of the variable Yk and
n
jk I
P
Yk (j)i +I Yk (j)ik
k
Y k (j) =
pjkik is the weighted mean value, for each observation
2
for k
ik
j ∈ {1, . . . , m} , of the variable Yk .
Although Definition 2.18 is only one generalization of Definition 2.17, the behavior of
this symbolic covariance for interval-valued variables and for histogram-valued variables is
different. For histogram-valued variables, if we particularize the definition of cov2 to the case
where Y1 = Y2 , we don’t obtain the definition of variance in Expression (2.25) deduced from
the density function.
2.4. DESCRIPTIVE STATISTICS
69
Symbolic Covariance 3
From the definition of variance 1, s21 , defined by the Expression (2.27), emerges a new
definition of covariance for interval-valued variables (Billard (2008), Xu (2010)), that in this
work will be named covariance 3, cov3 . To introduce this new definition, first it is important to
show how Billard (2007) decompose the Total Sum of Squares (T SS ) for an interval-valued
variable Y. T SS may be written as T SS = m × s21 .
Let cY (j) the midpoint of the interval Y (j), Y the symbolic mean of the interval-valued
variable Y and s21 the symbolic variance defined in Expression (2.27). Expanding the
definition of T SS describe above:
ms21 =
=
=
=
=
=
m
2
2
1X
I Y (j) − Y + I Y (j) − Y I Y (j) − Y + I Y (j) − Y
3 j=1
m
1X 2
2
2
I Y (j) + I Y (j) I Y (j) + I Y (j) − 3I Y (j) Y + 3Y − 3I Y (j) Y
3 j=1
m
1X 2
2
2
I Y (j) + I Y (j) I Y (j) + I Y (j) − 3 I Y (j) + I Y (j) Y + 3Y
3 j=1
!
"2
m
I Y (j) + I Y (j)
1X 2
2
I
+ I Y (j) I Y (j) + I Y (j) − 3
− 3(I Y (j) + I Y (j) )Y +
3 j=1 Y (j)
2
!
"2
I Y (j) + I Y (j)
2
+3Y + 3
2
!
"2
m
2
I Y (j) + I Y (j)
1X 2
2
I Y (j) + I Y (j) I Y (j) + I Y (j) − 3
+ 3 cY (j) − Y
3 j=1
2
!
"2
m
m
X
1 X I Y (j) − I Y (j)
+
(cY (j) − Y )2 .
3 j=1
2
j=1
Considering
1X
W SS =
3 j=1
m
!
I Y (j) − I Y (j)
2
and
BSS =
m
X
j=1
"2
2
1 X
I Y (j) − I Y (j)
12 j=1
m
=
cY (j) − Y
2
.
(2.38)
(2.39)
The T SS for the interval-valued variable Y may be decomposed as:
T SS = W SS + BSS.
(2.40)
70
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Each term in Expression (2.38) corresponds to the internal variation of the single unit j.
When all observations of variable Y are degenerate intervals (classical data), W SS = 0
i.e., internal variation not exist in the classical case. Furthermore, Expression (2.38) is the
variance of a uniformly distributed variable, that is in accordance with the assumption that
the values within each interval Y (j) are uniformly distributed.
An expression that allows to obtain a decomposition of the Total Sum of Products (T SP )
between two interval-valued variables Y1 and Y2 , which is similar to Expression (2.40), was
first suggested by Billard (2007) and then presented in the work of Xu (2010). In this case:
T SP = W SP + BSP
(2.41)
where W SP and BSP are a extension of the definitions of W SS and BSS presented in
Expressions (2.38) and (2.39), respectively.
1 X
I Y1 (j) − I Y1 (j) I Y2 (j) − I Y2 (j)
12 j=1
m
W SP =
and
BSP =
m
X
j=1
!
I Y1 (j) + I Y1 (j)
−Y1
2
"!
"
I Y2 (j) + I Y2 (j)
−Y2 .
2
(2.42)
(2.43)
As m × cov(Y1 , Y2 ) = T SP, from the Expressions (2.42) and (2.43) the new definition of
covariance may be deduced.
Definition 2.19. For interval-valued variables Y1 and Y2 the empirical covariance 3 is
given by
m
1 X
cov3 (Y1 , Y2 ) =
I Y1 (j) − I Y1 (j) I Y2 (j) − I Y2 (j) +
12m j=1
!
"!
"
m
I Y2 (j) + I Y2 (j)
1 X I Y1 (j) + I Y1 (j)
+
−Y1
−Y2
m j=1
2
2
m
1 X
=
2 I Y1 (j) − Y 1 I Y2 (j) − Y 2 + I Y1 (j) − Y 1 I Y2 (j) − Y 2 +
6m j=1
+ I Y1 (j) − Y 1
I Y2 (j) − Y 2 + 2 I Y1 (j) − Y 1 I Y2 (j) − Y 2
where Y 1 and Y 2 are the symbolic means of variables Y1 and Y2 , respectively.
In the literature, this definition of covariance 3 and the decomposition of the T SS that
gave rise to it are not generalized for the case of histogram-valued variables.
2.4. DESCRIPTIVE STATISTICS
71
Independently of the definitions of variance and covariance used, the empirical correlation coefficient is calculated according to the next definition.
Definition 2.20. For histogram/interval-valued variables Y1 and Y2 the empirical correlation coefficient is given by
cov(Y1 , Y2 )
r(Y1 Y2 ) = p 2 2
sY 1 s Y 2
where cov(Y1 , Y2 ) is the empirical covariance function and sY1 , sY2 the symbolic variance of
the variables Y1 and Y2 , respectively.
2.4.1.3. Examples
The following examples illustrate the concepts presented in the previous sections.
Example 2.12. Consider again the interval-valued variable Y2 , “Age” in Table 1.1, where
Y2 (A) = [25, 83], Y2 (B) = [18, 90] and Y2 (C) = [20, 74].
The observation intervals span the interval I = [18, 90] because
min I Y (j) : j ∈ {A, B, C} = 18 and max I Y (j) : j ∈ {A, B, C} = 90.
Let us build a histogram of the Age with r = 6 subintervals, each with length 12. The
intervals and respective frequencies determined according to Definition 2.11 are given in
Table 2.6. Figure 2.20 represents the histogram of the interval-valued variable “Age”.
Table 2.6: Table of frequencies associated with interval-valued variable “Age” in Table 1.1.
Ig
f (Ig )
fr (Ig )
Fr (Ig )
[18, 30[
0.438
0.15
0.15
[30, 42[
0.596
0.2
0.35
[42, 54[
0.596
0.2
0.55
[54, 66[
0.596
0.2
0.75
[66, 78[
0.522
0.17
0.92
[78, 90]
0.253
0.08
1
The interval median of the interval-valued variable “Age” is [42,54[, because the cumulative relative frequency associated with this interval is 0.55. The equation that defines the cumulative frequency polygon in interval [42,54[, when the median is included, is
72
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
y = 0.0167x − 0.35. So, the median, obtained when x = 0.5, is 50.9. Similarly, we
can compute the value of the 1st and 3rd quartiles. In Figure 2.21 we may observe the
Relative Frequency %
cumulative frequency polygon of this variable and the respective quartiles.
0.3
0.2
0.1
0
18
30
42
54
66
78
Figure 2.20: Histogram of the interval-valued variable “Age” in Table 1.1.
Cumulative relative frequency %
1
0.75
0.5
0.25
0
18
30
st
35.9
1 quartile
42
50.9
Median
rd
3
66
quartile
78
90
Figure 2.21: Histogram of the cumulative relative frequency and the quartiles of the interval-valued variable
“Age” in Table 1.1.
The symbolic descriptive statistics defined in this section, for the variable “Age” are
presented in Table 2.7.
In symbolic-numerical-numerical category, the mean Age of the patients interviewed by
the three healthcare centers is approximately 52 years old. The symbolic variance was
calculated considering the two proposed definitions and very different values are obtained.
2.4. DESCRIPTIVE STATISTICS
73
Table 2.7: Descriptive statistics for the interval-valued variable “Age” in Table 1.1.
Descriptive statistics
Y 2 = 51.67
s21 = 329.33
s22 = 10.89
Example 2.13. Consider now the histogram-valued variable Y5 , “Waiting time for a consult”
in Table 1.2. The data relative to this variable are observed histograms of the waiting
time for a consult by patients aggregated by healthcare center. Let us construct the histogram on r = 6 subintervals. Associated with each interval is a frequency or relative frequency computed according to Definition 2.11. The complete information about the subintervals and respective frequencies is given in Table 2.8 and the histogram is represented in
Figure 2.22.
Table 2.8: Table of frequencies of the histogram-valued variable “Waiting time for a consult” in Table 1.2.
Ig
f (Ig )
fr (Ig )
Fr (Ig )
[0, 15[
1.4
0.467
0.467
[15, 30[
0.2
0.067
0.533
[30, 45[
0.9
0.3
0.833
[45, 60[
0.3
0.1
0.933
[60, 75[
0.1
0.033
0.967
[75, 90]
0.1
0.033
1
According to the observed results, 46.7% of the interviewed patients of the three healthcare centers waited up to 15 minutes for a consult and as the subinterval [0,15[ has the
highest relative frequency, [0,15[ is the modal interval. Similarly to the last example, [15,30[
is the median interval and the median value is 22.5.
The symbolic descriptive statistics defined in this section, for the variable “Waiting time
for a consult” are presented in Table 2.9
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Relative Frequency %
74
0.5
0.4
0.3
0.2
0.1
0
15
30
45
60
75
90
Figure 2.22: Histogram of the histogram-valued variable “Waiting time for a consult” in Table 1.2.
Table 2.9: Descriptive statistics for the histogram-valued variable “Waiting time for a consult” in Table 1.2.
Descriptive statistics
Y 5 = 26.50
s21 = 452.75
s22 = 189.5
Example 2.14. Consider the symbolic data tables in Example 2.11. The levels of hematocrit
and hemoglobin of a set of patients attending in a healthcare center during one month were
aggregated, for patient, in intervals or distributions. Symbolic data tables, in Tables 2.4 and
2.5 presented the aggregated records.
In Table 2.10 the values of covariance calculated using the proposed definitions may
be compared. The results illustrated the conclusions presented when the definitions of
variance and covariance were introduced. The definition of covariance 1, when applied to
two equal variables, coincides with the definition of variance 2. In the case of the intervalvalued variables, the definitions of covariance 2 and covariance 3, when applied to two
equal variables, coincide with the definition of variance 1. However, the same is not verified
in the case of the histogram-valued variables.
2.4. DESCRIPTIVE STATISTICS
75
Table 2.10: Values of the covariance and variance for the data in Tables 2.4 and 2.5.
Interval variables
Histogram variables
Histogram variables 2
cov1 (X, Y )
4.2239
4.3276
4.3276
cov2 (X, Y )
6.0066
5.993
5.2263
cov3 (X, Y )
6.0313
-
-
cov1 (X, X)
1.3139
1.3688
1.3688
cov2 (X, X)
1.7841
1.6595
1.5624
cov3 (X, X)
1.7841
-
-
cov1 (Y, Y )
13.9692
14.1394
14.1394
cov2 (Y, Y )
21.5213
19.9008
17.7531
cov3 (Y, Y )
21.5213
-
-
s21 (X)
1.7841
1.8367
1.8367
s22 (X)
1.3139
1.3688
1.3688
s21 (Y )
21.5213
21.6990
21.6990
s22 (Y
13.9692
14.1394
14.1394
)
In this example, another goal was to compare the values obtained for the covariance and
variance, considering the histogram-valued variables rewritten either with the same number
or with a different number of subintervals. Based in the obtained results of this example,
the conclusion is that the variance obtained with both definitions is always the same, but
the values for the covariance are only the same when the definition of covariance 1 is
applied. Moreover, we may conclude that the values obtained when the several definitions
of variance and covariance are applied, are obviously different.
2.4.2. Symbolic-symbolic-symbolic category
The empirical distributions associated with the observations of a histogram-valued variables may be represented by classical histograms or by quantile functions. To work with this
elements in symbolic-symbolic-symbolic category, the representation more adequate is the
quantile function. Using this representation and consider the Mallows and the Wassertein
distance, Irpino and Verde (Irpino and Verde (2006); Verde and Irpino (2008)), Arroyo and
2
The observations associated with histogram-valued variable X and Y are rewritten with the same number
of subinterval.
76
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Maté (2009), proposed some alternative descriptive statistics for distributions.
In this section, we will consider a histogram-valued variable Y and, that for all observations j, the distributions associated with each unit j are all defined with n pieces (see
Note 2.4).
2.4.2.1. Descriptive measures defined from the Mallows distance
The first descriptive measure defined from the Mallows distance may be interpreted as
a “mean” of histograms and was proposed by Irpino and Verde (2006).
Mallows Barycentric Histogram
Considering the Mallows distance, it is possible to compute a distribution, named as
Mallows barycentric histogram or simply barycentric histogram, Yb , represented by the
quantile function Ψ−1
Yb (t) and which is the solution of the minimization problem
min
m
X
2
−1
DM
(Ψ−1
Y (j) (t), ΨYb (t)).
(2.44)
j=1
According to Proposition 2.1 we may rewrite the previous problem as
min f (cb1 , rb1 , . . . , cbn , rbn ) =
m X
n
X
j=1 i=1
pi (cY (j)i
1
2
− cYbi ) + (rY (j)i − rYbi ) .
3
2
(2.45)
The optimal solution of this problem is obtained by solving a least squares problem. So, the
optimal solution for each i ∈ {1, . . . , n} is the solution of the following equation:

m

X

∂f


(cY (j)i − cYbi ) = 0

 ∂cbi = −2pi
j=1
m

2 X
∂f



=
−
p
(rY (j)i − rYbi ) = 0
i

 ∂rbi
3 j=1
⇔



m


1 X


cY (j)i

 cbi = m
j=1


m

1 X



r =
rY (j)i

 bi
m j=1
.
The optimal solution of the minimization problem (2.45) is a histogram where the center and
half range of the subinterval i is the classical mean, respectively, of the centers and of the
half ranges i, of all observations j, i.e.,
1 X
cbi =
cY (j)i
m j=1
m
1 X
rbi =
rY (j)i .
m j=1
m
and
2.4. DESCRIPTIVE STATISTICS
77
The barycentric histogram may be represent by the histogram:
HY b =
[cb1 − rb1 , cb1 + rb1 [ , p1 ; [cb2 − rb2 , cb2 + rb2 [ , p2 , . . . , [cbn − rbn , cbn + rbn ] , pn
(2.46)
or, by the quantile function, represented by Ψ−1
Yb (t) as follows:
Ψ−1
Yb (t) =



2t

−
1
c
+
rb1
b1

w1





 cb2 + 2(t−w1 ) − 1 rb2
w2 −w1










if
0 ≤ t < w1
if
w1 ≤ t < w 2
..
.
cbn +
2(t−wn−1 )
1−wn−1
− 1 rbn
if
.
(2.47)
wn−1 ≤ t ≤ 1
The barycentric histogram have some interesting properties. Using the operations
between functions, we may computed the average of the m quantile functions that represent
m given distributions.
Definition 2.21. Consider the m quantile functions Ψ−1
Y (j) (t), j ∈ {1, . . . , m}, all defined
with n pieces. The mean quantile function Ψ−1
Y (t) is the function where each piece is the
mean of the corresponding n pieces involved. The function is then,
Ψ−1
Y (t) =

m
P

rY (j)1
cY (j)1
2t

+
−
1


m
w1
m

j=1



r
m

P

cY (j)2
Yj2
2(t−w1 )


−
1
+
m
w2 −w1
m
0 ≤ t < w1
if
w1 ≤ t < w 2
j=1












..
.
m
P
j=1
cY (j)n
m
1 X −1
That is Ψ (t) =
Ψ (t).
m j=1 Y (j)
m
−1
Y
if
+
2(t−wn−1 )
1−wn−1
−1
r
Yjn
m
if
(2.48)
wn−1 ≤ t ≤ 1
Proposition 2.3. (Verde and Irpino (2010)) The quantile function Ψ−1
Yb (t), that represents
the barycentric histogram of m histograms, is the mean quantile function of the m quantile
functions that represent each observation of the histogram-valued variable Y, i.e. Ψ−1
Y (t).
So,
−1
Ψ−1
Yb (t) = ΨY (t).
The concepts of barycentric histogram and symbolic mean for a histogram-valued variable are related as we may see in the following proposition.
78
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Proposition 2.4. Consider the histogram-valued variable Y observed to m units. Let
Ψ−1
Y (t) the quantile function that represents the barycentric histogram of the histogramvalued variable Y and its symbolic mean Y . We have
Y =
Z
1
Ψ−1
Y (t)dt.
0
Proof. Let Y a histogram valued variable observed in m units. Consider that the m distributions represented by the quantile functions Ψ−1
Y (j) (t) with j ∈ {1, 2, . . . , m} are all
defined with n subintervals and with the same weight associated with each subinterval in
all observations. The barycentric histogram description of the m histograms is defined as
in Expression (2.48).
So,
Z
1
−1
Y
Ψ (t)dt =
0
n Z
X
i=1
wi
m X
1
wi−1 j=1
n
m
m
cY (j)i +
2(t − wi−1 )
1
rY (j)i dt
−1
wi − wi−1
m
1 XX
(cY (j)i − rY (j)i ) (wi − wi−1 ) +
m i=1 j=1
2
2
wi − wi−1
2rY (j)i
2
− wi−1 wi + wi−1
+
wi − wi−1
2
m
n
1 XX
=
cY (j)i pi = Y
m j=1 i=1
=
with pi = wi − wi−1 .
From Proposition 2.4 we may conclude the following:
Proposition 2.5. Consider the histogram-valued variable Y observed in m units. The
mean value of the barycentric histogram, Y b , is the symbolic mean of the histogram-valued
variable Y.
If Ψ−1
Yb (t) is a non-negative function for all t ∈ [0, 1] , the symbolic mean of the barycentric histogram may be interpret as the area between the quantile function that represent the
barycentric histogram and the xx-axis.
According Irpino and Verde (2006), the identification of the barycenter using the Mallows
distance allows showing that it is possible to express a measure of inertia of the data using
this same distance. The total inertia (T I), with respect to the barycentric histogram of a
2.4. DESCRIPTIVE STATISTICS
79
set of m histogram data, Yb , is given by
TI =
m
X
2
DM
(Y (j), Yb ).
j=1
These authors prove that the Mallows distance allows for the Huygens theorem of decomposition of inertia for clustered data. As in classical statistics, also to histogram-valued
variables it is possible obtain a decomposition of the total inertia as follows:
T I = W I + BI
k X
k
X
X
2
2
(Y (j), Ybh ) +
|Ch |DM
(Ybh , Yb )
DM
=
h=1 i∈Ch
(2.49)
h=1
where |Ch | is the cardinality of the cluster Ch and Ybh is its barycenter, with
h ∈ {1, . . . , k} . (For more details see Irpino and Verde (2006)).
Verde and Irpino (2008), consider the the Mallows distance between quantile functions
may be consider a extension of the Euclidean distance between real values. According
this approach, the authors proposed a new definition of symbolic variance and symbolic
covariance.
Symbolic Variance
To understand the proposed definition, it is convenient rewrite the classical expressions
of variance using the Euclidean distance.
Let Y be a classical variable where to each unit j corresponds the real value yj . The
classical variance of the variable Y is defined as s2 =
1
m
m
P
j=1
2
(yj − y) .
Using the Euclidean distance, DE , between two real values, we have
m
m X
X
DE2 (yj1 , yj2 ) =
j1 =1 j2 =1
because
m P
m
P
j1 =1 j2 =1
m
m X
X
j1 =1 j2 =1
2
2
(yj1 − yj2 ) = 2m
(yj1 − yj2 ) may be rewritten as follows:
m
X
j1 =1
(yj1 − y)
2
80
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
m
m X
X
j1 =1 j2 =1
(yj1 − yj2 )
m
m
m
m
X
X
X
X
= m
yj21 − 2
yj1
yj2 + m
yj22
j1 =1
j2 =1
!j1 =1 m
" j2 =1
m
m
X
X
X
yj21 −
yj1
yj2
= 2 m
j1 =1
j1 =1
j2 =1
! !
""
m
m
X
1 X
= 2m
yj1 yj1 −
yj
m j2 =1 2
j1 =1
m
X
= 2m
yj21 − yj1 y
2
j1 =1
m
= 2m
X
j1 =1
m
= 2m
X
j1 =1
m
= 2m
X
j1 =1
(2.50)
m
X
yj21 − 2yj1 y + y 2 + 2m
yj1 y − 2m2 y 2
j1 =1
2
(yj1 − y) + 2my
m
X
j1 =1
yj1 − 2my
m
X
yj1
j1 =1
2
(yj1 − y) .
So, we may define, in classical statistics
s2 =
m X
m
X
DE2 (yj1 , yj2 )
j1 =1 j2 =1
(2.51)
.
2m2
By analogy, consider the Mallows distance, Verde and Irpino (2008) proposed to define
symbolic variance as
s
2
=
m X
m
X
2
M
D
j1 =1 j2 =1
Ψ
−1
Y (j1 )
(t), Ψ
2m2
−1
Y (j2 )
(t)
(2.52)
.
Proceeding to a similar simplification as in the classical situation in Expression (2.50), we
may rewrite Expression (2.52) of symbolic variance as:
s2 =
=
m X
m
X
j1 =1 j2 =1
−1
2
DM
Ψ−1
(t),
Ψ
(t)
Y (j1 )
Y (j2 )
m X
m Z
X
j1 =1 j2 =1
1
0
2m2
−1
Ψ−1
Y (j1 ) (t) − ΨY (j2 ) (t)
2m2
2
2.4. DESCRIPTIVE STATISTICS
2m
m Z
X
j1 =1
=
1
0
81
Ψ
−1
Y (j1 )
(t) − Ψ
−1
Y (j1 )
2m2
(t)
2
1 X 2
−1
DM Ψ−1
(t),
Ψ
(t)
Y (j1 )
Y (j1 )
m j1 =1
m
n
1 XX
1
2
2
=
pi (cY (j1 )i − cYbi ) + (rY (j1 )i − rYbi ) .
m j1 =1 i=1
3
m
=
(2.53)
Analogously, Verde and Irpino (2008) proposed a new concept for the covariance that allow
comparing two histogram-valued variables.
Symbolic Covariance
Consider the histogram-valued variables Y1 and Y2 . The covariance between these
variables may be defined as follows:
1 X
cov(Y1 , Y2 ) =
m j=1
m
Z
1
0
−1
Ψ−1
Y1 (j) (t) − ΨY1 (j) (t)
−1
Ψ−1
(t)
−
Ψ
(t)
dt.
Y2 (j)
Y2 (j)
(2.54)
The correlation mesure for two histogram-valued variables that was presented in Definition 2.20, may also be calculate using the the definition of covariance in Expression (2.54)
and the definition of variance in Expression (2.53).
2.4.2.2. Descriptive measures defined from the Wasserstein distance
Similarly to Irpino and Verde (2006), Arroyo and Maté (Arroyo and Maté (2009), Arroyo
(2008)) define barycentric histogram but consider the Wasserstein distance instead of the
Mallows distance. In both situations, all distributions involved to build the barycentric
histogram must be represented by histograms or quantile functions with equal number
of subintervals or pieces. For the Mallows barycentric histogram, the redefinition of the
distributions to comply with the previous condition, is performed according to the process
described in Section 2.2.3.1. To determine a barycentric histogram using Wasserstein
distance, the process is not exactly the same but is very similar. The only diference is
in the construction of the set of cumulative weights. For this case, we sort Z ∪ V without
repetition, where Z is the set that includes the cumulative weights without repetition and V
the set of points in the interval [0,1] where at least two quantile functions intersect.
82
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Wasserstein barycentric histogram
The goal is now to compute a distribution, named Wasserstein barycentric histogram or
median histogram, represented by Ym , that is the solution of the minimization problem
m
X
min
2
−1
DW
(Ψ−1
Y (j) (t), ΨYm (t))
(2.55)
j=1
where Ψ−1
Ym (t) is the quantile function that represents Ym . From Proposition 2.2 we may
rewrite the previous problem as
min f (cm1 , . . . , cmn ) =
m X
n
X
j=1 i=1
pi |cY (j)i − cYmi |.
(2.56)
The optimal solution of this problem is obtained by solving a least absolute deviations
problem. The solution are the centers cYmi that are, for each i, the median the of centers
cY (j)i , of all observations j. So,
cmi = median {cY (j)i : j ∈ {1, . . . , m}} .
To build the subintervals [cmi − rmi , cmi + rmi ] that compose the median histogram, we
have for each fixed i, cmi = cY (j)i for a given j ∈ {1, . . . , m} , where the corresponding
half range rmi will be rmi = rY (j)i , for the same j.
The median histogram may be represented by
HY m =
n
[cm1 − rm1 , cm1 + rm1 ] , p1 ; [cm2 − rm2 , cm2 + rm2 ] , p2 , . . . ,
o
[cmn − rmn , cmn + rmn ] , pn
(2.57)
or, by the quantile function Ψ−1
Ym (t),
Ψ−1
Ym (t) =



2t

−
1
c
+
rm1
m1

w1





 cm2 + 2(t−w1 ) − 1 rm3
w2 −w1










if
0 ≤ t < w1
if
w 1 ≤ t < w2
..
.
cmn +
2(t−wn−1 )
1−wn−1
− 1 rmn
if
.
(2.58)
wn−1 ≤ t ≤ 1
The barycentric histogram obtained using the Wasserstein distance resembles the median. Because of this, it is normally designated by median histogram.
The measures of location and dispersion that we presented in this section for histogramvalued variables, may also be deduced for interval-valued variables.
2.4. DESCRIPTIVE STATISTICS
83
2.4.2.3. Examples
We will now present some examples that illustrate the descriptive statistics defined in
symbolic-symbolic-symbolic category. With these examples we observe that the barycentric histogram corresponds to the ”gravity center” of a set of histograms.
Example 2.15. Consider the histograms HY1 = {[1, 2[, 0.7; [2, 3[, 0.2; [3, 4], 0.1} and
HY2 = {[11, 12[, 0.1; [12, 13[, 0.2; [13, 14], 0.7} . These histograms may be represented by
the quantile functions as follows, defined with equal number of pieces:


2t

1.07 + 0.1
− 1 × 0.07






2(t−0.1)


−
1
× 0.14
1.29
+

0.2


2(t−0.3)
Ψ−1
1.71
+
−
1
× 0.29
Y1 (t) =
0.4






2.50 + 2(t−0.7)
− 1 × 0.50


0.2




 3.50 + 2(t−0.9) − 1 × 0.50
0.1


2t

−
1
× 0.50
11.50
+

0.1





2(t−0.1)


12.50
+
−
1
× 0.50

0.2


Ψ−1
13.29 + 2(t−0.3)
− 1 × 0.29
Y2 (t) =
0.4





2(t−0.7)

− 1 × 0.14

 13.71 +
0.2




 13.93 + 2(t−0.9) − 1 × 0.07
0.1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.3
if
0.3 ≤ t < 0.7
if
0.7 ≤ t < 0.9
if
0.9 ≤ t ≤ 1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.3
if
0.3 ≤ t < 0.7 .
if
0.7 ≤ t < 0.9
if
0.9 ≤ t ≤ 1
For this example the barycentric histogram is represented in Figure 2.23; in Figure 2.24
it is defined by the respective quantile function:


2t

6.29 + 0.1
− 1 × 0.29






2(t−0.1)


6.89 +
− 1 × 0.32

0.2


2(t−0.6)
Ψ−1
−
1
× 0.29
7.50
+
Yb (t) =
0.1






8.11 + 2(t−0.7)
− 1 × 0.32


0.2




 8.71 + 2(t−0.9) − 1 × 0.29
0.1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.3
if
0.3 ≤ t < 0.7 .
if
0.7 ≤ t < 0.9
if
0.9 ≤ t ≤ 1
84
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Figure 2.23: Barycentric histogram of the histograms in Example 2.15.
Ψ−1
(t)
Y
1
Ψ−1
(t)
Y
2
Ψ−1
(t)
Y
b
15
12
9
6
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2.24: Quantile function that represents the barycentric histogram of the histograms in Example 2.15.
Observing these graphical representations of the barycentric histogram we may see that
the location and shape of the barycenter are the result of the averaging the features of the
distributions involved in its construction.
In this case the number of distributions considered to calculate the median histogram is
even, because of this the median histogram is not unique. In this situation we may consider
that median histogram is the barycentric histogram.
2.4. DESCRIPTIVE STATISTICS
85
Example 2.16. Consider once now the histogram-valued variable Y5 , “Waiting time for a
consult” in Table 1.2. In Example 2.10 we rewrote the three quantile functions with the
same number of pieces, according to the process described in Section 2.2.3.1. For this
example, the barycentric histogram, in Figure 2.25, is


2t

8.23 + 0.1
− 1 × 3.23






2(t−0.1)


17.19
+
−
1
× 5.73

0.5


Ψ−1
29.90 + 2(t−0.6)
− 1 × 1.98
Yb (t) =
0.1





2(t−0.7)

35.94
+
−
1
× 4.06


0.1




 52.50 + 2(t−0.8) − 1 × 12.50
0.1
if
0 ≤ t < 0.1
if
0.1 ≤ t < 0.6
if
0.6 < t < 0.7
if
0.7 ≤ t < 0.8
if
0.8 ≤ t ≤ 1
or, alternatively, may be represented by:
HYb =
Ψ−1 (A)
Y
80
5
[5.00, 11.46[ , 0.1; [11.46, 22.92[ , 0.5; ]27.92, 31.88[ , 0.1;
[31.88, 40.00[ , 0.1; [40.00, 65.00] , 0.2 .
Ψ−1 (B)
Y
5
Ψ−1 (C)
Y
5
Ψ−1
Y
b
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2.25: Quantile function that represents the barycentric histogram of the histograms in Example 2.16.
In this example we may observed that when we have subintervals with null weight
in histograms, the quantile function that represents the barycentric histogram will be a
discontinuous function. The point of discontinuity corresponds to one of the cumulative
values considered in set Z, built during the process of rewriting all histograms with equal
number of pieces, described in Section 2.2.3.1.
In this case, as the number of observations associated with this variable is odd, the
median histogram of the histogram-valued variable is unique and corresponds to the distribution Y5 (C). The situation is illustrated in Figure 2.26.
86
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
Ψ−1
Y
80
m
−1
ΨY
5
−1
(A)
ΨY
5
−1
ΨY
(B)
5
(C)
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2.26: Quantile function that represents the median histogram of the histograms in Example 2.16.
All definitions considered in symbolic-symbolic-symbolic category can also be particularized to interval-valued variable. We will illustrate one situation in next example.
Example 2.17. Consider the interval-valued variable Y2 , “Age” in Table 1.1. For this variable
we have three observations. The barycenter of this interval-valued variable is the interval
Yb = [21, 82.3] that is represented in Figure 2.27.
Ψ−1 (A)
Ψ−1 (B)
Ψ−1 (C)
2
2
2
Y
Y
Y
Ψ−1
Y
b
80
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2.27: Quantile function that represents the barycentric interval of the intervals in Example 2.17.
However, as to compute the median histogram we have to rewrite the distributions considering not only the cumulative weights without repetition that we obtain in all observation,
but also the point in interval [0, 1] where at least two quantile functions intersect. As such,
to calculate the median histogram for this interval-valued variable, we rewrite each interval
considering the cumulative weights 0, 19 , 12 , 1 .
2.5. CONCLUSION
87
Consider the three intervals that represent the range of the age of the patients of each
healthcare center, rewritten as quantile functions with three pieces. The median histogram,
represented in Figure 2.28, for this case is


2t


−
1
23
+
×3
1


9


2(t− 19 )
−1
40 +
− 1 × 14
ΨYm (t) =
7
18




2(t− 12 )

137

− 1 × 292
 2 +
1
2
Ψ−1
Y
80
m
Ψ−1 (A)
Ψ−1 (C)
Ψ−1 (B)
2
2
2
Y
Y
if
0≤t<
1
9
if
1
9
≤t<
1
2
if
1
2
≤t≤1
.
Y
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2.28: Quantile function that represents the median histogram of the intervals in Example 2.17.
2.5. Conclusion
This chapter presents the concepts that support the study developed in this work.
New kinds of variables were proposed and it is now important to develop their respective
statistical concepts and methods. In the literature, the best studied symbolic variables are
the interval-valued variables. At the present moment, the research with histogram-valued
variables is only beginning to emerge. Each observation in a histogram-valued variable
has a lot of information associated (namely the distribution of the values recorded for each
observed unit). This values the statistical study and the results, but at the same time it
makes the study much more complex.
The difficulty in working with histogram-valued variables led to the search for alternative
representations of the histogram distributions. The quantile functions (the inverse of the
cumulative functions) seem a good option since we are working with functions that avoid the
88
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
histogram arithmetics. However, the representation of the distributions by quantile functions
is not a problem-free solution. Quantile functions are a special kind of functions that are
elements of a semi-vector space because when a quantile function is multiplied by −1 we
obtain a function that cannot be a quantile function because it is not a non-decreasing
function and consequently cannot represent a histogram. Given a distribution represented
by a quantile function Ψ−1 (t), with t ∈ [0, 1], the quantile function that represents the
respective symmetric histogram is given by the quantile function −Ψ−1 (1 − t). Intervalvalued variables being a particular case of histogram-valued variables, the intervals that are
the observations of interval-valued variables may also be represented by quantile functions.
This representation is not considered in the literature, however, it has advantages because
it is more simple to work with intervals and allows taking into account the distribution within
the intervals. In Symbolic Data Analysis, in general, an Uniform distribution is assumed
within the intervals, however this consideration is too strict. Opting by the representation
of the observations of the interval-valued variables by quantile functions, other distributions
may be considered.
In classical statistics the observations of the quantitative variables are real numbers
(points) and consequently the dissimilarity between two classical variables is often measured calculating the sum of the square of the difference between the observations, i.e.
using the Euclidean distance. In the case of the interval/histogram-valued variables the
Mallows distance is a measure suggested in the literature as adequate to measure the
dissimilarity between distributions.
The concepts and methods developed so far for symbolic variables are essentially
descriptive. In the last sections of this chapter we presented and deduced the main descriptive statistics already proposed for interval/histogram-valued variables. The proposals for these concepts are divided in two categories. It was in the symbolic-numericalnumerical category that the first concepts were proposed and in some cases more than
one definition was proposed for the same concept. The results of the measures proposed
in this category are real numbers, despite working with elements that are distributions or
intervals. The variability in each observation is considered in some definitions to obtain
the final result. However, obtaining a real number as a descriptive measure when symbolic
variables are being considered seems to provide a poor outcome. As an alternative to the
concepts proposed in the symbolic-numerical-numerical category, new proposals included
2.5. CONCLUSION
89
in symbolic-symbolic-symbolic category have recently emerged. In this category the result
of a descriptive measure is an element of the same kind of the observations of the symbolic
variable. For example, the definition of “mean” for histogram-valued variables produces
a mean distribution, named barycentric histogram and not a real number as in symbolicnumerical-numerical category.
90
CHAPTER 2. HISTOGRAM AND INTERVAL DATA
3. Linear Regression Model for symbolic
data - state of the art
In this chapter we present a selection of linear regression models that have been proposed for interval and histogram data. Several proposed models for interval data are
not developed with a probabilistic assumption and just a few of them attempt to make a
statistical inference on the estimated model. Linear models for interval data have been
much studied. Contrary as to what happened with some concepts and methods, as for
example, descriptive statistics, as we see in Section 2.4.1 of the Chapter 1, it is not easy
to generalize linear regression models proposed for intervals to histogram data. The complexity of histograms may prevent a generalization of the models. For this data type only
two approaches have been presented.
3.1. Linear Regression Models for interval-valued variables
In the literature interval data have been studied following two different lines of research
and linear regression models were proposed in both cases. These approaches are associated with the kind of data that the intervals are representing:
• When intervals are used to represent data with variability, i.e. in the Symbolic Data
Analysis approach, the most usual is to represent them as vectors with two components: the lower and upper bounds or the center and half range. Under this
approach, it is also common to assume uniformity within each interval.
• When intervals are used to represent imprecise data, they are considered as nonempty
compact sets, i.e., the intervals are elements of KC (R) and models with these ele-
ments were developed considering the space KC (R), +, ·) where the operations +
91
92
CHAPTER 3. STATE OF THE ART
and · are defined according to the interval arithmetic, presented in Section 2.2.1. This
space has elements that do not have a symmetric in the space (Gil et al. (2001)).
Studies involving both approaches have been based in a linear fitting problem for a given
sample of interval data without probability assumptions on the data and that do not allow for
inference studies on the estimated model. More recently, linear regression models where a
probabilistic assumption is considered have been proposed.
3.1.1. Descriptive linear regression models
3.1.1.1. Linear regression models for data with variability
Considering interval-valued variables in the context of Symbolic Data Analysis, the most
noteworthy models that have been proposed are: the Center Method (Billard and Diday
(2000)); the MinMax Method (Billard and Diday (2002)); the Center and Range Method
(Lima Neto and De Carvalho (2008)) and the Constrained Center and Range Method
(Lima Neto and De Carvalho (2010)). Other contributions have recently emerged (Xu
(2010), Yang et al. (2011) and Giordani (2011, 2014)). With all these methods it is possible
to predict a response interval-valued variable from n explicative interval-valued variables.
Because most of the above-mentioned models do not treat the intervals as intervals, they
are in the symbolic-numerical-symbolic category. As a consequence, the adjustment of
classical linear regression models is required and therefore it is not the closeness between
the observed and predicted intervals that is quantified.
The probabilistic assumptions that involve the linear regression model theory for classical data will not be considered in the models for symbolic data presented in this section. In
almost all of the presented models, the problem that will be investigated is an optimization
problem, in which we seek to minimize a predefined criterion.
Next we present the main linear regression models already proposed. In the methods
that we describe we have considered the following notation.
Let Y be the response interval-valued variable and Xk , with k ∈ {1, . . . , p} the p
explicative interval-valued variables. To each unit j, of the interval-valued variable Y, corres-
ponds the interval IY (j) = I Y (j) , I Y (j) and to each unit j, of the interval-valued variable
Xk corresponds the interval IXk (j) = I Xk (j) , I Xk (j) . The centers and half ranges of the
interval IY (j) are defined as: cY (j) =
I Y (j) +I Y (j)
2
and rY (j) =
I Y (j) −I Y (j)
2
, likewise for IXk (j) ,
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
93
with k ∈ {1, . . . , p} .
The predicted interval that is the range of values estimated for each observation of the
dependent variable Y is denoted IYb (j) = I Yb (j) , I Yb (j) .
Methods based in Symbolic Covariance definitions
The first linear regression model for interval-valued variables was proposed by Billard
and Diday (2000) and was named Center method (CM) by Lima Neto and De Carvalho
(2008). This method predicts the bounds of the intervals separately and therefore does not
predict intervals from other intervals loosing the variability of the data.
According to the interpretation of Lima Neto and De Carvalho (2008), the parameters
of the model are estimated by reducing the intervals to its centers and then applying a
classical linear regression model to these midpoints. Consider the set of interval data
{Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 . The centers of the intervals Y (j) and the centers of
the intervals X1 (j), . . . , Xp (j) are related according to the classical linear relation:
cY (j) = b0 + b1 cX1 (j) + b2 cX2 (j) + . . . + bp cXp (j) + ec (j).
(3.1)
The prediction of the centers of the intervals cY (j) , can be obtained from Expression
(3.1), that may be written in matricial notation as:
b c = Xc b
Y
where




c
X =



1 cX1 (1)
. . . cXp (1)
1 cX1 (2)
. . . cXp (2)
..
.
..
.
..
.
..
.
1 cX1 (m) . . . cXp (m)




,



(3.2)

 cYb (1)


 cYb (2)
bc = 
Y

 ..
 .


cYb (m)










 and b = 







b0
b1
..
.
bp




.



The estimation of the parameters b0 , b1 , . . . , bp is obtained by minimizing the sum of
squares of deviations:
m
X
j=1
cY (j) − b0 − b1 cX1 (j) − . . . − bp cXp (j)
2
.
(3.3)
Using the Least Squares method, the optimal solution of the problem is obtained. This
solution is calculated by differentiating the function in Expression (3.3) with respect to the
94
CHAPTER 3. STATE OF THE ART
parameters and setting the results equal to zero. The normal equations are defined from
the centers of the intervals and the resulting system is the one obtained in the classical
situation but now applied to the centers of the intervals:
b0 m + b1
m
P
cX1 (j) + . . . + bp
j=1
b0
m
P
b0
m
P
cXp (j) =
j=1
cX1 (j) + b1
j=1
..
.
m
P
m
P
j=1
m
P
cY (j)
j=1
(cX1 (j) )2 + . . . + bp
j=1
cXp (j) + b1
m
P
m
P
cXp (j) cX1 (j) =
j=1
cX1 (j) cXp (j) + . . . + bp
j=1
m
P
cY (j) cX1 (j)
j=1
m
P
(cXp (j) )2 =
j=1
m
P
(3.4)
cY (j) cXp (j)
j=1
Considering the matrices

m
P
m
P

m
cX1 (j)
...
cXp (j)

j=1
j=1

 P
m
m
P
P
 m
cX1 (j)
(cX1 (j) )2 . . .
cX1 (j) cXp (j)


j=1
j=1
A =  j=1

..
..
..
..

.
.
.
.


 m
m
m
P
P
 P
cXp (j)
cX1 (j) cXp (j) . . .
(cXp (j) )2
j=1
j=1
j=1


m
P


cY (j)


j=1



 P

 m
cY (j) cX1 (j)




 and d =  j=1


..


.




 m


 P
cY (j) cXp (j)
j=1








.






If the matrix A is non singular, we may obtain the estimated parameters in matrix notation
by rewriting the previous system of linear equations as:
Ab = d ⇔ b = A−1 d.
(3.5)
As A = (Xc )T Xc and d = (Xc )T Yc , if Xc is a non singular matrix, the parameters
estimated from the matrices Xc and Yc are obtained as follows:
b = A−1 d ⇔ b = ((Xc )T Xc )−1 ((Xc )T Yc ).
(3.6)
In Section 2.4.1, we observed that the symbolic definitions of covariance 1 in Expression
(2.30), variance 2 in Expression (2.32) and symbolic mean in Expression (2.23) are the
classical definitions applied to cY (j) and cXk (j) , with k ∈ {1, 2, . . . , p} . Because of this, and
as equivalently to Expression (3.6), the estimated parameters for b may be obtained from
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES








b1
b2
..
.
bp


2
s (X1 )
cov(X1 , X2 ) . . . cov(X1 , Xp )





 cov(X2 , X1 )
s2 (X2 )
. . . cov(X2 , Xp )
 = 
..
..
..
..




.
.
.
.


cov(Xp , X1 ) cov(Xp , X2 ) . . .
s2 (Xp )
and
b0 = Y −
p
X
95
−1 







cov(X1 , Y )


 cov(X2 , Y )

..


.

cov(Xp , Y )
bk X k ,








(3.7)
(3.8)
k=1
the estimated values of the parameters, using the previous symbolic definitions of cov1
and s22 , are the ones obtained from the classical definitions applied to the centers of the
intervals.
In the particular case when we have only one explicative variable, we have
b1 =
cov(X, Y )
s2 (X)
and b0 = Y − b1 X.
In spite of the interpretation of Lima Neto and De Carvalho (2008), the authors of the
linear relation methods based in symbolic covariance definitions (Billard and Diday (2000,
2006), Xu (2010)), consider a relation between interval-valued variables as an adaption of
the classical linear relation between real values.
Considering the set of interval data {Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 , the intervals
Y (j) and the intervals X1 (j), . . . , Xp (j) are related according to:
Y (j) = b0 + b1 X1 (j) + b2 X2 (j) + . . . + bp Xp (j) + e(j).
(3.9)
The estimation of the parameters are obtained by an adaptation of the solution obtained
by the Least Square estimation method for the classical linear model, where the symbolic
definitions of variance and covariance are used. When the definitions of variance 2 and
covariance 1 are applied, the model in Expression (3.9) is equivalent to CM in Expression
(3.1), the classical model applied to the centers of the intervals. For other definitions
of covariance and variance that were proposed, the values of the parameters may be
estimated in a similar way. In Billard and Diday (2006), the Definition 2.17 of covariance
2 and the definition of variance 1 in Expression (2.26) are used. In another approach,
Xu (2010) proposes the Symbolic Covariance method (SCM) when the parameters are
96
CHAPTER 3. STATE OF THE ART
also estimated from Expressions (3.7) and (3.8), but uses the Definition 2.19 of symbolic
covariance 3 and the definition of symbolic variance 1, in Expression (2.26). So, the process
used by Billard and Diday (2006) and Xu (2010) to estimate parameters is not deduced from
the model.
Concerning the symbolic definitions of variance 1, covariance 2 and covariance 3, the
authors consider that the internal variations within observations is not lost because the
symbolic definitions incorporate the internal variations and the variations between observations. Because of this Xu (2010) considers that his model is inherently a truly ”symbolic”
approach.
In the three methods based in symbolic covariance definitions, the intervals of the
response variable are obtained by predicting separately the bounds of the intervals from
the estimated parameters as follows:
I Yb (j) = b0 + b1 I X1 (j) + b2 I X2 (j) + . . . + bp I Xp (j) ;
I Yb (j) = b0 + b1 I X1 (j) + b2 I X2 (j) + . . . + bp I Xp (j) .
(3.10)
Obviously, the estimated values of the parameters b0 , b1 , . . . , bp may be any real numbers. However, when these values are negative, we may obtain a range of values where
the upper bound of the interval may be smaller than the respective lower bound in some
situations. In other words, an interval of real numbers is not necessarily obtained. In this
case, the appropriate solution may be selected using the two predicted values, the lower
value for the lower bound and the higher for the upper bound of the interval, such that for
each unit j,
IYb (j) = I Yb (j) , I Yb (j)
with
(
)
p
p
X
X
I Yb (j) = min b0 +
bk I Xk (j) , b0 +
bk I Xk (j) ;
k=1
k=1
(
)
p
p
X
X
I Yb (j) = max b0 +
bk I Xk (j), b0 +
bk I Xk (j) .
k=1
(3.11)
k=1
MinMax method
In 2002, Billard and Diday (2002), proposed other similar models, of which we highlight
the method that Lima Neto and De Carvalho (2008) named MinMax method. In this method,
the bounds of the intervals are also estimated separately, but in these cases, the coefficients
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
97
of the model are estimated applying the classical linear regression model separately to the
lower and upper bounds of the intervals, and therefore the coefficients are not the same.
Consider the set of interval data {Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 . In the MinMax
method, the linear relation of the upper and lower bounds of the intervals Y (j) with the
upper and lower bounds of the intervals Xk (j), with k ∈ {1, 2, . . . , p}, respectively are
classical linear relations, as follows:
I Y (j) = bL0 + bL1 I X(j)1 + bL2 I X(j)2 + . . . + bLp I X(j)p + e(j);
(3.12)
I Y (j) = bU0 + bU1 I X(j)1 + bU2 I X(j)2 + . . . + bUp I X(j)p + e(j).
Considering the interpretation of Lima Neto and De Carvalho (2008), the vectors of the
L
L
U
U
U
parameters bL
0 , b1 , . . . , bp and b0 , b1 , . . . , bp , used to predict the bounds of the interval,
are estimated independently. Their estimations are obtained by minimizing the sum of the
squares of the deviations:
m
X
j=1
!
I Y (j) − bL0 −
p
X
bLk I X(j)k
k=1
"2
+
m
X
j=1
!
I Y (j) − bU0 −
p
X
bUk I X(j)k
k=1
"2
.
(3.13)
Applying the Least Squares method, the optimal solution of the minimization problem
(3.13) is the solution of the system of linear equations obtained by differentiating the function
(3.13) in order to all parameters and setting the results equal to zero, as follows:
bL0 m + bL1
bL0
..
.
bL0
U
0
m
P
j=1
m
P
j=1
..
.
bU0
m
P
j=1
m
P
j=1
j=1
I X1 (j) + . . . + bLp
I X1 (j) + bL1
I Xp (j) + bL1
b m+b
bU0
m
P
U
1
m
P
m
P
j=1
m
P
j=1
U
p
m
P
(I X1 (j) )2 + . . . + bUp
j=1
m
P
j=1
I Y (j)
j=1
m
P
I Xp (j) I X1 (j) =
j=1
m
P
j=1
I Xp (j) =
j=1
m
P
m
P
I X1 (j) I Xp (j) + . . . + bLp
I X1 (j) + . . . + b
I Xp (j) + bU1
j=1
I Xp (j) =
(I X1 (j) )2 + . . . + bLp
j=1
I X1 (j) + bU1
m
P
m
P
(I Xp (j) )2 =
I Y (j)
m
P
j=1
I Y (j) I X1 (j)
m
P
j=1
I Y (j) I Xp (j)
(3.14)
j=1
m
P
I Xp (j) I X1 (j) =
j=1
I X1 (j) I Xp (j) + . . . + bUp
m
P
I Y (j) I X1 (j)
j=1
m
P
j=1
(I Xp (j) )2 =
m
P
j=1
I Y (j) I Xp (j)
98
CHAPTER 3. STATE OF THE ART
In matricial notation, the previous system of 2(p + 1) equations may be written by the
equation:
(3.15)
Ab = d

 AI
where b is the vector of the 2(p + 1) parameters, A = 

m
P
Op+1

Op+1 
 with
AI
m
P

m
I X1 (j)
...
I Xp (j)

j=1
j=1

 P
m
m
P
P
 m
I X1 (j)
(I X1 (j) )2
...
I Xp (j) I X1 (j)


j=1
j=1
AI =  j=1

..
..
..
..

.
.
.
.


 m
m
m
P
P
 P
I Xp (j)
I X1 (j) I Xp (j) . . .
(I Xp (j) )2
j=1
j=1
j=1
analogously for AI ; Op+1 is the null matrix of order p + 1; d =
dI =
m
P
j=1
I Y (j) ,
m
P
j=1
I Y (j) I X1 (j) , . . . ,
m
P
j=1
I Y (j) I Xp (j)
T








,






h
dI
dI
iT
with
and analogously for dI . If the matrix
A is non singular, the matrix of the estimated parameters b is given by
b = A−1 d.
Therefore, the obtained parameters that estimate the lower bound of the intervals are
different from those obtained for estimating the upper bound of the intervals. The bounds
of the intervals of the response variable are predicted as follows:
I Yb (j) = bL0 + bL1 I X1 (j) + bL2 I X2 (j) + . . . + bLp I Xp (j) ;
I Yb (j) = bU0 + bU1 I X1 (j) + bU2 I X2 (j) + . . . + bUp I Xp (j) .
(3.16)
However, similarly as to what happens with the methods based in symbolic covariance
definitions, the estimated value for the upper bound of the interval may be smaller than that
for the lower bound, in which case an interval is not obtained. This may happen if some
coefficients in the models are negative. When this situation occurs, the proposed solution
is the same as the one presented in Expression (3.11).
Center and Range method
In 2008, Lima Neto and De Carvalho (2008) proposed another linear regression model
for interval data, the Center and Range method (CRM). CRM considers that the intervals
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
99
are defined by their centers and half ranges. This model is very similar to the previous
one but in this case it is the centers and the half ranges of the intervals that are estimated separately. The coefficients of the model are obtained by applying the classical linear regression model to the centers and half ranges. Let the set of interval data
{Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 . The linear regressions between the center/half ranges
of the intervals Y (j) and the center/half ranges of the intervals Xk (j), with k ∈ {1, 2, . . . , p}
are the classical linear relations, given by:
cY (j) = bc0 + bc1 cX1 (j) + bc2 cX2 (j) + . . . + bcp cXp (j) + ec (j);
(3.17)
rY (j) = br0 + br1 rX1 (j) + br2 rX2 (j) + . . . + brp rXp (j) + er (j).
Also in this method it is assumed that the values of the center and the half range of the
intervals are independent and the values of the parameters bc0 , bc1 , . . . , bcp and br0 , br1 , . . . , brp
are estimated separately applying the Least Squares method to two classical linear regression models, one for the centers and another for the half ranges of the intervals.
The estimation of the parameters is in this case obtained by
min
m
X
j=1
!
cY (j) − bc0 −
p
X
bck cX(j)k
k=1
"2
+
m
X
j=1
!
rY (j) − br0 −
p
X
k=1
brk rX(j)k
"2
.
(3.18)
The solution of the optimization problem (3.18) is the solution of the following system of
2(p + 1) equations:
bc0 m + bc1
bc0
..
.
bc0
r
0
m
P
j=1
m
P
j=1
..
.
br0
m
P
j=1
m
P
j=1
cX1 (j) + . . . + bcp
j=1
cXp (j) + bc1
r
1
m
P
m
P
m
P
j=1
j=1
r
p
m
P
rXp (j) =
j=1
(rX1 (j) )2 + . . . + brp
j=1
m
P
cXp (j) cX1 (j) =
j=1
m
P
cY (j) cX1 (j)
j=1
cX1 (j) I Xp (j) + . . . + bcp
rX1 (j) + . . . + b
rXp (j) + br1
m
P
j=1
m
P
m
P
cY (j)
j=1
(cX1 (j) )2 + . . . + bcp
j=1
rX1 (j) + br1
m
P
cXp (j) =
j=1
cX1 (j) + bc1
b m+b
br0
m
P
m
P
(cXp (j) )2 =
j=1
m
P
m
P
cY (j) cXp (j)
j=1
rY (j)
(3.19)
j=1
m
P
I Xp (j) rX1 (j) =
j=1
rX1 (j) I Xp (j) + . . . + brp
m
P
rY (j) rX1 (j)
j=1
m
P
j=1
(rXp (j) )2 =
m
P
j=1
rY (j) rXp (j)
100
CHAPTER 3. STATE OF THE ART
The system (3.19) may be represented by the matricial equation
(3.20)
Ab = d,
where b is the vector of the parameters to be estimated. The matrices A and d are defined
as:

c
 A
with
A=

Op+1

Op+1 

Ar
m
P
m
P

m
cX1 (j)
...
cXp (j)

j=1
j=1

 P
m
m
P
P
 m
cX1 (j)
(cX1 (j) )2 . . .
cXp (j) cX1 (j)


j=1
j=1
Ac =  j=1

..
..
..
..

.
.
.
.


 m
m
m
P
P
 P
cXp (j)
cX1 (j) cXp (j) . . .
(cXp (j) )2
j=1
j=1
j=1
analogously for Ar ; Op+1 is the null matrix of order p + 1; d =
dc =
P
m
j=1
P
m
cY (j) ,
j=1
P
m
cY (j) cX1 (j) , . . . ,
j=1
cY (j) cXp (j)
T








,






h
dc dr
iT
with
and analogously for dr . If the matrix
A is non singular, the matrix of the estimated parameters b is given by
b = A−1 d.
From the following equations, it is possible to predict the values of the centers and half
ranges of the observations of the interval-valued variable Y as follows:
cYb (j) = bc0 + bc1 cX1 (j) + bc2 cX2 (j) + . . . + bcp cXp (j) ;
rYb (j) = br0 + br1 rX1 (j) + br2 rX2 (j) + . . . + brp rXp (j) .
(3.21)
From the estimated values of the cY (j) and rY (j), the bounds of the new intervals are
given by
I Yb (j) = cYb (j) − rYb (j) and I Yb (j) = cYb (j) + rYb (j) .
(3.22)
As for the previous models, also in this case it is possible to obtain values for the upper
bound of the interval smaller than that of the respective lower bound. This may happen if
there are negative coefficients in the model that estimates the half range of the intervals.
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
101
For all the models described above the problem may arise that the ranges of estimated
values are not intervals. To solve this situation, Lima Neto and De Carvalho (2010) proposed the model named by Constrained Center and Range method.
Constrained Center and Range method
This model results from adding to the previous CRM non-negative constraints in the
linear relation between the half ranges of the intervals.
For a set of interval data {Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 , in the Constrained Center
and Range method (CCRM), the center and half range of the intervals Y (j) are obtained
according to:
cY (j) = bc0 + bc1 cX1 (j) + bc2 cX2 (j) + . . . + bcp cXp (j) + ec (j)
(3.23)
rY (j) = br0 + br1 rX1 (j) + br2 rX2 (j) + . . . + brp rXp (j) + er (j)
subject to
brk ≥ 0, k ∈ {1, . . . , p}.
The minimization problem to solve in this case is:
min
m
X
j=1
!
 cY (j) −
subject to brk ≥ 0,
bck ∈ R,
p
X
bck cXk (j)
k=0
k ∈ {1, . . . , p}
"2
!
+ rY (j) −
p
X
k=0
brk rXk (j)
"2 

(3.24)
k ∈ {1, . . . , p}.
As for the previous model, the centers and half ranges of the intervals are considered
independently. The estimated values for the parameters bc0 , bc1 , . . . , bcp are obtained as in the
CM or in the CRM, from the solution of the matricial equation in Expression (3.5). To estimate the values of the parameters br0 , br1 , . . . , brp the constraints brk ≥ 0 with k ∈ {1, . . . , p}
are imposed. Lima Neto and De Carvalho (2010) use Lawson and Hanson’s algorithm
(Lawson and Hanson (1995)), adapted to the CCRM, to obtain the optimal solution for
these parameters. The objective of Lawson and Hanson’s algorithm is to identify the values
of the parameters βkr , that are incompatible with the constraints and change them to nonnegative values through a re-weighting process. In this case it is not possible to obtain
the expression that allows finding the estimated parameters βkr with k ∈ {1, . . . , p}. For
102
CHAPTER 3. STATE OF THE ART
details about the Lawson and Hanson’s algorithm applied in this method see Lima Neto
and De Carvalho (2010).
From the parameters estimated as previously described, the predicted values of the
bounds of the observations of the interval-valued variable Y are obtained as in CRM by
Expression (3.22).
In the CCRM, the authors solved the situation of obtaining negative values for the half
ranges of the intervals. As the parameters brk ≥ 0, the predicted half ranges rbY (j) are
I Y (j) ≤ b
I Y (j) is achieved.
non-negative for all j ∈ {1, . . . , m} and consequently the goal b
However, by including non-negative constraints on the linear relations between the half
range of the intervals, the authors impose that this linear relation is always direct. This
behavior seems to be in agreement with the real situations: if the half range of the intervals
“represents” the variability of the data, to predict less variability from data with higher
variability does not appear to be natural.
Lasso-IR method
Giordani (2011, 2014) proposed a new approach to linear regression for interval-valued
variables based on the Lasso technique, named Lasso-IR method, that follows the research
line of the previously described models. As in CRM and CCRM, in this new approach the
linear relation between interval-valued variables considers two classical regression models,
one for the centers and another for the half ranges. However, in this case the parameters
of the models are related. The coefficients of the two models are estimated in such a
way that the coefficients for the centers are as close as possible to the corresponding
ones for the half ranges. When applying the Lasso technique, this is achieved by fixing
a threshold expressing the maximum allowed level of diversity between the two sets of
regression coefficients. For a set of interval data {Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 , in
the Lasso-IR method, the linear relation between the center and half ranges of the intervals
Y (j) and the center and half ranges of the intervals Xk (j), with k ∈ {1, 2, . . . , p} are given
by:
cY (j) = bc0 + bc1 cX1 (j) + bc2 cX2 (j) + . . . + bcp cXp (j) + ec (j);
rY (j) = br0 + br1 rX1 (j) + br2 rX2 (j) + . . . + brp rXp (j) + er (j)
= (bc0 + ba0 ) + (bc1 + ba1 )rX1 (j) + (bc2 + ba2 )rX2 (j) + . . . + (bcp + bap )rXp (j) + er (j).
(3.25)
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
103
In matricial notation, the predicted centers and half ranges are given by:
c
= Xc b c
r
= X r b r = X r ( bc + ba )
b
Y
b
Y
(3.26)
where the matrices Xc and Yc , of order m × (p + 1), are the ones defined in Expression
(3.2); Xr and Yr are analogous but for the half ranges, bc and br are the column matrices
with p + 1 rows of the parameters for the linear regression model of the centers bck and
the half ranges brk , k ∈ {0, 1, . . . , p}, respectively. As the coefficients of the half ranges
model br are equal to those of the centers model bc up to additive coefficients, we have
br = bc + ba with ba = (ba0 , ba1 , . . . , bap )T .
To estimate the parameters of the models, the author selected the Bertoluzza distance,
in Expression (2.17), in Section 2.3.1, studied in the framework of the intervals by Trutschnig
et al. (2009). The selection of the value to θB depends on relative importance of the half
ranges with respect to the centers. θB =
1
3
is considered a reasonable choice (Giordani
(2014)).
The minimization problem in this case is:
min
m
X
j=1
p
subject to
X
k=0
p
X
k=0
!
 cY (j) −
p
X
bck cXk (j)
k=0
(bck + bak ) rXk (j) ≥ 0,
"2
!
+ θB rY (j) −
j ∈ {1, . . . , m}
p
X
k=0
"2 
(bck + bak ) rXk (j) 
|bak | ≤ t
bak , bck ∈ R,
k ∈ {0, 1, . . . , p}.
(3.27)
The constraints included in the optimization problem guarantee that each estimated
half range to Y (j) is non-negative,
m
P
j=1
(bck + bak ) rX(j) ≥ 0, ∀k ∈ {0, 1, . . . , p} and that
the additive coefficients bak are as small as possible. To solve the minimization problem, the
Least Absolute Shrinkage and Selection Operator (Lasso) are used. This method minimizes
the residual sum of squares with the constraint that the sum of the absolute values of
the regression coefficients is smaller than a threshold. The minimization problem (3.27)
requires the obtention of the optimal solution for the parameter t, the tOP T . The values of
t are in [0, tM ax ], when t = 0, ba is null and therefore br = bc . In this case it is always
104
CHAPTER 3. STATE OF THE ART
possible to obtain an optimal solution with non-negative values to the predicted half ranges.
The value tM ax is the smallest value such that two separate regression problems for the
centers and half ranges models are solved. The selection of the tOP T is performed by
cross-validation techniques, such as the k−fold cross validation. For details about the
process to obtain the optimal value for t, see Giordani (2014).
For the optimal value of t, the optimal solution of the optimization problem is obtained
by calculating separately the vectors bc and ba , keeping the remaining one fixed. The
estimation of bc is obtained by solving the least squares problem (3.28) considering the
matrix ba fixed:
min
m
X
j=1
!
 cY (j) −
p
X
bck cXk (j)
k=0
"2
!
+ θB rY (j) −
p
X
(bck + bak ) rXk (j)
k=0
"2 
that in matricial notation can be written as:

 
 2
Xc
Yc
c
− 1
 b = kh − Fbc k2 .
 1
min θB2 (Yr − Xr ba )
θB2 Xr

(3.28)
(3.29)
c
b = (FT F)−1 FT h.
So, the estimated values to bc are given by b
To calculate ba , fixing bc , it is necessary to solve the following minimization problem:
min
m
X
j=1
subject to
p
X
k=0
p
X
k=0
!
rY (j) −
p
X
(bck + bak ) rXk (j)
k=0
(bck + bak ) rXk (j) ≥ 0,
"2
j ∈ {1, . . . , m}
(3.30)
|bak | ≤ t
bak , bck ∈ R,
k ∈ {1, . . . , p}.
The optimal solution for ba is obtained by solving this least squares problem with constraints (Lawson and Hanson (1995)). More details about Lasso-IR method can be found in
Giordani (2014), where the Matlab algorithm of the method is also presented.
It is also important to underline that when tOP T = tM ax , the parameters estimated for
the linear regression of the centers and half ranges are independent, a situation that is
considered in the CCRM. However the obtained solution is not necessarily the same. In
the CCRM, the constraint is imposed to the parameters of the model of the half ranges,
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
105
br ≥ 0, and because of this the linear relation between the half ranges is always direct.
In the Lasso-IR method, the constraint is Xr br ≥ 0, so negative regression coefficients
are admitted and the linear regression between the half ranges may be inverse (a direct
linear relation between the half ranges is not impose). The Lasso-IR method and CCRM
are related, however, the Lasso-IR method is more flexible. They both treat the linear
regression between intervals considering two linear relations, between the centers and the
half ranges, but the Lasso-IR method allows a relation between the parameters estimated
for the centers model and the parameters estimated for the half ranges model.
Other methods
Recently, a particular swarm optimization (PSO) algorithm was applied to estimate the
parameters of the following linear regression models: Center method, MinMax method and
Center and Range method (Yang et al. (2011)). This work does not propose a new linear
regression model between interval-valued variables, but introduces a new method to obtain
the optimal solution for the models that we referred to previously. The PSO algorithm was
applied to solve the optimization problem associated with the above-mentioned methods
because of some limitations of the Least Square method. For example, to determine the
optimal solution we need to determine the inverse of a matrix and the dimension of this
matrix increases with the number of variables. When this matrix is too large, the application
of the Least Square method may cause a large error in the solution. Additionally, the PSO
algorithm guarantees that the predicted values of the lower bounds will be lower than the
predicted values of the upper bounds.
Maia and De Carvalho (2008) introduced the Center and Range Least Absolute Deviation method (CRMLAD). As in CRM, the centers/half ranges of the intervals of the response
and explicative variables are related according to the model in Expression (3.17). In this
case, the regression coefficients bc0 , bc1 , . . . , bcp and br0 , br1 , . . . , brp of the two classical linear
regression models are estimated through the minimization of the sum of the absolute values
of the residuals:
p
m X
X
bck cX(j)k cY (j) − bc0 −
j=1
k=1
and
p
m X
X
brk rX(j)k .
rY (j) − br0 −
j=1
k=1
(3.31)
To find the optimal values of the parameters a simplex based algorithm developed by
Barroda and Roberts (1974) is used. The application of this algorithm has important
106
CHAPTER 3. STATE OF THE ART
implications: it ensures that the regression coefficients are estimated in a finite number of
simplex iterations and ensures the robustness of the vector of coefficients estimated in the
presence of outliers in the dependent variable. Because of this, according to the authors,
this method is more robust then the CRM when we are in the presence of outliers.
Performance of the models
Most of the models previously described does not has a goodness-of-fit measure deduced from the model. The performance assessment of the linear regression models is
based only on the following measures (Lima Neto and De Carvalho (2008, 2010)):
• The lower bound Root Mean Square Error (RM SEL ), that measures the fit between
the lower bounds of the observed and predicted intervals
RM SEL =
vm
2
uP
u
I
−
I
b
Y
(j)
Y
(j)
t j=1
m
;
(3.32)
• The upper bound Root Mean Square Error (RM SEU ), that measures the fit between
the upper bounds of the observed and predicted intervals
RM SEU =
vm
2
uP
u
I
−
I
b
Y
(j)
Y
(j)
t j=1
m
;
(3.33)
• The square of the lower bound correlation coefficient
2
L
r =
cov (IY , IYb )
s(IY )s(IYb )
2
(3.34)
;
• The square of the upper bound correlation coefficient
rL2 =
!
"2
cov IY , IYb
s(IY )s(IYb )
(3.35)
where the covariance (cov) and the standard deviation (s) are the classical definitions and
IY = I Y (1) , I Y (2) , . . . , I Y (m)
T
; IYb = I Yb (1) , I Yb (2) , . . . , I Yb (m)
T
, analogously to IY and IYb .
3.1.1.2. Linear regression models for imprecise data
The first linear regression models proposed by the line of research that works with
intervals to represent imprecise data also do not consider a probabilistic space. In this
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
107
approach the intervals are considered elements of KC (R). The first models proposed by
Diamond (1990) and Gil et al. (2002) are linear transformations that relate intervals and that
are based in the interval arithmetic. These first proposed models only allow predicting an
interval from another interval, i.e., they are simple linear relation models.
Diamond method
Diamond (1990) proposed the first linear regression model, in the descriptive context,
when the intervals are considered elements of KC (R).
In his work, he attempts to fit an interval-valued variable Y from an interval-valued
variable X considering that all observations of these variables are non degenerate intervals
and include only positive values.
Considering the set of interval data {Y (j), X(j)}m
j=1 , with X(j), Y (j) ∈ KC (R), under
the required conditions, two relations were studied:
with a, b ∈ R,
Yb (j) = a + bX(j)
(3.36)
Yb (j) = A + bX(j)
(3.37)
with b ∈ R and A = [a, a] ∈ KC (R).
The relation in Expression (3.36) considers as the independent parameter a real number. However, because the relation is between intervals, the relation in Expression (3.37)
generalizes the previous one considering an interval as the independent parameter.
To estimate the parameters of the models a least square minimization problem must
be solved. The minimization problem uses the distance ρ2 defined in Expression (2.15),
in Section 2.3.1. Let {Y (j), X(j)}m
j=1 be the set of interval data. For the most general
situation, the optimization problem is given by:
min
m
X
2ρ22 (Y (j), A + bX(j)).
(3.38)
j=1
Under this framework, we are working in the space (KC (R), +, ·), where the operations
are defined according to the interval arithmetics. For this reason the minimization problem
(3.38) is divided in two cases:
• When b > 0, the least squares estimators of the parameters are obtained from
min
m
X
j=1
(I Y (j) − a − bI X(j) )2 + (I Y (j) − a − bI X(j) )2 .
(3.39)
108
CHAPTER 3. STATE OF THE ART
Differentiating the function in order to the parameters: b, a, a and setting the results
equal to zero, the optimal solution is obtained solving the system:
am + b
m
P
j=1
am + b
m
P
j=1
a
m
P
I X(j) −
I X(j) −
I X(j) + a
j=1
m
P
m
P
I Y (j) = 0
j=1
m
P
I Y (j) = 0
j=1
I X(j) + b
j=1
m P
j=1
P
m
2
I X(j) + I 2X(j) =
I Y (j) I X(j) + I X(j) I Y (j) .
j=1
(3.40)
• When b < 0, the least squares estimators of the parameters are obtained from
min
m
X
j=1
(I Y (j) − a − bI X(j) )2 + (I Y (j) − a − bI X(j) )2 .
(3.41)
Similarly to the previous case, the values for b, a, a are obtained solving the system of
three equations, that results of equalizing to zero the partial derivatives of the function
to minimize:
am + b
m
P
j=1
am + b
m
P
j=1
a
m
P
I X(j) −
I X(j) −
I X(j) + a
j=1
m
P
j=1
m
P
I Y (j) = 0
j=1
m
P
I Y (j) = 0
j=1
I X(j) + b
m P
j=1
P
m
2
I X(j) + I 2X(j) =
I Y (j) I X(j) + I X(j) I Y (j) .
j=1
(3.42)
In his work, the author establishes a sufficient condition for non degenerate intervals
under which it is possible to prove that the solution of the optimization problem (3.38) exists
and is unique.
If the set of data is non degenerate and positive coherent, i.e. cov(rX , rY ) ≥ 0 and
cov(cX , cY ) − cov(rX , rY ) ≥ 0 (where cov is the classic covariance of the real variables)
the solution of the minimization problem (3.38) exists; is unique and that solution is the
solution of the system (3.40). On the other hand, if the set of data is non degenerate and
negative coherent: cov(rX , rY ) ≥ 0 and cov(cX , cY ) − cov(rX , rY ) ≤ 0 the solution of the
minimization problem (3.38) exists; is unique and its solution is the solution of the system
(3.42).
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
109
Gil method
The work of Gil et al. (2002) generalizes the model proposed by Diamond (1990), extending the Least Squares method to a more generalized metric on the space of nonempty
compact intervals. The uniqueness of the optimal solutions is identified by necessary and
sufficient conditions, and not only by sufficient conditions, as in the work of Diamond (1990).
In the Gil method, the intervals Y (j) of the interval-valued variable Y are predicted
according to Expression (3.37). However, the parameters of the model are now obtained
considering that the optimization problem is defined using the Bertoluzza distance (see
Section 2.3.1). Considering the set of interval data {Y (j), X(j)}m
j=1 , the parameters of the
model are now estimated solving the minimization problem:
1 X 2
D (Y (j), bX(j) + A).
m j=1 B
m
min
(3.43)
The optimal solutions are obtained in an algorithmic way that fits all solutions for the
general case of non degenerate interval-valued variables with the necessary and sufficient
conditions for the non-uniqueness. The few cases leading to non unique solutions are
characterized. In this model the analytic expressions for the parameters b and A are
not obtained. One algorithm for the computation of the determination coefficient is also
designed. The main problem of the Gil method is that in some situations not exist an
interval A ∈ KC (R) such that Y (j) = bX(j) + A ∈ KC (R).
3.1.2. Probabilistic linear regression models
3.1.2.1. Linear regression models for data with variability
Almost all linear regression models proposed for interval-valued variables in the context
of the Symbolic Data Analysis consider the problem from an optimization point of view
and do not take into account neither the random nature of the explicative variables nor of
the response variable. In this way, the lack of a probabilistic distribution for the response
interval-valued variable has made the use of inference techniques over the parameter
estimates impossible as, for example, hypothesis testing, residual analysis and diagnostic
measures.
The development of probabilistic methods is still an open research topic for almost all
kinds of symbolic variables. Papers proposing models when the probabilistic assumption
110
CHAPTER 3. STATE OF THE ART
is considered have been recently published (Domingues et al. (2010), Xu (2010), Brito
and Duarte Silva (2012), Lima Neto et al. (2011), Silva et al. (2011), Ahn et al. (2012)).
In the probabilistic framework two approaches emerge for linear regression models with
interval-valued variables. The models proposed by Domingues et al. (2010), Lima Neto
et al. (2011) and Silva et al. (2011) allow for statistical inference under the framework of joint
distributions for bivariate random vectors. However, as in almost all descriptive methods
proposed for data with variability, the variability within the intervals is lost when we represent
the intervals as a bivariate vector. Silva et al. (2011) proposes a regression model using
copulas and Lima Neto et al. (2011) constructs a bivariate generalized linear model based
on a bivariate exponential family of distributions. The main contribution of the work of
Domingues et al. (2010) is a prediction method for interval data that is less sensible in
the presence of interval-valued data outliers. The method is based on the symmetrical
linear regression methodology and allows assuming a heavy-tailed probabilistic distribution
for the model errors.
The model proposed by Ahn et al. (2012) does not treat the intervals as a vector
with two components. Instead, it contemples the internal variation in the interval-valued
observations. The new method uses a resampling idea to fit a linear regression model on
the interval-valued data.
Among these works we highlight the work of Lima Neto et al. (2011) and Ahn et al.
(2012). These two models have the main advantage of considering a probabilistic assumption, however, comparisons made by the authors with a selection of descriptive methods,
showed similar performances.
Bivariate symbolic regression models.
The model proposed by Lima Neto et al. (2011) is an adaptation to interval-valued
variables of the Bivariate Generalized Linear Model and uses to the bivariate exponential
family of distributions proposed by Iwasaki and Tsubaki (2005).
Let the response interval-valued variable Y be defined by a bivariate vector composed
by two one-dimensional random variables Y1 and Y2 . The authors assume that the joint
distribution of the response interval-valued variable Y = (Y1 , Y2 ) belongs to the bivariate
exponential family of distributions. It is this framework that gives to the interval-valued
variable Y its random nature.
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
111
Let {Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 be random samples of the interval-valued varia-
bles Y and Xk , with k ∈ {1, . . . , p} , with Y the response variable and Xk the n explicative variables. As in this work the interval-valued variables are defined as bivariate
quantitative vectors, for each unit j, the observation Y (j) is represented by the vector
Y (j) = (y1 (j), y2 (j)) , where y1 (j) and y2 (j) are real values that represent the lower
and upper bounds or the center and half range of the intervals for each unit j. Similarly
the explicative variables Xk are represented by the vectors Xk (j) = (x1k (j), x2k (j)) , with
k ∈ {1, . . . , p} . So, Y1 , X1k and Y2 , X2k are the quantitative classical variables that
represent, respectively, the lower and upper bounds or the center and half range of the
intervals.
As in the Bivariate Generalized Linear Models, the Bivariate Symbolic Regression Model
(BGLM) is divided in two components, the random component and the systematic component that are related by a link function.
The random component considers the bivariate random vector Y = (Y1 , Y2 ) that follows
the bivariate exponential family. The joint density probability function of these distributions
is defined by
f (y1 , y2 , θ1 , θ2 ) = exp [φ−1 (y1 θ1 + y2 θ2 − g(θ1 θ2 , ρ)) + h(y1 , y2 , ρ, φ)]
(3.44)
where θ = (θ1 , θ2 ) is the vector of canonical parameters, φ is a common dispersion
parameter and ρ is a constant correlation parameter between the two random variables.
The functions g and h are assumed to be known.
In the systematic component, the explicative variables X1k and X2k with k ∈ {1, 2, . . . , p}
are responsible for the variability of Y1 and Y2 , respectively. These components are defined
by
η1 = g1 (µ1 ) = X1 β1 and η2 = g2 (µ2 ) = X2 β2
(3.45)
where β1 and β2 are the vectors of the p + 1 parameters to be estimated; X1 and X2 are
the m × (p + 1) matrices formed by the observed values of the variables Xk1 and Xk2 ,
respectively, as follows:
112
CHAPTER 3. STATE OF THE ART






X1 = 




1
1
..
.
1
x11 (1)
...

x1p (1) 


x11 (2) . . . x1p (2) 


..
..
..


.
.
.


x11 (m) . . . x1p (m)






and X2 = 




1
1
..
.
1
x21 (1)
...

x2p (1) 


x21 (2) . . . x2p (2) 

.
..
..
..


.
.
.


x21 (m) . . . x2p (m)
The functions g1 and g2 are link functions that connect the systematic component to the
averages µ1 and µ2 of the variables Y1 and Y2 , respectively.
Lima Neto et al. (2011) also present another approach for the BGLM, assuming that the
vectors of the parameters are the same in both systematic components.
The link functions can be selected to have particular properties. For example, if we
consider the logarithmic link function, we guarantee the positiveness of the predicted values
of the half ranges. The mean of the response variable is related to a set of known explivative
interval-valued variables through a link function.
Assuming that ρ is fixed, it is possible to estimate the vectors of the parameters β1 and
β2 based on the iterative Fisher scoring method. After obtaining the estimates for β1 and
β2 , (or only β ) it is possible to compute the goodness-of-fit measure deviance, based on the
iterative Fisher scoring method and consequently, estimate the parameter dispersion. For
more details see Lima Neto et al. (2011). In this work, residual definitions and diagnostic
measures that are useful for making inference about the response distribution and identify
outliers, among others aspects, are also present. An important point to underline, is the
joint residual measure for interval-valued observations. In previous works the residuals are
defined separately for each boundary of the interval.
In conclusion, we can say that the BGLM has advantages over previous methods. As
these models take into account the random nature of the response variable, inference
techniques that can be very useful in analyzing interval-valued data in the Symbolic Data
Analysis field may be developed. Moreover, with the model proposed by Lima Neto et al.
(2011), it is possible to guarantee the positiveness for the predicted values of the half
ranges. On the other hand, as the BGLM considers the representation of the intervals
as a vector with two components (the lower and upper bounds or the center and half range
of the intervals), information about the variability of the values within the intervals is not
taken into account.
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
113
Monte Carlo method.
The Monte Carlo method (MCM), (Ahn et al. (2012)) proposes a new approach to fit
a linear regression model to interval-valued data using a resampling idea. This model
consists of the following steps:
• generate a large number of samples by randomly selecting a single-valued point within
each observed intervals, assuming a Uniform distribution within each interval;
• fit a classical linear regression model on each single-valued sample;
• calculate the mean regression coefficients over the fitted models, and then use them
to produce a final linear regression model.
The main contributions of the proposed method are that it uses the variability of the
interval-valued data and that it allows to make inference regarding the regression coefficients, as statistical significance tests on the regression coefficients, because the sampling
distributions of the estimated coefficients are obtained via Monte Carlo simulation.
In MCM, the predicted elements are not necessarily intervals because it is possible to
obtain a prediction for the lower bound greater than the prediction for the upper bound. So,
it occurs the same problem as in the previous models: CM, MinMax method and CRM.
As we described before, for this situation the proposed solution is the one presented in
Expression (3.11).
For the authors, this approach has three main advantages; 1) It is not necessary to
apply complex methodologies to work with interval-valued data; 2) it is flexible because
we may assume a non uniform distribution within each interval and 3) the method may be
generalized and applied to another type of symbolic data, e.g. histogram-valued data.
3.1.2.2. Linear regression models for imprecise data
In this line of research, the first model that considers a probabilistic assumption is
named Simple Basic Linear Model (SBLM) and it was proposed by González-Rodrı́guez
et al. (2007). Afterwards, other more flexible methods were proposed: the Model M and
the Model MG (Blanco-Fernández (2009), Blanco-Fernández et al. (2011)). A review about
the approaches based on linear regression analysis for interval-valued data may be found
in Blanco-Fernández et al. (2013). Almost all models presented in this section, that are
114
CHAPTER 3. STATE OF THE ART
defined using interval arithmetics, are simple linear regression models, i.e. only allow predicting one response interval-valued variable from one explicative interval-valued variable.
The complexity of the interval arithmetic associated with the lack of linearity of the space
KC (R) appears to make the generalization of the models to multiple linear regression
models more difficult.
Before the description of the models we present some preliminar definitions and notation. In the context of imprecise data, random intervals or interval-valued random sets
are defined as a generalization of random variables (real-valued random variables) when
the outcomes that result of a random experience are described by a compact set instead
of a real number. In this approach, the intervals are elements of the space (KC (R), +, ·)
where the operations + and · are defined according to interval arithmetics presented in
Section 2.2.1. In this space there is no symmetric element with respect to the addition. In
many cases, no interval C such B + C = A exists. The difference A − B is not an inner
operation in KC (R). When an element C ∈ KC (R) such that B + C = A exists, it is named
Hukuhara difference between the intervals A and B, and is denoted A −H B. The interval
C only exists if rB ≤ rA where rA and rB are the half ranges of the intervals A and B,
respectively.
Let (Ω, A, P ) be a probability space and let X, Y : Ω → KC (R) be two interval-valued
random sets. For this kind of random sets, some measures were defined. The expected
value of X is defined considering the Aumann expectation of imprecise random elements.
The expected value of the interval-valued random set X is given by:
E(X) = [E(cX ) ± E(rX )]
(3.46)
where E(cX ) and E(rX ) are the expected values of the real random variables centers, cX ,
and half ranges, rX .
Other measures such as the variance and covariance of interval-valued random sets
were also defined considering the Dθ distance, in Expression (2.18) in Section 2.3.1. The
variance of X is defined as
σ 2 (X) = Dθ2 (X, E(X))
(3.47)
which is the usual Frechet variance associated with the Aumann expectation in the metric
space (KC (R), Dθ ). For more detail see Blanco-Fernández (2009) and Blanco-Fernández
et al. (2013).
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
115
The variance of the interval-valued random set X may be expressed in terms of classical
variance of the respective random variables cX and rX by
σ 2 (X) = σ 2 (cX ) + θσ 2 (rX ).
(3.48)
Similarly, the covariance between two interval-valued random sets X and Y may be given
by the expression:
(3.49)
σ(X, Y ) = σ(cX , cY ) + θσ(rX , rY ).
m
The mean, variance and covariance of a simple random sample {X(j), Y (j)}j=1 , obtained
from (X, Y ) are defined as follows:
1
(X1 + X2 + . . . + Xm );
m
! m
"
1 X 2
2
bX =
σ
D (X, X) ;
m j=1 θ
X=
Simple Basic Linear Model
bX,Y = σ
bcX ,cY + θσ
brX ,rY .
σ
(3.50)
(3.51)
(3.52)
Let (Ω, A, P ) be a probability space and let X, Y : Ω → KC (R) be two interval-valued
random sets. Similarly to the classical statistics, the linear relation between two intervalvalued random sets should be defined as
Y = βX + Γ + εe
with β ∈ R, Γ ∈ KC (R) and εe an interval-valued random set such that E(εe|X) = 0.
However, in this case the error εe degenerates into real-valued random variables and the
model will be quite restrictive. For this reason González-Rodrı́guez et al. (2007) proposed
the following model, named Simple Basic Linear Model (SBLM), defined as follows:
Y = βX + ε
(3.53)
where β ∈ R is a single-valued regression parameter and ε is a interval-valued random
set with a fixed expected value (in the Aumann sense), E(ε|X) = Γ ∈ KC (R). In order to
avoid reducing the errors to real random variables, the independent term Γ is included in
the formalization of the possible errors.
116
CHAPTER 3. STATE OF THE ART
m
Consider the interval-valued simple data {X(j), Y (j)}j=1 obtained from the interval-
valued random sets (X, Y ) . So, for all j ∈ {1, 2, . . . , m} and the simple random sample
(X(j), Y (j)) it is verified that
(3.54)
Y (j) = bX(j) + e(j)
for a certain e(j) ∈ KC (R). For e(j) to be an interval-valued random set, the Hukuhara
difference Y (j) −H bX(j) has to exist for all j ∈ {1, 2, . . . , m}. This was not guaranteed
in the model proposed by Gil et al. (2002).
The estimation of the parameters of the model is made by solving a constrained least
squares minimization problem that uses the distance Dθ , presented in Section 2.3.1. The
minimization problem for the SBLM is defined as follows:
1 X 2
D (Y (j), bX(j) + A)
m j=1 θ
m
min
(3.55)
subject to Y (j) −H bX(j) exists, ∀j ∈ {1, 2, . . . , m}
i.e. A ∈ KC (R) .
The analytic expressions of the parameters estimates b and A may be obtained for
this model. Considering the definitions presented in Expressions (3.50), (3.51) and (3.52),
these estimates are given by (see González-Rodrı́guez et al. (2007)):




0



n
o
σ
bX,Y
b=
b
s
,
min
2
0
σ
bX



n
o


 − min sb0 , σb−X,Y
σ
b2
if
if
if
X
where
sb0 =
and


 ∞

 min
if
n
rY (j)
rX(j)
: rX(j) 6= 0
o
bX,Y ≤ 0 and σ
b−X,Y ≤ 0
σ
b−X,Y ≤ σ
bX,Y
bX,Y ≥ 0 and σ
σ
(3.56)
bX,Y ≤ σ
b−X,Y
b−X,Y ≥ 0 and σ
σ
rXj = 0 for all j ∈ {1, 2, . . . , m}
(3.57)
otherwise
A = Y −H bX.
(3.58)
In the same paper, González-Rodrı́guez et al. (2007) propose a unique approach, in
this line of research, for a extension of simple linear regression models to multiple linear
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
117
regression. For this case, the estimation of the parameters is made using a stepwise
method. Another important development for the SBLM is the deduction of the coefficient of
determination associated with the model, presented in Blanco-Fernández (2009).
According to Blanco-Fernández (2009) the SBLM presents some limitations such as
the lack of flexibility. The model in Expression (3.53) induces the following relations, one
between the centers of the variables and one between the half ranges:
cY = βcX + εc and rY = |β|rX + εr .
(3.59)
These expressions are related because the two linear functions have the same regression coefficient (in absolute value). However, this only happens in some practical situations
in which it is reasonable to assume that the centers and the half ranges of the variables
may increase/decrease with the same absolute ratio.
Model M
In the works of Blanco-Fernández et al. (2011) and Blanco-Fernández (2009), the
main goal is to define a more flexible linear regression model between interval-valued
random sets. This new model uses the representation of the intervals by the canonical
decomposition presented in Expression (2.11) in Section 2.1.2.
Let (Ω, A, P ) be a probability space, X, Y : Ω → KC (R) two interval-valued random
sets and consider the variable X represented by the canonical decomposition:
X = cX [1 ± 0] + rX [0 ± 1]. The linear regression model named Model M is defined as
follows:
Y = αcX [1 ± 0] + βrX [0 ± 1] + γ[1 ± 0] + ε.
(3.60)
In this case ε is an interval-valued random set, such that, E(ε|X) = [−δ, δ], with δ ≥ 0
and γ is an intercept term that affects only the centers. As in the model presented by
González-Rodrı́guez et al. (2007), if the usual condition of zero mean error is considered,
the residual term is reduced to a real-valued random variable instead of an interval-valued
random set. So, the regression function associated with the model in Expression (3.60)
may be expressed as
Y = αcX [1 ± 0] + βrX [0 ± 1] + Υ = αX c + βX r + Υ
(3.61)
where α, β ∈ R, Υ = [γ − δ, γ + δ] with δ ≥ 0. Consider the interval-valued data
m
{X(j), Y (j)}j=1 obtained from the interval-valued random sets (X, Y ) . For all j ∈ {1, . . . , m},
118
CHAPTER 3. STATE OF THE ART
the simple random sample (X(j), Y (j)) verifies
Y (j) = aX c (j) + bX r (j) + e(j)
(3.62)
with e(j) ∈ KC (R). The estimation of the parameters of the Model M is done by solving a
constrained least squares minimization problem, also now defined using the generalization
of the Bertoluza distance, Dθ , in Expression (2.18). The constrained minimization problem
is defined as follows:
1 X 2
Dθ (Y (j), aX c (j) + bX r (j) + A)
m j=1
m
min
(3.63)
subject to Y (j) −H aX c (j) + bX r (j) exists, ∀j ∈ {1, 2, . . . , m}.
According to Blanco-Fernández et al. (2013) the particular definition of X r allow us to
assume with without loss of generality that b ≥ 0.
Similarly to the SBLM, it is possible to determine the analytic expression of the parameters estimates that minimize the optimization problem (3.63). Considering the definitions
presented in Expressions (3.50), (3.51), (3.52) and (3.57), the parameters estimates are
given by (see Blanco-Fernández et al. (2011)):
bX c ,Y
σ
;
2
bX
σ
c
n
n
oo
r
b = min sb0 , max 0, σbσbX2 ,Y
;
c
a=
(3.64)
(3.65)
X
c
r
A = Y −H (aX + bX ).
(3.66)
The Model M induces linear relation between the centers and half ranges of the variables. These relations are defined by:
cY = αcX + γ + εc and rY = |β|rX + εr .
(3.67)
Since α and β are in general different, the Model M is more flexible than the SBLM. In
Model M the relation between the centers and half ranges is not necessarily the same in
absolute value; the linear relations between the half ranges are always direct because the
coefficient of rX is non-negative. The non-negative behavior of the coefficient of rX is a
consequence of the canonical decomposition of the intervals.
Furthermore, as for the latter model, Model M uses the distance Dθ to deduce a
coefficient of determination (Blanco-Fernández (2009)).
3.1. LINEAR REGRESSION MODELS FOR INTERVAL VARIABLES
119
Model MG
To increase even more the flexibility of the linear regression models between intervalvalued random sets, Blanco-Fernández (2009) proposed an extension of the Model M. This
new approach, named Model MG , was proposed with the goal of inducing more versatile
relations between the centers and half ranges of the intervals. As a consequence, the
centers of the intervals Y are not only dependent of the centers of the intervals X and the
half ranges of the intervals Y are not only dependent of the half ranges of the intervals X,
unlike what happens in other models proposed.
Given two interval-valued random sets X and Y the linear relation between them is now
defined as:
Y
= αcX [1 ± 0] + βrX [0 ± 1] + γcX [0 ± 1] + δrX [1 ± 0] + ε
= αX c + βX r + γX s + δX m + ε
(3.68)
with α, β, γ, δ ∈ R and E(ε|X) ∈ KC (R). The model in Expression (3.68) induces the
following relations between the centers and half ranges:
cY = αcX + δrX + εc and rY = |β|rX + |γ||cX | + εr .
(3.69)
In analogy with the Model M, the estimation of the parameters for a simple random
sample is done by solving the optimization problem:
1 X 2
D (Y (j), aX c (j) + bX r (j) + cX s (j) + dX m (j) + A)
m j=1 θ
m
min
subject to Y (j) −H aX c (j) + bX r (j) + cX s (j) + dX m (j) exist, ∀j ∈ {1, 2, . . . , m}.
(3.70)
As for the Model M, this optimization problem ensures the Hukuhara difference and for
the same reasons the authors assume that, b, c ≥ 0. The estimation of the parameters
of the Model MG is complex and is done by numerical algorithms. In this case, it is not
possible to find the analytic expression of the paraments estimates.
120
CHAPTER 3. STATE OF THE ART
3.2. Linear Regression Models for histogram-valued variables
The first linear regression model for histogram-valued variables was a generalization
of the first model proposed for interval-valued variables by Billard and Diday (2000, 2002,
2006). Recently, an alternative model has been developed, see, e.g., Verde and Irpino
(2010); Irpino and Verde (2012, 2013). As the research with histogram-valued variables is
still in its beginnings, the proposed models do not consider the probabilistic assumption yet.
In fact, few works have considered a probabilistic approach for histogram-valued variables
(Diday and Vrac (2005), Cuvelier and Noirhomme-Fraiture (2006)).
Methods based in Symbolic Covariance definitions
Consider p explicative histogram-valued variables Xk , k = {1, 2, . . . , p}, one response
histogram-valued variable Y and for each unit j, the respective mean values X k (j) and
Y (j). For each observation j, we have:
Xk (j) = {[I Xk (j)1 , I Xk (j)1 [, pj1 ; [I Xk (j)2 , I Xk (j)2 [, pj2 ; . . . , [I Xk (j)n , I Xk (j)n ], pjnj1 };
Y (j) = {[I Yj1 , I Yj1 [, pj1 ; [I Yj2 , I Yj2 [, pj2 ; . . . , [I Yjnj , I Yjnj ], pjnj2 };
X k (j) =
nj1
X
I Xk (j) + I X
i=1
Y (j) =
k (j)
2
n j2
X
I Y (j) + I Y (j)
i=1
2
pji ;
pji .
Billard and Diday (2006) proposed a generalization of two of the methods based in
Symbolic Covariance definition proposed for interval-valued variables. Consider the set of
histogram data {Y (j), X1 (j), X2 (j), . . . , Xp (j)}m
j=1 . For each unit j, the histograms Y (j)
and the histograms X1 (j), . . . , Xp (j) are related as in classical linear regression model, as
follows:
Y (j) = b0 + b1 X1 (j) + b2 X2 (j) + . . . + bp Xp (j) + e(j).
(3.71)
In Billard and Diday (2006), the estimation of the parameters is not deduced from the
model but is an adaptation of the solution obtained by the Least Square estimation method
for the classical linear model where the symbolic definitions of variance and covariance are
applied. In classical statistics, the parameters of the model are obtained by the matricial
equation:
3.2. LINEAR REGRESSION MODELS FOR HISTOGRAM VARIABLES








b1


s2 (X1 )
cov(X1 , X2 ) . . . cov(X1 , Xp )




 cov(X2 , X1 )
b2 
s2 (X2 )
. . . cov(X2 , Xp )
 = 
.. 
..
..
..
..


. 
.
.
.
.


bp
cov(Xp , X1 ) cov(Xp , X2 ) . . .
s2 (Xp )
−1 







121
cov(X1 , Y )





 cov(X2 , Y ) 
.

..




.


cov(Xp , Y )
(3.72)
The estimation of the parameters is obtained by considering the symbolic definitions of
variance and covariance presented in Section 2.4.2. However, two definitions of variance
and covariance for histogram-valued variables were proposed. As the symbolic definitions
of covariance 1 in Expression (2.29); variance 2 in Expression (2.31) and the mean in
Expression (2.22) are the classical definitions applied to the mean values of each observation j of the histogram-valued variables X and Y , the described method coincides
with the Center method generalized to histogram-valued variables. So, in this case, the
method based in symbolic covariance 1 and variance 2 is equivalent to the classical linear
regression model applied to the mean values of the observations of the histogram-valued
variables. The mean values of the histograms Y (j) and the mean values of the histograms
X1 (j), . . . , Xp (j) are real values that are related according to the classical linear relation:
Y (j) = b0 + b1 X 1 (j) + . . . + bp X p (j) + e(j).
(3.73)
The estimation of the parameters b0 , b1 , . . . , bp is obtained by minimizing the sum of
squares of deviation:
m
X
j=1
2
Y (j) − b0 − b1 X 1 (j) − . . . − bp X p (j) .
(3.74)
Alternatively, the Definition 2.18 of covariance 2 and the definition of variance 1 in
Expression (2.24), is also applied in Expression (3.72) to find the parameters of the model
in Expression (3.71). This approach is a simple adaptation of the process used to estimate the parameters for classical variables and the linear relation, defined between the
same variables, is not equivalent to the one obtained by the Center method generalized to
histogram-valued variables (Billard and Diday (2006)). In this latter situation, the authors
stress that the internal variability within observations is considered because the definitions
of variance and covariance incorporate the internal variations and the variations between
observations.
122
CHAPTER 3. STATE OF THE ART
Billard and Diday (2006) proposed linear regression models for histogram-valued variables for which it is possible to estimate the parameters, but don’t present the process that
may be used to predict the distributions for the response variables from the explicative
variables. A possible solution for this problem is suggested below.
From the methods based in Symbolic Covariance definitions, in analogy to the case of
the intervals, it is possible to predict the bounds of the subintervals of the response variable.
To each subinterval i, with i ∈ {1, 2, . . . , n} the bounds of the intervals are estimated as
follows:
I Yb (j)i = b0 + b1 I X1 (j)i + b2 I X2 (j)i + . . . + bp I Xp (j)i ;
I Yb (j)i = b0 + b1 I X1 (j)i + b2 I X2 (j)i + . . . + bp I Xp (j)i .
(3.75)
As in the methods based on the Symbolic Covariance definition proposed for intervalvalued variables, if some of the parameters b0 , b1 , . . . , bp are negative it is possible to obtain
a range of values where the upper bound of the subintervals may be smaller than the
respective lower bound. For each case, to build the subinterval the lowest obtained value
should be used for the lower bound and the highest for the upper bound. In this way we
may obtain histograms where the subintervals are not ordered neither disjoint, but may be
rewritten as described in Appendix A. The building of the histogram using Expression (3.75)
is only possible if all observations associated with all variables (explicative and response)
are represented by histograms with the same number of subintervals. If this does not occur,
it is always possible to rewrite all histograms to fulfill this condition (see Section 2.2.3.1). It
is important to recall that when the distributions that are the observations associated with
the histogram-valued variables, are rewritten with equal number of subintervals, the values
of covariance 2 are different than those obtained when this transformation is not applied
(see Example 2.14 in Section 2.4.1.3).
The methods based in Symbolic Covariance definitions do not predict distributions from
other distributions. The authors of the methods consider histogram data but predict single
values from which histograms may be built.
Verde and Irpino Model
In the works of Verde and Irpino (2010); Irpino and Verde (2012, 2013), the authors
propose a new linear regression model for histogram-valued variables where the observations of these symbolic variables are represented by quantile functions. To measure the
3.2. LINEAR REGRESSION MODELS FOR HISTOGRAM VARIABLES
123
error between the observed and predicted distributions the Mallows distance, termed by
the authors by Wasserstein distance (see in Section 2.3.2), is used.
The linear regression model for histogram-valued variables, considering the observations represented by their respective quantile functions, cannot be a simple adaptation of
the classical model:
Ψ
−1
Y (j)
(t) = b0 +
p
X
bk Ψ−1
Xk (j) (t) + ej (t).
(3.76)
k=1
The linear combination of quantile functions (non decreasing functions) cannot be considered as a linear combination of real values, because if we multiply a quantile function by
a negative number we obtain a decreasing function, that can never be a quantile function.
In the works of Verde and Irpino (2010); Irpino and Verde (2012, 2013), the authors propose
a linear regression model that overcomes this situation.
Cuesta-Albertos et al. (1997) proved that it is possible to decompose the square of
the Mallows distance between two distributions into the sum of the squared Euclidean
distance between their means and the squared Mallows distance between the two centered
distributions.
Consider the distributions X(j) and Y (j) represented by the respective quantile func−1
tions Ψ−1
X(j) (t) and ΨY (j) (t), and the mean value of the distributions represented by X(j)
and Y (j). The quantile functions that represent the centered distributions are defined by
−1
−1
−1
c
ΨcX(j) (t) = Ψ−1
X(j) (t) − X(j) and ΨY (j) (t) = ΨY (j) (t) − Y (j). Then,
2
M
D (Ψ
−1
X(j)
(t), Ψ
−1
Y (j)
(t)) =
X(j) − Y (j)
2
2
M
+D
Ψ
c−1
X(j)
(t), Ψ
c−1
Y (j)
(t) .
(3.77)
Irpino and Romano (2007) showed that the squared Mallows distance can also be
decomposed into the sum of squared Euclidean distances between the means and the
squared Euclidean distances between the standard deviations and a residual part that may
be assumed as a shape distance between the two distributions.
Based in the latter decompositions, the authors assume that each quantile function
Ψ−1
Y (j) (t) may be expressed as a linear combination of the mean values of the histograms
−1
Xk , X k , and the centered quantile function, ΨcXk (j) (t) as follows:
Ψ
−1
Y (j)
(t) = b0 +
p
X
k=1
bk X k (j) +
p
X
k=1
−1
ak ΨcXk (j) (t) + ej (t)
(3.78)
124
CHAPTER 3. STATE OF THE ART
where for all k ∈ {1, 2, . . . , p}, bk ∈ R, ak ≥ 0 and the error function ej (t) is a function
but not necessarily a quantile function. The constraint of non negativity is imposed on the
values of the parameters ak , due to the fact that these parameters are multiplying a quantile
function.
In order to estimate the parameters, the sum of square errors function is defined, like in
the Least Squares method, but using the Mallows distance. In this case, the minimization
problem is the follows:
m Z
X
min
j=1
1
0
!
Ψ−1
Y (j) (t) − b0 −
subject to ak ≥ 0,
p
X
k=1
bk X k −
p
X
ak Ψ−1c
Xk (j) (t)
k=1
"2
dt
(3.79)
∀k ∈ {1, . . . , p}.
Consider a = (a1 , . . . , ap )T and b = (b0 , b1 , . . . , bp )T the vectors of the parameters;
−1
−1
−1
T
T
c
c
c
T
Y = (Ψ−1
Y (1) (t), . . . , ΨY (p) (t)) ; Y = (Y (1), . . . , Y (p)) ; Y = (ΨY (1) (t), . . . , ΨY (p) (t)) ;

−1
−1
−1
 ΨcX1 (1) (t) ΨcX2 (1) (t) . . . ΨcXp (1) (t)



 Ψc−1 (t) Ψc−1 (t) . . . Ψc−1 (t)
 X1 (2)
X2 (2)
Xp (2)
c
X =

..
..
..
..


.
.
.
.



−1
−1
−1
ΨcX1 (m) (t) ΨcX2 (m) (t) . . . ΨcXp (m) (t)



 X 1 (1) . . . X p (1)







 X (2) . . . X (2)
1
p


;X = 


..
..
..




.
.
.






X 1 (m) . . . X p (m)














and X+ the matrix obtained from X adding a first column with values equal to 1.
In matricial notation, the minimization problem (3.79) may be written as (for details see
Irpino and Verde (2012)):
min [Y − X+ b − Xc a]T [Y − X+ b − Xc a]
(3.80)
The vectors of the parameters a and b, may be independently estimated. The estimation of
the parameters b is achieved by solving the classic least square problem:
min [Y − X+ b]T [Y − X+ b].
(3.81)
The vectors of the parameters a is obtained applying the Non-Negative Least Square
algorithm proposed by Lawson and Hanson (1995) modified with the introduction of the
3.2. LINEAR REGRESSION MODELS FOR HISTOGRAM VARIABLES
125
product between quantile functions in the classic matrix computations (see Irpino and Verde
(2012)). In this case, the optimization problem is
[Yc − Xc a]T [Yc − Xc a]
min
subject to ak ≥ 0,
(3.82)
∀k ∈ {1, . . . , p}.
For the particular case of the simple linear regression model it is possible to define the
analytic expressions of the estimated parameters of the model (Verde and Irpino (2010)) as
follows:
b0 = Y − b1 X;
m
X
j=1
b1 =
X(j)Y (j) − mXY
m
X
j=1
a1 =
m Z
X
j=1
(3.83)
;
(3.84)
2
X (j) − mX
1
0
−1
Ψ−1
X(j) (t)ΨY (j) (t)dt − X(j)Y (j)
m Z
X
j=1
1
0
Ψ
−1
X(j)
(t)
2
dt − X(j)
2
.
(3.85)
To analyze the goodness-of-fit of the model, a measure that the authors termed by
Pseudo-R2 , that uses the Mallows distance, has been proposed (Irpino and Verde (2012,
2013)). As in classical statistics, the computation of this coefficient is based on the decomposition of the total variation of the response variable, denoted SSY , into two parts:
the part explained by the estimated regression equation, denoted SSR, and the part that
measures the unexplained variation, SSE , referred to as the residual sum of squares by
SSY = SSE + SSR.
The Pseudo-R2 is defined by:
Pseudo-R2 = min max 0; 1 −
SSE
SSY
;1 .
(3.86)
126
CHAPTER 3. STATE OF THE ART
Irpino and Verde (2012) also consider the Root Mean Square Error (RM SEM ) that
is generally used to measure the goodness of fit. Choosing the Mallows distance for
computing the dissimilarity between distributions, we may compute the RM SEM as follows:
RM SEM =
v m Z
uX 1
u
−1
2
(ΨY−1
u
b (j) (t) − ΨY (j) (t)) dt
t j=1 0
m
(3.87)
.
3.3. Conclusion
In the classical model, linear regression between quantitative variables is based on
the linear combination of real values, however, in the models described above for intervalvalued variables and histogram-valued variables the concept of linear regression is different.
Almost all linear regression models for interval-valued variables proposed in the framework
of Symbolic Data Analysis use real numbers that allow building the intervals, instead of
using intervals as such, i.e. these models are based on the difference between real values
(the upper and lower bound of the intervals or the center and half range) and do not quantify
the closeness between intervals. Therefore, the elements estimated by some of these
models may fail to build an interval. Similar problems occur in methods based in symbolic
covariance definitions generalized to histogram-valued variables.
Few models developed in the framework of Symbolic Data Analysis, descriptive or
probabilistic, used the variability inherent to the data in the process of predicting intervals
or histograms by linear regression. When variability is considered, the predicted interval or
histogram is not obtained directly from the model. The only method where both situations
are considered is the model proposed by Irpino and Verde (2012), that uses quantile
functions to represent distributions.
It is in the line of research that uses interval data to represent imprecise data, that
models with elements that are truly intervals are defined. However, in that framework,
the values in the interval do not represent variability, as in Symbolic Data Analysis but
imprecision. In this situation, the intervals are elements of KC (R), for which the distribution
of values within the intervals is not considered.
Defining models and concepts with elements that are not real values is obviously much
more complex and abstract. When defining linear regression models with data that take
the form of intervals or histograms, the first difficulty is to operate with them. Interval and
3.3. CONCLUSION
127
histogram arithmetics have been defined; however they are not easy to use. On the other
hand, in the spaces where the elements are intervals or histograms the existence of a
symmetric element is not always guaranteed i.e. these spaces are not linear. These are
possibly the main reasons that hinder the extension of the simple methods proposed for
imprecise data to multiple linear regression models.
Furthermore, if we work directly with intervals or histograms, it is necessary to select a
measure to adequately quantify the dissimilarity between such of elements.
In spite of all models proposed, there does not yet exist a model that simultaneously
solves all limitations. These limitations and the difficulty to work directly with intervals and
histograms makes linear regression a topic of Symbolic Data Analysis that requires further
research and the study of new models.
128
CHAPTER 3. STATE OF THE ART
4. Regression Models for histogram data
Our main goal in this work is to propose a linear regression model for histogram-valued
variables. More precisely, to provide a linear regression model that considers data with
variability and allows predicting distributions. In this work we propose a linear regression
model that:
• Is flexible and that truly predicts histograms or intervals from other histograms or intervals, i.e, the models should be included in the symbolic-symbolic-symbolic category;
• Solves the problem of the lack of linearity in the spaces when the elements are
intervals and histograms;
• Uses an adequate distance to measure the error between the observed and predicted
elements;
• Can be particularized to interval-valued variables. In this case different distributions
within the intervals may be considered.
4.1. The Distribution and Symmetric Distribution Regression
Model I
4.1.1. Definition of the Model
The first option to define the functional linear relation between histogram data was to
adapt the classical model to these kind of data. Consider that we want to predict the
distributions of histogram-valued variable Y from p histogram-valued variables Xk with
k ∈ {1, . . . , p}. At each unit j ∈ {1, . . . , m}, the predicted distribution Yb (j) would then be
obtained as follows:
129
130
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
Yb (j) = v + a1 X1 (j) + a2 X2 (j) + . . . + ap Xp (j).
(4.1)
As already mentioned, in this work we choose to represent the distributions by quantile
functions. However, when we multiply a quantile function by a negative number we do
not obtain a non-decreasing function. Therefore, it is necessary to impose non-negativity
constraints on the parameters of the model. As such, a functional linear relation between
the observations of the histogram-valued variables, represented by the respective quantile
functions, may be defined as follows:
−1
−1
−1
ΨY−1
b (j) (t) = v + a1 ΨX1 (j) (t) + a2 ΨX2 (j) (t) + . . . + ap ΨXp (j) (t)
(4.2)
with ak ≥ 0 and k ∈ {1, 2, . . . , p}.
The non-negativity constraints imposed on the coefficients force a direct linear relation
between the variables. Although we did not generalize the model of interval-valued variables to histogram-valued variables, by defining a model that allows predicting a quantile
function from other quantile functions, we obtain similar limitations as observed in the
models defined for interval-valued variables.
It is not possible to have negative parameters in the previous model. Nevertheless, it is
fundamental to allow for the possibility of a direct and an inverse linear relation between the
variable Y and the variables Xk , i.e. it is necessary to introduce a “tecnique” that solves
the problem of the semi-linearity of the space of the quantile functions. For this reason, our
proposal is to include in the linear regression model both the quantile functions Ψ−1
Xk (j) (t)
that represent the distributions of the histogram-valued variables Xk for each unit j, and the
quantile functions that represent the respective symmetric histograms, −Ψ−1
Xk (j) (1 − t) (see
Section 2.2.3.3). Therefore the following model is proposed:
Definition 4.1. Consider the histogram-valued variables X1 ; X2 ; . . . ; Xp . The quantile functions that represent the distribution of these histogram-valued variables for each unit j are
−1
−1
Ψ−1
X1 (j) (t), ΨX2 (j) (t), . . . , ΨXp (j) (t) and the quantile functions that represent the respective
symmetric histograms associated with each unit of the referred variables are −Ψ−1
X1 (j) (1−t),
−1
−1
−Ψ−1
X2 (j) (1 − t), . . . , −ΨXp (j) (1 − t), with t ∈ [0, 1]. Each quantile function ΨY (j) that
represent the observation j of the histogram-valued variable Y may be expressed as
follows:
−1
Ψ−1
b (j) (t) + ej (t)
Y (j) (t) = ΨY
4.1. THE DSD REGRESSION MODEL I
131
where ΨY−1
b (j) (t) is the predicted quantile function for unit j, obtained from
ΨY−1
b (j) (t) = v +
p
X
k=1
ak Ψ−1
Xk (j) (t) −
p
X
k=1
bk Ψ−1
Xk (j) (1 − t)
with t ∈ [0, 1] ; ak , bk ≥ 0, k ∈ {1, 2, . . . , p} and v ∈ R.
It should be noted that ΨY−1
b (j) (t) is always a quantile function since it is a linear combi-
nation of quantile functions where the coefficient are always non-negative real values. The
error, for each unit j , is a piecewise function, but not necessarily a quantile function, given
by
−1
ej (t) = Ψ−1
b (j) (t),
Y (j) (t) − ΨY
t ∈ [0, 1].
For each unit j, the predicted distribution Yb (j) may be represented by the quantile
function ΨY−1
b (j) . This linear regression model is named
b (j) or by the respective histogram HY
Distribution and Symmetric Distribution (DSD) Regression Model I.
To define the DSD Regression Model I, it is necessary to take into account that:
1. For none of the histogram-valued variables, all m observations present a histogram
which is symmetric in relation to the yy -axis, because in this case Ψ−1
X(j) (t) and
−Ψ−1
X(j) (1 − t) would be colinear.
2. For all observations of each variable, the histograms are assumed to be defined with
the same number n of subintervals, and to each subinterval i of all observations of
each variable is associated the same weight pi , that verifies the condition pi = pn−i+1
with i ∈ {1, ..., n} .
If the histograms do not follow the conditions referred in point 2, enumerated above,
it is necessary to apply the process described in Section 2.2.3.1. Using this process, it
is possible to rewrite all distributions associated with each histogram-valued variable Xk ,
k ∈ {1, 2, . . . , m}, the distributions that represent the respective symmetric histograms
and the distributions associated with the response variable Y, with the same number of
subintervals and weights. As we refer in Section 2.2.3.3, when we rewrite the histograms
and respective symmetric histograms with the same n number of subintervals, the condition
pi = pn−i+1 , with i ∈ {1, 2, . . . , n} is verified. To define a linear regression model, we
consider also the distributions associated with the response variable but not the distributions
132
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
that represent the respective symmetric histograms. Because of this, in some situations,
the condition pi = pn−i+1 may not occur. When this happens, we consider the symmetric of
the histograms that are the observations of the response variable Y but only with the goal
of defining the weights of the subintervals such that pi = pn−i+1 , i ∈ {1, 2, . . . , n} .
The representation of the distributions in the conditions presented above is not restrictive because it is always possible to write or rewrite the observations of the histogramvalued variables obeying to these conditions. When the histogram data are given before
the aggregation of data, we may organize all histograms that are the observations of the
variables, as equiprobable.
In this work we consider that all distributions and the distributions that represent the
respective symmetric histograms, of all variables, are defined with n subintervals and with
cumulative weights {0, w1 , . . . , wn−1 , 1} . The quantile functions that represent the distribu-
tions of Xk , the quantile functions that represent respective symmetric histogram and the
quantile function that represent the distributions of response variable Y, for a given unit j
are defined respectively, by:
Ψ−1
Xk (j) (t) =
−Ψ−1
Xk (j) (1 − t) =



2t

c
+
rXk (j)1
−
1
Xk (j)1

w1





 cXk (j)2 + 2(t−w1 ) − 1 rXk (j)2
w2 −w1










Ψ−1
Y (j) (t) =
0 ≤ t < w1
if
w 1 ≤ t < w2
..
.
cXk (j)n +
2(t−w(n−1) )
1−w(n−1)
− 1 rXk (j)n



2t

−
1
−c
rXk (j)n
Xk (j)n +

w1





 −cXk (j)n−1 + 2(t−w1 ) − 1 rXk (j)n−1
w2 −w1










if
2(t−wn−1 )
1−wn−1
− 1 rXk (j)1



2t

+
c
rY (j)1
−
1

Y
(j)
1
w1






 cY (j)2 + 2(t−w1 ) rY (j)2
w2 −w1











if
0 ≤ t < w1
if
w1 ≤ t < w 2
cY (j)n +
2(t−wn−1 )
1−wn−1
rY (j)n
if
0 ≤ t < w1
if
w1 ≤ t < w 2
if
;
(4.4)
wn−1 ≤ t ≤ 1
if
..
.
(4.3)
wn−1 ≤ t ≤ 1
if
..
.
−cXk (j)1 +
;
wn−1 ≤ t ≤ 1
.
(4.5)
4.1. THE DSD REGRESSION MODEL I
133
According to Definition 4.1, for each unit j, the quantile function Ψ−1
b (j) (t) that predicts
Y
the histogram-valued variable Y, may be obtained as follows:
ΨY−1
b (j) (t) =
where



2t

−
1
c
rYb (j)1
b (j)1 +

Y
w1





 cYb (j) + 2(t−w1 ) − 1 rYb (j)
2
2
w2 −w1










if
0 ≤ t < w1
if
w1 ≤ t < w 2
..
.
cYb (j)n +
cYb (j)i = v +
and
rYb (j)i =
2(t−wn−1 )
1−wn−1
p
X
k=1
p
X
− 1 rYb (j)n
ak cXk (j)i −
ak rXk (j)i +
k=1
p
X
if
(4.6)
wn−1 ≤ t ≤ 1
bk cXk (j)n−i+1 ;
(4.7)
k=1
p
X
bk rXk (j)n−i+1 .
(4.8)
k=1
Alternatively, the predicted observations of the histogram-valued variable Y, may be obtained by the histogram:
HYb (j) =
p
P
k=1
...;
ak I Xk (j)1 − bk I Xk (j)n + v,
p
P
k=1
ak I Xk (j)n
p
P
ak I Xk (j)1 − bk I Xk (j)n + v , p1 ; . . .
p
P
− bk I Xk (j)1 + v,
ak I Xk (j)n − bk I Xk (j)1 + v , pn
k=1
k=1
(4.9)
with ak , bk ≥ 0, k ∈ {1, 2, . . . , p} and v ∈ R.
According to the DSD Model I, for each subinterval i, the predicted center cYb (j)i and
half range rYb (j)i (or the bounds) may be described, respectively, by a linear relation of the
centers cXk (j)i and half ranges rXk (j)i (or the bounds) of the explicative histogram-valued
variables. These linear relations are defined in Expressions (4.7) and (4.8), respectively.
As the values of the parameters ak , bk and v are the same in all subintervals
i ∈ {1, 2, . . . , n}, we may consider the following classical linear relations between the mean
values (weighted mean of the centers) and between the weighted mean of the half ranges
of the observations of the variables Xk and Y. I.e, for each j we have:
Y (j) = v +
p
X
k=1
r
Y (j) =
p
X
k=1
(ak − bk ) X k (j) + ec (j)
r
(ak + bk ) X k (j) + er (j)
(4.10)
(4.11)
134
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
where
X k (j) =
n
X
cXk (j)i pji and Y (j) =
i=1
n
X
cY (j)i pji
i=1
and similarly for the half ranges
r
X k (j) =
n
X
r
rXk (j)i pji and Y (j) =
i=1
n
X
rY (j)i pji .
i=1
From Expressions (4.10) and (4.11) we may observe that the parameters that define the
linear regressions between the mean values and weighted mean of the half ranges to each
distribution j are not the same but are related. In spite of the fact that this model is defined
between distributions and the relation between the distributions may be direct or inverse,
it always induces a direct linear relation between the weighted mean of the half ranges
of the variables Xk and Y, to each observation j. The direct or inverse relation between
the histogram-valued variables is always in accordance with the linear relation between the
mean values of the observations of the histogram-valued variables. The histogram-valued
variables Xk are in direct linear relation with Y when ak > bk and the linear relation is
inverse if ak < bk .
Example 4.1. To compare the parameters estimated by the DSD Regression Model I
and the behavior of the scatter plots for histogram-valued variables (see Section 2.4.1.2)
when the relations between the variables are direct or inverse, we will consider one example of each situation. All observations of the histogram-valued variables are rewritten
as histograms with six subintervals and for all units the weight associated with the each
subinterval i is the same.
1. In the first situation we study the symbolic data in Table 2.5, where 10 units are
described by two symbolic variables: Y the response histogram-valued variable and
X the explicative histogram-valued variable.
−1
−1
DSD Model I: Ψ−1
Y (j) (t) = −1.95 + 3.56ΨX(j) (t) − 0.41ΨX(j) (1 − t).
The scatter plots of this situation are represented in Figure 4.1.
4.1. THE DSD REGRESSION MODEL I
(a) Representation 3D.
135
(b) Projection of the scatter plot in z = 0.
Figure 4.1: Scatter plots considering the histogram-valued variables X and Y in Table 2.5.
2. In the second situation, we study the symbolic data in Table 4.1, where the histograms
of the explicative histogram-valued variable are the symmetric of the histogram-valued
variable X, denoted −X.
Table 4.1: Symbolic data table for histogram-valued variables −X and Y.
Units
Y
-X
1
{[33.29; 37.52[ , 0.6; [37.52; 39.61] , 0.4}
{[−12.8; −12.19; [ , 0.6; [−12.19; −11.54] , 0.4}
2
{[36.69; 39.11[ , 0.3; [39.11; 45.12] , 0.7}
{[−14.17; −13.32[ , 0.5; [−13.32; −12.07] , 0.5}
3
{[36.69; 42.64[ , 0.5; [42.64; 48.68] , 0.5}
{[−16.16; −14.2[ , 0.7; [−14.2; −12.38] , 0.3}
4
{[36.38; 40.87[ , 0.4; [40.87; 47.41] , 0.6}
{[−15.29; −14.26[ , 0.5; [−14.26; −12.38] , 0.5}
5
{[39.19; 50.86] , 1}
{[−16.24; −14.28[ , 0.7; [−14.28; −13.58] , 0.3}
6
{[39.7; 44.32[ , 0.4; [44.32; 47.24] , 0.6}
{[−15.2; −14.5[ , 0.6; [−14.5; −13.81] , 0.4}
7
{[41.56; 46.65[ , 0.6; [46.65; 48.81] , 0.4}
{[−15.55; −14.81[ , 0.5; [−14.81; −14.34] , 0.5}
8
{[38.4; 42.93[ , 0.7; [42.93; 45.22] , 0.3}
{[−14.6; −14.0[ , 0.4; [−14.0; −13.27] , 0.6}
9
{[28.83; 35.55[ , 0.5; [35.55; 41.98] , 0.5}
{[−13.8; −11.98[ , 0.6; [−11.98; −9.92] , 0.4}
10
{[44.48; 52.53] , 1}
{[−16.75; −15.78[ , 0.7; [−15.78; −15.37] , 0.3}
−1
−1
DSD Model I: Ψ−1
Y (j) (t) = −1.95 + 0.41ΨX(j) (t) − 3.56ΨX(j) (1 − t).
The corresponding scatter plot is represented in Figure 4.2.
Comparing the expressions of the DSD Model, in both situations, we can observe that
in case 2, as expected, the values of the parameters a and b change relatively to case 1.
136
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
(b) Projection of the scatter plot in z = 0.
(a) Representation 3D.
Figure 4.2: Scatter plots considering the histogram-valued variables −X and Y in Table 4.1.
Observing the behavior of the scatter plots, it is important to underline that two orientations can be distinguished: 1) The orientation of the subintervals of each histogram, that
obviously is always direct (when the histograms are represented by quantile functions,
which are non-decreasing functions) and 2) the orientation of each subinterval i for all
units j. It is this latter orientation, and consequently the orientation of the mean values of
the histograms, that induces the direct or inverse relation between the histogram-valued
variables.
In the first situation, a > b so, according to the Expression (4.10), the classical linear relation between the mean values of the histograms that are the observations of the
histogram-valued variables, are in a direct linear relation, as illustrated in Figure 4.3(a).
This behavior means that the relation between histogram-valued variables is classified as
direct. On the other hand, we consider that the linear relation between the histogram-valued
variable Y and −X is inverse because the parameter a is lower than b. As we can observe
in Figure 4.3(b) the classical linear regression between the mean values of the histograms
Y (j) and Xk (j) is inverse.
50
50
45
45
40
40
12
13
14
15
16
(a) X(j) versus Y (j) with j ∈ {1, . . . , m}.
17
−17
−16
−15
−14
−13
−12
(b) −X(j) versus Y (j) with j ∈ {1, . . . , m}.
Figure 4.3: Scatter plots considering the mean values of the observations of the histogram-valued variables X
and Y in (a); −X and Y in (b).
4.1. THE DSD REGRESSION MODEL I
137
4.1.2. Estimation of the parameters of the DSD Model I
In classical statistics, the parameters of the linear regression model are estimated
m
P
solving the minimization problem
j=1
(yj − ybj )2 , where yj are the observed and ybj the
predicted values, respectively, with j ∈ {1, . . . , m}. To solve this problem the Least
Squares method is used.
For histogram-valued variables the parameters of the DSD Model I, in Definition 4.1, are
estimated solving a quadratic optimization problem, subject to non-negativity constraints on
the unknowns.
4.1.2.1. Optimization problem
Consider ΨY−1
b (j) (t) obtained by the DSD Model I. The quadratic optimization problem is
written as:
min
SSE =
m
X
j=1
−1
2
DM
(Ψ−1
b (j) (t))
Y (j) (t), ΨY
subject to
(4.12)
ak , bk ≥ 0,
k ∈ {1, . . . , p}
v ∈ R.
Consider the quantile function Ψ−1
Y (j) (t) in Expression (4.5), the predicted quantile function ΨY−1
b (j) (t) in Expression (4.6) and the Mallows distance defined according to
Proposition 2.1. The quadratic optimization problem (4.12) can then be rewritten as follows:
min f (a1 , b1 , . . . , ap , bp , v) =
m P
n
P
j=1 i=1
+
subject to
1
3
pi
#
rY (j)i −
cY (j)i − v −
p
P
k=1
p
P
k=1
(ak cXk (j)i − bk cXk (j)n−i+1 )
2 %
(ak rXk (j)i + bk rXk (j)n−i+1 )
gk (a1 , b1 , . . . , ap , bp , v) = −ak ≤ 0,
k ∈ {1, . . . , p}
hk (a1 , b1 , . . . , ap , bp , v) = −bk ≤ 0,
k ∈ {1, . . . , p}
v ∈ R.
2
(4.13)
138
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
The quadratic optimization problem that allows estimating the parameters of the DSD
Model I may be rewritten in matricial form as a constraint quadratic problem or, as a
constraint least squares problem 3 .
Constraint quadratic problem
The minimization problem (4.13) may be rewriter in matricial form, as follows:
min
1
f (a1 , b1 , . . . , ap , bp , v) = bT H1 b + w1T b + K
2
subject to
(4.14)
−ak , −bk ≤ 0,
k ∈ {1, . . . , p}
v∈R
where H1 = [hlq ] is the hessian matrix, a symmetric matrix of order 2p + 1, with p the
number of variables Xk ; w1 = [wl ] the vector of independent terms and b the parameters
vector are column vectors with 2p + 1 rows. We have,
b = (a1 , b1 , a2 , b2 , . . . , ap , bp , v)
T
and K is the real value given by
K=
n
m X
X
j=1 i=1
pi c
2
Y (j)i
1 2
+ rY (j)i .
3
The elements of the matrices H1 and w1 are computed from the first order partial derivatives of the function f. In these partial derivatives the subintervals of the histograms are
defined from the center and half range of the intervals.
First order partial derivatives of function f
• Partial derivative in order to ak , for a fixed k ∈ {1, . . . , p}.
From function f , in Expression (4.13), we have:
p
n
m P
P
P
∂f
=
pi 2 cY (j)i − v −
(ak cXk (j)i − bk cXk (j)n−i+1 ) (−cX (j)i ) +
∂ak
j=1 i=1
k=1
p
P
2
(ak rXk (j)i + bk rXk (j)n−i+1 ) (−rX (j)i ) .
+ 3 rY (j)i −
k
k
k=1
(4.15)
3
In practical examples of this work, the optimization problems to estimate the parameters of the DSD Model
are solved using the Matlab function quadprog if we treat the problem as a constraint quadratic problem and
the Matlab function lsqlin when we write the problem as a constraint least squares problem.
4.1. THE DSD REGRESSION MODEL I
139
Throughout this work, two different ways will be used to express the first order partial
derivatives of function f. As cYb (j)i and rYb (j)i are defined as in Expressions (4.7) and
(4.8), respectively, the
∂f
∂ak
, in Expression (4.15), may be simplified to
n
m P
P
∂f
= −2
pi cY (j)i − cYb (j)i cX (j)i +
∂ak
j=1 i=1
k
rY (j)i − rYb (j)i rX (j)i . (4.16)
1
3
k
Alternatively, from Expression (4.15), grouping the terms corresponding to the parameters ak , bk , with k ∈ {1, . . . , p} and v, we obtain:
p
n
m P
P
P
∂f
=
2pi
cXk (j)i cX (j)i + 13 rXk (j)i rX (j)i ak +
∂ak
j=1 i=1
k=1
p
m P
n
P
P
2pi
−cXk (j)n−i+1 cX (j)i + 13 rXk (j)n−i+1 rX (j)i bk +
+
k
k
k
j=1 i=1
+
n
m P
P
(4.17)
k
k=1
2pi cX (j)i v +
k
j=1 i=1
n
m P
P
j=1 i=1
pi −2cY (j)i cX (j)i − 23 rY (j)i rX (j)i .
k
k
• Partial derivative in order to bk , for a fixed k ∈ {1, . . . , p}.
Similarly to the partial derivative in order to ak of function f for a fixed k ∈ {1, . . . , p},
three equivalent expressions may be presented to the partial derivative in order to bk ,:
p
m P
n
P
P
∂f
=
pi 2 cY (j)i − v −
(ak cXk (j)i − bk cXk (j)n−i+1 ) cX (j)n−i+1 +
∂bk
j=1 i=1
k=1
p
P
2
(ak rXk (j)i + bk rXk (j)n−i+1 ) (−rX (j)n−i+1 ) ;
+ 3 rY (j)i −
k
k
k=1
(4.18)
n
m P
P
∂f
= 2
pi cY (j)i − cYb (j)i cX (j)n−i+1 −
∂bk
j=1 i=1
k
1
3
rY (j)i − rYb (j)i rX (j)n−i+1 ;
k
(4.19)
p
m P
n
P
P
∂f
=
2pi
−cXk (j)i cX (j)n−i+1 + 13 rXk (j)i rX (j)n−i+1 ak +
∂bk
j=1 i=1
k=1
p
n
m P
P
P
2pi
cXk (j)n−i+1 cX (j)n−i+1 + 13 rXk (j)n−i+1 rX (j)n−i+1 bk −
+
k
k
k
j=1 i=1
−
n
m P
P
j=1 i=1
k
k=1
2pi cX (j)n−i+1 v +
k
n
m P
P
j=1 i=1
pi 2cY (j)i cX (j)n−i+1 − 23 rY (j)i rX (j)n−i+1 .
k
k
(4.20)
140
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
• Partial derivative in order to v.
∂f
∂v
=
n
m P
P
j=1 i=1
pi −2 cY (j)i −
p
P
k=1
(ak cXk (j)i − bk cXk (j)n−i+1 ) − v
(4.21)
.
Applying the Expression (4.7), the partial derivative of f in order to v may be simplified
to
∂f
∂v
= −2
n
m P
P
j=1 i=1
pi cY (j)i − cYb (j)i .
(4.22)
Another equivalent expression may be obtained from Expression (4.21),
∂f
∂v
=
m P
n
P
2pi
j=1 i=1
+2mv −
p
P
ak cXk (j)i
k=1
n
m P
P
−
2pi (cY (j)i )
m P
n
P
j=1 i=1
2pi
p
P
bk cXk (j)n−i+1 +
k=1
(4.23)
j=1 i=1
The elements of the symmetric matrix H1 were obtain from Expressions (4.15), (4.18)
and (4.21). These elements are defined by:

m P
n

P

1

c
+
r
2p
c
r

i
X l+1 (j)i X q+1 (j)i
3 X l+1 (j)i X q+1 (j)i


2
2
2
2
j=1 i=1




n
m P
P

1


2p
c
c
+
r
r
i
X l (j)n−i+1 X q (j)n−i+1

3 X 2l (j)n−i+1 X q2 (j)n−i+1

2
2

j=1 i=1


 P
m P
n
1
hlq =
2pi −cX l (j)n−i+1 cX q+1 (j)i + 3 rX l (j)n−i+1 rX q+1 (j)i

2
2
2
2
j=1 i=1




n
m

 P P 2p c

i X q+1 (j)i


2

j=1 i=1



m
n

PP



−2pi cX q (j)n−i+1

2
if
if
if
if
if
j=1 i=1
l, q odd;
l, q ≤ 2p
l, q even;
l, q ≤ 2p
l even,q odd;
l, q ≤ 2p
.
q odd;
l = 2p + 1
q even;
l = 2p + 1
(4.24)
Also from the same expressions as the matrix H1 , the column vector of independent terms,
w1 = [wl ] is defined by:
wl =

n
m P
P

2

pi −2cY (j)i cX l+1 (j)i − 3 rY (j)i rX l+1 (j)i


2
2

j=1 i=1

 m n P P
pi 2c
Y (j)i

j=1 i=1



n
m P
P



−2pi cY (j)i
j=1 i=1
cX l+1 (j)n−i+1 − 23 rY (j)i rX l+1 (j)n−i+1
2
2
if
l is odd; l ≤ 2p
if
l is even; l ≤ 2p
if
l = 2p + 1
. (4.25)
4.1. THE DSD REGRESSION MODEL I
141
Constraint least square problem
Alternatively, the minimization problem (4.13) may be presented in matricial form, as
follows:
f (a1 , b1 , . . . , ap , bp , v) = kY − X1 bk2
min
subject to
(4.26)
−ak , −bk ≤ 0,
k ∈ {1, . . . , p}
v ∈ R.
To build the matrices X1 and Y it is necessary to define some vectors and matrices.
Consider for each interval i, the vectors of length m, of the observed centers and half
ranges relatively to the histogram-valued variable Y :
c
i
√
y = ( pi cY (1)i , . . . ,
√
pi cY (m)i )
T
r
i
y =
and
r
pi
rY (1)i , . . . ,
3
r
T
pi
rY (m)i
3
From the 2p + 1− dimensional vectors, xc(j)i and xr(j)i , defined by
xc (j)i =
and
r
x (j)i =
√
.
√
√
√
√ pi cX1 (j)i , − pi cX1 (j)n−i+1 , . . . , pi cXp (j)i , − pi cXp (j)n−i+1 , pi
r
pi
rX (j) ,
3 1 i
r
pi
rX (j)
, ...,
3 1 n−i+1
r
pi
rX (j) ,
3 p i
we may build the m × (2p + 1) matrices:
Xci = [xc (1)i xc (2)i . . . xc (m)i ]T
and
r
pi
rX (j)
, 0 ;
3 p n−i+1
Xri = [xr (1)i xr (2)i . . . xr (m)i ]T .
With the matrices defined above, we may build the following matrix X1 with 2mn rows and
(2p + 1) columns:
X1 = [Xc1 Xr1
...
Xcn Xrn ]
T
(4.27)
and the column matrix Y with 2mn rows:
Y = [y1c y1r
...
T
ync ynr ] .
(4.28)
As the goal of a optimization problem is to find its optimal solution, or equivalently, find
the values of the parameters that minimize the sum of the squares of the errors between
the observed and predicted distributions, it is important to guarantee that the optimization
problem has a global minimum, i.e. an optimal solution. Consider the optimization problem
(4.13) where:
142
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
• the functions gk (a1 , b1 , . . . , ap , bp , v) and hk (a1 , b1 , . . . , ap , bp , v) that define the nonnegative constrains are convex, so the feasible region of the optimization problem is
a convex set;
T
• H1 = 2X1 X1 , with H1 and X1 defined in Expressions (4.24) and (4.27), respectively. Since the matrix H1 is positive semi-definite, then f (a1 , b1 , . . . , ap , bp , v) is a
convex function.
The quadratic function to optimize being convex and the feasible region as well, it may then
be ensured that the optimization problem (4.13) has optimal solutions. In the cases where
the objective function is strictly convex (the matrix H1 is positive definite, which occurs
when the columns of X1 are linearly independent), we may ensure that the optimal solution
is unique (Winston (1994)).
4.1.2.2. Kuhn Tucker conditions
Consider the minimization problem that allows estimating the parameters of the DSD
Model I. The optimal solution of the quadratic optimization problem, subject to non-negativity
constraints, verifies the Kuhn Tucker conditions.
Theorem 4.1. (Winston (1994)) Consider the minimization problem (4.13) or other equivalent definitions. If b∗ = (a∗1 , b∗1 , . . . , a∗p , b∗p , v ∗ ) is an optimal solution of this problem, then
b∗ must satisfy the constraints of the optimization problem and the Kuhn Tucker conditions.
For each k ∈ {1, 2, . . . , p} we have
• Constraints: −a∗k ≤ 0 and −b∗k ≤ 0;
• Kuhn Tucker conditions - exist real multipliers λk , δk satisfying:
∂f ∗
(b ) − λk = 0;
∂ak
∂f ∗
2.
(b ) − δk = 0;
∂bk
∂f ∗
3.
(b ) = 0;
∂v
1.
4. λk a∗k = 0;
5. δk b∗k = 0;
6. λk , δk ≥ 0.
4.1. THE DSD REGRESSION MODEL I
143
Or, equivalently,
∂f ∗
(b ) ≥ 0;
∂ak
∂f ∗
2’.
(b ) ≥ 0;
∂bk
∂f ∗
3’.
(b ) = 0;
∂v
1’.
∂f ∗
(b ) = 0;
∂ak
∂f ∗
5’. b∗k
(b ) = 0.
∂bk
4’. a∗k
Theorem 4.1 gives necessary conditions for a vector (a1 , b1 , . . . , ap , bp , v) to be an
optimal solution of the optimization problem (4.13). The following theorem, Theorem 4.2,
gives sufficient conditions for the vector (a1 , b1 , . . . , ap , bp , v) to be an optimal solution of
optimization problem (4.13).
Theorem 4.2. (Winston (1994)) Consider the minimization problem (4.13).
If
f (a1 , b1 , . . . , ap , bp , v), gk (a1 , b1 , . . . , ap , bp , v) and hk (a1 , b1 , . . . , ap , bp , v) are convex functions, then any vector that satisfies the Kuhn Tucker conditions in Theorem 4.1 is an optimal
solution of the optimization problem (4.13).
Since the quadratic function to optimize is convex and the feasible region too, it may be
ensured that the vectors that verify the Kuhn Tucker conditions are optimal solutions.
From those conditions, it is possible to prove some properties associated with the
predicted distribution. Some of these are the counterparts of corresponding properties
in classical statistics, and will allow defining a measure to evaluate the goodness-of-fit of
the model.
Considering the results of Section 2.4.1.2 and the Kuhn Tucker conditions, we may prove
the following propositions:
Proposition 4.1. For each unit j, let Yb (j) be the distribution predicted by the DSD Model I
considering as parameters the optimal solution b∗ = (a∗1 , b∗1 , . . . , a∗p , b∗p , v ∗ ). The mean of
the predicted histogram-valued variable Yb is given by:
Yb =
p
X
k=1
(a∗k − b∗k ) Xk + v ∗ .
Proof. Each observation j, of the predicted histogram-valued variable Yb (j) may be represented by the quantile function as in Expression (4.6) considering as parameters the
optimal solution b∗ of the quadratic optimization problem (4.13). As such, the mean quantile
144
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
function ΨY−1
b (t) may be obtained from Definition 2.21. So, applying Proposition 2.4 we may
prove that Yb =
p
X
k=1
(a∗k − b∗k ) Xk + v ∗ .
Proposition 4.2. The mean of the predicted histogram-valued variable Yb is equal to the
mean of the observed histogram-valued variable Y .
Proof. Consider the function to minimize in Expression (4.13),
n
m X
X
f (a1 , b1 , . . . , ap , bp , v) =
j=1 i=1
+ 13
2
n
m P
P
pi
j=1 i=1
⇔
⇔
m P
n
P
j=1 i=1
p
P
k=1
pi
p
P
∗
k Xk (j)i
a c
k=1
p
P
a∗k
cXk (j)i
m
k=1
−2
−
pi  cY (j)i −
!
For the optimal solution b∗ we have
!
rY (j)i −
pi
j=1 i=1
pi
j=1 i=1
a∗k Xk − b∗k Xk + v ∗ = Y .
k=1
(ak cXk (j)i − bk cXk (j)n−i+1 ) − v
(ak rXk (j)i + bk rXk (j)n−i+1 )
k=1
"2 
∂f ∗
(b ) = 0. Consequently,
∂v
n
m P
P
m P
n
P
p
X
p
X
p
P
b∗k
p
P
∗
k Xk (j)n−i+1
b c
k=1
cXk (j)n−i+1
m
k=1
From Proposition 4.1, it follows that Yb =
p
X
k=1
+ v∗ =
+ 2mv ∗ − 2
m P
n
P
j=1 i=1
pi
"2
+

n
m P
P
pi cY (j)i = 0
j=1 i=1
cY (j)i
m
(a∗k − b∗k ) Xk + v ∗ , so Yb = Y .
Proposition 4.3. For each unit j, the quantile function of the distribution Yb (j) predicted by
the DSD Model I, may be rewritten as follows:
ΨY−1
b (j) (t) − Y =
p
X
k=1
−1
∗
a∗k Ψ−1
(t)
−
X
+
b
−Ψ
(1
−
t)
+
X
k
k .
Xk (j)
Xk (j)
k
Proof. In Proposition 4.2 we proved that
Y =
p
X
k=1
∗
k
∗
k
∗
∗
(a − b ) Xk + v ⇔ v = Y −
∗
p
X
k=1
(a∗k − b∗k ) Xk .
For the optimal solution b , for each unit j, the quantile function predicted by the linear
regression model DSD I, in Definition 4.1, is given by
ΨYb (j) (t) =
p
X
k=1
∗ −1
∗
a∗k Ψ−1
Xk (j) (t) − bk ΨXk (j) (1 − t) + v
4.1. THE DSD REGRESSION MODEL I
145
which may be rewritten as
Ψ
−1
b (j)
Y
(t) − Y =
p
X
k=1
−1
∗
a∗k Ψ−1
(t)
−
X
+
b
−Ψ
(1
−
t)
+
X
.
k
k
Xk (j)
Xk (j)
k
Proposition 4.4. For the observed and predicted distributions Y (j) and Yb (j) of the variable Y, with j ∈ {1, . . . , m}, we have
m Z
X
j=1
1
0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
ΨY−1
(t)
−
Y
dt = 0.
b (j)
−1
Proof. Defining the quantile functions Ψ−1
b (j) (t) from the centers and half ranges
Y (j) (t) and ΨY
of the subintervals, according to Expressions (4.5) and (4.6), respectively, we have,
m R1 P
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
=
cY (j)i +
j=1 0
m P
n
Rwi h
P
j=1 i=1 wi−1
ΨY−1
(t)
−
Y
dt =
b (j)
2(t−wi−1 )
wi −wi−1
i
i−1 )
r
− 1 rY (j)i − cYb (j)i − 2(t−w
−
1
b
Y (j)i ×
wi −wi−1
h
i
i−1 )
× cYb (j)i + 2(t−w
−
−
1
r
Y
dt =
b (j)i
Y
wi −wi−1
=
n
m P
Rwi h
P
j=1 i=1 wi−1
×
=
h
cY (j)i − cYb (j)i + rY (j)i − rYb (j)i
2(t−wi−1 )
wi −wi−1
i
i−1 )
cYb (j)i − Y + rŶ (j)i 2(t−w
dt =
−
1
wi −wi−1
n
m P
Rwi
P
j=1 i=1 wi−1
cY (j)i − cYb (j)i
+ rY (j)i − rYb (j)i
cYb (j)i − Y +
cYb (j)i − Y
2(t−wi−1 )
wi −wi−1
−1
i
×
cY (j)i − cYb (j)i rYb (j)i +
−1 +
2
i−1 )
dt.
+ rY (j)i − rYb (j)i rYb (j)i 2(t−w
−
1
wi −wi−1
Solving the definite integral, after some algebra and considering wi − wi−1 = pi , we
obtain,
n
m P
P
j=1 i=1
pi
cY (j)i − cYb (j)i cYb (j)i +
1
3
rY (j)i − rYb (j)i rYb (j)i − cY (j)i − cYb (j)i Y .
(4.29)
146
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
Considering the optimal solution b∗ = a∗1 , b∗1 , . . . , a∗p , b∗p , v ∗ and applying Expressions
(4.7) and (4.8), the previous expression may be written as follows:
m P
n
P
∗
p
P
∗
k Xk (j)i
∗
k Xk (j)n−i+1
cY (j)i − cYb (j)i v +
a c
−b c
+
k=1
p
P
1
∗
∗
ak rX(j)i + bk rX(j)n−i+1 − cY (j)i − cYb (j)i Y =
+ 3 rY (j)i − rYb (j)i
k=1
p
p
n
m P
P
P
P
1
∗
∗
ak rXk (j)i +
=
pi cY (j)i − cYb (j)i
ak cXk (j)i + 3 rY (j)i − rYb (j)i
k=1
j=1 i=1
k=1
p
p
n
m P
P
P
P
1
∗
∗
pi − cY (j)i − cYb (j)i
bk cXk (j)n−i+1 + 3 rY (j)i − rYb (j)i
bk rXk (j)n−i+1 +
+
pi
j=1 i=1
j=1 i=1
+
m P
n
P
j=1 i=1
n
PP
m
=
j=1 i=1
k=1
pi cY (j)i − cYb (j)i
v∗ − Y =
a∗1 pi cY (j)i − cYb (j)i cX1 (j)i +
+a∗p pi cY (j)i − cYb (j)i cXp (j)i +
+
k=1
1
3
rY (j)i − rYb (j)i rX1 (j)i + . . . +
1
3
rY (j)i − rYb (j)i rXp (j)i +
n
m P
P
b∗1 pi cY (j)i − cYb (j)i (−cX1 (j)n−i+1 ) +
n
m P
P
pi cY (j)i − cYb (j)i
j=1 i=1
+b∗p pi cY (j)i − cYb (j)i −cXp (j)n−i+1 +
+
j=1 i=1
v∗ − Y .
1
3
1
3
rY (j)i − rYb (j)i rX1 (j)n−i+1 + . . . +
rY (j)i − rYb (j)i rXp (j)n−i+1 +
Comparing the above expression with the partial derivatives of the optimization function
(see Expression (4.16), (4.19) and (4.22)), we may write
m Z
X
j=1
1
0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
=−
p
ΨY−1
(t)
−
Y
dt =
b (j)
p
1 X ∗ ∂f ∗
1 X ∗ ∂f ∗
1 ∂f ∗
ak
(b ) −
bk
(b ) −
(b ) v ∗ − Y .
2 k=1 ∂ak
2 k=1 ∂bk
2 ∂v
∂f ∗
(b ) = 0,
∂v
∂f ∗
∂f ∗
(b ) = 0 and b∗k
(b ) = 0, for all k ∈ {1, . . . , p} and b∗ = (a∗1 , b∗1 , · · · , a∗n , b∗n , v ∗ ) .
a∗k
∂ak
∂bk
So,
m Z 1
X
−1
−1
−1
ΨY (j) (t) − ΨYb (j) (t) ΨYb (j) (t) − Y dt = 0.
From the Kuhn Tucker conditions presented in Theorem 4.1, we have
j=1
0
4.1. THE DSD REGRESSION MODEL I
147
4.1.3. Goodness-of-fit measure
To complete the investigation of the linear regression model for histogram-valued variables, a goodness-of-fit measure remains to be deduced. We define this measure in a
similar way as in the classical model for real data.
Proposition 4.5. The sum of the square of the Mallows distance between each observed
distribution j, with j ∈ {1, . . . , m}, of the histogram-valued variable Y and the symbolic
mean of the histogram-valued variable Y, Y , may be decomposed as follows:
m
X
j=1
m
m
X
X
−1
−1
−1
2
2
2
DM
Ψ−1
(t),
Y
=
D
Ψ
(t),
Ψ
(t)
+
D
Ψ
(t),
Y
.
b (j)
b (j)
Y (j)
Y (j)
M
M
Y
Y
j=1
j=1
Proof. Consider each observation j of the histogram-valued variable Y, represented by its
quantile function Ψ−1
Y (j) (t) and the mean of this histogram-valued variable, Y . Then, we
have
m
X
D
2
M
j=1
=
m Z
X
j=1
=
m Z
X
j=1
+2
Ψ
1
0
1
0
−1
Y (j)
(t), Y
=
m Z
X
j=1
1
0
Ψ−1
Y (j) (t) − Y
−1
−1
Ψ−1
b (j) (t) + ΨY
b (j) (t) − Y
Y (j) (t) − ΨY
Ψ
m Z
X
j=1
−1
Y (j)
(t) − Ψ
1
0
−1
b (j)
Y
(t)
2
dt +
m Z
X
j=1
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
1
0
2
2
Ψ
dt =
dt =
−1
b (j)
Y
(t) − Y
Ψ−1
(t)
−
Y
dt.
b (j)
Y
2
dt+
From Proposition 4.4 we have
m Z
X
j=1
1
0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
ΨY−1
(t)
−
Y
dt = 0.
b (j)
It follows that we may write
m
X
j=1
2
M
D
Ψ
−1
Y (j)
(t), Y
=
m Z
X
j=1
1
0
Ψ
−1
Y (j)
(t) − Ψ
−1
b (j)
Y
(t)
2
dt +
m Z
X
j=1
1
0
ΨY−1
b (j) (t) − Y
2
dt.
Therefore, similarly to the classical model, it is possible to define the goodness-of-fit
measure of the DSD Model I.
148
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
Definition 4.2. Consider the observed and predicted distributions of the histogram-valued
variable Y and Yb represented, respectively, by their quantile functions ΨY (j) (t) and ΨY−1
b (j) (t).
Consider also Y as the symbolic mean of the histogram-valued variable Y. The goodnessof-fit measure is given by
Ω=
m
X
j=1
m
X
2
DM
ΨY−1
(t),
Y
b (j)
2
M
D
j=1
Ψ
−1
Y (j)
(t), Y
.
In classical linear regression, the coefficient of determination R2 ranges from 0 to 1. In
this case, the goodness-of-fit measure, Ω, also verifies the condition.
Proposition 4.6. The goodness-of-fit measure Ω ranges from 0 to 1.
Proof. Consider Ω the measure defined in Definition 4.2. This measure is non-negative, so
Ω ≥ 0.
From Proposition 4.5, we have
m
P
j=1
2
2
m R1 m R P
P
1
−1
−1
−1
2
DM
Ψ−1
(t),
Y
=
Ψ
(t)
−
Ψ
(t)
dt
+
Ψ
(t)
−
Y
dt
b (j)
b (j)
Y (j)
Y (j)
0
Y
Y
j=1 0
⇔
1 =
m R1
P
j=1 0
j=1
(t)−Ψ−1 (t))
(Ψ−1
Y (j)
Y (j)
m
P
j=1
⇔
Ω = 1−
Since the term
m R1
P
j=1 0
m
P
(
(
2
DM
Ψ−1
(t),Y
Y (j)
m R1
P
j=1 0
(t)−Ψ−1 (t))
(Ψ−1
Y (j)
Y (j)
c
2
c
+
)
c
j=1
j=1 0
m
P
j=1
(t)−Ψ−1 (t))
(Ψ−1
Y (j)
Y (j)
m
P
2
dt
m R1
P
(
2
DM
Ψ−1
(t),Y
Y (j)
)
2
(ΨY−1(j) (t)−Y )
2
c
(
2
DM
Ψ−1
(t),Y
Y (j)
dt
)
dt
.
dt
is non-negative, the value of Ω is always lower
)
than or equal to 1. So, we have that 0 ≤ Ω ≤ 1.
j=1
2
DM
Ψ−1
(t),Y
Y (j)
Let us now analyze the extreme situations.
Suppose Ω = 0. In this case,
m
X
j=1
2
M
D
Ψ
−1
b (j)
Y
(t), Y
=0⇔
m Z
X
j=1
1
0
ΨY−1
b (j) (t) − Y
2
dt = 0.
−1
So, for all j ∈ {1, . . . , m} , we have ΨY−1
b (j) (t) − Y = 0 ⇔ ΨY
b (j) (t) = Y . In this case the
predicted function for all units j is a constant function.
4.2. THE DSD REGRESSION MODEL II
149
Suppose now that Ω = 1. In this case,
m
X
2
M
D
j=1
Ψ
−1
b (j)
Y
(t), Y
=
m
X
j=1
2
DM
Ψ−1
(t),
Y
.
Y (j)
From the decomposition obtained in Proposition 4.5 we have,
m
X
2
M
D
j=1
Ψ
−1
Y (j)
(t), Y
=
m
X
2
M
D
j=1
Ψ
−1
b (j)
Y
(t), Y
+
m
X
j=1
−1
2
DM
Ψ−1
(t),
Ψ
(t)
b (j)
Y (j)
Y
and so,
m
X
2
M
D
j=1
Ψ
−1
b (j)
Y
(t), Ψ
−1
Y (j)
(t) = 0.
So, for all j ∈ {1, . . . , m} ,
D
2
M
Ψ
−1
b (j)
Y
(t), Ψ
−1
Y (j)
(t) = 0 ⇔
Z
1
0
−1
ΨY−1
b (j) (t) − ΨY (j) (t)
2
−1
dt = 0 ⇒ ΨY−1
b (j) (t) = ΨY (j) (t).
In this case, for each unit j , the predicted and observed quantile functions are coincident.
In conclusion 0 ≤ Ω ≤ 1. If Ω = 0 there is no linear relation between the explicative
and response histogram-valued variables. If Ω = 1, the linear relation is perfect, so the
relation between the histogram-valued variable Y and histogram-valued variables Xk , with
k ∈ {1, . . . , p}, is exactly the relation defined by the linear regression model.
4.2. The DSD Regression Model II
As we are working in the semi-vector space of the quantile functions, where the defined
operations are the addition of the quantile functions and the product of the quantile function
by a real positive number, we may consider, in the linear regression model, a quantile
function as independent parameter instead of a real number. This new approach allows the
model to be more flexible. In the DSD Regression Model I the independent parameter being
a real number, it only influences the fit of the centers of the predicted subintervals of the
histogram. Moreover, this influence will be equal in all subintervals. Considering a quantile
function as independent parameter, will allow predicting quantile functions where the center
and half range of the subintervals of each histogram may be influenced in different ways.
For these reasons we may expect better results with this alternative model.
150
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
4.2.1. Definition of the Model
Definition 4.3. Consider the histogram-valued variables X1 , X2 , . . . , Xp . The quantile functions that represent the distributions of these histogram-valued variables, for each unit j,
−1
−1
are Ψ−1
X1 (j) (t), ΨX2 (j) (t), . . . , ΨXp (j) (t) and the quantile functions of the respective sym-
metric histograms associated with each unit of the referred variables are −Ψ−1
X1 (j) (1 − t),
−1
−1
−Ψ−1
X2 (j) (1 − t), . . . , −ΨXp (j) (1 − t), with t ∈ [0, 1]. Each quantile function ΨY (j) , may be
expressed as:
−1
Ψ−1
b (j) (t) + ej (t).
Y (j) (t) = ΨY
where Ψ−1
b (j) (t) is the predicted quantile function for unit j, obtained from
Y
−1
−1
−1
−1
ΨY−1
b (j) (t) = ΨConstant (t) + a1 ΨX1 (j) (t) − b1 ΨX1 (j) (1 − t) + a2 ΨX2 (j) (t)−
−1
−1
−b2 Ψ−1
X2 (j) (1 − t) + . . . + ap ΨXp (j) (t) − bp ΨXp (j) (1 − t).
with t ∈ [0, 1] and ak , bk ≥ 0, k ∈ {1, 2, . . . , p} . The independent parameter Ψ−1
Constant (t)
is now a quantile function, that we assume to be continuous, defined by
Ψ−1
Constant (t) =


2t


−
1
rv 1
c
+
v

w1





2(t−w1 )

−
1
+
r
+
c
+
r
rv 2

v
v1
v2
w2 −w1

















cv + rv1 + 2rv2 + rv3 +
..
.
P
2(t−w2 )
w3 −w2
n−1
c v + rv 1 +
2rvi + rvn +
i=2
− 1 rv3
2(t−wn−1 )
1−wn−1
− 1 rvn
if
0 ≤ t < w1
if
w 1 ≤ t < w2
if
w 2 ≤ t < w3
if
wn−1 ≤ t ≤ 1
(4.30)
where rvi ≥ 0, i ∈ {1, 2, . . . , n} and cv ∈ R.
The error, for each unit j , is the piecewise function given by
−1
ej (t) = Ψ−1
b (j) (t),
Y (j) (t) − ΨY
t ∈ [0, 1]
This linear regression model will be named Distribution and Symmetric Distribution
(DSD) Regression Model II.
In this case it is also assumed that the quantile functions that represent the distributions
−1
associated with the response variable, Ψ−1
Y (j) (t), the k explicative variables ΨXk (j) (t) and
4.2. THE DSD REGRESSION MODEL II
151
−Ψ−1
Xk (j) (1 − t) are defined in same conditions as those of DSD Model I, presented in
Section 4.1.1.
According to DSD Model II and the conditions mentioned above, the quantile function
that represents the distribution of the predicted histogram-valued variable Y, for a given unit
j is:
b
b
ΨY−1
b (j) (t) = Yc(j) (t) + Yr(j) (t)
with
 p
P


(ak cXk (j)1 − bk cXk (j)n ) + cv



k=1


p

P



(ak cXk (j)2 − bk cXk (j)n−1 ) + cv + rv1 + rv2



 k=1
p
P
Ybc(j) (t) =
(ak cXk (j)3 − bk cXk (j)n−2 ) + cv + rv1 + 2rv2 + rv3


k=1



..


.




p
n−1

P
P



(ak cXk (j)n − bk cXk (j)1 ) + cv + rv1 +
2rvi + rvn
k=1
(4.31)
if
0 ≤ t < w1
if
w1 ≤ t < w 2
if
w2 ≤ t < w 3
if
wn−1 ≤ t ≤ 1
if
0 ≤ t < w1
if
w 1 ≤ t < w2
if
w 2 ≤ t < w3
if
wn−1 ≤ t ≤ 1
i=2
 P
p

2t

−1
(ak rXk (j)1 + bk rXk (j)n ) + rv1


w1

k=1


P
p


2(t−w1 )


(ak rXk (j)2 + bk rXk (j)n−1 ) + rv2
−1

w2 −w1

k=1

 P
p
2(t−w
)
b
2
Yr(j) (t) =
(ak rXk (j)3 + bk rXk (j)n−2 ) + rv3
 w3 −w2 − 1

k=1



.


 ..



P
p


2(t−wn−1 )


−1
(ak rXk (j)n + bk rXk (j)1 ) + rvn
1−wn−1
k=1
.
or by the histogram HYb (j) :
HYb (j) =
where
I Yb1 =
p
X
k=1
I Yb1 , I Yb1 , p1 ; I Yb2 , I Yb2 , p2 ; . . . ; I Ybn , I Ybn , pn
ak I Xk (j)1 − bk I Xk (j)n + (cv − rv1 ) ;
I Yb1 = I Yb2 =
I Yb2 = I Yb3 =
p
X
k=1
p
X
k=1
ak I Xk (j)2 − bk I Xk (j)n−1 + (cv + rv1 ) ;
ak I Xk (j)3 − bk I Xk (j)n−2 + (cv + rv1 + 2rv2 )
(4.32)
152
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
and for i > 2,
I Ybi =
p
X
I Ybi =
p
X
k=1
k=1
!
ak I Xk (j)i − bk I Xk (j)n−i+1 + cv + rv1 +
!
ak I Xk (j)i+1 − bk I Xk (j)n−i + cv + rv1 +
i−1
X
2rvt
"
;
2rvt
"
.
t=2
i
X
t=2
The DSD Model II induces linear relations between the centers and half ranges of each
subinterval of the histogram. The prediction of the center of the subinterval i, for each unit
j, is given by:
cYb (j)i = vi +
with
p
X
k=1
ak cXk (j)i −



cv




vi =
c v + r v 1 + rv2



i−1
P


 c v + r v 1 + 2 r vt + r vi
p
X
bk cXk (j)n−i+1
(4.33)
k=1
if
i=1
if
i=2
if
i ∈ {3, . . . , n}
t=2
Analogously, for each subinterval i, the prediction of the half range of each unit j is
given by:
rYb (j)i = rvi +
p
X
ak rXk (j)i +
k=1
p
X
bk rXk (j)n−i+1 .
(4.34)
k=1
As in DSD Model I, we may define classical linear regressions between the mean values
(weighted mean of the centers) and between the weighted mean of the half ranges of the
histograms. In this case, the classical linear regressions are:
Y (j) = cv +
p
X
k=1
r
Y (j) = rv +
(ak − bk )X k (j) + ec (j)
p
X
r
(ak + bk )X k (j) + er (j)
(4.35)
(4.36)
k=1
r
r
with X k (j), Y (j), Y (j), X (j) defined as in Expressions (4.10) and (4.11), cv =
p
P
k=1
p i c vi
(symbolic mean of the histograms Ψ−1
Constant (t) with cvi defined according Expression (4.30))
and r v =
p
P
k=1
p i r vi .
4.2. THE DSD REGRESSION MODEL II
153
As in DSD Model I, we assume that the direct or inverse relation between the histogramvalued variables is always in accordance with the linear relation between the symbolic mean
of the observations of the histogram-valued variables, in Expression (4.35). The histogramvalued variables Xk are in direct linear relation with Y when ak > bk and the linear relation
is inverse if ak < bk .
Example 4.2. Consider again the data in Example 4.1. We may observe that also with the
DSD Model II, in the first case the value of a is greater than the value of b and in the second
situation these values are interchanged.
−1
−1
−1
1. DSD Model II: Ψ−1
Y (j) (t) = ΨConstant (t) + 3.11ΨX(j) (t) − 0ΨX(j) (1 − t)
−1
−1
−1
2. DSD Model II: Ψ−1
Y (j) (t) = ΨConstant (t) + 0Ψ−X(j) (t) − 3.11Ψ−X(j) (1 − t)
In both situations the Ψ−1
Constant (t) is as follows:


2t


−2.27
+
−
1
× 0.22

0.3




2(t−0.3)


−1.93
+
−
1
× 0.12

0.1





 −1.67 + 2(t−0.4) − 1 × 0.14
0.1
−1
ΨConstant (t) =



−1.37 + 2(t−0.5)
−
1
× 0.16

0.1






−1.07 + 2(t−0.6)
− 1 × 0.14

0.1





 −0.51 + 2(t−0.7) − 1 × 0.43
0.3
if
0 ≤ t < 0.3
if
0.3 ≤ t < 0.4
if
0.4 ≤ t < 0.5
if
0.5 ≤ t < 0.6
if
0.6 ≤ t < 0.7
if
0.7 ≤ t ≤ 1
.
4.2.2. Estimation of the parameters of the DSD Model II
4.2.2.1. Optimization problem
−1
Consider the quantile functions Ψ−1
b (j) (t). Using the Mallows distance deY (j) (t) and ΨY
fined according to Proposition 2.1, we define the optimization problem that allows predicting
the parameters of DSD Model II. This optimization problem is very similar to the one defined
for DSD Model I. The main difference is that now it is necessary to impose more constraints
because we must ensure that the independent parameter is a quantile function. So, in this
154
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
case, the quadratic optimization problem is defined as follows:
min
f (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn ) =
m
X
j=1
−1
2
DM
(Ψ−1
b (j) (t))
Y (j) (t), ΨY
subject to
ak , bk ≥ 0,
k ∈ {1, . . . , p}
(4.37)
cv ∈ R
rvi ≥ 0,
i ∈ {1, . . . , n}.
According to DSD Model II, considering the quantile function ΨY−1
b (j) (t) defined in Expres-
sion (4.31) and the Expressions (4.33), (4.34), the function to minimize may be written as
follows:
pi (cY (j)i − cYb (j)i )2 + 13 (rY (j)i − rYb (j)i )2 ⇔
j=1 i=1
#
2
p
m
P
P
p1 cY (j)1 −
(ak cXk (j)1 − bk cXk (j)n ) − cv +
f (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn ) =
j=1
k=1
2 %
p
P
+ 13 rY (j)1 −
(ak rXk (j)1 + bk rXk (j)n ) − rv1
+
k=1
#
2
p
P
+p2 cY (j)2 −
(ak cXk (j)2 − bk cXk (j)n−1 ) − cv − rv1 − rv2 +
k=1
2 %
p
P
+ 31 rY (j)2 −
(ak rXk (j)2 + bk rXk (j)n−1 ) − rv2
+
k=1
#
2
p
n
i−1
P
P
P
+ pi cY (j)i −
(ak cXk (j)i − bk cXk (j)n−i+1 ) − cv − rv1 − 2 rvt − rvi +
i=3
k=1
t=2
%
2
p
P
(ak rXk (j)i + bk rXk (j)n−i+1 ) − rvi
.
+ 13 rY (j)i −
f (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn ) =
n
m P
P
k=1
(4.38)
To define the optimization problem in matricial form, it is necessary to compute the first
order partial derivatives of f. In these partial derivatives the subintervals of the histograms
are defined in terms of the center and half range of the intervals.
First order partial derivatives of the function f
• Partial derivative in order to ak , for a fixed k ∈ {1, . . . , p}.
4.2. THE DSD REGRESSION MODEL II
155
From Expression (4.38), we have,
p
m
P
P
∂f
=
2p1 cY (j)1 −
(ak cXk (j)1 − bk cXk (j)n ) − cv (−cX (j)1 ) +
∂ak
j=1
k=1
p
P
1
(ak rXk (j)1 + bk rXk (j)n ) − rv1 (−rX (j)1 ) +
+ 3 rY (j)1 −
k=1
p
m
P
P
(ak cX (j)2 − bk cX (j)n−1 ) −
+ 2p2 cY (j)2 −
k
k
k
j=1
k
k=1
−cv − rv1 − rv2 ) (−cX (j)2 ) +
p
P
1
(ak rX (j)2 + bk rX (j)n−1 ) − rv2 (−rX (j)2 ) + . . . +
+ 3 rY (j)2 −
k=1
p
m
P
P
+ 2pn cY (j)n −
(ak cX (j)n − bk cX (j)1 ) − cv − rv1 −
j=1
k=1
n−1
P
2rvt − rvn (−cX (j)n ) +
−
t=2
p
P
1
(ak rX (j)n + bk rX (j)1 ) − rvn (−rX (j)n ) .
+ 3 rY (j)n −
k
k
k
k
k
k
k
k
k
k
k=1
(4.39)
Considering the Expressions (4.33) and (4.34), the partial derivative
∂f
∂ak
may be
succinctly written as:
m P
n
P
∂f
= 2
pi cY (j)i − cYb (j)i (−cX (j)i ) +
∂ak
j=1 i=1
k
rY (j)i − rYb (j)i (−rX (j)i ) .
1
3
k
(4.40)
Alternatively, grouping the terms of the Expression (4.39), corresponding to the parameters ak , bk ,cv , rvi (with k ∈ {1, 2, . . . , p}, i ∈ {1, 2, . . . , n}) and the independent
term, the Expression (4.39) that defines
∂f
∂ak
, may also be written as:
p
n
m P
P
P
∂f
=
2pi
cXk (j)i cX (j)i + 13 rXk (j)i rX (j)i ak +
∂ak
j=1 i=1
k=1
p
n
m P
P
P
+
2pi
−cXk (j)n−i+1 cX (j)i + 13 rXk (j)n−i+1 rX (j)i bk +
j=1 i=1
k=1
n
n
m P
m
P
P
P
2
2pi cX (j)i cv +
2pi cX (j)i + 3 p1 rX (j)i rv1 +
+
j=1 i=1
j=1 i=2
n−1
m
n
P P
P
1
+
2pt cX (j)t + 3 rX (j)t +
4pi cX (j)t rvt +
k
k
k
k
k
k
t=2
m
+
P
k
k
j=1
k
i=t+1
2pn cX (j)n + 13 rX (j)n rvn +
j=1
m n
+
k
PP
j=1 i=1
k
k
−2pi cY (j)i cX (j)i + 13 rY (j)i rX (j)i .
k
k
(4.41)
156
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
The other first order partial derivatives are obtain analogously. As to
∂f
∂ak
from all first
order partial derivatives, we will present a short expression that was defined taking
into account the Expressions (4.33), (4.34) and the more detailed expressions when
the terms are grouped according to the same parameters. Through this chapter, the
two forms will be necessary.
• Partial derivative in order to bk , for a fixed k ∈ {1, . . . , p}.
n
m P
P
∂f
= 2
pi cY (j)i − cYb (j)i cX (j)n−i+1 +
∂bk
j=1 i=1
k
1
3
rY (j)i − rYb (j)i (−rX (j)n−i+1 ) ;
k
(4.42)
p
n
m P
P
P
∂f
=
2pi
−cXk (j)i cX (j)n−i+1 + 13 rXk (j)i rX (j)n−i+1 ak +
∂bk
j=1 i=1
k=1
p
m
n
PP
P
2pi
cXk (j)n−i+1 cX (j)n−i+1 + 13 rXk (j)n−i+1 rX (j)n−i+1 bk +
+
j=1 i=1
k=1
n
n
m
m
PP
P
P
2
+
−2pi cX (j)n−i+1 cv +
−2pi cX (j)n−i+1 + 3 p1 rX (j)n rv1 +
j=1 i=1
j=1 i=1
m
n−1
n
PP
P
1
2pt −cX (j)n−t+1 + 3 rX (j)n−t+1 +
−4pi cX (j)n−i+1 rvt +
+
k
k
k
k
k
k
k
k
k
t=2 j=1
+
m
P
j=1
+
k
i=t+1
2pn −cX (j)1 + 13 rX (j)1 rvn +
m P
n
P
j=1 i=1
k
k
2pi cY (j)i cX (j)n−i+1 − 13 rY (j)i rX (j)n−i+1 .
k
k
(4.43)
• Partial derivative in order to cv .
n
m P
P
∂f
= −2
pi cY (j)i − cYb (j)i ;
∂cv
j=1 i=1
p
p
n
n
m P
m P
P
P
P
P
∂f
=
2pi
ak cXk (j)i +
−2pi
bk cXk (j)n−i+1 + 2mcv +
∂cv
j=1 i=1
k=1
j=1 i=1
k=1
n
n−1
n
P
P
P
+ 2pi mrv1 +
2mpt +
4mpt rvt + 2mpn rvn −
i=2
−
n
m P
P
j=1 i=1
t=2
2pi cY (j)i .
i=t+1
(4.44)
(4.45)
4.2. THE DSD REGRESSION MODEL II
157
• Partial derivative in order to rvi .
- with i = 1
∂f
∂rv1
∂f
∂rv1
= −2
=
p
m P
P
m
P
j=1
1
3
p1 rY (j)1 − rYb (j)1 +
i=2
pi cY (j)i − cYb (j)i
;
p1 rXk (j)1 + 2pi cXk (j)i ak +
i=2
p
m P
n
n
P
P
P
2
+
p
r
−2p
c
b
+
2pi mcv +
+
1
X
(j)
i
X
(j)
k
n
n−i+1
k
k
3
j=1 k=1
i=2
i=2
n−1
n
n
P
P
P
2
2mpt +
4pi m rvt +
+ 3 mp1 + 2mpi rv1 +
i=2
t=2
i=t+1
m
n
P
P
2
+2mpn rvn −
+ +2pi cY (j)i .
pr
3 1 Y (j)1
j=1 k=1
j=1
(4.46)
n
P
2
3
n
P
(4.47)
i=2
- with i = 2
∂f
∂rv2
= −2
m
P
∂f
∂rv2
=
p
m P
P
cY (j)2 − cYb (j)2 +
p2
j=1
−4
m P
n
P
j=1 i=3
1
3
rY (j)2 − rYb (j)2
pi cY (j)i − cYb (j)i ;
1
3 Xk (j)2
−
(4.48)
n
P
2p2 cXk (j)2 + r
+ 4pi cXk (j)i ak +
i=3
p
m P
n
P
P
1
2p2 −cXk (j)n−1 + 3 rXk (j)n−1 − 4pi cXk (j)n−i+1 bk +
+
j=1 k=1
i=3
n
n
P
P
+ 2mp2 + 4mpi cv + 2mp2 + 4mpi rv1 +
i=3
i=3
n
n−1
n
P
P
P
8
+ 3 mp2 + 8mpi rv2 +
4mpt +
8mpi rt +
i=3
t=3
i=t+1
m
n
P
P
1
2p2 cY (j)i + 3 rY (j)2 + 4pi cY (j)i .
+4mpn rvn −
j=1 k=1
j=1
i=3
(4.49)
158
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
- with i = k and 2 < k < n
∂f
∂rv
k
= −2
2<k<n
m
P
∂f
∂rv
=
p
m P
P
j=1 k=1
k
2<k<n
cY (j) − cYb (j) +
pk
k
j=1
−4
n
m
P
P
j=1 i=k+1
k
1
3
rY (j) − rYb (j)
k
k
−
(4.50)
pi cY (j)i − cYb (j)i ;
1
3 Xk (j)k
2pk cXk (j) + r
k
n
P
+
4pi cXk (j)i ak +
i=k+1
2pk −cXk (j)n− +1 + 13 rXk (j)n− +1 −
j=1 k=1
n
P
−
4pi cXk (j)n−i+1 bk +
i=k+1
n
n
P
P
4mpi cv + 2mpk +
4mpi rv1 +
+ 2mpk +
i=k+1
i=k+1
k−1
n
n
P
P
P
8
+
4mpk +
8mpi rvt + 3 mpk +
8mpi rv +
t=2
i=k+1
i=k+1
n−1
n
P
P
4mpt +
8mpi rvt + 4mpn rvn −
+
i=t+1
t=k+1
m
n
P
P
1
−
2pk cY (j) + 3 rY (j) +
4pi cY (j)i .
+
p
m P
P
k
k
(4.51)
k
k
k
j=1
i=k+1
- with i = n
∂f
∂rvn
= −2
∂f
∂rvn
=
m
P
j=1
p
m P
P
j=1 k=1
+
pn
cY (j)n − cYb (j)n +
p
m P
P
2pn −cXk (j)1 + 13 rXk (j)1 bk +
+2mpn cv + 2mpn rv1 +
P
n−1
t=2
m
P
j=1
rY (j)n − rYb (j)n
;
(4.52)
2pn cXk (j)n + 13 rXk (j)n ak +
j=1 k=1
−
1
3
4mpn rvt + 83 mpn rvn −
2pn cY (j)n + 13 rY (j)n .
(4.53)
4.2. THE DSD REGRESSION MODEL II
159
Constraint quadratic problem
The optimization problem (4.37) may be rewritten in matricial form, as given below.
min
1
f (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn ) = bT H2 b + w2T b + K
2
subject to
−ak , −bk ≤ 0,
(4.54)
k ∈ {1, . . . , p}
cv ∈ R
−rvi ≤ 0,
i ∈ {1, . . . , n}.
In this case the vector of the parameters is b = (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn )T and H2
is the hessian matrix. The symmetric matrix H2 of order 2p + n + 1 (with p the number
of explicative variables and n the number of subintervals in each histogram) is defined
calculating the second order partial derivatives from the Expressions (4.41); (4.43); (4.45);
(4.47); (4.49); (4.51) and (4.53). The matrix H2 may be defined as follows:


 H 1 MT 


H2 = 



M N
(4.55)
where H1 is the matrix defined for DSD Model I, in Expression (4.24), M = [mlq ] is a
n × (2p − 1) matrix and N = [nlq ] is the symmetric matrix of order n. These matrices are
composed as follows:

m
n
P
P

2


+
p
r
2pi cX q+1 (j)i
1
X
(j)
q+1
1

3

2
2
j=1
i=2




m
n

P2
P


pr
− 2pi cX q (j)n−i+1

3 1 X q2 (j)n

2

i=2
j=1




n
P



2mpi



i=2


 P
m
n
P
1
mlq =
2pl 3 rX q+1 (j)l + cX q+1 (j)l +
4pi cX q+1 (j)i

2
2
2
j=1
i=l+1




m
n
P

 P 2p 1 r

− cX q (j)n−l+1 −
4pi cX q (j)n−i+1
l

3 X q2 (j)n−l+1

2
2

j=1
i=l+1



m

P

1


2p
r
+
c
n
X
(j)
X
(j)
q+1
q+1
n
n

3

2
2
j=1




m

P

1

−
c
2p
r

n
X q (j)1
3 X q2 (j)1
2
j=1
if
if
if
if
if
if
if
l = 1, q ≤ 2p;
q is odd
l = 1, q ≤ 2p;
q is even
l = 1;
q = 2p + 1
2 ≤ l ≤ n − 1,
q ≤ 2p; q is odd
2 ≤ l ≤ n − 1,
q ≤ 2p; q is even
l = n, q ≤ 2p;
q is odd
l = n;
q = 2p + 1
160
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA

n
P


2

mp
+
2mpi
q

3


i=q+1



n
P



+
4mpi
2mp

q


i=q+1






2mpq




n
P
8
nlq =
mp
+
8mpi
q
3


i=q+1



n

P


4mpq +
8mpi



i=q+1






4mpq







 8 mpq
3
if
l=q=1
if
l = 1, 2 ≤ q ≤ n − 1
if
l = 1, q = n
if
2 ≤ l ≤ n − 1, q = l
if
2 ≤ l ≤ n − 1, q > l ∨ q < l
if
2 ≤ l ≤ n − 1, q = n
if
l=q=n
In the column matrix w2 the elements of the rows 1 to 2p + 1 are the elements of
matrix w1 in Expression (4.25). So, in this model the column matrix with 2p + n + 1 rows,
w2 = [w1 s]T . The elements of s = [sl ] with n rows are defined by:

n
m
P
P



− 23 pl rY (j)l − 2pi cY (j)i


j=1
i=1


 P
m
n
P
−2pl cY (j)l + 13 rY (j)l −
4pi cY (j)i
sl =

j=1
i=l+1

 m

P



−2pl cY (j)l + 13 rY (j)l

if
l=1
if
2≤l ≤n−1 .
if
l=n
j=1
Similarly to the DSD Model I, it is also possible to rewrite the constraint quadratic
problem (4.54) as a constraint least square problem.
4.2. THE DSD REGRESSION MODEL II
161
Constraint least square problem
f (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn ) = kY − X2 bk2
min
subject to
−ak , −bk ≤ 0,
(4.56)
k ∈ {1, . . . , p}.
cv ∈ R
−rvi ≤ 0,
i ∈ {1, . . . , n}.
The matrix Y is the one presented in Expression (4.28), when the constraint least
square problem for DSD Model I was defined. As before the matrix X2 may be defined
from the matrix X1 in Expression (4.27).
If p is the number of variables Xk , n the number of subintervals in each histogram and
m the number of units j observed for each variable, the matrix X2 with 2mn rows and
(2p + 1 + n) columns is defined as:
X2 =
Z = [Zc1 Zr1
...
Zic = [zlqc ]m×n with
h
X1 Z
i
(4.57)
Zcn Zrn ]T is a 2mn × n matrix where Z1c = Om×n and for i ≥ 2,

√



pi



√
zlqc =
2 pi





 0
if
1≤l≤m
∧
(q = 1 ∨ q = i)
if
1≤l≤m
∧
1<q<i
other cases
r
and the matrices Zir = [zlq
] are defined by:
zlqr =

p pi


3

 0
if
1≤l≤m
∧
q=i
.
other cases
Considering X2 as defined in Expression (4.57) and H2 as defined in Expression (4.55),
T
we then have H2 = 2X2 X2 . Therefore, the symmetric matrix H2 is positive semi-definite
and the function f (a1 , b1 , . . . , ap , bp , cv , rv1 , . . . , rvn ) is convex. Because all constraints are
non-negative constraints, the feasible region of the optimization problem is a convex set. As
the conditions of convexity for the function to optimize and the feasible region are verified,
162
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
we may ensure that the optimization problem has optimal solutions. As before, the optimal
solutions of the quadratic optimization problems verify the Kuhn Tucker conditions.
The minimization problem that allows estimating the parameters of the DSD Model II
has more constraints but they are also non-negative constraints. For this case we may
consider an extension of the Theorem 4.1.
Theorem 4.3. (Winston (1994)) Consider the minimization problem (4.37) or other equi-
valent definitions. If b∗ = a∗1 , b∗1 , . . . , a∗p , b∗p , c∗v , rv∗1 , . . . , rv∗n is an optimal solution of this
problem, then b∗ must satisfy the constrains of the optimization problem and the Kuhn
Tucker conditions. For each k ∈ {1, 2, . . . , p} and i ∈ {1, 2, . . . , n} , we have:
• Constrains: −a∗k ≤ 0; −b∗k ≤ 0 and −rv∗i ≤ 0.
• Kuhn Tucker conditions:
∂f ∗
(b ) ≥ 0;
∂ak
∂f ∗
2.
(b ) ≥ 0;
∂bk
∂f ∗
3.
(b ) = 0;
∂cv
∂f ∗
(b ) = 0;
∂ak
∂f ∗
5. b∗k
(b ) = 0;
∂bk
∂f ∗
6. rv∗i
(b ) = 0.
∂rvi
4. a∗k
1.
4.2.2.2. Properties and the Goodness-of-fit measure
Properties similar to the ones proved for DSD Model I may also be obtained when the
distributions are predicted with DSD Model II.
Proposition 4.7. For each unit j, let Yb (j) be the distribution predicted by DSD Model II
considering as parameters the optimal solution b∗ = a∗1 , b∗1 , . . . , a∗p , b∗p , c∗v , rv∗1 , . . . , rv∗n .
The symbolic mean of the predicted histogram-valued variable Yb is given by:
Yb =
p
X
k=1
∗
k
∗
k
∗
v
(a − b ) Xk + c +
n
X
i=2
∗
i v1
pr +
n−1
X
t=2
!
pt +
n
X
i=t+1
"
2pt rv∗t + pn rv∗n .
Proof. Each observation j of the predicted histogram-valued variable Yb (j), may be represented by the quantile function as in Expression (4.31) considering as parameters the
optimal solution b∗ of the quadratic optimization problem (4.54). Therefore, the mean
quantile function, Ψ−1
b (t), may be obtain from Definition 2.21. So, applying Proposition 2.4
Y
4.2. THE DSD REGRESSION MODEL II
163
and considering that pi = wi − wi−1 , i ∈ 1, 2, . . . , n, we have:
Yb =
=
R1
ΨY−1
b (t)dt =
0
p
m
Rw1 1 P
P
0
m
j=1
k=1
2t
w1
a∗k cXk(j)1 − b∗k cXk(j)n + c∗v +
P
p
∗
k Xk(j)1
∗
k Xk(j)n
∗
v1
−1
+b r
a r
+ r dt+
k=1
p m
Rw2 P
P
a∗k cXk(j)2 − b∗k cXk(j)n−1 + c∗v + rv∗1 + rv∗2 +
+ m1
j=1 k=1
w1
P
p 2(t−w1 )
∗
∗
∗
+ w2 −w1 − 1
ak rXk(j)2 + bk rXk(j)n−1 + rv2 dt+
k=1
p
m
n
i−1
Rwi 1 P
P
P
P ∗
∗
∗
∗
∗
+
a
c
c
+
c
+
r
2rvt + rv∗i +
−
b
+
X
X
k
k
v
v
k(j)
k(j)
1
m
i
n−i+1
i=3 wi−1
j=1 k=1
t=2
P
p 2(t−wi−1 )
∗
∗
∗
+ wi −wi−1 − 1
ak rXk(j)i + bk rXk(j)n−i+1 + rvi dt =
k=1
p
m
P
P
1
∗
∗
∗
ak cXk(j)1 − bk cXk(j)n + cv p1 +
= m
j=1 k=1
p m
P
P
1
∗
∗
∗
∗
∗
ak cXk(j)2 − bk cXk(j)n−1 + cv + rv1 + rv2 p2 +
+m
j=1 k=1
p n
i−1
m
P
P
P
P ∗
1
∗
∗
∗
∗
∗
+
ak cXk(j)i − bk cXk(j)n−i+1 + cv + rv1 + 2rvt + rvi pi =
m
i=3
j=1 k=1
t=2
p
n
n−1
n
P
P
P
P
=
(a∗k − b∗k ) Xk + c∗v + pi rv∗1 +
pt +
2pt rv∗t + pn rv∗n
+
k=1
i=2
t=2
i=t+1
Proposition 4.8. The mean of the predicted histogram-valued variable Yb is equal to the
mean of the observed histogram-valued variable Y .
Proof. Analogous to the proof of Proposition 4.2.
As in DSD Model I, the result of Proposition 4.9 allows deducing the coefficient of
determination, Ω, for DSD Model II.
Proposition 4.9. For the observed and predicted distributions Y (j) and Yb (j), with
j ∈ {1, . . . , m}, of the variable Y, we have
m Z
X
j=1
1
0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
ΨY−1
dt = 0.
b (j) (t) − Y
−1
Proof. Let Ψ−1
b (j) (t) the quantile functions defining according to the ExpresY (j) (t) and ΨY
sions (4.5) and (4.31), respectively. Considering these definitions, in the proof of the
164
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
Proposition 4.4 (Expression (4.29)), we prove:
m R1 P
j=1 0
=
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
n
m P
P
pi
j=1 i=1
Applying
ΨY−1
(t)
−
Y
dt =
b (j)
cY (j)i − cYb (j)i cYb (j)i +
Expressions
(4.33)
1
3
rY (j)i − rYb (j)i rYb (j)i − cY (j)i − cYb (j)i Y
and
(4.34)
when
the
optimal
solution
b∗ = a∗1 , b∗1 , . . . , a∗p , b∗p , c∗v , rv∗1 , . . . , rv∗n is considered, the expression above may be written
as follows:
m
P
p1
j=1
cY (j)1 − cYb (j)1
1
3
p
P
∗
k Xk (j)1
a c
k=1
p
P
∗
k Xk (j)n
−b c
∗
k Xk (j)1
+c
∗
k Xk (j)n
∗
v
+
∗
v1
−b r
+r
rY (j)1 − rYb (j)1
a r
− cY (j)1 − cYb (j)1 Y +
k=1
p
m
P
P
∗
∗
∗
∗
∗
ak cXk (j)2 − bk cXk (j)n−1 + cv + rv1 + rv2 +
+ p2 cY (j)2 − cYb (j)2
j=1
k=1
p
P
1
∗
∗
∗
ak rXk (j)2 − bk rXk (j)n−1 + rv2 − cY (j)2 − cYb (j)2 Y +
+ 3 rY (j)2 − rYb (j)2
k=1
p
n
i−1
P
P
P ∗
∗
∗
∗
∗
∗
pi cY (j)i − cYb (j)i
ak cXk (j)i − bk cXk (j)n−i+1 + cv + rv1 + 2 rvt + rvi +
i=3
k=1
t=2
p
P
1
∗
∗
∗
ak rX(j)i + bk rX(j)n−i+1 + rvi − cY (j)i − cYb (j)i Y .
+ 3 rY (j)i − rYb (j)i
+
k=1
Grouping the terms corresponding to the same parameters ak , bk , cv , rvi (with
k ∈ {1, 2, . . . , p}, i ∈ {1, 2, . . . , n}) and the term Y . After some algebra the previous
expression may be also written as:
m P
n
P
j=1 i=1
a∗1 pi cY (j)i − cYb (j)i cX1 (j)i +
+a∗p pi cY (j)i − cYb (j)i cXp (j)i +
+
1
3
1
3
rY (j)i − rYb (j)i rX1 (j)i + . . . +
rY (j)i − rYb (j)i rXp (j)i +
n
m P
P
b∗1 pi cY (j)i − cYb (j)i (−cX1 (j)n−i+1 ) +
m P
n
P
m P
n
P
pi cY (j)i − cYb (j)i c∗v + 13
pi rY (j)i − rYb (j)i rv∗i +
j=1 i=1
+b∗p pi cY (j)i − cYb (j)i −cXp (j)n−i+1 +
−
j=1 i=1
PP
m
+
n
j=1 i=2
+
m P
n
P
j=1 i=3
1
3
1
3
rY (j)i − rYb (j)i rX1 (j)n−i+1 + . . . +
rY (j)i − rYb (j)i rXp (j)n−i+1 −
j=1 i=1
n
PP
pi cY (j)i − cYb (j)i rv∗1 +
pi cY (j)i − cYb (j)i rv∗i +
m
j=1 i=2
2pi cY (j)i − cYb (j)i
i−1
P
t=2
rv∗t −
m P
n
P
j=1 i=1
pi cY (j)i − cYb (j)i Y .
4.2. THE DSD REGRESSION MODEL II
165
Considering the expressions of the first order partial derivatives in order to ak and bk ,
with k ∈ {1, 2, . . . , p}, defined in Expressions (4.40) and (4.42), respectively, we have:
p
P
∗ ∂f
k ∂ak
1
2
∗
1 ∗ ∂f
2 k ∂bk
m
P
∗
1
3
n
P
(b ) − b
(b ) +
− a
p1 rY (j)1 − rYb (j)1 + pi cY (j)i − cYb (j)i rv∗1 +
k=1
j=1
i=2
m
n
∗
P
P
1
p rY (j)2 − rYb (j)2 + p2 cY (j)2 − cYb (j)2 + 2pi cY (j)i − cYb (j)i rv2 +
+
3 2
j=1
i=3
P 1
m
+... +
j=1
−
m P
n
P
j=1 i=1
3
pn rY (j)n − rYb (j)n + pn cY (j)n − cYb (j)n rv∗n −
pi cY (j)i − cYb (j)i
c∗v − Y
Considering again the expressions of the first order partial derivatives but now in order
to rvi with i ∈ {1, 2, . . . , n}, and cv , defined in Expressions (4.46), (4.48), (4.50), (4.52)
and (4.44), respectively, we obtain:
m Z
X
j=1
1
0
=
Ψ
−1
Y (j)
(t) − Ψ
p
X
−1
b (j)
Y
(t)
Ψ
−1
b (j)
Y
(t) − Y dt =
1 ∂f ∗
1 ∂f ∗
1 X ∗ ∂f ∗
1 ∂f ∗
1 ∂f ∗
− a∗k
(b ) − b∗k
(b ) −
rv i
(b ) − Y
(b ).
(b ) + c∗v
2
∂a
2
∂b
2
∂r
2
∂c
2
∂c
k
k
v
v
v
i
k=1
i=1
Considering the Kuhn Tucker conditions:
and rv∗i
n
∂f ∗
∂f ∗
∂f ∗
(b ) = 0; a∗k
(b ) = 0; b∗k
(b ) = 0;
∂cv
∂ak
∂bk
∂f ∗
(b ) = 0, we finally obtain:
∂rvi
m Z 1
X
−1
−1
−1
ΨY (j) (t) − ΨYb (j) (t) ΨYb (j) (t) − Y dt = 0.
j=1
0
As a consequence of Proposition 4.9, we may also, for this generalization of the DSD
Model I, define the goodness-of-fit measure:
Ω=
m
X
j=1
m
X
j=1
2
DM
ΨY−1
(t),
Y
b (j)
2
M
D
Ψ
−1
Y (j)
(t), Y
.
where ΨY (j) (t) and ΨY−1
b (j) (t) represent the quantile functions of the observed and predicted
distributions of the histogram-valued variable Y and Y its symbolic mean. As before,
0 ≤ Ω ≤ 1.
166
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
4.3. Conclusion
In this chapter, two versions of a new linear regression model for histogram-valued
variables were proposed. The difference between DSD Model I and DSD Model II is that
the independent parameter in the first one is a real number and in second one is a quantile
function. With this generalization we expected to obtain a more flexible model.
The main advantages of the DSD Models are the following: 1) they allow predicting
the distributions taken by one histogram-valued variable from the distributions taken by
explicative histogram-valued variables, considering that the relation may be either direct or
inverse; 2) the parameters are easily estimated solving a quadratic optimization problem or
a constrained least squares problem, subject to non-negative constraints on the unknowns;
3) from the model, the prediction of the distributions for the observations of the response
histogram-valued variable is immediate and 4) it is possible to deduce a goodness-of-fit
measure from the model. This measure is deduced similarly as in classical statistics and
appears to have a good behavior.
As limitations of the DSD Models we may enumerate the following: 1) the interpretation
of the meaning of the parameters of the models is not obvious and 2) the goodness-of-fit
measure, Ω, deduced from the models is computed with respect to the symbolic mean of
the histogram-valued response variable, that is a real value and not an average distribution,
the barycentric histogram. This option is due to the apparent impossibility of obtaining the
decomposition of the total sum of squares in the residual sum of squares plus the explained
sum of squares, when the barycentric histogram is considered.
The behavior of the models proposed, as well as the goodness-of-fit measures deduced
from them, will be tested and analyzed in Chapters 6 and 7, where a simulation study and
the application of the models to some examples are presented.
In this work we study linear regression models that provide relations between a special
type of functions, i.e. quantile functions. Therefore it is important to relate our approach
with Quantile Regression defined for classical variables. Quantile Regression uses the
Least Squares method to estimate the quantiles of the response variable, given certain
values of the explicative variables. As this method is defined for classical variables, we
may predict several quantiles for the response variable, using a different model for each
quantile. Nonetheless, all these quantiles are predicted from real values associated with
4.3. CONCLUSION
167
each explicative variable. As such, for each unit several quantiles will be obtained and with
them, if it is possible, the quantile functions associated with the response variable may be
built. However, when we use the linear regression models proposed in this work we predict
with the same model the complete quantile functions associated with the response variable,
and the elements that are the records of the explicative variables are not real values but
intervals or distributions. Although the Quantile Regression method predicts quantiles of the
response variables, the comparison between the two methods is not obvious, essentially
because they use different kinds of variables. In Section 7.2, these two approaches will be
applied to an example.
As interval-valued variables are a particular case of histogram-valued variables, in
Chapter 5, the DSD Regression Models will be particularized and studied when applied
to this kind of variables.
168
CHAPTER 4. REGRESSION MODELS FOR HISTOGRAM DATA
5. Regression Models for interval data
In Chapter 4 linear regression models for histogram-valued variables, the Distribution
and Symmetric Distribution (DSD) Regression Models, were proposed. As a histogramvalued variable may be reduced to the case of an interval-valued variable when each
unit takes values on only one interval with weight equal to one, the DSD Models may
be particularized to interval-valued variables. These models are defined for n explicative
variables and are based on the distributions considered within the intervals. In this work we
study the special case where the Uniform distribution is assumed in each observed interval.
Other distributions may be considered, but will not be studied here.
5.1. The DSD Regression Model I
5.1.1. Particularization of the Model to interval-valued variables
The DSD Linear Regression Model I for histogram-valued variables proposed in Definition 4.1 may be particularized to interval-valued variables.
Definition 5.1. Consider the interval-valued variables X1 ; X2 ; . . . ; Xp . The quantile functions that represent the range of values that these variables take for each unit j are de−1
−1
noted Ψ−1
X1 (j) (t), ΨX2 (j) (t), . . . , ΨXp (j) (t) and the quantile functions that represent the res-
pective symmetric interval associated with each unit of the referred variables are denoted
−1
−1
−Ψ−1
X1 (j) (1 − t), −ΨX2 (j) (1 − t), . . . , −ΨXp (j) (1 − t), with t ∈ [0, 1].
The quantile function of the response variable Y, Ψ−1
Y (j) , may be expressed as follows:
−1
Ψ−1
b (j) (t) + ej (t)
Y (j) (t) = ΨY
where ΨY−1
b (j) (t) is the predicted quantile function for the unit j, obtained from
Ψ
−1
b (j)
Y
(t) = v +
p
X
k=1
ak Ψ
−1
Xk (j)
(t) −
169
p
X
k=1
bk Ψ−1
Xk (j) (1 − t)
170
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
with t ∈ [0, 1] ; ak , bk ≥ 0, k ∈ {1, 2, . . . , p} and v ∈ R.
This linear regression model is the Distribution and Symmetric Distribution (DSD)
Regression Model I applied to interval-valued variables.
Particularizing Definition 5.1 to the situation studied in this work, where we assume
uniformity within the intervals and because
ΨXk (j) (t) = cXk (j) + (2t − 1)rXk (j)
−ΨXk (j) (1 − t) = −cXk (j) + (2t − 1)rXk (j)
the predicted quantile function ΨY−1
b (j) is defined as follows:
Ψ−1
b (j) (t) =
Y
p
X
k=1
(ak − bk ) cXk (j) + v +
p
X
k=1
(ak + bk ) rXk (j) (2t − 1)
(5.1)
with t ∈ [0, 1] ; ak , bk ≥ 0, k ∈ {1, 2, . . . , p} , and v ∈ R.
For each unit j, the predicted interval IYb (j) may be obtained from
IYb (j) =
#
p
X
k=1
ak I Xk (j) − bk I Xk (j) + v,
p
X
k=1
%
ak I Xk (j) − bk I Xk (j) + v .
(5.2)
The error, for each unit j , is a function, but not necessarily a quantile function, given by
−1
ej (t) = Ψ−1
b (j) (t),
Y (j) (t) − ΨY
t ∈ [0, 1].
By including in the model both, the distributions that the explicative interval-valued variables take for each unit j and the quantile functions that represent the respective symmetric
intervals, the linear relation between the intervals is not necessarily direct, even though
positivity constraints are imposed on the parameters.
For the DSD Model I, the center cYb (j) and the half range rYb (j) (or the bounds) pre-
dicted to the interval-valued variable Y may be described, respectively, by a classical linear
relation with the centers cXk (j) and by a classical linear relation with the half ranges rXk (j)
(or the bounds), of the explicative interval-valued variables. These linear relations are the
following:
cYb (j) =
p
X
k=1
rYb (j) =
(ak − bk ) cXk (j) + v
p
X
(ak + bk ) rXk (j)
k=1
with ak , bk ≥ 0, k ∈ {1, 2, . . . , p} and v ∈ R.
(5.3)
(5.4)
5.1. THE DSD REGRESSION MODEL I
171
From Expressions (5.3) and (5.4) we may observe that the parameters that define the
linear regressions between the centers and between the half ranges of the intervals are not
the same but are related. In spite of the fact that this model is defined between intervals and
the relation between the intervals may be direct or inverse, it always induces a direct linear
relation between the half ranges of the intervals. The direct or inverse relation between
the interval-valued variables is always in accordance with the linear relation between the
centers. The interval-valued variables Xk are in direct linear relation with Y when ak > bk
and the linear relation is inverse if ak < bk . Example 5.1 illustrates these cases.
Example 5.1. In the first situation we consider the symbolic data in Table 2.4, where 10
units are described by two symbolic variables: Y the response interval-valued variable and
X the explicative interval-valued variable.
In this case, the predicted intervals are obtained from the model as follows:
−1
−1
DSD Model I: Ψ−1
b (j) (t) = −2.5911 + 3.5298ΨX(j) (t) − 0.3152ΨX(j) (1 − t),
Y
t ∈ [0, 1].
As the value of parameter a is greater than the value of parameter b, we consider that
the relation between the interval-valued variables Y and X is direct. This behavior may
be observed in the scatter plot in Figure 5.1(a). In Figure 5.1(b) we represent the classical
linear relation between the centers of the intervals that are in accordance with the relation
between the intervals.
50
45
40
35
12
(a) Intervals X(j) versus Y (j) with j ∈ {1, . . . , m}.
13
14
15
16
(b) cX(j) versus cY (j) with j ∈ {1, . . . , m}.
Figure 5.1: Scatter plots associated with interval-valued variables X and Y in Table 2.4.
In the second situation we will study the symbolic data in Table 5.1, where the intervals
of the explicative interval-valued variable are the symmetric of the intervals of the intervalvalued variable X. We will denote this variable by −X.
172
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
Table 5.1: Symbolic data table for interval-valued variables −X and Y .
Units
Y
-X
1
[33.29; 39.61]
[−12.8; −11.54]
2
[36.69; 45.12]
[−14.17; −12.07; ]
3
[36.69; 48.68]
[−16.16; −12.38]
4
[36.38; 47.41]
[−15.29; −12.38]
5
[39.19; 50.86]
[−16.24; −13.58]
6
[39.7; 47.24]
[−15.2; −13.81]
7
[41.56; 48.81]
[−15.55; −14.34]
8
[38.4; 45.22]
[−14.6; −13.27]
9
[28.83; 41.98]
[−13.8; −9.92]
10
[44.48; 52.53]
[−16.75; −15.37]
−1
−1
DSD Model I: ΨY−1
b (j) (t) = −2.5911 + 0.3152ΨX(j) (t) − 3.5298ΨX(j) (1 − t),
t ∈ [0, 1].
Analyzing the expressions of the DSD Model I for both situations, we may observe that
in second situation, as expected, the values of the parameters a and b change relatively
to the first situation. In this case, we have a < b and consequently the relation between
the intervals is inverse, as illustrates the scatter plot in Figure 5.2(a). In Figure 5.2(b) we
may observe that the classical linear relation between the centers of the intervals is, in this
situation, inverse.
50
45
40
35
−16
(a) Intervals −X(j) versus Y (j) with j ∈ {1, . . . , m}.
−15
−14
−13
−12
(b) −cX(j) versus cY (j) with j ∈ {1, . . . , m}.
Figure 5.2: Scatter plots associated with interval-valued variables −X and Y in Table 5.1.
The non-negative parameters of DSD Model I, in Definition 5.1, are determined solving
a quadratic optimization problem, subject to non-negativity constraints on the unknowns.
5.1. THE DSD REGRESSION MODEL I
173
The distance used to quantify the dissimilarity between the predicted and the observed
quantile function is the Mallows Distance (Mallows (1972)), defined in Proposition 2.1.
Consider the centers cY (j) and half ranges rY (j) of the observed intervals Y (j) and the
centers cYb (j) and half ranges rYb (j) of the predicted intervals Yb (j) defined in Expressions
(5.3) and (5.4), respectively. The quadratic optimization problem that is necessary to obtain
the parameters of the model is then:
min
m
P
j=1
#
cY (j) −
p
P
k=1
(ak − bk ) cXk (j) − v
2
+
1
3
rY (j) −
subject to −ak , −bk ≤ 0, k ∈ {1, 2, . . . , p}
p
P
(ak + bk ) rXk (j)
k=1
v ∈ R.
2 %
(5.5)
The optimization problem (5.5) may also be rewritten in matricial form as a classical
constraint quadratic optimization problem, particularizing the constrained quadratic optimization problem (4.14) (see Section 4.1.2.1) to interval-valued variables. Alternatively, the
problem may also be defined as a constraint least square problem. In this case the problem
(4.26) (see Section 4.1.2.1) may be particularized to interval-valued variables as follows.
Consider the m−dimensional vectors of the observed centers and half ranges of the
response variable Y :
yc = (cY (1) , . . . , cY (m) )
T
T
and yr = (rY (1) , . . . , rY (m) ) ;
and the vector of the parameters of the model with length 2p + 1,
T
b = (a1 , b1 , . . . , ap , bp , v) .
From the vectors xc (j) and xr (j) defined by
xc (j) = cX1 (j) , −cX1 (j) , . . . , cXp (j) , −cXp (j) , 1
and
xr (j) = rX1 (j) , rX1 (j) , . . . , rXp (j) , rXp (j) , 0 ;
we can build the following m × (2p + 1) matrices:
Xc = [xc (1) xc (2)
...
xc (m)]T
and
Xr = [xr (1) xr (2)
The matrix X1 is built from the two previous matrices, X1 = [Xc
Xr ] T .
...
xr (m)]T .
174
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
With the matrices defined above, the minimization problem (5.5) may be rewritten in
matricial form as follows:
min
2 1 2
c
y − Xc b + yr − Xr b
3
subject to −ak , −bk ≤ 0, k ∈ {1, 2, . . . , p}
(5.6)
v ∈ R.
As the parameters for the centers and half ranges may not be obtained independently,
we may rewrite the optimization problem (5.5) as the following least square problem:
min

 
 2
c
c
y
X
2

 
 −
b

 
 = Y − X1 b
1 r
√1 Xr
√3 y
3
(5.7)
subject to −ak , −bk ≤ 0, k ∈ {1, 2, . . . , p}
v ∈ R.
Several methods may be found in the literature to solve the constrained least squares
problem (5.7) and therefore the constrained quadratic optimization problem (5.5).
As the quadratic function to optimize is convex and the feasible region too, it may be
ensured that the vectors that verify the Kuhn Tucker conditions (see Section 4.1.2.2) are
the vectors where the function reaches the smallest value i.e., are the optimal solutions. In
cases where the objective function is strictly convex we may ensure that the optimal solution
is unique.
As the DSD Model I uses both the quantile function Ψ−1
Xk (j) (t) and the quantile function
−Ψ−1
Xk (j) (1 − t) that represents the respective symmetric interval, it is important to analyze the behavior of the model in the situations where these functions are collinear. The
Proposition 5.1 below allows deducing the collinearity conditions.
Proposition 5.1. The quantile functions Ψ−1
Xk (j) (t)
= cXk (j) + rXk (j) (2t − 1) and
−Ψ−1
X(j) (1 − t) = −cXk (j) + rXk (j) (2t − 1) with 0 ≤ t ≤ 1 that represent the intervals
IXk (j) and −IXk (j) , for all j ∈ {1, . . . , m}, respectively, are collinear if :
1. the interval IXk (j) has cXk (j) = 0, which means that the interval is symmetric;
2. rXk (j) = 0, which means that the interval is reduced to a real number (degenerate
interval).
5.1. THE DSD REGRESSION MODEL I
175
−1
Proof. The quantile functions Ψ−1
Xk (j) (t) and −ΨXk (j) (1 − t) with 0 ≤ t ≤ 1 are collinear if
there is a real number λ 6= 0 such that
−1
−Ψ−1
Xk (j) (1 − t) = λΨXk (j) (t), for t ∈ [0, 1] .
−1
−Ψ−1
Xk (j) (1 − t) = λΨXk (j) (t) ⇔ −cXk (j) + rXk (j) (2t − 1) = λ (cXk (j) + rXk (j) (2t − 1))
⇒ (cXk (j) = 0 ∧ λ = 1 ∧ rXk (j) ∈ R) ∨
∨ rXk (j) = 0 ∧ ((λ = −1 ∧ cXk (j) ∈ R) ∨
∨ (cXk (j) = 0 ∧ λ ∈ R)) .
Therefore the two quantile functions are collinear when the interval IXk (j) is symmetric,
i.e, IXk (j) = [−rXk (j) ; rXk (j) ] or degenerate, i.e., IXk (j) = cXk (j)
−1
When all quantile functions Ψ−1
Xk (j) (t) and −ΨXk (j) (1 − t) are collinear, the
Expression (5.1) is reduced to the classical linear regression model: between the centers (in
Expression (5.3)) when all intervals of the explicative interval-valued variables are degenerate or between the half ranges (in Expression (5.4)) when all intervals of the explicative
interval-valued variables are symmetric.
When, for all observations of the explicative variables, the collinearity between Ψ−1
Xk (j) (t)
and −Ψ−1
Xk (j) (1 − t) is verified, the optimization problem has an optimal solution which is
not unique because, in this situation, the quadratic function to optimize is not strictly convex
(the columns of X1 in the optimization problem (5.7) are linearly dependent). However,
all values of the parameters where the smallest value is attained allow obtaining the same
model, that in these cases is a classical model between the centers or the half ranges.
As the DSD Model I for interval-valued variables is a particular case of the model
defined in Section 4.1 for histogram-valued variables, the optimal solution of the quadratic
optimization problem for interval-valued variables with non-negative constraints verifies the
Kuhn Tucker conditions. It is therefore possible to prove the following decomposition (see
Proposition 4.5 in Section 4.1.3):
m
X
j=1
m
m
X
X
−1
−1
−1
2
2
2
DM
Ψ−1
(t),
Y
=
D
Ψ
(t),
Ψ
(t)
+
D
Ψ
(t),
Y
.
b (j)
b (j)
Y (j)
Y (j)
M
M
Y
Y
j=1
j=1
This decomposition allows defining the goodness-of-fit measure for the proposed model for
interval-valued variables.
176
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
Definition 5.2. Consider the observed and predicted ranges of values of the interval-valued
variable Y and Yb represented, respectively, by their quantile functions ΨY (j) (t) and ΨY−1
b (j) (t)
with t ∈ [0, 1] . Consider also the symbolic mean of the interval-valued variable Y, Y . The
goodness-of-fit measure is given by
1 2 D Ψ (t), Y
cYb (j) − Y + rYb (j)
3
j=1
j=1
Ω= m
=
.
m
X
X
2 1 2 −1
2
DM ΨY (j) (t), Y
cY (j) − Y + rY (j)
3
j=1
j=1
m
X
2
M
−1
b (j)
Y
m X
2
As in classical linear regression, where the coefficient of determination R2 ranges from
0 to 1, the goodness-of-fit measure, Ω, also ranges between 0 and 1.
5.1.2. The DSD Model I is a generalization of the classical linear regression
model
Symbolic variables, introduced in Symbolic Data Analysis, are a generalization of classical variables. Hence, the statistical concepts and methods defined for these variables
should also generalize the classical ones. As we will see below, the DSD linear regression
Model I defined for histogram-valued variables and its present particularization for intervalvalued variables may be written for classical variables when their values are degenerate
intervals (the upper and lower bounds are equal).
Proposition 5.2. The expression that allows predicting the values that the response variable takes in a classical linear regression model is a particular case of the one obtained
by the DSD Model I for interval-valued variables given in Expression (5.1), if we consider
intervals where the upper and lower bounds of the intervals are the same.
Proof. Consider the observations of the explicative classical variables Xk , with
k ∈ {1, 2, . . . , p} and the observations of the response classical variable Y. For each unit
j, the observed values of the variables Xk are real numbers uXk (j) that may be represented
by the interval [uXk (j) , uXk (j) ] or by the quantile function Ψ−1
Xk (t) = uXk (j) (that in this case
is a constant function). For each unit j, the predicted value of the classical variable Y, is the
real number yb(j) that may similarly be represented by an interval or a quantile function.
5.1. THE DSD REGRESSION MODEL I
177
Expression (5.1) allows predicting the values of variable Y, as follows:
yb(j) = v +
p
X
k=1
(ak − bk ) uXk (j)
with ak , bk ≥ 0, k ∈ {1, 2, . . . , p} and v ∈ R.
As ak , bk ≥ 0, ak −bk is a real number. If we consider sk = ak −bk we have the classical
linear regression model
yb(j) = v +
with sk ∈ R and k ∈ {1, 2, . . . , p} .
p
X
sk uXk (j)
k=1
As we have referred before, in a situation of degenerate intervals, the function to optimize is not strictly convex, and therefore more than one optimal solution exists. However, for
all parameters ak and bk we obtain the same parameter sk . Since no constraint is imposed
on ak − bk , we have in this case a classical linear regression model.
Also, the goodness-of-fit measure for interval-valued variables is a generalization of
the coefficient of determination R2 of classical variables. To obtain this result, it is first
necessary to prove the following proposition.
Proposition 5.3. The Mallows distance between degenerate intervals is the Euclidean
distance between two real numbers.
Proof. Consider two intervals IX and IY with equal bounds, IX
= [u1 , u1 ] and
IY = [u2 , u2 ] with u1 , u2 ∈ R; those intervals may be represented by the quantile functions
−1
Ψ−1
X (t) = u1 and ΨY (t) = u2 for 0 ≤ t ≤ 1.
The Mallows distance in Definition 2.8 applied to these particular intervals IX and IY
whose centers are u1 and u2 , respectively, and both have range 0, is:
2
−1
2
DM
(Ψ−1
X , ΨY ) = (u1 − u2 ) .
We obtain the squared Euclidean distance between two unidimensional points.
To conclude the previous result, we then need to state the following straightforward
proposition:
Proposition 5.4. The goodness-of-fit measure in Definition 5.2 particularized to degenerated intervals is the coefficient of determination R2 of the classical linear regression model.
178
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
Therefore, it may be said that the DSD Model I under uniformity is a theoretical generalization of the classical linear regression model.
5.1.3. The simple DSD Model I for interval-valued variables
Using Definition 5.1 for the special case of only one explicative variable, the quantile
function ΨY−1
b (j) for each unit of the predicted interval-valued variable is given by
ΨY−1
b (j) (t) = v + (a − b) cX(j) + (a + b) rX(j) (2t − 1),
0 ≤ t ≤ 1.
(5.8)
The corresponding predicted interval IYb (j) is the following:
IYb (j) = aI X(j) − bI X(j) + v, aI X(j) − bI X(j) + v .
(5.9)
Proposition 5.5. Consider the predicted intervals by the DSD Model I from the intervalvalued variable X. From this relation we may conclude:
1. The centers of the predicted intervals are in a classical linear relation with the centers
of the observed intervals of the variable X.
2. The ratio of the half ranges of the predicted intervals Yb (j) to the half ranges of X(j)
is constant.
Proof. From DSD Model I we obtain the relation between the centers and between the
half ranges of the intervals that are the observations of the interval-valued variables, in
Expressions (5.3) and (5.4). Particularizing these expressions to one explicative variable,
we obtain for each unit j ∈ {1, 2, . . . , m} the following:
cYb (j) = (a − b) cX(j) + v
and
rYb (j) = (a + b) rX(j) ⇔
rYb (j)
= a + b.
rX(j)
So, when two interval-valued variables are in perfect linear relation, the centers of the
intervals are in a perfect classical linear relation and the ratio of the half ranges of the
intervals Yb (j) to the half ranges of the intervals X(j) is constant and equal for all units.
5.1. THE DSD REGRESSION MODEL I
179
In this situation, when we predict one interval-valued variable from only one intervalvalued variable, it is straightforward to obtain the expressions of the parameters a, b and
v of the DSD Model I. To find these expressions it is necessary to solve the quadratic
optimization problem with non-negative constrains for the parameters a and b, as described
in Expression (5.5), but now considering only one explicative variable. The minimization
problem is, in this case:
min
f (a, b, v) =
m X
j=1
1
2
(cY (j) − (a − b) cX(j) − v) + (rY (j) − (a + b) rX(j) )
3
2
subject to g1 (a, b, v) = −a ≤ 0
g2 (a, b, v) = −b ≤ 0
v ∈ R.
(5.10)
In this particular case, the optimization function f (a, b, v) to be optimized may be rewritten in matricial form:
1
f (a, b, v) = bT H1 b + w1T b + K
2
(5.11)
where the matrices and vectores involved are the following:
• H1 is the hessian matrix, a symmetric matrix of order 3,

m
P
2 2
3 X(j)
2
X(j)
2c
+ r

 j=1
 P
 m
2
−2c2X(j) + 23 rX(j)
H1 = 
 j=1

m
P

2cX(j)
m
P
j=1
m
−2c
P
j=1
j=1
2
X(j)
2 2
3 X(j)
+ r
2
2c2X(j) + 23 rX(j)
m
P
j=1
−2cX(j)
• w1 is the column vector of independent terms,

m
P
2
3 Y (j) X(j)
−2cY (j) cX(j) − r r

 j=1
 P
 m
2c c
− 2r r
w1 = 
 j=1 Y (j) X(j) 3 Y (j) X(j)

m
P

−2cY (j)
j=1
• b is the column vector of parameters b = (a, b, v)T ;
• K is a real value, K =
m X
n
X
j=1 i=1
1
c2Y (j) + rY2 (j) .
3
m
P

2cX(j) 


P

−2cX(j)  ;

j=1


2m
j=1
m





;



180
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
Proposition 5.6. Consider the minimization problem (5.10). When the function to minimize
is strictly convex and the centers of all intervals of the explicative variable are not all the
same, the optimal solution b∗ = (a∗ , b∗ , v ∗ ), i.e., the values for the parameters of the DSD
Model I where the objective function reaches the minimum value, are as follows:
I. a∗ = 0;
b∗ = 0;
v∗ = Y
m
X
1
if
3
j=1
II. a∗ = 0;
b∗ =
m
P
j=1
1
3 X(j) Y (j)
r
r
m
P
−
m
P
j=1
X1
m
j=1
3
m
X
2
+
m
P
j=1
X
m
rX(j) rY (j) ≥
cX(j) − X
1 2
3 X(j)
r
cY (j) − Y
j=1
and
m
X
cY (j) − Y
j=1
III. a∗ =
m
P
j=1
1
3 X(j) Y (j)
r
r
m
P
j=1
cX(j) − X
m
P
+
j=1
3
j=1
2
+
3
cY (j) − Y
cX(j) − X
m
X
1
m
X
1
j=1
m
P
2
rX(j)
≤−
cX(j) − X
1 2
3 X(j)
r
j=1
rX(j) rY (j) ≥ −
m
X
j=1
j=1
j=1
IV. a∗ =
m
P
j=1
cY (j) − Y
cY (j) − Y
cX(j) − X
cX(j) − X
2
m
P
j=1
b∗ =
−
m
P
j=1
cY (j) − Y
m
X
1
3
j=1
m
P
j=1
m
P
j=1
∗
∗
∗
v = Y − (a − b ) X
if
1 2
3 X(j)
r
cX(j) − X
cX(j) − X
2
3
m
P
j=1
m
X
1
j=1
m
2 P
j=1
1 2
3 X(j)
r
cX(j) − X
3
j=1
m
P
+
; v ∗ = Y + b∗ X
rX(j) rY (j)
r
r
m
P
j=1
m
P
j=1
1
3 X(j) Y (j)
r
1 2
3 X(j)
r
m
X
j=1
1 2
3 X(j)
j=1
m
X
cX(j) − X
rX(j) rY (j)
r
m
2 P
if
cX(j) − X
j=1
1
3 X(j) Y (j)
+
cX(j) − X = 0;
2
b∗ = 0 ; v ∗ = Y − a∗ X
;
cY (j) − Y
2
≥
rX(j)
cX(j) − X
m
X
1
and
m
X
cY (j) − Y
j=1
cY (j) − Y
cX(j) − X
j=1
rX(j) rY (j) =
r
cX(j) − X
cX(j) − X
m
P
j=1
if
2
2
;
;
cX(j) − X
2
;
;
5.1. THE DSD REGRESSION MODEL I
−
m
X
1
j=1
3
rX(j) rY (j)
m
X
j=1
181
m
m
2 X
X
1 2
cX(j) − X ≤
cY (j) − Y cX(j) − X
rX(j)
3
j=1
j=1
and
m
X
j=1
cY (j) − Y
cX(j) − X
m
X
1
j=1
3
r
2
X(j)
≤
m
X
1
j=1
3
rX(j) rY (j)
m
X
j=1
cX(j) − X
2
.
Proof. Consider the optimization problem (5.10) where:
- the functions g1 (a, b, v) and g2 (a, b, v) that define the non-negative constraints are
convex, so the feasible region of the optimization problem is a convex set;
- f (a, b, v) is a convex function. Consider the matrix X1 defined in Expression (5.7) but
now only for one explicative variable. In this particular case, we have:

cX(1)






X1 = 






T
..
.
cX(m)
√1 X(1)
3
r
..
.
√1 X(m)
3
r
−cX(1)
1


.. 
. 
..
.
−cX(m)
1
√1 X(1)
3
r
0
..
.
..
.
√1 X(m)
3
0
r










As H1 = 2X1 X1 , the matrix H1 is positive semi-definite, therefore f (a, b, v) is a
convex function.
- the intervals of the explicative variable X are not all degenerate (∃j ∈ {1, . . . , m} :
rX(j) 6= 0) or symmetric (∃j ∈ {1, . . . , m} :
cX(j) 6= 0). In this situation, the
columns of X1 are linearly independent, so H1 is positive definite and consequently
the function f (a, b, v) is strictly convex, and therefore the optimal solution is unique.
Note that in this proof, the following simplifications were considered:
m
X
j=1
m
X
j=1
m
X
2
cX(j) − X cX(j) =
cX(j) − X ;
j=1
cX(j) − X cY (j) =
m
X
j=1
cX(j) − X
cY (j) − Y .
182
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
As the optimization problem (5.10) verifies the conditions of Theorem 4.1 (see Section
4.1.2.2), it is possible to find the expressions of the parameters for the simple linear regression model.
Consider also the expressions of the first order partial derivatives, in Section 4.1.2.1,
particularized to the case of the intervals and for k = 1.
m
m
m
P
P
P
∂f
2
2
) + b (−c2X(j) + 13 rX(j)
)+v
cX(j) −
= a (c2X(j) + 13 rX(j)
∂a
j=1
j=1
j=1
m
P
− (cY (j) cX(j) + 13 rY (j) rX(j) );
j=1
∂f
∂b
= a
m
P
j=1
m
+
2
(−c2X(j) + 13 rX(j)
)+b
P
j=1
m
P
j=1
2
(c2X(j) + 13 rX(j)
)−v
(cY (j) cX(j) − 13 rY (j) rX(j) );
m
P
cX(j) +
j=1
m
m
P
P
∂f
cY (j) .
= (a − b) cX(j) + mv −
∂v
j=1
j=1
Consider the Kuhn Tucker conditions (1) to (3) in Theorem 4.1. From condition (3) we
have,
∂f ∗
(b ) = 0 ⇔ v = Y − (a∗ − b∗ )X.
∂v
(5.12)
Substituting the Expression (5.12) in Kuhn Tucker conditions (1) and (2),
i
i
m h
m h
2
2
P
P
∂f ∗
2
2
cX(j) − X + 13 rX(j)
+ b∗
− cX(j) − X + 13 rX(j)
(b ) = λ ⇔ a∗
∂a
j=1
j=1
m m
P
P
1
+
− cX(j) − X cY (j) − Y −
r r
= λ;
3 X(j) Y (j)
j=1
j=1
(5.13)
i
i
m h
m h
2
2
P
P
∂f ∗
2
2
( b ) = δ ⇔ a∗
− cX(j) − X + 13 rX(j)
+ b∗
cX(j) − X + 13 rX(j)
∂b
j=1
j=1
m
m
P
P
1
+
cX(j) − X cY (j) − Y −
r r
= δ.
3 X(j) Y (j)
j=1
j=1
(5.14)
From the Kuhn Tucker conditions (4): λa∗ = 0 and (5): δb∗ = 0, substituting λ and δ by
the Expressions (5.13) and (5.14), respectively, it is possible to build the system:
5.1. THE DSD REGRESSION MODEL I
183

m h
m h
2 1 2 i
2 1 2 i
P
P


∗ 2
∗ ∗

(a
)
c
+
r
+
a
b
−
c
−
X
+ 3 rX(j) −
X(j) − X
X(j)

3 X(j)


j=1
j=1



m
m
P
P


1

−a∗
cX(j) − X cY (j) − Y − a∗
r r
=0

3 X(j) Y (j)
j=1
j=1
m h
m h
2 1 2 i
2 1 2 i

P
P

∗ ∗
∗ 2

a
b
−
c
−
X
+
r
+
(b
)
c
−
X
+ 3 rX(j) +

X(j)
X(j)
3 X(j)


j=1
j=1



m
m

P
P

1

+b∗
cX(j) − X cY (j) − Y − b∗
r r
=0

3 X(j) Y (j)
j=1
j=1
Solving this system, four possible solutions may occur:
Case I: a∗ = 0 and b∗ = 0. In this case the parameters are non-negative, as required.
However, the Kuhn Tucker condition (6) has to be verified, i.e. λ, δ ≥ 0. Substituting
in Expressions (5.13) and (5.14), a∗ = 0 and b∗ = 0 we have,

m
m
P
P


1

r r )
 − (cX(j) − X)(cY (j) − Y ) ≥
3 X(j) Y (j)
j=1
j=1
m
m

P
P

1

(cX(j) − X)(cY (j) Y ) ≥
r r )

3 X(j) Y (j)
j=1
j=1
So, a∗ = 0; b∗ = 0; v ∗ = Y if
m
X
1
j=1
Case II: a∗ = 0 and b∗ =
m
P
j=1
3
rX(j) rY (j) =
1
3 X(j) Y (j)
r
r
m
P
j=1
m
X
j=1
−
m
P
j=1
cY (j) − Y
cX(j) − X
2
+
m
P
j=1
cY (j) − Y
cX(j) − X = 0.
cX(j) − X
1 2
3 X(j)
r
.
By the conditions of the Theorem 4.1, we have that b∗ ≥ 0 and λ, δ ≥ 0. As it
is assumed that the intervals of the explicative variable X are not all degenerate
(∃j ∈ {1, . . . , m} : rX(j) 6= 0) and the centers of all these intervals are not all the
m
m
2 P
P
1 2
cX(j) − X +
r
> 0.
same, we may ensure that
3 X(j)
j=1
j=1
∗
For the expression that defines b to be non-negative it is necessary that,
m
X
1
j=1
3
rX(j) rY (j) ≥
m
X
j=1
cY (j) − Y
cX(j) − X .
As λ, δ ≥ 0, substituting the a∗ = 0 and b∗ for the expression above in Expressions
(5.13) and (5.14), we conclude that
184
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA

m
m
m
m
P
2
P
P
P
1 2
1


−
cY (j) − Y cX(j) − X
rX(j) −
rX(j) rY (j)
cX(j) − X

3
3

j=1
j=1
j=1
j=1


≥0

m
m
2 P
P
1 2
.
cX(j) − X +
r
3 X(j)

j=1
j=1





 δ=0
So, a∗ = 0;
m
P
j=1
b∗ =
1
3 X(j) Y (j)
r
r
m
P
m
X
1
j=1
and
m
X
j=1
cY (j) − Y
Case III: a∗ =
m
P
j=1
3
1
3 X(j) Y (j)
r
r
m
P
j=1
j=1
m
X
rX(j) rY (j) ≥
cX(j) − X
+
m
P
j=1
cY (j) − Y
cX(j) − X
j=1
if
m
P
−
3
cX(j) − X
+
j=1
2
≤−
rX(j)
cY (j) − Y
2
+
m
P
m
P
j=1
cX(j) − X
1 2
3 X(j)
cY (j) − Y
j=1
m
X
1
j=1
2
r
cX(j) − X
m
X
1
j=1
3
cX(j) − X
1 2
3 X(j)
r
rX(j) rY (j)
; v ∗ = Y + b∗ X
m
X
j=1
cX(j) − X
2
.
and b∗ = 0.
By the conditions of the Theorem 4.1, we have that a∗ ≥ 0 and λ, δ ≥ 0. Analogously
to the case II, we have that the expression that defines a∗ is non-negative if,
m
X
1
j=1
3
rX(j) rY (j) ≥ −
m
X
j=1
cY (j) − Y
cX(j) − X .
As λ, δ ≥ 0, substituting the b∗ = 0 and a∗ for the expression above in Expressions
(5.13) and (5.14), we conclude that




λ=0



m
m
m
m
P
2
 P
P
P
1 2
1
cY (j) − Y cX(j) − X
r
−
r
r
c
−
X
X(j)
Y
(j)
X(j)
3 X(j)
3
j=1
j=1
j=1
j=1


≥0

m
m

P
P
2

1 2

c
−
X
+
r

X(j)
X(j)
j=1
So, a∗ =
m
P
j=1
1
3 X(j) Y (j)
r
r
m
P
j=1
+
m
P
j=1
j=1
cY (j) − Y
cX(j) − X
2
+
m
P
j=1
cX(j) − X
1 2
3 X(j)
r
3
;
b ∗ = 0 ; v ∗ = Y − a∗ X
5.1. THE DSD REGRESSION MODEL I
185
if
m
X
1
3
j=1
rX(j) rY (j) ≥ −
m
X
cY (j) − Y
j=1
cX(j) − X
and
m
X
cY (j) − Y
j=1
cX(j) − X
m
X
1
j=1
3
r
2
X(j)
≥
m
X
1
3
j=1
rX(j) rY (j)
m
X
cX(j) − X
j=1
2
.
Case IV:
a∗ =
m
P
cY (j) − Y
j=1
cX(j) − X
2
m
P
m
P
j=1
b∗ =
−
m
P
j=1
cY (j) − Y
cX(j) − X
2
m
P
j=1
j=1
1 2
3 X(j)
r
cX(j) − X
m
P
j=1
+
j=1
m
2 P
j=1
1 2
3 X(j)
r
cX(j) − X
m
P
+
1
3 X(j) Y (j)
r
1 2
3 X(j)
r
m
P
j=1
m
2 P
j=1
r
m
P
j=1
1
3 X(j) Y (j)
r
r
1 2
3 X(j)
r
m
P
j=1
cX(j) − X
2
cX(j) − X
2
For ensure the conditions λ ≥ 0 and δ ≥ 0, the expressions that define a∗ and b∗
have to be substituted in Expressions (5.13) and (5.14). Performing this substitutions
we obtain λ = 0 and δ = 0.
For the reasons enumerated in the last points, it is ensured that the denominator of
the expressions is non-negative.
As a∗ and b∗ are non-negative parameters, their expressions verify these conditions
only if
−
m
X
1
j=1
3
rX(j) rY (j)
m
X
j=1
m
m
2 X
X
1 2
r
cX(j) − X ≤
cY (j) − Y cX(j) − X
3 X(j)
j=1
j=1
and
m
X
j=1
cY (j) − Y
cX(j) − X
m
X
1
j=1
3
r
2
X(j)
≤
m
X
1
j=1
3
rX(j) rY (j)
m
X
j=1
cX(j) − X
2
.
186
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
5.2. DSD Regression Model II
As discussed before, the DSD Model I is a generalization of the classical linear regression model. However the DSD Model I may be also generalized, if we consider that the
independent term may be an interval instead of a real number. This case is a particularization of the DSD Model II defined for histogram-valued variables, in Section 4.2. In this work,
the DSD Model II applied to interval data, also considers the Uniform distribution within the
intervals, although other distributions may naturally be assumed.
5.2.1. Definition and Properties
Definition 5.3. Consider the interval-valued variables X1 , X2 , . . . , Xp . The quantile functions that represents the range of values that these interval-valued variables take for each
−1
−1
unit j are Ψ−1
X1 (j) (t), ΨX2 (j) (t), . . . , ΨXp (j) (t) and the quantile functions that represent the
respective symmetric intervals associated with each unit of the referred variables are
−1
−1
−Ψ−1
X1 (j) (1 − t), −ΨX2 (j) (1 − t), . . . , −ΨXp (j) (1 − t), with t ∈ [0, 1].
The quantile function of the response variable Y, Ψ−1
Y (j) , may be expressed as follows:
−1
Ψ−1
b (j) (t) + ej (t)
Y (j) (t) = ΨY
where Ψ−1
b (j) (t) is the predicted quantile function for the unit j, obtained from
Y
−1
−1
−1
−1
ΨY−1
b (j) (t) = ΨConstant (t) + a1 ΨX1 (j) (t) − b1 ΨX1 (j) (1 − t) + a2 ΨX2 (j) (t)−
−1
−1
−b2 Ψ−1
X2 (j) (1 − t) + . . . + ap ΨXp (j) (t) − bp ΨXp (j) (1 − t)
with ak , bk ≥ 0, k ∈ {1, 2, . . . , p} and Ψ−1
Constant (t), with t ∈ [0, 1] , a quantile function that
represents an interval.
The error, for each unit j , is the linear function given by
−1
ej (t) = Ψ−1
b (j) (t),
Y (j) (t) − ΨY
0 ≤ t ≤ 1.
This linear regression model is the Distribution and Symmetric Distribution (DSD)
Regression Model II applied to interval-valued variables.
5.2. DSD REGRESSION MODEL II
187
Assuming a Uniform distribution within the intervals, the independent parameter
Ψ−1
Constant (t) is a quantile function defined by
Ψ−1
Constant (t) = cv + (2t − 1) rv
(5.15)
where rv ≥ 0 and cv ∈ R.
For each unit j, the predicted quantile function Ψ−1
b (j)
b (j) and the predicted interval IY
Y
may be obtained, respectively, from
ΨY−1
b (j) (t) =
and
IYb (j) =
#
p
X
k=1
p
X
k=1
(ak − bk ) cXk (j) + cv +
!
p
X
k=1
"
(ak + bk ) rXk (j) + rv (2t − 1)
ak I Xk (j) − bk I Xk (j) + (cv − rv ) ,
p
X
k=1
(5.16)
ak I Xk (j) − bk I Xk (j) + (cv + rv )
%
(5.17)
with t ∈ [0, 1] ; ak , bk , rv ≥ 0, k ∈ {1, 2, . . . , p} and cv ∈ R.
The relations induce by the model, between the centers and between the half ranges of
the intervals are:
cYb (j) =
p
X
rYb (j) =
p
X
k=1
(ak − bk ) cXk (j) + cv
(5.18)
(ak + bk ) rXk (j) + rv .
(5.19)
k=1
Similarly as to the other cases, to the obtain parameters of the model, it is necessary to
solve an optimization problem that in this case is given by:
min
f (a1 , b1 , . . . , ap , bp , cv , rv ) =
m
X
j=1
subject to −ak , −bk ≤ 0,
−1
2
DM
(Ψ−1
b (j) (t))
Y (j) (t), ΨY
k ∈ {1, . . . , p}
(5.20)
cv ∈ R
−rv ≤ 0
where, considering the Expressions (5.18) and (5.19), the function to minimize is the following:
m
2
2
P
f (a1 , b1 , . . . , ap , bp , cv , rv ) =
cY (j) − cYb (j) + 13 rY (j) − rYb (j)
j=1
2
2
p
p
m
P
P
P
1
=
cY (j) − cv −
(ak − bk ) cXk (j) + 3 rY (j) − rv −
(ak + bk ) rXk (j) .
j=1
k=1
k=1
(5.21)
188
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
Particularizing the optimization problems (4.54) and (4.56) in Section 4.2.2.1, to intervalvalued variables, we obtain the optimization problems for this case in matricial form.
For the DSD Models applied to histogram-valued variables we showed that the symbolic
mean of the observed and predicted response variable is equal (Propositions 4.2 and 4.8
in Sections 4.1.2.2 and 4.2.2.2, respectively). When we consider the particular case of the
interval-valued variables, these propositions are obviously also verified. However, when
the DSD Model II is applied to interval-valued variables, it is possible to prove the equality
between the barycentric interval of the observed response variable and the barycentric
interval of the predicted response variable. To prove this result, it is firstly necessary to
prove a preliminary result and calculate the first order partial derivatives of the function
(5.21) to optimize.
First order partial derivatives of the function f
The first order partial derivatives of f in the optimization problem (5.21) are a particularization of the first order partial derivatives in Section 4.2.2.1, for the case of the intervalvalued variables. As such, in this case we will also present for all derivatives a short
expression defined taking into account the Expressions (5.18) and (5.19); and another
expressions when the terms are grouped according to the parameters.
• Partial derivative in order to ak , for a fixed k ∈ {1, . . . , p}.
m
m
P
P
∂f
2
=
2 cY (j) − cYb (j) (−cX (j) ) +
r
−
r
(−rX (j) )
b
Y
(j)
Y
(j)
3
∂ak
j=1
j=1
k
k
p
m
P
P
∂f
=
2 cY (j) −
(ak − bk ) cXk (j) − cv (−cX (j) ) +
∂ak
j=1
k=1
p
m
P
P
2
rY (j) −
(ak + bk ) rXk (j) − rv (−rX (j) )
+
3
(5.22)
k
k
j=1
=
p
m P
P
j=1 k=1
+
k=1
2ak cXk (j) cX (j) + 13 rXk (j) rX (j) +
k
p
m P
P
j=1 k=1
+
m
P
j=1
(5.23)
k
2bk −cXk (j) cX (j) + 13 rXk (j) rX (j) +
k
2cv cX (j) +
k
2
3
m
P
j=1
k
rv rX (j) −
k
m
P
j=1
2cY (j) cX (j) − 23 rY (j) rX (j)
k
k
5.2. DSD REGRESSION MODEL II
189
• Partial derivative in order to bk , for a fixed k ∈ {1, . . . , p}.
m
m
P
P
∂f
2
=
2 cY (j) − cYb (j) cX (j) +
rY (j) − rYb (j) (−rX (j) )
3
∂bk
j=1
j=1
k
(5.24)
k
p
m
P
P
∂f
=
2 cY (j) −
(ak − bk ) cXk (j) − cv cX (j) +
∂bk
j=1
k=1
p
m
P
P
2
+
rY (j) −
(ak + bk ) rXk (j) − rv (−rX (j) )
3
k
k
j=1
=
p
m P
P
j=1 k=1
+
k=1
−2ak cXk (j) cX (j) − 13 rXk (j) rX (j) +
k
p
m P
P
j=1 k=1
−
m
P
(5.25)
k
2bk cXk (j) cX (j) + 13 rXk (j) rX (j) +
k
2cv cX (j) +
k
j=1
2
3
m
P
k
rv rX (j) +
k
j=1
m
P
j=1
2cY (j) cX (j) − 23 rY (j) rX (j)
k
k
• Partial derivative in order to cv .
m
P
∂f
=
−2 cY (j) − cYb (j)
∂cv
j=1
(5.26)
p
m
P
P
∂f
=
−2 cY (j) −
(ak − bk ) cXK (j) − cv
∂cv
j=1
k=1
p
p
m P
m P
m
P
P
P
= 2
ak cXk (j) − 2
bk cXk (j) + 2mcv − 2 cY (j)
j=1 k=1
j=1 k=1
(5.27)
j=1
• Partial derivative in order to rv .
m
P
∂f
=
− 32 rY (j) − rYb (j)
∂rv
j=1
(5.28)
p
m
P
P
∂f
2
=
− 3 rY (j) −
(ak + bk ) rXK (j) − rv
∂rv
j=1
k=1
p
p
m P
m P
m
P
P
P
= 23
ak rXk (j) + 23
bk rXk (j) + 23 mrv − 23
rY (j)
j=1 k=1
j=1 k=1
j=1
(5.29)
190
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
Proposition 5.7. Let Y be a response interval-valued variable and b∗ = (a∗1 , b∗1 , . . . , a∗p ,
b∗p , c∗b , rv∗ ) the optimal solution of the optimization problem (5.20):
1. The mean values of the centers of the observed and predicted intervals of the variable
Y are equal;
2. If rv∗ 6= 0 the mean values of the half ranges of the observed and predicted intervals
of the variable Y are equal.
Proof. Consider the function to optimize defined in Expression (5.21) where cY (j) , cYb (j) and
rY (j) rYb (j) are the centers and half ranges of the observed and predicted intervals for each
unit j.
From the Kuhn Tucker conditions (3) and (6) (see Theorem 4.3 in Section 4.2.2.1) and
considering the first order partial derivatives in Expressions (5.26) and (5.28), we have:
m
m
m
X
1 X
1 X
∂f ∗
(b ) = 0 ⇔
cY (j) − cYb (j) = 0 ⇔
cY (j) =
cYb (j)
∂cv
m
m
j=1
j=1
j=1
⇔ cY = cYb ;
X
∂f ∗
1 X
1 X
r
(b ) = 0 ⇔ rv∗
rY (j) − rYb (j) = 0 ⇔ rv∗
rY (j) = rv∗
rb
∂rv
m j=1
m j=1 Y (j)
j=1
m
m
m
∗
v
⇔ rv∗ rY = rv∗ rYb .
In conclusion,
cY
= cYb =
p
X
= rYb =
p
X
∗
v
and if r 6= 0,
rY
(a∗k + b∗k ) cXk + c∗v
k=1
k=1
(a∗k − b∗k ) rXk + rv∗ .
Proposition 5.8. If the independent parameter of the optimal solution in the DSD Model II is
a non degenerate interval (rv∗ 6= 0), the barycentric interval of the predicted interval-valued
variable ΨY−1
b (t) is equal to the barycentric interval of the observed interval-valued variable
b
Ψ−1
Yb (t).
5.2. DSD REGRESSION MODEL II
191
Proof. Considering the quantile function that represents the barycentric histogram, defined
in Expression (2.47), particularized to intervals, we have:
1 X
(cY (j) + rY (j) (2t − 1)) = cY + rY (2t − 1)
m j=1
m
Ψ−1
Yb (t) =
where cY and rY are the mean values of the centers and half ranges of variable Y, respectively.
Consider also the predicted quantile function, for each unit j, defined in Expression
(5.16),
Ψ
−1
b (j)
Y
(t) =
p
X
k=1
(ak − bk ) cXk (j) + cv +
!
p
X
k=1
"
(ak + bk ) rXk (j) + rv (2t − 1).
The barycentric interval of the predicted interval-valued variable is given by
−1
bb
Y
1
m
m
P
p
P
p
P
(ak − bk ) cXk (j) + cv +
(ak + bk ) rXk (j) + rv (2t − 1) =
k=1
p
p
m
m
P
P
P
P
1
1
=
(ak − bk ) m
cXk (j) + cv +
(ak + bk ) m
rXk (j) + rv (2t − 1) =
k=1
j=1
k=1
j=1
p
p
P
P
(ak − bk ) cXk + cv +
(ak + bk ) rXk + rv (2t − 1) =
=
Ψ (t) =
j=1
k=1
k=1
k=1
= cYb + rYb (2t − 1).
Considering the optimal solution b∗ = (a∗1 , b∗1 , . . . , a∗p , b∗p , c∗b , rv∗ ), from Proposition 5.7,
cY = cYb and if rv∗ 6= 0, rY = rYb .
So, if the estimated independent parameter for the DSD Model II is a non degenerate
interval,
−1
Ψ−1
bb (t),
Yb (t) = ΨY
t ∈ [0, 1] .
As a consequence of the equality between the barycentric intervals of the observed and
predicted interval-valued response variable, it is possible to prove the following results.
Proposition 5.9. Let the observed and predicted intervals Y (j) and Yb (j), with
j ∈ {1, . . . , m}, be represented by the respective quantile functions. If the estimated
independent parameter is a non degenerate interval (rv∗ 6= 0), we have
m Z
X
j=1
1
0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
−1
ΨY−1
(t)
−
Ψ
dt = 0.
(t)
b (j)
Yb
192
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
−1
Proof. Let Ψ−1
b (j) (t), with t ∈ [0, 1], be the quantile functions that represent the
Y (j) (t) and ΨY
observed and predicted interval Y, for each unit j. Consider also the barycentric histogram
Ψ−1
Yb (t) = cY + (2t − 1)rY .
Then,
m Z
X
j=1
=
=
1
0
m Z 1
X
j=1 0
m Z 1
X
0
j=1
=
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
m Z
X
j=1
1
0
−1
ΨY−1
(t)
−
Ψ
(t)
dt =
Yb
b (j)
cY (j) + rY (j) (2t − 1) − cYb (j) − rYb (j) (2t − 1)
cY (j) − cYb (j)
h
cYb (j) + rYb (j) (2t − 1) − cY − rY (2t − 1) dt =
i h
i
cY (j) − cYb (j) + rY (j) − rYb (j) (2t − 1)
cYb (j) − cY + rYb (j) − rY (2t − 1) dt =
+ rY (j) − rYb (j)
h
cYb (j) − cY + cY (j) − cYb (j) rYb (j) − rY +
cYb (j) − cY
i
2
(2t − 1) + rY (j) − rYb (j) rYb (j) − rY (2t − 1) dt.
Solving the integral we obtain
m X
1
cYb (j) − cY +
rY (j) − rYb (j) rYb (j) − rY =
3
j=1
m
X
1
1
cY (j) − cYb (j) cYb (j) +
cY (j) − cYb (j) cY +
=
rY (j) − rYb (j) rYb (j) −
rY (j) − rYb (j) rY .
3
3
j=1
cY (j) − cYb (j)
Considering the optimal solution b∗ = (a∗1 , b∗1 , . . . , a∗p , b∗p , c∗b , rv∗ ) of the minimization
problem (5.20), according to Expressions (5.18) and (5.19) we have:
#
m
X
j=1
cY (j) − cYb (j)
! p
X
(a∗k
k=1
−
b∗k ) cXk (j)
+
c∗v
"
1
+
3
!
rY (j) − rYb (j)
! p
X
(a∗k
k=1
+
b∗k ) rXk (j)
+
rv∗
""
−
i
− cY (j) − cYb (j) cY + 13 rY (j) − rYb (j) rY
=
#
%
p
p
m
X
X
X
1
∗
∗
cY (j) − cYb (j)
=
ak cXk (j) +
ak rXk (j) +
rY (j) − rYb (j)
3
j=1
k=1
k=1
%
#
p
p
m
X
X
X
1
∗
∗
+
(−bk )cXk (j) +
bk rXk (j) +
rY (j) − rYb (j)
cY (j) − cYb (j)
3
j=1
k=1
+
m X
j=1
cY (j) − cYb (j) (c∗v − cY ) +
k=1
m
1 X
3
j=1
rY (j) − rYb (j) (rv∗ − rY ) .
Considering the first order partial derivatives (5.22), (5.24), (5.26) and (5.28) in optimal
point b∗ , the previous expression is equal to:
−
p
p
m
1 X ∗ ∂f ∗
1 X ∗ ∂f ∗
1
∂f ∗
1 ∂f ∗
1 X
rY (j) − rYb (j) .
ak
(b ) −
bk
(b ) − (c∗v − cY )
(b ) − rv∗
(b ) − rY
2
∂ak
2
∂βk
2
∂cv
2 ∂rv
3 j=1
k=1
k=1
5.2. DSD REGRESSION MODEL II
193
∂f ∗
(b ) = 0;
∂cv
Considering the Kuhn Tucker conditions of the Theorem 4.3:
a∗k
∂f ∗
∂f ∗
∂f ∗
(b ) = 0; b∗k
(b ) = 0 and rv∗
(b ) = 0; for the optimal solution b∗ of the
∂ak
∂bk
∂rv
optimization problem (5.20), we obtain:
m Z
X
j=1
1
0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
m
1 X
−1
(t)
Ψ−1
(t)
−
Ψ
dt
=
−
r
rY (j) − rYb (j) .
Y
b (j)
Yb
Y
3 j=1
From Proposition 5.7 if rv∗ 6= 0, we have rY = rYb , so
m Z
X
j=1
1
0
Ψ
−1
Y (j)
(t) − Ψ
−1
b (j)
Y
(t)
Ψ
−1
b (j)
Y
m
P
rY (j) =
j=1
m
P
j=1
rYb (j) and therefore,
m
1 X
(t) − Ψ (t) dt = − rY
rY (j) − rYb (j) = 0.
3 j=1
−1
Yb
Consequently, it is easy to prove the next proposition.
Proposition 5.10. The sum of the squared Mallows distance between each observed
interval j ∈ {1, . . . , m}, of the interval-valued variable Y, and the barycentric interval
Ψ−1
Yb (t) may be decomposed as follows:
m
m
m
X
X
X
−1
−1
−1
2
−1
2
2
−1
DM ΨY (j) (t), ΨYb (t) =
DM ΨY (j) (t), ΨYb (j) (t) +
DM
ΨY−1
b (j) (t), ΨYb (t) .
j=1
j=1
j=1
Proof. Consider each unit j of the interval-valued variable Y, represented by its quantile
−1
function Ψ−1
Y (j) (t), and the barycentric interval of the variable Y , ΨYb (t). Then, we have,
m
P
j=1
=
P
2
m R1 −1
−1
−1
2
(t)
(t)
DM
Ψ−1
(t),
Ψ
=
Ψ
(t)
−
Ψ
dt =
Yb
Yb
Y (j)
Y (j)
j=1 0
m R1 P
Ψ
j=1 0
+2
−1
Y (j)
(t) − Ψ
m R1 P
j=1 0
−1
b (j)
Y
(t)
2
dt +
m Z
X
j=1
1
0
j=1 0
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
From Proposition 5.9 we have,
−1
Ψ−1
b (j) (t)
Y (j) (t) − ΨY
m R1 P
Ψ
−1
b (j)
Y
−1
Yb
(t) − Ψ (t)
−1
(t)
ΨY−1
(t)
−
Ψ
dt.
Yb
b (j)
2
dt+
−1
(t)
ΨY−1
(t)
−
Ψ
dt = 0.
b (j)
Yb
So, it has been proved that
m
X
j=1
m
m
X
X
−1
−1
−1
2
−1
2
2
−1
DM
Ψ−1
(t),
Ψ
=
D
Ψ
(t),
Ψ
(t)
+
D
Ψ
(t),
Ψ
.
(t)
(t)
b (j)
b (j)
Y (j)
Y (j)
Yb
M
M
Yb
Y
Y
j=1
j=1
194
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
5.2.2. Goodness-of-fit measures
For the DSD Models applied to histogram-valued variables and for the DSD Model I applied to interval-valued variables, a coefficient of determination of the models, represented
by Ω, was deduced. In addition, for the DSD Model II, it is also possible to deduce the same
measure:
m 2 1 2 P
2
DM
ΨY−1
(t),
Y
c
−
Y
+ 3 rYb (j)
b (j)
Y
b (j)
j=1
j=1
Ω= P
= P
m
m 2 1 2 2
DM
Ψ−1
(t),
Y
c
−
Y
+ 3 rY (j)
Y (j)
Y (j)
m
P
j=1
j=1
where Y is the symbolic mean of the interval-valued variable Y.
For the DSD Model II, in the particular case of the interval-valued variables, we have deduced from Proposition 5.10, a new coefficient of determination also based on the Mallows
distance. Instead of calculating the distance between each observed and predicted interval
and the symbolic mean, we calculated the distance between each observed and predicted
interval and the barycentric interval. It is important to underline that the calculation of a
coefficient of determination relatively to the barycentric distribution is only possible in this
situation.
Definition 5.4. Consider the observed and predicted distributions of the interval-valued
variable Y and Yb represented, respectively, by their quantile functions ΨY (j) (t) and ΨY−1
b (j) (t).
Consider also the barycentric interval of the variable Y, Ψ−1
Yb (t). The goodness-of-fit measure is given by
m 2 1
P
−1
2
2
(t)
DM
ΨY−1
(t),
Ψ
c
−
c
+
(r
−
r
)
b
b
Y
Y
Yb
Y (j)
Y (j)
b (j)
3
j=1
e = j=1
Ω
=
.
m
m
P
P
2
−1
−1
1
2
2
DM ΨY (j) (t), ΨYb (t)
(cY (j) − cY ) + 3 (rY (j) − rY )
m
P
j=1
j=1
This measure may only be used when the optimal value estimated for the independent
parameter of the model is not a degenerate interval.
e is more sensitive because it measures the
It is expected that the goodness-of-fit Ω
distance relatively to a quantile function and not to a real value. As other coefficients of
e ranges from 0 to 1.
determination, the values of Ω
5.3. CONCLUSION
195
Example 5.2. Considering again the data in Example 5.1, we may also observe that the
behavior of the DSD Model II is the same as that of the DSD Model I. When we consider the
symmetric intervals −X(j), for all units j, the behavior of the parameters a and b change.
DSD Model II:
−1
−1
ΨY−1
b (j) (t) = −1.2294 + 1.3497(2t − 1) + 3.1168ΨX(j) (t) + 0ΨX(j) (1 − t)
DSD Model II:
−1
−1
Ψ−1
b (j) (t) = −1.2294 + 1.3497(2t − 1) + 0Ψ−X(j) (t) + 3.1168Ψ−X(j) (1 − t)
Y
The values of the coefficients of determination for both cases are as follows:
Ω = 0.9699,
e = 0.9552.
Ω
e is more sensitive it has a lower value than the coefficient Ω, as exAs the coefficient Ω
pected.
5.3. Conclusion
An interval-valued variable is a particular case of a histogram-valued variable if for all
observations we only have one interval with weight equal to one. A classical variable is
a particular case of an interval-valued variable, when to all observations corresponds a
degenerate interval (an interval where the lower and upper bounds are the same). Because
of this link between histogram, interval and classical variables it was logical that the DSD
Models for histogram-valued variables could be particularized to interval-valued variables
that in turn could be particularized to real values, as we have shown in this chapter.
As for histogram-valued variables, it is possible to deduce from the DSD Models a
goodness-of-fit measure for interval-valued variables. This measure is computed with respect to the symbolic mean (a real value) of the response symbolic variable. However, in the
case of the interval-valued variables and only for the most general case of the DSD Model II,
it was also possible to deduce a goodness-of-fit measure with respect to the barycentric
interval (an interval). It is possible to define this measure because when the independent
parameter is not a degenerated interval, the equality between the total sum of squares
and the sum of the two other sums of squares (the explained sum and the residual sum)
e for the DSD Model I applied to
is verified. It is not possible to deduce the coefficient Ω
196
CHAPTER 5. REGRESSION MODELS FOR INTERVAL DATA
interval-valued variables and for the DSD Models applied to the histogram-valued variables
because, in general, the barycentric distribution of the observed and predicted distributions
of the response variables is not the same.
The main advantages of the DSD Models for interval-valued variables are that it defines a linear relation between one response variable and n explicative variables, without
decomposing the intervals in their bounds or centers and half ranges. In fact, as this model
uses the quantile function to represent the intervals, it allows working with the intervals
as such considering the distributions within them. In this work we assume an Uniform
distribution in all intervals that are the “values” observed for the interval-valued variables.
Under these conditions, the DSD Model induces a relation between the half ranges and
a relation between the centers of the intervals where the respective estimated parameters
are not independent; in the case of the half ranges, this relation is always direct, similarly
to what occurs in the Constrained Center and Range Method (Lima Neto and De Carvalho
(2010)).
The DSD Models have the potential of taking into consideration the distribution within
the intervals associated with the observations of the interval-valued variables. As such, it is
possible to adapt the proposed model to interval-valued variables with other distributions.
For example, the DSD Model may be developed considering a triangular distribution within
the intervals. Since in most studies of Symbolic Data Analysis it is considered that the
values in the observed intervals are uniformly distributed, all descriptive statistics would
also have to be redefined.
6. Simulation studies
The main goal of the simulation studies that we present in this chapter is to analyze
the performance of the DSD Models under different conditions. We start by performing
a simulation study considering only interval-valued variables, with the aim of analyzing
and relating the behavior of the error function with the goodness-of-fit measures. From
the conclusions of the first study, it was possible to delineate a second simulation study
where the performance of DSD Models with histogram/interval-valued variables is analyzed.
For this, we evaluate empirically the behavior of the parameters estimation of the DSD
Models when the explicative and response histogram/interval-valued variable are related
with different levels of linearity, measured by the coefficient of determination Ω.
6.1. Building symbolic simulated data tables
The observations of the explicative and response histogram-valued variables or intervalvalued variables Xk and Y were generated in different ways.
• The observations of each histogram/interval-valued variable Xk are created.
According to the concept of symbolic variables, to obtain the m observations associated with a histogram/interval-valued variable Xk , we started by generating 5000 real
values xj (w), with w ∈ {1, . . . , 5000}, corresponding to each unit j . The process to
generate these values will be introduced below. These values were then organized in
histograms or intervals, that represent the empirical distribution or the range of values
for each unit, respectively.
– For interval-valued variables - we recorded the minimum and maximum for values
associated with the respective observation, i.e. for each unit j the minimum and
maximum values of the set {xj (w), w ∈ {1, . . . , 5000}} are selected and an
interval for each unit j is then built.
197
198
CHAPTER 6. SIMULATION STUDIES
– For histogram-valued variables - it was considered that, in all observations, the
subintervals of each histogram have the same weight (equiprobable) with frequency 0.10. This option is not restrictive, and is supported by the work of
Colombo and Jaarsma (1980). If equiprobable histograms with the same weight
distributions in all observations were not considered, we would obtain a large
number of different weights and consequently some subintervals could have very
low frequencies. It is possible to consider histograms that are not equiprobable,
however, the weight in each subinterval has to be the same in all observations
(see Section 2.2.3).
• The observations of the histogram/intervals-valued variable Y are created.
The histograms/intervals that are the observations of the histogram/interval-valued
variable Y are obtained in two steps. First, we consider a perfect linear regression,
without error, given by:
DSD Model I:
Ψ
−1
Y ∗ (j)
(t) = v +
p
X
ak Ψ
−1
Xk (j)
k=1
(t) −
p
X
k=1
bk Ψ−1
Xk (j) (1 − t)
DSD Model II:
Ψ
−1
Y ∗ (j)
(t) = Ψ
−1
Constant
(t) +
p
X
k=1
ak Ψ
−1
Xk (j)
(t) −
p
X
k=1
bk Ψ−1
Xk (j) (1 − t)
for particular values of the parameters. The histogram-valued variables Xk and Y ∗
are in a perfect linear relation. However, this is not what we intend to simulate for a
symbolic data table. As such, we disturb the perfect linear relation by introducing an
error function,
−1
Ψ−1
Y (j) (t) = ΨY ∗ (j) (t) + ej (t).
The error function is a piecewise linear function (but not necessarily a quantile function) defined by:


2t

e
c
(j)
−
1
re(j)1
+

1
w1





1)

 e
e(j)2
c(j)1 + re(j)1 + re(j)2 + 2(t−w
w2 −w1 − 1 r
ej (t) =
.

..






n−1
P

2 t−w(n−1)

 e
e(j)n
2e
r(j)i + re(j)n +
c(j)1 + re(j)1 +
1−w(n−1) − 1 r
i=2
if
0 ≤ t < w1
if
w 1 ≤ t < w2
if
.
wn−1 ≤ t ≤ 1
(6.1)
6.2. SIMULATION STUDY I
199
In the particular case of the interval-valued variables, the linear function is written as
ej (t) = e
c(j) + (2t − 1) re(j),
t ∈ [0, 1].
The values of e
c(j)1 and re(j)i , with i ∈ {1, . . . , n} that compose the error function
ej (t) are randomly selected from intervals with low or high variation depending on
whether we want the linear regression between the variables to be better or worse.
However, the values of re(j)i have a limitation. Each half range rY (j)i in the quantile
−1
function Ψ−1
Y (j) (t), that results from the perturbation of ΨY ∗ (j) (t) by the error function
ej (t), is obtained by rY (j)i = rY ∗ (j)i + re(j)i , for each unit j and subinterval i. As it is
not imposed that the error function is a quantile function, the values of re(j)i may be
negative but cannot be lower than −rY ∗ (j)i else for this unit j and subinterval i, the
half range rY (j)i would be negative.
To perform the simulation study, symbolic data tables that illustrate different situations
were created according to a selected factorial design. For each situation considered,
1000 data tables were generated. Thereby, the main values presented in the tables of
Appendices C are the means of the obtained values and the respective standard deviation
values (represented by s).
6.2. Simulation study I
In the first simulation study, the goal is to analyze the behavior of the error function and
see if it is possible to establish a relation between the error function and the goodness-offit measures. In this study the DSD Model I and DSD Model II applied to interval-valued
variables are considered.
The goodness-of-fit measures that we analyze are the following:
• Coefficient of determination, Ω, the measure deduced from the DSD Models (see
Definition 5.2 in Section 5.1.1 and Section 5.2.2)
m
P
j=1
m
Ω= P
j=1
2
M
D
Ψ
−1
b (j)
Y
(t), Y
m P
j=1
m
cYb (j) − cY
2
1 2
b (j)
3 Y
+ r
= P
;
2
1 2
2
DM
Ψ−1
(t),
Y
(c
−
c
)
+
r
Y (j)
Y
Y (j)
3 Y (j)
j=1
(6.2)
200
CHAPTER 6. SIMULATION STUDIES
e that will only be used when the predictions of the res• Coefficient of determination, Ω,
ponse intervals are obtained from the DSD Model II and the independent parameter
estimated is not a degenerated interval (see Definition 5.4 in Section 5.2.2)
m 2 1
P
−1
2
2
DM
Ψ−1
(t),
Ψ
c
−
c
+
(r
−
r
)
(t)
b
b
Y
Y
Yb
Y (j)
Y (j)
b (j)
3
Y
j=1
e = j=1
Ω
=
;
m
m P
P
2
−1
−1
1
2
2
DM ΨY (j) (t), ΨYb (t)
(cY (j) − cY ) + 3 (rY (j) − rY )
m
P
j=1
(6.3)
j=1
• Root Mean Square Error, RM SEM , proposed by Irpino and Verde (2012) and defined
in Expression (3.87) in Section 3.2
RM SEM
v
u X
Z 1
2
u1 m
−1
t
=
ΨY−1
(t)
−
Ψ
(t)
dt.
b (j)
Y (j)
m j=1 0
6.2.1. Factorial design
In this study a full factorial design is employed, with the following factors:
• Sample size: m = 30; 100.
• Number of explicative interval-valued variables p = 1.
• Patterns of variability in the explicative variable X.
The patterns of variability in the variable take into account the size of the half ranges
of the intervals that are the observations of the variable X and the variance of half
ranges and centers of these intervals. Assuming that the distribution of the real values
xj (w) with w ∈ {1, . . . , 5000} in the microdata is Uniform, four patterns of variability
are considered:
i) Low and similar variability - when the intervals associated with variable X have
similar centres (around the value 20) and similar small half ranges (around the
value 2), that implies a small variance of half ranges.
xj (w) ∼ U(δ1 (j), δ2 (j)), w ∈ {1, . . . , 5000} are randomly generated conside-
ring for each j ∈ {1, . . . , m} , δ1 (j) ∼ U(17, 19) and δ2 (j) ∼ U(21, 23).
ii) High and similar variability - when the intervals associated with variable X have
similar centres (around the value 20) and similar large half ranges (around the
value 15), that implies a small variance of the half ranges.
6.2. SIMULATION STUDY I
201
xj (w) ∼ U(δ3 (j), δ4 (j)), w ∈ {1, . . . , 5000} are randomly generated conside-
ring for each j ∈ {1, . . . , m} , δ3 (j) ∼ U(4, 6) and δ4 (j) ∼ U(34, 36).
iii) Mixed variability I - when the intervals associated with the variable X have similar centers (around the value 20) and the half ranges are diverse (half ranges
between 3 and 14), that implies a large variance of the half ranges.
xj (w) ∼ U(δ5 (j), δ6 (j)), w ∈ {1, . . . , 5000} are randomly generated conside-
ring for each j ∈ {1, . . . , m} , several options:
◦ δ5 (j) ∼ U(11, 15) and δ6 (j) ∼ U(25, 29);
◦ δ5 (j) ∼ U(13, 17) and δ6 (j) ∼ U(23, 27);
◦ δ5 (j) ∼ U(12, 16) and δ6 (j) ∼ U(24, 28);
◦ δ5 (j) ∼ U(12, 16) and δ6 (j) ∼ U(26, 30);
◦ δ5 (j) ∼ U(14, 18) and δ6 (j) ∼ U(24, 28);
◦ δ5 (j) ∼ U(10, 14) and δ6 (j) ∼ U(26, 30);
◦ δ5 (j) ∼ U(9, 13) and δ6 (j) ∼ U(29, 33);
◦ δ5 (j) ∼ U(10, 14) and δ6 (j) ∼ U(28, 32);
◦ δ5 (j) ∼ U(6, 10) and δ6 (j) ∼ U(30, 34);
◦ δ5 (j) ∼ U(8, 12) and δ6 (j) ∼ U(30, 34).
iv) Mixed variability II - when the intervals associated with the variable X have diverse
half ranges and centers (half ranges between 3 and 13; centers between 8 and
22), that implies a large variance of centers and half ranges.
xj (w) ∼ U(δ7 (j), δ8 (j)), w ∈ {1, . . . , 5000} are randomly generated conside-
ring for each j ∈ {1, . . . , m} , several options:
◦ δ7 (j) ∼ U(11, 15) and δ8 (j) ∼ U(25, 29);
◦ δ7 (j) ∼ U(11, 15) and δ8 (j) ∼ U(21, 25);
◦ δ7 (j) ∼ U(7, 11) and δ8 (j) ∼ U(19, 23);
◦ δ7 (j) ∼ U(3, 7) and δ8 (j) ∼ U(17, 21);
◦ δ7 (j) ∼ U(3, 7) and δ8 (j) ∼ U(13, 17);
◦ δ7 (j) ∼ U(14, 18) and δ8 (j) ∼ U(20, 24);
◦ δ7 (j) ∼ U(4, 8) and δ8 (j) ∼ U(28, 32);
◦ δ7 (j) ∼ U(4, 8) and δ8 (j) ∼ U(22, 26);
202
CHAPTER 6. SIMULATION STUDIES
◦ δ7 (j) ∼ U(5, 9) and δ8 (j) ∼ U(15, 19);
◦ δ7 (j) ∼ U(2, 6) and δ8 (j) ∼ U(14, 18).
• Parameters of the DSD Models.
– DSD Model I: a = 2; b = 1; v = −1;
– DSD Model II: a = 2; b = 1; Ψ−1
Constant (t) = −1 + 1(2t − 1) with t ∈ [0, 1]
(Ψ−1
Constant (t) represents the interval [−2, 0]).
c(j) + (2t − 1)re(j), with t ∈ [0, 1] is defined considering:
• The error function ej (t) = e
i) Different levels of variability for the values e
c(j). The values of e
c(j) are randomly
(uniformly) generated in Usc = U (−sec , sec ) with sec = {0, 0.25, 1, 2, 5, 10, 20, 40, 60} .
e
The values of sec are selected according to the variability of the values within the
intervals, that are the observations simulated for the interval-valued variables Y ∗ ,
in each situation;
ii) Different levels of variability for the values re(j). The values of re(j) are randomly
(uniformly) generated in Usr = U (−sre, sre) with sre = {0, 0.5, 1, 2, 5, 10, 20, mr }
e
with mr =
min
j∈{1,...,m}
{rY ∗ (j) }. As for each unit j, the value of re(j) cannot be lower
than the −rY ∗ (j) , else the functions Ψ−1
Y (j) (t) would not be quantile functions, in
all situations, mr is the highest possible value of sre.
Note 6.1. In this work, when we mention variability within the intervals/distributions, we are
referring to the size of the half range of the intervals/distributions. The variability within the
intervals/distributions may be low or high if the size of the half range is small or large. When
we refer to the variability in the variables we include the variability within the intervals (high
or low) and the variance of the half ranges of the intervals that are the observations of the
interval-valued variables (all half ranges are similar or a mixture of several different half
ranges is consider).
c(j) are
Note 6.2. In each different situation studied, for each j, the values of re(j) and e
randomly generated in a selection of intervals Usr and Usc , respectively. When we state that
e
e
the values of e
c(j) (re(j)) increase/decrease, it means that the values of e
c(j) (re(j)) take
values in larger/smaller intervals Usc (Usr ).
e
e
6.2. SIMULATION STUDY I
203
6.2.2. Results and conclusions
To analyze the behavior of the error function and the impact of the values of e
c(j) and re(j)
in the disturbance of the linear relation between interval-valued variables, we considered
several possible intervals Usr where the values of re(j) are generated. For each interval
e
Usr several intervals Usc were associated, for the generation of the values of e
c(j). The
e
e
disturbance of the linear relations are measured in these situations with the coefficient of
e (only in the DSD Model II). Therefore, it
determination Ω (in both DSD Models) and Ω
is important to understand: 1) how much it is necessary to disturb the model to obtain
a weak/strong linear relation between intervals, i.e. what is the size of the values that
e evaluate
compose the error function for which the coefficients of determination Ω and Ω
the linear relation as weak or strong and 2) does the pattern of variability in the variables
e
influence the values of the coefficients of determination Ω and Ω.
The tables with the mean values of the coefficients of determination and the mean
values of RM SEM that result from 1000 replications of each situation, according to the
factorial design previously described, may be found in Appendix C.1. From Tables C.1 to C.4
we present the mean values of the goodness-of-fit measures (Ω and RM SEM ) when the
interval-valued variable X presents different patterns of variability and the interval-valued
variable Y is generated by DSD Model I considering the selection of parameters a = 2;
b = 1; v = −1. From Tables C.5 to C.12 we present the results of the studies where DSD
e 4.
Model II is applied. For this case, we consider the two coefficients Ω and Ω
In this simulation study, as in the following, the variability of the explicative variable may
be controlled, but the variability of the response variables is influenced by the variability
of the explicative variable, the parameters of the model and the disturbance. Since in this
simulation study the selection of parameters for each model is the same in all situations,
the pattern of variability (low and similar ; high and similar, mixed) in the interval-valued
variable Y reflects the pattern of variability in the explicative variable X, that is used to
generate them. However the variability in the response variable may be almost only, related
with the variability in the explicative variable when the disturbance is low. High disturbances
may influence the variability in the response variables.
4
e only the samples for which the values estimated for the independent
To calculate the mean values of Ω,
parameter were non degenerate intervals were considered.
204
CHAPTER 6. SIMULATION STUDIES
6.2.2.1. Study of the behavior of the error function
Concerning the influence of the values e
c(j) and re(j) that compose the error function that
disturbs the model, we may observe that the linearity between the data is more affected (in
c(j) than by the values of re(j). In all situations, for both
absolute terms) by the values of e
c(j) in U (−sec , sec ) the increase of the values
models, when we generate the values of e
of re(j) affects less the linear relation between the variables than when we consider the
values of re(j) generated in U (−sre, sre) with sre = sec and increase the values of e
c(j). This
e Moreover, the coefficient
behavior is observed when the linearity is measured with Ω and Ω.
e decreases faster than the coefficient Ω when the values of e
Ω
c(j) and re(j) increase. The
measure RM SEM , that evaluates the ”closeness” between the intervals observed and
predicted, presents the expected behavior, i.e. the values of RM SEM increase when the
coefficient of determination approaches zero. Figures 6.1 to 6.3 illustrate this behavior
relatively to the situation of Tables C.5 and C.6, where X has low and similar variability and
the DSD Model II is applied.
Mean values of Ω
1
0.8
0.6
0.4
0.2
0
re = 0
re ∈ U (−1, 1) re ∈ U (−2, 2) re ∈ U (−5, 5)
e
c ∈ U(−1, 1)
e
c=0
e
c ∈ U (−1, 1) e
c ∈ U (−2, 2) e
c ∈ U (−5, 5)
re ∈ U(−1, 1)
Figure 6.1: Mean values of Ω and the respective standard deviation for different error functions.
In this simulation study, a limit for the selection of the values of re(j) was imposed.
This limit, that prevents the analysis of the behavior of the models when the values of
re(j) are higher than mr or smaller than −mr, is defined according to the value of the
lowest half range rY ∗ (j) , j ∈ {1,. . ., m}. As a consequence, when we have variables whose
intervals have a mixture of different half ranges (Tables C.3 and C.4; Tables C.9 to C.12), the
disturbance of the perfect linear regression that we use to generate the symbolic data tables
c(j). The values of |re(j)|
(as previously described), is almost only affected by the values of e
just take higher values when the variability in the response variables is similar and higher.
6.2. SIMULATION STUDY I
205
e
Mean values of Ω
1
0.8
0.6
0.4
0.2
0
re = 0
re ∈ U (−1, 1) re ∈ U (−2, 2) re ∈ U (−5, 5)
e
c=0
e
c ∈ U (−1, 1) e
c ∈ U (−2, 2) e
c ∈ U (−5, 5)
re ∈ U(−1, 1)
e
c ∈ U(−1, 1)
e and the respective standard deviation for different error functions.
Figure 6.2: Mean values of Ω
Mean values of RMSEM
4
3
2
1
0
re = 0
re ∈ U (−1, 1) re ∈ U (−2, 2) re ∈ U (−5, 5)
e
c ∈ U(−1, 1)
e
c=0
e
c ∈ U (−1, 1) e
c ∈ U (−2, 2) e
c ∈ U (−5, 5)
re ∈ U(−1, 1)
Figure 6.3: Mean values of RM SEM and the respective standard deviation for different error functions.
c(j) and re(j), that compose the error function
To evaluate the influence of the values e
ej , in the behavior of the goodness-of-fit measures, we will analyze these measures when
the perfect linear relations, that consider different patterns of variabilities in the variables,
are disturbed with similar error functions. Table 6.1 summarizes the results presented in the
c(j) ∈ U(−5, 5)
Tables of Appendix C.1 when the error function is generated considering e
and re(j) ∈ U(−2, 2).
e are
We may observe that similar results for RM SEM and different values for Ω and Ω
obtained. In these conditions, it is when the explicative variables have high and similar or
mixed variability (that include intervals with high and small half ranges) that the values of
e are
Ω are closer to one. In the situations where DSD Model II is applied, the values of Ω
closer to zero when the variance of the half ranges intervals, that are the observations of
the explicative variables, is small (variables whose pattern of variability is low and similar
206
CHAPTER 6. SIMULATION STUDIES
e and RM SEM in the situations analyzed in the simulation study I, when the
Table 6.1: Mean values of Ω, Ω
error function is generated considering e
c(j) ∈ U(−5, 5) and re(j) ∈ U (−2, 2).
Model
Pattern of variability
Low and similar variability
High and similar variability
DSD Model I
Mixed variability I
Mixed variability II
Low and similar variability
High and similar variability
DSD Model II
Mixed variability I
Mixed variability II
n
Ω (s)
e (s)
Ω
RM SEM
30
0.6162 (0.0455)
-
2.8525 (0.2491)
100
0.5977 (0.0234)
-
2.9353 (0.1302)
30
0.9878 (0.0020)
-
2.8602 (0.2429)
100
0.9873 (0.0011)
-
2.9404 (0.1321)
30
0.9604 (0.0063)
-
2.8456 (0.2357)
100
0.9637 (0.0031)
-
2.9328 (0.1307)
30
0.9537 (0.0077)
-
2.8495 (0.2476)
100
0.9423 (0.0048)
-
2.9326 (0.1263)
30
0.6841 (0.0413)
0.0924 (0.0558)
2.8408 (0.2495)
100
0.6656 (0.0211)
0.0741 (0.0277)
2.9331 (0.1272)
30
0.9883 (0.0019)
0.0829 (0.0448)
2.8615 (0.2397)
100
0.9879 (0.0010)
0.0702 (0.0250)
2.9325 (0.1255)
30
0.9633 (0.0062)
0.6502 (0.0450)
2.8425 (0.2480)
100
0.9661 (0.0028)
0.7131 (0.0189)
2.9331 (0.1246)
30
0.9570 (0.0069)
0.7641 (0.0380)
2.8529 (0.2374)
100
0.9472 (0.0045)
0.7286 (0.0258)
2.9252 (0.1289)
e is larger when the intervals associated of the explicative
or high and similar ). Moreover, Ω
variables present large variance of the half ranges (variables with mixed variability). For
both situations where the pattern of variability in the explicative variables is mixed, the
behavior of the coefficients of determination is analogous.
c(j) and re(j) are geneThis latter behavior, that is also observed when the values of e
rated in other intervals, allows concluding that in analogous conditions (of variability in the
explicative variables and similar disturbance of the models), the coefficient of determination
Ω has a similar behavior when the DSD Model I and DSD Model II are applied. In the
e may be used, this measure presents a very different behavior whether the
cases where Ω
variance of the half ranges in the intervals of the explicative variables is high or low. The
e are closer to the values of Ω when the intervals have a mixture of variabilities
values of Ω
and consequently a largest variance of the half ranges. Observing Tables C.9 to C.12 we
e and Ω grow more apart with the increase of the values of
may verify that the values of Ω
e
c(j). Analyzing Tables C.5 to C.8, that present the results when the patterns of variability
6.2. SIMULATION STUDY I
207
e are
of the explicative variables are low/high and similar, we conclude that the values of Ω
similar when the disturbance is similar, in absolute terms. On the other hand, the values
of Ω are similar when the disturbance takes into account the size of the half ranges of the
c(j) and re(j) considered in Tables C.1 and C.2;
intervals IY ∗ . For this reason, the values of e
Tables C.5, C.6 and C.7,C.8 are quite different.
e
6.2.2.2. Study of the behavior of the coefficients of determination Ω and Ω
e help us understand the
The expressions of the coefficients of determination Ω and Ω
behaviors reported above. These measures are deduced from the DSD Models (similarly to
what happens in the classical linear regression model with the coefficient of determination
R2 ). The measure Ω is computed using the symbolic mean of the response variable, that is
a real value (a degenerated interval, i.e. an interval with half range zero) and not an average
interval. Because of this, in the expression of Ω, see Expression (6.2), the numerator (and
denominator) are computed adding m times the variance of the centers of the predicted
(observed) intervals with a third of the sum of the squares of the half ranges of the predicted
(observed) intervals. As the values that re(j) takes may not be larger than mr or lower than
−mr, the disturbance of the half ranges of the intervals of Y ∗ is limited. Consequently, the
half ranges of Y and the estimated half ranges rYb (j) are in general similar. These are the
reasons that explain why, when the intervals associated with the variable Y have high half
ranges, the measure Ω is quite stable, even with the weight of the square of the deviations
of the centers being three times larger than the weight of the square of the half ranges.
Even when the observed and estimated centers of the intervals are very different, if the half
ranges of the respective intervals are high we will obtain values of Ω close to one. Moreover,
higher differences between the sum of the squares of rY (j) and the sum of the squares of
rYb (j) are only observed when the linear relation is between variables with high and similar
variability and the disturbance is made with error functions composed with re(j) that may
take high values.
Considering this latter analysis, we may conclude that the disturbance of the centers
must be sufficiently large to be “detected” by Ω. As we may observe in the tables of
Appendix C.1, independently of the values that re(j) takes, the values of Ω are only lower
than 0.5 when the values of sec are close or greater than the maximum of the half range
of the intervals I Y ∗ (j) , I Y ∗ (j) , with j ∈ {1, . . . , m}. As such, there are units j where
208
CHAPTER 6. SIMULATION STUDIES
I(j) 6⊆ I ∗ (j) and I ∗ (j) 6⊆ I(j), i.e. the quantile functions that represent these intervals
do not intersect in [0, 1] (see Appendix B). Because of this observed behavior, when Ω is
c(j) that are
used to measure the linearity between interval-valued variables, the values of e
generated take into account the size of the half ranges of the intervals IY ∗ .
From the DSD Model II, it was possible to deduce a goodness-of-fit measure with
respect to the barycentric interval. However, as we observed in Section 5.2.2, it can only
be applied when the interval-valued variables are predicted with the DSD Model II and the
e the
independent parameter estimated is not a degenerate interval. In Expression (6.3) of Ω,
numerator (and denominator) already uses the variance of the centers and half ranges of
e it is easy to understand
the predicted (observed) intervals. Analyzing the expression of Ω
that, under the same conditions of disturbance, the behavior of this measure is similar
when the variability of the variables is low and similar or high and similar (according to the
factorial design previously described). This occurs because in both cases the variance of
the half ranges of the intervals is analogous (see Tables C.5 to C.8). With these patterns
of variability in explicative variables, the values of the variance of the half ranges in the
e will always be smaller (and even smaller when the half ranges are low)
expression of Ω
than the values of the sum of the square of the half ranges, in expression of Ω. Therefore,
the value of variance is not sufficiently high to “overcome” the effect of the disturbance
of the centers, as it happens in Ω. It is important to refer that when the variability is high
and similar, the half ranges may be higher. When the variables have mixed variability the
variance of the half ranges is larger. Consequently, as in this case the disturbance of the
half ranges is always small (due to the limitations in the perturbations of the half ranges),
e
the disturbance of the centers must be sufficiently large to be “detected” by Ω.
6.2.2.3. Comparison of the observed and predicted intervals
e we will analyze particular
To help evaluating the quality of the measures Ω and Ω,
simulated situations and compare the predicted and observed intervals of the response
variable. In Figures 6.4 to 6.7 one of the 1000 cases is represented for each of the situations
e are closer to zero
presented in Table 6.1. We may observe that, although the values of Ω
when the pattern of variability is low and similar or high and similar, the observed and
predicted intervals are close. According to the value of Ω, the observed and predicted
intervals are closer when the variability of the variables is high and similar.
6.2. SIMULATION STUDY I
80
209
Observed interval
Predicted interval
60
40
20
0
−20
−40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.4: Observed and predicted intervals with low and similar variability when e
c ∈ U (−5, 5) and
e = 0.0993).
re ∈ U (−2, 2) (Ω = 0.7138 and Ω
80
Observed interval
Predicted interval
60
40
20
0
−20
−40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.5: Observed and predicted intervals with high and similar variability when e
c ∈ U (−5, 5) and
e = 0.0730).
re ∈ U (−2, 2) (Ω = 0.9898 and Ω
e are closer to
If the disturbance is smaller (see Figures 6.8 and 6.9) the values of Ω and Ω
one and the predicted intervals are closer to the observed intervals. In Figure 6.8 observed
c(j) and
and predicted intervals with low and similar variability are represented. The values e
re(j), in the error function, are generated in U (−1, 1) and U (−0.5, 0.5), respectively. In this
case, the value of Ω is closer to one than the value obtained in the situation represented in
Figure 6.4. This happens because, as Ω takes into account the size of the half ranges of
the intervals, to obtain similar values of this coefficient, the disturbance has to be smaller
when the pattern of variability is similar and low than when it is similar and high (compare
Figures 6.4 and 6.8 and Figures 6.5 and 6.9).
210
CHAPTER 6. SIMULATION STUDIES
80
Observed interval
Predicted interval
60
40
20
0
−20
−40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.6: Observed and predicted intervals with mixed variability I when e
c ∈ U(−5, 5) and re ∈ U(−2, 2)
e = 0.6728).
(Ω = 0.9677 and Ω
80
Observed interval
Predicted interval
60
40
20
0
−20
−40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.7: Observed and predicted intervals with mixed variability II when e
c ∈ U (−5, 5) and re ∈ U (−2, 2)
e = 0.7914).
(Ω = 0.9571 and Ω
e are closer to zero, the
When the values of both coefficients of determination Ω and Ω
observed and predicted intervals are, as expected, quite different. Figures 6.10 and 6.11
illustrate situations where the linear relation between the intervals is weak.
In conclusion, based in this first simulation study, we verify that under certain conditions,
e presents values closer to zero while the value of Ω is
the coefficient of determination Ω
closer to one. When this discrepancy between the values of the coefficients of determi-
nation occurred, comparing the predicted intervals of Y with the initial observed intervals
we observed that they are not very different. This behavior happens when all intervals
of X have similar half ranges and consequently, all intervals of Y have also similar half
6.2. SIMULATION STUDY I
211
Observed interval
Predicted interval
60
40
20
0
−20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.8: Observed and predicted intervals with low and similar variability when e
c ∈ U (−1, 1) and
e = 0.6807).
re ∈ U (−0.5, 0.5) (Ω = 0.9866 and Ω
80
Observed interval
Predicted interval
60
40
20
0
−20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.9: Observed and predicted intervals with high and similar variability when e
c ∈ U (−1, 1) and
e = 0.6943).
re ∈ U (−0.5, 0.5) (Ω = 0.9996 and Ω
ranges (but different from the half ranges of X ). In these situations, to evaluate the linearity
between interval-valued variables, it is insufficient to consider only the coefficient of deter-
e It is when the variables have high and similar variability that the discrepancy
mination Ω.
e is larger.
between the values of Ω and Ω
When the variables have mixed variability (large variance of half ranges and similar
centers or large variance of half ranges and centers), the coefficients of determination are
e is more sensible to the disturbances. Values of Ω and Ω
e
generally in agreement, but Ω
closer to zero (one) always means weak (strong) linear relations between the explicative
and response variables. Situations when the variables have mixed variability seem to be
the more common in real situations.
212
CHAPTER 6. SIMULATION STUDIES
150
Observed interval
Predicted interval
100
50
0
−50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.10: Observed and predicted intervals with high and similar variability when e
c ∈ U(−60, 60) and
e = 0.0018).
re ∈ U (−mr, mr) (Ω = 0.3663 and Ω
80
Observed interval
Predicted interval
60
40
20
0
−20
−40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 6.11: Observed and predicted intervals with mixed variability II when e
c
∈
U(−40, 40) and
e = 0.0262).
re ∈ U (−mr, mr) (Ω = 0.2199 and Ω
It is important to underline that the DSD Models relate the interval-values of the explicative variables with the interval-values of the response variables. These models do
not fit separately, with classical linear regression models, the centers and half ranges
that compose the intervals. Consequently, the coefficients of determination measure the
goodness-of-fit of the intervals and not separately the goodness-of-fit of their centers and
half ranges. When the pattern of the variability in the explicative variable is high and similar,
a bad fit of the centers of the intervals does not necessarily mean a bad fit of the intervals.
As Ω is the measure that may be applied to both DSD Models and both types of
variables, i.e., interval-valued and histogram-valued variables, we will consider this measure
to evaluate the level of linearity in the next simulation studies performed in this work.
6.3. SIMULATION STUDY II
213
6.3. Simulation study II
In the second simulation study, the goal is to evaluate the performance of the DSD
Models applied to interval/histogram-valued variables under different conditions and considering two levels of linearity, evaluated by the measure Ω. This performance will be assessed by analyzing the behavior of the parameters’ estimation.
In this section we will analyze one simulation study with interval-valued variables and
another with histogram-valued variables. In each study we will consider different cases
(different selections for the parameters) and in each case several different situations (for
each selection of parameters we will vary, for example, the dimension of the sample;
patterns of variability or distributions in explicative variables and the level of error).
For each situation, 1000 data tables were generated. For each symbolic table studied,
we computed the estimated parameters for the DSD Models and the respective goodnessof-fit measures. In addition to the goodness-of-fit measures Ω and the RM SEM , considered in simulation study I, we computed the lower and the upper bound Root Mean Square
Errors,
v
u X
u1 m
2
RM SEL = t
I Yb (j) − I Y (j)
m j=1
and
v
u X
u1 m
2
RM SEU = t
I Yb (j) − I Y (j)
m j=1
respectively, that Lima Neto and De Carvalho (2008, 2010) used to study the performance of
their linear regression models between interval-valued variables (defined in Section 3.1.1.1).
These measures were adapted to histogram-valued variables and we define them as:
v
u X
n
u1 m X
RM SEL = t
(I b − I Y (j)i )2 pi
m j=1 i=1 Y (j)i
v
u X
n
u1 m X
t
RM SEU =
(I Yb (j)i − I Y (j)i )2 pi
m j=1 i=1
with I Y (j)i , I Y (j)i and I Yb (j)i , I Yb (j)i the subintervals i ∈ {1, . . . , n} of the observed and
predicted histograms, for each unit j.
214
CHAPTER 6. SIMULATION STUDIES
As we considered 1000 replications for each situation, the results presented in the
tables of Appendices C.2 and C.3 , i.e., the values of estimated parameters and the values
of goodness-of-fit measures, are the means of the obtained values and the respective
standard deviation values (represented by s).
With the goal of evaluating the empirical behavior of the parameters, we calculated the
mean square error (MSE)
1000
M SE =
1 X
(θ − θu∗ )2
1000 u=1
with θ corresponding to the real parameter of the model in each case, and with θu∗ corresponding to the respective estimated parameter in each replication u ∈ {1, . . . , 1000}. For
DSD Model II, as the independent parameter is an interval/distribution, Ψ−1
Constant (t), we
calculated, for each case, the mean of the squared Mallows distance between Ψ−1
Constant (t)
and the respective independent parameter estimate, i.e.
1000
−1
∗
2
DM
(Ψ−1
Constant (t), ΨConstant (t)) =
1 X 2
∗−1
DM (Ψ−1
Constant (t), ΨConstantu (t)).
1000 u=1
6.3.1. Simulation study with interval-valued variables
6.3.1.1. Description of the simulation study
For each case (three different selections for the parameters), we want to analyze the
performance of the DSD Models considering three different patterns of variability in the
explicative variables and two levels of linearity between the interval-valued variables. For
all situations, the observations of the interval-valued variables are generated from microdata
with Uniform distribution.
In this study a full factorial design was employed, with the following factors:
• Sample size: m = 10; 30; 100; 250.
• Number of explicative interval-valued variables: p = 1 and p = 3.
• Parameters of the DSD Models.
6.3. SIMULATION STUDY II
215
– DSD Model I
◦ For p = 1 :
i) a = 2; b = 1; v = −1 (a and b are close);
ii) a = 2; b = 8; v = 3 (a is lower than b);
iii) a = 6; b = 0; v = 2 (a is larger than b).
◦ For p = 3 : a1 = 2; b1 = 1; a2 = 0.5; b2 = 3; a3 = 1.5; b3 = 1; v = −1.
– DSD Model II
◦ For p = 1 :
i) a = 2; b = 1; Ψ−1
Constant (t) = −1 + 1(2t − 1),
t ∈ [0, 1];
ii) a = 2; b = 8; Ψ−1
Constant (t) = 3 + 2(2t − 1),
t ∈ [0, 1];
iii) a = 6; b = 0; Ψ−1
Constant (t) = 2 + 1(2t − 1),
t ∈ [0, 1].
◦ For p = 3 : a1 = 2; b1 = 1; a2 = 0.5; b2 = 3; a3 = 1.5; b3 = 1;
Ψ−1
Constant (t) = −1 + 1(2t − 1),
t ∈ [0, 1].
• Patterns of variability in the explicative variables Xk . The real values xjk (w), with
w ∈ {1, . . . , 5000} in each interval Xk (j), are uniformly distributed.
i) Low and similar variability - when the intervals associated with variable Xk have
similar centres and similar small half ranges.
xjk (w) ∼ U(δ1 (j), δ2 (j)) are randomly generated considering for each
j ∈ {1, . . . , m} and k ∈ {1, 2, 3} :
◦ k = 1 : δ1 (j) ∼ U (17, 19) and δ2 (j) ∼ U (21, 23) (centers around 20 and
half ranges around 2);
◦ k = 2 : δ1 (j) ∼ U (13.5, 15) and δ2 (j) ∼ U (15, 16.5) (centers around 15
and half ranges around 0.5);
◦ k = 3 : δ1 (j) ∼ U(8, 10) and δ2 (j) ∼ U(10, 12) (centers around 10 and half
ranges around 1).
ii) High and similar variability - when the intervals associated with variable Xk have
similar centres and similar high half ranges.
xjk (w) ∼ U(δ3 (j), δ4 (j)) are randomly generated considering for each
j ∈ {1, . . . , m} and k ∈ {1, 2, 3} :
216
CHAPTER 6. SIMULATION STUDIES
◦ k = 1 : δ3 (j) ∼ U(4, 6) and δ4 (j) ∼ U (34, 36) (centers around 20 and half
ranges around 15);
◦ k = 2 : δ3 (j) ∼ U(4, 6) and δ4 (j) ∼ U (24, 26) (centers around 15 and half
ranges around 10);
◦ k = 3 : δ3 (j) ∼ U(1, 3) and δ4 (j) ∼ U(17, 19) (centers around 10 and half
ranges around 8).
iii) Mixed variability - when the intervals associated with the variable X have diverse
half ranges and centers. In this case, the intervals of Xk have centers between
8 and 22 and half ranges between 1 and 15.
• Two error levels are considered. To establish these error levels the conclusions of the
simulation study I were considered. For each j ∈ {1, . . . , m} , the values of e
c(j) and
re(j) are randomly generated as follows:
i) Level I: e
c(j) ∼ 0.1 × U (−M r, M r) and re(j) ∼ 0.1 × U (−mr, mr);
ii) Level II: e
c(j) ∼ U(−M r, M r) and re(j) ∼ U(−mr, mr)
where M r =
max {rY ∗ (j) } and mr =
j∈{1,...,m}
min
j∈{1,...,m}
{rY ∗ (j) } .
6.3.1.2. Results and conclusions
The tables with the results of the study may be found in Appendix C.2. From Tables
C.13 to C.15 the results obtained for the parameters estimated by the DSD Model I and the
goodness-of-fit measures are shown, with p = 1 for the three selected values of a, b and v.
The results obtained when the DSD Model II is applied, considering the conditions defined
in the respective factorial design, are presented from Table C.18 to C.20. In Tables C.16,
C.17 and Tables C.21, C.22, similar results are presented for the considered cases where
p = 3 and the DSD Models I and II are applied.
The two error levels considered in the factorial design for this study were established according to the conclusions of the simulation study I, relatively to the measure Ω. Therefore,
the models slightly disturbed (error level I) present values of Ω close to one. On the other
hand, when the error function applied to the model causes a high disturbance in the linear
relation (error level II), the values of Ω are more distant from one and closer to zero. It is
6.3. SIMULATION STUDY II
217
important to bear in mind that the level of the error takes into account the variability within
the intervals of Y ∗ .
In the simulation studies, only the variability of the explicative variable is controlled. The
variability in response variables is related with the variability of the explicative variables, is
influenced by the parameters of the model and by the disturbance. Comparing the three
selected cases with p = 1, when the disturbance is low, for each pattern of variability of
the explicative variables, the lowest variability within the intervals of Y is obtained when
the intervals of Y are generated by the DSD Model I with a = 2; b = 1; v = −1 and the
highest variability within the intervals is obtained for the parameters a = 2; b = 8; v = 3.
In the conditions considered in this study, the variance of the half ranges in the intervals
of Y reflects the variance of the half ranges of the intervals in the explicative variable. For
all patterns of variability in the explicative variables, the values of the Root Mean Square
Errors increase according to the variability within the intervals of Y. Figure 6.12 illustrates
this behavior in all cases when DSD Model I, with one explicative variable, is applied.
a=2; b=8; v=3
a=6; b=0; v=2
a=2; b=1; v=−1
11
Mean values of RMSEM
10
9
8
7
6
5
4
3
2
1
m=10
m=30 m=100 m=250 m=10
Low and similar variability
m=30 m=100 m=250 m=10
High and similar variability
m=30 m=100 m=250
Mixed variability
Figure 6.12: Representation of the RM SEM in all cases when the DSD Model I, with one explicative variable,
is applied.
The mean values of Ω are consistent with the respective mean values of the measures
RM SEM ; RM SEL and RM SEU . In general, as expected, in each situation and for
the respective pattern of variability in the explicative variables, the highest values of Ω
correspond to the lowest values of RM SEM . Small values in the measures RM SEM ;
RM SEL and RM SEU mean that the observed and predicted intervals are close and in this
case we expected values of Ω close to one. The behavior of the goodness-of-fit measures
is analogous in both methods.
218
CHAPTER 6. SIMULATION STUDIES
Hereafter, it is important to analyze the behavior of the parameters estimated taking into
account the respective values of Ω, in different situations. Analyzing the obtained results,
presented in Appendix C.2, we can see that the behavior of the parameters’ estimation is
independent of the model applied, of the number of explicative variables and of the parameters selected for the model. In each studied case (each selection of parameters of the DSD
Models), three patterns of variability in the explicative variables Xk were considered, each
of them with two levels of linearity, and different behaviors were observed.
When we consider an error function of level II, the behavior of the estimated parameters
is more unstable and their means are more distant from the original values. Consequently,
the values of the M SE are not close to zero. This is not surprising because other models
may exist that adjust the interval data better. The behavior of the parameters is more unstable when the number of observations is low and whenever the variability in the explicative
variables is high and similar (see tables in Appendix C.2).
When the linear relation between the variables is high (error functions of level I), the
estimated parameters are close to the initial parameter values and the closeness is more
obvious as the number of units in the sample increases. The values of the M SE, associated with each parameter, decrease and approach zero as the number of observations
increases. The values of the M SE and the values of the standard deviation associated
with the mean value of the parameters are more distant from zero when the half ranges
of the intervals of Y are larger (which occurs when the variability of X is similar and high
and when the values of the parameters are far apart, see Tables C.14 and C.19). The
justification may be that, in this situation, the disturbance is “larger”, in absolute terms.
Comparing the cases where the DSD Model I and DSD Model II are applied for the same
selection of parameters, in the corresponding conditions, the behavior of the estimated
parameters and the behavior of the goodness-of-fit measures is similar (compare Tables
C.13 and C.18; Tables C.14 and C.19; Tables C.15 and C.20; Tables C.16; C.17 and C.21;
C.22).
The boxplots in Figures 6.13, 6.14, 6.15 and 6.16 illustrate the behavior of the parameters a and b, described above, for the situations where p = 1 and the original values of
the parameters are a = 2; b = 8; v = 3 in DSD Model I (Table C.14) and a = 2; b = 8;
Ψ−1
Constant (t) = 3 + 2(2t − 1) in DSD Model II (Table C.19).
6.3. SIMULATION STUDY II
10
Low and similar variability
219
High and similar variability
Mixed variability
8
6
4
2
0
m=10
m=30 m=100 m=250 m=10
m=30 m=100 m=250 m=10
m=30 m=100 m=250
Figure 6.13: Boxplots of the values estimated for parameter a, under different conditions, when DSD Model I
(a = 2; b = 8; v = 3) is applied to interval-valued variables and when the level I error is considered.
10
Low and similar variability
High and similar variability
Mixed variability
8
6
4
2
0
m=10
m=30 m=100 m=250 m=10
m=30 m=100 m=250 m=10
m=30 m=100 m=250
Figure 6.14: Boxplots of the values estimated for parameter a, under different conditions, when DSD Model II
(a = 2; b = 8; Ψ−1
Constant (t) = 3 + 2(2t − 1)) is applied to interval-valued variables and when the
level I error is considered.
Observing the Tables in Appendix C.2, we may conclude that when the error function of
level I is considered, the values/intervals estimated for the independent parameter may be
quite different from the original. Consequently, the standard deviation associated with the
mean value of the independent parameters and the respective values of the M SE are much
higher than the values of the standard deviation and the M SE associated with the other
parameters. The larger instability of the independent parameters may be explained because
it is obtained from the other parameters (see Proposition 4.1 when the independent parameter is a real value and Expressions (4.35) and (4.36) when the independent parameter is a
quantile function but consider these results applied to interval-valued variables).
220
CHAPTER 6. SIMULATION STUDIES
10
8
6
4
2
0
Low and similar variability
m=10
High and similar variability
m=30 m=100 m=250 m=10
m=30 m=100 m=250 m=10
Mixed variability
m=30 m=100 m=250
Figure 6.15: Boxplots of the values estimated for parameter b, under different conditions, when DSD Model I
(a = 2; b = 8; v = 3) is applied to interval-valued variables and when the level I error is considered.
10
8
6
4
2
0
Low and similar variability
m=10
High and similar variability
m=30 m=100 m=250 m=10
m=30 m=100 m=250 m=10
Mixed variability
m=30 m=100 m=250
Figure 6.16: Boxplots of the values estimated for parameter b, under different conditions, when DSD Model II
(a = 2; b = 8; Ψ−1
Constant (t) = 3 + 2(2t − 1)) is applied to interval-valued variables and when the
level I error is considered.
The behaviors observed in this study corroborate the conclusions reached in simulation study I about the measure Ω and its relations with the error levels. Moreover, with
this simulation study, it was possible to confirm the empirical consistency of the estimated
parameters. Estimated parameters closer to the original parameters, with associated values of M SE that get closer to zero when the number of observations increases, was the
expected behavior when the linear relation between the interval-valued variables is strong,
i.e. the values of Ω are close to one.
6.3. SIMULATION STUDY II
221
6.3.2. Simulation study with histogram-valued variables
Similarly to the simulation study for interval-valued variables, in this section we elaborate a simulation study to analyze the performance of the DSD Models with histogramvalued variables. We consider, for each selection of parameters in the models, four different
distributions in the explicative variables and two levels of linearity between the histogramvalued variables. Furthermore, we also want to analyze if the coefficient of determination
Ω deduced from the DSD Models, is an adequate measure to evaluate the quality of the
linear relation between histogram-valued variables.
Similarly to the case of the simulation study with interval-valued variables, in this study,
the level of disturbance of the linear relation, takes into account the variability of values in
the distributions of Y ∗ .
6.3.2.1. Description of the simulation study
The first step is to generate the observations of the histogram-valued variables Xk ,
k = {1, . . . , p} and Y, where Y is the variable to be modeled from Xk by the linear relations,
according to the process described in Section 6.1. Next, the parameters are estimated
by the DSD Models and the adequate goodness-of-fit measures computed, considering
symbolic simulated data tables covering the factors indicated below. For each considered
situation, 1000 data tables were generated. From these results it is possible to analyze the
behavior of the models, the behavior of the goodness-of-fit measures and to draw some
meaningful conclusions.
In this study a full factorial design was employed, with the following factors:
• Sample size: m = 10; 30; 100; 250.
• Parameters of the DSD Models.
– DSD Model I
Number of explicative histogram-valued variables: p = 1 and p = 3.
◦ For p = 1 :
i) a = 2; b = 1; v = −1 (a and b are close);
ii) a = 2; b = 8; v = 3 (a is lower than b);
iii) a = 6; b = 0; v = 2 (a is larger than b).
222
CHAPTER 6. SIMULATION STUDIES
◦ For p = 3 :
i) a1 = 2; b1 = 1; a2 = 0.5; b2 = 3; a3 = 1.5; b3 = 1; v = −1 (the values of
ak and bk , k ∈ {1, 2, 3} are close);
ii) a1 = 6; b1 = 0; a2 = 2; b2 = 8; a3 = 10; b3 = 5; v = 3 (the values of ak
and bk, k ∈ {1, 2, 3} are apart).
– DSD Model II
Number of explicative histogram-valued variables: p = 1.
−1
i) a = 8; b = 2; Ψ−1
Constant (t) = ΨU (t) is randomly generated with Uniform
distribution, U (1, 5);
−1
ii) a = 8; b = 2; Ψ−1
Constant (t) = ΨN (t) is randomly generated with Normal
distribution, N (3, 1);
−1
iii) a = 8; b = 2; Ψ−1
Constant (t) = ΨLogN (t) is randomly generated with Log-
Normal distribution, lnN (0.95, 0.4);
−1
iv) a = 8; b = 2; Ψ−1
Constant (t) = ΨM ix (t) is randomly generated with a Mixture
of distributions.
• Distribution of the microdata (real values xjk (w), with w ∈ {1, . . . , 5000}) that allows
generating the histograms corresponding to each observation of the variables Xk ,
with k = {1, 2, 3} .
i) Uniform distribution:
xjk (w)
∼
U(δ1 (j), δ2 (j)) are randomly generated
considering for each j ∈ {1, . . . , m} and k ∈ {1, 2, 3} .
◦ k = 1 : δ1 (j) ∼ U(−1.5, 0.5) and δ2 (j) ∼ U(2.5, 4.5);
◦ k = 2 : δ1 (j) ∼ U(3, 5) and δ2 (j) ∼ U(11, 13);
◦ k = 3 : δ1 (j) ∼ U(−5, −3) and δ2 (j) ∼ U(9, 11).
ii) Normal distribution:
xjk (w)
∼
N (δ3 (j), δ4 (j)) are randomly generated
considering for each j ∈ {1, . . . , m} and k ∈ {1, 2, 3} .
◦ k = 1 : δ3 (j) ∼ U(1, 2) and δ4 (j) ∼ U(0.5, 1.5);
◦ k = 2 : δ3 (j) ∼ U(7.5, 8.5) and δ4 (j) ∼ U(1.5, 2.5);
◦ k = 3 : δ3 (j) ∼ U(2.5, 3.5) and δ4 (j) ∼ U(3.5, 4.5).
iii) Log-Normal distribution:
xjk (w)
∼
LogN (δ5 (j), δ6 (j)) are randomly
generated considering for each j ∈ {1, . . . , m} and k ∈ {1, 2, 3} .
6.3. SIMULATION STUDY II
223
◦ k = 1 : δ5 (j) ∼ U(−0.5, 0.5) and δ6 (j) ∼ U(0.5, 1);
◦ k = 2 : δ5 (j) ∼ U(1.5, 2.5) and δ6 (j) ∼ U(0, 0.5);
◦ k = 3 : δ5 (j) ∼ U(0, 1) and δ6 (j) ∼ U(0.75, 1.25).
iv) Mixture of distributions: xjk (w) are randomly generated considering for each
j ∈ {1, . . . , m} and k ∈ {1, 2, 3} .
◦ k = 1 : randomly selected from U (−0.5, 3.5); N (1.5, 1); LogN (0, 0.75);
−LogN (0, 0.75); X 2 (1);
◦ k = 2 : randomly selected from U (4, 12); N (8, 2); LogN (2, 0.25);
−LogN (2, 0.25); X 2 (8);
◦ k = 3 : randomly selected from U (−4, 10); N (3, 4); LogN (0.5, 1);
−LogN (0.5, 1); X 2 (3).
With the exception of the situation of the mixture of distributions, and as the goal
is to study the effect of the type of distribution in the explicative variables, for
each k, the histogram-valued variable Xk is built from different distributions but
those distributions have similar values for the mean and standard deviation.
• Error level: in the error function ej (t), the values of e
c(j)1 and re(j)i are randomly
selected.
i) The values of e
c(j)1 are randomly generated in
◦ Uc1 = 0.1 × U (−maH , maH );
◦ Uc2 = 0.5 × U (−maH , maH );
◦ Uc3 = U (−maH , maH )
m
P
1
1
with maH = m
I Y ∗ (j)10 − I Y ∗ (j)1 .
2
j=1
ii) The values of re(j)i are randomly generated in
◦ Ur1 = 0.1 × U (−mri , mri );
◦ Ur2 = 0.5 × U (−mri , mri );
◦ Ur3 = U (−mri , mri )
with, for each i ∈ {1, . . . , n}, mri =
min
j∈{1,...,m}
{rY ∗ (j)i } .
The level I error corresponds to the case where the linearity is slightly disturbed
(considering the variability of values in the distributions Y ∗ ), which implies that
224
CHAPTER 6. SIMULATION STUDIES
the values of e
c(j)1 and re(j)i in the error function ej (t) are randomly gener-
ated in Uc1 = 0.1 × U (−maH , maH ) and Ur1 = 0.1 × U (−mri , mri ), respec-
c(j)1 and re(j)i are
tively. The level II error is considered when the values of e
randomly generated in intervals with more variability, Uc3 = U (−maH , maH ) and
Ur3 = U (−mri , mri ), respectively. As we observed in simulation study I, con-
sidering Ω as the measure to quantify the goodness-of-fit, the disturbance of a
linear relation between distributions must take into account the variability of the
values in the distributions of the Y ∗ . It is the reason that led us to consider the
range of the distributions of Y ∗ in the disturbance of the centers.
The options considered to compose the error levels are based in a generalization of the the conditions considered in the simulation study II for intervalvalued variables (see Section 6.2.1). Each quantile function Ψ−1
Y ∗ (j) (t) is ran-
c(j)1 and re(j)i ,
domly disturbed by the error function ej (t) for different values of e
i
∈
{1, . . . , n} (see Expression (6.1)).
re(j)i cannot take values in
Ur = U (−srei , srei ) with srei higher than the minimum value of the half range
rY ∗ (j)i , else for this unit j and subinterval i, the half range rY (j)i would be nega-
tive. The disturbance induced in the centers of the subintervals of the histograms
c(j)i with i ∈ {2, . . . , n} is recursive
is difficult to control because the building of e
c(j)1 and re(j)i .
and depends on the values of e
It is important to underline that in this simulation study, similarly as to what occurred for
the interval-valued variables, it was only possible to control the type of distributions of the
observations of the explicative histogram-valued variables. This simulation does not allow
selecting the distributions of the observations of the response variable. The distribution
of the response variable depends on the distribution of the explicative variables Xk , the
selection of the parameters and the disturbance applied to the histograms Y ∗ (j). In the
studied cases, the variability of the values in distribution of the variable Y, when p = 1, is
−1
higher when a = 2; b = 8; v = 3 and a = 2; b = 8; Ψ−1
Constant (t) = ΨN (t), for the DSD
Models I and DSD Models II, respectively.
In addition to the selection of parameters considered in the factorial design, other choices
where analyzed. However, as the results were similar, we chose to present only the cases
enumerated above.
6.3. SIMULATION STUDY II
225
6.3.2.2. Results and conclusions
In Appendix C.3 three tables are presented, each of which containing the results obtained when applying the DSD Model I, with p = 1, in different conditions and considering
the following three selections of the parameters: a = 2; b = 1; v = −1 (Table C.23); a = 2;
b = 8; v = 3 (Table C.24) and a = 6; b = 0; v = 2 (Table C.25). From Tables C.26 to C.29,
similar results are presented for the cases obtained applying the DSD Model I with p = 3.
The results for the four situations studied with the DSD Model II are presented in Tables
C.30 to C.41.
The main goals of this study are to verify if the goodness-of-fit measures are in accordance with the considered error levels and to analyze the performance of the DSD
Models applied to histogram-valued variables, evaluating the behavior of the parameters’
estimation. For p = 1 it is also our goal to analyze how the symmetry/asymmetry of the distributions of the observations of the explicative histogram-valued variable affect the symmetry/asymmetry of the distributions in the observations of the response variable. Moreover,
we intended to compare the predictions obtained when DSD Model I and DSD Model II are
used.
6.3.2.3. Concerning the goodness-of-fit measures versus level of linearity
To evaluate if the linear relation between distributions is strong or weak we used the
coefficient of determination Ω, deduced from the DSD Models. However, when we delineated this simulation study with the goal of analyzing the performance of the DSD Models
applied to histogram-valued variables, it was necessary to define what is the meaning of a
high or low disturbance. As Ω was the selected measure to quantify the goodness-of-fit,
the levels of disturbance were defined taking into account its expected behavior. We expect
the behavior of Ω to be similar to the one verified when Ω was used to evaluate the linear
relations between interval-valued variables (see simulation study I).
In all studied situations (see tables in Appendix C.3), when we consider a disturbance
with error level I, Ω presents values close to one, i.e. it indicates that the linear relation
between the distributions is strong. When a higher disturbance (error level II) is considered,
the values of Ω are, as expected, more distant from one and approach zero. So, as for the
case of interval-valued variables, the disturbance takes into account the variability of the
226
CHAPTER 6. SIMULATION STUDIES
values in the distributions of the Y ∗ . This behavior shows that, in order to obtain a similar
value of Ω, the linear relations when the explicative variables have low variability need to
be less disturbed than in the case when the explicative variables have higher variability.
The tables in Appendix C.3 only record values of the goodness-of-fit measures considering two levels of variability for the error function. However, to analyze more comprehensively the level of sensitivity of the Ω, for different kinds of error functions, we must consider
some cases where the error functions affect more the half range of the subintervals of
the histograms and other cases where the centers are more affected. Tables 6.2 and 6.3
illustrate the results that were obtained for samples with 10 and 100 observations, for DSD
−1
Models I and II with a = 2; b = 8; v = 3 and a = 2; b = 8; Ψ−1
Constant (t) = ΨN (t),
respectively. The values of Ω were determined considering different error functions that
c(j)1 : Uc1 , Uc2 , Uc3 and, for each one, three
use three levels of variability for the values of e
levels of variability for re(j)i : Ur1 , Ur2 , Ur3 , as defined in Section 6.3.2.1.
Table 6.2: Mean values of Ω considering different levels of linearity, when the distributions generating
! !
observations of X are Uniform ΩU and Normal ΩN .
ΩU (s)
m
10
c
e(j) ∼ Uc1
1
100
10
c
e(j) ∼ Uc2
1
100
10
c
e(j) ∼ Uc3
1
100
0.9924 (0.0027)
0.9798 (0.0075)
r
e(j) ∼ Ur3
0.9434 (0.0212)
r
e(j) ∼ Ur1
0.9801 (0.0068)
r
e(j) ∼ Ur2
0.9628 (0.0139)
0.9130 (0.0282)
DSD II
0.9911 (0.0031)
0.9775 (0.0083)
0.9369 (0.0230)
0.9798 (0.0070)
0.9600 (0.0149)
0.9044 (0.0331)
DSD I
0.9907 (8.4E − 4)
0.9809 (0.0020)
0.9521 (0.0053)
0.9762 (0.0021)
0.9620 (0.0039)
0.9201 (0.0080)
DSD II
0.9892 (0.0010)
0.9771 (0.0025)
0.9407 (0.0067)
0.9758 (0.0023)
0.9583 (0.0046)
0.9065 (0.0092)
DSD I
0.8508 (0.0453)
0.8410 (0.0480)
0.8144 (0.0551)
0.6804 (0.0753)
0.6713 (0.0769)
0.6487 (0.0798)
DSD II
0.8300 (0.0514)
0.8235 (0.0523)
0.7960 (0.0609)
0.6741 (0.0747)
0.6693 (0.0756)
0.6398 (0.0798)
DSD I
0.8163 (0.0144)
0.8102 (0.0148)
0.7901 (0.0166)
0.6285 (0.0214)
0.6241 (0.0227)
0.6061 (0.0236)
DSD II
0.7930 (0.0146)
0.7864 (0.0166)
0.7624 (0.0196)
0.6258 (0.0212)
0.6189 (0.0228)
0.5968 (0.0248)
DSD I
0.6028 (0.0911)
0.5919 (0.0896)
0.5856 (0.0885)
0.3661 (0.0863)
0.3567 (0.0798)
0.3563 (0.0844)
DSD II
0.5659 (0.0927)
0.5667 (0.0927)
0.5523 (0.0935)
0.3633 (0.0830)
0.3596 (0.0853)
0.3506 (0.0843)
DSD I
0.5315 (0.0250)
0.5273 (0.0252)
0.5183 (0.0268)
0.3023 (0.0209)
0.3001 (0.0208)
0.2966 (0.0204)
DSD II
0.4945 (0.0265)
0.4905 (0.0267)
0.4813 (0.0273)
0.2988 (0.0204)
0.2973 (0.0218)
0.2924 (0.0206)
DSD I
r
e(j) ∼ Ur1
ΩN (s)
r
e(j) ∼ Ur2
Model
i
i
i
i
i
r
e(j) ∼ Ur3
i
Based on these results, we can say that, for both models, the linearity between histogramvalued variables is more affected by disturbances in the centers of the subintervals than in
the half ranges. This behavior is not surprising because the distance associated with this
model is the Mallows distance, and as we have observed, the contribution of the centers
of the subintervals is three times larger than that of the half ranges (see Proposition 2.1
in Section 2.3.2). Moreover, the limitation imposed on the values of re(j)i prevents larger
disturbances in the half ranges of the subintervals of Y (j).
6.3. SIMULATION STUDY II
227
Table 6.3: Mean values of Ω considering different levels of linearity, when the distributions generating
!
!
observations of X are Log-Normal ΩLogN and a mixture of distributions ΩM ix .
ΩLogN (s)
m
10
c
e(j) ∼ Uc1
1
100
10
c
e(j) ∼ Uc2
1
100
10
c
e(j) ∼ Uc3
1
100
Model
r
e(j) ∼ Ur1
i
ΩM ix (s)
r
e(j) ∼ Ur2
i
r
e(j) ∼ Ur3
i
r
e(j) ∼ Ur1
i
r
e(j) ∼ Ur2
i
r
e(j) ∼ Ur3
i
DSD I
0.9779 (0.0074)
0.9424 (0.0209)
0.8459 (0.0460)
0.9800 (0.0066)
0.9773 (0.0077)
0.9699 (0.0104)
DSD II
0.9778 (0.0074)
0.9410 (0.0209)
0.8390 (0.0476)
0.9790 (0.0064)
0.9760 (0.0081)
0.9654 (0.0123)
DSD I
0.9784 (0.0020)
0.9719 (0.0029)
0.9512 (0.0050)
0.9788 (0.0019)
0.9761 (0.0023)
0.9677 (0.0033)
DSD II
0.9779 (0.0020)
0.9701 (0.0031)
0.9469 (0.0059)
0.9781 (0.0019)
0.9740 (0.0025)
0.9654 (0.0042)
DSD I
0.6692 (0.0736)
0.6450 (0.0804)
0.6108 (0.0939)
0.6707 (0.0767)
0.6702 (0.0811)
0.6648 (0.0813)
DSD II
0.6576 (0.0740)
0.6524 (0.0836)
0.5996 (0.0934)
0.6658 (0.0772)
0.6700 (0.0799)
0.6621 (0.0790)
DSD I
0.6483 (0.0206)
0.6464 (0.0212)
0.6361 (0.0227)
0.6505 (0.0234)
0.6501 (0.0254)
0.6461 (0.0247)
DSD II
0.6431 (0.0207)
0.6396 (0.0220)
0.6307 (0.0237)
0.6434 (0.0248)
0.6421 (0.0249)
0.6356 (0.0247)
DSD I
0.3439 (0.0867)
0.3416 (0.0829)
0.3305 (0.0896)
0.3520 (0.0983)
0.3533 (0.0970)
0.3568 (0.0958)
DSD II
0.3454 (0.0814)
0.3441 (0.0896)
0.3281 (0.0837)
0.3558 (0.0920)
0.3596 (0.0942)
0.3594 (0.0991)
DSD I
0.3165 (0.0224)
0.3160 (0.0218)
0.3150 (0.0225)
0.3198 (0.0329)
0.3208 (0.0319)
0.3191 (0.0319)
DSD II
0.3112 (0.0220)
0.3104 (0.0212)
0.3097 (0.0224)
0.3116 (0.0305)
0.3128 (0.0301)
0.3121 (0.0314)
Comparing all situations where the considered error levels are the same, we may conclude that when the observations of the explicative variables follow a Uniform distribution
the linearity is less disturbed. This behavior could be influenced by the non-variability of the
half ranges associated with each unit j.
An in-depth analysis of the values of the goodness-of-fit measures (see all tables of
Appendix C.3), shows that the values of the Root Mean Square Error decrease in the
same proportion as the levels of linearity. The mean values associated with the measures RM SEM ; RM SEL and RM SEU increase approximately ten times when we pass
from high (error level I) to moderate/low (error level II) linearity. This increase is an exact
reflection of the range of variability tested in this study for the error function (ten times
from level I to level II). As the measures of the Root Mean Square Errors quantify the
differences between the observed and predicted distributions in ”absolute terms,” these
are not adequate to compare situations with different selections of parameters in the DSD
Models. They are also not adequate when the selection of parameters is the same, but the
distributions of the explicative variables are different.
6.3.2.4. Concerning the analysis of the parameters’ estimation
The results obtained for DSD Model I with one (Tables C.23 to C.25) or three (Tables
C.26 to C.29) explicative variables are in general similar and as such in this section we will
analyze in detail the results obtained when p = 1.
228
CHAPTER 6. SIMULATION STUDIES
Comparing the obtained results, we can see that the behavior of the parameters’ estimation is not very different for all distributions used to generate the microdata of the explicative
variables and is similar for the three selections of the parameters. For the situations where
the level I error is considered, we observed that (Tables C.23 to C.25): the mean value of
the estimated parameters is close to the true parameter values; both the standard deviation
associated with the mean values of the estimated parameters and the values of M SE
get closer to zero when the number of observations increases. These results confirm
the empirical consistency of the estimation and are the ones expected when the linear
regression models are only slightly disturbed.
In Figures 6.17, 6.18 and 6.19 we may observe the behavior described above. The
figures illustrate only the situation where a = 2; b = 8; v = 3 (Table C.24), but the behavior
for the other selections of parameters is similar. For the different distributions used to
generate the histogram values of X , the boxes reduce their ranges around the true values
of the respective parameters as the number m of observations increases. It may also be
observed that it is for the Normal distribution that the diversity of the estimated values of the
parameters is higher.
Uniform distribution
Normal distribution
Log−Normal distribution
Mixture of distributions
5
4
3
2
1
0
m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250
Figure 6.17: Boxplots of the values estimated for parameter a, under different conditions, when DSD Model I
(a = 2, b = 8, v = 3) is applied to histogram-valued variables and when level I error is considered.
In the cases where the DSD Model II is applied, the main goal is not only to analyze
the behavior of the parameters’ estimation considering different distributions for the generation of the microdata of the explicative variables, but also to observe the influence of
the type of distribution in the independent parameter. For this reason, we consider four
6.3. SIMULATION STUDY II
229
10
9
8
7
6
5
Uniform distribution
Normal distribution
Log−Normal distribution
Mixture of distributions
m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250
Figure 6.18: Boxplots of the values estimated for parameter b, under different conditions, when DSD Model I
(a = 2, b = 8, v = 3) is applied to histogram-valued variables and when level I error is considered.
11
9
7
5
3
1
−1
−3
−5
−7
−9
Uniform distribution
Normal distribution
Log−Normal distribution
Mixture of distribution
m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250
Figure 6.19: Boxplots of the values estimated for parameter v, under different conditions, when DSD Model I
(a = 2, b = 8, v = 3) is applied to histogram-valued variables and when level I error is considered.
different situations, with the same initial parameters a and b, but with different independent
parameters that are now a distribution: Uniform (Tables C.30 to C.32), Normal (Tables
C.33 to C.35), Log-Normal (Tables C.36 to C.38) and a mixture of distributions (Tables
C.39 and C.41). We observe a similar behavior irrespective of the kind of explicative
variables (generated with different types of distributions) and the kind of distribution of the
independent parameter. The performance of the parameters a and b, analyzed when DSD
Model I is applied and level I error is considered may be extended to DSD Model II. Figures
6.20, 6.21 and 6.22 illustrate the empirical consistency of the estimation when a = 2; b = 8
−1
and Ψ−1
Constant (t) = ΨN (t) (Tables C.33 to C.35).
230
CHAPTER 6. SIMULATION STUDIES
6
Normal distribution
Uniform distribution
Log−Normal distribution
Mixture of distributions
5
4
3
2
1
0
m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250
Figure 6.20: Boxplots of the values estimated for parameter a, under different conditions, when DSD Model II
−1
(a = 2, b = 8, Ψ−1
Constant (t) = ΨN (t)) is applied to histogram-valued variables and when level I
error is considered.
12
10
8
6
4
Normal distribution
Uniform distribution
Log−Normal distribution
Mixture of distributions
m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250 m=10 m=30 m=100 m=250
Figure 6.21: Boxplots of the values estimated for parameter b, under different conditions, when DSD Model II
−1
(a = 2, b = 8, Ψ−1
Constant (t) = ΨN (t)) is applied to histogram-valued variables and when level I
error is considered.
The behavior of the independent parameter is always more unstable than the behavior of
the other parameters of the model. Observing the obtained results, it is when the variability
in the distributions of the response variable is higher that the values/quantile functions
estimated for the independent parameters are more apart from the original values/quantile
functions. In Figures 6.19 and 6.22 we may observe that the values of the M SE and
−
∗ 1
2
(Ψ−1
values of DM
Constant (t), ΨConstant (t)), correspondent to the independent parameter will
be closer to zero when samples’ sizes increase. It is when the distribution of the explicative
variables is Normal that these values are higher.
6.3. SIMULATION STUDY II
231
8
6
4
2
m=10 m=30 m=100 m=250
Uniform distribution
m=10 m=30 m=100 m=250
Normal distribution
m=10 m=30 m=100 m=250
Log−Normal distribution
m=10 m=30 m=100 m=250
Mixture of distributions
2
b −1
Figure 6.22: Representation of the DM (Ψ−1
N (t), ΨN (t)) in different conditions, when DSD Model II (a = 2;
−1
b = 8; Ψ−1
Constant (t) = ΨN (t)) is applied to histogram-valued variables and when level I error is
considered.
When we consider the level II error, the mean values associated with the estimated
parameters a and b are distant from the original ones, essentially when the distributions of
the explicative variables are Uniform or Normal and the number of observations is lower. In
all situations the estimated parameters have higher values of standard deviation and M SE
than in the analogous situations when level I error is considered (Tables C.23 to C.25 for
the DSD Model I and Tables C.30 to C.41 for the DSD Model II).
6.3.2.5. Concerning symmetry/asymmetry of Yb (j).
In this simulation study it was possible to analyze the symmetry/assymetry of the pre-
dicted distributions obtained by the simple DSD Models, taking into consideration the symmetry/asymmetry of the distributions in the observations of the histogram-valued variables
X and the values of the parameters of the models. When the observations of the histogramvalued variable X are symmetric histograms, represented by Ψ−1
X(j) (t), the respective symmetric histogram represented by −Ψ−1
X(j) (1 − t) is also symmetric, but when the histogram
represented by Ψ−1
X(j) (t) is asymmetric positive (negative) (Log-Normal, for example), the
respective symmetric histogram represented by −Ψ−1
X(j) (1 − t) is asymmetric negative
(positive).
In the DSD Model I, the predicted distributions are obtained from
−1
−1
−1
ΨY−1
b (j) (t) = v + aΨX(j) (t) − bΨX( j) (1 − t). Therefore, if the distribution ΨX(j) (t) is sym-
−1
metric the distribution of ΨY−1
b (j) (t) also tends to be symmetric. If the distribution ΨX(j) (t)
232
CHAPTER 6. SIMULATION STUDIES
is asymmetric the distribution of ΨY−1
b (j) (t) tends to be symmetric when the values of a and
b are similar and asymmetric negative (resp. positive) when the value of a is lower (resp.
higher) than the value of b. These conclusions are illustrated in Figure 6.23, considering all
predicted distributions in the simulation study with the DSD Model I for p = 1, level I error
and for samples with 10 observations. In the DSD Model I the value of the independent
parameter does not influence the symmetry/asymmetry of Yb (j).
a=2; b=1; v=−1
20
a=6; b=0; v=2
a=2; b=8; v=3
15
10
5
0
−5
−10
−15
−20
Uniform
Normal
LogN
Mixture Uniform
Normal
LogN
Mixture Uniform
Normal
LogN
Mixture
Figure 6.23: Boxplots that represent the “skewness” 5 of the distributions estimated with DSD Model I.
10
5
0
−5
−10
−15
−20
−25
−30
−1
−1
=Ψ
U
Constant
a=2; b=8; Ψ
Uniform Normal
−1
−1
a=2; b=8; ΨConstant=ΨN
LogN Mixture Uniform Normal
−1
−1
=Ψ
LogN
Constant
a=2; b=8; Ψ
LogN Mixture Uniform Normal
−1
−1
=Ψ
Mix
Constant
a=2; b=8; Ψ
LogN Mixture Uniform Normal
LogN Mixture
Figure 6.24: Boxplots that represent the “skewness” of the distributions estimated with DSD Model II.
5
The “skewness” in this context is measured by the difference between the symbolic mean and the symbolic
median. A distribution is considered to be asymmetric positive (negative) when this difference is positive
(negative).
6.3. SIMULATION STUDY II
233
Comparing now the case where a = 2, b = 8 and v = 3 in Figure 6.23 with the cases
in Figure 6.24, where a = 2, b = 8 and the independent parameters are quantile functions
with four different distributions, we observe similar behaviors relatively to the “skewness” of
the predicted response distributions. When the distributions in the explicative variables were
(all or only some) asymmetric (Log-Normal and Mixture of distributions), the predicted distributions are in all cases asymmetric. However, the “skewness” of the predicted response
distributions obtained with DSD Model II may be influenced by the independent parameter.
It may be observed in Figure 6.25 that when we consider an independent parameter more
asymmetric (positive), the behavior of the predicted distribution is affected.
20
−1
=Ψ −1
Constant
LogN(0.95,0.4)
−1
−1
=Ψ
Constant
LogN(0.5,1)
a=2; b=8; Ψ
a=2; b=8; Ψ
15
10
5
0
−5
−10
−15
Uniform
Normal
LogN
Mixture
Uniform
Normal
LogN
Mixture
Figure 6.25: Boxplots that represent the “skewness” of the distributions estimated with DSD Model II with
−1
parameters a = 2, b = 8, Ψ−1
LogN (0.95,1) and a = 2, b = 8, ΨLogN (0.5,0.4) .
In conclusion, when the distributions of observations X(j) are symmetric; asymmetric
positive or asymmetric negative, it is possible, in several cases, to forecast whether the
distributions of Yb (j) will be asymmetric.
6.3.2.6. Comparing DSD Models I and II
In this simulation study, the generation of a symbolic data table is not independent from
the models used afterwards to predict the distributions of the response variable. Because
of this limitation, two situations are considered to compare the results obtained with both
models. As we concluded that the distribution of the explicative variables X has little
influence in the behavior of both models, the explicative variables used for this study are
generated with a Normal distribution, according to the criteria established in the factorial
234
CHAPTER 6. SIMULATION STUDIES
design. For this analysis we consider a simple linear relation (p = 1) and samples with 10
and 100 observations.
1. In the first situation 1000 data tables are built considering the response variable Y
obtained by DSD Model I with a = 2, b = 8 and v = 3, perturbed by a error function
with level I. The predicted distributions for the response variable Y are obtained in two
ways:
• applying DSD Model I (situation identified by CDSDI /PDSDI , the distributions of
Y are created with the DSD Model I and predicted with DSD Model I);
• applying DSD Model II (situation identifed by CDSDI /PDSDII the distributions of
Y are created with the DSD Model I and predicted with DSD Model II).
2. In the second situation, data tables are built considering the response variable Y
−1
obtained by DSD Model II (a = 2, b = 8 and Ψ−1
Constant = ΨM ix ) perturbed by a
error function with level I. The predicted distributions for the response variable Y are
obtained in two ways:
• applying DSD Model I (analogously identified by CDSDII /PDSDI );
• applying DSD Model II (analogously identified by CDSDII /PDSDII ).
In Tables 6.4, 6.5 and in Figures 6.26, 6.27, the results of these studies may be observed.
Table 6.4: Relative efficiency of the estimation of parameters a and b when the distributions of Y are created
by DSD Model I and DSD Model II.
m
er (a)
er (b)
Response variable created by DSD I Response variable created by DSD II
with a = 2, b = 8 and v = 3
−1
with a = 2, b = 8 and Ψ−1
Constant = ΨM ix
10
1.0092
1.2939
100
0.9923
4.2931
10
0.9907
1.2781
100
1.0050
4.3373
In Table 6.4 the values of the relative efficiency of the estimated parameters with DSD
Model I and DSD Model II are presented, considering the two situations described above.
6.3. SIMULATION STUDY II
235
The relative efficiency of the estimated parameter θ is defined by
P
1000
1
1000
M SEDSDI (θ)
er (θ) =
=
M SEDSDII (θ)
u=1
1000
P
1
1000
u=1
(θ − θu∗DSDI )2
(θ − θ
,
)
∗
2
uDSDII
where θu∗DSDI and θu∗DSDII are the values estimated for the parameter θ with DSD Model I
and DSD Model II, respectively (with θ corresponding to each real parameter). The values
obtained for the relative efficiency (see Table 6.4) of the estimated parameters a and b
confirm the behavior observed in the boxplots of Figures 6.26 and 6.27: the estimation
is equally efficient in both cases of the first situation (CDSDI /PDSDI , CDSDI /PDSDII ); in
second situation (CDSDII /PDSDI , CDSDII /PDSDII ), the estimation is better when the prediction is obtained applying the DSD Model II.
7
6
5
4
3
2
1
0
C
DSDI
/P
DSDI
C
DSDI
m=10
/P
DSDII
C
DSDI
/P
DSDI
C
DSDI
m=100
/P
DSDII
C
DSDII
/P
DSDII
C
DSDII
m=10
/P
DSDI
C
DSDII
/P
DSDII
C
DSDII
/P
DSDI
m=100
Figure 6.26: Boxplot for the estimated parameter a, in four cases CDSDI /PDSDI , CDSDI /PDSDII ,
CDSDII /PDSDII , CDSDII /PDSDI .
Analyzing Figures 6.26 and 6.27 and Table 6.5, we may conclude that in cases
CDSDI /PDSDI and CDSDI /PDSDII the goodness of fit measures and the estimated parameters are similar. In the second situation, CDSDII /PDSDI and CDSDII /PDSDII , the
estimated parameters are closer to the initial parameters when the DSD Model II is applied
to predict the distribution of Y and it is also in this case that the goodness of fit measures
are best. This behavior is expected because the DSD Model I is a particular case of the
DSD Model II, where the independent parameter is a constant function.
236
CHAPTER 6. SIMULATION STUDIES
12
11
10
9
8
7
6
5
4
C
DSDI
/P
DSDI
CDSDI /PDSDII
C
DSDI
/P
DSDI
C
DSDI
/P
DSDII
C
DSDII
/P
DSDII
C
DSDII
/P
DSDI
C
DSDII
/P
DSDII
C
DSDII
/P
DSDI
m=100
m=10
m=100
m=10
Figure 6.27: Boxplot for the estimated parameter b, in four cases CDSDI /PDSDI , CDSDI /PDSDII ,
CDSDII /PDSDII , CDSDII /PDSDI .
Table 6.5: Evaluation measures for the situations CDSDI /PDSDI , CDSDI /PDSDII , CDSDII /PDSDII , and
CDSDII /PDSDI .
Model used in
prediction
PDSDI
PDSDII
m
CDSDI
CDSDII
Ω (s)
RM SEM (s)
Ω (s)
RM SEM (s)
10
0.9801 (0.0068)
2.0608 (0.3781)
0.9784 (0.0061)
2.4601 (0.3621)
100
0.9762 (0.0021)
2.1268 (0.0958)
0.9743 (0.0020)
2.5380 (0.1010)
10
0.9801 (0.0068)
2.0600 (0.3782)
0.9815 (0.0061)
2.2680 (0.3943)
100
0.9762 (0.0021)
2.1268 (0.0952)
0.9778 (0.0020)
2.3596 (0.1084)
6.4. Conclusion
As the proposed DSD Models and the kind of variables used in this work are complex
and new, a simulation study helps analyzing the behavior of the models under several
different conditions.
In the performed simulation studies three main limitations occur: it is necessary to
impose conditions in the error functions that limit the selection of values that compose
them; the large number of factors involved in these simulation studies imply that a selection
of criteria and situations under analysis had to be performed; the process selected to build
the response variables made the choice of the distribution or the pattern of variability in the
response variables impossible - as such, they are always generated by the model used to
estimate them.
6.4. CONCLUSION
237
This latter limitation, prevents a “real” comparative study between the DSD Model I and
DSD Model II. In fact, as we present in Section 6.3.2.6, it is only possible to compare the
predicted results applying DSD Model I to response variables created with DSD Model II
and vice versa.
From simulation study I, it was possible to establish a relation between the error function
e The measure Ω shows a behavior that
and the coefficients of determination Ω and Ω.
takes into account the size of the half ranges of the intervals. Because of this, to obtain
similar values of Ω, when the pattern of variability is different, the error functions are built
considering the sizes of the half ranges involved in the linear relations. On the other hand,
e takes into account the variance of the half ranges of the intervals, it presents
since Ω
a different behavior than Ω. This difference is less noticeable when the variability of the
variables is mixed and more noticeable when the variability in the variables is low/high and
similar. In this latter case, the predicted and observed intervals are close whenever Ω is
e and the fact that it does not takes
close to 1. The extreme sensibility of the measure Ω
into account the sizes of the half ranges involved in the linear regression implies that this
measure must be carefully interpreted.
From simulation study II, it was possible to corroborate the conclusions of the simulation
study I relatively to the levels of the linearity and its effect in the values of the coefficient
of determination Ω. Moreover, we may conclude that the models have a similar behavior in
different conditions that allow considering that the proposed linear regression models are
consistent. The behavior of DSD Model I and DSD Model II is similar when we have intervalvalued variables or histogram-valued variables and seems to be independent of the kind and
the number of explicative variables involved in the model. In all situations, the estimated
parameters present empirical consistency since the values of the M SE decrease as the
number of observations increases.
The error measures used to analyze the dissimilarity between the original and predicted
distributions and intervals are in agreement with the considered error levels. For higher error
levels the coefficient of determination Ω presents values more distant from one than when
the error levels are lower. The Root Mean Square Errors, while not a relative measure, also
have a behavior in accordance with what was expected: values are higher when the error
level is lower and lower when the error level is higher, but in agreement with the variability
in the intervals or distributions of the response variable.
238
CHAPTER 6. SIMULATION STUDIES
In the simulation studies presented in this chapter we have focused on the fit of the
obtained models to the data and on the empirical consistency of the parameters’ estimation.
The values of the measures that evaluate the quality of the prediction (RM SEL , RM SEU ,
RM SEM ) may be somehow optimistic do to possible overfitting. This will be accessed in
future research.
7. Analysis of data with variability
In this chapter we will analyze the linear relation between real data presenting intrinsic
variability. Among the selected cases, we consider situations that other authors have also
studied and to which they have applied their models. Apart from these cases, real data
tables were selected where the interest is not to study the information in the microdata
but in data resulting from the aggregation of the original records, and therefore data with
variability.
According to the number of values associated with each higher level unit, we chose to
aggregate the values in intervals or histograms. When the number of records associated to
each higher level unit is small, the associated element is an interval, when a large number
of records is associated to each higher level unit, we choose to form histograms. Using
the selected examples, the goals of this chapter are to compare the results obtained when
different situations were considered.
• Symbolic approaches: application of the models developed in the symbolic context;
– DSD Model I and the DSD Model II;
– Models proposed by other authors.
• Classic approach: application of the classical linear regression model to the mean
values of each higher level unit;
• Classic-Symbolic approach: application of classical linear regression to all values
observed for each first level unit, and posterior aggregation of the predicted values in
intervals or empirical distributions according to the higher level units.
239
240
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
7.1. Prediction of the hematocrit values
7.1.1. The histogram data
The data of this first example was presented in Billard and Diday (2006) to illustrate their
linear regression models for histogram-valued and interval-valued variables. For a group of
patients the values of hematocrit and hemoglobin observed along a period of time were
recorded.
To build the symbolic tables, it is necessary to aggregate the values of hematocrit and
hemoglobin of the patients for a period of time. In the symbolic data table, Table 2.5 in
Example 2.11, the 10 units (patients) are described by two histogram-valued variables, the
hematocrit (Y ) and the hemoglobin (X ).
The relation between the histogram-valued variables in Table 2.5 may be visualized
in the scatter plot for histograms in Figure 2.19 (Example 2.11). In this graphic, each
of the distributions associated with the respective patient is represented by a histogram
with a different color. The graphic shows that there is a strong linear relation between the
histogram-valued variables hematocrit and hemoglobin.
7.1.2. The DSD Models
We predicted the quantile function representing the distribution taken by the histogramvalued variable Y by the DSD Models.
DSD Model I:
−1
−1
ΨY−1
b (j) (t) = −1.953 + 3.5598ΨX(j) (t) − 0.4128ΨX(j) (1 − t),
t ∈ [0, 1];
DSD Model II:
−1
−1
−1
ΨY−1
b (j) (t) = ΨConstant (t) + 3.1101ΨX(j) (t) + 0ΨX(j) (1 − t)
with t ∈ [0, 1] and
7.1. PREDICTION OF THE HEMATOCRIT VALUES



2t

−2.27
+
−
1
× 0.22

0.3






2(t−0.3)

−1.93
+
−
1
× 0.12

0.1






 −1.67 + 2(t−0.4) − 1 × 0.14
0.1
Ψ−1
Constant (t) =


2(t−0.5)

−1.37
+
−
1
× 0.16

0.1





2(t−0.6)


−1.07
+
−
1
× 0.14

0.1






 −0.51 + 2(t−0.7) − 1 × 0.43
0.3
241
if
0 ≤ t < 0.3
if
0.3 ≤ t < 0.4
if
0.4 ≤ t < 0.5
if
0.5 ≤ t < 0.6
if
0.6 ≤ t < 0.7
if
0.7 ≤ t ≤ 1
.
(7.1)
The value of the goodness-of-fit measure is, for the DSD Model I, Ω = 0.9631 and when
the DSD Model II is applied the fit is slightly better, Ω = 0.9720. In Figure 7.2, the observed
and the predicted distributions of the histogram-valued variable Y with the DSD Models are
represented, showing the slightly better prediction obtained with the DSD Model II.
Concerning the interpretation of the parameters of the model we may conclude, from
Proposition 4.1, that for the set of patients to which the data refers, the symbolic mean
of hematocrit increases α − β = 3.1470 for each unit of increase of the symbolic mean
of hemoglobin. As this value is positive we may consider that the relation between the
histogram-valued variables is direct. Considering Proposition 4.7 we may establish an
analogy to the interpretation of the parameters for the DSD Model II.
When we predict a histogram value we always have an associated error function. In this
example, in Figure 7.1 we may observe the error function of DSD Model I and DSD Model II,
for observation 1.
The interpretation of sign of the error functions allows interpreting the behavior of the
observed and predicted quantile functions. The error function is positive when the predicted
quantile function is lower than the observed quantile function and negative in the opposite
case; the error function is zero when the estimated and observed quantile functions are the
same.
242
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
40
38
36
34
Y observed
32
0
0.1
0.2
0.3
0.4
0.5
YDSD I predicted
0.6
0.7
YDSD II predicted
0.8
0.9
1
(a) Observed and predicted quantile functions for observation 1.
0.5
0.5
0
0
−0.5
0
0.3
0.4
0.5
0.6
0.7
1
(b) Error function when the DSD Model I is applied.
−0.5
0
0.3
0.4
0.5
0.6
0.7
1
(c) Error function when the DSD Model II is applied.
Figure 7.1: Comparing the predictions and error functions for observation 1 in Table 2.5.
7.1.3. Comparison of the DSD Models with other proposed symbolic models
For this example, we also predicted the hematocrit distributions using the linear regression models proposed by Billard and Diday (2006) (the Center Model (CM) and Billard and
Diday Model (BD)) and the Verde and Irpino Model (VI) (Verde and Irpino (2010),Irpino and
Verde (2012)). For a short description of these methods see Section 3.2.
The hematocrit distributions predicted with these methods are presented in Tables 7.1,
7.2, 7.3 and represented in Figure 7.2. The values obtained for the Root Mean Square Error
(RM SE) values are in accordance with the behavior observed in Figure 7.2. The DSD II
is the method where the predicted and observed quantile functions are more similar.
The expressions and measures RM SEM , RM SEL , RM SEU , that allow comparing
the methods and the quality of the predictions (see Sections 3.2 and 6.3), are presented in
Tables 7.4 and 7.5.
7.1. PREDICTION OF THE HEMATOCRIT VALUES
Observation 1
243
Observation 2
40
Observation 3
46
52
50
39
44
48
38
42
46
37
44
40
42
36
38
40
35
38
36
34
33
0
36
0.2
0.4
0.6
0.8
1
34
0
0.2
Observation 4
0.4
0.6
0.8
1
34
0
0.2
Observation 5
0.4
0.6
0.8
1
0.8
1
0.8
1
Observation 6
48
52
48
46
50
44
48
42
46
44
40
44
43
38
42
36
40
47
46
45
42
41
34
0
0.2
0.4
0.6
0.8
1
38
0
40
0.2
Observation 7
0.4
0.6
0.8
1
39
0
0.2
Observation 8
0.4
0.6
Observation 9
49
46
44
48
45
42
47
44
46
43
45
42
44
41
43
40
42
39
40
38
36
34
32
41
0
0.2
0.4
0.6
0.8
1
38
0
30
28
0.2
0.4
0.6
0.8
1
26
0
0.2
0.4
0.6
Observation 10
53
52
51
50
49
48
47
46
45
44
0
0.2
0.4
Y observed
0.6
0.8
1
YDSDI predicted
YDSDII predicted
YCM predicted
YBD predicted
YVI predicted
Figure 7.2: Observed and predicted quantile functions of each observation in Table 2.5.
244
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Table 7.1: Observed and predicted histograms (using different methods) of the hematocrit values shown in
Table 2.5 (part1: patients 1 to 4).
Patient
HY (1)
HYbDSDI (1)
HYbDSDII (1)
HYbCM (1)
HYbBD (1)
HYbV I (1)
HY (2)
HYbDSDI (2)
HYbDSDII (2)
HYbCM (2)
HYbBD (2)
HYbV I (2)
HY (3)
HYbDSDI (3)
HYbDSDII (3)
HYbCM (3)
HYbBD (3)
HYbV I (3)
HY (4)
HYbDSDI (4)
HYbDSDII (4)
HYbCM (4)
HYbBD (4)
HYbV I (4)
Distributions of the hematocrit values
{[33.29; 35.41[ , 0.3; [35.41; 36.11[ , 0.1; [36.11; 36.82[ , 0.1; [36.82; 37.52[ , 0.1;
[37.52; 38.04[ , 0.1; [38.04; 39.61] , 0.3}
{[33.84; 35.70[ , 0.3; [35.70; 36.32[ , 0.1; [36.32; 36.73[ , 0.1; [36.73; 37.13[ , 0.1;
[37.13; 37.56[ , 0.1; [37.56; 38.85] , 0.3}
{[33.40; 35.36[ , 0.3; [35.36; 36.11[ , 0.1; [36.11; 36.70[ , 0.1; [36.70; 37.34[ , 0.1;
[37.34; 37.93[ , 0.1; [37.93; 39.73] , 0.3}
{[34.33; 35.87[ , 0.3; [35.87; 36.38[ , 0.1; [36.38; 36.70[ , 0.1; [36.70; 37.02[ , 0.1;
[37.02; 37.35[ , 0.1; [37.35; 38.31] , 0.3}
{[35.12; 36.51[ , 0.3; [36.51; 36.97[ , 0.1; [36.97; 37.23[ , 0.1; [37.23; 37.55[ , 0.1;
[37.55; 37.84[ , 0.1; [37.84; 38.71] , 0.3}
{[33.79; 35.70[ , 0.3; [35.70; 36.34[ , 0.1; [36.34; 36.73[ , 0.1; [36.73; 37.13[ , 0.1;
[37.13; 37.53[ , 0.1; [37.53; 38.73] , 0.3}
{[36.69; 39.11[ , 0.3; [39.11; 39.97[ , 0.1; [39.97; 40.83[ , 0.1; [40.83; 41.69[ , 0.1;
[41.69; 42.54[ , 0.1; [42.54; 45.12] , 0.3}
{[35.16; 38.04[ , 0.3; [38.04; 39.00[ , 0.1; [39.00; 39.96[ , 0.1; [39.96; 40.67[ , 0.1;
[40.67; 41.38[ , 0.1; [41.38; 43.51] , 0.3}
{[35.05; 37.83[ , 0.3; [37.83; 38.84[ , 0.1; [38.84; 39.89[ , 0.1; [39.89; 40.75[ , 0.1;
[40.75; 41.55[ , 0.1; [41.55; 43.99] , 0.3}
{[36.00; 38.37[ , 0.3; [38.37; 39.16[ , 0.1; [39.16; 39.95[ , 0.1; [39.95; 40.49[ , 0.1;
[40.49; 41.03[ , 0.1; [41.03; 42.64] , 0.3}
{[36.63; 38.76[ , 0.3; [38.76; 39.47[ , 0.1; [39.47; 40.18[ , 0.1; [40.18; 40.67[ , 0.1;
[40.67; 41.15[ , 0.1; [41.15; 42.60] , 0.3}
{[35.13; 38.06[ , 0.3; [38.06; 39.04[ , 0.1; [39.04; 40.02[ , 0.1; [40.02; 40.69[ , 0.1;
[40.69; 41.36[ , 0.1; [41.36; 43.35] , 0.3}
{[36.69; 40.26[ , 0.3; [40.26; 41.45[ , 0.1; [41.45; 42.64[ , 0.1; [42.64; 43.85[ , 0.1;
[43.85; 45.06[ , 0.1; [45.06; 48.68] , 0.3}
{[35.45; 42.27[ , 0.3; [42.27; 43.38[ , 0.1; [43.38; 44.50[ , 0.1; [44.50; 45.61[ , 0.1;
[45.61; 46.72[ , 0.1; [46.72; 50.46] , 0.3}
{[36.01; 42.12[ , 0.3; [42.12; 43.23[ , 0.1; [43.23; 44.37[ , 0.1; [44.37; 45.57[ , 0.1;
[45.57; 46.72[ , 0.1; [46.72; 50.18] , 0.3}
{[36.98; 42.74[ , 0.3; [42.74; 43.62[ , 0.1; [43.62; 44.51[ , 0.1; [44.51; 45.39[ , 0.1;
[45.39; 46.28[ , 0.1; [46.28; 48.93] , 0.3}
{[37.51; 42.69[ , 0.3; [42.69; 43.49[ , 0.1; [43.49; 44.28[ , 0.1; [44.28; 45.08[ , 0.1;
[45.08; 45.88[ , 0.1; [45.88; 48.27] , 0.3}
{[35.29; 42.42[ , 0.3; [42.42; 43.51[ , 0.1; [43.51; 44.61[ , 0.1; [44.61; 45.71[ , 0.1;
[45.71; 46.80[ , 0.1; [46.80; 50.1] , 0.3}
{[36.38; 39.75[ , 0.3; [39.75; 40.87[ , 0.1; [40.87; 41.96[ , 0.1; [41.96; 43.05[ , 0.1;
[43.05; 44.14[ , 0.1; [44.14; 47.41] , 0.3}
{[35.80; 40.08[ , 0.3; [40.08; 41.50[ , 0.1; [41.50; 42.92[ , 0.1; [42.92; 43.81[ , 0.1;
[43.81; 44.70[ , 0.1; [44.70; 47.37] , 0.3}
{[36.01; 39.97[ , 0.3; [39.97; 41.38[ , 0.1; [41.38; 42.82[ , 0.1; [42.82; 43.79[ , 0.1;
[43.79; 44.70[ , 0.1; [44.70; 47.47] , 0.3}
{[36.98; 40.55[ , 0.3; [40.55; 41.74[ , 0.1; [41.74; 42.93[ , 0.1; [42.93; 43.58[ , 0.1;
[43.58; 44.23[ , 0.1; [44.23; 46.18] , 0.3}
{[37.51; 40.72[ , 0.3; [40.72; 41.97[ , 0.1; [41.97; 42.86[ , 0.1; [42.86; 43.45[ , 0.1;
[43.45; 44.03[ , 0.1; [44.03; 45.79] , 0.3}
{[35.71; 40.13[ , 0.3; [40.13; 41.61[ , 0.1; [41.61; 43.08[ , 0.1; [43.08; 43.89[ , 0.1;
[43.89; 44.69[ , 0.1; [44.69; 47.12] , 0.3}
7.1. PREDICTION OF THE HEMATOCRIT VALUES
245
Table 7.2: Observed and predicted histograms (using different methods) of the hematocrit values shown in
Table 2.5 (part2: patients 5 to 8).
Patient
HY (5)
HYbDSDI (5)
HYbDSDII (5)
HYbCM (5)
HYbBD (5)
HYbV I (5)
HY (6)
HYbDSDI (6)
HYbDSDII (6)
HYbCM (6)
HYbBD (6)
HYbV I (6)
HY (7)
HYbDSDI (7)
HYbDSDII (7)
HYbCM (7)
HYbBD (7)
HYbV I (7)
HY (8)
HYbDSDI (8)
HYbDSDII (8)
HYbCM (8)
HYbBD (8)
HYbV I (8)
Distributions of the hematocrit values
{[39.19; 42.69[ , 0.3; [42.69; 43.86[ , 0.1; [43.86; 45.03[ , 0.1; [45.03; 46.19[ , 0.1;
[46.19; 47.36[ , 0.1; [47.36; 50.86] , 0.3}
{[39.68; 42.52[ , 0.3; [42.52; 43.64[ , 0.1; [43.64; 44.75[ , 0.1; [44.75; 45.86[ , 0.1;
[45.86; 46.97[ , 0.1; [46.97; 50.25] , 0.3}
{[39.74; 42.37[ , 0.3; [42.37; 43.48[ , 0.1; [43.48; 44.62[ , 0.1; [44.62; 45.82[ , 0.1;
[45.82; 46.97[ , 0.1; [46.97; 50.43] , 0.3}
{[40.78; 42.99[ , 0.3; [42.99; 43.87[ , 0.1; [43.87; 44.76[ , 0.1; [44.76; 45.64[ , 0.1;
[45.64; 46.53[ , 0.1; [46.53; 49.19] , 0.3}
{[40.92; 42.92[ , 0.3; [42.92; 43.71[ , 0.1; [43.71; 44.51[ , 0.1; [44.51; 45.31[ , 0.1;
[45.31; 46.10[ , 0.1; [46.10; 48.49] , 0.3}
{[39.80; 42.54[ , 0.3; [42.54; 43.64[ , 0.1; [43.64; 44.74[ , 0.1; [44.74; 45.83[ , 0.1;
[45.83; 46.93[ , 0.1; [46.93; 50.22] , 0.3}
{[39.70; 43.17[ , 0.3; [43.17; 44.32[ , 0.1; [44.32; 44.81[ , 0.1; [44.81; 45.29[ , 0.1;
[45.29; 45.78[ , 0.1; [45.78; 47.24] , 0.3}
{[40.93; 42.92[ , 0.3; [42.92; 43.58[ , 0.1; [43.58; 44.04[ , 0.1; [44.04; 44.51[ , 0.1;
[44.51; 44.99[ , 0.1; [44.99; 46.45] , 0.3}
{[40.46; 42.51[ , 0.3; [42.51; 43.29[ , 0.1; [43.29; 43.93[ , 0.1; [43.93; 44.62[ , 0.1;
[44.62; 45.25[ , 0.1; [45.25; 47.19] , 0.3}
{[41.50; 43.14[ , 0.3; [43.14; 43.68[ , 0.1; [43.68; 44.05[ , 0.1; [44.05; 44.42[ , 0.1;
[44.42; 44.79[ , 0.1; [44.79; 45.90] , 0.3}
{[41.58; 43.05[ , 0.3; [43.05; 43.54[ , 0.1; [43.54; 43.87[ , 0.1; [43.87; 44.21[ , 0.1;
[44.21; 44.54[ , 0.1; [44.54; 45.53] , 0.3}
{[40.92; 42.95[ , 0.3; [42.95; 43.62[ , 0.1; [43.62; 44.08[ , 0.1; [44.08; 44.54[ , 0.1;
[44.54; 44.99[ , 0.1; [44.99; 46.47] , 0.3}
{[41.56; 44.11[ , 0.3; [44.11; 44.95[ , 0.1; [44.95; 45.80[ , 0.1; [45.80; 46.65[ , 0.1;
[46.65; 47.19[ , 0.1; [47.19; 48.81] , 0.3}
{[42.67; 43.86[ , 0.3; [43.86; 44.26[ , 0.1; [44.26; 44.65[ , 0.1; [44.65; 45.22[ , 0.1;
[45.22; 45.78[ , 0.1; [45.78; 47.48] , 0.3}
{[42.11; 43.43[ , 0.3; [43.43; 43.96[ , 0.1; [43.96; 44.53[ , 0.1; [44.53; 45.32[ , 0.1;
[45.32; 46.05[ , 0.1; [46.05; 48.28] , 0.3}
{[43.18; 44.07[ , 0.3; [44.07; 44.37[ , 0.1; [44.37; 44.66[ , 0.1; [44.66; 45.13[ , 0.1;
[45.13; 45.60[ , 0.1; [45.60; 47.00] , 0.3}
{[43.09; 43.89[ , 0.3; [43.89; 44.16[ , 0.1; [44.16; 44.42[ , 0.1; [44.42; 44.85[ , 0.1;
[44.85; 45.27[ , 0.1; [45.27; 46.53] , 0.3}
{[42.76; 43.87[ , 0.3; [43.87; 44.24[ , 0.1; [44.24; 44.61[ , 0.1; [44.61; 45.19[ , 0.1;
[45.19; 45.77[ , 0.1; [45.77; 47.51] , 0.3}
{[38.4; 40.34[ , 0.3; [40.34; 40.99[ , 0.1; [40.99; 41.64[ , 0.1; [41.64; 42.28[ , 0.1;
[42.28; 42.93[ , 0.1; [42.93; 45.22] , 0.3}
{[39.26; 40.74[ , 0.3; [40.74; 41.24[ , 0.1; [41.24; 41.72[ , 0.1; [41.72; 42.20[ , 0.1;
[42.20; 42.79[ , 0.1; [42.79; 44.54] , 0.3}
{[38.78; 40.36[ , 0.3; [40.36; 40.98[ , 0.1; [40.98; 41.63[ , 0.1; [41.63; 42.34[ , 0.1;
[42.34; 43.08[ , 0.1; [43.08; 45.33] , 0.3}
{[39.80; 40.95[ , 0.3; [40.95; 41.33[ , 0.1; [41.33; 41.72[ , 0.1; [41.72; 42.10[ , 0.1;
[42.10; 42.58[ , 0.1; [42.58; 44.00] , 0.3}
{[40.04; 41.08[ , 0.3; [41.08; 41.43[ , 0.1; [41.43; 41.77[ , 0.1; [41.77; 42.12[ , 0.1;
[42.12; 42.55[ , 0.1; [42.55; 43.83] , 0.3}
{[39.31; 40.74[ , 0.3; [40.74; 41.22[ , 0.1; [41.22; 41.70[ , 0.1; [41.70; 42.17[ , 0.1;
[42.17; 42.76[ , 0.1; [42.76; 44.52] , 0.3}
246
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Table 7.3: Observed and predicted histograms (using different methods) of the hematocrit values shown in
Table 2.5 (part3: patients 9 and 10).
Patient
HY (9)
HYbDSDI (9)
HYbDSDII (9)
HYbCM (9)
HYbBD (9)
HYbV I (9)
HY (10)
HYbDSDI (10)
HYbDSDII (10)
HYbCM (10)
HYbBD (10)
HYbV I (10)
Distributions of the hematocrit values
{[28.83; 32.86[ , 0.3; [32.86; 34.21[ , 0.1; [34.21; 35.55[ , 0.1; [35.55; 36.84[ , 0.1;
[36.84; 38.12[ , 0.1; [38.12; 41.98] , 0.3}
{[27.66; 33.54[ , 0.3; [33.54; 35.50[ , 0.1; [35.50; 36.70[ , 0.1; [36.70; 37.91[ , 0.1;
[37.91; 39.20[ , 0.1; [39.20; 43.08] , 0.3}
{[28.36; 33.61[ , 0.3; [33.61; 35.45[ , 0.1; [35.45; 36.67[ , 0.1; [36.67; 37.94[ , 0.1;
[37.94; 39.16[ , 0.1; [39.16; 42.84] , 0.3}
{[29.20; 34.09[ , 0.3; [34.09; 35.72[ , 0.1; [35.72; 36.68[ , 0.1; [36.68; 37.63[ , 0.1;
[37.63; 38.59[ , 0.1; [38.59; 41.47] , 0.3}
{[30.51; 34.91[ , 0.3; [34.91; 36.37[ , 0.1; [36.37; 37.23[ , 0.1; [37.23; 38.10[ , 0.1;
[38.10; 38.96[ , 0.1; [38.96; 41.55] , 0.3}
{[27.54; 33.59[ , 0.3; [33.59; 35.61[ , 0.1; [35.61; 36.80[ , 0.1; [36.80; 37.90[ , 0.1;
[37.90; 39.18[ , 0.1; [38.18; 42.74] , 0.3}
{[44.48; 46.90[ , 0.3; [46.90; 47.70[ , 0.1; [47.70; 48.51[ , 0.1; [48.51; 49.31[ , 0.1;
[49.31; 50.12[ , 0.1; [50.12; 52.53] , 0.3}
{[45.85; 47.48[ , 0.3; [47.48; 48.03[ , 0.1; [48.03; 48.58[ , 0.1; [48.58; 49.13[ , 0.1;
[49.13; 49.68[ , 0.1; [49.68; 51.33] , 0.3}
{[45.31; 47.03[ , 0.3; [47.03; 47.70[ , 0.1; [47.70; 48.41[ , 0.1; [48.41; 49.17[ , 0.1;
[49.17; 49.87[ , 0.1; [49.87; 52.01] , 0.3}
{[46.43; 47.73[ , 0.3; [47.73; 48.17[ , 0.1; [48.17; 48.61[ , 0.1; [48.61; 49.05[ , 0.1;
[49.05; 49.48[ , 0.1; [49.48; 50.80] , 0.3}
{[46.02; 47.18[ , 0.3; [47.18; 47.58[ , 0.1; [47.58; 47.97[ , 0.1; [47.97; 48.37[ , 0.1;
[48.37; 48.76[ , 0.1; [48.76; 49.94] , 0.3}
{[45.91; 47.51[ , 0.3; [47.51; 48.06[ , 0.1; [48.06; 48.60[ , 0.1; [48.60; 49.14[ , 0.1;
[49.14; 49.68[ , 0.1; [49.68; 51.31] , 0.3}
Table 7.4: Comparison of the expressions of the symbolic linear regression models for the histogram-valued
variables in Table 2.5.
Models
Expressions that allow predicting the distributions
DSD I
−1
−1
ΨY−1
b (j) (t) = −1.95 + 3.56ΨX(j) (t) − 0.41ΨX(j) (1 − t)
DSD II
CM
BD
VI
−1
−1
ΨY−1
b (j) (t) = ΨConstant (t) + 3.11ΨX(j) (t)
with Ψ−1
Constant (t) def ined in Expression(7.1).
Yb (j) = −2.16 + 3.16X(j)
Yb (j) = 2.28 + 2.85X(j)
−1
c
ΨY−1
b (j) (t) = −2.16 + 3.16X(j) + 3.92ΨX(j) (t)
7.1. PREDICTION OF THE HEMATOCRIT VALUES
247
To evaluate the quality of the predictions, we applied all proposed models using the
Leave-One-Out (LOO) method. This method uses a single observation from the original
sample as the validation data. The Root Mean Square Error values obtained when each
observation j is not used to model the relation between the variables and the distribution
associated with the unit j is subsequently predicted by the model, are recorded in Table
7.5. Using the Leave-One-Out method, the comparison between the methods is more
trustworthy.
Table 7.5: Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied
together with the proposed models for the histogram-valued variables in Table 2.5.
Models
Without Leave-One-Out method
With Leave-One-Out method
RM SEL
RM SEU
RM SEM
RM SEL
RM SEU
RM SEM
DSD I
0.9621
0.9496
0.8946
1.0775
1.1151
1.0365
DSD II
0.8313
0.8190
0.7793
0.9695
0.9722
0.9225
CM
1.0636
1.1501
1.0507
1.4388
1.7419
1.5589
BD
1.1291
1.3480
1.2292
1.5948
1.6368
1.5665
VI
1.0072
0.9633
0.9145
1.1237
1.1006
1.0410
7.1.4. Interval data
Table 2.4 in Example 2.11, is an alternative to the symbolic data table in Table 2.5. In
this case the set of values of hematocrit and hemoglobin associated with each patient are
organized in intervals.
Similarly to the previous situation, the prediction of the range of the values of hematocrit
for each patient j may be obtained by applying DSD Model I and DSD Model II.
DSD Model I:
The expression that allows predicting the quantile function, for each patient j, is defined
by:
−1
−1
ΨY−1
b (j) (t) = −2.5911 + 3.5298ΨX(j) (t) − 0.3152ΨX(j) (1 − t),
t ∈ [0, 1].
In this case, the predicted interval for each unit j, may also be obtained by:
[−2.5911 + 3.2146cX(j) − 3.8450rX(j) , −2.5911 + 3.2146cX(j) + 3.8450rX(j) ] .
The value of the goodness-of-fit measure Ω is 0.9539.
248
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
DSD Model II:
The quantile function predicted for each patient j is given by:
−1
−1
ΨY−1
b (j) (t) = −1.2294 + 1.3497(2t − 1) + 3.1168ΨX(j) (t) + 0ΨX(j) (1 − t),
t ∈ [0, 1]
or by the interval:
[−2.5791 + 3.1168cX(j) − 3.1168rX(j) , 0.1203 + 3.1168cX(j) + 3.1168rX(j) ] .
The fact that the symmetric of the intervals X(j) does not influence the DSD Model II
is in accordance to what was observed when we treated the data as histogram data. In this
case, the independent parameter affects in different ways the bounds of the intervals, which
e = 0.9552.
allows for a higher flexibility of the model. In this case Ω = 0.9699 and Ω
To evaluate the quality of the predictions obtained with the DSD Models, the Root Mean
Square Error values were calculated with and without the Leave-One-Out method. In Table
7.6 these values may be compared and, as for the case of histogram data, the errors in
both cases are similar.
Table 7.6: Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied
together with the DSD Models for the interval-valued variables in Table 2.4.
Models
Without Leave-One-Out method
With Leave-One-Out method
RM SEL
RM SEU
RM SEM
RM SEL
RM SEU
RM SEM
DSD I
1.2640
1.6711
0.9962
1.3075
1.9036
1.1398
DSD II
0.7577
1.3015
0.8044
0.8219
1.5022
0.9504
In Figures 7.3 and 7.4, the scatter plots of the observed and predicted (with the DSD
Models) intervals of hematocrit are represented. From these graphics, we may conclude
that the scatter plot that illustrates the observed intervals of hematocrit and the scatter plot
that represents the predicted values are close. When the intervals are predicted by the DSD
Model II the intervals are a little closer to the observed ones. The slightly better behavior of
the DSD Model II is observed when the predictions are obtained both with and without the
Leave-One-Out method.
7.1. PREDICTION OF THE HEMATOCRIT VALUES
55
Observed data
Predicted data with DSDI
249
Predicted data with DSDII
50
45
40
35
30
10
11
12
13
14
15
16
17
Figure 7.3: Scatter plots representing the observed and predicted intervals of hematocrit values in Table 2.4.
55
Observed data
Predicted data with DSDI and LOO
Predicted data with DSDII and LOO
50
45
40
35
30
10
11
12
13
14
15
16
17
Figure 7.4: Scatter plots representing the observed and predicted intervals of hematocrit values in Table 2.4
when the Leave-One-Out method is applied.
250
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
7.2. Distributions of Crimes in USA
7.2.1. The data
In this example we consider a real data
table (Redmond (2011)) where we have
records related with communities in the
USA. The original data combines socioeconomic data from the ’90 Census and
crime data from 1995. For this study we selected the response variable violent crimes
(total number of violent crimes per 100 000
Figure 7.5: Map of the communities of the USA.
habitants) and four explicative variables:
LEd (percentage of people aged 25 and over with less than 9th grade education); Emp
(percentage of people aged 16 and over who are employed); Div (percentage of population
who are divorced); Img (percentage of immigrants who immigrated within the last 10 years).
To build the symbolic data table we aggregated the information (contemporary aggregation)
for each state. The units (higher units) of this study are the USA individual states and
their observations for each selected variable are the distributions of the records of the
communities in the respective state. To build the initial data table we considered only the
states for which the number of records for the variables selected was higher than thirty.
Using this criterion, only twenty states were included (AL, CA, CT, FL, GA, IN, MA, MO, NC,
NJ, NY, OH, OK, OR, PA, TN, TX, VA, WA, WI).
Figure 7.6: Selected states used to define the model.
7.2. DISTRIBUTIONS OF CRIMES IN USA
251
Similarly to the simulation study, we consider that in all observations, the subintervals of
each histogram have the same weight (equiprobable) with frequency 0.20. Furthermore, as
the response variable violent crimes only presents positive values and the distributions of
these values are asymmetric, we will consider as the response histogram-valued variable,
the variable LV C whose observations are the distributions of the logarithm of the number
of violent crimes for each USA state.
7.2.2. Three approaches to study linear relations between data with
variability
Symbolic approach
• DSD Models
Considering the conditions described above, the DSD Model I that allows predicting
the distribution of LV C from the distributions of the explicative variables LEd, Emp,
Div and Img, for each USA state j is as follows:
−1
Ψ−1
(t) = 3.9321 + 0.0009Ψ−1
LEd(j) (t) − 0.0123ΨEmp(j) (1 − t)+
[
LV
C(j)
−1
−1
+0.2073Ψ−1
Div(j) (t) − 0.0353ΨDiv(j) (1 − t) + 0.0187ΨImg(j) (t)
with t ∈ [0, 1].
The goodness-of-fit measure associated with this model is Ω = 0.8680.
The model that we obtain when the DSD model II is applied is the following:
−1
−1
−1
ΨLV
(t) = Ψ−1
Constant (t) + 0.016ΨLEd(j) (t) − 0.009ΨEmp(j) (1 − t)+
[
C(j)
−1
+0.155Ψ−1
Div(j) (t) + 0.019ΨImg(j) (t)
with t ∈ [0, 1] and



2t

−
1
× 0.42
3.37
+

0.2





 3.83 + 2(t−0.2) − 1 × 0.04
0.2
Ψ−1
Constant (t) =

2(t−0.4)


3.9
+
−
1
× 0.03

0.2





 3.93
if
0 ≤ t < 0.2
if
0.2 ≤ t < 0.4
if
0.4 ≤ t < 0.6
if
0.6 ≤ t ≤ 1
.
(7.2)
In this case the coefficient of determination is similar to the previous one,
Ω = 0.8818.
252
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
In both models, the values of the parameters estimated for this situation allow concluding that the variables LEd, Div and Img have a direct influence in the logarithm
of the number of violent crimes and the percentage of employed people has an
opposite effect.
As concerns the DSD Model I, from Proposition 4.1 we may conclude that, for the
set of states to which the data refer, when the symbolic mean of the percentage
of divorced population increases 1% and the other variables remain constant, the
symbolic mean of the LV C increases 0.1720. The percentage of divorced population
is the one that influences the predicted histogram-valued variable the most. This
interpretation may be extrapolated for the other values of the associated parameter
of all explicative variables and similarly to the DSD Model II. For example, from
Proposition 4.7 we may interpret the parameter value of 0.016 (the coefficient of the
Ψ−1
LEd(j) (t)), as the increase in the symbolic mean of the LV C when the percentage of
people aged 25 and over with less than 9th grade education increases 1%, assuming
all other variables constant.
To evaluate the quality of the predictions, the Root Mean Square Error values with
and without the Leave-One-Out method were calculated. The results are presented
in Table 7.7.
Table 7.7: Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied
together with the DSD Models for the histogram-valued variables in Section 7.2.
Models
Without Leave-One-Out method
With Leave-One-Out method
RM SEL
RM SEU
RM SEM
RM SEL
RM SEU
RM SEM
DSD I
0.5571
0.4233
0.4477
0.6038
0.4900
0.5030
DSD II
0.5164
0.3992
0.4237
0.5729
0.4672
0.4834
• Other symbolic methods
For this example, we also predicted the logarithm of the number of violent crimes using
the linear regression models proposed by Billard and Diday (2006) and Irpino and
Verde (Irpino and Verde (2012),Verde and Irpino (2010)). In all these methods, the
predicted histograms were built using the process described in Section 3.2. However,
7.2. DISTRIBUTIONS OF CRIMES IN USA
253
in the methods that use the symbolic covariance definitions, proposed by Billard and
Diday, we obtain results that are not histograms because in some subintervals the
lower bound is greater than the lower bound. For each case, to build the subintervals,
the lowest obtained value should be used for the lower bound and the highest for the
upper bound. In this way, we obtain histograms where the subintervals are neither
ordered nor disjoint, but may be rewritten according to the process in Appendix A.
In Tables 7.8 and 7.9 it is possible to compare the expressions of the several methods
and the respective performance. The LV C distributions predicted by these methods
are presented in Figures 7.7 and 7.8.
Table 7.8: Comparison of the expressions of the symbolic linear regression models that predict the number of
violent crimes in USA states.
Models
DSD I
Expressions that allow predicting the distributions
Ψ−1
[
LV C(j)
−1
−1
(t) = 3.93 + 0.001Ψ−1
LEd(j) (t) − 0.01ΨEmp(j) (1 − t) + 0.21ΨDiv(j) (t)−
−1
−0.04Ψ−1
Div(j) (1 − t) + 0.02ΨImg(j) (t)
DSD II
Ψ−1
[
LV C(j)
−1
−1
−1
(t) = Ψ−1
Constant (t) + 0.02ΨLEd(j) (t) − 0.01ΨEmp(j) (1 − t) + 0.16ΨDiv(j) (t)−
−1
+0.02Ψ−1
Img(j) (t) with ΨConstant (t) in Expression (7.2)
CM
[
LV
C(j) = 6.01 + 0.09LEd(j) − 0.05Emp(j) + 0.11Div(j) + 0.01Img(j)
BD
[
LV
C(j) = 4.40 + 0.03LEd(j) − 0.02Emp(j) + 0.11Div(j) + 0.02Img(j)
VI
Ψ−1
[
LV C(j)
(t) = 6.01 + 0.09LEd(j) − 0.05Emp(j) + 0.11Div(j) + 0.01Img(j)+
−1
−1
−1
+0.01ΨcEmp(j) (t) + 0.32ΨcDiv(j) (t) + 0.01ΨcImg(j) (t)
Table 7.9: Performance of the symbolic linear regression models that predict the number of violent crimes in
USA states.
Models
RM SEL
RM SEU
RM SEM
DSD I
0.5571
0.4233
0.4477
DSD II
0.5164
0.3992
0.4237
CM
0.9182
0.5617
0.6717
BD
0.7927
0.4665
0.5801
VI
0.5214
0.3444
0.3933
254
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
As we observed in the previous example, the linear regression models proposed by
Billard and Diday are those that present worse fits and are similar the behavior of
the DSD Models and of the Verde and Irpino Model. Comparing the results in Table
7.9 with the RMSE obtained when the Leave-One-Out method was applied in DSD
Models, Table 7.7, we arrive to the same conclusions.
The advantage of studying a linear relation between data with variability is the possibility
to predict the distribution of the values of the response variable instead of only one real
value as in a classical study. In this example, the predicted distribution of the logarithm of
the number of violent crimes for a given state is more informative about the criminality in
that state than only one descriptive measure (e.g., the mean).
Classic approach
The classical alternative to study the logarithm of the number of violent crimes in each
USA state would be to reduce the records of all communities of each state, for example to
the mean value, and make a classical linear regression study. In this case, the variability of
the records would be lost and the predicted results would be less informative. Considering
the mean of the records associated the communities of each state, the classical model is
the following:
[
LV
C(j) = 6.5817 + 0.0705LEd(j) − 0.0503Emp(j) + 0.0933Div(j) + 0.0177Img(j)
(7.3)
As in DSD Models, the prediction of the mean values of the logarithm of the number of
violent crimes is directly influenced by three of the four explicative variables in the model.
The variable Emp is also the only one that is in inverse relation with the LV C. It is the
mean value of the percentage of divorced population that influences more the mean values
of the logarithm of the number of violent crimes in states of the USA.
The classical coefficient of determination of the model is r2 = 0.75. So, the linear
relation of the mean values presents a worse fits to the data than the symbolic linear relation
defined by the DSD Models.
7.2. DISTRIBUTIONS OF CRIMES IN USA
255
Classic-symbolic approach
In this case we will apply the classical linear regression to all values observed for each
first level unit, the communities, and a posteriori aggregate the predicted values in empirical
distributions according to the higher level units, the states. This approach named classicsymbolic approach is denoted by CS.
In this case, the first step is to apply the classical linear regression model to the microdata. The obtained model is the following:
[
LV
C(j) = 4.5875 + 0.0140LEd(j) + 0.035Emp(j) + 0.0638Div(j) + 0.0064Img(j).
The coefficient of determination associated with this model, shows that it does not exist a
linear relation between the microdata, r2 = 0.063.
When we aggregate the predicted values of the logarithm of the number of violent
crimes by state, we obtain distributions that are not close to the ones observed. With
this approach the Root Mean Square Errors are RM SEL = 1.2153; RM SEU = 0.8649
and RM SEM = 0.9766. Comparing these values with the errors in Table 7.9, we may
observe that the symbolic approach allows for much better results than the classic-symbolic
approach. This behavior is corroborate with the behavior observed in Figures 7.7 and 7.8.
In Figures 7.7 and 7.8 we may compare the observed distributions of each state with
the respective predicted distributions obtained by all presented approaches.
7.2.3. The prediction of the violent crimes in the state of Arkansas
Let us now consider one state that was not used to build the model, the state of
Arkansas (AR). It is possible to predict the distribution of violent crimes since the distributions of the explicative variables for this state are known. The histograms predicted by
the DSD Models for the state of Arkansas are:
DSD Model I
HV C (AR) = {[68.37, 203.53], 0.2; [203.53, 360.93], 0.2; [360.93, 652.1], 0.2;
[652.1, 1153.9], 0.2; [1153.9, 2419.46], 0.2}
DSD Model II
HV C (AR) = {[36.7, 230.1], 0.2; [230.1, 415.7], 0.2; [415.7, 728.5], 0.2;
[728.5, 1203.8], 0.2; [1203.8, 2285.4], 0.2}
256
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
AL
CA
9
CT
10
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
FL
0.6
0.8
1
0
0.2
0.4
GA
8.5
0.6
0.8
1
0.6
0.8
1
0.6
0.8
1
0.6
0.8
IN
9
9
8
8
7
7
6
6
5
5
4
4
3
3
8
7.5
7
6.5
6
5.5
5
4.5
4
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
MA
0.6
0.8
1
0
0.2
0.4
MO
8
NC
8
8
7.5
7
7
7
6.5
6
6
6
5
5.5
5
5
4
4
4.5
4
3
3
3.5
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
NI
0.8
1
0
9
8
7
8
7
6
7
6
5
6
5
4
5
4
3
4
3
2
3
0.2
DSDI
0.4
predicted
0.4
OH
8
LVC
0.2
NY
9
0
LVC observed
0.6
0.6
0.8
LVC
DSDII
1
predicted
0
0.2
LVC
CM
predicted
0.4
0.6
0.8
LVC
BD
predicted
1
0
0.2
LVC predicted
VI
0.4
LVC
CS
predicted
1
mean LVC predicted
Figure 7.7: Observed and predicted quantile functions of LV C considering the approaches: symbolic, classic,
classic - symbolic (part1).
7.2. DISTRIBUTIONS OF CRIMES IN USA
257
OK
OR
PA
8
8
9
7.5
7.5
8
7
7
6.5
6.5
6
6
5.5
5.5
5
5
4.5
4.5
4
4
7
6
5
4
0
0.2
0.4
0.6
0.8
1
0
3
2
0.2
0.4
TN
0.6
0.8
1
0
0.2
0.4
TX
9
0.6
0.8
1
0.6
0.8
1
VA
10
8
7.5
9
8
7
8
6.5
7
7
6
6
6
5.5
5
5
5
4.5
4
4
4
3
0
0.2
0.4
0.6
0.8
1
0
3.5
0.2
0.4
WA
0.6
0.8
1
0.6
0.8
1
0
0.2
0.4
WI
9
8
8
7
7
6
6
5
5
4
4
3
3
0
LVC observed
0.2
0.4
LVCDSDI predicted
0.6
0.8
1
LVCDSDII predicted
0
0.2
LVCCM predicted
0.4
LVCBD predicted
LVCVI predicted
LVCCS predicted
mean LVC predicted
Figure 7.8: Observed and predicted quantile functions of LV C considering the approaches: symbolic, classic,
classic-symbolic (part2).
258
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
For the state of Arkansas, with the classic approach in Expression (7.3), the mean of the
number of violent crimes is estimated at 633.4 per 100 000 habitants. With this approach
the information about the behavior of the predicted variable is obviously poorer. The distributions predicted by the DSD Models show that there is a small number of communities in
this state where the number of violent crimes per 100 000 habitants, is higher than 1000.
The prediction shows that about 50% of the observed communities in this state present
less than 500 violent crimes. Also based in both models, it may be observed that the higher
number of crimes, above approximately 1200 violent crimes per 100 000 habitants, occurs
in only 20% of the observed communities.
Figure 7.9 illustrates the estimated and observed quantile function of the LV C for this
state considering studied approaches.
9
8
7
6
LVC observed
LVC
predicted
DSDI
5
LVCDSDII predicted
4
LVC
CS
predicted
mean LVC predicted
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 7.9: Observed and estimated quantile function of the variable LV C in the state of Arkansas.
7.2.4. Predicted Quantiles
The DSD Models predict the quantile functions associated with each unit j of the response variable from the quantile functions that are the observations of the explicative variables. As these models predict quantiles, a comparison with classical quantile regression is
meaningful. However, the comparison between both models is far from obvious. The first
difference is that the DSD Models relate histogram-valued variables whose observations
are distributions and the quantile regression works with classical variables. In DSD Models
7.2. DISTRIBUTIONS OF CRIMES IN USA
259
the same model (with the same coefficients) allows predicting quantile functions whereas
when the quantile regression is used, each quantile of the response variable is predicted
with a different model. In this case, it is possible to predict several quantiles and afterwards,
if possible, we may build quantile functions.
Other process that uses classical methods and that does not applies the quantile regression, may be used to predict the quantiles of the response variables. With this approach, we
calculate the same quantile for all variables and units. Afterwards we predicted the quantiles of the response variable from the respective quantiles associated with the explicative
variables, applying the classical linear regression model. Also in this case, it is necessary
to determine one model for each quantile.
When we calculate the quantiles, the building of the quantile functions associated to
each unit j may not be possible. Sometimes we obtain values to one quantile that are
greater than the value of an upper quantile. This limitation is similar to what occurs in the
MinMax or the Center and Range methods proposed to study the linear regression between
interval-valued variables.
Although the nature of the predicted quantiles obtained by several processes was different, we will illustrate the results obtained with classical methods for this example. The
quantiles will be predicted from the quantile regression, considering two different approaches,
and applying the classical regression to the quantiles calculated for all variables.
Quantile Regression
1st Approach - Quantile Regression 1 (QR1)
In this case we apply the classic Quantile Regression to the microdata, i.e., when the
first order units, the communities are considered. In this case we predict the quantiles 0.2;
0.4; 0.6 and 0.8 of the logarithm of the number of violent crimes in communities of the USA.
In this case, the quantile regression models are the follows:
[
Q0.2 LV
C(j) = 3.5108 + 0.0206LEd(j) + 0.0012Emp(j) + 0.0797Div(j) + 0.0094Img(j);
[
C(j)) = 3.7772 + 0.0184LEd(j) + 0.0082Emp(j) + 0.0721Div(j) + 0.0112Img(j);
Q0.4 (LV
[
C(j)) = 4.8348 + 0.0133LEd(j) + 0.0034Emp(j) + 0.0735Div(j) + 0.0060Img(j);
Q0.6 (LV
[
Q0.8 (LV
C(j)) = 5.6198 + 0.0115LEd(j) + 0.0033Emp(j) + 0.0694Div(j) + 0.0024Img(j).
260
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
With the objective of comparing results, and as with DSD Models, the obtained quantiles
concern the states and not for the communities, the predicted quantiles may be aggregated
a posteriori by state. Thereby, each state may be associated, for example, with the mean
of the quantiles predicted for the corresponding communities.
2 nd Approach - Quantile Regression 2 (QR2)
In the second approach, as the Quantile Regression is a classical method, it works with
classical variables. As the goal is to predict the logarithm of the number of violent crimes by
state and the records of the variables in the data table are for communities, one classical
possibility consists in reducing the set of values according the second order unit, the states,
to an unique value, for example the mean. After this transformation of the data, quantile
regression is applied to the mean values of the variables and the quantiles of the logarithm
of the number of violent crimes associated with each state are predicted.
In this case, the quantiles 0.2; 0.4; 0.6 and 0.8 of the mean of the logarithm of the
number of violent crimes are predicted by the models:
[
Q0.2 (LV
C(j)) = 6.0937 + 0.0692LEd(j) − 0.0534Emp(j) + 0.1424Div(j) + 0.0158Img(j);
[
Q0.4 (LV
C(j)) = 6.2198 + 0.0790LEd(j) − 0.0479Emp(j) + 0.0933Div(j) + 0.0179Img(j);
[
C(j)) = 6.6901 + 0.0620LEd(j) − 0.0510Emp(j) + 0.0685Div(j) + 0.0250Img(j);
Q0.6 (LV
[
C(j)) = 3.8647 − 0.0014LEd(j) − 0.0174Emp(j) + 0.1385Div(j) + 0.0466Img(j).
Q0.8 (LV
Classical Regression applied to quantiles (CRQ)
In this case, the first step is to calculate the same quantiles for all variables and units.
For each quantile, we define a classical linear regression model from which it is possible to
predict the respective quantile of the response variable. Notice that this approach assumes
that the corresponding quantiles of the different variables occur together.
The quantiles 0.2; 0.4; 0.6 and 0.8 predicted for the logarithm of the number of violent
crimes are calculated by the follow classical linear regression models:
[
LV
C 0.2 (j) = 4.8335 + 0.0194LEd0.2 (j) − 0.0304Emp0.2 (j) + 0.0915Div0.2 (j) + 0.0436Img0.2 (j);
[
LV
C 0.4 (j) = 6.5490 + 0.0446LEd0.4 (j) − 0.0505Emp0.4 (j) + 0.0473Div0.4 + 0.0398Img0.4 (j);
[
LV
C 0.6 (j) = 11.4450 + 0.0288LEd0.6 (j) − 0.1085Emp0.6 (j) − 0.0001Div0.6 + 0.0313Img0.6 (j);
[
LV
C 0.8 (j) = 9.8643 − 0.0780LEd0.8 (j) − 0.0833Emp0.8 (j) + 0.1435Div0.8 − 0.0114Img0.8 (j).
7.2. DISTRIBUTIONS OF CRIMES IN USA
261
Due to the fact that the type of variables used in several methods are different, Figures
7.10, 7.11, 7.12, 7.13 show the quantiles 0.2; 0.4; 0.6 and 0.8 of the logarithm of the number
of violent crimes predicted for all states, using the several models presented in this section.
Quantile 0.2
6
4
Observed
Predicted DSD I
Predicted DSD II
Predicted CS
Predicted QR 1
Predicted QR 2
Predicted CRQ
2
AL CA CT FL GA IN MA MO NC NJ NY OH OK OR PA TN TX VA WA WI
States
Figure 7.10: Predicted quantile 0.2 of LV C for all states.
Quatile 0.4
6
4
Observed
Predicted DSD I
Predicted DSD II
Predicted CS
Predicted QR 1
Predicted QR 2
Predicted CRQ
2
AL CA CT FL GA IN MA MO NC NJ NY OH OK OR PA TN TX VA WA WI
States
Figure 7.11: Predicted quantile 0.4 of LV C for all states.
Table 7.10 presents the Root Mean Square Error between the observed and predicted
values of LV C for the four quantiles. These results show that in most cases, the quantiles
predicted with the DSD Models using the same model to predict all quantiles, are the closest
to the observed quantiles.
262
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Quantile 0.6
6
4
Observed
Predicted DSD I
Predicted DSD II
Predicted CS
Predicted QR 1
Predicted QR 2
Predicted CRQ
2
AL CA CT FL GA IN MA MO NC NJ NY OH OK OR PA TN TX VA WA WI
States
Figure 7.12: Predicted quantile 0.6 of LV C for all states.
7
Quatile 0.8
6
5
4
Observed
Predicted DSD I
Predicted DSD II
Predicted CS
Predicted QR 1
Predicted QR 2
Predicted CRQ
3
2
1
AL CA CT FL GA IN MA MO NC NJ NY OH OK OR PA TN TX VA WA WI
States
Figure 7.13: Predicted quantile 0.8 of LV C for all states.
Table 7.10: Comparison between the observed and predicted LV C quantiles.
Approaches
RM SEQ0.2
RM SEQ0.4
RM SEQ0.6
RM SEQ0.8
DSD I
0.3507
0.3729
0.4285
0.4053
DSD II
0.3312
0.3550
0.4008
0.3862
CS
0.7557
0.5270
0.5244
0.6909
QR 1
0.4589
0.4697
0.4821
0.4739
QR 2
0.6135
0.3040
0.4650
0.7297
CRQ
0.8338
0.4392
0.7071
0.4355
7.3. TIME OF UNEMPLOYMENT FROM YEARS OF EMPLOYMENT
263
7.3. The relation between time of unemployment and years of
employment
7.3.1. The data
The 2008 Portuguese Labour Force Survey provides individual information about the
people that live in Portugal. The original data table that we analyzed contains, among
others, demographic variables (such as gender, marital status, age, level of education, employer,...) and geographical location (region, community,...). In this study we are interested
in analyzing if the time of unemployment (in months) is related to the time (in years) that
people have worked previously. However, we are not interested in performing this study
for each individual, as it may be of greater interest to determine what happens in certain
categories, such as young women who live in the North of Portugal. Since each of these categories consists of several individuals, the observed “value” is no longer a single point but
an interval. So, in this case, the symbolic data table is built considering that the units (higher
units) are classes of individuals obtained by crossing gender×region×age×education.
There are two genders (female (F), male (M)), four regions (north (N), Center (C), Lisbon
and Tagus Valley (L), South (S)), three age groups (15 to 24 (A1), 25 to 44 (A2), 45
to 64 (A3)) and three levels of education (basic education (B), secondary education (S)
and graduate (G)). In total we have 2×4×3×3 = 72 possible classes (categories). The
time of unemployment and the time of work before unemployment are now interval-valued
variables.
Table 7.11 represents the symbolic table that results from the original data table, for the
variables E (time of employment before unemployment) and U (time of unemployment). In
this study, only 58 classes (units) were created (since there are no cases corresponds to
the remaining 14).
The main goal of this study is to analyze the linear relation between the interval-valued
variables: logarithm of the time of unemployment, LN U (LN U = ln(U + 2)), and time
of activity before the unemployment, E , considering as observed units (higher units) the
classes of individuals previously described.
264
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Table 7.11: Symbolic data table where the two variables, time of activity before unemployment and time of
unemployment are interval-valued variables.
Units
U
E
Units
U
E
Units
F ×C×A1×B
[3; 49]
[0; 4]
F ×N ×A3×S
[0; 123] [23; 35] M ×L×A3×B [1; 244] [22; 57]
F ×C×A1×S
[1; 6]
[0; 2]
F ×S×A1×B
[1; 52]
[1; 7]
M ×L×A3×S
[2; 65] [25; 50]
F ×C×A2×B [2; 147] [2; 34] F ×S×A1×S
[1; 36]
[0; 9]
M ×L×A3×G
[7; 44] [28; 40]
[0; 1]
M ×N ×A1×B [1; 33]
F ×C×A2×S
[3; 61]
[5; 22] F ×S×A1×G
[1; 13]
F ×C×A2×G
[4; 16]
[0; 15] F ×S×A2×B
[1; 101] [0; 33] M ×N ×A1×S
U
E
[0; 18]
[1; 15]
[1; 4]
F ×C×A3×B [1; 108] [23; 47] F ×S×A2×S
[0; 96]
[0; 25] M ×N ×A2×B [1; 97]
[1; 35]
F ×L×A1×B
[1; 18]
[1; 7]
[1; 21]
[1; 27] M ×N ×A2×S
[0; 21]
F ×L×A1×S
[1; 19]
[1; 11] F ×S×A3×B
[1; 265] [8; 52] M ×N ×A2×G [2; 100] [2; 14]
F ×L×A2×B [0; 156] [3; 34] F ×S×A3×S
[3; 26] [20; 37] M ×N ×A3×B [0; 159] [15; 52]
F ×S×A2×G
[1; 46]
F ×L×A2×S
[2; 69]
[3; 25] M ×C×A1×B
[3; 6]
[0; 8]
M ×N ×A3×S
F ×L×A2×G
[0; 63]
[0; 22] M ×C×A1×S
[2; 3]
[0; 4]
M ×N ×A3×G [9; 19] [31; 36]
F ×L×A3×B [1; 320] [29; 58] M ×C×A2×B [2; 97] [10; 28] M ×S×A1×B
[4; 10] M ×S×A1×S
[9; 35] [20; 40]
[1; 35]
[0; 10]
[4; 63]
[1; 6]
F ×L×A3×S
[2; 162] [22; 36] M ×C×A2×G [7; 13]
F ×L×A3×G
[8; 27] [12; 32] M ×C×A3×B [4; 98] [30; 51] M ×S×A2×B [0; 157] [4; 35]
F ×N ×A1×B [1; 61]
[0; 9]
M ×C×A3×S [20; 38] [25; 39] M ×S×A2×S
F ×N ×A1×S
[0; 3]
M ×L×A1×B
[2; 20]
F ×N ×A2×B [1; 325] [6; 32] M ×L×A1×S
[4; 14]
F ×N ×A2×S
[0; 10]
[2; 88]
F ×N ×A2×G [2; 80]
[1; 21]
[7; 24]
[0; 9]
M ×N ×A2×G [4; 18]
[5; 20]
[1; 9]
M ×S×A3×B [1; 274] [26; 56]
[2; 25] M ×L×A2×B [1; 194] [0; 31] M ×S×A3×S
[11; 26] [28; 42]
[1; 25] M ×L×A2×S [4; 133] [3; 23]
F ×N ×A3×B [1; 372] [11; 57] M ×L×A2×G
[6; 65]
[4; 16]
7.3.2. Prediction with the DSD Model I
The prediction, with the DSD Model I, of the quantile functions for the interval-valued
variable LN U , is in this case:
−1
−1
ΨLN
(t) = 2.2277 + 0.0779Ψ−1
E(j) (t) − 0.0503ΨE(j) (1 − t),
\
U (j)
t ∈ [0, 1].
Equivalently, the predicted interval for each unit j, is given by
[2.2277 + 0.0276cE(j) − 0.1282rE(j) , 2.2277 + 0.0276cE(j) + 0.1282rE(j) ] .
As we interpreted in Section 5.1.1, the interval-valued variables E and LN U have
a linear relation that tends to be direct, because the value estimated for the parameter
α = 0.0779 is slightly higher than β = 0.0503. For the set of classes of individuals to which
7.3. TIME OF UNEMPLOYMENT FROM YEARS OF EMPLOYMENT
265
the data refers, when the symbolic mean of time of activity before the unemployment increases one year, the symbolic mean of the LN U (in months) increases 0.0276. However,
the relation described by the DSD Model I is not very strong. The value of the goodnessof-fit measure Ω deduced to the model is 0.7715 for these data. The scatter plot of these
data may be observed in Figure 7.14(a). As we have a large number of units, the scatter
plot that represents the observed intervals of both variables by a rectangle is very hard to
interpret. As such, this scatter plot is represented by the diagonals of the rectangle instead.
6
8
6
4
4
2
2
0
10
20
30
40
50
60
(a) Intervals E(j) versus LN U (j) with j ∈ {1, . . . , 58}.
0
10
20
30
40
50
60
\
(b) Intervals E(j) versus LN
U (j) with j ∈ {1, . . . , 58}.
Figure 7.14: Scatter plot of the explicative interval-valued variables E and of the response variables LN U
observed in (a) and predicted with the DSD Model I, in (b).
As we proved in Proposition 5.5, a perfect linear regression by the DSD Model I between
two interval-valued variables induces a perfect linear regression between the centers of
the intervals and also induces that the ratio (slope) of the ranges of the intervals is constant and equal for all observations. These behaviors are illustrated in the scatter plot in
Figure 7.14(b), that considers the intervals observed for the variable E and the predicted
intervals by the DSD Model I for the variable LN U.
7.3.3. Comparison of the predictions with different symbolic models
The purpose of this example is not only to illustrate the DSD Model I, but also to compare
the results with the DSD Model II and with other models that have been already proposed
(Lima Neto and De Carvalho (2010, 2008); Billard and Diday (2006, 2002, 2000)). In Tables
7.12 and 7.13 we present the obtained model expressions and the Root Mean Square Error
values.
266
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Table 7.12: Comparison of the expressions of the symbolic linear regression models for interval-valued
variables in Table 7.11.
Models
DSD I
Expressions that allow predicting the intervals
Ψ−1
\
LN U (j)
−1
(t) = 2.23 + 0.08Ψ−1
E(j) (t) − 0.05ΨE(j) (1 − t)
cLN
= 2.23 + 0.03cE(j) and rLN
= 0.13rE(j)
\
\
U (j)
U (j)
Ψ−1
\
LN U (j)
DSD II
−1
−1
(t) = Ψ−1
Constant (t) + 0.06ΨE(j) (t) − 0.03ΨE(j) (1 − t)
Ψ−1
Constant (t) = 2.23 + 0.53(2t − 1) with t ∈ [0, 1]
cLN
= 2.23 + 0.03cE(j) and rLN
= 0.53 + 0.09rE(j)
\
\
U (j)
U (j)
CM
cLN
= 2.23 + 0.03cE(j)
\
U (j)
BD
\
LN
U (j) = 1.90 + 0.05E(j)
MinMax
I LN
= 1.22 + 0.02I E(j) and I LN
= 2.87 + 0.04I E(j)
\
\
U (j)
U (j)
CRM
cLN
= 2.23 + 0.03cE(j) and rLN
= 0.53 + 0.09rE(j)
\
\
U (j)
U (j)
CCRM
cLN
= 2.23 + 0.03cE(j) and rLN
= 0.53 + 0.09rE(j)
\
\
U (j)
U (j)
In this example, the CRM and CCRM are the same because in the CRM the parameters
estimated for the half ranges are all non-negative, i.e. the constrains imposed in the CCRM
to these parameters are met. We can also observe that the linear regression induced by
DSD Models for the centers of the intervals is the one obtained by the models when a
linear regression between the centers is considered. Moreover, the parameters in the linear
regression model induced for the half ranges by the DSD Model II, are the same as the
ones estimated for the classical linear regression model between half ranges, used in CRM
(and CCRM).
The values of the RM SE allow comparing the predicted and the observed intervals of
the response variable LN U. These measures are not deduced from the model, therefore
they may be used for an independent comparison of results. Observing the values of the
RM SE, we may conclude that the DSD Model II and CRM (and CCRM) have similar
results, which is not surprising because the linear regression between the centers and half
ranges are the same. The results for these measures, when the DSD Model I is considered,
are similar to the ones obtained with the methods referred above.
7.3. TIME OF UNEMPLOYMENT FROM YEARS OF EMPLOYMENT
267
Table 7.13: Performance of the symbolic linear regression models that predict the logarithm of the time of
unemployment.
RM SEL
RM SEU
RM SEM
0.5745
0.6710
0.4679
0.5866
0.6829
0.4797
0.4458
0.6541
0.4397
0.4597
0.6740
0.4539
CM
1.1622
1.3146
0.7759
BD
1.0368
1.1499
0.7255
MinMax
0.4725
0.7329
0.4621
CRM
0.4457
0.6541
0.4397
CCRM
0.4457
0.6541
0.4397
Models
DSD I
DSD I LOO
6
DSD II
DSD II LOO
6
The RM SE values when the Leave-One-Out method is applied together with DSD
Models are very similar. It shows a good behavior of the DSD Models in predictions of
intervals (see Table 7.13).
It is important to underline that the main goal of the interval-related work is not to
propose a model that provides better results than the previous models. The DSD Model
for interval-valued variables emerges from the particularization of a more general model,
the DSD Linear Regression Model for histogram-valued variables. The advantage of the
DSD Model when applied to interval-valued variables is that it allows working with variables
and takes into consideration a distribution within the intervals.
In Figure 7.15 to 7.19, the observed and predicted quantile functions are represented,
with all models presented in Table 7.12, for the 58 studied observations.
6
DSD I LOO and DSD II
method.
LOO
means that the DSD Models were applied together with the Leave-One-Out
268
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Observation 1
Observation 2
4
Observation 3
3
5.5
5
3.5
4.5
2.5
4
3
3.5
2.5
2
3
2.5
2
2
1.5
1.5
1.5
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
Observation 5
Observation 4
4.5
0.4
0.6
0.8
1
0.8
1
0.8
1
Observation 6
4
5
4.5
4
3.5
4
3.5
3
3.5
3
2.5
3
2.5
2.5
2
2
2
1.5
1.5
0
1.5
0.2
0.4
0.6
0.8
1
0
0.2
Observation 7
0.4
0.6
0.8
1
0
0.2
Observation 8
0.4
0.6
Observation 9
3.5
3.5
5.5
3
3
4.5
2.5
2.5
3.5
2
2
2.5
1.5
1.5
1.5
5
4
3
2
1
0
0,2
0,4
0,6
0,8
1
0
Observation 10
0.2
0.4
0.6
0.8
1
0
Observation 11
4.5
4.5
4
4
0.2
0.4
0.6
Observation 12
6
5
3.5
3.5
4
3
3
2.5
3
2.5
2
2
2
1.5
1
1.5
0
LNU observed
1
0.2
0.4
0.6
0.8
LNUDSD I predicted
1
0
0.2
0.4
0.6
0.8
LNUDSD II predicted
LNUCM predicted
1
0
LNUBD predicted
0.2
0.4
0.6
0.8
LNUMinMax predicted
1
LNUCRM predicted
Figure 7.15: Observed and predicted quantile functions consider all methods presented in Table 7.12 (part1).
7.3. TIME OF UNEMPLOYMENT FROM YEARS OF EMPLOYMENT
Observation 13
Observation 14
5.5
Observation 15
4.5
4.5
4
4
3.5
3.5
3.5
3
3
3
2.5
2.5
2
2
1.5
1.5
5
269
4.5
4
2.5
2
1.5
0
0.2
0.4
0.6
0.8
1
0
0.2
Observation 16
0.4
0.6
0.8
1
0
0.2
Observation 17
3.5
0.4
0.6
0.8
1
0.8
1
0.8
1
Observation 18
6
4.5
5.5
4
3
5
3.5
4.5
2.5
4
2
3
3.5
2.5
3
1.5
2.5
2
2
1
1.5
1.5
0
0.2
0.4
0.6
0.8
1
0
0.2
Observation 19
0.4
0.6
0.8
1
0
0.2
Observation 20
4.5
7
4
6
3.5
5
0.4
0.6
Observation 21
5
4.5
4
3.5
3
4
3
2.5
3
2.5
2
2
2
1.5
1.5
0
1
0.2
0.4
0.6
0.8
1
0
Observation 22
1
0.2
0.4
0.6
0.8
1
0
Observation 23
4
4
3.5
3.5
3
3
2.5
2.5
0.2
0.4
0.6
Observation 24
3
2.8
2.6
2.4
2.2
2
1.8
2
2
1.5
1.5
1.6
1.4
1.2
0
LNU observed
0.2
0.4
LNU
DSD I
0.6
0.8
predicted
1
LNU
DSD II
0
0.2
0.4
0.6
0.8
1
predicted
LNU predicted
LNU
CM
BD
0
predicted
0.2
0.4
0.6
0.8
1
LNU
predicted
LNU
MinMax
CRM
predicted
Figure 7.16: Observed and predicted quantile functions consider all methods presented in Table 7.12 (part2).
270
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Observation 25
Observation 26
Observation 27
5
5
4.5
4.5
4.5
4
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
3.5
3
2.5
2
0
0.2
0.4
0.6
0.8
1
0
1.5
1
0.2
Observation 28
0.4
0.6
0.8
1
0
0.2
Observation 29
6
4.5
5
4
4
3.5
3
3
2
2.5
1
2
0.4
0.6
0.8
1
0.8
1
0.8
1
Observation 30
3.5
3
2.5
2
0
0.2
0.4
0.6
0.8
1
0
1.5
0.2
Observation 31
0.4
0.6
0.8
1
0
0.2
Observation 32
3.2
0.6
Observation 33
5
3
0.4
3.5
4.5
2.8
3
4
2.6
3.5
2.4
2.2
2.5
3
2
2
2.5
1.8
2
1.6
1.5
1.5
1.4
0
0.2
0.4
0.6
0.8
1
0
Observation 34
0.2
0.4
0.6
0.8
1
0
Observation 35
5.5
5
5
4.5
0.2
0.4
0.6
Observation 36
3.5
3
4.5
4
4
3.5
2.5
3.5
3
3
2
2.5
2.5
1.5
2
2
0
LNU observed
0.2
0.4
0.6
0.8
LNUDSD I predicted
1
0
0.2
0.4
0.6
0.8
LNUDSD II predicted
LNUCM predicted
1
0
LNUBD predicted
0.2
0.4
0.6
0.8
LNUMinMax predicted
1
LNUCRM predicted
Figure 7.17: Observed and predicted quantile functions consider all methods presented in Table 7.12 (part3).
7.3. TIME OF UNEMPLOYMENT FROM YEARS OF EMPLOYMENT
Observation 37
Observation 38
3.5
5.5
3
4.5
271
Observation 39
5
5
4.5
4
4
2.5
3.5
2
2.5
1.5
1.5
3.5
3
3
2.5
2
2
1.5
1
0
0.2
0.4
0.6
0.8
1
0
0.2
Observation 40
0.4
0.6
0.8
1
0
0.2
Observation 41
4.5
0.6
0.8
1
0.8
1
0.8
1
Observation 42
6
5.5
5.5
5
4
5
3.5
0.4
4.5
4.5
4
4
3
3.5
3.5
3
2.5
3
2.5
2.5
2
2
2
1.5
1.5
1.5
0
0.2
0.4
0.6
0.8
1
0
0.2
Observation 43
0.4
0.6
0.8
1
0
0.2
Observation 44
5
4
4.5
0.6
Observation 45
3.5
3.5
0.4
3
4
3
2.5
3.5
2.5
3
2
2
2.5
0
1.5
1.5
2
0.2
0.4
0.6
0.8
1
0
0.2
Observation 46
5
0.4
0.6
0.8
1
0
Observation 47
4
0.2
0.4
0.6
Observation 48
5
4.5
4.5
3.5
4
3.5
4
3
3.5
3
2.5
3
2.5
2.5
2
2
2
1.5
1.5
1.5
1
0
LNU observed
0.2
0.4
0.6
0.8
1
0
LNUDSD I predicted
LNUDSD II predicted
0.2
0.4
0.6
0.8
1
0
LNUCM predicted
LNUBD predicted
0.2
0.4
0.6
0.8
1
LNUMinMax predicted
LNUCRM predicted
Figure 7.18: Observed and predicted quantile functions consider all methods presented in Table 7.12 (part4).
272
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Observation 49
Observation 50
6
Observation 51
5
4.5
4.5
5
4
4
4
3.5
3.5
3
3
3
2
2.5
2.5
1
0
2
2
0.2
0.4
0.6
0.8
1
0
0.2
Observation 52
0.4
0.6
0.8
1
0
0.2
Observation 53
4
0.4
0.6
0.8
1
0.8
1
0.8
1
Observation 54
4.5
5.5
5
4
3.5
4.5
3.5
4
3
3.5
3
2.5
3
2.5
2.5
2
2
2
1.5
1.5
1.5
1
0
0.2
0.4
0.6
0.8
1
0
0.2
Observation 55
0.4
0.6
0.8
1
0
0.2
Observation 56
5.5
0.4
0.6
Observation 57
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
Observation 58
5
4.5
4
3.5
3
2.5
2
0
LNU observed
0.2
0.4
0.6
LNUDSD I predicted
0.8
1
LNUDSD II predicted
LNUCM predicted
LNUBD predicted
LNUMinMax predicted
LNUCRM predicted
Figure 7.19: Observed and predicted quantile functions consider all methods presented in Table 7.12 (part5).
7.4. PREDICTED BURNED AREA OF FOREST FIRES
273
7.4. Predicted burned area of forest fires, in the northeast region of Portugal
7.4.1. The data
This study considers forest fire data
from the Montesinho natural park, in the
northeast region of Portugal. The data may
be found in Cortez and Morais (2008). The
original databases used in the experiments
were built from data collected from January
2000 to December 2003 and using two
sources. The first database was collected
by the responsible for the Montesinho fire
occurrences. Several features were registered: date, spatial location, the type of
Figure 7.20: Localization of Montesinho natural park.
vegetation involved, the six components of
the FWI system and the total burned area. The second database was collected by the
Bragança Polytechnic Institute, containing several weather observations that were recorded
by a meteorological station located in the center of the Montesinho park. The details are
described in Cortez and Morais (2007).
For this study we selected the response variable area (the forest burned area (in ha))
and three explicative variables: temp (temperature in Celsius degrees); wind (wind speed
in km/h); rh (relative humidity in percentage). As in the classical study of Cortez and
Morais (2007), the response variable area was transformed with a ln(x + 1) function and
we represented it as LN area. To build the symbolic data (macrodata) we aggregated
the information by months. The units (higher units) of this study are the months and the
observations of the variables temp, wind, rh and LN area associated with each month
are organized in intervals. Alternatively we may organize the data as classical data, where
to each month corresponds the mean value of the records associated with each variable. To
build these macrodata we considered only the months and the records in which forest fires
occurred. For this reason January and November were eliminated. Table 7.14 presents the
data with the records organized in two different ways. In even columns we have symbolic
274
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
variables because the set of records associated with each variable were aggregated by
month. In odd columns we register the logarithm of the total burned area in each month,
LN areaT, and the mean values of the temp, wind and rh associated with each month
and as such, we have classical variables.
Table 7.14: Data with information about the total burned area of forest fire and other four variables: LN area,
temp, wind and rh organized by month.
Months
LN area
LN areaT
temp
temp
wind
wind
rh
rh
Feb
[0.56; 1.60]
4.83
[4.6; 12.4]
7.6
[0.9; 9.4]
4.6
[35; 82]
60.6
Mar
[0.51; 1.53]
5.46
[5.3; 17]
12.9
[0.9; 9.4]
4.8
[26; 70]
40.7
Apr
[0.90; 1.64]
4.38
[5.8; 13.7]
9.1
[3.1; 9.4]
6.0
[33; 64]
51.3
May
[1.54; 1.54]
3.65
[18; 18]
18
[4; 4]
4
[40; 40]
40
June
[0.50; 1.66]
4.60
[14.3; 28]
21.6
[1.8; 9.4]
4.5
[34; 79]
45.3
July
[0.27; 1.89]
6.13
[11.2; 33.3]
22.6
[0.4; 8.9]
3.6
[22; 88]
43.7
Aug
[0.08; 2.03]
7.74
[11.2; 33.3]
22.6
[0.4; 8.9]
4.1
[22; 88]
43.7
Sep
[0.25; 2.08]
8.03
[10.1; 29.6]
19.6
[0.9; 7.6]
3.6
[15; 78]
42.5
Oct
[1.05; 1.59]
4.60
[16.1; 20.2]
18.12
[2.7; 4.5]
3.7
[25; 45]
36.6
Dec
[1.05; 1.45]
4.79
[2.2; 5.1]
5.7
[4.9; 8.5]
7.3
[21; 61]
39.0
7.4.2. Linear regression studies with macrodata
7.4.2.1. Non symbolic approaches
Classic approach
Considering the values of the total burned area and the mean values of the explicative
variables associated with each month, in Table 7.14, we perform classical linear regression,
that allows predicting the value of the total burned area from the mean values of the three
weather-related variables: temp, wind and rh. So, in this case we lost the variability of the
data and the predicted results are less informative.
The classical linear regression model for these data is the following:
LN\
areaT (j) = 2.1525 + 0.1203temp(j) + 0.0571wind(j) + 0.0249rh(j).
(7.4)
7.4. PREDICTED BURNED AREA OF FOREST FIRES
275
The classical coefficient of determination of the model is r 2 = 0.2150. So, this value
shows that only 21.5% of the variance of the value of the total burned area is explained by
the variations of the weather-related variables. Although burned area of the forest seems
to be influenced by the weather factors in the model, in this case it is not the classical linear
regression that better explains the relations of these variables.
Analyzing the parameters of the model defined in Expression (7.4), it is the temperature,
the weather factor that has a larger impact on the burned forest area.
Classic-symbolic approach
Another possible alternative to analyze data with variability is to apply classical linear
regression to all values observed for each first level unit, the microdata. The obtained model
is the following:
\
LN
area(j) = 1.2364 − 0.0053temp(j) + 0.0099wind(j) − 0.0026rh(j).
(7.5)
The coefficient of determination associated with this model shows that there is no linear
relation between the variables in the microdata, r2 = 0.012.
In this approach, the first step is the prediction of the burned area. Afterwards, the
predicted values are aggregated by month obtaining, for each month, the range of hectares
of the burned area. As after the aggregation the elements are of different nature, the
behavior between the variables may be different. However, comparing the observed and
predicted intervals obtained from the classic-symbolic approach (Figure 7.21), we verify
that we don’t obtain good predictions for the intervals. In this case the Root Mean Square
Error values are RM SEL = 0.5291, RM SEU = 0.6123 and RM SEM = 0.3623.
7.4.2.2. Symbolic approaches
As the DSD Model I is a particular case of the DSD Model II, it may happen in some
cases that the constant parameter in the DSD Model II (that is a interval) is an interval with
half range equal to zero, i.e, a real number. In this situation the DSD Model I and DSD
Model II are coincident. This is precisely what occurs in this example.
276
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Figure 7.21: Observed and predicted intervals of burned area by month in Table 7.14.
Considering the conditions described above, the symbolic model that allows predicting
the intervals of LN area from the intervals of the explicative variables temp, wind and rh,
for each month j, is as follows:
−1
−1
−1
ΨLN
(t) = 1.8637 + 0.0224Ψ−1
temp(j) (t) − 0.0215Ψtemp(j) (1 − t) − 0.0143Ψrh(j) (1 − t)
\
area(j)
(7.6)
with t ∈ [0, 1].
7.4. PREDICTED BURNED AREA OF FOREST FIRES
277
The goodness-of-fit measure associated with this situation is Ω = 0.9202, that shows
that this linear regression model describes well the relation between the interval-valued
variables. Therefore, if we know the forecast for the temperature, wind and relative humidity
for one month, it is possible to predict the minimum and maximum area of burned forest
area.
Comparing the models in Expressions (7.4) and (7.6) we may observe that the models
that work with elements of different nature have a different behavior. As we may observe,
the influence of the explicative variables is not always similar. The parameter associated
with the mean of the temperature influences positively the prediction of the total burned
forest area, whereas in the DSD Model the influence of the range of the temperature in the
interval of the burned forest area is almost null.
Considering a symbolic approach, several other proposed methods can also be applied.
In Table 7.15 the expressions of some selected methods (Lima Neto and De Carvalho
(2010, 2008); Billard and Diday (2006, 2002, 2000)) are presented.
Table 7.15: Comparison of the expressions of the symbolic linear regression models for interval-valued
variables in Table 7.14.
Models
Expressions that allow predicting the intervals
Ψ−1\
LN area(j)
DSD
(t) = 1.8637 + 0.0224Ψ−1
temp(j) (t)−
−1
−0.0215Ψ−1
temp(j) (1 − t) − 0.0143Ψrh(j) (1 − t)
cLN
= 1.8637 + 0.0009ctemp(j) − 0.0143crh(j)
\
area(j)
rLN
= 0.0439rtemp(j) + 0.0143rrh(j)
\
area(j)
CM
cLN
= 1.92 + 0.002ctemp(j) + 0.003cwind(j) − 0.02crh(j)
\
area(j)
BD
\
LN
area(j) = 2.25 − 0.03temp(j) − 0.06wind(j) − 0.01rh(j)
MinMax
I LN
= −0.39 + 0.01I temp(j) + 0.24I wind(j) + 0.02I rh(j)
\
area(j)
I LN
= 1.16 + 0.01I temp(j) − 0.04I wind(j) + 0.01I rh(j)
\
area(j)
CRM
cLN
= 1.92 + 0.002ctemp(j) + 0.003cwind(j) − 0.02crh(j)
\
area(j)
rLN
= 0.01 + 0.07rtemp(j) − 0.01rwind(j) + 0.01rrh(j)
\
area(j)
CCRM
cLN
= 1.92 + 0.002ctemp(j) + 0.003cwind(j) − 0.02crh(j)
\
area(j)
rLN
= 0.004 + 0.07rtemp(j) + 0.01rrh(j)
\
area(j)
278
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
In this case, as one of the estimated parameters of the model associated with the half
ranges in CRM is negative, the expression for the half ranges in CCRM is already different.
As such, the relations between the centers and half ranges induced by DSD Model are not
equal to the ones obtained with CRM and CCRM, as it happens in the previous example.
In Figure 7.22 it is possible to observe the prediction of the burned area (LN area)
for each month, obtained with the selected symbolic methods in Table 7.15. The months
are represented in the xx-axis and the observed and predicted intervals are represented
in the yy -axis. In these figures, the prediction of the response variable obtained with the
classic-symbolic approach is also represented.
2
1.5
1
0.5
0
February
March
April
May
June
July
(a) Burned area from February to July.
2.5
2
LNArea observed
LNArea
predicted
DSD
LNArea
1.5
CM
predicted
LNAreaBD predicted
LNArea
MinMax
LNArea
1
CRM
LNArea
CCRM
LNArea
CS
0.5
0
August
September
October
predicted
predicted
predicted
predicted
December
(b) Burned area from August to December.
Figure 7.22: Observed and predicted intervals of the burned area in Montesinho natural park in Table 7.14.
In Table 7.16 it is possible compare the measures RM SEL , RM SEU and RM SEM
for the models selected with and without the Leave-One-Out method.
7.4. PREDICTED BURNED AREA OF FOREST FIRES
279
Table 7.16: Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is
applied together with the proposed models for the histogram-valued variables in Table 7.14.
Models
Without Leave-One-Out method
With Leave-One-Out method
RM SEL
RM SEU
RM SEM
RM SEL
RM SEU
RM SEM
DSD I
0.1106
0.1222
0.1066
0.1390
0.1578
0.1380
CM
0.3076
0.2676
0.1856
0.3186
0.3597
0.2440
BD
0.1945
0.2734
0.2172
0.3381
0.2368
0.2431
MinMax
0.1481
0.0940
0.1044
0.2812
0.1435
0.1815
CRM
0.1030
0.1161
0.1038
0.1758
0.2343
0.1880
CCRM
0.1034
0.1159
0.1038
0.1656
0.2128
0.1821
As observed in Example 7.3, the results of the RM SE calculated for the CRM, CCRM
and DSD Model are again very similar. These results are also similar to the ones calculated
for the MinMax model.
The values of RM SE with and without the Leave-One-Out method are in general close.
As expected, when the Leave-One-Out method method is applied, the values are slightly
higher. The small difference between them for the DSD Model, show that the model was
not overfitting the data.
The results of the RMSE for the DSD Model with and without the Leave-One-Out
method are presented in Figure 7.23.
2.5
Observed interval
Predicted DSD I
Predicted DSD I with LOO
2
1.5
1
0.5
0
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Out
Dez
Figure 7.23: Observed and predicted intervals applying the DSD Model I and DSD Model I with LOO, for the
interval-valued variable LN area, for each month.
280
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
Observing this figure we can say that both predictions of the intervals are worse for May
and December. Comparing, for each month, the observed intervals of LN area, with their
predictions obtained by the DSD Model in Expression (7.6) and the DSD Model I with LOO,
we comprove the results in Table 7.16.
7.5. Conclusion
In this chapter,considering four examples, we analyzed different approaches to study
data with variability; within the symbolic approach we compared the predictions obtained
by the DSD Models with alternative models proposed by other authors. Moreover, in all
situations, to analise the quality of the predicted distributions and intervals when the DSD
Models are applied, the predicted values for each unit are obtain using the Leave-One-Out
method and the values of the RM SE were calculated.
Considering each of the three analyzed approaches, classic approach, symbolic
approach and classic-symbolic approach, some advantages and drawbacks may be listed:
1. Classic approach
• Advantages:
– Application of the classical linear regression model and the respective coefficient of determination;
– Standard interpretations of the classical models.
• Drawbacks:
– The variability of the data is lost;
– The mean value may not be a good representation of the values associated
with each higher level unit;
– The prediction of a mean value is poor and gives weak information about the
variable to predict.
2. Classic-symbolic approach:
• Advantages:
– Application of the classical linear regression model and the respective coefficient of determination;
7.5. CONCLUSION
281
– The variability of the data is not lost.
• Drawbacks:
– Does not work with symbolic data;
– The type of variables in the microdata and the macrodata are different,
consequently, the relation between the variables may be different.
3. Symbolic approach:
• Advantages:
– The model allows predicting distributions from other distributions;
– The prediction is more informative.
• Drawbacks:
– Difficulty to work with records that are empirical distributions;
– Need to develop new models and adapt the classical statistics concepts/
methods;
– The interpretation of the parameters in the linear regression models proposed in context of Symbolic Data Analysis is not evident.
When the DSD Models are applied to histogram-valued variables, we obtain good predictions for the distribution of the response variable. The predicted distributions are close
to the observed distributions in all cases studied here. In situations where the DSD Models
are applied to the particular cases of interval-valued variables, the obtained predictions are
similar to those obtained with models proposed by other authors. However, it is important to
underline that the DSD Model, when applied to interval-valued variables, has the advantage
that it allows taking into consideration a distribution within the intervals. In this work we
have only considered the Uniform distribution but others are of course possible. This
development is planned for future research.
282
CHAPTER 7. ANALYSIS OF DATA WITH VARIABILITY
8. General conclusions and Future Work
8.1. Conclusions of the work
The main goal of this work was to define a linear regression model that allows predicting
distributions from other distributions. The complexity of the elements under study, i.e.
distributions, the respective operations and measures make their manipulation difficult.
Likewise, the generalization of the statistical concepts and methods to histogram-valued
variables is not straightforward. This type of symbolic variables is the one less studied
under the Symbolic Data Analysis framework. The first linear regression models that relate
distributions are generalizations of the methods based on symbolic covariance definitions
and were proposed for interval-valued variables, a particular case of histogram-valued
variables (Billard and Diday (2002)). When the research presented in this work started,
only those methods were known.
The linear regression models developed in this work, the DSD Models, considers the
distributions represented by quantile functions, the inverse of cumulative functions. This
option, suggested by Irpino and Verde (2006), allows to overcome part of the issues raised
by working with distributions. To apply the DSD Models, the distributions that represent all
observations and the distributions that represent the symmetric of the observed histograms,
must have the same number of pieces and the domain associated with each piece (cumulative weight) has to be the same for all variables. As a consequence the quantile functions
are symmetric relatively to the weight pi , i.e., pi = pn−i+1 with i ∈ {1, 2, . . . , n} and n the
number of pieces in all quantile functions. These requirements are not a limitation, since it
is always possible to rewrite the observed distributions for the described conditions.
DSD Models allow predicting the distributions taken by one histogram-valued variable
only from the distributions taken by the explicative histogram-valued variables. The inclu283
284
CHAPTER 8. GENERAL CONCLUSIONS AND FUTURE WORK
sion in the model of distributions that represent the observations of the histogram-valued
variables and the quantile functions that represent the respective symmetric histograms,
allows obtaining a direct or inverse relation between the explicative and response variables,
despite the constraint of non-negativity on the parameters that multiply the variables, imposed by the model. The independent parameter in DSD Models may be a real number
or a distribution itself. In DSD Regression Model I the independent parameter is a real
number. In this case, it only influences the fit of the centers of the predicted subintervals of
the histogram. Moreover, this influence will be equal in all subintervals. In DSD Regression
Model II a quantile function that is considered as the independent parameter, will allow
predicting quantile functions where the center and half range of the subintervals of each
histogram may be influenced in different ways.
The DSD Models were initially defined for histogram-valued variables however, they
may be particularized for interval-valued variables. The main advantage of the DSD Models
applied to interval-valued variables is that a linear relation between one response intervalvalued variable and n explicative variables of the same type is defined, without reducing the
intervals to their bounds or centers and half ranges. In fact, this model, as it uses quantile
functions to represent the intervals, allows working with the intervals as such and considers
the distributions within intervals. This fact is not considered in all other models proposed for
this kind of variables. In this work the Uniform distribution is assumed in all intervals that
are the “values” observed of the interval-valued variables. Under these conditions, the DSD
Models induce a relation between the half ranges and a relation between the centers of the
intervals where the respective estimated parameters are not independent. In the case of
the half ranges, this relation is always direct, similarly as to what occurs in the Constrained
Center and Range Method (Lima Neto and De Carvalho (2010)).
An interval-valued variable is a particular case of a histogram-valued variable if for all
observations we only have one interval with weight equal to one. A classical variable is
a particular case of an interval-valued variable, when to all observations corresponds a
degenerate interval (an interval where the lower and upper bounds are the same). Because
of this link between histogram, interval and classical variables it was logical that the DSD
Models for histogram-valued variables was then particularized to interval-valued variables
that in turn could be particularized to real values.
8.2. FUTURE WORK
285
From the proposed models, it was possible to deduce a goodness-of-fit measure while
in the previously proposed methods, the performance of the models was only evaluated
considering Root Mean Square Error values. For histogram-valued variables, the deduced
goodness-of-fit measure is computed with respect to the symbolic mean of the response
histogram-valued variable, that is a real value and not an average distribution, the barycentric histogram. In the case of the interval-valued variables and only for the most general
case of the DSD Model II, it was also possible to deduce a goodness-of-fit measure with
respect to the barycentric interval. The general coefficient of determination considers a
“punctual” mean and not an average distribution, but in spite of this limitation the measure Ω
shows a good behavior. As expected, when we compare the representation of the predicted
and observed quantile functions for each unit we have good estimates when the value of
the goodness-of-fit measure Ω is close to one whereas the predicted and observed quantile
functions are more discrepant when the value of the goodness-of-fit measure is lower.
The main goals of this work were achieved, however many questions remain unanswered
and many symbolic statistical methods may be explored applying a similar methodology.
The extensive and complex data that emerged in the last decades made it necessary to
develop methods that treat distributions and not just real values. It was in this perspective
that this work was developed. The expectation is that these models prove to have potential
application in real situations.
8.2. Future Work
As future research several lines of work may be pursued. First and foremost, it is
important to explore the interpretation of the parameters of the DSD Linear Regression
Model. The application of the method to real situations requires an interpretation more
directly related with the distributions. So far it is only possible to interpret the relations
between distributions based in relations deduced from the model that relate the centers
of the (sub)intervals and half ranges of the (sub)intervals. Another important point is to
analyze the impact and the difference in the results when the distributions that are the
observations of the variables are defined with different weights but verifying the conditions
required for the application of the model. It is also important to define the linear regression
model in a probabilistic context, so that it is possible to use statistical inference and not only
286
CHAPTER 8. GENERAL CONCLUSIONS AND FUTURE WORK
descriptive statistics. This latter point is important not only for the linear regression studies.
For example, it will be important to define symbolic random variables. These three lines of
work are fundamental to make the model a ”linear regression model of the future.”
Particulary related to the work developed in this thesis, two studies may be performed
next as future research. In Chapter 7, we only make a small study to compare the predicted
quantiles obtained with Quantile Regression and the quantiles that result from the predicted
distributions obtained with the DSD Models. However, and although these models worked
with different type of elements, it is important to analyze with more detail the results obtained
with both approaches. Since the DSD Models are linear regression models that predict a
particular type of functions (quantile functions) from other quantile functions, whereas in
Functional Data Analysis linear regression models that relate functions were proposed, it
will be interesting to apply these models to quantile functions. However, the models of the
Functional Data Analysis cannot be applied directly to quantile functions because these
functions are not smooth functions. It will be necessary to smooth the quantile functions
that are the observations, assuming in this case that we are working with the distribution
function instead of the empirical distribution function.
The DSD Model has the potential of taking into consideration the distribution within
the intervals associated with the observations of interval-valued variables. As such, it is
possible to adapt the proposed model to interval-valued variables with other distributions.
For example, the DSD Model may be developed considering a triangular distribution within
the intervals. As in most studies of Symbolic Data Analysis it is considered that the values
with the intervals are uniformly distributed, all descriptive statistics would also have to be
redefined.
Finally, the approach used in the linear regression models proposed in this thesis may
also be applied in other models and methods in Symbolic Data Analysis based on linear
relations between interval-valued and histogram-valued variables. One such models is
logistic regression.
References
Ahn, J., M. Peng, C. Park, and Y. Jeon (2012). A resampling approach for interval-valued
data regression. Statistical Analysis and Data Mining 5(4), 336–348.
Arroyo, J. (2008).
Métodos de Predicción para Series Temporales de Intervalos e
Histogramas. Ph. D. thesis, Universidad Pontificia Comillas, Madrid, Espanha.
Arroyo, J. and C. Maté (2009). Forecasting histogram time series with k-nearest neighbours
methods. International Journal of Forecasting 25(1), 192–207.
Barroda, I. and F. Roberts (1974). Solution of an over determined system of equation in the
l1 norm. Communications of the Association for Computing Machinery 17 (6), 319–320.
Bertoluzza, C., N. C. Blanco, and A. Salas (1995). On a new class of distances between
fuzzy numbers. Mathware & Soft Computing 2(2), 71–84.
Billard, L. (2007). Selected Contributions in Data Analysis and Classification, Chapter
Dependencies and Variation Components of Symbolic Interval-Valued Data, pp. 3–12.
Springer-Verlag Berlin.
Billard, L. (2008).
Sample covariance functions for complex quantitative data.
In
M. Mizuta and J. Nakano (Eds.), Proceedings of the 2008 World Conference International
Association of Statistical Computing, Yokohama, Japan, December 2008, pp. 157–163.
Japanese Society of Computational Statistics.
Billard, L. and E. Diday (2000). Regression analysis for interval-valued data. In H. Kiers,
J.-P. Rasson, P. Groenen, and M. Schader (Eds.), Data Analysis, Classification and
Related Methods. Proceedings of the 7th Conference on the International Federation
of Classification Societies (IFCS’00), Namur, Belgium, July 2000, pp. 369–374. Springer
Berlin Heidelberg.
287
288
REFERENCES
Billard, L. and E. Diday (2002). Symbolic regression analysis. In K. Jajuga, A. Sokolowski,
and H.-H. Bock (Eds.), Classification, Clustering, and Data Analysis. Proceedings of
the 8th Conference of the International Federation of Classification Societies (IFCS’02),
Cracow, Poland, July 2002, pp. 281–288. Springer Berlin Heidelberg.
Billard, L. and E. Diday (2003). From the statistics of data to the statistics of knowledge:
Symbolic Data Analysis. Journal of the American Statistical Association 98(462), 470–
487.
Billard, L. and E. Diday (2006). Symbolic Data Analysis: Conceptual Statistics and Data
Mining. John Wiley & Sons, Ltd.
Blanco-Fernández, A. (2009).
Análisis estadı́stico de un nuevo modelo de regresión
lineal flexible para intervalos aleatórios. Ph. D. thesis, Universidad de Oviedo, Oviedo,
Espanha.
Blanco-Fernández, A., A. Colubi, and G. González-Rodrı́guez (2013). Towards Advanced
Data Analysis by Combining Soft Computing and Statistics, Volume 285, Chapter Linear
Regression Analysis for Interval-valued Data Based on Set Arithmetic: A Review, pp.
19–31. Springer-Verlag Berlin.
Blanco-Fernández, A., N. Corral, and G. González-Rodrı́guez (2011). Estimation of a
flexible simple linear model for interval data based on set arithmetic. Computational
Statistics & Data Analysis 55(9), 2568–2578.
Bock, H.-H. and E. Diday (Eds.) (2000). Analysis of Symbolic Data: Exploratory Methods
for Extracting Statistical Information from Complex Data. Springer-Verlag Berlin.
Brito, P. and A. Duarte Silva (2012). Modelling interval data with Normal and Skew-Normal
distributions. Journal of Applied Statistics 39(1), 3–20.
Case, J. (1999). Interval arithmetic and analysis. The College Mathematics Journal 30(2),
106–111.
Chavent, M. and J. Saracco (2008). On central tendency and dispersion measures for
intervals and hypercubes. Communications in Statistics - Theory and Methods 37 (9),
1471–1482.
REFERENCES
289
Colombo, A. G. and R. Jaarsma (1980). A powerful numerical method to combine random
variables. IEEE Transactions on Reliability 29(2), 126–129.
Cortez, P. and A. Morais (2007). A data mining approach to predict forest fires using
meteorological data. In J. Neves, M. F. Santos, and J. Machado (Eds.), New Trends
in Artificial Intelligence. Proceedings of the 13th Portuguese Conference on Artificial Intelligence (EPIA 2007), Guimarães, Portugal, December 2007, pp. 512–523. Associação
Portuguesa para a Inteligência Artificial.
Cortez, P. and A. Morais (2008). Forest fires data set. UCI Machine Learning Repository.
In website: http://www.ics.uci.edu/ mlearn/MLRepository.html.
Cuesta-Albertos, J., C. Matrán, and A. Tuero-Diaz (1997). Optimal transportation plans and
convergence in distribution. Journal of Multivariate Analysis 60(1), 72–83.
Cuvelier, E. and M. Noirhomme-Fraiture (2006). A probability distribution of functional
random variable with a functional data analysis application. In S. Tsumoto, C. Clifton,
N. Zhong, X. Wu, J. Liu, B. Wah, and Y.-M. Cheung (Eds.), Proceedings of the 6th IEEE,
International Conference on Data Mining - Workshops, Hong Kong, China, December
2006, pp. 247–252. The Institute of Electrical and Electronics Engineers.
De Carvalho, F. A. T. (1994). Proximity coefficients between Boolean symbolic objects.
In E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.), New
Approaches in Classification and Data Analysis. Proceedings of the 4th Conference
of the International Federation of Classification Societies (IFCS’93), Paris, France,
August/September 1993, pp. 387–394. Springer Berlin Heidelberg.
Diamond, P. (1990).
Least squares fitting of compact set-valued data.
Journal of
Mathematical Analysis and Applications 147, 531–544.
Diday, E. (1988).
The symbolic approach in clustering and related methods of data
analysis: The basic choices. In H.-H. Bock (Ed.), Classification and Related Methods
of Data Analysis. Proceedings of the 1st Conference of the International Federation
of Classification Societies (IFCS’87), Aachen, Germeny, June/July 1987, pp. 673–684.
North Holland, Amsterdam.
290
REFERENCES
Diday, E. and M. Noirhomme-Fraiture (Eds.) (2008). Symbolic Data Analysis and the
SODAS Software. John Wiley & Sons, Ltd.
Diday, E. and M. Vrac (2005). Mixture decomposition of distributions by copulas in the
symbolic data analysis framework. Discrete Applied Mathematics 147 (1), 27–41.
Domingues, M. A., R. De Souza, and F. Cysneiros (2010). A robust method for linear
regression of symbolic interval data. Pattern Recognition Letters 31(13), 1991–1996.
Gibbs, A. and F. Su (2002). On choosing and bounding probability metrics. International
Statistical Review 70(3), 419–435.
Gil, M., M. López-Garcı́a, M. Lubiano, and M. Montenegro (2001).
Regression and
correlation analyses of a linear relation between random intervals.
Sociedad de
Estadı́stica e Investigación Operativa 10(1), 183–201.
Gil, M., M. Lubiano, M. Montenegro, and M. López (2002). Least squares fitting of an affine
function and strength of association for interval-valued data. Metrika 56(2), 97–111.
Giordani, P. (2011). Linear regression analysis for interval-valued data based on the Lasso
technique. In Proceeding of the 58th World Statistical Congress of the International
Statistical Institute, Dublin, Irland, August 2011, pp. 5576–5581.
Giordani, P. (2014). Lasso-constrained regression analysis for for interval-valued data.
Advances in Data Analysis and Classification, 1–15.
González-Rodrı́guez, G., A. Blanco, N. Corral, and A. Colubi (2007).
Least squares
estimation of linear regression models for convex compact random sets. Advances in
Data Analysis and Classification 1(1), 67–81.
Ichino, M. and H. Yaguchi (1994). Generalized Minkowski metrics for mixed feature-type
data analysis. IEEE Transactions on Systems, Man and Cybernetics 24(4), 698–708.
Irpino, A. and E. Romano (2007). Optimal histogram representation of large data sets:
Fisher vs piecewise linear approximation. In M. Noirhomme-Fraiture and G. Venturini
(Eds.), Extraction et gestion des connaissances (EGC’2007), Actes des cinquièmes
journées Extraction et Gestion des Connaissances, Namur, Belgique, January 2007, pp.
99–110. Cépaduès-Éditions.
REFERENCES
291
Irpino, A. and R. Verde (2006). A new Wasserstein based distance for the hierarchical
clustering of histogram symbolic data.
In V. Batagelj, H.-H. Bock, A. Ferligoj, and
A. Ziberna (Eds.), Data Science and Classification. Proceedings of the 10th Conference
of the International Federation of Classification Societies (IFCS’06), Faculty of Social
Sciences of the University of Ljubljana in Slovenia, July 2006, pp. 185–192. Springer
Berlin Heidelberg.
Irpino, A. and R. Verde (2012).
Linear regression for numeric symbolic variables:
an ordinary least squares approach based on Wasserstein distance.
In website:
http://arxiv.org/abs/1202.1436v1 (arXiv:1202.1436v1 [stat.ME] 7Feb2012).
Irpino, A. and R. Verde (2013). A metric based approach for the least square regression
of multivariate modal symbolic data. In P. Giudici, S. Ingrassia, and M. Vichi (Eds.),
Statistical Models for Data Analysis. Proceedings of the 8th biannual meeting of the
Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, Pavia,
Italy, September 2011, pp. 161–169. Springer International Publishing.
Iwasaki, M. and H. Tsubaki (2005). A bivariate generalized linear model with an application
to meteorological data analysis. Statistical Methodology 2(3), 175–190.
Lawson, C. and J. Hanson (1995). Solving Least Squares Problems (2nd ed.). Society for
Industrial and Applied Mathematics.
Lima Neto, E., G. Cordeiro, and F. De Carvalho (2011). Bivariate symbolic regression
models for interval-valued variables. Journal of Statistical Computation and Simulation 81(11), 1727–1744.
Lima Neto, E. and F. De Carvalho (2008).
Centre and Range method for fitting a
linear regression model to symbolic interval data.
Computational Statistics & Data
Analysis 52(3), 1500–1515.
Lima Neto, E. and F. De Carvalho (2010). Constrained linear regression models for symbolic
interval-valued variables. Computational Statistics & Data Analysis 54(2), 333–347.
Maia, A. L. S. and F. A. De Carvalho (2008). Fitting a least absolute deviation regression
model on interval-valued data.
In G. Zaverucha and A. L. Costa (Eds.), Advances
292
REFERENCES
in Artificial Intelligence. Proceedings of the 19th Brazilian Symposium on Artificial
Intelligence (SBIA 2008), Salvador, Brazil, October 2008, pp. 207–216. Springer Berlin
Heidelberg.
Mallows, C. (1972). A note on asymptotic joint normality. The Annual of Mathematical
Statistics 43(2), 508–515.
Moore, R. (1966). Interval Analysis, Volume 2. Prentice Hall Englewood Cliffs.
Nadler, S. (1978). Hyperspaces of Sets: A Text with Research Questions. Chapman & Hall
Pure and Applied Mathematics. M. Dekker.
Noirhomme-Fraiture, M. and P. Brito (2011). Far beyond the classical data models: symbolic
data analysis. Statistical Analysis and Data Mining 4(2), 157–170.
Prakash, P. and R. Murat (1971). Semilinear (Topological) Spaces and Aplications (Revised
Edition ed.). Massachsetts Institute of Technology.
Ramsay, J. and B. Silverman (2005). Functional Data Analysis (2nd ed.). Springer Science
+ Business Media, Inc.
Redmond, M. (2011).
Communities and crime unnormalized data set.
UCI Machine
Learning Repository. In website: http://www.ics.uci.edu/mlearn/MLRepository.html.
Silva, A., E. Lima Neto, and U. Anjos (2011). A regression model to interval-valued variables
based on copula approach. In Proceeding of the 58th World Statistical Congress of the
International Statistical Institute, Dublin, Irland, August 2011, pp. 6481–6486.
Trutschnig, W., G. González-Rodrı́guez, A. Colubi, and M. Gil (2009). A new family of
metrics for compact, convex (fuzzy) sets based on a generalized concept of mid and
spread. Information Sciences 179(23), 3964–3972.
Verde, R. and A. Irpino (2007). Selected Contributions in Data Analysis and Classification,
Chapter Dependencies and Variation Components of Symbolic Interval-Valued Data, pp.
123–134. Springer-Verlag Berlin.
Verde, R. and A. Irpino (2008).
Wasserstein distance.
Comparing histogram data using a Mahalanobis-
In P. Brito (Ed.), Proceedings of the COMPSTAT’2008, 18th
REFERENCES
293
International Conference on Computational Statistics, Porto, Portugal, August 2008, pp.
77–89. Physica-Verlag HD.
Verde, R. and A. Irpino (2010).
Ordinary least squares for histogram data based on
Wasserstein distance. In Y. Lechevallier and G. Saporta (Eds.), Proceedings of the
COMPSTAT’2010, 19th International Conference on Computational Statistics, Paris,
France, August 2010, pp. 581–588. Physica-Verlag HD.
Vitale, R. (1985). Lp metrics for compact, convex sets. Journal of Approximation Theory 45,
280–287.
Williamson, R. (1989). Probabilistic Arithmetic. Ph. D. thesis, University of Queensland,
Queensland, Australia.
Winston, W. (1994).
Operations Research. Applications and Algorithms (3rd ed.).
Wadsworth.
Xu, W. (2010). Symbolic Data Analysis: Interval-valued data regression. Ph. D. thesis,
University of Georgia, Georgia, USA.
Yang, C., J. Jeng, C. Chuang, and C. Tao (2011). Constructing the linear regression
models for the symbolic interval-values data using PSO algorithm. In Proceedings of the
International Conference on System Science and Engineering (ICSSE), Macao, China,
June 2011, pp. 177–181. Institute of Electrical and Electronics Engineers.
294
REFERENCES
Appendices
295
A. Algorithm to build ordered and disjoint
histograms
Consider the histogram HZ =
I Z1 , I Z1 , p1 ; I Z2 , I Z2 , p2 ; . . . ; I Zn , I Zn , pn where
the subintervals are not ordered neither disjoint. If the histogram is not disjoint we can
rewrite it as a histogram with ordered and disjoint subintervals. For this, the processe is the
follows (Williamson (1989)):
1. Order the subintervals. The specified order is given by the relation ≤H and build the
histogram HZO . The relation ≤H is defined by
I Z i , pi ≤ H I Z k , p k
if
I Zi < I Zk or I Zi = I Zk and I Zi < I Zk .
2. Construction of the disjoint histogram HZD . The next step is to build a disjoint
histogram HZD where
HZD =
nh
h
h
h
h
i
o
I ZD1 , I ZD1 , s1 ; I ZD2 , I ZD2 , s2 ; . . . ; I ZDn , I ZDn , sn
and
h
h
h h
I ZDi , I ZDi ∩ I ZDk , I ZDk = ∅ for all i 6= k.
The histogram HZD will have the property I ZDi = I ZDi+1 , for i ∈ {1, . . . , n − 1}.
The procedure used to form the histogram HZD from the ordered histogram HZO is as
follows:
297
298
APPENDIX A. ORDERED AND DISJOINT HISTOGRAMS
for (i := 0; i < n; i + +) {
I ZDi := I ZOi ; I ZDi := I ZOi+1 ; si = 0;
for (k = i; k ≥ 0; k − −) {
I ZO k > I ZD i {
if I ZOk > I ZDi and
I ZD i ≥ I ZO k
if I ZDi > I ZOk and
si + = rk
I ZOk − I Z D i
I ZO k − I Z O k
else if I ZDi ≤ I ZO and
k
si + = rk
I ZD i > I Z O k
I ZD i − I Z D i
I ZO k − I Z O k
}
}
}
B. Behavior of pairs of quantile functions
Since the observations of the interval-valued variables may be represented by intervals
or quantile functions, it is interesting to understand the relation between the behavior of two
quantile functions and the corresponding intervals.
Consider the quantile function Ψ−1
X (t) = cX + rX (2t − 1) with t ∈ [0, 1], that represents
I X , I X and the quantile function Ψ−1
Y (t) = cY + rY (2t − 1), with
t ∈ [0, 1], that represent the interval IY = I Y , I Y . The line segments that represent the
the interval IX =
−1
quantile functions Ψ−1
X (t) and ΨY (t) may present one of these three relative positions:
• parallel, when the values of rX and rY are the same;
• non-parallel and the quantile functions intersect at a point whose abscissa is a value
in [0, 1] ;
• non-parallel and the quantile function intersect at a point whose abscissa does not
belong to [0, 1] .
The possible positions of the two intervals when the respective quantile functions are
parallel, are illustrated in Figure B.1. In the cases where the quantile functions intersect
each other, the abcissa of the intersection point is given by
t=
cY − cX
1
−
2 2(rY − rX )
with rY − rX 6= 0.
From Expression (B.1) we may easily prove the following propositions:
299
(B.1)
300
APPENDIX B. BEHAVIOR OF PAIRS OF QUANTILE FUNCTIONS
IY
IY
IX
IX
0
1
(a)
(b)
IX = IY
IX = IY
0
1
(c)
(d)
IY
IX
IY
IX
0
1
(e)
(f)
Figure B.1: Relative positions of the intervals when the respective quantile function are parallel. In (a) and (b)
I Y > I X ∧ rY = rX ; in (c) and (d) I Y = I X ∧ I Y = I X ; in (e) and (f) I Y < I X ∧ rY = rX .
Proposition B.1. Consider two intervals IX = I X , I X and IY = I Y , I Y that may be
represented by the respective quantile functions Ψ−1
X (t) = cX + rX (2t − 1) and
Ψ−1
Y (t) = cY + rY (2t − 1) with t ∈ [0, 1]. Under the condition that rY − rX 6= 0, that like
−1
the quantile functions Ψ−1
X (t) and ΨY (t) are non-parallel, we have:
−1
1. Two quantile functions Ψ−1
X (t) and ΨY (t) intersect each other outside the interval
[0, 1] when they verify the condition,
(cY − cX )2 > (rY − rX )2
or, equivalently, the intervals IX and IY verify the condition,
IY > IX ∧ IY > IX ∨ IY < IX ∧ IY < IX .
These situations are illustrated in Figure B.2.
301
IY
IX
IY
IX
0
1
(a)
(b)
IY
IY
IX
IX
0
1
(c)
(d)
Figure B.2: Relative positions of the intervals when the respective quantile functions intersect outside the interval [0, 1] . In (a) and (b) I Y
>
IX ∧ IY
>
IX ∧ IY
<
IX ;
in (c) and (d) I Y > I X ∧ I Y > I X ∧ I Y > I X .
−1
2. Two quantile functions Ψ−1
X (t) and ΨY (t) intersect in the interval [0, 1] when they
verify the condition,
(cY − cX )2 ≤ (rY − rX )2 .
In this situation the intervals IX and IY verify the following condition,
IY ≤ IX ∧ IY ≥ IX ∨ IY ≥ IX ∧ IY ≤ IX .
−1
The quantile functions Ψ−1
X and ΨY intersect in t = 0 if I Y = I X and intersect in
t = 1 if I Y = I X . The possible positions of the intervals and respective quantile
functions in this situation are illustrate in Figure B.3.
Proposition B.1 states the conditions that the quantile functions must verify when they
intersect inside or outside the interval [0, 1] , and allows understanding the behavior of the
corresponding intervals in each of these situations.
In this work, we use the Mallows distance to measure the similarity between distributions
and in particular between intervals. In general, the Mallows distance is always influenced
by the centers and half ranges of the intervals. A large distance between centers will
have a more important effect in the value of the Mallows distance than a similar large
302
APPENDIX B. BEHAVIOR OF PAIRS OF QUANTILE FUNCTIONS
IX
IY
IY
IX
0
1
(a)
(b)
IX
IY
IX = IY
0
1
(c)
(d)
IX = IY
IY
IX
0
1
(e)
(f)
Figure B.3: Relative positions of the intervals when the respective quantile functions intersect in the interval
[0, 1] : in (a), (b) I Y > I X ∧ I Y < I X ; in (c), (d) I Y = I X ∧ I Y < I X ; in (e), (f) I X < I Y ∧ I Y = I X .
distance between half ranges because, assuming the Uniform distribution within intervals,
the expression of this distance is given by
1
2
−1
2
DM
(Ψ−1
(rY − rX )2 .
X , ΨY ) = (cY − cX ) +
3
However, a very large distance between half ranges may overcome the influence of the
centers distance.
Below, we present the analysis of the distance between two quantile functions when
they are parallel and non-parallel.
As we proved in Proposition 5.3 of the Section 5.1.2, if two quantile functions are parallel,
the Mallows distance between them is the squared Euclidean distance between two parallel
lines. So, in this case, the interpretation is obvious. Two parallel quantile functions are more
similar if they are closer, in the usual sense.
303
Consider now the behavior of two non-parallel quantile functions.
Proposition B.2. If the quantile functions Ψ−1
X1 (t)
=
cX1 + rX1 (2t − 1) and
Ψ−1
Y (t) = cY + rY (2t − 1) with 0 ≤ t ≤ 1, intersect for one value of t outside the interval
2
−1
4
2
[0, 1] , then DM
(Ψ−1
X1 (j) , ΨY (j) ) > 3 (rY − rX1 ) .
Proof. Consider two non-parallel quantile functions defined by the center and half-range of
−1
the intervals, Ψ−1
X1 (t) = cX1 + rX1 (2t − 1) and ΨY (t) = cY + rY (2t − 1) with 0 ≤ t ≤ 1.
The abcissa of the point of intersection is
t=
1
cY − cX1
−
, with rY − rX1 6= 0.
2 2(rY − rX1 )
−1
The Mallows distance between Ψ−1
X1 and ΨY is given by
1
−1
2
2
DM
(Ψ−1
(rY − rX1 )2 .
X1 (j) , ΨY (j) ) = (cY − cX1 ) +
3
If the quantile functions intersect outside the interval [0, 1] from Proposition B.1, we have
2
2
(cY − cX1 ) > (rY − rX1 ) .
So, in this case,
1
4
−1
2
2
(rY − rX1 )2 > (rY − rX1 )2 .
(Ψ−1
DM
X1 (j) , ΨY (j) ) = (cY − cX1 ) +
3
3
Proposition B.3. If the quantile functions Ψ−1
X2 (t)
=
cX2 + rX2 (2t − 1) and
Ψ−1
Y (t) = cY + rY (2t − 1) with 0 ≤ t ≤ 1, intersect for one value of t, inside the interval
2
−1
4
2
[0, 1] , then DM
(Ψ−1
X(j) , ΨY (j) ) < 3 (rY − rX2 ) .
Proof. Consider two non-parallel quantile functions, defined by the center and half-range of
−1
the intervals, Ψ−1
X2 (t) = cX2 + rX2 (2t − 1) and ΨY (t) = cY + rY (2t − 1) with 0 ≤ t ≤ 1.
The abcissa of the point of intersection is
t=
1
cY − cX2
, with rY − rX2 6= 0.
−
2 2(rY − rX2 )
−1
The Mallows distance between Ψ−1
X2 and ΨY is given by
1
−1
2
2
DM
(Ψ−1
(rY − rX2 )2 .
X2 (j) , ΨY (j) ) = (cY − cX2 ) +
3
304
APPENDIX B. BEHAVIOR OF PAIRS OF QUANTILE FUNCTIONS
From the Proposition B.1, if the quantile functions intersect inside the interval [0, 1] the
following condition is verified
2
2
(cY − cX2 ) < (rY − rX2 ) .
So,
1
4
−1
2
2
(rY − rX2 )2 < (rY − rX2 )2 .
DM
(Ψ−1
X2 (j) , ΨY (j) ) = (cY − cX2 ) +
3
3
For the work that was developed in the simulation studies it is important to understand
the behavior of the intervals/quantile functions associated with each unit j, when the linear
relation is disturbed, i.e, when the linear function e(j) = e
cj + rej (2t − 1) is added to the
quantile function Ψ−1
Y ∗ (j) (t) = cY ∗ (j) + rY ∗ (j) (2t − 1).
Let Ψ−1
Y ∗ (j) (t) = cY ∗ (j) + rY ∗ (j) (2t − 1) be the quantile function representing the interval
−1
I ∗ (j) before the disturbance of the linear relation and Ψ−1
Y (j) (t) = ΨY ∗ (j) (t) + e(j), i.e.
Ψ−1
cj + (rY (j) + rej )(2t − 1), the quantile function after disturbance. Four
Y (j) (t) = cY (j) + e
different situations may occur:
−1
1. If rej = 0, only the center is disturbed and the quantile functions Ψ−1
Y ∗ (j) (t) and ΨY (j) (t)
are parallel.
−1
2. If e
cj = 0, only the range is disturbed and the quantile functions Ψ−1
Y ∗ (j) (t) and ΨY (j) (t)
intersected at t = 12 .
−1
3. If e
cj =
6 0 and rej =
6 0, the quantile functions Ψ−1
Y ∗ (j) (t) and ΨY (j) (t) are non-parallel
and intersect at a point with abcissa t =
1
2
be:
(a) inside the interval [0, 1], if 0 ≤
i.e., e
c2j ≤ rej2 ;
1
2
−
e
cj
2e
rj
−
e
cj
2e
rj
, with rej 6= 0. The intersection may
≤ 1, which means that −rej ≤ e
cj ≤ rej ,
(b) outside the interval [0, 1], which implies that e
c2j > rej2 .
−1
Let us analyze the behavior of the quantile functions Ψ−1
Y (j) (t) and ΨY ∗ (j) (t) when
e
cj 6= 0 and rej 6= 0. The square of the Mallows distance between these quantile functions,
for each j, is given by:
1
−1
2
DM
(Ψ−1
c2j + rej2 .
Y (j) (t), ΨY ∗ (j) (t)) = e
3
305
−1
Considering that the quantile functions Ψ−1
Y1 (j) (t) and ΨY2 (j) (t) are the result of the
ej 6= 0 and different values e
cj 6= 0, i.e.
perturbation of Ψ−1
Y ∗ (t) for the same value of r
and
−1
ej )(2t − 1)
Ψ−1
Y1 (j) (t) = ΨY ∗ (j) (t) + e(j) = cY (j) + ce1j + (rY (j) + r
−1
ej )(2t − 1).
Ψ−1
Y2 (j) (t) = ΨY ∗ (j) (t) + e(j) = cY (j) + ce2j + (rY (j) + r
2
2
Assuming that ce1 j < rej2 and ce2 j > rej2 from the Proposition B.1 the quantile function
−1
−1
−1
Ψ−1
Y1 (j) (t) intersects ΨY ∗ (j) (t) inside the interval [0, 1] and ΨY2 (j) (t) intersects ΨY ∗ (j) (t)
outside the interval [0, 1], therefore
and
4 2
−1
2
DM
(Ψ−1
re
Y1 (j) (t), ΨY ∗ (j) (t)) <
3 j
4 2
−1
2
DM
(Ψ−1
re .
Y2 (j) (t), ΨY ∗ (j) (t)) >
3 j
−1
2
Analyzing this latter situations, the value of the DM
(Ψ−1
Y (j) (t), ΨY ∗ (j) (t)) is higher when
−1
the quantile functions Ψ−1
Y (j) (t) and ΨY ∗ (j) (t) intersect outside the interval [0, 1] and this
c2j > rej2 , i.e. e
cj 6∈] − rej ; rej [, for each unit j.
occurs when e
306
APPENDIX B. BEHAVIOR OF PAIRS OF QUANTILE FUNCTIONS
307
Simulation study I with interval variables
C.1.
mr =
0.6282 (0.0430)
0.6103 (0.0226)
0.3122 (0.0463)
0.2862 (0.0225)
30
100
30
100
0.0000 (0)
max
j∈{1,...,30}
5.7141 (0.2678)
5.5449 (0.5055)
2.8566 (0.1304)
2.7871 (0.2462)
1.1428 (0.0525)
1.1138 (0.0986)
0.5701 (0.0259)
0.5538 (0.0493)
= 3.8176;
0.9067 (0.0077)
100
rY ∗ (j)
0.9116 (0.0143)
30
min
0.9750 (0.0022)
100
o
1.0000 (0)
0.9765 (0.0041)
30
0.0000 (0)
RM SE M (s)
r
e=0
100
n
1.0000 (0)
30
j∈{1,...,30}
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−2, 2)
c
e ∈ U (−1, 1)
c
e= 0
Ω (s)
n
n
rY ∗ (j)
o
min
n
rY ∗ (j)
o
0.2835 (0.0223)
0.3076 (0.0450)
0.5977 (0.0234)
0.6162 (0.0455)
0.8792 (0.0086)
0.8849 (0.0157)
0.9429 (0.0041)
0.9456 (0.0073)
0.9662 (0.0030)
= 3.6107;
max
n
rY ∗ (j)
o
0.2766 (0.0243)
0.3004 (0.0501)
0.5714 (0.0269)
0.5828 (0.0501)
0.8234 (0.0136)
0.8202 (0.0258)
0.8775 (0.0108)
0.8713 (0.0216)
0.8981 (0.0107)
RM SE M (s)
= 8.5239.
5.8428 (0.2534)
5.7094 (0.4847)
3.0979 (0.1237)
3.0518 (0.2337)
1.6513 (0.0553)
1.6751 (0.1035)
1.3282 (0.0510)
1.3675 (0.1014)
1.1970 (0.0567)
1.2460 (0.1027)
r
e ∈ U (−mr, mr)
Ω (s)
0.8911 (0.0198)
j∈{1,...,100}
5.7412 (0.2572)
5.6085 (0.4856)
2.9353 (0.1302)
2.8525 (0.2491)
1.3206 (0.0466)
1.2888 (0.0901)
0.8754 (0.0277)
0.8566 (0.0536)
0.6645 (0.0292)
0.6509 (0.0544)
RM SE M (s)
r
e ∈ U (−2, 2)
Ω (s)
0.9679 (0.0056)
j∈{1,...,100}
5.7205 (0.2647)
5.5723 (0.4850)
2.8746 (0.1312)
2.7858 (0.2435)
1.1900 (0.0507)
1.1592 (0.0982)
0.6611 (0.0238)
0.6466 (0.0458)
0.3314 (0.0153)
0.3245 (0.0284)
RM SE M (s)
= 8.0839; mr =
0.2854 (0.0219)
0.3102 (0.0463)
0.6080 (0.0231)
0.6282 (0.0437)
0.8996 (0.0080)
0.9050 (0.0150)
0.9667 (0.0024)
0.9683 (0.0045)
0.9914 (0.0008)
0.9918 (0.0014)
Ω (s)
r
e ∈ U (−1, 1)
Table C.1: Results of the DSD Model I with a = 2, b = 1 and v = −1, when is low and similar the variability in variable X.
Results of simulation studies
C.
mr =
0.5656 (0.0238)
0.3764 (0.0434)
0.3664 (0.0215)
100
30
100
max
j∈{1,...,30}
n
34.3599 (1.5825)
33.7386 (3.0461)
22.8481 (1.1041)
22.4191 (1.9979)
11.4458 (0.5104)
11.1111 (1.0020)
5.7088 (0.2661)
min
0.8659 (0.0208)
0.8756 (0.0100)
0.6181 (0.0419)
0.6375 (0.0214)
0.2969 (0.0393)
0.3092 (0.0212)
30
100
30
100
30
100
0.0000 (0)
0.0000 (0)
max
j∈{1,...,30}
n
22.8789 (1.0739)
22.3725 (1.9404)
11.4464 (0.5242)
11.1624 (0.9751)
5.7038 (0.2616)
5.5277 (0.4962)
2.8556 (0.1249)
2.7638 (0.2504)
= 10.2112;
0.9655 (0.0029)
100
rY ∗ (j)
0.9625 (0.0065)
30
o
1.0000 (0)
100
n
1.0000 (0)
30
RM SE M (s)
r
e=0
Ω (s)
n
j∈{1,...,30}
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e= 0
0.5743 (0.0436)
30
0.0000 (0)
0.0000 (0)
5.5705 (0.4894)
= 42.8619;
0.8377 (0.0121)
100
rY ∗ (j)
0.8441 (0.0236)
30
min
0.9539 (0.0041)
100
j∈{1,...,30}
0.9554 (0.0074)
30
o
1.0000 (0)
100
n
1.0000 (0)
30
RM SE M (s)
rY ∗ (j)
o
min
n
rY ∗ (j)
o
0.3579 (0.0243)
0.3668 (0.0445)
0.5448 (0.0236)
0.5517 (0.0468)
0.7940 (0.0142)
0.7986 (0.0275)
0.8983 (0.0077)
0.8999 (0.0135)
0.9388 (0.0059)
= 42.5407;
max
n
rY ∗ (j)
o
0.3306 (0.0292)
0.3405 (0.0575)
0.4842 (0.0322)
0.4898 (0.0603)
0.6710 (0.0271)
0.6708 (0.0501)
0.7437 (0.0258)
0.7430 (0.0490)
0.7709 (0.0249)
= 47.3890.
37.2028 (1.4390)
36.4157 (2.8150)
26.8873 (0.9602)
26.4447 (1.7603)
18.1930 (0.6046)
17.9436 (1.0844)
15.2171 (0.6220)
15.0670 (1.1654)
14.0953 (0.6414)
14.0225 (1.2328)
RM SE M (s)
r
e ∈ U (−mr, mr)
Ω (s)
0.7665 (0.0462)
j∈{1,...,100}
35.0043 (1.5551)
34.3718 (2.8422)
23.8026 (0.9615)
23.3620 (1.9050)
13.2424 (0.4712)
12.9418 (0.8955)
8.7372 (0.2913)
8.5938 (0.5237)
6.6287 (0.2963)
6.5395 (0.5628)
RM SE M (s)
r
e ∈ U (−20, 20)
Ω (s)
0.9390 (0.0113)
j∈{1,...,100}
34.5019 (1.4854)
33.8877 (2.8959)
23.0994 (1.0502)
22.6544 (1.9123)
11.9158 (0.4924)
11.6776 (0.9648)
6.6023 (0.2426)
6.4816 (0.4664)
3.3107 (0.1518)
3.2547 (0.2798)
= 47.4616; mr =
0.3645 (0.0207)
0.3743 (0.0423)
0.5599 (0.0234)
0.5690 (0.0429)
0.8263 (0.0125)
0.8303 (0.0240)
0.9392 (0.0045)
0.9405 (0.0084)
0.9840 (0.0015)
0.9842 (0.0027)
RM SE M (s)
r
e ∈ U (−10, 10)
Ω (s)
rY ∗ (j)
o
min
Ω (s)
n
rY ∗ (j)
o
max
n
rY ∗ (j)
o
23.0605 (1.0234)
22.5681 (1.8990)
11.9242 (0.5001)
11.6114 (0.9597)
6.6319 (0.2414)
6.5082 (0.4396)
4.4195 (0.1458)
4.3464 (0.2585)
3.3746 (0.1552)
3.3537 (0.2874)
= 39.7110.
0.3054 (0.0217)
0.2928 (0.0408)
0.6178 (0.0230)
0.5990 (0.0453)
0.8386 (0.0114)
0.8225 (0.0233)
0.9212 (0.0058)
0.9119 (0.0118)
0.9525 (0.0046)
RM SE M (s)
r
e ∈ U (−mr, mr)
Ω (s)
0.9454 (0.0096)
j∈{1,...,100}
22.9128 (1.0327)
22.5917 (2.0249)
11.5571 (0.5125)
11.2709 (0.9815)
5.9566 (0.2539)
5.8083 (0.4675)
3.3008 (0.1194)
3.2149 (0.2329)
1.6546 (0.0751)
1.6275 (0.1426)
= 10.2112;
0.3084 (0.0206)
0.2933 (0.0410)
0.6330 (0.0218)
0.6140 (0.0436)
0.8658 (0.0103)
0.8540 (0.0205)
0.9545 (0.0033)
0.9500 (0.0071)
0.9882 (0.0011)
RM SE M (s)
r
e ∈ U (−5, 5)
0.9866 (0.0023)
j∈{1,...,100}
22.8897 (1.0593)
22.4205 (2.0074)
11.4440 (0.5137)
11.214 (0.9548)
5.7405 (0.2548)
5.6032 (0.5014)
2.9328 (0.1307)
2.8456 (0.2357)
0.6628 (0.0299)
0.6542 (0.0559)
= 36.7110; mr =
0.3089 (0.0211)
0.2962 (0.0402)
0.6377 (0.0210)
0.6157 (0.0413)
0.8742 (0.0098)
0.8628 (0.0212)
0.9637 (0.0031)
0.9604 (0.0063)
0.9981 (0.0002)
RM SE M (s)
r
e ∈ U (−2, 2)
0.9978 (0.0004)
Ω (s)
Table C.3: Results of the DSD Model I with a = 2, b = 1 and v = −1, when is mixed (type I) the variability in variable X.
mr =
c
e ∈ U (−60, 60)
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e= 0
Ω (s)
n
r
e=0
Table C.2: Results of the DSD Model I with a = 2, b = 1 and v = −1, when is high and similar the variability in variable X.
308
APPENDIX C. RESULTS OF SIMULATION STUDIES
mr =
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e= 0
0.8118 (0.0148)
0.5819 (0.0509)
0.5210 (0.0266)
0.2745 (0.0567)
0.2201 (0.0284)
100
30
100
30
100
max
j∈{1,...,30}
n
22.8171 (1.0190)
22.1935 (2.0597)
11.4202 (0.4953)
11.1197 (1.0250)
5.7105 (0.2642)
5.5484 (0.4803)
2.8591 (0.1335)
2.7752 (0.2415)
0.0000 (0)
0.0000 (0)
RM SE M (s)
= 3.8292;
0.8453 (0.0230)
30
rY ∗ (j)
0.9450 (0.0049)
100
min
0.956 (0.0073)
30
j∈{1,...,30}
1.0000 (0)
100
o
1.0000 (0)
30
n
Ω (s)
n
r
e=0
rY ∗ (j)
o
min
Ω (s)
n
rY ∗ (j)
o
0.2181 (0.0282)
0.2726 (0.0534)
0.5214 (0.0285)
0.5857 (0.0511)
0.8101 (0.0145)
0.8436 (0.0241)
0.9423 (0.0048)
0.9537 (0.0077)
0.9969 (0.0003)
= 3.8292;
max
n
rY ∗ (j)
o
22.8576 (1.0730)
22.3656 (1.9339)
11.4966 (0.5151)
11.2315 (1.0005)
5.8521 (0.2569)
5.6821 (0.4723)
3.1149 (0.1189)
3.0507 (0.2201)
1.2688 (0.0587)
1.2455 (0.1030)
= 32.5070.
0.2184 (0.0279)
0.2703 (0.0518)
0.5183 (0.0287)
0.5784 (0.0501)
0.8044 (0.0150)
0.8390 (0.0234)
0.9355 (0.0049)
0.9473 (0.0075)
0.9887 (0.0010)
RM SE M (s)
r
e ∈ U (−mr, mr)
Ω (s)
0.9908 (0.0015)
j∈{1,...,100}
22.9022 (1.0729)
22.2184 (2.0296)
11.4265 (0.5321)
11.0734 (1.0331)
5.7435 (0.2535)
5.5898 (0.4993)
2.9326 (0.1263)
2.8495 (0.2476)
0.6632 (0.0299)
0.6528 (0.0563)
RM SE M (s)
r
e ∈ U (−2, 2)
0.9974 (0.0004)
j∈{1,...,100}
22.8460 (1.0407)
22.1834 (1.9481)
11.4551 (0.5223)
11.1250 (0.9793)
5.7365 (0.2700)
5.5824 (0.4752)
2.8637 (0.1344)
2.8021 (0.2367)
0.3307 (0.0152)
0.3275 (0.0279)
= 32.5070; mr =
0.2184 (0.0277)
0.2758 (0.0554)
0.5199 (0.0271)
0.5827 (0.0482)
0.8106 (0.0150)
0.8437 (0.0230)
0.9449 (0.0049)
0.9551 (0.0072)
0.9992 (0.0001)
RM SE M (s)
r
e ∈ U (−1, 1)
0.9994 (0.0001)
Ω (s)
Table C.4: Results of the DSD Model I with a = 2, b = 1 and v = −1, when is mixed (type II) the variability in variable X.
C.1. SIMULATION STUDY I WITH INTERVAL VARIABLES
309
1.0000 (0)
1.0000 (0)
0.9989 (0.0002)
0.9988 (0.0001)
0.9825 (0.0031)
0.9812 (0.0017)
0.9328 (0.0110)
0.9290 (0.0060)
0.6957 (0.0393)
0.6780 (0.0202)
0.3738 (0.0487)
0.3484 (0.0230)
30
100
30
100
30
100
30
100
30
100
30
100
0.0259 (0.0142)
0.0348 (0.0228)
0.0795 (0.0278)
0.1035 (0.0581)
0.3247 (0.0386)
0.3554 (0.0762)
0.6544 0.(0269)
0.6808 (0.0499)
0.9680 (0.0030)
0.9709 (0.0051)
1.0000 (0)
1.0000 (0)
e (s)
Ω
r
e=0
5.7100 (0.2679)
5.5515 (0.4950)
2.8530 (0.1296)
2.7667 (0.2492)
1.1408 (0.0518)
1.1115 (0.0977)
0.5708 (0.0267)
0.5518 (0.0507)
0.1426 (0.0067)
0.1393 (0.0123)
0.0000 (0)
0.0000 (0)
RM SE M (s)
0.3472 (0.0218)
0.3713 (0.0484)
0.6771 (0.0207)
0.6938 (0.0380)
0.9274 (0.0062)
0.9314 (0.0110)
0.9796 (0.0017)
0.9808 (0.0031)
0.9972 (0.0002)
0.9974 (0.0003)
0.9984 (0.0001)
0.9985 (0.0003)
Ω (s)
0.0254 (0.0145)
0.0363 (0.0244)
0.0784 (0.0279)
0.1039 (0.0595)
0.3222 (0.0400)
0.3485 (0.0734)
0.6347 (0.0281)
0.6617 (0.0501)
0.9283 (0.0052)
0.9354 (0.0088)
0.9579 (0.0040)
0.9617 (0.0065)
e (s)
Ω
r
e ∈ U (−0.5, 0.5)
5.7193 (0.2483)
5.5860 (0.5021)
2.8580 (0.1310)
2.7779 (0.2372)
1.1544 (0.0532)
1.1231 (0.0969)
0.5954 (0.0260)
0.5784 (0.0476)
0.2183 (0.0071)
0.2118 (0.0138)
0.1643 (0.0077)
0.1605 (0.0138)
RM SE M (s)
0.3474 (0.0234)
0.3716 (0.0489)
0.6746 (0.0205)
0.6889 (0.0382)
0.9234 (0.0062)
0.9274 (0.0107)
0.9751 (0.0018)
0.9765 (0.0033)
0.9924 (0.0006)
0.9928 (0.0011)
0.9936 (0.0006)
0.9940 (0.0010)
Ω (s)
0.0268 (0.0148)
0.0362 (0.0240)
0.0784 (0.0282)
0.1004 (0.0575)
0.3076 (0.0406)
0.3383 (0.0783)
0.5875 (0.0315)
0.6060 (0.0562)
0.8255 (0.0158)
0.8377 (0.0256)
0.8492 (0.0142)
0.8612 (0.0230)
e (s)
Ω
r
e ∈ U (−1, 1)
5.7280 (0.2614)
5.5787 (0.5064)
2.8748 (0.1317)
2.8068 (0.2390)
1.1891 (0.0504)
1.1590 (0.0902)
0.6593 (0.0240)
0.6425 (0.0455)
0.3600 (0.0143)
0.3510 (0.0268)
0.3299 (0.0146)
0.3208 (0.0266)
RM SE M (s)
mr =
min
j∈{1,...,30}
n
rY ∗ (j)
o
rY ∗ (j)
o
= 4.6107;
max
j∈{1,...,100}
n
rY ∗ (j)
0.3457 (0.0239) 0.0257 (0.0144) 5.7451 (0.2627) 0.3335 (0.0269) 0.0231 (0.0145) 5.9028 (0.2589)
n
0.3729 (0.0491) 0.0343 (0.0243) 5.5638 (0.4898) 0.3568 (0.0542) 0.0318 (0.0247) 5.7857 (0.4872)
100
min
0.6656 (0.0211) 0.0741 (0.0277) 2.9331 (0.1272) 0.6203 (0.0256) 0.0561 (0.0254) 3.2351 (0.1191)
30
j∈{1,...,100}
0.6841 (0.0413) 0.0924 (0.0558) 2.8408 (0.2495) 0.6329 (0.0462) 0.0654 (0.0473) 3.1828 (0.2237)
100
= 9.0839; mr =
0.9074 (0.0068) 0.2613 (0.0425) 1.3192 (0.0483) 0.8247 (0.0147) 0.1348 (0.0351) 1.9038 (0.0647)
30
o
0.9123 (0.0123) 0.2796 (0.0788) 1.2820 (0.0905) 0.8226 (0.0274) 0.1231 (0.0567) 1.9152 (0.1259)
100
rY ∗ (j)
0.9571 (0.0030) 0.4419 (0.0400) 0.8734 (0.0286) 0.8651 (0.0133) 0.1740 (0.0388) 1.6286 (0.0696)
30
n
0.9593 (0.0055) 0.4582 (0.0665) 0.8512 (0.0553) 0.8614 (0.0248) 0.1534 (0.0557) 1.6557 (0.1298)
100
max
0.9740 (0.0023) 0.5725 (0.0388) 0.6738 (0.0291) 0.8787 (0.0128) 0.1884 (0.0422) 1.5312 (0.0710)
30
j∈{1,...,30}
0.9752 (0.0045) 0.5801 (0.0616) 0.6589 (0.0588) 0.8744 (0.0249) 0.1722 (0.0568) 1.5603 (0.1407)
RM SE M (s)
100
e (s)
Ω
r
e ∈ U (−mr, mr)
0.9751 (0.0023) 0.5800 (0.0375) 0.6591 (0.0298) 0.8793 (0.0121) 0.1894 (0.0400) 1.5264 (0.0692)
Ω (s)
30
RM SE M (s)
0.9763 (0.0042) 0.5881 (0.0589) 0.6441 (0.0556) 0.8760 (0.0236) 0.1695 (0.0584) 1.5496 (0.1398)
e (s)
Ω
r
e ∈ U (−2, 2)
100
Ω (s)
30
= 4.8176;
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−2, 2)
c
e ∈ U (−1, 1)
c
e ∈ U (−0.25, 0.25)
c
e= 0
n
o
= 9.5239.
Table C.6: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is low and similar the variability in variable X (part II).
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−2, 2)
c
e ∈ U (−1, 1)
c
e ∈ U (−0.25, 0.25)
c
e= 0
Ω (s)
n
Table C.5: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is low and similar the variability in variable X (part I).
310
APPENDIX C. RESULTS OF SIMULATION STUDIES
1.0000 (0)
1.0000 (0)
30
100
0.3863 (0.0428) 0.0009 (0.0005) 33.7498 (2.9763) 0.3853 (0.0399) 0.0009 (0.0005) 33.8083 (2.8290) 0.3849 (0.0419)
0.3770 (0.0221) 0.0008 (0.0004) 34.3324 (1.6086) 0.3758 (0.0216) 0.0008 (0.0003) 34.4187 (1.5766) 0.3776 (0.0214)
0.0008 (0.0004)
0.0009 (0.0005)
34.2952 (1.5448)
33.8446 (2.9412)
11.1476 (0.9775)
2.9325 (0.1255)
100
0.0074 (0.0042)
0.0702 (0.0250)
0.8446 (0.0122) 0.0071 (0.0032) 11.4004 (0.5300) 0.8437 (0.0119) 0.0066 (0.0031) 11.4391 (0.5184) 0.8434 (0.0123) 0.0069 (0.00345) 11.4505 (0.5341)
0.9879 (0.0010)
2.8615 (0.2397)
0.8736 (0.0284)
0.8526 (0.0556)
30
2.8592 (0.1287)
0.0829 (0.0448)
0.4296 (0.0337)
100
0.9885 (0.0010) 0.0786 (0.0259)
0.9883 (0.0019)
0.9989 (0.0001)
0.6762 (0.0301)
0.6621 (0.0529)
0.6477 (0.0579)
0.6477 (0.0579)
RM SE M (s)
0.8488 (0.0228) 0.0081 (0.0044) 11.1545 (0.9973) 0.8488 (0.0225) 0.0080 (0.0043) 11.1537 (0.9867) 0.8489 (0.0224)
2.8542 (0.1251)
2.7993 (0.2469)
0.5939 (0.0251)
0.4348 (0.0645)
0.5557 (0.0327)
0.5594 (0.0544)
0.5653 (0.0323)
0.5691 (0.0542)
e (s)
Ω
r
e ∈ U (−2, 2)
30
0.6386 (0.0265)
0.9888 (0.0019) 0.0978 (0.0470)
0.9995 (0)
0.9990 (0.0001)
0.9994 (0.0001)
0.9994 (0.0001)
0.9994 (0.0001)
0.9994 (0.0001)
Ω (s)
0.9886 (0.0010) 0.0802 (0.0263)
2.7809 (0.2484)
0.5710 (0.0261)
0.5826 (0.0469)
0.2180 (0.0072)
0.2134 (0.0131)
0.1650 (0.0077)
0.1601 (0.0143)
RM SE M (s)
100
0.6587 (0.0259)
0.9287 (0.0046)
0.9333 (0.0082)
0.9578 (0.0039)
0.9614 (0.0067)
e (s)
Ω
r
e ∈ U (−0.5, 0.5)
0.9995 (0.0001) 0.6572 (0.0493)
0.9999 (0)
0.9999 (0)
1.0000 (0)
1.0000 (0)
Ω (s)
0.9890 (0.0019) 0.0998 (0.0434)
0.9995 (0)
0.5579 (0.0490)
0.1431 (0.0066)
0.1386 (0.0124)
0.0000 (0)
0.0000 (0)
RM SE M (s)
30
100
0.9684 (0.0029)
0.9717 (0.0049)
1.0000 (0)
1.0000 (0)
0.9996 (0.0001) 0.6818 (0.0473)
1.0000 (0)
100
30
1.0000 (0)
30
r
e=0
e (s)
Ω
mr =
min
0.8408 (0.0123) 0.0058 (0.0034) 11.5638 (0.5230) 0.8331 (0.0122) 0.0052 (0.0037) 11.8999 (0.4982) 0.6757 (0.0272) 0.0013 (0.0015) 18.3978 (0.6112)
0.3825 (0.0425) 0.0008 (0.0006) 34.0537 (3.0071) 0.3829 (0.0409) 0.0007 (0.0006) 33.9854 (2.8314) 0.3460 (0.0550) 0.0003 (0.0005) 36.7045 (2.7015)
0.3760 (0.0215) 0.0007 (0.0004) 34.3919 (1.5531) 0.3739 (0.0219) 0.0007 (0.0004) 34.5568 (1.5234) 0.3385 (0.0294) 0.0004 (0.0004) 37.2819 (1.4669)
100
30
100
n
rY ∗ (j)
o
min
j∈{1,...,100}
n
0.9736 (0.0018) 0.0226 (0.0133)
0.9743 (0.0033) 0.0242 (0.0219)
0.9843 (0.0014) 0.0324 (0.0139)
0.9845 (0.0027) 0.0274 (0.0192)
0.9848 (0.0014) 0.0337 (0.0136)
= 48.4616; mr =
3.2999 (0.1185)
3.2293 (0.2165)
1.7479 (0.0749)
1.7109 (0.1342)
j∈{1,...,30}
max
0.8447 (0.0228) 0.0072 (0.0051) 11.3292 (0.9876) 0.8367 (0.0227) 0.0056 (0.0049) 11.6826 (0.9452) 0.6741 (0.0510) 0.0011 (0.0017) 18.1817 (1.0848)
30
= 43.8619;
0.9848 (0.0011) 0.0492 (0.0211)
100
o
0.9852 (0.0020) 0.0565 (0.0393)
30
rY ∗ (j)
0.9957 (0.0004) 0.1363 (0.0303)
100
n
0.9958 (0.0007) 0.1293 (0.0480)
1.6526 (0.0749)
0.9850 (0.0026) 0.0274 (0.0188)
rY ∗ (j)
o
max
j∈{1,...,100}
n
rY ∗ (j)
o
= 48.3890.
0.7648 (0.0246) 0.0008 (0.0012) 14.6612 (0.6556)
0.7629 (0.0468) 0.0009 (0.0017) 14.5248 (1.2172)
0.7720 (0.0243) 0.0007 (0.0009) 14.4055 (0.6372)
0.7664 (0.0479) 0.0004 (0.0008) 14.3583 (1.2490)
0.7717 (0.0252) 0.0006 (0.0008) 14.3932 (0.6493)
0.7705 (0.0460) 0.0004 (0.0007) 14.2792 (1.2330)
= 43.5407;
4.3717 (0.1398)
4.2788 (0.2634)
3.3534 (0.1483)
3.3022 (0.2895)
3.3014 (0.1518)
3.2421 (0.2802)
0.7727 (0.0249) 0.0007 (0.0009) 14.3841 (0.6555)
30
RM SE M (s)
0.9961 (0.0004) 0.1521 (0.0306)
e (s)
Ω
r
e ∈ U (−mr, mr)
0.7698 (0.0470) 0.0004 (0.0007) 14.3132 (1.2390)
Ω (s)
100
1.6220 (0.1442)
3.3113 (0.1452)
3.2340 (0.2907)
RM SE M (s)
0.9962 (0.0007) 0.1376 (0.0505)
0.9847 (0.0014) 0.0329 (0.0134)
0.9851 (0.0027) 0.0284 (0.0187)
e (s)
Ω
r
e ∈ U (−10, 10)
30
1.6526 (0.0749)
1.6206 (0.1410)
Ω (s)
0.9961 (0.0004) 0.1519 (0.0315)
RM SE M (s)
0.9962 (0.0007) 0.1399 (0.0488)
e (s)
Ω
r
e ∈ U (−5, 5)
100
Ω (s)
30
j∈{1,...,30}
c
e ∈ U (−60, 60)
c
e ∈ U (−20, 20)
c
e ∈ U (−5, 5)
c
e ∈ U (−1, 1)
c
e ∈ U (−0.25, 0.25)
c
e= 0
n
Table C.8: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is high and similar the variability in variable X(part II).
c
e ∈ U (−60, 60)
c
e ∈ U (−20, 20)
c
e ∈ U (−5, 5)
c
e ∈ U (−1, 1)
c
e ∈ U (−0.25, 0.25)
c
e= 0
Ω (s)
n
Table C.7: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is high and similar the variability in variable X (part I).
C.1. SIMULATION STUDY I WITH INTERVAL VARIABLES
311
0.9678 (0.0029) 0.7236 (0.0195)
0.8729 (0.0197) 0.3407 (0.0474)
0.8826 (0.0096) 0.3983 (0.0246)
0.6346 (0.0412) 0.1185 (0.0248) 11.1934 (0.9767) 0.6360 (0.0412) 0.1172 (0.0241) 11.1603 (0.9733)
0.6547 (0.0215) 0.1461 (0.0164) 11.4198 (0.5355) 0.6540 (0.0212) 0.1459 (0.0162) 11.4370 (0.5291)
0.3114 (0.0399) 0.0327 (0.0073) 22.4016 (1.9438) 0.3129 (0.0400) 0.0333 (0.0073) 22.3858 (1.9690)
0.3246 (0.0205) 0.0424 (0.0055) 22.8558 (1.0274) 0.3241 (0.0210) 0.0424 (0.0058) 22.8908 (1.0334)
100
30
100
30
100
30
100
5.7191 (0.2644)
5.5725 (0.4974)
2.8574 (0.1329)
2.772 (0.2454)
0.8812 (0.0096) 0.3956 (0.0262)
0.8722 (0.0200) 0.3359 (0.0507)
0.9661 (0.0028) 0.7131 (0.0189)
0.9633 (0.0062) 0.6502 (0.0450)
5.7570 (0.2625)
5.5877 (0.5002)
2.9331 (0.1246)
2.8425 (0.2480)
0.8716 (0.0292)
0.9650 (0.0059) 0.6653 (0.0419)
0.9969 (0.0002) 0.9655 (0.0024)
0.8475 (0.0530)
0.6595 (0.0305)
0.6413 (0.0565)
30
0.5713 (0.0264)
0.9966 (0.0004) 0.9534 (0.0061)
0.9982 (0.0002) 0.9799 (0.0019)
0.9981 (0.0003) 0.9726 (0.0047)
RM SE M (s)
0.9987 (0.0001) 0.9849 (0.0014)
0.5556 (0.0499)
0 (0)
0 (0)
e (s)
Ω
r
e ∈ U (−2, 2)
100
1.0000 (0)
1.0000 (0)
Ω (s)
0.9985 (0.0003) 0.9797 (0.0035)
1.0000 (0)
100
RM SE M (s)
30
1.0000 (0)
30
r
e=0
e (s)
Ω
mr =
min
j∈{1,...,30}
n
rY ∗ (j)
o
0.6329 (0.0423) 0.1120 (0.0263) 11.2350 (0.9825) 0.6119 (0.0434) 0.0938 (0.0295) 11.7273 (0.9102)
0.6502 (0.0214) 0.1418 (0.0180) 11.5319 (0.5195) 0.6308 (0.0223) 0.1252 (0.0186) 12.0294 (0.4948)
0.3111 (0.0413) 0.0318 (0.0081) 22.4539 (1.9514) 0.3094 (0.0449) 0.0289 (0.0093) 22.5868 (1.9729)
0.3228 (0.0208) 0.0414 (0.0059) 22.9619 (1.0046) 0.3199 (0.0240) 0.0391 (0.0074)
30
100
30
100
rY ∗ (j)
0.8750 (0.0096) 0.3766 (0.0275)
100
max
0.8642 (0.0198) 0.3049 (0.0498)
30
j∈{1,...,30}
0.9576 (0.0030) 0.6578 (0.0219)
100
o
0.9536 (0.0065) 0.5782 (0.0458)
30
n
0.9877 (0.0010) 0.8723 (0.0107)
= 37.7110; mr =
min
0.4735 (0.0746)
n
rY ∗ (j)
o
= 11.2112;
max
n
rY ∗ (j)
23.137 (1.0687)
6.8162 (0.2351)
6.6181 (0.4439)
4.6746 (0.1595)
4.5586 (0.2942)
3.7418 (0.1691)
3.6531 (0.3213)
j∈{1,...,100}
0.8411 (0.0113) 0.3009 (0.0318)
0.8293 (0.0236) 0.2306 (0.0579)
0.9184 (0.0061) 0.4757 (0.0348)
0.9105 (0.0127) 0.3705 (0.0630)
0.9459 (0.0051) 0.5817 (0.0354)
j∈{1,...,100}
5.9289 (0.2530)
5.7904 (0.4752)
3.2971 (0.1147)
3.2115 (0.2269)
1.7468 (0.0689)
0.9404 (0.011)
3.7013 (0.1674)
100
1.7051 (0.1313)
3.6269 (0.3162)
RM SE M (s)
0.9865 (0.0021) 0.8270 (0.0261)
0.9471 (0.0052) 0.5885 (0.0371)
0.9412 (0.0106) 0.4745 (0.0733)
e (s)
Ω
r
e ∈ U (−mr, mr)
30
1.6500 (0.0742)
1.6070 (0.1456)
Ω (s)
0.9890 (0.0010) 0.8844 (0.0106)
RM SE M (s)
0.9879 (0.0022) 0.8417 (0.0274)
e (s)
Ω
r
e ∈ U (−5, 5)
100
Ω (s)
30
= 11.2112;
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−1, 1)
c
e= 0
n
o
= 40.5470.
Table C.10: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type I) the variability in variable X(part II).
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−1, 1)
c
e= 0
Ω (s)
n
Table C.9: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type I) the variability in variable X (part I).
312
APPENDIX C. RESULTS OF SIMULATION STUDIES
0.9495 (0.0045) 0.7369 (0.0263)
0.8563 (0.0226) 0.4680 (0.0799)
0.8253 (0.0138) 0.4156 (0.0496)
0.6032 (0.0481) 0.1977 (0.0848) 11.0856 (1.0030) 0.6014 (0.0478) 0.1985 (0.0838) 11.1358 (0.9924)
0.5436 (0.0277) 0.1566 (0.0496) 11.4076 (0.5403) 0.5437 (0.0269) 0.1568 (0.0494) 11.4062 (0.5158)
0.2939 (0.0573) 0.0761 (0.0515) 22.1127 (2.0598) 0.2896 (0.0577) 0.0718 (0.0506) 22.2018 (2.0774)
0.2335 (0.0266) 0.0513 (0.0317) 22.8693 (1.0357) 0.2349 (0.0280) 0.0517 (0.0339) 22.7878 (1.0408)
100
30
100
30
100
30
100
5.7027 (0.2620)
5.5308 (0.4980)
2.8551 (0.1323)
2.7766 (0.2409)
0.8243 (0.0143) 0.4139 (0.0500)
0.8538 (0.0226) 0.4616 (0.0787)
0.9488 (0.0044) 0.7355 (0.0254)
0.9585 (0.0071) 0.7716 (0.0400)
5.7231 (0.2751)
5.5823 (0.4944)
2.8759 (0.1283)
2.8008 (0.2509)
0.6579 (0.0247)
0.9592 (0.0068) 0.7754 (0.0382)
0.9972 (0.0002) 0.9815 (0.0014)
0.6436 (0.0461)
0.3297 (0.0145)
0.3207 (0.0297)
30
0.5712 (0.0261)
0.9977 (0.0003) 0.9846 (0.0022)
0.9993 (0.0001) 0.9953 (0.0004)
0.9994 (0.0001) 0.9961 (0.0007)
RM SE M (s)
0.9979 (0.0002) 0.9860 (0.0013)
0.5569 (0.0503)
0 (0)
0 (0)
e (s)
Ω
r
e ∈ U (−1, 1)
100
1.0000 (0)
1.0000 (0)
Ω (s)
0.9983 (0.0003) 0.9884 (0.0021)
1.0000 (0)
100
RM SE M (s)
30
1.0000 (0)
30
r
e=0
e (s)
Ω
mr =
min
j∈{1,...,30}
n
rY ∗ (j)
o
0.9570 (0.0069)
0.9472 (0.0045)
0.8542 (0.0221)
0.8231 (0.0143)
0.5997 (0.0481)
0.5431 (0.0274)
0.2916 (0.0537)
0.2311 (0.0261)
30
100
30
100
30
100
30
100
max
0.9950 (0.0003)
100
j∈{1,...,30}
0.9960 (0.0005) 0.97346 (0.0034)
30
n
0.9972 (0.0003)
100
rY ∗ (j)
o
e (s)
Ω
r
e ∈ U (−mr, mr)
0.8143 (0.0143) 0.3950 (0.0501)
0.8451 (0.0241) 0.4375 (0.0823)
0.9347 (0.0047) 0.6801 (0.0270)
0.9468 (0.0073) 0.7136 (0.0413)
0.9468 (0.0073) 0.7122 (0.0416)
0.9816 (0.0015) 0.9029 (0.0144)
0.9837 (0.0015) 0.8995 (0.0085)
0.9868 (0.0023) 0.9130 (0.0139)
Ω (s)
5.9229 (0.2560)
5.775 (0.4957)
3.2736 (0.1164)
3.1945 (0.2242)
3.1945 (0.2242)
1.6944 (0.0689)
1.5923 (0.0715)
1.5558 (0.1335)
RM SE M (s)
min
j∈{1,...,100}
n
rY ∗ (j)
o
= 4.8292;
max
j∈{1,...,100}
n
rY ∗ (j)
22.9376 (1.0722) 0.2327 (0.0278) 0.0502 (0.0332) 22.9057 (1.0819)
22.2165 (1.8784) 0.2885 (0.0568) 0.0717 (0.0517) 22.3065 (2.0125)
11.4157 (0.5412) 0.5380 (0.0270) 0.1590 (0.0484) 11.5268 (0.5180)
11.1658 (1.0207) 0.5967 (0.0491) 0.1852 (0.0824) 11.2183 (1.0203)
5.7433 (0.2622)
5.5752 (0.4804)
2.9252 (0.1289)
2.8529 (0.2374)
0.8732 (0.0277)
0.8485 (0.0543)
0.6605 (0.0298)
0.6396 (0.0555)
RM SE M (s)
= 33.5070; mr =
0.0486 (0.0302)
0.0744 (0.0504)
0.1558 (0.0495)
0.1941 (0.0792)
0.4104 (0.0518)
0.4613 (0.0774)
0.7286 (0.0258)
0.7641 (0.0380)
0.9678 (0.0021)
0.9814 (0.0017)
0.9846 (0.0026)
0.9977 (0.0004)
30
= 4.8292;
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−1, 1)
c
e= 0
Ω (s)
n
e (s)
Ω
r
e ∈ U (−2, 2)
o
= 33.5070.
Table C.12: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type II) the variability in variable X(part II).
c
e ∈ U (−40, 40)
c
e ∈ U (−20, 20)
c
e ∈ U (−10, 10)
c
e ∈ U (−5, 5)
c
e ∈ U (−1, 1)
c
e= 0
Ω (s)
n
Table C.11: Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type II) the variability in variable X (part I).
C.1. SIMULATION STUDY I WITH INTERVAL VARIABLES
313
C.2.
1.9834 (0.4246)
1.9439 (0.7971)
1.9730 (0.6034)
2.0174 (0.3350)
1.9886 (0.2355)
1.6480 (1.5162)
1.5917 (1.4276)
1.6689 (1.3222)
1.7359 (1.2114)
250
10
30
100
250
10
30
100
250
2.0225 (0.8497)
1.9888 (0.4987)
2.0086 (0.2464)
2.0051 (0.1567)
10
30
100
250
1.9996 (0.0164)
variability
Level II
2.0133 (0.5921)
100
1.9999 (0.0242)
1.9406 (0.8099)
30
100
1.8192 (1.3178)
10
2.000 (0.0877)
2.0007 (0.0416)
250
1.9999 (0.0468)
2.0011 (0.0576)
100
30
2.0033 (0.0924)
10
1.9958 (0.2408)
30
a∗ (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
0.0246
0.0607
0.2486
0.7217
0.0003
0.0006
0.0022
0.0077
1.5359
1.8562
2.2027
2.4205
0.0555
0.1124
0.3645
0.6378
0.1804
0.3504
0.6588
1.7676
0.0017
0.0033
0.0085
0.0580
MSE(a)
0.9950 (0.1572)
0.9915 (0.2458)
1.0135 (0.4911)
1.0304 (0.7728)
1.0004 (0.0165)
1.0001 (0.0243)
1.0000 (0.0469)
1.0001 (0.0864)
1.2657 (1.2082)
1.3395 (1.3194)
1.4260 (1.4139)
1.4053 (1.4975)
1.0114 (0.2353)
0.9818 (0.3346)
1.0267 (0.6029)
1.0560 (0.7945)
1.0171 (0.4237)
0.9926 (0.5874)
1.0658 (0.7870)
1.2908 (1.2670)
0.9996 (0.0414)
0.9993 (0.0581)
0.9964 (0.0921)
1.0054 (0.2416)
b∗ (s)
0.0247
0.0604
0.2411
0.5975
0.0003
0.0006
0.0022
0.0075
1.5289
1.8544
2.1787
2.4044
0.0555
0.1122
0.3638
0.6337
0.1797
0.3447
0.6230
1.6884
0.0017
0.0034
0.0085
0.0583
MSE(b)
Estimated parameters
-1.1080 (4.7367)
-1.3072 (7.8400)
-0.4504 (17.4941)
-1.1635 (26.9129)
-0.9860 (0.4972)
-1.0021 (0.7579)
-0.9942 (1.6390)
-1.0053 (2.8745)
9.5994 (48.2692)
12.4257 (53.0652)
15.6896 (56.9147)
14.1145 (59.4807)
-0.5444 (9.3996)
-1.7259 (13.4232)
0.0859 (24.1892)
1.2212 (31.7073)
-0.3207 (16.9244)
-1.4118 (23.5757)
1.4953 (31.8518)
8.4889 (51.2458)
-1.0232 (1.6567)
-1.0356 (2.3105)
-1.1385 (3.6848)
-0.8118 (9.6354)
v ∗ (s)
22.4253
61.4978
306.0391
723.6086
0.2472
0.5738
2.6837
8.2546
2439.9304
2993.3517
3514.5929
3762.8658
88.4709
180.5279
585.7105
1009.2841
286.6094
555.4258
1019.7514
2713.545
2.7426
5.3342
13.583
92.7835
MSE(v)
0.2795 (0.0175)
0.2945 (0.0291)
0.3571 (0.0555)
0.4084 (0.1161)
0.9745 (0.0014)
0.9760 (0.0022)
0.9808 (0.0033)
0.9822 (0.0064)
0.4172 (0.0199)
0.4209 (0.0314)
0.4253 (0.0592)
0.4646 (0.1066)
0.9862 (0.0006)
0.9863 (0.001)
0.9867 (0.0019)
0.9887 (0.0030)
0.3256 (0.0156)
0.3412 (0.0255)
0.3813 (0.0537)
0.4115 (0.0971)
0.9795 (0.0011)
0.9806 (0.0016)
0.9833 (0.0028)
0.9838 (0.0053)
Ω (s)
18.7173 (0.5250)
18.6183 (0.8325)
18.0251 (1.5968)
16.8083 (2.9515)
1.8740 (0.0537)
1.8592 (0.0864)
1.8064 (0.1599)
1.6609 (0.3134)
30.8052 (0.7197)
30.586 (1.1431)
30.2256 (2.1111)
28.1745 (3.6975)
3.0809 (0.0706)
3.0559 (0.1094)
2.9924 (0.2089)
2.7352 (0.3702)
5.0712 (0.1413)
5.0121 (0.2165)
4.6733 (0.3885)
4.3285 (0.7012)
0.5072 (0.0142)
0.5006 (0.0212)
0.4661 (0.0396)
0.4271 (0.0732)
RM SE M (s)
18.8009 (0.5354)
18.6968 (0.8512)
18.1125 (1.6477)
17.2367 (3.2065)
1.8825 (0.0549)
1.8684 (0.0886)
1.8149 (0.1641)
1.6988 (0.3325)
36.5458 (1.3842)
36.5152 (2.1633)
35.985 (3.8944)
33.7758 (7.0893)
3.6627 (0.1358)
3.6494 (0.2154)
3.5783 (0.3980)
3.3330 (0.6975)
5.2941 (0.1758)
5.2991 (0.2792)
4.9874 (0.5231)
4.6161 (0.9161)
0.5285 (0.0174)
0.5270 (0.0277)
0.4982 (0.0530)
0.4586 (0.0949)
RM SE L (s)
Goodness of fit measures
Table C.13: Results, in different conditions, of the DSD Model I with a = 2, b = 1 and v = −1.
Simulation study II with interval variables
18.8062 (0.5445)
18.7105 (0.8644)
18.1045 (1.6313)
17.1714 (3.1907)
1.8826 (0.0554)
1.8672 (0.0891)
1.8147 (0.1646)
1.6949 (0.3490)
36.6275 (1.3552)
36.4730 (2.1463)
36.1349 (4.0445)
33.7809 (7.0398)
3.6568 (0.1356)
3.6411 (0.2117)
3.5640 (0.4059)
3.2979 (0.6762)
5.2858 (0.1755)
5.2776 (0.2880)
4.9840 (0.5324)
4.6246 (0.9109)
0.5295 (0.0175)
0.5294 (0.0282)
0.4970 (0.0512)
0.4534 (0.0933)
RM SE U (s)
314
APPENDIX C. RESULTS OF SIMULATION STUDIES
1.9844 (1.1238)
1.9982 (0.7601)
4.4850 (4.9604)
4.4098 (4.6370)
4.2069 (4.3952)
3.6169 (3.9147)
1.9954 (0.2891)
1.9920 (0.1633)
2.0037 (0.0801)
100
250
10
30
100
250
10
30
100
1.9947 (0.5489)
2.1783 (1.7720)
30
250
2.4174 (2.3327)
10
2.0237 (0.7845)
2.0207 (1.2861)
250
2.0618 (1.4820)
2.1335 (1.7408)
30
2.3295 (2.3499)
30
100
100
Level II
3.9997 (4.2095)
10
2.4269 (2.3666)
1.9945 (0.1417)
250
10
2.0046 (0.1968)
100
variability
1.9897 (0.3039)
2.0004 (0.0563)
1.9744 (0.7856)
30
a∗ (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
0.3011
0.6154
2.1979
5.7775
0.0032
0.0064
0.0267
0.0835
17.9238
24.1685
27.2870
30.7561
0.5772
1.2618
3.1687
5.6104
1.6527
3.0453
5.6248
21.7009
0.0201
0.0387
0.0923
0.6172
MSE(a)
8.0054 (0.5491)
7.9769 (0.7864)
7.9969 (1.5923)
7.9477 (2.8481)
7.9995 (0.0564)
7.9961 (0.0806)
8.0074 (0.1634)
8.0031 (0.2856)
6.3974 (3.9491)
5.8154 (4.4684)
5.6196 (4.6394)
5.6798 (4.9797)
7.9998 (0.7596)
8.0158 (1.1243)
7.8177 (1.7758)
7.5903 (2.3438)
7.977 (1.3048)
7.8933 (1.7798)
7.8146 (2.5356)
6.4533 (4.4015)
8.0057 (0.1423)
7.9959 (0.1965)
8.0069 (0.3060)
8.0230 (0.7892)
b∗ (s)
0.3012
0.6184
2.5328
8.1064
0.0032
0.0065
0.0267
0.0815
18.1480
24.7185
27.1687
30.1559
0.5763
1.2629
3.1836
5.6559
1.7013
3.1761
6.4574
21.7459
0.0203
0.0386
0.0936
0.6227
MSE(b)
Estimated parameters
3.1769 (16.3967)
1.9162 (24.5723)
0.8784 (53.9364)
-4.2571 (86.4180)
2.9909 (1.7152)
2.9205 (2.5047)
3.2623 (5.7499)
3.3311 (9.5573)
-61.3240 (156.7565)
-85.0827 (177.6946)
-92.2450 (185.8014)
-92.9741 (197.1509)
3.0737 (30.3151)
3.5973 (45.0618)
-4.2384 (71.0896)
-13.5640 (93.0847)
2.1043 (51.6686)
-1.8296 (70.2139)
-7.3426 (97.2957)
-68.1982 (170.5964)
3.2272 (5.6708)
2.8259 (7.8441)
3.3496 (12.1628)
3.9889 (31.4686)
v ∗ (s)
268.6154
604.3678
2910.7254
7513.2685
2.9391
6.2735
33.0972
91.3604
28685.5906
39302.3741
43559.2486
48040.6333
918.0925
2028.8908
5101.0672
8930.4700
2667.773
4948.3897
9563.9654
34143.1937
32.1774
61.4992
147.9081
990.2581
MSE(v)
0.3185 (0.0232)
0.3380 (0.0372)
0.3872 (0.0663)
0.4407 (0.1248)
0.9789 (0.0012)
0.9803 (0.0017)
0.9835 (0.0028)
0.9848 (0.0051)
0.4179 (0.0203)
0.4201 (0.0318)
0.4241 (0.0591)
0.4617 (0.1056)
0.9862 (0.0007)
0.9863 (0.0010)
0.9868 (0.0019)
0.9889 (0.0030)
0.3307 (0.0170)
0.3469 (0.0272)
0.3931 (0.0558)
0.4126 (0.0966)
0.9800 (0.0011)
0.9811 (0.0016)
0.9837 (0.0027)
0.9840 (0.0049)
Ω (s)
62.5027 (1.7145)
61.8683 (2.8135)
60.2426 (5.4772)
55.2868 (10.5882)
6.2440 (0.1818)
6.2168 (0.2780)
6.0181 (0.5258)
5.5460 (0.9764)
102.6302 (2.3045)
102.0593 (3.7356)
100.662 (6.9163)
94.4150 (12.5131)
10.2698 (0.2466)
10.2033 (0.3860)
9.9365 (0.7348)
9.0612 (1.2684)
16.9069 (0.4726)
16.6842 (0.7128)
15.5431 (1.3173)
14.5161 (2.3589)
1.6912 (0.0466)
1.6717 (0.0743)
1.5617 (0.1304)
1.4296 (0.2290)
RM SE M (s)
62.7952 (1.7797)
62.1480 (2.8662)
60.5431 (5.6330)
56.6721 (11.5276)
6.2706 (0.1881)
6.2436 (0.2847)
6.0455 (0.5436)
5.6462 (1.0752)
121.6439 (4.3301)
121.7733 (7.0518)
119.8776 (13.5412)
113.0991 (24.2285)
12.2089 (0.4541)
12.165 (0.7281)
11.9266 (1.3310)
10.8983 (2.2931)
17.6291 (0.5899)
17.5787 (0.9228)
16.6159 (1.7503)
15.5880 (2.9938)
1.7627 (0.0581)
1.7644 (0.0972)
1.6782 (0.1741)
1.5254 (0.3133)
RM SE L (s)
Goodness of fit measures
Table C.14: Results, in different conditions, of the DSD Model I with a = 2, b = 8 and v = 3.
62.7796 (1.7650)
62.1610 (2.9210)
60.5306 (5.5730)
56.8957 (11.4630)
6.2749 (0.1848)
6.2465 (0.2878)
6.0461 (0.5350)
5.6910 (1.0768)
122.2160 (4.2979)
121.5274 (7.3249)
119.8337 (13.4980)
112.4600 (23.7080)
12.1878 (0.4605)
12.1928 (0.6992)
11.8448 (1.3177)
11.0143 (2.3947)
17.6423 (0.5788)
17.6269 (0.9443)
16.5627 (1.8119)
15.4589 (3.1209)
1.7649 (0.0571)
1.7619 (0.0957)
1.6531 (0.1708)
1.5282 (0.3033)
RM SE U (s)
C.2. SIMULATION STUDY II WITH INTERVAL VARIABLES
315
5.6924 (0.4304)
5.8084 (0.2795)
3.4040 (2.9642)
3.6571 (2.8469)
4.0286 (2.5207)
4.3594 (2.2199)
5.9930 (0.1589)
5.9929 (0.0886)
6.0021 (0.0498)
100
250
10
30
100
250
10
30
100
5.9999 (0.3291)
5.4787 (0.7325)
30
250
5.2533 (1.0660)
10
5.9984 (0.4841)
5.7158 (0.5340)
250
5.9011 (0.8650)
5.6756 (0.7300)
100
30
5.5001 (1.2298)
30
100
Level II
4.6186 (2.5007)
10
5.9723 (1.6399)
5.9744 (0.0533)
250
10
5.9640 (0.0779)
100
variability
5.9391 (0.1253)
5.9991 (0.0331)
5.8281 (0.3094)
30
a∗ (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
0.1082
0.2341
0.7573
2.6873
0.0011
0.0025
0.0079
0.0253
7.6145
10.2343
13.5859
15.517
0.1147
0.2796
0.8078
1.6928
0.3656
0.6377
1.7607
8.1556
0.0035
0.0074
0.0194
0.1252
MSE(a)
0.1335 (0.1950)
0.1953 (0.2743)
0.3908 (0.5481)
0.6722 (1.0149)
0.0136 (0.0196)
0.0185 (0.0277)
0.0392 (0.0563)
0.0693 (0.0976)
1.6548 (2.2169)
1.9796 (2.5089)
2.4070 (2.8403)
2.6602 (2.9627)
0.1931 (0.2790)
0.3091 (0.4298)
0.5253 (0.7299)
0.7558 (1.0566)
0.3405 (0.4845)
0.4307 (0.6435)
0.7125 (1.0396)
1.6785 (2.2743)
0.0327 (0.0482)
0.0456 (0.0682)
0.0787 (0.1067)
0.1991 (0.2843)
b∗ (s)
0.0558
0.1133
0.4529
1.4807
0.0006
0.0011
0.0047
0.0143
7.6480
10.2069
13.8527
15.8453
0.1150
0.2801
0.8081
1.6866
0.3505
0.5991
1.5873
7.9849
0.0034
0.0067
0.0176
0.1204
MSE(b)
Estimated parameters
3.8653 (7.9642)
5.0318 (11.6671)
10.0442 (24.8389)
13.6752 (43.4882)
2.2241 (0.7850)
2.2481 (1.1952)
2.7649 (2.5163)
3.2610 (4.2224)
67.4541 (88.4495)
80.9981 (100.7132)
97.225 (114.5795)
106.3097 (117.0232)
9.6567 (11.1427)
14.3622 (17.2180)
23.0263 (29.3728)
31.8842 (42.2034)
14.4597 (20.1908)
17.1496 (27.1720)
26.3856 (44.8360)
63.4324 (94.3885)
3.1656 (2.0127)
3.6388 (2.8966)
4.7945 (4.5856)
9.4333 (11.7926)
v ∗ (s)
66.8448
145.1775
681.0626
2025.6412
0.6659
1.4885
6.9105
19.4008
12099.7297
16373.6965
22183.138
24561.2529
182.6611
448.9896
1304.0056
2672.4158
562.5059
967.0868
2602.9123
12674.2153
5.4054
11.0676
28.8161
194.1803
MSE(v)
0.4034 (0.0287)
0.4281 (0.0443)
0.4480 (0.0703)
0.4811 (0.1241)
0.9854 (0.0008)
0.9868 (0.0012)
0.9877 (0.002)
0.9885 (0.0038)
0.4176 (0.0203)
0.4204 (0.0312)
0.4284 (0.0613)
0.4607 (0.1061)
0.9862 (0.0006)
0.9863 (0.0010)
0.9865 (0.0019)
0.9883 (0.0031)
0.3431 (0.0176)
0.3605 (0.0274)
0.4037 (0.0548)
0.4116 (0.1004)
0.9812 (0.0010)
0.9823 (0.0015)
0.9851 (0.0023)
0.984 (0.0049)
Ω (s)
37.4815 (1.0899)
37.2317 (1.7218)
36.3191 (3.2645)
34.2931 (5.9513)
3.7474 (0.1074)
3.7288 (0.1722)
3.6344 (0.3090)
3.3896 (0.5893)
61.6769 (1.3986)
61.1465 (2.2593)
60.3564 (4.3267)
56.4884 (7.6711)
6.1697 (0.1443)
6.1236 (0.2209)
6.0274 (0.4344)
5.5875 (0.7578)
10.1550 (0.2669)
10.0465 (0.4176)
9.4358 (0.7806)
8.7809 (1.4507)
1.0156 (0.0275)
1.0045 (0.0431)
0.9361 (0.0757)
0.8778 (0.1414)
RM SE M (s)
37.6624 (1.1198)
37.4426 (1.7479)
36.6272 (3.4213)
35.5889 (6.6460)
3.7657 (0.1100)
3.7512 (0.1776)
3.6684 (0.3216)
3.5214 (0.6577)
73.2487 (2.7104)
72.7731 (4.2428)
71.9955 (7.8317)
67.0016 (13.9810)
7.3287 (0.2712)
7.3080 (0.4367)
7.1849 (0.8463)
6.7037 (1.4363)
10.5856 (0.3330)
10.6077 (0.5525)
10.1200 (1.0415)
9.3692 (1.8559)
1.0596 (0.0333)
1.0601 (0.0544)
0.9988 (0.1001)
0.9365 (0.1759)
RM SE L (s)
Goodness of fit measures
Table C.15: Results, in different conditions, of the DSD Model I with a = 6, b = 0 and v = 2.
37.6833 (1.1194)
37.448 (1.7858)
36.6183 (3.3194)
35.4847 (6.5394)
3.7672 (0.1111)
3.7506 (0.1774)
3.6638 (0.3186)
3.5000 (0.6571)
73.2659 (2.6078)
72.9899 (4.4683)
72.1277 (7.9684)
68.2117 (14.6922)
7.3300 (0.2817)
7.3026 (0.4204)
7.178 (0.8145)
6.7304 (1.3595)
10.5972 (0.3371)
10.5961 (0.5535)
10.0236 (1.0578)
9.3729 (1.9114)
1.0591 (0.0346)
1.0598 (0.0584)
1.0009 (0.1057)
0.9389 (0.1887)
RM SE U (s)
316
APPENDIX C. RESULTS OF SIMULATION STUDIES
1.8815 (0.8346)
1.9798 (0.5608)
0.9794 (2.2653)
0.8893 (2.0019)
1.0368 (1.9756)
1.1869 (1.9021)
1.9773 (0.2828)
1.9957 (0.1081)
1.9966 (0.0889)
100
250
10
30
100
250
10
30
100
1.8465 (0.7867)
1.8592 (1.3425)
30
250
1.4505 (1.6350)
10
1.6986 (1.1177)
1.9615 (0.7566)
250
100
Level II
1.866 (1.0200)
100
1.8009 (1.6091)
1.6504 (1.5217)
30
30
1.4417 (1.7511)
10
2.0474 (2.3355)
1.9992 (0.0810)
250
10
2.0047 (0.1131)
100
variability
1.9951 (0.2029)
1.9993 (0.0563)
1.9873 (0.4518)
30
a∗
1 (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
0.6419
1.3388
2.6261
5.4515
0.0032
0.0079
0.0117
0.0804
4.2756
4.8268
5.2373
6.1683
0.3146
0.7099
1.8204
2.9725
0.5734
1.0574
2.4355
3.3748
0.0065
0.0128
0.0411
0.2041
MSE(a1 )
0.9258 (0.7165)
0.9543 (1.0156)
0.9287 (1.2111)
1.4200 (2.0369)
1.0028 (0.0527)
1.0035 (0.0842)
0.9960 (0.1097)
0.9629 (0.2803)
0.968 (1.7092)
0.8714 (1.8375)
0.7165 (1.8213)
0.8710 (2.1482)
1.0029 (0.5696)
1.0027 (0.7737)
0.9666 (1.0198)
0.9302 (1.3798)
0.9972 (0.6878)
0.9544 (0.8841)
0.8252 (1.1404)
1.0320 (1.5241)
0.9999 (0.0798)
0.9970 (0.1135)
0.9924 (0.1994)
0.9984 (0.4534)
b∗
1 (s)
0.5184
1.0324
1.4703
4.3212
0.0028
0.0071
0.0120
0.0798
2.9195
3.3896
3.3942
4.6267
0.3241
0.5980
1.0400
1.9067
0.4726
0.7829
1.3299
2.3215
0.0064
0.0129
0.0398
0.2054
MSE(b1 )
0.7316 (0.8022)
0.8015 (1.0768)
0.8858 (1.3049)
0.9010 (1.5054)
0.5057 (0.1078)
0.5069 (0.1342)
0.5013 (0.2892)
0.5807 (0.5451)
1.2531 (2.3406)
1.4222 (2.7718)
1.4617 (3.0957)
1.7534 (3.6735)
0.5507 (0.4925)
0.6530 (0.7210)
0.6821 (0.9699)
1.1367 (1.6882)
0.6546 (0.7092)
0.9098 (1.1354)
1.3117 (1.9717)
1.7273 (2.8339)
0.5028 (0.0964)
0.5005 (0.1613)
0.5044 (0.3249)
0.5738 (0.4980)
a∗
2 (s)
0.6966
1.2494
1.8498
2.4248
0.0116
0.0180
0.0835
0.3034
6.0400
8.5259
10.4984
15.0522
0.2449
0.5427
0.9729
3.2525
0.5264
1.4558
4.5428
9.5294
0.0093
0.0260
0.1055
0.2532
MSE(a2 )
2.8978 (1.1353)
2.9489 (1.6981)
2.7054 (2.2389)
2.5413 (2.4635)
2.9926 (0.1065)
2.9959 (0.1362)
3.0038 (0.2869)
2.9720 (0.6038)
2.1792 (2.8793)
2.1092 (3.3489)
1.8802 (3.5180)
2.0358 (3.9030)
2.9946(0.6085)
2.9920 (0.9866)
2.8864 (1.7835)
2.5831 (2.2883)
3.0507 (1.0669)
3.0012 (1.7182)
3.2305 (3.0010)
2.9513 (3.5121)
2.9962 (0.0971)
2.9994 (0.1668)
3.0135 (0.3587)
3.0049 (0.5911)
b∗
2 (s)
Estimated parameters
1.2981
2.8832
5.0945
6.2730
0.0114
0.0185
0.0822
0.3650
8.9560
11.9975
13.6179
16.1476
0.3700
0.9725
3.1906
5.4050
1.1397
2.9494
9.0502
12.3250
0.0094
0.0278
0.1287
0.3491
MSE(b2 )
1.6809 (1.1057)
1.6305 (1.5321)
1.9305 (2.1158)
1.7274 (2.1937)
1.5044 (0.1095)
1.5002 (0.1304)
1.5026 (0.3038)
1.5604 (0.6877)
2.0913 (3.2809)
2.3599 (3.9488)
2.8646 (4.6586)
2.3045 (4.6732)
1.4980 (0.5904)
1.5263 (0.9976)
1.5942 (1.4060)
1.7284 (2.1426)
1.4608 (0.7547)
1.4883 (1.1622)
1.7122 (1.8313)
1.7189 (2.3606)
1.5025 (0.0713)
1.4960 (0.1237)
1.4990 (0.2226)
1.4843 (0.3928)
a∗
3 (s)
1.2541
2.3620
4.6572
4.8592
0.0120
0.0170
0.0922
0.4761
11.1030
16.3168
23.5427
22.4642
0.3483
0.9949
1.9837
4.6383
0.5705
1.3495
3.3953
5.6146
0.0051
0.0153
0.0495
0.1544
MSE(a3 )
0.9258 (0.8674)
0.9543 (1.1345)
0.9287 (1.4019)
1.4200 (1.5418)
1.0028 (0.1107)
1.0035 (0.1298)
0.9960 (0.3091)
0.9629 (0.6278)
0.9680 (3.1708)
0.8714 (3.9425)
0.7165 (4.4953)
0.8710 (4.6136)
1.0029 (0.5763)
1.0027 (0.8538)
0.9666 (1.3151)
0.9302 (2.2020)
0.9972 (0.6562)
0.9544 (1.0368)
0.8252 (1.5840)
1.0320 (2.2744)
0.9999 (0.0697)
0.997 (0.1224)
0.9924 (0.2241)
0.9984 (0.3736)
b∗
3 (s)
Table C.16: Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1, a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1.
0.7622
1.2893
1.9668
2.3843
0.0123
0.0168
0.0954
0.3964
11.2915
17.0221
22.6418
22.8594
0.3323
0.7283
1.7497
5.2768
0.4302
1.0913
2.5971
5.3968
0.0049
0.0150
0.0504
0.1396
MSE(b3 )
C.2. SIMULATION STUDY II WITH INTERVAL VARIABLES
317
-1.3949 (41.1234)
-2.8910 (73.6315)
-2.7308 (63.6582)
-1.1470 (36.6992)
-1.5652 (26.0025)
-12.7210 (132.6605)
-13.222 (124.0239)
-7.7275 (113.5824)
-3.6319 (100.6135)
-4.5079 (36.0235)
-0.9733 (17.5595)
-1.1383 (7.4955)
250
10
30
100
250
10
30
100
250
10
30
100
-12.7191 (96.75)
-15.5361 (96.7261)
-3.5017 (74.2281)
-9.0616 (55.8534)
10
30
100
250
variability
Level II
-1.2755 (6.2929)
250
Level I
Level II
Mixed
variability
similar
High and
-3.924 (57.1169)
100
Level I
-5.2635 (88.5241)
30
Level II
-5.6925 (108.7766)
10
-1.1333 (6.5151)
100
variability
-0.7590 (13.6114)
-1.1204 (4.5715)
-1.7823 (26.9084)
30
3181.472
5510.5629
9557.8811
9488.5425
39.6373
56.146
308.0289
1308.6972
10119.8815
12933.3227
15515.9344
17718.5841
675.7751
1345.5062
4051.3082
5419.7462
1689.6024
3267.633
7846.8634
11842.5294
20.892
42.422
185.1425
723.9481
MSE(v)
Estimated parameter
v ∗ (s)
10
n
250
Level I
Error level
Low and similar
Pattern of variability
0.3175 (0.0172)
0.3006 (0.0282)
0.3645 (0.0586)
0.5064 (0.1183)
0.9904 (0.0005)
0.9902 (0.0008)
0.9916 (0.0015)
0.9936 (0.0023)
0.4234 (0.0203)
0.4266 (0.0316)
0.4392 (0.0604)
0.4737 (0.1090)
0.9866 (0.0006)
0.9867 (0.0010)
0.9877 (0.0018)
0.9905 (0.0029)
0.3365 (0.0168)
0.3523 (0.0267)
0.409 (0.0558)
0.5283 (0.1060)
0.9801 (0.0011)
0.9809 (0.0017)
0.9839 (0.0027)
0.9904 (0.0036)
Ω (s)
56.2304 (1.5351)
52.9952 (2.3928)
49.9148 (4.4089)
43.4851 (8.1285)
3.7374 (0.0965)
3.3377 (0.1339)
3.1782 (0.2837)
3.0175 (0.5736)
67.7029 (1.6031)
67.7501 (2.5506)
65.7387 (4.6680)
61.667 (8.3808)
6.7441 (0.1610)
6.7295 (0.2554)
6.3919 (0.4626)
5.5707 (0.8689)
9.2555 (0.2607)
9.1408 (0.3946)
8.7306 (0.6910)
7.1313 (1.1988)
0.9258 (0.0264)
0.9154 (0.0404)
0.8641 (0.0742)
0.6469 (0.1265)
RM SE M (s)
58.3002 (1.9127)
55.6267 (2.9831)
51.5413 (5.0236)
46.8738 (10.0865)
4.0525 (0.1407)
3.7223 (0.2179)
3.3886 (0.3657)
3.4427 (0.7839)
80.8972 (3.0755)
81.0198 (4.8083)
78.8249 (8.8622)
73.751 (15.1705)
8.0588 (0.3082)
8.0298 (0.4690)
7.7205 (0.9202)
6.8057 (1.4393)
9.6354 (0.3160)
9.6334 (0.5109)
9.7753 (1.0901)
8.1398 (1.7480)
0.9654 (0.0314)
0.9648 (0.0529)
0.9634 (0.1123)
0.7359 (0.1702)
RM SE L (s)
Goodness-of-fit measures
58.4289 (1.8492)
55.5293 (2.9814)
51.513 (5.1361)
47.4534 (10.6710)
4.0456 (0.1416)
3.7222 (0.2142)
3.3939 (0.3649)
3.4204 (0.8027)
80.7315 (3.0851)
80.6425 (4.8974)
78.4996 (8.9895)
73.7701 (15.6579)
8.046 (0.3045)
8.0232 (0.4782)
7.6404 (0.8660)
6.8558 (1.4628)
9.6554 (0.3211)
9.6679 (0.5199)
9.7876 (1.0603)
8.2489 (1.7593)
0.9641 (0.0319)
0.9651 (0.0530)
0.9650 (0.1081)
0.7380 (0.1677)
RM SE U (s)
Table C.17: Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1, a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1 (continuation of the Table C.16).
318
APPENDIX C. RESULTS OF SIMULATION STUDIES
2.4593
2.1145
1.9200
0.0091
0.0027
0.0007
1.4635 (1.4743)
250 1.5277 (1.3033)
2.0042 (0.0954)
1.9995 (0.0520)
30
100 1.5814 (1.3933)
10
30
100 1.9997 (0.0266)
0.0325
2.6519
1.4556 (1.5355)
10
250 1.9885 (0.1799)
0.0768
250 1.9487 (0.2724)
0.0689
0.1724
100 1.8969 (0.4024)
0.2580
0.5309
1.7896 (0.6979)
30
1.9934 (0.5082)
1.0744
1.6784 (0.9859)
10
30
0.2470
250 1.9710 (0.4964)
100 1.9983 (0.2626)
Level II
0.4959
100 1.9903 (0.7045)
0.8497
1.1172
1.8988 (1.0527)
30
1.9899 (0.9222)
2.5781
1.9072 (1.6038)
10
10
0.0026
250 1.9981 (0.0507)
variability
0.0057
100 1.9963 (0.0758)
0.0003
0.0156
2.0031 (0.1250)
0.0765
1.9931 (0.2767)
MSE(a)
30
a∗ (s)
10
n
250 1.9998 (0.0178)
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
1.0065 (0.1780)
0.9833 (0.2626)
0.9651 (0.4931)
0.9514 (0.7827)
0.9997 (0.0178)
1.0005 (0.0270)
1.0006 (0.0510)
0.9937 (0.0941)
1.1086 (1.2464)
1.1292 (1.3550)
1.2869 (1.4551)
1.4133 (1.5359)
0.9410 (0.2723)
0.9014 (0.3847)
0.8316 (0.6069)
0.7798 (0.8262)
1.0175 (0.4883)
0.9772 (0.6485)
0.9903 (0.9216)
1.3398 (1.5182)
1.0021 (0.0496)
1.0008 (0.0753)
1.0011 (0.1234)
1.0013 (0.2847)
b∗ (s)
0.0317
0.0691
0.2441
0.6143
0.0003
0.0007
0.0026
0.0089
1.5637
1.8509
2.1976
2.5274
0.0776
0.1576
0.3963
0.7304
0.2385
0.4207
0.8486
2.4181
0.0025
0.0057
0.0152
0.0810
MSE(b)
I∗
0.9807 (0.0010)
0.9816 (0.0016)
0.9840 (0.0025)
0.9850 (0.0050)
Ω (s)
8.2492 (5.7394)
11.8418 (8.0643)
0.9862 (0.0006)
0.9864 (0.0010)
20.5155 (12.4842) 0.9869 (0.0019)
28.5404 (16.5457) 0.9888 (0.0030)
14.8181 (11.1586) 0.3385 (0.0168)
20.2301 (14.7164) 0.3531 (0.0261)
29.7360 (20.9913) 0.3942 (0.0528)
53.9096 (27.8418) 0.4254 (0.1010)
1.5025 (1.1223)
2.1340 (1.6092)
3.4056 (2.5784)
8.5981 (6.1684)
2 (I ∗ , I) (s)
DM
3.1480 (0.0739)
3.1174 (0.1180)
3.0364 (0.2247)
2.7885 (0.3765)
5.7139 (0.1622)
5.6473 (0.2284)
5.3017 (0.4129)
4.9236 (0.7684)
0.5710 (0.0149)
0.5647 (0.0250)
0.5275 (0.0423)
0.4785 (0.0839)
3.7379 (0.1401)
3.7215 (0.2297)
3.6298 (0.4180)
3.3399 (0.6903)
6.0465 (0.2049)
6.0451 (0.3238)
5.7170 (0.6084)
5.2786 (1.0626)
0.6032 (0.0204)
0.6037 (0.0351)
0.5710 (0.0626)
0.5192 (0.1071)
RM SE L (s)
Goodness of fit measures
RM SE M (s)
3.7418 (0.1390)
3.7197 (0.2289)
3.6280 (0.4047)
3.3674 (0.7069)
6.0426 (0.2110)
6.0329 (0.3194)
5.7419 (0.5836)
5.352 (1.0433)
0.6048 (0.0198)
0.6039 (0.0330)
0.5686 (0.0583)
0.5118 (0.1087)
RM SE U (s)
[-1.8306; 0.2296]
[-2.2689; -0.0321]
[-2.6284; 0.0049]
[-3.4996; 0.2503]
[-2.0044; -0.0010]
[-1.9916; 0.0060]
[-1.9686; 0.0274]
[-2.1825; -0.1344]
[4.0134; 17.0279]
[4.4937; 15.4189]
0.9752 (0.0014)
0.9765 (0.0021)
0.9811 (0.0032)
0.9821 (0.0060)
1.9318 (0.0539)
1.9189 (0.0873)
1.8658 (0.1632)
1.7433 (0.3084)
1.9444 (0.0567)
1.9322 (0.0908)
1.8770 (0.1681)
1.7705 (0.3329)
1.9456 (0.0556)
1.9316 (0.0915)
1.8792 (0.1708)
1.7937 (0.3397)
4.1682 (3.1061)
6.3164 (4.7139)
14.1527 (9.9574)
0.2840 (0.0172) 19.3191 (0.5622) 19.4552 (0.5803) 19.4476 (0.5883)
0.2999 (0.0283) 19.1887 (0.8732) 19.3152 (0.9125) 19.3230 (0.9044)
0.3590 (0.0559) 18.6993 (1.6606) 18.8206 (1.6852) 18.8264 (1.7575)
22.8865 (15.4177) 0.4163 (0.1171) 17.2924 (3.1618) 17.7344 (3.5050) 17.7340 (3.4835)
0.4103 (0.3076)
0.6489 (0.4647)
1.4351 (1.0535)
2.3931 (1.7137)
44.8497 (21.8889) 0.4174 (0.0199) 31.4869 (0.7036) 37.4000 (1.3902) 37.4340 (1.3969)
48.5699 (23.1589) 0.4212 (0.0312) 31.2413 (1.1494) 37.2947 (2.1769) 37.2365 (2.1567)
[10.6052; 20.5628] 53.5991 (23.4352) 0.4279 (0.0588) 30.7291 (2.1201) 36.7314 (4.1862) 36.5608 (4.1358)
[14.7204; 21.8540] 56.5752 (25.6831) 0.4622 (0.1070) 28.8415 (3.9476) 34.2186 (7.2418) 34.9929 (7.0480)
[-3.8094; 1.5039]
[-4.9458; 3.1311]
[-6.8252; 6.4858]
[-8.0960; 10.1085]
[-1.0923; 0.9534]
[-2.3790; -0.2200]
[-0.4061; 2.1870]
[6.8744; 8.4774]
[-1.9208; 0.0784]
[-1.9141; 0.0970]
[-2.0341; -0.0511]
[-1.8362; 0.1744]
Estimated parameters
Table C.18: Results, in different conditions, of the DSD Model II with a = 2, b = 1 and Ψ−1
Constant (t) that represents the interval I = [−2, 0].
C.2. SIMULATION STUDY II WITH INTERVAL VARIABLES
319
1.9913
5.3954
3.0138
1.4194
0.7896
250 1.9879 (1.4118)
1.8025 (2.3155)
1.6400 (1.6991)
10
30
100 1.7157 (1.1575)
250 1.7541 (0.8543)
2.4530
4.8062
[-43.2949; 5.6631]
0.0075
100 2.0002 (0.0869)
5.0086
2.2029
0.7090
0.3532
2.2920 (2.2200)
1.9656 (1.4846)
10
30
100 1.9562 (0.8413)
250 1.9859 (0.5945)
variability
7.9893 (0.5899)
7.9709 (0.8595)
7.9956 (1.6843)
7.6789 (2.9725)
7.9988 (0.0604)
7.9970 (0.0834)
0.3478
0.7389
2.8339
8.9299
0.0036
0.0070
0.0283
0.0875
[0.8341; 5.2155]
[0.7753; 5.8181]
[0.6462; 6.2228]
[-12.6214; -2.4831]
[0.9611; 4.9459]
[0.9352; 4.9735]
[1.0484;4.9986]
[0.8362;4.9634]
2.0000 (0.1680)
30
8.0028 (0.1684)
7.9913 (0.2958)
0.0282
2.0013 (0.2959)
10
0.0875
[-84.9204;-46.5683]
5.712 (4.3057)
250 (3.1575) 3.9828 17.1869
23.7553
[-96.8051;-53.8637]
(3.8542) 4.6784 25.3038 5.0526 (4.8496) 32.1823 [-112.5042;-74.8205]
100 (3.3223) 4.2595 19.8734 5.4107 (4.6294) 28.1147
[-98.6581;-73.5343]
[-4.9051; 12.6125]
[-10.1324; 13.1612]
[-20.9320; 14.1559]
30
0.8025
1.8303
5.3827
[1.3918; 5.4404]
[-1.7807; 2.9896]
[-15.2390; -9.4517]
3.8699 (4.9056) 27.5378 5.4747 (5.0598) 31.9533
7.7960 (0.8727)
7.6423 (1.3054)
7.3201 (2.2193)
6.7016 (3.0764) 11.1407
8.0097 (1.5670)
7.8672 (2.1894)
[-65.3441; -60.4207]
[1.1122; 5.0919]
[0.5339; 4.5251]
[0.9775; 4.9553]
[0.6317; 4.5902]
I∗
10
0.0036
Level II
3.1839
26.915
0.0224
0.0581
0.1484
0.8027
7.4658 (3.2910) 11.1051
6.1812
2.2341 (2.4764)
30
8.0051 (0.1497)
100 1.9830 (1.7852)
0.0248
250 2.0000 (0.1575)
7.9893 (0.2408)
3.7562 (4.4139) 22.5471 6.4697 (4.9596)
0.0570
100 2.0128 (0.2384)
8.0036 (0.3853)
7.9968 (0.8964)
MSE(b)
Estimated parameters
b∗ (s)
10
0.1534
2.0053 (0.3918)
0.7994
2.0156 (0.8944)
MSE(a)
30
a∗ (s)
10
n
250 2.0024 (0.0603)
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
0.9806 (0.0010)
0.9817 (0.0016)
0.9841 (0.0026)
0.9849 (0.0049)
Ω (s)
0.9862 (0.0006)
0.9864 (0.0010)
0.9868 (0.0019)
0.9889 (0.0030)
0.3390 (0.0169)
0.3530 (0.0278)
0.3968 (0.0553)
13.8376 (10.5006)
20.8359 (15.1768)
45.2751 (31.6112)
68.7635 (48.1177)
1.4260 (1.0641)
2.0505 (1.5644)
4.7178 (3.3734)
6.3597 (0.1861)
6.3218 (0.2883)
6.1524 (0.5444)
5.6755 (1.0159)
0.3208 (0.0237)
0.3405 (0.0377)
0.3889 (0.0669)
63.6155 (1.8325)
63.0795 (2.8398)
61.4732 (5.5947)
0.4268 (0.1197) 57.5018 (10.0763)
0.9791 (0.0012)
0.9805 (0.0017)
0.9835 (0.0029)
0.9847 (0.0052)
141.1748 (100.6994) 0.4171 (0.0211) 104.1448 (2.4922)
7.6011 (5.4131)
12.3676 (0.4550)
12.2776 (0.7459)
12.0420 (1.3406)
10.9622 (2.3948)
19.1342 (0.6267)
19.1238 (1.0407)
18.0547 (1.9242)
16.7482 (3.3183)
1.9156 (0.0621)
1.9131 (0.1070)
1.8014 (0.1817)
1.6262 (0.3246)
114.0641 (24.6599) 115.1002 (23.3327)
12.3439 (0.4462)
12.274 (0.7150)
12.0114 (1.3201)
11.1490 (2.3302)
19.1186 (0.6342)
19.1775 (1.0085)
18.0713 (1.9309)
16.7355 (3.3078)
1.9140 (0.0643)
1.9096 (0.1052)
1.8106 (0.1943)
1.6272 (0.3437)
RM SE U (s)
63.9687 (1.8944)
63.3996 (2.9361)
61.8583 (5.8099)
59.0153 (11.2578)
6.3992 (0.1935)
6.3638 (0.2966)
6.1867 (0.5591)
5.8018 (1.1121)
123.7505 (4.5217)
123.2185 (7.3823)
64.0099 (1.9006)
63.5041 (2.9587)
61.8529 (5.7101)
58.9822 (11.0283)
6.3952 (0.1919)
6.3540 (0.3015)
6.1879 (0.5679)
5.7908 (1.1011)
123.6537 (4.5974)
123.1653 (7.2774)
0.4242 (0.0613) 102.1227 (7.5178) 122.3402 (13.9393) 120.8536 (13.3908)
95.808 (12.7457)
10.3956 (0.2384)
10.2981 (0.3872)
10.0693 (0.7230)
9.1610 (1.2884)
18.1790 (0.4736)
18.0053 (0.7629)
16.7951 (1.3820)
15.5710 (2.4651)
1.8203 (0.0489)
1.7967 (0.0785)
1.6811 (0.1410)
1.5203 (0.2616)
RM SE L (s)
Goodness of fit measures
RM SE M (s)
155.4569 (104.5561) 0.4206 (0.0320) 103.3753 (3.7787)
173.303 (113.1593)
174.1119 (120.2162) 0.4553 (0.1094)
26.5659 (17.5990)
38.0004 (25.2413)
62.2399 (38.0436)
84.8268 (54.6610)
46.0400 (33.2996)
61.8663 (41.5872)
89.8682 (61.5732)
158.7723 (106.3888) 0.4221 (0.0976)
4.5840 (3.4367)
6.9102 (5.2746)
10.6970 (7.9929)
27.6391 (19.8917)
2 (I ∗ , I) (s)
DM
Table C.19: Results, in different conditions, of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) that represents the interval I = [1, 5].
320
APPENDIX C. RESULTS OF SIMULATION STUDIES
0.5969
3.3900
1.2709
0.4307
0.1900
250 5.8905 (0.7652)
4.9916 (1.5413)
5.3558 (0.9256)
10
30
100 5.6500 (0.5555)
250 5.7776 (0.3751)
8.8679 (10.8129)
12.9411 (16.0651)
22.192 (28.1050)
33.1037 (43.7038)
18.7427 (17.8166)
25.2462 (25.4609)
38.7232 (39.6877)
9.3578 (1.5107)
1.0778 (0.0298)
1.0663 (0.0456)
0.9974 (0.0820)
0.9193 (0.1506)
0.9862 (0.0006)
0.9864 (0.0010)
0.9866 (0.0019)
0.9885 (0.0031)
0.0051
6.0227 (0.1202)
30
100 6.0154 (0.0698)
3.0966
0.9546
0.2933
0.1393
5.7767 (1.7464)
5.9206 (0.9743)
10
30
100 6.0388 (0.5404)
250 6.0518 (0.3698)
variability
0.1245 (0.1884)
0.182 (0.2860)
0.3920 (0.5811)
0.6466 (1.0017)
0.0140 (0.0210)
0.0208 (0.0300)
0.0386 (0.0596)
0.051
0.1148
0.4911
1.4206
0.0006
0.0013
0.005
0.0162
0.0149
6.0158 (0.1939)
10
0.0741 (0.1038)
6.4848
250 3.8689 (2.4695) 10.6339 1.4288 (2.1090)
0.0378
8.2299
100 3.6875 (2.7088) 12.6779 1.5973 (2.3841)
[2.6247; 3.6904]
[3.2495; 4.5326]
[9.3051; 11.0947]
[14.1121; 18.0118]
[1.1899; 2.8818]
[1.3133; 2.8711]
[1.6346; 2.9213]
[2.3448; 3.5801]
[61.4158; 84.7784]
[68.4551; 91.8351]
6.5843 (5.1115)
10.0536 (7.8677)
21.2586 (17.6808)
34.534 (30.3403)
0.7952 (0.6122)
1.1951 (0.8921)
2.5291 (1.7942)
3.9103 (2.8640)
74.9981 (83.9094)
83.6163 (92.6104)
3.8044 (0.1088)
3.7792 (0.1760)
3.7117 (0.3240)
3.4334 (0.5964)
0.4032 (0.0279) 38.0855 (1.0449)
0.4272 (0.0440) 37.8617 (1.6772)
0.4494 (0.0688) 36.8016 (3.2122)
0.4748 (0.1141) 34.6948 (6.0201)
0.9853 (0.0008)
0.9867 (0.0012)
0.9875 (0.0021)
0.9885 (0.0038)
0.4173 (0.0202) 62.3273 (1.4250)
0.4188 (0.0316) 61.9079 (2.2075)
57.167 (7.2888)
6.2277 (0.1454)
6.1814 (0.2330)
6.0896 (0.4337)
5.6071 (0.7839)
0.3496 (0.0186) 10.7843 (0.2975)
0.3656 (0.0276) 10.6607 (0.4477)
0.4081 (0.0540) 10.0069 (0.8327)
0.4201 (0.0973)
0.9816 (0.0010)
0.9827 (0.0015)
0.9852 (0.0024)
0.9849 (0.0047)
7.4056 (0.2741)
7.3706 (0.4343)
7.2805 (0.8187)
6.7387 (1.4288)
11.3419 (0.3803)
11.3608 (0.6118)
10.7536 (1.1091)
10.0038 (2.0001)
1.1319 (0.0382)
1.1316 (0.0623)
1.0711 (0.1162)
0.9898 (0.2017)
RM SE U (s)
38.3251 (1.0763)
38.1331 (1.7837)
37.222 (3.3336)
36.1311 (6.8024)
3.8260 (0.1123)
3.8025 (0.1805)
3.7388 (0.3329)
3.5364 (0.6725)
74.0978 (2.7489)
73.8967 (4.3730)
72.8447 (8.0421)
38.3075 (1.0985)
38.0921 (1.6877)
37.0747 (3.3643)
35.6791 (6.7179)
3.8268 (0.1129)
3.8010 (0.1830)
3.7386 (0.3411)
3.5382 (0.6675)
73.9782 (2.7368)
73.724 (4.2729)
72.5995 (8.1866)
68.7598 (14.4231) 68.2787 (13.6781)
7.3893 (0.2799)
7.3802 (0.4388)
7.2499 (0.7910)
6.7368 (1.4446)
11.3188 (0.3782)
11.3047 (0.6010)
10.7766 (1.1514)
10.0738 (2.0333)
1.1338 (0.0383)
1.1361 (0.0609)
1.0732 (0.1134)
0.9855 (0.1999)
RM SE L (s)
Goodness of fit measures
RM SE M (s)
3.2077 (2.9279) 16.3607 2.2455 (2.7942) 12.8421 [93.6392; 112.8773] 108.9362 (102.3983) 0.4249 (0.0595) 61.1274 (4.1643)
[7.3825; 11.5396]
[10.2391; 16.3681]
[16.8237; 27.4144]
[23.4621; 40.7600]
[10.2081; 11.5154]
[14.9191; 16.3417]
[23.3236; 25.2294]
87.4669 (86.8550)
2.5047 (1.9267)
3.6526 (2.8913)
5.4742 (3.9218)
11.3753 (9.3388)
Ω (s)
30
0.0890
0.1773
0.5590
1.1559
0.3762
0.7306
1.5982
[68.0619; 70.4247]
[1.6565; 3.4491]
[1.5088; 3.1476]
[2.1757; 3.6604]
[5.7799; 7.1212]
2 (I ∗ , I) (s)
DM
3.2031 (3.0438) 17.0780 2.4092 (2.9423) 14.4528 [97.7294; 113.1385] 116.3032 (103.2069) 0.4592 (0.1051)
0.1511 (0.2574)
0.2140 (0.3628)
0.3590 (0.6562)
0.5041 (0.9501)
0.3338 (0.5149)
0.4563 (0.7231)
9.3164
0.0047
0.0093
0.0224
0.1268
I∗
10
0.0025
Level II
1.1249
0.6606 (1.0784)
2.8325
5.5504 (1.6226)
30
0.0401 (0.0555)
100 5.7767 (1.0373)
0.0130
250 6.0125 (0.1132)
0.0535 (0.0804)
4.4469 (2.9104) 10.8741 1.8093 (2.4594)
0.0324
100 6.0371 (0.1763)
0.0866 (0.1221)
0.2025 (0.2930)
MSE(b)
Estimated parameters
b∗ (s)
10
0.0658
6.0406 (0.2534)
0.2102
5.9807 (0.4583)
MSE(a)
30
a∗ (s)
10
n
250 6.0112 (0.0491)
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
Table C.20: Results, in different conditions, of the DSD Model II with a = 6, b = 0 and Ψ−1
Constant (t) that represents the interval I = [1, 3].
C.2. SIMULATION STUDY II WITH INTERVAL VARIABLES
321
1.7684 (0.9120)
1.8612 (0.6059)
0.8765 (2.1793)
0.9112 (2.0520)
1.0194 (1.9725)
1.1555 (1.9079)
1.9564 (0.3383)
1.9903 (0.1552)
1.9939 (0.1357)
100
250
10
30
100
250
10
30
100
1.8234 (0.7983)
1.6641 (1.3603)
30
250
1.3751 (1.6201)
10
1.6864 (1.2287)
1.9484 (0.8295)
250
100
Level II
1.8480 (1.1900)
100
1.7187 (1.5756)
1.7170 (1.6380)
30
30
1.4923 (1.9152)
10
2.0342 (2.4050)
1.9994 (0.0859)
250
10
2.0028 (0.1321)
100
variability
2.0006 (0.2481)
1.9989 (0.0792)
1.9932 (0.5015)
30
a∗
1 (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
Mixed
variability
similar
High and
variability
similar
Low and
variability
Pattern of
I = [−2, 0].
0.6679
1.6066
2.5592
5.7796
0.0063
0.0184
0.0242
0.1162
4.3497
4.8482
5.3922
6.0069
0.3861
0.8845
1.9614
3.0126
0.6900
1.4379
2.7603
3.9221
0.0074
0.0174
0.0615
0.2513
MSE(a1 )
0.8679 (0.6851)
0.9753 (1.0913)
0.9562 (1.2700)
1.3640 (2.1079)
1.0035 (0.0822)
0.9934 (0.1377)
0.9933 (0.1545)
0.9479 (0.3517)
0.9659 (1.7134)
0.8598 (1.8375)
0.7582 (1.9139)
0.8623 (2.1461)
0.9020 (0.5718)
0.8874 (0.7635)
0.8726 (1.0518)
0.9660 (1.4444)
1.0297 (0.7468)
1.0448 (0.9731)
0.8785 (1.1917)
1.1549 (1.6955)
1.0021 (0.0863)
1.0004 (0.1334)
0.9851 (0.2340)
0.9589 (0.4888)
b∗
1 (s)
0.4863
1.1904
1.6132
4.5712
0.0068
0.0190
0.0239
0.1263
2.9340
3.3928
3.7178
4.6199
0.3363
0.5950
1.1215
2.0854
0.5581
0.9479
1.4334
2.8958
0.0074
0.0178
0.0549
0.2404
MSE(b1 )
0.6980 (0.8040)
0.8679 (1.1047)
0.9982 (1.4021)
0.9911 (1.7065)
0.5026 (0.1673)
0.4968 (0.2191)
0.5393 (0.4048)
0.5765 (0.6020)
1.2434 (2.3247)
1.4743 (2.8751)
1.5375 (3.2497)
1.8374 (3.7084)
0.4801 (0.4865)
0.5898 (0.6809)
0.7820 (1.0842)
1.0401 (1.7067)
0.6745 (0.7541)
0.9402 (1.2094)
1.4003 (2.0343)
1.9987 (3.2614)
0.4981 (0.1011)
0.4926 (0.1743)
0.5058 (0.3522)
0.5567 (0.5156)
a∗
2 (s)
0.6850
1.3546
2.2120
3.1504
0.0280
0.0480
0.1652
0.3679
5.9516
9.2073
11.6263
15.5271
0.2368
0.4713
1.2538
3.2015
0.5986
1.6551
4.9447
12.8723
0.0102
0.0304
0.1240
0.2688
MSE(a2 )
2.9277 (1.1503)
2.7033 (1.6815)
2.5741 (2.2917)
2.4677 (2.5433)
2.9958 (0.1695)
3.0039 (0.2252)
2.9768 (0.4656)
2.9398 (0.7483)
2.1662 (2.8887)
2.0934 (3.3777)
1.8533 (3.5298)
1.8960 (3.7942)
2.9351 (0.6403)
2.8519 (1.0608)
2.8506 (1.7981)
2.5554 (2.3447)
3.065 (1.1479)
3.0348 (1.9434)
3.2788 (3.1221)
2.9259 (3.7929)
2.9977 (0.0998)
3.0034 (0.1830)
3.0186 (0.4116)
2.9833 (0.6865)
b∗
2 (s)
Estimated parameters
1.3270
2.9127
5.4280
6.7454
0.0287
0.0507
0.2171
0.5630
9.0313
12.2192
13.7620
15.6007
0.4138
1.1461
3.2522
5.6900
1.3205
3.7743
9.8157
14.3769
0.0100
0.0335
0.1696
0.4711
MSE(b2 )
1.6261 (1.1755)
1.7126 (1.5671)
2.1321 (2.1918)
1.7093 (2.1944)
1.5020 (0.1780)
1.4972 (0.2179)
1.5214 (0.4784)
1.5807 (0.8029)
2.1282 (3.3392)
2.3889 (4.0181)
2.7687 (4.6806)
2.4337 (4.8256)
1.4503 (0.6079)
1.4356 (0.9677)
1.4201 (1.4300)
(1.8336) 2.2514
1.5147 (0.8111)
1.5498 (1.2601)
1.7959 (1.9375)
1.8593 (2.6081)
1.4963 (0.0804)
1.5012 (0.1281)
1.4991 (0.2506)
1.4818 (0.4489)
a∗
3 (s)
1.3964
2.4985
5.1988
4.8545
0.0317
0.0474
0.2291
0.6505
11.5337
16.9189
23.4959
24.1344
0.3717
0.9397
2.0493
5.1750
0.6574
1.5887
3.8378
6.9246
0.0065
0.0164
0.0627
0.2016
MSE(a3 )
0.8679 (0.9056)
0.9753 (1.1197)
0.9562 (1.3352)
1.3640 (1.5778)
1.0035 (0.1787)
0.9934 (0.2181)
0.9933 (0.4427)
0.9479 (0.7093)
0.9659 (3.1855)
0.8598 (3.9748)
0.7582 (4.5033)
0.8623 (4.8665)
0.9020 (0.5671)
0.8874 (0.8497)
0.8726 (1.2841)
(0.9660) 2.0859
1.0297 (0.7136)
1.0448 (1.1558)
0.8785 (1.7423)
1.1549 (2.5623)
1.0021 (0.0787)
1.0004 (0.1299)
0.9851 (0.2545)
0.9589 (0.4286)
b∗
3 (s)
0.8246
1.2525
1.7938
2.4992
0.0319
0.0475
0.1968
0.5096
11.3929
17.3666
22.5915
26.1319
0.3255
0.7212
1.6572
4.6281
0.5097
1.3542
3.1740
7.0055
0.0062
0.0169
0.0648
0.1849
MSE(b3 )
Table C.21: Results, in different conditions, of the DSD Model II with a1 = 2, b1 = 1, a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and Ψ−1
Constant (t) that represents the interval
322
APPENDIX C. RESULTS OF SIMULATION STUDIES
[-5.1505; 1.1499]
[-7.9178; 4.9589]
[-8.8588; 5.7665]
[-7.3503; 5.3287]
[-11.0295; -10.6381]
[-14.7828; -13.8301]
[-9.0701; -7.9271]
[-4.8176; -2.0401]
[-7.1980; -3.9452]
[-3.7841; -1.6083]
[-1.9303; 0.2445]
10
30
100
250
10
30
100
250
10
30
100
[-23.1395; -20.7726]
[-11.8903; -8.1413]
[-9.0248; -5.4799]
10
30
100
250
variability
Level II
[-18.7399; -14.4119]
250
Level I
Level II
Mixed
variability
similar
[-2.0848; -0.0875]
[-2.3837; -0.5769]
250
High and
[-2.9844; -1.5093]
100
Level I
[-6.7293; -5.7774]
30
Level II
[-8.3709; -7.9875]
10
[-1.8857; 0.0993]
100
variability
[-2.0366; -0.0329]
[-1.9013; 0.1033]
[-4.1057; -1.8545]
30
50.3552 (29.7117)
65.2050 (39.5885)
82.8030 (50.0557)
84.9023 (50.7996)
8.2048 (6.1655)
10.4464 (7.7078)
21.9671 (14.8901)
35.7777 (21.3927)
89.6752 (46.9829)
105.9064 (45.5221)
117.8019 (43.6643)
126.0623 (44.3210)
21.216 (14.3171)
32.0566 (21.7119)
53.5604 (35.8656)
62.2800 (39.8561)
34.4700 (25.1942)
51.4435 (35.2632)
79.7799 (51.6173)
104.4583 (60.669)
3.7239 (2.6798)
5.6595 (4.3694)
11.4707 (8.4119)
22.0720 (16.2194)
MSE(v)
Estimated parameter
v ∗ (s)
10
n
250
Level I
Error level
Low and similar
Pattern of variability
I = [−2, 0] (continuation of the Table C.21).
0.3188 (0.0170)
0.3008 (0.0273)
0.3683 (0.0575)
0.5151 (0.1128)
0.9786 (0.0012)
0.9760 (0.0021)
0.9803 (0.0035)
0.9888 (0.0045)
0.4233 (0.0203)
0.4268 (0.0309)
0.4383 (0.0612)
0.4804 (0.1065)
0.9865 (0.0007)
0.9868 (0.0010)
0.9878 (0.0019)
0.9905 (0.0028)
0.3434 (0.0163)
0.3595 (0.0276)
0.4121 (0.0591)
0.5361 (0.1055)
0.9808 (0.0010)
0.9815 (0.0015)
0.9843 (0.0026)
0.9906 (0.0036)
Ω (s)
56.8351 (1.5523)
53.6845 (2.3433)
50.2934 (4.5297)
43.7907 (7.8133)
5.6849 (0.1578)
5.3494 (0.2437)
4.9803 (0.4656)
4.0575 (0.8573)
68.3795 (1.6263)
68.4079 (2.5469)
66.3857 (4.6727)
61.8053 (8.4581)
6.8182 (0.1656)
6.7732 (0.2550)
6.4277 (0.4998)
5.6165 (0.8559)
9.886 (0.2692)
9.7793 (0.4247)
9.3685 (0.7586)
7.6399 (1.2340)
0.9890 (0.0264)
0.9783 (0.0413)
0.9246 (0.0764)
0.6938 (0.1380)
RM SE M (s)
59.0531 (1.8703)
56.2882 (2.9509)
51.9861 (5.3356)
47.6683 (9.9647)
5.9048 (0.1889)
5.6150 (0.3045)
5.1335 (0.5219)
4.3491 (0.9995)
81.7287 (3.0883)
81.8271 (4.7993)
79.4283 (9.1443)
73.8887 (15.7112)
8.1225 (0.2979)
8.0623 (0.4834)
7.7375 (0.8881)
6.8615 (1.4960)
10.3755 (0.3521)
10.4056 (0.5722)
10.5177 (1.1502)
8.8135 (1.9260)
1.0393 (0.0343)
1.0404 (0.0563)
1.0328 (0.1134)
0.7954 (0.1820)
RM SE L (s)
Goodness-of-fit measures
59.1078 (1.9327)
56.4054 (3.0552)
51.9491 (5.2764)
47.6233 (10.0710)
5.9073 (0.1946)
5.6054 (0.3082)
5.1231 (0.5376)
4.3980 (1.0783)
81.5461 (3.1009)
81.4144 (4.9405)
79.4588 (9.0696)
74.0048 (15.5416)
8.1429 (0.3114)
8.1062 (0.4819)
7.7306 (0.8494)
6.8386 (1.4952)
10.3936 (0.3403)
10.4058 (0.5566)
10.5773 (1.1929)
8.8313 (1.8938)
1.0381 (0.0341)
1.0377 (0.0546)
1.0314 (0.1175)
0.7864 (0.1792)
RM SE U (s)
Table C.22: Results, in different conditions, of the DSD Model II with a1 = 2, b1 = 1, a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and Ψ−1
Constant (t) that represents the interval
C.2. SIMULATION STUDY II WITH INTERVAL VARIABLES
323
C.3.
1.9959 (0.2677)
1.9829 (0.4405)
1.9977 (0.1944)
1.9970 (0.1038)
250
10
30
100
1.9387 (0.8810)
1.9564 (0.6377)
2.0020 (0.0350)
2.0001 (0.0180)
2.0003 (0.0100)
30
100
250
10
30
100
distributions
different
Mixture of
Level II
Level I
Level II
2.0110 (0.3500)
2.0100 (0.1900 )
1.9940 (0.0960)
1.9997 (0.0470)
2.0014 (0.0588)
2.0008 (0.0293)
1.9998 (0.0199)
2.0000 (0.0114)
2.0092 (0.5748)
1.9982 (0.2841)
2.0024 (0.1945)
1.9983 (0.1115)
30
100
250
10
30
100
250
10
30
100
250
distribution
10
1.9999 (0.0050)
250
LogNormal
Level I
1.7704 (1.2191)
10
distribution
Level II
1.7233 (1.4533)
250
Normal
2.0009 (0.1037)
2.0002 (0.5497)
100
Level I
1.9441 (0.7508)
30
Level II
1.8793 (1.0839)
10
1.9977 (0.0559)
100
distribution
2.0020 (0.0810)
2.0008 (0.0403)
2.0015 (0.1368)
30
a∗ (s)
10
n
250
Level I
Error level
Uniform
microdata
Distribution of
0.0124
0.0378
0.0806
0.3302
0.0001
0.0004
0.0009
0.0035
0.0020
0.0090
0.0360
0.1220
2.2135E-5
9.4048E-5
3.3688E-4
0.0010
0.4082
0.7792
1.5374
2.1864
0.0107
0.0108
0.0378
0.1942
0.0716
0.3019
0.5663
1.1882
0.0016
0.0031
0.0066
0.0187
MSE(a)
1.0017 (0.1119)
0.9972 (0.1946)
1.0016 (0.2838)
0.9919 (0.5688)
1.0000 (0.0114)
1.0002 (0.0199)
0.9993 (0.0294)
0.9985 (0.0588)
0.9998 (0.0470)
1.0060 (0.0960)
0.9890 (0.1870)
0.9970 (0.3370)
1.0001 (0.0050)
0.9997 (0.0100)
0.9998 (0.0180)
0.9980 (0.0330)
1.0450 (0.6349)
1.0727 (0.8702)
1.2834 (1.2086)
1.4121 (1.4455)
0.9990 (0.1038)
1.0031 (0.1038)
1.0025 (0.1942)
1.0170 (0.4409)
1.0032 (0.2680)
1.0015 (0.5466)
1.0631 (0.7357)
1.1714 (1.0565)
0.9992 (0.0404)
1.0023 (0.0561)
0.9979(0.0808)
0.9988 (0.1375)
b∗ (s)
0.0125
0.0378
0.0805
0.3233
0.0001
0.0004
0.0009
0.0035
0.0020
0.0090
0.0350
0.1130
2.1963E-5
9.1453E-5
3.2923E-4
0.0010
0.4047
0.7618
1.5396
2.2572
0.0108
0.0108
0.0377
0.1945
0.0718
0.2985
0.5446
1.1444
0.0016
0.0032
0.0065
0.0189
MSE(b)
Estimated parameters
-1.0047 (0.7308)
-1.0134 (1.0434)
-0.9473 (2.2149 )
-1.1080 (3.8687)
-0.9983 (0.0723)
-0.9933 (0.1068)
-0.9991 (0.2217)
-1.0067 (0.3851)
-0.9880 (1.1150)
-1.0090 ( 1.7580)
-1.1940 (2.4730)
-0.9890 (4.7560)
-0.9900 (0.1120)
-1.0040 (0.1750)
-1.0080 (0.2510)
-1.0030 (0.4640)
-0.8522 (1.9215)
-0.7882 (2.6483)
-0.1550 (4.1590)
0.2289 (5.6502)
-0.9999 (0.3132)
-0.9923 (0.3124)
-0.9939 (0.6529)
-0.9563 (1.5537)
-0.9854 (0.8166)
-0.9903 (1.6097)
-0.8140 (2.1537)
-0.5298 (3.4948)
-1.0023 (0.1220)
-0.9933 (0.1627)
-1.0090 (0.2316)
-1.0024 (0.4364)
v ∗ (s)
0.5336
1.0877
4.9035
14.9639
0.0052
0.0114
0.0491
0.1482
1.2420
3.0890
6.1460
22.5990
0.0130
0.0310
0.0630
0.2150
3.7102
7.0514
17.9939
33.4031
0.0980
0.0975
0.4259
2.4135
0.6663
2.5886
4.6685
12.4228
0.0149
0.0265
0.0537
0.1903
MSE(v)
0.2762 (0.0136)
0.2780 (0.0247)
0.2697 (0.0377)
0.3259 (0.0855)
0.9743 (0.0014)
0.9740 (0.0023)
0.9725 (0.0048)
0.9764 (0.0076)
0.4130 (0.0140)
0.2830 (0.0200)
0.2910 (0.0400)
0.3150 (0.0790)
0.9860 (0.0010)
0.9750 (0.0020)
0.9750 (0.0040)
0.9760 (0.0080)
0.2867 (0.0133)
0.2944 (0.0214)
0.3062 (0.0448)
0.3549 (0.0809)
0.9759 (0.0022)
0.9761 (0.0023)
0.9763 (0.0042)
0.9801 (0.0070)
0.5042 (0.0156)
0.3265 (0.0225)
0.5276 (0.0485)
0.5854 (0.0937)
0.9786 (0.0012)
0.9792 (0.0019)
0.9907 (0.0016)
0.9921 (0.0027)
Ω (s)
10.8351 (0.3004)
10.0615 (0.4612)
11.0789 ( 0.9735)
10.2512 (1.7637)
1.0847 (0.0305)
1.0089 (0.0463)
1.1094 (0.1010)
1.0213 (0.1753)
17.2620 (0.4760)
17.2270 (0.7900)
13.1140 (1.1650)
12.8310 (2.1230)
1.7280 (0.0500)
1.7250 (0.0750)
1.3120 (0.1130)
1.2860 (0.2170)
6.5374 (0.1951)
6.3760 (0.3008)
6.4961 (0.6143)
6.3839 (1.0909)
0.6374 (0.0299)
0.6358 (0.0309)
0.6460 (0.0587)
0.6153 (0.1159)
3.4809 (0.1034)
5.0721 (0.2285)
3.4968 (0.3139 )
3.1077 (0.5710)
0.5182 (0.0153)
0.5083 (0.0242)
0.3497 (0.0312)
0.3098 ( 0.0553)
RM SE M (s)
10.8354 (0.2993)
10.0654 (0.4595)
11.0919 (0.9723)
10.3166 (1.7653)
1.0848 (0.0305)
1.0093 (0.0461)
1.1109 (0.1009)
1.0285 (0.1752)
17.2570 ( 0.4760 )
17.2220 (0.7910)
13.1190(1.1660)
12.8320 (2.1190)
1.7270 ( 0.0500)
1.7240 (0.0740)
1.3120 ( 0.1130)
1.2860 (0.2170)
6.5317 (0.1928)
6.3721 (0.2979)
6.4896 (0.6055)
6.3804 (1.0800)
0.6370 (0.0297)
0.6354 (0.0306)
0.6458 (0.0578)
0.6153 (0.1147)
3.4749 (0.1023)
5.0688 (0.2273)
3.4895 (0.3101)
3.1031 (0.5652)
0.5179 (0.0153)
0.5080 (0.0241)
0.3490 (0.0309)
0.3093 (0.0547)
RM SE L (s)
Goodness of fit measures
Table C.23: Results, in different conditions, of the DSD Model I with a = 2, b = 1 and v = −1.
Simulation study II with histogram variables
10.8430 (0.3008)
10.0734 (0.4619)
11.0995 (0.9742)
10.3241 (1.7699)
1.0855 (0.0306)
1.0101 (0.0463)
1.1115 (0.1012)
1.0289 (0.1755)
17.2830 (0.4760)
17.2560 (0.7900)
13.1460 (1.1670)
12.9780 (2.1300)
1.7300 ( 0.0500)
1.7280 (0.0750)
1.3150 (0.1130)
1.3010 (0.2170)
6.5649 (0.1971)
6.3986 (0.3032)
6.5406 (0.6210)
6.4185 (1.0989)
0.6397 (0.0301)
0.6381 (0.0311)
0.6505 (0.0594)
0.6191 (0.1166)
3.4937 (0.1044)
5.0797 (0.2296)
3.5135 (0.3172)
3.1229 (0.5753)
0.5191 (0.0154)
0.5091 (0.0243)
0.3512 (0.0315)
0.3114 (0.0558)
RM SE U (s)
324
APPENDIX C. RESULTS OF SIMULATION STUDIES
2.0129 (1.2074)
2.0223 (0.8549)
2.0967 (1.4226)
1.9842 (0.6694)
1.9902 (0.3540)
250
10
30
100
3.5385 (3.9068)
2.5780 (2.6539)
2.1668 (1.9088)
1.9960 (0.1160)
1.9960 (0.0590)
1.9998 (0.0310)
30
100
250
10
30
100
2.0040 (0.3280)
2.0130 (0.1530)
2.0075 (0.1945)
1.9984 (0.0970)
1.9997 (0.0651)
1.9993 (0.0380)
2.0651 (1.7394)
2.0079 (0.9464)
1.9959 (0.6567)
1.9863 (0.3816)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
1.9990 (0.6390)
30
100
Level II
Level I
2.0340 (1.1180)
10
distribution
Level II
2.0010 (0.0160)
250
LogNormal
Level I
4.5615 (4.8027)
10
distribution
Level II
2.0000 (0.2199)
250
Normal
Level I
2.3117 (2.1745)
30
100
Level II
3.0342 (3.2418)
10
2.0008 (0.1323)
100
distribution
1.9847 (0.2701)
1.9983 (0.0942)
1.9915 (0.4520)
30
a∗ (s)
10
n
250
Level I
Error level
Uniform
microdata
Distribution of
0.1456
0.4309
0.8948
3.0268
0.0014
0.0042
0.0094
0.0379
0.0240
0.1070
0.4080
1.2490
2.4392E-4
0.0010
0.0040
0.0130
3.6676
7.3699
17.6148
29.6046
0.0483
0.1253
0.4479
2.0311
0.7306
1.4564
4.8209
11.5686
0.0089
0.0175
0.0731
0.2042
MSE(a)
8.0128 (0.3819)
8.0050 (0.6577)
7.9963 (0.9474)
8.0708 (1.9194)
8.0006 (0.0379)
8.0002 (0.0650)
8.0016 (0.0969)
7.9930 (0.1941)
7.9880 (0.1530)
7.9970 (0.3310)
8.0010 (0.6410)
7.9850 (1.1770)
7.9990 (0.0160)
8.0004 (0.0320)
8.0040 (0.0590)
8.0030 (0.1140)
7.8517 (1.9313)
7.4812 (2.7122)
6.6568 (3.9765)
5.9059 (4.9355)
7.9999 (0.2199)
8.0098 (0.3539)
8.0168 (0.6701)
7.9062 (1.4281)
7.9790 (0.8556)
7.9954 (1.2206)
7.7527 (2.2562)
7.1862 (3.4119)
8.0015 (0.0942)
7.9991 (0.1321)
8.0131 (0.2711)
8.0086 (0.4506)
b∗ (s)
0.1458
0.4322
0.8966
3.6855
0.0014
0.0042
0.0094
0.0377
0.0230
0.1090
0.4110
1.3840
2.4331E-4
0.0010
0.0030
0.0130
3.7484
7.6180
17.6010
28.7201
0.0483
0.1252
0.4488
2.0463
0.7317
1.4884
5.1464
12.2918
0.0089
0.0174
0.0736
0.2029
MSE(b)
Estimated parameters
3.1761 (2.5117)
3.0427 (3.5225)
2.8864 (7.2998)
2.5964 (12.9825)
3.0059 (0.2433)
3.0012 (0.3476)
2.9919 (0.7528)
2.9145 (1.2898)
3.0730 (3.5690)
2.7960 (6.1880)
2.9310 (8.7020)
3.6840 (16.3180)
3.0090 (0.3840)
2.9760 (0.5950)
3.0430 (0.8550)
2.9920 (1.6020)
2.5374 (5.9072 )
1.3196 (8.3472)
-1.7030 (13.6088)
-5.2812 (18.7400)
2.9990 (0.6745)
3.0282 (1.0801)
3.0522 ( 2.2418)
2.6628 (5.0032)
2.9444 (2.6689)
3.0452 (3.5845)
2.2254 (6.4742)
0.1597 (10.8013)
3.0030 (0.2837)
3.0001 (0.3828)
3.0384 (0.7709)
3.0229 (1.4417)
v ∗ (s)
6.3335
12.3973
53.2464
168.5408
0.0592
0.1207
0.5662
1.6693
12.7280
38.2960
75.6620
266.4640
0.1470
0.3540
0.7320
2.5630
35.0738
72.4307
207.1313
419.4156
0.4545
1.1663
5.0233
25.1205
7.1191
12.8382
42.4731
124.6187
0.0804
0.1464
0.5952
2.0769
MSE(v)
0.3131 (0.0175)
0.3191 (0.0319)
0.2940 (0.0423)
0.3568 (0.0958)
0.9784 (0.0012)
0.9788 (0.0019)
0.9762 (0.0041)
0.9800 (0.0066)
0.4550 (0.0160)
0.3150 (0.0230)
0.3190 (0.0430)
0.3300 (0.0900)
0.9880 (0.0010)
0.9780 (0.0020)
0.9790 (0.0040)
0.9780 (0.0070)
0.2890 (0.0136)
0.2966 (0.0204)
0.3044 (0.0413)
0.3563 (0.0844)
0.9757 (0.0014)
0.9762 (0.0021)
0.9763 (0.0042)
0.9801 (0.0068)
0.5110 (0.0164)
0.5183 (0.0268)
0.5331 (0.0519)
0.5856 (0.0885)
0.9905 (0.0005)
0.9907 (0.0008)
0.9911 (0.0016)
0.9924 (0.0027)
Ω (s)
36.1132 (1.0192)
33.5675 (1.5549)
37.0044 (3.2426)
34.2123 (5.7510)
3.6120 (0.1047)
3.3561 (0.1501)
3.6657 (0.3234)
3.3860 (0.5900)
57.7450 (1.6470)
57.5170 (2.6890)
44.0600 (3.8910)
44.2560 (7.8380)
5.7690 ( 0.1660)
5.7680 (0.2670)
4.4000 (0.3890)
4.4380 (0.7950)
21.8057 (0.6483)
21.2287 (0.9388)
21.7775 (1.9671 )
21.2601 (3.6607)
2.1806 (0.0623)
2.1268 (0.0952)
2.1646 (0.1986)
2.0608 (0.3781)
11.6145 (0.3396)
11.4784 (0.5447)
11.6755 (1.1063)
10.4196 (1.8327)
1.1598 (0.0334)
1.1436 (0.0526)
1.1585 (0.1068)
1.0225 (0.1930)
RM SE M (s)
36.1160 (1.0177)
33.5832 (1.5526)
37.0508 (3.2410)
34.4588 (5.7669)
3.6123 (0.1045)
3.3576 (0.1496)
3.6707 (0.3229)
3.4096 (0.5896)
57.7650 (1.6380)
57.5550 (2.6750)
44.1120 (3.8660)
44.5340 (7.6590)
5.7710 (0.1650)
5.7710 (0.2660)
4.4060 (0.3870)
4.4650 (0.7780)
21.7876 (0.6421)
21.2085 (0.9303)
21.7607 (1.9364)
21.2454 (3.6329)
2.1788 (0.0616)
2.1249 (0.0945)
2.1636 (0.1955)
2.0598 (0.3742)
11.5951 (0.3356)
11.4622 (0.5388)
11.6534 (1.0948)
10.4029 (1.8114 )
1.1579 (0.0330)
1.1419 (0.0522)
1.1565 (0.1058)
1.0216 (0.1913)
RM SE L (s)
Goodness of fit measures
Table C.24: Results, in different conditions, of the DSD Model I with a = 2, b = 8 and v = 3.
36.1365 (1.0198)
33.6035 (1.5552)
37.0681 (3.2482)
34.3920 (5.7614)
3.6142 (0.1050)
3.3597 (0.1500)
3.6723 (0.3238)
3.4111 (0.5903)
57.7840 (1.6560)
57.5670 (2.7040)
44.1440 (3.9140)
44.5010 ( 8.0260)
5.7730 (0.1660)
5.7720 (0.2690)
4.4070 (0.3910)
4.4640 (0.8130)
21.8918 (0.6537)
21.3103 (0.9462)
21.9164 (1.9923)
21.3778 (3.6803)
2.1893 (0.0629)
2.1349 (0.0958)
2.1796 (0.2004)
2.0737 (0.3800)
11.6567 (0.3433)
11.5153 (0.5502)
11.7283 (1.1160)
10.4716 (1.8500)
1.1640 (0.0337)
1.1473 (0.0530)
1.1637 (0.1076)
1.0271 (0.1942)
RM SE U (s)
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
325
5.7588 (0.5229)
5.8372 (0.3433)
5.6565 (0.5278)
5.8513 (0.2450)
5.9207 (0.1287)
100
250
10
30
100
4.6760 (2.1116)
5.2244 (1.2684)
5.5403 (0.8005)
5.9960 (0.0700)
5.9960 (0.0350)
5.9990 (0.0180)
30
100
250
10
30
100
5.9930 (0.1810)
5.9996 (0.0900)
5.9925 (0.1079)
5.9934 (0.0512)
5.9989 (0.0386)
5.9993 (0.0215)
5.9118 ( 1.0729)
5.9389 (0.4937)
6.0041 (0.3893)
5.9735 (0.2286)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
5.9710 (0.3400)
30
100
Level II
Level I
5.9400 (0.7110)
10
distribution
Level II
5.9997 (0.0090)
250
LogNormal
Level I
3.9562 (2.8522)
10
distribution
Level II
5.9522 (0.0807)
250
Normal
Level I
5.4839 (1.0569)
30
Level II
5.1445 (1.7747 )
10
5.9784 (0.0499)
100
distribution
5.9553 (0.1025)
5.9836 (0.0353)
5.9060 (0.1897)
30
a∗ (s)
10
n
250
Level I
Error level
Uniform
microdata
Distribution of
0.0529
0.1514
0.2472
1.1577
0.0005
0.0015
0.0027
0.0117
0.0080
0.0330
0.1170
0.5080
8.7406E-5
3.2272E-4
0.0010
0.0050
0.8514
2.2087
6.2071
12.3041
0.0088
0.0228
0.0821
0.3963
0.1442
0.3313
1.3823
3.8784
0.0015
0.0030
0.0125
0.0448
MSE(a)
0.1054 (0.1419)
0.1557 (0.2203)
0.2382 (0.3215)
0.5004 ( 0.6533)
0.0088 ( 0.0134)
0.0160 (0.0224)
0.0250 (0.0334)
0.0494 (0.0658)
0.0370 (0.0510)
0.0760 (0.1120)
0.1480 (0.2090)
0.2890 (0.4030)
0.0040 (0.0060)
0.0070 (0.0110)
0.0160 (0.0230)
0.0270 (0.0390)
0.5077 (0.7634)
0.8513 (1.2124)
1.4787 (2.0209)
2.3298 (2.7704)
0.0528 (0.0768)
0.0870 (0.1231)
0.1623 (0.2353)
0.3642 (0.5128)
0.2115 (0.3058)
0.3181 (0.4604)
0.6388 (0.9542)
1.0960 (1.5941)
0.0210 (0.0316)
0.0294 (0.0433)
0.0589 (0.0903)
0.1174 (0.1703)
b∗ (s)
0.0312
0.0727
0.1600
0.6767
0.0003
0.0008
0.0017
0.0068
0.0040
0.0180
0.0650
0.2460
4.5620E-5
1.6840E-4
0.0010
0.0020
0.8399
2.1932
6.2663
13.0955
0.0087
0.0227
0.0816
0.3953
0.1382
0.3130
1.3176
3.7397
0.0014
0.0027
0.0116
0.0427
MSE(b)
Estimated parameters
2.1551 (1.4054)
2.1594 (2.0098)
2.4506 (4.2470)
2.9336 ( 7.1752)
2.0135 ( 0.1387 )
2.0107 (0.2064)
2.0395 (0.4276)
2.0804 (0.7350)
2.1190 (2.1670)
2.1140 (3.5020)
2.1640 (4.8970)
2.8770 (8.9940)
2.0060 (0.2200)
2.0130 (0.3590 )
2.0370 (0.4870)
2.0750 (0.8900)
3.4450 (2.4552)
4.3969 (3.9058)
6.7264 (7.2579)
9.7470 (10.9972)
2.1498 (0.2471)
2.2503 ( 0.4030)
2.5023 (0.8316)
3.2455 (1.8674)
2.5513 (1.0297)
2.8271 (1.5093)
3.5398 (3.0400)
4.9585 (5.5954)
2.0554 (0.1075)
2.0741 (0.1470)
2.1385 (0.2876)
2.3269 (0.5950)
v ∗ (s)
1.9973
4.0607
18.2219
52.3032
0.0194
0.0427
0.1843
0.5461
4.7060
12.2680
23.9830
81.5750
0.0480
0.1290
0.2380
0.7960
8.1101
20.9856
74.9641
180.8330
0.0834
0.2249
0.9432
5.0351
1.3631
2.9597
11.6035
40.0296
0.0146
0.0271
0.1018
0.4605
MSE(v)
0.3904 (0.0224)
0.4069 (0.0374)
0.3490 (0.0492)
0.4163 ( 0.1022)
0.9847 ( 0.0008)
0.9855 (0.0013)
0.9814 (0.0032)
0.9852 (0.0047)
0.5450 (0.0160)
0.3880 (0.0250)
0.3900 (0.0480)
0.4020 (0.0930)
0.9920 (4.7992E-4)
0.9840 (0.0010)
0.9840 (0.0030)
0.9850 (0.0050)
0.2948 (0.0139)
0.2994 ( 0.0222)
0.3081 (0.0412)
0.3557 (0.0842)
0.9765 (0.0013)
0.9769 (0.0022)
0.9768 (0.0039)
0.9794 (0.0070)
0.5299 (0.0170)
0.5380 (0.0265)
0.5446 (0.0487)
0.5889 (0.0893)
0.9912 (0.0005)
0.9914 (0.0008)
0.9914 (0.0015)
0.9924 (0.0025)
Ω (s)
21.6308 (0.6286)
20.1143 (0.8733)
22.2670 (1.9401)
20.4358 ( 3.3003)
2.1642 ( 0.0598)
2.0172 (0.0912)
2.2197 (0.1950)
2.0509 (0.3404)
34.4870 (0.9780)
34.4440 (1.5240)
26.2500 (2.2940)
26.0160 (4.0850)
3.4500 (0.1010)
3.4480 (0.1500)
2.6220 (0.2280)
2.5890 (0.4200)
13.0683 (0.3792)
12.7847 (0.5737)
13.0324 (1.1632)
12.7916 (2.2538)
1.3057 (0.0375)
1.2760 (0.0619)
1.3022 (0.1125)
1.2709 (0.2312)
6.9544 (0.2078)
6.8855 (0.3232)
7.0279 (0.6605)
6.3568 (1.1442)
0.6962 (0.0213)
0.6886 (0.0326)
0.7057 (0.0654)
0.6374 (0.1106)
RM SE M (s)
21.6308 (0.6283)
20.1180 (0.8731)
22.2732 (1.9403)
20.4698 ( 3.2934)
2.1642 (0.0597)
2.0175 (0.0912)
2.2204 (0.1949)
2.0540 (0.3402)
34.4540 (0.9780)
34.4030 (1.5240)
26.2290 (2.2960)
25.8460(4.1260 )
3.4470 (0.1010)
3.4440 (0.1500)
2.6210 (0.2290)
2.5730 (0.4250)
13.0559 (0.3763)
12.7784 (0.5688)
13.0133 (1.1503)
12.7806 (2.2253)
1.3045 (0.0371)
1.2755 (0.0613)
1.3006 (0.1109)
1.2707 (0.2285)
6.9432 (0.2061)
6.8754 (0.3205)
7.0141 (0.6542)
6.3474 (1.1346)
0.6951 (0.0210)
0.6875 ( 0.0323)
0.7044 (0.0646)
0.6364 (0.1097)
RM SE L (s)
Goodness of fit measures
Table C.25: Results, in different conditions, of the DSD Model I with a = 6, b = 0 and v = 2.
21.6415 (0.6284)
20.1319 (0.8726)
22.3021 (1.9417)
20.5977 (3.3195)
2.1653 (0.0598)
2.0190 (0.0911)
2.2235 (0.1950)
2.0673 (0.3426)
34.5600 (0.9770)
34.5440 (1.5230)
26.3410 (2.2920)
26.5160 (4.0510)
3.4570 (0.1000)
3.4580 (0.1500)
2.6320 (0.2280)
2.6370 (0.4150)
13.1177 (0.3817)
12.8257 (0.5780)
13.1227 (1.1724)
12.8635 (2.2764)
1.3106 (0.0378)
1.2801 (0.0624)
1.3110 (0.1138)
1.2778 (0.2332)
6.9793 (0.2093)
6.9081 (0.3256)
7.0601 (0.6658)
6.3873 (1.1516)
0.6987 (0.0215)
0.6908 (0.0328)
0.7087 (0.0660)
0.6404 (0.1113)
RM SE U (s)
326
APPENDIX C. RESULTS OF SIMULATION STUDIES
0.0307
4.1884
2.3894
0.8851
250 1.9969 (0.1752)
2.1101 (2.0446)
2.0907 (1.5439)
10
30
100 1.9522 (0.9401)
65.1165
18.5624
11.1127
0.3201
0.0377
0.0077
5.1883 (7.4166)
100 3.0643 (4.1770)
250 2.6133 (3.2783)
1.9363 (0.5625)
2.0029 (0.1942)
30
10
30
100 1.9966 (0.0878)
0.7111
0.1092
0.0814
0.0245
0.0181
0.0026
8.4760
2.8794
1.7890
0.2767
100 1.8891 (0.8364)
250 1.9937 (0.3306)
1.9777 (0.2846)
1.9929 (0.1564)
250 2.0001 (0.0506)
2.0814 (1.6958)
30
100 1.9989 (0.1346)
2.5108 (2.8676)
10
10
30
100 2.0760 (1.3360)
250 1.9814 (0.5260)
distributions
different
Mixture of
2.7336
2.0171 (1.6541)
30
Level II
Level I
4.7653
1.4470 (2.1128)
10
distribution
Level II
0.0011
250 1.9982 (0.0331)
LogNormal
Level I
90.3710
6.1020 (8.5801)
10
distribution
Level II
0.3661
250 1.9973 (0.6053)
Normal
Level I
6.1521
100 2.2417 (2.4698)
Level II
13.2105
2.5566 (3.5936)
30
0.9926 (0.5340)
1.1120 (1.1021)
1.1883 (1.3511)
1.9512 (2.5185)
1.0000 (0.0511)
1.0002 (0.1347)
1.0069 (0.1552)
1.0031 (0.3050)
0.9730 (0.3306)
1.0523 (0.7637)
1.2535 (1.4276)
1.5148 (2.0260)
1.0017 (0.0325)
1.0035 (0.0871)
0.9965 (0.1915)
0.9704 (0.6123)
2.2798 (2.9082)
3.1008 (4.3948)
4.6084 (7.2786)
5.8700 (8.7266)
1.0128 (0.5843)
1.0975 (0.8524)
1.2460 (1.3024)
1.7887 (1.9431)
0.9981 (0.1779)
1.3448 (1.9563)
2.1629 (3.4342)
3.7313 (5.5866) 34.17618 3.2144 (5.3537)
10
0.9981 (0.1779)
1.0055 (0.2627)
0.9860 (0.5230)
1.0290 (0.8540)
b∗
1 (s)
distribution
0.0691
100 1.9951 (0.2630)
0.0307
0.3010
2.0300 (0.5480)
1.0080
2.0180 (1.0040)
MSE(a1 )
30
a∗
1 (s)
10
n
250 1.9969 (0.1752)
Level I
Error level
Uniform
microdata
Distribution of
0.2850
1.2260
1.8592
7.2411
0.0026
0.0181
0.0241
0.0929
0.1099
0.5854
2.1003
4.3657
0.0011
0.0076
0.0367
0.3754
10.0869
23.7084
65.9450
99.7942
0.3412
0.7353
1.7552
4.3939
0.0316
3.9421
13.1344
33.5365
0.0316
0.0690
0.2730
0.7300
MSE(b1 )
0.5079 (0.3562)
0.4796 (0.4623)
0.5467 (0.6034)
1.1774 (1.4834)
0.5003 (0.0367)
0.4981 (0.0558)
0.4998 (0.0778)
0.5417 (0.3659)
0.6279 (0.6534)
0.7938 (0.9088)
1.3990 (1.8138)
1.7741 (2.4277)
0.5057 (0.0869)
0.5040 (0.1320)
0.5089 (0.2694)
0.5443 (0.5019)
1.0509 (1.5058)
1.1631 (1.8017)
0.9933 (1.7250)
0.7455 (1.7663)
0.5265 (0.3428)
0.5731 (0.5403)
0.6695 (0.7621)
0.7246 (0.9897)
0.5008 (0.1813)
1.1146 (1.7354)
1.2743 (2.2707)
1.8612 (3.1480)
0.5008 (0.1813)
0.4964 (0.2665)
0.5770 (0.5320)
0.6700 (0.7690)
a∗
2 (s)
0.1268
0.2140
0.3659
2.6572
0.0013
0.0031
0.0060
0.1355
0.4429
0.9115
4.0949
7.5113
0.0076
0.0174
0.0726
0.2536
2.5687
3.6825
3.2162
3.1770
0.1181
0.2970
0.6089
1.0289
0.0329
3.3862
5.7504
11.7530
0.0329
0.0710
0.2890
0.6200
MSE(a2 )
3.0186 (0.4099)
3.0397 (0.6322)
3.0429 (1.0352)
2.0922 (1.7827)
2.9991 (0.0365)
3.0019 (0.0555)
3.0000 (0.0766)
2.9683 (0.3812)
3.0213 (0.8386)
3.0153 (1.2865)
3.1881 (2.5762)
2.9833 (3.1078)
2.9943 (0.0864)
2.9960 (0.1317)
2.9934 (0.2722)
3.0296 (0.5918)
2.0568 (1.8915)
1.6705 (2.0395)
1.4892 (2.1051)
0.9229 (2.0136)
2.9793 (0.3613)
2.9575 (0.6137)
2.8822 (0.8897)
2.2317 (1.4262)
3.0005 (0.1812)
2.6963 (2.5101)
2.3932 (3.0064)
2.1846 (3.3425)
3.0005 (0.1812)
3.0023 (0.2848)
3.0060 (0.6660)
3.0440 (1.1870)
b∗
2 (s)
Estimated parameters
0.1682
0.4008
1.0723
3.9989
0.0013
0.0031
0.0059
0.1462
0.7031
1.6536
6.6655
9.6493
0.0075
0.0173
0.0741
0.3508
4.4638
5.9229
6.7099
8.3647
0.1308
0.3780
0.8047
2.6222
0.0328
6.3863
9.3976
11.8259
0.0328
0.0811
0.4430
1.4090
MSE(b2 )
1.5000 (0.1890)
1.4778 (0.2476)
1.4411 (0.3326)
1.2844 (1.0560)
1.5017 (0.0188)
1.4990 (0.0233)
1.5020 (0.0306)
1.5059 (0.1068)
1.4809 (0.1028)
1.4755 (0.2206)
1.3660 (0.3212)
1.3239 (0.8571)
1.4991 (0.0112)
1.5005 (0.0236)
1.4986 (0.0372)
1.5045 (0.1457)
1.3016 (1.5597)
1.1911 (1.7758)
0.9281 (1.7970)
1.0657 (1.9449)
1.4947 (0.4133)
1.4726 (0.6094)
1.3987 (0.8784)
1.5038 (1.4331)
1.5077 (0.1766)
1.1442 (1.3671)
1.0544 (1.5866)
0.68015 (1.4447)
1.5077 (0.1766)
1.4974 (0.2459)
1.5120 (0.4670)
1.3570 (0.8650)
a∗
3 (s)
0.0357
0.0618
0.1140
1.1606
0.0004
0.0005
0.0009
0.0114
0.0109
0.0492
0.1210
0.7650
0.0001
0.0006
0.0014
0.0212
2.4695
3.2458
3.5529
3.9676
0.1707
0.3718
0.7811
2.0517
0.0312
1.9938
2.7132
2.7573
0.0312
0.0604
0.2180
0.7680
MSE(a3 )
0.9940 (0.1897)
0.9715 (0.2320)
0.9757 (0.2832)
0.9198 (0.9291)
0.9986 (0.0187)
1.0012 (0.0233)
0.9981 (0.0304)
0.9964 (0.1078)
0.9974 (0.1168)
0.9671 (0.2370)
0.9130 (0.3405)
0.8539 (0.7641)
1.0009 (0.0111)
0.9995 (0.0232)
1.0010 (0.0372 )
1.0010 (0.1396)
1.1420 (1.4702)
1.2351 (1.7757)
0.9547 (1.8582)
1.1209 (1.9571)
0.9967 (0.4161)
0.9849 (0.5983)
0.9657 (0.8482)
1.3207 (1.3280)
0.9929 (0.1779)
1.0365 (1.3062)
0.9354 ( 1.5036)
0.6521 (1.4055)
0.9929 (0.1779)
1.0031 (0.2466)
0.9360(0.4450)
1.0160 (0.8580)
b∗
3 (s)
Table C.26: Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1, a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1.
0.0360
0.0546
0.0807
0.8688
0.0004
0.0005
0.0009
0.0116
0.0136
0.0572
0.1234
0.6046
0.0001
0.0005
0.0014
0.0195
2.1794
3.2052
3.4514
3.8409
0.1730
0.3578
0.7200
1.8647
0.0317
1.7058
2.2627
2.0944
0.0317
0.0607
0.2020
0.7360
MSE(b3 )
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
327
-7.1346 (31.9188)
-4.3557 (24.3173)
-1.7818 (12.8664)
-1.4290 (8.8043)
-1.0779 (5.4809)
250
10
30
100
-4.8885 (26.6366)
-3.6457 (18.2035)
-3.4715 (15.4462)
-1.0437 (6.6573)
-1.1024 (4.4514)
-1.0438 (2.2897)
30
100
250
10
30
100
-3.4937 (19.7820)
-1.7408 (12.8365)
-1.5361 (5.6379)
-0.9821 (1.3223)
-1.0786 (0.9928)
-1.0355 (0.5313)
-13.0526 (26.5748)
-0.5398 (13.3243)
-0.7405 (7.2973)
-0.9364 (5.1880)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
-6.1572 (34.0822)
30
100
Level II
Level I
-8.4105 (38.0447)
10
distribution
Level II
-1.0915 (1.4540)
250
LogNormal
Level I
-4.9998 (34.4850)
10
distribution
Level II
-1.1130 (3.7736)
250
Normal
Level I
-10.0067 (37.7315)
30
100
Level II
-16.2481 (46.2304)
10
-0.9182 (4.5996)
100
distribution
-1.8260 (10.4680)
-1.0406 (3.0068)
-1.5390 (15.2030)
30
26.8924
53.2646
177.5710
850.7790
0.2833
0.9909
1.7471
32.0416
165.1590
397.1544
1187.0296
1500.8634
2.1204
5.2393
19.8058
44.2771
244.4548
338.0369
723.9217
1204.0237
14.2387
30.0168
77.6226
165.9908
601.9989
1055.4226
1503.3633
2367.6185
9.0336
21.1414
110.1520
231.1920
MSE(v)
Estimated parameter
v ∗ (s)
10
n
250
Level I
Error level
Uniform
microdata
Distribution of
0.3125 (0.0181)
0.2976 (0.0316)
0.3472 (0.0660)
0.3640 (0.0834)
0.9781 (0.0013)
0.9774 (0.0020)
0.9796 (0.0035)
0.9784 (0.0070)
0.2393 (0.0117)
0.2333 (0.0185)
0.2565 (0.0400)
0.3281 (0.0841)
0.9687 (0.0017)
0.9671 (0.0029)
0.9684 (0.0056)
0.9749 (0.0082)
0.2597 (0.0124)
0.2651 (0.0193)
0.2824 (0.0412)
0.3298 (0.0830)
0.9720 (0.0017)
0.9726 (0.0027)
0.9742 (0.0048)
0.9781 (0.0079)
0.4814 (0.0165)
0.4871 (0.0272)
0.5128 (0.0486)
0.5700 (0.0923)
0.9892 (0.0007)
0.9894 (0.0011)
0.9900 (0.0020)
0.9930 (0.0030)
Ω (s)
49.2312 (1.3779)
57.0197 (2.7292)
57.8498 (5.2121)
45.9825 (7.7313)
4.9282 (0.1466)
6.4328 (0.2873)
5.7982 (0.5045)
4.5320 (0.7769)
96.6891 (2.7337)
91.7504 (4.1788)
86.4383 (7.5017)
56.8560 (9.9320)
9.6863 (0.2736)
9.1786 (0.4271)
8.6758 (0.8009)
5.5943 (0.9805)
60.0610 (1.8103)
59.1260 (2.6834)
57.9609 (5.4172)
55.0942 (9.8672)
5.9871 (0.1851)
5.8658 (0.2956)
5.6637 (0.5404)
5.2894 (0.9987)
22.5629 (0.6960)
22.4286 (1.1417)
21.6993 (2.0073)
20.2636 (3.7344)
2.2584 (0.0689)
2.2340 (0.1132)
2.1080 (0.2070)
1.7990 (0.3940)
RM SE M (s)
49.2283 (1.3750)
57.0658 (2.7256)
57.9755 (5.2090)
46.5534 (7.6858)
4.9280 (0.1460)
6.4384 (0.2867)
5.8124 (0.5045)
4.6074 (0.7754)
96.6724 (2.7232)
1.7881 (4.1741)
86.6241 (7.4774)
57.0863 (9.9399)
9.6835 (0.2731)
9.1822 (0.4252)
8.6971 (0.7964)
5.6663 (0.9673)
59.9207 (1.7640)
59.0011 (2.6232)
57.8505 (5.2743)
55.0224 (9.6353)
5.9760 (0.1801)
5.8589 (0.2884)
5.6703 (0.5264)
5.3373 (0.9797)
22.4852 (0.6838)
22.3489 (1.1177)
21.6227 (1.9667)
20.1923 (3.6756)
2.2505 (0.0676)
2.2265 (0.1107)
2.1020 (0.2040)
1.7960 (0.3880)
RM SE L (s)
Goodness-of-fit measures
49.3006 (1.3832)
57.1239 (2.7347)
58.0411 (5.2300)
46.5829 (7.7703)
4.9350 (0.1471)
6.4435 (0.2875)
5.8170 (0.5059)
4.6093 (0.7776)
96.9214 (2.7415)
92.0029 (4.1857)
86.8222 (7.5162)
57.5179 (9.8937)
9.7115 (0.2741)
9.2070 (0.4275)
8.7223 (0.8031)
5.6887 (0.9768)
60.7524 (1.8495)
59.8200 (2.7333)
58.7480 (5.5337)
55.9747 (10.0384)
6.0580 (0.1890)
5.9399 (0.3007)
5.7603 (0.5490)
5.4002 (1.0084)
22.7374 (0.7067)
22.6125 (1.1628)
21.8825 (2.0436)
20.4681 (3.7797)
2.2759 (0.0700)
2.2520 (0.1154)
2.1240 (0.2100)
1.8180 (0.3980)
RM SE U (s)
Table C.27: Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1, a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1 (continuation of the Table C.26).
328
APPENDIX C. RESULTS OF SIMULATION STUDIES
8.0084 (7.9529)
6.1899 (6.4202)
6.7039 (5.4925)
5.5686 (3.1611)
250
10
30
100
5.8846 (1.5230)
6.3798 (1.7457)
6.2112 (0.9328)
6.2074 (0.7713)
6.0513 (0.2608)
11.3412 (12.2949)
8.0850 (7.8615)
7.0340 (5.9841)
6.4910 (2.5077)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
5.6426 (3.7238)
100
Level II
Level I
6.6484 (6.8399)
30
Level II
6.3221 (9.9316)
10
distribution
5.9827 (0.3918)
100
6.0078 (0.1557)
5.9884 (0.9078)
30
250
5.7541 (2.4136)
10
LogNormal
Level I
112.8460
8.0842 (10.4216)
250
6.5234
36.8431
66.0883
179.5422
0.0706
0.6373
0.9138
3.1888
2.3305
13.9806
47.1579
98.6420
0.0243
0.1537
0.8234
5.8802
336.1673
0.7633 (1.1560)
2.4709 (3.9780)
3.1629 (5.1847)
6.1341 (9.3815)
0.0777 (0.1154)
0.2093 (0.3071)
0.2791 (0.3972)
0.5378 (0.7632)
0.6164 (0.8660)
1.5381 (2.3138)
2.9793 (4.6593)
3.7750 (6.6825)
0.0593 (0.0846)
0.1607 (0.2320)
0.3502 (0.5150)
1.0681 (1.5536)
6.8109 (9.9712)
10.7856 (16.5852)
1.9177
21.9146
36.8584
125.5513
0.0193
0.1380
0.2355
0.8710
1.1292
7.7138
30.5637
58.8611
0.0107
0.0796
0.3876
3.5521
145.7133
391.1231
20.0250 (30.6497) 1135.1646 18.3701 (29.0715) 1181.7682
100 12.5434 (17.1361)
30
Level II
19.5384 (30.1263) 1089.9741 23.0826 (33.6988) 1667.2778
2.9100
10
0.9581 (1.4121)
7.2676
21.5608
50.0951
35.0445
89.4498
237.7624
603.5122
0.2830
0.6707
2.7804
7.5375
MSE(b1 )
distribution
4.4073
1.5645 (2.1965)
2.5740 (3.8666)
4.2460 (5.6656)
2.9439 (5.1385)
5.0310 (8.0127)
8.0165 (13.1785)
12.4554 (21.1855)
0.3016 (0.4384)
0.4560 (0.6806)
0.9146 (1.3949)
1.5427 (2.2722)
b∗
1 (s)
250
10.1686
30.6331
41.2144
67.2192
127.8936
274.7761
667.1439
0.9260
2.1845
6.4889
17.5452
MSE(a1 )
Normal
5.7034 (2.0794)
8.8226 (10.9566)
100
Level I
10.9816 (15.8180)
30
Level II
15.2212 (24.1391)
10
6.2787 (1.4522)
100
distribution
6.2297 (2.5382)
6.1695 (0.9477)
6.3715 (4.1743)
30
a∗
1 (s)
10
n
250
Level I
Error level
Uniform
microdata
Distribution of
0.5908
1.2471
4.4033
11.9658
0.1641
0.3822
1.7032
4.9739
33.2067
41.2185
56.7183
62.3047
1.9930
4.9625
8.0269
12.1122
32.2800
67.6221
1.8327 (1.4102)
1.8423 (1.8664)
1.9189 (2.5902)
3.6998 (5.0037)
1.9801 (0.1590)
1.9227 (0.2984)
1.9003 (0.3939)
2.0761 (1.5963)
2.8735 (2.9117)
3.3822 (4.0155)
6.2718 (8.4137)
2.0148
3.5047
6.7091
27.9012
0.0256
0.0949
0.1649
2.5514
9.2327
18.0189
88.9684
6.6268 (10.0238) 121.7837
2.0104 (0.4052)
1.9998 (0.6186)
2.0275 (1.3054)
2.3909 (2.1968)
3.7954 (5.4784)
3.6611 (6.2046)
4.0303 (7.2559)
3.3160 (7.7868)
2.0421 (1.4118)
2.3975 (2.1930)
2.6683 (2.7546)
2.4327 (3.4550)
3.8007 (5.3914)
5.1229 (7.6110)
5.8666 (10.0152) 115.1538
8.1637 (13.5112) 220.3613
1.9678 ( 0.7684)
1.9767 (1.1170)
2.2264 (2.0872)
MSE(a2 )
115.6308
0.1580
0.3749
1.9795
4.9598
47.4994
58.3060
70.3302
81.1396
2.1973
6.0984
10.7319
28.3644
55.1833
88.6130
128.9303
209.1941
0.6158
1.3995
7.9129
22.9021
MSE(b2 )
7.8260 (1.4520)
7.4722 (2.7343)
7.4338 (4.7105)
4.9059 (5.3881)
7.9700 (0.1287)
7.9114 (0.2143)
7.8800 (0.3511)
7.7189 (1.4624)
7.6140 (3.6918)
8.1756 (5.5977)
2.1366
7.7474
22.4874
38.5762
0.0174
0.0538
0.1375
2.2156
13.7646
31.3334
10.4564 (10.9321) 125.4248
8.9186 (10.7192)
7.9665 (0.3962)
7.9657 (0.6116)
7.9665 (1.4072)
7.3313 (2.1254)
5.8792 (6.5608)
4.7456 (6.9110)
5.1479 (7.8904)
3.4185 (7.7595)
7.8796 (1.4782)
7.6173 (2.4409)
7.6296 (3.2566)
5.8788 (4.8876)
8.0732 (7.4319)
8.0074 (9.4182)
7.8857 (11.3599)
9.2567 (14.4161)
7.9975 (0.7851)
7.9660 (1.1831)
8.1975 (2.8075)
7.9944 (4.7880)
b∗
2 (s)
Estimated parameters
2.8373 (3.3580)
a∗
2 (s)
9.8569 (0.7649)
9.7925 (1.1403)
9.4406 (1.6407)
8.6930 (4.8385)
9.9931 (0.0754)
9.9797 (0.1045)
9.9782 (0.1498)
9.9191 (0.4230)
9.8951 (0.4639)
9.8158 (1.0184)
9.2326 (1.5082)
8.4787 (4.1420)
9.9985 (0.0528)
10.0043 (0.1119)
9.9868 (0.1773)
9.9844 (0.6076)
7.6233 (7.7100)
6.9008 (8.2102)
4.3349 (7.6641)
5.0435 (8.2166)
9.8789 (1.5967)
9.8201 (2.4105)
9.0163 (3.7933)
8.8327 (6.8724)
8.0554 (5.7973)
6.5383 (6.2197)
5.1868 (6.9554)
3.1279 (6.2838)
9.9601 (0.7612)
9.8788 (1.1363)
9.6613 (1.9165)
9.3881 (3.8229)
a∗
3 (s)
0.6049
1.3421
3.0022
25.0960
0.0057
0.0113
0.0229
0.1853
0.2260
1.0701
2.8613
19.4531
0.0028
0.0125
0.0316
0.3690
65.0334
76.9455
90.7732
92.0114
2.5615
5.8371
15.3425
48.5451
37.3570
50.6291
71.4964
86.6726
0.5804
1.3045
3.7840
14.9742
MSE(a3 )
4.9013 (0.7283)
4.6403 (1.0608)
4.6134 (1.2473)
3.5144 (3.6269)
4.9905 (0.0744)
4.9828 (0.0934)
4.9773 (0.1198)
4.8631 (0.4800)
4.9371 (0.5056)
4.6169 (0.9733)
4.1608 (1.5184)
3.6631 (3.0099)
4.9951 (0.0503)
4.9749 (0.0968)
4.9659 (0.1620)
4.8485 (0.4659)
5.5481 (6.7991)
5.5393 (7.5360)
3.6631 (7.3326)
4.4917 (8.0309)
5.0331 (1.5963)
4.8835 (2.3846)
4.8627 (3.7216)
6.7459 (6.3410)
4.5135 (4.9062)
4.5743 (5.6959)
4.1814 (6.4541)
2.8147 (6.1209)
4.9273 (0.7345)
4.9499 (1.1680)
4.7633 (1.8954)
4.6417 (3.6571)
b∗
3 (s)
Table C.28: Results, in different conditions, of the DSD Model I with a1 = 6, b1 = 0, a2 = 2, b2 = 8, a3 = 10, b3 = 5 and v = 3.
0.5396
1.2536
1.7038
15.3484
0.0056
0.0090
0.0149
0.2489
0.2594
1.0931
3.0075
10.8378
0.0026
0.0100
0.0274
0.2398
46.4825
57.0254
55.4998
64.6890
2.5467
5.6942
13.8556
43.2159
24.2834
32.5919
42.2833
42.2040
0.5442
1.3655
3.6449
13.4892
MSE(b3 )
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
329
-10.6807 (127.3979)
-5.0230 (99.8573)
11.1307 (50.4357)
5.3845 (32.4760)
3.6874 (20.7385)
250
10
30
100
8.6759 (106.5402)
6.1545 (74.9246)
7.8800 (60.6098)
-2.9196 (26.5045)
3.0677 (21.2563)
3.0176 (10.6110)
30
100
250
10
30
100
-5.6890 (85.9342)
-3.9669 (57.4979)
0.5223 (22.9902)
2.7570 (5.6917)
2.9845 (3.3438)
2.8976 (1.9650)
-33.0838 (93.8292)
2.2830 (57.2783)
1.0807 (30.5694)
3.1898 (20.4726)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
-8.9582 (151.7535)
30
100
Level II
Level I
-14.9940 (146.8749 )
10
distribution
Level II
2.6938 (6.8659)
250
LogNormal
Level I
17.5905 (132.8516)
10
distribution
Level II
4.8888 (14.0520)
250
Normal
Level I
-13.5004 (151.7463)
30
100
Level II
-16.5829 (198.6805)
10
3.3419 (19.8125)
100
distribution
4.0446 (42.8217)
3.3371 (13.0226)
-1.2255 (62.8274)
30
418.7453
937.2343
3278.0417
10097.1450
3.8679
11.1699
32.4220
534.1620
3351.2388
7452.8067
23149.0842
21874.4360
47.1876
112.4817
451.3850
736.8273
3693.6885
5618.0313
11371.6744
17844.7831
200.8291
430.1263
1059.3199
2607.3210
10025.8686
16401.1455
23276.1861
39817.9727
169.5329
392.2586
1832.9585
3961.1896
MSE(v)
Estimated parameter
v ∗ (s)
10
n
250
Level I
Degree of linearity
Uniform
microdata
Distribution of
0.3056 (0.0148)
0.2853 (0.0224)
0.3271 (0.0525)
0.3706 (0.0906)
0.9775 (0.0013)
0.9745 (0.0022)
0.9778 (0.0039)
0.9790 (0.0069)
0.2526 (0.0116 )
0.2442 (0.0192)
0.2626 (0.0392)
0.3218 (0.0847)
0.9710 (0.0016)
0.9688 (0.0027)
0.9704 (0.0050)
0.9747 (0.0081)
0.2599 (0.0121)
0.2654 (0.0201)
0.2804 (0.0407)
0.3207 (0.0809)
0.9720 (0.0016)
0.9724 (0.0027)
0.9740 (0.0050)
0.9781 (0.0079)
0.4792 (0.0169)
0.4864 (0.0258)
0.5129 (0.0493)
0.5718 (0.0939)
0.9892 (0.0007)
0.9893 (0.0011)
0.9904 (0.0018)
0.9929 (0.0030)
Ω (s)
196.5597 (5.6251)
246.3766 (11.3368)
257.4014 (23.6065)
195.4006 (34.3356)
19.6451 (0.5761)
24.6052 (1.1158)
25.6419 (2.3166)
19.2216 (3.3664)
447.6666 (12.6391)
418.3204 (19.8923)
422.3805 (36.5826)
245.4011 (42.9161)
44.7719 (1.3077)
41.9305 (1.8795)
41.8798 (3.6441)
24.0000 (4.1075)
234.9127 (6.9491)
231.8536 (11.1344)
229.0872 (20.9494)
220.0315 (38.6167)
23.4521 (0.6825)
23.1175 (1.1568)
22.3414 (2.2300)
20.5721 (3.9492)
94.9141 (3.0191)
94.1281 (4.5080)
90.3790 (8.5273)
85.3536 (15.8259)
9.4723 (0.2946)
9.3806 (0.4800)
8.8441 (0.8493)
7.6060 (1.6619)
RM SE M (s)
196.5302 (5.6124)
246.5810 (11.3199)
257.9902 (23.6029)
197.9795 (34.4766)
19.6421 (0.5748)
24.6234 (1.1137)
25.6867 (2.3155)
19.4548 (3.3612)
447.4331 (12.6195)
418.2955 (19.8783)
422.8704 (36.6250)
245.7950 (42.7959)
44.7508 (1.3052)
41.9239 (1.8738)
41.9553 (3.6286)
24.2282 (4.1001)
234.3231 (6.7804)
231.2495 (10.8614)
228.5872 (20.4949)
219.8830 (37.7432)
23.4066 (0.6645)
23.0805 (1.1248)
22.3750 (2.1718)
20.7540 (3.8554)
94.5741 (2.9494)
93.7650 (4.4251)
90.0812 (8.3531)
85.1213 (15.5336)
9.4386 (0.2895)
9.3452 (0.4707)
8.8182 (0.8333)
7.6050 (1.6332)
RM SE L (s)
Goodness-of-fit measures
196.8411 (5.6372)
246.7886 (11.3522)
258.2033 (23.6690)
197.6777 (34.2227)
19.6735 (0.5772)
24.6502 (1.1164)
25.7347 (2.3272)
19.6036 (3.3939)
448.8317 (12.6537)
419.7091 (19.8771)
424.4867 (36.5567)
249.2247 (43.0330)
44.8917 (1.3085)
42.0868 (1.8811)
42.1200 (3.6478)
24.4583 (4.0913)
237.6361 (7.0973)
234.7588 (11.3699)
232.2221 (21.3402)
223.1061 (39.3400)
23.7289 (0.6969)
23.4257 (1.1814)
22.7271 (2.2632)
20.9982 (4.0038)
95.6737 (3.0814)
94.9425 (4.5789)
91.1389 (8.6806)
86.1528 (16.0521)
9.5480 (0.2991)
9.4612 (0.4880)
8.9179 (0.8628)
7.6754 (1.6788)
RM SE U (s)
Table C.29: Results, in different conditions, of the DSD Model I with a1 = 6, b1 = 0, a2 = 2, b2 = 8, a3 = 10, b3 = 5 and v = 3 (continuation of the Table C.28).
330
APPENDIX C. RESULTS OF SIMULATION STUDIES
distribution
Normal
distribution
Uniform
microdata
Distribution of
Level II
Level I
Level II
Level I
Error level
2.0149 (0.9923)
2.0826 (1.4706)
1.9896 (0.7153)
250
10
30
2.2781 (1.9856)
250
4.0162
7.9073
3.4744 (3.9482) 17.7468
30
2.6179 (2.7446)
4.9600 (5.1097)
10
34.8450
0.0542
250 2.0096 ( 0.2326)
0.5113
2.1674
0.9839
1.8149
6.0647
0.1480
100
8.0034 (0.1022)
7.9917 ( 0.1446)
7.9846 (0.2984)
7.9988 (0.5140)
b∗ (s)
7.7949 (2.1112)
7.5585 (2.9710)
6.8817 (4.1613)
5.8398 (5.1972)
7.9903 (0.2321)
8.0020 (0.3851)
8.0078 (0.7176)
7.9406 (1.5175)
7.9964 (1.0288)
8.0071 (1.4613)
7.6239 (2.7566)
13.5811 7.1600 ( 4.0195)
0.0100
0.0216
0.0919
0.2616
MSE(a)
1.9967 (0.3849)
100
2.0258 (1.3476)
2.5205 (2.4082)
30
100
3.1372 (3.5072)
1.9965 (0.0999)
250
10
2.0089 (0.1467)
2.0160 (0.3029)
30
100
2.0068 (0.5117)
a∗ (s)
10
n
4.4949
9.0127
18.5499
31.6498
0.0539
0.1482
0.5146
2.3042
1.0574
2.1332
7.7326
16.8455
0.0104
0.0210
0.0892
0.2640
MSE(b)
b U (s)
H
[2.29; 2.66[, 0.1; [2.66; 3.01[, 0.1; [3.01; 3.40[, 0.1; [3.40; 3.74[, 0.1; [3.74; 4.19], 0.1}
{[0.39; 0.84[, 0.1; [0.84; 1.15[, 0.1; [1.15; 1.52[, 0.1; [1.52; 1.89[, 0.1; [1.89; 2.29[, 0.1;
[1.39; 1.73[, 0.1; [1.73; 2.04[, 0.1; [2.04; 2.38[, 0.1; [2.38; 2.68[, 0.1; [2.68; 3.22], 0.1}
{[−0.37; 0.08[, 0.1; [0.08; 0.35[, 0.1; [0.35; 0.69[, 0.1; [0.69; 1.03[, 0.1; [1.03; 1.39[, 0.1;
[−1.42; −1.13[, 0.1; [−1.13; −0.86[, 0.1; [−0.86; −0.59[, 0.1; [−0.59; −0.35[, 0.1; [−0.35; 0.47], 0.1}
{[−3.22; −2.51[, 0.1; [−2.51; −2.28[, 0.1; [−2.28; −2.0[, 0.1; [−2.0; −1.72[, 0.1; [−1.72; −1.42[, 0.1;
[−6.20; −6.01[, 0.1; [−6.01; −5.85[, 0.1; [−5.85; −5.68[, 0.1; [−5.68; −5.50[, 0.1; [−5.50; −4.79], 0.1}
{[−7.42; −6.89[, 0.1; [−6.89; −6.73[, 0.1; [−6.73; −6.57[, 0.1; [−6.57; −6.39[, 0.1; [−6.39; −6.20[, 0.1;
[2.97; 3.36[, 0.1; [3.36; 3.73[, 0.1; [3.73; 4.14[, 0.1; [4.14; 4.56[, 0.1; [4.56; 4.97], 0.1}
{[0.97; 1.35[, 0.1; [1.35; 1.75[, 0.1; [1.75; 2.13[, 0.1; [2.13; 2.54[, 0.1; [2.54; 2.97[, 0.1;
[3.01; 3.40[, 0.1; [3.40; 3.77[, 0.1; [3.77; 4.19[, 0.1; [4.19; 4.60[, 0.1; [4.60; 5.02], 0.1}
{[1.01; 1.39[, 0.1; [1.39; 1.79[, 0.1; [1.79; 2.19[, 0.1; [2.19; 2.58[, 0.1; [2.58; 3.01[, 0.1;
[3.02; 3.41[, 0.1; [3.41; 3.78[, 0.1; [3.78; 4.20[, 0.1; [4.20; 4.62[, 0.1; [4.62; 5.03], 0.1}
{[1.01; 1.41[, 0.1; [1.41; 1.80[, 0.1; [1.80; 2.20[, 0.1; [2.20; 2.60[, 0.1; [2.60; 3.02[, 0.1;
[2.75; 3.13[, 0.1; [3.13; 3.49[, 0.1; [3.49; 3.91[, 0.1; [3.91; 4.29[, 0.1; [4.29; 4.69], 0.1}
{[0.81; 1.19[, 0.1; [1.19; 1.55[, 0.1; [1.55; 1.95[, 0.1; [1.95; 2.33[, 0.1; [2.33; 2.75[, 0.1;
[2.94; 3.32[, 0.1; [3.32; 3.69[, 0.1; [3.69; 4.10[, 0.1; [4.10; 4.52[, 0.1; [4.52; 4.93], 0.1}
{[0.97; 1.34[, 0.1; [1.34; 1.74[, 0.1; [1.74; 2.13[, 0.1; [2.13; 2.52[, 0.1; [2.52; 2.94[, 0.1;
[2.96; 3.34[, 0.1; [3.34; 3.71[, 0.1; [3.71; 4.11[, 0.1; [4.11; 4.52[, 0.1; [4.52; 4.93], 0.1}
{[1.02; 1.39[, 0.1; [1.39; 1.78[, 0.1; [1.78; 2.16[, 0.1; [2.16; 2.55[, 0.1; [2.55; 2.96[, 0.1;
[1.74; 2.09[, 0.1; [2.09; 2.45[, 0.1; [2.45; 2.84[, 0.1; [2.84; 3.21[, 0.1; [3.21; 3.60], 0.1}
{[−0.11; 0.26[, 0.1; [0.26; 0.63[, 0.1; [0.63; 0.99[, 0.1; [0.99; 1.35[, 0.1; [1.35; 1.74[, 0.1;
[0.12; 0.47[, 0.1; [0.47; 0.83[, 0.1; [0.83; 1.21[, 0.1; [1.21; 1.59[, 0.1; [1.59; 1.96], 0.1}
{[−1.77; −1.37[, 0.1; [−1.37; −1.0[, 0.1; [−1.0; −0.64[, 0.1; [−0.64; −0.28[, 0.1; [−0.28; 0.12[, 0.1;
[3.01; 3.39[, 0.1; [3.39; 3.76[, 0.1; [3.76; 4.18[, 0.1; [4.18; 4.60[, 0.1; [4.60; 5.01], 0.1}
{[1.01; 1.39[, 0.1; [1.39; 1.79[, 0.1; [1.79; 2.18[, 0.1; [2.18; 2.58[, 0.1; [2.58; 3.01[, 0.1;
[2.97; 3.36[, 0.1; [3.36; 3.73[, 0.1; [3.73; 4.14[, 0.1; [4.14; 4.56[, 0.1; [4.56; 4.98], 0.1}
{[0.98; 1.36[, 0.1; [1.36; 1.75[, 0.1; [1.75; 2.15[, 0.1; [2.15; 2.54[, 0.1; [2.54; 2.97[, 0.1;
[2.96; 3.35[, 0.1; [3.35; 3.72[, 0.1; [3.72; 4.14[, 0.1; [4.14; 4.55[, 0.1; [4.55; 4.96], 0.1}
{[0.97; 1.35[, 0.1; [1.35; 1.75[, 0.1; [1.75; 2.14[, 0.1; [2.14; 2.54[, 0.1; [2.54; 2.96[, 0.1;
[2.97; 3.36[, 0.1; [3.36; 3.72[, 0.1; [3.72; 4.14[, 0.1; [4.14; 4.44[, 0.1; [4.44; 4.98], 0.1}
{[0.99; 1.37[, 0.1; [1.37; 1.76[, 0.1; [1.76; 2.16[, 0.1; [2.16; 2.55[, 0.1; [2.55; 2.97[, 0.1;
Estimated parameters
b U , HU
H
(s)
5.1569 (3.6446)
3.3742 (2.3099)
11.9003 (8.7053)
18.3008 (11.8432)
0.5697 (0.4153)
0.9278 (0.7000 )
1.9447 (1.4237)
2.5192 (1.7550)
2.5192 (1.7550)
3.3742 (2.3099)
6.1492 (4.2222)
10.3121 (7.0651)
0.2577 (0.1739)
0.3468 (0.2419)
0.6987 (0.4859)
1.3239 (0.9205)
2
DM
−1
Table C.30: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨU (t) when the histogram-valued variable X have Uniform or Normal distributions.
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
331
distributions
different
Mixture of
distribution
LogNormal
microdata
Distribution of
Level II
Level I
Level II
Level I
Error level
1.9693 (0.2391)
1.9808 (0.1365)
10
30
2.0010 ( 0.6638 )
1.9619 (0.4396)
250
1.6931 (1.1448)
30
100
1.9549 (1.6521)
1.9990 (0.0435)
10
250
2.0003 (0.0667)
1.9974 (0.1943)
250
100
1.8513 ( 0.5373)
1.7578 (0.9335)
30
100
1.7240 (1.4215 )
2.0001 (0.0211)
250
10
1.9931 (0.0599)
1.9833 (0.1156)
30
100
1.9654 (0.2220)
a∗ (s)
10
n
mixture of different distributions.
0.1945
0.4401
1.4035
2.7287
0.0019
0.0044
0.0190
0.0581
0.0377
0.3105
0.9292
2.0947
0.0004
0.0036
0.0136
0.0504
MSE(a)
7.9312 (0.4479)
7.9624 (0.6591)
7.6569 ( 1.3396)
7.4784 (2.4596)
8.0005 (0.0434)
7.9991 (0.0671)
7.9835 (0.1355)
7.9677 (0.2408)
7.9703 (0.1954)
7.8975 (0.5285)
7.7538 (1.0171)
7.4359 (1.9959)
7.9997 (0.0208 )
7.9947 (0.0584)
7.9836 ( 0.1156)
7.9444 (0.2226)
b∗ (s)
0.2051
0.4354
1.9104
6.3157
0.0019
0.0045
0.0186
0.0590
0.0390
0.2895
1.0941
4.2977
0.0004
0.0034
0.0136
0.0526
MSE(b)
b U (s)
H
[2.96; 3.38[, 0.1; [3.38; 3.79[, 0.1; [3.79; 4.21[, 0.1; [4.21; 4.56[, 0.1; [4.56; 5.86], 0.1}
{[−0.07; 1.31[, 0.1; [1.31; 1.66[, 0.1; [1.66; 2.08[, 0.1; [2.08; 2.51[, 0.1; [2.51; 2.96[, 0.1;
[3.18; 3.58[, 0.1; [3.58; 3.96[, 0.1; [3.96; 4.39[, 0.1; [4.39; 4.77[, 0.1; [4.77; 5.46], 0.1}
{[0.87; 1.56[, 0.1; [1.56; 1.93[, 0.1; [1.93; 2.33[, 0.1; [2.33; 2.74[, 0.1; [2.74; 3.18[, 0.1;
[2.75; 3.30[, 0.1; [3.30; 3.77[, 0.1; [3.77; 3.23[, 0.1; [3.23; 4.83[, 0.1; [4.83; 10.20], 0.1}
{[−5.37; 0.58[, 0.1; [0.58; 1.20[, 0.1; [1.20; 1.69[, 0.1; [1.69; 2.18[, 0.1; [2.18; 2.75[, 0.1;
[2.57; 3.02[, 0.1; [3.02; 3.34[, 0.1; [3.34; 3.65[, 0.1; [3.65; 4.06[, 0.1; [4.06; 8.68], 0.1}
{[−8.96; 0.07[, 0.1; [0.07; 0.92[, 0.1; [0.92; 1.51[, 0.1; [1.51; 2.03[, 0.1; [2.03; 2.57[, 0.1;
[2.99; 3.38[, 0.1; [3.38; 3.75[, 0.1; [3.75; 4.17[, 0.1; [4.17; 4.58[, 0.1; [4.58; 5.0], 0.1}
{[1.0; 1.38[, 0.1; [1.38; 1.77[, 0.1; [1.77; 2.17[, 0.1; [2.17; 2.57[, 0.1; [2.57; 2.99[, 0.1;
[3.0; 3.39[, 0.1; [3.39; 3.76[, 0.1; [3.76; 4.17[, 0.1; [4.17; 4.59[, 0.1; [4.59; 5.01], 0.1}
{[1.0; 1.38[, 0.1; [1.38; 1.78[, 0.1; [1.78; 2.18[, 0.1; [2.18; 2.57[, 0.1; [2.57; 3.0[, 0.1;
[3.01; 3.41[, 0.1; [3.41; 3.78[, 0.1; [3.78; 4.23[, 0.1; [4.23; 4.60[, 0.1; [4.60; 5.35], 0.1}
{[0.70; 1.40[, 0.1; [1.40; 1.74[, 0.1; [1.74; 2.17[, 0.1; [2.17; 2.57[, 0.1; [2.57; 3.01[, 0.1;
[2.97; 3.37[, 0.1; [3.37; 3.75[, 0.1; [3.75; 4.21[, 0.1; [4.21; 4.55[, 0.1; [4.55; 5.54], 0.1}
{[0.39; 1.36[, 0.1; [1.36; 1.68[, 0.1; [1.68; 2.11[, 0.1; [2.11; 2.52[, 0.1; [2.52; 2.97[, 0.1;
[2.93; 3.32[, 0.1; [3.32; 3.67[, 0.1; [3.67; 3.99[, 0.1; [3.99; 4.28[, 0.1; [4.28; 5.90], 0.1}
{[−0.40; 1.51[, 0.1; [1.51; 1.80[, 0.1; [1.80; 2.12[, 0.1; [2.12; 2.50[, 0.1; [2.50; 2.93[, 0.1;
[2.90; 3.29[, 0.1; [3.29; 3.63[, 0.1; [3.63; 4.0[, 0.1; [4.0; 4.43[, 0.1; [4.43; 9.57], 0.1}
{[−3.10; 1.45[, 0.1; [1.45; 1.84[, 0.1; [1.84; 2.16[, 0.1; [2.16; 2.50[, 0.1; [2.50; 2.90[, 0.1;
[3.06; 3.40[, 0.1; [3.40; 3.74[, 0.1; [3.74; 4.13[, 0.1; [4.13; 4.65[, 0.1; [4.65; 11.41], 0.1}
{[−5.59; 1.43[, 0.1; [1.43; 1.96[, 0.1; [1.96; 2.34[, 0.1; [2.34; 2.70[, 0.1; [2.70; 3.06[, 0.1;
[3.12; 3.34[, 0.1; [3.34; 3.60[, 0.1; [3.60; 3.99[, 0.1; [3.99; 4.65[, 0.1; [4.65; 15.14], 0.1}
{[−13.51; 1.09[, 0.1; [1.09; 1.96[, 0.1; [1.96; 2.50[, 0.1; [2.50; 2.86[, 0.1; [2.86; 3.12[, 0.1;
[3.0; 3.39[, 0.1; [3.39; 3.76[, 0.1; [3.76; 4.18[, 0.1; [4.18; 4.558[, 0.1; [4.58; 5.02], 0.1}
{[0.98; 1.39[, 0.1; [1.39; 1.78[, 0.1; [1.78; 2.18[, 0.1; [2.18; 2.57[, 0.1; [2.57; 3.0[, 0.1;
[3.01; 3.40[, 0.1; [3.40; 3.77[, 0.1; [3.77; 4.21[, 0.1; [4.21; 4.55[, 0.1; [4.55; 5.26], 0.1}
{[0.78; 1.45[, 0.1; [1.45; 1.78[, 0.1; [1.78; 2.19[, 0.1; [2.19; 2.58[, 0.1; [2.58; 3.01[, 0.1;
[2.98; 3.37[, 0.1; [3.37; 3.75[, 0.1; [3.75; 4.17[, 0.1; [4.17; 4.50[, 0.1; [4.50; 5.45], 0.1}
{[0.54; 1.44[, 0.1; [1.44; 1.74[, 0.1; [1.74; 2.15[, 0.1; [2.15; 2.55[, 0.1; [2.55; 2.98[, 0.1;
[2.93; 3.34[, 0.1; [3.34; 3.72[, 0.1; [3.72; 4.09[, 0.1; [4.09; 4.40[, 0.1; [4.40; 6.06], 0.1}
{[−0.41; 1.39[, 0.1; [1.39; 1.70[, 0.1; [1.70; 2.07[, 0.1; [2.07; 2.49[, 0.1; [2.49; 2.93[, 0.1;
Estimated parameters
b U , HU
H
2.1870 (1.4641)
2.9631 (2.2500)
7.2209 (4.4369)
12.1271 (7.6253)
0.2168 (0.1549)
0.3082 (0.2328)
0.7422 (0.4889 )
1.2299 (0.7713)
3.0856 (2.1461)
5.6255 (3.6803)
8.0916 (4.8899)
15.1265 (9.2651)
0.3181 (0.2188)
0.5804 (0.3527)
0.8330 (0.5527)
1.6010 (0.9553)
2
DM
(s)
−1
Table C.31: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨU (t) when the histogram-valued variable X have LogNormal distributions or a
332
APPENDIX C. RESULTS OF SIMULATION STUDIES
0.5308 (0.0474)
0.5125 (0.0275)
0.5069 (0.0155)
0.9809 (0.0068)
0.9771 (0.0041)
0.9773 (0.0021)
30
100
250
10
30
100
0.3126 (0.0421)
0.3056 (0.0224)
0.2965 (0.0137)
0.9785 (0.0069)
0.9786 (0.0035)
0.9783 (0.0019)
30
100
250
10
30
100
0.3141 (0.0215)
0.4498 (0.0155)
0.9801 (0.0064)
0.9761 (0.0039)
0.9788 (0.0019)
0.9785 (0.0012)
0.3560 (0.0938)
0.2972 (0.0422)
0.3185 (0.0313)
0.3126 (0.0177)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
0.3195 (0.0417)
30
100
Level II
Level I
0.3342 (0.0859)
10
distribution
Level II
0.9879 (0.0007)
250
LogNormal
Level I
0.3593 (0.0818)
10
distribution
Level II
0.9767 (0.0013)
250
Normal
Level I
0.5789 (0.0948)
Level II
0.9903 (0.0006)
0.9905 (0.0009)
100
10
0.9908 (0.0016)
250
0.9921 (0.0027)
30
Ω (s)
10
n
distribution
Level I
Error level
Uniform
microdata
Distribution of
37.3111 (1.0518)
34.7378 (1.6151)
38.1672 (3.2010)
35.3198 (5.9551)
3.7248 (0.1065)
3.4731 (0.1619)
3.8087 (0.3216)
3.4894 (0.6023)
58.8866 (1.7356)
58.6860 (2.6540)
45.0898 (3.8845)
45.3465 (7.7139)
5.8881 (0.1707)
5.8843 (0.2654)
4.4918 (0.3818)
4.4895 (0.7598)
23.0106 (0.6805)
22.3719 (1.0637)
22.9187 (2.0473)
22.5282 (3.8380)
2.2973 (0.0657)
2.2354 (0.1054)
2.2854 (0.2141)
2.1585 (0.4033)
12.8134 (0.3611)
12.7073 (0.6287)
12.7890 (1.1210)
11.5803 (2.1838)
1.2830 (0.0367)
1.2686 (0.0591)
1.2862 (0.1162)
1.1425 (0.2030)
RM SE M (s)
37.3134 (1.0505)
34.7496 (1.6084)
38.2247 (3.2019)
35.6317 (5.9799)
3.7249 (0.1063)
3.4744 (0.1616)
3.8144 (0.3216)
3.5162 (0.6033)
58.9103 (1.7262)
58.7303 (2.6402)
45.1671 (3.8569)
45.7010 (7.5650)
5.8903 (0.1698)
5.8891 (0.2637)
4.5011 (0.3796)
4.5305 (0.7486)
22.9897 (0.6731)
22.3471 (1.0545)
22.9050 (2.0286)
22.5078 (3.8062)
2.2951 (0.0652)
2.2333 (0.1043)
2.2841(0.2107)
2.1584 (0.3998)
12.7885 (0.3573)
12.6860 (0.6222)
12.7632 (1.1083)
11.5585 (2.1636)
1.2805 (0.0364)
1.2663 (0.0585)
1.2835 (0.1148)
1.1404 (0.2007)
RM SE L (s)
Goodness-of-fit measures
37.3393 (1.0535)
34.7794 (1.6143)
38.2334 ( 3.1972)
35.4434 (5.9547)
3.7276 (0.1066)
3.4771 (0.1621)
3.8170 (0.3221)
3.5170 (0.6049)
58.9281 (1.7445)
58.7470 (2.6693 )
45.1731 (3.9108)
45.5865 (7.8579)
5.8924 (0.1714)
5.8907 (0.2674)
4.5023 (0.3843)
4.5246 (0.7770)
23.1034 (0.6869)
22.4626 (1.0713)
23.0602 (2.0592)
22.6582 (3.8616)
2.3067 (0.0661)
2.2442 (0.1063)
2.3005 (0.2161)
2.1705 (0.4050)
12.8671 (0.3646)
12.7550 (0.6346)
12.8516 (1.1323)
11.6423 (2.1987)
1.2884 (0.0370)
1.2734 ( 0.0596)
1.2925 (0.1175)
1.1483 (0.2049)
RM SE U (s)
−1
Table C.32: Results, in different conditions, of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨU (t) (continuation of the Tables C.30 and C.31).
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
333
distribution
Normal
distribution
Uniform
microdata
Distribution of
Level II
Level I
Level II
Level I
Error level
2.0222 (0.3232)
30
2.7106 (2.8553)
2.2646 (1.9873)
100
250
3.6190 (4.0548)
30
4.0155
8.6494
19.0457
5.1050 (5.1246) 35.8758
0.0590
0.1509
0.5453
2.2061
1.1148
2.0101
5.7464
14.9966
10
1.9956 (0.2431)
1.9923 (0.7388)
30
250
2.1749 (1.4757)
10
1.9914 (0.3885)
2.0299 (1.0560)
250
100
2.0275 (1.4182)
2.4206 (2.3611)
30
100
3.2439 (3.6692)
0.0116
250 2.0013 (0.1076 )
10
0.0234
0.1048
0.3385
MSE(a)
2.0011 (0.1529)
100
1.9963 (0.5821)
a∗ (s)
10
n
7.8771 (2.1670)
7.5739 (3.1159)
6.8923 (4.2996)
5.8003 (5.2300)
8.0036 (0.2427)
8.0088 (0.3899)
8.0099 (0.7411)
7.8530 (1.5375)
7.9785 (1.0827)
8.0029 (1.5633)
7.7046 (2.7428)
7.1236 (4.1271)
7.9980 (0.1077)
7.9987 (0.1525)
7.9771 ( 0.3174)
8.0062 (0.5756)
b∗ (s)
4.7061
9.8807
19.6952
32.1640
0.0589
0.1520
0.5487
2.3832
2.4415
2.4415
7.6028
17.7841
0.0116
0.0232
0.1011
0.3311
MSE(b)
b N (s)
H
[2.51; 2.73[, 0.1; [2.73; 2.97[, 0.1; [2.97; 3.11[, 0.1; [3.22; 3.59[, 0.1; [3.59; 5.62], 0.1}
{[−0.53; 1.41[, 0.1; [1.41; 1.79[, 0.1; [1.79; 2.05[, 0.1; [2.05; 2.29[, 0.1; [2.29; 2.51[, 0.1;
[1.28; 1.48[, 0.1; [1.48; 1.68[, 0.1; [1.68; 1.90[, 0.1; [1.90; 2.22[, 0.1; [2.22; 4.02], 0.1}
{[−1.34; 0.31[, 0.1; [0.31; 0.65[, 0.1; [0.65; 0.88[, 0.1; [0.88; 1.09[, 0.1; [1.09; 1.28[, 0.1;
[−1.55; −1.38[, 0.1; [−1.38; −1.21[, 0.1; [−1.21; −1.01[, 0.1; [−1.01; −0.75[, 0.1; [−0.75; 0.87], 0.1}
{[−3.80; −2.36[, 0.1; [−2.36; −2.10[, 0.1; [−2.10; −1.89[, 0.1; [−1.89; −1.72[, 0.1; [−1.72; −1.55[, 0.1;
[−6.44; −6.34[, 0.1; [−6.34; −6.23[, 0.1; [−6.23; −6.10[, 0.1; [−6.10; −5.908[, 0.1; [−5.90; −4.55], 0.1}
{[−8.05; −6.97[, 0.1; [−6.97; −6.78[, 0.1; [−6.78; −6.66[, 0.1; [−6.66; −6.55[, 0.1; [−6.55; −6.44[, 0.1;
[3.03; 3.29[, 0.1; [3.29; 3.56[, 0.1; [3.56; 3.86[, 0.1; [3.86; 4.28[, 0.1; [4.28; 6.65], 0.1}
{[−0.51; 1.76[, 0.1; [1.76; 2.20[, 0.1; [2.20; 2.50[, 0.1; [2.50; 2.77[, 0.1; [2.77; 3.03[, 0.1;
[3.04; 3.30[, 0.1; [3.30; 3.57[, 0.1; [3.57; 3.87[, 0.1; [3.87; 4.30[, 0.1; [4.30; 6.66], 0.1}
{[−0.50; 1.77[, 0.1; [1.77; 2.20[, 0.1; [2.20; 2.51[, 0.1; [2.51; 2.78[, 0.1; [2.78; 3.04[, 0.1;
[3.05; 3.31[, 0.1; [3.31; 3.58[, 0.1; [3.58; 3.87[, 0.1; [3.87; 4.30[, 0.1; [4.30; 6.66], 0.1}
{[−0.48; 1.78[, 0.1; [1.78; 2.21[, 0.1; [2.21; 2.53[, 0.1; [2.53; 2.79[, 0.1; [2.79; 3.05[, 0.1;
[2.43; 2.68[, 0.1; [2.68; 2.94[, 0.1; [2.94; 3.23[, 0.1; [3.23; 3.65[, 0.1; [3.65; 5.94], 0.1}
{[−0.99; 1.19[, 0.1; [1.19; 1.61[, 0.1; [1.61; 1.91[, 0.1; [1.91; 2.17[, 0.1; [2.17; 2.43[, 0.1;
[2.95; 3.20[, 0.1; [3.20; 3.47[, 0.1; [3.47; 3.77[, 0.1; [3.77; 4.19[, 0.1; [4.19; 6.55], 0.1}
{[−0.58; 1.69[, 0.1; [1.69; 2.11[, 0.1; [2.11; 2.43[, 0.1; [2.43; 2.69[, 0.1; [2.69; 2.95[, 0.1;
[3.08; 3.33[, 0.1; [3.33; 3.60[, 0.1; [3.60; 3.89[, 0.1; [3.89; 4.30[, 0.1; [4.30; 6.62], 0.1}
{[−0.40; 1.83[, 0.1; [1.83; 2.25[, 0.1; [2.25; 2.56[, 0.1; [2.56; 2.83[, 0.1; [2.83; 3.08[, 0.1;
[1.97; 2.23[, 0.1; [1.23; 2.25[, 0.1; [2.25; 2.79[, 0.1; [2.79; 3.17[, 0.1; [3.17; 5.25], 0.1}
{[−1.28; 0.72[, 0.1; [0.72; 1.12[, 0.1; [1.12; 1.44[, 0.1; [1.44; 1.71[, 0.1; [1.71; 1.97[, 0.1;
[−0.28; −0.03[, 0.1; [−0.03; 0.26[, 0.1; [0.26; 0.56[, 0.1; [0.56; 0.94[, 0.1; [0.94; 2.67], 0.1}
{[−3.23; −1.54[, 0.1; [−1.54; −1.15[, 0.1; [−1.15; −0.84[, 0.1; [−0.84; −0.57[, 0.1; [−0.57; −0.28[, 0.1;
[3.01; 3.27[, 0.1; [3.27; 3.54[, 0.1; [3.54; 3.84[, 0.1; [3.84; 4.27[, 0.1; [4.27; 6.63], 0.1}
{[−0.53; 1.74[, 0.1; [1.74; 2.17[, 0.1; [2.17; 2.49[, 0.1; [2.49; 2.76[, 0.1; [2.76; 3.01[, 0.1;
[3.02; 3.28[, 0.1; [3.28; 3.55[, 0.1; [3.55; 3.85[, 0.1; [3.85; 4.27[, 0.1; [4.27; 6.54], 0.1}
{[−0.52; 1.75[, 0.1; [1.75; 2.18[, 0.1; [2.18; 2.49[, 0.1; [2.49; 2.76[, 0.1; [2.76; 3.02[, 0.1;
[2.95; 3.21[, 0.1; [3.21; 3.48[, 0.1; [3.48; 3.77[, 0.1; [3.77; 4.20[, 0.1; [4.20; 6.56], 0.1}
{[−0.59; 1.68[, 0.1; [1.68; 2.11[, 0.1; [2.11; 2.42[, 0.1; [2.42; 2.70[, 0.1; [2.70; 2.95[, 0.1;
[3.04; 3.30[, 0.1; [3.30; 3.57[, 0.1; [3.57; 3.86[, 0.1; [3.86; 4.29[, 0.1; [4.29; 6.65], 0.1}
{[−0.51; 1.76[, 0.1; [1.76; 2.20[, 0.1; [2.20; 2.52[, 0.1; [2.52; 2.78[, 0.1; [2.78; 3.04[, 0.1;
Estimated parameters
b N , HN
H
(s)
5.2726 (3.5571)
7.7399 (5.5304)
12.4106 (9.0628)
19.0449 (11.8991)
0.6008 (0.4305)
0.9459 (0.7152)
2.0199 (1.4759)
4.5357 (2.8590)
2.6132 (1.8653)
3.5604 (2.4565)
6.1754 (4.1544)
10.7130 (7.6507)
0.2656 (0.1904)
0.3605 (0.2617)
0.7402 (0.5219)
1.4893 (1.0550)
2
DM
−1
Table C.33: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨN (t) when the histogram-valued variable X have Uniform or Normal distributions.
334
APPENDIX C. RESULTS OF SIMULATION STUDIES
distributions
different
Mixture of
distribution
LogNormal
microdata
Distribution of
Level II
Level I
Level II
Level I
Error level
2.0035 (0.2459)
1.9971 (0.1534)
10
30
1.9917 (0.7116)
1.9647 (0.4560)
250
1.8001 (1.2457)
30
100
2.0185 (1.7281)
2.0000 (0.0461)
10
250
1.9998 (0.0700)
2.0051 (0.2073)
250
100
1.8850 (0.5714)
1.8168 (0.9793)
30
100
1.8267 (1.4766)
2.0002 (0.0225)
250
10
1.9997 (0.0696)
1.9972 (0.1301)
30
100
1.9836 (0.2530)
a∗ (s)
10
n
0.2090
0.5060
1.5901
2.9837
0.0021
0.0049
0.0235
0.0604
0.0429
0.3394
0.9917
2.2083
0.0005
0.0048
0.0169
0.0642
MSE(a)
mixture of different distributions.
8.0108 (0.4459)
8.0086 (0.7190)
7.6431 (1.4867)
7.4926 (2.5057)
8.0002 (0.0458)
8.0000 (0.0706)
8.0019 (0.1538)
7.9955 (0.2461)
7.9797 ( 0.2058)
7.8999 ( 0.5672 )
7.7438 (1.0826)
7.3181 (2.1085)
7.9998 ( 0.0222)
7.9994 (0.0690)
7.9980 ( 0.1298)
7.9713 (0.2404)
b∗ (s)
0.1987
0.5165
2.3354
6.5300
0.0021
0.0050
0.0236
0.0605
0.0427
0.3314
1.2364
4.9062
0.0005
0.0048
0.0168
0.0586
MSE(b)
b N (s)
H
[3.21; 3.48[, 0.1; [3.48; 3.75[, 0.1; [3.75; 4.06[, 0.1; [4.06; 4.49[, 0.1; [4.49; 7.15], 0.1}
{[−0.44; 1.95[, 0.1; [1.95; 2.36[, 0.1; [2.36; 2.67[, 0.1; [2.67; 2.95[, 0.1; [2.95; 3.21[, 0.1;
[3.02; 3.28[, 0.1; [3.28; 3.55[, 0.1; [3.55; 3.85[, 0.1; [3.85; 4.28[, 0.1; [4.28; 6.68], 0.1}
{[−0.52; 1.75[, 0.1; [1.75; 2.18[, 0.1; [2.18; 2.50[, 0.1; [2.50; 2.76[, 0.1; [2.76; 3.02[, 0.1;
[3.04; 3.40[, 0.1; [3.40; 3.72[, 0.1; [3.72; 4.06[, 0.1; [4.06; 4.64[, 0.1; [4.64; 10.84], 0.1}
{[−6.40; 1.14[, 0.1; [1.14; 1.84[, 0.1; [1.84; 2.29[, 0.1; [2.29; 2.66[, 0.1; [2.66; 3.04[, 0.1;
[2.87; 3.16[, 0.1; [3.16; 3.39[, 0.1; [3.39; 3.65[, 0.1; [3.65; 4.07[, 0.1; [4.07; 9.65], 0.1}
{[−9.33; 0.66[, 0.1; [0.66; 1.50[, 0.1; [1.50; 2.07[, 0.1; [2.07; 2.48[, 0.1; [2.48; 2.87[, 0.1;
[3.02; 3.28[, 0.1; [3.28; 3.55[, 0.1; [3.55; 3.85[, 0.1; [3.85; 4.27[, 0.1; [4.27; 6.64], 0.1}
{[−0.52; 1.75[, 0.1; [1.75; 2.18[, 0.1; [2.18; 2.49[, 0.1; [2.49; 2.76[, 0.1; [2.76; 3.02[, 0.1;
[3.03; 3.28[, 0.1; [3.28; 3.55[, 0.1; [3.55; 3.85[, 0.1; [3.85; 4.28[, 0.1; [4.28; 6.64], 0.1}
{[−0.51; 1.75[, 0.1; [1.75; 2.19[, 0.1; [2.19; 2.50[, 0.1; [2.50; 2.77[, 0.1; [2.77; 3.03[, 0.1;
[3.01; 3.27[, 0.1; [3.27; 3.54[, 0.1; [3.54; 3.84[, 0.1; [3.84; 4.26[, 0.1; [4.26; 6.65], 0.1}
{[−0.52; 1.73[, 0.1; [1.73; 2.17[, 0.1; [2.17; 2.48[, 0.1; [2.48; 2.75[, 0.1; [2.75; 3.01[, 0.1;
[2.97; 3.23[, 0.1; [3.23; 3.50[, 0.1; [3.50; 3.80[, 0.1; [3.80; 4.22[, 0.1; [4.22; 6.58], 0.1}
{[−0.61; 1.70[, 0.1; [1.70; 2.13[, 0.1; [2.13; 2.44[, 0.1; [2.44; 2.71[, 0.1; [2.71; 2.97[, 0.1;
[3.02; 3.27[, 0.1; [3.27; 3.53[, 0.1; [3.53; 3.80[, 0.1; [3.80; 4.15[, 0.1; [4.15; 6.96], 0.1}
{[−1.31; 1.83[, 0.1; [1.83; 2.21[, 0.1; [2.21; 2.50[, 0.1; [2.50; 2.76[, 0.1; [2.76; 3.02[, 0.1;
[2.89; 3.11[, 0.1; [3.11; 3.36[, 0.1; [3.36; 3.65[, 0.1; [3.65; 4.10[, 0.1; [4.10; 10.34], 0.1}
{[−4.30; 1.68[, 0.1; [1.68; 2.13[, 0.1; [2.13; 2.42[, 0.1; [2.42; 2.66[, 0.1; [2.66; 2.89[, 0.1;
[2.84; 3.03[, 0.1; [3.03; 3.27[, 0.1; [3.27; 3.58[, 0.1; [3.58; 4.10[, 0.1; [4.10; 11.84], 0.1}
{[−7.07; 1.48[, 0.1; [1.48; 2.05[, 0.1; [2.05; 2.39[, 0.1; [2.39; 2.65[, 0.1; [2.65; 2.84[, 0.1;
[2.11; 2.24[, 0.1; [2.24; 2.42[, 0.1; [2.42; 2.72[, 0.1; [2.72; 3.35[, 0.1; [3.35; 14.36], 0.1}
{[−16.90; 0.20[, 0.1; [0.20; 1.17[, 0.1; [1.17; 1.70[, 0.1; [1.70; 1.96[, 0.1; [1.96; 2.11[, 0.1;
[3.02; 3.28[, 0.1; [3.28; 3.55[, 0.1; [3.55; 3.85[, 0.1; [3.85; 4.27[, 0.1; [4.27; 6.63], 0.1}
{[−0.52; 1.75[, 0.1; [1.75; 2.18[, 0.1; [2.18; 2.50[, 0.1; [2.50; 2.76[, 0.1; [2.76; 3.02[, 0.1;
[3.04; 3.30[, 0.1; [3.30; 3.57[, 0.1; [3.57; 3.87[, 0.1; [3.87; 4.29[, 0.1; [4.29; 6.77], 0.1}
{[−0.51; 1.77[, 0.1; [1.77; 2.20[, 0.1; [2.20; 2.52[, 0.1; [2.52; 2.78[, 0.1; [2.78; 3.04[, 0.1;
[3.06; 3.32[, 0.1; [3.32; 3.59[, 0.1; [3.59; 3.89[, 0.1; [3.89; 4.30[, 0.1; [4.30; 6.75], 0.1}
{[−0.55; 1.80[, 0.1; [1.80; 2.22[, 0.1; [2.22; 2.53[, 0.1; [2.53; 2.80[, 0.1; [2.80; 3.06[, 0.1;
[2.94; 3.31[, 0.1; [3.31; 3.48[, 0.1; [3.48; 3.77[, 0.1; [3.77; 4.15[, 0.1; [4.15; 7.07], 0.1}
{[−1.24; 1.71[, 0.1; [1.71; 2.09[, 0.1; [2.09; 2.40[, 0.1; [2.40; 2.67[, 0.1; [2.67; 2.94[, 0.1;
Estimated parameters
b N , HN
H
2.2918 (1.5643)
3.0470 (2.2420)
7.4064 (4.6120)
11.7245 (7.4488)
0.2210 (0.1542)
0.3040 (0.2187)
0.7881 (0.4896)
1.2565 (0.7921)
3.2072 (2.1352)
5.6298 (3.5708)
8.2258 (5.3814)
15.2835 (9.8969)
0.3254 (0.2163)
0.6260 (0.3969)
0.9169 (0.5468)
1.6896 (0.9707)
2
DM
(s)
−1
Table C.34: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨN (t) when the histogram-valued variable X have LogNormal distributions or a
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
335
0.4813 (0.0273)
0.4748 (0.0167)
0.9798 (0.0070)
0.9758 (0.0045)
0.9758 (0.0023)
250
10
30
100
0.3028 (0.0410)
0.2924 (0.0206)
0.2848 (0.0134)
0.9778 (0.0074)
0.9780 (0.0037)
0.9779 (0.0020)
30
100
250
10
30
100
0.3097 (0.0224)
0.4441 (0.0158)
0.9790 (0.0064)
0.9758 (0.0041)
0.9781 (0.0019)
0.9777 (0.0013)
0.3594 (0.0991)
0.2942 (0.0441)
0.3121 (0.0314)
0.3068 (0.0171)
250
10
30
100
250
10
30
100
250
distributions
different
Mixture of
0.3125 (0.0424)
30
100
Level II
Level I
0.3281 (0.0837)
10
distribution
Level II
0.9876 (0.0007)
250
LogNormal
Level I
0.3506 (0.0843)
10
distribution
Level II
0.9753 (0.0014)
250
Normal
Level I
0.4981 (0.0527)
30
100
Level II
0.5523 (0.0935)
10
0.9892 (0.0010)
100
distribution
0.9896 (0.0019)
0.9890 (0.0007)
0.9911 (0.0031 )
30
Ω (s)
10
n
250
Level I
Error level
Uniform
microdata
Distribution of
38.2105 ( 1.0809)
35.6633 (1.6712)
38.8182 (3.4337)
35.4452 (6.2440)
3.8215 (0.1117)
3.5650 (0.1596)
3.8625 (0.3425)
3.6238 (0.5851)
59.7713 (1.7478)
59.6231 (2.8591)
46.1421 (3.9823)
46.1035 (7.8675)
5.9806 (0.1758)
5.9772 (0.2792)
4.5945 ( 0.3973)
4.5836 (0.8083)
23.9570 (0.7146)
23.3600 (1.0561)
23.7564 (2.1232)
23.2340 (4.0078)
2.3952 (0.0711)
2.3336 (0.1135)
2.3761 (0.2279)
2.2472 (0.4134)
13.7140 (0.4096)
13.5792 (0.6403)
13.7200 (1.2843)
12.3500 (2.2410)
1.3728 (0.0421)
1.3579 (0.0645)
1.3721 (0.1285)
1.2208 (0.2229)
RM SE M (s)
38.2104 (1.0786)
35.6775 (1.6670)
38.8917 (3.4342)
35.7569 (6.2790)
3.8217 (0.1114)
3.5665 (0.1591)
3.8696 (0.3417)
3.6494 (0.5854)
59.7961 (1.7423)
59.6688 (2.8421)
46.2252 (3.9632)
46.4770 (7.7203)
5.9834 (0.1747)
5.9816 (0.2772)
4.6036 (0.3942)
4.6236 (0.7893)
23.9314 (0.7047)
23.3307 (1.0458)
23.7288 (2.0911)
23.2184 (3.9717)
2.3927 (0.0702)
2.3311 (0.1124)
2.3751 (0.2238)
2.2473 (0.4077)
13.6908 (0.4058)
13.5574 (0.6337)
13.6928 (1.2704)
12.3332 (2.2195)
1.3705 (0.0415)
1.3557 (0.0637)
1.3697 (0.1269)
1.2194 (0.2212)
RM SE L (s)
Goodness-of-fit measures
38.2458 (1.0825)
35.7126 (1.6717)
38.8898 ( 3.4390)
35.5829 ( 6.2460)
3.8248 (0.1120)
3.5697 (0.1596)
3.8721 (0.3431)
3.6516 (0.5868)
59.8217 (1.7531)
59.6945 (2.8762)
46.2321 (4.0056)
46.3454 (8.0232)
5.9855 (0.1768)
5.9856 (0.2808)
4.6059 (0.4006)
4.6213 (0.8288)
24.0726 (0.7236)
23.4719 (1.0652)
23.9301 (2.1494)
23.3740 (4.0309)
2.4067 (0.0718)
2.3445 (0.1143)
2.3936 ( 0.2307)
2.2606 (0.4169)
13.7736 ( 0.4130)
13.6354 (0.6460)
13.7927 (1.2956)
12.4172 (2.2567)
1.3787 (0.0427)
1.3636 (0.0652)
1.3788 (0.1300)
1.2269 (0.2240)
RM SE U (s)
−1
Table C.35: Results, in different conditions, of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨN (t) (continuation of the Tables C.33 and C.34).
336
APPENDIX C. RESULTS OF SIMULATION STUDIES
distribution
Normal
distribution
Uniform
microdata
Distribution of
Level II
Level I
Level II
Level I
8.0138 (0.6003 )
b∗ (s)
1.1530
2.3967
250 2.0267 (1.0740)
2.0751 (1.5471)
2.0361 (0.7321)
10
30
9.2769
4.2131
100 2.8109 (2.9373)
250 2.1543 (2.0478)
7.9689 (2.2133)
7.4561 (3.1957)
3.5981 (4.0993) 19.3412 6.9526 (4.3712)
30
7.9970 (0.2516 )
4.6468 (5.1200) 33.1940 6.3782 (5.2707)
0.0631
250 2.0042 (0.2512)
7.9962 (0.4034)
7.9698 (0.7404)
7.9579 (1.5920)
7.9770 (1.1262)
7.9382 (1.5633)
7.6776 (2.8566)
10
0.1625
100 2.0043 (0.4033)
0.5367
2.0903
100 2.0693 (1.4449)
6.5337
4.8946
10.4980
20.1856
30.3830
0.0632
0.1626
0.5486
2.5338
1.2676
2.4454
8.2560
2.4986 (2.5083)
30
0.0127
3.2857 (3.7814) 15.9380 7.2038 (4.3046) 19.1447
7.9981 (0.1129)
0.0275
0.1084
0.3602
MSE(b)
10
0.0127
250 2.0025 (0.1127)
8.0008 (0.1660)
0.1097 8.0004 (0.3295 )
0.0271
2.0023 (0.3314)
30
0.3552
MSE(a)
100 1.9973 (0.1647)
1.9849 (0.5961)
a∗ (s)
10
Error level n
[2.27; 2.50[, 0.1; [2.50; 2.80[, 0.1; [2.80; 3.19[, 0.1; [3.19; 3.84[, 0.1; [3.84; 9.14], 0.1}
{[0.54; 1.38[, 0.1; [1.38; 1.65[, 0.1; [1.65; 1.87[, 0.1; [1.87; 2.06[, 0.1; [2.06; 2.27[, 0.1;
[0.73; 0.94[, 0.1; [0.94; 1.20[, 0.1; [1.20; 1.55[, 0.1; [1.55; 2.12[, 0.1; [2.12; 7.08], 0.1}
{[−0.79; −0.053[, 0.1; [−0.05; 0.18[, 0.1; [0.18; 0.38[, 0.1; [0.38; 0.54[, 0.1; [0.54; 0.73[, 0.1;
[−1.87; −1.70[, 0.1; [−1.70; −1.49[, 0.1; [−1.49; −1.22[, 0.1; [−1.22; −0.78[, 0.1; [−0.78; 3.42], 0.1}
{[−3.25; −2.51[, 0.1; [−2.51; −2.32[, 0.1; [−2.32; −2.17[, 0.1; [−2.17; −2.02[, 0.1; [−2.02; −1.87[, 0.1;
[−4.74; −4.63[, 0.1; [−4.63; −4.51[, 0.1; [−4.51; −4.35[, 0.1; [−4.35; −4.04[, 0.1; [−4.04; −0.96], 0.1}
{[−5.77; −5.16[, 0.1; [−5.16; −5.03[, 0.1; [−5.03; −4.93[, 0.1; [−4.93; −4.84[, 0.1; [−4.84; −4.74[, 0.1;
[2.56; 2.82[, 0.1; [2.82; 3.16[, 0.1; [3.16; 3.60[, 0.1; [3.60; 4.30[, 0.1; [4.30; 9.92], 0.1}
{[0.57; 1.51[, 0.1; [1.51; 1.84[, 0.1; [1.84; 2.09[, 0.1; [2.09; 2.31[, 0.1; [2.31; 2.56[, 0.1;
[2.55; 2.82[, 0.1; [2.82; 3.16[, 0.1; [3.16; 3.59[, 0.1; [3.59; 4.30[, 0.1; [4.30; 9.92], 0.1}
{[0.57; 1.50[, 0.1; [1.50; 1.83[, 0.1; [1.83; 2.08[, 0.1; [2.08; 2.30[, 0.1; [2.30; 2.55[, 0.1;
[2.50; 2.76[, 0.1; [2.76; 3.10[, 0.1; [3.10; 3.53[, 0.1; [3.53; 4.23[, 0.1; [4.23; 9.84], 0.1}
{[0.53; 1.46[, 0.1; [1.46; 1.78[, 0.1; [1.78; 2.03[, 0.1; [2.03; 2.25[, 0.1; [2.25; 2.50[, 0.1;
[2.43; 2.69[, 0.1; [2.69; 3.02[, 0.1; [3.02; 3.44[, 0.1; [3.44; 4.13[, 0.1; [4.13; 9.68], 0.1}
{[0.57; 1.43[, 0.1; [1.43; 1.74[, 0.1; [1.74; 1.98[, 0.1; [1.98; 2.19[, 0.1; [2.19; 2.43[, 0.1;
[2.50; 2.77[, 0.1; [2.77; 3.10[, 0.1; [3.10; 3.54[, 0.1; [3.54; 4.24[, 0.1; [4.24; 9.85], 0.1}
{[0.52; 1.46[, 0.1; [1.46; 1.78[, 0.1; [1.78; 2.03[, 0.1; [2.03; 2.26[, 0.1; [2.26; 2.50[, 0.1;
[2.42; 2.70[, 0.1; [2.70; 3.04[, 0.1; [3.04; 3.46[, 0.1; [3.46; 4.15[, 0.1; [4.15; 9.80], 0.1}
{[0.42; 1.34[, 0.1; [1.34; 1.66[, 0.1; [1.66; 1.93[, 0.1; [1.93; 2.16[, 0.1; [2.16; 2.42[, 0.1;
[1.43; 1.70[, 0.1; [1.70; 2.02[, 0.1; [2.02; 2.40[, 0.1; [2.40; 3.0[, 0.1; [3.0; 8.36], 0.1}
{[−0.36; 0.40[, 0.1; [0.40; 0.71[, 0.1; [0.71; 0.96[, 0.1; [0.96; 1.18[, 0.1; [1.18; 1.43[, 0.1;
[−0.32; −0.08[, 0.1; [−0.08; 0.23[, 0.1; [0.23; 0.58[, 0.1; [0.58; 1.10[, 0.1; [1.10; 5.77], 0.1}
{[−2.03; −1.35[, 0.1; [−1.35; −1.04[, 0.1; [−1.04; −0.78[, 0.1; [−0.78; −0.57[, 0.1; [−0.57; −0.32[, 0.1;
[2.56; 2.82[, 0.1; [2.82; 3.17[, 0.1; [3.17; 3.60[, 0.1; [3.60; 4.30[, 0.1; [4.30; 9.93], 0.1}
{[0.57; 1.51[, 0.1; [1.51; 1.84[, 0.1; [1.84; 2.10[, 0.1; [2.10; 2.31[, 0.1; [2.31; 2.56[, 0.1;
[2.60; 2.83[, 0.1; [2.83; 3.18[, 0.1; [3.18; 3.61[, 0.1; [3.61; 4.32[, 0.1; [4.32; 9.94], 0.1}
{[0.57; 1.51[, 0.1; [1.51; 1.84[, 0.1; [1.84; 2.10[, 0.1; [2.10; 2.32[, 0.1; [2.32; 2.60[, 0.1;
[2.60; 2.83[, 0.1; [2.83; 3.17[, 0.1; [3.17; 3.60[, 0.1; [3.60; 4.30[, 0.1; [4.30; 9.93], 0.1}
{[0.58; 1.52[, 0.1; [1.52; 1.85[, 0.1; [1.85; 2.10[, 0.1; [2.10; 2.32[, 0.1; [2.32; 2.60[, 0.1;
[2.58; 2.85[, 0.1; [2.85; 3.20[, 0.1; [3.20; 3.62[, 0.1; [3.62; 4.33[, 0.1; [4.33; 9.96], 0.1}
{[0.60; 1.54[, 0.1; [1.54; 1.87[, 0.1; [1.87; 2.12[, 0.1; [2.12; 2.34[, 0.1; [2.34; 2.58[, 0.1;
Estimated parameters
b LogN (s)
H
2
DM
5.4363 (3.7789)
7.6914 (5.6605)
12.6867 (9.2032)
17.8967 (11.4409)
0.6083 (0.4775)
0.9819 (0.7523)
1.9874 (1.4785)
4.7316 (2.8857)
2.7372 (1.9246)
3.5936 (2.5960)
6.4587 (4.4901)
11.0812 (7.5004)
0.2745 (0.1980)
0.3873 (0.2736)
0.7611 (0.5591)
1.5407 (1.0674)
b LogN , HLogN
H
(s)
−1
Table C.36: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨLogN (t) when the histogram-valued variable X have Uniform or Normal distributions.
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
337
distributions
different
Mixture of
distribution
LogNormal
microdata
Distribution of
Level II
Level I
Level II
Level I
Error level
1.9945 (0.2633)
1.9933 (0.1615)
10
30
1.9914 (0.7071)
1.9562 (0.4909)
250
1.8533 (1.3271)
30
100
2.0351 (1.8069)
1.9953 (0.0480)
10
250
1.9995 (0.0722)
1.9987 (0.2226)
250
100
1.9351 (0.6021)
1.8368 (1.3504)
30
100
2.0788 (1.8239)
1.9995 (0.0214)
250
10
2.0031 (0.0716)
1.9899 (0.1634)
30
100
1.9919 (0.2520)
a∗ (s)
10
n
0.2426
0.4996
1.7810
3.2630
0.0023
0.0052
0.0261
0.0693
0.0495
0.3663
1.8484
3.3296
0.0005
0.0051
0.0268
0.0635
MSE(a)
mixture of different distributions.
7.9989 (0.4242)
7.9951 (0.6982)
7.6674 (1.3821)
7.6422 (2.4350)
8.0048 (0.0480)
8.0006 (0.0723)
7.9961 (0.1477)
7.9875 (0.2385)
7.9884 (0.2038)
7.8730 (0.5446)
7.6778 (1.4140)
7.5762 (2.4881)
8.0005 ( 0.0214)
7.9942 (0.0663)
7.9992 (0.1490)
7.9926 (0.2249)
b∗ (s)
0.1798
0.4870
2.0188
6.0512
0.0023
0.0052
0.0218
0.0570
0.0416
0.3124
2.1012
2.4881
0.0005
0.0044
0.0222
0.0506
MSE(b)
[2.76; 3.04[, 0.1; [3.04; 3.39[, 0.1; [3.39; 3.84[, 0.1; [3.84; 4.58[, 0.1; [4.58; 10.50], 0.1}
{[0.37; 1.75[, 0.1; [1.75; 2.03[, 0.1; [2.03; 2.27[, 0.1; [2.27; 2.50[, 0.1; [2.50; 2.76[, 0.1;
[2.65; 2.92[, 0.1; [2.92; 3.27[, 0.1; [3.27; 3.70[, 0.1; [3.70; 4.41[, 0.1; [4.41; 10.07], 0.1}
{[0.53; 1.62[, 0.1; [1.62; 1.93[, 0.1; [1.93; 2.18[, 0.1; [2.18; 2.40[, 0.1; [2.404; 2.65[, 0.1;
[2.32; 2.67[, 0.1; [2.67; 3.08[, 0.1; [3.08; 3.58[, 0.1; [3.58; 4.39[, 0.1; [4.39; 12.83], 0.1}
{[−5.41; 0.74[, 0.1; [0.74; 1.31[, 0.1; [1.31; 1.68[, 0.1; [1.68; 1.98[, 0.1; [1.98; 2.32[, 0.1;
[2.08; 2.37[, 0.1; [2.37; 2.65[, 0.1; [2.65; 2.97[, 0.1; [2.97; 3.57[, 0.1; [3.57; 11.32], 0.1}
{[−7.92; 0.27[, 0.1; [0.27; 0.97[, 0.1; [0.97; 1.42[, 0.1; [1.42; 1.76[, 0.1; [1.76; 2.08[, 0.1;
[2.57; 2.84[, 0.1; [2.84; 3.18[, 0.1; [3.18; 3.61[, 0.1; [3.61; 4.32[, 0.1; [4.32; 9.96], 0.1}
{[0.61; 1.53[, 0.1; [1.53; 1.85[, 0.1; [1.85; 2.10[, 0.1; [2.10; 2.31[, 0.1; [2.32; 2.57[, 0.1;
[2.56; 2.82[, 0.1; [2.82; 3.16[, 0.1; [3.16; 3.60[, 0.1; [3.60; 4.30[, 0.1; [4.30; 9.92], 0.1}
{[0.57; 1.51[, 0.1; [1.51; 1.84[, 0.1; [1.84; 2.09[, 0.1; [2.09; 2.31[, 0.1; [2.31; 2.56[, 0.1;
[2.55; 2.82[, 0.1; [2.82; 3.17[, 0.1; [3.17; 3.60[, 0.1; [3.60; 4.31[, 0.1; [4.31; 10.0], 0.1}
{[0.46; 1.52[, 0.1; [1.52; 1.82[, 0.1; [1.82; 2.08[, 0.1; [2.08; 2.30[, 0.1; [2.30; 2.55[, 0.1;
[2.59; 2.86[, 0.1; [2.86; 3.21[, 0.1; [3.21; 3.64[, 0.1; [3.64; 4.36[, 0.1; [4.36; 10.05], 0.1}
{[0.36; 1.57[, 0.1; [1.57; 1.85[, 0.1; [1.85; 2.11[, 0.1; [2.11; 2.34[, 0.1; [2.34; 2.59[, 0.1;
[2.59; 2.86[, 0.1; [2.86; 3.20[, 0.1; [3.20; 3.63[, 0.1; [3.63; 4.32[, 0.1; [4.32; 10.05], 0.1}
{[−0.36; 1.72[, 0.1; [1.72; 1.97[, 0.1; [1.97; 2.17[, 0.1; [2.17; 2.35[, 0.1; [2.35; 2.59[, 0.1;
[2.48; 2.73[, 0.1; [2.73; 3.06[, 0.1; [3.06; 3.47[, 0.1; [3.47; 4.15[, 0.1; [4.15; 12.20], 0.1}
{[−3.77; 1.47[, 0.1; [1.47; 1.84[, 0.1; [1.84; 2.07[, 0.1; [2.07; 2.27[, 0.1; [2.27; 2.48[, 0.1;
[1.96; 2.31[, 0.1; [2.31; 2.72[, 0.1; [2.72; 3.23[, 0.1; [3.23; 4.04[, 0.1; [4.04; 12.62], 0.1}
{[−5.76; 0.41[, 0.1; [0.41; 0.98[, 0.1; [0.98; 1.33[, 0.1; [1.33; 1.64[, 0.1; [1.64; 1.96[, 0.1;
[1.98; 2.26[, 0.1; [2.26; 2.52[, 0.1; [2.52; 2.85[, 0.1; [2.85; 3.45[, 0.1; [3.45; 11.0], 0.1}
{[−8.54; 0.13[, 0.1; [0.13; 0.86[, 0.1; [0.86; 1.31[, 0.1; [1.31; 1.66[, 0.1; [1.66; 1.98[, 0.1;
[2.55; 2.82[, 0.1; [2.82; 3.16[, 0.1; [3.16; 3.59[, 0.1; [3.59; 4.30[, 0.1; [4.30; 9.93], 0.1}
{[0.57; 1.51[, 0.1; [1.51; 1.83[, 0.1; [1.83; 2.09[, 0.1; [2.09; 2.31[, 0.1; [2.31; 2.55[, 0.1;
[2.56; 2.83[, 0.1; [2.83; 3.17[, 0.1; [3.17; 3.60[, 0.1; [3.60; 4.30[, 0.1; [4.30; 9.87], 0.1}
{[0.41; 1.53[, 0.1; [1.53; 1.83[, 0.1; [1.83; 2.09[, 0.1; [2.09; 2.31[, 0.1; [2.31; 2.56[, 0.1;
[2.60; 2.87[, 0.1; [2.87; 3.21[, 0.1; [3.21; 3.65[, 0.1; [3.65; 4.36[, 0.1; [4.36; 10.08], 0.1}
{[0.52; 1.57[, 0.1; [1.57; 1.86[, 0.1; [1.86; 2.13[, 0.1; [2.13; 2.35[, 0.1; [2.35; 2.60[, 0.1;
[2.53; 2.80[, 0.1; [2.80; 3.15[, 0.1; [3.15; 3.59[, 0.1; [3.59; 4.30[, 0.1; [4.30; 10.0], 0.1}
{[0.35; 1.51[, 0.1; [1.51; 1.79[, 0.1; [1.79; 2.06[, 0.1; [2.06; 2.28[, 0.1; [2.28; 2.53[, 0.1;
Estimated parameters
b LogN (s)
H
2
DM
2.3343 (1.5881)
3.1287 (2.2794)
7.4743 (4.6680)
11.5988 (7.5826)
0.2254 ( 0.1552)
0.3174 ( 0.2340)
0.7969 (0.4957)
1.3102 (0.7746)
3.2998 (2.2370)
6.1610 (3.8589)
7.6932 (4.5399)
12.0980 (7.8810)
0.3387 (0.2224)
0.6364 (0.3688)
0.7789 (0.4855)
1.2663 (0.8020)
b LogN , HLogN
H
(s)
−1
Table C.37: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨLogN (t) when the histogram-valued variable X have LogNormal distributions or a
338
APPENDIX C. RESULTS OF SIMULATION STUDIES
0.4620 (0.0172)
0.9794 (0.0072)
0.9757 (0.0042)
0.9754 (0.0022)
0.9749 (0.0014)
0.3531 (0.0861)
0.2993 (0.0410)
0.2887 (0.0212)
0.2833 (0.0135)
250
10
30
100
250
10
30
100
250
0.3505 (0.0929)
0.2891 (0.0409)
0.3053 (0.0221)
0.4390 (0.0159)
0.9791 (0.0065)
0.9753 (0.0042)
0.9778 (0.0020)
10
30
100
250
10
30
100
distribution
0.3525 (0.0941)
0.2878 (0.0413)
0.3079 (0.0299)
0.3025 (0.0171)
10
30
100
250
distributions
Level II
0.9773 (0.0013)
250
Mixture of different
Level I
0.9874 (0.0007)
Level II
0.4692 (0.0263)
0.9775 (0.0020 )
0.4873 (0.0488)
30
100
100
0.5461 (0.0960)
10
0.9793 (0.0063)
0.9885 (0.0007)
250
0.9752 ( 0.0041)
0.9887 (0.0011)
100
30
0.9891 (0.0020)
10
0.9908 (0.0032)
30
Ω (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
LogNormal
Normal distribution
Uniform distribution
microdata
Distribution of
38.8319 (1.0974)
36.3092 (1.6139)
39.6218 (3.4770)
36.3380 (6.0783)
3.8851 (0.1131)
3.6201 (0.1631)
3.9231 (0.3471)
3.6303 (0.6018)
60.4680 (1.7734)
60.2330 (2.7666)
39.5165 (3.2720)
36.4347 (6.1581)
6.0406 ( 0.1754)
6.0255 (0.2803)
3.9342 (0.3338)
3.6249 (0.5733)
24.5222 ( 0.7253)
24.0343 (1.1067)
24.4115 (2.1703)
23.5811 (4.1629)
2.4565 (0.0725)
2.4003 (0.1123)
2.4282 (0.2184)
2.3090 (0.4281)
14.3311 (0.4346)
14.1712 (0.6669)
14.3094 (1.2925)
12.7751 (2.2856)
1.4316 (0.0425)
1.4161 ( 0.0667)
1.4261 (0.1323)
1.2612 (0.2315)
RM SE M (s)
38.8233 (1.0954)
36.3132 (1.6085)
39.6775 (3.4732)
36.6421 (6.0826)
3.8846 (0.1129)
3.6205 (0.1626)
3.9289 (0.3464)
3.6542 (0.6029)
60.4796 (1.7643)
60.2691 ( 2.7533)
39.5717 (3.2732)
36.7459 (6.1605)
6.0422 (0.1744)
6.0294 (0.2782)
3.9399 (0.3337)
3.6466 ( 0.5737)
24.4741 (0.7189)
23.9850 (1.1001)
24.3657 (2.1453)
23.5660 (4.1318)
2.4514 (0.0718)
2.3951 (0.1112)
2.4238 (0.2155)
2.3065 (0.4245)
14.2913 (0.4296)
14.1326 (0.6621)
14.2646 (1.2803)
12.7405 (2.2651)
1.4276 (0.0423)
1.4123 (0.0662)
1.4217 (0.1309)
1.2581 (0.2297)
RM SE L (s)
Goodness-of-fit measures
38.8810 (1.0986)
36.3717 (1.6158)
39.7093 (3.4819)
36.4855 (6.0624)
3.8898 (0.1132)
3.6265 (0.1630)
3.9345 ( 0.3475)
3.6626 (0.6025)
60.5335 (1.7819)
60.3135 (2.7797)
39.6095 (3.2771)
36.5876 ( 6.1579)
6.0468 (0.1761)
6.0346 (0.2819)
3.9460 (0.3344)
3.6545 (0.5739)
24.6726 ( 0.7303)
24.1797 (1.1118)
24.6176 ( 2.1898)
23.7374 (4.1798)
2.4720 (0.0731)
2.4151 (0.1132)
2.4503 (0.2197)
2.3263 (0.4297)
14.4176 (0.4391)
14.2547 (0.6708)
14.4099 (1.3027)
12.8732 (2.2995)
1.4402 (0.0427)
1.4244 (0.0672)
1.4360 (0.1334)
1.2700 (0.2326 )
RM SE U (s)
−1
Table C.38: Results, in different conditions, of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨLogN (t) (continuation of the Tables C.36 and C.37).
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
339
distribution
Normal
distribution
Uniform
microdata
Distribution of
Level II
Level I
Level II
Level I
2.6851 (2.9037)
2.2681 (2.0502)
250
3.7747 (4.2454 )
30
100
4.8420 (5.2415)
10
1.9968 (0.2416 )
1.9808 (0.7348)
30
250
2.0005 (1.5228)
10
1.9998 (0.4136)
2.0291 ( 1.0783)
250
100
2.0799 (1.5075)
2.5523 (2.5562)
30
100
3.7088 (3.9033)
2.0042 (0.1127 )
250
10
1.9985 (0.1644)
2.0163 (0.3278)
30
100
1.9995 (0.5639)
a∗ (s)
10
Error level n
7.9929 (0.1124)
8.0025 (0.1657)
7.9888 (0.3214)
7.9820 (0.5725)
b∗ (s)
0.0127
0.0274
0.1033
0.3278
MSE(b)
8.0027 (0.2416)
7.9998 (0.4141)
8.0151 (0.7405)
8.0284 (1.5838)
7.9684 (1.1467)
8.0012 (1.6754)
7.7917 (3.0124)
0.0583
0.1713
0.5480
2.5068
1.3145
2.8041
9.1089
4.2709
8.8927
7.8592 (2.2409)
5.0364
7.5705 (3.2050) 10.4462
21.1547 6.8752 (4.5562 ) 22.0037
35.5230 6.3835 (5.4320) 32.0906
0.0583
0.1709
0.5398
2.3166
1.1624
2.2766
6.8325
18.1402 6.9459 (4.3694) 20.1838
0.0127
0.0270
0.1076
0.3177
MSE(a)
[2.36; 2.58[, 0.1; [2.58; 3.54[, 0.1; [3.54; 4.02[, 0.1; [4.02; 5.89[, 0.1; [5.89; 6.97], 0.1}
{[−0.70; −0.34[, 0.1; [−0.34; 0.70[, 0.1; [0.70; 0.95[, 0.1; [0.95; 1.40[, 0.1; [1.40; 2.36[, 0.1;
[1.38; 1.59[, 0.1; [1.59; 2.48[, 0.1; [2.48; 2.95[, 0.1; [2.95; 4.65[, 0.1; [4.65; 5.76], 0.1}
{[−1.51; −1.08[, 0.1; [−1.08; −0.17[, 0.1; [−0.17; 0.07[, 0.1; [0.07; 0.48[, 0.1; [0.48; 1.38[, 0.1;
[−2.03; −1.85[, 0.1; [−1.85; −1.15[, 0.1; [−1.15; −0.77[, 0.1; [−0.77; 0.49[, 0.1; [0.49; 1.71], 0.1}
{[−4.48; −3.85[, 0.1; [−3.85; −3.27[, 0.1; [−3.27; −3.04[, 0.1; [−3.04; −2.72[, 0.1; [−2.72; −2.03[, 0.1;
[−4.66; −4.52[, 0.1; [−4.52; −4.06[, 0.1; [−4.06; −3.78[, 0.1; [−3.78; −3.01[, 0.1; [−3.01; −2.13], 0.1}
{[−6.20; −5.73[, 0.1; [−5.73; −5.40[, 0.1; [−5.40; −5.26[, 0.1; [−5.26; −5.08[, 0.1; [−5.08; −4.66[, 0.1;
[3.02; 3.27[, 0.1; [3.27; 4.27[, 0.1; [4.27; 4.77[, 0.1; [4.77; 6.77[, 0.1; [6.77; 8.02], 0.1}
{[−0.23; 0.02[, 0.1; [0.02; 1.27[, 0.1; [1.27; 1.52[, 0.1; [1.52; 2.02[, 0.1; [2.02; 3.02[, 0.1;
[3.0; 3.25[, 0.1; [3.25; 4.25[, 0.1; [4.25; 4.75[, 0.1; [4.75; 6.75[, 0.1; [6.75; 8.0], 0.1}
{[−0.25; 0[, 0.1; [0; 1.25[, 0.1; [1.25; 1.50[, 0.1; [1.50; 2.0[, 0.1; [2.0; 3.0[, 0.1;
[3.07; 3.32[, 0.1; [3.32; 4.32[, 0.1; [4.32; 4.82[, 0.1; [4.82; 6.82[, 0.1; [6.82; 8.08], 0.1}
{[−0.21; 0.07[, 0.1; [0.07; 1.32[, 0.1; [1.32; 1.57[, 0.1; [1.57; 2.07[, 0.1; [2.07; 3.07[, 0.1;
[3.06; 3.3[, 0.1; [3.3; 4.29[, 0.1; [4.29; 4.78[, 0.1; [4.78; 6.76[, 0.1; [6.76; 7.94], 0.1}
{[−0.15; 0.15[, 0.1; [0.15; 1.32[, 0.1; [1.32; 1.58[, 0.1; [1.58; 2.06[, 0.1; [2.06; 3.06[, 0.1;
[2.93; 3.19[, 0.1; [3.19; 4.19[, 0.1; [4.19; 4.69[, 0.1; [4.69; 6.70[, 0.1; [6.70; 7.95], 0.1}
{[−0.34; −0.07[, 0.1; [−0.07; 1.17[, 0.1; [1.17; 1.44[, 0.1; [1.44; 1.93[, 0.1; [1.93; 2.93[, 0.1;
[2.88; 3.14[, 0.1; [3.14; 4.07[, 0.1; [4.07; 4.55[, 0.1; [4.55; 6.51[, 0.1; [6.51; 7.73], 0.1}
{[−0.24; 0.03[, 0.1; [0.03; 1.19[, 0.1; [1.19; 1.46[, 0.1; [1.46; 1.93[, 0.1; [1.93; 2.88[, 0.1;
[1.91; 2.21[, 0.1; [2.21; 2.99[, 0.1; [2.99; 3.47[, 0.1; [3.47; 5.17[, 0.1; [5.17; 6.28], 0.1}
{[−0.83; −0.53[, 0.1; [−0.53; 0.43[, 0.1; [0.43; 0.72[, 0.1; [0.72; 1.14[, 0.1; [1.14; 1.91[, 0.1;
[−1.38; −1.07[, 0.1; [−1.07; −0.38[, 0.1; [−0.38; 0.07[, 0.1; [0.07; 1.53[, 0.1; [1.53; 2.56], 0.1}
{[−3.90; −3.54[, 0.1; [−3.54; −2.75[, 0.1; [−2.75; −2.44[, 0.1; [−2.44; −2.04[, 0.1; [−2.04; −1.38[, 0.1;
[2.98; 3.23[, 0.1; [3.23; 4.24[, 0.1; [4.244.74[, 0.1; [4.74; 6.74[, 0.1; [6.74; 7.99], 0.1}
{[−0.27; −0.02[, 0.1; [−0.02; 1.23[, 0.1; [1.23; 1.48[, 0.1; [1.48; 1.98[, 0.1; [1.98; 2.98[, 0.1;
[3.01; 3.26[, 0.1; [3.26; 4.26[, 0.1; [4.264.76[, 0.1; [4.76; 6.76[, 0.1; [6.76; 8.01], 0.1}
{[−0.24; 0.01[, 0.1; [0.01; 1.26[, 0.1; [1.26; 1.51[, 0.1; [1.51; 2.01[, 0.1; [2.01; 3.01[, 0.1;
[2.97; 3.21[, 0.1; [3.21; 4.21[, 0.1; [4.21; 4.71[, 0.1; [4.71; 6.71[, 0.1; [6.71; 7.96], 0.1}
{[−0.27; −0.03[, 0.1; [−0.03; 1.22[, 0.1; [1.22; 1.47[, 0.1; [1.47; 1.97[, 0.1; [1.97; 2.97[, 0.1;
[2.95; 3.21[, 0.1; [3.21; 4.22[, 0.1; [4.22; 4.72[, 0.1; [4.72; 6.73[, 0.1; [6.7; 37.99], 0.1}
{[−0.7033; −0.07[, 0.1; [−0.07; 1.18[, 0.1; [1.18; 1.44[, 0.1; [1.44; 1.95[, 0.1; [1.95; 2.95[, 0.1;
Estimated parameters
b M ix (s)
H
2
DM
b M ix , HM ix (s)
H
5.4621 (3.6903)
7.7258 (5.7470)
13.3593 (9.2650)
18.2438 (11.7220)
0.5942 (0.4360)
1.0056 (0.7422)
2.0320 (1.4215)
4.6555 (2.8534)
2.7653 (1.9449)
3.8330 (2.5993)
6.5898 (4.5707)
11.8112 (8.0582)
0.2774 (0.1961)
0.3865 (0.2753)
0.7500 (0.5302)
1.4753 (1.0206)
−1
Table C.39: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨM ix (t) when the histogram-valued variable X have Uniform or Normal distributions.
340
APPENDIX C. RESULTS OF SIMULATION STUDIES
distributions
different
Mixture of
distribution
LogNormal
microdata
Distribution of
Level II
Level I
Level II
Level I
Error level
1.9774 (0.2673)
1.9862 (0.1635 )
10
30
2.0412 (0.7118)
1.9445 (0.4698 )
250
1.7246 (1.2251)
30
100
1.9153 (1.7412)
1.9996 (0.0472)
10
250
2.0000 (0.0707)
1.9788 (0.2087)
250
100
1.8899 (0.5816)
1.8323 ( 1.0012)
30
100
1.9279 (1.7274)
2.0002 (0.0226)
250
10
1.9980 (0.0673)
1.9926 (0.1250 )
30
100
1.9555 (0.2586)
a∗ (s)
10
n
0.2236
0.5079
1.5752
3.0358
0.0022
0.0050
0.0269
0.0719
0.0440
0.3501
1.0295
2.9861
0.0005
0.0045
0.0157
0.0688
MSE(a)
mixture of different distributions.
7.9789 (0.4346)
7.9269 ( 0.6931)
7.7171 (1.3881)
7.6898 (2.5101)
7.9996 (0.0453)
8.0000 (0.0709)
7.9841 (0.1406)
7.9735 ( 0.2402)
7.9985 ( 0.1991)
7.8868 ( 0.5577)
7.7292 (1.0739)
7.6570 (2.4579)
7.9993 (0.0217)
7.9925 (0.0609)
7.9850 ( 0.1164)
7.9955 (0.2286)
b∗ (s)
0.1892
0.4852
2.0051
6.3908
0.0020
0.0050
0.0200
0.0583
0.0396
0.3235
1.2255
6.1526
0.0005
0.0038
0.0138
0.0522
MSE(b)
[3.04; 3.31[, 0.1; [3.31; 4.33[, 0.1; [4.33; 4.89[, 0.1; [4.89; 6.81[, 0.1; [6.81; 8.77], 0.1}
{[−0.94; 1.15[, 0.1; [1.15; 1.18[, 0.1; [1.18; 1.51[, 0.1; [1.51; 2.02[, 0.1; [2.02; 3.04[, 0.1;
[3.17; 3.43[, 0.1; [3.43; 4.42[, 0.1; [4.42; 4.93[, 0.1; [4.93; 6.92[, 0.1; [6.92; 8.29], 0.1}
{[−0.47; 0.17[, 0.1; [0.17; 1.37[, 0.1; [1.37; 1.65[, 0.1; [1.65; 2.16[, 0.1; [2.16; 3.17[, 0.1;
[1.80; 3.21[, 0.1; [3.21; 4.32[, 0.1; [4.32; 5.01[, 0.1; [5.01; 6.68[, 0.1; [6.68; 12.65], 0.1}
{[−6.03; −0.35[, 0.1; [−0.35; 0.73[, 0.1; [0.73; 1.17[, 0.1; [1.17; 1.73[, 0.1; [1.73; 2.80[, 0.1;
[2.74; 3.15[, 0.1; [3.15; 4.03[, 0.1; [4.03; 4.56[, 0.1; [4.56; 5.97[, 0.1; [5.97; 11.40], 0.1}
{[−8.65; −0.48[, 0.1; [−0.48; 0.73[, 0.1; [0.73; 1.23[, 0.1; [1.23; 1.78[, 0.1; [1.78; 2.74[, 0.1;
[3.0; 3.25[, 0.1; [3.25; 4.25[, 0.1; [4.25; 4.75[, 0.1; [4.75; 6.75[, 0.1; [6.75; 8.01], 0.1}
{[−0.26; 0.01[, 0.1; [0.01; 1.25[, 0.1; [1.25; 1.50[, 0.1; [1.50; 2.0[, 0.1; [2.0; 3.0[, 0.1;
[3.02; 3.27[, 0.1; [3.27; 4.27[, 0.1; [4.27; 4.77[, 0.1; [4.77; 6.77[, 0.1; [6.77; 8.02], 0.1}
{[−0.23; 0.02[, 0.1; [0.02; 1.27[, 0.1; [1.27; 1.52[, 0.1; [1.52; 2.02[, 0.1; [2.02; 3.02[, 0.1;
[2.98; 3.24[, 0.1; [3.24; 4.25[, 0.1; [4.25; 4.76[, 0.1; [4.76; 6.75[, 0.1; [6.75; 8.19], 0.1}
{[−0.59; 0.02[, 0.1; [0.02; 1.18[, 0.1; [1.18; 1.47[, 0.1; [1.47; 1.97[, 0.1; [1.97; 2.98[, 0.1;
[3.0; 3.27[, 0.1; [3.27; 4.28[, 0.1; [4.28; 4.80[, 0.1; [4.80; 6.78[, 0.1; [6.78; 8.37], 0.1}
{[−0.79; 0.05[, 0.1; [0.05; 1.17[, 0.1; [1.17; 1.48[, 0.1; [1.48; 1.99[, 0.1; [1.99; 3.0[, 0.1;
[2.95; 3.21[, 0.1; [3.21; 4.20[, 0.1; [4.20; 4.78[, 0.1; [4.78; 6.45[, 0.1; [6.45; 8.83], 0.1}
{[−1.20; 0.35[, 0.1; [0.35; 1.19[, 0.1; [1.19; 1.47[, 0.1; [1.47; 1.94[, 0.1; [1.94; 2.95[, 0.1;
[2.84; 3.16[, 0.1; [3.16; 4.12[, 0.1; [4.12; 4.68[, 0.1; [4.68; 6.16[, 0.1; [6.16; 11.79], 0.1}
{[−4.68; 0.22[, 0.1; [0.22; 1.18[, 0.1; [1.18; 1.41[, 0.1; [1.41; 1.84[, 0.1; [1.84; 2.84[, 0.1;
[2.773.11[, 0.1; [3.11; 4.01[, 0.1; [4.01; 4.56[, 0.1; [4.56; 6.01[, 0.1; [6.01; 13.16], 0.1}
{[−7.42; 0.04[, 0.1; [0.04; 1.05[, 0.1; [1.05; 1.38[, 0.1; [1.38; 1.83[, 0.1; [1.83; 2.77[, 0.1;
[2.74; 3.14[, 0.1; [3.14; 4.04[, 0.1; [4.04; 4.57[, 0.1; [4.57; 5.96[, 0.1; [5.96; 11.34], 0.1}
{[−8.59; −0.49[, 0.1; [−0.49; 0.74[, 0.1; [0.74; 1.25[, 0.1; [1.25; 1.81[, 0.1; [1.81; 2.74[, 0.1;
[3.01; 3.25[, 0.1; [3.25; 4.25[, 0.1; [4.25; 4.75[, 0.1; [4.75; 6.75[, 0.1; [6.75; 8.0], 0.1}
{[−0.29; 0.03[, 0.1; [0.03; 1.25[, 0.1; [1.25; 1.51[, 0.1; [1.51; 2.02[, 0.1; [2.02; 3.01[, 0.1;
[3.02; 3.27[, 0.1; [3.27; 4.27[, 0.1; [4.27; 4.78[, 0.1; [4.78; 6.76[, 0.1; [6.76; 8.10], 0.1}
{[−0.52; 0.10[, 0.1; [0.10; 1.24[, 0.1; [1.24; 1.52[, 0.1; [1.52; 2.02[, 0.1; [2.02; 3.02[, 0.1;
[3.04; 3.29[, 0.1; [3.29; 4.29[, 0.1; [4.29; 4.82[, 0.1; [4.82; 6.74[, 0.1; [6.74; 8.26], 0.1}
{[−0.68; 0.16[, 0.1; [0.16; 1.23[, 0.1; [1.23; 1.54[, 0.1; [1.54; 2.03[, 0.1; [2.03; 3.04[, 0.1;
[3.07; 3.33[, 0.1; [3.33; 4.34[, 0.1; [4.34; 4.88[, 0.1; [4.44; 6.86[, 0.1; [6.86; 8.56], 0.1}
{[−0.59; 0.15[, 0.1; [0.15; 1.24[, 0.1; [1.24; 1.55[, 0.1; [1.55; 2.05[, 0.1; [2.05; 3.07[, 0.1;
Estimated parameters
b M ix (s)
H
2
DM
0.3249 (0.2174)
0.6078 (0.3923)
0.8912 (0.5392)
1.2471 (0.7479)
b M ix , HM ix
H
2.2222 (1.4711)
3.1240 (2.3681)
7.5501 (4.6285)
11.8348 (7.8602)
0.2226 (0.1552)
0.3171 (0.2428)
0.8055 (0.4748)
1.2574 (0.7790)
3.2299 (2.1573)
5.6570 (3.5900)
8.2838 (5.4149)
11.7546 (7.5833)
(s)
−1
Table C.40: Results of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨM ix (t) when the histogram-valued variable X have LogNormal distributions or a
C.3. SIMULATION STUDY II WITH HISTOGRAM VARIABLES
341
0.9779 (0.0040)
0.9778 (0.0020)
0.9772 (0.0013)
0.3685 (0.0860)
0.3185 (0.0434)
0.3096 (0.0228)
30
100
250
10
30
100
0.3639 (0.0932)
0.3155 (0.0423)
0.3118 (0.0224)
0.4439 (0.0155)
0.9798 (0.0062)
0.9761 (0.0040 )
0.9788 (0.0019)
10
30
100
250
10
30
100
distribution
0.3627 (0.0940)
0.3006 (0.0440)
0.3169 (0.0300)
0.3132 (0.0180)
10
30
100
250
distributions
Level II
0.9783 (0.0012)
250
Mixture of different
Level I
0.9876 (0.0007)
Level II
0.9815 (0.0061)
10
0.9781 (0.0020)
0.4964 (0.0165)
250
100
0.5041 (0.0259)
0.9783 (0.0036)
0.5203 (0.0492)
30
100
30
0.5704 (0.0945)
10
0.3025 (0.0139)
0.9899 (0.0006)
250
0.9801 ( 0.0064)
0.9901 (0.0009)
100
10
0.9904 (0.0018)
250
0.9919 (0.0029)
30
Ω (s)
10
n
250
Level I
Level II
Level I
Level II
Level I
Error level
LogNormal
Normal distribution
Uniform distribution
microdata
Distribution of
38.5167 (1.1291)
35.9785 (1.6618)
39.1750 (3.5171)
36.2108 (6.1041 )
3.8550 (0.1129)
3.5969 (0.1674)
3.9340 (0.3435)
3.6380 (0.5875)
60.1828 (1.7282)
59.8897 (2.8670)
46.4158 (4.0001)
35.9637 (6.0038)
6.0084 (0.1765)
6.0040 (0.2798)
4.6225 (0.3987)
3.6102 (0.6132)
24.2310 (0.7192)
23.6743 (1.1510)
24.1593 (2.2269)
23.4296 (3.9992)
2.4250 (0.0703)
2.3596 (0.1084)
2.3945 (0.2228)
2.2680 (0.3943)
14.1169 (0.4218)
13.9367 (0.6482)
14.0709 (1.2585)
12.6898 (2.3601)
1.4126 (0.0413)
1.3977 (0.0639)
1.4110 (0.1322)
1.2466 (0.2349)
RM SE M (s)
38.5152 (1.1283)
35.9882 ( 1.6549)
39.2333 (3.5135)
36.5298 (6.1303)
3.8549 (0.1125)
3.5978 (0.1668)
3.9398 (0.3435)
3.6643 (0.5878)
60.2016 (1.7200)
59.9337 ( 2.8519)
46.4987 (3.9844)
36.2645 (6.0178)
6.0108 (0.1755)
6.0081 (0.2781)
4.6316 (0.3961)
3.6344 (0.6128)
24.2040 (0.7116)
23.6490 ( 1.1379)
24.1412 (2.1990)
23.4264 (3.9661)
2.4220 (0.0695)
2.3566 (0.1073)
2.3929 (0.2201)
2.2673 (0.3905)
14.0842 (0.4168)
13.9070 (0.6422)
14.0343 (1.2439)
12.6646 (2.3340)
1.4093 (0.0409)
1.3945 (0.0631)
1.4076 (0.1306)
1.2443 (0.2327)
RM SE L (s)
Goodness-of-fit measures
38.5527 (1.1295)
36.0298 ( 1.6612)
39.2540 (3.5234)
36.3474 (6.1040)
3.8586 (0.1132)
3.6020 (0.1674)
3.9449 (0.3451)
3.6706 (0.5879)
60.2317 (1.7367)
59.9577 (2.8823)
46.5058 (4.0209)
36.1100 (6.0046 )
6.0130 (0.1774)
6.0118 (0.2812)
4.6337 (0.4016)
3.6415 (0.6141)
24.3352 (0.7256)
23.7724 ( 1.1628)
24.3160 (2.2470)
23.5619 (4.0216)
2.4357 (0.0710)
2.3697 (0.1093)
2.4107 (0.2239)
2.2814 (0.3965)
14.1862 (0.4264)
14.0007 (0.6536)
14.1528 (1.2711)
12.7649 (2.3793)
1.4195 (0.0416)
1.4042 (0.0646)
1.4189 (0.1335)
1.2537 (0.2365)
RM SE U (s)
−1
Table C.41: Results, in different conditions, of the DSD Model II with a = 2, b = 8 and Ψ−1
Constant (t) = ΨM ix (t) (continuation of the Tables C.39 and C.40).
342
APPENDIX C. RESULTS OF SIMULATION STUDIES
D

Download Report

Linear regression with empirical distributions

Paperzz.com

Your Paperzz