Local Gaussian Correlation: A new measure of dependence

Local Gaussian Correlation: A new
measure of dependence
Dag Tjøstheim
Karl Ove Hufthammer∗
18th September 2012
This paper will appear in Journal of Econometrics
Abstract
It is a common view among finance analysts and econometricians that the
correlation between financial objects becomes stronger as the market is going
down, and that it approaches one when the market crashes, having the effect of
destroying the benefit of diversification. The purpose of this paper is to introduce
a local dependence measure that gives a precise mathematical description and
interpretation of such phenomena. We propose a new local dependence measure,
a local correlation function, based on approximating a bivariate density locally
by a family of bivariate Gaussian densities using local likelihood. At each point
the correlation coefficient of the approximating Gaussian distribution is taken as
the local correlation. Existence, uniqueness and limit results are established. A
number of properties of the local Gaussian correlation and its estimate are given,
along with examples from both simulated and real data. This new method of
modelling carries with it the prospect of being able to do locally for a general
density what can be done globally for the Gaussian density. In a sense it extends
Gaussian analysis from a linear to a non-linear environment.
Key words: local likelihood; dependence measures; local dependence; local correlation
∗ University
of Bergen, Department of Mathematics, PO Box 7803, 5020 Bergen, Norway.
[email protected]
E-mail:
1
1 Introduction
It is a common statement among finance analysts and econometricians that the correlation
between financial objects becomes stronger as the market is going down, and that it approaches
one when the market crashes, having the effect of destroying the benefit of diversification.
Such a statement is imprecise, since it is not clear what ‘correlation’ means here. The purpose
of this paper is to introduce a local dependence measure that gives a precise mathematical
description and interpretation of such phenomena.
In finance and econometrics there have been several attempts to handle variation in local
dependence in, say, portfolio modelling (Silvapulle and Granger 2001), measuring contagion
(Rodrigues 2007; Inci et al. 2011) and description of tail dependence (Campbell et al. 2008).
Perhaps the most used solution to these problems is to employ the concept of conditional
correlation; see, for example, Forbes and Rigobon (2002), Longin and Solnik (2001) and Hong
et al. (2007). This simply amounts to computing the ordinary product-moment correlation
for certain regions of values of log difference returns. Given two random variables X1 and X2
with observed values (X1i , X2i ), i = 1, . . . , n the correlation between X1 and X2 conditionally
on being in a region A is
P
bX1 ,c )(X2i − µ
bX2 ,c )
(X1i ,X2i )∈A (X1i − µ
P
ρbc (A) = P
2
1/2
( (X1i ,X2i )∈A (X1i − µ
bX1 ,c ) ) ( (X1i ,X2i )∈A (X2i − µ
bX2 ,c )2 )1/2
where
µ
bX1 ,c =
1
nA
X
(X1i ,X2i )∈A
X1i
and µ
bX2 ,c =
1
nA
X
X2i ,
(X1i ,X2i )∈A)
with nA being the number of pairs with (X1i , X2i ) ∈ A. For an ergodic series (X1i , X2i ), as
nA tends to infinity, ρbc (A) converges to ρc (A) = corr(X1 , X2 |(X1 , X2 ) ∈ A) almost surely.
There are several disadvantages of the conditional correlation concept: i) The introduction
of an indicator function to define the local region means that ρc (A) does not equal the global
correlation ρ for a pair of jointly Gaussian variables (X1 , X2 ). This is unfortunate, since in a
Gaussian distribution dependence is completely characterised by the correlation coefficient.
This difficulty is sometimes termed the bias problem for the conditional correlation, and it
has been sought corrected for in various ways; see, for example, Forbes and Rigobon (2002).
ii) The conditional correlation is defined for a region A and not for a point (x1 , x2 ). How
should A be chosen? iii) The conditional correlation adopts a linear dependence measure to
local regions. Is this sensible in a non-linear and non-Gaussian situation?
We believe the local Gaussian correlation that we advocate in this paper does not have
these disadvantages. The idea behind it is very simple, though its computation and estimation
are far more difficult than the corresponding computation and estimation of the conditional
correlation. We start by recalling that the ordinary correlation coefficient is fully satisfactory
2
for a Gaussian distribution. Now, for a general bivariate density f for the variables (X1 , X2 ),
locally in a neighbourhood of each point x = (x1 , x2 ) we fit a Gaussian bivariate density
ψ(v, µ(x), Σ(x)) =
h 1
i
1
exp − (v − µ(x))T Σ−1 (x)(v − µ(x))
1/2
2
2π|Σ(x)|
using local likelihood. Here v = (v1 , v2 )T is the running variable in this Gaussian distribution,
µ(x) = (µ1 (x), µ2 (x))T is the local mean vector, Σ(x) = (σij (x)) is the local covariance
matrix, and we desire that ψ(v, µ(x), Σ(x)) equals f (x) at v = x and is close to f in a
neighbourhood of x. With σi2 (x) = σii (x), we define the local correlation at the point x
by ρ(x) =
σ12 (x)
σ1 (x)σ2 (x) ,
and in terms of the local correlation, ψ may be written in the usual
fashion,
ψ(v, µ1 (x), µ2 (x), σ12 (x), σ22 (x), ρ(x)) =
1
p
2πσ1 (x)σ2 (x) 1 − ρ2 (x)
(v1 − µ1 (x))(v2 − µ2 (x)) (v2 − µ2 (x))2
1
(v1 − µ1 (x))2
1
−
2ρ(x)
.
+
× exp −
2 1 − ρ2 (x)
σ12 (x)
σ1 (x)σ2 (x)
σ22 (x)
(1)
In the sequel we will use θ(x) to denote the 5-dimensional vector [µ1 (x), µ2 (x), σ12 (x), σ22 (x), ρ(x)]
or alternatively with ρ(x) replaced by σ12 (x) as the 5th element. As we move to another point
x0 , we obtain a new Gaussian approximation density ψ(v, µ(x0 ), Σ(x0 )), which approximates
f in a neighbourhood of x0 . In this way f is represented by a family of Gaussian bivariate
densities as x varies, the point being that in each specific neighbourhood of a point x, the
local dependence properties are described by the local covariance/correlation matrix, and
since ψ is Gaussian, we get a complete characterisation of the dependence relationship in that
neighbourhood (modulo the precision of the approximation). All of this depends on whether
a local Gaussian representation exists and to which degree it is unique. This problem will be
discussed throughout Section 2, along with an appropriate asymptotic estimation theory in
Sections 3–4.
The ‘bias’ problem of the conditional correlation disappears trivially for the local Gaussian
correlation, since if f is Gaussian, the same Gaussian will approximate f at every point.
Note that in general the local Gaussian is not centered at the point x. Indeed, in the case of
f Gaussian, all of the local Gaussians will be identical to f and hence located at the mean µ
of f .
A picture of the estimated local correlation of a distribution with standard Gaussian
marginals and a Gumbel copula is shown in Figure 1. The local Gaussian correlation
increases as (x1 , x2 ) increases. We will return to the pattern of Figure 1 in Sections 6.2 and
8.3. The corresponding local Gaussian correlation for the Clayton copula, sometimes used to
model financial objects, has the opposite pattern. We refer to Berentsen et al. (2012) for
3
more details and other copula structures.
It may be useful to compare our approach to measuring local dependence to other attempts
in the statistical literature. Bjerve and Doksum (1993) and Doksum et al. (1994) base their
approach on the relationship between correlation and linear regression. There are well-defined
procedures for defining non-linear and non-parametric regression models, and the authors
define a non-linear local correlation by extending linear relationships to non-linear ones.
A disadvantage is that the response variable Y plays a fundamentally different role from
the input variable X. This leads to asymmetry in their local correlation (which they can
heuristically correct for). An application to finance is given in Inci et al. (2011). In our
approach we treat (X, Y ), or rather (X1 , X2 ), on exactly the same basis. This in turn leads
to non-traditional non-parametric estimation problems.
In the approach by Holland and Wang (1987) and Jones (1996) the two variables are
treated on the same basis. The authors start from a local version of the odds ratio in
contingency tables. Using limit arguments, they arrive at the local dependence function
γ(x1 , x2 ) =
∂2
log f (x1 , x2 ).
∂x1 ∂x2
This is a function of the ordinary correlation coefficient when the variables are jointly
Gaussian, but its range is not between −1 and 1. Jones (1996) demonstrated that γ may be
obtained by a localization argument of the product moment correlation appropriately scaled;
see also Jones (1998) and Jones and Koch (2003). In a recent contribution Delicado and
Smrekar (2009) propose a new dependence measure using principal curves.
A formal definition of local Gaussian population parameters µ(x), Σ(x) and hence ρ(x) is
given in Section 2.1. After that, in Section 2.2, a discussion of local parameters and their
computation in the setting of (possibly nonlinear) transformations of Gaussian variables is
given. A connection to copula theory is pointed out. See also Section 5.
The local quantities µ(x) and Σ(x) must be estimated as a function of x. We use local
likelihood for that purpose. The relevant theory is outlined in Sections 3 – 4. A first stage of
an asymptotic theory with a fixed bandwidth is given in Section 3. The general case of a
decreasing bandwidth is covered in Section 4. A number of properties of the local parameters
such as range, transformation, symmetry and tail dependence are established in Sections 5 6. Matters of practical implementation, such as computing standard errors and choosing
bandwidth, are looked at in Section 7. Finally, there are some illustrations on simulated and
real data in Section 8.
The present paper contains the basic theory and properties of the local Gaussian correlation.
There are many potential applications. Applications to independence testing, including
introduction of positive and negative nonlinear dependence, are in Berentsen and Tjøstheim
(2012). A connection to copula theory and a characterization and visualization of their
4
dependence structure can be found in Berentsen et al. (2012). Applications to financial
contagion are given in Støve et al. (2012).
2 Existence of local parameters
In sub-section 2.1 we define population parameters θb (x) for a fixed bandwidth b completely
analogous to those defined in Hjort and Jones (1996). We then let b tend to zero to
obtain population values θ(x). In sub-section 2.2 we look at the special case of nonlinear
transformations of Gaussian variables and copulas.
2.1 A local penalty function
First note that the representation in (1) is not well-defined unless extra conditions are
imposed. Actually, as is easy to check, if f (x) is a global Gaussian N (µ, Σ), infinitely many
Gaussians densities can be chosen that pass through f (x) at the point x. However, these
Gaussian densities are not really the objects we want. We need to construct a Gaussian
approximation that approximates f (x) in a neighborhood of x and such that (1) holds at x.
In the case of X ∼ N (µ, Σ) this is trivially obtained by taking one Gaussian; i.e., µ(x) = µ
and Σ(x) = Σ for all x. In fact, these relationships may be taken as definitions of the local
parameters for a Gaussian distribution, and it will be shown in Sections 3 and 4 that the
b
local likelihood estimates µ
b(x) and Σ(x)
converge towards µ and Σ.
Since our focus is on representation (and estimation) in a neighbourhood, in order to define
population parameters, we start with a fixed neighbourhood of x defined by a 2-dimensional
bandwidth b and a local penalty function. Subsequently we let b be small. Perhaps the most
natural penalty function is the one given by
q0 =
Z
Kb (v − x)(ψ(v, θ(x)) − f (v))2 dv
−1
where Kb (v−x) = (b1 b2 )−1 K(b−1
1 (v1 −x1 ))K(b2 (v2 −x2 )) is a product kernel with bandwidth
b = (b1 , b2 ). Then θ(x) = θb (x) could be chosen to minimize q 0 , so that it would satisfy
Z
Kb (v − x)
∂ψ
(v, θ(x))[f (v) − ψ(v, θ(x)]dv = 0, j = 1, . . . , 5.
∂θj
(2)
In a sense this would correspond to a local least squares approximation. But since we are
dealing with densities a criterion related to local log likelihood may be better suited. Such a
penalty function is obtained by replacing q 0 with
Z
q=
Kb (v − x)[ψ(v, θ(x) − log ψ(v, θ(x))f (v)]dv.
(3)
5
As is seen in Hjort and Jones (1996) this can be interpreted as a locally weighted KullbackLeibler distance from f to ψ(·, θ(x)). Corresponding to (2), with the local score function as
a weight function, we then obtain that the minimizer θ(x) of (3) should satisfy
Z
Kb (v − x)
∂
log(ψ(v, θ(x))[f (v) − ψ(v, θ(x)]dv = 0, j = 1, . . . , 5.
∂θj
(4)
We define the population value θb (x) as the minimizer of the penalty function q, assuming
that there is a bandwidth b0 such that there exists a unique minimizer of (3) satisfying (4) for
any b with 0 < b < b0 . Here and in the following for 2-dimensional bandwidths b0 ≤ b ≤< b00
is taken to mean b0i ≤ bi ≤ b00i for i = 1, 2. This definition of θb (x) is essentially identical to
the one used in Hjort and Jones (1996) for more general parametric families of densities.
Note that both (2) and (4) yield local solutions θb (x) such that if Kb has a compact support
Sb and f1 (v) anf f2 (v) coincide on {v : (v − x) ∈ Sb }, then θb (f1 , x) = θb (f2 , x).
It is easy to find examples where (4) is satisfied with a unique θb (x). A trivial example
is when X ∼ N (µ, Σ) is Gaussian, where θb (x) = θ(x) = (µ, Σ). The next stage consists in
defining a step function of a Gaussian variable Z ∼ N (µ, Σ), where we will take µ = 0 and
Σ = I2 , the identity matrix of dimension 2. Let Ri , i = 1, . . . , k be a set of non-overlapping
regions of R2 such that R2 = ∪ki=1 Ri . Further, let ai and Ai be a corresponding set of vectors
and matrices in
R2 such that Ai is non-singular and define the piecewise linear function
X = gs (Z) =
k
X
(ai + Ai Z)1(Z ∈ Ri ),
(5)
i=1
where 1(·) is the indicator function. Let Si be the region defined by Si = {x : x =
ai + Ai z, z ∈ Ri }. It is assumed that (5) is one-to-one in the sense that Si ∩ Sj = ϕ for i 6= j
and ∪ki=1 Si = R2 . To see that the linear step function (5) can be used to obtain a solution
of (4) let x be a point in the interior of Si and let the kernel function K have a compact
support. By choosing b small enough, then all v for which v − x is in the support of Kb ,
will be in Si . Under this restriction on b, θb (x) = θ(x) ≡ θi = (µi , Σi ) where µi = ai and
Σi = Ai ATi as defined in (5). Thus, in this sense, for a fixed but small b, there exists a local
Gaussian approximation ψ(x, θb ) of f , with corresponding local means µi,b (x), variances
2
σi,b
(x), i = 1, 2, and correlation ρb (x). It will be seen in Sections 3 and 4 that for these points
b b (x)] converge towards θb (x) = [µb (x), Σb (x)]
the local likelihood estimates θbb (x) = [b
µb (x), Σ
as the number of observations tends to infinity and the bandwidth b is held fixed, and towards
a population parameter θ(x) as n → ∞ and b → 0. (The local parameters at the boundary
of the regions can be defined by some convention; these boundaries having zero measure in
R2 ).
Note that (5) can be used to simulate observations from a pair (X1 , X2 ) having a simple
given local stepwise structure (µi , Σi ) in the x1 −x2 plane with Ri = {z : z = A−1
i (x−ai ), x ∈
6
1/2
Si and with ai = µi , Ai = Σi .
Before we proceed any further and let b → 0, let us verify that the population values
satisfying (4) are consistent with the local likelihood approach of Hjort and Jones (1996).
Their primary purpose was not the estimation of a population parameter θ(x) but rather
the estimation of f (x) via ψ(x, θ(x)) where ψ may not be Gaussian and where θ(x) may
not be a location-scale parameter. To introduce the local likelihood, let (X1 , . . . , Xn ) with
Xi = (X1i , X2i ) be iid observations from the density f and let ψ(x, θ(x)) be the parametric
family we are interested in (Gaussian in our case). One might suppose that an appropriate
local log likelihood, which should be maximised to obtain an estimate of θ(x) for a fixed x, is
L0 = n−1
X
Kb (Xi − x) log ψ(Xi , θ(x)),
i
where Kb is a kernel function as defined above.. However, this gives nonsensical results
when used in an approximation context, as has been realised by Hjort and Jones (1996) and
others. There are several modifications of the local likelihood to handle this problem; see the
discussion in Hjort and Jones (1996), and Eguchi and Copas (1998). We follow Hjort and
Jones (1996) in using the modified local log likelihood, which in our notation, with a fixed
bandwidth b, is
L(X1 , . . . , Xn , θb (x)) = n−1
X
Z
Kb (Xi −x) log ψ(Xi , θb (x))−
Kb (v−x)ψ(v, θb (x)) dv. (6)
i
The last term implies that ψ(x, θb (x)) is not allowed to stray far away from f (x) as b → 0.
Indeed, using the notation uj (·, θ) = ∂/∂θj log ψ(·, θ), by the law of large numbers, assuming
E{Kb (Xi − x) log ψ(xi , θb (x))} < ∞, we have almost surely
Z
X
∂L
= n−1
Kb (Xi − x)uj (Xi , θb (x)) − Kb (v − x)uj (v, θb (x))ψ(v, θb (x)) dv
∂θj
i
Z
→ Kb (v − x)uj (v, θb (x))[f (v) − ψ(v, θb (x))] dv. (7)
Putting the expression in the first line of (7) equal to zero yields the local maximum
likelihood estimates θbb (x) of the population value θb (x) which satisfies the second line of (7),
which in turn is identical to our requirement for the population value in (4). Letting b → 0
and requiring
∂L/∂θj = 0
(8)
uj (x, θb (x))[f (x) − ψ(x, θb (x))] + O(bT b) = 0,
(9)
leads to
so that, ignoring solutions that yield uj (x, θb (x)) = 0, (7) requires ψ(x, θb (x)) to be close
7
to f (x), in fact with a difference of the order O(bT b) as b → 0. (This is exploited in the
bandwidth algorithm of Section 7.2.)
Hjort and Jones (1996) did consider local Gaussians in Sections 5.3, 5.4 and 5.7 of their
paper as one of several examples of parametric families that could be used for estimating f .
Gourieroux and Jasiak (2010) put the main emphasis on the Gaussian family, but again the
main purpose was to estimate the density, in particular in the tails. Our goal is different,
as we are interested in the parameter vector θ itself as a means of describing the local
dependence.
One of the most important motivations for introducing a local Gaussian correlation is to be
able to handle time series, such as financial returns. One example would be {Xt } = {Yt , Yt−j }
for a univariate time series {Yt }, for the purpose of modelling auto-dependence at lag j. The
local log likelihood function of equation (6) can be used for dependent data in that it can be
maximised using the same algorithm as in the iid case, and resulting in the same expression
for θbb (x) in terms of the variables Xt . When the Xt variables are dependent, however, it
cannot really be thought of as a local Gaussian log likelihood. A local likelihood can be
constructed by approximating the dependence between the Xt variables by a multivariate
Gaussian, but in general this will lead to a drastic increase in the number of local parameters,
and we prefer instead to use the marginal local Gaussian penalty function
Qn (X1 , . . . , Xn , θb (x)) = −nL(X1 , . . . , Xn , θb (x))
(10)
Z
X
=−
Kb (Xt − x) log ψ(Xt , θ(x)) + n Kb (v − x)ψ(v, θb (x)) dv,
(11)
which is proportional to the local Gaussian likelihood in the independent case and could
be interpreted as an M -estimation penalty function in the more general case. Assuming
stationarity, we can still approximate the marginal distribution of {X1t , X2t } by a local
Gaussian distribution, and we can model the local Gaussian auto-correlation structure for
{Yt , Yt−j } for each j. A natural next step would be to use this as a first step in a local
autocorrelation analysis of time series, including a local spectral density function that could
describe periodicities on several scales.
Using (11), the expected value of the function n−1 Q(x) for a fixed b is given by the penalty
function q defined in (3). The time series version of equation (4) is of course identical in
form to the one obtained from (7) and is again identical to the equation used by Hjort and
Jones (1996) to define the population value θb (x). In Section 3 it will be shown that the
corresponding local likelihood estimate converges towards θb (x) for a stationary ergodic series
under some additional regularity conditions.
We now return to the question as to what happens when the neighbourhood defined by
b gets smaller. This is in fact considered indirectly by Hjort and Jones (1996) on pages
8
1627-1630, for a possibly more general parametric family than the Gaussian. Assuming
differentiability and Taylor expanding in the integral equation (4),
Z
2
2
1 X X 2 ∂ 2 α(xi , xj )
Kb (v − x)α(v)dv = α(x) +
bi bj + o(bT b),
σ
2 i=1 j=1 Ki,j ∂xi ∂xj
2
where α(v) = u(v, θ(x))[f (v) − ψ(v, θb (x))], σK
=
i,j
R
(12)
z1i z2j K(z1 , z2 )dz1 dz2 with uT =
[u1 , . . . , u5 ]. This leads to f (x) − ψ(x, θb (x)) → 0, but also equality in derivatives as
b → 0. The details are somewhat tedious (cf. Hjort and Jones (1996)) and are not needed
by us here. It will be used in a separate paper where we look at the bias of the local
likelihood estimates. The point for us is that these consistency conditions guarantees that
f (x) − ψ(x, θb (x)) → 0 in a smooth manner as b → 0.
If, with little loss of generality, we assume that b1 = kb2 for some constant k, then as
b → 0 in this manner, the sequence ψ(x, θb (x)) is a Cauchy sequence converging to f (x).
For two bandwidths bi = (bi1 , bi2 ), i = 1, 2, by the mean value theorem and the continuous
differentiability of ψ, with respect to θb (x) = [θ1,b (x), . . . , θ5,b (x)] = [µb (x), Σb (x)],
ψ(x, θb1 (x)) − ψ(x, θb2 (x)) =
X ∂
(ψ(x, ξ)(θi,b1 (x) − θi,b2 (x))
∂θi
i=1
and θb (x) is a Cauchy sequence and has a unique limit θ(x) if the vector of partial derivatives
∂
∂θi (ψ(x, ξ)
has linearly independent components. Here ξ is determined by the mean value
theorem. Recalling that ψ is Gaussian the linear independence condition is not a strong
condition.
In the sequel the above regularity conditions will be assumed to hold guaranteeing the
existence of a unique θ(x) = (µ(x), Σ(x)), and it will be shown in Section 4 that the local
likelihood estimates θbb ,n (x) converge towards θ(x). Again, for the step function model of (5)
n
the existence of a unique θ(x) is trivial since θb (x) does not depend on b once b is small enough.
Moreover, if there exists a neighborhood of x such that f1 (x) = f2 (x) on this neighborhood,
and if the kernel function K has a compact support, then in the notation defined after
equation (4), there exists a bandwidth b0 such that for b ≤ b0 , θb (f1 , x) = θb (f2 , x), and
hence θ(f1 , x) = θ(f2 , x).
2.2 Non-linear transformations and explicit computation of
population parameters
As seen above the local parameters can be explicitly calculated for the linear step function
case in (5). A continuous one-to-one function g : z → g(z) = x mapping
inverse h = g
−1
R2 onto R2 with an
can be approximated by a sequence of one-to-one piecewise linear functions
such as in (5) by letting k increase and by letting the regions Ri be smaller. If g is continuously
9
differentiable at z, then a representation such as (1) can easily be found by Taylor expansion
with µ0 (x) and Σ0 (x) defined explicitly in terms of g (or h), but unfortunately it is not
unique. The existence of an open set in Si within which X is identical to a Gaussian is the
crucial argument in the derivation of uniqueness for the local parameters of the piecewise
linear function. This property is lost for a general continuous function g. However as seen in
Berentsen et al. (2012) using an extension of the Rosenblatt (1952) representation, there is a
set of points for which we can prove uniqueness and actually compute the local Gaussian
parameters (µ(x), Σ(x)), and in particular the local Gaussian correlation ρ(x). This result
can be applied to an exchangeable copula C where we have been able to compute ρ(x) along
the curve defined by {x = (x1 , x2 ) : F1 (x1 ) = F2 (x2 )}, where F1 and F2 are the two marginal
cumulative distribution functions of X. If they are the same, then the curve reduces to the
diagonal x = (s, s). In such a case with FX (x) = C(F1 (x1 ), F2 (x2 )) it is shown in Berentsen
et al. (2012) that
ρ(s, s) = p
C11 (F1 (s), F2 (s))φ(Φ−1 (F1 )(s))
[φ(Φ−1 (C1 (F1 (s), F2 (s))))]2 + [C11 (F1 (s), F2 (s))φ(Φ−1 (F1 (s)))]2
where φ is the standard normal density, where C1 (u1 , u2 ) =
2
∂
C(u1 , u2 ).
∂u21
∂
∂u1 C(u1 , u2 )
(13)
and C11 (u1 , u2 ) =
This formula can be simplified for various copula classes and gives an explicit
formula for the local parameters; i.e., the local correlation along the diagonal for the density
f (x) determined by the copula in question. In Berentsen et al. (2012) this has been checked
against the estimated local Gaussian correlation (it can be estimated everywhere, not only
along the diagonal, for these copulas, an example of which is given in Figure 1). Subsequently
these local correlations can be used in a test of fit of copula models using the theoretical
formula for ρ(x) along the diagonal. So far we have not been able to compute the population
value ρ(x) for a general x. We refer to Berentsen et al. (2012) for details.
A much simpler example is given by the parabola
X2 = X12 + ε; ε ∼ N (0, σε2 ), X1 ∼ N (0, σx2 ),
where X1 and ε are independent. This is a textbook example of a situation where there is
(strong) dependence between X2 and X1 , but where the global correlation ρ(X1 , X2 ) = 0
under very general conditions. We define the function g(Z) by
2
X1 = σX Z1 ; X2 = σX
Z12 + σε Z2
where Z1 and Z2 are independent standard normal variables. It follows that a local linear
10
approximation is given by
∂g
∂z (z)Z
with Z T = [Z1 , Z2 ] and
"
σX
∂g
(z) =
2
∂z
2σX
z1
0
#
σε
.
Along the curve x2 = x21 the local covariance is given by
Σ(x1 , x21 )
"
2
σX
∂g
∂g T
=
(z)
(z) =
2
∂z
∂z
2σX
x1
2
2σX
x1
2 2
4σX
x1 + σε2
#
,
with constant local variance in the x1 direction, which is reasonable since X1 is Gaussian.
p
2 . As σ
The local correlation is given by ρ(x1 , x21 ) = 2x1 / 4x21 + σε2 /σX
X → ∞, or σε → 0,
ρ(x1 , x21 ) tends to +1 for x1 > 0 and to −1 for x1 < 0, whereas for x = 0, ρ(x1 , x21 ) = 0. All
of these results are intuitively reasonable, and are consistent with simulation experiments.
There is some bias that can be predicted and that decreases as the bandwidth in the local
likelihood algorithm tends to zero, but such bias problems will be studied generally in a
separate publication.
Finally, in the one-dimensional case the Rosenblatt (1952) transformation is unique
−1
everywhere and in our setting we have X = g(Z) = FX
(Φ(Z))) where Φ is the cumulative
distribution function for a standard normal variable Z. The local mean and variance at
the point z can then be obtained by Taylor expansion of g(Z) such that with z = g −1 (x),
µ(x) = g(z) − g 0 (z)z and σ 2 (x) = (g 0 (z))2
The case of the uniform distribution is somewhat pathological in this context. If X is
one-dimensional uniform, the support of X is bounded and the the bound depends on the
parameter defining the distribution. On the support of f (x), ψ(x, θ(x)) ≡ constant but
then the whole representation in terms of ψ and θ(x) degenerates with a solution µ(x) = x,
σ(x) ≡ constant that is in fact picked up by the likelihood estimates. In some sense this
solution is picked up by (4) since the parametrization with µ(x) = x, σ(x) ≡ constant has
uj (x) = 0, whereas the Rosenblatt representation is a unique solution of (4) only if uj (x) 6= 0.
3 Asymptotic theory for b fixed
After having defined population values we are in a position to discuss local likelihood estimates
of these and prove consistency and asymptotic normality. We define the local likelihood
estimate θbb (x) as a solution of the first part of (7). With a sufficient number of observations
we can estimate the population values for different bandwidths, thereby getting several
measures of correlation, each measuring the dependence at the specified degree of locality.
For the extreme case of b = ∞, the local Gaussian likelihood reduces to the global Gaussian
quasi-likelihood, and it is easily seen that by maximising this an estimated parameter vector
11
θb that estimates the vector θ composed of the global means, global variances and the global
correlation of X1 , X2 is obtained.
A numerical algorithm based on a Newton-Raphson method has been developed and is
available from the authors on request. In the asymptotic theory we will first look at the case
of a fixed b; i.e. estimating θb (x) of (4). In the next section we let b → 0, estimating θ(x) as
defined in Section 2.
Asymptotic theory for b fixed in the iid case is covered in Hjort and Jones (1996). Our
contribution in this section is in allowing dependence in time as formulated in Theorem 1.
A basic assumption in Hjort and Jones (1996) is the existence of a unique solution θb of
(4) for every b > 0 . Under this assumption they state the following result, which is used as
an instrument in density estimation:
Proposition: Hjort and Jones (1996): Under the existence and uniqueness assumption on
θb (x), and letting θbn,b (x) be the local likelihood estimate of θb (x),
d
(nb1 b2 )1/2 [θbn,b (x) − θb (x)] → N (0, Jb−1 Mb (Jb−1 )T ),
where
.
Jb =
Z
Kb (v − x)u(v, θb (x))uT (v, θb (x))ψ(v, θb (x)) dv
Z
− Kb (v − x)∇u(v, θb (x))[f (v) − ψ(v, θb (x))] dv,
(14)
with u(x, θ) = ∇ log ψ(x, θ) being the column vector of score functions, and where
Z
.
Mb = lim Var((nb1 b2 )1/2 Vn (θb )) = b1 b2 Kb2 (v − x)u(v, θb (x))uT (v, θb (x))f (v) dv
n→∞
Z
Z
− b1 b2 Kb (v − x)u(v, θb (x))f (v) dv Kb (v − x)uT (v, θb (x))f (v) dv
(15)
with Vn (θ) = ∂L(θ)/∂θ.
The use of the scaling factor (nb1 b2 )1/2 is perhaps slightly deceptive, as the limit theorem
is for b fixed. The question of existence and uniqueness of θb has been discussed in sub-section
2.1. The local likelihood algorithm works by fitting a Gaussian surface to the surface of f at
any neighbourhood of finite size b, and if f is Gaussian or stepwise Gaussian, its parameter
vector θ is uniquely determined by any finite portion (f (x), x ∈ N (x) of its surface, where
N (x) is a neighbourhood of x.
As mentioned in sub-section 2.1 an important motivation for introducing a local Gaussian
correlation is to be able to handle time series, such as financial returns. In the following
theorem we extend the above result to the time series case using (11 as a local penalty
function.
12
Theorem 1: Let {Xt } be a stationary bivariate time series with Xt having bivariate density
f . Assume that i) {Xt } is geometrically ergodic, ii) f has support on all of
R2 , and is such
that E|Kb (Xt − x)∂/∂θj log ψ(Xt , θ)|γ < ∞, j = 1, . . . , 5 for some γ > 2, and iii) there exists
a bandwidth b0 such that there is a unique minimizer θb (x) of (3) for every 0 < b < b0 . Then
for every ε > 0 there is a possibly x-dependent event A with P (Ac ) < ε such that there
exists a sequence θbn,b (x) that minimises Q in (11), and such that θbn,b (x) converges almost
surely to θb (x) on A. Moreover, the central limit result in Proposition 1 continues to hold,
with the integral expression in the first part of (15) as the leading term for Mb .
Proof: We will use the approach formalised by Klimko and Nelson (1978) and Tjøstheim
(1986) (see also Taniguchi and Kakizawa 2000, pages 96–100). This is an approach for
obtaining asymptotic properties of estimates using Taylor expansion of a general penalty
function Q. As before, we denote the minimiser of Qn by θb = θbn,b . We omit the xdependence in the sequel, as x is fixed. Taylor-expanding, we have in a neighbourhood
N = {θ : |θ − θb | < δ} of θb (noting that Qn (θ) is twice continuously differentiable with
respect to θ)
∂
1
∂2
Qn (θ) = Qn (θb ) + (θ − θb )T Qn (θb ) + (θ − θb )T
Qn (θb )(θ − θb )
∂θ
2
∂θ ∂θT
1
∂2
∂2
∗
+ (θ − θb )T
Q
(θ
)
−
Q(θb ) (θ − θb )
n
2
∂θ ∂θT
∂θ ∂θT
∂
1
.
= Qn (θb ) + (θ − θb )T Qn (θb ) + (θ − θb )T Wn (θb )(θ − θb )
∂θ
2
1
+ (θ − θb )T Tn (θ∗ , θb )(θ − θb ),
2
where θ∗ is an intermediate point between θ and θb . It should be carefully noted that we do
our analysis for b fixed, so the true value of θ is identified with θb satisfying Equation (4).
To obtain the consistency statement (via Egorov’s theorem), we have to show (see Klimko
and Nelson (1978), Theorem 2.1, or Taniguchi and Kakizawa (2000), Theorem 3.2.23)
a.s.
∂
(i) n−1 ∂θ
Qn (θb ) → 0.
a.s.
(ii) n−1 Wn (θb ) → Wb , where Wb is a 5 × 5 positive definite matrix that can be identified
with Jb of (14).
(iii) For j, k = 1, . . . , 5, limn→∞ supδ→0 (nδ)−1 |Tn (θ∗ , θb )jk | < ∞ a.s., where Tn (θ∗ , θb )jk is
the (j, k)th element of Tn (θ∗ , θb ).
13
In our case, because of the ergodic assumption on Xt ,
−n−1
∂
∂L
Qn (θb ) =
(X1 , . . . , Xn , θb )
∂θ
∂θ
Z
∂
a.s.
→
Kb (v − x) log(ψ(v, θb ))f (v)) dv
∂θ
Z
∂
− Kb (v − x) log(ψ(v, θb ))ψ(v, θb ) dv = 0,
∂θ
due to the requirement on θb in (4). Similarly, using ergodicity, it can be shown that
∂ 2 L(X1 , . . . , Xn , θb ) a.s.
→ Jb .
∂θ ∂θT
R
Since θb is global minimizer of the function Kb (v − x)[ψ(v, θ) − log(ψ(v, θ))f (v)] dv, this
n−1 Wn (θb ) = −
means that the second-order derivative matrix of the function must be positive definite at θb ,
but this derivative matrix is Jb = Wb , so that ii) above is fulfilled. Finally, the third-order
derivative of ψ(x, θ) is bounded on N , and iii) is implied by the mean-value theorem.
Theorem 2.2 of Klimko and Nelson (1978), or the last part of Theorem 3.2.23 of Taniguchi
and Kakizawa (2000), can now be used to prove asymptotic normality of θbn,b . In addition to
(i)–(iii) above, we need to verify the condition
d
1
∂
Qn (θb ) → N (0, Sb ),
(iv) n− 2 ∂θ
where Sb is a positive definite matrix that can be identified with (b1 b2 )−1 Mb in (15). The
expression for Mb is positive definite since it can be expressed as the covariance matrix of the
√
stochastic vector variable b1 b2 u(Xt , θb (x)) (u cannot be degenerate due to the uniqueness
condition). To prove (iv) in our case, we can use a mixing clt; see, for instance, Theorem
2.20(i) and 2.21(i) of Fan and Yao (2003). Those theorems are formulated for a scalar process,
1
∂
whereas n− 2 ∂θ
Qn (θb ) is a five-dimensional quantity. However, Theorem 2.20(i) of Fan and
Yao (2003) can be applied component-wise, and by using a Cramér –Wold argument, Theorem
2.21(i) can be used. These theorems assume an α-mixing process {Xt } with a certain mixing
rate, but this is fulfilled by our assumption of geometric ergodicity (in fact a slower rate
would suffice). In addition, it is required that
E|Kb (Xi − x)∂/∂θj log ψ(Xi , θ)|γ
< ∞ for
some γ > 2, which is our condition ii) in the statement of Theorem 1.
Remark 1: The condition that f has support on all of
R2 can be weakened, but with this
assumption we do not need to discuss what happens with θb at the boundary of the support.
A pathological situation emerges for a uniform distribution as was seen at the end of Section
2.2.
Remark 2: The condition of geometric ergodicity is quite standard in time series analysis.
The condition ii) is fulfilled under weak conditions on f because of the presence of the kernel
14
function. In fact, the finiteness condition in ii) can be expressed as
Z
[Kb (v − x)]γ [
∂ 1
1
{ log |Σ(x)| − (v − µ(x))T Σ−1 (x)(v − µ(x))}]γ f (v)dv < ∞
∂θj 2
2
where θj , j = 1, . . . , 5 corresponds to the components of µ(x) and Σ(x). Clearly, for a fixed
x, this integral exists by choosing the tail behaviour of the kernel K to decrease sufficiently
fast compared to the tail behaviour in f . In particular, K can be chosen to have a compact
support.
The condition iii) (in a slightly stronger version) is left unverified in Hjort and Jones
(1996), but has been discussed in sub-section 2.1. It seems quite reasonable from a practical
point of view, and it holds for the step function model (5), which in turn can be taken as an
approximation for more general models.
Remark 3: The question remains how Mb and Jb should be evaluated for a given data set.
We will return to this question in Section 7.
4 Asymptotic theory as b → 0 in the global Gaussian case
Looking at the case where f is globally Gaussian may seem to be of little use, but it turns
out that this case and the existence-uniqueness arguments of Section 2 are key elements in
proving a limit theorem in the more general case. In fact, once existence and uniqueness are
established as in Section 2, the proof of the asymptotic properties is virtually identical to the
proof presented in the present section, in particular it holds trivially for the step function
model (5). The global Gaussian case (and the stepwise Gaussian case) has the advantage
that we need not worry about uniqueness and existence, and we have θb (x) = θ(x) = θ (or
θi for x ∈ Si ). It will be shown below that the local likelihood estimator θbb,n (x) converges
towards θ(x) as n and b tend to infinity and 0, respectively, and that an asymptotic limit
theory akin to that in Theorem 1 can be established under an appropriate scaling.
The main difficulty in extending Theorem 1 to the present situation is that the matrices
Jb and Mb are no longer positive definite as b → 0. In fact, with Jb and Mb defined by the
integral expressions in (14) and (15), but with the parameter θb (x) replaced by θ(x) = θ, we
R
.
.
have Jb → J = u(x, θ)uT (x, θ)ψ(x, θ) and Mb → M = u(x, θ)uT (x, θ)f (x) K 2 (v) dv (the
last term in (15) is of smaller order as b → 0). Both of these quantities are non-negative
definite but not positive definite, the matrix u(x, θ)uT (x, θ) being of rank 1. The matrices J
and M are obtained by only retaining the first-order term in a Taylor expansion in terms of
b. To obtain positive definiteness, higher-order terms must be included, and these are picked
up using a scaling that is changed accordingly.
15
We illustrate this by looking at the Taylor expansion of the term
Z
IM = b1 b2
Kb2 (v − x)u(v, θ(x))uT (v, θ(x))f (v) dv
corresponding to Mb and looking at the more general (possibly non-Gaussian) case of a
parameter θ(x), whose existence was established in sub-section 2.1. The other terms are
treated in an identical manner. Substituting wi = b−1
i (vi − xi ) and letting w = (w1 , w2 ) we
have, by retaining two more terms of the Taylor expansion,
Z
IM ∼
K 2 (w1 , w2 )Abw bTw AT f (x + b1 w1 + b2 w2 ) dw1 dw2 ,
wherehtypically K is a product kernel and where
i bw is the 6-dimensional vector defined by
bTw = 1 b1 w1 b2 w2 b21 w12 b1 w1 b2 w2 b22 w22 and A is the 5 × 6 matrix A =
i
h
∂L
∂u
, ux1 = ∂x
, ux2 =
u ux1 ux2 21 ux21 ux1 x2 21 ux22 , with uT = [u1 , . . . , u5 ], ui = ∂θ
i
1
∂u
∂x2 ,
ux21 =
∂2u
,
∂x21R
ux22 =
∂2u
,
∂x22
ux1 x2 =
∂2u
∂x1 ∂x2 .
K 2 (w1 , w2 )bw bTw dw1 dw2 we obtain a 6 × 6 matrix H, where (only retaining
R
R
the powers of b, omitting all constant factors like w2 K(w) dw, 12 , K i (w) dw, and, for
Computing
simplicity, taking b1 = b2 = b),


1
0
0
b2
b2
0

0

0

H= 2
b

b2

0
b2
0
0
0
0
b2
0
0
0
b
4
b4
4
b4
0

0

0

.
0

0

b4
0
0
0
b
0
0
0
Assuming that the local Gaussian at the point x is non-degenerate, the matrix A is of rank 5,
and consequently, as H is of full rank, AHAT is of rank 5. It follows that the second-order
Taylor expansion Mb,2 of Mb is positive definite of order (b1 b2 )2 in the sense that (b1 b2 )−2 Mb,2
exists and is positive definite in the limit. Similarly, whereas the matrix J is singular,
(b1 b2 )−2 Jb,2 is non-singular and positive definite as b1 , b2 → 0. Hence, (b1 b2 )2 Jb−1 Mb Jb−1
exists and is positive definite as b1 , b2 → 0. One can get an explicit expression for the leading
term of (b1 b2 )2 Jb−1 Mb Jb−1 by symbol manipulation software, and we have used the software
package Maple to do so. The coefficient associated with the leading term is very complicated
and not useful, as it runs over several pages of computer output.
It will be seen that the above matrix terms enter into the asymptotic expressions in the
following theorem. To evaluate the covariances in practice, one must use other means, as
shown in Section 7.1. Note that because of the assumed Gaussianity in Theorem 2, the
16
condition ii) of Theorem 1 is automatically fulfilled.
Theorem 2: Let {Xt } be a non-degenerate stationary bivariate Gaussian time series.
Assume that i) {Xt } is geometrically ergodic, ii) n → ∞ and b1 , b2 → 0 such that
log n/(nb1 b2 )5 → 0, iii) The kernel function K is Lipschitz. Then for every ε > 0, there is
a possibly x-dependent event A with P (Ac ) < ε such that there is a sequence θbn,b (x) that
minimises Q in (11) and such that θbn,b (x), for a fixed x, converges almost surely to the true
value θ = θ0 on A. Moreover,
d
(n(b1 b2 )3 )1/2 (θbn,b (x) − θ0 ) → N (0, (b1 b2 )2 Jb−1 Mb Jb−1 ),
this meaning that the left hand side is asymptotically normal with limiting covariance
(b1 b2 )2 Jb−1 Mb Jb−1 as b1 , b2 go to zero at the given rate, or alternatively
−1/2
(nb1 b2 )1/2 Jb Mb
d
(θbn,b (x) − θ0 ) → N (0, I)
where I is the identity matrix of dimension 5.
Proof: The proof follows the same pattern as the proof of Theorem 1, but we introduce a
new penalty function Q0n = (b1 b2 )−2 Qn . And instead of relying on the ergodic theorem and
clt for a fixed b in that proof, we now have to depend on an ergodic theorem and a clt where
there is smoothing involved. It turns out that we can use standard theorems of this type. For
example, we can use the non-parametric ergodic Theorem 4.1 of Masry and Tjøstheim (1995).
P
hn (y) is replaced by n−1 i Kb (Xi −x)u(Xi , θ), where u(Xi , θ) =
In the proof of that theorem b
R
∂ log ψ(Xi , θ)/∂θ and E(b
hn (y)) by the corresponding expectation Kb (v −x)u(v, θ)ψ(v, θ) dv.
We use the truncation argument of that proof, with s̄ > 2 replaced by γ > 2 of condition
ii) of Theorem 1 (the latter condition being trivially fulfilled by the Gaussianity of {Xt }
and by condition iv) of Theorem 3 in Section 4.1 in the non-Gaussian case). Moreover, we
use geometric ergodicity instead of the weaker mixing condition in the proof of Masry and
Tjøstheim. The conclusion of that proof is that for a compact sub-set D of
R2 ,
Z
1 X
log n 12
sup Kb (Xi − x)u(Xi , θ) − Kb (v − x)u(v, θ)ψ(v, θ) dv = O
nb1 b2
x∈D n i
almost surely. It follows that for a fixed x and with the bandwidth condition ii),
n−1
∂Q0n
1
|θ=θ0 =
∂θ
(b1 b2 )2
−n
−1
X
Z
Kb (v − x)u(v, θ0 )ψ(v, θ0 ) dv
∂ log ψ(Xi , θ0 ) a.s
Kb (Xi − x)
→ 0.
∂θ
17
Moreover,
n−1
∂ 2 Q0n
1 h
|
=
Kb (v − x)u(v, θ0 )uT (v, θ0 )ψ(v, θ0 ) dv
θ=θ
0
∂θ∂θT
(b1 b2 )2
Z
i
X
+ Kb (v − x)∇u(v, θ0 )ψ(v, θ0 ) dv − n−1
Kb (Xi − x)∇u(Xi , θ0 ) .
The last two terms cancel almost surely again using the non-parametric ergodic theorem and the bandwidth condition. The first term has the positive definite matrix J2 =
limb→0 (b1 b2 )−2 Jb,2 as a limit, which means that J2 has positive eigenvalues, and that an
inverse exists. Finally, we look at
(δn)−1
i
2
∂ 2 Q0n
∂ 2 Qn
∂ 2 Q0n
−2
−1 ∂ Qn
(θ
)
−
(θ)
=
(b
b
)
(θ
)
−
(θ)
(δn)
0
1 2
0
∂θ∂θT
∂θ∂θT
∂θ∂θT
∂θ∂θT
where |θ − θ0 | < δ. From the existence and continuity of the third-order derivative, elementwise |∂ 2 Qn /∂θ∂θT (θ) − ∂ 2 Qn /∂θ∂θT (θ0 )| ≤ cδ for some constant c, and using the bandwidth
condition, it follows that element-wise
∂ 2 Q0
∂ 2 Q0n
n
lim lim sup (δn)−1
(θ)
−
(θ
)
< ∞,
0
n→∞
∂θ2
∂θ2
δ→0
and as in the proof of Theorem 1, the almost sure part of Theorem 2 follows. We then look
at
3
0
1/2 −1 ∂Qn
((b1 b2 ) n)
n
∂θ
1/2
(θ0 ) = (nb1 b2 )
−1
−1 ∂Qn
n (b1 b2 )
(θ0 ) .
∂θ
The quantity in brackets has covariance matrix Mb /(b1 b2 )2 , which tends to a positive definite
matrix M2 . Asymptotic normality can then be shown using a standard non-parametric clt;
∗
see, e.g., Masry and Tjøstheim (1995), Theorem 4.4, with Zn,t
in (4.47) and (4.48) of the
proof of that theorem replaced by
∂ log ψ(Xi , θ)
∂ log ψ(Xi , θ)
(b1 b2 )3/2 Kb (Xi − x)
− E Kb (Xi − x)
,
∂θ
∂θ
and then using the ‘big block–small block’ technique of that proof. The assumptions (4.6a)
and (4.7) of Masry and Tjøstheim (1995) follows from the Gaussianity and condition i),
∗
whereas (4.6b) and (4.6c) are trivially fulfilled from the definition of Zn,t
. The claimed result
then follows from that theorem and Theorem 2.2 of Klimko and Nelson (1978).
Remark 1: The convergence rate can be compared to the results of Jones (1996), where
there is convergence of the same order. The convergence is much slower than in the standard
bivariate non-parametric case, which means that for a given number of observations, one
must use a much larger bandwidth, which again means that if one is interested in details of a
18
non-smooth θ(x) in the non-Gaussian case (Theorem 3), one must have rather large samples.
For a reasonably smooth density, θ(x) will typically be smooth.
Remark 2: The almost sure non-parametric convergence actually yields uniform convergence
on a compact set. We refer to Hansen (2008) for a possibly expanding set. Such results are
of interest in the construction of test functionals for independence and goodness-of-fit tests.
As seen in Berentsen et al. (2012), Berentsen and Tjøstheim (2012) and Støve et al. (2012)
we get good results for n = 500 and in some cases for n considerably smaller.
4.1 A more general limit theorem
In sub-section 2.1 existence of a unique θ(x) consistent with the local penalty function was
established, in particular for the step function model of (5), but also more generally under
the regularity conditions stated towards the end of that sub-section.
Theorem 3: Let {Xt } be a bivariate time series with bivariate density f having support on
all of
R2 .
Assume that i) {Xt } is geometrically ergodic, ii) n → ∞ and b1 , b2 → 0 such that
log n/(nb1 b2 )5 → 0, iii) The kernel function K is Lipschitz, iv) f is such that in each fixed
point x there exists a non-degenerate local Gaussian approximation ψ(v, θ(x)) as given in
Section 2 such that
E |Kb (Xt − x)∂/∂θj log ψ(Xt , θ(x))|γ < ∞, j = 1, . . . , 5 for some γ > 2.
Then for every ε > 0, there is a possibly x-dependent event A with P (Ac ) < ε such that
there exists a sequence θbn,b (x) that minimises Q in (11) and such that θbn,b (x), for this fixed
x, converges almost surely to the true value θ(x) = θ0 (x) on A. Moreover,
d
(n(b1 b2 )3 )1/2 (θbn,b (x) − θ0 (x)) → N (0, (b1 b2 )2 Jb−1 Mb Jb−1 ),
this meaning that the left hand side is asymptotically normal with limiting covariance
obtained from (b1 b2 )2 Jb−1 Mb Jb−1 as b1 , b2 → 0 at the appropriate rate, or alternatively
−1/2
(nb1 b2 )1/2 Jb Mb
d
(θbn,b (x) − θ0 (x)) → N (0, I),
where I is the identity matrix of dimension 5.
Proof: By inspecting the proof of Theorem 2, it is seen that this can be repeated step
by step for a fixed x in the present situation. Only obvious notational changes and the
additional use of condition iv) are required.
5 Linear transformations and the local Gaussian
distribution
To analyse symmetry properties we need to study the effect of linear transformations. We
assume that X has a local Gaussian representation ψ(x, θ(x)) with θ(x) = (µ(x), Σ(x)) and
19
we look at the transformation Y = a + AX, where a is a 2-dimensional vector and A is a
b
non-singular matrix. We know that under mild regularity conditions θ(x)
converges towards
θ(x) and we will use local likelihood to establish that for θ(y) = (µ(y), Σ(y)), µ(y) = a+Aµ(x)
and Σ(y) = AΣ(x)AT . It will be seen from the arguments at the end of this section that these
transformation results are also true for a population parameter θb (x) for a fixed bandwidth.
Let (X1 , . . . , Xn ) with Xi = (X1i , X2i ) be observations from the density f (x) = ψ(x, θ(x),
and we let Yi = a + AXi , i = 1, . . . , n. We introduce bandwidth matrices Bx and By , see
Fan and Gijbels (1996), and we have
LX (X1 , . . . , Xn , µ(x), Σ(x)) = n
−1
n
X
Z
KBx (Xi −x) log ψ(Xi , θ)−
KBx (v−x)ψ(v, θ(x)) dv,
i=1
where KBx (Xi −x) = |Bx−1 |K(Bx−1 (Xi −x)). Letting Bx be diagonal with bx = (bx1 , bx2 ) along
the diagonal and K(x) = K1 (x1 )K2 (x2 ) yields the familiar product kernel representation.
Analogously,
LY (Y1 , . . . , Yn , µ(y), Σ(y)) = n
−1
n
X
Z
KBy (Yi −y) log ψ(Yi , θ(y))−
KBy (w−y)ψ(w, θ(y)) dw
i=1
with θ(y) = (µ(y), Σ(y)). We will now assume µ(y) = a + Aµ(x) and Σ(y) = AΣ(x)AT and
show that this parametrization is consistent with the local likelihood procedure. With this
parametrisation it follows at once that
(Yi − µ(y))T Σ−1 (y)(Yi − µ(y)) = (Xi − µ(x))T Σ−1 (x)(Xi − µ(x)),
(16)
and hence,
1
log|Σ(x)| − (Yi − µ(y))T Σ−1 (y)(Yi − µ(y))
2
= − log |A| + log ψ(Xi , θ).
(17)
log ψ(Yi , θ(y)) = − log 2π − log |A| −
We need to let Bx and By tend to zero at different rates, which is not unnatural given the
possible different metric structure of two neighbourhoods N (x) and N (y) around x and y,
respectively. We let By = ABx such that By−1 = Bx−1 A−1 and |By−1 | = |Bx−1 ||A−1 |. Hence
KBy (Yi − y) = |By−1 |K(By−1 (Yi − y)) = |A−1 ||Bx−1 |K(Bx−1 A−1 (Yi − y)).
and consequently
KBy (Yi − y) log ψ(Yi , θ(y)) = |A−1 |KBx (Xi − x)[− log |A| + log ψ(Xi , θ(x))].
20
In the same manner with w = a + Av, y = a + Ax,
Z
KBy (w − y)ψ(w, θ(y)) dw = |A
−1
Z
|
KBx (v − x)ψ(v, θ(x)) dv,
and hence
LY (Y1 , . . . , Yn , µ(y), Σ(y))
= |A−1 |LX (X1 , . . . , Xn , µ(x), Σ(x)) − |A−1 | log |A|n−1
n
X
KBx (Xi − x).
i=1
The last term and the multiplicative factor |A−1 | do not depend on local parameters. Hence,
as Bx and By tend to zero at their different rates, there will be a one-to-one correspondence
between local likelihood estimates. The estimates θbbx ,n (x) are associated with θ(x) and
the estimates θbb ,n are associated with θ(y) = (µ(y), Σ(y)) = (a + Aµ(x), AΣ(x)AT ). Here,
y
θbb,n (x) converges to θ(x) and θbb,n (y) to θ(y) as shown in Theorem 3.
An alternative route to this result goes via (4) for the parameter θb (x) = θBx . It is
shown below that if θBx (x) satisfies this equation, then θBy (y) satisfies this equation with
µBy (y) = a + AµBx (x) and ΣBy (y) = AΣBx (x)AT . Then if θBx (x) → θ(x), we have
θBy (y) → θ(y) with µ(y) = a + Aµ(x) and Σ(y) = AΣ(x)AT . Indeed, for y = a + Ax and
w = a + Av, corresponding to (17),
log ψ(w, θBy (y)) = − log 2π − log |A| −
1
log |ΣBx (x)| − (w − µBy (y))T Σ−1
By (y)(w − µBy (y))
2
= − log |A| + log ψ(v, θBx (x)).
and hence it is not difficult to see that there is a one-to-one correspondence between the
maxima of
Z
KBx (v − x)(log ψ(v, θ(x))f (v) − ψ(v, θ(x)))dv
and
Z
KBy (w − y)(log ψ(w, θ(y))f (w) − ψ(w, θ(y)))dw
Moreover, KBy (w − y) = |A|−1 KBx (v − x), and reasoning as above for the likelihood it is
seen that corresponding to (4)
Z
KBy (w−y)uj (w, θBy )[fY (w)−ψ(w, θBy )]dw = |A|−1
Z
KBx (v−x)uj (v, θBx )[fx (v)−ψ(v, θBx )]dv
from which the desired conclusion follows.
21
6 Properties of local parameters
We will focus on ρ(x) = σ12 (x)/
p
σ11 (x)σ22 (x), as we are particularly interested in ρ(x) as
a measure of local dependence. It follows from the expression in (1) that −1 ≤ ρ(x) ≤ 1,
and from the corresponding local likelihood algorithm that −1 ≤ ρb(x) ≤ 1. Further, if
X is Gaussian, then ρ(x) ≡ constant, but note that Berentsen et al. (2012) show that
ρ(x) ≡ constant also for a Gaussian copula with arbitrary marginals. In this section we will
look more closely at independence and symmetry properties of ρ(x), and at limit behaviour
as |x| gets large. For financial data the motivation is the often observed asymmetric patterns
in financial returns, and the fact that dependence becomes stronger as xi → −∞, i = 1, 2.
We wish to make this precise using ρ(x). We will carry through the analysis for θ(x) but
with appropriate approximations it can be carried out for θb (x).
6.1 Independence and functional dependence
For the global correlation ρ, independence implies ρ = 0, but not vice versa. For Gaussian
variables, ρ = 0 implies independence. Moreover, if X1 and X2 are linearly related (not
necessarily Gaussian) with X2 = a + bX1 , then ρ = 1 if b > 0 and ρ = −1 if b < 0, but for
X2 = α(X1 ) for some function α, the value of ρ will depend on α. For X2 = X12 , ρ = 0 under
weak assumptions.
If X1 and X2 are independent, then f (x1 , x2 ) = f1 (x1 )f2 (x2 ). As b → 0 in (7) (9) this forces ψ(x, θ(x)) to factor and ρ(x) ≡ 0. Alternatively, this can be proved by
using arguments related to the Rosenblatt (1952) transformation which in this case is
−1
−1
unique and leads to the representation X1 = FX
(Φ(Z1 ) and X2 = FX
(Φ(Z2 )), where
1
2
Z1 and Z2 are independent standard normal variables again resulting in f (x1 , x2 ) =
ψ(x1 , µ1 (x1 ), σ12 (x1 ))ψ(x2 , µ(x2 ), σ22 (x2 )), so that ρ(x) ≡ 0.
A necessary and sufficient condition for independence is that ρ(x) ≡ 0, µi (x) ≡ µi (xi ), σi2 (x) ≡
σi2 (xi ),
i = 1, 2. We have not found an example with ρ(x) ≡ 0, and where there is dependence.
Functionals based on ρ(x) for measuring dependence and for testing of independence are
given in Berentsen and Tjøstheim (2012). An advantage of this approach is that one can
distinguish between positive and negative dependence in a nonlinear situation, which does
not seem to be the case for other commonly used non-linear dependence measures; see e.g.
Skaug and Tjøstheim (1993) and Székely and Rizzo (2009).
The other extreme is where |ρ(x)| = 1. This is connected to functional dependence in
much the same way as linear dependence is connected to |ρ| = 1. Assume that X2 = α(X1 )
with α differentiable and that X1 can be represented by a local Gaussian representation,
so that X1 = g(Z1 ). Then X2 = α(g(Z1 )). There is only one variable involved, and any
Gaussian approximation must also collapse to one variable. This is only possible if at the
point x, the Gaussian variables (U1 (x), U2 (x)) defining the bivariate density ψ(v, µ(x), Σ(x))
22
has U1 (x) = a + bU2 (x) for some scalars a and b, which implies ρ(x) = +1 or ρ(x) = −1
depending on whether the sign of α0 (x1 ) is positive or negative. The same reasoning can be
carried out if (x1 , x2 ) defines a closed curve, such as α(x1 ) + β(x2 ) = c. Then ρ(x) is equal
to 1 or −1 along the curve, depending on whether the tangent at the point x has positive or
negative slope.
6.2 Symmetry and asymmetry
We will look at bivariate densities f having the following types of symmetries (we may assume
that µ =
E(X) = 0,
because otherwise we may just center the density at µ and discuss
symmetry around µ): i) Radial symmetry, where f (−x) = f (x), ii) reflection symmetry,
where f (−x1 , x2 ) = f (x1 , x2 ) and/or f (x1 , −x2 ) = f (x1 , x2 ), iii) exchange symmetry, where
f (x1 , x2 ) = f (x2 , x1 ), and iv) rotation symmetry, where f (x) = γ(|x|) for some function
γ. We will examine how the corresponding local Gaussian correlation ρ(x) behaves. All of
the symmetries can be described by linear transformations, and we use the facts already
established in Section 5 that if y = g(x) = Ax, then ΣY (y) = AΣX (x)AT , µY (y) = AµX (x).
Symmetry properties of Σ(x) and µ(x) can conceivably be used to obtain more precise
estimates, and in Berentsen and Tjøstheim (2012) they are used to increase power of
independence tests.
i) Radial symmetry: If X has radial symmetry, then X and −X =
"
−1
0
#
0
−1
X have the
same density. It follows that
"
Σ(−x) =
−1
0
0
−1
#
"
Σ(x)
#
−1
0
0
−1
= Σ(x),
so that Σ(x) and hence ρ(x) has radial symmetry. For the local mean, however, µ(−x) =
−µ(x). Examples of densities having radial symmetry are all elliptic distributions. If we
let X1 and X2 represent Yt and Yt−1 in a garch model, then −Yt , −Yt−1 has the same
garch structure, and garch densities have radial symmetry, and consequently the local
covariance and local correlation have radial symmetry. Figures 2 and 3 show estimates of
ρ(x) for a t-distribution with 4 degrees of freedom and global ρ = 0, and for a garch model,
respectively. The radial symmetry of ρ(x) emerges clearly from the estimated values in the
figures. In Berentsen et al. (2012) the radial symmetry along the diagonal x = (s, s) of the
population quantity ρ(x) is verified using (13).
ii) Reflection symmetry: Reflection symmetry around the x1 and x2 axes is described by
the matrices
A1 =
"
1
0
#
0
−1
"
and A2 =
#
−1
0
0
1
,
23
so that
"
x1
−x2
#
"
= A1
x1
"
#
x2
and
We have
#
x2
"
Σ(x1 , −x2 ) = A1 Σ(x)AT1 =
−x1
"
= A2
x1
#
x2
.
σ11 (x)
#
−σ12 (x)
−σ12 (x)
σ22 (x)
.
We see that σii (x1 , −x2 ) = σii (x1 , x2 ), i = 1, 2, and ρ(x1 , −x2 ) = −ρ(x1 , x2 ). Similarly, if
there is reflection symmetry around the x2 axis, σii (−x1 , x2 ) = σii (x1 , x2 ), i = 1, 2, and
ρ(−x1 , x2 ) = −ρ(x1 , x2 ). Moreover, µ1 (x1 , −x2 ) = µ1 (x1 , x2 ), µ2 (x1 , −x2 ) = −µ2 (x1 , x2 ),
and µ1 (−x1 , x2 ) = −µ1 (x1 , x2 ), µ2 (−x1 , x2 ) = µ2 (x1 , x2 ). Examples of densities having
reflection symmetry are the spherical densities, for example, the t-distribution with 4 degrees
of freedom with global ρ = 0. The odd reflection symmetry of the local correlation in this
case is brought out by the estimated ρb(x) in Figure 2. A general consequence of the odd
reflection symmetry derived above is that ρ(x) is zero along the coordinate axes; that is,
ρ(x1 , 0) ≡ ρ(0, x2 ) ≡ 0. Again this is roughly confirmed by the estimates on Figure 2. In the
garch model, (−Yt , Yt−1 ) have the same distribution as (Yt , Yt−1 ), so that again the local
correlation has odd reflection symmetry (cf. Figure 3 for estimated values).
iii) Exchange symmetry: Let A = [ 01 10 ]. If we have exchange symmetry, Σ(x2 , x1 ) =
AΣ(x1 , x2 )AT , and exchange symmetry of f (x) implies exchange symmetry of Σ(x), and
hence of ρ(x). Many copula models have exchange symmetry. An example is given in Figure 1.
Elliptic distributions with a principal axis along x1 = x2 has exchange symmetry. On the
other hand, for a garch model, we cannot in general exchange Yt and Yt−1 and still obtain
the same bivariate distribution. This implies that the local correlation for garch models do
not in general have exchange symmetry (cf. again Figure 3). For an exchange-symmetric
density the local mean is exchange symmetric. We refer to Berentsen et al. (2012) for plots
of theoretical values of ρ(x) = ρ(s, s) along the diagonal.
α − sin α
iv) Rotations: The rotation matrix is given by A = [ cos
sin α cos α ]. For a given vector x,
Ax is rotated counter-clockwise through an angle α. We consider an arbitrary spherical
density f . Then f is rotation symmetric in addition to being radial, reflection and exchange
symmetric. The corresponding local correlation is radial and exchange symmetric, but it is
odd reflection symmetric. We start rotation from a point x = (x1 , 0) on the positive x1 axis.
Then ρ(x1 , 0) = 0, so that
y=
24
" #
y1
y2
"
=A
x1
0
#
"
=
x1 cos α
x1 sin α
#
.
Further, noting that Σ(x) is diagonal, since ρ(x) = 0,
"
Σ(y) = AΣ(x)AT =
σ11 (x) cos2 α + σ22 (x) sin2 α
(σ11 (x) − σ22 (x)) sin α cos α
(σ11 (x) − σ22 (x)) sin α cos α
σ11 (x) sin2 α + σ22 (x) cos2 α
#
.
It follows that
ρ2 = ρ2 (α) =
(σ11 (x) − σ22 (x))2
2 (x) + σ 2 (x) + σ (x)σ (x)(tan2 α +
σ11
11
22
22
,
1
tan2 α )
which has its maximum for tan2 α = 1, i.e., α = ±π/4. This is consistent with estimated
values in Figure 2, where it is seen that ρ(α) has maxima/minima along α = π/4 and
α = −π/4, and similarly for the other quadrants. Moreover, it can be shown that ρ(α) is
positive in the first and third quadrant and negative in the second and fourth quadrant.
For an elliptic distribution (again assuming µ = 0), let Σ be the global covariance of X. The
elliptic density can then be written f (x) = |Σ−1/2 |γ(xT Σ−1 x) for a function γ. If the global
standard deviations σ1 and σ2 are equal, then [x1 , x2 ]Σ−1 [x1 , x2 ]T = [x2 , x1 ]Σ−1 [x2 , x1 ]T
and f (x) and ρ(x) are exchange symmetric. This in turn implies that f (x) and ρ(x) are
reflection symmetric along the line x1 = x2 . A similar argument leads to reflection symmetry
around the line x1 = −x2 . This means that as we approach the diagonals along the density
contours, which are ellipses, these are axes of local maxima and minima (for ρ > 0 we have
local maxima along the diagonals in the first and third quadrant, and minima along the axes
in the second and fourth quadrant; for ρ < 0 the role of the axes are reversed).
If σ1 =
6 σ2 , the same reasoning can be carried through for the scaled variables Xi /σi , the
axes (x1 , x1 ) and (x1 , −x1 ) now corresponding to (x1 , σσ12 x1 ) and (x1 , − σσ21 x1 ).
6.3 Extremes for local variance
We consider the local variance and local mean in the univariate case. Let f and F be the
density and cumulative distribution function for X. Then X = g(Z) = F −1 (Φ(Z)), with
Z ∼ N (0, 1), and with Φ being the cumulative distribution function of the standard normal.
Let φ be the standard normal density, and let σ(x) be the local standard deviation of f .
Then, from Section 2.2 and with z = Φ−1 (F (x)),
σ(x) = g 0 (z) =
φ(z)
φ(Φ−1 (F (x))
=
.
f (F −1 (Φ(z))
f (x)
Let xα and zα be the α-quantiles defined by P (X ≥ xα ) = α and P (Z ≥ zα ) = α. Note
that σ(xα ) = φ(zα )/f (xα ). We only consider the upper tail. As a definition of ‘thick’ tail
we propose two requirements: As α becomes small, xα > zα and f (xα )/φ(zα ) → 0. A
moment of reflection will convince that this may be a reasonable definition, albeit a slightly
25
unconventional one. It essentially means that any density having a heavier tail than the
Gaussian will be characterized as thick tailed, e.g. the exponential density, which is not
traditionally called thick tailed. If f is Gaussian, f (xα )/φ(zα ) ≡ 1. It follows immediately
that for a thick-tailed density, as xα → ∞ when α → 0, then σ(xα ) → ∞. In the limit, the
local Gaussian density must have infinite variance to compensate for the thickness of the
tail of f . (Note that it is not sufficient to require φ(x)/f (x) → 0 to obtain thick-tailedness.
Indeed, if f is Gaussian with σ > 1, then φ(x)/f (x) → 0, whereas f (xα )/φ(xα ) ≡ 1.). In the
same manner f (xα )/φ(zα ) → ∞ (thin tail) has the local standard deviation tending to zero.
6.4 Local correlation and the tail index
A much used measure of dependence in the tail is the so-called tail index. This is described
in McNeil et al. (2005). The upper and lower tail index can be defined as
λU = lim P (F1 (X1 ) > q | F2 (X2 ) > q) and λL = lim P (F1 (X1 ) ≤ q | F2 (X2 ) ≤ q),
q→1
q→0
where F1 and F2 are the marginal cumulative distributions of X1 and X2 . A problem is
determining λU and λL in practice (see, for example, Frahm et al. 2005).
For a Gaussian distribution (cf. McNeil et al. 2005, page 211),
λL,G
X − µ
r1 − ρ
X1 − µ1
2
2
= 2 lim P
≤ u|
= u = 2 lim Φ u
,
u→−∞
u→−∞
σ2
σ1
1+ρ
so that for ρ < 1, λL = 0, whereas for ρ = 1, λL = 1. This absence of a non-trivial asymptotic
tail dependence is used as one argument against employing the Gaussian for financial data.
Applying this to the Gaussian U (x) = (U1 (x), U2 (x)) ∼ N (µ(x), Σ(x)) with x = (s, s),
s
U (x) − µ (x)
U1 (x) − µ1 (x)
1 − ρ(s, s) 2
2
lim P
≤ u|
= u = 2 lim Φ u
.
u→−∞
u→−∞
σ2 (x)
σ1 (x)
1 + ρ(s, s)
If U (x) gives a satisfactory approximation to F (x) in the tail as x → −∞, then the above
formula suggests that if ρ(s, s) ≤ M < 1 as s → −∞, then F has no lower tail dependence
(F Gaussian coming out as a special case). In other words, if there is positive tail dependence,
then it is necessary that ρ(s, s) → ∞. We have not been able to find exact (monotonicity)
conditions under which this holds in general, but (13) can be used to actually compute the
limiting value of ρ(x) = ρ(s, s) as s → −∞ or s → ∞ for copulas with exchange symmetry.
Applied to the Clayton copula it is shown in Berentsen et al. (2012) that ρ(s, s) → 1 as
s → −∞ (ρ(s, s) → 0 as s → +∞), and plots of the population value of ρ(s, s) for the
copulas looked at in that publication indicate that it has far greater validity. We believe that
this is a theoretical characterization of dependence relationships in a market crash situation.
26
Another form of correlation based on extreme values is given by Davis and Mikosch (2009).
7 Some practical considerations
7.1 Estimating the standard error of estimates
As seen in Section 4, it is difficult to obtain explicit and useful expressions for the standard
errors of θbb,n (x) by Taylor expansion in the general formulas (14)–(15). We have used two
alternative methods. The first one, only valid in case {Xi } consists of iid variables, is the
bootstrap, and this was used in Berentsen et al. (2012) and Berentsen and Tjøstheim (2012)
to evaluate significance of goodness-of-fit tests for copulas and in tests of independence.
However, it is quite time-consuming, since for each bootstrap realisation the local likelihood
has to be optimised numerically for a grid of points. Moreover, it is not extendable to the
dependent case, although the block bootstrap might be tried. The second alternative is to
first note that Mb in (15) can be approximated by
X
n
1
c
Mb = b1 b2
K 2 (Xi − z)u(Xi , θb (x))uT (Xi , θb (x))
n i=1 b
−
n
n
1X
1X
Kb (Xi − x)u(Xi , θb (x)) ×
Kb (Xi − x)uT (Xi , θb (x)) ,
n i=1
n i=1
cb → Mb almost surely (for fixed b) by the strong law of large numbers. In practice,
where M
we must replace θb (x) by the estimate θbb,n (x). Similarly, in (14) the term containing f (v)
can be approximated using the law of large numbers, whereas the other integral terms can be
computed analytically if K is a Gaussian kernel. If K is not Gaussian, these latter integrals
can be evaluated by Monte Carlo sampling from ψ(v, θbb,n (x)). In case a Gaussian kernel
is used, this method is much faster than the bootstrap, and it gives very similar results.
Another advantage of using this alternative is that it can be extended to the dependent case,
the ergodic theorem playing the role of the law of large numbers. Confidence intervals can
then be constructed in a standard fashion.
7.2 Choosing the bandwidth
In his book on local likelihood, Loader (1999) discusses the issue of bandwidth choice. His
results are not directly applicable in the present situation, however. In our version of a
bandwidth algorithm, we have chosen b as a compromise between having variance of order
(nb31 b32 )−1 and ‘bias’ squared of order (b21 + b22 )2 , where the role of the bias is played by the
b
distance between ψ(x, θ(x))
and f (x).
In practice, the true f is unknown, and in the bandwidth algorithm, we replace it by
27
the kernel estimate fe of f , where there is a number of algorithms for choosing ‘optimal’
b
bandwidths for fe. If b = bi is chosen by balancing Var(θbi (x)) and |ψ(x, θ(x))
− fe(x)|2 , then
the latter quantity is again of order b4 , since |fe(x) − f (x)|2 is of order b4K = o(b4 ), where bK
is the bandwidth used in the kernel estimation. This results from Var(fe(x)) being of smaller
order than Var(θbi (x)).
b can be measured in many ways, and can be evaluated
The distance between fe and ψ(θ)
locally for a fixed x or globally. In this section we are only concerned with a global bandwidth,
and to simplify further, we use the bandwidth b = (b1 , b2 ), obtained by first scaling the
observations to have identical global standard deviations equal to one. We also use the
following estimate for the mean integrated global variance
V (θbi ) = n−1
n
X
d θbi (X1j , X2j )).
Var(
j=1
Since reliable estimates in low-density regions require larger bandwidths, and the variances
in these locations could dominate the other terms in the sum, in practice we restrict the
observations included in the above sum to observations in locations of moderate to high
density, using a cut-off point based on the estimated density. As before, we can estimate the
variance by bootstrapping if {X1j , X2j } is a sequence of iid variables, or by the method in
Section 7.1 in the dependent (and in the iid) case. Because the latter method is much faster,
it is the method we have used. We use the squared distance
b = n−1
D(θ)
n
X
b 1j , X2j )) − fe(X1j , X2j ))2
(ψ(X1j , X2j , θ(X
j=1
b The bandwidth b = bi can now be chosen
for computing the distance between fe and ψ(θ).
b
b b), where we choose 0 < p < 1
as the minimiser of the weighted sum pV (θi , b) + (1 − p)D(θ,
depending on which aspect of our estimator (low variance or low bias) we value. Note that
the density term D is not invariant to linear scaling, so we have pre-scaled our observations to
unit sample variance. The quantity V is invariant – in that linearly scaling both observations
and bandwidths leaves the sum unchanged. As a last step, we must therefore scale the
bandwidth up again with the standard deviation. Finally, note that in the special case of
f itself being Gaussian, a minimising bandwidth of the sum can not be expected, since as
b
b → ∞, ψ(x, θ(x))
→ f (x).
We illustrate our procedure with two examples, both involving the choice of bandwidth for
ρbb (x) only. Because of the scaling, we take the two components of b to be the same. First we
look at a bivariate Gaussian distribution with µ1 = µ2 = 0, σ1 = σ2 = 1 and ρ = 0.5, with
n = 100, 200, 500 and 1500 observations, and for bandwidths from 0.5 to 2.5; see Figure 4a.
As expected, both the variance and the bias sums decrease with n. The unweighted sum
28
V + D does not have a minimum as a function of b, which is consistent with the reasoning
above.
In our second example, we look at a garch(1,1) model (see Bollerslev 1986), {Yt }, with
parameters µ = 0, ω = 0.1, α = 0.7 and β = 0.2, and εt ∼ N (0, 1),
Yt = µ + σt εt ,
2
2
σt2 = ω + αYt−1
+ βσt−1
,
(18)
for t = 1, . . . , n. We now let {Xt } = {Yt , Yt+1 }. We see from Figure 4b that the sum of
variance and bias squared yields a minimum for b which gets smaller as n increases.
This approach seems useful, but clearly there is a need for evaluating alternative algorithms,
including ones for local bandwidth, and for examining the theoretical properties. One
alternative method in copula testing is given in Berentsen et al. (2012).
8 Simulated and real examples
Many more examples are displayed in Berentsen et al. (2012), Berentsen and Tjøstheim
(2012) and Støve et al. (2012).
8.1 The Gaussian distribution
We have done a number of experiments for bivariate Gaussian variables. The local correlation
ρ(x) is constant, and this is confirmed for ρb(x) in the simulations, with ρb(x) lying within
a 90% confidence interval with expected frequency. We have used a bandwidth which is
comparable to those used in the simulations for non-Gaussian examples. Increasing the
bandwidth leads to more precise results. Both the bootstrap and the procedure outlined in
Section 7.1 were used to construct confidence intervals, and the results were very similar.
8.2 A bivariate t-distribution
The symmetry pattern of the local correlation for the t-distribution with 4 degrees of freedom
has already been explained in Section 6. Qualitatively similar results were obtained by
Holland and Wang (1987) (for ν = 1), using a different measure of local dependence.
8.3 A Gumbel copula
The exchange symmetry of the Gumbel copula in Figure 1 has been explained in Section 6.
Such a plot can be used as a tool for checking the validity of a copula model, by comparing
the local correlation structure for a chosen copula model to the local correlation structure of
the data themselves. We refer to Berentsen et al. (2012) for a formal test.
29
8.4 A
GARCH
model
Although Yt and Yt−1 are uncorrelated, Figure 3 shows strong local dependence, with much
the same pattern as for the t-distribution, i.e. positive dependence in the first and third
quadrant, and negative dependence in the second and fourth quadrant, with increasing
dependence away from the centre, but more strongly expressed. The absence of strict
exchange symmetry was mentioned in Section 6. We use the method of Section 7.1 to
estimate standard errors. The standard errors vary from about 0.03 to 0.05 within the
(−2, +2) region, up to 0.2 at ±3. The local correlation plot is consistent with volatility plots
of a garch(1,1) model.
8.5 Stock returns data
We briefly explore the local Gaussian correlation structure for a real trivariate financial data
set. The data consists of daily 100 × log returns for three stocks, namely X1 : Boeing, X2 :
Alcoa and X3 : Caterpillar, taken from 1973-01-01 to 2007-12-31, with zero return values
removed and a garch(1,1) filtering using a conditional t-distribution applied. The garch
filtering has the effect of reducing the kurtosis and transforming the data to being closer to
independent in time.
A scatterplot matrix of the data is shown in Figure 5a, with superimposed linear regression
lines and kernel density estimates on the diagonals. Judging by the plots, the returns appears
to be linearly dependent with a positive global correlation for each pair. But examining
the local Gaussian correlation reveals much more structure, and quite dramatic differences
between the pairs; see Figure 5b, where we have plotted the diagonal values of the estimated
local Gaussian correlation, that is, ρb(x1 , x2 ) with x1 = x2 . At x1 = x2 = ±2, the standard
errors are about 0.015.
We note that for all the three pairs, we have a non-constant and asymmetric structure,
implying that neither a Gaussian, nor a t-distribution, nor any other distribution displaying
symmetry, will provide an adequate fit. (The non-constant and asymmetric structure is
supported by hypothesis tests at the points x1 = x2 = ±2 at a level of 0.01 or less, using
the asymptotic distribution of ρb. A more appropriate tool in this situation would be a
non-parametric test testing the different structure of the curves. Such tests will be treated
in a separate publication).There is a beginning in Berentsen et al. (2012) for a copula
goodness-of-fit test.
The shape of the local Gaussian correlation pattern for Boeing against the two other stocks
also differ very much from the Caterpillar–Alcoa pattern in direction, showing differences in
the underlying distributions. None of these results are readily apparent from the scatterplots,
or from univariate plots. Moreover, the detected differences have direct implications for
portfolios where these stocks are involved. An increase of Gaussian local correlation when
30
the market is going down, as shown for the pairs (X1 , X2 ) and (X1 , X3 ), would lead to a
deterioration of this diversification protection. On the other hand, a linear combination of
X2 and X3 would not experience such problems. These effects can be quantified numerically
using the estimates of the local Gaussian correlation and the local Gaussian variances. It is
possible to apply the local Gaussian correlation to measure and detect financial contagion
(Støve et al. (2012).
In principle, a joint local analysis of more than two variables can be done. One has the
prospect of doing locally what can be done globally in multivariate analysis for normal
variables, for example, portfolio theory, principal component analysis and discrimination
theory. One then has to generalise the bivariate Gaussian approximation to a general
multivariate local Gaussian approximation. This necessitates a simplification of the model to
avoid the curse of dimensionality. In fact the simplified model has a faster convergence rate
for pairwise local Gaussian correlations.
Acknowledgement
We are very grateful to two anonymous referees for a number of valuable comments and
suggestions. Their constructive criticism has lead to a much improved paper.
References
Berentsen, G., Støve, B., Tjøstheim, D., and Nordbø, T. (2012), “Recognizing and
visualizing copulas: an approach using local Gaussian approximation,” Manuscript,
Department of Mathematics, University of Bergen. See
http://folk.uib.no/gbe062/local-gaussian-correlation.
Berentsen, G. and Tjøstheim, D. (2012), “Recognizing and visualizing departures from
independence in bivariate data using local Gaussian correlation,” Manuscript, Department
of Mathematics, University of Bergen. See
http://folk.uib.no/gbe062/local-gaussian-correlation.
Bjerve, S. and Doksum, K. (1993), “Correlation curves: measures of association as function
of covariate values,” Annals of Statistics, 21, 890–902.
Bollerslev, T. (1986), “Generalized autoregressive conditional heteroscedasticity,” Journal of
Econometrics, 31, 307–327.
Campbell, R., Forbes, C. S., Koedijk, K., and Kofman, P. (2008), “Increasing correlations or
just fat tails?” Journal of Empirical Finance, 15, 287–309.
Davis, R. and Mikosch, T. (2009), “The extremogram: A correlogram for extreme events,”
Bernoulli, 15, 977–1009.
31
Delicado, P. and Smrekar, M. (2009), “Measuring non-linear dependence for two random
variables distributed along a curve,” Statistics and Computing, 19, 255–269.
Doksum, K., Blyth, S., Bradlow, E., Meng, X., and Zhao, H. (1994), “Correlation curves as
local measures of variance explained by regression,” Journal of the American Statistical
Association, 89, 571–582.
Eguchi, S. and Copas, J. (1998), “A class of local likelihood methods and near-parametric
asymptotics,” Journal of the Royal Statistical Society Series B, 60, 709–724.
Fan, J. and Gijbels, I. (1996), Local Polynomial Modelling and Its Application, Theory and
Methodologies, Chapman and Hall, London.
Fan, J. and Yao, Q. (2003), Nonlinear Time Series. Nonparametric and Parametric Methods,
Springer, New York.
Forbes, K. J. and Rigobon, R. (2002), “No contagion, only interdependence: Measuring
stock market comovements,” The Journal of Finance, 57, 2223–2261.
Frahm, G., Junker, M., and Schmidt, R. (2005), “Estimating the tail-dependence coefficient:
Properties and pitfalls,” Insurance: Mathematics and Economics, 37, 80–100.
Gourieroux, C. and Jasiak, J. (2010), “Local likelihood density estimation and value-at-risk,”
Journal of Probability and Statistics, 2010, 26 pages.
Hansen, B. E. (2008), “Uniform convergence rates for kernel estimation with dependent
data,” Econometric Theory, 24, 726–748.
Hjort, N. and Jones, M. (1996), “Locally parametric nonparametric density estimation,”
Annals of Statistics, 24, 1619–1647.
Holland, P. and Wang, Y. (1987), “Dependence functions for continuous bivariate densities,”
Communications in Statistics, 16, 863–876.
Hong, Y., Tu, J., and Zhou, G. (2007), “Asymmetries in stock returns: Statistical tests and
economic evaluation,” Review of Financial Studies, 20, 1547–1581.
Inci, A. C., Li, H. C., and McCarthy, J. (2011), “Financial contagion: A local correlation
analysis,” Research in International Business and Finance, 25, 11–25.
Jones, M. (1996), “The local dependence function,” Biometrika, 83, 899–904.
— (1998), “Constant local dependence,” Journal of Multivariate Analysis, 64, 148–155.
Jones, M. and Koch, I. (2003), “Dependence maps: local dependence in practice,” Statistics
and Computing, 13, 241–255.
32
Klimko, L. A. and Nelson, P. I. (1978), “On conditional least squares estimation for
stochastic processes,” Annals of Statistics, 6, 629–642.
Loader, C. R. (1999), Local Regression and Likelihood, Statististics and Computing, Springer,
New York.
Longin, F. and Solnik, B. (2001), “Extreme correlations in international equity markets,”
The Journal of Finance, 56, 649–676.
Masry, E. and Tjøstheim, D. (1995), “Nonparametric estimation and identification of ARCH
non-linear time series: strong convergence and asymptotic normality,” Econometric
Theory, 11, 258–289.
McNeil, A. J., Frey, R., and Embrechts, P. (2005), Quantative Risk Management: Concepts,
Techniques , Tools, Princeton University Press, Princeton.
Rodrigues, J. C. (2007), “Measuring financial contagion. A copula approach,” Journal of
Empirical Finance, 14, 401–423.
Rosenblatt, M. (1952), “Remarks on a multivariate transformation,” Annals of Mathematical
Statistics, 23, 470–472.
Silvapulle, P. and Granger, C. W. J. (2001), “Large returns, conditional correlation and
portfolio diversification. A value-at-risk approach,” Quantative Finance, 1, 542–551.
Skaug, H. J. and Tjøstheim, D. (1993), “A nonparametric test of serial independence based
on the empirical distribution function,” Biometrika, 80, 591–602.
Støve, B., Tjøstheim, D., and Hufthammer, K. (2012), “Using local Gaussian correlation in a
nonlinear re-examination of financial contagion,” Revised manuscript.
Székely, G. J. and Rizzo, M. L. (2009), “Brownian distance covariance,” Annals of Applied
Statistics, 3, 1236–1265.
Taniguchi, M. and Kakizawa, Y. (2000), Asymptotic Theory of Statistical Inference for Time
Series, Springer-Verlag, New York.
Tjøstheim, D. (1986), “Estimation in nonlinear time series models,” Stochastic Processes and
Their Application, 21, 251–273.
33
3
rho
+0.89
−1
+0.86 +0.87 +0.88 +0.89
2
+0.82 +0.84 +0.86 +0.87
−0,8
+0.76 +0.78 +0.80 +0.82 +0.84 +0.85
−0,6
+0.71 +0.73 +0.75 +0.76 +0.78 +0.80 +0.81 +0.83
1
−0,4
+0.67 +0.70 +0.72 +0.73 +0.74 +0.76 +0.77 +0.78
−0,2
y
+0.64 +0.67 +0.69 +0.70 +0.71 +0.72 +0.73 +0.74 +0.75
0
−1
+0.63 +0.66 +0.67 +0.69 +0.70 +0.70 +0.70 +0.70
0
+0.60 +0.62 +0.64 +0.66 +0.67 +0.67 +0.67 +0.67 +0.66
0,2
+0.59 +0.61 +0.62 +0.64 +0.65 +0.65 +0.65 +0.64
0,4
+0.56 +0.58 +0.60 +0.61 +0.62 +0.63 +0.63
0,6
+0.54 +0.56 +0.57 +0.58 +0.60 +0.61
−2
0,8
+0.52 +0.53 +0.55
1
−3
−3
−2
−1
0
1
2
3
x
Figure 1: Local Gaussian correlation for a bivariate distribution with standard Gaussian
marginals and a Gumbel copula.
rho
−0.16
4
−0.57
−0.01
−0.22 −0.12 −0.12 −0.09 +0.04 +0.20
+0.53
1.0
+0.61 +0.69
0.8
−0.13 −0.62 −0.56 −0.42 −0.25 −0.11 −0.05 +0.01 +0.11 +0.24 +0.34 +0.42 +0.53
0.6
−0.43 −0.36 −0.28 −0.17 −0.07 +0.03 +0.12 +0.20 +0.25 +0.29 +0.40 +0.49
2
−0.40 −0.29 −0.29 −0.27 −0.20 −0.11 −0.01 +0.09 +0.16 +0.18 +0.21 +0.30 +0.39 +0.30
−0.26 −0.21 −0.22 −0.21 −0.17 −0.12 −0.02 +0.08 +0.12 +0.13 +0.14 +0.17
y
0
0.4
0.2
−0.06 −0.14 −0.17 −0.14 −0.11 −0.08 −0.01 +0.06 +0.07 +0.07 +0.06 +0.02
+0.15 −0.03 −0.07 −0.05 −0.01 +0.00 −0.01 −0.02 −0.02 −0.02 −0.02 −0.09 −0.27 −0.41
0.0
+0.54 +0.32 +0.12 +0.05 +0.05 +0.07 +0.05 −0.02 −0.09 −0.10 −0.09 −0.09 −0.12 −0.23 −0.31
−2
+0.43 +0.24 +0.16 +0.13 +0.09 +0.04 −0.04 −0.11 −0.13 −0.13 −0.12 −0.10
−0.14
+0.41 +0.38 +0.26 +0.22 +0.17 +0.10 +0.02 −0.06 −0.14 −0.18 −0.18 −0.14 −0.07
+0.00
−0.28 −0.05 +0.13 +0.21 +0.20 +0.13 +0.04 −0.07 −0.18 −0.24 −0.25 −0.20 −0.13 −0.23
+0.14 +0.23 +0.21 +0.11 −0.06 −0.21 −0.29 −0.33 −0.30
−4
−0.03 +0.29
+0.55
−4
−0.88
+0.17
−2
0
−0.60
2
−0.4
−0.6
−0.49
+0.22 −0.02 −0.23
−0.2
−0.8
−1.0
4
x
Figure 2: Local Gaussian correlation map based on 5000 observations from a bivariate
t-distribution with ρ = 0 and ν = 4 degrees of freedom.
3
+0.57 +0.63 +0.62
−0.60 −0.59 −0.53 −0.41
+0.22 +0.39 +0.50 +0.54 +0.52
+0.36
rho
−0.48 −0.52 −0.51 −0.46 −0.35 −0.20 +0.01 +0.20 +0.35 +0.43 +0.46 +0.45
+0.23
−1
−0.39 −0.44 −0.44 −0.39 −0.30 −0.16 +0.02 +0.18 +0.30 +0.37 +0.40 +0.38
+0.23
−0.8
−0.29 −0.34 −0.34 −0.31 −0.24 −0.12 +0.02 +0.15 +0.25 +0.31 +0.33 +0.31
+0.22
−0.6
−0.12 −0.18 −0.22 −0.24 −0.22 −0.17 −0.09 +0.02 +0.12 +0.19 +0.23 +0.24 +0.23
+0.19
2
1
−0.66 −0.67 −0.65 −0.60 −0.48
y
−0.04 −0.07 −0.09 −0.11 −0.10 −0.08 −0.03 +0.02 +0.07 +0.11 +0.13 +0.13 +0.12 +0.12
0
+0.03 +0.04 +0.04 +0.04 +0.04 +0.03 +0.03 +0.02 +0.02 +0.01 +0.00 −0.01 −0.01 −0.00 +0.04
+0.15 +0.23 +0.29 +0.31 +0.29 +0.24 +0.15 +0.04 −0.08 −0.18 −0.24 −0.29 −0.30 −0.29 −0.22
+0.21 +0.33 +0.40 +0.42 +0.39 +0.32 +0.21 +0.05 −0.12 −0.25 −0.34 −0.40 −0.42 −0.41 −0.34
+0.30 +0.43 +0.50 +0.51 +0.48 +0.39 +0.25 +0.06 −0.15 −0.31 −0.42 −0.49 −0.52 −0.51 −0.43
−2
+0.65 +0.70 +0.69 +0.63 +0.52
−3
−3
−2
−0.60 −0.68 −0.71 −0.70 −0.66
0
1
2
0.4
0.8
−0.21 −0.42 −0.55 −0.63 −0.65 −0.64
−1
0.2
0.6
+0.60 +0.61 +0.56 +0.46 +0.29 +0.06 −0.18 −0.37 −0.49 −0.56 −0.59 −0.58
+0.64 +0.74 +0.77 +0.76 +0.70
−0.2
0
+0.10 +0.14 +0.17 +0.19 +0.17 +0.14 +0.09 +0.03 −0.04 −0.09 −0.13 −0.15 −0.16
−1
−0.4
1
3
x
Figure 3: Local Gaussian correlation map based on 2000 observations from the garch(1,1)
model defined in Section 7.2. The bandwidths are b1 = b2 = 1. The bulk of the
observations are within (−2, +2), and we have restricted the scales to (−3, +3) for
display purposes.
100
200
500
1500
0.20
Variance
0.15
0.10
0.05
6e−04
5e−04
Bias2
4e−04
3e−04
2e−04
1e−04
−2
log(Sum)
−3
−4
−5
−6
−7
0.5
1.0
1.5
2.0
2.50.5
1.0
1.5
2.0
2.50.5
1.0
1.5
2.0
2.50.5
1.0
1.5
2.0
2.5
2.0
2.5
Bandwidth
(a) Normal distribution.
100
200
500
1500
0.10
0.08
Variance
0.06
0.04
0.02
0.006
0.005
Bias2
0.004
0.003
0.002
0.001
log(Sum)
−2.5
−3.0
−3.5
−4.0
−4.5
−5.0
−5.5
−6.0
0.5
1.0
1.5
2.0
2.50.5
1.0
1.5
2.0
2.50.5
1.0
1.5
2.0
2.50.5
1.0
1.5
Bandwidth
(b) garch(1,1) model.
Figure 4: Variance, bias and log sum values for choosing bandwidths.
Boeing
Caterpillar
Alcoa
10
5
0
−5
−10
Caterpillar
y
Boeing
10
5
0
−5
−10
Alcoa
10
5
0
−5
−10
−10 −5
0
5
10
−10 −5
0
5
10
−10 −5
0
5
10
x
(a) Scatterplot matrix.
Local Gaussian correlation
Boeing−Alcoa
Boeing−Caterpillar
Caterpillar−Alcoa
0.45
0.40
0.35
0.30
0.25
0.20
−4
−2
0
2
4 −4
−2
0
2
4 −4
−2
0
2
4
(b) Line plot showing diagonal values of the local Gaussian correlation (ordinate) for the three pairs of
stocks, calculated at various values of x1 = x2 = x (abscissa).
Figure 5: 100 × log returns for the three stocks Boeing, Caterpillar and Alcoa, based on daily
data from 1973-01-01 to 2007-12-31.