Nonparametric stochastic discount factor for high

NONPARAMETRIC STOCHASTIC DISCOUNT FACTOR FOR
HIGH-DIMENSIONAL FINANCIAL RETURNS
CHEN QIU AND TAISUKE OTSU
Very Preliminary
Abstract. We develop a nonparametric method to estimate SDF using high dimensional financial returns. It extends work of Kitamura and Stutzer (2002); Ghosh et al. (2016) that
conducted bound analysis for candidate SDFs. We select SDF based on how close they are
to the empirical measure, conditional on them being valid SDFs in sample. When there are
more assets than number of observations, we show it is critical to regularize using methods
such as Lasso. This shares similar spirit with high dimensional nonlinear optimization when
the loss function is convex in Van de Geer (2008). Under approximate sparsity assumption, we
prove that our estimator enjoys consistency and asymptotic normality properties. Our estimator can also be applied to broad scenarios when economists are interested in moments under
risk adjusted measure. Simulation exercises support our theoretic results.
1. Introduction
One of the most important discoveries in finance is that free of arbitrage ensures existence of a
positive pricing kernel or stochastic discount factor (hereafter SDF), which appears in a momentbased Euler equation linking asset prices with their expected discounted cash flows. Moreover,
we can pick risk free asset, if there exists one, as our benchmark asset such that a risk-adjusted
measure(or risk neutral measure) would arise to distort all asset returns as if they are behaving like a martingale. Hence, the study of the SDF and risk-adjusted measure is equivalent,
up to a scale. Estimation and inference of SDF have been the central issue in both theoretic
and empirical asset pricing research. Since finite dimensional parameters defined through such
moment conditions are usually only set-identified(Schennach, 2014), when researchers want to
estimate SDF, they have a lot of latitude to further specify Euler equation according to their
needs, usually relying on parametric or semi-parametric methods. This paper explores the possibility to estimate SDF completely nonparametrically using possibly a very large number of
assets(p
n). More assets provides more information but this is more challenging because such
information contains much individual noise. It turns out a correct way to proceed is to rely on
certain regularization methods. This connection underlines recently popular high dimensional
econometric techniques, therefore is the main contribution of this paper to burgeoning empirical
asset pricing literature. To the best of our knowledge, this is the first paper that tries to apply
high dimensional “learning” methods to the estimation of SDF. Our method is nonparametric in
the sense that we do not assume any parametric form of SDF as well as the distribution of state
variables.
1
The idea behind this paper stemmed from the seminal work of Qin and Lawless (1994); Owen
(2001); Newey and Smith (2004) in empirical likelihood and generalized empirical likelihood.
Earlier research in asset pricing applied such information-theoretic techniques to derive nonparametric bounds (Kitamura and Stutzer, 2002; Ghosh et al., 2015, 2016) as well as to evaluate
misspecified models(Almeida and Garcia, 2012). As more data becomes available and the cost
for calculation becomes much cheaper, it is natural to extend their work to a case where econometricians can possibly get a very large number of asset returns(p
n) and are interested to
recover a “best” possible estimation for SDF. This idea motivates us to explore the possibility
of combining high dimensional methods with empirical likelihood methods to estimate SDF in a
complete nonparametric way.
If using a small number of representative assets could give us information of bounds for SDF,
such as Kitamura and Stutzer (2002); Ghosh et al. (2015, 2016), why not use possibly all asset
returns at hand? If our estimated SDF can correctly price almost all assets available, it should
be a good estimate of SDF, instead of merely being a bound. More specifically, we are searching
for a candidate risk adjusted measure closest to the empirical measure in the Kullback–Leibler
divergence sense. When asset number p is small and fixed, we get a bound for the risk adjusted
measure. When p is much bigger than sample size n(p
n), we get an estimate for risk adjusted
measure and thus for SDF. However, this task seems daunting because, first, information from
individual asset returns can be very noisy, even if we get an estimate it might not have good
asymptotic properties; Second, computationally it could be very expensive. As noted by Tsao
(2004), in a given sample size of n, empirical likelihood for a p dimensional population mean
would fail with an asymptotically non-negligible probability whenever p
n
2.
This means a direct
extension of Kitamura and Stutzer (2002) is not desirable. Nevertheless, if we believe a relatively
small number of assets among all assets can actually imitate the behavior of the true SDF well
enough, the question boils down to how we could select a number of best imitating assets among
many assets.This is the approximate sparsity assumption in high-dimensional econometrics literature. In fact, approximate sparsity assumption has been implicitly used in empirical finance.
For example, the famous Fama and French three factor model essentially says market return,
together with portfolios that imitate growth and size effect, is a good linear approximation of
SDF. After all, this is what “mimicking portfolios” all about. With this respect, our methodology
formalizes the idea of choosing “mimicking assets”, but in a data-driven way.
Our paper is related to plenty of literature that tries to use semi/non parametric methods to
unravel behavior of SDF. One popular line of research relies on DSGE modeling techniques. For
example, Mehra and Prescott (1985); Campbell and Cochrane (1999); Bansal and Yaron (2004);
Jagannathan and Wang (1996); Piazzesi et al. (2007), etc. Moment-based semi-parametric methods, such as Ai and Chen (2003) and Chen and Pouzo (2009) also have been explored in line with
those models, for example, Chen and Ludvigson (2009); Gagliardini et al. (2011); Chen et al.
(2013).
This paper is also related to literature that aims to impose a minimal amount of parametric
specification on SDF and to let market data tell what kind of behavior SDF should have. For
example, Alvarez and Jermann (2005);Backus et al. (2014, 2015); Backus et al. (2011); Schneider
2
and Trojani (2016). These methods are “model free”, without any parametric specifications. This
paper shares the same spirit of “model-free” recovery of SDF, but we are trying to “learn” SDF
from many asset returns.
In terms of econometrics, this paper is closely related to literature in generalized empirical
likelihood and high-dimensional methods, such as Hjort et al. (2009); Chen et al. (2009); Lahiri
and Mukhopadhyay (2012); Fan and Li (2001); Fan and Peng (2004); Tang and Leng (2010);
Leng and Tang (2012); Chang et al. (2015), etc. It also relates to LASSO methods for general
general convex loss functions in Van de Geer (2008) and Bühlmann and Van de Geer (2011).
What is different is that our paper directly addresses generalized empirical likelihood methods
for p
n case. As a result, under our setting the status of sparsity assumptions becomes
primary. In addition, due to our asset pricing context, we explore a case where there is no
explicit parameter(similar to Haberman (1984)) in the objective function.
The rest of the paper is organized as follows: Section 2 sets up a general framework for this
paper; Section 3 develops our main estimation strategy; Section 4 establishes main asymptotic
results, including consistency, oracle rate and asymptotic distribution for our estimators. Section
5 discusses algorithms and empirical results. We conclude the paper in section 6; All proofs, as
well as tables and figures, are in Appendix.
2. Setup and Notations
Consider a discrete time economy with time index set T = {1, 2, · · · } and where all uncertainty
is driven by state vector X = {Xt : t 2 T, } and Xt : ⌦ ! X , X ✓ Rd . X is defined over a
probability space (⌦, F , P) with the filtration F = {Ft : t 2 T } generated by histories of X.
Under the assumption of no arbitrage, there exists a strictly positive SDF m(Xt, Xt+1 ) such that
1 = EP [m(Xt , Xt+1 )Ri (Xt , Xt+1 ) | Xt = x] , 8i
(2.1)
where Ri (Xt , Xt+1 ) denotes the short term return of asset i from period t to t + 1, EP [· | Xt = x]
is the conditional expectation operator using objective measure P and m(Xt , Xt+1 ) is one period
short term SDF. Notice all variables are essentially functions of state vectors X, which we do
not observe. But we do observe asset returns. Therefore, we hereafter ignore dependence of
m(Xt , Xt+1 ) and Ri (Xt , Xt+1 ) on X to arrive at an unconditional version of equation (2.1):
1 = EP [mt!t+1 Ri,t!t+1 ] ,
(2.2)
where mt!t+1 = m(Xt , Xt+1 ) and Ri,t!t+1 = Ri (Xt , Xt+1 ). If we further assume the existence
of a risk free asset between t and t + 1, then:
1 = EP [mt!t+1 Rf,t!t+1 ] ,
where Rf,t!t+1 is the risk free return between t and t + 1. As long as Rf,t!t+1 6= 0, it induces a
new measure, or risk-adjusted measure, which we call the Q measure, such that
mt!t+1 =
dQ
1
,
dP Rf,t!t+1
3
where
dQ
dP
is the Radon–Nikodym derivative. This directly implies
e
where Ri,t!t+1
= Ri,t!t+1
⇥ e
⇤
0 = EQ Ri,t!t+1
,
(2.3)
Rf,t!t+1 and is the excess return of asset i from t to t + 1. Equation
(2.3) says if we use risk free asset as the benchmark, all assets would share the same expected
return under the new Q measure. If we want to estimate m nonparametrically, we have to come
up with some way to select a distorted measure that satisfy moment condition (2.3). With this
respect, we borrow from the theory of I-divergence(Csiszar, 1975, 1984).
We first select a measure closest to theoretic objective measure (in some sense), conditional
on them satisfying a series of moment condition (2.3):
min K(Q̃, P) s.t. EQ̃ [Rpe ] = 0,
Q̃
(2.4)
where Rpe is the vector of excess return from p different assets, and K(Q̃, P) measures the closeness(which will be specified soon) of a candidate measure Q̃ to the true data generating measure
P. The solution to (2.4), if there exists one, would give us a bound for candidate measure Q,
when dimension of p is low.
Then we carry out an empirical analogue of (2.4) using directly observable empirical measure
Pn , the property of which we know very well. Finally K(Q̃, P) is just the I-projection, or KullbackLeibler number of Q̃ to P and can be written as
K(Q̃, P) =
Z
log
dQ̃
dQ̃.
dP
Intuitively, we can treat K(Q̃, P) as measuring the distance between Q̃ and P.1 As a measure of
closeness Kullback-Leibler number is very sensitive to small changes in measures(Robinson, 1991).
It is always positive and can be 0 only when the two measures are identical. Its applications in
econometrics has been noted very well: it is a natural extension of EL by Owen (2001); Qin and
Lawless (1994); it also has been a good alternative to GMM(Kitamura and Stutzer, 1997), and
it is in line with the maximum entropy principle in machine learning.
The solution to (2.4) is the canonical Gibbs density:
0
e
dQ⇤
e Rp
=
0 e ,
dP
EP [e Rp ]
where
0
2 ⇥p ✓ Rp solves
0
= arg minp EP [e
2⇥
0 Re
p
(2.5)
].
(2.6)
Now we will introduce some notations for the rest of the paper. Since from now on we only
deal with excess returns, R always stands for excess return. An unsubscripted R is always of
dimension p. A subscripted Rk usually stands for a vector of excess asset return of dimension
k, which depending on the context, can be p, the number of assets observed by econometrician,
or s, the number of assets that are statistically important in determining the Q measure under
1However, readers should notice that in spite of this intuitive explanation, Kullback-Leibler number is by no
means a true “metric” between two measures.
4
approximate sparsity assumptions. Correspondingly, in sample we have Rk (i) representing i-th
observation of the return vector where i = 1, . . . , n and n is the sample size. Sometimes we will
also use notation Rk,j (i), which means the j-th element of the vector Rk (i). Moreover,
for the j-th element of the vector
k
k,j
stands
which has dimension k. The norm of a vector is usually
denoted as k·kq where q could be 1, 2 or 1, referring to l1 , l2 or l1 norm, respectively. We will
also use notations conventional in empirical processes: for a function f : Z ! R, the empirical
P
average is denoted as: En (f ) = n1 ni=1 f (Z(i)) where Z(i) represents the i-th observation of
P
variable Z 2 Z . Theoretic objective mean is: EP (f ) = n1 ni=1 EP [f (Z(i))]. Notice EP (f ) =
EP [f (Z(i))] for any i if we assume identical distribution among different observations.
3. Estimation
To extend the study of (2.4) to a high dimensional case, we formally impose the following
condition:
Condition 1.
(i): Econometricians observe a sample size n of identically and independently distributed(iid)
excess returns on all p number of assets in the market. Potentially p
(ii): There exists possibly s <
n
2
number of assets that can imitate the measure Q well
enough, in the following sense
EP [e
0R
s s
n.
]
EP [e
00 R
]=o
r
log p
n
!
,
where
s
= arg
min
EP [e
s
s
2⇥ ✓R
0R
s
] + ⇤,
for some ⇤ 2 R that we will be specified later in Subsection 4.2.
Remark 1. First notice that under our assumptions, it is possible that the SDF is not unique.
There might exists more than one SDF that can correctly price assets in the market. Nevertheless,
based on (2.4), no matter how large the dimension of assets is, if a solution exists, it should have
the form in (2.5) and
0
is defined through (2.6). Therefore,
0
2 ⇥p ✓ Rp should be regarded
as a pseudo-parameter that correctly adjusts weight to different assets; Second, Condition 1-(i)
assumes that econometricians observe all assets in the market. This is not essential. Our theory
can easily cover, with minor modification, the case when researchers are not able to observe all
assets’ information, as long as the set of asset returns observed by researchers is rich enough;
Third, the iid assumption for returns is essential because we have to use the independent part
to deal with the empirical process part we will meet later to prove consistency; Fourth, the
definition for
s
is closely aligned with the definition of oracle and it signifies the best we can do.
Indeed, the formula for
the first part
0
EP [e Rs ]
s
means that it should somehow balance the approximation error from
and some additional price from ⇤, which we will see from later section
is a mixture of sparsity size, penalty level and some other parameters. We will discuss this in
detail in Subsection 4.2.
5
Under our approximate sparsity condition (1) it is possible to select a small number of assets
that are important systemically in determining SDF, by adding a penalty term to the original definition of . There are several choices for the penalty term: LASSO(Tibshirani, 1996);
Smoothly Clipped Absolute Deviation (SCAD) penalty(Fan and Li, 2001); minimax concave
penalty(Zhang, 2010), etc. We consider LASSO as the penalty function and therefore evaluate
the following problem:
ˆ = arg
where k k1 =
Pp
j=1 | j |,
min
En (e
p
p
2⇥ ✓R
0R
) + ↵ k k1 ,
(3.1)
the l1 norm and ↵ is a penalty level that can be chosen by the econo-
metrician. We therefore estimate
dQ⇤
dP
by
ˆ0
ˆ⇤
dQ
e R
=
.
dP
En e ˆ 0 R
Many objects of interests in financial economics can be expressed as a moment condition based
on the Q measure. Suppose we are interested in the expectation of function D : D ! R , which
we denote W , under Q measure: W = EQ [D]. Then we could estimate W using
Ŵ =
Pn
ˆ 0 Re (i)
i=1 D(i)e
Pn ˆ 0 Re (i) .
i=1 e
Now we give two examples of how our method could be used to estimate some W that is of
interest.
Example 1. The Q measure itself. Given any Borel set B 2 B, let Q(B) be the probability of
B under Q measure. Then Q(B) = EQ [IB ], where IB is the indicator function of B. We could
estimate Q(B) by
Q̂(B) =
Pn
ˆ 0 Re (i)
i=1 IB (i)e
Pn ˆ 0 Re (i) .
i=1 e
Example 2 (Christensen, 2016). Christensen (2016) proposed a way to use sieve methods to
retrieve long run components of SDF, which is in essence a study of an eigenvalue-eigenfunction
problem. Following his notation, we are interested in the following problem in sieve space:
h
i
h
i
EP bk (Xt )m(Xt , Xt+1 )bk (Xt+1 )0 ck = ⇢k EP bk (Xt )bk (Xt )0 ck ,
where X is the state vector as usual, and m(Xt , Xt+1 ) is the short term SDF. bk is a dictionary of
base functions chosen by the econometrician. ⇢k is the eigenvalue and ck is the eigenvector, both
of which has important economic implications in asset pricing. For example, ⇢k can be regarded
as long run yield in the market. To solve this eigenvalue-eigenfunction problem we need to esti⇥
⇤
⇥
⇤
mate two moment conditions, i.e., EP bk (Xt )m(Xt , Xt+1 )bk (Xt+1 )0 and EP bk (Xt )bk (Xt )0 ck .
Christensen (2016) assumed m is either observable or can be estimated directly through observing
state vector X.
⇥
⇤
Using our methodology we could estimate W = EP bk (Xt )m(Xt , Xt+1 )bk (Xt+1 )0 using a
simple plug-in method, even when we do not observe state vector. Notice:
6
E
P
h
k
k
b (Xt )m(Xt , Xt+1 )b (Xt+1 )
Our estimate for W is:
Ŵ =
Pn
i=1
0
i
=E
Q

bk (Xt )bk (Xt+1 )0
.
Rf,t!t+1
bk (Xi )bk (Xi+1 )0 ˆ 0 Re (i)
e
Rf,i!i+1
.
Pn ˆ 0 Re (i)
i=1 e
4. Asymptotic properties
To show consistency of our object of interest Ŵ , we first show an oracle inequality of ˆ under
a high dimensional case where p > n2 ; After this consistency and asymptotic normality of Ŵ can
be established accordingly. Preliminary results that do not involve rate of convergence, as well
as a benchmark low dimensional case, can be found in Appendix A.
4.1. Oracle inequality when p > n2 . In this subsection we aim to derive a convergence rate for
ˆ when p > n . The main result is that our estimator behaves like the oracle who knows which
2
assets to include to get the best possible estimation of the Q measure. First we introduce some
notation standard in high dimensional literature. Under Condition 1 we can define excess risk of
2 Rp as E ( ) = EP [e
any
s = |S|. We write
S
=(
0R
p
1,S ,
EP [e
]
2,S
···
00 R
p
].2 For an index set S ⇢ {1, 2, · · · p}, its cardinality is
p,S , )
1 {j 2 S} is the indicator function and
j
0,
a vector of length p, and
=
set S.
j 1 {j
2
/ S}. That is,
SC
=
j 1 {j
is the jth component of the vector .
non-zero coefficients in the index set S. Correspondingly,
j,SC
j,S
SC
= (
1,SC ,
2,SC
···
2 S} where
S
only has
p,SC , )
0
and
has non-zero coefficients only in the complement of the index
To get our first theorem we need to further introduce three conditions:
Condition 2 (Empirical process condition). For any fixed ¨ , we denote
ZM =
k
sup
vn ( )
¨ k M
1
vn ( ¨ ) ,
for some M > 0 and small enough. We say Condition 2 stands if
for some small ".Thus 1
Pr{ZM  ↵0 M } > 1
",
" is the confidence level, and ↵0 is a number that depends on 1
".
Remark 2. We need Condition 2 because it guarantees that we only need to control empirical
process part locally, thanks to the convexity of loss function. It says with a probability large
enough, the empirical process part ZM is proportional to M . This can be controlled by the
choice of ↵0 which in turn depends on the confidence interval 1
". Condition 2 can be verified
using empirical process theory. We will give a sufficient condition for Condition 2 in Appendix
C.
2Notice because of the definition of
0
, excess risk is always non-negative.
7
Condition 3 (Margin condition). For any
terms of 1 norm, we have E ( )
0
G(
2 ⇥plocal , the local neighborhood of the true
2
), where G(x) =
Remark 3. Condition 3 ensures that excess risk of any
cx2
0
in
for some constant c.
close to the true
0
can be bounded
from below by a strict convex function G of their difference in l2 norm. Notice the neighborhood
is defined in terms of 1 norm which is stronger than the l2 norm used in the lower bound for
excess risk. It is a reasonably weak condition. Since our loss function is the exponential function,
by Taylor expansion E ( ) will behave quadratically around the minimization point
E( )
0,
that :
0 2
.
2
r( )
As long as r( ) has a common lower bound above zero for all
around
0,
Condition 3 should
stand.
Condition 4 (Compatibility condition).pCondition 4 is met for set S if for all that satisfy
k k2 |S|
k SC k1  3 k S k1 , we have k S k1 
where S is some constant depending on set S.
S
Remark 4. Condition 4 is a high level condition that strengthens Cauchy–Schwarz inequality
between l1 and l2 norm for vector . Such compatibility condition can be seen in many high
dimensional literature, such as the restricted eigenvalue condition in Bickel et al. (2009). We
need this condition later in the proof to bound inequalities that deal with translation between l1
and l2 norm.
Recall our estimator under high dimension is ˆ = arg minEn (e
0R
Lemma 3(see in Appendix A), we define the oracle as follows:
Definition 1. The oracle
⇤
is defined as
⇤
where S = {j :
j
6= 0} and
) + ↵ k k1 . Inspired by
j
16↵2 s
1
E( )+ 2 ,
2S 2
S c
(4.1)
= arg min
:S
is the jth component of vector . That is, S is the index set of
that has non-zero coefficients. Denote s = |S |. S stands for the class of index set that the
oracle is trying to maximize on.3
S
value of (4.1) is denoted as Q⇤ = 12 E (
⇤;
⇤
is the compatibility constant for set S . The minimized
⇤)
+
is the compatibility constant for S ⇤ .
16↵2 s⇤
⇤2 c ,
where s⇤ = |S ⇤ |, and S ⇤ is the index set of
Theorem 1. Assume:
(i): Condition 1 stands;
(ii): Condition 2 is satisfied for the oracle
with
M⇤
=
Q⇤
2↵0
⇤
defined in equation (4.1), and M replaced
and ↵0 being some constant;
(iii): Condition 3 is satisfied for all k
⇤k
1
(iv): Condition 4 is satisfied for all S 2 S .
 M ⇤;
3For example, under Condition 1, the model is approximate sparse with s <
number of assets enough to
get a small approximation error, then oracle will only optimize over an index set that has s number of nonzero
coefficients. If moreover we know exactly what the s assets are, we only need to define oracle to maximize over
the exact index set.
8
n
2
(v): ↵0  8↵.
Then with probability at least 1
":
E (ˆ)
+↵ ˆ
2
32↵2 s⇤
.
⇤2 c
1
q
q
Corollary 1. Under Theorem 1 we could choose ↵ ⇣ ↵0 ⇣ O( logn p ). If s⇤ logn p = o(1) and
✓q
◆
log p
⇤
E( ) = o
, then:
n
⇣ ⇤
⌘
p
(i): E ( ˆ ) = Op s log
,
n
✓ q
◆
log p
⇤
⇤
ˆ
(ii):
= Op s
,
n
⇤
 E(
⇤
)+
1
p
(iii): ˆ !
⇤
in terms of l1 norm.4
4.2. Consistency of Ŵ . For a certain class of function D, under stronger conditions, we can
show consistency of our estimator Ŵ .
Theorem 2. Assume:
(i): Corollary 1 stands;
(ii): EP [e
⇤0 R
] < 1;
(iii): |sup D| < 1;
0
(iv): EP Dewk R RR0 < 1;
(v): EP De
⇤0 R
R < 1.
p
Then Ŵ ! W .
Remark 5. To get consistency of Ŵ we indeed need slightly stronger conditions. In particular,
assumption (iii) imposes restrictions on the classes of functions we can consistently estimate.
That is, function D shall be bounded. Assumptions (ii), (iv) and (v) are stronger conditions
than usual moment conditions in regression-style problems. But it is standard in information
theoretic asymptotics, which is similar to regularity conditions made for exponentially tilted
empirical likelihood estimators.
p
Corollary 2. Q̂(B) ! Q(B). That is, our estimate for the probability of any Borel set B under
Q measure is consistent.
4.3. Asymptotic normality. Asymptotic normality of sparse high dimensional models is usually much harder to establish. As pointed out in Van de Geer et al. (2014), tracking its limiting
distribution is challenging because usually convergence is not uniform and methods like bootstrap
and sub-sampling are of limited use. Indeed, we do not try to establish asymptotic normality
distribution for ˆ . After all, it is not of the final interest. Nevertheless, we are able to give a
simple asymptotic distribution result for our object of interest, Ŵ , similar to the idea of the
4With a stronger version of Condition 4, we can also show convergence in terms of l norm.
2
9
treatment effect using many controls in Belloni et al. (2011).5 Denote Ŵ =
where Mn (D) = En (De
ˆ0 R
), Mn (1) = En (e
ˆ0 R
); M (D) = EP (De
00 R
Mn (D)
Mn (1) ,
W =
), and M (1) = EP (e
M (D)
M (1) ,
00 R
).
Theorem 3. Assume:
(i): Theorem 2 stands;
(ii): EP [e
(iii):
⇤0 R
e
⇤0 R
EP [D2 e2
s⇤2plog p
n
(iv):
p
Then n(Ŵ
00 R
]=o
] < 1;
⇣p
log p
n
⌘
;
= o(1).
d
W ) ! N (0, V), where V = V ar(T e
⇤0 R
) and T =
D W
M (1) .
Remark 6. Indeed, we need much stronger conditions than assumptions made in Theorem 2 in
relation to approximate sparsity(ii), moment existence condition(iii), as well as divergence speed
of s and n(iv). For example, assumption (ii) in Theorem 3 strengthens
✓qhow ◆small excess risk of
⇤0
00
log p
⇤ should be. To get consistency we only need EP [e R
e R] = o
while assumption
n
(ii) in Theorem 3 states that it should be o(
then
EP [e
⇤0 R
e
00 R
p
log p
n ).
In particular, if we assume exact sparsity,
] will vanish and assumption (ii) will stand. All the other conditions made in
Theorem 3 are standard. Theorem 3 should be straightforward to show once we have established
Lemma 4 which is presented in Appendix A.
5. Simulation
In this section we aim to verify some main theoretic results developed in this paper through
simulation. For simplicity we only consider an exact sparse situation where the SDF is precisely
determined by a small number of core assets. This can be achieved by carefully designing our
economy through moment conditions, similar to the idea in Hall and Horowitz (1996). The
economy, under our parameterization , is defined by p moment conditions: EQ [R] = 0. This can
be easily shown to be equivalent to
EP [Re
0R
Since we consider an exact sparse case, suppose true
a vector of length p, where
p,i
(5.1)
] = 0.
could be written as (
· · · p,s 0 · · · 0 )’,
6 0, 8i = 1 · · · s. This means we have s number of active assets
=
p,1
in the whole economy. We could further write 5.1 in vector form as:
EP (Rp,1 e
p,1 Rp,1 +···+ p,s Rp,s
)=0
EP (R
p,2 e
p,1 Rp,1 +···+ p,s Rp,s
)=0
p,3 e
p,1 Rp,1 +···+
p,p e
p,1 Rp,1 +···+ p,s Rp,s
EP (R
EP (R
p,s Rp,s )
···
Now let R be mutually independent. Then the last (p
=0 .
)=0
s) moment conditions with respect to
sparse asset returns will be satisfied as long as EP (Rp,i ) = 0, i = 3, 4 . . . p. Trivially, we set:
Rp,j ⇠ N (0, 1), 8j = s + 1, · · · p.
5However, readers should notice that the asymptotic normality for ˆ can be established using recent de-sparsifying
methods in Van de Geer et al. (2014).
10
For the nonsparse part, we let:
0
1
00
Rp,1
B . C
BB
B .. C ⇠ N BB
@
A
@@
Rp,s
1 0
11
µ1
1 0 0
B
CC
.. C
C
. C
A , @ · · · 1 · · · AA .
0 0 1
µs
That is, nonsparse assets are mutually independently and normally distributed. it can then
be shown easily that as long as µi +
i
= 0, 8i = 1 · · · s, moment condition 5.1 will be satisfied.
This could be reached without any effort. For example, set µi = 1,
i
=
1, 8i = 1 · · · s.
We choose the best penalty level through leave-one-out cross validation. Then, we consider
the following easily verifiable conclusions in this paper:
(1) There is a high probability that estimated ˆ includes true nonsparse element of
p
(2) ˆ ! ⇤ in terms of l1 norm;
ˆ0
p
0;
R ] ! EP [De R ] when D = 1;
(3) En [De
i
p h
00
d
ˆ0
ˆ0
(4) n En (De R ) EP (De R ) ! N (0, V ) when D = 1. That is, En (e R ) is asymptoti00
cally normal.
We check the performance of our estimators in several conditions. In the simplest scenario, we
set the number of nonsparse assets, s, to be 4. Then we choose the first two assets with expected
return 1, and the other two with expected return -1. We set the number of sparse assets as 100.
Therefore, the true
1,
1, 1, 1, 0 · · · 0 )0 . Then we generate a random
sample from our specified normal distribution of size 50(n = 50) and use our proposed method
to estimate
and EP (e
should be (
00 R
). We repeat such experiment for a number of times, which we denote
N . We set N = 50, 200, 500, 1000. Then we vary by gradually increasing sample size n to 100
and 500 and repeating such experiment for N times. Finally, we check the performance of our
estimator when we increase difficulty and set |ui | = 0.5 and 0.1, respectively, for each nonsparse
asset. As we will see later, this showcases some interesting results of non-linear optimization
through LASSO. All results are reported in tables in Appendix C.
Table 1 summarizes results when we set expected return of active assets to be 1, the simplest
case. Thus, R104,1 ⇠ N (1, 1), R104,2 ⇠ N (1, 1), R104,3 ⇠ N ( 1, 1), R104,4 ⇠ N ( 1, 1). R104,j ⇠
N (0, 1), 8j = 5, · · · 104.The first column reports the probability of our estimator of correctly
including the true nonsparse asset 1-4. The second column reports the average l1 norm of our
estimators, while the norm of true is 4. The third column reports, average l 1 distance of
0 , where N is the number of experiment we did. We expect that
estimator to true , EN ˆ
1
as n, sample size gradually increases, EN ˆ
1
the mean squared error of our estimator for EP e
over N , and is calculated as EN (En
0
ˆ
e R
EP e
shall decrease. The fourth column reports
0
00
00 R
, whose true value is e
R )2 .
2 under
our setting,
We should also expect that it drops when
we increase n.
The structure of Table 2 and 3 is essentially the same as Table 1 other than the fact that
we decrease the mean of active asset returns. In Table 2 we set R104,1 ⇠ N (0.5, 1), R104,2 ⇠
N (0.5, 1), R104,3 ⇠ N ( 0.5, 1), R104,4 ⇠ N ( 0.5, 1). And in Table 3 we set R104,1 ⇠ N (0.1, 1), R104,2 ⇠
N (0.1, 1), R104,3 ⇠ N ( 0.1, 1), R104,4 ⇠ N ( 0.1, 1). Everything else is the same with Table 1.
11
We designed this way to see the performance of our estimators when it is increasingly harder
to differentiate between sparse assets and nonsparse ones. Notice for Table 1-3 for each of the
experiment, we do leave-one-out cross-validation to choose the best level of penalty. Thus computationally it is very costly, especially when n is large.
To verify our conclusion (4) from above, we simulate the sampling distribution of En (e
ˆ0 R
)
under various scenarios by setting N = 10000. Results are reported in Figure 1. For each
of the plot in Figure 1, we only do cross validation for one time and fix the penalty level.
Then we estimate En (e
ˆ0 R
) for given n and |µi | , 8i = 1, 2, 3, 4. We plot the density of our
estimators when we do such experiment for 10000 times. The red line is the normal distribution
it should have. For example, in the first plot in Figure 1, we set R104,1 ⇠ N (1, 1), R104,2 ⇠
N (1, 1), R104,3 ⇠ N ( 1, 1), R104,4 ⇠ N ( 1, 1). R104,j ⇠ N (0, 1), 8j = 5, · · · 104. Each sample
has n = 50 observations. The rest of the plots only differ in expected returns of active assets
and number of observations, which we specify in each of the column.
As we can see from our simulation, our estimators worked very well most of time except
when expected return of active assets are too close to expected return of sparse assets(Table
3). Under this case, when sample size is too small(n = 50, 100), our estimator is also almost
unable to correctly select nonsparse assets. As under this situation when sparse and nonsparse
assets behaves too similar, LASSO would trivially set ˆ = 0 outright. This in turn also distorted
sampling distribution En (e
ˆ0 R
), as we see from the last row of Figure 1. This type of behavior
deserves more research in high dimensional methods as it presents a similar “weak identification”
situation for LASSO. Nevertheless, our paper did not talk about variable selection properties.
As we can see from last two columns in Table 3, our methods still work in term of consistency
ˆ0
of ˆ and En (e R ).
6. concluding remarks
In this paper we proposed a nonparametric estimator to estimate SDF when we have many
asset returns at hand and the number of assets is possibly much larger than the number of
observations. Such a high dimensional setting imposes challenges but also gives some merits
to estimation: it provides us with more but nosier information. We showed how approximate
sparsity could come in aid to identify and estimate SDF in this high dimensional setting. This
is a natural extension of work by Kitamura and Stutzer (2002); Ghosh et al. (2015, 2016). We
expanded their analysis considerably and established formal results, achieving consistency and
asymptotic normality theorems. Simulation shows our method worked well, even when sparse
and nonsparse assets behaved very similar. Our method is flexible enough to be applied to both
empirical and theoretic work.
For future research, it would be interesting to look at the case when sparse and nonsparse
assets behave very similar. Under this scenario, variable selection properties and inference might
fail, which presents a challenge for current high dimensional methods. Work is also needed when
data is beyond our iid assumption, showing some dependence feature.
12
Appendix A. Additional Results in Asymptotic Properties
This section presents some useful preliminary results that complement our main theories in
the main body.
A.1. Consistency of ˆ in terms of excess risk. By definition of ˆ , we have, for any
En [e
ˆ0 R
]+↵ ˆ
1
 En [e
0R
2 Rp
(A.1)
] + ↵ k k1 .
Based on this basic inequality we are able to derive a very useful lemma that will be frequently
used.
Lemma 1. For any
2 Rp ,
E (ˆ) + ↵ ˆ
1
h

where E ( ) is the excess risk of vector
vn ( ˆ )
i
vn ( ) + ↵ k k1 + E ( ),
and vn ( ) = En [e
0R
]
EP [e
0R
] is the random part.
Lemma
1 implies ithat in order to bound excess risk we have to control the empirical process
h
ˆ
part vn ( ) vn ( ) . Since our loss function is convex, it turns out we only need to control
the empirical process part locally. If Condition 2 stands, the empirical process can be bounded
locally by a deterministic part with a very large probability. Based on this we could get the
consistency result for ˆ in terms of excess risk.
˙
Lemma 2. We define the best p-dimensional estimation
E ( ) + ↵ k k1 . Suppose
✓q
◆as = arg min
q
log p
log p
˙
(i): Condition 2 stands for ˙ , (ii): E ( ˙ ) = o
, (iii): ↵ ⇣
=
n
n , and (iv):
1
⇣q
⌘
n
ˆ is consistent in terms of excess risk, that is, E ( ˆ ) = op (1).
o
log p . Then
Remark 7. Lemma 2 is only a preliminary result. It shows that under Condition 1 and 2 the
00
ˆ0
prediction error of ˆ is small enough, i.e., EP e R is close enough to EP e R . But we do not
know the speed of convergence and also the relationship between ˆ and 0 .
A.2. Oracle inequality when p <
n
2.
When p <
n
2,
there is no need to regularize because
according to Tsao (2004), empirical likelihood methods work well in finite sample. However, it
would be useful to look at how the oracle behaves under a low dimensional case and relate it to
high dimensional situation. Similar to literature in series regressions, we could define the best
approximation when p <
n
2
as follows:
Definition 2. The best low dimensional approximation LOW is defined as
0
The corresponding estimator is ˆ LOW = arg min En [e R ].
LOW
= arg min EP [e
We need to bound the prediction error of ˆ LOW . To do this, let’s introduce a modified
Condition 2 under low dimensional scenarios.
Condition 5 (Empirical process condition under low dimension). For
Z =
sup
k
LOW k2 
|vn ( )
13
vn (
LOW )| /
p
p
LOW ,
we denote
0R
].
for some
where 1
> 0. We say Condition 5 stands if
Pr{Z  ↵0 } > 1
",
" is the confidence level, and ↵0 is constant that depends on 1
happens with a large probability.
". That is, Z  ↵0
With Condition 5 we are able to get the following lemma that confirms the estimator’s behavior
under a low dimensional situation.
Lemma 3. Suppose:
(i): All asset returns are iid distributed;
(ii): Condition 3 is satisfied for
k
LOW k2
 ,
2
LOW
, i.e,
LOW
⇥plocal ;
(iii): Condition 5 is satisfied with
Then with probability at least 1
1 (✏0 ),
=G
"
E ( ˆ )  2E (
LOW )
2 ⇥plocal ; Moreover, for all
⇣
where ✏0 = 32 E (
+
LOW )
+
8↵20 p
c
such that
⌘
.
16↵02 p
.
c
Remark 8. The significance of Lemma 3 is twofold: First, it gives a more accurate behavior
of the prediction error of our estimator under a low dimensional setting. In particular, the
prediction error is bounded from above by two biases: 2E (
approximation error; and
16↵20 p
c ,
LOW )
, which can be seen as the
which can be called the estimation error—the additional price
we have to pay as we are estimating using an empirical analogue of the theoretic measure. This
result is in line with many nonparametric estimation asymptotics. If the SDF can be exactly
imitated by the p assets, E (
LOW )
= 0 and there is no approximation error and the bias would
only be determined by the estimation error part. In general, we could choose the ↵0 ⇣
p1
n
, so
the estimation error is like Op (p/n). Second, it gives us an idea of how the oracle should behave
under a high dimensional case with approximate s-sparsity assumptions. Indeed, the oracle rate
should achieve a balance between approximation error and estimation error. And it will behave
like in a low dimensional case.
A.3. Preliminary result for Theorem 3.
Lemma 4.
Under the same assumptions (i) to (iv) in Theorem 3, we have:
where V = V ar(De
p h
ˆ0
n En (De R )
⇤0 R
00 R
EP (De
).
i
) = N (0, V ) + op (1)
Based on Lemma 4, it is easy to see when D = 1, we have:
p
where H = V ar(e
⇤0 R
 X
1
ˆ0
n
(e R )
n
EP (e
00 R
)
14
d
) ! N (0, H) + op (1)
Appendix B. Proofs
B.1. Proof of Lemma 1.
Proof. This is straightforward. From basic inequality (A.1):
En (e
ˆ0 R
)
+ EP e
ˆ0 R
vn ( ˆ ) + EP e
ˆ0 R
EP e
ˆ0 R
+↵ ˆ
Now:
EP e
ˆ0 R
EP e
00 R
+↵ ˆ
1

h
1

h
That is:
E (ˆ) + ↵ ˆ
EP e
1
1
 vn ( ) + EP e
So:
+↵ ˆ
0R
 En (e
)
0R
i
vn ( ) + EP e
vn ( ˆ )
0R
+ EP e
0R
+ ↵ k k1 .
+ ↵ k k1 .
0R
EP e
00 R
+ ↵ k k1 .
i
vn ( ) + ↵ k k1 + E ( ).
vn ( ˆ )
⇤
B.2. Proof of Lemma 2.
Proof. Fix ˙ . Because Condition 2 stands for ˙ , pick the l1 distance M defined as:
Ṁ =
1 h ˙
E ( ) + 2↵ ˙
↵0
1
i
.
ZṀ is therefore defined accordingly. As Condition 2 stands, we have: ZṀ  ↵0 Ṁ . Choose
penalty level ↵ such that:
↵
4↵0 .
We want to use Condition 2 to bound the empirical process part, but Condition 2 only stands
when some
is close enough to ⇤ . The following step shows that because of the convexity of
loss function, ˆ indeed is already close enough to ˙ to invoke Condition 2.
Consider a linear combination of ˆ and ˙ : ˜ = t ˆ + (1 t) ˙ , with weight t defined as follows:
t=
Ṁ
Ṁ + ˆ
.
˙
1
As a result:
˜
˙
1
= t( ˆ
˙)
1
=
Ṁ
Ṁ + ˆ
That is, the choice of t drags ˜ close enough to ˙ .
Now we have:
15
(ˆ
˙
1
˙)
1
 Ṁ .
En e
˜0 R
+↵ ˜
1
h
ˆ0
 t En e R + ↵ ˆ
 En e
˙0R
+↵ ˙
1
.
1
i
+ (1
h
˙0
t) En e R + ↵ ˙
1
i
(B.1)
The first inequality in (B.1) comes from the convexity of the loss function and the l1 norm;
whereas the second inequality comes from the basic inequality (A.1). Therefore we can use
Lemma 1 to get:
E (˜) + ↵ ˜

1
Since ˜ is within the l1 distance of
E (˜) + ↵ ˜
1

h
⇤,
i
vn ( ˙ ) + ↵ ˙
vn ( ˜ )
1
+ E ( ˙ ).
Condition 2 can be used to get:
h
i
vn ( ˙ ) + ↵ ˙
vn ( ˜ )
ZṀ + ↵ ˙
+ E(˙)
+ E(˙)
1
↵0 Ṁ + ↵ ˙
1
1
+ E ( ˙ ).
Then through triangle inequality it is easy to see:
E (˜) + ↵ ˜
˙
1
 ↵0 Ṁ + 2↵ ˙
1
+ E(˙)
= 2↵0 Ṁ
↵Ṁ
,
2
where the second equality comes from the definition of Ṁ and the third inequality uses the

fact that we choose ↵
4↵0 .
Notice by definition E (· )
0 always so we have:
˜
˙
1
But this just says that ( ˆ
˙)

Ṁ
.
2
 Ṁ considering our choice of t. This means by a proper
choice of penalty level our estimator ˆ will be within a l1 distance of the best estimator ˙ . Now
this means we can once again use Lemma 1 and Condition 2 for ˆ to get:
E (ˆ) + ↵ ˆ
1
1

h
vn ( ˆ )
ZṀ + 2↵ ˙
i
vn ( ˙ ) + ↵ ˙
1
1
+ E(˙)
+ E(˙)
2↵0 Ṁ
h
=2 E ( ˙ ) + 2↵ ˙
1
i
.
Now we can q
see easily that E ( ˆ ) = op (1) if we impose additional assumptions:
(i)E ( ˙ ) = o( logn p );
q
(ii)↵ ⇣ logn p ;
q
(iii) ˙ = o( logn p ).
1
16
⇤
B.3. Proof of Lemma 3.
Proof. First of all, we have the following basic inequality:
En (e
ˆ0
LOW R
)  En (e
0
LOW R
).
Now similar to the proof in Lemma 2, let’s consider a specific linear combination of ˆ LOW and
˜ = t ˆ LOW + (1 t) LOW , where
LOW :
t=
.
+ ˆ LOW
LOW
2
The convexity of the loss function ensures:
En (e
˜0 R
h
i
ˆ0
)  t En (e LOW R ) + (1
h
t) En (e
0
LOW R
Using this inequality and Lemma 1 it is easy to get:
h
E (˜) 
vn ( ˜ )
vn (
i
)
+E(
LOW
i
)  En (e
0
LOW R
).
LOW ),
which means we have to again bound the empirical process part. Notice such linear combination ensures that ˜ is close enough to LOW to trigger Condition 5:
˜
LOW
2
= t( ˆ
LOW )
2
=
(ˆ
+ ˆ
LOW
LOW )
2
 .
2
So we have:
Notice since ab  a2 +
b2
4 , 8a, b
E ( ˜ )  ↵0
p
p+E(
LOW ).
2 R, we have:
✓
p
1 p 32↵0 p
1
↵0 p =
c p

c
32
32
c
As a result:
p
2
322 ↵0 p
+
4c
◆
=
G( ) 8↵0 p
✏0
8↵0 p
+
=
+
.
32
c
32
c

✏0
8↵0 p
8↵2 p
✏0
˜
E( ) 
+
+ E ( LOW ) = 2 E ( LOW ) + 0 = .
32
c
c
16
p
Now by our assumption ˜ 2 ⇥local , Condition 3 implies:
˜
⇤
2
On the other hand:
k
LOW
⇤
k2  G
h
i
E (˜)  G
G
1
1
LOW )]
[E (
G
1
1
(
(
✏0
1
)= G
16
4
✏0
)G
32
1
(
1
(✏0 ) =
✏0
1
)= G
16
4
⇣
where the second inequality comes from the fact that ✏0 = 32 E (
inequality we get:
17
1
.
4
1
LOW )
(✏0 ) =
+
8↵20 p
c
1
,
⌘4
. By triangle
˜
LOW
 .
2
2
Now, recall that:
˜
which means
ˆ
LOW
LOW
2
2
=
(ˆ
+ ˆ
LOW
LOW )
2
 ,
2
2
 . That is, ˆ is also close enough to
LOW
so that we could
invoke Condition 5 and 3 again. This gives us:
E (ˆ) 
h
vn ( ˜ )
vn (
LOW )
i
+E(
LOW )
✏0
8↵0 p
+
+E(
32
c

LOW )
= 2E (
LOW )
+
16↵02 p
.
c
⇤
B.4. Proof of Theorem 1 and Corollary 1.
Proof. From Lemma 1 it’s clear that:
E (ˆ) + ↵ ˆ
1
h

vn ( ˆ )
Since Condition 2 is satisfied, define t =
the following equation:
E (˜) + ↵ ˜
where ˜ = t ˆ + (1
⇤
S⇤
=
⇤,
⇤
⇤
SC
t)
⇤.
1
⇤
vn (
M⇤
M ⇤ +k ˆ
⇤
i
) + ↵k
k1
⇤
, where M ⇤ =
 ↵0 M ⇤ + ↵ k
⇤
k1 + E (
Now for any , we have:
=
⇤
k1 + E (
S⇤
⇤
+
Q⇤
2↵0 .
).
We could easily reach
),
and specifically for
⇤
SC
⇤:
= 0. Thus:
E ( ˜ ) + ↵ ˜ S ⇤ + ˜ SC⇤
1
= E (˜) + ↵ ˜S⇤
1
+ ↵ ˜ SC⇤
1
 ↵0 M ⇤ + ↵ k
⇤
k1 + E (
⇤
).
Then:
E ( ˜ ) + ↵ ˜ SC⇤
1
Using triangle inequality we get:
E ( ˜ ) + ↵ ˜ SC⇤
 ↵0 M ⇤ + ↵ k
1
That is:
E ( ˜ ) + ↵ ˜ SC⇤
1
⇤
k1
↵ ˜S⇤
 ↵0 M ⇤ + ↵ ˜ S ⇤
⇤
1
 ↵0 M ⇤ + ↵ ˜ S ⇤
⇤
 ↵0 M ⇤ + ↵ ˜ S ⇤
⇤
= Q⇤ + ↵ ˜ S ⇤
Now we can see:
18
⇤
1
.
1
1
1
+E(
+E(
+
⇤
+E(
Q⇤
2
⇤
).
).
⇤
)
(B.2)
E ( ˜ ) + ↵ ˜ SC⇤
1
+ ↵ ˜S⇤
⇤
1
 Q⇤ + 2↵ ˜ S ⇤
⇤
1
.
That is:
E (˜) + ↵ ˜
⇤
 Q⇤ + 2↵ ˜ S ⇤
1
⇤
1
.
We need to find a way to bound the RHS of the above equation. It’s useful to discuss two
cases:
(1) If 2↵ ˜ S ⇤
⇤
Q⇤ :
1
Notice now from (B.2) we can see ˜ SC⇤
we can translate l1 norm of ( ˜ S ⇤
p
s⇤
⇤
˜
⇤
2
⇤ )
S⇤
⇤
⇤
SC
1
= ˜ SC⇤
1
 3 ˜S⇤
⇤
S⇤
1
into l2 norm using Condition 4:
, which means
˜S⇤
⇤
1

. As a result:
E (˜) + ↵ ˜
⇤
1
 2↵ ˜ S ⇤
⇤
= 4↵ ˜ S ⇤
p
4↵ s⇤ ˜

⇤
p
4↵ s⇤ ˜

(
⇤
⇤
+ 2↵ ˜ S ⇤
1
⇤
1
1
⇤
2
0
2
+
⇤
0
2
)
⇤ )k  M ⇤ , we could use similar trick
Since we assume that Condition 3 stands for all k(
1
0
⇤
0
in Lemma 3 to bound ˜
and
as
follows:
S⇤
2
2
p
4↵ s⇤ ˜
0
⇤
2
Similarly:
p
1 8↵ s⇤ p ˜
p
=
c
2 ⇤ c
p
4↵ s⇤
⇤
⇤
0
2
So:
E (˜) + ↵ ˜
⇤
That is:
1
0
2

1

E(
2
⇤

1
16↵2 s⇤

E ( ˜ ) + ⇤2
.
2
c
)+
1
1
 E (˜) + E (
2
2
1
 E ( ˜ ) + Q⇤ .
2
E (˜)
+↵ ˜
2
⇤
Now this means:
1
16↵2 s⇤
.
⇤2 c
⇤
)+
16↵2 s⇤
⇤2 c
 2↵0 M ⇤ .
1
 2↵0 M ⇤  ↵M ⇤ ,
2
1
where the last inequality comes from our assumption that ↵0  8↵. It is then straightforward
⇤
to see that ˜
 1 M ⇤.
↵ ˜
1
⇤
2
19
(2) If 2↵ ˜ S⇤
⇤
< Q⇤ :
1
E (˜) + ↵ ˜
⇤
 Q⇤ + 2↵ ˜ S⇤
1
 2Q
⇤
1
⇤
= 4↵0 M ⇤
1
 ↵M ⇤ .
2
The last inequality comes from our assumption that ↵0  8↵. So:
˜
Under both cases we have ˜
enough to
⇤
⇤
1
1
 M ⇤.
2
 M ⇤ , i.e., ˆ is close
to invoke Condition 2. Then repeat the proof above but use ˆ to replace ˜ , we can
⇤
 12 M ⇤ , which implies ˆ
1
⇤
1
get either:
E (ˆ)
+↵ ˆ
2
or:
⇤
E (ˆ)
+↵ ˆ
2
Then considering both cases we have:
⇤
1
 Q⇤ ,
 2Q⇤ .
1
E (ˆ)
32↵2 s⇤
⇤
+↵ ˆ
 E ( ⇤ ) + ⇤2 .
2
c
1
The derivation of Corollary 1 should be straightforward and thus is omitted.
⇤
B.5. Proof of Theorem 2 and Corollary 2.
Proof. Under condition 1 W = EQ D =
show En e
ˆ0 R p
! EP e
00
R
EP (De
EP (e
ˆ0 R
(iii)-(v) in Theorem2. Notice:
En (De
EP (De
00 R
ˆ0 R
00
)
R)
. Recall Ŵ =
for the denominator and En (De
Without loss of generality, we show En (De
where A = En (De
00 R
)
ˆ0 R
En (De
)!
EP (De
)
⇤0 R
p
00 R
), B = En (De
Since also:
) ! E (De
⇤0 R
)
EP (De
00 R
20
) for the numerator.
) for any D that satisfies assumption
⇤0 R
⇤0 R
)
EP (De
p
) ! EP (De
h
)  |sup D| EP (e
q
under assumption (iii) in Theorem 2, C = o( logn p ).
EP (De
it suffices to
)  |A| + |B| + |C| ,
By weak law of large numbers, we know En (De
(iii) in Theorem 2. So B =
00 R
EP (De
).
Op ( p1n ).
ˆ0 R
ˆ0
En (De R )
. Therefore
En (e ˆ 0 R )
00 R
p
P
⇤0 R
⇤0 R
⇤0 R
) and C = EP (De
⇤0 R
)
) from assumption (ii) and
e
00 R
i
) ,
Now we have to bound A. Since, the sample objective function is continuously differentiable,
through Taylor expansion:
A = (ˆ
⇤ 0
 (ˆ
⇤ 0
⇤0 R
) En (De
⇤0 R
) En (De
R) + ( ˆ
R) + ( ˆ
 |F | + |H|
where ¯ lies on a line segment joining
⇤
h
i
¯0
) En (De R RR0 ) ( ˆ
h
i
¯0
⇤ 0
) En (De R RR0 ) ( ˆ
⇤ 0
⇤
) ,
(B.3)
⇤
(B.4)
)
(B.5)
and ˆ .
q
⇤
First of all, from Theorem 1 we know ˆ
= Op (s⇤ logn p ). Notice when n is sufficiently
1
P
large, ¯ could always be expressed as ¯ = b+1
↵
0, 1  k  b + 1, b
k=1 k wk for some for some ↵k
Pb+1
finite, and k=1 ↵k = 1, where wk , 1  k  b + 1, all lie within the neighborhood J of ⇤ . Then:
En De
¯0 R
X
0
 En
↵k (Dewk R RR0 )
0
p X
!
↵k EP (Dewk R RR0 )
X
0

EP (Dewk R RR0 )
RR0
< 1,
where the first inequality is because of the convexity of exponential function and the second
line used law of large numbers under assumption (iv) in Theorem 2. As a result, |H| will be
Op ( s
⇤2
log p
).
n
Furthermore,
En (De
⇤0 R
p
R) ! EP (De
⇤0 R
R)
 |sup D| EP (e
⇤0 R
R),
where the first line uses law of large numbers for iid sequences under assumption (v) in Theorem
2 and the second line comes under our assumption (iii) in Theorem 2. Notice
i.e., the solution to (4.1), so its score equation ensures:
En (De
⇤0 R
EP (e
⇤0 R
⇤
is the oracle,
R) = 0. This means:
1
R) = Op ( p ).
n
⇤
p
From here we can conclude that |F | is of order Op ( s nlog p ).
q
Based on arguments above, if s⇤ logn p = o(1), we have:
En (De
ˆ0 R
)
EP (De
00 R
) = op (1).
Now let D = 1, we immediately see :
En e
ˆ0 R p
! EP e
00 R
.
To see Corollary 2, simply let D = IB , the indicator function for Borel set B, which must be
⇤
bounded as it can only take values 1 or 0.
21
B.6. Proof of Lemma 4.
Proof. This is immediate. As we have already shown in Theorem 2, En (De
can be decomposed into three parts: A is of
p
log p
n ).
Theorem 4 C is of o(
p h
ˆ0
n En (De R )
s⇤2plog p
n
if we assume
⇤2
Op ( s nlog p )
ˆ0 R
)
EP (De
00 R
)
and further under assumption (ii) in
Then:
EP (De
00 R
i p h
) = n En (De
⇤0 R
)
EP (De
⇤0 R
i
) + op (1),
(B.6)
= o(1), which implies the asymptotic distribution will be determined by the
second part on the right hand side of B.6. We only need to verify that it satisfies the Lindeberg
⇤
condition, which is ensured by iid assumption and assumption (iii) in Theorem 3.
B.7. Proof of Theorem 3.
Proof. From proof of Lemma 4we can see:
p
n(Ŵ
p
Mn (D)
Mn (1)
p Mn (D)
= n(
M (1)
p Mn (D)
= n(
M (1)
p Mn (D)
= n(
M (1)
W) =
n(
M (D)
)
M (1)
M (D)Mn (1) M (1)
)
M 2 (1)
Mn (1)
p Mn (D)
M (D)Mn (1)
) + n(
2
M (1)
M (1)
M (D)Mn (1)
) + op (1),
M 2 (1)
M (D)Mn (1) M (1)
)(
M 2 (1)
Mn (1)
1)
(B.7)
where the last equation in (B.7) comes from the fact that Mn (1) =
⇤2
M (1) + Op ( s nlog p ).
Thus,
the asymptotic distribution of Ŵ will be dominated by the first term on the right hand side of
the last line in (B.7). Observe:
Mn (D)
M (1)
where T =
M (D)Mn (1)
Mn (D)
=
2
M (1)
M (1)
D W
M (1) .
Mn (W )
D
= Mn (
M (1)
M (1)
W
) = Mn (T ),
M (1)
Observe also trivially:

M [T ] = M
So:
p
n(
1
(D
M (1)
Mn (D)
M (1)
W) =
1
M (1)
⇢
M (D)
M
p
M (D)Mn (1)
) = n [Mn (T )
2
M (1)

M (D)
M (1)
=0
M (T )]
d
! N (0, V),
where V = V ar(T e
⇤0 R
⇤
).
22
Appendix C. Sufficient conditions
Since Condition 2 and 5 are essentially the same, here we only give a sufficient condition
for Condition 2. The main idea is to use probability inequalities to show how close theoretic
objective expectation and empirical expectation can be.
C.1. Sufficient condition for Condition 2. Showing probability inequalities for Condition 2
relies on the loss function being Lipschitz. Since we only need Condition 2 to stand locally around
⇤,
it suffices if the loss function is locally Lipschitz, which is the case for our loss function, the
exponential function.
Recall ZM =
sup
k
⇤k
1 M
|vn ( )
vn (
⇤ )|,
for some M > 0. Along with Condition 1, we make
the following two more assumptions:
(i): Asset returns have bounded support, that is, sup Rp < 1 where Rp represents the
vector of asset returns of length p.
(ii): Given a fixed sample of asset returns, we normalize each asset by dividing by its
standard deviation, that is:
Rp,j (i)
R̃p,j (i) = q
,
2 (i)
En Rp,j
where Rp,j (i) stands for the ith observation of the jth component of the vector of asset returns
of length P . We denote normalized Rp using R̃p . It’s obvious that En R̃p2 = 1. Then we have:
EP ZM  8M L
Moreover,
r
2 log(2p)
2
max En R̃p,j
 8M L
1jp
n
r
P(ZM  2M L(4
where
L=
2 log(2p)
+
n
sup
⇤k
k
e
0R
r
2t
))
n
r
2 log(2p)
.
n
1
et ,
+ 1.
1 M
Proof. The result is a direct application of Lemma 14.20 and Theorem 14.2 in Bühlmann and
Van de Geer (2011, Chapter 14, page 502). The only problem is to show our loss function is
local Lipschitz with Lipschitz constant L. That is, we have to show:
e
for all R and all
0R
such that k
By Taylor’s expansion we can see:
e
for some  2 R.
0R
e
⇤0 R
⇤k
1
e
 L ( 0R
⇤0
 e ( 0 R
⇤0
R) ,
 M.
⇤0 R
23
R) ,
Then by Hölder’s inequality, we have:
0
R  k k1 sup Rp .
Under assumption that sup Rp < 1, we could choose the Lipschitz constant L as:
L=
sup
k
In practice we can choose to be: sup Rp =
asset return.
⇤k
e
0R
+ 1.
1 M
max
1i;1jp
24
Rp,j (i), the largest in sample individual
⇤
Appendix D. Simulation Results
Table 1. Specification 1: s = 4, p
|µi | = 1, i = 1, 2, 3, 4
(1)
Prob. incl.
N = 50
200
500
1000
0.940
0.965
0.938
0.957
0
s = 100; active assets’ expected return
0
(2)
1
=4
4.487
4.791
4.709
4.710
N = 50
200
500
1000
0.98
1
0.998
0.995
4.868
4.833
5.072
5.026
N = 50
200
1
1
5.634
5.046
25
EN
n = 50
n = 100
n = 500
(3)
ˆ
0
1
(4)
ˆ0
MSE. En e R
2.474
2.382
2.508
2.410
0.007
0.01
0.009
0.009
2.095
2.077
2.19
2.202
0.007
0.007
0.007
0.007
2.05
1.78
0.004
0.003
Table 2. Specification 2: s = 4, p
|µi | = 0.5, i = 1, 2, 3, 4
(1)
Prob. incl.
0
s = 100; active assets’ expected return
0
(2)
1
=2
N = 50
200
500
1000
0.48
0.545
0.628
0.614
2.377
2.201
2.281
2.32
N = 50
200
500
1000
0.92
0.98
0.972
0.96
1.903
2.152
2.398
2.357
N = 50
200
1
1
2.034
2.200
26
EN
n = 50
n = 100
n = 500
(3)
ˆ
0
1
(4)
ˆ0
MSE. En e R
2.317
2.057
2.057
2.136
0.061
0.05
0.053
0.058
1.343
1.451
1.574
1.535
0.015
0.022
0.029
0.027
0.675
0.773
0.003
0.003
Table 3. Specification 3: s = 4, p
|µi | = 0.1, i = 1, 2, 3, 4
(1)
Prob. incl.
0
s = 100 ; active assets’ expected return
0
(2)
= 0.4
1
N = 50
200
500
1000
0
0
0
0
0.612
0.48
0.464
0.492
N = 50
200
500
1000
0
0
0.004
0.004
0.301
0.323
0.336
0.352
N = 50
200
0.12
0.25
0.354
0.364
27
EN
n = 50
n = 100
n = 500
(3)
ˆ
0
1
(4)
ˆ0
MSE. En e R
0.946
0.836
0.819
0.847
0.046
0.036
0.032
0.035
0.642
0.669
0.667
0.686
0.008
0.009
0.009
0.01
0.531
0.519
0.001
0.001
µ=±1,obs=100
µ=±1,obs=500
0.00
0.05
0.10
0.15
0.20
10
0
0
0
10
5
5
Density
10
Density
30
20
Density
40
15
50
15
20
60
µ=±1,obs=50
0.05
0.10
0.20
0.25
0.05
µ=±0.5,obs=100
0.10
0.15
0.20
0.25
µ=±0.5,obs=500
6
Density
4
3
Density
6
0.2
0.4
0.6
0.8
0
0
0
1
2
2
2
4
4
Density
8
5
8
10
6
10
7
12
µ=±0.5,obs=50
0.15
0.3
0.5
0.6
0.7
0.8
0.9
0.45 0.50 0.55 0.60 0.65 0.70 0.75
µ=±0.1,obs=100
µ=±0.1,obs=500
30
10
5
20
Density
Density
15
10
0.8
0.9
1.0
1.1
1.2
0
0
5
0
Density
10
20
40
25
15
50
µ=±0.1,obs=50
0.4
0.8
0.9
1.0
Figure D.1. Empirical distributions of En e
the distribution it should have
28
1.1
ˆ0 R
1.2
0.90
0.95
1.00
, N = 10000; red curve represents
1.05
References
Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restrictions
containing unknown functions. Econometrica, 71(6):1795–1843.
Almeida, C. and Garcia, R. (2012). Assessing misspecified asset pricing models with empirical
likelihood estimators. Journal of Econometrics, 170(2):519–537.
Alvarez, F. and Jermann, U. J. (2005). Using asset prices to measure the persistence of the
marginal utility of wealth. Econometrica, 73(6):1977–2016.
Backus, D., Boyarchenko, N., and Chernov, M. (2015). Term structures of asset prices and
returns. Working Paper.
Backus, D., Chernov, M., and Martin, I. (2011). Disasters implied by equity index options.
Journal of Finance, 66(6):1969–2012.
Backus, D., Chernov, M., and Zin, S. (2014). Sources of entropy in representative agent models.
Journal of Finance, 69(1):51–99.
Bansal, R. and Yaron, A. (2004). Risks for the long run: a potential resolution of asset pricing
puzzles. Journal of Finance, 59(4):1481–1509.
Belloni, A., Chernozhukov, V., and Hansen, C. (2011). Inference for high-dimensional sparse
econometric models. arXiv.org, pages 1–41.
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig
selector. Annals of Statistics, 37(4):1705–1732.
Bühlmann, P. and Van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory
and applications.
Campbell, J. Y. and Cochrane, J. H. (1999). By force of habit: a consumption-based explanation
of aggregate stock market behavior. Journal of Political Economy, 107(2).
Chang, J., Chen, S. X., and Chen, X. (2015). High dimensional generalized empirical likelihood
for moment restrictions with dependent data. Journal of Econometrics, 185(1):283–304.
Chen, S. X., Peng, L., and Qin, Y. L. (2009). Effects of data dimension on empirical likelihood.
Biometrika, 96(3):711–722.
Chen, X., Favilukis, J., and Ludvigson, S. C. (2013). An estimation of economic models with
recursive preferences. Quantitative Economics, 4(1):39–83.
Chen, X. and Ludvigson, S. C. (2009). Land of addicts? An empirical investigation of habit-based
asset pricing models. Journal of Applied Econometrics, 24(7):1057–1093.
Chen, X. and Pouzo, D. (2009). Efficient estimation of semiparametric conditional moment
models with possibly nonsmooth residuals. Journal of Econometrics, 152(1):46–60.
Christensen, T. M. (2016). Nonparametric stochastic discount factor decomposition. Working
Paper, pages 1–53.
Csiszar, I. (1975). I divergence geometry of probability distributoins and minimization problems.
Annals of Probability, 3(1):146–158.
Csiszar, I. (1984). Sanov property, generalized I-projection and a conditional limit theorem. The
Annals of Probability, 12(3):768–793.
29
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96(456):1348–1360.
Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of
parameters. Annals of Statistics, 32(3):928–961.
Gagliardini, P., Gourieroux, C., and Renault, E. (2011). Efficient derivative pricing by the
extended method of moments. Econometrica, 79(4):1181–1232.
Ghosh, A., Julliard, C., and Taylor, A. P. (2015). An information-based one-factor asset pricing
model.
Ghosh, A., Julliard, C., and Taylor, A. P. (2016). What is the consumption-CAPM missing?
An information-theoretic framework for the analysis of asset pricing models. The Review of
Financial Studies.
Haberman, S. J. (1984). Adjustment by minimum discriminant information. The Annals of
Statistics, 12(3):971–988.
Hall, P. and Horowitz, J. L. (1996). Bootstrap critical values for tests based on generalizedmethod-of-moments estimators. Econometrica: Journal of the Econometric Society, pages
891–916.
Hjort, N. L., McKeague, I. W., and Van Keilegom, I. (2009). Extending the scope of empirical
likelihood. The Annals of Statistics, pages 1079–1111.
Jagannathan, R. and Wang, Z. (1996). The conditional capm and the cross-section of expected
returns. Journal of Finance, pages 3–53.
Kitamura, Y. and Stutzer, M. (1997). An information-theoretic alternative to generalized method
of moments estimation. Econometrica, 65(4):861–874.
Kitamura, Y. and Stutzer, M. (2002). Connections between entropic and linear projections in
asset pricing estimation. Journal of Econometrics, 107(1-2):159–174.
Lahiri, S. and Mukhopadhyay, S. (2012). A penalized empirical likelihood method in high dimensions. The Annals of Statistics, 40(5):2511—-2540.
Leng, C. and Tang, C. Y. (2012). Penalized empirical likelihood and growing dimensional general
estimating equations. Biometrika, 99(3):703–716.
Mehra, R. and Prescott, E. C. (1985). The equity premium: a puzzle. Journal of monetary
Economics, 15(2):145–161.
Newey, W. K. and Smith, R. J. (2004). Higher order properties of GMM and generalized empirical
likelihood estimators. Econometrica, 72(1):219–255.
Owen, A. B. (2001). Empirical likelihood. CRC press.
Piazzesi, M., Schneider, M., and Tuzel, S. (2007). Housing, consumption and asset pricing.
Journal of Financial Economics, 83(3):531–569.
Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. The
Annals of Statistics, pages 300–325.
Robinson, P. M. (1991). Consistent nonparametric entropy-based testing. The Review of Economic Studies, 58(3):437–453.
Schennach, S. M. (2014). Entropic latent variable integration via simulation. Econometrica,
82(1):345–385.
30
Schneider, P. and Trojani, F. (2016). (Almost) model-free recovery. Working Paper.
Tang, C. Y. and Leng, C. (2010). Penalized high-dimensional empirical likelihood. Biometrika,
97(4):905–920.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society B, 58(1):267–288.
Tsao, M. (2004). Bounds on coverage probabilities of the empirical likelihood ratio confidence
regions. Annals of Statistics, 32(3):1215–1221.
Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal
confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–
1202.
Van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Annals of
Statistics, 36(2):614–645.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The
Annals of statistics, pages 894–942.
Department of Economics, London School of Economics, Houghton Street, London, WC2A
2AE, UK.
E-mail address: [email protected]
Department of Economics, London School of Economics, Houghton Street, London, WC2A
2AE, UK.
E-mail address: [email protected]
31