Supervised learning for computer vision:
Theory and algorithms - Part I
Francis Bach1 & Jean-Yves Audibert2,1
1.
2.
INRIA - Ecole Normale Supérieure
ENPC
ECCV Tutorial - Marseille, 2008
1
Outline
• Probabilistic model
• Local averaging algorithms
–
–
–
–
Link between binary classification and regression
k-Nearest Neighbors
Kernel estimate
Partitioning estimate
• Empirical risk minimization and variants
–
–
–
–
Neural networks
Convexification in binary classification
Support Vector Machines
Boosting
2
Probabilistic model
• Training data = n input-output pairs :
(X1, Y1), . . . , (Xn, Yn)
i.i.d.
from some unknown distribution P
• A new input X comes.
• Goal: predict the corresponding output Y .
• probabilistic assumption:
(X, Y ) = another independent realization of P.
3
Some typical examples
• Computer Vision
– object recognition
X = an image
Y = +1 if the image contains the object, Y = 0 otherwise
• Textual document
– X = a mail
Y = spam vs non spam
• Insurance
– X = data of a future policy holder
Y = premium
• Finance
– X = data of a loanee
– X = data of a company
Y = loan rate
Y = buy or sell
4
Measuring the quality of prediction (1/2)
• `(y, ŷ) = measure the loss incurred by predicting ŷ while the true
output is y
• Typical losses are:
– the p-power loss for real outputs
`(y, ŷ) = |y − ŷ|p
– the classification loss for discrete outputs (e.g. in {0, 1})
`(y, ŷ) = 1y6=ŷ
5
Measuring the quality of prediction (2/2)
• A prediction function = mapping from input space to output space
f : X 7→ f (X)
• Quality of a prediction function
Risk of f = R(f ) = E `[Y, f (X)]
• The best prediction function (=Bayes predictor):
f ∗ = argmin R(f )
f
6
Noise in classification:
R(f ) = E1Y 6=f (X) = P(Y 6= f (X))
Density
0 or 6
Y=1
Y=0
⇒
R(f ∗) = 0
Input space
Density
Y=1
Y=0
⇒
R(f ∗) > 0
Input space
7
Bayes predictors for typical losses
• In classification (when `(y, ŷ) = 1y6=ŷ ):
∗
fcla
(x) = argmax P (Y = y|X = x)
y
• In least square regression (when `(y, ŷ) = (y − ŷ)2)
∗
freg
(x) = E(Y |X = x)
8
What is formally a supervised learning algorithm?
• An estimator of the unobservable f ∗
• An algorithm = a training sample is mapped to a prediction function
fˆ : T = {(X1, Y1), . . . , (Xn, Yn)} 7→ fˆT
• Quality of an algorithm for training samples of size n
ET R(fˆT )
Here the expectation is wrt the training sample distribution.
9
Uniformly universal consistency
• An algorithm is uniformly universally consistent if we have
©
ª
lim sup ET R(fˆT ) − R(f ) = 0
n
∗
P
• Bad news: uniformly universally consistent algorithms do not exist
[Devroye (1982); Audibert (2008)]
• Practical meaning: you will never know beforehand how much data
is required to reach a predefined accuracy
10
Universal consistency [Stone (1977)]
• An algorithm is universally consistent if for any P generating the
data, we have
ET R(fˆT ) −→ R(f ∗),
n→+∞
in other words:
©
ª
∗
ˆ
sup lim ET R(fT ) − R(f ) = 0
P
¡
n
6= unif. univ. consistency: limn supP
©
ª
¢
∗
ˆ
ET R(fT ) − R(f ) = 0
• Good news: universally consistent algorithms do exist
• Practical meaning: for a sufficiently large amount of data, you will
reach any desired accuracy
11
What should we expect from a good supervised
learning algorithm?
• its universal consistency
©
ª
∗
ˆ
sup lim ET R(fT ) − R(f ) = 0
P
n
• a locally uniform universal consistency
©
ª
∗
ˆ
sup ET R(fT ) − R(f ) goes to 0 fast (typically in 1/nγ ),
P ∈P
for P a known class of distributions in which (we hope/know that)
the unknown distribution P is.
12
Outline
• Probabilistic model
• Local averaging algorithms
–
–
–
–
Link between binary classification and regression
k-Nearest Neighbors
Kernel estimate
Partitioning estimate
• Empirical risk minimization and variants
–
–
–
–
Neural networks
Convexification in binary classification
Support Vector Machines
Boosting
13
Link between binary classification and regression
Y ∈ {0, 1}
∗
freg
(x) = P (Y = 1|X = x)
∗
∗ (x)≥1/2 = argmaxy P (Y = y|X = x)= f (x)
⇒ 1freg
cla
Theorem:
freg: real-valued function defined on the input space
fcla = 1freg≥1/2
q
∗
∗ )
Rcla(fcla) − Rcla(fcla
) ≤ 2 Rreg(freg) − Rreg(freg
Corollary:
fˆreg universally consistent ⇒ fˆcla = 1fˆreg≥1/2 universally consistent
14
Local averaging methods [Györfi et al. (2004)]
Context:
Y ∈R
`(y, y 0) = (y − y 0)2
• Recall: f ∗(x) = E(Y |X = x) unknown
but (X1, Y1), . . . , (Xn, Yn) observed
• Implementation
For an input x, predict the average of the Yi of the Xi’s close to x
fˆ : x 7→
n
X
Wi(x)Yi,
i=1
with Wi(x) appropriate functions of x, n, X1, . . . , Xn.
15
Stone’s Theorem [Stone (1977)]:
sufficient conditions for universal consistency
Assume that the weights Wi satisfies for any distribution P
¯
©¯ Pn
ª
¯
¯
1. ∀ε > 0 P
−→
i=1 Wi(X) − 1 > ε
n→+∞
2. ∀a > 0
3. E
0
© Pn
ª
E
−→
i=1 |Wi(X)|1kXi−Xk>a
n→+∞
Pn
2
[W
(X)]
−→
i
i=1
n→+∞
0
0
4. + two technical assumptions
Pn
ˆ
Then f : x 7→ i=1 Wi(x)Yi is universally consistent
16
First example: the k-Nearest Neighbors
(
Wi(x) =
1
k
if Xi belongs to the k-n.n. of x among X1, . . . , Xn
0
otherwise
e.g. for k = 1: if Xi N.N. of x, then fˆ1(x) = Yi. More generally:
fˆk (x) =
k
X
1
k j=1
Yij
Universal consistency [Stone (1977)] :
The kn-N.N. is univ. consistent iff kn → +∞ and kn/n → 0
• The nearest neighbor (k = 1) algorithm is not universally consistent
[Cover and Hart (1967)].
17
Using k-N.N.
• Requires full storage of training points
• Naive implementation: O(n)
• Refined implementation using trees: O(log n) at test time (but
O(n log n) for building the tree)
( http://www.cs.umd.edu/~mount/ANN/ )
• How to choose k? answer: by cross-validation i.e. take the k by
minimizing the risk estimate of fˆk by
p
X
1
X
n j=1
(x,y)∈Bj
£
¤2
ˆ
y − fk (∪l6=j Bl)(x) ,
where B1, ..., Bp is a partition of
{(X1, Y1), . . . , (Xn, Yn)} = B1 t ... t Bp.
the
training
sample:
18
Second example: kernel estimate [Nadaraya (1964);
Watson (1964)]
X ∈ Rd
h>0
K : Rd → R
¶
n µ
x−Xi
X
K( h )
ˆ
f (x) =
Yi
Pn
x−Xl
l=1 K( h )
i=1
Universal consistency [Devroye and Wagner (1980); Spiegelman
and Sacks (1980)]: Let B(0, u) be the Euclidean ball in Rd of radius
u > 0. If there are 0 < a ≤ A et b > 0 s.t.
1
∀u ∈ Rd
K
b
−A
−a
and if hn
consistent
a
−→
b1B(0,a) ≤ K(u) ≤ 1B(0,A)
A
n→+∞
0 and nhdn
−→
n→+∞
+∞, then fˆ is universally
19
Partitioning estimate [Tukey (1947)]
X ∈ [0, 1]d = X1 t · · · t Xp
¶
n µ
X
1Xi∈Xj(x)
Pn
fˆ(x) =
Yi,
l=1 1Xl ∈Xj(x)
i=1
X
with j(x) such that x ∈ Xj(x)
Universal consistency [Györfi (1991)]:
Let Diam(Xj ) = supx1,x2∈Xj kx1 − x2k.
max Diam(Xj ) −→
j
n→+∞
If p/n −→
n→+∞
0 and
0 then fˆ is universally consistent
• Meaning for a regular grid of width hn: nhdn → +∞ and hn → 0
20
Using the partitioning estimate
• Fast and simple but ...
• border effects: “Mind the gap!”
• nonobvious choice of the partition
• Variants of the partitioning estimate: decision trees with partitions
built from the training data ...
21
Outline
• Probabilistic model
• Local averaging algorithms
–
–
–
–
Link between binary classification and regression
k-Nearest Neighbors
Kernel estimate
Partitioning estimate
• Empirical risk minimization and variants
–
–
–
–
Neural networks
Convexification in binary classification
Support Vector Machines
Boosting
22
Empirical risk minimization
R(f ) = E `(Y, f (X)) unobservable
• Empirical risk: r(f ) =
1
n
Pn
i=1 `(Yi, f (Xi))
observable
• Law of large numbers and central limit theorem:
p.s.
r(f ) −→
n→+∞
R(f )
¤ L
√ £
n r(f ) − R(f ) −→
n→+∞
¡
¢
N 0, Var `[Y, f (X)] .
• Goal of learning: predict as well as f ∗ = argmin R(f )
f
• a “natural” algorithm is therefore: fˆERM ∈ argmin r(f )
f
23
Natural choice does not work
fˆERM ∈ argmin r(f )
f
• There is an infinity of minimizers. Most of them will not perform
well on test data. −→ Overfitting
Yi
Xi
24
Why is it not working?
Risk
R
r
fˆERM f ∗
Prediction function space
R(fˆERM) − R(f ∗) = R(fˆERM) − r(fˆERM) + r(fˆERM) − r(f ∗)
+ r(f ∗) − R(f ∗)
©
ª
√
≤ sup R(f ) − r(f ) + 0 + O(1/ n)
f
√
∀f, R(f ) − r(f ) = O(1/ n)
6=
©
ª
sup R(f ) − r(f ) −→
f
n→+∞
0
25
fˆERM ∈ argmin r(f ) with F appropriately chosen
f ∈F
• Choice of F ? Introduce f˜ ∈ argmin R(f ),
f ∈F
R(f˜) − R(f ∗)
R(fˆERM) − R(f ∗) = R(fˆERM) − R(f˜) +
{z
}
|
{z
}
|
Estimation error
Approximation error
R(fˆERM) − R(f˜) = R(fˆERM) − r(fˆERM) + r(fˆERM) − r(f˜)
+ r(f˜) − R(f˜)
©
ª
√
≤ sup R(f ) − r(f ) + 0 + O(1/ n)
f ∈F
©
ª
• F should be small enough to ensure sup R(f ) − r(f ) −→
n→+∞
f ∈F
• F should be large enough to ensure R(f˜) − R(f ∗) −→
n→+∞
0
0
26
First example : “neural networks” [Rosenblatt (1958,
1962)]
• Squashing function σ: a nondecreasing function with
σ(x) −→
x→−∞
0
σ(x) −→
x→+∞
1
e.g. σ(x) = 1x≥0 or σ(x) = 1/(1 + e−x)
1
• (Artificial) neuron: function defined on Rd by
0
g(x) = σ
µX
d
¶
aj x(j) + a0
= σ(a · x̃)
j=1
where a = (a0, . . . , ad)T and x̃ = (1, x(1), . . . , x(d))T
27
• Neural network with one hidden layer: function defined on Rd by
f (x) =
k
X
cj σ(aj · x̃) + c0
j=1
µ
where x̃ =
1
x
¶
= (1, x(1), . . . , x(d))T .
1
1
(0)
(1)
x
c0
a1
a1
(1)
Σσ
c1
(0)
Σ
ak
c0 + Σkj=1 cj σ(aj · x̃)
(1)
x̃
ak
ck
x
(d)
a1
(d)
ak
Σσ
x(d)
28
1
1
(0)
c0
a1
(1)
a1
x(1)
Σσ
c1
(0)
ak
Σ
c0 + Σkj=1 cj σ(aj · x̃)
(1)
x̃
ak
ck
x
(d)
a1
(d)
ak
Σσ
x(d)
Universal consistency in least square setting [Lugosi and Zeger
(1995); Faragó and Lugosi (1993)]:
(kn) integer sequence
(βn) real sequence
Pk
Fn = { n. n. with one hidden layer, k ≤ kn and j=0 |cj | ≤ βn }
ERM on Fn is universally consistent if kn → +∞, βn → +∞ and
knβn4 log(knβn2 )
−→ 0.
n→+∞
n
29
Using neural networks
• In practice: use of multilayer neural nets
• Squashing function ⇒ ERM = nonconvex optimization pb
⇒ any algorithm will end in a local minimum
• With good intuitions on how to build the neural nets and good
heuristics to perform the minimization [LeCun et al. (1998); LeCun
(2005); Simard et al. (2003)], neural nets are great...
30
Convexification of empirical risk minimization in
binary classification
Y ∈ {−1; +1}
• ERM: ĝ ∈ argmin
g∈G
R(g) = P[Y 6= g(X)]
Pn
i=1 1Yi6=g(Xi)
−→ highly nonconvex
• f real-valued function and g : x 7→ sign[f (x)]
−→
fˆ ∈ argmin
f ∈F
−→
fˆ ∈ argmin
f ∈F
n
X
1Yif (Xi)≤0
F convex
φ[Yif (Xi)]
φ convex
i=1
n
X
i=1
31
Criterion to choose the convex function φ
• φ-risk of f : A(f ) = Eφ[Y f (X)].
• φ should satisfy:
fˆ univ. consistent for the φ-risk
⇒ sign(fˆ) univ. consistent for the classification risk
• Necessary and sufficient cond. [Bartlett et al. (2006)]:
φ is differentiable at 0 and φ0(0) < 0
32
Some convex functions useful for classification and
their remarkable property
f ∗ best function for the φ-risk
f a real-valued function
• φ(u) = (1 − u)+ = max(1 − u, 0): S.V.M. loss
R[sign(f )] − R(g ∗) ≤ A(f ) − A(f ∗)
• φ(u) = e−u: AdaBoost loss
√ p
∗
R[sign(f )] − R(g ) ≤ 2 A(f ) − A(f ∗)
• φ(u) = log(1 + e−u): Logistic regression loss
√ p
∗
R[sign(f )] − R(g ) ≤ 2 A(f ) − A(f ∗)
• φ(u) = (1 − u)2: Least square regression loss
p
∗
R[sign(f )] − R(g ) ≤ A(f ) − A(f ∗)
33
Support Vector Machines [Boser et al. (1992);
Vapnik (1995)]
C>0
φ(u) = (1 − u)+
inf
b∈R,h∈H
H a Reproducing Kernel Hilbert Space
n
X
¡
¢ 1
C
φ Yi[h(Xi) + b] + khk2H
2
i=1
©
(PC )
• Empirical φ-risk minim. on F = x 7→ h(x) + b; khkH ≤ λ, b ∈ R
n
¢
1X ¡
inf
φ Yif (Xi)
f ∈F n
i=1
ª
(Qλ)
• (ĥC , b̂C ) solution of (PC ) ⇒ ĥC + b̂C sol. of (Qλ) for λ = kĥC kH
−→ SVM: x 7→ sign(ĥC (x) + b̂C ) ≈ empirical φ-risk minim. on F
34
Reproducing Kernel pre-Hilbert Space
• Let K : X × X → R symmetric (i.e. K(u, v) = K(v, u)) and
positive semi-definite (i.e. ∀J ∈ N, ∀α ∈ RJ and ∀x1, . . . , xJ
P
1≤j,k≤J αj αk K(xj , xk ) ≥ 0)
– K is called a (Mercer) kernel
– Examples: X = Rd
∗ linear kernel K(x, x0) = hx, x0i¡Rd
¢p
0
0
∗ polynomial kernel K(x, x ) = 1 + hx, x iRd for p ∈ N∗
0
−kx−x0k2/(2σ 2)
∗ gaussian kernel K(x, x ) = e
for σ > 0.
• Let H0 be the linear span of K(x, ·) : x0 7→ K(x, x0), equipped with
*
+
X
X
X
αiαj0 K(xi, x0j )
αiK(xi, ·),
αj0 K(x0j , ·)
=
1≤i≤I
1≤j≤J
H0
i,j
35
Reproducing Kernel Hilbert Space
• the closure H of H0 is an Hilbert space
• H (as H0) has the reproducing property:
hf, k(x, ·)iH = f (x)
for any f ∈ H
Back to©S.V.M.: Training set: (X1, Y1),ª. . . , (Xn, Yn)
Pn
Let Hn =
i=1 αiK(Xi, ·); ∀i, αi ∈ R ( H
S.V.M. pb =
min
b∈R,h∈Hn
n
X
¡
¢ 1
C
φ Yi[h(Xi) + b] + khk2H
2
i=1
⇒ tractable (n + 1)-dimensional optimization task
36
Universal consistency and using S.V.M.
• Universal consistency [Steinwart (2002)]:
X ∈ [0; 1]d
σ>0
0
−kx−x0k2/(2σ 2)
The S.V.M. with gaussian kernel K : (x, x ) 7→ e
and
parameter C = nβ−1 with 0 < β < 1/d is universally consistent.
• Practical choices:
– kernel: linear, polynomial, gaussian, ...
– C (and parameters of the kernel) cross-validated
• Choice of the kernel ←→ functions approximated by linear
combinations of the functions K(x, ·) : x0 7→ K(x, x0)
• gaussian kernel with σ → +∞ = linear kernel !
37
Boosting methods
• Let G be a set of functions from X to {−1, +1}
©
ª
(j)
1. G = x© 7→ sign(x − τ ); j ∈ {1, . . . , d}, τ ∈ R
ª
(j)
∪ x 7→ sign(−x + τ ); j ∈ {1, . . . , d}, τ ∈ Rª
©
2. G = x© 7→ 1x∈A − 1x∈Ac ; A hyper-rectangle of Rd ª
∪ x 7→ 1x∈Ac − 1x∈A; A hyper-rectangle of Rd
−1
+1
−1
+1
38
©
ª
d
3. G = ©x 7→ 1x∈H − 1x∈H c ; H halfspace of R
ª
4. G = univariate decision trees with number of leaves = d + 1
−1
−1
+1
+1
+1
• Boosting looks for classification function of the form
µX
¶
m
x 7→ sign
λj gj (x)
j=1
• Question: choice of λj and gj ?
39
Boosting by L1-regularization
• Let Fλ =
• φ(u) = e
u
nP
m
j=1 λj gj
; m ∈ N, λj ≥ 0, gj ∈ G,
An(f ) =
1
n
Pm
j=1 λj
o
=λ
Pn
i=1 φ[Yif (Xi)]
• Boosting by L1-regularization:
fˆλ = argmin An(f )
f ∈Fλ
• Universal consistency [Lugosi and Vayatis (2004)]:
If λ = (log n)/4 and G is one of the previous choice (except choice 1),
then sign(fˆλ) is universally consistent
40
Usual description of AdaBoost
• Initialisation: wi = 1/n for i = 1, . . . , n
• Iterate: For j = 1 to J:
– Take
gj ∈ argmin
g∈G
and ej the ³minimum
´ value
1−e
– λj = 12 log ej j .
n
X
1
n i=1
wi1g(Xi)6=Yi
1−ej
wi ej .
– Update weights: for all i s.t. gj (Xi) 6= Yi, wi ←
Pn
– Normalize the weights: wi ← wi/ i0=1 wi0 for i = 1, . . . , n
• Output:
µX
¶
J
x 7→ sign
λj gj (x)
j=1
41
AdaBoost = greedy empirical φ-risk minimization
• φ(u) = e
u
An(f ) =
1
n
Pn
i=1 φ[Yif (Xi)]
• f0 = 0
• For j = 1 to J
•
(λj , gj ) ∈ argmin An(fj−1 + λg)
λ∈R,g∈G
•
fj = fj−1 + λj gj
• Universal consistency [Bartlett and Traskin (2007)]:
If J = nν with 0 < ν < 1 and G is one of the previous choice (except
choice 1), then AdaBoost is universally consistent
42
Link between boosting methods and S.V.M.
µ
¶
PJ
• AdaBoost output: x 7→ sign
j=1 λj gj (x)
¶
µ
Pn
• S.V.M. output: x 7→ sign
i=1 αiK(Xi, x)+b
0
• Consider K(x, x ) =
n
X
i=1
PJ
0
g
(x)g
(x
). Then
j
j
j=1
αiK(Xi, x) =
n
X
αi
i=1
with
λj =
J
X
gj (Xi)gj (x) =
j=1
n
X
J
X
λj gj (x)
j=1
αigj (Xi)
i=1
43
Boosting vs S.V.M. vs Neural networks
• Boosting advantages:
–
–
–
–
Variable selection
Ability to handle very large amount of features
Simple tricks to reduce computational complexity
S.V.M. can be run at the end on the selected features
• S.V.M. advantages:
– Easy to use off-the-shelf
– Consistently good results
• Neural networks advantages:
– Works well in practice
44
J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. 2008. To be published
in Annals of Statistics, http://www.e-publications.org/ims/submission/index.php/AOS/
user/submissionFile/1175?confirm=51fc3552.
P.L. Bartlett and M. Traskin. Adaboost is consistent. J. Mach. Learn. Res., 8:2347–2368, 2007.
P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of
the American Statistical Association, 101:138–156, 2006.
Boser, Guyon, and Vapnik. A training algorithm for optimal margin classifiers. In COLT: Proceedings
of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1992.
T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, IT-13, 1967.
L. Devroye. Any discrimination rule can have an arbitrarily bad probability of error for finite sample
size. IEEE Trans. Pattern Analysis and Machine Intelligence, 4:154–157, 1982.
L. Devroye and T. Wagner. Distribution-free consistency results in nonparametric discrimination and
regression function estimation. Annals of Statistics, 8:231–239, 1980.
András Faragó and Gábor Lugosi. Strong universal consistency of neural network classifiers. IEEE
Transactions on Information Theory, 39(4):1146–1151, 1993.
L. Györfi. Nonparametric estimation II. statistically equivalent blocks and tolerance regions. In
Nonparametric Functional Estimation and Related Topics, pages 329–338, 1991.
L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A Distribution-Free Theory of Nonparametric
Regression. Springer, 2004.
Y. LeCun, 2005. Notes de cours, http://www.cs.nyu.edu/~yann/2005f-G22-2565-001/diglib/
45
lecture09-optim.djvu, requiress the djvu reader http://djvu.org/download/.
Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors,
http: // yann. lecun. com/ exdb/ publis/ pdf/ lecun-98b. pdf , Neural Networks: Tricks of the
trade. Springer, 1998.
G. Lugosi and N. Vayatis. On the bayes-risk consistency of regularized boosting methods. Ann. Stat.,
32(1):30–55, 2004.
Gábor Lugosi and Kenneth Zeger. Nonparametric estimation via empirical risk minimization. IEEE
Transactions on Information Theory, 41(3):677–687, 1995.
E. A. Nadaraya. On estimating regression. Theor. Probability Appl., 9:141–142, 1964.
F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the
brain. Psychological Review, 65:386–408, 1958.
F. Rosenblatt. Principles of Neurodynamics. Spartan Books, Washington, 1962.
P.Y. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied
to visual document analysis. http: // research. microsoft. com/ ~ patrice/ PDF/ fugu9. pdf ,
International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society,
pages 958–962, 2003.
C. Spiegelman and J. Sacks. Consistent window estimation in nonparametric regression. Annals of
Statistics, 8:240–246, 1980.
I. Steinwart. Support vector machines are universally consistent. Journal of Complexity, 18, 2002.
C. J. Stone. Consistent nonparametric regression (with discussion). Annals of Statistics, 5:595–645,
1977.
46
J. W. Tukey. Nonparametric estimation ii. statistically equivalent blocks and tolerance regions. Annals
of Mathematical Statistics, 18:529–539, 1947.
V. Vapnik. The nature of statistical learning theory. Springer-Verlag, second edition, 1995.
G. S. Watson. Smooth regression analysis. Sankhya Series A, 26:359–372, 1964.
47
© Copyright 2026 Paperzz