Inference for Multiple Isotonic Regression

Inference for Multiple Isotonic Regression
Mary C. Meyer
December 2, 2010
Abstract
The isotonic regression for two or more independent variables is a classic problem in data
analysis. The classical solution involves enumeration of upper sets, which is computationally prohibitive unless the sample size is small. Here it is shown that the solution may be
obtained through a single projection onto a convex polyhedral cone. The cone formulation
allows an exact test of the null hypothesis of constant versus isotonic regression function.
Categorical covariates can be included so that the model is of several hyper-surfaces; then
the null hypothesis for the test is that the regression function depends only on the categorical covariate. Simulations show that when the constraints hold, the test tends to have
higher power than the standard parametric test that does not use the constraints, even
when the parametric assumptions are true. For other types of inference, the speed of the
algorithm allows for bootstrap confidence intervals and related inference for moderate-sized
samples. The methods are illustrated with data from an observational study concerning
diet and lifestyle factors as predictors of blood plasma micronutrients in healthy adults.
Keywords: Doubly monotone regression, order restrictions, cone projection, bootstrap.
1
1
Introduction
Let X be a set of points {x1 , . . . , xn } in IRp . A partial ordering may be defined as xi xj if
xi ≤ xj if holds coordinate-wise. Two points xi and xj in X are comparable if either xi xj
or xj xi . Recall that partial orderings are reflexive, anti-symmetric, and transitive, but
differ from complete orderings in that pairs of points are not required to be comparable. A
function f : X → IR is isotonic with respect to the partial order if f (xi ) ≤ f (xj ) whenever
xi xj . Consider a scatterplot of data (xi , yi ), for i = 1, . . . , n, where yi = f (xi ) + i
for mean-zero errors i with common variance σ 2 . The isotonic least-squares regression
solution fˆ minimizes
n
X
[yi − f (xi )]2
(1)
i=1
over the set of functions f that are isotonic on X . We can assume without loss of generality
that the points xi are unique; otherwise y-values at the unique points can be averaged and
the regression appropriately weighted, as discussed in section 2.
Brunk (1955) formulated the max-min theorem characterizing the solution as averages
of the data over appropriate intersections of upper and lower sets. For p = 1 (or any simple
ordering), the pooled adjacent violators algorithm (PAVA) takes advantage of the greatest
convex minorant of the cumulative sum diagram. However, PAVA does not work for the
partial ordering case; Gebhardt (1970), Barlow and Brunk (1972), and Dykstra (1981) gave
search algorithms to provide the exact solution. Dykstra and Robertson (1982) pointed
out that these algorithms require “a significant amount of checking and readjustment”
so that the coding has “intricate branching logic [that is] complicated to program.” For
the multiple isotonic regression on a grid of points in IRp , they proposed iterative onedimensional isotonic regressions, and showed that the algorithm converges to the correct
solution. They recommended the previous exact algorithms for small grids. Bacchetti
(1989) gave a similar iterative algorithm for special case of additive effects. Best and
Chakravarti (1990) formulated the complete ordering isotonic regression as a quadratic
program and applied a primal active set algorithm. Qian and Eddy (1996) provided a
“sandwich isotonic block-class” algorithm for multiple isotonic regression on a grid. Spouge
et al. (2003) provided a maximal upper set algorithm for p = 2 that does not require a
grid. Burdakov et al. (2006) provided a fast algorithm for isotonic regression that gives a
“sufficiently accurate” solution for n in the thousands, by using blocking methods.
Tests of constant versus isotonic regression function date back to Bartholomew (1961)
and Kudô (1963), who investigated the case of p = 1. For yi = f (xi ) + i with iid
mean zero normal i , the null distribution of a likelihood ratio test statistic is found to be
that of a mixture of chi-squared random variables. Nomakuchi and Shi (1988) considered
2
independent samples of size n1 , . . . , nk from p-variate normal populations Np (θi , Λ) for
known Λ, and a partial ordering on {1, . . . , k}. They proposed a test for θ1 = · · · = θk
versus θi ≤ θj whenever i j. The values of k in the simulations were 5 and 10; our
algorithm allows for much larger k.
The algorithm presented in this paper is based on cone projection. It is formulated
for the multiple isotonic regression, but is valid for any partial ordering. It gives an exact
solution and is reasonably fast for data sets that are not too large, and adding a categorical
covariate to the model is straight-forward. The cone formulation also allows for an exact
test of f is constant versus f is isotonic, and the speed of the algorithm allows for bootstrap types of inference for moderate-sized data sets. In the next section some necessary
background on cone projection is given and the algorithm for multiple isotonic regression
is formulated. In section 3, an exact test is given for constant versus increasing regression
function, or alternatively the null hypothesis can be that the regression function depends
only on the categorical covariate. The blood plasma micronutrients data set is analyzed in
section 4, and some discussion is given in section 5.
2
Cone Projection Algorithm
Results from Cone Projection Theory
An intuitive description of cones and cone projection is given; for more details see
chapter 3 of Silvapulle and Sen (2005) or Meyer (2010). For details about cone projection
with more constraints than dimensions, see Meyer (1999). Consider an m × n constraint
matrix A and the cone
C = {θ ∈ IRn : Aθ ≥ 0} ,
(2)
where the inequality is element-wise. The projection of y ∈ IRn onto C is the element
P
θ̂ ∈ C that minimizes the Euclidean distance ni=1 (yi − θi )2 . The existence and uniqueness
of the projection is guaranteed because the cone is convex and closed. The polar cone C 0
associated with C is defined as the set of vectors in IRn whose projection onto C is the
origin. The set C 0 is also a closed, convex polyhedral cone, contained in the linear space
perpendicular to the null space of A. The following proposition is a standard result in cone
projection theory.
Proposition 1 The projection of y ∈ IRn onto C 0 is the residual of the projection of y
onto C and vice-versa, so that the projection onto either cone can be obtained from the
projection onto the other.
3
The constraint matrix is irreducible (Meyer 1999) if no row is a positive linear combination of other rows and no positive linear combination of rows equals zero. (Here “positive
linear combination” means a linear combination with positive coefficients.) Essentially, this
means that there are no redundancies in the constraint matrix. If A is full row-rank, then
it is irreducible. If a row is a positive linear combination of other rows, it can be removed
without changing the set C. If the origin is a positive linear combination of rows, then there
is an implicit equality constraint, and the cone C is in a linear space of smaller dimension
than n. It is shown in Meyer (1999) that if A is irreducible, then the edges of the polar
cone are the rows of −A, where an edge is defined as a vector in the cone that can not be
written as a positive linear combination of two linearly independent vectors in the cone.
The edges γ 1 , . . . , γ m are generators of the polar cone in that
(
n
0
C = ρ ∈ IR : ρ =
m
X
)
αi γ i , αj ≥ 0, j = 1, . . . , m .
i=1
An efficient algorithm for projection uses the “faces” of cone, which may be indexed by
subsets J of {1, . . . , m}. Given J, the face FJ is defined by
FJ =


X

j∈J
ρ ∈ IRn : ρ =
αj γ j ; αj > 0, j ∈ J


.

Note that the interior of C 0 is a face with J = {1, . . . , m}, and the origin is a face with
J = ∅. The faces cover the cone, and if the edges are linearly independent, the faces
partition the cone. If the edges are not linearly independent but the constraint matrix is
irreducible, then the overlap between the faces has measure zero. The following proposition
is proved in Raubertas (1986) or Meyer (2003).
Proposition 2 The projection of y ∈ IRn onto C 0 lands on a face FJ for some J. If J is
known, the solution coincides with the ordinary least-squares projection onto SJ , the linear
space spanned by the edges indexed by J.
The hinge algorithm (Meyer 2010) finds J by choosing a sequence of candidate sets Jk ,
and at each step performing a least-squares projection onto the linear space spanned by
the vectors γ j , j ∈ Jk . Edges are added or removed from the set based on the projection
of y onto SJ . At any step in the algorithm, the next edge is chosen to maximize the angle
between the current residual and the remaining edges, so that the SSE is shown to be
steadily decreasing. Although for m edges there are a large number of faces of the cone,
at each step faces where the projection of y onto the current face has a higher SSE are
4
eliminated from consideration, so that relatively short sequence of projections is required
to obtain the exact solution.
Isotonic Regression
The multiple isotonic regression function estimate is computed by finding its residual
via a projection onto its polar cone. Defining θ by θi = f (xi ), the set of functions isotonic
on X can be characterized using a constraint matrix A. For each i, j such that xi xj
there is a k such that Ak,i = −1 and Ak,j = +1, and all other elements of the kth row are
zero. The number of rows m is at most the number of comparable pairs, which is at most
n choose 2. For p = 2 and a 3 × 3 rectangular grid of points where n = 9, it is easy to
see that there are 27 comparable pairs. There are many redundancies in the matrix; for
example, if xi xj xk , the three rows of A used to describe these could be reduced to
two because of transitivity. The row comparing xi to xk is the sum of the rows comparing
xi to xj and xj to xk , thus rows can be removed until the resulting matrix is irreducible. If
rows sum to zero, then for some i and j, we have both xi xj and xj xi , which implies
xj = xi , and we have stipulated that the xi values are unique. For the 3 × 3 example,
the reduced constraint matrix has 12 rows. For a complete order, A can be reduced to
n − 1 rows. For the blood-plasma micronutrient data set with n = 311 and p = 3, analyzed
in section 4, there are m = 1321 rows of the irreducible constraint matrix. The isotonic
regression estimator θ̂ is obtained by projecting the data y onto the polar cone whose edges
are the rows of the reduced A; by Proposition 1, this projection is y − θ̂.
The edges of the constraint cone C could be found using Proposition 3 of the appendix,
and it can be seen that each edge of C is an indicator for an upper set of X . A set U ⊆ X is
an upper set if for each u ∈ U, u x implies x ∈ U. The enumeration of the upper sets,
necessary for the earlier isotonic regression algorithms, is a computational burden unless
the sample size is quite small. For example, the number M of constraint cone edges for the
3 × 3 grid is 18, but for a 4 × 4 grid, we have m = 24 edges of the polar cone and M = 68
edges of the constraint cone. In the case of more constraints than dimensions, the number
of edges of the constraint cone tends to be much larger than the number of constraints m.
By projecting the data onto the polar cone and taking the residual of this projection to
obtain the isotonic regression, the upper sets can be avoided altogether.
A set X is inseparable if for all proper subsets X0 ⊂ X , each point in X0 is comparable
with at least one point not in X0 . If X is not inseparable, it can be broken down into
smaller subsets, and isotonic regressions performed on the subsets. The following lemma
is proved in the appendix.
Lemma 1 The null space V of the constraint matrix A associated with isotonic regression
5
on a set X is spanned by the constant vectors if and only if the set X is inseparable.
Checking for inseparability can be accomplished using the lemma, by computing the null
space of A, although for two-dimensional predictors, it can easily be assessed visually. The
cone projection does not require inseparability of X , but for the inference outlined in the
next section, the dimension of the null space of A must be known.
Interpolation
Interpolation or prediction of the regression function f at a new value x0 can be accomplished in a straight-forward manner, if there is at least one xi ∈ X where xi x0 or
x0 xi . Let U ⊆ X be the set of x ∈ X such that x0 x and let L ⊆ X be the set of x ∈ X
such that x x0 . If both sets are not empty, then fˆ(x0 ) = [maxx∈L fˆ(x)+minx∈U fˆ(x)]/2.
If U is empty, then fˆ(x0 ) = maxx∈L fˆ(x), and if L is empty, then fˆ(x0 ) = minx∈U fˆ(x).
Weighted Regression
If the x values are not unique, the y-values at each unique x can be averaged, and a
P
weighted regression performed on the smaller data set. The expression ni=1 wi (yi − θi )2 is
minimized where wi is the number of observations at xi . More generally, we can minimize
(y − θ)0 Λ(y − θ) subject to Aθ ≥ 0 for a known positive-definite matrix Λ. If U 0 U is
the Cholesky decomposition of Λ, the transformations θ̃ = U θ, ỹ = U y, and à = U −1 A
P
2
allow the minimization of m
i=1 (ỹi − θ̃i ) subject to Ãθ̃ ≥ 0. It is easy to see that if A is
irreducible, so is Ã.
Adding Categorical Covariates
Consider a categorical covariate z with D levels, and the non-additive model E(yi ) =
fd (xi )I{zi = d}, where d = 1, . . . , D and each fd is isotonic on Xd , the subset of X for
which zi = d. The fit consists of D separate isotonic surfaces. For this model, xi and xj
are comparable if zi = zj and either xi xj or xj xi , and hence the null space of A
contains indicator vectors for the levels of z. The inference method of section 2 permits
an exact test where the null hypothesis is that the regression function depends only on the
covariate (i.e., each fd is constant), versus the alternative that each fd is non-decreasing.
Here, the x values do not have to be unique, but the same x value may appear at most
once for each level of the covariate. There is no requirement that the design be “balanced,”
so that there may be different x values for the different levels.
6
3
Inference
For y = θ + σ with unknown σ, suppose is multivariate standard normal. Let C be a
convex polyhedral cone in IRn and let V be a linear subspace of IRn contained in C. The test
of H0 : θ ∈ V versus H1 : θ ∈ C has been established in several contexts. The outline of the
derivations are in the appendix; for details see Raubertas, et al. (1986) or Silvapulle and
Sen (2005), chapter 3. Let ρ̂ be the projection of y onto C 0 , and let ŷ 0 be the projection
P
P
of y onto V . Let SSE1 = ni=1 ρ̂2i , and SSE0 = ni=1 (yi − ŷ0i )2 . It has been shown that
under H0 , the distribution of
SSE0 − SSE1
χ̄201 =
σ2
conditioned on ρ̂ ∈ FJ , is chi-squared with n − d − r degrees of freedom, where d is the
dimension of the face FJ , and r is the dimension of V . Hence the null distribution of χ̄201 is
P r(χ̄201 ≤ c) =
n−r
X
pd P r(χ2n−d−r ≤ c)
d=0
for 0 < c < 1, where χ2k is a chi-squared random variable with k degrees of freedom and
P r(χ20 ≤ c) = 1. The mixing coefficient pd is the probability under H0 that the face FJ on
which the projection ρ̂ lands has dimension d. For σ 2 unknown,
B̄01 =
χ̄201
SSE0 − SSE1
=
χ̄201 + SSE1 /σ 2
SSE0
has a null distribution of a mixture of beta random variables, so that
P r(B̄01 ≤ c) =
n−r
X
pd P r(B(n−d−r)/2,d/2 ≤ c)
d=0
where Bα,β is a Beta random variable with parameters α and β. The mixing coefficients
can be found through simulations to any desired precision, so the test may be considered
“exact.”
For the multiple isotonic regression with parametrically modeled covariates where the
cone is (2), the null space of A is the linear space contained in the cone C, so if X is inseparable, the test of H0 : f is constant versus H1 : f is non-decreasing may be accomplished
using the B̄01 statistic as described above with r = 1. If the set X is not inseparable,
then r > 1 and the projection of y onto V is not constant; rather, it is piece-wise constant
over the pieces of X that are individually inseparable. However, the test of constant versus
7
increasing regression function may still be accomplished using SSE0 =
χ̄201 and B̄01 defined accordingly. Then
P r(B̄01 ≤ c) =
n−r
X
pd P r(B(n−d−1)/2,d/2 ≤ c).
Pn
i=1 (yi
− ȳ)2 , with
(3)
d=0
Some details about the proof of this result are found in the appendix.
Simulations show how the test can be useful in common situations. Suppose a data
set has two continuous predictors, with an a priori assumption that the response is nondecreasing in both predictors. For parametric regression, linear relationships between the
response and predictors are often assumed as a default when there is no information about
the parametric form of the relationship. If interaction is a possibility, the “warped plane”
parametric model
yi = β0 + β1 x1i + β2 x2i + β3 x1i x2i + i
(4)
is considered. For testing the null hypothesis that E(yi ) ≡ c, an F -test is typically used.
However, the constant versus monotone test often has higher power than a test using
a parametric alternative without constraints, even when the parametric assumptions are
correct. That is, using a non-parametric model with constraints is often preferable to a
parametric model without constraints.
For the simulations, predictors are generated in a square grid in IR2 , and responses are
yi = f (x1i , x2i ) + i , where the i are iid standard normal and f is non-decreasing in both
predictors. Table 1 contains the proportion of data sets for which the null hypothesis of
constant f is rejected, for three competing tests, four underlying regression functions, and
three sample sizes, when the target test size is α = .05. The column labeled ISO contains
the rejection proportions for the constant versus isotonic test using B̄01 . The F -test for the
significance of (4), compared to the constant function, is included under the column labeled
“unconstrained warped plane” (UWP). Constraints may be imposed on the parameters of
model (4) to force the estimated regression function to be non-decreasing in both predictors
over the unit square. The function (4) is non-decreasing on (0, 1) × (0, 1) if and only if the
four inequalities hold: β1 ≥ 0, β2 ≥ 0, β1 + β3 ≥ 0 and β2 + β3 ≥ 0. The test of constant
versus constrained warped plane (CWP) has a test statistic whose null distribution is also
that of a mixture of beta random variables; see Meyer and Wang (2010) and references
therein. The test is not typically included in data analysis software packages but code in R
can be found at www.stat.colostate.edu/∼meyer/constrparam.htm. The proportions
of the simulated data sets for which the null was rejected for this test is listed in the table
under CWP.
8
f
x1 /3
x1 /3
x1 /3
x1 x2 /3
x1 x2 /3
x1 x2 /3
4(x1 − .6)2+
4(x1 − .6)2+
4(x1 − .6)2+
exp(10 max(x1 , x2 ))/40000
exp(10 max(x1 , x2 ))/40000
exp(10 max(x1 , x2 ))/40000
n
100
225
400
100
225
400
100
225
400
100
225
400
ISO UWP
.177 .120
.284 .227
.407 .367
.178 .091
.264 .146
.367 .221
.334 .203
.559 .385
.763 .595
.408 .165
.620 .281
.801 .424
CWP
.208
.347
.520
.170
.269
.385
.329
.548
.751
.331
.489
.654
Table 1: Proportion of rejections of the null hypothesis of constant regression function,
for N = 10, 000 simulated data sets. Compares tests using as the alternative fit: isotonic
regression function (ISO), unconstrained warped plane (UWP), and constrained warped
plane (CWP).
All three tests are exact, so the constant function is not included in the simulations. For
f (x1 , x2 ) = x1 /3 and f (x1 , x2 ) = x1 x2 /3, the parametric assumptions are correct, so the
constrained parametric test has the highest power (to within simulation error). However,
the constrained nonparametric test has higher power than the unconstrained parametric
test for all example regression functions, showing that when constraints are valid, the
nonparametric test using the constraints may be preferable to the parametric test that
does not use the constraints. Of course, the correct parametric test using the correct
constraints gives highest power.
The second two choices of regression function do not satisfy the constraints; here the
nonparametric test may have higher power than the parametric test using the constraints.
To illustrate, a data set of size n = 225 is generated on a 15 × 15 grid, using f (x1 , x2 ) =
exp(10 max(x1 , x2 ))/40000 (depicted in the first plot of Figure 1), and σ 2 = 1. This function
is chosen because it is far from the warped plane, but still increasing in both predictors.
The isotonic fit is shown in the second plot, and the fit to model (4) is shown in the third,
with the constrained parametric fit in the last plot. For the B01 test, p = .008, for the
F -test, p = .044, and using the constrained warped plane as the alternative fit, p = .012.
Standard residual plots for the unconstrained warped plane are shown in Figure 2; although
9
true
isotonic fit
warped plane fit
5
1.0
4
y
0.0
1
0.2
0.8
1.0
0.4
0.2
0.00.0
0.2
0.6
0.4 1
x
0.8
1.0
0.2
0.2
0.0
0.0
1.0
0.8
0.6
0.4
1.0
0.8
0.6
0.4
0.2
0.00.0
0.2
0.6
0.4 1
x
0.8
1.0
x2
0.00.0
0.6
0.4 1
x
0.4
x2
0.2
-0.5
1.0
0.8
0.6
0.4
x2
x2
1.0
0.8
0.6
0.4
0.6
y
y
2
0.6
y
0.5
3
constrained warped plane fit
0.2
0.00.0
0.2
0.6
0.4 1
x
0.8
1.0
Figure 1: For a data set generated on a 15 × 15 grid, using the regression function depicted
in (a) and iid standard normal errors, the isotonic fit is shown in (b), the unconstrained
warped plane fit is shown in (c), and the constrained warped plane fit is shown in (d).
the model is incorrect, there are no alarming patterns that would prompt a change in the
model. If the quadratic terms are added to the model, they are not significant (even
in the absence of the interaction). This shows that even for reasonably large data sets,
residual plots do not always indicate that the chosen model is incorrect, and using minimal
assumptions may be a better choice than a parametric model.
Bootstrap Inference
There is no known exact test for the significance of individual predictors in the presence of others, when constraints are applied in both the null and alternative hypotheses.
However, a bootstrap test can be formulated for H0 : f = f (x0 ) versus H1 : f = f (x1 ),
where the elements of x0 are a subset of those of x1 . Using the null fit, bootstrap data sets
may be simulated. If the normal errors assumption seems to be valid, model errors can be
simulated from the normal distribution with σ̂ 2 = SSE1 /d, where d is the dimension of the
face of the polar cone containing the projection of y, i.e., the dimension of the linear space
containing the residual vector. If the normal errors assumption is suspect, the residuals
from the alternative fit may be sampled from. For each bootstrap data set, the null and
alternative fits are computed, as well as the value of T = (SSE0 − SSE1 )/SSE0 ; thus the
null distribution of this test statistic may be approximated. The proportion of bootstrap
values of T that are larger than the observed T is the approximate p-value.
Simulations show that the bootstrap performs well as a test for sub-models. The null
hypothesis is that f is an isotonic function of x1 only, and the alternative is that f is isotonic
in both x1 and x2 . The test is compared to a standard F -test for sub-models, where the
10
0.0
0.2
0.4
0.6
0.8
1.0
x1
0.2
0.4
0.6
0.8
1.0
x2
3
2
1
0
-2
-3
-2
-3
0.0
-1
0
1
sorted residuals
2
3
(d)
-1
residuals from (4)
2
1
0
-3
-2
-1
residuals from (4)
2
1
0
-1
-3
-2
residuals from (4)
(c)
3
(b)
3
(a)
0.0
0.2
0.4
0.6
predicted values from (4)
-2
-1
0
1
2
normal quantiles
Figure 2: Standard residual plots for the unconstrained warped plane fit using the data set
from Figure 1.
null model is f (x1 ) = β0 + β1 x1 , and the alternative is model (4). The two predictors are
arranged in a 10 × 10 grid, and data sets simulated for four regression functions and σ 2 = 1.
Proportions of rejections of H0 are found in Table 2, for α = .05, for N = 1000 simulated
data sets for the bootstrap but N = 100, 000 for the much faster F -test. The simulations
show that using the nonparametric constrained model is likely to provide more power than
the unconstrained parametric model, even when the parametric assumptions hold.
true f
Boot
2x1
.050
2
4(x1 − .6)+
.042
x1 x2
.381
exp(10max(x1 , x2 ))/20000 .551
F
.050
.047
.367
.342
Table 2: Under the column “Boot” are the proportions of rejections of the null hypothesis
that f = f (x1 ) versus f = f (x1 , x2 ), both isotonic, using a bootstrap distribution of the
test statistic. Under the column “F” are the rejection proportions for the F -test where the
null model is f (x1 ) = β0 + β1 x1 versus the unconstrained model (4) as the alternative.
4
Application
The methods are applied to data from a study of micronutrient levels in blood-plasma
(Nierenberg et al., 1989). High levels of micronutrients such as beta carotene are thought
11
(a)
(b)
2.8
3
3
3.0
4
3.2
3.4
log(BMI)
6
3.6
3.8
7
(c)
5
log(BPBC)
5
4
log(BPBC)
6
7
never
former
current
2.8
3.0
3.2
3.4
log(BMI)
3.6
3.8
-0.15
-0.05
junk food
0.05 0.10 0.15
-0.15
-0.05
0.05 0.10 0.15
junk food
Figure 3: Relationships between variables for the beta carotene data. The plot character
and shading represent smoking status.
to be cancer-preventive, and this study was concerned with how diet and lifestyle factors
influence these levels. Micronutrient blood plasma levels were observed for 311 healthy
subjects; the response variable is blood-plasma beta carotene levels (BPBC). Diet variables
were also collected, including daily intake of calories, fat, fiber, cholesterol, dietary beta
carotene, and dietary retinol. These predictors exhibit some skewness and a high degree of
multicollinearity, so a principal component analysis was performed on the log of the diet
variables. The first principal component has all positive weights, so is a measure of overall
amount consumed. The second principal component has large positive weights on fat and
cholesterol, large negative weights on fiber and dietary beta carotene, and small weights
on the other two predictors. This will be called the “junk food” predictor in our model.
Also recorded is smoking status where 1=never smoker, 2=former smoker, and 3=current
smoker. The logarithm of the BPBC is plotted against the logarithm of body mass index
(BMI) in Figure 3(a) where an overall negative trend, perhaps curved, is indicated. The
scatterplot of log(BPBC) against junk food is presented in Figure 3(b) where again a
negative trend is seen. The two predictors are plotted against each other in Figure 3(c),
where it is seen that the points form an inseparable set.
Ignoring the two continuous predictors, a standard one-way analysis of variance indicates
that the smoking variable is a significant predictor of log(BPBC), with p = .0022. The next
issue is to determine whether the two continuous predictors are necessary additions; that is,
do these variables explain a significantly larger amount of variation in log(BPBC), after the
variation due to smoking is accounted for? A parametric model makes strong assumptions
12
about the relationships between the variables. Most are constructed using some model
selection procedure, after which the usual inference based on t and F -distributions might
be suspect because of data-snooping issues. A completely flexible model uses only a-priori
assumptions about the relationships. Suppose that nutritionists agree that blood plasma
nutrient levels should be non-increasing in both BMI and junk food consumption.
To test the significance of the continuous predictors in the presence of the smoking variable using only the assumption of non-increasing relationships, the B̄01 test with categorical
covariate that was outlined in the last section is used. It is readily verified that for these
data, the null space of the constraint matrix has only three dimensions, and is spanned
by indicators for the three smoking levels. In other words, the set of ordered pairs of the
two continuous predictors for each level of smoking is irreducible. The p-value for the B̄01
test is about 5 × 10−10 , indicating that the non-increasing relationship of these variables
explains a substantial amount of the variation in log(BPBC), even after the smoking effect
is controlled for.
Next, suppose that the researchers can assume that log(BPBC) is non-increasing in
the smoking variable as well, so that we have a three-dimensional isotonic regression. The
regression function consists of three ordered two-dimensional surfaces, one for each level
of the smoking variable, where at each (x1 , x2 ), we must have f (x1 , x2 , 3) ≤ f (x1 , x2 , 2) ≤
f (x1 , x2 , 1). It is harder to determine visually if the predictors form an inseparable set,
but using lemma 1, the null space is found to be one-dimensional and to contain only the
constant vectors. The estimated surfaces are shown in Figure 4.
Residuals for the fit are shown in Figure 5, plotted against each of the predictors in
(a)-(c). Because of the nature of the isotonic fit, about 10% of the residuals are exactly
zero. This makes a “flat spot” at the origin in the normal probability plot of the residuals
(d), which otherwise follow roughly a straight line. When the zero values are removed, the
residuals do not show significant deviations from normality, so the parametric bootstrap
may be used in tests of significance of the individual effects.
Testing the significance of the junk food variable in the presence of the other two requires
a bootstrap test. The null hypothesis is that the expected level of log(BPBC) depends only
on the log(BMI) and the smoking variable, and the alternative is that it depends on all
three (the isotonicity assumption holds for both null and alternative). The null fit is shown
in the first plot of Figure 6. The test statistic is T = (SSE0 − SSE1 )/SSE0 = .28, where
the SSE1 is computed from the fit shown in Figure 4. To compare this value with the
null distribution of values, bootstrap data sets are generated using the null hypothesis fit
and the estimate of the model variance using the full model. For each data set, the null
and alternative fits are computed, along with the bootstrap value of T . The histogram of
10,000 values of the simulated distribution of the test statistic, shown in the second plot of
13
never smokers
former smokers
6
3
3.0
3.2
3.4
i)
bm
3.6
0.10
3.8
3.0
3.2
3.4
-0.15
-0.10
ju-0.05
nk 0.00
foo
d 0.05
i)
-0.15
-0.10
ju-0.05
nk 0.00
foo
d 0.05
g(
i)
bm
g(
lo
3.6
0.10
4
3.6
0.10
3.8
bm
3
3.0
3.2
3.4
-0.15
-0.10
ju-0.05
nk 0.00
foo
d 0.05
lo
3
4
g(
4
5
log(BPBC)
5
log(BPBC)
log(BPBC)
5
6
lo
6
current smokers
3.8
Figure 4: Surfaces from the multiple isotonic regression, for each of the smoking levels.
3.0
3.2
3.4
log(bmi)
3.6
3.8
sorted residuals
never
former
-0.15 -0.10 -0.05 0.00 0.05 0.10 0.15
current
-2
-2
-1
0
1
2
1
0
residual
-1
0
residual
-2
2.8
(d)
2
(c)
1
2
(b)
-1
0
-2
-1
residual
1
2
(a)
-2
junk food
smoking status
-1
0
1
2
normal quantiles
Figure 5: Residuals for the three-dimensional isotonic regression of log(BPBC) on log(BMI),
junk food, and smoking status, plotted against the three predictors in (a)-(c). The normal
probability plot is (d).
14
600
400
Frequency
5
Simulated null distribution of T
3
200
4
log(BPBC)
6
7
800 1000
null fit
0
X
2.8
3.0
3.2
3.4
3.6
3.8
0.05
0.10
0.15
0.20
0.25
log(BMI)
Figure 6: Testing the significance of the junk food variable in the presence of the other two
predictors.
Figure 6, has all but one of the values less than the observed value (marked with “X”), so the
null hypothesis can be confidently rejected. For the significance of the log(BMI) variable,
the null fit uses the junk food and smoking variables; here the bootstrap values are all
below the observed test statistic value. For testing the significance of the smoking variable,
about 2% of the bootstrap values are above the observed T , providing some evidence that
significantly more of the variation in log(BPBC) is explained by the three-dimensional
isotonic model, compared with the two-dimensional model without the smoking variable.
To investigate further the effect of the smoking variable, bootstrap methods can be used
to find confidence intervals for the regression function at fixed values of the predictors. Using fˆ and σ̂ 2 , bootstrap data sets can be simulated, using normal errors. The predicted
values of log(BPBC) are computed for 10,000 bootstrap samples, at each of the three smoking levels, at the mean values of other two predictors. Figure 7 shows the three bootstrap
distributions of predicted values on the same scale. The estimated 95% confidence intervals
for the surface at log(BMI)=3.24 and junk food=0 are: for never-smokers, (4.82,5.22); for
former smokers, (4.64,5.03), and for current smokers, (4.41,4.90).
The smoking variable has been found to be significant with p = .02. To test whether all
three levels are significantly different, we can again use a bootstrap method. The null model
combines never-smokers with former smokers; 10,000 bootstrap data sets were simulated,
15
former smokers
current smokers
4.2
4.4
4.6
4.8
5.0
5.2
5.4
0
0
0
200
600
Frequency
1000
Frequency
500
500
1000
1000
1500
1500
never-smokers
4.2
4.4
4.6
4.8
5.0
5.2
5.4
4.2
4.4
4.6
4.8
5.0
5.2
5.4
Figure 7: Bootstrap distributions of the predicted log(BPBC) for each of the smoking
variable levels, at the mean values of log(BMI) and junk food.
and about 11% of the samples had B01 values larger than the observed B01 . There is not
strong evidence for a difference between mean log(BPBC) for never-smokers and former
smokers, when the effects of both log(BMI) and junk food are accounted for. However,
when the null model combines former smokers with current smokers, the bootstrap test
provides a p-value of about .03, providing some evidence that current smokers have lower
average log(BPBC) compared with former smokers.
Figure 8 shows the estimated regression function for the final model, where the conclusion is that both log(BMI) and junk food are highly significant predictors of log(BPBC),
and the smoking variable is significant at α = 0.05, with a p-value of about .02. Bootstrap
confidence intervals can be obtained for any values x0 of predictor values, where x0 is
comparable with at least one observed xi .
5
Discussion
The isotonic regression is a nonparametric function estimation method that does not require
user-defined inputs like bandwidth, penalty parameters, or knot choices. Minimal assumptions about the regression function translate to minimal opportunities for mis-specification
of the model, and often result in higher power for the hypothesis tests. This paper presents
an exact test for the null hypothesis that the regression function is constant or depends
only on a categorical covariate, against the alternative that the regression function is isotonic in p variables. The simulations in section 4 indicate that if isotonicity assumptions
are valid, using the constrained nonparametric test can provide better power than using
the unconstrained parametric test, even when the parametric assumptions are true.
16
never & former smokers
current smokers
6
6
5
log(BPBC)
4
3
i)
i)
bm
g(
lo
3.6
0.10
3.0
3.2
3.4
-0.15
-0.10
ju-0.05
nk 0.00
foo
d 0.05
3.6
0.10
3.8
bm
3.0
3.2
3.4
-0.15
-0.10
ju-0.05
nk 0.00
foo
d 0.05
g(
3
4
lo
log(BPBC)
5
3.8
Figure 8: Estimated regression surfaces for two smoking groups.
Often practitioners use linear models as defaults, because of the “Occam’s razor” rule
of thumb in statistical data analysis that the simpler model should be chosen over a more
complicated model, if there is not good evidence that the more complicated model is necessary. However, in data analysis, “simpler” should mean “fewer assumptions” rather than
“more tractable math.”
Computational Speed of Algorithm
For the blood-plasma micronutrients data of the last section, when n = 311 and p = 2,
the number of polar cone edges, i.e., the number of rows of the irreducible constraint matrix,
is m = 1321. The polar cone has a huge number of faces, but the hinge algorithm finds
the correct face and computes the projection in 303 steps, taking 1.3 CPU seconds on Mac
powerbook with a 3.06 GHz processor, using the programming language R. The sparsity
of the matrix of cone edges allows for further efficiencies, so that in a compiled language
like Fortran or C++, the computations would likely be much faster. The computing time
increases rapidly with sample size. The simulations in section 4 involved predictors in
grids of size 10 × 10, 15 × 15, and 20 × 20. The 10,000 cone projections take about 20
minutes for the smallest grid, a few hours for the n = 225 case, and almost two days for
the largest sample size. For n = 900 and a 30 × 30 grid, a single projection takes about
half an hour. The current code for isotonic regression and the B01 test can be found at
17
www.stat.colostate.edu/∼meyer/multipleiso.htm.
A
Details
Proof of Lemma 1
Let X be a set of points in IRp , and let A be its constraint matrix for isotonic regression
before reducing, so there is a row for every comparable pair. Let V be the null space of
A (it is easy to see that the null space of the reduced constraint matrix is also V ). First,
suppose θ ∈ V and θa 6= θb for some 1 ≤ a, b ≤ n. Let Xa be the set of x ∈ X that are
comparable to xa , and let Xb be the set of x ∈ X that are comparable to xb . If xj is in the
intersection, then there is row of A where the jth element is −1 and the ath element is +1
(or vice-versa), as well as a row where the jth element is −1 and the bth element is +1 (or
vice-versa), so that θa = θj and θb = θj . Therefore if θa 6= θb , Xa ∩ Xb is empty and X is
separable. Second, suppose X is inseparable and θ ∈ V . For any a 6= b, 1 ≤ a, b ≤ n, define
Xa and Xb as above; then because of inseparability there must be an xj in the intersection,
and hence θa = θb , and only constant vectors are in V .
Edges of the constraint cone
The constraint cone (2) can be defined by a set of edges δ i , i = 1, . . . , M , so that
(
C = θ ∈ IRn : θ = v +
M
X
)
ci δ i , for v ∈ V and ci ≥ 0 .
i=1
The set of edges is unique up to multiplicative factors if we stipulate that they must
be orthogonal to V . If the m × n constraint matrix A is full row-rank, then it is well
established that M = m = n − r, where r =dim(V ), and the edges of the constraint cone
are the columns of A0 (AA0 )−1 . For the case of more constraints than dimensions, the
following proposition, proved in Meyer (1999), shows how to obtain the constraint cone
edges that are orthogonal to V , the null space of A. Let m0 be the dimension of the space
spanned by the rows of A, and suppose v 1 , . . . , v r span V .
Proposition 3 For J ⊂ {1, . . . , m}, if dim span{γ j , j ∈ J} = m0 − 1, then the space
orthogonal to span{v 1 , . . . , v r , γ j , j ∈ J} is a line through the origin containing the vectors
δ and -δ, say. If Aδ ≥ 0, then δ is an edge of the cone {θ : Aθ ≥ 0}. Conversely, all
edges are of this form.
If m is large compared with m0 , this exhaustive method for finding constraint cone edges
is very computationally intensive, and M can be much larger than m.
18
Outline of proof of (3)
Let δ 1 , . . . , δ M be the constraint cone edges. Proposition 2 of Meyer (1999) shows that
the M × n matrix ∆ whose rows are the constraint cone edges is irreducible and is the
constraint matrix for the polar cone. For each J ⊆ {1, . . . , m}, the set
CJ =


X

j∈J
v+
bj γ j +
X


ci δ i : v ∈ V, bj > 0, j ∈ J, and ci ≥ 0, i ∈ I(J)

i∈I(J)
is called a sector, where I(J) = {i : γ 0j δ i = 0, for all j ∈ J}. Proposition 5 of Meyer (1999)
can be applied to show that the sectors cover IRn . If A is full row rank, the sectors are
disjoint; if A is not full row rank but is irreducible, the sectors may overlap but the overlap
has Lebesgue measure zero. The sector CJ is itself a polyhedral convex cone, and its edges
P
P
are γ j , j ∈ J and δ i , i ∈ I(J). For y = v + j∈J aj γ j + i∈I bj δ j ∈ CJ , the projection of
P
P
y onto the constraint cone C is v + i∈I ci δ i and its residual j∈J bj γ j is the projection y
onto the polar cone C 0 . Hence, the sector CJ contains all y ∈ IRn whose projection onto C
lands on the face FJ . For full row-rank A, the representation of y is unique; for irreducible
A the y for which the representation is not unique are in a set of Lebesgue measure zero,
and in any case, the projections onto the cones are unique.
For J ⊆ {1, . . . , m}, let SJ be the linear space spanned by γ j , j ∈ J. The sector CJ
is polyhedral cone containing the linear space V , with edges γ j , j ∈ J and δ i , i ∈ I(J).
There is a polar cone CJ0 that has a constraint matrix A0J , say, and by proposition 2 of
Meyer (1999), the rows of A0J are −γ j , j ∈ J and −δ i , i ∈ I(J). Therefore, the rows of
A0J are either orthogonal to SJ or contained in SJ . Proofs of the following lemmas can be
found in Raubertas (1986) or Meyer (2003).
Lemma 2 Let Z ∼ Nn (0, I) and consider a d-dimensional linear subspace S of IRn . Let A
be an m × n matrix such that the rows of A are orthogonal to S, and let ρ̂ be the projection
P
of Z onto S. Then the conditional distribution of ni=1 ρ̂2i , given AZ ≥ 0, is χ2 (d).
Lemma 3 Let Z ∼ Nn (0, I) and consider a d-dimensional linear subspace S of IRn . Let A
be an m × n matrix such that the rows of A are contained in S, and let Ẑ be the projection
P
of Z onto S. Then the conditional distribution of ni=1 Ẑi2 , given AZ ≥ 0, is χ2 (d).
By construction of the constraint matrix A for isotonic regression, the constant vectors
are a subset of V . If y ∈ CJ , then y = ȳ1 + u + ρ̂ + θ̂, where u is the projection of y onto
the r −1-dimensional subspace of V that is orthogonal to 1, ρ̂ is the projection of y onto SJ
by proposition 2, and θ̂ is the projection of y onto the n − r − d dimensional linear subspace
that is orthogonal to SJ and V . If E(y) = µ ∈ V , the projection of y onto C 0 is equivalent
19
to the projection of (y − µ)/σ onto C 0 , because C 0 ⊥ V . Therefore, by lemmas 2 and 3
and the normal errors assumption, the distribution k ρ̂ k2 , given that y ∈ CJ , is χ2 (d),
P
where d is the dimension of SJ . If SSE0 = ni=1 (yi − ȳ)2 =k y − ȳ1 k2 , the distribution of
(SSE0 − SSE1 )/σ 2 is distributed as χ2 (n − 1 − d) under the null hypothesis that E(y) is
constant. Then (3) follows by the independence of SSE0 − SSE1 and SSE1 , and the result
that if X1 and X2 are independent χ2 (d1 ) and χ2 (d2 ) random variables, respectively, then
X1 /(X1 + X2 ) is Beta(d1 /2, d2 /2).
References
[1] Bacchetti, P. (1989) Additive Isotonic Models. Journal of the American Statistical
Association 84(405) 289-294.
[2] Best, M.J. and Chakravarti, N. (1990) Active Set Algorithms for Isotonic Regression;
a Unifying Framework. Mathematical Programming 47 425-439.
[3] Barlow, R.E. and Brunk, H.D. (1972) The Isotonic Regression Problem and its Dual.
Journal of the American Statistical Association 67(337) 140-147.
[4] Bartholomew, D. J. (1961) A test of homogeneity of means under restricted alternatives. Journal of the Royal Statistical Society, Series B 23(2) 239-281.
[5] Brunk, H.D. (1955) Maximum likelihood estimates of monotone parameters. Annals
of Mathematical Statistics 26(4) 607-616.
[6] Burdakov, O., Sysoev, O., Grimvall, A., and Hussian, M. (2006) An O(n2 ) Algorithm
for Isotonic Regression. Large Scale Nonlinear Optimization. Series: Nonconvex Optimization and Its Applications, Springer-Verlag, 83 pp. 25-33.
[7] Dykstra, R.L. (1981) An isotonic regression algorithm. J. Statist. Planning and Inference 5 355-363.
[8] Dykstra, R.L. and Robertson, T. (1982) An algorithm for isotonic regression for two
or more independent variables. Annals of Statistics, 10 708-716.
[9] Gebhardt, F. (1970) An Algorithm for Monotone Regression with One or More Independent Variables. Biometrika, 57(2) 263-271.
[10] Kudô, (1963). A multivariate analogue of the one-sided test. Biometrika 50(3), 403418.
20
[11] Meyer, M.C. (1999) An Extension of the Mixed Primal-Dual Bases Algorithm to the
Case of More Constraints than Dimensions, Journal of Statistical Planning and Inference 81, 13-31.
[12] Meyer M.C. (2003) A Test for Linear vs. Convex Regression Function using ShapeRestricted Regression, Biometrika 90(1), 223-232.
[13] Meyer, M.C. (2010) An Algorithm for Quadratic Programming with Applications in
Statistics, Technical Report, Colorado State University.
[14] Meyer, M.C. and Wang, J. (2010) Hypothesis Tests in Constrained Parametric Regression, Technical Report, Colorado State University.
[15] Nomakuchi, K. and Shi, N.Z. (1988) A Test for a Multiple Isotonic Regression Problem,
Biometrika 75 181-184.
[16] Qian, S. and Eddy, W.F. (1996) An Algorithm for Isotonic Regression on Ordered
Rectangular Grids. Journal of Computational and Graphical Statistics 5(3) 225-235.
[17] Raubertas, R. F., Chu-In, C. L., & Nordheim, E. V. (1986). Hypothesis tests for
normal means constrained by linear inequalities. Commun. Statist. 15 2809-33.
[18] Silvapulle, M.J. and Sen, P.K. (2005) Constrained Statistical Inference. John Wiley &
Sons, New York.
[19] Spouge, J., Wan, H., and Wilbur, W.J. (2003) Least Squares Isotonic Regression in
Two Dimensions, Journal of Optimization Theory and Applications 117(3) 585-605.
21