Determining Goodness of Fit for a Binary Dependent Variable Model

Determining Goodness of Fit for a Binary Dependent
Variable Model with the heatmapFit Package in R
Justin Esarey
Rice University
Jericho Du
Rice University
August 6, 2014
Abstract
This paper presents the 2.0 release of the heatmapFit package in R, extending the
work of Esarey and Pierce (2012). The heatmapFit package provides a simple, reliable,
and intuitive way to assess fit quality of models where predicting the probability of
stochastic phenomena is the goal (as compared to, e.g., classification of conceptually
separable outcomes). The new release extends the package’s capability to a much
broader array of models, makes it computationally feasible to perform the assessment
in very large data sets, and allows both in- and out-of-sample fit assessment.
1
Interpretable fit statistics for binary depen-
dent variables
It can be challenging to determine whether a statistical model is a good fit to a binary dependent variable y in a substantively interpretable way. Likelihood-based fit
statistics like the AIC (Akaike, 1998) are useful for comparing alternative models, but
not useful for assessing whether any of these models achieves a qualitatively adequate
fit; these statistics have little intuitive interpretation. The classification approach to
fit assessment, such as that embodied in the receiver operating characteristic (ROC)
curve or the percent correctly predicted (PCP), calculates how well a model can match
a model-predicted ŷ ∈ {0, 1} to the observed y ∈ {0, 1}. This is easily understood,
but applies best where perfect knowledge of the data generating process (DGP) would
1
imply a perfect ability to predict outcomes. Many important applications do not fit
the classification paradigm. In these circumstances, the classification approach to fit
assessment can indicate a poor fit even when the fit is perfect, or a good fit when the
fit is poor (Esarey and Pierce, 2012).
As an alternative to classification-based fit statistics, the heatmapFit package in
c = 1) to the empirically observed frequency of
R compares a model’s predicted Pr(y
y = 1 in a data set. The idea, as presented in Esarey and Pierce (2012), is to treat
c = 1) as the independent variable in a non-parametric regression, with
the fitted Pr(y
c =
y as the dependent variable. If the fit is good, then at any point p for which Pr(y
1) ≈ p we should observe ≈ p proportion of observations where y = 1. Equivalently, a
good fit implies that a loess regression should make predictions R(p) that approximate
the straight line R(p) = p. The approach is broadly similar to that of the HosmerLemeshow fit statistic (Hosmer and Lemeshow, 1980, see also Copas, 1983; le Cessie
and van Houwelingen, 1991; Firth, Glosup and Hinkley, 1991; Hart, 1997; Azzalini,
Bowman and Hardle, 1989).
To distinguish misspecification from noise, the heatmapFit package uses parametric
bootstrapping to approximate the sampling distribution of R(p). For each observation
i, a bootstrap observation yib is drawn from the binomial distribution with the model
c i = 1). The bootstrapped data set is used to produce a
predicted probability Pr(y
c i = 1). After this
non-parametric prediction Rb (p) with a loess regression of yib on Pr(y
process is repeated many times, the R(p) from the original data set can be compared to
the bootstrap distribution of Rb (p) to determine a one-tailed p-value = min(q(p), 1 −
q(p)), where q(p) is the quantile of R(p) in the bootstrapped distribution of Rb (p) at
every point p. Lower p-values indicate that the observed deviation from perfect fit is
less attributable to sampling variation.
Consider a case where the DGP is a latent random utility model: y ? = Xβ + ε,
where y = 1 if y ? > 0 and = 0 otherwise. It is impossible to perfectly predict y without
foreknowledge of the pure noise random error term ε ∼ Φ(µ = 0, σ = 1), even if we
2
Listing 1: Code to produce a heat map plot for a correctly specified model
require(heatmapFit)
set.seed(123456)
x <- runif(20000)
y <- as.numeric( runif(20000) < pnorm(2*x - 1) )
mod <- glm( y ~ x, family=binomial(link="probit"))
pred <- predict(mod, type="response")
heatmap.fit(y, pred, reps=1000)
know Xβ perfectly. The appropriate estimand is Pr(y = 1), not y. Neither likelihoodbased fit statistics (like AIC) nor classification-based fit statistics (like ROC) can tell
c = 1) is a good predictor of Pr(y = 1), but the heatmapFit package
us whether Pr(y
can.
2
Demonstrative examples
The heatmapFit package automates the process described in the previous section and
presents the results visually and quantitatively.
2.1
A correctly specified model
To demonstrate what a well-fitting model looks like in a heatmapFit analysis, consider
the sample R code in Listing 1. This code generates a data set out of a simple random
utility model, then estimates a correctly specified probit model on the data. It then
calculates in-sample model predictions for Pr(y = 1) and uses these predictions (pred)
and the dependent variable y to construct the heatmap plot. This plot is shown in
Figure 1.
Figure 1 depicts the in-sample predictions of a very well-fitted model. The “heat
map line” is the prediction R(p) of a loess model predicting y using the probit model’s
predicted probability pred; the bandwidth parameter of the loess model is automati-
3
Figure 1: Heat map plot for a correctly specified probit model
Predicted Probability Deviation
Model Predictions vs. Empirical Frequency
p−Value
Legend
Heat Map Plot
0.5
0.6
0.4
0.4
0.3
0.2
0.2
Smoothed Empirical Pr(y=1)
0.8
heat map line
perfect fit
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.01
Model Prediction, Pr(y=1)
cally selected to minimize the AICc, a modification of the Akaike Information Criterion
(Hurvich, Simonoff and Tsai, 1998). The figure indicates a heat map line that almost
perfectly coincides with the perfect prediction line R(p) = p, indicating a very good fit.
The histogram at the bottom of the plot indicates the distribution of the data across
the range of model predictions; in this example, predictions are distributed uniformly
across the space.
The color of the heat map line shows the bootstrap one-tailed p-value for R(p) at
that point. Lighter shades indicate that gaps between R(p) and p are well-within the
sampling variation for R(p) generated by bootstrap resampling; thus gaps between R(p)
and the perfect prediction line are attributable to sampling variation. Darker shades
indicate that these gaps are outside the sampling distribution created by bootstrapping,
and are therefore indicative of model misspecification. The light color of the heat map
line over its entire range in Figure 1 indicates that deficiencies in fit quality for this
4
Listing 2: Code to produce an ROC curve
library(ROCR)
pred.r <- prediction(pred, y)
perf <- performance(pred.r, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
# area under the ROC curve
performance(pred.r, measure="auc")@y.values
model are attributable to sampling variation. In fact, the heatmapFit routine reports
that 0% of the observed values of y have bootstrap one-tailed p-values ≤ 0.10; as a
heuristic rule-of-thumb (and based on simulation evidence), Esarey and Pierce (2012)
consider the fit to be acceptable as long as fewer than 20% of observations have p-values
in this range.
By comparison, consider the assessment of the same model in the same data set
created by the ROC curve; this assessment is produced by the code in Listing 2. The
resulting ROC curve is plotted in Figure 2. Although we know that this model is a perfect fit to the data, the ROC curve is far from the perfect gamma-shaped classification
curve that we would expect from an excellent fit. Indeed, the area under the ROC curve
of ≈ 0.75 suggests only “acceptable” discrimination (Hosmer and Lemeshow, 2000, p.
162). By contrast, the heatmap curve of Figure 1 accurately indicates an excellent fit
to the data.
2.2
A misspecified model
What does a misspecified model look like in heatmapFit? Consider the example generated by the R code in Listing 3. This code generates data from a random utility model
with a curvilinear utility index (generated by a square term of the independent variable
x), then estimates a standard probit model on the data with no attempt to model the
nonlinearity. As a result, the model is substantially misspecified. This misspecification
5
0.6
0.4
0.2
True positive rate
0.8
1.0
Figure 2: ROC plot for a correctly specified probit model
0.0
Area under curve = 0.748
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Listing 3: Code to produce a heat map plot for a misspecified model
set.seed(13579)
x <- runif(20000)
y <- as.numeric( runif(20000) < pnorm(0.5 - 2*x^2) )
mod <- glm( y ~ x, family=binomial(link="probit"))
pred <- predict(mod, type="response")
heatmap.fit(y, pred, reps=1000)
is revealed by a plot generated by heatmapFit, as shown in Figure 3.
Figure 3 shows deviations between the model’s predicted Pr(y = 1) and the loess
smoothed R(p), as measured by the distance from the dashed perfect prediction line to
the heat map line. For example, observations whose model predicted Pr(y = 1) ≈ 0.45
are underpredicted by the model; observations with this predicted probability have an
≈ 0.5 probability of being observed as y = 1 in the sample. This is shown in Figure
3 as a positive vertical distance between the heat map line and the dashed perfect
6
Figure 3: Heat map plot for a misspecified probit model
Predicted Probability Deviation
Model Predictions vs. Empirical Frequency
p−Value
Legend
Heat Map Plot
0.6
0.4
0.4
0.3
0.2
0.2
Smoothed Empirical Pr(y=1)
0.8
0.5
heat map line
perfect fit
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 0.01
Model Prediction, Pr(y=1)
prediction line. By comparison, observations whose model predicted Pr(y = 1) ≈ 0.8
are overpredicted by the model; observations with this predicted probability have an
≈ 0.7 probability of being observed as y = 1 in the sample. This underprediction is
represented in the plot as a negative vertical distance between the heat map and perfect
prediction lines. Finally, the heat map line is darkly colored throughout, indicating
that deviations from perfect prediction are much larger than we would expect due to
sampling variation. The program indicates that 91.8% of observations have one-tailed
bootstrap p-values ≤ 0.10, indicating misspecification.
Interestingly, an ROC assessment of this misspecified model is very similar to the
ROC assessment of the correctly specified model from before; we show the ROC curve
for the model in Figure 4. The area under this ROC curve is 0.736, substantively
identical to the area of 0.748 for the correct model. Thus, the ROC seems to have
limited ability to distinguish between correctly and incorrectly specified models in this
7
0.6
0.4
0.2
True positive rate
0.8
1.0
Figure 4: ROC plot for a misspecified probit model
0.0
Area under curve = 0.736
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
scenario.
3
New features in the 2.0 release
The heatmapFit package was originally released concurrent with the publication of the
underlying methodology in Esarey and Pierce (2012). This 2.0 release is applicable
to a much wider variety of models, improves usability in large data sets, and makes
out-of-sample fit assessment possible.
3.1
Applicability to a wider variety of models
The original heatmapFit software package could only create heat map plots for basic
generalized linear models (viz., the probit and logit). More complex models of BDVs,
such as Bayesian hierarchical approaches or dynamic panel models, were not assessable.
8
The new software release allows heat map plots to be created for any model that can
produce predicted probabilities for each observation. This includes models that are
estimated outside of R in other software packages, such as Stata or SAS, as long as a
vector of the model’s predictions can be exported to a data file and then read into R.
A considerably wider variety of applications are opened up by this change.
As an example, we replicate a portion of the analysis of Calvo and Sagarzazu (2011),
who study the relationship between majority status in the legislature and the degree of
committee-level gatekeeping authority exercised by the largest party. Our specific aim
is to replicate Model A in Table 3 on p. 11 of the article; this is a multilevel logistic
model of legislative successes in committee when the largest party holds a majority on
the floor and the committee chairmanship. Each of the 27,635 observations is a billcommittee pairing in the lower house of the Argentine Congress between 1984 and 2007;
some bills are assigned to more than one committee and are therefore observed more
than once. The multilevel component of the model is a random intercept (by congress).
Because the lmer function in lme4 is used to estimate this model, the earlier version
of heatmapFit would not have been able to perform an assessment of model fit.
The dependent variable is “success in committee,” a binary variable which “takes the
value of 1 if the proposal receives a joint dictamen (dossier reporting the bill for further
consideration by the Chamber) or the value of 0 if it dies in committee (cajoneada)”
(Calvo and Sagarzazu, 2011, p. 8). One primary independent variable of interest is the
squared ideological distance between a bill’s lead sponsor and the median committee
member of the majority party, as determined by a principal component analysis of
cosponsored legislation (p. 9). The authors believe that, when the largest party holds
the majority in parliament and chairs the committee, ideological distance between the
bill’s lead sponsor and the median committee member from the majority party will be
negatively associated with success. Another variable of interest is the squared ideological distance between the lead sponsor and the median committee member (including
minority party members). In the context of a party holding both a floor and committee
9
majority, the authors believe that proximity to the overall committee median will not
be an important predictor of success.
Table 1: Replication of Table 3, Model A in Calvo and Sagarzazu (2011)
Dependent variable:
Success in Committee
0.259∗∗
(0.105)
Dist. from Sponsor to Cmte. Median
−0.791∗∗∗
(0.090)
Dist. from Sponsor to Median of Majority on Cmte.
Control variables and random intercepts omitted.
Observations
Log Likelihood
Akaike Inf. Crit.
Bayesian Inf. Crit.
27,635
−7,692.847
15,433.690
15,631.140
∗
Note:
p<0.1;
∗∗
p<0.05;
∗∗∗
p<0.01
Our replication of the model is depicted in Table 1; the table omits control variables and random intercepts in order to focus on the primary independent variables
of interest. The results, which are identical to those reported in the published paper,
match the authors’ expectations. Our present concern is whether the model provides
an acceptable in-sample fit to the data, or whether there is evidence for misspecification. The likelihood-based fit statistics (ln L, AIC, and BIC) reported in Table 1 do
not provide us with this information, as they are designed for model comparison rather
than model assessment.
To make our assessment of model fit, we apply the heatmapFit algorithm to the
model. We generate predictions from the model using the predict command, then
input these predictions and the dependent variable into heatmap.fit to generate the
plot. We depict this plot in Figure 5.
The heat map plot shows that this model is a good fit to the sample data. The
heat map line in Figure 5 is very close to the perfect fit line, and none of the devia-
10
Figure 5: Heat map plot for model in Table 1
Predicted Probability Deviation
Model Predictions vs. Empirical Frequency
p−Value
Legend
Heat Map Plot
0.6
0.5
0.4
0.4
0.2
0.3
0.3
0.1
0.2
0.1
0.0
Smoothed Empirical Pr(y=1)
0.5
heat map line
perfect fit
0.0
0.1
0.2
0.3
0.4
0.5
0.01
Model Prediction, Pr(y=1)
tions are statistically distinguishable from sampling variation. The highest predicted
probabilities (above 40% probability that y = 1) are the most inaccurate; but they
only occur slightly more frequently in the data set than predicted. Note also that,
according to the histogram at the base of the plot, these high-probability predictions
are a comparatively rare event; most bills have < 10% chance of success in committee.
The distribution is slightly skewed right, indicating that most bills are likely to fail but
a small subset of bills are comparatively sure bets.
3.2
Better handling of large data sets
As we noted above, the Calvo and Sagarzazu (2011) data set contains over 25,000 observations. The previous implementation of heatmapFit produced results very slowly in
data sets of this size; larger data sets could fail to produce any result due to insufficient
11
memory.
Our new implementation of heatmapFit provides much faster performance on even
very large data sets. Some of these refinements are internal and comparatively technical. For example, the new version of the program evaluates each bootstrap replicate of
R(p) as it is created and then discards the result rather than saving all results into a
matrix and evaluating them together; this change saves a great deal of memory.
The primary improvement for large data sets is the heatmap.compress function
within heatmap.fit, which collapses large data sets into a set of bins before creating a
loess estimate of R(p) or generating bootstrap replicates. heatmap.compress takes the
sample and sorts it into a set of 2,000 bins based on each observation’s model-predicted
Pr(y = 1); the number of bins can be changed through the init.grid parameter of
heatmap.fit. Inside of each bin, two observations are produced, one where y = 1 and
the other where y = 0, with an observation weight equal to the proportion of the sample
in the bin with the given value of y. We then use the binned data set and accompanying
observation weights to create the weighted loess model. The bootstrapping process is
also aware of the bin structure: in each bin k containing nk observations, a bootstrap
replicate is created by drawing nk samples from the binomial distribution with the
bin’s probability P r(y = 1), creating a bootstrap observation with y = 0 and another
with y = 1, and then setting each observation’s weight to nk times the proportion of
drawn samples with matching y.
The speedup gained from data compression is substantial. To illustrate this, we
repeated the analysis of Calvo and Sagarzazu (2011) with and without data compression
and timed the result; the code to accomplish this is in Listing 4. The difference in the
code is simple; the function includes a compress.obs flag that allows data compression
to be turned off. We set the number of bootstrap replicates to 100; this is far fewer
than advisable for a substantive analysis, but convenient to save time for a software
demonstration.
12
Listing 4: Code to assess data compression
# with data compression
system.time(heatmap.fit(ydat$salioComision, pred, reps=100))
# without data compression
system.time(heatmap.fit(ydat$salioComision, pred, reps=100,
compress.obs=F))
Listing 5: Code to assess data compression in very large data sets
set.seed(123456)
rm(list=ls())
x <- runif(500000)
y <- as.numeric( runif(500000) < pnorm(-x + 2*x^2) )
mod <- glm( y ~ x, family=binomial(link="probit"))
pred <- predict(mod, type="response")
# with data compression
system.time(heatmap.fit(y, pred, reps=100))
# without data compression
system.time(heatmap.fit(y, pred, reps=100, compress.obs=F))
With data compression, the analysis takes 2.29 seconds using a Windows system
with a Core i7 2620M processor and 6 GB of RAM. Without data compression, the
analysis takes 20.39 seconds on the same system. Thus, data compression results in an
analysis that is ≈ 10 times faster in this real-world data set.
The time savings scale with the size of the data set. For example, consider the
simulation of a data set with 500,000 observations shown in Listing 5. With data compression, this analysis takes 5.68 seconds; the data compression stage takes somewhat
longer than it did in the Calvo and Sagarzazu (2011) data set, but fitting the loess
model and bootstrap resampling takes roughly the same time because the compressed
data set is the same size. Without data compression, the analysis takes 458.86 seconds.
In this case, data compression speeds up the analysis by a factor of ≈ 80.
13
3.3
Out-of-sample assessment
Accurate prediction inside of the sample used to fit the model is an important criterion
of fit, but it does not rule out the possibility of over-fitting. Our revision of the
heatmapFit package allows for prediction accuracy out-of-sample to be assessed as
well. Our analysis indicates that the ability to assess substantive deviations from
perfect fit is more valuable than the bootstrap-based indicator for misspecification in
this application; we provide the option to turn off calculating bootstrapped p-values in
this case by setting calc.boot = FALSE.
3.3.1
Simulation study
Recall the heuristic rule that no more than 20% of observations should have bootstrap
one-tailed p-values ≤ 0.10 in order for fit quality to be considered acceptable. Our
testing of the 2.0 release of heatmapFit has revealed that this is a more conservative
criterion for determining fit quality out-of-sample compared to in-sample fit assessment. We simulate 1,000 data sets with 20,000 observations from two data generating
processes:
1. Pr(y = 1) = Φ(2x − 1)
2. Pr(y = 1) = Φ(log(x) + 2)
We then estimate the probit model Pr(y = 1) = Φ(β0 + β1 x) on each data set; this
is a correctly specified model for the first DGP but misspecified for the second. We
then produce an additional 1,000 out-of-sample observations from the DGP, use the
c = 1) for these observations, and then use heatmapFit to
fitted model to predict Pr(y
assess the fit quality. Finally, we determine what proportion of the 1,000 simulated
data sets exceed the 20% heuristic. The code to perform this process for the first DGP
is shown in Listing 6.
14
Listing 6: Code to examine out-of-sample fit assessment
set.seed(123456)
p.sum <- c()
iter <- 1000
for(i in 1:iter){
cat("Iteration: ", i, " of ", iter, "\n", "\n", sep="")
x <- runif(20000)
y <- as.numeric( runif(20000) < pnorm(2*x - 1) )
mod <- glm( y ~ x, family=binomial(link="probit"))
## out-of-sample prediction
x <- runif(1000)
y <- as.numeric( runif(1000) < pnorm(2*x - 1) )
pred <- predict(mod, type="response", newdata=data.frame(x))
heat.obs <- heatmap.fit(y, pred, reps=1000, ret.obs=T)
p.sum[i] <- sum(heat.obs[[1]] <= 0.1)
}
1 - sum(p.sum/1000 >= 0.2)/length(p.sum)
hist(p.sum/1000)
Our results indicate that 0.02% of the misspecified models in our simulation satisfy
the heuristic criterion for acceptable fit, but only 52.5% of the correctly specified models
do so.1 That is, for this simulation, satisfying the fit heuristic is a good indicator of
satisfactory out-of-sample fit but failing to satisfy it is not necessarily a definitive
indicator of a poor fit.
In their study of in-sample fit assessment using the same heuristic, Esarey and Pierce
(2012) find that properly specified models typically satisfy the heuristic criterion but
that misspecified models sometimes do as well. Thus, for in-sample assessment, the
heat map statistic tends to err on the side of considering some misspecified models acceptable but rarely deeming properly specified models unacceptable; it is similar to the
Hosmer-Lemeshow fit statistic in this regard. Our study of out-of-sample fit assessment
1
These figures are not meant to be generalizable point estimates for all cases; their exact values will differ
according to the many parameters of the situation.
15
reverses this pattern: it rejects some properly specified models, but rarely accepts misspecified ones. We believe this difference exists due to the nature of the bootstrapping
procedure: out-of-sample predictions are more variable than in-sample predictions, and
the bootstrapping process does not account for this additional variation.
The upshot is that, when performing an assessment of out-of-sample fit, the bootstrapbased p-values of the heat map line are not as informative as the substantive importance
of the distance between the estimated fit R(p) and a perfect fit. That is, the heatmapFit package should be regarded as a way of visually assessing and communicating the
average accuracy of out-of-sample predictions; use of the heuristic test for misspecification should be limited to in-sample assessments. To save computing time and avoid
producing extraneous information, we provide the option of turning off the calculation
of bootstrap-based p-values by setting calc.boot = FALSE in the heatmap.fit function. In the event that p-values are calculated using a hold-out sample, a user may
regard failing to satisfy the heuristic fit assessment criterion as only suggestive (and
not dispositive) of misspecification.
3.3.2
Applied example
We illustrate out-of-sample fit assessment on the Calvo and Sagarzazu (2011) replication by randomly selecting 1,000 observations as a hold-out sample, estimating the
model on the remaining 26,635 observations, predicting committee success in the holdout sample, and then creating a heat map plot using these out-of-sample predictions.
Consonant with the result of our simulation study above, we turned off bootstrapping
for this analysis. The result is shown in Figure 6.
As the figure shows, out-of-sample performance is nearly as good as the in-sample
performance shown in Figure 5. The model-predicted probabilities of success in committee are always within a few percentage points of the observed probabilities of success in the hold-out sample. Furthermore, the worst predictions (between 15% and
30% probability of committee success, which tend to be underestimated relative to the
16
Figure 6: Out-of-sample heat map plot for Table 3, Model A of Calvo and Sagarzazu (2011)
Predicted Probability Deviation
Model Predictions vs. Empirical Frequency
Heat Map Plot
0.2
0.1
0.0
Smoothed Empirical Pr(y=1)
0.3
heat map line
perfect fit
0.0
0.1
0.2
0.3
Model Prediction, Pr(y=1)
Note: bootstrap−based p−values not calculated for this plot.
hold-out observations) are in a section of the data set with lower data availability (as
shown by the histogram at the base of the plot). Ergo, we conclude based on our
assessment that this model has good out-of-sample prediction properties.
4
Conclusion
The 2.0 release of the heatmapFit package in R provides a simple, reliable, and intuitive way to assess fit quality of models where predicting the probability of stochastic
phenomena is the goal. The new release extends the package’s capability to a much
broader array of models, makes it computationally feasible to perform the assessment
in very large data sets, and allows both in- and out-of-sample fit assessment.
17
References
Akaike, Hirotogu. 1998. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike. Springer pp. 199–213.
Azzalini, A., A. W. Bowman and W. Hardle. 1989. “On the use of nonparametric
regression for model checking.” Biometrika 76:1–11.
Calvo, Ernesto and Inaki Sagarzazu. 2011. “Legislator Success in Committee: Gatekeeping Authority and the Loss of Majority Control.” American Journal of Political
Science 55(1):1–15.
Copas, J. B. 1983. “Plotting p against x.” Journal of the Royal Statistical Society,
Series C 32:25–31.
Esarey, Justin and Andrew Pierce. 2012. “Assessing Fit Quality and Testing for Misspecification in Binary-Dependent Variable Models.” Political Analysis 20(4):480–
500.
Firth, D., J. Glosup and D. V. Hinkley. 1991. “Model Checking with Nonparametric
Curves.” Biometrika 78:245–52.
Hart, Jeffrey D. 1997. Nonparametric Smoothing and Lack-of-Fit Tests. Springer.
Hosmer, David W. and Stanley Lemeshow. 1980. “A goodness-of-fit test for the multiple
logistic regression model.” Communications in Statistics A10:1043–1069.
Hosmer, David W. and Stanley Lemeshow. 2000. Applied Logistic Regression. Wiley
Interscience.
Hurvich, Clifford M., Jeffrey S. Simonoff and Chih-Ling Tsai. 1998. “Smoothing parameter selection in nonparametric regression using an improved Akaike information
criterion.” Journal of the Royal Statistical Society, Series B 60:271–293.
le Cessie, S. and J. C. van Houwelingen. 1991. “A Goodness-of-Fit Test for Binary
Regression Models, Based on Smoothing Methods.” Biometrics 47:1267–1282.
18