Using prior probabilities in decision

Remote Sensing of Environment 81 (2002) 253 – 261
www.elsevier.com/locate/rse
Using prior probabilities in decision-tree classification of
remotely sensed data
D.K. McIver , M.A. Friedl*
Department of Geography, Center for Remote Sensing, Boston University, 675 Commonwealth Avenue, Boston, MA 02215, USA
Received 27 November 2000; received in revised form 8 December 2001; accepted 8 December 2001
Abstract
Land cover and vegetation classification systems are generally designed for ecological or land use applications that are independent of
remote sensing considerations. As a result, the classes of interest are often poorly separable in the feature space provided by remotely sensed
data. In many cases, ancillary data sources can provide useful information to help distinguish between inseparable classes. However, methods
for including ancillary data sources, such as the use of prior probabilities in maximum likelihood classification, are often problematic in
practice. This paper presents a method for incorporating prior probabilities in remote-sensing-based land cover classification using a
supervised decision-tree classification algorithm. The method allows robust probabilities of class membership to be estimated from
nonparametric supervised classification algorithms using a technique known as boosting. By using this approach in association with Bayes’
rule, poorly separable classes can be distinguished based on ancillary information. The method does not penalize rare classes and can
incorporate incomplete or imperfect information using a confidence parameter that weights the influence of ancillary information relative to
its quality. Assessments of the methodology using both Landsat TM and AVHRR data show that it successfully improves land cover
classification results. The method is shown to be especially useful for improving discrimination between agriculture and natural vegetation in
coarse-resolution land cover maps. D 2002 Elsevier Science Inc. All rights reserved.
1. Introduction
Information from ancillary data sources has been widely
shown to aid discrimination of classes that are difficult to
classify using remotely sensed data (Brown, Loveland,
Merchant, Reed, & Ohlen, 1993; Foody & Hill, 1996;
Strahler, 1980). Such information is particularly useful in
large area land cover mapping problems that involve cover
types characterized by high levels of within-class variability (Loveland et al., 1999). One means of incorporating
ancillary information is by using prior probabilities of class
membership. Prior probabilities can improve classification
results by helping to resolve confusion among classes that
are poorly separable, and by reducing bias when the
training sample is not representative of the population
being classified.
* Corresponding author. Tel.: +1-617-353-5745.
E-mail addresses: [email protected] (D.K. McIver), [email protected]
(M.A. Friedl).
Until recently maximum likelihood classification has
been the most common method used for supervised
classification of remotely sensed data (Richards, 1993).
This methodology assumes that the probability distributions for the input classes possess a multivariate normal
form. Increasingly, nonparametric classification algorithms
are being used, which make no assumptions regarding the
distribution of the data being classified (Carpenter et al.,
1999; Foody, 1997; Friedl, Brodley, & Strahler, 1999).
The use of such techniques has been motivated by the
increased accuracy and flexibility that these algorithms
provide. In particular, nonparametric classification methods
have proven to be especially useful for continental and
global-scale land cover mapping, where the classes of
interest tend to be multimodal and possess substantial
within-class variance (Friedl et al., 2000; Gopal, Woodcock, & Strahler, 1999; Hansen, DeFries, Townshend, &
Sohlberg, 2000).
In this paper, we expand upon previous work using
nonparametric classification algorithms for remote sensing
applications (McIver & Friedl, 2001). Specifically, we
explore the use of prior probabilities with decision-tree
0034-4257/02/$ – see front matter D 2002 Elsevier Science Inc. All rights reserved.
PII: S 0 0 3 4 - 4 2 5 7 ( 0 2 ) 0 0 0 0 3 - 2
254
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
classification algorithms. To accomplish this objective,
the paper has four main components. First, a methodology to incorporate prior probabilities in nonparametric
classification algorithms is presented. This methodology
includes an approach for controlling the influence of
prior probabilities depending on their quality. Second,
we illustrate several issues regarding the influence of
prior probabilities and training sample properties on
classification outcomes from nonparametric classifiers.
Third, we present an assessment of the methodology
using Landsat Thematic Mapper (TM) data. This assessment uses field sites and prior probabilities derived from
terrain data for an area characterized by strong topographic control over vegetation cover. Finally, the technique
is applied to a coarse resolution land cover classification
problem using Advanced Very High Resolution Radiometer (AVHRR) data. In this last case, prior probabilities
estimated from data related to the global distribution of
agricultural land use intensity are used to resolve confusion between natural vegetation and agriculture at continental scales.
2. Methodology
2.1. Boosting
Maximum likelihood classification has been widely
used in remote sensing to estimate per-pixel conditional
probabilities of class membership. While some nonparametric classification algorithms (e.g., decision trees) can
also provide such estimates, they are typically quite
approximate (Breiman, Friedman, Olshen, & Stone,
1984; Friedman et al., 2000). The approach used here is
based on recent theory that explains the behavior of a
machine learning algorithm known as boosting using
statistical theory. This explanation provides a robust
method to estimate conditional probabilities of class membership from nonparametric classification algorithms.
Boosting is one of numerous ensemble classification
methods that are widely used in the machine learning
research community (Freund, 1995). Ensemble methods
are used in conjunction with a base classification algorithm
(e.g., a decision tree or neural network) and have been
widely shown to improve the performance of many classification algorithms. Specifically, boosting improves classification accuracies (Dietterich, 2000; Quinlan, 1996a), is
resistant to overfitting (Bauer & Kohavi, 1999; Schapire,
Freund, Bartlett, & Lee, 1998), and has been shown to be
effective with remotely sensed data (Friedl et al., 1999;
McIver & Friedl, 2001). While boosting is most commonly
associated with decision trees (Breiman, 1998; Quinlan,
1996a), it has also been applied with other supervised
classification algorithms such as artificial neural networks
(e.g., Opitz & Maclin, 1999; Quinlan, 1996b). As a result,
the method described in this work is not unique to deci-
sion trees and can be used with a variety of classification algorithms.
Boosting operates by iteratively estimating multiple
classifiers using a base algorithm while varying the training sample. At each iteration, the training sample is
modified to focus attention on examples that were difficult to classify correctly in the previous iteration. Modifications to the training sample are performed by
maintaining a weight for each training sample that is
adjusted each iteration. A weighted voting scheme is then
used to select the predicted class. Friedl et al. (1999) and
McIver and Friedl (2001) provide discussions of boosting
for remote sensing applications. For a complete description of boosting, see Freund and Schapire (1997) or
Friedman et al. (2000).
Recent theoretical work regarding boosting has shown
it to be a form of additive logistic regression (Collins,
Schapire, & Singer, 2000; Friedman et al., 2000). As a
result, conditional probabilities of class membership can
be obtained along with class predictions. From this
perspective, boosting can be described as an additive
model (Hastie & Tibshirani, 1990), where at each iteration
the base learning algorithm is fit to the classification
residuals (i.e., errors) from the previous iterations (Friedman et al., 2000). That is, the residuals from iteration n
are used to modify the training sample used at iteration
n + 1 to emphasize previously misclassified examples. As
Friedman et al. (2000) show, this interpretation allows
conditional probabilities of class membership to be computed for each pixel.
A number of different boosting algorithms have been
developed (Friedman et al., 2000; Schapire & Freund,
1999; Schapire et al., 1998). The algorithm used in this
research, Adaboost.M1 (Freund & Schapire, 1997), is the
simplest multiclass boosting method. For this work, Adaboost.M1 was implemented using the C4.5 decision tree
(Quinlan, 1993), which is widely used in the machine
learning community. The approach used throughout this
paper is to obtain probabilities of class membership from
boosting using 10 iterations of the base algorithm (C4.5).
Previous research has shown that 10 or more iterations
produces accurate probability estimates (McIver & Friedl,
2001). Outside information, which is expected to improve
the classification, is then incorporated using Bayes’ rule in
association with confidence estimates as described in
Section 2.2.
2.2. Controlling the influence of prior probabilities
A common problem when using prior probabilities with
supervised classification algorithms is that they can bias the
posterior probability of class k given observation A (i.e.,
P(k | A)), towards the result predicted by the ancillary
information (Strahler, 1980). This bias can be particularly
problematic when there are errors or uncertainty in the
information used to prescribe the prior probabilities. A
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
solution proposed by Chen, Ibrahim, and Yianoutsos (1999)
(for use with Bayesian logistic regression models) provides
an approach for addressing this problem.
In this approach, the ancillary information provides a
likelihood estimate for the membership of each pixel in a
given class j (Lj). Lj is then conditioned using a subjective
confidence parameter (c) that ranges from 0.0 to 1.0. The
conditioned estimate Ljc can then be used as the prior
probability in Bayes’ rule. As c is varied from 1 to 0, the
effect of the likelihood estimate decreases until, when c
equals 0, no information from the likelihood estimate is
incorporated. Thus, the parameter c controls the influence of
Lj on the predictions, based upon the user’s confidence in Lj.
While the choice of a value for c is subjective, we show
below that a simple heuristic based on objective criteria can
be used to prescribe a value for c in many situations.
3. A simple example
In the following example, sample data were randomly
drawn from two overlapping bivariate (X,Y ) normal distributions (Fig. 1). Each distribution possessed the same mean
255
in the Y dimension and equal variances and covariances in
both dimensions. They are differentiated only by their
means in the X dimension. Probability density functions
for each distribution exhibit considerable overlap in the X
dimension and are shown in Fig. 1a. The training data were
generated using a random sample from each distribution.
Separate test data were then created using a systematic
sample of the entire feature space.
3.1. Interaction between prior probabilities and
class separability
To demonstrate the effect of prior probabilities on
classification results, the training sample used for this
exercise contained an equal number of examples from
each class. The prior probability of Class 2, P(2), was
then varied from 0.3 to 0.8. Changes in the proportion of
predictions in the test data for each class arising as a result
of varying P(2) are shown in Fig. 1b. As we would expect,
the proportion of cases predicted to belong to Class 2
increases linearly with P(2).
A key point to note is that only cases that lie in the
overlapping region of the two class distributions are affected
Fig. 1. A two-class example illustrating the effects of prior probabilities and training sample class frequency distribution using C4.5. The probability density
functions for each class in the X dimension are shown in (a). The proportion of test cases predicted to be Class 2 as a function of the prior probability for Class 2
is shown in (b). The effect of varying the training class proportions is shown in (c). For comparison, the dashed line in (c) is the least squares fit for the data
shown in (b).
256
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
by the prior probabilities. Cases in separable regions
(roughly X < 2 and X > 4), on the other hand, are correctly
classified with a probability of 1.0 by C4.5, and prior
probabilities have no effect in these regions of the feature
space. This is distinct from the behavior of maximum
likelihood classification where the estimated probabilities
for all cases are affected, and low prior probability can
preclude selection of a class, even if it is spectrally distinct
(Strahler, 1980).
3.2. Training sample characteristics and prior probabilities
Fig. 1c illustrates the influence that training sample
class frequency distributions can have on classification
predictions from nonparametric algorithms. In particular,
when classes within a region of feature space are not
separable, the distribution of class predictions tends to
reflect the class frequency distribution of the training
sample (Foody, McCulloch, & Yates, 1995). This bias
occurs because nonparametric classification algorithms
such as decision trees are optimized to maximize classification accuracy based on the training data provided.
They therefore assume that the training sample captures
the variability and characteristics of the underlying population, and implicitly assume that the relative class frequency of the training sample is representative of the
population (Breiman et al., 1984, p. 9; Brown & Koplowitz, 1979; Ripley, 1996, p. 220).
Fig. 1c illustrates bias in classification predictions
arising from variation in the class frequency distribution
of the training data. To generate this plot, a series of
training data sets were created where a fixed number of
samples were randomly drawn from Class 1 while varying
the number of samples drawn from Class 2. Fig. 1c shows
that the proportion of the test data predicted to belong to
Class 2 increases as the proportion of samples belonging
to Class 2 in the training data increases. By oversampling
Class 2 relative to Class 1, the Class 2 sample density
curve is effectively shifted up (i.e., higher probability)
relative to the population probability density for Class 2
shown in Fig. 1a. As a result, the point where the two
class sample density curves intersect shifts to the left (to a
lower value of X), and a greater proportion of the feature
space is predicted to belong to Class 2. Therefore, oversampling of one or more classes in training data can lead
to bias in classification predictions. In Section 5 we show
that using prior probabilities provides a means of correcting this type of bias.
TM bands 1– 5 and 7 were extracted from imagery acquired
on September 8, 1995 for 1013 field-validated sites in the
Sierra National Forest in California. These sites were
compiled by the United States Forest Service (USFS) as
part of a remote-sensing-based classification of the Sierra
National Forest (Carpenter et al., 1999; Woodcock et al.,
1994). The site data were comprised of 59,903 pixels, each
with two class labels. The first label describes the vegetation
life form. The second label provides more detailed species
information using the CALVEG system (Matyas & Parker,
1980), which classifies vegetation according to dominant
species associations. The class labels and number of sites for
each class are presented in Table 1. For this work, the
nonconifer CALVEG classes were aggregated into a single
‘‘nonconifer’’ class (seven classes total). This was done to
reduce confusion caused by undersampled classes (noted by
Carpenter et al., 1999), which complicates interpretation of
the results.
The USFS classification was performed using a multistage process following the methodology described by
Woodcock et al. (1994). This methodology used image
segmentation to delineate vegetation stands (Shandley,
Franklin, & White, 1996; Woodcock & Harward, 1992),
followed by supervised classification of remotely sensed
imagery to predict the life form for each stand. Terrainbased species association rules were then used to predict
CALVEG labels, followed by manual interpretation and
relabeling of errors.
4.2. Estimation of prior probabilities
Bayes’ rule requires that prior probabilities be available
for all classes of interest. Thus, ancillary information that is
available for only a subset of classes is problematic.
Table 1
Summary of the field-validated training sites for the Sierra National Forest
data set
Life form
Number of
sites
Conifer
561
Hardwood
211
Scrub
38
4. Evaluation using Landsat TM imagery
4.1. Data
Herbaceous
The method described in Section 2 was assessed using
field data and Landsat TM imagery. To do this, data from
Water
Barren
101
50
50
CALVEG class
Number of
sites
Mixed conifer: pine
Red fir
Subalpine conifer
Mixed conifer: fir
Eastside pine
Lodgepole pine
Black oak
Canyon live oak
Blue oak
Buckeye interior live oak
Mixed chaparral
Northern mixed chaparral
Mid-elevation chaparral
High-elevation chaparral
Dry grass
Wet meadow
Water
Barren
122
116
37
121
22
41
49
60
69
33
17
2
16
3
51
50
50
50
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
Following the methodology suggested by Chen et al.
(1999), the approach taken here was to assign equal prior
probability of J1 to each of the classes for which no
information was available, where J was the total number
of classes. The prior probabilities for the other classes were
then taken to be proportional to their likelihood estimates
(derived from ancillary information). The final prior probabilities were then normalized to sum to 1.0 for each pixel.
Prior probabilities describing associations between topographic variables (slope, aspect, elevation) and CALVEG
labels for conifer sites were estimated from the field data. To
estimate these probabilities, a boosted classification tree was
constructed using slope, aspect, and elevation data to predict
conifer species. This method is similar to the approach
followed by Franklin (1998) who demonstrated that useful
species association rules can be estimated from terrain data
using decision trees. The estimated terrain rules discriminated between species within a single life form (conifers),
but not between different life forms, and provided per-pixel
prior probabilities of class membership for each conifer
species at each pixel in the study area. Conifers were
selected because they represent the dominant tree species
in the region, comprising 55% of the training sites and 6 of
the 18 CALVEG classes.
Two steps were required to estimate the terrain rules.
First, prior probabilities for conifer species were produced
for the nonconifer training sites. This step was necessary
because prior probabilities were required for all classes at
each pixel. Note that because the training data used to
estimate the boosted decision trees did not include nonconifer sites, the estimated prior probabilities for these sites
possessed lower predictive accuracy relative to the conifer
sites. Second, estimates of prior probabilities were produced
for the conifer training sites using a jackknife cross-validation procedure (Mosteller & Tukey, 1977). In this procedure, boosted decision trees were estimated after removing
training sites from the training data, one at a time. In doing
so, independent estimates of conifer class membership were
obtained for every training site, using only terrain data. To
ensure that no classes were eliminated by the terrain rules,
the prior probabilities were constrained to be greater than 0
for every class. Following the procedure described above,
the prior probability for the nonconifer class was assigned
an initial value of 17, whereupon the prior probabilities for all
classes were normalized to sum to 1.0.
4.3. Classification results
Before including prior probabilities, a decision-tree classification was performed using the Landsat TM data. This
classification produced a cross-validated accuracy of 59.7%.
The cross-validated accuracy for the conifer sites from the
terrain rules (i.e., excluding the nonconifer sites and independent of the TM data) was 57.5%. However, the accuracy
across the full data set was only 31.7%. These results
indicate that the terrain rules provided useful information
257
Fig. 2. Cross-validated classification accuracy as a function of the
confidence parameter c. The horizontal dashed line is the accuracy from
the spectral data alone. The vertical dashed line is the confidence
corresponding to the cross-validated accuracy of the terrain rules.
for classifying conifer species, but much lower predictive
power for the nonconifer class.
To incorporate prior probabilities in the TM-based classification a value for the confidence measure (c) was
required. Fig. 2 presents cross-validated accuracies for the
TM-based classification using prior probabilities, where the
value of c varies from 0 to 1. This figure shows that when c
is 0, the cross-validated accuracy is the same as that
achieved using only spectral data (shown by the horizontal
dashed line in Fig. 2). Conversely, because the terrain rules
were relatively poor at classifying the nonconifer sites,
setting c equal to 1 produced lower classification accuracy
(56.9%) relative to using only spectral data.
The results reported in Fig. 2 illustrate a potential pitfall
of using prior probabilities: ancillary information does not
necessarily improve classification results. At the same time,
because the terrain rules contain useful information for the
conifer sites, they can improve classification accuracy if
their influence is appropriately weighted. Specifically, Fig. 2
shows that the cross-validated classification accuracy
increased to 65.1% when c = 0.4. This result represents the
optimal case where the benefits of including information
from the terrain rules are maximized and the costs are
minimized. As a consequence, the resulting accuracy is
higher than the accuracy obtained using either source of
information alone.
Closer inspection of the classification results illustrates
these patterns more clearly. Table 2 shows that the accuracies for the conifer classes increase until c 0.4. As c
increases beyond this threshold, accuracies for the conifer
species generally continue to rise. However, because the
terrain rules are only effective for the conifer classes,
increasing c (and therefore the overall likelihood of the
conifer species), increases the frequency with which nonconifer examples are misclassified as conifers. Thus,
258
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
Table 2
Class accuracies as a function of the confidence parameter c
c
Class
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Other
Lodgepole pine
Eastside pine
Ponderosa pine
Red fir
Subalpine conifer
Mixed conifer: fir
Mixed conifer: pine
Overall
0.927
0.024
0.090
0.058
0.681
0.162
0.214
0.500
0.597
0.927
0.048
0.136
0.088
0.715
0.216
0.322
0.524
0.621
0.912
0.097
0.181
0.127
0.732
0.378
0.388
0.524
0.637
0.893
0.121
0.181
0.235
0.732
0.459
0.388
0.524
0.643
0.866
0.170
0.318
0.323
0.732
0.513
0.396
0.549
0.651
0.814
0.170
0.318
0.392
0.724
0.594
0.396
0.581
0.639
0.779
0.170
0.318
0.450
0.706
0.648
0.388
0.614
0.633
0.720
0.195
0.318
0.519
0.698
0.648
0.404
0.622
0.616
0.652
0.170
0.318
0.529
0.689
0.675
0.413
0.614
0.585
0.620
0.170
0.318
0.549
0.681
0.675
0.413
0.606
0.571
0.582
0.170
0.363
0.578
0.681
0.675
0.438
0.622
0.563
selecting an appropriate value for c helps to balance the
useful information in the terrain rules against the errors
that they introduce.
Determining an appropriate value for the confidence
parameter c is potentially problematic and subjective. However, Fig. 2 shows that the value of c for which the crossvalidated accuracy is maximum roughly corresponds to the
cross-validated accuracy of the terrain rules for the full
training set ( 32%, shown by the vertical dashed line in
Fig. 2). This suggests that the accuracy of the terrain rules
provides a good estimate of their quality, and by extension, a
good estimate of c. Intuitively, using the accuracy of the
terrain rules (or more precisely, the proportion of correctly
classified cases) as a value for c makes sense. If the
predictive accuracy of the terrain rules is high, their influence should also be high, and vice versa. While the
classification results shown in Table 2 are influenced by
the choice of c, they do not appear to be highly sensitive to
the exact value selected. As long as the confidence parameter generally captures the quality of the ancillary information, improvement is achieved.
5. Application to coarse-resolution mapping
Confusion between natural vegetation and agriculture is
a major source of error in remote-sensing-based global
land cover maps. For example, Loveland et al. (1999)
reported that nearly 60% of the problems addressed in the
postclassification process for the International Geosphere
Biosphere Programme’s Data and Information System
(IGBP-DIS) global land cover data set arose from confusion between natural vegetation and agriculture. This
problem was also noted in the MODIS prototype classification of North American land cover based on AVHRR
data (Friedl et al., 2000). The most obvious confusion in
this regard arose because of seasonal variation in the
AVHRR NDVI signal for evergreen needle leaf land cover
at high latitudes (caused by seasonal variation in illumination geometry) that mimicked a phenological cycle (Spanner, Pierce, Running, & Peterson, 1990). Because of this,
the classification algorithm confused this signal with
agricultural (and other) land cover. This factor, in conjunction with the limited information content of the
AVHRR NDVI data (Goward, Markham, Dye, Dulaney,
& Yang, 1991) and oversampling of agriculture in the
training data, resulted in overprediction of agriculture in
the final map.
One means to address this problem is to use available
knowledge concerning the spatial distribution of agriculture
to constrain predictions from spectral information alone.
Prior probabilities provide a means of incorporating such
knowledge. Recently, Ramankutty and Foley (1998) produced global maps of agricultural intensity at 5-min spatial
resolution. Their estimates were derived from an analysis of
global season land cover regions (Loveland et al., 1999)
along with national and regional crop inventory data. While
these maps are not perfect, they do provide significant
information regarding the global distribution and intensity
of agriculture.
For this analysis, agricultural intensities from Ramankutty and Foley (1998) were used to provide likelihood
estimates (prior probabilities) for the presence of agriculture
in each 1-km pixel. These estimates ranged from 0.01 to
0.95 (the original intensity estimates ranged from 0 to 1.0
but were trimmed to avoid precluding or overpredicting
agriculture) and were used in conjunction with the North
American land cover training site data set used by Friedl
et al. (2000). This data set included 455 sites that were
produced by manual interpretation of Landsat TM scenes
using the IGBP land cover classification system, which is
composed of 17 broad cover types (including two agricultural classes) (Loveland & Belward, 1997).
To incorporate prior probabilities, two steps were
required. First, the training site data were used to estimate
a boosted decision tree that produced (for each pixel)
probabilities of class membership for each IGBP land cover
class. Second, Bayes’ rule was used where prior probabilities of agricultural intensity were included as described
above. For North America, the agricultural intensity map
had a correlation of 0.93 with inventory data from the
United States Department of Agriculture (Ramankutty &
Foley, 1998). For this work, a somewhat more conservative
confidence estimate was used (c = 0.5).
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
When applied to the AVHRR-based classification of
North America, the inclusion of prior probabilities resulted
in 2.6% of land pixels changing from natural vegetation to
agriculture and 6.3% changing from agriculture to natural
vegetation. These changes correct many previously noted
problems in the original classification (Friedl et al., 2000;
Muchoney, 2000). The majority of pixels whose labels
changed to agriculture from other classes were located in
the agricultural regions of the central and southern United
States, while the majority of pixels that were relabeled as
natural vegetation were located at high latitudes and in arid
regions of the western United States and Mexico.
Fig. 3 illustrates specific examples of classification
changes that occurred as a result of including prior probabilities for agriculture for three regions of North America
centered in Alaska, Northern California, and South Carolina. The left-hand column of Fig. 3 presents the original
predictions from the North American data set, while the
right-hand column presents the same areas after including
prior probabilities for agriculture (with agriculture shown in
red). Dramatic changes can be seen in the Alaska images
(top row) where a large area in the center of this region had
originally been misclassified as agriculture. When prior
259
probabilities are included, the majority of the agriculture
in this region changes to a mix of evergreen needle leaf
forest, shrubland, and woody savanna (one of two woodland
classes in the IGBP system).
The middle row presents a similar comparison for
California and western Nevada. In the left-hand panel,
considerable confusion between agriculture and natural
vegetation is evident throughout the Sierra Nevada Mountains and in the Great Basin of Nevada. Inclusion of prior
probabilities resolves much of this confusion. Also, agricultural area increases in the heavily agricultural region of the
Central Valley, and decreases in the suburban areas surrounding San Francisco Bay and in the desert areas of the
Great Basin.
Finally, in the two lower panels we present a region
where the area mapped as agriculture increases as a result of
including prior probabilities for agricultural intensity. Specifically, significant portions of the southeastern United
States, previously labeled as evergreen needle leaf forest,
were reclassified as agriculture. Once again, these changes
resolve known errors in the original map in which significant amounts of agriculture were misclassified as natural
vegetation in this region.
Fig. 3. Detailed views of changes arising from including prior probabilities for agriculture in an AVHRR-based IGBP classification for North America. The
images on the left are the original predictions. The images on the right show the result of including prior probabilities. The three regions are approximately
centered on (from the top): Alaska, northern California, and South Carolina.
260
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
In summary, the changes introduced by the use of prior
probabilities for agriculture appear to have substantially
improved the classification of North American land cover.
Given the significant problems posed by confusion between
agriculture and natural vegetation in coarse resolution land
cover mapping, this approach holds considerable promise
for improving land cover characterization over large areas.
6. Discussion and conclusions
In this paper, we have presented a method for including
prior probabilities in land cover classifications using nonparametric classification methods. The approach draws from
recent research regarding the machine learning algorithm
known as boosting. Results from this work demonstrate that
inclusion of prior probabilities can be effective in supervised
classification of remotely sensed data using nonparametric
algorithms. Quantitative and qualitative assessments support
the potential of this approach at both fine and coarse spatial
resolutions. An important aspect of the method is that
knowledge regarding the quality of ancillary information
can be incorporated into Bayes’ rule.
One limitation of the proposed approach is that specification of prior probabilities is often problematic. Further, even
when useful ancillary information is available, it is often
difficult to convert into probabilities. Indeed, a common
criticism of Bayesian statistical approaches is the subjectivity
involved: Any desired result can be obtained if the appropriate prior probabilities are used (Robert, 1994). The use of
the confidence parameter c partially addresses this problem
by reducing the influence of the outside information, thereby
mitigating the effects of errors and uncertainty, contingent on
the user’s confidence in the quality of the ancillary data.
An important advantage of the method described in this
paper is that it can be used to reduce bias in predictions from
nonparametric classification algorithms introduced by nonrepresentative training samples. As both this and other
research has illustrated (e.g., Foody et al., 1995), the creation
of training data used for supervised classification always
contains subjective elements, and training sample properties
often influence classification results in much the same way
as prior probabilities. For unsupervised approaches, the
manual labeling process can be subjective (Loveland et al.,
1999). Therefore, virtually all classifications contain elements that reflect analyst expectations. While the assumptions underlying training samples are often hidden, the
subjective elements of using prior probabilities within the
classification process are explicit. Further, prior probabilities
can be used to offset bias introduced by a nonrepresentative
training sample class frequency distribution. For example, in
Section 5 the inclusion of prior probabilities helped to correct
classification bias towards overpredicting agriculture caused
by overrepresentation of this class in the training data.
Finally, the most compelling reason to use prior probabilities in the classification of remotely sensed data is the
ability to include outside knowledge from diverse sources in
the classification process. For example, considerable knowledge exists concerning the global distribution of vegetation
and land cover types as a function of ecological relationships with climate (e.g., Bailey 1998; Box, 1981; Matthews,
1983; Prentice et al., 1992). However, given the extent of
human and natural disturbance regimes (Schimel et al.,
1997; Townshend, Justice, Gurney, & McManus, 1991),
this information alone is inadequate for global land cover
mapping purposes. Remote-sensing-based land cover observations have the potential to produce more accurate representations of actual global land cover. However, remotely
sensed data often possess limited ability to discriminate
between all vegetation and land cover types of interest
(Friedl et al., 2000; Myneni et al., 1995; Running, Loveland,
Pierce, Nemani, & Hunt, 1995). The use of prior probabilities, as described in this paper, provides a means to exploit
the strengths of both approaches while overcoming some of
their individual limitations.
Acknowledgments
This work was supported by NASA Grant NAG5-7218
and NASA contract NAS5-31369. The hard work of the
land cover team at BU is gratefully acknowledged, along
with the staff of the Global Land 1-km AVHRR project at
EROS Data Center who kindly provided the 1-km AVHRR
data. Curtis Woodcock provided the USFS data set for the
Sierra National Forest. The constructive input of two
anonymous reviewers is also gratefully acknowledged.
References
Bailey, R. G. (1998). Ecoregions: the ecosystem geography of the oceans
and continents. New York: Springer.
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105 – 139.
Box, E. O. (1981). Macroclimate and plant forms: an introduction to
predictive modeling. Dr. W. Junk.
Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26 (3),
801 – 849.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth.
Brown, J. F., Loveland, T. R., Merchant, J. W., Reed, B. C., & Ohlen, D. O.
(1993). Using multisource data in global land-cover characterization:
concepts, requirements, and methods. Photogrammetric Engineering
and Remote Sensing, 59 (6), 977 – 987.
Brown, T. A., & Koplowitz, J. (1979). The weighted nearest neighbor rule
for class dependent sample sizes. IEEE Transactions on Information
Theory IT-25, 5, 617 – 619.
Carpenter, G. A., Gopal, S., Macomber, S., Martens, S., Woodcock, C. E.,
& Franklin, J. (1999). A neural network method for efficient vegetation
mapping. Remote Sensing of Environment, 70, 326 – 338.
Chen, M.-H., Ibrahim, J. G., & Yianoutsos, C. (1999). Prior elicitation,
variable selection, and Bayesian computation for logistics regression
models. Journal of the Royal Statistical Society B, 61, 223 – 242.
Collins, M., Schapire, R. E., & Singer, Y. (2000). Logistic regression,
D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261
Adaboost, and Bregman distances. In: Proceedings of the 13th Annual
Conference on Computational Learning Theory ( pp. 158 – 169).
Dietterich, T. G. (2000). An experimental comparison of three methods for
constructing ensembles of decision trees: bagging, boosting, and ramdomization. Machine Learning, 40 (2), 139 – 158.
Foody, G. M. (1997). Fully fuzzy supervised classification of land cover
from remotely sensed imagery with an artificial neural network. Neural
Computing and Applications, 5 (4), 238 – 247.
Foody, G. M., & Hill, R. A. (1996). Classification of tropical forest classes
from Landsat TM data. International Journal of Remote Sensing, 17
(12), 2353 – 2367.
Foody, G. M., McCulloch, M. B., & Yates, W. B. (1995). The effect of
training set size and composition on artificial neural network classification. International Journal of Remote Sensing, 16 (9), 1707 – 1723.
Franklin, J. (1998). Predicting the distribution of shrub species in Southern
California from climate and terrain-derived variables. Journal of Vegetation Science, 9 (5), 733 – 748.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121 (2), 256 – 285.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of
on-line learning and an application to boosting. Journal of Computer
and System Sciences, 5 (1), 119 – 139.
Friedl, M. A., Brodley, C. E., & Strahler, A. H. (1999). Maximizing the
land cover classification accuracies produced by decision trees at continental to global scales. IEEE Transactions on Geoscience and Remote
Sensing, 27 (2), 969 – 977.
Friedl, M. A., Muchoney, D., McIver, D., Gao, F., Hodges, J. F. C., &
Strahler, A. H. (2000). Characterization of North American land cover
from NOAA-AVHRR data using the EOS MODIS land cover classification algorithm. Geophysical Research Letters, 27 (7), 977 – 980.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression:
a statistical view of boosting. The Annals of Statistics, 28, 337 – 374.
Gopal, S., Woodcock, C. E., & Strahler, A. H. (1999). Fuzzy neural network classification of global land cover from a 1 degree AVHRR data
set. Remote Sensing of Environment, 67, 230 – 243.
Goward, S. N., Markham, B., Dye, D. G., Dulaney, W., & Yang, J. (1991).
Normalized difference vegetation index measurements from the Advanced Very High Resolution Radiometer. Remote Sensing of Environment, 35, 257 – 277.
Hansen, M. C., DeFries, R. S., Townshend, J. R. G., & Sohlberg, R. (2000).
Global land cover classification at the 1 km spatial resolution using a
classification tree approach. International Journal of Remote Sensing,
21 (6), 1331 – 1364.
Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. New
York: Chapman & Hall.
Loveland, T. R., & Belward, A. S. (1997). The IGBP-DIS global 1 km land
cover data set, discover: first results. International Journal of Remote
Sensing, 18 (15), 3289 – 3295.
Loveland, T. R., Zhu, Z., Ohlen, D. O., Brown, J. F., Redd, B. C., &
Yang, L. (1999). An analysis of the IGBP global land-cover characterization process. Photogrammetric Engineering and Remote Sensing,
65 (9), 1021 – 1031.
Matthews, E. (1983). Global vegetation and land use: new high resolution
data bases for climate studies. Journal of Climate and Meteorology, 22,
474 – 487.
Matyas, W. J. & Parker, I. (1980). CALVEG mosaic of existing vegetation of
California (Technical report). San Francisco, CA: Regional Ecology
Group, US Forest Service, Region 5.
McIver, D. K., & Friedl, M. A. (2001). Estimating pixel-scale land cover
classification confidence using nonparametric machine learning methods. IEEE Transactions on Geoscience and Remote Sensing, 39,
1959 – 1968.
Mosteller, F., & Tukey, J. (1977). Data analysis and regression. Reading,
MA: Addison Wesley.
Muchoney, D. (2000). Known problems and errors in the MODIS prototype
North America land cover map. Available at: http://geography.bu.edu/
landcover/.
261
Myneni, R. B., Maggion, S., Iaquinta, J., Privette, J. L., Gobron, N., Pinty,
B., Kimes, D. S., Verstrate, M. M., & Williams, D. L. (1995). Optical
remote sensing of vegetation: modeling, caveats, and algorithms. Remote Sensing of Environment, 51, 169 – 188.
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: an empirical
study. Journal of Artificial Intelligence Research, 11, 169 – 198.
Prentice, I. C., Cramer, W., Harrison, S. P., Leemans, R., Monserud, R. A.,
& Solomon, A. M. (1992). A global biome model based on plant
physiology and dominance, soil properties and climate. Journal of Biogeography, 19, 117 – 134.
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo:
Morgan Kauffman.
Quinlan, J. R. (1996a). Bagging, boosting, and C4.5. In: Proceedings of the
14th National Conference on Artificial Intelligence (AAAI-96). Portland, OR: AAAI Press.
Quinlan, J. R. (1996b). Boosting first-order learning. In: Proceedings of the
7th International Workshop on Algorithmic Learning Theory (ALT’96),
Sydney, Australia. Available online at: http://www.i.kyushuu.ac.jp/
~ thomas/ALT96/alt96.html.
Ramankutty, N., & Foley, J. A. (1998). Characterizing patterns of global
land use: an analysis of global cropland data. Global Biogeochemical
Cycles, 12 (4), 667 – 685.
Richards, J. A. (1993). Remote sensing digital image analysis: an introduction. New York: Springer-Verlag.
Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge,
UK: Cambridge University Press.
Robert, C. P. (1994). The Bayesian choice: a decision-theoretic motivation.
New York: Springer-Verlag.
Running, S. R., Loveland, T. R., Pierce, L. L., Nemani, R. R., & Hunt Jr.,
E. R. (1995). A remote sensing based vegetation classification logic for
global remote land cover analysis. Remote Sensing of Environment, 51,
39 – 48.
Schapire, R. E., & Freund, Y. (1999). Improved boosting algorithms using
confidence-rated predictions. Machine Learning, 37 (3), 297 – 336.
Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the
margin: a new explanation for the effectiveness of voting methods. The
Annals of Statistics, 26 (5), 1651 – 1686.
Schimel, D. S., Emanuel, W., Rizzo, B., Smith, T., Woodward, F. I., Fisher,
H., Kittel, T. G. F., McKeown, R., Painter, T., Rosenbloom, N., Ojima,
D. S., Parton, W. J., Kicklighter, D. W., McGuire, A. D., Melillo, J. M.,
Pan, Y., Hazeltine, A., Prentice, C., Stich, S., Hibbard, K., Nemani, R.,
Pierce, L., Running, S., Borches, J., Chaney, J., Neilson, R., & Braswell, B. H. (1997). Continental scale variability in ecosystem processes:
models, data, and the role of disturbance. Ecological Monographs, 67
(2), 251 – 271.
Shandley, J., Franklin, J., & White, T. (1996). Testing the Woodcock –
Harward segmentation algorithm in an area of southern California
chaparral and woodland vegetation. International Journal of Remote
Sensing, 17 (5), 983 – 1004.
Spanner, M. A., Pierce, L. L., Running, S. R., & Peterson, D. L. (1990).
The seasonality of AVHRR data of temperate coniferous forests: relationship with leaf area index. Remote Sensing of Environment, 33,
97 – 112.
Strahler, A. H. (1980). The use of prior probabilities in maximum likelihood classification of remotely sensed data. Remote Sensing of Environment, 10, 135 – 163.
Townshend, J., Justice, C., Li, W., Gurney, C., & McManus, J. (1991).
Global land cover classification by remote sensing: present capabilities
and future possibilities. Remote Sensing of Environment, 35, 243 – 255.
Woodcock, C. E., Collins, J. B., Gopal, S., Jakabhazy, V. D., Li, X., Macomber, S., Ryherd, S., Harward, V. J., Levitan, J., Wu, Y., & Warbington, R. (1994). Mapping forest vegetation using LANDSAT TM
imagery and a canopy reflectance model. Remote Sensing of Environment, 50, 240 – 254.
Woodcock, C. E., & Harward, V. J. (1992). Nested-hierarchical scene models and image segmentation. International Journal of Remote Sensing,
13 (16), 3167 – 3187.