Remote Sensing of Environment 81 (2002) 253 – 261 www.elsevier.com/locate/rse Using prior probabilities in decision-tree classification of remotely sensed data D.K. McIver , M.A. Friedl* Department of Geography, Center for Remote Sensing, Boston University, 675 Commonwealth Avenue, Boston, MA 02215, USA Received 27 November 2000; received in revised form 8 December 2001; accepted 8 December 2001 Abstract Land cover and vegetation classification systems are generally designed for ecological or land use applications that are independent of remote sensing considerations. As a result, the classes of interest are often poorly separable in the feature space provided by remotely sensed data. In many cases, ancillary data sources can provide useful information to help distinguish between inseparable classes. However, methods for including ancillary data sources, such as the use of prior probabilities in maximum likelihood classification, are often problematic in practice. This paper presents a method for incorporating prior probabilities in remote-sensing-based land cover classification using a supervised decision-tree classification algorithm. The method allows robust probabilities of class membership to be estimated from nonparametric supervised classification algorithms using a technique known as boosting. By using this approach in association with Bayes’ rule, poorly separable classes can be distinguished based on ancillary information. The method does not penalize rare classes and can incorporate incomplete or imperfect information using a confidence parameter that weights the influence of ancillary information relative to its quality. Assessments of the methodology using both Landsat TM and AVHRR data show that it successfully improves land cover classification results. The method is shown to be especially useful for improving discrimination between agriculture and natural vegetation in coarse-resolution land cover maps. D 2002 Elsevier Science Inc. All rights reserved. 1. Introduction Information from ancillary data sources has been widely shown to aid discrimination of classes that are difficult to classify using remotely sensed data (Brown, Loveland, Merchant, Reed, & Ohlen, 1993; Foody & Hill, 1996; Strahler, 1980). Such information is particularly useful in large area land cover mapping problems that involve cover types characterized by high levels of within-class variability (Loveland et al., 1999). One means of incorporating ancillary information is by using prior probabilities of class membership. Prior probabilities can improve classification results by helping to resolve confusion among classes that are poorly separable, and by reducing bias when the training sample is not representative of the population being classified. * Corresponding author. Tel.: +1-617-353-5745. E-mail addresses: [email protected] (D.K. McIver), [email protected] (M.A. Friedl). Until recently maximum likelihood classification has been the most common method used for supervised classification of remotely sensed data (Richards, 1993). This methodology assumes that the probability distributions for the input classes possess a multivariate normal form. Increasingly, nonparametric classification algorithms are being used, which make no assumptions regarding the distribution of the data being classified (Carpenter et al., 1999; Foody, 1997; Friedl, Brodley, & Strahler, 1999). The use of such techniques has been motivated by the increased accuracy and flexibility that these algorithms provide. In particular, nonparametric classification methods have proven to be especially useful for continental and global-scale land cover mapping, where the classes of interest tend to be multimodal and possess substantial within-class variance (Friedl et al., 2000; Gopal, Woodcock, & Strahler, 1999; Hansen, DeFries, Townshend, & Sohlberg, 2000). In this paper, we expand upon previous work using nonparametric classification algorithms for remote sensing applications (McIver & Friedl, 2001). Specifically, we explore the use of prior probabilities with decision-tree 0034-4257/02/$ – see front matter D 2002 Elsevier Science Inc. All rights reserved. PII: S 0 0 3 4 - 4 2 5 7 ( 0 2 ) 0 0 0 0 3 - 2 254 D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 classification algorithms. To accomplish this objective, the paper has four main components. First, a methodology to incorporate prior probabilities in nonparametric classification algorithms is presented. This methodology includes an approach for controlling the influence of prior probabilities depending on their quality. Second, we illustrate several issues regarding the influence of prior probabilities and training sample properties on classification outcomes from nonparametric classifiers. Third, we present an assessment of the methodology using Landsat Thematic Mapper (TM) data. This assessment uses field sites and prior probabilities derived from terrain data for an area characterized by strong topographic control over vegetation cover. Finally, the technique is applied to a coarse resolution land cover classification problem using Advanced Very High Resolution Radiometer (AVHRR) data. In this last case, prior probabilities estimated from data related to the global distribution of agricultural land use intensity are used to resolve confusion between natural vegetation and agriculture at continental scales. 2. Methodology 2.1. Boosting Maximum likelihood classification has been widely used in remote sensing to estimate per-pixel conditional probabilities of class membership. While some nonparametric classification algorithms (e.g., decision trees) can also provide such estimates, they are typically quite approximate (Breiman, Friedman, Olshen, & Stone, 1984; Friedman et al., 2000). The approach used here is based on recent theory that explains the behavior of a machine learning algorithm known as boosting using statistical theory. This explanation provides a robust method to estimate conditional probabilities of class membership from nonparametric classification algorithms. Boosting is one of numerous ensemble classification methods that are widely used in the machine learning research community (Freund, 1995). Ensemble methods are used in conjunction with a base classification algorithm (e.g., a decision tree or neural network) and have been widely shown to improve the performance of many classification algorithms. Specifically, boosting improves classification accuracies (Dietterich, 2000; Quinlan, 1996a), is resistant to overfitting (Bauer & Kohavi, 1999; Schapire, Freund, Bartlett, & Lee, 1998), and has been shown to be effective with remotely sensed data (Friedl et al., 1999; McIver & Friedl, 2001). While boosting is most commonly associated with decision trees (Breiman, 1998; Quinlan, 1996a), it has also been applied with other supervised classification algorithms such as artificial neural networks (e.g., Opitz & Maclin, 1999; Quinlan, 1996b). As a result, the method described in this work is not unique to deci- sion trees and can be used with a variety of classification algorithms. Boosting operates by iteratively estimating multiple classifiers using a base algorithm while varying the training sample. At each iteration, the training sample is modified to focus attention on examples that were difficult to classify correctly in the previous iteration. Modifications to the training sample are performed by maintaining a weight for each training sample that is adjusted each iteration. A weighted voting scheme is then used to select the predicted class. Friedl et al. (1999) and McIver and Friedl (2001) provide discussions of boosting for remote sensing applications. For a complete description of boosting, see Freund and Schapire (1997) or Friedman et al. (2000). Recent theoretical work regarding boosting has shown it to be a form of additive logistic regression (Collins, Schapire, & Singer, 2000; Friedman et al., 2000). As a result, conditional probabilities of class membership can be obtained along with class predictions. From this perspective, boosting can be described as an additive model (Hastie & Tibshirani, 1990), where at each iteration the base learning algorithm is fit to the classification residuals (i.e., errors) from the previous iterations (Friedman et al., 2000). That is, the residuals from iteration n are used to modify the training sample used at iteration n + 1 to emphasize previously misclassified examples. As Friedman et al. (2000) show, this interpretation allows conditional probabilities of class membership to be computed for each pixel. A number of different boosting algorithms have been developed (Friedman et al., 2000; Schapire & Freund, 1999; Schapire et al., 1998). The algorithm used in this research, Adaboost.M1 (Freund & Schapire, 1997), is the simplest multiclass boosting method. For this work, Adaboost.M1 was implemented using the C4.5 decision tree (Quinlan, 1993), which is widely used in the machine learning community. The approach used throughout this paper is to obtain probabilities of class membership from boosting using 10 iterations of the base algorithm (C4.5). Previous research has shown that 10 or more iterations produces accurate probability estimates (McIver & Friedl, 2001). Outside information, which is expected to improve the classification, is then incorporated using Bayes’ rule in association with confidence estimates as described in Section 2.2. 2.2. Controlling the influence of prior probabilities A common problem when using prior probabilities with supervised classification algorithms is that they can bias the posterior probability of class k given observation A (i.e., P(k | A)), towards the result predicted by the ancillary information (Strahler, 1980). This bias can be particularly problematic when there are errors or uncertainty in the information used to prescribe the prior probabilities. A D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 solution proposed by Chen, Ibrahim, and Yianoutsos (1999) (for use with Bayesian logistic regression models) provides an approach for addressing this problem. In this approach, the ancillary information provides a likelihood estimate for the membership of each pixel in a given class j (Lj). Lj is then conditioned using a subjective confidence parameter (c) that ranges from 0.0 to 1.0. The conditioned estimate Ljc can then be used as the prior probability in Bayes’ rule. As c is varied from 1 to 0, the effect of the likelihood estimate decreases until, when c equals 0, no information from the likelihood estimate is incorporated. Thus, the parameter c controls the influence of Lj on the predictions, based upon the user’s confidence in Lj. While the choice of a value for c is subjective, we show below that a simple heuristic based on objective criteria can be used to prescribe a value for c in many situations. 3. A simple example In the following example, sample data were randomly drawn from two overlapping bivariate (X,Y ) normal distributions (Fig. 1). Each distribution possessed the same mean 255 in the Y dimension and equal variances and covariances in both dimensions. They are differentiated only by their means in the X dimension. Probability density functions for each distribution exhibit considerable overlap in the X dimension and are shown in Fig. 1a. The training data were generated using a random sample from each distribution. Separate test data were then created using a systematic sample of the entire feature space. 3.1. Interaction between prior probabilities and class separability To demonstrate the effect of prior probabilities on classification results, the training sample used for this exercise contained an equal number of examples from each class. The prior probability of Class 2, P(2), was then varied from 0.3 to 0.8. Changes in the proportion of predictions in the test data for each class arising as a result of varying P(2) are shown in Fig. 1b. As we would expect, the proportion of cases predicted to belong to Class 2 increases linearly with P(2). A key point to note is that only cases that lie in the overlapping region of the two class distributions are affected Fig. 1. A two-class example illustrating the effects of prior probabilities and training sample class frequency distribution using C4.5. The probability density functions for each class in the X dimension are shown in (a). The proportion of test cases predicted to be Class 2 as a function of the prior probability for Class 2 is shown in (b). The effect of varying the training class proportions is shown in (c). For comparison, the dashed line in (c) is the least squares fit for the data shown in (b). 256 D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 by the prior probabilities. Cases in separable regions (roughly X < 2 and X > 4), on the other hand, are correctly classified with a probability of 1.0 by C4.5, and prior probabilities have no effect in these regions of the feature space. This is distinct from the behavior of maximum likelihood classification where the estimated probabilities for all cases are affected, and low prior probability can preclude selection of a class, even if it is spectrally distinct (Strahler, 1980). 3.2. Training sample characteristics and prior probabilities Fig. 1c illustrates the influence that training sample class frequency distributions can have on classification predictions from nonparametric algorithms. In particular, when classes within a region of feature space are not separable, the distribution of class predictions tends to reflect the class frequency distribution of the training sample (Foody, McCulloch, & Yates, 1995). This bias occurs because nonparametric classification algorithms such as decision trees are optimized to maximize classification accuracy based on the training data provided. They therefore assume that the training sample captures the variability and characteristics of the underlying population, and implicitly assume that the relative class frequency of the training sample is representative of the population (Breiman et al., 1984, p. 9; Brown & Koplowitz, 1979; Ripley, 1996, p. 220). Fig. 1c illustrates bias in classification predictions arising from variation in the class frequency distribution of the training data. To generate this plot, a series of training data sets were created where a fixed number of samples were randomly drawn from Class 1 while varying the number of samples drawn from Class 2. Fig. 1c shows that the proportion of the test data predicted to belong to Class 2 increases as the proportion of samples belonging to Class 2 in the training data increases. By oversampling Class 2 relative to Class 1, the Class 2 sample density curve is effectively shifted up (i.e., higher probability) relative to the population probability density for Class 2 shown in Fig. 1a. As a result, the point where the two class sample density curves intersect shifts to the left (to a lower value of X), and a greater proportion of the feature space is predicted to belong to Class 2. Therefore, oversampling of one or more classes in training data can lead to bias in classification predictions. In Section 5 we show that using prior probabilities provides a means of correcting this type of bias. TM bands 1– 5 and 7 were extracted from imagery acquired on September 8, 1995 for 1013 field-validated sites in the Sierra National Forest in California. These sites were compiled by the United States Forest Service (USFS) as part of a remote-sensing-based classification of the Sierra National Forest (Carpenter et al., 1999; Woodcock et al., 1994). The site data were comprised of 59,903 pixels, each with two class labels. The first label describes the vegetation life form. The second label provides more detailed species information using the CALVEG system (Matyas & Parker, 1980), which classifies vegetation according to dominant species associations. The class labels and number of sites for each class are presented in Table 1. For this work, the nonconifer CALVEG classes were aggregated into a single ‘‘nonconifer’’ class (seven classes total). This was done to reduce confusion caused by undersampled classes (noted by Carpenter et al., 1999), which complicates interpretation of the results. The USFS classification was performed using a multistage process following the methodology described by Woodcock et al. (1994). This methodology used image segmentation to delineate vegetation stands (Shandley, Franklin, & White, 1996; Woodcock & Harward, 1992), followed by supervised classification of remotely sensed imagery to predict the life form for each stand. Terrainbased species association rules were then used to predict CALVEG labels, followed by manual interpretation and relabeling of errors. 4.2. Estimation of prior probabilities Bayes’ rule requires that prior probabilities be available for all classes of interest. Thus, ancillary information that is available for only a subset of classes is problematic. Table 1 Summary of the field-validated training sites for the Sierra National Forest data set Life form Number of sites Conifer 561 Hardwood 211 Scrub 38 4. Evaluation using Landsat TM imagery 4.1. Data Herbaceous The method described in Section 2 was assessed using field data and Landsat TM imagery. To do this, data from Water Barren 101 50 50 CALVEG class Number of sites Mixed conifer: pine Red fir Subalpine conifer Mixed conifer: fir Eastside pine Lodgepole pine Black oak Canyon live oak Blue oak Buckeye interior live oak Mixed chaparral Northern mixed chaparral Mid-elevation chaparral High-elevation chaparral Dry grass Wet meadow Water Barren 122 116 37 121 22 41 49 60 69 33 17 2 16 3 51 50 50 50 D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 Following the methodology suggested by Chen et al. (1999), the approach taken here was to assign equal prior probability of J1 to each of the classes for which no information was available, where J was the total number of classes. The prior probabilities for the other classes were then taken to be proportional to their likelihood estimates (derived from ancillary information). The final prior probabilities were then normalized to sum to 1.0 for each pixel. Prior probabilities describing associations between topographic variables (slope, aspect, elevation) and CALVEG labels for conifer sites were estimated from the field data. To estimate these probabilities, a boosted classification tree was constructed using slope, aspect, and elevation data to predict conifer species. This method is similar to the approach followed by Franklin (1998) who demonstrated that useful species association rules can be estimated from terrain data using decision trees. The estimated terrain rules discriminated between species within a single life form (conifers), but not between different life forms, and provided per-pixel prior probabilities of class membership for each conifer species at each pixel in the study area. Conifers were selected because they represent the dominant tree species in the region, comprising 55% of the training sites and 6 of the 18 CALVEG classes. Two steps were required to estimate the terrain rules. First, prior probabilities for conifer species were produced for the nonconifer training sites. This step was necessary because prior probabilities were required for all classes at each pixel. Note that because the training data used to estimate the boosted decision trees did not include nonconifer sites, the estimated prior probabilities for these sites possessed lower predictive accuracy relative to the conifer sites. Second, estimates of prior probabilities were produced for the conifer training sites using a jackknife cross-validation procedure (Mosteller & Tukey, 1977). In this procedure, boosted decision trees were estimated after removing training sites from the training data, one at a time. In doing so, independent estimates of conifer class membership were obtained for every training site, using only terrain data. To ensure that no classes were eliminated by the terrain rules, the prior probabilities were constrained to be greater than 0 for every class. Following the procedure described above, the prior probability for the nonconifer class was assigned an initial value of 17, whereupon the prior probabilities for all classes were normalized to sum to 1.0. 4.3. Classification results Before including prior probabilities, a decision-tree classification was performed using the Landsat TM data. This classification produced a cross-validated accuracy of 59.7%. The cross-validated accuracy for the conifer sites from the terrain rules (i.e., excluding the nonconifer sites and independent of the TM data) was 57.5%. However, the accuracy across the full data set was only 31.7%. These results indicate that the terrain rules provided useful information 257 Fig. 2. Cross-validated classification accuracy as a function of the confidence parameter c. The horizontal dashed line is the accuracy from the spectral data alone. The vertical dashed line is the confidence corresponding to the cross-validated accuracy of the terrain rules. for classifying conifer species, but much lower predictive power for the nonconifer class. To incorporate prior probabilities in the TM-based classification a value for the confidence measure (c) was required. Fig. 2 presents cross-validated accuracies for the TM-based classification using prior probabilities, where the value of c varies from 0 to 1. This figure shows that when c is 0, the cross-validated accuracy is the same as that achieved using only spectral data (shown by the horizontal dashed line in Fig. 2). Conversely, because the terrain rules were relatively poor at classifying the nonconifer sites, setting c equal to 1 produced lower classification accuracy (56.9%) relative to using only spectral data. The results reported in Fig. 2 illustrate a potential pitfall of using prior probabilities: ancillary information does not necessarily improve classification results. At the same time, because the terrain rules contain useful information for the conifer sites, they can improve classification accuracy if their influence is appropriately weighted. Specifically, Fig. 2 shows that the cross-validated classification accuracy increased to 65.1% when c = 0.4. This result represents the optimal case where the benefits of including information from the terrain rules are maximized and the costs are minimized. As a consequence, the resulting accuracy is higher than the accuracy obtained using either source of information alone. Closer inspection of the classification results illustrates these patterns more clearly. Table 2 shows that the accuracies for the conifer classes increase until c 0.4. As c increases beyond this threshold, accuracies for the conifer species generally continue to rise. However, because the terrain rules are only effective for the conifer classes, increasing c (and therefore the overall likelihood of the conifer species), increases the frequency with which nonconifer examples are misclassified as conifers. Thus, 258 D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 Table 2 Class accuracies as a function of the confidence parameter c c Class 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Other Lodgepole pine Eastside pine Ponderosa pine Red fir Subalpine conifer Mixed conifer: fir Mixed conifer: pine Overall 0.927 0.024 0.090 0.058 0.681 0.162 0.214 0.500 0.597 0.927 0.048 0.136 0.088 0.715 0.216 0.322 0.524 0.621 0.912 0.097 0.181 0.127 0.732 0.378 0.388 0.524 0.637 0.893 0.121 0.181 0.235 0.732 0.459 0.388 0.524 0.643 0.866 0.170 0.318 0.323 0.732 0.513 0.396 0.549 0.651 0.814 0.170 0.318 0.392 0.724 0.594 0.396 0.581 0.639 0.779 0.170 0.318 0.450 0.706 0.648 0.388 0.614 0.633 0.720 0.195 0.318 0.519 0.698 0.648 0.404 0.622 0.616 0.652 0.170 0.318 0.529 0.689 0.675 0.413 0.614 0.585 0.620 0.170 0.318 0.549 0.681 0.675 0.413 0.606 0.571 0.582 0.170 0.363 0.578 0.681 0.675 0.438 0.622 0.563 selecting an appropriate value for c helps to balance the useful information in the terrain rules against the errors that they introduce. Determining an appropriate value for the confidence parameter c is potentially problematic and subjective. However, Fig. 2 shows that the value of c for which the crossvalidated accuracy is maximum roughly corresponds to the cross-validated accuracy of the terrain rules for the full training set ( 32%, shown by the vertical dashed line in Fig. 2). This suggests that the accuracy of the terrain rules provides a good estimate of their quality, and by extension, a good estimate of c. Intuitively, using the accuracy of the terrain rules (or more precisely, the proportion of correctly classified cases) as a value for c makes sense. If the predictive accuracy of the terrain rules is high, their influence should also be high, and vice versa. While the classification results shown in Table 2 are influenced by the choice of c, they do not appear to be highly sensitive to the exact value selected. As long as the confidence parameter generally captures the quality of the ancillary information, improvement is achieved. 5. Application to coarse-resolution mapping Confusion between natural vegetation and agriculture is a major source of error in remote-sensing-based global land cover maps. For example, Loveland et al. (1999) reported that nearly 60% of the problems addressed in the postclassification process for the International Geosphere Biosphere Programme’s Data and Information System (IGBP-DIS) global land cover data set arose from confusion between natural vegetation and agriculture. This problem was also noted in the MODIS prototype classification of North American land cover based on AVHRR data (Friedl et al., 2000). The most obvious confusion in this regard arose because of seasonal variation in the AVHRR NDVI signal for evergreen needle leaf land cover at high latitudes (caused by seasonal variation in illumination geometry) that mimicked a phenological cycle (Spanner, Pierce, Running, & Peterson, 1990). Because of this, the classification algorithm confused this signal with agricultural (and other) land cover. This factor, in conjunction with the limited information content of the AVHRR NDVI data (Goward, Markham, Dye, Dulaney, & Yang, 1991) and oversampling of agriculture in the training data, resulted in overprediction of agriculture in the final map. One means to address this problem is to use available knowledge concerning the spatial distribution of agriculture to constrain predictions from spectral information alone. Prior probabilities provide a means of incorporating such knowledge. Recently, Ramankutty and Foley (1998) produced global maps of agricultural intensity at 5-min spatial resolution. Their estimates were derived from an analysis of global season land cover regions (Loveland et al., 1999) along with national and regional crop inventory data. While these maps are not perfect, they do provide significant information regarding the global distribution and intensity of agriculture. For this analysis, agricultural intensities from Ramankutty and Foley (1998) were used to provide likelihood estimates (prior probabilities) for the presence of agriculture in each 1-km pixel. These estimates ranged from 0.01 to 0.95 (the original intensity estimates ranged from 0 to 1.0 but were trimmed to avoid precluding or overpredicting agriculture) and were used in conjunction with the North American land cover training site data set used by Friedl et al. (2000). This data set included 455 sites that were produced by manual interpretation of Landsat TM scenes using the IGBP land cover classification system, which is composed of 17 broad cover types (including two agricultural classes) (Loveland & Belward, 1997). To incorporate prior probabilities, two steps were required. First, the training site data were used to estimate a boosted decision tree that produced (for each pixel) probabilities of class membership for each IGBP land cover class. Second, Bayes’ rule was used where prior probabilities of agricultural intensity were included as described above. For North America, the agricultural intensity map had a correlation of 0.93 with inventory data from the United States Department of Agriculture (Ramankutty & Foley, 1998). For this work, a somewhat more conservative confidence estimate was used (c = 0.5). D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 When applied to the AVHRR-based classification of North America, the inclusion of prior probabilities resulted in 2.6% of land pixels changing from natural vegetation to agriculture and 6.3% changing from agriculture to natural vegetation. These changes correct many previously noted problems in the original classification (Friedl et al., 2000; Muchoney, 2000). The majority of pixels whose labels changed to agriculture from other classes were located in the agricultural regions of the central and southern United States, while the majority of pixels that were relabeled as natural vegetation were located at high latitudes and in arid regions of the western United States and Mexico. Fig. 3 illustrates specific examples of classification changes that occurred as a result of including prior probabilities for agriculture for three regions of North America centered in Alaska, Northern California, and South Carolina. The left-hand column of Fig. 3 presents the original predictions from the North American data set, while the right-hand column presents the same areas after including prior probabilities for agriculture (with agriculture shown in red). Dramatic changes can be seen in the Alaska images (top row) where a large area in the center of this region had originally been misclassified as agriculture. When prior 259 probabilities are included, the majority of the agriculture in this region changes to a mix of evergreen needle leaf forest, shrubland, and woody savanna (one of two woodland classes in the IGBP system). The middle row presents a similar comparison for California and western Nevada. In the left-hand panel, considerable confusion between agriculture and natural vegetation is evident throughout the Sierra Nevada Mountains and in the Great Basin of Nevada. Inclusion of prior probabilities resolves much of this confusion. Also, agricultural area increases in the heavily agricultural region of the Central Valley, and decreases in the suburban areas surrounding San Francisco Bay and in the desert areas of the Great Basin. Finally, in the two lower panels we present a region where the area mapped as agriculture increases as a result of including prior probabilities for agricultural intensity. Specifically, significant portions of the southeastern United States, previously labeled as evergreen needle leaf forest, were reclassified as agriculture. Once again, these changes resolve known errors in the original map in which significant amounts of agriculture were misclassified as natural vegetation in this region. Fig. 3. Detailed views of changes arising from including prior probabilities for agriculture in an AVHRR-based IGBP classification for North America. The images on the left are the original predictions. The images on the right show the result of including prior probabilities. The three regions are approximately centered on (from the top): Alaska, northern California, and South Carolina. 260 D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 In summary, the changes introduced by the use of prior probabilities for agriculture appear to have substantially improved the classification of North American land cover. Given the significant problems posed by confusion between agriculture and natural vegetation in coarse resolution land cover mapping, this approach holds considerable promise for improving land cover characterization over large areas. 6. Discussion and conclusions In this paper, we have presented a method for including prior probabilities in land cover classifications using nonparametric classification methods. The approach draws from recent research regarding the machine learning algorithm known as boosting. Results from this work demonstrate that inclusion of prior probabilities can be effective in supervised classification of remotely sensed data using nonparametric algorithms. Quantitative and qualitative assessments support the potential of this approach at both fine and coarse spatial resolutions. An important aspect of the method is that knowledge regarding the quality of ancillary information can be incorporated into Bayes’ rule. One limitation of the proposed approach is that specification of prior probabilities is often problematic. Further, even when useful ancillary information is available, it is often difficult to convert into probabilities. Indeed, a common criticism of Bayesian statistical approaches is the subjectivity involved: Any desired result can be obtained if the appropriate prior probabilities are used (Robert, 1994). The use of the confidence parameter c partially addresses this problem by reducing the influence of the outside information, thereby mitigating the effects of errors and uncertainty, contingent on the user’s confidence in the quality of the ancillary data. An important advantage of the method described in this paper is that it can be used to reduce bias in predictions from nonparametric classification algorithms introduced by nonrepresentative training samples. As both this and other research has illustrated (e.g., Foody et al., 1995), the creation of training data used for supervised classification always contains subjective elements, and training sample properties often influence classification results in much the same way as prior probabilities. For unsupervised approaches, the manual labeling process can be subjective (Loveland et al., 1999). Therefore, virtually all classifications contain elements that reflect analyst expectations. While the assumptions underlying training samples are often hidden, the subjective elements of using prior probabilities within the classification process are explicit. Further, prior probabilities can be used to offset bias introduced by a nonrepresentative training sample class frequency distribution. For example, in Section 5 the inclusion of prior probabilities helped to correct classification bias towards overpredicting agriculture caused by overrepresentation of this class in the training data. Finally, the most compelling reason to use prior probabilities in the classification of remotely sensed data is the ability to include outside knowledge from diverse sources in the classification process. For example, considerable knowledge exists concerning the global distribution of vegetation and land cover types as a function of ecological relationships with climate (e.g., Bailey 1998; Box, 1981; Matthews, 1983; Prentice et al., 1992). However, given the extent of human and natural disturbance regimes (Schimel et al., 1997; Townshend, Justice, Gurney, & McManus, 1991), this information alone is inadequate for global land cover mapping purposes. Remote-sensing-based land cover observations have the potential to produce more accurate representations of actual global land cover. However, remotely sensed data often possess limited ability to discriminate between all vegetation and land cover types of interest (Friedl et al., 2000; Myneni et al., 1995; Running, Loveland, Pierce, Nemani, & Hunt, 1995). The use of prior probabilities, as described in this paper, provides a means to exploit the strengths of both approaches while overcoming some of their individual limitations. Acknowledgments This work was supported by NASA Grant NAG5-7218 and NASA contract NAS5-31369. The hard work of the land cover team at BU is gratefully acknowledged, along with the staff of the Global Land 1-km AVHRR project at EROS Data Center who kindly provided the 1-km AVHRR data. Curtis Woodcock provided the USFS data set for the Sierra National Forest. The constructive input of two anonymous reviewers is also gratefully acknowledged. References Bailey, R. G. (1998). Ecoregions: the ecosystem geography of the oceans and continents. New York: Springer. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105 – 139. Box, E. O. (1981). Macroclimate and plant forms: an introduction to predictive modeling. Dr. W. Junk. Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26 (3), 801 – 849. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Brown, J. F., Loveland, T. R., Merchant, J. W., Reed, B. C., & Ohlen, D. O. (1993). Using multisource data in global land-cover characterization: concepts, requirements, and methods. Photogrammetric Engineering and Remote Sensing, 59 (6), 977 – 987. Brown, T. A., & Koplowitz, J. (1979). The weighted nearest neighbor rule for class dependent sample sizes. IEEE Transactions on Information Theory IT-25, 5, 617 – 619. Carpenter, G. A., Gopal, S., Macomber, S., Martens, S., Woodcock, C. E., & Franklin, J. (1999). A neural network method for efficient vegetation mapping. Remote Sensing of Environment, 70, 326 – 338. Chen, M.-H., Ibrahim, J. G., & Yianoutsos, C. (1999). Prior elicitation, variable selection, and Bayesian computation for logistics regression models. Journal of the Royal Statistical Society B, 61, 223 – 242. Collins, M., Schapire, R. E., & Singer, Y. (2000). Logistic regression, D.K. McIver, M.A. Friedl / Remote Sensing of Environment 81 (2002) 253–261 Adaboost, and Bregman distances. In: Proceedings of the 13th Annual Conference on Computational Learning Theory ( pp. 158 – 169). Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and ramdomization. Machine Learning, 40 (2), 139 – 158. Foody, G. M. (1997). Fully fuzzy supervised classification of land cover from remotely sensed imagery with an artificial neural network. Neural Computing and Applications, 5 (4), 238 – 247. Foody, G. M., & Hill, R. A. (1996). Classification of tropical forest classes from Landsat TM data. International Journal of Remote Sensing, 17 (12), 2353 – 2367. Foody, G. M., McCulloch, M. B., & Yates, W. B. (1995). The effect of training set size and composition on artificial neural network classification. International Journal of Remote Sensing, 16 (9), 1707 – 1723. Franklin, J. (1998). Predicting the distribution of shrub species in Southern California from climate and terrain-derived variables. Journal of Vegetation Science, 9 (5), 733 – 748. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121 (2), 256 – 285. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 5 (1), 119 – 139. Friedl, M. A., Brodley, C. E., & Strahler, A. H. (1999). Maximizing the land cover classification accuracies produced by decision trees at continental to global scales. IEEE Transactions on Geoscience and Remote Sensing, 27 (2), 969 – 977. Friedl, M. A., Muchoney, D., McIver, D., Gao, F., Hodges, J. F. C., & Strahler, A. H. (2000). Characterization of North American land cover from NOAA-AVHRR data using the EOS MODIS land cover classification algorithm. Geophysical Research Letters, 27 (7), 977 – 980. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28, 337 – 374. Gopal, S., Woodcock, C. E., & Strahler, A. H. (1999). Fuzzy neural network classification of global land cover from a 1 degree AVHRR data set. Remote Sensing of Environment, 67, 230 – 243. Goward, S. N., Markham, B., Dye, D. G., Dulaney, W., & Yang, J. (1991). Normalized difference vegetation index measurements from the Advanced Very High Resolution Radiometer. Remote Sensing of Environment, 35, 257 – 277. Hansen, M. C., DeFries, R. S., Townshend, J. R. G., & Sohlberg, R. (2000). Global land cover classification at the 1 km spatial resolution using a classification tree approach. International Journal of Remote Sensing, 21 (6), 1331 – 1364. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. New York: Chapman & Hall. Loveland, T. R., & Belward, A. S. (1997). The IGBP-DIS global 1 km land cover data set, discover: first results. International Journal of Remote Sensing, 18 (15), 3289 – 3295. Loveland, T. R., Zhu, Z., Ohlen, D. O., Brown, J. F., Redd, B. C., & Yang, L. (1999). An analysis of the IGBP global land-cover characterization process. Photogrammetric Engineering and Remote Sensing, 65 (9), 1021 – 1031. Matthews, E. (1983). Global vegetation and land use: new high resolution data bases for climate studies. Journal of Climate and Meteorology, 22, 474 – 487. Matyas, W. J. & Parker, I. (1980). CALVEG mosaic of existing vegetation of California (Technical report). San Francisco, CA: Regional Ecology Group, US Forest Service, Region 5. McIver, D. K., & Friedl, M. A. (2001). Estimating pixel-scale land cover classification confidence using nonparametric machine learning methods. IEEE Transactions on Geoscience and Remote Sensing, 39, 1959 – 1968. Mosteller, F., & Tukey, J. (1977). Data analysis and regression. Reading, MA: Addison Wesley. Muchoney, D. (2000). Known problems and errors in the MODIS prototype North America land cover map. Available at: http://geography.bu.edu/ landcover/. 261 Myneni, R. B., Maggion, S., Iaquinta, J., Privette, J. L., Gobron, N., Pinty, B., Kimes, D. S., Verstrate, M. M., & Williams, D. L. (1995). Optical remote sensing of vegetation: modeling, caveats, and algorithms. Remote Sensing of Environment, 51, 169 – 188. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research, 11, 169 – 198. Prentice, I. C., Cramer, W., Harrison, S. P., Leemans, R., Monserud, R. A., & Solomon, A. M. (1992). A global biome model based on plant physiology and dominance, soil properties and climate. Journal of Biogeography, 19, 117 – 134. Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo: Morgan Kauffman. Quinlan, J. R. (1996a). Bagging, boosting, and C4.5. In: Proceedings of the 14th National Conference on Artificial Intelligence (AAAI-96). Portland, OR: AAAI Press. Quinlan, J. R. (1996b). Boosting first-order learning. In: Proceedings of the 7th International Workshop on Algorithmic Learning Theory (ALT’96), Sydney, Australia. Available online at: http://www.i.kyushuu.ac.jp/ ~ thomas/ALT96/alt96.html. Ramankutty, N., & Foley, J. A. (1998). Characterizing patterns of global land use: an analysis of global cropland data. Global Biogeochemical Cycles, 12 (4), 667 – 685. Richards, J. A. (1993). Remote sensing digital image analysis: an introduction. New York: Springer-Verlag. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge, UK: Cambridge University Press. Robert, C. P. (1994). The Bayesian choice: a decision-theoretic motivation. New York: Springer-Verlag. Running, S. R., Loveland, T. R., Pierce, L. L., Nemani, R. R., & Hunt Jr., E. R. (1995). A remote sensing based vegetation classification logic for global remote land cover analysis. Remote Sensing of Environment, 51, 39 – 48. Schapire, R. E., & Freund, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37 (3), 297 – 336. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26 (5), 1651 – 1686. Schimel, D. S., Emanuel, W., Rizzo, B., Smith, T., Woodward, F. I., Fisher, H., Kittel, T. G. F., McKeown, R., Painter, T., Rosenbloom, N., Ojima, D. S., Parton, W. J., Kicklighter, D. W., McGuire, A. D., Melillo, J. M., Pan, Y., Hazeltine, A., Prentice, C., Stich, S., Hibbard, K., Nemani, R., Pierce, L., Running, S., Borches, J., Chaney, J., Neilson, R., & Braswell, B. H. (1997). Continental scale variability in ecosystem processes: models, data, and the role of disturbance. Ecological Monographs, 67 (2), 251 – 271. Shandley, J., Franklin, J., & White, T. (1996). Testing the Woodcock – Harward segmentation algorithm in an area of southern California chaparral and woodland vegetation. International Journal of Remote Sensing, 17 (5), 983 – 1004. Spanner, M. A., Pierce, L. L., Running, S. R., & Peterson, D. L. (1990). The seasonality of AVHRR data of temperate coniferous forests: relationship with leaf area index. Remote Sensing of Environment, 33, 97 – 112. Strahler, A. H. (1980). The use of prior probabilities in maximum likelihood classification of remotely sensed data. Remote Sensing of Environment, 10, 135 – 163. Townshend, J., Justice, C., Li, W., Gurney, C., & McManus, J. (1991). Global land cover classification by remote sensing: present capabilities and future possibilities. Remote Sensing of Environment, 35, 243 – 255. Woodcock, C. E., Collins, J. B., Gopal, S., Jakabhazy, V. D., Li, X., Macomber, S., Ryherd, S., Harward, V. J., Levitan, J., Wu, Y., & Warbington, R. (1994). Mapping forest vegetation using LANDSAT TM imagery and a canopy reflectance model. Remote Sensing of Environment, 50, 240 – 254. Woodcock, C. E., & Harward, V. J. (1992). Nested-hierarchical scene models and image segmentation. International Journal of Remote Sensing, 13 (16), 3167 – 3187.
© Copyright 2026 Paperzz