Mon. Not. R. Astron. Soc. 316, 519±528 (2000) Feature relevance in morphological galaxy classification D. Bazellw Eureka Scientific Inc., 6509 Evensong Mews, Columbia, MD 21044-6064, USA Accepted 2000 February 24. Received 2000 February 9; in original form 1999 November 24 A B S T R AC T We investigate the utility of a variety of features in performing morphological galaxy classification using back-propagation neural network classifiers based on a sample of 805 galaxies classified by Naim et al. We derive a total of 22 features from each galaxy image and use these as inputs to a neural network trained using back-propagation. The morphological types are subdivided into two to seven groups, and the relevance of each of the features is examined for each grouping. We use the magnitude of the regularization parameter for each input to determine whether a feature can be eliminated. We then prune the input features of the network, typically down to four features. We examine a number of methods of assessing the performance of the network and determine which works best for our task. Key words: methods: data analysis ± galaxies: general ± galaxies: structure. 1 INTRODUCTION Historically, astronomical galaxy classification has been done by a few experts in a very time-consuming procedure of visually inspecting each object in an image. As computing resources have become more powerful and digitized data more prevalent, automated classification schemes have also multiplied. Until recently, automated schemes have emphasized star/galaxy discrimination. This is a logical first step since, in order to study morphological galaxy classification, one must first distinguish the galaxies from the stars. It also turns out to be much easier to perform than full galaxy classification. Work on star/galaxy discrimination has been done using a variety of software packages. An example is skicat, a software package that uses a decision tree approach to classification (Weir, Djorgovski & Fayyad 1995). The APM machine (Maddox et al. 1990) has classified stars and galaxies from the SERC J plates, producing a catalogue of approximately two million galaxies. Maddox et al. (1990) defined a star/galaxy discrimination parameter C which measured the deviation between the features for the object being studied and the features for a star. Objects from the Minnesota Automatic Plate Scanning machine have been classified using a neural network approach (Odewahn et al. 1992). Bazell & Peng (1998) examined several different input feature sets for star/galaxy discrimination. Early work on morphological classification using neural networks was done by Storrie-Lombardi et al. (1992). They used a neural network trained with back-propagation using 13 input parameters to classify galaxies into five classes: E, S0, Sa 1 Sb; Sc 1 Sd; and Irr. They reported a 64 per cent classification accuracy if the highest probability output was used to represent w E-mail: [email protected] q 2000 RAS the class, and a 90 per cent accuracy if the first or second highest probability represented the output. While a set of features must be chosen as a starting point for classification, it is important to consider how they can be optimized to improve classification and remove unneeded features. There are several useful approaches to optimizing feature sets, including forward selection and backward elimination. Forward selection starts with a small set of features and assesses the performance of the classifier after the addition of one feature, cycling through each of a set of features each time. The feature that most increases performance at each step is kept and the process is repeated. Backward elimination starts with a full set of features and deletes a single feature, retrains the classifier, and assesses performance. The feature whose presence causes the smallest increase in performance is dropped and the process repeated. In the work presented here, we use a variation of backward elimination. We do not look at each feature individually, but rather allow the neural network to assess simultaneously the relevance of all the features being used as inputs. By starting with a large set of features, we iteratively eliminate those features that have the smallest relevance. 2 D ATA P R E PA R AT I O N A list of galaxies was obtained from the work of Naim et al. (1995a) consisting of 834 objects with coordinates and morphological types (T-types). Of the initial list of 834 objects, we used 805. The 29 remaining objects were rejected for reasons of quality and availability of the images. Images 300 300 pixels of the 805 galaxies were downloaded using the skyview facility by supplying a list of coordinates of the objects. The data for the images were obtained from the Digitized Sky Survey (DSS). Each 520 D. Bazell image was examined individually and centred in the 300 300 frame. In the work presented here we made no attempt to calibrate the images. Thus, any non-linearity or saturation in the images must be handled by the classifier software. As discussed below, we did attempt to remove heuristically the effect of saturation to some extent by excluding from consideration the brightest pixels in an image when calculating certain features. The DSS images used in this work are not in units of intensity, but rather represent uncalibrated plate densities. Furthermore, the images of the galaxies are drawn from multiple POSS plates. Thus, formal background subtraction or extraction of photometric parameters would be highly questionable. Nevertheless, some minimal background subtraction was performed in order to isolate each galaxy image and to reduce noise in extracted parameters. 2.1 Background subtraction and point source removal To perform background subtraction on the images we first created an image histogram of each object. The peak of the histogram corresponds to the most common pixel value, which in all but a few cases was contained in the background. We set the background of the image to be equal to the most common pixel value. Since this is essentially a first-order approximation to the background, this procedure undoubtedly underestimates the background to some extent. Moreover, no attempt was made to account for background variation across the image. In subsequent feature extraction steps we exclude the lowest pixels in the image before processing in order to account for residual background. Point sources were removed using a variation of the iterative algorithm called sigma_filter developed by Varosi & Gezari (1991). In our version we search a square window for single pixels that are more than 3s above the mean in the window, excluding the central pixel. These pixels are replaced with the mean value in the window. We iteratively reduce the size of the window until pixels from the centre of the galaxy image begin to be eliminated. Once the window size is set, the pixel removal proceeds iteratively, removing all pixels that are more than 3s , until no more pixels change. We then check that the central region of the galaxy has not changed. If it has, we increase the size of the window and start over with the original image. We found this procedure worked well in removing most point sources and leaving galaxy features untouched. In some cases, when a stellar image was saturated, we had to remove the remainder of the star by hand. 2.2 Feature selection and extraction There are an infinite number of features that one could use to perform classification on an image. We chose to compile a feature set by choosing a variety of features that have been used previously by other workers in the field, as well as adding some new parameters of our own. The features we used for our galaxy morphology classification are listed in Table 1. While most of the features are selfexplanatory, we discuss the derivation of several of them here. The feature designated mq2q3 was calculated by examining the plot of I(r) versus r and dividing the range of r values into four equal segments. A linear fit was performed on the second and third quarters of the I(r) versus r plot. The ratio of the slopes of the second quarter to the third quarter was the assigned value of mq2q3 for that galaxy. The asymmetry parameter was introduced by Abraham et al. (1995) and is defined by taking the difference between the galaxy image and the same image rotated 1808 about the centre of the galaxy. The sum of the absolute value of the pixels in the difference image is divided by the sum of the pixel values in the original image to give the asymmetry parameter. This parameter tends towards zero when the image is invariant under a 1808 rotation. The feature r25/r75 is the ratio of radii containing 25 per cent of the brightness and 75 per cent of the brightness in a plot of I(r). This is the inverse of what Odewahn & Aldering (1995) refer to as the concentration index C31. The concentration indices are defined several ways in the literature ± see, for example, de Vaucouleurs (1962) and Doi, Fukugita & Okamura (1993). Our definition is ai Ci 01 I r dr 0 I r dr ; 1 where ai 0:1±0:9 and r is the normalized distance from the centre of the object, corrected for ellipticity. Thus, the concentration index is a measure of the fraction of total brightness falling within a certain ellipticity-corrected radius. RBulge is determined by taking the average of the first two concentration indices as a reference value. We then find the radius Table 1. Features selected for use in classifying galaxies. Feature number Feature name Description 1 2 3 4 5 6 7 8 9 10±18 19 20 21 22 peak brightness mean brightness mq2q3 ellipticity area max(rI) Asym r25/r75 RBulge Ci isophotal displacement isophotal filling factor Pmax Entropy Maximum brightness level in the image Mean brightness level in the image Ratio of fitted slope of I(r) versus r for the second and third quartiles Ratio of the semimajor to semiminor axis length Number of pixels contained in the object Maximum value of the plot of rI(r) versus r Comparison between original galaxy and galaxy rotated 1808 Ratio of the radii at which 25 and 75 per cent of light is enclosed in a plot of I(r) versus r Radius where I(r) falls to 90 per cent of peak value Concentration indices for the ith annulus; i 0; 8 Maximum displacement of the centres of five isophotal levels Area of an isophotal level relative to the area of the enclosing ellipse Maximum value of the normalized co-occurrence matrix, cij P 2 i;j cij log cij q 2000 RAS, MNRAS 316, 519±528 Morphological galaxy classification where the value of the concentration index falls below 10 per cent of the averaged reference value and assign this to RBulge. The isophotal displacement was introduced by Naim, Ratnatunga & Griffiths (1997). This feature is calculated by determining the centres of five isophotal levels, finding the distance between each pair of centres, and setting the parameter equal to the maximum value of the inter-centre distance. We divide the image of each galaxy into five isophotal levels defined on the logarithm of the image, excluding the bottom 15 per cent and top 5 per cent of the pixel values to eliminate noise and saturation effects. Each isophote is fitted with an ellipse and the coordinates of the centre of each ellipse are compared with the other centres. The maximum value is taken and assigned to the feature. The isophotal filling factor was also introduced by Naim et al. (1997). It is the ratio of the number of pixels that are covered by the third isophotal level to the number of pixels in the enclosing ellipse. As with the isophotal displacement, we exclude the bottom 15 per cent and the top 5 per cent of pixels and divide the remaining image into five logarithmic levels. We calculate the enclosing ellipse by determining the maximum distance of an isophotal pixel from the centre of the isophote. We then find an ellipse that has the same centre and position angle as the current isophote, and find its area. The ratio of the area of the isophote to the area of the ellipse gives the value of this feature. We also examined the use of texture features as discriminators. There are a variety of ways to measure the texture of an object. We use the co-occurrence matrix (Haralick 1979), which can be defined as follows. Define a position operator to be `one pixel right and one pixel down'. The co-occurrence matrix cij is defined as the number of pixels with grey level i separated by the position operator from a pixel with grey level j, divided by the number of possible point pairs. Each element of this matrix is a measure of the probability of finding two pixels in the image, with a separation as defined by the position operator, with grey levels i and j. Use of the co-occurrence matrix requires that the grey levels in the image be quantized. The maximum values of i and j correspond to the number of grey levels, thus defining the dimensions of the matrix. In our work we quantized the image into 16 grey levels, resulting in a 16 16 co-occurrence matrix. Once this matrix is calculated, a variety of measures can be used to summarize the matrix in a few values. Common measures that have been used (Haralick 1979) include Pmax max c P ij and the entropy, as defined in Table 1, as well as the energy, ij c2ij : We found Pmax and the energy to have quite similar distributions across galaxy types, so we chose Pmax and the entropy as texture features. 3 N E U R A L N E T W O R K C L A S S I F I C AT I O N Our purpose in this study was to investigate the importance or relevance of the 22 features used as input to the networks and to demonstrate the utility of the automatic relevance determination to neural network classification of astronomical images. We ran a series of tests using a back-propagation neural network to classify our sample of galaxies. Each test consisted of ten cross-validation runs using a training set consisting of 644 galaxies (80 per cent) and a test set consisting of 161 galaxies (20 per cent), drawn randomly from the total data set consisting of 805 galaxies. We used maximum likelihood optimization, which attempts to find weights in the neural network model that maximize the joint probability of the observed output. If the weight parameters that are determined during optimization of the network are designated q 2000 RAS, MNRAS 316, 519±528 521 as w, then there is some function P(yi, w) that denotes the probability of occurrence of the output value yi for the ith object or observation. The log of the joint likelihood of observing the data is then given by X 2log P w 2 log P yi ; w; 2 i which will be maximized by the same parameter set w that will maximize the joint likelihood P(w) itself. During a cross-validation run the network is optimized based on the test set data. In other words, during each training epoch the ability of the network to predict the test data set is evaluated and weights are saved for the epoch that produces the best test set evaluation. After each classification run we determined the relevance of each of the features in determining the classification of the galaxies. There are a variety of ways one can determine feature relevance. The nevprop software we used (Goodman 1996) implements automatic relevance determination (ARD) (MacKay 1992) to determine how important a given feature is. The basic principles behind ARD for neural networks (MacKay 1991) can be summarized as follows. The neural network attempts to minimize an objective function, or error function, which we will write as E w bED 1 aEW ; 3 where w is the weight vector, ED is the sum squared difference between the predicted Poutput of the network and the `true' value of the output, EW 12 i w2i is a regularization function serving to smooth variations in the objective function, and a and b are regularization constants. The term EW penalizes large values of the weights by making the total error function E(w) large in the presence of large weights. This is the standard `weight decay' approach to regularizing the objective function. Typical values for a might be in the range [20.001, 20.9], with a being a constant set before the network is run. The main idea behind ARD is to apply a different regularization constant a i to each input parameter in order to control the size of the weights from that parameter to the hidden layer. If a given input parameter is not important, the a i associated with that input parameter will be inferred to be large, yielding a small likelihood for that parameter. Remember that a i is a weight decay parameter: the larger the value of a i, the faster the associated weight decays. Thus, the inverse of a i is a measure of the relevance of the associated input parameter, the `relative' relevance of a given parameter being given by P 21 ai relevance Pi 21 ; 4 a j j where the index i runs over the weights for a single input parameter, and the index j runs over the weights of all input parameters. This is the way we determine the relevance of each input parameter following a neural network run. Each set of ten cross-validation runs produced ten rankings for the relevance of each input feature. These ranks were averaged and the resulting values were used to determine which input features should be eliminated. An example of this procedure is shown in Table 2, which shows the features, relevance values and ranks of the features for two runs with six output types. Note that the feature numbers correspond to the numbers in Table 1, the relevances are given as percentages, and for the ranking lower 522 D. Bazell Table 2. Feature relevance ranking for six output types. Feature 2 3 5 12 13 15 16 Seven features Rel. (per cent) Rank 24.33 40.99 20.21 3.23 6.42 3.87 1.07 2 1 3 6 4 5 7 Four features Rel. (per cent) Rank 38.90 22.38 26.91 1 3 2 13.94 4 Table 3. Grouping used for neural network classification runs. two types three types four types five types six types seven types E/S0, Sa/Sb/Sc/Sd/Sm/Ir E/S0, Sa/Sb/Sc, Sd/Sm/Ir E/S0, Sa/Sb, Sc, Sd/Sm/Ir E/S0, Sa, Sb, Sc, Sd/Sm/Ir E, S0, Sa, Sb, Sc, Sd/Sm/Ir E, S0, Sa, Sb, Sc, Sd, Sm/Ir numbers are better (1 is highest, 7 is lowest). The table shows the ranking when the network had seven input features, how the three lowest ranked features were eliminated, and the resulting ranking of the four remaining features. Each cross-validation run produced a classification of the training and testing data set, along with output weights for the network connections. The training and testing data were drawn randomly from the entire data set for each cross-validation run. The initial weight values for the network were determined pseudorandomly at the start of each run. This is important because certain initial weight values might lead the network to a local minimum, so choosing different initial weights for each run minimizes the chance of getting stuck in a local minimum. The relevance values from each of the ten cross-validation runs were averaged, and we removed the input features that had the three lowest relevances. Most of the time there were several features that had low relevance values, perhaps an order of magnitude lower than the highest relevance values. Because the T-type of each galaxy was calculated by Naim et al. (1995a) from an average of the eye-ball classifications of six individuals, the T-type essentially takes on a continuous range of values from 25 to 10. The neural network classifier needs a number of examples of each output type in order to be trained. Therefore, we broke the continuum of types into seven main types and further grouped them for some of our classification runs. The following list shows the main groupings that we used: (1) (2) (3) (4) (5) (6) (7) [25, 23.5]: E (23.5, 0.0): S0 [0.0, 2.0): Sa [2.0, 4.0): Sb [4.0, 6.0): Sc [6.0, 8.0): Sd [8.0, 10.0]: Sm/Ir We also investigated the ability of our networks to classify the galaxies into two to seven classes. These classes were made by combining the main groupings shown above. Table 3 shows the groupings that we used. We specified the architecture of the networks as i:h:o, where i is the number of input nodes, h is the number of hidden nodes, and o is the number of output nodes. All of our networks started with a 22:3:o architecture where the number of output nodes varied from two to seven. The output nodes could each take on a continuous value ranging from 0 to 1. For the cases of two to seven output nodes, each node represented a different group of galaxy types. For example, with two output nodes, if the galaxy being classified is type E or S0, the first node should ideally display a 1 and the second node should display a 0. If the galaxy being classified is Sa, Sb, Sc, Sd, or Ir, the first node should be 0 and the second node should be 1. Because the value at the output node is continuous, we have to make a decision about what range of values is acceptable for a given target value. For the classifiers with two to seven output nodes, we chose a threshold range of 0.45. This means that, if the target value was 1, then any value greater than 0.55 would be accepted. This threshold value does not affect the training of the classifier, but only the interpretation of the accuracy of the results. 4 C L A S S I F I C AT I O N R E S U LT S With many features to choose from, it is interesting first to look at the correlation matrix for the feature set. When two features are strongly correlated, then elimination of one of the features should not have a detrimental effect on our ability to classify objects. However, when the features have only a modest correlation, then the effect of removing one of the features is not so clear. The features we used include nine concentration indices, as defined in equation (1). The correlation matrix for the nine indices shows a strong correlation between nearby indices and decreasing correlation as the distance increases. For example, the value of the correlation matrix between C3 and C4 is 0.92, between C3 and C5 is 0.75, and between C3 and C6 is 0.50. Thus, it would appear that choosing two or three well-separated indices would be adequate to represent the whole range of concentration indices. We wanted to find out what ARD would do with a variety of features, some correlated, some not. Therefore, we left all the concentration indices in our original data set and allowed ARD to sort things out. The case with the other features was not so clear. Table 4 shows the correlation matrix for 14 of our 22 features. We chose to represent the family of concentration indices by C3 because it shows up so strongly in our final reduced feature set. The correlation matrix shows a strong correlation between mean brightness and area (0.95), between C3 and r25/r75 (20.75), and between rImax and r25/r75 (0.64). The remaining parameters show marginal or negligible correlations. A summary of the results of the classification runs using the six different groupings is displayed in Table 5. This table shows the features that remained after six and seven iterations of removing the least relevant features. The entries in the table are the relevance rankings of each of the features. Features 10±18 are all concentration indices. We can see immediately the effect of the correlation between the concentration indices. After the sixth iteration, on average about four concentration indices remain as features. This drops to about 1.7 after the final iteration. Similarly, features 2 and 5 have significant correlation and often show up together in the final feature sets. Feature 13, concentration index C3, occurs most often, being present in all groupings except two output types and the final iteration of four output types. Features 2 and 3, mean brightness and mq2q3 ratio, are not present in the final set of features for 2-type and 3-type classifications, but appear frequently highly q 2000 RAS, MNRAS 316, 519±528 Morphological galaxy classification 523 Table 4. Correlation matrix for features 1±9, 13, 19±22. 1 2 3 4 5 6 7 8 9 13 19 20 21 22 1 Max 2 Mean 3 mq2q3 4 ell 5 area 6 rImax 7 Asym 8 r25/r75 9 RBulge 13 C3 19 IsoDisp 20 IsoFF 21 Pmax 22 Entropy 1.00 0.14 0.09 20.12 0.04 20.14 20.24 20.20 0.18 0.33 0.03 0.18 0.15 0.01 0.14 1.00 20.03 20.31 0.95 0.03 20.20 20.05 0.10 0.03 0.13 0.05 0.37 20.04 0.09 20.03 1.00 20.12 0.01 20.32 20.04 20.25 20.06 0.13 0.14 0.11 0.23 20.27 20.12 20.31 20.12 1.00 20.37 0.05 0.22 0.16 0.05 0.02 0.05 20.02 20.49 0.16 0.04 0.95 0.01 20.37 1.00 0.04 20.13 20.07 20.02 20.04 0.23 0.08 0.46 20.17 20.14 0.03 20.32 0.05 0.04 1.00 0.16 0.64 20.11 20.58 0.01 20.18 20.33 0.28 20.24 20.20 20.04 0.22 20.13 0.16 1.00 0.13 20.23 20.27 0.19 20.13 20.19 20.19 20.20 20.05 20.25 0.16 20.07 0.64 0.13 1.00 20.05 20.74 20.10 20.32 20.54 0.43 0.18 0.10 20.06 0.05 20.02 20.11 20.23 20.05 1.00 0.46 20.12 20.09 20.03 0.30 0.33 0.03 0.13 0.02 20.04 20.58 20.27 20.74 0.46 1.00 0.08 0.13 0.36 20.16 0.03 0.13 0.14 0.05 0.23 0.01 0.19 20.10 20.12 0.08 1.00 20.13 0.27 0.27 0.18 0.05 0.11 20.02 0.08 20.18 20.13 20.32 20.09 0.13 20.13 1.00 0.35 20.23 0.15 0.37 0.23 20.49 0.46 20.33 20.19 20.54 20.03 0.36 0.27 0.35 1.00 20.50 0.01 20.04 20.27 0.16 20.17 0.28 20.19 0.43 0.30 20.16 20.27 20.23 20.50 1.00 Table 5. Summary of results for binary classification into two to seven type categories. 1 2 3 4 5 two types three types 1 4 6 8 9 7 6 10 Feature number 11 12 13 14 5 1 1 5 3 3 6 five types 2 1 1 4 3 2 six types 2 1 1 3 3 2 3 4 2 2 6 7 3 4 four types seven types 6 ranked in 4- to 7-type classifications. The 3-type grouping lumps together Sa, Sb and Sc Hubble types, while subsequent groupings break these out. This suggests that these features are important for distinguishing between these different Hubble types. The question arises as to how to determine when to stop deleting features from the original feature set. We examined a number of approaches to this question, including looking at the change in distribution of relevance values, looking at trends in mean nodal activation, and looking for minima in error measures. Starting from our original feature set, we deleted features three at a time, choosing those features with the lowest relevance, until there were only four features left. Typically this entailed deleting features whose relevance was about an order of magnitude less than the most relevant features of the previous iteration. Fig. 1 shows this process for the case of four output types. The figure shows the relevance of each feature for the seven iterations of relevance determination/feature elimination. The narrower (outer) error bars show the range of relevance values for the ten cross-validation runs. If one of these runs produced a very small value for the relevance of a certain feature, then the error bar would extend to a low value. An example is feature five of the plot labelled 7±3±4. The wider (inner) error bars show the rms deviation from the mean calculated separately for positive and negative deviations. The symbol is the mean value of the ten q 2000 RAS, MNRAS 316, 519±528 1 2 15 5 4 3 2 1 4 3 7 7 6 4 2 5 4 3 6 4 4 7 4 3 16 17 18 19 3 1 20 21 22 2 2 2 4 7 5 5 7 1 1 cross-validation runs. Note that the range of relevance values for a single parameter in a specific run can be two orders of magnitude. The range of values appears to be due to the network getting stuck in different local minima during its learning phase for the different cross-validation runs. It then converges to that minimum with a certain relevance for each parameter. We simply took the mean of the ten values. Using the median of the values rather than the mean would put less significance on the extremal values, but it is not immediately clear whether that would be a better choice. Examining the change in the distribution of relevance values as the features are removed, we see no obvious change that would tell us that an appropriate number of features remained. Neither the range of relevance values nor the rms deviation from the mean shows a minimum for a certain number of features. The plots with 16, 13 and 10 input features in Fig. 1 all show mean relevance values within about an order of magnitude of each other. The dispersion appears to increase for the plots with seven and four features. However, similar plots for different numbers of output types do not show similar trends. The mean nodal activation did not appear to be much more helpful in finding an objective stopping point for parameter deletion. When performing a classification run, the objects are ideally classified into one of several output types, but in reality each output node can have a range of values from 0 to 1. An 524 D. Bazell Figure 1. Relevance of each parameter for the indicated network architecture i:3:4. The wider (inner) error bars show the rms deviation from the mean calculated separately for positive and negative deviations. The narrower (outer) error bars show the range of relevance values for the ten cross-validation runs for each parameter. The mean of the ten values is shown by the symbol. example of this is shown in Fig. 2, where we have plotted the mean nodal activation as a function of output node number for several different network configurations of the form i:3:4. The mean nodal activation was calculated taking all objects that should be activating output node 1 and averaging the nodal values for each of the output nodes separately. This procedure was repeated for each set of objects that should be activating a given output node. Concentrating on the top left panel labelled 22±3±4 we see the objects that should be activating output node 1 also activate nodes 2, 3 and 4, although to a lesser extent. Similarly, objects of q 2000 RAS, MNRAS 316, 519±528 Morphological galaxy classification 525 Figure 2. Mean nodal activation for objects that should be classified to node 1 (+), node 2 (*), node 3 (W) and node 4 (S) for different numbers of input features. Each curve is offset by 0.2 from the previous curve for clarity. The labels above the plots denote the number of input:hidden:output nodes. type 3 have their peak at node 3, but also activate the other nodes. This broadening of the output activation profile is due to a number of factors including: (a) noise, (b) calibration variations, (c) nonlinearities in the input images, as well as (d) the choice of features used to describe the object. Our main interest here was to study how (d) affects the classification of our input data set. q 2000 RAS, MNRAS 316, 519±528 Looking at the mean activation for type-1 objects (1), we see that the profile is essentially the same for 22 and 19 input features, but activation of node 1 tends to decrease and activation of node 2 tends to increase as we delete more features. Activation due to type-2 objects (*) does not appear to change with the number of input features. Activation due to type-3 objects (W) shows a weak 526 D. Bazell Figure 3. Error measures for classification into six output groupings as a function of the number of input features: negative log likelihood (S); average threshold error (A); average squared error (K). Figure 4. Error measures using the reduced feature set for classification into four output groupings as a function of the number of input features: negative log likelihood (S); average threshold error (A); average squared error (K). maximum at node 3 for 22, 19, 16 and 13 input features, and then peaks at node 2 for 10, 7 and 4 features. Activation due to type-4 objects (S) is strongest at node 4 for 19 input features, and clearly decreases for seven and four input features. Thus, one might say that the best overall configuration would be 19 input features, but the mean nodal activation makes a weak case for that, too. Looking at similar plots for other numbers of output nodes, we see essentially the same thing. The activation of one node might increase a little as features are deleted, but others decrease, and there is not an obvious trend that allows us to say that a given number of input features is the best. Yet another objective determination of the appropriate number of features is to rely on some measure of the error in the crossvalidation test set. In Fig. 3 we plot three different error measures for each of the six output groupings we examined. The figure shows the negative log likelihood (S), average squared error (K) and the average threshold error (A) plotted as a function of the number of features used as input to the network. The negative log likelihood is the function that is maximized to determine when to stop training the network. The average squared error is the average squared difference between the network output and the target value, summed over all objects. The average threshold error is the fraction of objects that fell outside the threshold value of 0.45. In most of the cases in Fig. 3 there is no clear minimum. The plot of two output types does show a minimum in the negative log likelihood around 13 input features, while the plot of three output types shows a minimum around 19 input features and a broad minimum between 16 and four input features for both negative log likelihood and average threshold error. The plot of four output types shows a minimum around 19 input features, which corresponds to what was suggested by the mean nodal activation plots. Fig. 1, however, shows significant variation in relevance values for 19 input features. To explore further the question of error trends, we looked at the rms error for each output node as a function of the number of input parameters. This is the rms difference between the target node value, always zero or one, and the actual output node value. The general behaviour of the rms error as a function of the number of input features is similar to the other error functions and does not show any additional variations that could be used to determine the number of input features to delete. Because a number of our features had significant correlations between them, we decided to eliminate one member of each set of highly correlated features and use the reduced feature set as a starting point for classification and ARD. We eliminated several concentration indices, leaving C2 and C6. We also removed the q 2000 RAS, MNRAS 316, 519±528 Morphological galaxy classification area feature since it was highly correlated r 0:95 with the mean value. This reduced our starting feature set size to 14. We performed a series of ten cross-validation runs with four output types, eliminating three features at a time to end up with five input features. Fig. 4 shows a plot of the three error measures. There is a clear minimum in the negative log likelihood around 11 features and a broad minimum in the average squared error that is consistent with that. The average threshold error shows a broad minimum between 11 and eight features. Plots of the mean nodal activation show marginally better differentiation between nodes for 11 input features. 5 DISCUSSION We have examined a method of feature relevance determination for a sample of 805 galaxies starting with 22 input features and using a back-propagation neural network. We found that certain features appear to be more important for discrimination between certain types of galaxies. For example, the mean brightness and the slope parameter mq2q3 both appear when classifying the galaxies into four or more types. We also found that at least one of the concentration indices is relevant after reduction of the initial feature set to four features. A concentration index is present for all classifiers when using any of two to seven output types. Also of interest are the features that were eliminated early in the relevance determination procedure. The first concentration index C0 was removed from all feature sets in the first or second iteration of relevance determination. That is, regardless of how many output types into which the data were being classified, C0 had low relevance. C0 had significant quantization error because of the precision with which it was calculated and stored, so this is probably an artefact of the precision problem. The ellipticity was eliminated in the first two iterations in five out of six cases, while the asymmetry parameter and the entropy of the co-occurrence matrix were eliminated in four out of six cases. All three of these parameters showed little correlation in a plot against T-type, so their elimination is not surprising. Forward selection, backward elimination and ARD are all methods of selecting the most important features in a feature set. The traditional approach using forward selection or backward elimination on a single parameter at a time can be very timeconsuming for even a moderate number of parameters. Automatic relevance determination, used here in combination with backward elimination, is faster because it determines the relevance of all the features in parallel. However, in many of the cases examined here we found the lack of a clear and definitive choice of which parameters were best suited for classification of our galaxy images. When we eliminated the strong correlations between certain parameters, we found the various error measures gave a clearer indication of when we had an appropriate number of parameters. Another way to approach the correlation problem is with principal component analysis (PCA), which takes linear combinations of a group of parameters and produces an orthogonal output set of reduced dimensionality. We are currently exploring PCA in combination with other methods of relevance determination. The lack of a definitive answer to the question of which are the best parameters for a certain classification task suggests that better features can be chosen. Naim et al. (1995b) found that concentration indices are very useful in galaxy discrimination, and our work supports their conclusions. Odewahn & Aldering q 2000 RAS, MNRAS 316, 519±528 527 (1995) and Odewahn et al. (1996) use photometric data in their feature sets. We plan on supplementing our data with photometric data, reducing the dimensionality of the input set by PCA and running the relevance determination again. It should be pointed out that our use of a back-propagation neural network software is only one of a variety of choices that could be made for classification algorithms. The main thrust of our current research effort is to examine different classification methods and explore different feature sets that are useful with the different classification algorithms. One promising method of feature selection is called schemata racing (Moore & Lee 1994; Maron & Moore 1997). This method of feature selection conducts a sequence of races, one per feature, during which a decision is made to select or deselect some feature. Significantly, however, the ordering in which decisions are made is not predetermined. Moore and his colleagues reported that schemata racing performs well in comparison with sequential feature selection algorithms for several artificial tasks and tasks from four data sets. Ricci & Aha (1998) used a variant of schemata racing to select an independent set of features for each bit in a nearest-neighbour algorithm's distributed (i.e. error-correcting) output representation. They reported good results, and specifically selected schemata racing for its good computational efficiency. Thus, we believe that schemata racing's benefits will carry over to astronomical classification tasks. AC K N O W L E D G M E N T S We are indebted to the referee, Steve Odewahn, for pointing out some important references and clarifying several points in the paper. We acknowledge the use of NASA's SkyView facility (http://skyview.gsfc.nasa.gov) located at NASA Goddard Space Flight Center. The Digitized Sky Surveys were produced at the Space Telescope Science Institute under US Government grant NAG W-2166. The images of these surveys are based on photographic data obtained using the Oschin Schmidt Telescope on Palomar Mountain and the UK Schmidt Telescope. The plates were processed into the present compressed digital form with the permission of these institutions. This work was performed under the support of NASA AISRP grant NAG5-8166. REFERENCES Abraham R. G., Valdes F., Yee H. K. C., van den Bergh S., 1994, ApJ, 432, 75 Bazell D., Peng Y., 1998, ApJS, 116, 47 de Vaucouleurs G., 1962, in McVittie G., ed., Proc. IAU Symp. 15, Problems in Extragalactic Research. Macmillan, New York, p. 3 Doi M., Fukugita M., Okamura S., 1993, MNRAS, 264, 832 Goodman P. H., 1996, NevProp software, version 4.1, http:// www.scs.unr.edu/nevprop Haralick R. M., 1979, Proc. IEEE, 67, 786 MacKay D. J. C., 1991, PhD thesis, California Institute of Technology MacKay D. J. C., 1992, Neural Coomput., 4, 448 Maddox S. J., Sutherland W. J., Efstathiou G., Loveday J., 1990, MNRAS, 243, 692 Maron O., Moore A. W., 1997, Artif. Intell. Rev., 11, 193 Moore A. W., Lee M. S., 1994, in Cohen W., Hirsh H., eds, Proc. 11th Int. Conf. on Machine Learning. Morgan Kaufmann, San Francisco, p. 190 Naim A. et al., 1995a, MNRAS, 274, 1107 Naim A., Lahav O., Sodre L. Jr, Storrie-Lombardi M. C., 1995b, MNRAS, 275, 567 Naim A., Ratnatunga K. U., Griffiths R. E., 1997, ApJ, 476, 510 528 D. Bazell Odewahn S. C., Aldering G., 2009, AJ, 110 Odewahn S. C., Stockwell E. B., Pennington R. M., Humphreys R. M., Zumach W. A., 1992, AJ, 103, 318 Odewahn S. C., Windhorst R. A., Driver S. P., Keel W. C., 1996, ApJ, 472, L13 Ricci F., Aha D. W., 1998, Proc. 10th Eur. Conf. on Machine Learning. Springer, Chemnitz, p. 280 Storrie-Lombardi M. C., Lahav O., Sodre L. Jr, Storrie-Lombardi L. J., 1992, MNRAS, 259, 8 Varosi F., Gezari D., 1991, http://idlastro.gsfc.nasa.gov/ homepage.html Weir N., Djorgovski S., Fayyad U., 1995, AJ, 109, 2401 This paper has been typeset from a TEX/LATEX file prepared by the author. q 2000 RAS, MNRAS 316, 519±528
© Copyright 2025 Paperzz