Feature relevance in morphological galaxy classification

Mon. Not. R. Astron. Soc. 316, 519±528 (2000)
Feature relevance in morphological galaxy classification
D. Bazellw
Eureka Scientific Inc., 6509 Evensong Mews, Columbia, MD 21044-6064, USA
Accepted 2000 February 24. Received 2000 February 9; in original form 1999 November 24
A B S T R AC T
We investigate the utility of a variety of features in performing morphological galaxy
classification using back-propagation neural network classifiers based on a sample of 805
galaxies classified by Naim et al. We derive a total of 22 features from each galaxy image
and use these as inputs to a neural network trained using back-propagation. The
morphological types are subdivided into two to seven groups, and the relevance of each
of the features is examined for each grouping. We use the magnitude of the regularization
parameter for each input to determine whether a feature can be eliminated. We then prune
the input features of the network, typically down to four features. We examine a number of
methods of assessing the performance of the network and determine which works best for
our task.
Key words: methods: data analysis ± galaxies: general ± galaxies: structure.
1
INTRODUCTION
Historically, astronomical galaxy classification has been done by a
few experts in a very time-consuming procedure of visually
inspecting each object in an image. As computing resources have
become more powerful and digitized data more prevalent,
automated classification schemes have also multiplied.
Until recently, automated schemes have emphasized star/galaxy
discrimination. This is a logical first step since, in order to study
morphological galaxy classification, one must first distinguish the
galaxies from the stars. It also turns out to be much easier to
perform than full galaxy classification. Work on star/galaxy
discrimination has been done using a variety of software packages.
An example is skicat, a software package that uses a decision tree
approach to classification (Weir, Djorgovski & Fayyad 1995). The
APM machine (Maddox et al. 1990) has classified stars and
galaxies from the SERC J plates, producing a catalogue of
approximately two million galaxies. Maddox et al. (1990) defined
a star/galaxy discrimination parameter C which measured the
deviation between the features for the object being studied and the
features for a star. Objects from the Minnesota Automatic Plate
Scanning machine have been classified using a neural network
approach (Odewahn et al. 1992). Bazell & Peng (1998) examined
several different input feature sets for star/galaxy discrimination.
Early work on morphological classification using neural networks was done by Storrie-Lombardi et al. (1992). They used a
neural network trained with back-propagation using 13 input
parameters to classify galaxies into five classes: E, S0, Sa 1 Sb;
Sc 1 Sd; and Irr. They reported a 64 per cent classification
accuracy if the highest probability output was used to represent
w
E-mail: [email protected]
q 2000 RAS
the class, and a 90 per cent accuracy if the first or second highest
probability represented the output.
While a set of features must be chosen as a starting point for
classification, it is important to consider how they can be
optimized to improve classification and remove unneeded features.
There are several useful approaches to optimizing feature sets,
including forward selection and backward elimination. Forward
selection starts with a small set of features and assesses the
performance of the classifier after the addition of one feature,
cycling through each of a set of features each time. The feature
that most increases performance at each step is kept and the
process is repeated. Backward elimination starts with a full set of
features and deletes a single feature, retrains the classifier, and
assesses performance. The feature whose presence causes the
smallest increase in performance is dropped and the process
repeated.
In the work presented here, we use a variation of backward
elimination. We do not look at each feature individually, but rather
allow the neural network to assess simultaneously the relevance of
all the features being used as inputs. By starting with a large set of
features, we iteratively eliminate those features that have the
smallest relevance.
2
D ATA P R E PA R AT I O N
A list of galaxies was obtained from the work of Naim et al.
(1995a) consisting of 834 objects with coordinates and morphological types (T-types). Of the initial list of 834 objects, we used
805. The 29 remaining objects were rejected for reasons of quality
and availability of the images. Images …300 300 pixels† of the
805 galaxies were downloaded using the skyview facility by
supplying a list of coordinates of the objects. The data for the
images were obtained from the Digitized Sky Survey (DSS). Each
520
D. Bazell
image was examined individually and centred in the 300 300
frame.
In the work presented here we made no attempt to calibrate the
images. Thus, any non-linearity or saturation in the images must
be handled by the classifier software. As discussed below, we did
attempt to remove heuristically the effect of saturation to some
extent by excluding from consideration the brightest pixels in an
image when calculating certain features.
The DSS images used in this work are not in units of intensity,
but rather represent uncalibrated plate densities. Furthermore, the
images of the galaxies are drawn from multiple POSS plates.
Thus, formal background subtraction or extraction of photometric
parameters would be highly questionable. Nevertheless, some
minimal background subtraction was performed in order to isolate
each galaxy image and to reduce noise in extracted parameters.
2.1
Background subtraction and point source removal
To perform background subtraction on the images we first created
an image histogram of each object. The peak of the histogram
corresponds to the most common pixel value, which in all but a
few cases was contained in the background. We set the background of the image to be equal to the most common pixel value.
Since this is essentially a first-order approximation to the background, this procedure undoubtedly underestimates the background
to some extent. Moreover, no attempt was made to account for
background variation across the image. In subsequent feature
extraction steps we exclude the lowest pixels in the image before
processing in order to account for residual background.
Point sources were removed using a variation of the iterative
algorithm called sigma_filter developed by Varosi & Gezari
(1991). In our version we search a square window for single pixels
that are more than 3s above the mean in the window, excluding
the central pixel. These pixels are replaced with the mean value in
the window. We iteratively reduce the size of the window until
pixels from the centre of the galaxy image begin to be eliminated.
Once the window size is set, the pixel removal proceeds
iteratively, removing all pixels that are more than 3s , until no
more pixels change. We then check that the central region of the
galaxy has not changed. If it has, we increase the size of the
window and start over with the original image.
We found this procedure worked well in removing most point
sources and leaving galaxy features untouched. In some cases,
when a stellar image was saturated, we had to remove the
remainder of the star by hand.
2.2
Feature selection and extraction
There are an infinite number of features that one could use to
perform classification on an image. We chose to compile a feature
set by choosing a variety of features that have been used
previously by other workers in the field, as well as adding some
new parameters of our own.
The features we used for our galaxy morphology classification
are listed in Table 1. While most of the features are selfexplanatory, we discuss the derivation of several of them here.
The feature designated mq2q3 was calculated by examining the
plot of I(r) versus r and dividing the range of r values into four
equal segments. A linear fit was performed on the second and
third quarters of the I(r) versus r plot. The ratio of the slopes of the
second quarter to the third quarter was the assigned value of mq2q3
for that galaxy.
The asymmetry parameter was introduced by Abraham et al.
(1995) and is defined by taking the difference between the galaxy
image and the same image rotated 1808 about the centre of the
galaxy. The sum of the absolute value of the pixels in the difference image is divided by the sum of the pixel values in the original
image to give the asymmetry parameter. This parameter tends
towards zero when the image is invariant under a 1808 rotation.
The feature r25/r75 is the ratio of radii containing 25 per cent of
the brightness and 75 per cent of the brightness in a plot of I(r).
This is the inverse of what Odewahn & Aldering (1995) refer to as
the concentration index C31.
The concentration indices are defined several ways in the
literature ± see, for example, de Vaucouleurs (1962) and Doi,
Fukugita & Okamura (1993). Our definition is
„ ai
Ci ˆ „01
I…r† dr
0 I…r† dr
;
…1†
where ai ˆ 0:1±0:9 and r is the normalized distance from the
centre of the object, corrected for ellipticity. Thus, the concentration index is a measure of the fraction of total brightness falling
within a certain ellipticity-corrected radius.
RBulge is determined by taking the average of the first two
concentration indices as a reference value. We then find the radius
Table 1. Features selected for use in classifying galaxies.
Feature
number
Feature name
Description
1
2
3
4
5
6
7
8
9
10±18
19
20
21
22
peak brightness
mean brightness
mq2q3
ellipticity
area
max(rI)
Asym
r25/r75
RBulge
Ci
isophotal displacement
isophotal filling factor
Pmax
Entropy
Maximum brightness level in the image
Mean brightness level in the image
Ratio of fitted slope of I(r) versus r for the second and third quartiles
Ratio of the semimajor to semiminor axis length
Number of pixels contained in the object
Maximum value of the plot of rI(r) versus r
Comparison between original galaxy and galaxy rotated 1808
Ratio of the radii at which 25 and 75 per cent of light is enclosed in a plot of I(r) versus r
Radius where I(r) falls to 90 per cent of peak value
Concentration indices for the ith annulus; i ˆ ‰0; 8Š
Maximum displacement of the centres of five isophotal levels
Area of an isophotal level relative to the area of the enclosing ellipse
Maximum
value of the normalized co-occurrence matrix, cij
P
2 i;j cij log…cij †
q 2000 RAS, MNRAS 316, 519±528
Morphological galaxy classification
where the value of the concentration index falls below 10 per cent
of the averaged reference value and assign this to RBulge.
The isophotal displacement was introduced by Naim, Ratnatunga & Griffiths (1997). This feature is calculated by determining
the centres of five isophotal levels, finding the distance between
each pair of centres, and setting the parameter equal to the
maximum value of the inter-centre distance. We divide the image
of each galaxy into five isophotal levels defined on the logarithm
of the image, excluding the bottom 15 per cent and top 5 per cent
of the pixel values to eliminate noise and saturation effects. Each
isophote is fitted with an ellipse and the coordinates of the centre
of each ellipse are compared with the other centres. The maximum
value is taken and assigned to the feature.
The isophotal filling factor was also introduced by Naim et al.
(1997). It is the ratio of the number of pixels that are covered by
the third isophotal level to the number of pixels in the enclosing
ellipse. As with the isophotal displacement, we exclude the bottom
15 per cent and the top 5 per cent of pixels and divide the remaining
image into five logarithmic levels. We calculate the enclosing
ellipse by determining the maximum distance of an isophotal pixel
from the centre of the isophote. We then find an ellipse that has
the same centre and position angle as the current isophote, and
find its area. The ratio of the area of the isophote to the area of the
ellipse gives the value of this feature.
We also examined the use of texture features as discriminators.
There are a variety of ways to measure the texture of an object. We
use the co-occurrence matrix (Haralick 1979), which can be
defined as follows. Define a position operator to be `one pixel
right and one pixel down'. The co-occurrence matrix cij is defined
as the number of pixels with grey level i separated by the position
operator from a pixel with grey level j, divided by the number of
possible point pairs. Each element of this matrix is a measure of
the probability of finding two pixels in the image, with a
separation as defined by the position operator, with grey levels i
and j. Use of the co-occurrence matrix requires that the grey levels
in the image be quantized. The maximum values of i and j
correspond to the number of grey levels, thus defining the
dimensions of the matrix. In our work we quantized the image into
16 grey levels, resulting in a 16 16 co-occurrence matrix.
Once this matrix is calculated, a variety of measures can be
used to summarize the matrix in a few values. Common measures
that have been used (Haralick 1979) include Pmax ˆ max…c
P ij † and
the entropy, as defined in Table 1, as well as the energy, ij c2ij : We
found Pmax and the energy to have quite similar distributions
across galaxy types, so we chose Pmax and the entropy as texture
features.
3
N E U R A L N E T W O R K C L A S S I F I C AT I O N
Our purpose in this study was to investigate the importance or
relevance of the 22 features used as input to the networks and to
demonstrate the utility of the automatic relevance determination to
neural network classification of astronomical images. We ran a
series of tests using a back-propagation neural network to classify
our sample of galaxies. Each test consisted of ten cross-validation
runs using a training set consisting of 644 galaxies (80 per cent)
and a test set consisting of 161 galaxies (20 per cent), drawn
randomly from the total data set consisting of 805 galaxies. We
used maximum likelihood optimization, which attempts to find
weights in the neural network model that maximize the joint
probability of the observed output. If the weight parameters that
are determined during optimization of the network are designated
q 2000 RAS, MNRAS 316, 519±528
521
as w, then there is some function P(yi, w) that denotes the
probability of occurrence of the output value yi for the ith object or
observation. The log of the joint likelihood of observing the data is
then given by
X
2log P…w† ˆ 2
log P…yi ; w†;
…2†
i
which will be maximized by the same parameter set w that will
maximize the joint likelihood P(w) itself.
During a cross-validation run the network is optimized based on
the test set data. In other words, during each training epoch the
ability of the network to predict the test data set is evaluated and
weights are saved for the epoch that produces the best test set
evaluation.
After each classification run we determined the relevance of
each of the features in determining the classification of the
galaxies. There are a variety of ways one can determine feature
relevance. The nevprop software we used (Goodman 1996)
implements automatic relevance determination (ARD) (MacKay
1992) to determine how important a given feature is. The basic
principles behind ARD for neural networks (MacKay 1991) can
be summarized as follows.
The neural network attempts to minimize an objective function,
or error function, which we will write as
E…w† ˆ bED 1 aEW ;
…3†
where w is the weight vector, ED is the sum squared difference
between the predicted
Poutput of the network and the `true' value of
the output, EW ˆ 12 i w2i is a regularization function serving to
smooth variations in the objective function, and a and b are
regularization constants. The term EW penalizes large values of
the weights by making the total error function E(w) large in the
presence of large weights. This is the standard `weight decay'
approach to regularizing the objective function. Typical values for
a might be in the range [20.001, 20.9], with a being a constant
set before the network is run.
The main idea behind ARD is to apply a different regularization
constant a i to each input parameter in order to control the size of
the weights from that parameter to the hidden layer. If a given
input parameter is not important, the a i associated with that input
parameter will be inferred to be large, yielding a small likelihood
for that parameter. Remember that a i is a weight decay parameter:
the larger the value of a i, the faster the associated weight decays.
Thus, the inverse of a i is a measure of the relevance of the
associated input parameter, the `relative' relevance of a given
parameter being given by
P 21
ai
relevance ˆ Pi 21
;
…4†
a
j j
where the index i runs over the weights for a single input
parameter, and the index j runs over the weights of all input
parameters. This is the way we determine the relevance of each
input parameter following a neural network run.
Each set of ten cross-validation runs produced ten rankings for
the relevance of each input feature. These ranks were averaged
and the resulting values were used to determine which input
features should be eliminated. An example of this procedure is
shown in Table 2, which shows the features, relevance values and
ranks of the features for two runs with six output types. Note that
the feature numbers correspond to the numbers in Table 1, the
relevances are given as percentages, and for the ranking lower
522
D. Bazell
Table 2. Feature relevance ranking for six output types.
Feature
2
3
5
12
13
15
16
Seven features
Rel. (per cent) Rank
24.33
40.99
20.21
3.23
6.42
3.87
1.07
2
1
3
6
4
5
7
Four features
Rel. (per cent) Rank
38.90
22.38
26.91
1
3
2
13.94
4
Table 3. Grouping used for neural network
classification runs.
two types
three types
four types
five types
six types
seven types
E/S0, Sa/Sb/Sc/Sd/Sm/Ir
E/S0, Sa/Sb/Sc, Sd/Sm/Ir
E/S0, Sa/Sb, Sc, Sd/Sm/Ir
E/S0, Sa, Sb, Sc, Sd/Sm/Ir
E, S0, Sa, Sb, Sc, Sd/Sm/Ir
E, S0, Sa, Sb, Sc, Sd, Sm/Ir
numbers are better (1 is highest, 7 is lowest). The table shows the
ranking when the network had seven input features, how the three
lowest ranked features were eliminated, and the resulting ranking
of the four remaining features.
Each cross-validation run produced a classification of the
training and testing data set, along with output weights for the
network connections. The training and testing data were drawn
randomly from the entire data set for each cross-validation run.
The initial weight values for the network were determined pseudorandomly at the start of each run. This is important because certain
initial weight values might lead the network to a local minimum,
so choosing different initial weights for each run minimizes the
chance of getting stuck in a local minimum.
The relevance values from each of the ten cross-validation runs
were averaged, and we removed the input features that had the
three lowest relevances. Most of the time there were several
features that had low relevance values, perhaps an order of
magnitude lower than the highest relevance values.
Because the T-type of each galaxy was calculated by Naim et al.
(1995a) from an average of the eye-ball classifications of six
individuals, the T-type essentially takes on a continuous range of
values from 25 to 10. The neural network classifier needs a
number of examples of each output type in order to be trained.
Therefore, we broke the continuum of types into seven main types
and further grouped them for some of our classification runs. The
following list shows the main groupings that we used:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
[25, 23.5]: E
(23.5, 0.0): S0
[0.0, 2.0): Sa
[2.0, 4.0): Sb
[4.0, 6.0): Sc
[6.0, 8.0): Sd
[8.0, 10.0]: Sm/Ir
We also investigated the ability of our networks to classify the
galaxies into two to seven classes. These classes were made by
combining the main groupings shown above. Table 3 shows the
groupings that we used.
We specified the architecture of the networks as i:h:o, where i is
the number of input nodes, h is the number of hidden nodes, and o
is the number of output nodes. All of our networks started with a
22:3:o architecture where the number of output nodes varied from
two to seven. The output nodes could each take on a continuous
value ranging from 0 to 1. For the cases of two to seven output
nodes, each node represented a different group of galaxy types.
For example, with two output nodes, if the galaxy being classified
is type E or S0, the first node should ideally display a 1 and the
second node should display a 0. If the galaxy being classified is
Sa, Sb, Sc, Sd, or Ir, the first node should be 0 and the second node
should be 1.
Because the value at the output node is continuous, we have to
make a decision about what range of values is acceptable for a
given target value. For the classifiers with two to seven output
nodes, we chose a threshold range of 0.45. This means that, if the
target value was 1, then any value greater than 0.55 would be
accepted. This threshold value does not affect the training of the
classifier, but only the interpretation of the accuracy of the
results.
4
C L A S S I F I C AT I O N R E S U LT S
With many features to choose from, it is interesting first to look at
the correlation matrix for the feature set. When two features are
strongly correlated, then elimination of one of the features should
not have a detrimental effect on our ability to classify objects.
However, when the features have only a modest correlation, then
the effect of removing one of the features is not so clear. The
features we used include nine concentration indices, as defined in
equation (1). The correlation matrix for the nine indices shows a
strong correlation between nearby indices and decreasing
correlation as the distance increases. For example, the value of
the correlation matrix between C3 and C4 is 0.92, between C3 and
C5 is 0.75, and between C3 and C6 is 0.50. Thus, it would appear
that choosing two or three well-separated indices would be
adequate to represent the whole range of concentration indices.
We wanted to find out what ARD would do with a variety of
features, some correlated, some not. Therefore, we left all the
concentration indices in our original data set and allowed ARD to
sort things out.
The case with the other features was not so clear. Table 4 shows
the correlation matrix for 14 of our 22 features. We chose to
represent the family of concentration indices by C3 because it
shows up so strongly in our final reduced feature set. The
correlation matrix shows a strong correlation between mean
brightness and area (0.95), between C3 and r25/r75 (20.75), and
between rImax and r25/r75 (0.64). The remaining parameters show
marginal or negligible correlations.
A summary of the results of the classification runs using the six
different groupings is displayed in Table 5. This table shows the
features that remained after six and seven iterations of removing
the least relevant features. The entries in the table are the
relevance rankings of each of the features. Features 10±18 are all
concentration indices. We can see immediately the effect of the
correlation between the concentration indices. After the sixth
iteration, on average about four concentration indices remain as
features. This drops to about 1.7 after the final iteration. Similarly,
features 2 and 5 have significant correlation and often show up
together in the final feature sets.
Feature 13, concentration index C3, occurs most often, being
present in all groupings except two output types and the final
iteration of four output types. Features 2 and 3, mean brightness
and mq2q3 ratio, are not present in the final set of features for
2-type and 3-type classifications, but appear frequently highly
q 2000 RAS, MNRAS 316, 519±528
Morphological galaxy classification
523
Table 4. Correlation matrix for features 1±9, 13, 19±22.
1
2
3
4
5
6
7
8
9
13
19
20
21
22
1
Max
2
Mean
3
mq2q3
4
ell
5
area
6
rImax
7
Asym
8
r25/r75
9
RBulge
13
C3
19
IsoDisp
20
IsoFF
21
Pmax
22
Entropy
1.00
0.14
0.09
20.12
0.04
20.14
20.24
20.20
0.18
0.33
0.03
0.18
0.15
0.01
0.14
1.00
20.03
20.31
0.95
0.03
20.20
20.05
0.10
0.03
0.13
0.05
0.37
20.04
0.09
20.03
1.00
20.12
0.01
20.32
20.04
20.25
20.06
0.13
0.14
0.11
0.23
20.27
20.12
20.31
20.12
1.00
20.37
0.05
0.22
0.16
0.05
0.02
0.05
20.02
20.49
0.16
0.04
0.95
0.01
20.37
1.00
0.04
20.13
20.07
20.02
20.04
0.23
0.08
0.46
20.17
20.14
0.03
20.32
0.05
0.04
1.00
0.16
0.64
20.11
20.58
0.01
20.18
20.33
0.28
20.24
20.20
20.04
0.22
20.13
0.16
1.00
0.13
20.23
20.27
0.19
20.13
20.19
20.19
20.20
20.05
20.25
0.16
20.07
0.64
0.13
1.00
20.05
20.74
20.10
20.32
20.54
0.43
0.18
0.10
20.06
0.05
20.02
20.11
20.23
20.05
1.00
0.46
20.12
20.09
20.03
0.30
0.33
0.03
0.13
0.02
20.04
20.58
20.27
20.74
0.46
1.00
0.08
0.13
0.36
20.16
0.03
0.13
0.14
0.05
0.23
0.01
0.19
20.10
20.12
0.08
1.00
20.13
0.27
0.27
0.18
0.05
0.11
20.02
0.08
20.18
20.13
20.32
20.09
0.13
20.13
1.00
0.35
20.23
0.15
0.37
0.23
20.49
0.46
20.33
20.19
20.54
20.03
0.36
0.27
0.35
1.00
20.50
0.01
20.04
20.27
0.16
20.17
0.28
20.19
0.43
0.30
20.16
20.27
20.23
20.50
1.00
Table 5. Summary of results for binary classification into two to seven type categories.
1
2
3
4
5
two types
three types
1
4
6
8
9
7
6
10
Feature number
11 12 13 14
5
1
1
5
3
3
6
five types
2
1
1
4
3
2
six types
2
1
1
3
3
2
3
4
2
2
6
7
3
4
four types
seven types
6
ranked in 4- to 7-type classifications. The 3-type grouping lumps
together Sa, Sb and Sc Hubble types, while subsequent groupings
break these out. This suggests that these features are important for
distinguishing between these different Hubble types.
The question arises as to how to determine when to stop
deleting features from the original feature set. We examined a
number of approaches to this question, including looking at the
change in distribution of relevance values, looking at trends in
mean nodal activation, and looking for minima in error measures.
Starting from our original feature set, we deleted features three at
a time, choosing those features with the lowest relevance, until
there were only four features left. Typically this entailed deleting
features whose relevance was about an order of magnitude less
than the most relevant features of the previous iteration.
Fig. 1 shows this process for the case of four output types. The
figure shows the relevance of each feature for the seven iterations
of relevance determination/feature elimination. The narrower
(outer) error bars show the range of relevance values for the ten
cross-validation runs. If one of these runs produced a very small
value for the relevance of a certain feature, then the error bar
would extend to a low value. An example is feature five of the plot
labelled 7±3±4. The wider (inner) error bars show the rms
deviation from the mean calculated separately for positive and
negative deviations. The symbol is the mean value of the ten
q 2000 RAS, MNRAS 316, 519±528
1
2
15
5
4
3
2
1
4
3
7
7
6
4
2
5
4
3
6
4
4
7
4
3
16
17
18
19
3
1
20
21
22
2
2
2
4
7
5
5
7
1
1
cross-validation runs. Note that the range of relevance values for a
single parameter in a specific run can be two orders of magnitude.
The range of values appears to be due to the network getting stuck
in different local minima during its learning phase for the different
cross-validation runs. It then converges to that minimum with a
certain relevance for each parameter. We simply took the mean of
the ten values. Using the median of the values rather than the mean
would put less significance on the extremal values, but it is not
immediately clear whether that would be a better choice.
Examining the change in the distribution of relevance values as
the features are removed, we see no obvious change that would
tell us that an appropriate number of features remained. Neither
the range of relevance values nor the rms deviation from the mean
shows a minimum for a certain number of features. The plots with
16, 13 and 10 input features in Fig. 1 all show mean relevance
values within about an order of magnitude of each other. The
dispersion appears to increase for the plots with seven and four
features. However, similar plots for different numbers of output
types do not show similar trends.
The mean nodal activation did not appear to be much more
helpful in finding an objective stopping point for parameter
deletion. When performing a classification run, the objects are
ideally classified into one of several output types, but in reality
each output node can have a range of values from 0 to 1. An
524
D. Bazell
Figure 1. Relevance of each parameter for the indicated network architecture i:3:4. The wider (inner) error bars show the rms deviation from the mean
calculated separately for positive and negative deviations. The narrower (outer) error bars show the range of relevance values for the ten cross-validation runs
for each parameter. The mean of the ten values is shown by the symbol.
example of this is shown in Fig. 2, where we have plotted the
mean nodal activation as a function of output node number for
several different network configurations of the form i:3:4. The
mean nodal activation was calculated taking all objects that should
be activating output node 1 and averaging the nodal values for
each of the output nodes separately. This procedure was repeated
for each set of objects that should be activating a given output
node. Concentrating on the top left panel labelled 22±3±4 we see
the objects that should be activating output node 1 also activate
nodes 2, 3 and 4, although to a lesser extent. Similarly, objects of
q 2000 RAS, MNRAS 316, 519±528
Morphological galaxy classification
525
Figure 2. Mean nodal activation for objects that should be classified to node 1 (+), node 2 (*), node 3 (W) and node 4 (S) for different numbers of input
features. Each curve is offset by 0.2 from the previous curve for clarity. The labels above the plots denote the number of input:hidden:output nodes.
type 3 have their peak at node 3, but also activate the other nodes.
This broadening of the output activation profile is due to a number
of factors including: (a) noise, (b) calibration variations, (c) nonlinearities in the input images, as well as (d) the choice of features
used to describe the object. Our main interest here was to study
how (d) affects the classification of our input data set.
q 2000 RAS, MNRAS 316, 519±528
Looking at the mean activation for type-1 objects (1), we see
that the profile is essentially the same for 22 and 19 input features,
but activation of node 1 tends to decrease and activation of node 2
tends to increase as we delete more features. Activation due to
type-2 objects (*) does not appear to change with the number of
input features. Activation due to type-3 objects (W) shows a weak
526
D. Bazell
Figure 3. Error measures for classification into six output groupings as a function of the number of input features: negative log likelihood (S); average
threshold error (A); average squared error (K).
Figure 4. Error measures using the reduced feature set for classification
into four output groupings as a function of the number of input features:
negative log likelihood (S); average threshold error (A); average squared
error (K).
maximum at node 3 for 22, 19, 16 and 13 input features, and then
peaks at node 2 for 10, 7 and 4 features. Activation due to type-4
objects (S) is strongest at node 4 for 19 input features, and clearly
decreases for seven and four input features. Thus, one might say
that the best overall configuration would be 19 input features, but
the mean nodal activation makes a weak case for that, too.
Looking at similar plots for other numbers of output nodes, we see
essentially the same thing. The activation of one node might
increase a little as features are deleted, but others decrease, and
there is not an obvious trend that allows us to say that a given
number of input features is the best.
Yet another objective determination of the appropriate number
of features is to rely on some measure of the error in the crossvalidation test set. In Fig. 3 we plot three different error measures
for each of the six output groupings we examined. The figure
shows the negative log likelihood (S), average squared error (K)
and the average threshold error (A) plotted as a function of the
number of features used as input to the network. The negative log
likelihood is the function that is maximized to determine when to
stop training the network. The average squared error is the average
squared difference between the network output and the target
value, summed over all objects. The average threshold error is the
fraction of objects that fell outside the threshold value of 0.45.
In most of the cases in Fig. 3 there is no clear minimum. The
plot of two output types does show a minimum in the negative log
likelihood around 13 input features, while the plot of three output
types shows a minimum around 19 input features and a broad
minimum between 16 and four input features for both negative log
likelihood and average threshold error. The plot of four output
types shows a minimum around 19 input features, which
corresponds to what was suggested by the mean nodal activation
plots. Fig. 1, however, shows significant variation in relevance
values for 19 input features.
To explore further the question of error trends, we looked at the
rms error for each output node as a function of the number of input
parameters. This is the rms difference between the target node
value, always zero or one, and the actual output node value. The
general behaviour of the rms error as a function of the number of
input features is similar to the other error functions and does not
show any additional variations that could be used to determine the
number of input features to delete.
Because a number of our features had significant correlations
between them, we decided to eliminate one member of each set of
highly correlated features and use the reduced feature set as a
starting point for classification and ARD. We eliminated several
concentration indices, leaving C2 and C6. We also removed the
q 2000 RAS, MNRAS 316, 519±528
Morphological galaxy classification
area feature since it was highly correlated …r ˆ 0:95† with the
mean value. This reduced our starting feature set size to 14. We
performed a series of ten cross-validation runs with four output
types, eliminating three features at a time to end up with five input
features. Fig. 4 shows a plot of the three error measures. There is a
clear minimum in the negative log likelihood around 11 features
and a broad minimum in the average squared error that is
consistent with that. The average threshold error shows a broad
minimum between 11 and eight features. Plots of the mean nodal
activation show marginally better differentiation between nodes
for 11 input features.
5
DISCUSSION
We have examined a method of feature relevance determination
for a sample of 805 galaxies starting with 22 input features and
using a back-propagation neural network. We found that certain
features appear to be more important for discrimination between
certain types of galaxies. For example, the mean brightness and
the slope parameter mq2q3 both appear when classifying the
galaxies into four or more types. We also found that at least one of
the concentration indices is relevant after reduction of the initial
feature set to four features. A concentration index is present for all
classifiers when using any of two to seven output types.
Also of interest are the features that were eliminated early in the
relevance determination procedure. The first concentration index
C0 was removed from all feature sets in the first or second
iteration of relevance determination. That is, regardless of how
many output types into which the data were being classified, C0
had low relevance. C0 had significant quantization error because
of the precision with which it was calculated and stored, so this is
probably an artefact of the precision problem. The ellipticity was
eliminated in the first two iterations in five out of six cases, while
the asymmetry parameter and the entropy of the co-occurrence
matrix were eliminated in four out of six cases. All three of these
parameters showed little correlation in a plot against T-type, so
their elimination is not surprising.
Forward selection, backward elimination and ARD are all
methods of selecting the most important features in a feature set.
The traditional approach using forward selection or backward
elimination on a single parameter at a time can be very timeconsuming for even a moderate number of parameters. Automatic
relevance determination, used here in combination with backward
elimination, is faster because it determines the relevance of all the
features in parallel. However, in many of the cases examined here
we found the lack of a clear and definitive choice of which
parameters were best suited for classification of our galaxy
images. When we eliminated the strong correlations between
certain parameters, we found the various error measures gave a
clearer indication of when we had an appropriate number of
parameters. Another way to approach the correlation problem is
with principal component analysis (PCA), which takes linear
combinations of a group of parameters and produces an
orthogonal output set of reduced dimensionality. We are currently
exploring PCA in combination with other methods of relevance
determination.
The lack of a definitive answer to the question of which are the
best parameters for a certain classification task suggests that better
features can be chosen. Naim et al. (1995b) found that
concentration indices are very useful in galaxy discrimination,
and our work supports their conclusions. Odewahn & Aldering
q 2000 RAS, MNRAS 316, 519±528
527
(1995) and Odewahn et al. (1996) use photometric data in their
feature sets. We plan on supplementing our data with photometric
data, reducing the dimensionality of the input set by PCA and
running the relevance determination again.
It should be pointed out that our use of a back-propagation
neural network software is only one of a variety of choices that
could be made for classification algorithms. The main thrust of
our current research effort is to examine different classification
methods and explore different feature sets that are useful with the
different classification algorithms. One promising method of
feature selection is called schemata racing (Moore & Lee 1994;
Maron & Moore 1997). This method of feature selection conducts
a sequence of races, one per feature, during which a decision is
made to select or deselect some feature. Significantly, however,
the ordering in which decisions are made is not predetermined.
Moore and his colleagues reported that schemata racing performs
well in comparison with sequential feature selection algorithms
for several artificial tasks and tasks from four data sets. Ricci &
Aha (1998) used a variant of schemata racing to select an independent set of features for each bit in a nearest-neighbour
algorithm's distributed (i.e. error-correcting) output representation. They reported good results, and specifically selected
schemata racing for its good computational efficiency. Thus, we
believe that schemata racing's benefits will carry over to
astronomical classification tasks.
AC K N O W L E D G M E N T S
We are indebted to the referee, Steve Odewahn, for pointing out
some important references and clarifying several points in the
paper. We acknowledge the use of NASA's SkyView facility
(http://skyview.gsfc.nasa.gov) located at NASA Goddard Space
Flight Center. The Digitized Sky Surveys were produced at the
Space Telescope Science Institute under US Government grant
NAG W-2166. The images of these surveys are based on
photographic data obtained using the Oschin Schmidt Telescope
on Palomar Mountain and the UK Schmidt Telescope. The plates
were processed into the present compressed digital form with the
permission of these institutions. This work was performed under
the support of NASA AISRP grant NAG5-8166.
REFERENCES
Abraham R. G., Valdes F., Yee H. K. C., van den Bergh S., 1994, ApJ, 432,
75
Bazell D., Peng Y., 1998, ApJS, 116, 47
de Vaucouleurs G., 1962, in McVittie G., ed., Proc. IAU Symp. 15,
Problems in Extragalactic Research. Macmillan, New York, p. 3
Doi M., Fukugita M., Okamura S., 1993, MNRAS, 264, 832
Goodman P. H., 1996, NevProp software, version 4.1,
http:// www.scs.unr.edu/nevprop
Haralick R. M., 1979, Proc. IEEE, 67, 786
MacKay D. J. C., 1991, PhD thesis, California Institute of Technology
MacKay D. J. C., 1992, Neural Coomput., 4, 448
Maddox S. J., Sutherland W. J., Efstathiou G., Loveday J., 1990, MNRAS,
243, 692
Maron O., Moore A. W., 1997, Artif. Intell. Rev., 11, 193
Moore A. W., Lee M. S., 1994, in Cohen W., Hirsh H., eds, Proc. 11th Int.
Conf. on Machine Learning. Morgan Kaufmann, San Francisco, p. 190
Naim A. et al., 1995a, MNRAS, 274, 1107
Naim A., Lahav O., Sodre L. Jr, Storrie-Lombardi M. C., 1995b, MNRAS,
275, 567
Naim A., Ratnatunga K. U., Griffiths R. E., 1997, ApJ, 476, 510
528
D. Bazell
Odewahn S. C., Aldering G., 2009, AJ, 110
Odewahn S. C., Stockwell E. B., Pennington R. M., Humphreys R. M.,
Zumach W. A., 1992, AJ, 103, 318
Odewahn S. C., Windhorst R. A., Driver S. P., Keel W. C., 1996, ApJ, 472,
L13
Ricci F., Aha D. W., 1998, Proc. 10th Eur. Conf. on Machine Learning.
Springer, Chemnitz, p. 280
Storrie-Lombardi M. C., Lahav O., Sodre L. Jr, Storrie-Lombardi L. J.,
1992, MNRAS, 259, 8
Varosi F., Gezari D., 1991, http://idlastro.gsfc.nasa.gov/ homepage.html
Weir N., Djorgovski S., Fayyad U., 1995, AJ, 109, 2401
This paper has been typeset from a TEX/LATEX file prepared by the author.
q 2000 RAS, MNRAS 316, 519±528