Data analysis and Ecological Classification

Data analysis and Ecological Classification
Gabriela Augusto Marco Painho
Abstract
Classification is a basic tool for knowledge conceptualisation in any science. Ecology is not an
exception.
There are a huge number of classification methods, since the empirical classification that we use
in everyday life, until the most recent neuronal networks methods.
To perform an ecosystems classification, in this project, the chosen methodology was cluster
analysis, more specifically hierarchical classification. In this paper are compared only tree
classification algorithms together with two measures of similarity, in Europe Ecosystems
Classification.
Biogeography experts appreciated the results, and we could reach a methodology that leaded us to
an ecological consistent classification.
However, being cluster analysis only a set of heuristic procedures to find natural patterns in a
collection of data, it can only suggest a classification, and is up to the experts to appreciate the
infinite number of results, that can be obtained by this procedure.
Introduction
In Europe, a relatively small continent, there are large variations in biosphere quality and
quantity, from the Arctic polar deserts, to semiarid landscapes, including high mountains, rich
plans and large river valleys. This geographical variation is due to the climatic and topographic
restrictions, that together with soil conditions, determine the type of vegetation, which can live in
a particular place, allowing that some other living beings can survive there.
Once that the ecological systems defined on what organisms live in a particular place, and how
those organisms inter-relate with abiotic components (Rowe and Shultz), the Ecological
Classification, must take all those factors in account: the vegetation, the clime and topography.
Information considering those factors draw from two sources: The European Map of potential
vegetation – BfN (Bonn 1994)- and an already existing map on land classification for topographic
and climate data in Europe –ITE (Bunce, 1995).
The ecological classification however, is not a peaceful matter, because the ecological systems
are multiscale concepts. Even the concept of ecosystem, central both in ecology, biodiversity and
landscape management is considered “diffuse and ambiguous” (Savard 1994), once it is both a
lake and the forest which contains that lake. Traditional levels of biological organisation biosphere, ecosystem, species, organism and organ systems, are arranged hierarchically,
according with the scale of observation.
Cluster analysis is a set of heuristic procedures used to finding natural patterns in large amounts
of objects, so it places those objects into groups or clusters suggested by data, not defined
previously. Hierarchical clusters are organised so that one cluster may be entirely contained
within another cluster, but no other overlap between clusters is allowed.
Because ecological systems are understood as hierarchical systems, the ecological classification
was performed by hierarchical cluster analysis.
Cluster Analysis
The purpose of cluster analysis is to place observations into groups or clusters suggested by the
data such that observations in a given cluster tend to be similar to each other in some sense, and
objects in different clusters tend to be dissimilar.
Cluster analysis classifies a set of observations into two or more mutually exclusive unknown
groups based on combinations of interval variables. The purpose of cluster analysis is to discover
a system of organizing observations into groups, where members of the groups share properties in
common. It is cognitively easier for people to predict behaviour or properties of objects based on
group membership, all of whom share similar properties. It is generally cognitively difficult to
deal with individuals and predict behaviour or properties based on observations of other
behaviours or properties.
Cluster analysis, on the other hand, allows many choices about the nature of the algorithm for
combining groups. Each choice may result in a different grouping structure.
There are however, a significant number of decisions to perform a cluster analysis, like how to
measure the data to be classified, how to measure the similarity between that data, and which
criterion to use to determine if two records belong or not on the same group.
Data to Classify
Vegetation reflects many physical factors found at a site, such as climate, soil, type, elevation,
and aspect. It is also composed of the ecosystem’s primary production and it serves as habitat for
the animal community. Vegetation acts as an integrator of many of the physical and biological
attributes of an area, and a vegetation map can be used as a surrogate for ecosystems in
conservation evaluations (Specht 1975, Austin 1991). A vegetation map, therefore, provides the
foundation for our assessment of the distribution of ecological regions.
Europe has probably the most humanized landscape on Earth; as an integrant part of the
landscape, vegetation also had been largely displaced, extinct and introduced by people. Once the
aim of this project was to find regions where ecosystem productivity and processes are
homogenous a potential vegetation map produced in Germany by the Institute für
Vegetationskunde Bundesamt für Naturschutz (Bonn, 1994), provided the 650 basic records to
classify.
The variables were the climatic and topographic characteristics of the places in Europe. This
information was added from a map of Land Classification, produced by the Institute of Terrestrial
Ecology (Bunce et al1995), which recognizes in Europe 64 landclasses.
So the data to be classified is, for each location the potential vegetation over the land class, as
expressed in Figure 1.
Figure 1 – The table produced by the concatenation of the potential vegetation and the land classification in
Europe.
Measures of Similarity
The Euclidian distance is probably the most universal measure of distance in data analysis. Most
statistical programs automatically calculate this distance form the coordinate matrix (Figure 2).
The coordinate matrix is the result of the cross-tabulation of the table shown on Figure 1 with
land classes in column headings (ITE), potential vegetation units as row headings (BfN), and each
cell of the matrix contains the total area of co-occurrence of a single potential vegetation BfN and
land class unit ITE. The resultant matrix has 490x57, once that not all the BfN legend types
represent real potential vegetation (lakes, glaciers, etc.), and not all the ITE classes were in the
identity map, therefore were excluded from further analysis
Figure 2 – the coordinate matrix has 490x57, once that not all the BfN legend types represent real potential
vegetation (lakes, glaciers, etc.).
The other similarity measure chosen to be used in this classification is the Lance and Williams
distance, well recommended for biological applications in biological classification:
Dlw=! |BfNij-BfNik|/ ! ( BfNij+BfNik)
This distance was calculated in SAS®, according with the instructions in Figure 3:
Figure 3 - Programming the Lance and Williams distance in SAS®
A part of the 498x498 distance matrix produced in SAS, is in Figure 4.
Figure 4 – Lance and Williams Distance Matrix.
Hierarchical cluster analysis
Hierarchical clusters are organized so that one cluster may be entirely contained within another
cluster, but no other kind of overlap between clusters is allowed.
But to perform hierarchically classification there are several algorithms. In this paper we
nominated only tree:
•Average Linkage – determines that a new individual will belong to a previous established
cluster, if this individual feats in the average distance within the cluster.
Dkl = SiÎCk SjÎcl d(xixj)/(NKNL)
•Single Linkage. - determines that a new individual will belong to a previous established cluster,
if this individual is closer from another belonging to that cluster than from any other.
Dkl = min iÎCk min jÎcl d(xixj)
•Complete Linkage - determines that a new individual will belong to a previous established
cluster, if this individual closer from the last similar individual belonging to that cluster than from
any other
Dkl = max iÎCk maxjÎcl d(xixj)
The Figures 5 and 6 show SAS® performing the cluster classification.
Figure 5 – SAS® Proc cluster
Figure 6 – Output generated by the average linkage over the Lance and Williams matrix.
Figure 7 – The first of 120 pages drawing the dendrogram resulting from average linkage over the
Lance and Williams matrix.
Drawing Results
A quick analysis on the SAS® outputs, thought SAS® Insight allowed us some conclusions about
the dendrograms resulting from the different methodologies of classification. Figures 8 and 9 are
placed here just to exemplify the first appreciation of the results. Figure 8 shows the number of
clusters along the dendrogram for the average linkage over the euclidian distance. In Figure 9 the
same for the average linkage over the Lance and Williams distance.
Figure 8 – Number of clusters along the dendrogram for the average linkage over the euclidian distance.
Figure 9 – Number of clusters along the dendrogram for the average linkage over the Lance and
Williams distance.
Apparently in Figure 8 the desegregation of the dendrogram, happens explosively around height
2, while in figure 9 it happens continuous and consistently along the dendrogram. This
characteristic is important because we would like to have a method that allowed us to appreciate
the classification in several levels of desegregation.
But the validation of the results is to be made by biogeography experts. This leaded to the
representation of the classification in maps, so they could properly appreciate the classification.
Was decided to represent each method of classification, when the dendrogram had around 50
clusters, because that is a reasonable number of mesoecosystems that are expected to be in
Europe. Figure 10 shows how does SAS split the dendrogram at 50 clusters and produce a file to
be exported to a Geographical Information System.
Figure 10 – SAS proc tree.
Results
With 50 clusters the map that represents Average linkage over the euclidian distance with 50
clusters, has 300 polygons greater than 2000Km2, and hardly any ecological consistency. The
complete linkage does the same with booth distances.
Figure 11 - Average linkage over the euclidian distance 50 clusters
Figure 12 represents Average linkage over the lance and Williams distance with 50 clusters.
In this map we can immediately see the main biomes apart from each other. The boreal forests
form a cluster ending nearly the 60º parallel, and the tundra is also recognizable. All the
Mediterranean mountains form individual clusters except the Pyrenees that are join with the
Dynaric Alps. And the Ukrainian steps form a separate cluster from the temperate forests.
Figure 12 - Average linkage over the Lance and Williams distance 50 clusters
The single linkage algorithm (Figure 13), as expected form a huge cluster covering almost all
Europe, and places the left 49 in small area, without any ecological consistency.
Figure 13 - Single linkage over the Lance and Williams distance 50 clusters
Conclusions
Average linkage over the Lance and Williams distance seams to provide a good classification in
an ecological perspective, but two points are left to proceed:
"
It is possible that other classification algorithms provide a better classification.
"
Human experts must do the validation of the results.
The aim to nevertheless show this result is to express what is possible to distinguish on a
European scale at ecological regions level with the developed methodology.
References
Andeberg, M. R. 1973. Cluster analysis for applications. Academic Press.
Austin, M.P. 1991.Vegetation: data collection and analysis in Nature conservation: cost effective
biological surveys and data analysis. Australia CSIRO, East Melbourne.
Bailey, R. G. 1996. Ecosystem Geography. Springer-Verlag New York, Inc.
Bonn, U. 1994. International project for the construction of a map of the natural vegetation of
Europe at a scale of 1:2.5 million - it’s concept, problems of harmonisation and application for
nature protection. Working text, Bundesamt für Naturschutz (BfN).
Bunce, R G H, 1995 A European land Classification, Institute of Terrestrial Ecology, Merlewood
Davis, F. W. 1994. Mapping and monitoring terrestrial biodiversity using geographic information
systems. Biodiversity and Terrestrial Ecosystems, Institute of Botany, Acadeia Sinica Monograph
Series 14.
Hunter, M., et all 1988. Paleoecology and the coarse-filter approach to maintaining biological
diversity. Conserv. Bio. 2.
Hunter, M. Jr. 1991. Coping with ignorance: the coarse-filter strategy for maintaining
biodiversity. Endangered Species Act and lessons for the future. Island Press.
Ludwing, J. A., and Reynolds, J. F. 1988. “ Stastitical Ecology. A primer on methods and
computing”. John Wiley & Sons.
Noss, R.F. 1983. A regional landscape approach to maintain diversity. BioScience 33.
Kotz, S. and Johnson, N. 1981. Encyclopaedia of Statistical Sciences, vol. 2. Campbell B. Read,
Associate Editor.
Painho et all. 1996 Digital Map of European Ecological Regions (DMEER): its concept and
elaboration. Second Joint European Conference (JEC) & Exhibition on Geographical
Information, Barcelona .
Specht, R. 1974. The report and its recommendations. A national system of ecological reserves in
Australia. Australia Academia Sciences 19.
http://www.orst.edu/instruct/rng341/ecosys1.htm
http://www.sas.com/software/components/stat.html