Expert System for Identification of Trees

Master’s Thesis Nr. 102
Systems Group, Department of Computer Science, ETH Zurich
Expert System for Identification of Trees
by
Simon A. Eugster
Supervised by
Prof. Donald Kossmann
April 15th to October 15th 2013
Contents
1 Introduction
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . .
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Structure of This Thesis . . . . . . . . . . . . . . . . . . . . . .
2 State of the Art
2.1 Key Systems . . . . . . . .
2.1.1 List of Images . . .
2.1.2 Dichotomous Key .
2.1.3 Diagnostic Tables .
2.2 Examples of Existing Keys
2.2.1 Books . . . . . . . .
2.2.2 Web Keys . . . . .
2.2.3 Other Keys . . . .
3
3
4
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
5
6
8
8
10
11
3 Proposed Solution
3.1 Schema . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Generality of Keys . . . . . . . . . . . . . .
3.1.2 Add-Ons . . . . . . . . . . . . . . . . . . . .
3.2 Optimising Identification Keys . . . . . . . . . . .
3.2.1 Choosing Questions in Dichotomous Keys .
3.2.2 Choosing Questions in Dynamic Diagnostic
3.3 Design Choices . . . . . . . . . . . . . . . . . . . . .
3.3.1 What is a Character? . . . . . . . . . . . . .
3.3.2 Exclusiveness of Characters . . . . . . . . .
3.3.3 Numerical Values . . . . . . . . . . . . . . .
3.3.4 Geographical Distribution . . . . . . . . . .
3.4 Decisions for Key Generation . . . . . . . . . . . .
3.4.1 Does This Taxon Match? . . . . . . . . . .
3.5 Implementation Details . . . . . . . . . . . . . . . .
3.5.1 Identification Key . . . . . . . . . . . . . . .
3.5.2 Key Editor . . . . . . . . . . . . . . . . . . .
3.5.3 Mobile Identification Key . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Tables
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
14
14
15
15
15
16
16
17
17
18
18
18
19
19
20
21
4 Conclusion
4.1 Evaluation . . . . . . . . . . . . . .
4.1.1 Question Order Benchmark
4.1.2 Are The Goals Met? . . . .
4.2 Future Work . . . . . . . . . . . . .
4.3 Acknowledgements . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
23
23
24
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1 Introduction
Summary This thesis examines current identification keys and their advantages and issues, and describes the design of a generic web-based identification
key with those issues addressed.
Terminology The term taxon is used for a group of items which share the
same name, like trees (Fraxinus excelsior), clouds (Cumulus), or light bulbs
(halogen lamp). The term character is used for a feature that can be used to
distinguish between taxa. The term key is sometimes used as the short form
for identification key.
1.1 Background and Motivation
This section briefly describes some aspects that can be improved in current
identification keys.
Keys are used to find out what something is, to find the name of a thing.
Why is that important at all?
For me, it is often curiousity. But there are good reasons as well; figure 1.1
mentions one. If the sky is covered with thick Altostratus clouds, I know
that it will most likely be raining for a longer time in a few hours. Unlike
so with Stratus, a similar-looking cloud, which never produces rain. Or, if I
need wood for making bows, I have to know if this tree is an European Ash
(Fraxinus excelsior) or a Beech (Fagus sylvatica): the former is great bow
wood, whereas the latter usually breaks in heavier bows.
But there are other motivations as well. A prominent one is the desire for
knowing all life on Earth; that in order to capture biodiversity and as a
starting point for halting biodiversity decline, as stated in LaSalle et al. (2009).
The authors estimate the discovery and description of all species—which again
are estimated to around 10 million—to several centuries when proceeding at
the same speed.
Traditional keys are in the form of books. Switzerland has its keys for the local
species (often in two or three languages), Germany has them (in German),
France (in French), England (in English), Mexico (in Spanish), and so on.
Perhaps any country can be named.
Already three problems arise from this form. First, if I want to identify a tree
in Mexico, I may have to learn Spanish first; the Swiss keys do not describe
the trees growing there. Second, an identification key to all trees would result
in a multi-volume and little portable book. Third, the information is printed
and therefore static; the key cannot be adjusted according to decisions made
by the user, and new taxa could only be added in a new edition.
Another point is a bit more subtle, but just as important: finding a suitable
key at the first place. Many are, as mentioned, in the form of books. Others,
especially specialised ones, can only be found in papers and have never been
printed.
All those problems can be solved with a digital identification key. It can be
translated, has much less storage issues, and is dynamic. As a bonus, digital
keys can be made freely accessible and editable—Wikipedia and its sister
projects serve as great examples there.
With more storage available, the keys may also contain more sample images.
Identification keys in books are often illustrated very sparsely.
3
Figure 1.1: Is this mushroom edible?
Identification provides an easy way for
finding out without dangerous tasting.
Oct 2013
Introduction: Structure of This Thesis
1.2 Problem Statement
This section lists the key features which the identification key written in this
thesis aims to fulfil.
The main reason for me to start this thesis was the observation that dichotomous keys in books are difficult to use as soon as a character required for the
next step cannot be determined. (This problem will be explained more indepth in section 2.1.) The previously mentioned observations led to additional
points that shall be addressed.
My vision is an open computer-based identification system, accessible to and
editable by everyone. It should be general enough to support various types
of subjects besides trees, which serve as example in this thesis.
Editability The key should be editable, in a way simple enough that also
“normal” people—i.e. not only computer experts—can add taxa and identification criteria. (Biologists, for example, usually are not computer experts.)
Many projects prove that collaboration is the only way to keep them alive;
closed-source projects with sole authors are usually attended few years only
and then forgotten. Collaboration is, at the same time, a great way to gather
information from all over the world in a central, open resource—just like
Wikipedia does it, and very sucessfully so.
Performance In terms of both speed and usability. Performance can be defined by the cost of identifying species, like the number of questions asked
or the “sum” of their difficulty, which is to be minimized. Small differences
thereof may not be noticeable, however pointless questions that e.g. don’t
even narrow down the remaining selection are. Performance can also be defined in terms of usability: The user interface has to respond quickly and avoid
useless interactions; an example of latter would be a confirmation dialog “are
you sure?” whenever a character is chosen.
Internationality To make a key usable globally, such that e.g. also Asian
tourists can identify plants growing on the Swiss mountains, it has to be translated. The same holds if editors of different countries work on the data. The
taxon characters themselves—their meaning—do not differ, no matter what
language one speaks, only their localised description does. Consequently, ideally only the character description is translated and the features are defined
only once, “for the whole world”.
Growth Every project needs to reach a “critical mass” until it starts growing
by itself. For my thesis, this affects both code and data. Enough data is
necessary so contributors do not have the feeling of contributing to an empty
project; I regard a sample key as mandatory, and its data can additionally
serve as real-life test set. The code—especially the identification part—has
to be good enough to make people want the program to identify their trees
as well, and well designed—i.e. not hacked together—so other programmers
can work on it without too big effort.
1.3 Structure of This Thesis
Chapter 2 gives an overview over the identification keys that are currently
used—key systems as well as concrete examples of them—and examines their
advantages and disadvantages. Knowledge about those is mandatory for understanding the further decisions and reasonings. Chapter 3 contains the
technical part of this thesis; it describes the identification key system I developed. It should be easy to read also if you are not a computer scientist.
Chapter 4 finally discusses the results and possible future work.
Each section starts with a short summary in italic.
Simon A. Eugster, ETH Zürich
– 4 –
Identification Key for Trees
2 State of the Art
Allow me to start this chapter with a 33 years old quote:
Identification keys and diagnostic tables are simple to use and easy to carry,
so we think they will retain their popularity until we all have our own pocket
micro-processor. —Payne and Preece, 1980
2.1 Key Systems
This section gives an overview over existing types of keys and discusses advantages and disadvantages of each.
2.1.1 List of Images
Image lists are easy to read and ideal for small data sets.
The most common kind of key encountered is the field guide. A vast number
of books on any subject can be found: plants, birds, ants, fishes, constellations, insects, and so on. Those keys need no introduction how they work;
comparing images with the object to identify is a task our brain is excellent
at. For example, consider how quickly you can identify a person you know on
the street, although there are many other faces around. This is amazing.
ARALIACEAE
AQUIFOLIACEAE
Simple
Ovate to elliptic
To 4 x 11⁄2 in (10 x 4 cm)
ARRANGEMENT Alternate
BARKPale gray, smooth, with small lenticels
FLOWERSSmall and purple
FRUITA glossy red, spherical berry about 1⁄2 in (1 cm) across, often slightly longer
than wide
DISTRIBUTION China, Taiwan, Japan
HABITAT Mountain forests
SYNONYM Ilex chinensis auct.
LEAF TYPE
LEAF SHAPE
LEAF SIZE
Bipinnate
Oval in outline
To 3 ft x 23 in (1 m x 60 cm)
ARRANGEMENT Alternate
BARKGray-brown, often spiny
FLOWERSSmall and white in rounded heads borne in large open panicles
FRUITA spherical black berry about 1⁄4 in (5 mm) across
DISTRIBUTION China, eastern Russia, Korea, Japan
HABITAT Open forests
LEAF TYPE
LEAF SHAPE
LEAF SIZE
Up to 50 ft
(15 m)
Up to 33 ft
(10 m)
(This may sound like something obvious, but consider how hard it still CHINESE
is for HOLLY
a computer to accomplish such tasks, whileas it is millions of times faster in
multiplying matrices than I am.)
ILEX PURPUREA
ARALIA ELATA
HASSKARI
(MIQUEL) SEEMANN
48
The Chinese Holly is a fast-growing, conical, evergreen tree,
sometimes multistemmed or a large shrub, flowering in spring
or summer. Its purple flowers are very unusual in the genus as
other hollies usually have white or greenish flowers. The bright
red fruits persist on the tree for a long time during winter. These,
together with the brightly colored young foliage, make it a
popular tree in regions where it can be grown, such as the
southeastern United States. Various parts of the plant are
important medicinally in China, and it is one of the
50 fundamental herbs of traditional Chinese medicine, used to
treat a variety of ailments. The name Ilex chinensis has been and
sometimes still is incorrectly used for this species.
Generating such keys is also relatively easy—which does not mean that it is
little work!—, not much more but an image or a painting and the name is
required.
So, why is this not enough? Those keys become more inefficient the larger the
data set is, it is arduous to match dozens and hundreds of pictures against
the object, and similar looking species—thinking of living creatures—make
the matching process error-prone.
SIMILAR SPECIES
The shallowly toothed leaves could be confused with other,
mainly subtropical species, but the purple flowers easily
distinguish the Chinese Holly.
The leaves of the Chinese Holly are ovate to elliptic, and up
to 4 in (10 cm) long and 2 in (5 cm) across. They emerge
bright purple-pink when young, maturing to glossy dark
green above and pale green beneath. The margins are very
shallowly toothed, not spiny, tapering to a slender point at
the tip, and with a petiole to about 1⁄2 in (1 cm) long.
JAPANESE ANGELICA TREE
49
The Japanese Angelica Tree is a small deciduous tree of
spreading habit with slightly spiny shoots. It flowers in late
summer and fall. Although it can be a tree it frequently spreads
by suckers, forming thickets. The young shoots are eaten in the
Far East, and the root bark is used medicinally. It is a popular
ornamental, grown for its fruit and fall color, and there are
several forms with variegated leaves.
SIMILAR SPECIES
The North America Devil’s Walking Stick (Aralia spinosa) is
similar but more shrubby and more spiny. Its inflorescences are
conical with a single main axis while those of A. elata have
several spreading branches from the base.
Actual size
The leaves of the Japanese Angelica Tree are very large, 3 ft (1
m) or more long and 23 in (60 cm) across. Each leaf has several
pairs of opposite pinnae, each with a single leaflet at its base. The
pinnae have up to 11 ovate leaflets up to about 41⁄2 in (12 cm)
long; they are taper-pointed at the tip with a toothed margin and
a very short stalk. They are dark green above, grayish with hairs
and sometimes spiny on the veins. They turn yellow, orange, red,
and purple in fall.
Technically speaking, searching takes O(n) time. Extended versions coarsely
groupo taxa e.g. by flower colour or other characters to speed up the scanning
process.
Actual size
2.1.2 Dichotomous Key
Dichotomous keys are binary search trees and ideal for large data sets.
An advanced method is the dichotomous key, which is a character based
binary search tree. Each node represents a character, and its subtrees are
chosen depending on whether the character matches or not. The leaves are,
for example, families or species. Their big advantage is that they can cover
much larger numbers of species since the key’s depth is only O(log2 n) in
case it is balanced. It also forces the user to look more carefully at the
examined object in order to correctly identify the required characters, and
hereby teaches to—and how to—see things invisible to the untrained user.
Dichotomous keys have disadvantages too. Many of them are merely text,
without images, which makes them look too dry or complicated especially
for occasional users. Some characters require specialist knowledge that is
not always explained along with the key, or they require special tools like
magnifying glasses or microscopes. For those reasons, dichotomous keys are
mainly used by experts.
5
Figure 2.1: The Book Of Leaves
(Coombes, 2011) contains leaf images of 600 trees. It is amazing how
quickly trees can be found merely by
looking through the 30 pages with
preview images in the “key” section.
Oct 2013
State of the Art: Key Systems
The structure of dichotomous keys leads to another problem: If a character
cannot be identified definitely, both subtrees need to be followed, requiring
the user to jump forth and back between multiple options and check which
suit better. If a character is incorrectly identified, one may have to re-start
close to the root and re-check all decisions, or it may even be impossible to
correctly identify the taxon.
Start
Branching
Leaves opposite
rectangular
bark plates
longish
bark plates
Leaves alternate
Simple leaves
Bark
Acer pseudoplatanus
Compound leaves
Leaves
Fraxinus excelsior
Fagus sylvatica Sorbus aucuparia
Figure 2.2: Simple dichotomous key for four taxa.
Finally, although the tree depth is only O(log2 n) compared to O(n) for lists,
it takes longer for small collections of, for example, 20 species, since the eye is
much faster matching them visually than identifying characters and reading
text.
The following improvements can be made to dichotomous keys:
Illustrations Adding illustrations of the characters (figure 2.3) speeds up
identification, as explained in the previous section about lists. For the same
reason—images are natural to our brain, text is not—illustrations also let the
key look much “lighter” and easier to understand.
Multichotomous key To reduce the depth of the tree, and hereby the number
of questions asked, multiple options for a character can be given at once. For
example, for species of the familiy Pinus one wants to know whether they have
2, 3, or 5 needles per “bundle” (fascicle). A strict BST (binary search tree)
would require two questions: a) Are there 2 or more needles per fascicle?
and b) Are there 3 or 5 needles?
Taxon descriptions Description of additional redundant characters for the
found taxon help in ensuring that the identification was actually correct. For
the example key shown in figure 2.2, the tree bark may be hard to classify,
or may not even show the characters yet if it is young. Providing additional
information for the Ash (Fraxinus excelsior) like “buds are black” then makes
confusion with Acer pseudoplatanus impossible.
2.1.3 Diagnostic Tables
Diagnostic tables are matrices with the taxa on one axis and the characters on the other. They provide most information, but are slowest to search
manually.
The most general form of a key is the diagnostic table. One axis of the
table contains the taxa, the other one the characters; thus they contain most
information on characters of the keys listed here. Searching diagnostic tables
by hand is perhaps nearly as fast as for a dichotomous key for small data sets.
When it grows, sorting is essential; the example table below is first sorted by
Branching and then by Leaf Type:
Branching
Leaf Type
Bud Color
Leaflets
F. excelsior
opposite
compound
black
9–15
A. p.platanus
opposite
simple
green
—
S. aucuparia
alternating
compound
grey
9–19
F. sylvatica
alternating
simple
reddish
—
If the first column’s character cannot be identified, the user has to jump
around in the table; the same problem known from dichotomous keys. This
Simon A. Eugster, ETH Zürich
– 6 –
Identification Key for Trees
State of the Art: Key Systems
Oct 2013
Figure 2.3: Identification key for domestic mites, with illustrated characters, from
Colloff and Spieksma (1992)
Figure 2.4: Diagnostic table for Vibrio species (Alsina and Blanch, 1994)
has early been worked around with software; (Payne and Preece, 1980, Section
4.5) mentions systems from the sixties. In software it is trivial to filter the
table by the given criteria.
As the table grows, it is easy to lose overview. Already the diagnostic table in
figure 2.4 is not easy to decipher as the user’s eyes have to keep track of both
horizontal and vertical position, and it has no more than 18 characters.
Yet, diagnostic tables are very easy to evaluate with software. They will
therefore also be the key system of choice for this thesis.
Identification Key for Trees
– 7 –
Simon A. Eugster, ETH Zürich
Oct 2013
State of the Art: Examples of Existing Keys
2.2 Examples of Existing Keys
This section lists some existing keys on different media and discusses advantages and disadvantages, helping to clarify the project goals.
It is amazing to see how many keys exist. The web makes it even easier to find
them, from all over the world. From general plant keys to very specific ones
for families, from mites to fungi, for children and specialists. The following
selection of keys is therefore just a tiny cut-out of interesting examples. As I
am German speaking, some of the keys are too.
2.2.1 Books
Books are well-established for centuries and still wildely used. They have to
reach a certain quality until they are printed.
Gehölzflora, Fitschen The Fitschen (Fitschen, 2002) is the standard key to
woody plants in Central Europe. This is not just by chance: It contains a large
number (more than 1700) of species, is accurate and well-structured. Three
different dichotomous keys for families are present: one regarding vegetative
characters as leaves, mainly leaves; a winter key viewing for example buds
and bark, and an inflorence key. Each family and genus then has a separate
key for species; they are sorted alphabetically and can be found easily.
On the first few dozen pages a description is given of most of the characters
used in the key. Many graphics support the descriptions as well as the species,
as shown in the scan of the Sorbus key in figure 2.5.
Figure 2.5: Extract from the Gehölzflora; key for the Sorbus family. The bold
numbers to the right lead to the next question.
Bestimmungsschlüssel zur Flora der Schweiz This key (Hess et al., 2010) is
specialised in Switzerland’s flora. Numerous clear graphics serve as additional
description of the species, as seen in the scanned figure 2.6.
For non-experts it is much harder to use than Gehölzflora as there is no entry
point with vegetative characters; the root of the dichotomous key mainly
considers the inflorescence and uses many technical terms average users are
not familiar with, so they are stuck at the very first point of the key.
Simon A. Eugster, ETH Zürich
– 8 –
Identification Key for Trees
State of the Art: Examples of Existing Keys
Oct 2013
Figure 2.6: Extract from the Bestimmungsschlüssel to Sorbus
Bäume von A–Z This book is not a key, but only describes various trees.
This is done both with text and, especially, a large number of high-quality
and expressive images showing both macroscopic and microscopic scales, as
seen in figure 2.7. They give a good visual impression of the trees’ character,
i.e. their silhouette, bark, leaves, and other typical characters.
Interestingly, it lacks an index for German names. The trees are sorted by
their Latin name, and so is the list at the end of the book, listing additionally
names in other languages like German, French, and Spanish. To look up a
tree by its German name, one has no other way but to scan this whole list,
or to look up the Latin name in a different book.
Figure 2.7: Extract from Bäume von A–Z
on Sorbus
Identification Key for Trees
– 9 –
Simon A. Eugster, ETH Zürich
Oct 2013
State of the Art: Examples of Existing Keys
2.2.2 Web Keys
Web keys benefit from links, newer ones are interactive. Countless keys can
be found, high-quality keys are more rare though.
People have started early to use the web for identification keys, some count 30
years already. A huge number can be found nowadays. A Google search for
“identification key” found 299 000 matches on April 27th 2013, and 315 000 five
months later. The number of keys does not grow as fast anymore nowadays;
Walter and Winterton (2007) measured an increase from 45 000 to 149 000 in
one single year back in March 2006.
http://gobotany.newenglandwild.org/ Perhaps the most beautiful key available today. Unfortunately it is hardly visible on search engines. All the more
was it interesting for me to find in it nearly all the points that I regarded
important for identification. The reader is at this point asked to take a look
at the web site.
The entry point of the Simple Key is well arranged, on pastel colours, and
asks the user to chose woody, aquatic, grass-like, and other plants. For each of
them an image gallery, which can be scrolled through, shows representatives,
a short text describes the group, and hints about possible misidentification.
For example, the entry about aquatic plants states: “Some land plants can
be flooded temporarily but cannot live long in water.” They even have a video
for each group.
Figure 2.8: Go Botany: Starting page for
woody plants
When the group is chosen, subgroups appear (broad-leaved and needle trees
for woody plants) in the same manner, leading then to an image gallery of all
plants in this group. Simple criteria—only around 8 in the basic view—filter
this list. Usually some species remain in the list, but they can be distinguished
visually. The species description then lists the characters, shows several images, gives a description of the species and its distribution in England and
North America.
To date, the tree list contains 573 species, and has doubled during the last
six months.
http://trees.luidp.net/de/index.php Another web page using filters, the
first one of their kind I found. The filtering system is fast: Switching groups
(Stem & Bark, Leaves, Flowers etc.) works via mouse-over, the view is compact. A click on the search button lists all matching species (unless they are
too many).
The key features a fixed number of characteristics. It is easy to get used to
them after a few identifications. However, they are sometimes not precise
enough, for example when identifying conifers, no further description of their
needles can be chosen.
To date, the tree list contains 558 species.
Figure 2.9: Luidp-Trees: Tree identification with a filter
http://offene-naturfuehrer.de/web/ This web site is unique in that it is
the only one made for public collaboration. Keys, e.g. in PDF format, can
be shared, other keys can be edited directly in a wiki. Those are platform
independant and also offered in apps for mobile systems, displayed in form of
web pages.
While the idea is great, I got the impression that editing is not as easy as it
could be—or should be in order to attract a larger number of editors—as they
are hand-written and not generated, making changes more labour-intensive.
The web keys are dichotomous and static.
Simon A. Eugster, ETH Zürich
– 10 –
Identification Key for Trees
State of the Art: Examples of Existing Keys
Oct 2013
2.2.3 Other Keys
This section shows examples of two different key concepts not further covered
here.
LeafSnap A different approach is taken by the authors of Belhumeur et al.
(2008). They built a computer vision based system to identify plants by
an image of the shape of their leaves. This is a very convenient way since
the identification process requires no work from the user except for taking
an image and looking through the best-matching species (according to the
authors, the correct taxon is among the top 10 species returned in 97 % of all
cases for 245 species).
This kind of system is obviously not suited in winter and difficult for identifying trees with small leaves (especially conifers with their narrow needles
and microscopic details). Also, it does not explicitly describe the characters
used for identifying the taxon—it contains machine-readable patterns instead
of human-readable facts. To illustrate this point, let’s take a key for architectural styles. The key would only tell me that this is a Gothic building, but
not that the pointed arch is a typical feature of this epoch. The user can look
it up, but does not need to, so it is less “educational” concerning teaching the
whys, what is special for this taxon or epoch.
The key does not need to do this, however, since its goals are quick and easy
identification. The web site mentioned in their paper (http://herbarium.
cs.columbia.edu/data.php) was never accessible while I was working on the
thesis, but they provide an iOS app on http://www.leafsnap.com. The user
has to take a picture of a leaf on white background, which is then uploaded
to their server and rated against existing entries. I could not validate their
results since the database contains trees native to US; still, the genus of the
top ranked tree used to be correct. The white background sometimes created
difficulties when the leaf was larger than an A4 paper, or when images got
rejected due to automatic brightness adjustments that could not be controlled
manually with the used iPhone.
Bird Songs Yet another approach is taken by audio “keys”, commonly found
for bird songs. Their principle resembles to image lists, but are also used for
learning the sounds by heart and hereby building a “mental” identification
system.
Identification Key for Trees
– 11 –
Simon A. Eugster, ETH Zürich
3 Proposed Solution
3.1 Schema
This section explains the structure and the reasoning behind the database
schema used for this identification key.
The “units” that describe a taxon are characters. Multiple taxa can share the
same characters; the less characters two taxa have in common, the easier it
becomes to distinguish between them.
described by
Taxon
*
Character
Characters
Taxa are only defined by the characters they show, and not by all the characters they do now have; section 3.4.1 gives more details about this choice.
Fir
Ash
Maple
Figure 3.1: Taxon and characters. Ash
resembles maple more than fir; they
have more characters in common.
It is convenient to group multiple characters in questions when they describe
different distinctions of the same thing.
* distinction of
Character
Question
?
The first benefit of doing so is the structure and order character groups
bring—their addition cleans up the mess in the—by tendency large—set of
characters, and adds some simplicity hereby.
B
A
C
Taxon X
E
Observed
The second benefit is exclusion: If the user observes character A and a taxon
shows character B, we cannot conclude anything helpful for identification;
maybe the taxon actually shows A as well, but the data has not been recorded
in the database yet. However, if A and B are distinctions of the same question,
we are (usually) on the safe side to assume that the taxon does not show
character A.
×
✓
A
D
B
C
E
F
✓
Figure 3.2: Questions and characters.
Assigning A and B to the same question
allows to conclude a mismatch.
Assigning questions into components does not directly have a semantic meaning, the focus now lies on usability. A component can be, in case of a tree, for
example the leaf, the bark, or the inflorescence. Grouping questions about
the same component helps the user focus on one thing after the other, and
skipping all questions about other components manually when e.g. only a leaf
is available falls away.
Component grouping also allows to insert contextual information about the
current component, for example a leaf or a bud, like a sketch with a legend
naming the different parts (leaf stem, blade, veins, edge, etc.). Figure 3.14
shows an example of such contextual information about the leaf.
13
Question
* part of
Component
Leaves
Bark
Inflorescence
Figure 3.3: Grouping questions into components also allows to display contextual explanations.
Oct 2013
has parent
Question
A
D
0…1
Character
While the characters are structured now, we still throw all of them at the
user at once, which is painful for them. Introducing parent characters for
questions shows the relevant ones only. For example, the question whether
the needles are bundled or single is relevant only for coniferous trees with
needle-like leaves, not for broadleaf trees.
B
E
C
Proposed Solution: Optimising Identification Keys
F
Figure 3.4: A hierarchy for questions allows to uncover relevant questions only
when necessary.
Altogether this leads to a quite simple core schema that is applicable for many
topics and not only trees. But it is amazingly powerful!
3.1.1 Generality of Keys
This section extends the schema to support different subjects and discusses
differences between topic specific keys.
The basic concept of a key does not change with the subject to identify. Be
it plants, fishes, airplanes, all of the three introduced key systems—image
list, dichotomous, diagnostic—can be used: only data differs; information
does not. The dichotomous plant key asks for the leaf type, the airplane
key asks for the engine type. This is useful when writing software—once the
functionality is there, it can be used for any subject.
One aspect does depend on the subject, though: it is the probability of the
item showing a character. Take a key for diseases and a key for light bulbs.
An incandescent light bulb will always contain a glowing filament wire. But
Lyme borreliosis may cause any of a circular rash, headaches, muscle soreness, or other symptoms, but none of them always occurs. Different rating
mechanisms are required for such subjects; an item cannot be excluded when
a character cannot be observed. Dichotomous keys are by their nature not
suited for this task. Diagnostic keys, on the other hand, are easier to adjust
accordingly.
Component
1…*
Topic
belongs to
Taxon
1
Figure 3.5: Multiple topics can be supported in the same database.
To support multiple topics in the schema, let us introduce a topic entity. Each
taxon is part of a topic, and so are components.
One could argue that assigning topics to components be redundant since they
could be deduced by looking which taxa the characters in the components are
assigned to. The assignment though represents reality—a light bulb does
not have leaves—, and it also simplifies editing a taxon since characters to
non-related topics are not shown.
3.1.2 Add-Ons
This section discusses how the g eneric identification key can be extended for
special needs of other topics.
Range
has value
for range
Taxon
With the core schema defined, one can now start adding topic specific extensions or others that are useful. Examples are (partly discussed later): difficulty for questions; taxonomic degree for plants and animals, together with
the according higher-level parent taxon; or required equipment for questions.
All of them are implemented in the code.
min,max
Range:
Blade length
Range value:
20…50 cm
Figure 3.6: Ranges with corresponding
range values
Another useful add-on are ranges for numerical values, since those cannot be
represented by characters in a good way; especially for floating-point values
used with distances and such. Ranges cannot be represented by a “has” relation directly; the value for the respective range needs to be stored as well.
Simon A. Eugster, ETH Zürich
– 14 –
Identification Key for Trees
Proposed Solution: Optimising Identification Keys
Oct 2013
3.2 Optimising Identification Keys
3.2.1 Choosing Questions in Dichotomous Keys
This section discusses aspects that can be used in order to decide which questions to ask first to keep the identification process as short as possible.
Literature concurs that dichotomous keys should be optimized, but is divided
about how this is to be done. Payne and Preece (1980) present in their paper
methods to speed up the identification process, for example by minimizing the
average path length in dichotomous keys. In this paper’s discussion, Dr Sviridov points out that not only speed, but also the probability for identifying the
taxon correctly, is of importance—and cannot be optimized simultaneously,
as for higher certainty usually more tests are required.
Dichotomous keys can be optimized in terms of maximum depth by choosing
characters that separate the remaining taxa into chunks of even size. Yet also
the opposite way can make sense: Short paths for common/frequent taxa,
long paths for rare ones.
3.2.2 Choosing Questions in Dynamic Diagnostic Tables
This section lists possible methods to “ tune” dynamic diagnostic tables, i.e.
speed up the process of identification.
Optimising dynamic diagnostic tables—which present several questions at
once—requires a different approach since the user can answer them in arbitrary order. The difficulty then is rather the large number of questions: there
may be hundreds of them, and endless lists are difficult to overview. So it
makes sense to both filter and sort the character list.
Filtering removes questions from the list because they are irrelevant, or not
yet relevant.
• Dependencies on other characters—For coniferous trees for example,
broadleaf characters like the shape of the leaf rim can be hidden. The
proposed schema supports this with the parent character of questions;
the question is only shown when its parent character is observed.
• Useless characters—If answering a question does not exclude any taxa
still in the list because they all show the same character, it does not need
to be asked. The question can, however, provide additional certainty
that the observed taxon is actually the taxon returned by the identification key by over-describing it. Otherwise, having only one matching
taxon left—and therefore no additional questions asked—does not necessarily mean that this taxon is the correct one: it could as well be that
the correct taxon is not even recorded in the data set, and the questions
answered so far happened to match another one.
• Existence—A character may not be visible because it simply currently
does not exist. Examples are leaves of deciduous trees in Winter, bark
on a young tree (as Dr Atkinson pointed out in Payne and Preece, 1980),
buds in spring, or the inflorescence during all but a few days of the year.
These questions can be skipped. In the proposed schema this is done
by hiding components.
• Equipment—Answering a question may require special equipment. For
example, recognising the shape—or even the existence—of hairs on
leaves requires a loupe, and if none is available, then those characters
need not be shown.
Sorting has the same goal as optimisation in dichotomous keys; the user should
answer as few questions as possible for an identification. The common way
of sorting questions is sorting them according to a cost function; several of
them have been proposed around the 1970s for computer programs, the first
ones by e.g. Pankhurst (1970) and Dallwitz (1974), and examined e.g. in
Gower and Payne (1975). Only recently have Reynolds et al. (2003) proposed
a different measure estimating the amount of work done by, instead of the
cost of, answering a question:
S
All taxa
Question i
Character j
Character j+1
P = 0.3
P = 0.7
Select character j
S ij
Identification Key for Trees
– 15 –
Simon A. Eugster, ETH Zürich
Figure 3.7: Remaining taxa Sij
Remaining taxa
Oct 2013
Proposed Solution: Design Choices
Wi = E(S) −
X
P (i, j) E(Sij )
j
where S denotes the set of Taxa and Sij the remaining set of taxa after
answering question i with character j (see figure 3.7), E(S) is the estimated
cost of completing identification for the given set, and P (i, j) is the probability
of answering question i with character j. For estimating E(S), the authors of
Reynolds et al. (2003) use Shannon entropy H(S).
Additional sorting criteria may include:
• Difficulty—When generating a key for inexperienced users, questions
that can be answered easily can be prioritised. This should also reduce
the probability for errors: for example, depending on the location of
a fir, it is not easy to see if the twig is haired or if it is just dirt.
Answering difficult questions often require more time as well. Difficulties
are currently supported but not actively used.
• Groups—By grouping characters concerning the same part of a tree, for
example a leaf or a bud, contextual information can be inserted, like a
sketch with the different characters explained.
3.3 Design Choices
During entering data for the identification key for trees I have gained experience about what works well and what does not. Many, if not all, resulting
decisions are generally applicable and not only for trees/plants.
3.3.1 What is a Character?
This section describes how characters should be chosen in order to maintain
simplicity—despite the plants showing great variability.
Deceptive looks Is Metasequoia a confier or not? The leaves are soft, and
they resemble mimosa leaves, not conifer needles. The question “is it a soft- or
a hardwood tree?” (softwood trees are conifers, hardwood trees are the others)
is not easy to answer if one does not know the tree already. The question is
therefore likely to be answered incorrectly for Metasequoia.
Figure
3.8:
Metasequoia
glyptostroboides. The green shoots fall
off in one piece in winter, similar to
compound leaves, and the needles are
soft.
One could now simply tag the tree both as soft- and as hardwood, and the
user could identify it on both ways. This, however, is not correct. The tree is
not a hardwood. Or, one could tag the character for this taxon with “could
mistakenly be answered with hardwood”. Which gets complicated.
Another example is the Podocarpus genus. The trees have broad leaves, but
belong to the conifers.
The better way is to ask: “Are the leaves needle-like or broadleaf-like?” With
good conscience can we say yes to both.
Describing what something looks like is easier than naming it. Otherwise,
why would we need identification keys?
Handling variability Prunus spinosa has reverse egg-shaped leaves. If the
tree is young, however, they are egg-shaped (not reverse!). Fraxinus excelsior
has a smooth, greyish stem. After 20–30 years it develops a thick bark that
breaks in longitudinal lines.
At first glance it looks like conditional characters were a good idea: If the
tree is young, then the leaves have this shape. However, what is young?
Such conditions are hard to determine, and show as much variance as the
characters themselves.
Figure 3.9: Prunus spinosa. Young individual on the left, mature one on the
right. Their leaf shape is nearly the opposite.
My solution is again the simple one: The tree can show both characters, so
both of them are set for it. If only one character is observed, the correct tree
is still found.
Simon A. Eugster, ETH Zürich
– 16 –
Identification Key for Trees
Proposed Solution: Design Choices
Oct 2013
3.3.2 Exclusiveness of Characters
This section shows why allowing at most one answer to some questions does
not make sense.
I thought several times about adding an exclusive property to questions, implented it, and finally decided to drop it.
Exclusiveness—allowing only one answer for a question—intuitively makes
sense. Leaf branching serves as example: One distinguishes between opposite
(pairs of leaves are attached at the same height on the twig) and alternate (one
left, one right, etc.). Maple leaves are opposite, beech leaves are alternate.
Tagging this question as exclusive would disallow the user to select both
answers, which would not make sense anyway.
Unfortunately, that is wrong.
leaves.
Some willows have opposite and alternate
Some characters, however, are exclusive. All Beech species have the same
kind of fruit, as do Castanea species. But also this exclusiveness is lost when
adding not only species to the key, but also genera, families, and further
higher-level degrees: Both belong to the family Fagaceae.
Consequence is that exclusiveness does not work.
It is important to note that leaving away exclusiveness does not weaken the
key; wrong answers are still counted as mismatch.
3.3.3 Numerical Values
When dealing with numerical values, ranges are required since nature, as well
as human measurement, usually shows variability.
More than once has it happened to me that I tried to identify a shrub with
leaves around 1 cm long, and ended up identifying it as a tree whose leaves
are usually over 30 cm long. The reason was that the dichotomous key only
asked for size independent characters like the leaf shape or the nervature type.
The possibility of specifying the length of the leaf would have prevented this
misidentification.
Plants typically show a great deal of variance especially for dimensions. Petitoles of Acer easily range between 5 and 20 cm on a single tree; usually
they are longer the lower on the branch they grow, to maximise light yield.
Also growth of the tree itself varies: tree rings of yews can be observed from
0.1 mm in alpin regions up to over 10 mm in gardens.
Those examples make clear that numerical values must be represented as
ranges. This is even the case for longer values, e.g., the number of leaflets on
a Fraxinus excelsior leaf ranges between 11 and 15 in general.
To keep the variability within reasonable limits, extreme values should be
ignored. For leaves one can almost always find an even smaller leaf; bonsais
for example are miniaturised in every regard. Leaving away extreme values
does not decrease the chance of identification in general; the user would very
likely not pick the ash leaf with the most leaflets, but chose an average number.
For bonsais or other extremes, they have to be (made) aware of the resulting
changes.
One of the few cases where numerical values can be replaced by characters
are conifers: If they have bundled needles, typical values are only 2, 3, 5, and
“many” (more than one wants to count).
Testing an observation for a hit is done by testing if it covers or intersects
with the range stored to the tested taxon.
Identification Key for Trees
– 17 –
Simon A. Eugster, ETH Zürich
Figure 3.10: Alternate and opposite
leaves
Oct 2013
Proposed Solution: Implementation Details
3.3.4 Geographical Distribution
This section explains why geographical distribution maps are not supported.
Many keys or books feature geographical distribution maps. I thought a
long time about the best way of implementing them. What should be the
units? Country borders change. Grids may be tgoo coarse or too fine-grained,
depending on the taxon and the landscape. Sample points are accurate but
only at a single point. And finally, the locations themselves change with
plants transported to and planted at the other end of the world.
I ended up with the conclusion that the best way of implementing geographical distribution maps was not at all. It is the wrong model: plants are
not hindered from spreading across boundaries defined by GPS coordinates:
Metasequoia originally lives in China, but grows just as well in Switzerland.
The correct model is the habitat: plants require a certain temperature range,
precipitation level, pH range, and others.
3.4 Decisions for Key Generation
The identification key is character based; a taxon can be identified by examining it accurately.
3.4.1 Does This Taxon Match?
This section explains the rules used for deciding when a taxon matches the
observed characters.
Matching is done by comparing the
observed values to the stored
data—not the other way round.
My proposed method for rating taxa counts positive, negative, and unknown
matches. A positive match is clear: if we observe a character the taxon shows
as well, the match is positive. If the taxon does not show this character, we
have either a negative match or we have not enough information.
A negative match occurs only when the observed character is not observed
for the tested taxon and the tested taxon shows a different character for the
same question. If the tested taxon does not have any character defined for
the question in question, we cannot tell if it matches the observation; either
the question does not apply (e.g. leaf shape for needle trees), or the data has
not been entered yet.
One could argue that we may know the parent character of a question, and
hereby use the question tree to invalidate more characters. In case of testing
against a fir, we know that if a leaf shape is specified, we have a negative
match; the leaf shape is a child of the broadleaf-like leaves character. My
experiments have shown that this excludes taxa far too quickly in case of a
misidentification (or of wrong data) since one single observation can affect
several—many—characters at once.
The ruleset thus stays very simple, and taxa remain described only by the
characters they show, which, in my opinion, is a clean and meaningful way.
Requiring to list all characters that do not match as well would lead to a
trememdeous amount of work when introducing new characters, and in effect
duplicate data, which should generally be avoided.
Complexity of this approach is O(c T ) for T taxa and an average of c characters
per taxon.
The most simple way for excluding a taxon is to exclude it as soon as an
observed character does not match the characters stored for it anymore. As
long as both data and observations are correct, this is the fastest way for
identification.
Allowing for a certain number or percentage of errors makes sense if chances
for misidentification are high, which is especially the case for difficult trees
like the Cupressaceae family.
Simon A. Eugster, ETH Zürich
– 18 –
Identification Key for Trees
Proposed Solution: Implementation Details
Oct 2013
3.5 Implementation Details
3.5.1 Identification Key
This section discusses languages and patterns used for implementing the identification key.
Together with the data model, the identification key builds the core part of
my thesis. What are the best suitable technologies for implementing them?
For the implementation of the identification key I first had to answer the classical question which programming language I wanted to use. The decision fell
on a web-based JavaScript application. The reasons are the following. First,
platform independence is simply not an issue anymore—browsers supporting
JavaScript and HTML5 are available on all major operating systems, including mobile ones for phones and tablets. Second, no installation is required.
Third, no server communication is necessary anymore after the key has been
loaded.
The third point is realised by the usage of a fat client—i.e. the JavaScript
client contains all logic for identification—, and by loading all required data
on initialisation.
Two important core libraries are used by the key: the data model library and
the identification library. Both of them are re-used in the editor, as explained
in the next section.
Figure 3.11: Identification key UI for
clouds. The remaining taxa scroll together with the questions, but stop early
enough so they never float out of the
viewport.
The identification key is constructed in a configurable queue of functions that
process both the taxon and question set. Taxa are processed first; their rating
is required for sorting questions afterwards.
Taxon processing is accomplished in three steps. In the first step, prefilters
exclude taxa that are not interesting to the user. For example, only species are
shown for trees, since genuses and other taxa of higher degree are merely used
for constructing a taxonomic tree; they do not have any characters assigned.
Or, the user might be interessted in species of the Cupressaceae family only;
all others are then filtered. The second step is assigning the ratings to the
set of remaining taxa, i.e. counting matches, mismatches, and unknowns
for the current observation. This information is then used in the third step
for filtering out taxa that do not match the observations and for sorting the
others.
Figure 3.12: Taxon processing queue
used for the taxon thumbnails on the
right-hand side in figure 3.11.
Question processing works in a similar way. The prefilters can again remove
questions that are not interesting; currently only one is available, hiding questions if required equipment, like a louope, is not available to the user. The
second step, rating the questions, calculates basic statistics like how many
times a character occurs in the set of remaining taxa; they are then used by
one of the cost functions, which themselves provide functions for sorting taxa
according to the costs they calculated, or by other ways. The filter, finally,
e.g. decides if questions should be hidden when they cannot help in narrowing down the set of remaining taxa, or not to allow over-describing taxa for
increased certainty.
With those two queues completed, the user interface is constructed using a
visitor pattern for both taxa and questions. Question visitors can optionally
use components for better overview, as seen in the screenshot 3.14, or disable
them for better performance—there is only one top question and not one per
component—, and use depth-first search to keep questions with parent–child
relationships together.
Identification Key for Trees
– 19 –
Simon A. Eugster, ETH Zürich
Figure 3.13: Question processing queue
(questions on the left in figure 3.11
Oct 2013
Proposed Solution: Implementation Details
Figure 3.14: Main UI; the first component Baum is hidden. The numbers give the
questions answered and available.
For the design of the user interface, my goals were to
• show relevant information only. Settings, which usually are not changed
often, are accessible in the menus. Taxon descriptions are provided when
clicking on their image; the thumbnails can be scanned quickly by our
brain for validation, text however cannot.
• use space efficiently. The key contains a possibly vast amount of information, and displaying it clearly requires a combination of space-saving
design and hiding of irreleveant information, while ideally keeping the
visual impression lightweight.
• maximise identification speed. Hand-drawn graphics explain the characters described in questions, allowing new users to soon answer questions
based on the graphics without reading the explanation—which is much
faster. The set of remaining taxa never scrolls out of sight, so the user
does not need to scroll back in order to see how many taxa are remaining. Uninteresting components, like leaves in winter, can be hidden so
the relevant questions are found quicklier.
The user interface for trees supports an additional feature, a taxonomic tree.
If items are selected, only they—or their children—will be displayed. Experienced users need this as they can often tell e.g. the familia of a taxon due to
similarities with other species, and can hereby quickly exclude all other taxa.
An example can be seen in figure 3.15.
3.5.2 Key Editor
This section descusses the two backends for the editor that are currently
available.
The editor is built of two parts, the front-end user interface where the actual
data is entered, and the back-end storing the data. For convenience, the user
interface is—like the identification key—written in JavaScript and HTML5.
As mentionend before, the identification library is re-used for the editor: it
provides life updates on the taxa most similar to the one that is currently
edited. When adding characters to the taxon, the library functions again hide
ones that do not come into question, e.g. due to parent–child relationships.
Simon A. Eugster, ETH Zürich
– 20 –
Identification Key for Trees
Proposed Solution: Implementation Details
Oct 2013
Figure 3.15: UI with the taxonomic tree; Coniferales are selected, other taxa are
hidden.
The first of the two back-ends available stores the schema directly in a MySQL
database; it is written in PHP, and the communication with the front-end
happens in Ajax/JSON. No additional information (metadata) is recorded.
The second back-end is more interesting. It is a Mediawiki extension using
Wikibase. The reader is probably familiar with Mediawiki as it is used on
Wikipedia: it is a multi-user content editing system with history support,
similar to a version control system. Wikibase now is an extension to Mediawiki
and transforms it into a key/value database. More precisely: it consists of
properties (keys) and items. Items have an ID, a name, and property/value
pairs; values may be, for example, strings, images on Wikimedia Commons,
or again items. Properties have an ID and a name.
So far, this adds support for multiple users and a history, compared to the
first back-end. Yet the most important point is: all data is translatable. The
point mentioned in the goals—structure should be language independant—is
hereby completed.
Ranges and numerical values are not implemented yet in Wikibase, therefore
the Mediawiki extension currently does not support ranges.
Wikibase is still in heavy development, and had I started my thesis only half
a year earlier, I would not have had the chance to make use of this great
tool!
3.5.3 Mobile Identification Key
Since mobile devices strongly focus on web and web applications, very little
changes are required for a mobile version.
Having an offline version is useful when working in the field. WLAN is usually not available there, and even mobile coverage may be absent in sparsely
populated areas.
As mentioned in 3.5.1, the identification key logic runs on the client side and
loads a database dump on initialisation with Ajax calls. An offline version is
therefore very simple to create—the JSON dump, i.e. the responses to said
Ajax calls, only needs to be saved in a file and can then be loaded with the
same Ajax calls as before.
Identification Key for Trees
– 21 –
Simon A. Eugster, ETH Zürich
Oct 2013
Proposed Solution: Implementation Details
The app itself is then generated with the Cordova platform1 . Cordova can
build apps for all major mobile platforms from HTML5 web applications, and
provides JavaScript libraries to access e.g. the camera of the mobile device.
Current efforts in browser development may soon make this step superfluous.
Web applications can be cached by the browser in its application cache (AppCache), and can then be used offline like a normal app on mobile devices. The
current user interface already works without changes on tablets, thus using
this mechanism should be relatively easy.
1 Cordova is available on http://cordova.apache.org.
Simon A. Eugster, ETH Zürich
– 22 –
Identification Key for Trees
4 Conclusion
4.1 Evaluation
4.1.1 Question Order Benchmark
This section shows experimental results measured for different question rating
algorithms, which are responsible for fast identification.
To compare the different ranking algorithms for questions, I wrote a benchmarking script that counts the steps required for identifying a taxon by answering the top rated questions. The algorithms compared are:
• Alphabetic: No rating at all; questions are simply sorted alphabetically
by their name.
• Most used: This algorithm takes the set of remaining taxa, i.e. the ones
matching the current observations, and counts for each character how
many times it is used by any of those taxa. The question rating is then
the sum of its characters’ counters, and higher is better.
• Entropy: The algorithm proposed by Reynolds et al. (2003) using Shannon entropy to estimate remaining cost
Except for one taxon, the entropy based algorithm always performed equally
good or better than the most-used algorithm. The results of the benchmark
are shown in figure 4.1.
All algorithms have linear complexity O(C), C being the number of characters, after the question ratings have been computed as discussed in 3.5.1. The
complexity of latter is, similar to section 3.4.1, O(c T + C), with the number of
taxa T , each having an average of c characters assigned. The overall complexity, including sorting, is O(Q log2 Q + c T + C), with Q being the total number
of questions. (All terms are independant of each other and therefore cannot
be canceled.)
4.1.2 Are The Goals Met?
This section looks at the problem statement in 1.2 and compares the original
goals to the resulted system.
I have written a fully working, generic identification key containing data with
photographs and hand-drawn graphics for trees and for cloud genera, which
is more than I hoped. Additionally, I have a prototype extension for Mediawiki/Wikibase and a prototype Android app.
Editability Works with two back-ends available; Mediawiki together with
Wikibase and the LifeWeb extension provides multi-user support and history,
and the plain PHP/MySQL backend aims for simplicity and allows compact
database backups.
Performance Questions can be sorted such that O(log2 n) is achieved. The
user interface also requires a single click for selecting characters and supports
graphics for characters, questions, and components (as in the tree key) that
give a tremendeous speed-up compared to text only.
Internationality Both user interface and data can be translated; the user
interface uses JavaScript based methods, and data is translated by using
Mediawiki’s and Wikibase’s built-in multi-language support.
23
Figure 4.1: Question rating benchmark
for 57 taxa and 193 characters, and
around 17 characters per taxon.
Light grey: 61 taxa, 205 characters,
19.6 characters per taxon.
Oct 2013
Conclusion: Acknowledgements
Growth All code is released under the Open Source license GPL, and the
identification key for trees contains data to around 60 trees, covering already
a good part of the trees native to Switzerland. Only future can show if the
project manages to grow by itself. The Wikidata.org integration will first
require some work, as discussed in the next section. As for code quality—
most code has been refactored or re-written at least once, which simplified
and improved the code.
4.2 Future Work
This section discusses tasks I regard as important for future work on this
identification key.
Ranges or numerical values are not implemented in the Mediawiki plugin
since Wikibase does not support them yet. Ranges are currently also disabled
in the user interface; a fast and comfortable way for entering numerical values spanning various magnitudes (leaf length may range from millimeters to
meters), and a meaningful cost function for rating the importance of ranges
when sorting questions, needs to be developed.
Collections containing trees at special locations, for example in botanical
gardens, need to be re-visited. I have removed them since I was not confident
of their usability, especially regarding structuring them. Using data from
Wikidata, like geographical coordinates or associated country, would extend
the possibilities.
wikidata.org is an ideal place for data and should be a long-term goal for this
identification key. The QueryEngine extension for Wikibase may be used in
future for querying data; the current way does not support real-time updates
with the amount of data present on Wikidata (around 13 million rows as of
August 2013).
Component based thumbnails for the remaining taxa for visual exclusion.
As discussed in section 2.1.1, the human brain is great at pattern matching.
Currently, the identification key user interface simply takes the first image as
thumbnail. In the example key for trees, visual matching could be supported
by taking images of, for example, single leaves of each tree, or their fruits, and
only showing those when desired, so that a specific character can be examined
on all remaining taxa.
Mobile version neds more treatment; additionally to the points discussed
in section 3.5.3—adaption to new technologies like browser based apps—a
single page for different screen resolutions is desireable. Currently there is a
separate version for mobile phones, while tablets can use the standard web
page.
Figure 4.2: Some SVG graphics from the
identification key.
Editor UI should be re-designed. The current user interface is a development
version that focuses on functionality only, and usability has been neglected a
bit. I have created a user interface design in Inkscape, but it is not implemented yet.
4.3 Acknowledgements
First I would like to thank Prof. Ottmar Holdenrieder and Andreas Rudow:
In their lectures I learned new ways to identify plants, and they shared their
experience and joy on the topic. Spepcifically, it was a lecture held by Ottmar
Holdenrieder in which we used the Fitschen (2002) for identifying uncommon
trees for our herbarium where the iea of an electronic identification key came
into my mind. Andreas Rudow also helped me with his broader view where
I sometimes focused too much on details, and with his own experience in
collecting data for trees.
Simon A. Eugster, ETH Zürich
– 24 –
Identification Key for Trees
Conclusion: Acknowledgements
Oct 2013
A big thanks goes to Carmen Maria Rovina. We together visited Ottmar’s
lecture. Carmen accompanied me during the thesis by helping with my time
plan, by insisting on me defining the next goals and prioritising them correctly,
and with her interest in the topic. She also tested the identification key
and proof-read my thesis. As she studied Environmental Systems Science,
her knowledge came in handy for topic related questions too. Carmen also
managed to locate the LeafSnap page.
Several people tested my identification key and provided valuable feedback:
Carmen Ferri, Iris Huber, Katharina Schwitter, Claude Barthels, Anja Taddei, Natalie Kaiser, Gabriela Fisch, Dr Thomas Niklaus Sieber; and Loredana
Vamanu lent me a tablet for testing. Thanks a lot!
Then I want to thank my supervisor Prof. Donald Kossmann for supervising
this “non-classical” computer science thesis and for his interest in trees, and
my Master’s Mentor Prof. Markus Püschel for his unique way of presenting
complex lecture topics, and for pointing me to the books of Edward Tufte,
especially Tufte (1997).
Thanks to my parents for my interesting childhood—for showing us plants,
for driving to forests where we could climb on trees, and for hiking—and for
their good cooking.
Finally, thanks to the Wikidata developers for their project that suits so well,
and to the developers of PhpStorm for writing a great JavaScript IDE.
Identification Key for Trees
– 25 –
Simon A. Eugster, ETH Zürich
Bibliography
Mercedes Alsina and Anicet R. Blanch. A set of keys for biochemical identification of environmental vibrio species. Journal of Applied Microbiology,
76(1):79–85, 1994. ISSN 1365-2672. URL http://dx.doi.org/10.1111/j.
1365-2672.1994.tb04419.x.
Peter Belhumeur, Daozheng Chen, Steven Feiner, David Jacobs, W. Kress,
Haibin Ling, Ida Lopez, Ravi Ramamoorthi, Sameer Sheorey, Sean White,
and Ling Zhang. Searching the world’s herbaria: A system for visual identification of plant species. pages 116–129. 2008. URL http://dx.doi.org/
10.1007/978-3-540-88693-8_9.
M. J. Colloff and F. Th. M. Spieksma. Pictorial keys for the identification of domestic mites. Clinical & Experimental Allergy, 22(9):823–830,
1992. ISSN 1365-2222. URL http://dx.doi.org/10.1111/j.1365-2222.
1992.tb02826.x.
Allen J. Coombes. The Book Of Leaves. New Holland Publishers, 2011.
M. J. Dallwitz. A flexible computer program for generating identification
keys. Systematic Biology, 23(1):50–57, 1974. doi: 10.1093/sysbio/23.1.50.
URL http://sysbio.oxfordjournals.org/content/23/1/50.abstract.
Jost Fitschen. Gehölzflora. Quelle & Meyer, 2002.
J. C. Gower and R. W. Payne. A comparison of different criteria for selecting binary tests in diagnostic keys. Biometrika, 62(3):665–672, 1975.
doi: 10.1093/biomet/62.3.665. URL http://biomet.oxfordjournals.org/
content/62/3/665.abstract.
Hans Ernst Hess, Elias Landolt, Rosmarie Hirzel, and Matthias Baltisberger.
Bestimmungsschlüssel zur Flora der Schweiz und angrenzende Gebiete.
Birkhäuser, sixth edition, 2010.
J. LaSalle, Q. Wheeler, P. Jackway, S. Winterton, D. Hobern, and D. Lovell.
Accelerating taxonomic discovery through automated character extraction.
Zootaxa, 2217:43–55, 2009. URL http://www.mapress.com/zootaxa/2009/
f/zt02217p055.pdf.
R. J. Pankhurst. A computer program for generating diagnostic keys. The
Computer Journal, 13(2):145–151, 1970. doi: 10.1093/comjnl/13.2.145.
URL http://comjnl.oxfordjournals.org/content/13/2/145.abstract.
R. W. Payne and D. A. Preece. Identification keys and diagnostic tables:
A review. Journal of the Royal Statistical Society. Series A (General),
143(3):pp. 253–292, 1980. ISSN 00359238. URL http://www.jstor.org/
stable/2982129.
Alan P. Reynolds, Jo L. Dicks, Ian N. Roberts, Jan-Jap Wesselink, Beatriz
Iglesia, Vincent Robert, Teun Boekhout, and Victor J. Rayward-Smith.
Algorithms for identification key generation and optimization with application to yeast identification. In Applications of Evolutionary Computing, volume 2611 of Lecture Notes in Computer Science, pages 107–
118. Springer Berlin Heidelberg, 2003. ISBN 978-3-540-00976-4. URL
http://dx.doi.org/10.1007/3-540-36605-9_11.
Edward R. Tufte. Visual Explanations: Images and Quantities, Evidence
and Narrative. Graphics Press USA, 1997.
David Evans Walter and Shaun Winterton.
Keys and the crisis in
taxonomy: Extinction or reinvention?*.
Annual Review of Entomology, 52(1):193–208, 2007.
doi: 10.1146/annurev.ento.51.110104.
151054. URL http://www.annualreviews.org/doi/abs/10.1146/annurev.
ento.51.110104.151054. PMID: 16913830.
26