Image Analysis and Supervised Learning in the Automated
Differentiation of White Blood Cells from Microscopic Images
by Alfred R. J. Katz
A Minor thesis submitted in partial fulfillment of the requirements
for the degree of Master of Applied Science in Information
Technology
Department of Computer Science
RMIT
Declaration
I certify that all work on this thesis was carried out between February 1999 and February
2000 and it has not been submitted for any academic award at any other college, institute,
or university. The work presented was carried out under the supervision of Dr. Vic
Ciesielski who suggested using the Waikato Environment for Knowledge Analysis. All
other work in the thesis is my own, except where acknowledged in the text.
Signed,
Alfred Rudolf Josef Katz
09 February 2000.
Thesis
A.R.J. Katz
ii
Abstract
Differentiation or the detection and classification of white blood cells is an established
problem in image analysis and machine learning, though one which has not yet been
solved. This thesis aims to shows that the detection and classification of white blood
cells can be fully automated.
We developed a multi step process, consisting of the extraction of a region of interest
from a larger image around thresholded cell nuclei, the segmentation of that image into
cell and non cell regions using Canny edge detection followed by a circle identification
algorithm, extraction of a feature set based on cell color, size and nuclear morphological
information, and application of a classifier.
The performance of a number of classifiers was compared using the extracted feature set
to determine which could achieve the lowest error rate on a common data set extracted
from 206 images of white blood cells. A ZeroR classifier was used as a baseline
classifier against which to compare the others. All of these classifiers produced error
rates of less than 10%. The neural network had the best performance with an error rate of
1.95%. The performance of all classifiers was generally better on those classes that
contained more instances. While these error rates are not low enough to completely
replace a complete white blood count, they may be low enough to allow automated
screening for a limited range of disorders.
Automation is accomplished except for the cell boundary determination of the image
segmentation part.
Thesis
A.R.J. Katz
iii
Acknowledgement:
This thesis could not have been completed without the encouragement and support of my
supervisor, Dr. Vic Ciesielski, my wife Karen and my employers - Vision Instruments
Limited. Special thanks go to Sue Fisher, Haematological Scientist, of Vision
Instruments Limited, for her tireless answers to my many questions on the nature of white
blood cells and their differentiation. Thanks also to Rex Smith for the loan of the camera
used to acquire the images used herein.
Thesis
A.R.J. Katz
iv
Chapter 1 Introduction ................................................................................................. 1
1.1 Goals.................................................................................................................. 1
1.2 White Blood Cell Types ..................................................................................... 2
Chapter 2 Literature Survey ......................................................................................... 4
2.1 Overview............................................................................................................ 4
2.2 Cell Segmentation .............................................................................................. 5
2.3 Feature Extraction .............................................................................................. 7
2.4 Learning and Supervised Learning ..................................................................... 8
2.4.1 Training and Testing of Classifiers............................................................... 8
2.4.2 Error Rate Estimation................................................................................... 9
2.4.3 Confusion Matrices.................................................................................... 10
2.4.4 WEKA....................................................................................................... 11
2.4.5 ZeroR ........................................................................................................ 11
2.4.6 OneR ......................................................................................................... 11
2.4.7 ID3 Decision Tree...................................................................................... 12
2.4.8 C4.5 and J4.8 ............................................................................................. 13
2.4.9 IBk k Nearest Neighbour............................................................................ 13
2.4.10 Naïve Bayesian Classifier .......................................................................... 14
2.4.11 Neural Networks........................................................................................ 15
2.5 White Blood Cell Classification ....................................................................... 18
Chapter 3 Process Overview ...................................................................................... 19
3.1 Overview including data flow........................................................................... 19
3.2 Data acquisition................................................................................................ 20
3.2.1 Cell Choice ................................................................................................ 20
3.2.2 Resolution of Acquired White Blood Cell Images ...................................... 20
Chapter 4 Image Processing ....................................................................................... 21
4.1 Data Preparation............................................................................................... 21
4.1.1 Determining Regions of Interest................................................................. 21
4.1.1.1 Separating Images Into Component Bands.............................................. 21
4.1.1.2 Thresholding........................................................................................... 22
4.1.1.3 Edge Erosion .......................................................................................... 23
4.1.1.4 Platelet Erosion....................................................................................... 24
4.1.1.5 Finding Blobs ......................................................................................... 24
4.1.1.6 Consolidating Adjacent Blobs................................................................. 25
4.1.2 Extracting small images ............................................................................. 26
4.1.3 Finding circles ........................................................................................... 26
4.1.3.1 Edge detection ........................................................................................ 26
4.1.3.2 “Rough Circle” determination................................................................. 27
4.1.4 Manual intervention ................................................................................... 29
4.1.5 Extracting Feature Set................................................................................ 31
4.1.5.1 Features in Feature Set............................................................................ 31
Chapter 5 Supervised Learning .................................................................................. 33
5.1 Using WEKA classifiers................................................................................... 33
Thesis
A.R.J. Katz
v
5.1.1 Data Preparation for Weka. ........................................................................ 33
5.1.2 ZeroR ........................................................................................................ 33
5.1.3 OneR ......................................................................................................... 33
5.1.4 IBk Instance based classifier ...................................................................... 34
5.1.5 Naïve Bayes............................................................................................... 34
5.1.6 ID3 Based Decision Table.......................................................................... 34
5.1.7 J4.8 (C4.5 Based) Decision Tree Induction ................................................ 35
5.2 Using SNNS..................................................................................................... 35
5.2.1 Data set...................................................................................................... 35
5.2.2 Network Topology..................................................................................... 35
5.2.3 Scaling the inputs....................................................................................... 36
5.2.4 Class Outputs............................................................................................. 36
5.2.5 Interpretation of Output ............................................................................. 36
Chapter 6 Results ....................................................................................................... 37
6.1.1 Weka Results ............................................................................................. 37
6.1.1.1 ZeroR Results......................................................................................... 38
6.1.1.2 OneR Results.......................................................................................... 39
6.1.1.3 IBk Instance based classifier. .................................................................. 40
6.1.1.4 ID3 Based Decision Table ...................................................................... 43
6.1.1.5 J4.8 (C4.5 based) Decision Tree Induction.............................................. 45
6.1.1.6 Naïve Bayesian Classifier ....................................................................... 47
6.1.2 SNNS Results ............................................................................................ 49
6.1.3 Overall Results .......................................................................................... 51
Chapter 7 Conclusion and Further Research............................................................... 52
7.1 Conclusion ....................................................................................................... 52
7.2 Further Research .............................................................................................. 54
Chapter 8 References ................................................................................................. 55
Appendix 1 Equipment and Software Used ................................................................... 58
Appendix 2 Data Acquisition Process............................................................................ 60
Slide Preparation..................................................................................................... 60
Leica Microscope ................................................................................................... 60
Settings................................................................................................................... 60
Camera ................................................................................................................... 60
AI Gotcha ............................................................................................................... 60
Computer................................................................................................................ 61
Scanning method .................................................................................................... 61
Appendix 3 Findblobs.c ................................................................................................ 63
Appendix 4 Perl Script to Consolidate Blobs ................................................................. 67
Thesis
A.R.J. Katz
vi
Chapter 1 Introduction
White blood cell classification and counting is an important diagnostic technique in the
detection of many illnesses, particularly leukaemias.
Automated systems for white blood cell recognition are currently available in the market.
These generally rely on flow cytometry techniques (e.g. Sysmex XE2100 [Sys1], Abbott
Diagnostics Cell-DYN 4000 [Add1], Beckman Coulter Onyx, MXM and GenS analyzers
[Bec1]), whereby a blood sample flows through a detector and is then sent to waste. The
importance of traceability in the medical diagnostics market, however, is increasing. A
technique that can make a determination from a microscope slide (or from a set of
images) has the advantage that the data from which a diagnosis is made can be kept on
file for future quality assurance needs.
As such systems would be used as an aid to the diagnosis of life threatening diseases,
accurate classification is very important. An inaccurate diagnosis could cause the death
of a patient who is falsely diagnosed as not having a disease, or cause inappropriate,
expensive and often dangerous treatment to be applied to a patient who is falsely
diagnosed as having a disease.
While image based classifiers for white blood cells are being developed for commercial
release, their details are usually closely guarded company confidential information. The
developers are not yet making any public claims of accuracy, and no instruments based
on this technology are as yet commercially available. Recent advances in supervised
learning techniques and the consequent free availability of classification algorithms and
programs make it desirable to revisit this problem.
1.1
Goals
The goal of this research is to determine whether all steps in a fully automated system for
the classification of white blood cells from microscopic images can be realized using
image processing and supervised learning techniques. As a component of this, we will
attempt to determine which of a number of investigated classification techniques provides
the best automated classifier for the classification of white blood cells into their five
major types (Neutrophils, Lymphocytes, Monocytes, Eosinophils and Basophils), based
on a limited data set of visual images.
The hypotheses investigated in this thesis are each related to the automated detection and
classification (the haematologists’ term for this process is differentiation) of white blood
cells from color images captured from a microscope:
1. That useful regions of interest, each centered on a white blood cell, can be
automatically extracted from color microscopic images.
2. That image processing techniques can be used to automatically extract a feature
set from these regions of interest that is useful for the classification of white
blood cells?
Thesis
A.R.J. Katz
1
3. That supervised classification techniques such as decision trees, k nearest
neighbor, and neural networks can be used to classify the white blood cells in
these regions of interest and what are their relative accuracies?
1.2 White Blood Cell Types
Human blood contains five major types of white blood cells or leukocytes. These are
neutrophils, basophils, eosinophils, lymphocytes and monocytes. These can be divided
into two major groups, distinguished by the presence or absence of granules in the
cytoplasm (cell body). There are two major types of leukocyte without granules; these
are lymphocytes and monocytes. The other three major types of leukocyte (neutrophils,
basophils and eosinophils) differ in the way their cytoplasmic granules are affected by
various stains. [Col1].
Figure 1 shows images of white blood cells from the Atlas of Blood Cell Differentiation
[Cil1] as well as from among the images acquired for this thesis. The cells on the left
show distinguishable cytoplasmic granules in the Basophil, Eosinophil and Neutrophil,
which become indistinct at the lower resolutions apparent in the images acquired as a part
of this thesis (see Figure 1).
In a healthy person, the number of white cells varies between about 4,000 and 10,000 per
microlitre. The proportion of different white cells varies greatly, both between
individuals and at different times in the same individual.
Table 1 shows typical percentages of the different types of leukocytes for a cross section
of healthy individuals[Col1]. Percentages significantly outside the ranges found can be
indicative of various blood related illnesses. For instance a large increase in lymphocytes
(e.g. 90% of white blood cells are lymphocytes) is an indicator of chronic lymphatic
leukemia.
Cell type
Neutrophils
Eosinophils
Basophils
Monocytes
Lymphocytes
Percentage normally
found
50-70%
1-5%
0-1%
2-10%
20-45%
Table 1 Typical White Cell Counts
Thesis
A.R.J. Katz
2
Figure 1 Examples of complete images
Cell Type
Image from Atlas of Blood
Cell Differentiation [Cil1]
Image captured for this
thesis (shown at similar
size for comparison).
Basophil
Eosinophil
Lymphocyte
Monocyte
Neutrophil
Thesis
A.R.J. Katz
3
Chapter 2 Literature Survey
2.1 Overview
Computer vision is the processing of image data for use by a computer [Umb1]. It is a
form of computer imaging where a computer processes visual information directly,
examining images and acting on the result. Computer vision can be split into two
primary tasks, that of image acquisition and that of image analysis.
Image acquisition is the process of capturing an image and converting it to a format
capable of being manipulated by the computer. Today this is seen a trivial task, with
many image acquisition systems being readily available on the commercial market,
covering a broad range of performance requirements (high and low resolution, speed,
color definition, etc). Nevertheless, there are a number of general criteria that need to be
met for images acquired for computer analysis, according to Russ [Rus2, pp7-70], to
ensure that the images acquired are as useful as possible. These include:
• Global uniformity. The same type of feature should look the same wherever it is in
the image
• Local sensitivity. Edges should be sharply delineated and accurately located.
• Geometry of the image system (e.g. magnification of a microscope) should be known.
Image analysis is the process of manipulating the acquired image to extract higher level
information for computer analysis or manipulation. The image analysis process can be
seen as having three main stages (see Figure 2) preprocessing, data reduction and feature
analysis.
Input Image
Preprocessing
Data Reduction
Feature Analysis
Figure 2 Image Analysis Process [Based on Umb1, p. 38]
Preprocessing techniques are used to make the main data reduction task simpler. These
techniques include extracting regions of interest, basic algebraic operations on images
and removing artifacts from images.
A common preprocessing operation in image analysis is the extraction of a region of
interest (ROI) which we wish to investigate more closely without the added complexity
of extraneous data from other, unwanted, parts of an image. To do this we need to
identify the region of interest and then crop, or cut away, this area from the rest of the
image.
Thesis
A.R.J. Katz
4
Edge detection is often used in ROI determination. One of the seminal algorithms in this
area is the one developed by J. Canny [Can1]. It consists of the following steps:
• Smooth the image with a Gaussian filter.
• Differentiate the image.
• Mark any values in the derivative image above a certain threshold as being edges.
Marr and Hildreth [Mar1] have a similar algorithm that uses the zero crossings of the
second derivative of the smoothed image to determine the edges.
The second stage of the image analysis process, data reduction involves either reducing
the data in the spatial domain, or transforming it into the frequency domain, or both
[Umb1](see Figure 3). The image information may be filtered after these processes,
further reducing the data and allowing the extraction of the features required for analysis.
Transform
Spectral
Information
Preprocessed
Image
Segmentation
[Umb1]
Spatial
Information
F
i
l
t
e
r
i
n
g
Feature
Extraction
Figure 3 Data Reduction [Based on Umb1, p. 39]
Feature extraction, the final step in the data reduction process, is the process of extracting
the particular subset of the information in the image that is useful for solving the
particular problem at hand. The object features of interest may include geometric
properties of binary images, color features and texture features.
Geometric features
such as edges and curves are often useful in matching images in the presence of grayscale
distortions [Ros1].
A feature vector is a representation of an image as an n-dimensional vector containing the
values associated with these object features. These values may be ordinal, nominal or
continuous numerical. Representing the important information in an image as a feature
vector has the advantage of significantly reducing the amount of data.
Images that
require many kilobytes, often megabytes, of storage can often have their features of
interest represented in a feature vector of only a few bytes or tens of bytes.
2.2 Cell Segmentation
Segmentation is part of the data reduction stage and involves the partitioning of the image
plane into meaningful parts, such that “a correspondence is known to exist between
Thesis
A.R.J. Katz
5
images on the one hand and parts of the object on the other hand” [Van1, p212]. In the
context of this thesis, the prime reason for segmenting the image is to define the
boundaries of the white blood cell currently of interest, enabling features to be extracted
from the whole white blood cell without the inclusion of extraneous material. Harms
and Aus [Har1] state that “segmentation is the first, important step in image analysis”.
Russell and Norvig [Rus1, p 734] similarly see segmentation as a “a key step towards
organizing the array of image pixels into regions that would correspond to semantically
meaningful entities in the scene”.
Rosenfeld [Ros1] discusses two forms of segmentation. Pixel based image segmentation,
in which pixels are classified independently, is seen as simpler, but has a number of
drawbacks particularly in the area of local consistency. Region based segmentation,
where the goal is to split the image into distinct connected regions is seen as a better
alternative. Both forms of segmentation require some experimentation to develop a good
semantic model that can be used to split or merge regions.
Sonka, Hlavac and Boyle [Son2] nominate thresholding as the simplest segmentation
process, as it is computationally cheap and fast. They claim that thresholding is a
suitable segmentation method where objects do not touch each other and where their grey
levels are clearly distinct from background levels. Correct threshold selection is crucial
for successful segmentation. They state that this can be useful in the case of segmenting
images of microscopic blood cells where cytoplasm, nucleus and background each have
their own distinctive grey levels. This technique can have problems where the lighting
level varies from one image to another.
Van der Heijden [Van1, p215] similarly proposes an image segmentation method for
complete white blood cells, which is based upon multiple gray level thresholding. In this
method, the white blood cell image is low pass filtered with a 5 x 5 pixel averaging filter.
This method then uses a known set of conditional probabilities to segment the image into
nuclear, cytoplasmic and background material. Van der Heijden [Van1, p215] shows an
eosinophilic granulocyte segmented in this manner. This method unfortunately falls
down in the presence of erythrocytes (red blood cells) which can often have similar gray
level values to leukocyte cytoplasm.
Aus, Harms, ter Meulen and Gunzer [Aus1] propose a cell segmentation algorithm driven
by a model of cell structure. Their model assumes that a cell consists of one or more
areas of nuclear matter, surrounded by an area of cytoplasm and that an area of nuclear
matter cannot be shared between two or more cells. Maximum and minimum areas of
cytoplasm and nucleus are set depending on the magnification of the image acquisition
system. Image segmentation is the achieved in a number of steps:
• Color differences of each pixel are calculated. These are differences between the
three color signals for each pixel (red level – blue level, blue level – green level
and green level – blue level).
• The regions with the largest color differences are located and marked as
belonging to nuclear material.
Thesis
A.R.J. Katz
6
•
Areas of cytoplasm are defined step by step, using both the cell model and
knowledge of the characteristic color differences
• The cell boundaries are then determined using a method that “incorporates color
features, geometric operations and probability functions” [AUS1, p513].
Unfortunately the method is not explained in further detail.
• Based on the cell model, then either accept or reject the whole cell as being a
leukocyte.
Based on an image of four leukocytes segmented by this method, it seems to perform
very well.
Harms and Aus [Har1] again use a model of the object of interest in their cell
segmentation algorithm. In this method, they use color difference to identify the nuclear
material in one focal plane, they then shift the focus and use absolute color (blue band
only) to identify myelin sheath material (not present in white blood cells as studied in this
thesis) in a different focal plane and use a luminosity (grey scale) image in a third focal
plane to identify cytoplasm. The three images are then overlaid, nuclear subregions are
combined if they are close enough (in a manner similar to that used in this thesis). This
method has the advantage of being able to work with the sharpest images possible of both
the nucleus and the cytoplasm boundary. A limitation of this technique (which is
highlighted) is that “a microscope digital focus control, with a 0.1 micron resolution is
mandatory for reproducibly measuring tissue sections of different focus settings” [Har1,
p521].
2.3 Feature Extraction
Jain [Jai2] sees feature extraction as a process of mapping original features into features
that are more useful. In the case of this thesis the original features are images of white
blood cells. Feature extraction is the extraction from a data set (such as an image) of a
vector in a multidimensional space, where each dimension represents an attribute of the
image that is believed to carry information useful in the classification of that image.
Aus, Harms, ter Meulen and Gunzer [Aus1] used a set of 33 features to perform this task.
They use such features as:
• nucleus and cytoplasm area
• average color co-ordinates in the DIN color diagram (a German standard, DIN 6173,
which can be used to express colors as x and y coordinates in a two dimensional
color diagram)
• number of pixels in the nuclear perimeter (they call this the “contour of the nucleus”)
• ratio of nuclear area to cytoplasmic area and “nuclear form factor” (they do not
explain this particular measure).
However, 17 of their 33 features were based on “texture lines”, which are generated by
gradient and contour following algorithms. The number of these texture lines, the
average of the distance between the lines and the variance of the distance between them,
were all included in the feature set. These texture-based features provide clues as to the
granularity of the cells, which are missing from this thesis. Unfortunately, a minimum
Thesis
A.R.J. Katz
7
resolution of 10-12 pixels per micron is required to resolve these features. The
equipment available for this thesis was only capable of a resolution of approximately 3
pixels per micron. The authors do not provide results as to the accuracy they achieved in
cell classification. However, they do claim 55% - 95% accuracy in detecting 11 different
types of leukemia.
Song, Abu-Mostafa, Sill and Kasdan [Son1] extract a feature set of 20 features including
cell area, median, standard deviation and 4th quantile of the color and color difference
distributions, perimeter of cell and standard deviation of the distance from the cell edge
to the mass centre. Unfortunately, no indication is given of the techniques used to define
the cell boundary and mass centre. They state that “at the preprocessing stage, the
images are segmented to set the cell interior apart from the background” [Son1, p4].
Again, no indication of the techniques used to achieve this segmentation is provided, not
even a statement as to whether this segmentation was automatic or required human
intervention. The white blood cell images shown in the paper are all fully segmented to
show just the white blood cell of interest on a white background. The team achieved an
88-89% accuracy in cell classification on fourteen different white blood cell types,
including the five basic types and 9 other types (mostly immature white blood cells).
2.4
Supervised Learning
Machine learning is a subfield of the field of artificial intelligence that deals with
programs that learn from experience [Rus1]. Learning is the capability of a program to
improve its performance at its given task. In the case of a classification algorithm,
learning is the ability to improve its performance at classifying the instances presented to
it.
Any situation in which both the inputs and outputs are known is called supervised
learning, while learning with no hint at all of the correct outputs is called unsupervised
learning. Supervised learning is a process of learning by example and is only possible in
the presence some form of feedback that can be used to quantify and hence compare the
performance of a program. Learning to classify white blood cells, given a set of training
examples classified by a human, is an example of supervised learning. In this case, the
feedback necessary to monitor performance can be obtained from the error rate, or
percentage of results that do not match those assigned by a human.
Russell and Norvig state that supervised learning can be seen as pure inductive inference,
which induces some function which approximates the reality being learned [Rus1, p529].
2.4.1 Training and Testing of Classifiers
In the context of classification, a good supervised learning algorithm is one that produces
hypotheses (classifiers) that do a good job of predicting the class of unseen examples
Thesis
A.R.J. Katz
8
[Rus1, p538]. The quality of a supervised learning algorithm that produces a classifier
can be assessed by the following method:
1. Divide the set of available samples into two separate sets, the training set and the test
set
2. Use the algorithm under test to produce a classifier using the samples in the training
set only. This is known as training the algorithm.
3. Use the resulting classifier on the samples in the test set generating a predicted class
from each. Compare the resulting predictions against the actual classes of the
examples and note the percentage of correctly classified instances in the test set.
Whenever a large number of features exist and thus a large number of hypotheses are
possible, there is a danger of using the resulting freedom to learn classifiers based on
meaningless or irregular attributes of the data set. This problem is a common one in all
kinds of learning algorithms and is known as overfitting or overtraining. Simple
techniques exist to prevent overtraining for most learning algorithms, such as decision
tree pruning [Rus1, pp542-543] for decision trees and optimal brain damage [Rus1,
pp572-573] for neural networks.
2.4.2 Error Rate Estimation
The main focus of this thesis is the comparison of different machine learning classifiers
on the data set extracted from white blood cell images. The most commonly used
measure of success or failure of a classifier is that classifier’s error rate [Wei1]. The true
error rate is the ratio of correct classifications to incorrect classifications on an
asymptotically large number of new cases that converge to the actual population. This,
according to Jain [Jai2] is usually estimated from the percentage of misclassified test
samples.
The empirical error rate can be defined as the ratio of the number of errors to the number
of cases examined.
error rate = number of errors / number of cases
Many researchers still use a resubstitution method to estimate the error rate [Jai2]. In this
method, the same data is used to develop the classifier and then to test it. This method
produces an optimistically based error rate and makes it difficult to detect overtraining.
It is well accepted [Wei1] that holding out a number of randomly selected samples as a
test data set and not using these cases at all as part of the training process, allows these
samples to be used to obtain a good estimate of the true error rate. According to Jain
[Jai2], this method gives a pessimistically biased estimate and is therefore generally
preferred.
Thesis
A.R.J. Katz
9
Unfortunately, with only 200 samples and only two instances of one of the five classes
(basophils), it is difficult to both have enough samples to train the classifiers well and
have enough data for a statistically significant test data set. Two methods are commonly
used [Wei1] to overcome this problem. The first of these is known as leaving-one-out,
where n train and test runs are performed on n samples, each time leaving out a different
test instance to be classified using the trained classifier. This gives the most accurate
estimate possible of the true error rate. The second method is k-fold cross validation
where the data is divided into k mutually exclusive test sets of approximately equal size.
The cases not allocated to each test set are used to train for that particular test. The main
advantage of cross validation is that all cases in the data set are used for testing. The
main disadvantage is that for each different data set, different classifiers may be learned.
The WEKA toolset [Wek1], by default, uses a default value of k = 10 and this default is
used throughout this thesis. The results of this cross validation are presented as part of
WEKA’s standard output from a test.
Other authors [Alp1] have preferred a method of 5 * 2 fold cross validation, where the
data set is split into two equal sized training and test sets, five times. Each time a two
fold cross validation is performed, swapping the test and training sets to achieve the
folds. This, however, suffers from the disadvantage for small data sets that the number
of instances in each training set is significantly reduced (from 90% of the total data set to
only 50%). This will simply mean that, in most cases, the classifier will be less well
trained and the estimated error rate higher.
2.4.3 Confusion Matrices
The confusion matrix, as output by the WEKA tools [Wek1] is a common method of
comparison of the performance of multiple supervised classifiers on the same data set. It
is often also used to compare the performance of a single classifier on multiple data sets.
The confusion matrix is basically a matrix of the actual classes of the data, shown against
the class assigned by the classifier under investigation.
m
n
12
0
0 138
2
0
0
3
0
0
l
1
0
42
0
2
e
0
0
0
5
0
b
0
0
0
0
0
|
|
|
|
|
<-m
n
l
e
b
classified as
= monocyte
= neutrophil
= leukocyte
= eosinophil
= basophil
Figure 4 Example of confusion matrix.
The confusion matrix provides a great deal of information about which classes are likely
to be confused, but gives no indication of which attributes cause the confusion. From the
Thesis
A.R.J. Katz
10
example in Figure 4, for instance the following conclusions can be extracted, based on
the classifier and data set from which this matrix is drawn:
• Instances of class n (neutrophils) are always classified correctly.
• Instances classified as class n have a probability of 138/141 (97.8%) of actually being
instances of class n (the remainder will be class e (eosinophils)).
• Instances of class e have a probability of 5/8 (62.5%) of being classified correctly.
• Instances classified as being of class e are always actual instances of class e.
• Instances of class b (basophils) are never classified correctly and are always classified
as class l (lymphocytes).
2.4.4 WEKA
WEKA [Wek1, Wit1] is the Waikato Environment for Knowledge Analysis. WEKA is a
Java based environment, allowing the same programs to run in the same way under
different operating systems. WEKA allows the application of a number of different
classification algorithms to a common data set by providing the necessary translations
from a common data format to the formats that may be required by individual algorithms.
Systems like WEKA have only recently become available. They greatly enhance the ease
of comparing classifiers by accepting the same input files and providing results in a
common output format.
2.4.5 ZeroR
ZeroR is a primitive classifier. It predicts the majority class in the training data for all
instances of test data if the class is categorical, according to the WEKA development
team [Wit1]. It does not make a very useful scheme for prediction.
Russell and Norvig [Rus1] describe the ZeroR learning algorithm “as a good ‘straw man’
learning algorithm” (p561), though they do not give it a name. They indicate that this
algorithm should give an idea of the minimal performance that any algorithm should be
able to maintain, offering as an aside that many published algorithms have performed
worse than this baseline standard.
2.4.6 OneR
Holte’s OneR [Hol1] is a simple classifier that extracts a set of rules based upon a single
attribute. It is normally used as a baseline inducer. OneR shows that it is easy to get
reasonable performance on a variety of classification problems by examining only one
Thesis
A.R.J. Katz
11
attribute. It is, however, substantially inferior in error rate to C4.5 decision trees on the
data sets that Holte uses as examples.
The only parameter available under the OneR classification algorithm is the minimum
number of instances required in each class. Holte suggests 6 as a default value.
2.4.7 ID3 Decision Tree
The ID3 algorithm [Qui2] produces a decision tree by finding a good subset of attributes
for inclusion in the tree. The leaf nodes of the resulting decision tree contain the classes,
while the non-leaf nodes are decision nodes [Rus1, pp 525-562]. Each decision node is
an attribute test with each branch being a range of possible values of the attribute under
test.
ID3 searches through the attributes of the instances in the training set, to find the attribute
that best separates the given examples. ID3 then stops if the training set is classified
perfectly. Otherwise it operates recursively on the partitioned subsets which have more
than one class represented in them.
The criterion used by ID3 to determine which attribute “best” splits the training data is
information gain [Rus1, p559]. Information, a fundamental concept in communication
theory, is described by Quinlan [Qui1] as depending on its probability and capable of
being measured in bits as minus the logarithm to base2 of that probability. As an
example, if there were 8 equally probable outcomes, the information conveyed by any
one of them would be –log2 (1/8) or 3 bits. The information gain from a test X on an
attribute is defined as the difference in information before and after the application of the
test. The attribute with the largest information gain is chosen.
Quinlan [Qui1] criticizes the information gain criterion as it having a strong bias in
favour of tests with more outcomes over tests with fewer outcomes. This can produce
unnecessarily wide decision trees, with many leaves containing single instances, which
can lead to overtraining. For this reason, Quinlan’s C4.5 [Qui1] (2.4.8) uses a different
criterion, the gain ratio to determine which attribute to use to split the training data. This
gain ratio is found by dividing the information gain by a value that represents the extra
information generated (Quinlan calls this the “split info”). Gain ratio is therefore seen as
the proportion of the information generated by the split which “appears helpful for
classification” [Qui1, p23].
ID3 uses a greedy search, selecting a test using the information gain criterion as a local
measure of progress, then never exploring the possibility of alternate choices. This is a
very efficient algorithm, in terms of processing time, but is “neither optimal nor
complete” [Rus1, p96]. It is possible that ID3 could be improved using a more complete
search algorithm such as A* search [Rus1, p97]
Thesis
A.R.J. Katz
12
2.4.8 C4.5 and J4.8
J4.8 is an improved version of Quinlan’s C4.5 algorithm [Qui1], supplied with the
WEKA tool set. Russell and Norvig [Rus1, p559] see C4.5 as “an industrial strength
version of ID3”. According to Quinlan [Qui1], the main differences between the C4.5
(and hence J4.8) and ID3 algorithms are:
•
C4.5 uses gain ratio by default instead of information gain to determine the best
attribute on which to split the data. The gain ratio of a test X can be expressed as:
gain ratio(X) = gain(X) / split info (X)
•
•
This test (together with a constraint that gain(X) must be large) removes ID3’s bias in
favour of wide decision trees.
C4.5 introduces a constraint that for any split at least two of the resulting subsets must
contain a threshold number of cases. This threshold number is user selectable. This
eliminates the type of trivial splits seen with ID3, which are often the cause of
overtraining of the classifier.
C4.5 allows pruning of the resulting decision trees. This tends to increase the error
rates on the training data, but, importantly, decrease the error rates on unseen test
cases.
2.4.9 IBk k Nearest Neighbor
IBk is an implementation of a k-nearest-neighbor (k-NN) classifier that employs a
distance metric to determine the nearest neighbor. Given a certain space with a defined
measure, e.g. the Euclidean distance, the k-NN algorithm can be used to classify the
object under consideration.
Each training instance is regarded as a point in n-dimensional space, where n is the
number of features in the feature vector [Alb1]. When a test instance is presented, the
euclidean distance from the point represented by the test instance to each training
instance is calculated. In the simplest single nearest neighbor case (1-NN) the nearest the
classification of the nearest neighbor (shortest euclidean distance) becomes the
classification of the test instance. When more nearest neighbors (k-NN) are taken into
consideration, each of the k nearest neighbors is determined as before. Each of these
nearest neighbors is given a “vote” for its classification and the classification with the
highest number of votes is assigned to the test instance [Van1, pp175-177].
Thesis
A.R.J. Katz
13
The error rates achieved by a k-NN algorithm vary with different values of k. Albert and
Aha [Alb1] suggest that a good initial estimate for the optimum number of neighbors (k)
is the square root of the number of samples in the training set.
In the WEKA implementation, if more than one neighbor is selected, the “voting power”
of each neighbor is weighted by a function of distance.
This distance function can be either 1 – distance or 1/distance.
A disadvantage of the IBk classifier is that it is computationally heavy for large data sets.
This has not been a problem with the dataset used for this thesis.
2.4.10 Naïve Bayesian Classifier
The naïve Bayesian classifier has been used for some time as a baseline against which
more sophisticated machine learning methods could be compared [Lan1]. These days,
however, it is generally accepted as a robust and useful approach to supervised
classification in its own right. A number or researchers [Lan2] have reported remarkably
high accuracy with naïve Bayesian classifiers, particularly in natural domains, either
performing comparably or outperforming both rule induction and decision tree methods.
This classifier first calculates from the training data, for each class, a probabilistic
summary [Lan2] containing the probability that a member of the class will be seen, along
with an associated probability distribution for each attribute. For nominal attributes,
these are usually discrete distributions, while, in numeric domains, a continuous
probability distribution is used. In the WEKA toolset, the normal distribution is used.
To classify a previously unseen instance, the naïve Bayes algorithm calculates the
conditional probability that an instance is a member of a class, assuming that the
attributes are independent. The naïve Bayesian classifier assigns the instance to the class
with the highest overall probability.
The naïve Bayesian classifier is robust and inherently resistant to noise [Lan2].
However, it is typically limited to classes that can be separated by a single decision
hypersurface as a result of the assumption that each attribute is normally distributed, and
has difficulties in domains where correlations between attributes are present. In contrast,
the decision tree algorithms above (ID3 and C4.5) are not limited in this way.
Langley and Sage [Lan2] have proposed a modification to the naïve Bayes classifier that
overcomes some of the difficulties naïve Bayes has with correlated attributes. This
modification involves selection of a subset of all attributes, excluding those that reduce
accuracy on the training data. While they have shown results on a number of datasets
that show that the overall performance on each is either improved or not degraded, it
would be interesting to see results on more data.
Thesis
A.R.J. Katz
14
2.4.11 Neural Networks
Artificial neural networks are a computational technique loosely modeled on the lowest
level processing unit of the human brain, the neuron. Their characteristics include the
ability to learn, generalize, adapt and tolerate failure of processing units [Jai1]. A neuron
(see Figure 5) is a biological cell that processes information. Neurons receive impulses
from other neurons, process these impulses according to learned behavior from their
history and transmit the result to other neurons. Artificial neurons emulate this behavior
in software. The analogy with the human brain is inaccurate for networks of artificial
neurons, however, as even the most complex artificial neural networks have many orders
of magnitude fewer neurons and interconnections than the brain.
Figure 5. Natural (left) and artificial neurons (images from [New1])
An artificial neuron (see Figure 5) emulates the natural neuron by accepting a number of
inputs (X1 … XN in the diagram) and multiplying each by a weight (W1 … WN in the
diagram). These input x weight terms are then summed to a single value X, A function
of this value becomes the output of the neuron. This function is usually either a simple
threshold function or a sigmoid (or logistic) function, though a Gaussian function is also
used in some cases [Jai1].
Van derHeijden [Van1, p183] states that neural networks are processing structures
“consisting of many interconnecting processing elements (neurons).” These artificial
neurons are connected together to form neural networks. An extremely simple example
of such a network is shown in Figure 6. In this example, a number of inputs are each
connected to each of a number of neurons in an intermediate layer. The neurons in the
intermediate layer are each connected to all the output neurons (one in this case). This is
an example of a fully connected feedforward neural network.
Thesis
A.R.J. Katz
15
Figure 6. Artificial neural network (images from [New1])
Feedforward networks of this type can be trained by back propagation [Win2]. This is a
procedure that trains the network by making small adjustments to the weights of each
neuron in the direction that reduces the error at that neuron’s output.
In their survey paper, Widrow, Rumelhart and Lehr [Wid1] list many applications of
neural networks, including a number in medicine. Important amongst these are Papnet
[Ren1], which uses neural network technology to help cytotechnologists (histologists) to
identify cancerous cells in pap smears, reducing the number of false negative
classifications. They also list enhancement and classification of medical images as an
accepted use of neural networks.
Jain, Mao and Mohiuddin [Jai1] recognize pattern classification as an important
application of neural networks and even quote blood cell classification as one of their
examples of well-known pattern matching applications of neural networks. The
multilayer feedforward network with back-propagation is seen as the most popular
amongst researchers and neural network users for classification.
Winston [Win2], describes a number of limitations of back-propagation neural networks:
• Poor choice of learning rate can cause the network to either become unstable or get
stuck on a local minimum in the error surface (as well as being very slow to train).
Best choice of learning rate is problem specific.
• A network with too many trainable weights (e.g. too many hidden nodes) can be
subject to overtraining, giving good results on training data, but poor result on unseen
instances.
Winston warns that the best guide to choices of neural networks is trial and error,
although performance on similar problems can often be used as an indicator.
Choice of input and output variable representation can also be important in obtaining the
best performance from neural networks according to Smith [Smi1]. He suggests that
classes that do not have a significantly high number of samples in the data should be
consolidated rather than being represented separately. Smith also suggests using separate
Thesis
A.R.J. Katz
16
outputs for each class, rather than trying to represent the classes in a single scalar
variable.
Russell and Norvig [Rus1] see the main disadvantages of neural networks as a lack of
“transparency”. Unlike decision trees and Bayesian classifiers, it is difficult to describe
the classification made by a neural network formally. This makes explanations of critical
decisions difficult with most current implementations of neural networks. It is also
(again due to lack of transparency) difficult to use prior knowledge or domain expertise
to “prime” a neural network for improved learning, though there has been some research
into this problem. This coincides with other observations [Jai1] that, while neural nets
may compare well to human performance on isolated instances of classification, humans
still outperform neural networks in the classification of instances in context (e.g. single
handwritten characters versus handwritten documents). Recently, much work is being
done in adding contextual information to neural network based classification [Son1].
Thesis
A.R.J. Katz
17
2.5
White Blood Cell Classification
Song, Abu-Mostafa, Sill and Kasdan [Son1] use a nonlinear feedforward neural network
with 20 inputs, 15 sigmoid hidden units and 14 sigmoid output units to classify white
blood cells. They attempt to classify 14 different subtypes of white blood cells based on
13200 images from 220 different blood samples. These images had been provided,
already segmented. The resolution of the images is comparable to that of the images used
in this thesis.
Song and her colleagues are mainly concerned with the effect of context (defined here as
the presence or absence of similar cells) on the classification of blast cells, as the
presence of these has clinical significance in its own right. They do not provide a
classification for the other cells in their data set, providing, instead, a (perhaps simpler)
classification into blast and non-blast classes. To improve the classification of blast cells
and thus to reduce the number of false positives with their attendant cost of further
diagnostic testing, the fact that blast cells are more likely to be found in samples
containing other blast cells (i.e. leukemic samples) is used. Based on this fact, they
propose and test an algorithm that assigns a higher probability to potential blast cells
found in the same sample as other potential blast cells. This is very similar to the
decision process used by human haematologists when classifying blast cells. In this way,
the number of false positive results (classification of a non-blast cell as a blast cell) is
reduced from 50% of all results to 10% of all results, even with individual cell
classification accuracy of less than 90%. This technique is claimed to be capable of
extension to other domains where contextual information is an essential clue. Speech
recognition is cited as a likely example.
Aus, Harms, ter Meulen and Gunzer [Aus1] also try to classify blast cells. However,
they are actually more interested in determining the type of leukemia that is exemplified
by a particular sample, than the actual blood cell type. The statistics published in this
paper are comprised solely of the probability classification distribution for the 11
different leukemias that they are attempting to classify. Overall they classified
approximately 70% of the samples as belonging to the correct leukemia. Notably, while
the cross validation was performed, only the results on the learning set (training set) were
published. This may indicate that substantial overtraining could have been required to
achieve the results published.
The texture related features that this group has developed (see 2.3 Feature Extraction,
above) have the potential of greatly improving the classification of blood cells attempted
in this paper.
Thesis
A.R.J. Katz
18
Chapter 3 Process Overview
3.1
Overview including data flow
Images Acquired
Using AIGotcha
Nucleus found
by thresholding
SNNS used to generate
backprop ANN for
supervised classification
Regions of interest
extracted using
image processing
tools
Nucleus related
features extracted
with nucfeats
Edges found using
Canny edge detection
Features combined
Circle of best fit
found using findcirc
on edge images
Colour related features
extracted from within
circles
Performance of
classifiers compared
ZeroR used to generate
a baseline for comparing
classifiers.
WEKA system used
to generate non-ANN
classifiers with multiple
algorithms.
System Overview
Figure 7 White blood cell classification system overview
Thesis
A.R.J. Katz
19
3.2 Data acquisition
The data acquisition phase of this thesis involved smearing blood onto microscope slides,
cover slipping them, viewing the resultant slides and capturing images including white
blood cells from them. The steps in this process are detailed in Appendix 2.
3.2.1 Cell Choice
Two sets of cell images were obtained. The first of these sets consisted of 200 images for
training and testing of the various classifiers. Some of the images had more than one
white cell in them, resulting (after determining of regions of interest) in 205 cell images.
These images were obtained using the scanning method described in Appendix 2.
The second set of images consisted of from 24 to 50 images (see Table 2) of each of the
different cells to be used for method development and is herein known as the
development set. These cells were not randomly sampled, as we wanted a good
representation of each type of cells and basophils make up only 1% of the total
population of white cells. This set of images was used in developing the image analysis
techniques (threshold values, parameters of Canny edge detection, parameters of the
circle segmentation algorithm, etc). It was also used to determine the parameters for the
classification algorithms. As this was not representative of a randomly chosen data set,
however, it was not used in the training or testing of the various classifiers developed.
Cell type
Basophil
Eosinophil
Lymphocyte
Monocyte
Neutrophil
Number of Images in
Development Set
24
43
45
38
50
Table 2 Images in development set.
3.2.2 Resolution of Acquired White Blood Cell Images
A comparison between the images of leukocytes shown in the Atlas of Blood Cell
Differentiation [Cil1] and those acquired in pursuit of this thesis shows substantially
lower detail in the images acquired for this thesis. In particular, it will be noted that,
except for basophilic granulocytes (basophils), little evidence is seen of the cytoplasmic
granularity that is a key feature of granulocytes in general (basophils, eosinophils and
neutrophils). It was expected that one or more features based on this granularity could
be extracted and that such a feature or features could be an important part of any
classification system.
This lack of detail was attributed to a lack of resolution in the equipment (camera +
acquisition hardware) used to obtain the captured images.
Thesis
A.R.J. Katz
20
Chapter 4 Image Processing
4.1
Data Preparation
4.1.1 Determining Regions of Interest
It was decided to determine smaller regions of interest around each of the white blood
cells in the larger captured images (some of which contain more than one white blood
cell). A method by which white blood cells could be distinguished from the rest of the
image needed to be developed.
All white blood cells contain nuclear material. This nuclear material can be recognized
visually by the fact that it is distinctly darker and purple in color. It was hypothesized
that concentrations of nuclear matter in close proximity to another could be used to
identify these regions of interest.
4.1.1.1 Separating Images Into Component Bands
First, the captured image file was split into its three component bands (red green and blue
see Figure 8). The result was three grayscale (.pgm portable graymap) files one for each
of the red green and blue components of the image captured by the camera.
Histogram analysis was used to examine three grayscale components (corresponding to
the red, green and blue bands) of 20 images covering all five basic white blood cell types.
It was found that the green component was consistently a better discriminator between
the purple nuclear material and the rest of the image (either lighter purple to pink or red
of white cell cytoplasm, the pink of red cell material or the white of the background).
Thesis
A.R.J. Katz
21
Figure 8 Raw captured image including 2 white blood cells (1 eosinophil and 1
neutrophil) and 3 single color components.
4.1.1.2 Thresholding
At first, we tried to set threshold grey levels for cytoplasm, nucleus and background as
was proposed by Sonka, Hlavac and Boyle [Son2] and also by Van der Heijden [Van1,
p215] (see p5). This was found to be feasible for a single blood cell image, or even a
small number. However, we found that the varying grey levels of cytoplasm and
background (particularly in the presence of erythrocytes – red blood cells,) made this
technique almost useless over a broad range of images. Erythrocytes in particular were
found to have very similar gray levels to the cytoplasm of some leukocytes (particularly
monocytes).
We did utilize thresholding to produce a binary bitmap image from the green band bitmap
for each image. A threshold value was selected to discriminate between nuclear and nonnuclear pixels, as every white blood cell has a nucleus. We used only the green band for
this nuclear thresholding, as, in this band, the nuclear material is much darker than either
the cytoplasm (with the exception of basophilic cytoplasm) or the background.
Thesis
A.R.J. Katz
22
Experimentation showed that a threshold value of 100 (on a scale of 0-256) gave an
acceptable discrimination between nuclear and non-nuclear pixels, leaving nuclear pixels
black on a white background.
Figure 9 Green component image and thresholded bitmap
4.1.1.3 Edge Erosion
Due to edge effects on the captured images (the camera produced noticeable dark bands
at the very edges of the images, as seen in Figure 8 and Figure 9) there were many nonnuclear pixels around the perimeter of the image.
Figure 10 Before and after edge erosion
An edge erosion filter, which simply set all of the pixels within 3 pixels of the edge of the
image to white, was developed and used. This removed the dark bands around the edge
of the image as seen in Figure 10.
Thesis
A.R.J. Katz
23
4.1.1.4 Platelet Erosion
It was immediately noticed that some non-nuclear material (particularly platelets which
are also stained purple by the May-Grunwald Giemsa staining protocol) also came out as
black on these images. Platelets were found to be very small in comparison to blobs of
nuclear matter. A process of erosion followed by dilation was found to be successful in
removing platelets from consideration in determining regions of interest.
Each erosion step performed the following algorithm:
for each pixel
if any neighboring pixel is white, set the pixel to white
Each dilation step performed the following algorithm
For each pixel
if any neighboring pixel is black, set the pixel to black.
After some experimentation, it was found that two erosion steps followed by four dilation
steps gave good results as seen in figure 7. In the context of this thesis, this process
(erosion, erosion, dilation, dilation, dilation, and dilation) is called platelet erosion.
Figure 11 Before and After Platelet Erosion
4.1.1.5 Finding Blobs
As a first step towards identifying a region of interest, an algorithm was developed to
identify blobs (continuous connected groups of black – presumably nuclear – pixels)
within a bitmap file and print out their details, including number of pixels in the blob and
centroid (arithmetic mean of the x and y position values) of the blob. The x value of the
centroid was found by summing the x values of all pixels in a blob and dividing the sum
by the number of points in the blob. The y value of the centroid was found in a similar
fashion, summing the y values of all pixels in the blob and dividing by the number of
Thesis
A.R.J. Katz
24
pixels. The (x, y) location represented by this centroid was used as the centre of the blob
of nuclear material.
The centre (cx, cy) and number of points were recorded for each blob found.
4.1.1.6 Consolidating Adjacent Blobs
Many white blood cells (particularly neutrophils and eosinophils) often have a number of
lobes which may appear to be disconnected after thresholding and the platelet erosion
process. It was therefore decided to consolidate (i.e. treat as a single blob) any two blobs
closer than what was felt to be the minimum separation of two nuclei in separate cells.
Any such consolidated blobs were then assumed to be the nucleus of a single white blood
cell. For the purposes of this thesis, a minimum separation between nuclei of 25 pixels
(approximately 8µ) was assumed. This distance was chosen after examining a number of
close pairs of cells in the development set. This choice is a tradeoff between the
probability of misclassifying a single nucleus as two nuclei (if too small a distance is
chosen) or two separate nuclei as a single nucleus (if too large a distance is chosen).
An algorithm (see Perl Script to Consolidate Blobs) was developed to determine which
blobs were close enough in proximity (i.e. D(blob1-blob2) < 25 pixels) to be considered as
part of the same cell and consolidate these into a single blob.
The centre (cx, xy) of each consolidated blob was determined according to:
cy (blob1) * np( blob1) + cy (blob 2) * np (blob 2)
cy( consolidated _ blob ) =
np(blob1) + np (blob 2)
cx(consolidated _ blob ) =
cx( blob1) * np(blob1) + cx( blob 2) * np( blob 2)
np (blob1) + np(blob 2)
where np is the number of (nuclear) pixels in a blob.
The centre of the resulting blob (cx, cy) and number of pixels (just the sum of two
individual nps) are recorded and further consolidated with any other blobs in close
enough proximity to the consolidated blob.
An unfortunate side effect of this blob consolidation is that two white cells in close
proximity might be mistaken as one cell. This is really only likely to be a problem with
cells with little cytoplasm (like some lymphocytes). The proportion of cells which occur
in such problem pairs is likely to be very low, however, as none were found in the 403
images (development set and test set) examined in the pursuit of this thesis.
The centroid of each of the (perhaps consolidated) remaining blobs of nuclear material
was found and nominated as the centre of a region of interest. Most images contained
Thesis
A.R.J. Katz
25
only one such region of interest, but a number contained two or more. This method
worked well for both groups of images. The output from this phase, for each image was:
• a number of regions of interest
• the x and y co-ordinates of the centre of each of these regions (cx, cy).
• the number of nuclear pixels in each region of interest
4.1.2 Extracting small images
A Region of interest (ROI) was defined around the centre (cx, cy) of each blob. This
region of interest was defined to be a rectangle 81 by 65 pixels in size (the corners at (cx40, cy-32), (cx-40, cy+32), (cx+40,cy-32), (cx+40, cy+32). The size of the ROI was
originally chosen to preserve the aspect ratio of the camera’s image, be large enough to
accommodate the largest leukocyte with some headroom. The rectangular aspect ratio
was later found to be unnecessary, a square ROI would have been simpler. The
extraction was performed using the PBMplus utility pnmcut.
This process produced a series of smaller images like the one in Figure 12.
Figure 12 Extracted image of white blood cell (Eosinophil)
4.1.3 Finding circles
Most white blood cells are roughly circular in shape, though some (monocytes, in
particular) may deviate significantly from the circular).
For the purposes of obtaining a good feature set, representative of the features of the cell
only, we chose to try to find the largest “roughly circular” area entirely within the cell.
This process was broken into two components, edge detection and “rough circle” fitting
4.1.3.1 Edge detection
The two stage Canny edge detection algorithm was used identify edge pixels within the
small image. This results in a black and white image containing white edge pixels on a
black background. ImgCanny and its associated edge synthesis utility ImgEdgeSynth
were used to extract the edge images, using the default parameters of each utility. These
utilities are both from the ImgStar image processing library Img* (or ImgStar) by Simon
Thesis
A.R.J. Katz
26
A. J. Winder, at the university of Bath [Win1]. The command to generate the edge image
from a portable pixel map (.ppm) image was:
imgPnmToFlt < file.ppm | imgCanny | imgEdgeSynth >file.edge.pbm
Figure 13 shows the results of this process on an image of an eosinophilic granulocyte.
Figure 13 Eosinophil and Result of Canny Edge Detection
4.1.3.2 “Rough Circle” determination
Each region of interest was thus centred on the centre of the nuclear matter within a cell.
Thus, it was known that the centre of each region of interest was within the cell.
However, for any other pixel, it was not known whether that pixel fell within the cell or
outside it.
It had been seen that the Canny edge detection process produced white pixels on a black
background (Figure 13) which showed edges within the image.
We chose to use this information to try to determine the largest “rough circle” of edge
pixels which contained the centre of the region of interest and consider only features that
could be extracted from the image inside this “rough circle”.
Before developing an algorithm to find the largest such “rough circle”, 20 cells of each
type (from a development set of cells, not used in classification) were examined visually,
overlaying circle of various sizes. From these, it was determined that the largest circle
that could be drawn within a given cell varied from a minimum radius of 13 pixels to a
maximum radius of 26 pixels. It was therefore decided to limit the circles being
examined to those whose radius fell between 12 and 28 pixels, inclusive, as this allowed
for cells which lay just outside the observed limits.
Thesis
A.R.J. Katz
27
A “rough circle” was then defined as set of points that lie in a circular band about a
centre. Two concentric circles of different radii define this band. The development set
was used to examine a number of different widths of this band. This parameter was
found to give the best results at a width of 3 pixels (i.e. nominal radius ± a “roughness
value” (rvar) of 1 pixel).
The rough circles we were searching for then needed to meet the following constraints (as
shown in
Figure 14):
• Include the centre of the region of interest (centre of assumed nuclear material).
• Have a nominal radius of between 12 (rmin) and 28 (rmax) pixels
• Have points located in a band within ± 1 pixel of the nominal radius.
Nominal radius r
12 pixels <= r <= 28 pixels
r-1 pixel
r+1 pixel
Rough circle
Figure 14 Rough circle illustration
The algorithm developed to find the largest and best “rough circle” was:
smax = 0
for each pixel p within rmax (28 pixels) of the centre of the ROI
for each possible value of the nominal radius r (12 to 28 pixels)
determine the strength s of the rough circle of radius r centred on p
if s >= smax
smax = s
bestp = p
bestr = r
next r
next p
In this way, we determine the centre bestp and radius bestr of the rough circle with the
highest strength (see below).
Thesis
A.R.J. Katz
28
The number of points on the “rough” circle was found by examining 128 vectors (2π/128
radians apart) and determining if a point was set between r-rvar and r+rvar along this
vector. If one or more points were found to be set along a vector, the number of points
on the “rough circle” (npoints) was incremented by (only) one.
A number of alternatives for a strength discriminator function were examined
1. Number of points on circle. It was found that simply using the number of points on
the rough circle as a measure of strength biased the results too heavily towards
finding smaller circles containing the centre of the region of interest.
2. Diameter of circle if more than a threshold number of points (e.g. 64 points found out
of 128 vectors). As this favours larger cells, this was found to favour extraneous
pixels not part of the actual cell.
3. A combination of 1. and 2. The best strength function found (on examining images
of the development set with the determined circle superimposed upon them) was
npoints + 2 * r.
Unfortunately, even this function only produced a circle that reasonably (in the opinion of
the person intervening) fitted the cell in approximately 80% of instances. The other 20%
of instances required human intervention to define a reasonable circle. A number of
methods of circle ellipse finding techniques (such as Hough transforms) are known to
exist, but were considered beyond the scope of this thesis.
4.1.4 Manual intervention
Each image was displayed with the circle fitted according to 4.1.3.2 superimposed upon
it. At this stage, a value judgement was made as to which circles were of sufficiently
good fit to allow further processing and which required adjustment of the fitted circles.
Figure 15 Fitted Circles Adjudged "Good"
Figure 15 shows that the cells adjudged as being of good fit were not necessarily
perfectly circular. All the circles, however, closely followed the circular parts of the cell
boundary and all enclosed a good representation of the cell’s features (colors, nuclear
material and cytoplasm).
Thesis
A.R.J. Katz
29
Figure 16 Fitted Circles Adjudged "Poor"
As can be seen from Figure 16, the poorly fitted circles each fitted to circular artifacts
within the cell (usually the boundaries of the nucleus) or partial circles within and outside
the cell (often contributed to by nearby erythrocytes – red blood cells).
Images of the same cells before and after this human intervention are shown in Figure 17.
Poor Circle
Manual Circle
Figure 17 Examples of Manually Adjusted Circles
Thesis
A.R.J. Katz
30
4.1.5 Extracting Feature Set
4.1.5.1 Features in Feature Set
The manually or automatically segmented circular cell region defined according to 4.1.3
and 4.1.4 was processed on a pixel by pixel basis and statistical information (mean,
standard deviation, maximum value and minimum value) was collected for each of 5
color bands. The color bands were the red, green and blue captured by the hardware
(camera + composite video capture card) together with the color ratios green/red and
green/blue. The number of pixels within the circle was also determined. As seen in
Table 3, these features form the basis of the feature set. While the 6 to 7 significant digits
shown cannot be statistically justified, these are the numbers that were applied to the
learning algorithms.
Table 3 Features in Feature set as used by WEKA classifiers.
Feature
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Thesis
Feature Description
Instance name (just an identifier)
Red pixel value mean
Red pixel value σ
Red pixel value max
Red pixel value min
Green pixel value mean
Green pixel value σ
Green pixel value max
Green pixel value min
Blue pixel value mean
Blue pixel value σ
Blue pixel value max
Blue pixel value min
Green/blue pixel value mean
Green/blue pixel value σ
Green/blue pixel value max
Green/blue pixel value min
Green/red pixel value mean
Green/red pixel value σ
Green/red pixel value max
Green/red pixel value min
Number of pixels in cell
Number of nuclear pixels
Number of nuclear edge pixels
Number of nuclear body pixels
Nuclear edge pixels/nuclear body pixels
Cell Class (Class variable)
Max.
Value
not used
165.6
37.726
237
118
125.726
47.968
210
56
171.592
33.337
247
124
0.756
0.205
1.248
0.421
0.786
0.192
1.314
0.478
1941
1194
324
966
0.45
b, e, l, m
or n
A.R.J. Katz
Min.
Value
Mean
Standard
Deviation
97.108
9.633
139
62
54.126
7.452
85
26
124.757
10.288
154
75
0.424
0.058
0.688
0.254
0.499
0.057
0.719
0.322
437
396
99
230
0.17
143.6455
25.75482
199.0683
87.80976
100.8887
34.26238
171.7073
40.75122
153.9809
22.37201
210.9366
96.3561
0.63721
0.152951
0.960912
0.334644
0.67958
0.132537
0.965883
0.39821
1128.824
553.9707
164.8341
389.1366
0.310195
14.03901
6.595718
17.67643
9.228999
16.93763
7.936144
21.30249
4.784735
9.921814
4.937516
15.35626
10.0389
0.076426
0.025912
0.073456
0.033412
0.062289
0.020325
0.061547
0.032345
307.4608
144.3944
36.08693
139.7825
0.081445
31
The red, blue and green band pixel values were just the 8 bit (0-255) values captured by
the AIGotcha frame grabber. The green/blue color ratio for a given pixel was determined
by dividing the green band value by the blue band value. The green/red value was
similarly obtained by dividing the red band value by the blue band value.
A thresholded image containing only the green band information was also generated,
using the same threshold value used in 4.1.1.2 Thresholding. This image was used to
determine the number of nuclear (black) pixels within the region. These were further
divided into nuclear edge pixels (determined by having one or more adjacent white
pixels) and those not on the nuclear edge (or within the body, determined by having no
adjacent white pixels). The ratio of nuclear edge to nuclear body cells was also
calculated. It was believed that this would be a good discriminator between cells with
segmented nuclei (in particular neutrophils and eosinophils) and those with nonsegmented nuclei (lymphocytes and monocytes).
Thesis
A.R.J. Katz
32
Chapter 5 Supervised Learning
5.1 Using WEKA classifiers
Version Weka-3-pre-6 of WEKA, the Waikato Environment for Knowledge Analysis
[Wek1] [Wit1] was used to investigate a number of classifiers for this thesis. This is a
java-based environment allowing a number of classifiers to be tried on a common data
file. Each of the classifiers required some preliminary experimentation with parameters
to obtain the best results.
It is intuitively important, when comparing the error rates of various classifiers, to ensure
that they are obtained under similar circumstances. The WEKA classifiers, use a method
of 10 fold cross validation by default. As this method has been used successfully by
other researchers in the past (see 2.4.2, page 9), 10-fold cross validation is used for every
classifier within this thesis.
5.1.1 Data Preparation for WEKA
The 25-feature dataset enumerated in 4.1.5.1 (page 31), was used to develop the WEKA
classifiers. The feature set was not normalized. A single letter categorical value was
used for the class variable. This could take the following values.
Table 4 WEKA cell categories
Classifier letter
b
e
l
m
n
Class (Leukocyte type)
Basophil
eosinophil
lymphocyte
monocyte
neutrophil
5.1.2 ZeroR
No parameters are available for the ZeroR classifier.
5.1.3 OneR
OneR (see page 11) is a simple classifier, which produces rules based on a single
attribute. There is one parameter available which defines the minimum number of
instances that must be covered by each rule that is generated (known as the minimum
bucket size in the WEKA toolbox). The default value of 6 was used for this parameter.
Thesis
A.R.J. Katz
33
5.1.4 IBk Instance based classifier
IBk is an implementation of a k-nearest-neighbor classifier (as described in 2.4.9 on page
13). The WEKA implementation of this algorithm had two settable parameters. The first
parameter is the number of neighbors whose classification affects the classification of the
test instance (from 1 to 1 less than the size of the data set). If more than one neighbor is
chosen, a second parameter determines the formula used for the weighting applied to the
predictions of the neighbors. The two formulae available are:
w = 1 – distance
or
w = 1/distance
Both formulae were tried in the preliminary experimentation phase, but no difference was
found in the error rate or confusion matrix. We used the default formula provided by
WEKA of w = 1/distance for all IBk results that form part of this thesis.
We tried a number of different values of k to determine how the accuracy of the IBk
classifier varies with the number of neighbors contributing to the classification (see
6.1.1.3 on page 40.
5.1.5 Naïve Bayes
The naïve Bayesian classifier (see page 14) is a simple probabilistic classifier. In the
WEKA tool set, naïve Bayes has only one parameter, which provides the user with a
choice of models for numerical attributes. As we are dealing with biological samples and
the distribution of such samples has typically been found to be normal, this was the
model chosen.
For interest, we did try choosing the alternative (kernel density estimators) but found that
the confusion matrix was identical.
5.1.6 ID3 Based Decision Table
The WEKA implementation of the ID3 decision tree algorithm (see 2.4.7 on page 12)
produces a decision table rather than a decision tree and allows for three parameters.
The first parameter is the number of non-improving attribute subsets that are investigated
before the search terminates. This was left at the WEKA default value of 5, after initial
experimentation determined that increasing it had little effect and decreasing it tended to
increase the error rate.
The second determines the number of folds performed by a wrapper function to find the
best table. This was left at the WEKA default value of leave one out after it was found to
make very little difference to the results obtained. The “leave one out” setting did,
though, cause an apparent degradation in the algorithm’s speed.
Thesis
A.R.J. Katz
34
The third parameter for WEKA’s ID3 based decision table algorithm determined its
behavior on test instances that do not match any table entry – whether they are assigned
to the majority class from the training data, or to the nearest table entry. This was found
to make no difference on the data set presented here, so was left at the WEKA default
value of assigning to the majority class.
5.1.7 J4.8 (C4.5 Based) Decision Tree Induction
J4.8 is the WEKA implementation of C4.5 (see page 13) version 8. It has a number of
user alterable parameters and options.
The options available under J4.8 are: use an unpruned decision tree, use binary splits only
and use reduced error pruning. In the preliminary work, none of these options was found
to improve the performance of the classifier. Thus they were not used.
One of the parameters available under the J4.8 is the minimum number of instances in a
leaf node of the decision tree. For this work, this value was set to 2. Initial investigations
showed that performance on both training and test data sets were about the same for
values of 4 or less and that error rates increased on both on values over 4.
Another parameter to J4.8 is the confidence threshold for pruning. During initial
investigations, it was found that this did not have much effect if set between 10% and
50% and that performance started to degrade outside these values.
5.2
Using SNNS
5.2.1 Data set
The same feature set (see 4.1.5, page 31) as used with the WEKA classifiers was used,
after some transformations (see 5.2.3 and 5.2.4, page 36) with SNNS.
5.2.2 Network Topology
A fully connected feedforward neural network was constructed consisting of :
25 input units (one for each attribute).
10 hidden units in a single layer
5 output units (one for each possible class).
The number of hidden units and their arrangement into a single layer was chosen after
evaluating the performance of different topologies during several trials on the
development set (see 3.2.1).
Thesis
A.R.J. Katz
35
All neurons had a logistic activation function. Other activation functions were available
but were not evaluated.
This network was trained with a standard back-propagation learning function and updated
in topological order. The network was trained for 150 cycles. The number of training
cycles was chosen after evaluating the performance of several different numbers of
training cycles with the development set (see 3.2.1).
5.2.3 Scaling the inputs
After an initial investigation found that SNNS did not work well with our unscaled input
data set, it was decided to normalize the inputs between 0.1 and 0.9.
5.2.4 Class Outputs
Whereas the WEKA classifiers used one class variable that could take 5 values, for
SNNS we chose to follow Smith’s [Smi1] (see 2.4.11, page 15) suggestion and use
separate outputs for each class, rather than trying to represent the classes in a single scalar
variable. Five class variables are therefore used to represent the output of the neural
network. The outputs of the training data were set to either 0 (not a member of this class)
or 1 (a member of this class) for each of the 5 possible classes. Each training instance
was assigned to only one class.
5.2.5 Interpretation of Output
The output of the SNNS neural network was interpreted using a “winner take all”
strategy. Using this strategy, the output with the highest value is deemed to be the class
of the instance. This strategy ensures that each instance is assigned to only one class.
Thesis
A.R.J. Katz
36
Chapter 6 Results
6.1.1 WEKA Results
Results from the classifiers run using WEKA are all displayed as a summary of
performance on both training and test data. These summaries are the standard output of
the java based WEKA release: Weka_3_pre_6 [Wit1]. They are presented in order of
accuracy, starting with the least accurate, the baseline ZeroR algorithm, and finishing
with the most accurate.
While the WEKA toolset produced results to four decimal places, six significant figures
of accuracy cannot be justified. The results have therefore been rounded to 3 significant
figures.
All results were obtained using 10-fold cross validation, on a feature set containing an
instance name, 25 features and a class variable as enumerated in Table 3.
WEKA produces a standardised table of results, containing a number of results common
to all classification algorithms and a number of results particular to each classifier. The
results used in this work include:
1. The number and percentage of correctly classified instances when the classifier is
applied to the training data.
2. The number and percentage of incorrectly classified instances when the classifier is
applied to the training data.
3. The number and percentage of correctly classified instances when the classifier is
applied to the test data using a 10 fold cross validation (see page 52).
4. The number and percentage of incorrectly classified instances when the classifier is
applied to the test data.
5. Confusion matrices as described in section 2.4.3, generally for the performance of the
classifier on both the training and test data. In these confusion matrices, the rows
indicate the actual classes of the data, while the columns indicate into which classes
the algorithm under examination has classified the data.
In the confusion matrices, the letters m, n, l, e and b are used to represent the five classes
of monocytes, neutrophils, lymphocytes, eosinophils and basophils respectively.
Thesis
A.R.J. Katz
37
6.1.1.1 ZeroR Results
ZeroR predicts class value: n
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
138
67
205
67.3 %
32.7 %
=== Confusion Matrix ===
m
n
0 13
0 138
0 44
0
8
0
2
l
0
0
0
0
0
e
0
0
0
0
0
b
0
0
0
0
0
|
|
|
|
|
<-- classified as
m
n
l
e
b
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
138
67
205
67.3 %
32.7 %
The ZeroR algorithm simply predicted that each instance was of class n (neutrophil), the
most numerous class. As expected, this classifier gave a very high error rate, incorrectly
classifying 32.7% of all instances in both the feature set and the test set as neutrophils. In
summary, all cases of neutrophil were classified correctly, while all other cases were
classified incorrectly. As this was the baseline classifier for this thesis, it is worth noting
that all of the other classifiers outperformed ZeroR by a good margin. In fact ZeroR had
approximately four times the error rate of the next worst classifier.
The results from ZeroR show only a single confusion matrix as it would be the same for
training and test data in any case. The numeric results for error on training and error on
cross validation are a little repetitious, but provide a good indication that the algorithm is
performing as expected.
Thesis
A.R.J. Katz
38
6.1.1.2 OneR Results
attribute_25:
< 396.5
< 580.0
>= 580.0
-> n
-> l
-> m
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
188
17
205
91.7 %
8.29%
=== Confusion Matrix ===
m
n
l
e
b
<-- classified as
13
0
0
0
0 |
m
0 136
2
0
0 |
n
4
1 39
0
0 |
l
0
3
5
0
0 |
e
1
0
1
0
0 |
b
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
185
20
205
90.2 %
9.76%
=== Confusion Matrix ===
m
n
l
e
b
<-- classified as
11
0
1
0
1 |
m
0 135
3
0
0 |
n
4
1 39
0
0 |
l
0
3
5
0
0 |
e
1
0
1
0
0 |
b
OneR selected attribute 25, the number of nuclear pixels, as the single attribute that
would give the best results (lowest number of incorrectly classified instances). OneR
has classified cells with a low number of nuclear pixels as neutrophils, those with a high
number as monocytes and that those in between as lymphocytes.
Under these rules, basophils and eosinophils will always be incorrectly classified, as they
are not at all represented in the rule set. Classification of neutrophils was quite accurate
(2.2% error) but accuracy of classification of monocytes and lymphocytes was poor (15%
and 11.4% respectively). As expected there was little evidence of overtraining.
It is unclear how one instance of a monocyte in the test dataset might have been classified
as class e (eosinophilic granulocyte). The rule presented by WEKA with this set of
Thesis
A.R.J. Katz
39
results does not show an outcome of e (eosinophil). However, only one rule has been
displayed, from 10 cross validation trials, each of which is likely to produce a slightly
different rule. It is possible that one of the 10 cross validations produced a rule set on the
same attribute that had a class e outcome.
6.1.1.3 IBk Instance based classifier.
The IBk instance based classifier uses a k-nearest neighbor algorithm. The value of k the number of neighbors taken into account when predicting the class – may be specified
as a parameter to the algorithm. It was decided to investigate the performance of this
classifier against the value of k. As usual, each point on the graph represents a 10 fold
cross validation on the dataset.
As can be seen from Figure 18, for this dataset, the performance on the training data is
perfect when only one neighbor is taken into account, however the performance on the
test data is poor. This is a classic indicator of overtraining.
IB1 Error Results
10%
9%
8%
Error
7%
6%
Training Error
5%
Test Error
4%
3%
2%
1%
0%
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Number of Neighbors
Figure 18 Performance of IBk vs. Value of k.
When more than one neighbor contributes to the prediction, the error on test data and
training data are much closer to one another. The best performance on test data was
obtained using a value of 3 for k. It is reasonable to assume that the performance of this
classifier will continue to trend towards generally higher error rates as the number of
neighbors taken into account increases. This contrasts with the literature [Alb1] which
suggests that √(number of samples in the training set), ~14-15 in this case, is a good
choice for k.
Thesis
A.R.J. Katz
40
The confusion matrix for the test data (marked “cross validation”) shows IBk with three
nearest neighbors to be an excellent discriminator of neutrophils, with 100% accuracy on
the 138 results. Leukocytes are classified reasonably well, with an error rate of 4.8%.
Eosinophils and basophils are classified poorly, with error rates of 37.5% and 100%
respectively.
java weka.classifiers.IBk -K 3 –t c:/rmit/thesis/all2.arff
Options: -K 3
IB1 instance-based classifier
using 3 nearest neighbour(s) for classification
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
199
6
205
97.1 %
2.92%
=== Confusion Matrix ===
m
n
12
0
0 138
2
0
0
2
0
0
l
1
0
42
0
1
e
0
0
0
6
0
b
0
0
0
0
1
|
|
|
|
|
<-- classified as
m
n
l
e
b
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
197
8
205
96.1 %
3.90%
=== Confusion Matrix ===
m
n
12
0
0 138
2
0
0
3
0
0
Thesis
l
1
0
42
0
2
e
0
0
0
5
0
b
0
0
0
0
0
|
|
|
|
|
<-- classified as
m
n
l
e
b
A.R.J. Katz
41
It is interesting to note that the accuracy for this data set seems to increase with the
number of instances of each class in the data set (see Figure 19). This increase in
accuracy would tend to indicate that the performance of the IB1 classifier should
significantly improve on these less well-represented classes, if more instances of all
classes are gathered.
% Accuracy
100
80
60
40
20
0
0
50
100
150
Instances of Category
Figure 19 IB1 Accuracy vs. Number of Instances in Class
Thesis
A.R.J. Katz
42
6.1.1.4 ID3 Based Decision Table
Decision Table:
Number of training instances: 205
Number of Rules : 13
Feature set: 14,20,25,27
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
200
5
205
97.6 %
2.43%
=== Confusion Matrix ===
m
n
l
e
b
<-- classified as
13
0
0
0
0 |
m
0 138
0
0
0 |
n
1
0 43
0
0 |
l
0
2
1
5
0 |
e
0
0
1
0
1 |
b
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
191
14
205
93.2 %
6.82%
=== Confusion Matrix ===
m
n
l
e
b
<-- classified as
10
0
3
0
0 |
m
0 137
0
1
0 |
n
3
1 40
0
0 |
l
0
2
2
4
0 |
e
0
1
1
0
0 |
b
Like the IB1 classifier, the confusion matrix for the ID3 based decision table induction
algorithm seems to perform with an accuracy that increases with greater representation of
the class in the data set. ID3 used the following attributes to build its decision table:
green/blue pixel value mean, green/red pixel value max and number of nuclear body
pixels.
Thesis
A.R.J. Katz
43
% Accuracy
100
80
60
40
20
0
0
50
100
150
Instances of Category
Figure 20 ID3 Accuracy vs. Number of Instances in Class
ID3, like IBk, increased quickly in classification accuracy as the number of samples rose.
Thesis
A.R.J. Katz
44
6.1.1.5 J4.8 (C4.5 based) Decision Tree Induction
java weka.classifiers.j48.J48 -t c:/rmit/thesis/all2.arff
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
201
4
205
98.0 %
1.95%
=== Confusion Matrix ===
m
n
l
e
b
<-- classified as
13
0
0
0
0 |
m
0 138
0
0
0 |
n
1
0 43
0
0 |
l
0
1
0
7
0 |
e
0
0
2
0
0 |
b
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
192
13
205
93.7 %
6.34%
=== Confusion Matrix ===
m
n
9
1
1 137
1
1
0
1
0
0
l
2
0
41
2
2
e
1
0
1
5
0
b
0
0
0
0
0
|
|
|
|
|
<-- classified as
m
n
l
e
b
J4.8 classifies neutrophils and lymphocytes reasonably well at error rates of 0.7% and
6.8% respectively. Performance on the rest of the data was poor with error rates of 30%
to 100%. Interestingly, J4.8’s performance on this data set was very similar to that of the
ID3 classifier on which it is based.
Thesis
A.R.J. Katz
45
% Accuracy
100
80
60
40
20
0
0
50
100
150
Instances of Category
Figure 21 J4.8 Classification vs. Number of Instances in Class
Like the other classifiers before it, J4.8 improved in accuracy (see Figure 21) with the
number of instances of a class, though, from this data, not as quickly as ID3 or IBk.
Again, it would be interesting to monitor its performance as the number of instances of
the less well-represented white blood cell types (monocytes, eosinophils and, particularly,
basophils) was increased.
J4.8 produced the following decision tree:
Nuclear edge pixels/nuclear body pixels <= 0.26
| Number of nuclear edge pixels <= 152
|
| Green/blue pixel value mean <= 0.63: l
|
| Green/blue pixel value mean > 0.63: n
| Number of nuclear edge pixels > 152
|
| Green/blue pixel value mean <= 0.578: l
|
| Green/blue pixel value mean > 0.578: m
Nuclear edge pixels/nuclear body pixels > 0.26
| Blue pixel value σ <= 18.23
|
| Blue pixel value min <= 107: e
|
| Blue pixel value min > 107: m
| Blue pixel value σ > 18.23: n
This shows that J4.8 used the ratio of nuclear edge pixels to body pixels and the number
of nuclear edge pixels as important attributes to split on, validating their inclusion in the
data set. The rest of J4.8’s information came from the color signals (particularly the blue
signal). Interestingly, the main benefactors of the split on the ratio of nuclear edge pixels
to body pixels were neutrophils and eosinophils, the cells that contain (humanly) easily
distinguishable multi-lobed nuclei. We would definitely recommend including this
feature in any future feature set used for white blood cell classification.
Thesis
A.R.J. Katz
46
6.1.1.6 Naïve Bayesian Classifier
> java weka.classifiers.NaiveBayes -t c:/rmit/thesis/all2.arff
Naive Bayes Classifier
=== Error on training data ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
202
3
205
98.5 %
1.46%
=== Confusion Matrix ===
m
n
13
0
0 137
2
0
0
0
0
0
l
0
0
42
0
0
e
0
1
0
8
0
b
0
0
0
0
2
|
|
|
|
|
<-- classified as
m
n
l
e
b
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
197
8
205
96.1 %
3.90%
=== Confusion Matrix ===
m
n
13
0
0 137
2
0
0
2
1
0
l
0
0
42
1
1
e
0
1
0
5
0
b
0
0
0
0
0
|
|
|
|
|
<-- classified as
m
n
l
e
b
The naïve Bayesian classifier was the best of the WEKA classification schemes, with a
total error rate of just under 4%. As shown by the confusion matrix for the cross
validation, this was the only algorithm tested that correctly classified all instances of
monocytes.
This performance on monocytes was also the only case of WEKA
classifier’s performance on a class breaking the apparent rule that more instances of a
class produce a lower error rate.
Thesis
A.R.J. Katz
47
Figure 22 when compared to the accuracy Vs number of instances in class graphs of the
other WEKA classifiers shows that the performance of the naïve Bayesian classifier
improves much more quickly with the number of instances in each class, based on this
data set. This does not guarantee, however, that this faster improvement in performance
will hold for other domains or other data sets.
% Accuracy
100
80
60
40
20
0
0
50
100
150
Instances of Category
Figure 22 Naïve Bayesian Accuracy vs. Number of Instances in Class.
Thesis
A.R.J. Katz
48
*
6.1.2 SNNS Results
Results were obtained from each run of a 10 way cross validation on 205 instances of
white blood cell features. These results were analyzed using a winner take all (WTA)
strategy, which means that the highest value output node is deemed to be the result, even
if there is another only fractionally lower. This strategy guarantees that one and only one
output is deemed true for any given set of inputs, thereby eliminating the possibility of
unknown errors due to no outputs being active, or more than one output being active.
SNNS Classifier Results
=== cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Total Number of Instances
201
4
205
98.05
1.95
%
%
=== Confusion Matrix ===
m
n
11
0
0 138
0
0
0
0
1
0
l
2
0
44
0
0
e
0
0
0
8
1
b
0
0
0
0
0
|
|
|
|
|
<-- classified as
m
n
l
e
b
A confusion matrix similar to those provided by the WEKA tool set has been manually
generated for ease of comparison, results were not obtained for the training data.
% Accuracy
100
80
60
40
20
0
0
50
100
150
Instances of Category
Figure 23 Accuracy of Neural Network Vs Number of Instances in Class
Thesis
A.R.J. Katz
49
As can be seen from these results, out of the 205 patterns, only 4 were incorrectly
classified. SNNS’ analyze utility program was used to determine which four were
misclassified. It was found that both basophils in the data set were misclassified, one as a
monocyte and one as an eosinophil. Two of the monocytes were misclassified, both as
lymphocytes. Neutrophils, eosinophils and lymphocytes were all classified correctly.
The neural network was the only classifier with a perfect score on eosinophils.
Figure 24 Misclassified Basophils
Figure 24 shows the two basophils misclassified by SNNS. It is not surprising that these
have been misclassified as only these two instances existed in the entire data set. This is
almost certainly a case where greater representation would have improved the
classification.
Figure 25 Misclassified Monocytes
The reason two monocytes were misclassified (shown in Figure 25) is not quite as
apparent immediately. However when we take into account that the features are
extracted from a circle fitted around the centre of the cell’s nuclear material, it is obvious
that such a circle would not include much of the cell’s cytoplasm (lighter purple area). In
this case, it is quite conceivable that these two monocytes could be misclassified as
leukocytes (examples in Figure 26)
Figure 26 Lymphocyte Examples
Thesis
A.R.J. Katz
50
6.1.3 Overall Results
35
30
% Error
25
20
15
10
5
0
ZeroR OneR
ID3
J4.8
Ibk
Naïve Neural
Bayes
Net
Classifier
Figure 27 Overall Performance of Classifiers
As can be seen from Figure 27, none of the rule-based classifiers, even leaving aside the
trivial case of the ZeroR classifier, had an error rate better than 6%. The classifiers that
take all the features into account (IBk, naïve Bayes and Neural Net) had an error rate of
less than 4%. The rule induction classifiers ID3 and J4.8 used 4 and 5 of the available
attributes respectively, from which to induce their decision trees. Both had very similar
error rates. Others [Mic1] have also found the performance of C4.5 and ID3 to be
similar.
The neural network was the classifier that produced the best results on this data, with an
error rate of (just) less than 2%.
It should be noted that the error rates generally accepted by haematologists, from flow
cytometer based haematological white blood cell differentiation instruments, are of the
order of 1%.
Thesis
A.R.J. Katz
51
Chapter 7 Conclusion and Further Research
7.1
Conclusion
The main goal of this research was to determine whether all steps in a fully automated
system for the detection and classification of white blood cells from microscopic images
could be realized using image processing and supervised learning techniques. This goal
was largely met, with the exception of a manual step required during image
segmentation.
As a component of this, we attempted to determine which of a number of investigated
classification techniques provides the best automated classifier for the classification of
white blood cells into their five major types (Neutrophils, Lymphocytes, Monocytes,
Eosinophils and Basophils), based on a limited data set of visual images. Of the
classification algorithms used, we found an artificial neural network to achieve the best
accuracy, with an error rate of 2%.
Three hypotheses were examined in the course of the investigation:
1. That useful regions of interest each, centered on a white blood cell, can be
automatically extracted from color microscopic images?
Useful regions of interest, each centered on a white blood cell, could be automatically
extracted from color microscopic images. Automatic segmentation of the images into
areas inside and outside the white blood cell was not successfully performed on a
sufficiently high proportion of the images and human intervention was necessary in
the segmentation of 20% of cells. However, others [Aus1, Har1] appear to have
solved this problem, so it is considered that with further work on this aspect, the
entire process is automatable.
2. That image processing techniques can be used to automatically extract a feature set
from these regions of interest that is useful for the classification of white blood cells?
Image processing techniques, such as banding and thresholding were found to be
capable of automatically extracting a useful feature set from these regions of interest.
A feature set based on color statistics, nuclear material statistics and size was
extracted from the data set. This feature set was found to be sufficient to classify
white blood cells using a range of different algorithms, achieving up to 98% accuracy
of classification.
3. That supervised classification techniques such as decision trees, k nearest neighbor,
and neural networks can be used to classify the white blood cells in these regions of
interest and what are their relative accuracies?
Supervised classification techniques including as J4.8, IBk and neural networks can
be used to classify the white blood cells from the features extracted from these
regions of interest. Their accuracies have been compared on the basis of the error
rates achieved. Error rates as low as 2% have been achieved. These error rates are not
low enough to be useful as a complete replacement of a white blood count, however,
Thesis
A.R.J. Katz
52
they are likely to be low enough to allow automated screening of samples for a
limited range of disorders.
The accuracy of classification using an appropriately trained neural network was
higher than that found using rule based, nearest neighbor and probabilistic classifiers.
However, the performance of nearest neighbor and naïve Bayesian classifiers was
sufficiently close to that of neural networks, that it was difficult to say whether, given
a larger or merely different dataset, these would not surpass neural networks in
accuracy.
This thesis has shown that a number of classification techniques have reasonable
performance on the features extracted from white blood cell images. The best
performance was achieved using artificial neural networks, IBk nearest neighbor and
naïve Bayesian classifiers. This may be due to the fact that these techniques are all
affected to some extent by all the attributes of the data. In contrast, the rule induction
algorithms ID3 and J4.8 (similar to C4.5), which only use the minimal set of features
required to classify the instances in the training set, did not perform nearly as well.
All of the classifiers performed reasonably well on neutrophils, the most numerous class
in the data set. None of them performed well on basophils, of which only two examples
(approximately 1%) appeared in the data set.
Typically (with a single data point
exception), each classifier’s performance improved as the number of instances per class
increased. However, it is prudent to consider that this might be because the betterrepresented classes are serendipitously easier to train for all classifiers. We do not feel
that this is likely, though. All of the classifiers under consideration performed
substantially better than the trivial ZeroR baseline classifier.
An attempt was made to train an SNNS based neural network of the topology described
in 5.2.2 (page 35) on an unscaled and unnormalized feature set as shown in Table 3. This
attempt produced very poor results, with an error rate of 40% or more (worse, even the
naïve control classifier ZeroR. Further investigation showed that the network thus
produced had many internal weights saturated to the values 1.0 and 0.0, where the
derivative of the sigmoid function is close to 0, and training is impossible. Further, these
weights tended to flip from 0.0 to 1.0 and vice versa, between successive training epochs.
We obtained dramatically better results using normalized inputs to SNNS (an error rate of
less than 2% for normalized inputs compared to about 40% for unnormalized inputs). It
was hypothesized that the neural networks created by SNNS do not behave in the ideal
manner that Smith [Smi1] describes (see 2.4.11, page 15), but have substantial problems
coping with large input data values. As the design documentation of SNNS is not readily
available, it is difficult to determine the reason for this behavior.
The accuracy of white blood cell classification presented here compares favourably with
other researchers (e.g. Song, Abu-Mostafa, Sill and Kasdan [Son1] achieved just under
90% accuracy, though with 14 classes compared to the 5 we attempted).
Thesis
A.R.J. Katz
53
7.2 Further Research
It would be beneficial to further explore the following areas:
1. Other methods of determining the boundaries of cells. The method chosen here was,
perhaps, a little naïve and worked poorly in a number of instances. This is the only
step in the process where manual intervention was required. Hence, it is the only
block to the automation of the entire process from data acquisition (which could now
have been performed automatically if an automated microscope stage was affordable
or at hand), through white cell segmentation, to classification. Counting the classified
instances is seen as trivial.
2. The performance of all these classifiers with a much larger data set. In particular, the
IB1, naïve Bayes and artificial neural network schemes were felt to be likely to
improve in performance with greater representation of the lower probability classes
(basophils and eosinophils). All classifiers, however, appeared to provide greater
accuracy on classes with larger numbers of instances in the training set.
3. Extension to other types of white blood cells. Blast cells, for instance, are
characteristic of certain types of Leukemia and would indicate further tests if found in
blood. Being able to automatically classify these and flag samples accordingly could
be a real boon to haematologists. This would of course require leukemic blood with
these unusual cells in evidence to be available and some image acquisition (and
manual classification by haematologists for the training set).
4. Use of other features in the blood cell. Some of these features will require higher
resolution images. In particular, the images acquired for this thesis were of too low a
resolution to show the characteristic granularity that is a key element in the
haematologist’s differentiation of the granular leukocytes (neutrophils, eosinophils
and basophils from lymphocytes and monocytes.
5. Tests of statistical significance. Tests of the statistical significance of the differences
in performance of the different classification algorithms could be conducted, given a
number of trials with different data sets.
The two areas likely to benefit most from further attention are improved cell
segmentation and a gathering of a larger data set (1. and 2. above). Use of other features
(4. above), particularly texture related features is likely to show some improvement.
Extension to other cell types is likely to make this research more useful to the general
population.
With these extensions, it is quite possible that the techniques presented herein could meet
and possibly surpass the 1% error rates that are generally accepted in haematological
instruments. Alternatively, it would be useful to investigate whether some mix of manual
and automated techniques could be used to provide significant improvements in economy
and throughput, while maintaining the high standards of existing manual techniques.
Thesis
A.R.J. Katz
54
Chapter 8 References
[Add1] Abbott Diagnostics Website, http://www.abbott.com/products/diagnostics.htm
[Aha1] Aha, D. W., & Bankert, R. L. (1997). “Cloud classification using errorcorrecting output codes”. Artificial Intelligence Applications: Natural Resources,
Agriculture and Environmental Science, 11:1, 13-28. (Technical Report AIC-96024).
[Alb1] Mark K. Albert and David W. Aha, “Analyses of Instance based Learning
Algorithms”, In Proceedings of the Ninth National Conference on Artificial
Intelligence (pp. 553-558). Anaheim, CA: AAAI Press.
[Alp1] Ethem Alpaydin, “Combined 5x2cv F Test for Comparing Supervised
Classification Learning Algorithms”, IDIAP Research Report 98-04, May 1998,
Dalle Molle Institute for Perceptual Artificial Intelligence, Martigny, Switzerland.
[Aus1] H.M. Aus, H.Harms, V. ter Meulen and U. Gunzer. “Statistical Evaluation of
Computer Extracted Blood Cell Features for Screening Populations to Detect
Leukemias”, Pattern Recognition Theory and Applications, Edited by P.A.
Devjver and J. Kittler, Springer-Verlag Berlin, Heidelberg 1987
[Bec1] Beckman Coulter Website, http://www.coulter.com/coulter/Hematology/
[Can1] Canny, J. , “A Computational Approach to Edge Detection”, 1986, IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-8.
(6):679-698.
[Cil1] “Atlas of Blood Cell Differentiation” (CD-ROM) Felix Cillesen and Wim Van
Der Meer, ISBN: 0-444-50066-9, 1998 Elsevier Science B.V., Amsterdam, The
Netherlands
[Col1] “Blood”, Keith Breden Taylor and Julian B. Schorr, Colliers Encyclopaedia, Vol
4, 1978
[Fer1] Ferri, M., Lombardini, S., Pallotti, C., "Leukocyte classification by size
functions", Proc. 2nd IEEE Workshop on Applications of Computer Vision,
Sarasota, 1994 Dec. 5-7 (1994), 223-229.
[Har1] Harry Harms and Hans Magnus Aus, “Tissue Image Segmentation with
Multicolor, Multifocal Algorithms”, Pattern Recognition Theory and
Applications, Edited by P.A. Devjver and J. Kittler, Springer-Verlag Berlin,
Heidelberg 1987
Thesis
A.R.J. Katz
55
[Hol1] Holte, R. C. 1993 , `Very simple classification rules perform well on most
commonly used datasets', Machine Learning 11, 63--90.
[Jai1] Anil K. Jain, Jianchang Mao and K.M. Mohiuddin, “Artificial Neural Networks:
A Tutorial”, Computer, March 1996
[Jai2] Anil K. Jain, “Advances in Statistical Pattern Recognition”, Pattern Recognition
Theory and Applications, Edited by P.A. Devjver and J. Kittler, Springer-Verlag
Berlin, Heidelberg 1987
[Koh1] Teuvo Kohonen, Jussi Hynninen, Jari Kangas, Jorma Laaksonen, Helsinki
University of Technology,“SOM-PAK The Self-Organizing Map Program
Package Version 31.” (April 7, 1995)
[Lan1] Pat Langley and Stephanie Sage, “Tractable Average Case Analysis of Naïve
Bayesian Classifiers”, Institute for the Study of Learning and Expertise (ISLE),
Palo Alto, California
[Lan2] Pat Langley and Stephanie Sage, “Induction of Selective Bayesian Classifiers”,
Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence
(1994). Seattle WA, USA. Morgan Kaufmann Publishers Inc, San Mateo,
California.
[Mar1] D. Marr and E. Hildreth, “Theory of Edge detection”, Proceedings of the Royal
Society of London, Volume 207, 1980, pp187-217.
[Mic1] D. Michie, D.J.Spiegelhalter and C.C.Taylor, “Machine Learning, Neural and
Statistical Classification”, Ellis Horwood Limited, Hemel Hempstead,
Hertfordschire, Great Britain, 1994.
[New1] Author unknown, http://sunflower.singnet.com.sg/~midaz/Intronn.htm, Website
of NewWave Intelligent Business Systems, NIBS Inc.
[Qui1] J. Ross Quinlan, University of Sydney, “C4.5: Programs for Machine Learning”,
Morgan Kaufmann Publishers, San Mateo, California.
[Qui2] J. Ross Quinlan, University of Sydney, “Induction of Decision Trees”, Machine
Learning 1, 1
[Ren1] Rennie J. “Cancer Catcher: Neural Net Catches Errors that Slip Through Pap
Tests”, Scientific American, 262,5 p 84, May 1990.
[Ros1] Rosenberg, Azriel, “Image Analysis, Problems, Progress and Prospects”, Pattern
Recognition, Vol 17, Number 1, 1984 pp3-12.
Thesis
A.R.J. Katz
56
[Rus1] Stuart J. Russell and Peter Norvig, “Artificial Intelligence A Modern Approach”,
1995, Prentice-Hall, Inc. Upper Saddle River, New York
[Rus2] John C Russ, “The Image Processing Handbook, Second Edition”, CRC Press
Inc, Boca Raton, Fl, USA, 1995, ISBN 0-8493-2516-1
[Smi1] Murray Smith, “Representing Variables”, Ch10, “Neural Networks for Statistical
Modelling”, Van Norstrand Rheinhold, New York 1993, ISBN 0-442-01310-8
[Son1] Xubo Song, Yaser Abu-Mostafa, Joseph Sill, Harvey Kasdan, "Incorporating
Contextual Information into White Blood Cell Image Recognition", Advances in
Neural Information Processing Systems, MIT Press, 1997.
[Son2] Milan Sonka, Vaclaw Hlavac and Roger Boyle, “Image processing, Analysis and
Machine Vision”, Chapman and Hall, London, 1993.
[Sys1] Sysmex Corporation Website, http://www.dokkyomed.ac.jp/dep-k/cli-path/wwwSYSMEX.html
[Umb1] Scott E. Umbaugh, “Computer Vision and Image Processing”, Prentice Hall Inc,
New Jersey, USA, 1997.
[Van1] Ferdinand van der Heijden, “Image Based Measurement Systems, Object
Recognition and Parameter Estimation”, John Wiley &Sons, West Sussex,
England, 1995.
[Wek1]“TUTORIAL. WEKA: The Waikato Environment for Knowledge Analysis.”
(August 30, 1996) Department of Computer Science, University of Waikato,
Hamilton, New Zealand. www.cs.waikato.ac.nz
[Wei1] Sholom M. Weiss and Casimir A. Kulikowski, Rutgers University, “How to
Estimate the True Performance of a Learning System”, Ch 1 of “Computer
Systems that Learn”, Morgan Kaufmann Publishers Inc, San Mateo, California.
[Wid1] Bernard Widrow, David E. Rumelhart, Michael A. Lehr, “Neural Networks:
Applications in Industry, Business and Science”, Communications of the ACM,
March 1994, Vol 37, No 3.
[Win1] Simon A. J. Winder, “Img* Image Processing Toolset Manual, Version 1.1, (Nov
1994)”, School of Mathematical Sciences, The University of Bath.
[Win2] Patrick Henry Winston, “Artificial Intelligence”, Third Edition, Addison-Wesley
Publishing Company, Reading Massachusetts.
Thesis
A.R.J. Katz
57
[Wit1] Ian H. Witten and Eibe Frank, “WEKA: Machine Learning Algorithms in Java”,
University of Waikato, Hamilton, New Zealand. www.cs.waikato.ac.nz
[Zel1] Andreas Zell et al., “Stuttgart Neural Network Simulator, User Manual, Version
4.1. (1995)”, Institute for Parallel and Distributed High Performance Systems
(IPVR)University of Stuttgart.
www.informatik.uni-stuttgart.de/ipvr/bv/projekte/snns/
Appendix 1 Equipment and Software Used
Process
Slide Preparation
Equipment / Software Used
Abbott Cell-Dyn SMS Slide
maker/stainer
Source
Abbott Diagnostics
Division,
http://www.abbottdiagnosti
cs.com/i
Image Acquisition
Leica ATC 2000 stereo
optical microscope.
Leica Microsystems GMBH
www.leica.com
Bischke CCD-FT2 1/3 inch
CCD camera
Unknown (made available
by R. Smith)
AI-Tech AI Gotcha! 2
Video capture device and
associated AI Gotcha
software
Jasc Paint Shop Pro
Version 6.0.
AI-Tech International
www.aitech.com
Image experimentation and
histogram evaluation.
Image separation into
component bands
ppmtopgm3
Jasc Software,
www.jasc.com
Part of the pbmplus suite of
image processing tools
http://metalab.unc.edu/pub/
Linux/apps/graphics/conver
t/
Thresholding to find
nuclear matter
pgmthresh
Part of the pbmplus suite of
image processing tools
http://metalab.unc.edu/pub/
Linux/apps/graphics/conver
t/
Thesis
A.R.J. Katz
58
Edge erosion
edgerode
Created for this thesis,
available from the author
Created for this thesis,
available from the author
Platelet erosion (erosion
and dilation)
pbmfilt
Region of interest
identification (finding blobs
of nuclear matter)
Region of interest
extraction (cutting a subimage from an image)
findblobs
Created for this thesis,
available from the author.
pbmcut
Part of the pbmplus suite of
image processing tools
http://metalab.unc.edu/pub/
Linux/apps/graphics/conver
t/
Edge detection
imgCanny
Cell segmentation (rough
circle fitting)
Color feature extraction
findcirc
Nuclear morphology feature
extraction
ZeroR, OneR, ID3, J4.8,
IBk and Naïve Bayes
classifiers.
Artificial Neural Network
classifier.
nucstats
Results analysis
(spreadsheet and graphing)
Thesis
cellstats
WEKA (The Waikato
Environment for
Knowledge Analysis).
SNNS (Stuttgart Neural
Network Simulator)
Microsoft Excel
A.R.J. Katz
Part of the imgStar suite of
image processing utilities
by Simon A.J. Winder
ftp://axiom.maths.bath.ac.u
k/
Created for this thesis,
available from the author.
Created for this thesis,
available from the author.
Created for this thesis,
available from the author.
University of Waikato,
Hamilton, New Zealand.
www.cs.waikato.ac.nz
University of Stuttgart.
www.informatik.unistuttgart.de/ipvr/bv/projekte
/snns/
Microsoft Corporation
www.microsoft.com
59
Appendix 2 Data Acquisition Process
Slide Preparation
Slides were prepared using an Abbott Cell-Dyn SMS slide maker and stainer. This
AUD$100k+ instrument was made available by the manufacturer, Vision Instruments
Ltd. (formerly Australian Biomedical Corporation). Human blood was used from two
anonymous donors.
Smears were made using the “Normal” setting on the SMS instrument, with 15 slides per
smearer wash, 1 slide per sample. Staining was performed according to Vision
Instruments’ standard May-Grunwald Giemsa staining protocol.
Leica Microscope
The Leica ATC 2000 microscope was fitted with a standard C-type coupling to fit a
charge coupled device (CCD) based video camera and a mechanism to direct the image to
either the eyepieces or the camera fitting. A magnification of 400x (10x eyepiece and
40x objective) was used.
Settings
The brightness of the microscope lamp was set to 6 (a scale from 1 to 10 is printed on the
lamp control knob of the microscope). A yellow green filter was used between the lamp
and microscope stage.
Camera
The camera used to acquire the images used in this project was a Bischke CCD-FT2 1/3
inch CCD camera. No adjustments to white balance or brightness were made from the
factory default settings. The output from the camera Is PAL-D standard composite video
on an RCA connector.
A camera with a higher claimed resolution and SVHS output was tried during the early
stages of the project. However, this actually appeared to produce an image with less
detail.
AI Gotcha
Images were acquired using AI-Tech AI Gotcha! 2 Video capture device. This device
plugs into the parallel port of an IBM PC compatible computer and accepts either SVideo or RCA Composite Video in. The AI Gotcha! 2 is switchable between NTSC and
PAL video modes. For these images the PAL mode was used, to maintain compatibility
with the video signal from the camera.
Thesis
A.R.J. Katz
60
In this case, the RCA composite video input of the AI-Gotcha device was used. Images
were captured as 512 * 384 pixels in RGB bitmap format, with eight bits per color.
Computer
An Intel Pentium II (350 MHz) computer running the Microsoft Windows 95 operating
system was used to capture the images. All processing was performed on a Pentium
Celeron (400MHz) computer under either the Linux or Windows 95 operating system.
Scanning method
The blood smear produced by the SMS instrument as used here covers a roughly
rectangular area of approximately 15mm * 40 mm. Unfortunately, not all of this is
useable. Near the edges of the smear, many cells are mechanically damaged by the
smearing action. Towards the tail end of the smear, where the population of blood cells
is sparse, a similar form of mechanical damage often occurs. This mechanical damage
tends to affect the morphology of the cells and break the cell walls, causing damaged
cells of irregular shape.
Towards the head of the smear, the population of cells is too dense, causing cells to
bunch up and overlap. This makes discrimination of cells difficult.
Haematologists avoid using cells in these areas, as they risk misidentification of cells
(and, consequently misdiagnosis of illnesses). Haematologists try to use the monolayer
of cells between the two extremities of the smear, calling this the “readable area”. All
images captured have similarly been captured from this monolayer of cells.
Readable
area (blood
monolayer)
Figure 28 Blood Monolayer Scanning Pattern
Thesis
A.R.J. Katz
61
The microscope stage was manipulated in the standard “S” shaped scanning (as seen in
Figure 28) pattern used by haematologists for the common blood count (CBC), within the
blood monolayer. For the two sets of 100 test cells obtained, each white cell that was
thus brought into the field of view was maneuvered into the approximate centre of the
field of view and then captured, before continuing.
For a completely automated system, the co-ordinates of the scanning pattern can be
programmed into an automated microscope stage, as the high consistency of the blood
monolayer between smears made by the Abbott Cell-Dyn SMS allows this.
Thesis
A.R.J. Katz
62
Appendix 3 Findblobs.c
/* Program Findblobs */
/* Finds Blobs and their centroids and x,y extremities */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
//#include <io.h>
#include <fcntl.h>
char arr1 [1000] [1000];
int iheight, iwidth;
int px, py;
int x_acc, y_acc, npoints;
int right, left, top, bottom;
int imageisraw = 1;
FILE * ifile;
FILE * ofile;
/* Parse a compressed .pbm file */
int parseHeader (FILE * pfile)
{
char tline[256];
fgets (tline,256,pfile);
fprintf (stderr,"%s\n", tline);
if (strncmp (tline,"P4",2) == 0) {
fprintf (stderr,"Compressed (RAWBITS) pbm file\n");
}
else {
fprintf (stderr, "Wrong file format\n");
exit(2);
}
do {
fgets (tline,256, pfile);
fprintf (stderr, "%s\n", tline);
} while (tline [0] == '#');
sscanf (tline, "%d %d", &iwidth, &iheight);
fprintf (stderr, "Image size = %dW * %dH\n", iwidth, iheight);
return (1);
}
/* Scan array created from .pbm file to find next black pixel
(and hence, next blob) */
int find_next_blob (void)
{
fprintf (stderr, "find next blob %d, %d\n", px, py);
while (py < iheight) {
while(px < iwidth) {
//fputc (arr1[px][py], stderr);
Thesis
A.R.J. Katz
63
if (arr1 [py][px] == '1') {
fprintf (stderr, "found blob %d, %d\n", px, py);
return (1);
}
px++;
}
px = 0;
py++;
}
fprintf (stderr, "none found\n");
return (0);
}
/* exhaustively (and recursively) find all connecting black pixels,
counting each pixel, accumulating the sum of the pixel x values
and the pixel y values and setting the pixel to white as we go.
In this way, we will have traversed the entire blob and counted the
number of black pixels and accumulated all the data required
to determine the blob's centroid as we go */
void eval_blob (int x, int y)
{
if ((x<0) || (y<0) || (x>iwidth) || (y>iheight)) return;
if (arr1 [y][x] == '1') {
arr1 [y][x] = '0';
if (x > right) right = x;
if (x < left) left = x;
if (y < top) top = y;
if (y > bottom) bottom = y;
npoints++;
x_acc += x;
/* accumulate sum of blob pixel x positions */
y_acc += y;
/* accumulate sum of blob pixel y positions */
eval_blob (x,y-1);
eval_blob (x+1, y-1);
eval_blob (x+1,y);
eval_blob (x+1,y+1);
eval_blob (x,y+1);
eval_blob (x-1,y+1);
eval_blob (x-1,y);
eval_blob (x-1,y-1);
}
}
/* read a line from the compressed .pbm file int othe array */
void readLine (char * iline, FILE * rfile)
{
char rawbuf [256];
char * d;
char c;
int i,j;
if (imageisraw) {
d = iline;
Thesis
A.R.J. Katz
64
fread (rawbuf, sizeof (char), (iwidth+7)/8, rfile);
for (i=0; i < (iwidth+7)/8; i++) {
c = rawbuf [i];
for (j = 0; j < 8; j++) {
if (c & 0x80) *d++ = '1'; else *d++ = '0';
c <<= 1;
}
}
*d = '\0';
}
}
/* Fill the array arr1[] from the compressed .pbm file */
void fillArray (FILE *rfile)
{
int i;
for (i=0; i < iheight; i++) {
readLine (arr1[i], rfile);
}
}
int main (void)
{
int xmean, ymean;
int i;
int nblobs = 0;
ifile = stdin;
ofile = stdout;
parseHeader (ifile);
fillArray (ifile);
/* Parse the file header */
/* fill the array from the file */
npoints = 0;
/* clear number of points in blob */
px = py = 0;
/* start scan at x=0, y=0) */
do {
x_acc = y_acc = 0;
/* clear pixel value accumulators */
px =py =0;
/* start scan at x=0, y=0) */
npoints =0;
/* clear number of points in blob */
find_next_blob ();
/* find first pixel in next blob */
right = left = px;
/* clear blob extents */
top = bottom = py;
eval_blob (px,py);
/* evaluate extent of next blob */
if (npoints > 0) {
nblobs++;
/* number of blobs found */
xmean = x_acc / npoints; /* evaluate mean of blob on x axis */
ymean = y_acc / npoints; /* evaluate mean of blob on y axis */
/* print details of this blob */
fprintf (stdout, "blob%d: %d %d %d %d %d %d %d\n",
nblobs, xmean, ymean, left, right, top, bottom, npoints);
}
} while (npoints > 0);
return 0;
Thesis
A.R.J. Katz
65
}
Thesis
A.R.J. Katz
66
Appendix 4 Perl Script to Consolidate Blobs
#!/usr/bin/perl
print ("Perl script ps1 on $ARGV[0]\n");
@lines = <STDIN>;
print ("read first line\n");
$i = 0;
$blob = 0;
$oldx = 0;
$oldy = 0;
$ow = 0;
while (($lines[$i] !~ /none/) && ($lines[$i] ne "")) {
print ($lines[$i]);
chop ($lines[$i]);
@words = split (/ /,$lines[$i]);
if (@words[0] =~ /blob[0-9]+:/) {
$blob++;
print ("Blob $blob at: ",@words[1]," ",@words[2],"\n");
$x = @words[1] - 40;
$y = @words[2] - 32;
$w = @words [7];
$dx = $oldx - $x;
$dy = $oldy - $y;
$euclid = sqrt ($dx*$dx +$dy*$dy);
print ("Euclidean distance = $euclid\n");
if ($euclid < 25) {
print ("Blobs too close, consolidating \n");
$blob--;
system ("rm -f $ARGV[0]b$blob.ppm");
$x = ($x * $w + $oldx * $ow) / ($w + $ow);
$y = ($y * $w + $oldy * $ow) / ($w + $ow);
$w = $w + $ow;
}
system ("pnmcut $x $y 81 65 <$ARGV[0].ppm >$ARGV[0]b$blob.ppm");
#
system ("xv $ARGV[0]b$blob.ppm &");
print ("file created: $ARGV[0]b$blob.ppm\n");
$oldx = $x;
$oldy = $y;
$ow = $w;
}
$i++;
print ("read a line\n");
}
Thesis
A.R.J. Katz
67
© Copyright 2026 Paperzz