Classification of Medical Data with HTM Cortical Learning Algorithms

Gottfried Wilhelm Leibniz Universität Hannover
Fachgebiet Distributed Computing
Distributed Computing Security Group
Masterarbeit
im Studiengang Informatik (M. Sc.)
Classification of Medical Data
with HTM Cortical Learning Algorithms
Verfasser:
B. Sc. Thomas Muders
Erstprüfer:
Prof. Dr. rer. nat. M. Smith
Zweitprüferin:
Prof. Dr.-Ing. G. von Voigt
Betreuer:
M. Sc. R. Gröper, M. Sc. M. Harbach
Datum:
25. März 2011
Hannover, den 25.03.2011
Hiermit versichere ich, dass ich diese Arbeit selbstständig verfasst habe und
keine anderen als die angegebenen Quellen und Hilfsmittel verwandt habe.
Thomas Muders
Zusammenfassung
Die medizinische Datenanlyse ist eine wichtige Aufgabe, von der häufig das Leben
der Patienten abhängt. Dennoch ist die Analyse selbst häufig eine sich ständig
wiederholende und zeitraubende Angelegenheit. Algorithmen zur Analyse müssen
eine hohe Erfolgsrate haben und außerdem verständlich sein, damit man mit ihren
Parametern umgehen kann. Diese Arbeit untersucht einen generellen Algorithmus
für verschiedenste Datentypen: Den hierarchischen Temporalspeicher (Hierarchical Temporal Memory, kurz HTM). Dieser Algorithmus wird von seinen Autoren
als sehr flexibel dargestellt und soll mit jeder Art von Daten arbeiten können. Der
HTM Algorithmus wurde in dieser Arbeit implementiert und mit zwei Datensätzen
getestet: Einerseits medizinische Bilder und andererseits EKG Aufzeichnungen. Da
der Algorithmus von seinen Autoren nur mit eingeschränkten Eingabe-Datentypen
und nur einer kleinen Anzahl von Eingangsdatensätzen dargestellt ist, war eine Anpassung nötig. Die Implementierung in dieser Arbeit unterstützt große Datenmengen und beliebige Eingabe-Datentypen. Die Ergebnisse werden am Schluss der Arbeit analysiert, sowie weitere Ideen und Gedanken für eine zukünftige Entwicklung
gegeben.
i
Abstract
Medical data analysis is an important task and the lives of patients often depend on
it. However, the analysis itself is often repetitive and time-consuming. Algorithms
for this tasks need to have a very high success rate and need to be understandable to
work with their parameters. This thesis will examine a general algorithm for various data types: The Hierarchical Temporal Memory learning algorithm (HTM). It is
proposed to be very flexible and to work on any kind of data. The HTM algorithm
was implemented and tested with two datasets: Medical images on the one hand and
ECG recordings on the other. The proposed algorithm is only presented with a limited data input method and a small input size by its authors. These properties needed
to be modified to support a bigger amount of data and also various data types. This
is done with an own implementation of the HTM algorithm. The results of this work
are analyzed and thoughts and ideas for future development are given at the end of
this thesis.
ii
Contents
1
2
3
4
Introduction
1
1.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . . . .
4
Scenario
6
2.1
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Intelligence and Prediction
10
3.1
Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.1.1
Behaviors and Reactions . . . . . . . . . . . . . . . . . . .
11
3.1.2
Plans and Thoughts . . . . . . . . . . . . . . . . . . . . . .
12
3.2
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.3
Hierarchical Structure in Nature . . . . . . . . . . . . . . . . . . .
13
3.4
Measuring Artificial Intelligence . . . . . . . . . . . . . . . . . . .
14
3.4.1
Turing Test . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.4.2
Limits of the Turing Test . . . . . . . . . . . . . . . . . . .
14
3.4.3
Limits of Behavior Tests in General . . . . . . . . . . . . .
15
Hierarchical Temporal Memory
17
4.1
Neocortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.1.1
Regions in Neocortex Tissue . . . . . . . . . . . . . . . . .
17
4.1.2
Neuroplasticity . . . . . . . . . . . . . . . . . . . . . . . .
18
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4.2.1
Nature of Patterns . . . . . . . . . . . . . . . . . . . . . .
20
4.2.2
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4.2.3
Auto Associative Memory . . . . . . . . . . . . . . . . . .
21
General Ideas and Goals . . . . . . . . . . . . . . . . . . . . . . .
21
4.2
4.3
iii
CONTENTS
5
4.4
Overview of the HTM Algorithm . . . . . . . . . . . . . . . . . . .
22
4.5
Spatial Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.6
Temporal Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4.7
Nodes as Smallest Parts . . . . . . . . . . . . . . . . . . . . . . . .
26
4.8
Regions of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.9
Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Implementation
29
5.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
5.2
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
5.3
HTM-Framework . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
5.3.1
General Data Container . . . . . . . . . . . . . . . . . . . .
33
5.3.2
Two Dimensional Data . . . . . . . . . . . . . . . . . . . .
33
5.3.3
One Dimensional Data . . . . . . . . . . . . . . . . . . . .
36
5.3.4
Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.3.5
Spatial Pooler . . . . . . . . . . . . . . . . . . . . . . . . .
39
5.3.6
Temporal Pooler . . . . . . . . . . . . . . . . . . . . . . .
40
5.3.7
Time Adjacency Matrix . . . . . . . . . . . . . . . . . . .
40
5.3.8
Node Input . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.3.9
Node Output . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.3.10 Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.3.11 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
End User Application . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.4.1
Structure and Design . . . . . . . . . . . . . . . . . . . . .
45
5.4.2
Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Iteration Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.5.1
First Iteration . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.5.2
Second Iteration . . . . . . . . . . . . . . . . . . . . . . .
53
Distributed and Parallel Computing . . . . . . . . . . . . . . . . .
53
5.4
5.5
5.6
6
iv
Experimental results
55
6.1
Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.2
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.2.1
Medical Scenario . . . . . . . . . . . . . . . . . . . . . . .
57
6.2.2
Simple Test Scenario . . . . . . . . . . . . . . . . . . . . .
59
6.3
Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.4
Influence of Parameters . . . . . . . . . . . . . . . . . . . . . . . .
62
CONTENTS
6.5
7
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
63
Conclusion and Outlook
64
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
7.2
Possible Improvements of Future Work . . . . . . . . . . . . . . .
65
List of Figures
67
References
69
Chapter 1
Introduction
This thesis examines the capabilities of a new pattern recognition algorithm applied
to medical data.
In the realm of medical data analysis a vast amount of data needs to be processed.
Since lives may depend on these analyses, people have very high expectations of the
success rate of automated1 algorithms.
Most of these algorithms have one of the following problems: Either the result of the
algorithm is too uncertain to be used in the medical realm or it is too complicated
to be understood by the staff. This results in a lack of trust in automated analysis
and eventually a lot of work for the staff, since they have to analyze all the data
themselves.
Despite the importance of these analyses, the tasks are often repetitive and timeconsuming. Highly skilled specialists lack the time and other staff lack the skills
needed. In these cases, an algorithm could be a more reliable solution.
The first goal of medical data analysis is to build better algorithms. Since the best
algorithm is of no use if its routinely application is too complex, the second goal of
medical data analysis is to develop more understandable algorithms. This is pursued
to regain the trust of medical and clinical staff in order to save lives, time and money.
The algorithm in this thesis is called the Hierarchical Temporal Memory algorithm
(HTM algorithm) and was first described in 2004. It has proven its abilities in the
fields of simple pattern recognition and motion analysis, but not yet in medical applications.
1 i.e.
minimal need of interaction with humans.
1
1. Introduction
2
Therefore this thesis examines the capabilities of the HTM algorithm for pattern
and texture recognition in medical datasets. The scenario in this thesis consists of
photographs of cells and electrocardiography (ECG2 ) recordings. A more detailed
description of the scenario follows in chapter 2.
Before the HTM algorithm is depicted in detail in chapter 4, an overview of intelligence and intelligence measurement techniques is given in chapter 3.
For this thesis, the algorithm was implemented and tested with the data of the mentioned scenario. The details of the implementation process can be read in chapter
5. The source code together with demo applications3 can be found on the enclosed
CD-ROM. The results are described in the chapter 6 and the conclusion is given in
chapter 7.
Before the reasons for this thesis are explained in more detail, an overview of the
related work is given.
1.1
Related Work
Medical image analysis is done with a very large set of algorithms. The chosen algorithm often depends on the problem and the image type. Texture analysis to classify
different types of tissue, similar to this thesis, is done with a variety of so-called
texture parameters. An overview of these techniques can be found in [CBLC04].
ECG analysis and processing is also done with various algorithms. For example
an Extended Kalman Filter (EKF) is used to eliminate noise of the sensors and of
moving patients [SS08]. The parameters for the Kalman Filter depend heavily on the
tasks it is applied to.
Similar to the theory behind the HTM algorithm are Deep Belief Networks (DBN)
with Restricted Boltzmann Machines (RBM). The Restricted Boltzmann Machine
consists of an input layer of units and a hidden layer of units. It learns by the use of
an energy function. The goal is to reach a thermal equilibrium in which the global
minimum of the energy function is found. This means that the algorithm stops learning in the end. In a Deep Belief Network the hidden layer of a Restricted Boltzmann
Machine is the input for another one [HOT06]. This way, patterns of patterns are analyzed and memorized in a similar way the regions and nodes of the HTM algorithm
2 Captured
3 Tested
electrical activity of the heart recorded by skin electrodes.
on Windows 7, 32 bit.
1. Introduction
3
work. However, time and the scaling of timespans in different hierarchy levels is not
part of this algorithm.
The HTM algorithm itself is based on Bayesian Network Classifiers, a network of
variables connected by conditional dependencies. Such a network can be used to
classify causes of effects like illnesses from symptoms [FGG97]. Sequences and
multiple hierarchies cannot be modeled with this approach.
The commercially available automated video surveillance software Vitamin D uses
the HTM algorithm to distinguish between the motion of people and other objects
in live video streams [Vit11]. It is a successful application and demonstrates the
capabilities of the HTM algorithm.
The commonly used algorithms highly depend on the input data or can be applied
only to special cases. Some algorithms like the deep belief networks are not easy
to understand for medical staff. These problems motivated an analysis of the HTM
algorithm.
1.2
Motivation
One reason for this thesis is the actual state of use of algorithms in the medical field
in general. The persons which should use the algorithms often hesitate to use them,
since they cannot understand all the details of the algorithms and therefore do not
understand the parameters which they could use to get satisfactory results.
Another reason is the result of the HTM algorithm in several realms and the absence
of an application on medical data classification. The authors of [GJ07] used a binary
image (black and white pixels) recognition test and achieved 66 % accuracy on test
images and 99.73 % on the training images. The set of test images contained about
10 % images that were unrecognizable for humans. Some of the correctly classified
images are shown in figure 1.1. Additionally, a video surveillance software uses the
HTM algorithm to detect people in video streams with impressive results4 .
The HTM algorithm is proposed to be a general artificial intelligence algorithm
[HB04]. General artificial intelligence is a very hard problem, and the possibility
of achieving it is still discussed widely. A theorem on the generality of problem
solving algorithms is the No Free Lunch Theorem, which is described in the next
section.
4 See http://www.vitamindinc.com/
for demonstration videos.
1. Introduction
4
Correctly classied images shown by the inventors of the HTM
algorithm in [GJ07]. The names in the top row are the names of the classes
and the images were correctly sorted by the HTM algorithm.
Figure 1.1:
1.3
No Free Lunch Theorem
There is a large amount of data collected about the brain by many scientists. However, a general theory on how the brain works is missing. Therefore, almost all of
the commonly used artificial intelligence (AI) algorithms are very specialized. The
reason of this specialization is stated as the No Free Lunch Theorem.
The No Free Lunch Theorem, written by Wolpert, D H and Macready, W G, states
that to get a better recognition rate in one realm, an algorithm needs to get worse in
another realm and maybe all other realms, too [WM95]. This theorem is confirmed
by many specialized algorithms which work very well on tasks they are designed to,
but fail in other realms.
If you apply, for example, a pattern recognition algorithm for grayscale images of
plastic bottles to a set of audio files, the result will be useless.
This theorem states that a general algorithm for many problems will always be less
successful than a set of specialized algorithms.
1. Introduction
5
Applied to the goal of this work, the conclusion is that the HTM algorithm needs to
be adjusted to the problem it will be applied to. At the same time, the HTM algorithm
will be unable to achieve good results on different types of data.
Despite this theorem, which was proven correct in many cases, the HTM algorithm
was still the preferred method. One reason was described earlier in the section 1.2
and another reason is the actual rise of hierarchical attempts to solve a wide variety
of artificial intelligence problems [FL07, GP07, HOT06, BL07].
To understand the idea on which the concepts of the HTM algorithm are built on,
a general description of (artificial) intelligence is given after the introduction of the
scenario.
Chapter 2
Scenario
No published and peer reviewed scientific results of the HTM algorithm on medical
data analysis were known in the beginning of this thesis. Therefore this work analyzes the capabilities of the HTM algorithm with medical data for the first time. Two
datasets were chosen to test the proposed recognition and generalization capabilities
of the algorithm. These datasets will be described in this chapter.
The first consists of images1 of tissue samples taken by a microscope. In these
images, multiple types of cells can be seen and the images are centered at the borders
of arteries. The goal is to classify the images into different states of an illness, based
on the type of cells found at the border of the arteries. An example of these images
can be seen in Figure 2.1.
The second dataset consists of eight hours2 electrocardiography (ECG3 ) recordings
of sleeping subjects. ECG signals are recorded by using skin electrodes and are
interpreted as the electrical activity of the heart muscles showing the heartbeat of the
subject. The goal of the analysis is the classification of the sleeping stages of the
subjects. These sleeping stages are normally assessed with visual checking of the
subject and with the help of the electroencephalography4 . An example of the ECG
data can be seen in Figure 2.2.
1 2088
by 1550 Pixels with 24 bit values
5.8 Million 16 bit oating point values per night per subject.
3 Captured electrical activity of the heart recorded by skin electrodes.
4 A recording of electrical activity at the surface of the head to analyze the ring
of neurons in the brain.
2 About
6
2. Scenario
7
Figure 2.1: Sample image of the microscopic images that are one part of the
test data for the HTM algorithm. In this image, the inner layer of the artery
is not intact and some of the tissue is out of place. The cell types inside the
dierent regions of this sample are hard to classify for algorithms.
Amplitude
FFT
Frequency
Figure 2.2: Sample image of the ECG dataset and the corresponding frequency analysis which is the other part of the test data for the HTM algorithm.
2. Scenario
8
Both are hard problems for machine learning algorithms in general, because of the
fuzzy classification and the preprocessing needed [GW08]. The constraints of the
classification problems are described in the next section.
2.1
Constraints
The cell images cannot be classified easily, since the borders of the celltype-classes
are overlapping and fuzzy. Fuzzy means that some properties of cells cannot be
sorted into a distinct class. Instead, these properties gradually change from one class
to another without a distinct point of change. For example a property of the cell could
be 65% healthy, 35% diseased and 30% dead at the same time. The combination of
several fuzzy properties could help to diversify the cells. Another problem are overlapping classes. Some properties of the cells may occur in several classes, making
it harder to classify a cell only by one property. Again, several properties combined
need to be examined to diversify the cells.
Two examples that incorporate these classification problems:
1. Cells are dying slowly and not from one moment to the next.
2. Layers of tissue are shifting steady along the arteries and do not ’jump’ from
one location to the next.
The microscopical images are also taken with different magnifications and some
parts of tissues are cut away at the border of the images which are centered on the
interesting spots at the border of the arteries. Additionally, the color used to mark
the tissue is different in some of the images. All classification parameters need to
be learned from sample images, since there are no parameters which can classify all
types of cells in all possible translations and rotations.
The ECG data is also hard to classify, because the sleep stages cannot be directly
linked to the heart rate. Therefore the borders of the sleep stages are very ambiguous in the ECG, because the different stages are mainly affecting the brain activity.
The sensors for the heartbeat are very sensitive to the subject’s body movements
and therefore the data is noisy. The ECG data consists of very low frequencies (the
heartbeat) and a single element of the data stream (a single electric current measurement) has no relevant information. A frequency analysis needs to be done in order to
capture ultra low frequencies and to eliminate the noise.
2. Scenario
9
With these constraints of the scenario a definition of the goal of this thesis is given
in the next section.
2.2
Goals of this Thesis
The proposed algorithm is presented with a simple binary image example by its authors. This work analyzes the performance of the proposed algorithm with real world
medical data. To do this, the algorithm will be implemented and customized to handle 8 bit input data instead of 1 bit data and to work with a large amount of data and
patterns. The input size will be increased from 16 by 16 pixels to 128 by 128 pixels
in order to capture all details of the real world data without losing too much detail.
The goal is to use the HTM algorithm on both scenario datasets without altering
the algorithm and with only minimal changes in the parameters of the algorithm.
The accuracy is of secondary interest in this first examination of the capabilities of
the HTM algorithm, as it is a general algorithm. Specialized algorithms tend to
have a better performance than general algorithms. Therefore, this work can only
be seen as a first analysis of a new algorithm and its conclusion will show if further
specialization and modification is useful. A successful application of the algorithm
would leading to an improvement in the realm of medical data analysis.
To explain the HTM algorithm in detail and the idea behind it, the next chapter will
introduce some general concepts about intelligence and the ability to measure it.
Chapter 3
Intelligence and Prediction
To explain the ideas of the HTM algorithm, the thoughts that lead to those ideas are
described and a definition of intelligence is given first.
A pragmatic conception of the term intelligence is given in [Goe93]:
”Intelligence is the ability to achieve complex goals in a complex
environment.”
Using this concept, it is possible to measure intelligence by the following two components.
• Complexity of the achievable goals
• Complexity of the environments in which these goals are achieved
To measure intelligence this way, a measurement of complexity is needed. Complexity is hard to measure directly, since there is no given mathematical definition of it.
The complexity of goal sets could be measured by the number of different reachable
goals, the amount of data needed and the amount of ordered steps to achieve them.
The complexity of environments could be measured by the variability of the environment and the amount of data that needs to be filtered to get only relevant data for the
goals.
Some programs, for example weather forecast systems, are clearly not intelligent. It
is possible to argue that these programs are filtering a large amount of data and are
calculating good results with complex input data. However, in this case the goal set
10
3. Intelligence and Prediction
11
of these applications are very limited and the environment is complex but unchanging
and preprocessed. In the end, the analysis would lead to a low score for these type of
programs.
Today relatively smart behavior is shown by autonomous robots and general AI systems. There is still a lot of research needed until machines can be considered as
intelligent by the described measurement process.
To understand which steps are necessary to create intelligence, it is helpful to take a
look at evolutionary processes, which is done in the next section.
3.1
3.1.1
Intelligence
Behaviors and Reactions
The first simple lifeforms on earth had no organs that work similar to brains. Therefore the behavior was limited to reactive movements and actions based on the perceived environment. The search for food and the reproduction of the individuals were
the main activities that led to survival. For example the bacterium Escherichia coli
switches its synthesis to transform glucose from lactose, when there is no glucose
nearby.
The lifeforms eventually got better at these tasks because the reproduction rate of
healthy and well nourished individuals is higher. In these early times, the genes were
the only source of behaviors and reactions. These early lifeforms had no memory and
therefore could not learn to adapt to changing environments.
After a long time, brains evolved that where capable of forming new memories and
the animals could adapt their behavior, since some stimuli lead to success (i.e. more
food) and some lead to failures (i.e. injuries). The internal nerve system developed
a tendency to maximize positive stimuli, since the individuals with more positive
stimuli were more successful. Therefore these individuals learned from experienced
events and could remember basic things [SS87].
3. Intelligence and Prediction
3.1.2
12
Plans and Thoughts
As this memory system evolved to have a long term memory it led to conditioning.
Causal relations between events in an external environment are stored and recalled.
A well known experiment is I.P. Pavlov’s dog experiment [Pav27].
These conditional relations led to the ability to predict what will happen, which led
to even better survival strategies. With the evolution of the frontal lobe of the brain,
it was even possible to imagine an action and the corresponding output of it without
acting it out in the real world.
One distinct feature of mammals is the neocortex, a relatively new brain structure
that surrounds the evolutionary older brain tissue. It gave them a significant boost
in memorizing and recognizing common actions and reactions of their surroundings.
The neocortex is described more detailed in section 4.1 on Page 17. Monkeys are
already capable of having a model of the world and show self recognition [RRLP10].
Finally humans evolved with relative big brains with around 90% of its outer layer
being neocortex tissue.
In the last decade, scientists found evidence of very intelligent behavior and even
self recognition in magpies and crows [PSG08, WWC+ 09]. Despite birds having
a different lineage than mammals, they nevertheless have developed rudimentary
intelligence without a neocortex.
However, the HTM theory has its main focus on the neocortex, because of the goal
to reach human level intelligence.
3.2
Prediction
From many observations it seems that the ability to predict events and reactions is
strongly correlated to intelligent behavior.
An individual with the capability to predict what will happen if it takes an action has
a distinct advantage. It is able to evaluate the outcome of different actions before
actually trying them. Individuals without this ability can only guess the outcome
based on experience or can just behave based on instincts which were formed through
evolution.
The intelligence tests (IQ tests) to measure the intelligence of human beings are
mostly consisting of questions regarding prediction: There are sequences of num-
3. Intelligence and Prediction
13
bers, words or geometric forms which the subject is asked to complete with a missing
element. The test checks if the subject is able to extract a general rule out of examples and apply this rule to get the next element. Without a memory of sequences or
general rules to generate sequences this seems impossible.
Processing large sets of data is a requirement to accomplish complex goals in complex environments. A hierarchical structure fulfills this requirement. The reason for
this is given in the next section.
3.3
Hierarchical Structure in Nature
The surface of the brain is split into different areas. These areas have been analyzed
in primates and they show a hierarchical structure [FV91]. A well studied example
is the realm of visual perception. At the bottom of this hierarchy are the areas that
respond to small parts like lines, corners and circles. At the top are areas that are
active when some higher order concepts like faces are present somewhere in the
visual area.
This structure of the brain areas is evident in our environment, too. Almost everything is made up of smaller parts. These parts are more commonly and are themselves
made up of even smaller parts. For example the characters in these words are built
out of a few types of lines, corners, dots, circles and crosses. These simple elements
form a greater variety of characters which then form an even greater variety of words.
This goes on until complex and unique things like a master thesis are formed.
If this hierarchical structure is used in an artificial intelligence system, the amount of
data that is needed to store will be much lower. Complex objects could be described
with references to their parts instead of describing all parts. This way, all parts need
to be saved only once. It will also make an AI system more flexible and increases the
capability of generalization, prediction and recognition which is further described in
[Hof94, HB04, Min88, Vos07].
The efficiency of the human intelligence can be easily tested with IQ tests. Measuring
the intelligence of machines is a different and difficult task as will be explained in
the next section.
3. Intelligence and Prediction
3.4
3.4.1
14
Measuring Artificial Intelligence
Turing Test
In 1950, the Turing Test was proposed by Alan Turing as a good measuring tool to
check whether or not a machine is actually intelligent [Tur50]. The test needs a judge,
a person and the machine to test. The person and the machine try to appear as human
like as possible and the judge needs to say which of them is the machine and which
is the human. All participants are isolated from each other and the conversation between the judge and each participant is only made through computer terminals. If the
judge cannot distinguish the machine from a human being, the machine has passed
the Turing test. This test is strongly based on behavior, since the entire procedure is
based on communication.
3.4.2
Limits of the Turing Test
An intelligent being can be intelligent without ever acting. Other beings will not
consider it intelligent, but it is impossible to say if a non-acting being is intelligent
or not.
This leads to the assumption that intelligence can be evident without acting like a
human being and that this cannot be tested with the Turing test. All internal thoughts
and processes are hidden to the judge.
Communication between humans is based on years of learning during their childhood and experience of cultural norms. Since most machines will not learn for a long
timespan like children and are not meant to be exactly like humans1 , another measurement approach needs to be pursued. Machines do not have the same properties
as men. Therefore, their world model and experiences are very likely to be different
from humans. They are also often purposely designed different from humans, i.e.
overcoming some of the weaknesses of humans or improve human properties.
1 i.e.
with all downsides of being human
3. Intelligence and Prediction
3.4.3
15
Limits of Behavior Tests in General
An experiment shows that behavior tests like the Turing Test cannot tell in all cases
whether or not someone is intelligent. It is called the Chinese Room Experiment and
was proposed by Searle in 1980 [Sea80].
The experiment is described as follows:
Imagine a room with a postbox for incoming chinese letters and a postbox
for the outgoing chinese answers.
A person inside the box receives the chinese letters, but cannot speak a
single chinese word. Instead, the person has a very big book with very
complex instructions on how to transform the incoming chinese letters to
chinese answers. These instructions can be followed without understanding chinese at all.
A person outside of this room who is able to read and write chinese can
write letters and will receive meaningful answers. This person would argue that inside the box is a person which understands chinese. That would
be a wrong statement.
This box can be seen as an equivalent to a computer: Inside the computer (room) is
a program (person) which gives answers to english letters (chinese letters) with the
help of a big database (big book) and complex rules.
This means even if a computer is able to have a conversation with any person, it is
not possible to tell with 100% certainty that this computer is intelligent.
In conclusion it seems that it is possible to do some tests with artificial intelligence
like the Turing test and IQ tests, but it is impossible to make a clear distinction
between intelligence and simulation of intelligence if one has a sufficiently large
book, like in the mentioned experiment.
Both of the following implications cannot be used to measure intelligence with 100%
certainty.
Behavior based test passed ⇒ Machine is intelligent
Behavior based test failed ⇒ Machine is not intelligent
3. Intelligence and Prediction
16
In order to be truly intelligent a machine needs to approach a more general concept
than just the imitation of behavior.
The HTM algorithm approaches the goal to be intelligent by taking the advantages of
a hierarchical structure. How this algorithm works is described in the next chapter.
Chapter 4
Hierarchical Temporal Memory
The hierarchical temporal memory (HTM) approach is based on the theory by Jeff
Hawkins [HB04] which he based on the structure and properties of the neocortex
in mammal brains. To better understand the foundations of the HTM algorithm, the
properties of the neocortex are described first.
4.1
Neocortex
The neocortex is a relatively new part of the brain and in humans it is about the size
of a dinner napkin and about two millimeters thick. It has a wrinkled structure to
fit into the skull. It surrounds the evolutionary older brain tissue. The structure of
the neocortex is regular and looks similar in different regions of the brain. There are
six different layers (See Figure 4.1) in it of which each is believed to have a distinct
function.
4.1.1
Regions in Neocortex Tissue
It is known for a long time that brain regions are subject to different realms. There
are areas for vision, touch, hearing and even an area for face recognition.
These regions are connected to each other. Analysis of macaque monkeys have
shown a hierarchical structure of these regions [FV91]. The map they produced
can be seen in Figure 4.2.
Brain researchers found many regions of the neocortex are managing very specific
tasks as listening and remembering music. There are regions in the brains used for
17
4. Hierarchical Temporal Memory
The six layers of the neocortex.
used to get a pseudo color image [Bra].
Figure 4.1:
(left)
18
(right)
Cell staining was
reading written letters. However, these regions cannot be formed and defined by the
DNA, because the written text is too young to be adapted by evolutionary processes.
This is a sign of neuroplasticity which is described in the next section.
4.1.2
Neuroplasticity
Neuroplasticity1 means the ability of the brain to adapt to significant changes even
during adulthood. For a long time it was believed that only children have the ability
of significant changes in their brain structures. There are some findings that support
the theory of neuroplasticity:
• The tissue of the neocortex looks very similar in the different regions [Mou78].
• There are blind patients who learned to see with a pressure-device attached to
their tongue [BYRTK03].
• People suffering from brain damage can regenerate by cortical re-mapping
[CC09].
1 Also
referred to as cortical re-mapping
4. Hierarchical Temporal Memory
19
These are the regions and their connections which the scientists
found in their studies on the brain of macaque monkeys. Image from [FV91].
Figure 4.2:
4. Hierarchical Temporal Memory
20
• People who lose one or more senses at young ages show improvements of
their remaining senses [GLL+ 04].
These findings lead to the conclusion that there is a general cortical function, which
adapts to the input it is receiving. Therefore the HTM theory builds upon the use of
a general learning algorithm for any kind of data.
There are general properties in the way senses transmit signals to brains. The task of
recognition of these signals is described in the next section.
4.2
4.2.1
Recognition
Nature of Patterns
Every input to our sensory organs is in general of spatial and sequential nature. Vision input is transfered by the optical nerve to the brain the same way the spinal
cord is transferring touch input and the auditory nerve is transferring sound input
[HB04]. The cochlea, a structure in the bony labyrinth in human ears, for example,
transforms sound pressure changes to a spatial form of data. This is achieved by the
vibrating of different parts of the bone. The nerve cells are activated by vibration.
High frequency tones will vibrate other parts of the bone than low frequencies. These
resulting signals are spatial (the frequency) and sequential (the sequence of tones).
4.2.2
Sequences
Many things humans learn and remember are stored in sequences. This can be experienced in the inability to recall things out of order. Every child learns the alphabet.
Almost all adults know the alphabet from A to Z. We can recognize each character
when we see it. Even distorted forms. Despite that knowledge and ability, it is a
really hard task to say the alphabet backward from Z to A, if a person has never done
that before. Telling stories is an additional example. It is often harder to remember
one single aspect of a story than the whole sequence which leads to this aspect. Even
motion seems to be stored in sequences. Many things in daily life are performed
in semi rigid sequences. Things like brushing teeth or take a shower are performed
always in similar ways. It is easy to alter this behavior by will, but if no attention
is payed to these things, the brain seems to replay well known motor commands
[HB04].
4. Hierarchical Temporal Memory
4.2.3
21
Auto Associative Memory
In [HB04], the similarities of the neocortex with auto associative memory systems
are demonstrated:
In loud environments not every word in a conversation is understood. The human
brain can fill in missing words, because it predicts, what other persons will say next.
This way a conversation is still possible. Another example is the blind spot of the
retina. Instead of experiencing a hole in the field of view, the brain makes a prediction
what is likely to be seen (even with one eye closed).
Very small parts of a spatial pattern are sufficient to trigger memories filling in the
rest of it. It could even be possible that the normal process of unconscious thought
is an auto associative process leading to a chain of memories coming into the mind,
each triggered by associations of other memories.
With these properties and abilities of the neocortex in mind, the overall idea behind
the HTM algorithm is described in the next section.
4.3
General Ideas and Goals
Since the HTM algorithm is in essence a theory working for all kinds of patterns
and even motor control2 , it would be an important cornerstone of general artificial
intelligence, if the algorithm can successfully work with unknown and diverse data
and produce good results.
Opposite to this goal is the general statement of the No Free Lunch Theorem, described is section 1.3 on page 4: It is impossible to construct a machine capable to
cope with many different problems without loosing efficiency in all of these tasks.
Despite this, the authors of the HTM algorithm are proposing the generality of their
algorithm [HB04, GJ07]. Additionally the HTM algorithm is showing impressive
results in cases like the binary image example in [GJ07] and the video surveillance
software Vitamin D.
It is hard to compare a general algorithm with specialized algorithms. On the one
hand there are different goals achievable by different algorithms, as written in chapter
3 on page 10. On the other hand, general artificial intelligence algorithms are trying
2 Due
to the prediction capabilities
4. Hierarchical Temporal Memory
22
to achieve human level intelligence which leads to different types of problems that
are easy to solve, as Figure 4.3 shows.
EH
EM
The outer circle is the known problemspace. The two circles
represent the easy to solve problems for humans (EH) and machines (EM).
Figure 4.3:
As can be seen in Figure 4.3, there are problems that can be solved easily by machines but not by humans, like complex mathematical equations in a short time span.
On the other hand there are problems that are easily solved by humans, but are hard
to do for machines, like talking about newspaper stories.
This needs to be kept in mind when comparing specialized algorithms with general
intelligence algorithms.
The goal is to construct an algorithm capable of solving many different problems.
The searchspace of an algorithm of this type cannot be limited in efficient ways. To
handle that large amount of data, a hierarchical structure needs to be implemented.
This structure needs to reflect the hierarchical structure of the environment as described in 3.3 on page 13.
Additionally it needs to have a prediction system similar to an auto associative memory to be able to make stable predictions, even when parts of the pattern are missing
or noisy.
These inherent structures of the neocortex have their according structures in HTM
networks, which are explained in the next sections.
4.4
Overview of the HTM Algorithm
This is a general overview and a more detailed description is given in the next sections.
The HTM algorithm is designed to be used for classification problems with supervised or unsupervised learning data. Another property is the prediction of future
4. Hierarchical Temporal Memory
23
input patterns. The algorithm learns with the help of a hierarchical network and a
continuous stream of input data. Time based and hierarchical relationships are the
most important concepts of the HTM algorithm.
An HTM network is built with several regions to resemble the hierarchy of neocortex
regions. Every region consists of a set of nodes. Unlike in artificial neural networks,
the nodes cannot be considered equivalent to neurons. They can be seen as sets
of neurons and they try to resemble the processing that occurs in neuronal groups
instead of the physiological properties of neurons. These nodes are the main units
of this learning algorithm and each node learns with the help of a spatial and a
temporal pooler. These poolers aggregate the spatial and temporal patterns a node
discovers during the learning stage. An overview can be seen in Figure 4.4.
HTM Network
has
Regions
has
Nodes
has
Temporal
Pooler
has
Spatial
Pooler
input
input
Input Data
The elements of the HTM algorithm. A network has multiple
regions, a region multiple nodes and each node has a temporal and a spatial
pooler. The input data is accessed by the spatial pooler.
Figure 4.4:
The lowest region (bottom level) of an HTM network processes only very small
parts of the input data, like the lines, dots, circles of these characters and serves as an
input sensory. This region tries to form a concept of these simple forms in the way of
labeling common spatial patterns with an identifier (ID). To do this, it uses a spatial
pattern storage (spatial pooler) and when some patterns occur more often than the
random probability, these patterns get an ID and are recognized the next time they
4. Hierarchical Temporal Memory
24
occur. Since the patterns are of spatial nature, the node handles the input according
to their dimensionality.
The change of known spatial patterns over time will be stored in a transitional matrix
to see which patterns follow each other. These sequences will be stored and labeled
by a temporal pattern storage (temporal pooler) like the spatial patterns. The next
higher region (above the lowest region) in the network processes only these temporal
pattern labels from the lower region and tries to find common spatial patterns inside
them by using the same process again (forming spatial patterns, forming temporal
patterns). This goes up in the hierarchy until the top region nodes are forming labels
for more complex things.
Every time a pattern goes up the hierarchy, information is also flowing down. This
feedback mechanism is essential for prediction of patterns. Since a node also looks
for temporal patterns, it pushes the most likely next spatial pattern of the discovered
temporal pattern down to the lower region. This works as a biasing for the lower
region nodes. This way, the lower region nodes are anticipating sequences (temporal
patterns) and constantly try to predict what spatial pattern (inside the temporal pattern) will be perceived next. If some parts of a spatial pattern are missing, the higher
regions may still recognize a temporal pattern and through the biasing of the lower
regions, a behavior similar to an auto associative memory is shown.
4.5
Spatial Patterns
To first determine spatial patterns, an analysis of the incoming data needs to be made.
If some input values occur together often enough, the combination of these values is
stored as a spatial pattern and gets an ID.
If two known patterns follow each other, a first order time adjacency matrix is updated with that occurrence. This keeps track of how often a pattern is followed by
another pattern. This information is used later to form temporal groups of patterns
that occur together in time and therefore are likely to share the same origin.
After several patterns were found and a certain amount of time has passed without
finding any new patterns, the spatial pooler will switch to a stable state and the temporal pooler will calculate the temporal patterns.
4. Hierarchical Temporal Memory
4.6
25
Temporal Patterns
Sequences of spatial patterns are formed with the help of the adjacency matrix. This
matrix stores the probabilities of patterns following each other in time.
Every time two spatial patterns are identified in sequence, the matrix is updated with
a pair consisting of the last pattern-ID and the actual pattern-ID. The value in the
matrix is increased and the corresponding row is normalized, to get a probability
distribution of patterns following another.
The goal is to extract temporal patterns from the matrix. It is needed to define a
mathematical notion to measure the quality of these temporal patterns. Equation
(4.1)
shows a measurement value ti calculated from D1 . . . DN which are disjoint
subsets of the spatial patterns in the matrix.
ti =
1 X X
T (k, m)
n2i
(4.1)
k∈Di m∈Di
ni is the number of elements in a partition Di and T (k, m) is an entry in the adjacency matrix. If all elements in Di occur adjacent in time the value ti is high.
A global measurement J is used and needs to be optimized. Equation
(4.2)
shows
how each of the N subsets is multiplied by the value ti and is added up. J = 1 would
be the absolute optimum, but that would never occur in real world scenarios.
J=
N
X
ni ti
(4.2)
i=1
Since the search space for the best temporal grouping is very large and it is a maximization problem, a fast greedy algorithm is used. This algorithm, proposed in
[GJ07] works as follows:
1. Find the most connected spatial pattern that is not in a temporal pattern and
add it to a new temporal pattern
2. For each spatial pattern in that group:
(a) Find the N following spatial patterns with the highest transition probability and add them to the temporal pattern. N is a parameter the user
can choose.
4. Hierarchical Temporal Memory
26
3. Repeat step two if there were spatial patterns added to the temporal pattern.
4. Start from the beginning as long as there are spatial patterns in the matrix
which are not in a temporal pattern yet.
This pattern discovery is done in each single node of the HTM network.
4.7
Nodes as Smallest Parts
The proposed theory does not make it necessary to simulate the neocortex on a
neuron-basis. Therefore the smallest parts are nodes which can be seen as sets of
neurons in the neocortex.
These nodes process the incoming data in the way described above with the help of
spatial and temporal poolers. Each node receives only small chunks of input data and
is trying to find spatial and temporal patterns in them. The found temporal patterns
in the input data are used as the input of higher nodes in the hierarchy.
To detect patterns that are larger than these chunks, nodes are organized in bigger
structures called regions and hierarchies.
4.8
Regions of Nodes
A region handles a set of nodes. It gets a data source which can be a sensor or the
output of other regions. This data is split up to small chunks, one per node. The
output of the nodes is resembled into a single data container for the higher regions.
The way the regions are connected and how the data flows through them is depicted
in the next sections and in section 5.9 on page 42.
To find more complex patterns and higher order concepts, the regions are connected
in a hierarchical structure like the regions of the mammalian brain.
4.9
Hierarchy
The regions are organized in a hierarchical system. Figures 4.6 and 4.7 show the
simplified hierarchies of this thesis for one and two dimensional data. The data
access is depicted by lines from a node to the data that is processed by that node.
4. Hierarchical Temporal Memory
27
Many forms of hierarchies are possible and some examples can be seen in the Figures
4.5 and 5.14 on page 54.
Figure 4.5:
in [HG06].
Demonstrations of the various possibilities of the HTM network
The goal of the hierarchy is to reflect the hierarchies in the environment. Out of the
higher regions in these hierarchies, properties and patterns can emerge which are not
present in the lower regions. This emergence reflects the emergence that occurs in
nature.
An interesting example where emergent behavior is shown are simulated bird flocks.
Each bird has only a few rules that do not define the behavior of a whole flock but if
these birds are simulated together, flocks emerge from these simple rules [Rey87].
Marvin Minsky states that the sum of parts is sometimes more than the parts alone.
Hierarchy can also give a better sense of what each discovered thing means. Relationships between things become clearer, when there is a hierarchy involved. For
further details see [Min88], chapter 2: Wholes and Parts.
The relationships and hierarchical structure of the HTM algorithm are reflected in
the implementation of it, which is described in the next chapter.
4. Hierarchical Temporal Memory
28
Simple hierarchy of re- Figure 4.7: Simple hierarchy of regions for two dimensional data with a gions for one dimensional data with a
grayscale image at the bottom.
frequency spectrum at the bottom
Figure 4.6:
Chapter 5
Implementation
5.1
Overview
The HTM Algorithm is available on the Numenta1 website2 along with tutorials and
examples. The downloadable framework, as the work on this thesis began, did not
include a proper feedback mechanism, which is needed for predicting patterns and
for eliminating noise in the input data. The feedback mechanisms are described as
one of the essential features for learning by Jeff Hawkins [HB04].
The available algorithm also does not offer a flexible input method to use the algorithm with a variety of different input data types. The only input data were one bit
black and white images like in [Gar07].
Colored images were used in commercial software3 with the source code not being
public and therefore it could not be modified for this thesis.
Since the algorithm itself is not too complicated and almost completely described in
numerous whitepapers by Numenta, a reimplementation of it was deemed necessary
and possible.
As the main source for implementation details, the whitepaper by Numenta was used
[GJ07]. The chosen programming language is C++, because it offers the ability to
handle memory and pointers directly and because the author is already familiar with
it.
1 Numenta
Dubinsky
is a company, founded 2005 by Je Hawkins, Dileep George and Donna
2 http://numenta.com/
3 http://www.vitamindinc.com/
29
5. Implementation
5.2
30
Design
The first step before implementing the algorithm is to design the overall architecture.
The software consists of three parts:
• A 2D-Framework to display two-dimensional graphics with OpenGL to render the user interface, information about the network and the learning process.
• An HTM-Framework to implement the algorithm. This framework needs to
be built in the most general way possible to be able to cope with a variety
of input data and scenarios for future applications. This means the datatype
of the input and the structure of the HTM network needs to be flexible and
modular.
• An end user application using both frameworks to let the user decide on the
structure and the parameters of the HTM network. The user can also fine
tune parameters for the network and can examine the single nodes and various
other details.
The HTM framework and the software were developed as parts of this thesis. The
2D-Framework is open source4 and was used because the author was already familiar
with it.
In order to have a general data input method, the data needs to be filtered with virtual
senses. These virtual senses work in a similar way like the senses of human beings.
They transform various raw signals to a more general signal, interpretable by the
HTM algorithm, being equivalent to the signals that are sent through nerve fibers.
These virtual senses make it possible to write the algorithm independent of the type
of input data. The only thing needed to be specified is the dimensionality of the data,
because the spatial relationship of areas (i.e. neighborhoods) in the data needs to be
maintained.
Therefore, a virtual retina5 for two dimensional data and a virtual cochlea6 for one
dimensional data were modeled in this thesis. Both inputs are meant to be a continuous stream of data, because time is essential for the algorithm and subsequent
patterns need to have a time relation.
4 http://sourceforge.net/projects/spgameframework/
5 An
organ in the human eye which can detect light and is responsible for vision.
organ in the human ear which can detect pressure changes in the air and is
responsible for hearing.
6 An
5. Implementation
31
The HTM framework implementation will be described in further detail in the next
sections. Starting from the bottom with the input data and nodes and proceeding to
higher order concepts like regions.
5.3
HTM-Framework
A very important aspect of the framework is its flexibility. It should be possible to
define the following parameters.
Parameter
Goal of that parameter
Number of Regions
The complexity of the input needs to be matched by
the hierarchy of the network.
Input Dimensionality
The dimensionality of the input needs to be matched.
Especially in mixed types of hierarchies.
Types of Input Data
In this thesis, ECG data and images are the input types.
The framework needs to be able to accept various types.
Speed of Learning
Not every source is delivering data all the time.
The user needs to be able to slow down the
algorithm in these cases.
To accomplish this flexibility, the framework was built in a modular way. Regions
are created independently at first. The class region creates its nodes and the data
containers for output to higher regions and bias output to lower regions. These data
containers are described in the following section. Afterwards, the network class,
called htmcontroller, controls the way the regions are connected.
To be flexible with the regions, all data is stored in a type called sequenceID, which
consists of an unsigned integer and a floating point value. The integer is an unique
ID and the float value is a strength indicator.
The output data of nodes is already of the type sequenceID, but the input data
needs to be converted to this data type by the virtual senses (virtualEyeball and
virtualCochlea).
A simplified version of the class diagram for the framework can be seen in Figure
5.1 on page 32. The flow of data can be seen in Figure 5.9 on page 42.
5. Implementation
32
Network
-myRegions : Region
-mySense : i_sense
+update()
+connectRegion(from, to)()
+addRegion(nodecount, id, dimensionality)()
+addSense(senseOrgan, dimension)()
1
«metaclass»
i_sense
1
1
+getSenseData() : RegionData
+getSenseBias() : RegionData
+covertAttention()
RegionData
-inputData
-outputData
+getChunkDataFromInput()
+getChunkDataFromOutput()
+addOutputElementsFromOtherContainerToInput()
+addInputElementsFromOtherContainerToOutput()
*
1
*
Region
-myNodes : Node
-biasData : RegionData
-valueData : RegionData
+stimulate()
+addBiasDatacontainerToInput(regionData)()
+addBiasDatacontainerToOutput(regionData)()
+addValuesDatacontainerToInput(regionData)()
+getBiasData() : RegionData
+getValueData() : RegionData
1
*
1
Node
-spatialPatterns : SpatialPooler
-temporalPatterns : TemporalPooler
-adjacencyMatrix : AdjacencyMatrix
+stimulate()
1
1
1
AdjacencyMatrix
-matrix
+getProb(pattern1,pattern2)() : float
1
SpatialPooler
-spatialPatterns
+getSpatialpatternID() : signed int
1
TemporalPooler
-temporalPatterns
+getmatchingSequenceID() : signed int
Figure 5.1: Simplied class diagram, showing the most important classes
and the methods to build a network.
5. Implementation
33
Parallel computing was kept in mind during the design but was not implemented,
since the goal was to show the capabilities of the HTM algorithm. An analysis for
future development with parallel computing is found in section 5.6 on page 53.
5.3.1
General Data Container
The class regiondata is the general data container of a single region. It stores two
vectors of pointers to instances of the type sequenceID. One vector is the output data
and the other is used for input data. It also provides methods to insert new pointers
and to get a chunk of pointers, based on the dimensionality of the data. Each region
has two of these data containers. One stores the biasing information and another one
stores the data values.
The next sections describe how data is transformed to be used with the HTM algorithm.
5.3.2
Two Dimensional Data
In this thesis, the two dimensional data are images. Real world images are often
noisy and the varying rotation and scaling leads to complications during processing.
The human eye can focus on different objects of different scales and the distortion the
image gets through the eye lens is helpful to resolve the scaling problem [HCB07].
This improvement and the fact that it is a more natural approach were the reasons to
simulate this property of the eye.
Since the human eyes can see more detail with the fovea centralis7 than with the
rest of the retina, a fisheye-lens-distortion filter was applied to the image to simulate
these distortions (See Figure 5.2). Two advantages of this distortion are the larger
field of view and the improved invariance to scaling.
Additionally, a median filter was applied after the distortion to eliminate noise.
Since humans perceive contrast edges amplified, a DoG8 filter was applied and the
result was used as an additional image, seen in the top right of Figure 5.2.
7 An
area at the center of the retina with more cones than elsewhere on the retina.
It is responsible for seeing details like the characters of this footnote.
8 Dierence of the Gaussians, the dierence of two blurred images, one is more
blurry. The result is an amplication of edges in the original image.
5. Implementation
34
The sheye distortion is seen on the left side. The same Image
with additional ltering is shown on the right. In the top right corner is the
result of an edge detection and the bottom shows the three extracted colors.
Figure 5.2:
Color extraction
To get the correct color information from the image, the RGB channels of images
are useless. Since the RGB channels specify the color of the pixel through additive
mixing, a white pixel would use full red, green and blue values. All gray tones are
mixed with all colors at the same level which can be seen on the left of Figure 5.4.
Therefore a method to eliminate all gray tones was used:
The image was first transformed to the HSV format. The HSV format stores pixel
colors as a triplet consisting of a hue value, representing the color tone, a saturation
level and the brightness value of the pixel as seen in Figure 5.3.
Figure 5.3:
HSV Cone [Pie]
The values V and S of the image were adjusted to eliminate all gray tones.
5. Implementation
35
V
= S·V
S = 1
(5.1)
(5.2)
If the saturation is multiplied by the value, the saturation will be low, if the value is
low. A low value means a darker color and therefore a less colorful light (the HSV
cone is getting thinner at the bottom). A low saturation means the color is more gray
than colorful. If both values are combined, the colorful areas in images can be extracted. The second step to accomplish this is setting the saturation to its maximum.
This eliminates the possibility of colors being mixed with white. Imagined in three
dimensional space, the HSV cone is completely hollow after applying equations 5.1
and 5.2. The results can be seen in Figure 5.4 and in Figure 5.2 at the bottom.
Left: Original image with the corresponding color channels.
Right: Altered image with the extractable RGB colors.
Figure 5.4:
Covert attention mechanism
The HTM algorithm forms temporal groups of patterns that occur next or near to each
other on the timeline. Because the input images in the described scenario are still
images, a mechanism to produce sequences of images is needed. This mechanism
is inherent in video streams, where the changes in the input data are automatically
related to real world changes. In still images, like the medical photographs of this
thesis, a virtual eye movement resembling a saccade9 can replace this mechanism.
The first implementation used a Harris Corner Detection algorithm [HS88] to determine interesting points and move between them. This approach was tried because
9 Small
unconscious movement of the eye to interesting details.
5. Implementation
36
eye tracking experiments have shown that humans unconsciously focus on edges and
corners [Att54]. The goal was to simulate this behavior and find similar structures
in different images with this approach. This behavior was not successful because the
movement of the virtual eyeball was too random to form useful and stable temporal
patterns.
Since the implementation in [HG06] was using a linear and simple movement, a
linear scanning of the image was chosen as a second implementation. Each pixel in
the image is processed by the algorithm and the structures of the image are gradually
moving in front of the virtual eye which increases the quality of the constructed
temporal patterns. This movement is outlined in Figure 5.5.
2088 px
128 px
128 px
1550 px
The small rectangle is the virtual eye that moves along the
arrows. This movement results in the processing of the whole input image
(big rectangle).
Figure 5.5:
5.3.3
One Dimensional Data
The ECG data is of one dimensional nature and is analyzed with a Fast Fourier
Transformation to check the frequencies of the signal. This is done because a single
signal does not carry enough information. A frequency analysis of a set of sequential
signals is like multiple sine waves combined and therefore carries more information
than a single signal alone. For example the low frequency amplitudes which are
of special interest for ECG analysis, since the heartbeat itself occurs with a low
frequency.
A virtual cochlea was designed to transform the incoming data samples to a vector
of frequency strengths.
This is accomplished with the Fast Fourier Transformation on the ECG Data. To
calculate frequencies, a single value of the ECG dataset is not useful. Thus a sliding
window is used. A sliding window is a set of sequential values from the input stream.
5. Implementation
37
Sequential calculations are taking subsequent sets of data from the input data stream.
The window ’slides’ through the dataset.
It is possible to eliminate noise by using more than one sliding window of the signal
and calculate the mean values. This leads to a distortion when the values at the
borders of two windows do not match, because a jump is measured which leads to
the leakage effect in high frequency levels. Such a jump can be seen in Figure 5.6
and the resulting distortion in Figure 5.7.
The window size (16) in this example does not match the sinus
signal of 1 kHz. Another window starts at 17 and this results in a jump. Image
from [Kam89].
Figure 5.6:
Figure 5.7: The analysis contains leakage around the peak at 1 kHz. Image
from [Kam89].
To prevent this distortion, a window function was multiplied to each window. Window functions typically fall down to zero on the edge. This leads to matching values
at the borders of multiple windows and prevents distortions.
A Blackman Harris Window function was used, since it provides the needed characteristics and was successfully used for ECG analysis in the past [Har78]. The
function value of w(n) is multiplied with each window (window size M ) before the
analysis.
5. Implementation
38
2n 3n n
+ a2 cos 2π
+ a3 cos 2π
w(n) = a0 + a1 cos 2π
M
M
M
M
M
With n = − , . . . ,
− 1 and
2
2
a0 = 0.35875, a1 = 0.48829, a2 = 0.14128, a3 = 0.01168
The output of the virtual senses are vectors of sequenceIDs and are the input for the
nodes in the lowest region in a hierarchy.
5.3.4
Nodes
The whole HTM network is built upon units called nodes. These nodes inherit the
main features of the algorithm and are organized in regions.
The number of nodes in a single region depends on the dimensionality of the input
data. If the input data has two dimensions, the number of nodes needs to be n2
where n is the sidelength of the input data divided without rest by the sidelength of
the input chunk for a single node. An input chunk is a subset of the input data. In
two dimensions it is a square and therefore has a sidelength.
For example: The input size is 256 by 256 values. the sidelength of the input can
be divided without rest by 8. The result is
256
8
= 32 and the number of nodes is
2
32 = 1024.
Node-Objects store only time adjacency matrices, actual patterns they perceive and
the temporal and spatial patterns they observed earlier. All input data, output data
(input for higher nodes in the hierarchy) and prediction data (bias data for lower
nodes) is stored in regions and is accessible through pointers.
This leads to faster processing, since the nodes can operate on the data by pointers to
the data and the results do not need to be copied. While this makes parallel processing more difficult to implement, the processing speed gains a significant increase due
to the use of pointers.
Each node contains a temporal and a spatial pooler which are described in the following sections.
5. Implementation
5.3.5
39
Spatial Pooler
Real world data almost always has noise in it. Beside that, signals coming from a
real world source almost never repeat in the exact same way. If the raw input data
were stored as a pattern for later recognition, it would be unlikely that this pattern
will show up again. The other downside is the amount of space needed to store all
the raw patterns.
The fact that similar input patterns correspond with similar causes and the fact that
not the whole input space is used, make it possible to quantize10 the incoming data
and greatly reduce the needed memory as well as processing power and at the same
time eliminate some of the noise.
Each pattern consists of an ID and a strength value (sequenceID datatype). In the
realm of images, the strength is the brightness of a pixel and in the realm of one
dimensional signal processing it is the amplitude of a particular frequency.
This strength is quantized to a chosen amount of possible values, actually reducing
the input space. The amount of these values is chosen by the user at program start.
After this quantization, the incoming pattern is compared to the stored patterns. For
all entries of a pattern a difference is calculated:
d=
N
1 X
abs(x[j] − y[j])
N j=1
(5.3)
Where N is the length of the patterns and x and y are the patterns to compare. If
this value d is below a threshold the user has chosen at program start, the patterns are
considered to be the same.
If a similar pattern is found, the spatial pooler outputs the pattern ID of that pattern.
If there is no similar pattern, it will create a new ID, assigned it to the new pattern
and output it. IDs are used by the temporal pooler and the adjacency matrix to form
temporal patterns later in the process.
10 By
transforming a large input variety to a lower variety. For example rounding
oating point values to get integer values.
5. Implementation
5.3.6
40
Temporal Pooler
The temporal pooler is used to find sequences in the given input. The input of the
temporal pooler is prefiltered by the spatial pooler and the adjacency matrix.
The temporal pooler is looking for common sequences of already known spatial patterns. If two patterns follow each other every time, they will be grouped with high
probability to a temporal pattern and a sequenceID for this pattern is generated. If
the same sequence occurs a second time, it just uses the known ID of that temporal
pattern. This ID generation is the same process as in the spatial pooler.
The temporal pooler will wait until the adjacency matrix is stable enough. This is
the case when no more new patterns are inserted for a long timespan. If the matrix is
stable, the greedy algorithm described earlier in section 4.6 on page 25 will calculate
the temporal patterns out of groups of spatial patterns.
5.3.7
Time Adjacency Matrix
To find adjacent patterns in time, an adjacency matrix is used. This matrix has the
known spatial pattern IDs as row and column indices. Each row represents a list of
probabilities of the follow up pattern IDs. The sum of a row is always 1 because it is
normalized with each new entry.
Typical time adjacency matrix. The row indices are the previous
pattern IDs and the column indices are the follow up pattern IDs. The darker
the entry, the higher the probability of the transition.
Figure 5.8:
For faster access to the data and a faster normalization, the rows are organized as
linked lists.
The update function has two parameters, the old ID and the new ID. The first ID
selects a list (the row) and the second ID selects the entry of this list (the column). If
it does not exist it will be added. After increasing the element in the list, the list will
be normalized.
5. Implementation
5.3.8
41
Node Input
The input of a node comes either from a virtual sense or from a lower region. The
node does not need to know what it is connected to, because the input data always
consists of a set of sequenceIDs. Additionally, each node is getting a bias from
higher regions. This bias is an expectation of the higher region and is equivalent to a
sequenceID of the temporal pooler.
5.3.9
Node Output
A node has two outputs. One is a vector consisting of sequenceIDs with the probability of one or more sequences occurring right now. Each time, the currently discovered spatial patterns are compared to all known spatial patterns in the formed
temporal patterns to find the best matching temporal patterns. The top temporal patterns are stored as the output signal to the higher regions in the HTM network.
A special case are the nodes on top of the hierarchy. Their output can either be
used for supervised learning or as a direct classification. For supervised learning a
mapping of the temporal pattern IDs to the classes to learn needs to be done. This
can be accomplished with conventional classification techniques. Using the output
as a direct classification of the input data is useful for unsupervised learning, where
the classes are unknown. For example the detection of abnormalities in complex
systems 11
The other output of a node is the prediction of the next spatial pattern which is derived
from the actual spatial pattern and the adjacency matrix of the node. This biasing
sequenceID is passed to lower nodes in the hierarchy, since its ID is equivalent to a
temporal pattern ID in the lower nodes.
The connections of nodes to other nodes is managed in regions which are described
in the next section.
5.3.10
Regions
The HTM network is built from different regions. These regions are organized in a
hierarchy, depending on their connection to each other. The lowest level region is
11 For
examples see the Numenta website
customers.php.
http://numenta.com/about-numenta/
5. Implementation
42
World
Sense
Region
Node
Data
In
Out
Bias
In
Out
Region
Node
Data
In
Out
Bias
In
Out
Points to
Data flow
Pointers
Data
Data
Out
In
Data In
Out
Bias In
Out
Data In
Out
Bias In
Out
Network Output
This diagram shows how the data is stored and which classes
point to which data.
Figure 5.9:
5. Implementation
43
connected to the sensory input. Depending on the configuration, the highest level
region sends its output to the user (unsupervised learning, just giving outputs) or is
used to train in a supervised manner.
Information always flows up and down. The discovered temporal patterns of a region
are passed to the higher regions and the prediction based on actual spatial patterns
are passed to the lower regions as a prediction for the next input.
An overview of the flow of data through a network is pictured in Figure 5.9 on page
42.
The region stores two data containers. One for biasing values and another for data
values. Each of this data containers has two vectors. One for incoming data and one
for outgoing data.
When regions are connected to each other, the framework manages the data pointers
of each region. The result is that each node in each region has the correct pointers
and can work on this data without the need of managing the pointers. This is done by
passing a pointer to the data container to another region which manages the subsets
of pointers (chunks) inside that data container. In the end, each node receives its
chunk of input data and the pointers to the position where it is meant to store its
output. See Listings 5.1 and 5.2 for details in the source code.
Listing 5.1:
Pointer Management for Regions
1
c o n n e c t R e g i o n s ( u n s i g n e d i n t from , u n s i g n e d i n t t o )
2
{
3
r e g i o n ∗ f r o m R e g i o n = g e t R e g i o n ( from ) ;
4
i f ( t o == 0 )
5
{
6
/ / connect to v i r t u a l sensor
7
/ / copy data p o i n t e r s t o i n p u t
8
fromRegion −> a d d V a l u e s D a t a c o n t a i n e r T o I n p u t (
mySensoryOrgan −>g e t S e n s e D a t a ( ) ) ;
9
10
/ / copy b i a s p o i n t e r s t o o u t p u t
11
fromRegion −> a d d B i a s D a t a c o n t a i n e r T o O u t p u t (
mySensoryOrgan −> g e t S e n s e B i a s ( ) ) ;
12
13
/ / p o i n t e r management o f n o d e s
14
fromRegion −> p r e p a r e ( ) ;
15
}
16
else
5. Implementation
17
{
18
/ / connect to other region
19
region ∗ toRegion = getRegion ( to ) ;
20
/ / copy data p o i n t e r s t o i n p u t
21
fromRegion −> a d d V a l u e s D a t a c o n t a i n e r T o I n p u t (
t o R e g i o n −>g e t V a l u e D a t a ( ) ) ;
22
23
/ / copy b i a s p o i n t e r s t o o u t p u t
24
fromRegion −> a d d B i a s D a t a c o n t a i n e r T o O u t p u t (
t o R e g i o n −> g e t B i a s D a t a ( ) ) ;
25
26
/ / p o i n t e r management o f n o d e s
27
fromRegion −> p r e p a r e ( ) ;
28
29
}
}
Listing 5.2:
Pointer Management for Nodes (inside region.prepare())
1
nodeindex = 0;
2
f o r e a c h ( node p r e p a r i n g N o d e i n myNodes )
3
{
4
/ / data input
5
p r e p a r i n g N o d e −> s e t I n p u t D a t a P o i n t e r s (
6
dataValues . getChunkDataFromInput (
7
nodeindex , chunksize , dimension ) ) ;
8
/ / bias input
9
p r e p a r i n g N o d e −> s e t I n p u t B i a s P o i n t e r s (
10
d a t a B i a s . getChunkDataFromInput (
11
nodeindex , 1 , dimension ) ) ;
12
/ / bias output
13
p r e p a r i n g N o d e −> s e t O u t p u t B i a s P o i n t e r s (
14
d a t a B i a s . getChunkDataFromOutput (
15
nodeindex , chunksize , dimension ) ) ;
16
/ / data output
17
fo reac h ( sequenceID output in
p r e p a r i n g N o d e −>g e t D i s c o v e r e d S e q u e n c e I D s ( ) )
18
19
{
20
dataValues . addSingleNodeOutputElement ( output ) ;
21
}
22
n o d e i n d e x ++;
44
5. Implementation
23
45
}
The main class controlling the connections of regions is the network class, described
in the next section.
5.3.11
Network
The network class htmcontroller consists at least of one region and a sensory class.
There is also an oscillator built in, which times the activation of all nodes. This
oscillator can be used for realtime inference when the data is only slowly updating.
In the proposed scenario this oscillator is set to the maximum frequency, since it uses
recorded data. This means for example that the recorded ECG data can be processed
as fast as possible instead of processing it eight hours long in realtime.
This network needs to be controlled by the user in the end user application.
5.4
5.4.1
End User Application
Structure and Design
The end user application hides all non-relevant data and makes it possible to control
the application without knowledge of the algorithm in detail. The user should be
familiar with some terms in order to understand the parameters he can control.
The application is providing a step by step configuration process, making it easier to
get to the desired results.
The code needed to write the end user application is minimal, since all work is done
in the frameworks. The code only needs to specify how the user interface should
look like and behave. Additionally, it needs to define how the network has to be
configured and run. For an example, see Listing 5.3.
The HTM framework can be controlled with simple commands by the programmer,
without knowledge of the algorithm. This can be done, since the parameters the user
controls have known effects that can easily be explained. For example the threshold
for the similarity of spatial patterns can be seen as a value, controlling how many
different patterns the algorithm will form out of the input data. The options for the
5. Implementation
46
framework are set in a specialized class called dhtmoptions that checks the validity
of the entered options.
Listing 5.3: Construction of an HTM network in the source code of the end
user application
1
/ / prepare the data
2
I m a g e d a t a = new i m a g e s o u r c e ( " t r a i n . png " ) ;
3
I m a g e d a t a T e s t = new i m a g e s o u r c e ( " t e s t . png " ) ;
4
/ / prepare the v i r t u a l sense
5
E y e b a l l = new s i m p l e E y e ( t r u e ) ;
6
Eyeball . setSource ( Imagedata ) ;
7
/ / prepare the parameters
8
h t m C o n t r o l l e r = new h t m C o n t r o l l e r ( ) ;
9
h t m C o n t r o l l e r l s e t O s c i l l a t i o n F r e q u e n c y (MAX_FREQ ) ;
10
htmController . setDistinctMode ( true ) ;
11
/ / build the network
12
h t m C o n t r o l l e r . addSense (
13
14
Eyeball , 2 ) ; / / type , dimensions
h t m C o n t r o l l e r . addRegion (
15
64 , 1 , 2 ) ; / / nodecount , id , di men sio ns
16
h t m C o n t r o l l e r . addRegion (16 , 2 , 2 ) ;
17
h t m C o n t r o l l e r . addRegion ( 1 , 3 , 2 ) ;
18
h t m C o n t r o l l e r . addRegion ( 1 , 4 , 1 ) ;
19
htmController . superviseRegion (4 ,
20
Eyeball . getSupervisor ( ) ) ;
21
/ / prepare the network
22
htmController . connectRegions (1 ,
23
VIRTUAL_SENSE ) ;
/ / id , superv .
/ / to , from
24
htmController . connectRegions (2 , 1);
25
htmController . connectRegions (3 , 2);
26
htmController . connectRegions (4 , 3);
27
/ / check f o r e r r o r s
28
i f ( htmController . isReady ( ) )
29
{
30
//
31
[...]
32
33
initialization successful
} else {
/ / e r r o r ( memory , p a r a m e t e r s ,
...)
5. Implementation
34
35
47
[...]
}
How this network is controlled by the user is described in more detail in the next
section.
5.4.2
Workflow
First the user chooses the main parameters of the algorithm. The most important
parameters are:
Spatial group threshold sets the threshold for the quantization of spatial pat-
terns. The higher this value, less spatial patterns will be formed.
Input parameters is a set of values, which depends on the type of data to be
analyzed. In the realm of image classification this can control parameters like
filtering.
Number of guesses controls how many guesses the next higher region gets as
input data.
A screenshot of this step can be seen in Figure 5.10.
After these parameters are set, the user can control the structure of the hierarchical
network.
• Number of regions
• Size of regions
• Dimensionality of regions
• Input type
• Oscillator speed
• Type of learning (supervised or unsupervised)
The standard parameters were chosen based on the experiences during the implementation and based on parameters in the literature. The values can be seen in Table 5.1
and they are selected as default parameters in the end user software.
5. Implementation
48
Technical demonstration (Not yet designed to be easy understandable and accessible.) of the conguration screen the user sees on startup
of the program.
Figure 5.10:
5. Implementation
49
Figure 5.11: Technical demonstration (Not yet designed to be easy understandable and accessible.) of the conguration screen. The user can dene
the hierarchical structure of the network.
5. Implementation
50
Parameter
Spatial threshold
Sensor size
Number of Regions
Top-N for greedy algorithm
Table 5.1:
value
0.1
128
5
2
Default parameters for the HTM algorithm.
Finally the user runs the algorithm and can watch as the algorithm progresses and can
have a closer look at nodes and the sensory data. This can be seen in Figure 5.12.
The process of learning the spatial patterns is shown in Figure 5.13 on page 52. Additionally, the user can control the updates of the network12 . The parameter Spatial
threshold can be changed during learning to finetune the algorithm.
During the implementation of this application some changes were necessary which
resulted in two iteration cycles for the implementation of the HTM framework.
5.5
Iteration Cycles
Since the application written for this thesis is of experimental nature, multiple iterations were necessary. The reasons for these iterations are some missing details in
[HG06, GJ07], the need to adapt the algorithm to different types of input data and
some contradictions between [HG06, GJ07] and [HB04]. This led to the necessity to
change the implementation after unsatisfactory results were obtained.
5.5.1
First Iteration
The first implementation was using a slightly different method to form temporal
groups. It saved sequences that occurred in the input stream rather than calculating
groups by the transition probability matrix.
Since the greedy method to calculate temporal groups was forming distinct groups,
some patterns which normally would be part of several temporal groups were part
of only one group.To go beyond first order prediction13 , these patterns need to be in
12 i.e.
pause the processing
to predict one future step based on one previous step.
13 Possibility
5. Implementation
51
Figure 5.12: Technical demonstration (Not yet designed to be easy understandable and accessible.) of the network display. The user can click on nodes
(black squares on the right side) to get details about them. More detailed
images of the network can also be seen in Figure 5.13.
5. Implementation
52
These six screenshots show the network during learning. (1.)
The network in its initial state. (2.) The lowest region is learning spatial
patterns (the white boxes are simplied versions of adjacency matrices). (3.)
The second region gets the rst patterns since the lowest region formed temporal patterns from its stable adjacency matrix (stable nodes are colored light
blue) (4.) The next regions learn the patterns. Bright green dots indicate
that a node has found a known pattern. Dark blue dots indicate an active
biasing from a higher node. (5.) The second region is completely stable and
the third and fourth regions are learning spatial patterns. (6.) The complete
network is stable and can be used now.
Figure 5.13:
5. Implementation
53
several groups and the order of the patterns need to be retained. This was not the
case with the use of the proposed greedy method.
However, this resulted in too many temporal groups and similar groups were stored
separately, which lead to wrong predictions and classifications even in the lowest
regions.
5.5.2
Second Iteration
The second iteration used distinct temporal groups. Each spatial group was used only
in one temporal group. The described greedy algorithm was used.
The downside of this approach is the temporal coherency. It is sometimes unable to
exactly predict the next spatial pattern of a temporal pattern because it relies only on
the adjacency matrix as the source of transition probability. The actual occurrence of
spatial patterns would be a better source.
The results of this approach were better than the first iteration and will be discussed
in chapter 6 on page 55.
5.6
Distributed and Parallel Computing
The distribution of data was not implemented during this work. Instead, it was designed to allow distributed processing in future iterations of the HTM framework.
To accomplish distributed computing, the data at the lowest region could be distributed to different machines. The output data of the higher regions needs to be
aggregated at some point. This point needs to be chosen with care, since the overhead of gathering the results from the machines could be greater than the speed gain
of the distribution.
Another idea is the parallel learning with different parameters. The networks on
several machines could learn the same training dataset. During inference, the results
of the several networks are combined and several methods could be applied to this
collection of results. For example a weight can be applied and the highest guess is
taken as the classification.
Networks could also be mixed with each other. As described in [HB04], one network
could be trained with three dimensional distributed humidity sensors while another
5. Implementation
54
network is trained with a three dimensional distributed air pressure sensors. At some
point in the higher regions of these networks, the output can be combined and be fed
into a third network which, for example, tries to predict the weather. An example of
how different senses are combined can be seen in Figure 5.14.
Combination of three senses in a hierarchy with a single top
region. Shown in [HB04]
Figure 5.14:
With the implementation in this thesis, tests with the described scenario where made.
The results of these tests will be discussed in the next chapter.
Chapter 6
Experimental results
The algorithm was tested with the proposed datasets (cell images and ECG recordings) of the scenario. Unfortunately the results were unsatisfactory, which is described in detail in the following sections.
6.1
Classifications
As a first classification test, unsupervised learning with enabled prediction was used
(See iteration one in 5.5.1 on page 50). The goal of the prediction was to eliminate
noise in the input data. Since the description of the algorithm was targeted at black
and white images and the prediction process was not clearly described in [GJ07],
some errors in the prediction process have occurred. It was necessary to make the
assumption that the bias needs to alter the input data of the next input. Unfortunately,
the system began oscillating the output of the lowest region after the region above it
began its prediction, leading to errors in classification. Since the biasing altered the
input values, different temporal patterns were found and the prediction of the next
higher region was wrong in the next step, which lead to another alteration of the input
values. Eventually this process ended in oscillations of temporal patterns as can be
seen in Figure 6.1.
The prediction was disabled to continue with the classification (See iteration two in
5.5.2 on page 53). However, a second problem occurred. The parameters for the
quantization of the input data could not be used to get optimal results. High parameter values lead to too general patterns and the classes could not be distinguished
by the algorithm (See Figure 6.2 on page 58). Low parameter values resulted in too
55
6. Experimental results
N
56
a
b
c
d
e
f
g
h
i
k
(N) is the original input image and is unchanging in this observed
sequence. (a-k) are the outputs of the lowest region. A black square is
predicted in (i) because of the black border on the right side in the image (h),
the higher nodes will predict the square moving further along the X axis and
pass down this prediction to the lower nodes. This sequence starts again after
(g) and keeps oscillating.
Figure 6.1:
6. Experimental results
57
specific patterns which lead to a huge amount of patterns found (See Figure 6.3). On
the one hand that was slowing down the whole algorithm since these patterns need
to be searched through when new patterns are processed. On the other hand, as the
temporal pattern IDs went up to the next region in the hierarchy, the combination
of these patterns led to exponential growth (See Figure 6.3) in higher order patterns,
slowing down the algorithm further and also led to much more classes than existing
in the real world data. In this case, no useful classification was possible either.
The exponential growth is unrelated to the training data. The cell images have similar structures in them. These structures should lead to a less then exponential growth,
once they are learned. Since this is not the case, it can be assumed that some mechanisms in the higher regions are missing or were described with insufficient details.
After both iterations failed to correctly classify the data, a similar scenario like the
one in [GJ07] was implemented to check if there are errors in the implementation.
The scenario consisted of binary images, representing characters of the alphabet and
simple symbols like squares, triangles and lines. The result of this test is comparable
to the result in [GJ07]. The accuracy is described in the next section.
6.2
Accuracy
The results of the medical scenario are described first. Since the results were unsatisfactory, the accuracy on a much simpler scenario was measured as well.
6.2.1
Medical Scenario
The accuracy of the medical image scenario could not be measured. On the one hand,
there were no exact annotations in the images. On the other hand, the result on the
ECG recordings already shows the problems of the HTM algorithm with complex
data. Tests with the image data were made, but the same problems as with the ECG
recordings occured. Therefore only the results on the ECG recordings are described
in detail.
Accuracy tests on ECG recordings show error rates between 52% and 77%. Always
choosing the main sleep stage for classification would result in lower error rate of
47%. The effect of the parameters (See Table 5.1 on page 50 for the default values)
6. Experimental results
58
Four screenshots of a learning network with the parameter spatial threshold set too high. (1.) The network in its initial state. (2.)
The lowest region is learning spatial patterns (the white boxes are simplied
versions of adjacency matrices). (3.) The nodes that are completely white
with one green box inside have only found a single distinct pattern. They
are nding the same temporal pattern regardless of their input data. (4.)
The highest node has only found a single temporal pattern which leads to the
whole network being useless for classication.
Figure 6.2:
6. Experimental results
Patterns
59
Stored Patterns over Time
100000
10000
Threshold
0.05
0.1
0.2
0.4
1000
100
Time
Figure 6.3: Number of patterns of dierent threshold options. In one case,
a horizontal line was drawn. This demonstrates the exponential growth of
patterns in each next higher region. The x axis is logarithmic. The y axis is
linear and denotes the learning steps.
and a short description of the effects can be seen in the Figures 6.6 to 6.9. The effect
on the error rate is shown in Table 6.1.
The parameters always lead to extreme cases: Either there were too many classes and
no clear assignment could be made or there were too few classes and assignments
would lead to insufficient classifications.
However, in the simplified scenario the results can be compared to the results in
[GJ07].
6.2.2
Simple Test Scenario
The training dataset should always have a very high accuracy. If the accuracy on the
training data is low, it means that the algorithm could not find the needed number
of distinct features of the data to classify them and will therefore also be unable to
classify unknown input.
The test data consists of distorted versions of the training data. Between 2% and
30% of noise were added to the training data to get the test dataset.
6. Experimental results
Parameter change
Standard parameters
Spatial threshold
Spatial threshold
Sensor size
Sensor size
Sensor size
Number of Regions
Number of Regions
Top-N for greedy
Top-N for greedy
Table 6.1:
rate.
60
value Error rate
54%
0.05 65%
0.2 77%
32 65%
64 52%
256 55%
3
61%
7
55%
1
58%
3
56%
Dierent values for the parameters and their eect on the error
The accuracy on binary images with distortions is given with 66% in [GJ07]. This
leads to the conclusion that the implemented algorithm of this thesis is comparable
to the implementation in [GJ07].
6.3
Processing Speed
The hierarchical structure of the HTM algorithm has significant influence on the
processing speed. Despite the large number of patterns discovered in the training
data, the processing speed of single patterns grows only linear with the size of the
pattern count. This can be seen in Figure 6.4.
The number of patterns to check is the only property that affects the speed of processing. Several tests with the parameters were made. The most obvious parameter, the
number of regions, showed almost no effect on the relationship between the number
of patterns and the processing speed as can be seen in Figure 6.5.
6. Experimental results
61
Speed
0,12
0,1
0,08
0,06
0,04
0,02
0
20000
60000
100000
140000
180000
Patterns
Speed of a single processing step (Y-Axis) in relation to the
number of patterns (X-Axis). The speed (i.e. number of processing steps per
timeframe) decreased linear with the increase of patterns.
Figure 6.4:
Speed
0,8
7
5
0,6
3
0,4
0,2
0
10
1000
100000
Patterns
The three lines are showing the same slowdown in processing
speed with growing number of patterns, despite using a dierent amount of
regions. This eect also shows up with other parameters.
Figure 6.5:
6. Experimental results
62
Dataset
Characters (Train)
Characters (Test)
Symbols (Train)
Symbols (Test)
Table 6.2:
6.4
No. of Classes
47
47
16
16
Accuracy
100%
70%
100%
75%
Results of the simple test scenario.
Influence of Parameters
Tests with various parameter settings were made to check if the right parameters were
chosen to classify the scenario data. The effects on the number of found patterns
can be seen in the Figures 6.6 to 6.9 and their influence on the error rate of the
classification is shown in the Table 6.1.
Patterns
Patterns
1000000
1000000
10000
100
1
0.05
0.2
0.1
Time
The main parameter
is the threshold for the dierence
of spatial groups. If it is low, more
patterns will be found.
Figure 6.6:
10000
100
1
3
2
1
Time
The parameter
topN for the greedy algorithm
Figure 6.7:
controls how many spatial groups
are examined in each step of the
greedy algorithm. If it is a high
value, less temporal groups will be
formed and eventually the number
of patterns is lower.
An interpretation of these results is given in the next section.
6. Experimental results
63
Patterns
Patterns
1000000
1000000
10000
100
1
256
64
128
Time
The sensor size controls how much of the input data is
processed in one step. It also controls how many details of the input data is processed. A big sensor
leads to more complex patterns and
to a higher pattern count.
Figure 6.8:
6.5
10000
100
1
5
7
3
Time
This is the region
count parameter and it controls
the structure of the network. A
higher number of regions is leading
to more possible combinations and
can discover more complex patterns.
Figure 6.9:
Comparison
Reaching only 75% on the very simple task described in section 6.2.2 is not a desireable result, as there are implementations that can reach almost 100% accuracy
on even harder tasks in the realm of character recognition. The popular MNIST
Database1 with over 58000 handwritten digits by 500 unique writers lists algorithms
reaching 99.61% accuracy [RPCL06] on the test data.
Despite the better results of these algorithms, the comparison is of limited use. The
HTM algorithm is a general algorithm and works with different types of data. It is
not expected to be better than a specialized algorithm as described in section 3.4 on
page 14.
According to the No Free Lunch Theorem and the experience with the HTM algorithm it should be possible to prepare the medical images and to alter the algorithm
to match this image task. With these steps, the algorithm will get better on classification of images, but would loose the ability to process a different kind of data like
the ECG recordings.
The next chapter will discuss possible improvements and further research possibilities.
1 http://yann.lecun.com/exdb/mnist/
Chapter 7
Conclusion and Outlook
7.1
Summary
In the first chapters, this work described and analyzed intelligence in general and
the HTM algorithm in detail. The HTM algorithm tries to accomplish the predictive
abilities that seem to have a big impact on the overall intelligence of beings. The
theory is based on the neocortex and the hierarchical structure of the regions in it.
The no free lunch theorem states and [BL07] believes that there is no general learning
algorithm possible.
Hierarchical approaches to recognition and classification became popular with the
hierarchical neural network Neocognitron by Fukushima [Fuk88] and are used in
many modern recognition models and algorithms [FL07, FG01, SWP05, ZLH+ 08].
Despite the very well reasoned approach and ideas in [HB04], the implementation of
the proposed algorithm in [GJ07] did not fulfill the stated goal of this thesis. There
are two different reasons why this is the case.
One is the problem of adapting the described algorithm from [GJ07] to the big input
datasets like the 2088 by 1550 Pixels 24 bit images of the medical images. The
prediction and noise reducing algorithms are not clearly described in [GJ07] and
could therefore not be implemented correctly for the scenarios of this work.
The second reason is the inability of the implemented algorithm to handle the type
of data of the scenarios. The stability of the algorithm and its generality, as stated
in [HG06], is inconsistent with the findings of this work. A lot of afford is needed
to prepare the input data to be easily classifiable by the HTM algorithm. The same
64
7. Conclusion and Outlook
65
problems on more complex data occurred in the early use of Boltzmann machines,
which share some properties with HTM networks [J. 95].
Since the authors have published a new whitepaper on the HTM algorithm with extensive changes to the algorithm [HAD10], it is probably the latter reason why the
implementation did not achieve the stated goal. Additionally, [RRV10] also tested the
HTM algorithm on more complex data and only got around 30% accuracy. However,
they were able to alter the HTM algorithm to get above 90% accuracy. Another sign
of problems with the algorithm is the lack of comparable benchmarks on datasets
like the MNIST database, the NORB dataset1 or the Caltech-101 dataset2 .
[HG06] states that the system is robust even when not the optimal number of nodes or
regions is used. This may be true for simple training data but could not be confirmed
by this work on more complex input data.
The description of the algorithm also limits the capabilities of the algorithm by fixing
the number of output values of single nodes. The reason for this decision is not given
[GJ07].
During the work on this thesis, additional discussions about the HTM algorithm
were found. Several people have read about the algorithm and are stating that the
technique, while promising, is not new and has been tested already. They also say
that the goal of a general algorithm will not be reached by the HTM algorithm 3 4 .
However, a good result is the processing speed of the algorithm. While the number of
patterns grows exponentially, the processing speed of inference grows only linearly.
A plot of the speed of the processing can be seen in Figure 6.4 on page 61. This
performance on a large hierarchical dataset can be a motivation for further research
of this topic.
7.2
Possible Improvements of Future Work
The huge amount of memory needed is a hint that the algorithm is not working like
the neocortex in brains since there is a limited capacity.
1 http://www.cs.nyu.edu/~ylclab/data/norb-v1.0-small/
2 http://www.vision.caltech.edu/feifeili/Datasets.htm
3 Carnicelli, Jim in his blog http://www.alexandria.nu/ai/blog/entry.asp?E=41 (Acessed on March, 18th, 2011)
4 Discussion on Reddit about the paper by Je Hawkins http://www.reddit.com/r/MachineLearning/
comments/a0mwd/dear_reddit_machine_learning_is_jeff_hawkings/
(Acessed on March, 18th, 2011)
7. Conclusion and Outlook
66
This problem could be solved in the future by using better spatial and temporal poolers on a global scale since many patterns in the different nodes were actually the
same.
The exponential growth in patterns in the higher regions could be solved by other
quantization methods which could incorporate the sequential order of the spatial patterns and form kinds of fuzzy sets of patterns.
The covert attention mechanism is very simple. It could improve the results, if the
movement of the virtual eye depends on what it processes instead of just being a
linear movement over the whole picture. This is different from the first iteration of
it, since the network itself would control the virtual eye.
[GJ07] describes that nodes are switching to an inference state. This will not work
in dynamic environments like photographs of real things since there are always new
features that could be extracted and used for better prediction. Therefore an online
learning variant of the second iteration (See section 5.5.2 on page 53) could be useful
in dynamic environments.
An implementation of the extensively altered HTM algorithm in [HAD10] could also
be tried with the scenarios of this work. However, it may be useful to wait for results
of benchmarks of this new algorithm.
Further research in the related work is the most promising way to improve the results.
Deep Belief Networks (also see section 1.1 on page 2) have shown good results
and share a large amount of concepts with the HTM algorithm. Combining both
conceptual approaches gives a promising field of future research.
List of Figures
1.1
Correctly classified images in HTM whitepaper . . . . . . . . . . .
4
2.1
Sample cell image . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Sample ECG Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
4.1
Layers in the Neocortex . . . . . . . . . . . . . . . . . . . . . . . .
18
4.2
Regions of the neocortex of a macaque monkey . . . . . . . . . . .
19
4.3
Problemspace for humans and machines . . . . . . . . . . . . . . .
22
4.4
Overview of the elements of the HTM Algorithm . . . . . . . . . .
23
4.5
Mixed hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.6
Two dimensional hierarchy of regions . . . . . . . . . . . . . . . .
28
4.7
One dimensional hierarchy of regions . . . . . . . . . . . . . . . .
28
5.1
Simplified Class Diagram . . . . . . . . . . . . . . . . . . . . . . .
32
5.2
Fisheye distortion . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.3
HSV Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.4
Virtual Eyeball: Color Extraction . . . . . . . . . . . . . . . . . . .
35
5.5
Linear scanning of the image data . . . . . . . . . . . . . . . . . .
36
5.6
Jump between multiple sliding windows . . . . . . . . . . . . . . .
37
5.7
Leakage effect of multiple sliding windows . . . . . . . . . . . . .
37
5.8
Time Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . .
40
5.9
Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.10 Configuration screen of the end user application . . . . . . . . . . .
48
5.11 Configuration Screen for the network . . . . . . . . . . . . . . . . .
49
5.12 Main screen of the end user application . . . . . . . . . . . . . . .
51
5.13 Screenshots of a network during learning . . . . . . . . . . . . . .
52
5.14 Combination of networks . . . . . . . . . . . . . . . . . . . . . . .
54
6.1
Oscillation Problem of the First Iteration . . . . . . . . . . . . . . .
56
6.2
Screenshots of a network learning with wrong parameters . . . . . .
58
67
LIST OF FIGURES
68
6.3
Plot of the number of stored patterns with different parameters . . .
59
6.4
Plot of the processing speed in relation to the stored patterns . . . .
61
6.5
Comparison: Processing Slowdown of different parameters . . . . .
61
6.6
Effect of parameter threshold for spatial groups . . . . . . . . . . .
62
6.7
Effect of parameter top-N for greedy algorithm . . . . . . . . . . .
62
6.8
Effect of parameter sensor size . . . . . . . . . . . . . . . . . . . .
63
6.9
Effect of parameter region count . . . . . . . . . . . . . . . . . . .
63
Bibliography
[Att54]
F. Attneave. Some informational aspects of visual perception. Psychological Review, 61(3):183–193, 1954.
[BL07]
Yoshua Bengio and Yann LeCun. Scaling Learning Algorithms towards AI, chapter 14, pages 1–41. MIT Press, 2007.
[Bra]
BraInSitu. Layers, image from www.nibb.ac.jp.
[BYRTK03] Paul Bach-Y-Rita, Mitchell Tyler, and Kurt Kaczmarek. Seeing with
the Brain.
International Journal of Human-Computer Interaction,
15(2):285–295, 2003.
[CBLC04]
G Castellano, L Bonilha, L M Li, and F Cendes. Texture analysis of
medical images. Clinical Radiology, 59(12):1061–1069, 2004.
[CC09]
Andrew N Clarkson and S Tomas Carmichael. Cortical excitability
and post-stroke recovery. Biochemical Society Transactions, 37(Pt
6):1412–1414, 2009.
[FG01]
F Fleuret and D Geman. Coarse-to-Fine Face Detection. International
Journal of Computer Vision, 41(1):85–107, 2001.
[FGG97]
N Friedman, D Geiger, and M Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29(2):131–163, 1997.
[FL07]
Sanja Fidler and Aleš Leonardis. Towards Scalable Representations of
Object Categories: Learning a Hierarchy of Parts. 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
[Fuk88]
K Fukushima. Neocognitron: A hierarchical neural network capable
of visual pattern recognition. Neural Networks, 1(2):119–130, 1988.
69
BIBLIOGRAPHY
[FV91]
70
D J Felleman and D C Van Essen. Distributed hierarchical processing
in the primate cerebral cortex. Cerebral Cortex, 1(1):1–47, 1991.
[Gar07]
Saulius Garalevicius.
Memory Prediction Framework for Pattern
Recognition: Performance and Suitability of the Bayesian Model of
Visual Cortex. In Artificial Intelligence, 2007.
[GJ07]
Dileep George and Bobby Jaros. The HTM learning algorithms. 1:1–
44, 2007.
[GLL+ 04]
Frédéric Gougoux, Franco Lepore, Maryse Lassonde, Patrice Voss,
Robert J Zatorre, and Pascal Belin. Pitch discrimination in the early
blind. Nature, 430(July):309, 2004.
[Goe93]
Ben Goertzel.
The Structure of Intelligence A New Mathematical
Model of Mind The Structure of Intelligence. Springer, 1993.
[GP07]
Ben Goertzel and Cassio Pennachin. Artificial General Intelligence,
volume 289. Springer, 2007.
[GW08]
Rafael C Gonzalez and Richard E Woods. Digital Image Processing,
volume 49 of Texts in Computer Science. Prentice Hall, 2008.
[HAD10]
Jeff Hawkins, Subutai Ahmad, and Donna Dubinsky. HTM Cortical
Learning Algorithms. Numenta Inc, 2010.
[Har78]
F J Harris. On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1):51–83, 1978.
[HB04]
Jeff Hawkins and Sandra Blakeslee. On Intelligence: How a New Understanding of the Brain will Lead to the Creation of Truly Intelligent
Machines. Henry Holt, 2004.
[HCB07]
Peter Hansen, Peter Corke, and Wageeh Boles. QUT Digital Repository : Scale Invariant Feature Matching with Wide Angle Images.
Wide Angle A Quarterly Journal Of Film History Theory Criticism And
Practice, pages 1689–1694, 2007.
[HG06]
Jeff Hawkins and Dileep George. Hierarchical Temporal Memory. Numenta Inc, pages 1–4, 2006.
BIBLIOGRAPHY
[Hof94]
71
Douglas R. Hofstadter. Metamagicum.: Fragen nach der Essenz von
Geist und Struktur. Dt. Taschenbuch-Verl., 1994.
[HOT06]
GE Hinton, S Osindero, and YW Teh. A fast learning algorithm for
deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
[HS88]
Chris Harris and Mike Stephens. A combined corner and edge detector.
In M M Matthews, editor, Alvey vision conference, volume 15, page 50.
Manchester, UK, Manchester, UK, 1988.
[J. 95]
J. H. J. Lenting. Representation issues in Boltzmann machines . Lecture Notes in Computer Science, 931:131–144, 1995.
[Kam89]
K Kammeyer K D Kroschel. Digitale Signalverarbeitung Filterung
und Spektralanalyse. B.G. Teubner Stuttgart, 1989.
[Min88]
Marvin Minsky. The Society of Mind. Simon & Schuster, 1988.
[Mou78]
V B Mountcastle. An organizing principle for cerebral function: The
unit module and the distributed system, pages 21–42. MIT Press, 1978.
[Pav27]
I P Pavlov. Conditioned Reflexes. Oxford University Press, 1927.
[Pie]
Eric Pierce. Hue, image from wikipedia (de).
[PSG08]
Helmut Prior, Ariane Schwarz, and Onur Güntürkün. Mirror-Induced
Behavior in the Magpie (Pica pica): Evidence of Self-Recognition.
PLoS Biology, 6(8):9, 2008.
[Rey87]
Craig W Reynolds. Flocks, herds and schools: A distributed behavioral
model. ACM SIGGRAPH Computer Graphics, 21(4):25–34, 1987.
[RPCL06]
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann
LeCun. Efficient Learning of Sparse Representations with an EnergyBased Model. Advances in neural information processing systems,
19:1137–1144, 2006.
[RRLP10]
Abigail Z Rajala, Katharine R Reininger, Kimberly M Lancaster, and
Luis C Populin. Rhesus Monkeys (Macaca mulatta) Do Recognize
Themselves in the Mirror: Implications for the Evolution of SelfRecognition. PLoS ONE, 5(9):8, 2010.
BIBLIOGRAPHY
[RRV10]
72
David Rozado, F Rodriguez, and Pablo Varona. Optimizing Hierarchical Temporal Memory for Multivariable Time Series. Artificial Neural
Networks ICANN, 6353:506–518, 2010.
[Sea80]
J R Searle. Minds, brains, and programs. Behavioral and Brain Sciences, 3(03):417–457, 1980.
[SS87]
D F Sherry and Daniel L Schacter. The evolution of multiple memory
systems. Psychological Review, 94(4):439–454, 1987.
[SS08]
Omid Sayadi and Mohammad Bagher Shamsollahi.
ECG denois-
ing and compression using a modified extended Kalman filter structure. IEEE Transactions on Biomedical Engineering, 55(9):2240–
2248, 2008.
[SWP05]
T Serre, L Wolf, and T Poggio. Object Recognition with Features
Inspired by Visual Cortex. 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition CVPR05, 2(c):994–1000,
2005.
[Tur50]
Alan M Turing.
Computing machinery and intelligence.
Mind,
59(236):433–460, 1950.
[Vit11]
LLC Vitamin D Video. Vitamin D surveillance software (http://www.
vitamindinc.com),
[Vos07]
2011.
Peter Voss. Essentials of General Intelligence: The Direct Path to
Artificial General Intelligence, chapter Essentials, pages 131–157 %U
http://dx.doi.org/10.1007/978–3–540–686. Springer-Verlag, 2007.
[WM95]
D H Wolpert and W G Macready. No Free Lunch Theorems for Search.
IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1995.
[WWC+ 09] Joanna H Wimpenny, Alex A S Weir, Lisa Clayton, Christian Rutz, and
Alex Kacelnik. Cognitive Processes Associated with Sequential Tool
Use in New Caledonian Crows. PLoS ONE, 4(8):16, 2009.
[ZLH+ 08]
Long Leo Zhu, Chenxi Lin, Haoda Huang, Yuanhao Chen, and
A Yuille. Unsupervised Structure Learning : Hierarchical Recursive
Composition , Suspicious Coincidence and Competitive Exclusion.
Science And Technology, 5303:759–773, 2008.

Download Report

Classification of Medical Data with HTM Cortical Learning Algorithms

Paperzz.com

Your Paperzz