Cognitive Tasks and fMRI data - Systems Immunology

One Class Classification and
Neurocomputation: Some
Issues
Larry Manevitz
Department of Computer Science and Caesarea
Rothschild Institute (Neurocomputation Laboratory)
University of Haifa
Visiting Researcher
CiNet, NICT at Osaka U.
neurocomputation.wordpress.com
• Thank you for the invitation to speak
here.
• A word about myself:
– I am from Israel; in the computer science
department of the Univ. of Haifa; and
– the head of the neurocomputation
laboratory situated at the Caesarea
Rothschild Institute for Interdisciplinary
Applications of Computer Science
– Currently visiting CiNet
• I am also a mathematician – background
in mathematical logic
2
IFReC Talk Osaka University
L. Manevitz
Goal of Talk
• To give you some familiarity with the
concepts and techniques of one-class
classification
• Especially from the viewpoint of
neurocomputation
• I will try to illustrate from case-studies
what are some of the issues
3
IFReC Talk Osaka University
L. Manevitz
Goals of Talk
• What is One-Class Classification?
• Some techniques used:
– One Class Support Vector Machine
– Bottleneck Neural Networks
• Applications:
– Classification of Text by Subject
– Classification of Visual Cognitive Task
• Early Results
• Improvements via Feature Selection
– Subject Classification for Text
– Use of Genetic Algorithm for fMRI
•4 Some future Ideas: (Deep
L. Manevitz
Tishby Web Pgae
Tishby Web Page
Overall Problem
CLASSIFYing
Agent
Human
User
WWW
Interface
(Web-Browser)
SHADOWing
Agent
Learning Agent
Neural Network
Model of User
Why Only Positive
Information?
• Easier to find “typical” examples rather
than typical “non-examples”
• Obtained by observation, we don’t need
an “active” teacher
• Avoiding
– Interrupting the user for rating
– Artificially determining negative
information
• End user- The machine does it for you!!
Here are two examples of
studies
• Text Categorization (almost solved
problem)
• Classification of fMRI data according to
Cognitive task (Very active these
days—started around 2003)
9
IFReC Talk Osaka University
L. Manevitz
Text Categorization
Text categorization defined as the task of assigning a set of documents•
D  {d1 , d 2 ,..., d n } Into a set of predefined categories or classes
C  {c1 , c2 ,..., cm }
Classifier
Classes
documents
Description of the problem
• Let C “the corpus” be
the set of documents to
be classified.
• Let T be a subset of C
the class of
“interesting” documents
• Let E be a subset of T,
the positive examples
• The problem is to define
a function (or filter)
using only information
from E that
distinguishes T from the
complement of T.
C
T
T
E
Challenge:
Given an fMRI
• Can we learn to
recognize from the
MRI data, the
cognitive task being
performed?
• Automatically?
WHAT ARE THEY?
Omer Boehm
Thinking Thoughts
Machine Learning approaches
The classifier tries to
differentiate between the given
category and the other
categories
Where the document belonging
to the given category are called
positive examples and the
remaining examples are called
negative examples
Machine Learning Techniques
–
–
–
–
–
–
–
–
–
–
Neural Networks
Support Vector Machines
Decision Trees
Inductive Logic Programming
Fuzzy Logic
Bayesian Belief Networks
Self Organizing Maps
Clustering
Hidden Markov Models
Association Rules
Machine Learning Tools
• Neural Networks
• Support Vector Machines (SVM)
• Both perform classification by finding a
multi-dimensional separation between
the “accepted “ class and others
• However, there are various techniques
and versions
15
L. Manevitz
Kinds of Techniques
• Unsupervised
– Technique makes no assumption about a
priori knowledge
– Useful when not much known
• Supervised
– Attach class labels to data items
– Identify (or learn about) properties that
distinguish classes
Kinds of Techniques
• Unsupervised
– Clustering
– SOMs
• Supervised
–
–
–
–
Support Vector Machines
Neural Networks
Bayesian Belief Networks
HMMs
Classification
•
•
•
•
0-class Labeled classification
1-class Labeled classification
2-class Labeled classification
N-class Labeled classification
• Distinction is in the TRAINING
methods and Architectures. (In this
work we focus on the 1-class and 2-class
cases) (Note: new interesting questions
for n-class)
18
L. Manevitz
Classification
19
L. Manevitz
Training Methods and
Architectures Differ
• 2 –Class Labeling
– Support Vector Machines
– “Standard” Neural Networks
• 1 –Class Labeling
– Bottleneck Neural Networks
– One Class Support Vector Machines
• 0-Class Labeling
– Clustering Methods
20
L. Manevitz
1-Class Training
• Appropriate when you have representative
sample of the class; but only episodic sample
of non-class
• System Trained with Positive Examples Only
• Yet Distinguishes Positive and Negative
• Techniques
– Bottleneck Neural Network
– One Class SVM
21
L. Manevitz
One Class is what is Important
in some tasks!!
• Typically only have representative data
for one class at most
• The approach is scalable; filters can be
developed one by one and added to a
system.
22
L. Manevitz
Bottleneck Neural Network
Fully Connected
Fully Connected
Output (dim n)
Compression
(dim k)
Input (dim n)
Trained Identity Function
The Learning Classifier Design
• Feed-forward neural network with a "bottleneck“
• Three level network with m inputs, m outputs and k neurons
on the hidden level (k<m)
• The network is trained under standard back-propagation to
learn the identity function
• The idea is that while the bottleneck prevents learning the full
identity function on m-space; the identity on the small set of
examples is in fact learnable
• The set of vectors for which the network acts as the identity
function is a sort of sub-space which is similar to the trained
set. (This avoids the "saturation" problem of learning from
only positive examples”
SVM with Optimal Hyperplane
SVM-Non Separable Case
Polynomial  ( x, x)  ( x, x   c) d
Sigmoid
 ( x, x)  tanh(k  x, x   )
Gaussian
 ( x, x)  exp( || x  x ||2 /( 2 2 ))
One-class Outlier-SVM
• Consider all data points
“close enough” to the
origin to be noise or
outliers
• Outlier is a vector that
has few non-zero
entries( this indicates
that this document
shares very few items
with the chosen feature
subset of the
dictionary)
• Use standard two-class
SVM
One-Class SVM
• The Scholkopf method used for
adapting the SVM methodology to
the one-class classification problem.
(Scholkopf and J.C. Platt and J.shawe-taylor and A.J. Smola
and R.C. Williamson, 1999)
• We want the ball to be as small as
possible while at the same time,
including most of the training data.
• Map the data into the feature space
• Try to use a hyper-sphere to
describe the data and put most of
the data into hyper-sphere.
One-Class SVM
l
1
1
2
min || w ||    i  
2
vl i 1
subject to
( w  ( xi ))     i i  1,2,.., l ,  i  0
When v is small, we try to put more data into the “ball”. When v is •
larger, we try to squeeze the size of the “ball”
Text Categorization
Text categorization defined as the task of assigning a set of documents•
D  {d1 , d 2 ,..., d n } Into a set of predefined categories or classes
C  {c1 , c2 ,..., cm }
Classifier
Classes
documents
Description of the problem
• Let C “the corpus” be
the set of documents to
be classified.
• Let T be a subset of C
the class of
“interesting” documents
• Let E be a subset of T,
the positive examples
• The problem is to define
a function (or filter)
using only information
from E that
distinguishes T from the
complement of T.
C
T
T
E
Data Representation: What is
a Document?
• Simplest Idea – Build a dictionary of all
words in all documents
• Use “bag of words” representation
– Vector of all words 1/0 in each dimension
– Frequency Representation of each word in
dictionary in a document
– Tf-idf frequency times inverse document
frequency (rare words more important)
– Normalizations
33
IFReC Talk Osaka University
L. Manevitz
Feature Representation
• Document frequency (DF)
DF is the number of documents in which a term
occurs
• TFIDF ( Term frequency – Inverse document
frequency)


n
TFID  f ( word )  log
 1
N
(
word
)


f ( w )  the frequency of the word in the document
N(word)  the number of documents the word appears in
n  total number of all the documents
Hadamard Product
• We used the following transformation
 P (e1 | d )   P (e1 | E ) 

 

 P (e2 | d )   P (e2 | E ) 

 

.
.


H T (e)  
.
.

 


 

.
.

 

 P (e | d )   P (e | E ) 
m
m

 

We discovered that the Hadamard Product enhances performance
Feature Extraction and
Documents Encoding
• Let D be the dictionary of all
words in E, each word is
associated with its frequency in
the list
• Choose the key-words
(feature), which are the m
words that appear in the most
documents of E
• Define as the m-dimensional
vector consisting of the
frequency of appearance of
each of the key-words
throughout the dictionary
• For each document in E,
associate a vector e of
dimension m
E
D -dictionary
 ct 1614 


 net 1112 
 shr 1028 


.

.



.



 income 130 
1 
 
0
1 
 
. 
. 
 
. 
 
1 
Feature Selection and
Dimensionality Reduction
• Choosing the more promising words that
describe the document content
• Reduce the complexity
• Improve generalization accuracy, and avoiding
overfitting
• We measure how well a given feature or term
separates the training examples according to
their target classification
• The feature selection aim is to select a
minimal subset of features which still let the
system classify with a high performance
Evaluation Measures
recall 
Number of items of category identified
Number of category members in test set
Number of items of category identified
precision 
Total items assigned to category
The F - measure is a weighted combination of recall and precision
2  recall  precision
F1 
recall  precision
Parameters?
• What features? How many?
• Parameters of Learning Technique?
– SVM (C?)
– NN (size of hidden level? Threshold of
Identity Function?)
– Experimentally determined meanwhile
(more in each example)
39
IFReC Talk Osaka University
L. Manevitz
Reuters-21578 Text
Categorization Test Collection
Category Name
Num Train
Num Test
Earn
966
2902
Acquisitions
590
1773
Money-fx
187
563
Grain
151
456
Crude
155
465
Trade
133
401
Interest
123
370
Ship
65
195
Wheat
62
186
Corn
62
184
Neural Networks Comparison of Hadamard and Frequency
Representation
Hadamard
Frequency
F
R
P
F
R
P
Earn
0.781
0.800
0.763
0.418
0.805
0.282
Acq
0.534
0.593
0.483
0.347
0.363
0.332
Money
0.542
0.641
0.470
0.475
0.420
0.546
Grain
0.415
0.394
0.439
0.379
0.355
0.408
Crud
0.537
0.505
0.573
0.476
0.410
0.566
Trade
0.573
0.600
0.547
0.536
0.513
0.561
Int
0.496
0.416
0.616
0.478
0.405
0.583
Ship
0.393
0.328
0.492
0.388
0.400
0.376
Wheat
0.507
0.446
0.588
0.414
0.430
0.400
Corn
0.310
0.451
0.236
0.315
0.434
0.247
Avg
0.508
0.517
0.520
0.422
0.453
0.430
Why Hadamard?
• Under independence assumptions can
show that Hadamard representation
represents the Bayesian result of a
document being in the class given the
feature is both in the example set and
in the document.
• But discovered experimentally (and
maybe by accident)
42
IFReC Talk Osaka University
L. Manevitz
Modified Algorithm for positive
Examples Only
•
•
•
•
Prototype (Rocchio’s) algorithm
Nearest Neighbor
Naïve Bayes
Distance Based Probabilities
Comparison
(Optimized Versions of One Class Variants)
Results of many experiments
One-class
SVM radial
Basis
F
OutlierSVM
Linear
F
Neural
Networks
Naïve
Bayes
Nearest
Neighbor
Prototype
Earn
0.676
0.750
0.714
0.708
0.703
0.637
Acq
0.482
0.504
0.621
0.503
0.476
0.468
Money
0.514
0.563
0.642
0.493
0.468
0.482
Grain
0.585
0.523
0.473
0.382
0.333
0.402
Crud
0.544
0.474
0.534
0.457
0.392
0.398
Trade
0.597
0.423
0.569
0.483
0.441
0.557
Int
0.485
0.465
0.487
0.394
0.295
0.454
Ship
0.539
0.402
0.361
0.288
0.389
0.370
Wheat
0.474
0.389
0.404
0.288
0.566
0.262
Corn
0.298
0.356
0.324
0.254
0.168
0.230
Avg
0.519
0.484
0.513
0.425
0.423
0.426
Macro
0.572
0.587
0.615
0.547
0.530
0.516
Bottom Line (from ~2003)
• One Class can be done at reasonable
levels of success
• Success depended on very careful
choices of representation, parameter
choice, and feature selection
• Both One Class SVM and Bottleneck
NNs were “methods of choice’
• SVM was very delicate on choices; NN
more robust. SVM needed Boolean
reprentation IFReC
only,
NN can handle TFIDF
45
Talk Osaka University
L. Manevitz
etc
Reading the Mind:
Cognitive Tasks and fMRI data:
the improvement
Omer Boehm, David Hardoon and Larry Manevitz
IBM Research Center and University of Haifa,
University College. London
University of Haifa
Cooperators and Data
•Ola Friman; fMRI Motor data from the
Linköping University (currently in
Harvard Medical School)
•Rafi Malach, Sharon Gilaie-Dotan and
Hagar Gelbard fMRI Visual data from
the Weizmann Institute of Science
47
Data Mining BGU 2009
L. Manevitz
Challenge:
Given an fMRI
• Can we learn to
recognize from the
MRI data, the
cognitive task being
performed?
• Automatically?
WHAT ARE THEY?
Omer Boehm
Thinking Thoughts
Our history and main results
• 2003 Larry visits Oxford and meets
ambitious student David.
Larry scoffs at idea, but agrees to work
• 2003 Mitchells paper on two class
• 2005 IJCAI Paper – One Class Results at
60% level; 2 class at 80%
• 2007 Omer starts to work
• 2009 Results on One Class – 90% level
– Reason for improvement: we “mined” the correct
features.
•49 2014 Experiments with Deep Learning
L. Manevitz
And second class “breeding experiments”
What was David’s Idea and Why
did I scoff?
• Idea: fMRI scans a brain while a
subject is performing a task.
• So, we have labeled data
• So, use machine learning techniqes to
develop a classifier for new data.
• What could be easier?
50
L. Manevitz
Why Did I scoff?
• Data has huge dimensionality
(about 120,000 real values in one scan)
• Very few Data points for training
– MRIs are expensive
• Data is “poor” for Machine Learning
– Noise from scan
– Data is smeared over Space
– Data is smeared over Time
• People’s Brains are Different; both
geometrically and (maybe) functionally
• No one had published any results at that time
51
L. Manevitz
• Today this is “almost” a standard tool –
But almost all work is 2-class.
52
IFReC Talk Osaka University
L. Manevitz
Automatically?
• No Knowledge of Physiology
• No Knowledge of Anatomy
• No Knowledge of Areas of Brain
Associated with Tasks
• Using only Labels for Training Machine
53
L. Manevitz
Basic Idea
• Use Machine Learning Tools to Learn from
EXAMPLES Automatic Identification of
fMRI data to specific cognitive classes
• Note: We are focusing on Identifying the Cognitive
Task from raw brain data; NOT finding the area of
the brain appropriate for a given task. (But see
later …)
54
L. Manevitz
Machine Learning Tools
• Neural Networks
• Support Vector Machines (SVM)
• Both perform classification by finding a
multi-dimensional separation between
the “accepted “ class and others
• However, there are various techniques
and versions
55
L. Manevitz
Bottom Line (2005)
• For 2 Class Labeled Training Data, we
obtained close to 90% accuracy (using
SVM techniques).
• For 1 Class Labeled Training Data, we
had close to 60% accuracy (which is
statistically significant) using both NN
and SVM techniques
56
L. Manevitz
Classification
•
•
•
•
0-class Labeled classification
1-class Labeled classification
2-class Labeled classification
N-class Labeled classification
• Distinction is in the TRAINING
methods and Architectures. (In this
work we focus on the 1-class and 2-class
cases)
57
L. Manevitz
Classification
58
Data Mining BGU 2009
L. Manevitz
Training Methods and
Architectures Differ
• 2 –Class Labeling
– Support Vector Machines
– “Standard” Neural Networks
• 1 –Class Labeling
– Bottleneck Neural Networks
– One Class Support Vector Machines
• 0-Class Labeling
– Clustering Methods
59
Data Mining BGU 2009
L. Manevitz
1-Class Training
• Appropriate when you have representative
sample of the class; but only episodic sample
of non-class
• System Trained with Positive Examples Only
• Yet Distinguishes Positive and Negative
• Techniques
– Bottleneck Neural Network
– One Class SVM
60
Data Mining BGU 2009
L. Manevitz
One Class is what is Important
in this task!!
• Typically only have representative data
for one class at most
• The approach is scalable; filters can be
developed one by one and added to a
system.
61
L. Manevitz
Bottleneck Neural Network
Fully Connected
Fully Connected
Output (dim n)
Compression
(dim k)
Input (dim n)
Trained Identity Function
Bottleneck NNs
• Use the positive data to train
compression in a NN – i.e. train for
identity with a bottleneck. Then only
similar vectors should compress and decompress; hence giving a test for
membership in the class
• SVM: Use the identity as the only
negative example
63
Data Mining BGU 2009
L. Manevitz
Computational Difficulties
• Note that the NN is very large (then
about 10 Giga) and thus training is slow.
Also, need large memory to keep the
network inside.
• Fortunately, we purchased what at that
time was a large machine with 16
GigaBytes internal memory
64
L. Manevitz
Support Vector Machines
•
•
•
65
Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space.
[Cristianini & Shawe-Taylor 2000]
Two-class SVM: We aim to find a separating
hyper-plane which will maximise the margin
between the positive and negative examples in
kernel (feature) space.
One-class SVM: We now treat the origin as
the only negative sample and aim to separate
the data, given relaxation parameters, from
the origin. For one class, performance is less
robust…
L. Manevitz
Historical (2005)
Motor Task Data: Finger Flexing
(Friman)
• Two sessions of data: a single
•
•
•
•
•
66
subject flexing his index
finger on the right hand;
Experiment repeated over two
sessions ( as the data is not
normalised across sessions).
The label consists of Flexing
and not Flexing
12 slices with 200 time points
of a 128x128 window
Slices analyzed separately
The time-course reference is
built from performing a
sequence of 10 tp rest 10 tp
active.... to 200 tp.
Data Mining BGU 2009
L. Manevitz
Experimental Setup Motor Task
– NN and SVM
•
•
•
•
67
For both methods the experiment was redone with 10
independent runs, in each a random permutation of training and
testing was chosen.
One-class NN:
–
–
We have 80 positive training samples and 20 positive and 20
negative samples for testing
Manually crop the non-brain background, resulting in a slightly
different input/output size for each slice of about 8,300 inputs
and outputs.
One-Class Support Vector Machines
–
–
Used with Linear and Gaussian Kernels
Same Test-Train Protocol
We use OSU SVM 3.00 Toolbox
http://www.ece.osu.edu/~maj/osu_svm/ and for the the
Neural Network toolbox for Matlab 7
L. Manevitz
NN – Compression Tuning
•
•
•
68
A uniform
compression of
60% gave the best
results.
A typical network
was about 8,300
input x about 2,500
compression x
8,300 output.
The network was
trained with 20
epochs
L. Manevitz
Results
69
L. Manevitz
N-Class Classification
Faces
70
Blank
Pattern
House
Object
L. Manevitz
2-Class Classification
House
71
Blank
L. Manevitz
Two Class Classification
• Train a network with positive and
negative examples
• Train a SVM with positive and negative
examples
• Main idea in SVM: Transform data to
higher dimensional space where linear
separation is possible. Requires
choosing the transformation “Kernel
Trick”.
72
L. Manevitz
Classification
73
L. Manevitz
Visual Task fMRI Data
(Courtesy of Rafi Malach,
Weizmann Institute)
•There are 4 subjects; A, B, C and Dwith filters applied
–
–
–
–
–
Linear trend removal
3D motion correction
Temporal high pass 4 cycles (per experiment) except
for D who had 5
Slice time correction
Talariach normalisation (For Normalizing Brains)
•The data consists of 5 labels; Faces,
Houses, Objects, Patterns, Blank
74
L. Manevitz
Two Class Classification
• Visual Task Data
• 89% Success
• Representation of Data
– An Entire “Brain” i.e. one
time instance of the
entire cortex. (Actually
used half a brain) so a
data point has dimension
about 47,000.
– For each event, sampled
147 time points.
75
L. Manevitz
•
76
Per subject, we have 17 slices of 40x58 window (each voxel is 3x3mm)
taken over 147 time points. (initially 150 time points but we remove
the first 3 as a methodology)
L. Manevitz
Some parts of data
77
Data Mining BGU 2009
L. Manevitz
•
•
•
Experimental Set-up
We make use of the linear kernel. For this particular work we use
SVM package Libsvm available from
http://www.csie.ntu.edu.tw/~cjlin/libsvm
Each experiment was run 10 time with a random permutation of the
training-testing split
In each experiment we use subject A to find a global SVM penalty
parameter C. We run the experiment for a range of C = 1:100 and
select the C parameter which performed the best
–
For label vs. blank; we have 21 positive (label) and 63 negative (blank) labels
(training 14(+) 42(-), 56 samples ; testing 7(+) 21(-), 28 samples.
• Experiments on subjects
–
• Experiments on combined-subjects
The training testing is split as with subject A
–
–
–
In these experiments we combine the data from B-C-D into one set; each
label is now 63 time points and the blank is 189 time points.
We use 38(+) 114(-); 152 for training and 25(+) 75(-); 100 for testing.
We use the same C parameter as previously found per label class.
Separate Individuals 2Class
SVM Parameters Set by A
label vs.
blank
80
Face
Pattern
House
Object
B
83.21%±7.53
81.78%±5.17 79.28%±5.78
87.49%±4.2%
%
%
%
C
86.78%±5.06 92.13%±4.39 91.06%±3.46 89.99%±6.89
%
%
%
%
D
97.13%±2.82 93.92%±4.77 94.63%±5.39 97.13%±2.82
%
%
%
%
L. Manevitz
Combined Individuals 2Class SVM
Label vs.
blank
Face
Pattern
House
Object
B&C&D
89.5%±2.5 88.4%±2.83 89.3%±2.9
86%±2.05%
(combined)
%
%
%
81
L. Manevitz
Separate Individuals 2 Class
Label vs. Label
label vs. label
Pattern
House
Object
Face
75.77%±6.02
67.69%±8.91
77.3%±7.35%
%
%
Pattern
75.0%±7.95%
House
82
Face
67.69%±8.34
%
71.54%±8.73
%
L. Manevitz
So Did 2-class work pretty well?
Or was the Scoffer Right or
Wrong? (2005)
• For Individuals and 2 Class; worked well
• For Cross Individuals, 2 Class where one class
was blank: worked well
• For Cross Individuals, 2 Class less good
• Eventually we got results for 2 Class for
individual to about 90% accuracy.
•
This
was
in
line
with
Mitchell’s
results
83
L. Manevitz
What About One-Class?
•SVM – Essentially Random Results
•NN – Similar to Finger-Flexing
Face
House
84
57%
57%
L. Manevitz
So Did 1-class work pretty well?
Or was the Scoffer Right or
Wrong? (in 2005)
• We showed one-class possible in
principle
• Needed to improve the 60% accuracy!
85
Data Mining BGU 2009
L. Manevitz
Feature Selection? (old slide)
• Can we narrow down the 120,000
features to find the important ones?
• We intend to use different techniques
on this – e.g. binary search with
relearning to focus.
• Alternatively – analyze weights to
eliminate
86
L. Manevitz
Concept: Feature Selection?
Since most of data is “noise”:
• Can we narrow down the 120,000
features to find the important ones?
• Perhaps this will also help the
complementary problem: find areas of
brain associated with specific cognitive
tasks
87
L. Manevitz
Relearning to Find Features
• From experiments we know that we can
increase accuracy by ruling out
“irrelevant” brain areas
• So do greedy binary search on areas to
find areas which will NOT remove
accuracy when removed
• Can we identify important features for
cognitive task? Maybe non-local?
88
L. Manevitz
Finding the Features
• Manual binary search on the features
• Algorithm: (Wrapper Approach)
–
–
–
–
Split Brain in “Half”
Redo entire experiment once with each half
If improvement, you don’t need the other half.
Repeat
– If both worse: split brain differently.
– Stop when you cant do anything better.
89
L. Manevitz
Binary Search for Features
90
L. Manevitz
Results of Manual Binary Search
Manual Binary Search
area A
area B
area C
Average quality over
categories
80%
75%
70%
65%
60%
55%
50%
1
2
3
4
Iteration
91
5
6
7
Results of Manual Binary Search
# Features
Manual Binary Search
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
1
2
3
4
5
Search nepth
92
6
7
6
Iteration
[rows, columns, height]
# features
Houses
Objects
Patterns
Blank
Avg
1
[ 1-17,1-39,1-38]
25194
58%
56%
55%
60%
57%
[15-33,1-39,1-38] *
28158
62%
55%
64%
65%
62%
[30-48,1-39,1-38]
28158
55%
52%
50%
60%
54%
[15-33,1-39,1-15]
11115
61%
63%
55%
60%
60%
[15-33,1-39,13-30] *
13338
69%
68%
72%
70%
70%
[15-33,1-39,27-38]
8892
58%
57%
60%
60%
59%
[15-23,1-39,13-30]
6318
63%
69%
68%
62%
66%
[20-26,1-39,13-30] *
4914
70%
67%
76%
79%
73%
[25-33,1-39,13-30]
6318
60%
67%
70%
75%
68%
[20-23,1-39,13-30] *
2808
74%
70%
71%
73%
72%
[22-25,1-39,13-30]
2808
65%%
73%
60%
80%
71%
[24-26,1-39,13-30]
2106
70%
69%
69%
68%
69%
[20-21,1-39,13-30]
1404
67%
65%
74%
63%
67%
[21-22,1-39,13-30]
1404
60%
63%
70%
64%
64%
[22-23,1-39,13-30]
1404
65%
63%
72%
68%
67%
[20-23,1-18,13-30]
1296
67%
66%
70%
72%
69%
[20-23,19-39,13-30]
1512
67%
70%
72%
78%
72%
2
3
4
5
back
6
93
Too Slow, too hard, not good
enough; need to automate
• We tried a Genetic Algorithm Approach
together with the Wrapper Approach
around the Compression Neural Network
94
L. Manevitz
Automate Search Using
Genetic Algorithm
•
•
•
•
•
•
95
Encoding technique
(gene, chromosome)
Initialization procedure
(creation)
Evaluation function
(environment)
Selection of parents
(reproduction)
Genetic operators (mutation, recombination)
Parameter settings
(practice and art)
Data Mining BGU 2009
L. Manevitz
Simple Genetic Algorithm
initialize population;
evaluate population;
while (Termination criteria not satisfied)
{
select parents for reproduction;
perform recombination and mutation;
evaluate population;
}
96
Data Mining BGU 2009
L. Manevitz
The GA Cycle of Reproduction
reproduction
children
modified
children
parents
population
modification
evaluation
evaluated children
deleted
members
discard
97
Data Mining BGU 2009
L. Manevitz
The Genetic Algorithm
• Gene: Binary Vector of dimension 120,000
• Crossover: Two point crossover randomly
Chosen
• Population Size: 30
• Number of Generations: 100
• Mutation Rate: .01
• Roulette Selection
• Evaluation Function: Quality of Classification
98
L. Manevitz
Computational Difficulties (old
slide)
• Computational: Need to repeat the
entire earlier experiments 30 times for
each generation.
• Then run over 100 generations
• Fortunately we purchased a machine
with 16 processors and 132GigaBytes
internal memory.
• So these are 80,000 NIS results!
99
L. Manevitz
RESULTS on Same Data Set
Category Faces
Houses
Objects
Patterns
-
84%
84%
92%
Houses
84%
-
83%
92%
Objects
83%
91%
-
92%
Patterns
92%
85%
92%
-
Blank
91%
92%
92%
93%
Filter
Faces
100
Finding the areas of the brain?
Remember the secondary question?
What areas of the brain are needed to do
the task?
Expected locality.
101
L. Manevitz
102
103
104
IFReC Talk Osaka University
L. Manevitz
Areas of Brain
Visually:
• We do *NOT* see local areas (contrary to
expectations
• Number of Features is Reduced by Search (to
2800 out of 120,000)
• Features do not stay the same on different
runs although the algorithm produces
features of comparable quality
105
L. Manevitz
Future Work
• Push the GA further.
– We did not get convergence but chose the elite
member
– Other options within GA
– More generations
– Different ways of representing data points
• Find ways to close in on the areas or to
discover what combination of areas are
important
– Build Histograms to get Brain
Maps
• Use further data sets; other cognitive tasks
Data Mining BGU 2009
L. Manevitz
•106 Discover how detailed
a cognitive task can
be
Coevolution?
• Perhaps we can artificially produce
“good” negative examples by “breeding”
them to defeat the classifier and using
a Genetic Algorithm to evolve them
simultaneously as we breed the optimal
features? A wrapper approach
107
IFReC Talk Osaka University
L. Manevitz
Deep Learning
• Instead of selecting the features, use
Deep Learning to automatically choose
them?
108
IFReC Talk Osaka University
L. Manevitz
One-class classification
One-class
Two-class
Training of the system
is done only using
positive data from a
class
Training data is taken
from two different
classes
The system is supposed
to be able to delineate
between the trained
class and "everything
else"
Classifying system is
supposed to tell them
apart from each other
109
Our Goals
• Be able to produce One-class filters that incrementally build up
a library of classification filters
• Use Deep learning to replace the extensive feature selection
110
Deep learning
• A learning technique that implements a multi-layered
network of artificial neurons
• It was shown that stacking machines called Restricted
Boltzmann Machines (RBM) is both efficient and effective
(Hinton 2006)
• Since then, the use of deep networks assembled from
RBMs (also called Deep Belief Networks (DBN) ) produced
outstanding results in a variety of Machine Learning tasks
111
Restricted Boltzmann Machine
• RBM is a stochastic neural network
• These RBMs have a simple and efficient learning method
• The restriction on connections within the layers enables
parallel training
Hidden
units
Visible
units
• The stacking of the RBMs into a deep network is done by
training them separately and then using the output from
one RBM as the input to the next one
112
Results
Results
113
Compression Auto-encoder
Results
114
Faces
Houses
Patterns
Objects
Blank
55.4% ± 1.7%
56.7% ± 1.4%
55.9% ± 1.2%
54.5% ± 1.4%
55.1% ± 1.3%
Alternating Auto-encoder
Results
115
Faces
Houses
Patterns
Objects
Blank
64.3% ± 2.2%
58.2% ± 1.5%
64.1% ± 2.0%
61.7% ± 2.3%
63.5% ± 1.3%
Individual subject testing
116
Faces
Houses
Patterns
Objects
Blank
Subject 1
71.1% ± 2.7%
64.7% ± 2.9%
65.0% ± 2.2%
67.0% ± 3.4%
61.1% ± 1.5%
Subject 2
65.3% ± 3.5%
58.6% ± 1.5%
71.9% ± 3.9%
64.0% ± 5.1%
61.7% ± 4.0%
Subject 3
66.8% ± 2.9%
64.0% ± 2.3%
67.9% ± 2.1%
65.1% ± 4.1%
63.0% ± 4.0%
Subject 4
68.0% ± 4.7%
60.9% ± 1.9%
63.2% ± 2.8%
66.7% ± 1.7%
58.9% ± 4.9%
Conclusions and Future work
• Conclusions
– The alternating architecture gives better results than
the compression architecture
– Different subjects create a hierarchy of
reconstruction errors
– Using individual subject testing, the model produces
good results
• Future work
– Generalize the new architecture to contain a specific
network for groups of subjects instead for single
subjects
– Decrease the difference in reconstruction rate
between the subjects
117
Further Study
• These one class papers are very heavily
cited. You can look up the citations on
Research Gate or my web site.
• You might look at one class study for
microRNA by Louise Stowe and Malik
Yousef (et al)
• I am quite happy to discuss things with
you during August.
118
IFReC Talk Osaka University
L. Manevitz