download

Lecture 9: Pattern Recognition
One area of research that is closely related to (and often overlaps with) machine learning is
that of pattern recognition. The importance of pattern recognition to machine learning and
intelligence in general cannot be overstated.
Pattern recognition: the science whose goal involves the theoretical and empirical understanding of how to create systems and algorithms that are able to do, amoung other
things,
• classify objects and data into different categories;
• find patterns in data for the purpose of knowledge discovery and prediction;
• find patterns in sets of experience (feedback) data for the purpose of learning and
evolving;
• create memories of patterns which can be retrieved upon exposure to similar patterns
Applications of Pattern Recognition:
• Machine vision: finding defects in VLSI chips, blood testing, character recognition,
robotics
• Voice recognition
• Data mining: learning the preferences of a shopper, gene analysis, discovery of trends,
predicting weather and markets
• security: profiling, fingerprint analysis
1
Pattern Classification
As we have already seen, pattern classification represents one of the main problems in both
machine learning and pattern recognition. The following represents the individual tasks
involved with designing a pattern classifier.
Pattern-classification design example: suppose the problem involves classifying letters
written on an electronic notepad.
• sensors: a 21 × 21 matrix of cells (pixels) which lie underneath the surface of the
notepad. The voltage state of each cell changes (from off to on) when a metallic pen
makes contact directly above the cell.
• feature generation: the cell matrix causes the formation of a 21 × 21 binary matrix
to be stored in memory. This pixel data represents the raw, generated feature.
• feature selection: for the purpose of classification, more useful features can be selected from the raw feature. For example, the raw feature may be divided up into 9
7 × 7 submatrices, and a feature may be the presence or absence of a subpattern (e.g.
a horizontal segment of four ones).
• classifier design: once the features are selected, they form what is called a feature
vector, which can then be input into a classifier, which will output a classification.
Examples of classifiers include decision trees, Boolean functions, and neural networks.
• evaluation: testing must be performed to see if the percentage of correct classifications
is sufficiently high. Testing will inevitably cause re-evaluation of all previous stages of
development.
2
Classification System Architecture
3
Artificial Neurons as Classifiers
Artificial Neurons are very similar to Boolean gates (and, or, and inverter gates), in that they
have inputs and an output. The main difference though is that they can receive real-valued
inputs, output real values, and have their functionality modified through a training/learning
process.
4
Artificial Neuron: a mathematical construct (thought of as a linear threshold function)
which was initially intended to model the behavior of a biological neuron. A biological
neuron has the tendency to synapse upon receiving a sufficient number (i.e. a number
which exceeds its threshold) of synaptic impulses from neighboring neurons. Properties of
an aritficial neuron:
• associated with each neuron n is a weight vector w.
~ Moreover, the dimension k of w
~
represents the number of neurons or environmental impulses that feed into n.
• each neuron n also has a threshold T . If the strength of the neuron synapses feeding
into n exceeds T , then T itself will synapse (i.e. output a value).
• the strength of the neuron synapses feeding into n is given by the dot product w
~ · ~x,
where ~xi represents the i th input into n, which is also the output of the i th neuron
feeding into n.
• a discrete neuron will output 1 when w
~ · ~x > T , and 0 otherwise.
• a smoothed neuron will output f (w
~ · ~x − T ) where f is a smoothing function, such as
f (x) =
1
.
1 + e−x
A discrete neuron with weight vector w
~ and threshold T may be used as a type of linear
classifier. For example, given a feature vector ~x, the vector may be classified into one of two
categories depending on whether or not w
~ · ~x > T . But how to choose w
~ and T for a given
set of training vectors?
5
Perceptron [Neuron] Learning Algorithm
Let W1 and W2 be two sets of feature vectors. Then W1 and W2 are said to be linearly
separable iff there exists vector w
~ ∗ such that
1. w
~ ∗ · ~x > 0, for every ~x ∈ W1
2. w
~ ∗ · ~x < 0, for every ~x ∈ W2
Rosenblatt’s Perceptron Learning Algorithm: has the goal of finding vector w
~ ∗ which
∗
linearly separates W1 and W2 . Note w
~ will include a component that represents the threshold
T.
• for each vector ~x in W1 ∪ W2 , add another component to ~x and assign it the value -1.
• initialize: w(0)
~
to a random vector; t = 0, and correct = 0.
• while(correct < |W1 ∪ W2 |)
– get next vector ~x in L
– if(w(t)
~ · ~x > 0 and ~x ∈ W2 )
∗ t = t + 1; correct = 0
∗ w(t)
~
= w(t)
~ − ~x
– else if(w(t)
~ · ~x < 0 and ~x ∈ W1 ) then
∗ t = t + 1; correct = 0
∗ w(t)
~
= w(t)
~ + ~x
– else correct + +.
• w
~ ∗ = w(t)
~
• return w
~∗
6
Perceptron Learning Example: use the perceptron learning algorithm to find a neuron/line that linearly separates W1 = {(−1, 1), (−2, 3), (1, 3)} from W2 = {(3, −1), (4, 5)}.
7
Artificial Neural Networks as Classifiers
An artificial neural network is simply a collection of neurons for which the outputs of some
neurons serve as the inputs to other neurons, to potentially form a complex dynamical system.
They represent the most important class of nonlinear classifiers. Why neural networks?
• represent an attempt to model a brain’s structure and functionality in hopes that they
will be enabled with the same classificational/pattern-recognition ability that most
brains possess
• experiments have shown they classify well in cases where the feature vectors
– possess errors and noise
– represent spatial and/or audio data
• computing with neural networks is parallel and distributed, and thus is ideal for parallel
processing
• learning algorithms exist for neural networks (e.g. backpropagation)
8
Feedforward Neural Network: consists of a matrix of neurons with the following connectivity properties:
• inputs to column 1 of the matrix consiste of environmental impuleses (in the form of
feature vectors)
• outputs from the last column of the matrix represent the output of the network
• for i > 1, inputs into a neuron in column i of the matrix consist of outputs from column
i − 1 of the matrix
• each column is called a layer of the network. The first column represents the input
layer, while the final column is called the output layer. All other layers are called
hidden.
Feedforward neural networks represent the most popular neural-network architecture for the
purpose of classification.
9
Feedforward Example 1: the “EXOR” problem. Provide a two-layer neural network
which correctly classifies the vectors in W1 = {(0, 1), (1, 0)} from W2 = {(0, 0), (1, 1)}.
10
Theorem: any two classes of vectors W1 and W2 can be correctly classified by a three-layer
neural network.
Proof:
1. The first layer maps real vectors to vertices of a hypercube.
2. The second and third layers provide a sum-of-minterms expansion for those vertices of
the hypercube which correspond to W1 .
Feedforward Example 2: for the following sets W1 and W2 provide a three-layer network
which correctly classifies the sets.
11
Training Feedforward Neural Networks: Backpropagation
The goal of backpropagation involves iteratively adapting the weights of each neuron to
minimize a chosen error function.
Sum-squared error function: let ~x1 , . . . , ~xn be the set of training vectors, and let F~ (~xi )j
represent the j th component of the output of the neural network using ~xi as input. Moreover,
let ~c(~xi )j be the desired j th component of the correct classification of ~xi . Then the error
associated with ~xi and induced by the network is given as
E(i) =
m
X
(F~ (~xi )j − ~c(~xi )j )2 ,
j=1
where m is the number of components of the network’s output layer. Furthermore, the
sum-squared error function for the entire training set is given as
E=
n
X
E(i).
i=1
Assuming that the training set remains constant, we may think of E as a function of the
weight vectors of all the neurons of the network; and training the network involves finding
an instantiation of the weight vectors so that E is a minimum. The most common algorithm
for doing this is called gradient descent; i.e. taking the gradient of E with respect to all
the weight vectors and changing the weights in the direction of the gradient. Doing this
requires using the chain rule for partial differentiation, which allows for the weight updating
to propagate backwards.
Once the weights are updated via backpropagation, F~ (~xi ) must be recomputed for each
training vector ~xi , so that E can be re-computed, and the backpropagation process repeats.
Backpropagation stops when either
1. ∆E, the change in E from one stage to the next, becomes smaller than some threshold,
or
2. the number of stages has exceeded some tolerance threshold, or
3. the computed error E becomes smaller than some threshold.
12