CMPUT 650: Learning To Make Decisions

CSE 4705
Artificial Intelligence
Jinbo Bi
Department of Computer Science & Engineering
http://www.engr.uconn.edu/~jinbo
1
Machine learning (1)
Supervised learning algorithms
2
Topics in machine learning
Supervised learning such as classification and
regression
Unsupervised learning such as cluster analysis,
outlier/novelty detection
Dimension reduction
Semi-supervised learning
Active learning
Online learning
3
Common techniques
Supervised learning
Regularized least squares
Least-absolute-shrinkage-and-selection operator
Neural networks
Logistic regression
Decision trees
Fisher’s discriminant analysis
Support vector machines
Graphical models
4
Common techniques
Unsupervised learning
K-means
Gaussian mixture models
Hierarchical clustering
Graph-based clustering (e.g., Spectral clustering)
5
Common techniques
Dimension reduction
Principal component analysis
Independent component analysis
Canonical correlation analysis
Feature selection
Sparse modeling
6
Machine learning / Data mining
Data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different
perspectives and summarizing it into useful information
http://www.kdd.org/kdd2016/ ACM SIGKDD conference
The ultimate goal of machine learning is the creation and
understanding of machine intelligence
http://icml.cc/2016/ ICML conference
Heavily related to statistical learning theory
Artificial intelligence is the intelligence exhibited by
machines or software. It is to study how to create
computers and computer software that are capable of
intelligent behavior.
http://www.aaai.org/Conferences/AAAI/aaai16.php AAAI
conference
7
Supervised learning: definition
Given a collection of examples (training set )
Each example contains a set of attributes
(independent variables), one of the attributes is the
target (dependent variables).
Find a model to predict the target as a function of the
values of other attributes.
Goal: previously unseen examples should be predicted
as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
8
Supervised learning: definition
Given a collection of examples (training set )
Each example contains a set of attributes
(independent variables), one of the attributes is the
target (dependent variables).
Find a model to predict the target as a function of the
values of other attributes.
Goal: previously unseen examples should be predicted
as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
9
Supervised learning: classification
When the dependent variable is categorical, a classification problem
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
10
Classification: example
Face recognition
Goal: Predict the identity of a face image
Approach:
Align all images to derive the features
Model the class (identity) based on these features
11
Supervised learning: regression
When the dependent variable is continuous, a regression problem
Current data, want to use the model to predict
Tid Refund
Marital
Status
Taxable
Income
Loss
Refund Marital
Status
Taxable
Income Loss
1
Yes
Single
125K
100
No
Single
75K
?
2
No
Married
100K
120
Yes
Married
50K
?
3
No
Single
70K
-200
No
Married
150K
?
4
Yes
Married
120K
-300
Yes
Divorced 90K
?
5
No
Divorced 95K
-400
No
Single
40K
?
6
No
Married
-500
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
-190
8
No
Single
85K
300
9
No
Married
75K
-240
10
No
Single
90K
90
10
Training
Set
Learn
Regressor
Test
Set
Model
Past transaction records, label them
goals: Predict the possible loss from a customer
12
Regression: example
Risk prediction for patients
Goal: Predict the likelihood if a patient will suffer
major complication after a surgery procedure
Approach:
Use patients vital signs before and after surgical operation.
Heart Rate, Respiratory Rate, etc.
Monitor patients by expert medical professionals to rate the
likelihood of a patient having complication
Learn a model as patient vital signs to map to the risk
ratings.
Use this model to detect potential high-risk patients for a
particular surgical procedure
13
Unsupervised learning: clustering
Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that
Data points in one cluster are more similar to one
another.
Data points in separate clusters are less similar to
one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures
14
Clustering: example
High Risky Patient Detection
Goal: Predict if a patient will suffer major
complication after a surgery procedure
Approach:
Use patients vital signs before and after surgical operation.
Heart Rate, Respiratory Rate, etc.
Find patients whose symptoms are dissimilar from most of
other patients.
15
Practice
Judge what kind of the problem it is in the following
scenarios
A student collected a couple of online documents about
movies, and try to identify which movie the documents
discuss
In a cognitive test, a person is asked if he could
recognize the “red” color from a screen. The person
needs to press a button if he thinks he sees red, or
otherwise not. Then an EEG recording is made during
the test. A researcher wants to use the EEG recordings
to predict whether the red color is recognized.
A researcher observed and recorded whether conditions
(temperature, wind speed, snow etc.) from the past
month, then he wants to use the data to predict the
temperature in the next day.
16
Practice
Judge what kind of the problem it is in the following
scenarios
A student collected a couple of online documents about
movies, and try to identify which movie the documents
discuss
In a cognitive test, a person is asked if he could
recognize the “red” color from a screen. The person
needs to press a button if he thinks he sees red, or
otherwise not. Then an EEG recording is made during
the test. A researcher wants to use the EEG recordings
to predict whether the red color is recognized.
A researcher observed and recorded whether conditions
(temperature, wind speed, snow etc.) from the past
month, then he wants to use the data to predict the
temperature in the next day.
17
Review of probability and linear algebra
18
Basics of probability
An experiment (random variable) is a welldefined process with observable outcomes.
The set or collection of all outcomes of an
experiment is called the sample space, S.
An event E is any subset of outcomes from S.
Probability of an event, P(E) is P(E) = number of
outcomes in E / number of outcomes in S.
19
Probability theory
Apples and Oranges
X: identity of the fruit
Y: identity of the box
Assume P(Y=r) = 40%, P(Y=b) = 60% (prior)
P(X=a|Y=r) = 2/8 = 25%
P(X=o|Y=r) = 6/8 = 75%
Marginal
P(X=a|Y=b) = 3/4 = 75%
P(X=o|Y=b) = 1/4 = 25%
P(X=a) = 11/20, P(X=o) = 9/20
Posterior
P(Y=r|X=o) = 2/3
P(Y=b|X=o) = 1/3
20
Probability theory

Marginal Probability

Conditional Probability
Joint Probability
21
Probability theory

• Product Rule
Sum Rule
The marginal prob of X equals the sum of
the joint prob of x and y with respect to y
The joint prob of X and Y equals the product of the conditional prob of Y
given X and the prob of X
22
Illustration
p(X,Y)
p(Y)
Y=2
Y=1
p(X)
p(X|Y=1)
23
The rules of probability

Sum Rule

Product Rule
= p(X|Y)p(Y)

Bayes’ Rule
posterior  likelihood × prior
24
Application of probability rules
Assume P(Y=r) = 40%, P(Y=b) = 60%
P(X=a|Y=r) = 2/8 = 25%
P(X=o|Y=r) = 6/8 = 75%
P(X=a|Y=b) = 3/4 = 75%
P(X=o|Y=b) = 1/4 = 25%
p(X=a) = p(X=a,Y=r) + p(X=a,Y=b)
= p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b)
=0.25*0.4 + 0.75*0.6 = 11/20
P(X=o) = 9/20
p(Y=r|X=o) = p(Y=r,X=o)/p(X=o)
= p(X=o|Y=r)p(Y=r)/p(X=o)
= 0.75*0.4 / (9/20) = 2/3
25
Application of probability rules
Assume P(Y=r) = 40%, P(Y=b) = 60%
P(X=a|Y=r) = 2/8 = 25%
P(X=o|Y=r) = 6/8 = 75%
P(X=a|Y=b) = 3/4 = 75%
P(X=o|Y=b) = 1/4 = 25%
p(X=a) = p(X=a,Y=r) + p(X=a,Y=b)
= p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b)
=0.25*0.4 + 0.75*0.6 = 11/20
P(X=o) = 9/20
p(Y=r|X=o) = p(Y=r,X=o)/p(X=o)
= p(X=o|Y=r)p(Y=r)/p(X=o)
= 0.75*0.4 / (9/20) = 2/3
26
Mean and variance
The mean of a random variable X is the
average value X takes.
The variance of X is a measure of how
dispersed the values that X takes are.
The standard deviation is simply the square
root of the variance.
27
Simple example
X= {1, 2} with P(X=1) = 0.8 and P(X=2) = 0.2
Mean
0.8 X 1 + 0.2 X 2 = 1.2
Variance
0.8 X (1 – 1.2) X (1 – 1.2) + 0.2 X (2 – 1.2) X (21.2)
28
Gaussian distribution
29
Gaussian distribution
30
Multivariate Gaussian
y
x
31
Basics of linear algebra
32
Matrix multiplication
The product of two matrices
Special case: vector-vector product, matrixvector product
A
B
C
33
Matrix multiplication
34
Rules of matrix multiplication
B
A
C
35
Vector norms
36
Matrix norms and trace
37
A bit more on matrix
38
Orthogonal matrix
1
1
..
.
1
39
Square matrix – eigenvalue, eigenvector
( , x) is an eigen pair of A, if and only if Ax  x.
 is the eigenvalue
x is the eigenvecto r
where
40
Symmetric matrix
A is symmetric, if A  AT
eigen-decomposition of A
.
A   nn is symmetric and positive semi -definite, if xT Ax  0, for any x   n .
i  0, i  1,, n
A   nn is symmetric and positive definite, if xT Ax  0, for any nonzero x   n .
i  0, i  1,, n
41
Singular value decomposition
orthogonal
diagonal
orthogonal
42
Supervised learning – practical issues
Underfitting
Overfitting
Before introducing these important concept, let us
study a simple regression algorithm – linear
regression
43
Questions?
44