Computational BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. 1 Course Information Instructor: Dr. Jinbo Bi – Office: ITEB 233 – Phone: 860-486-1458 – Email: [email protected] – Web: http://www.engr.uconn.edu/~jinbo/ – Time: Mon / Wed. 2:00pm – 3:15pm – Location: CAST 204 – Office hours: Mon. 3:30-4:30pm HuskyCT – http://learn.uconn.edu – Login with your NetID and password – Illustration 2 Review of last chapter General introduction to the topics in medical informatics, and the data mining techniques involved Review some basics of probability-statistics More slides on probability and linear algebra uploaded to huskyCT This class, we start to discuss supervised learning: classification and regression 3 Regression and classification Both regression and classification problems are typically supervised learning problems The main property of supervised learning – Training example contains the input variables and the corresponding target label – The goal is to find a good mapping from the input variables to the target variable 4 Classification: Definition Given a collection of examples (training set ) – Each example contains a set of variables (features), and the target variable class. Find a model for class attribute as a function of the values of other variables. Goal: previously unseen examples should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 5 Classification Application 1 Current data, want to use the model to predict Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Training Set Learn Classifier Test Set Model Past transaction records, label them Fraud detection – goals: Predict fraudulent cases in credit card transactions. 6 Classification: Application 2 Handwritten Digit Recognition Goal: Identify the digit of a handwritten number – Approach: Align all images to derive the features Model the class (identity) based on these features 7 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set 8 Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models 9 Regression: Definition Goal: predict the value of one or more continuous target attributes give the values of the input attributes Difference between classification and regression only lies in the target attribute – Classification: discrete or categorical target – Regression: continuous target Greatly studied in statistics, neural network fields. 10 Regression application 1 Current data, want to use the model to predict Tid Refund Marital Status Taxable Income Loss Refund Marital Status Taxable Income Loss 1 Yes Single 125K 100 No Single 75K ? 2 No Married 100K 120 Yes Married 50K ? 3 No Single 70K -200 No Married 150K ? 4 Yes Married 120K -300 Yes Divorced 90K ? 5 No Divorced 95K -400 No Single 40K ? 6 No Married -500 No Married 80K ? 60K 10 7 Yes Divorced 220K -190 8 No Single 85K 300 9 No Married 75K -240 10 No Single 90K 90 10 Training Set Learn Regressor Test Set Model Past transaction records, label them goals: Predict the possible loss from a customer 11 Regression applications Examples: – Predicting sales amounts of new product based on advertising expenditure. – Predicting wind velocities as a function of temperature, humidity, air pressure, etc. – Time series prediction of stock market indices. 12 Regression algorithms Least squares methods Regularized linear regression (ridge regression) Neural networks Support vector machines (SVM) Bayesian linear regression 13 Practical issues in the training Underfitting Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression 14 Least squares We wish to use some real-valued input variables x to predict the value of a target y We collect training data of pairs (xi,yi), i=1,…N Suppose we have a model f that maps each x example to a value of y’ Sum of squares function: – Sum of the squares of the deviation between the observed target value y and the predicted N value y’ N 2 2 y i 1 i yi ' yi f ( xi ) i 1 15 Least squares Find a function f such that the sum of squares is minimized N min y i 1 f ( xi ) 2 i For example, your function is in the form of linear functions f (x) = wTx N min w yi w xi T 2 i 1 Least squares with a linear function of parameters w is called “linear regression” 16 Linear regression Linear regression has a closed-form solution for w N min w yi xi w i 1 T 2 y Xw y Xw T E ( w) The minimum is attained at the zero derivative E ( w) T 2 X ( y Xw) 0 w w XTX 1 XT y 17 Polynomial Curve Fitting x is evenly distributed from [0,1] y = f(x) + random error y = sin(2πx) + ε, ε ~ N(0,σ) 18 Polynomial Curve Fitting 19 Sum-of-Squares Error Function 20 0th Order Polynomial 21 1st Order Polynomial 22 3rd Order Polynomial 23 9th Order Polynomial 24 Over-fitting Root-Mean-Square (RMS) Error: 25 Polynomial Coefficients 26 Data Set Size: 9th Order Polynomial 27 Data Set Size: 9th Order Polynomial 28 Regularization Penalize large coefficient values Ridge regression 29 Regularization: 30 Regularization: 31 Regularization: vs. 32 Polynomial Coefficients 33 Classification Underfitting or Overfitting can also happen in classification approaches We will illustrate these practical issues on classification problem Before the illustration, we introduce a simple classification technique – K-nearest neighbor method 34 K-nearest neighbor (K-NN) K-NN is one of the simplest machine learning algorithm K-NN is a method for classifying test examples based on closest training examples in the feature space An example is classified by a majority vote of its neighbors k is a positive integer, typically small. If k = 1, then the example is simply assigned to the class of its nearest neighbor. 35 K-NN K=3 K=1 36 K-NN on real problem data • Oil data set • K acts as a smoother, choosing K is model selection • For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions). 37 Limitation of K-NN K-NN is a nonparametric model (no any particular function is fitted) Nonparametric models requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation. 38 Probabilistic interpretation of K-NN Given a data set with Nk data points from class Ck and , we have and correspondingly Since , Bayes’ theorem gives 39 Underfit and Overfit (Classification) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x12+x22) 1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5 40 Underfit and Overfit (Classification) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x12+x22) 1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5 41 Underfitting and Overfitting Overfitting Number of Iterations Underfitting: when model is too simple, both training and test errors are large 42 Overfitting due to Noise Decision boundary is distorted by noise point 43 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task 44 Notes on Overfitting Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records Need new ways for estimating errors 45 Occam’s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex models, there is a greater chance that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model 46 How to Address Overfitting Minimize training error no longer guarantees a good model (a classifier or a regressor) Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y ) In practice, design a procedure that gives better estimate of the error than training error In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula 47 Model Evaluation (pp. 295—304 of data mining) Metrics for Performance Evaluation – How to evaluate the performance of a model? Methods for Performance Evaluation – How to obtain reliable estimates? Methods for Model Comparison – How to compare the relative performance among competing models? 48 Model Evaluation Metrics for Performance Evaluation – How to evaluate the performance of a model? Methods for Performance Evaluation – How to obtain reliable estimates? Methods for Model Comparison – How to compare the relative performance among competing models? 49 Metrics for Performance Evaluation Regression – Sum of squares – Sum of deviation – Exponential function of the deviation 50 Metrics for Performance Evaluation Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No a c Class=No b d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 51 Metrics for Performance Evaluation… PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No Class=No a (TP) b (FN) c (FP) d (TN) Most widely-used metric: ad TP TN Accuracy a b c d TP TN FP FN 52 Limitation of Accuracy Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example 53 Cost Matrix PREDICTED CLASS C(i|j) Class=Yes Class=Yes C(Yes|Yes) C(No|Yes) C(Yes|No) C(No|No) ACTUAL CLASS Class=No Class=No C(i|j): Cost of misclassifying class j example as class i 54 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS Model M1 ACTUAL CLASS PREDICTED CLASS + - + 150 40 - 60 250 Accuracy = 80% Cost = 3910 C(i|j) + - + -1 100 - 1 0 Model M2 ACTUAL CLASS PREDICTED CLASS + - + 250 45 - 5 200 Accuracy = 90% Cost = 4255 55 Cost vs Accuracy Count PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS a Class=No Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p b N=a+b+c+d Class=No c d Accuracy = (a + d)/N Cost PREDICTED CLASS Class=Yes ACTUAL CLASS Class=No Class=Yes p q Class=No q p Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p) Accuracy] 56 Cost-Sensitive Measures a Precision (p) ac a Recall (r) ab Count ACTUAL CLASS PREDICTED CLASS Class= Yes Class= No Class= Yes a b Class= No c d Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) A model that declares every record to be the positive class: b = d = 0 Recall is high A model that assigns a positive class to the (sure) test record: c is small Precision is high 57 Cost-Sensitive Measures (Cont’d) a Precision (p) ac a Recall (r) ab 2rp 2a F - measure (F) r p 2a b c Count ACTUAL CLASS PREDICTED CLASS Class= Yes Class= No Class= Yes a b Class= No c d F-measure is biased towards all except C(No|No) wa w d Weighted Accuracy wa wb wc w d 1 1 4 2 3 4 58 Model Evaluation Metrics for Performance Evaluation – How to evaluate the performance of a model? Methods for Performance Evaluation – How to obtain reliable estimates? Methods for Model Comparison – How to compare the relative performance among competing models? 59 Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets 60 Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate 61 Methods of Estimation Holdout – Reserve 2/3 for training and 1/3 for testing Random subsampling – Repeated holdout Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n Stratified sampling – oversampling vs undersampling Bootstrap – Sampling with replacement 62 Methods of Estimation (Cont’d) Holdout method – Given data is randomly partitioned into two independent sets Training Test set (e.g., 2/3) for model construction set (e.g., 1/3) for accuracy estimation – Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) – Randomly partition the data into k mutually exclusive subsets, each approximately equal size – At i-th iteration, use Di as test set and others as training set – Leave-one-out: k folds where k = # of tuples, for small sized data – Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 63 Methods of Estimation (Cont’d) Bootstrap – Works well with small data sets – Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap – Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) – Repeat the sampling procedure k times, overall accuracy of the k model: acc( M ) (0.632 acc( M i )test _ set 0.368 acc( M i )train_ set ) i 1 64 Model Evaluation Metrics for Performance Evaluation – How to evaluate the performance of a model? Methods for Performance Evaluation – How to obtain reliable estimates? Methods for Model Comparison – How to compare the relative performance among competing models? 65 ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and false alarms ROC curve plots TPR (on the y-axis) against FPR (on the x-axis) Performance of each classifier represented as a point on the ROC curve If the classifier returns a real-valued prediction, – changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point 66 ROC Curve PREDICTED CLASS Class =Yes Class= No Class a (TP) b (FN) Class =No c (FP) d (TN) ACTUAL =Yes CLASS TPR = TP/(TP+FN) FPR = FP/(FP+TN) At threshold t: TP=50, FN=50, FP=12, TN=88 67 ROC Curve PREDICTED CLASS Class =Yes Class= No Class a (TP) b (FN) Class =No c (FP) d (TN) ACTUAL =Yes CLASS TPR = TP/(TP+FN) FPR = FP/(FP+TN) (TPR,FPR): (0,0): declare everything to be negative class – TP=0, FP = 0 (1,1): declare everything to be positive class – FN = 0, TN = 0 (1,0): ideal – FN = 0, FP = 0 68 ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: – Random guessing – Below diagonal line: prediction is opposite of the true class 69 How to Construct an ROC curve Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + • Use classifier that produces posterior probability for each test instance P(+|A) • Sort the instances according to P(+|A) in decreasing order • Apply threshold at each unique value of P(+|A) • Count the number of TP, FP, TN, FN at each threshold • TP rate, TPR = TP/(TP+FN) • FP rate, FPR = FP/(FP + TN) 70 How to Construct an ROC curve Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + • Use classifier that produces posterior probability for each test instance P(+|A) • Sort the instances according to P(+|A) in decreasing order • Pick a threshold 0.85 • p>= 0.85, predicted to P • p< 0.85, predicted to N • TP = 3, FP=3, TN=2, FN=2 • TP rate, TPR = 3/5=60% • FP rate, FPR = 3/5=60% 71 How to construct an ROC curve + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 Class P Threshold >= ROC Curve: 72 Using ROC for Model Comparison No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area =1 Random guess: Area = 0.5 73 Revisit K-Nearest Neighbor K-NN: – Instance-based algorithm Uses k “closest” points (nearest neighbors) for performing classification – k-NN classifiers are lazy learners (does not build models explicitly) – Classifying unknown examples are relatively expensive than model-learning algorithms (or parametric approaches) 74 Nearest Neighbor Classifiers Basic idea: – If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Distance Training Records Test Record Choose k of the “nearest” records 75 Nearest-Neighbor Classifiers Unknown record Requires three things – The set of stored examples – Distance Metric to compute distance between examples – The value of k, the number of nearest neighbors to retrieve To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 76 Definition of Nearest Neighbor X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 77 1 nearest-neighbor Voronoi Diagram 78 Nearest Neighbor Classification Compute distance between two points: – Euclidean distance d ( p, q ) ( pi i q ) 2 i Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance weight factor, w = 1/d2 79 Nearest Neighbor Classification… Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes X 80 Nearest Neighbor Classification… Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M 81 Nearest Neighbor Classification… Problem with Euclidean measure: – High dimensional data curse of dimensionality solution is to do dimension reduction first – Can produce counter-intuitive results 111111111110 100000000000 vs 011111111111 000000000001 d = 1.4142 d = 1.4142 Solution: Normalize data 82 Data normalization Example-wise normalization – Each example is normalized and mapped to unit sphere Feature-wise normalization – [0,1]-normalization: normalize each feature into a unit space – Standard normalization: normalize each feature to have mean 0 and standard deviation 1 1 1 1 1 83 Classification Training data is given. – Each object is associated with a class label Y {1, 2, …, K} and a feature vector of d measurements: X = (X1, …, Xd). Build a model from the training data. Unseen objects are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}. Linear Discriminant Analysis / Fisher’s linear disciminant 84 Two classes Variable 2 u2 Variable 1 u1 85 Three classes 86 Classifiers Classifiers are built from a training set (TS) L = (X1, Y1), ..., (Xn,Yn) Classifier C built from a learning set L: C: X {1,2, ... ,K} Bayes classifier base on conditional densities p(Ck | X), C(X) = arg maxk p(Ck | X) This is a maximum a posterior, and p(Ck | X) is a posterior density 87 The Rules of Probability Sum Rule Product Rule = p(X|Y)p(Y) Bayes’ Rule p (Y C | X data ) p ( X | Y ) p (Y ) posterior likelihood × prior is irrelevant to Y=C 88 Maximum a posterior p(Ck | X) = p(X | Ck) p(Ck) /p(X) Find a class label C(X) so that maxk p(Ck | X) = maxk p(X | Ck) p(Ck) Naïve Bayes assumes independence among all features (last class) – p(X | Ck) = p(x1 | Ck) p(x2 | Ck) . . . p(xd | Ck) Very strong assumption 89 Multivariate normal dist for each class Assume multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), Maximizing posterior is equivalent to maximizing p(X|CK)p(CK), and equivalent to maximizing the logorithm of p(X|CK)p(CK) C(X) = arg mink {(X - k)’ k-1 (X - k) + log| k | -2log(p(Ck))} 90 Two-class case If p( X | C1 ) p(C1 ) p( X | C2 ) p(C2 ) C(X) = C2 otherwise Equivalently, C(X) = C1 p( X | C1 ) p(C1 ) 1 p( X | C2 ) p(C2 ) p ( X | C1 ) p (C2 ) p ( X | C2 ) p (C1 ) p( X | C1 ) p(C2 ) log log( ) p ( X | C2 ) p(C1 ) 91 Guassian discriminant rule For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the classification rule is C(X) = arg mink {(X - k)’ k-1 (X - k) + log| k |} In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA) In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities 92 Sample mean and variance Class mean 1 i x | Ci | xCi Class covariance 1 i Ci T ( x )( x ) i i xCi 93 Example 1 X 1 0 , 1 2 X 2 1 , 1 0 X 3 2 . 1 1 1 X 1 X 2 X 3 1 3 1 1 ( X 1 )( X 1 ) T ( X 2 )( X 2 ) T ( X 3 )( X 3 ) T 3 0 1 1 1 10 1 0 0 1 0 0 1 1 1 0 3 0 0 0 0 1 0 3 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 2 1 0 1 0 1 1 0 1 2 0 3 0 0 0 0 0 0 0 94 Two-class case If the two classes have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): Quadratic rule become X T 1 (2 1 ) c Usually, 1 XTw c where w (2 1 ) 1 (n11 n2 2 ) n 95 Illustration μ1 μ2 96 Two-class case Maximize the signal-to-noise ratio wT between w max w T w withinw Between-class separation Within-class cohesion where between ( 2 1 )( 2 1 ) T 1 within (n11 n2 2 ) n Solution is w 1 within (2 1 ) 97 Two-class case (illustration) LDA gives the yellow direction Two classes overlap Two classes are separated 98 Two-class case (illustration) 2- 1 1 2 LDA axis Best Threshold 99 Multi-class case Two approaches – Apply Fisher LDA to each “one-versus-rest” class 100 Multi-class case Second approach: Similarly, find multiple directions that form a low dimensional space Transformation matrix G that projects the data to be most separable is the matrix that maximizes max W Correct way to write it is W T SbW T W S wW max trace (W T SbW )(W T SwW ) 1 W Between-class matrix Within-class matrix 1 K Sb nk ( k )( k )T n k 1 1 K S w ( x i )( x i )T n i 1 xCi 101 Intuition The goal is to simultaneously maximize the between-class separation and minimize the withinclass cohesion T T 1 max trace ( W S W )( W S W ) The solution to W b w is the generalized eigenvalue problem Sw1Sb The generalized eigenvectors are eigenvectors by solving Sb g S w g 102 Graphic view of the transformation (projection) K-1 K-1 d n n d A A L nd n( k 1) Reduced training data Training data (matrix) W d ( k 1) Transformation matrix 103 Graphical view of classification K-1 d n n A 1 d 1d h A L nd K-1 1 d n( k 1) K-1 h L 1( k 1) A test data point h Find the nearest neighbor Or nearest centroid G d ( k 1) 104 Summary First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA): Dimension reduction – Finds linear combinations of the features X=X1,...,Xd with large ratios of between-groups to within-groups sums of squares - discriminant variables; Classification – Predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables 105 We just introduced Fisher discriminant analysis, particularly linear discriminant analysis Now let us discuss Support Vector Machine 106 History of SVM SVM is inspired from statistical learning theory [3]. SVM was first introduced in 1992 [1]. SVM becomes popular because of its success in handwritten digit recognition [2]. SVM is now regarded as an important example of “kernel methods”, arguably the hottest area in machine learning. http://www.kernel-machines.org/ [1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992. [2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82. 1994. [3] V. Vapnik. The Nature of Statistical Learning Theory. 1nd edition, Springer, 1996. 107 Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data 108 Support Vector Machines B1 One Possible Solution 109 Support Vector Machines B2 Another possible solution 110 Support Vector Machines B2 Other possible solutions 111 Support Vector Machines B1 B2 Which one is better? B1 or B2? How do you define better? 112 Support Vector Machines B1 B2 b21 b22 margin b11 b12 Find hyperplane maximizes the margin => B1 is better than B2 113 Support Vector Machines B1 w x b 0 w x b 1 w x b 1 b11 if w x b 1 1 f ( x) 1 if w x b 1 b12 2 Margin 2 || w || 114 Support Vector Machines What if the problem is not linearly separable? 115 Nonlinear Support Vector Machines What if decision boundary is not linear? 116 Nonlinear Support Vector Machines Transform data into higher dimensional space 117 Outline of SVM lecture Linear classifier Maximum margin classifier – Estimate the margin SVM for separable data SVM for non-separable data 118 Linear classifiers x denotes +1 a f y f(x,w,b) = sign(w. x +b) denotes -1 How would you classify this data? 119 Linear classifiers x denotes +1 a f y f(x,w,b) = sign(w. x +b) denotes -1 How would you classify this data? 120 a Classifier Margin x denotes +1 denotes -1 f y f(x,w,b) = sign(w. x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 121 Maximum Margin a x denotes +1 f y f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the maximum margin. denotes -1 Linear SVM This is the simplest kind of SVM (Called an LSVM) 122 Maximum Margin a x denotes +1 f y f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the maximum margin. denotes -1 Support Vectors are those data points that the margin pushes up against Linear SVM This is the simplest kind of SVM (Called an LSVM) 123 Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against f(x,w,b) = sign(w. - b) 2. If we’ve made a small error inxthe location of the boundary this gives maximum us least chance ofThe causing a misclassification. margin linear classifier is theof 3. The model is immune to removal any non-support-vector linear datapoints. classifier with(using the, VC um, 4. There’s some theory margin. dimension) that ismaximum related to (but not the same as) the proposition that this is a good thing. This is the simplest kind of 5. Empirically it works very very well. SVM (Called an LSVM) 124 Estimate the Margin denotes +1 denotes -1 wx +b = 0 x What is the distance expression for a point x to a line wx+b= 0? d ( x) xw b w 2 2 xw b d 2 w i 1 i 125 Estimate the Margin wx +b = 0 distance y x ( y x) w yw xw y x, w w w w Using yw b 0, we have d b xw w b xw w 2 b xw d w i 1 2 i 126 Estimate the Margin denotes +1 wx +b = 0 denotes -1 Margin What is the expression for margin? margin arg min d (x) arg min xD xD xw b d 2 w i 1 i 127 Maximize Margin denotes +1 wx +b = 0 denotes -1 Margin argmax margin(w, b, D) w ,b = argmax arg min d (xi ) w ,b xi D argmax arg min w ,b xi D b xi w d 2 w i 1 i 128 Maximize Margin denotes +1 wx +b = 0 denotes -1 Margin argmax arg min w ,b xi D b xi w d 2 w i 1 i subject to xi D : yi xi w b 0 Min-max problem 129 Maximize Margin denotes +1 denotes -1 Margin wx +b = 0 argmax arg min w ,b xi D b xi w d 2 w i i 1 subject to xi D : yi xi w b 0 Strategy: argmin i 1 wi2 d xi D : b xi w 1 w ,b subject to xi D : yi xi w b 1 130 Maximum Margin Linear Classifier {w , b }= argmax * * w,b d 2 w k 1 k subject to y1 w x1 b 1 y2 w x2 b 1 .... y N w xN b 1 How to solve it? 131 Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. Availabel open-source solvers – SVMLight http://svmlight.joachims.org/ – LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ – Matlab optimization toolbox 132 Quadratic Programming Find arg max u T u Ru T cd u 2 Quadratic criterion a11u1 a12u2 ... a1mum b1 Subject to a21u1 a22u2 ... a2 mum b2 : n additional linear inequality constraints an1u1 an 2u2 ... anmum bn a( n 1)1u1 a( n 1) 2u2 ... a( n 1) mum b( n 1) a( n 2)1u1 a( n 2) 2u2 ... a( n 2) mum b( n 2) : a( n e )1u1 a( n e ) 2u2 ... a( n e ) mum b( n e ) e additional linear equality constraints And subject to 133 Quadratic Programming of SVM {w* , b* }= min i wi2 w,b subject to yi w xi b 1 for all training data ( xi , yi ) {w* , b* }= argmax 0 0 w wT I n w w,b y1 w x1 b 1 y2 w x2 b 1 inequality constraints .... y N w xN b 1 134 Non-separable This is going to be a problem! What should we do? denotes +1 denotes -1 135 Non-separable This is going to be a problem! What should we do? denotes +1 denotes -1 Idea 1: Find minimum ||w||2, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-define optimization Separable case argmin i 1 wi2 d w ,b subject to xi D : yi xi w b 1 136 Non-separable This is going to be a problem! What should we do? denotes +1 denotes -1 Idea 1.1: Minimize ||w||2 + C (#train errors) Tradeoff parameter Some points will violate yi ( w xi b) 1 We allow errors to occur yi ( w xi b) 1 i , εi 0 Hinge loss 137 Non-separable This is going to be a problem! What should we do? denotes +1 denotes -1 Idea 2.0: Minimize ||w||2 + C (distance of error points to their correct place) N i 1 i yi ( w xi b) 1 i , εi 0 138 Linear inseparable case {w* , b*}= min i 1 wi2 c j 1 j d N w, b y1 w x1 b 1 1 , 1 0 y2 w x2 b 1 2 , 2 0 denotes +1 denotes -1 3 ... y N w xN b 1 N , N 0 Balance the trade off between margin and classification errors 1 2 139 Determining value for c How do we determine the appropriate value for c ? Cross-validation on training data – Take possible choices for c – For each choice, Run a cross validation procedure Calculate the error metric (chosen properly) – Find the choice that achieves the best metric – Use the best choice on all training data 140 A toy example on SVM (assignment 2) X 1 2 2 1 1 2 2 1 2 1 Training Data 2.4 2.2 2 1.8 1.6 x2 1.3162 x1 1.1447 x2 2.2966 2.3856 0.5606 1.4693 1.3368 1.9389 2.1281 2.2641 x10 0.8281 2.0391 1.9653 0.4878 0.3570 1.4951 2.8792 1.0212 1.7558 0.6714 1.4 1.2 1 0.8 0.6 0.4 0 0.5 1 1.5 x1 2 2.5 3 y 141 Separable case argmin i 1 wi2 d w ,b subject to xi D : yi xi w b 1 Matlab scripts [N,d] = size(X); % constraints A = diag(y) * [X ones(N,1)]; Rhs = ones(N,1); % objective H = [eye(d) zeros(d,1)]; H = [H; [zeros(1,d) 0]]; f = zeros(d+1, 1); [X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs) 142 Inseparable case d min w i 1 Matlab scripts i 2 c N ε i 1 i subject t o yi(w xi b) 1 εi , εi 0 [N,d] = size(X); % constraints A = [diag(y) * [X ones(N,1)] eye(N)]; Rhs = ones(N,1); % objective H = [eye(d) zeros(d,1+N)]; H = [H; [zeros(1+N,d) zeros(N,N)]]; f = [zeros(d+1, 1); c*ones(N,1)]; % bound constraints Lb = [-Inf * ones(d+1,1); zeros(N,1)]; [X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs,[],[],Lb) 143 Next couple of slides are backup slides (not required in this class) 144 Support Vector Machine for Noisy Data i 1 0 i 1 i 0 yi ( wxi b) 0, i.e., misclassif ication xi is correctly classified , but lies inside the margin xi is classified correctly, and lies outside the margin Class 2 k i 1 i is an upper bound on the number of training errors. Class 1 145 Support Vector Machine for Noisy Data {w* , b* }= argmin i wi2 c j 1 j N w,b y1 w x1 b 1 1 ,1 0 y2 w x2 b 1 2 , 2 0 inequality constraints .... y N w xN b 1 N , N 0 How do we determine the appropriate value for c ? •Cross-validation 146 Support Vector Machine for Noisy Data General optimization problem minimize f(w) Define the Lagrangian k subject to g i ( w) 0, i 1, , k . L p (w, a ) f(w) a i g i ( w) f(w) a T g ( w) i 1 Lagrangian dual problem Weak duality theorem subject to a 0 LD (a ) f(w) Duality gap f(w) - LD (a ) maximize LD (a ) inf w L p (w, a ) * Let w be the minimum of the Lagrangian with respect to w, and let a * be the maximum of the lagrangian dual with respect to a If the constrains g are linear functions of w, then the duality gap is 0. ai * gi (w* ) 0, for all i. 147 Support Vector Machine for Noisy Data Karush-Kuhn-Tucker Conditions L p (w * , a * ) 0 w a i*g i (w * ) 0, i 1, k g i (w * ) 0, i 1, k L p (w * , a * ) 0 a Complementarity condition Feasibility condition a i* 0, i 1, k 148 Support Vector Machine for Noisy Data Use the Lagrangian formulation for the optimization problem. Introduce a positive Lagrangian multiplier for each inequality constraint. yi xi w b 1 i 0, for all i. ai i 0, for all i. Lagrangian multipliers i Get the following Lagrangian: Lp w c i a i yi xi w b 1 i i i 2 i i i 149 Support Vector Machine for Noisy Data Lp w c i a i yi xi w b 1 i i i 2 i Lp w Lp b L p i i i 2w a i yi xi 0 w i 1 ai yi 0 2 i a y i i 1 ai yi xi 2 i i with respect to w, b, and ε i . 0 i c i a i 0 c i a i LD a i Take the derivative s of L p 1 a ia j yi y j xi x j 2 i, j 0 αi c i Both ε i and its multiplier i are not involved in the function. 150 The Dual Form of QP R 1 R R Maximize αk αk αl Qkl where Qkl yk yl (xk xl ) 2 k 1 l 1 k 1 Subject to these constraints: 0 αk c k R α k 1 k yk 0 Then define: 1 R w αk yk x k 2 k 1 151 The Dual Form of QP R 1 R R Maximize αk αk αl Qkl where Qkl yk yl (xk xl ) 2 k 1 l 1 k 1 Subject to these constraints: 0 αk C k R α k 1 k yk 0 Then define: 1 R w αk yk x k 2 k 1 Then classify with: f(x,w,b) = sign(w. x + b) 152 An Equivalent QP R 1 R R Maximize αk αk αl Qkl where Qkl yk yl (x k .xl ) 2 k 1 l 1 k 1 Subject to these constraints: 0 αk c k Then define: 1 R w αk yk x k 2 k 1 R α k 1 k yk 0 Datapoints with ak > 0 will be the support vectors ..so this sum only needs to be over the support vectors. 153 Support Vectors Support Vectors denotes +1 denotes -1 w x b 1 i : a i yi w xi b 1 i 0 ai = 0 for non-support vectors ai 0 for support vectors w 1 R w αk yk x k 2 k 1 w x b 1 Decision boundary is determined only by those support vectors ! 154 The Dual Form of QP R 1 R R Maximize αk αk αl Qkl where Qkl yk yl (xk xl ) 2 k 1 l 1 k 1 Subject to these constraints: 0 αk c k R α k 1 k yk 0 Then define: 1 R w αk yk x k 2 k 1 Then classify with: f(x,w,b) = sign(w. x + b) How to determine b ? 155 An Equivalent QP: Determine b b* = argmin j 1 j {w* , b*}= argmin i wi2 c j 1 j N N b , i i 1 N w, b y1 w x1 b 1 1 ,1 0 y2 w x2 b 1 2 , 2 0 Fix w y2 w x2 b 1 2 , 2 0 .... .... y N w x N b 1 N , N 0 A linear programming problem ! Another approach based on support vectors: 0 a i c y1 w x1 b 1 1 ,1 0 y N w x N b 1 N , N 0 i : a i yi w xi b 1 i 0 i i 0 i c a i 0 i 0 yi xi w b 1 0 b yi yi xi w 1 156
© Copyright 2026 Paperzz