preference learning: methods and algorithms for ranking problems

PREFERENCE LEARNING: METHODS AND ALGORITHMS FOR RANKING PROBLEMS Eyke Hüllermeier Department of Mathema9cs and Computer Science Marburg University, Germany Poznan, 25-­‐MAR-­‐2014 P R E F E R E N C E S B E C O M E I N C R E A S I N G L Y I M P O R T A N T User preferences play a key role in various fields of applica9on: COMPUTATIONAL ADVERTISING AUTONOMOUS AGENTS RECOMMENDER SYSTEMS ELECTRONIC COMMERCE COMPUTER GAMES ADAPTIVE USER INTERFACES ADAPTIVE RETRIEVAL SYSTEMS P R E F E R E N C E S I N A I R E S E A R C H : - 
- 
- 
preference representaCon (CP nets, GAI networks, logical representa9ons, fuzzy constraints, …) reasoning with preferences (decision theory, constraint sa9sfac9on, non-­‐monotonic reasoning, …) preference acquisiCon (preference elicita9on, preference learning, ...) 2
P R E F E R E N C E S A R E U B I Q U I T O U S ordinal u9lity scale free text à sen9ment analysis numerical informa9on 3
P R E F E R E N C E S A R E U B I Q U I T O U S NOT CLICKED ON CLICKED ON Preferences are not necessarily expressed explicitly, but can be extracted implictly from people‘s behavior! à „noisy“ data 4
P R E F E R E N C E S A R E U B I Q U I T O U S Fostered by the availability of large amounts of data, PREFERENCE LEARNING has recently emerged as a new subfield of machine learning, dealing with the learning of (predic9ve) preference models from observed, revealed or automa9cally extracted preference informa9on. 5
P R E F E R E N C E L E A R N I N G I S A N A C T I V E F I E L D Tutorials: - 
- 
- 
- 
European Conf. on Machine Learning, 2010 Int. Conf. Discovery Science, 2011 Int. Conf. Algorithmic Decision Theory, 2011 European Conf. on Ar9ficial Intelligence, 2012 Special Issue on Represen9ng, Processing, and Learning Preferences: Theore9cal and Prac9cal Challenges (2011) J. Fürnkranz & E. Hüllermeier (eds.) Preference Learning Springer-­‐Verlag 2011 Special Issue on Preference Learning Forthcoming 6
P R E F E R E N C E L E A R N I N G I S A N A C T I V E F I E L D § 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
NIPS–01: New Methods for Preference Elicita9on NIPS–02: Beyond Classifica9on and Regression: Learning Rankings, Preferences, Equality Predicates, and Other Structures KI–03: Preference Learning: Models, Methods, Applica9ons NIPS–04: Learning with Structured Outputs NIPS–05: Workshop on Learning to Rank IJCAI–05: Advances in Preference Handling SIGIR 07–10: Workshop on Learning to Rank for Informa9on Retrieval ECML/PDKK 08–10: Workshop on Preference Learning NIPS–09: Workshop on Advances in Ranking American Ins9tute of Mathema9cs Workshop in Summer 2010: The Mathema9cs of Ranking NIPS-­‐11: Workshop on Choice Models and Preference Learning EURO 2009-­‐12: Special Track on Preference Learning ECAI-­‐12: Workshop on Preference Learning: Problems and Applica9ons in AI Dagstuhl Seminar on Preference Learning (2014) 7
C O N N E C T I O N S T O O T H E R F I E L D S Structured Output Predic9on Learning Monotone Models Informa9on Retrieval Recommender Systems Learning with weak supervision Preference Learning Sta9s9cs Graph theory Classifica9on (ordinal, mul9label, ...) Economics & Decison Science Social Choice Op9miza9on Opera9ons Research Mul9ple Criteria Decision Making 8
M A N Y T Y P E S O F P R E F E R E N C E S -  binary vs. graded (e.g., relevance judgements vs. ra9ngs) -  absolute vs. relaCve (e.g., assessing single alterna9ves vs. comparing pairs) -  explicit vs. implicit (e.g., direct feedback vs. click-­‐through data) -  structured vs. unstructured (e.g., ra9ngs on a given scale vs. free text) -  single user vs. mulCple users (e.g., document keywords vs. social tagging) -  single vs. mulC-­‐dimensional -  ... A w i d e s p e c t r u m o f l e a r n i n g p r o b l e m s ! 9
O B J E C T R A N K I N G [Cohen et al., 1999] TRAINING Pairwise preferences between objects 10
O B J E C T R A N K I N G [Cohen et al. 99] PREDICTION (ranking a new set of objects) 11
C O L L A B O R A T I V E F I L T E R I N G [Goldberg et al., 1992] P R O D U C T S U S E R S P1 P2 P3 … P38 … U1 … … U2 … … … … … U46 ? ? … ? … … … … U98 … … U99 … … P88 P89 ? ? P90 12
P R E F E R E N C E L E A R N I N G T A S K S OBJECT RANKING COLLABORATIVE FILTERING product descrip9on features iden9fier preference descrip9on rela9ve absolute predic9ons ranking u9lity degrees single many number of users/models 13
P R E F E R E N C E L E A R N I N G T A S K S representa9on task type of preference informa9on context (input) alterna9ve (output) training informa9on predic9on ground truth ID ID absolute ordinal absolute ordinal absolute ordinal dyadic predic9on feature feature absolute ordinal absolute ordinal absolute ordinal mul9label classifica9on feature ID absolute binary absolute binary absolute binary mul9label ranking feature ID absolute binary ranking absolute binary label ranking feature ID rela9ve binary ranking ranking object ranking -­‐-­‐-­‐ feature rela9ve binary ranking ranking or subset instance ranking -­‐-­‐-­‐ feature absolute ordinal ranking absolute ordinal collabora9ve filtering 14
A G E N D A 1.  Introduc9on 2.  Preference learning versus elicitaCon 3.  The Choquet integral 4.  Choquis9c regression 5.  Summary and conclusion 15
M A C H I N E L E A R N I N G F O R P R E D I C T I V E M O D E L I N G SUPERVISED LEARNING: Algorithms and methods for discovering dependencies and regulari9es in a domain of interest, expressed through appropriate models, from specific observa9ons or examples. induc9on learning principle algorithm background knowledge data/observaCons M O D E L I N D U C T I O N … used for §  predic9on, classific9on §  adapta9on, control §  systems analysis model Many other se=ngs exist, such as online learning, semi-­‐supervised learning, ac@ve learning, cost-­‐sensi@ve learning, mul@-­‐task learning, etc. 16
S P E C I F I C A T I O N O F A M A C H I N E L E A R N I N G P R O B L E M §  What kind of training data is offered to the learning algorithm? §  What type of model (predic9on) is the learner supposed to produce? §  What is the nature of the ground truth, and how is a model assessed? LOSS FUNCTION 17
S P E C I F I C A T I O N O F A M A C H I N E L E A R N I N G P R O B L E M §  What kind of training data is offered to the learning algorithm? §  What type of model (predic9on) is the learner supposed to produce? §  What is the nature of the ground truth, and how is a model assessed? risk ¼ average penalty caused by the model‘s predic9ons unknown data-­‐
genera9ng process §  Other criteria, such as complexity ... 18
L A B E L R A N K I N G ... mapping instances to TOTAL ORDERS over a fixed set of alterna9ves/labels: (38, 1, 2, 65K)
Die Grünen  S P D  C D U Die Technische InformaCk Linke  19
L A B E L R A N K I N G : T R A I N I N G D A T A TRAINING X1
X2
X3
X4
0.34
0
10
174
A Â B, C Â D 1.45
0
32
277
B Â C Â A 1.22
1
46
421
B Â D, A Â D, C Â D, A Â C 0.74
1
25
165
C Â A Â D, A Â B 0.95
1
72
273
B Â D, A Â D 1.04
0
33
158
D Â A Â B, C Â B, A Â C preferences Instances are associated with preferences between labels ... no demand for full rankings! 20
L A B E L R A N K I N G : P R E D I C T I O N PREDICTION 0.92
1
81
new instance 382
A
B
C
D
?
?
?
?
ranking ? 21
L A B E L R A N K I N G : P R E D I C T I O N PREDICTION 0.92
1
81
382
A
B
C
D
4
1
3
2
A ranking of all labels new instance 22
L A B E L R A N K I N G : P R E D I C T I O N PREDICTION 0.92
1
81
382
4
1
1
81
2
3
4
A ranking of all labels LOSS GROUND TRUTH 0.92
3
382
2
1
K E N D A L L LOSS RANK CORRELATION 23
M O D E L I D E N T I F I C A T I O N data genera9ng process, the „ground truth“ M O D E L ... S T O C H A S T I C P R O C E S S E S T I M A T E D M O D E L O B S E R V E D D A T A L E A R N I N G A L G O R I T H M -  The model refers to an en@re popula&on of individuals (learning = generalizing from sample to popula@on) -  Knowing the model allows for making good predic@ons on average. -  The dependency tried to be captured by the model is not determinis&c (variability due to aggrega@on, „noisy“ observa@ons, etc.) 24
M O D E L I D E N T I F I C A T I O N data genera9ng process, the „ground truth“ M O D E L ... S T O C H A S T I C P R O C E S S E S T I M A T E D M O D E L O B S E R V E D D A T A L E A R N I N G A L G O R I T H M -  Assump@ons about the „ground truth“ allow for deriving theore&cal results (guarantees on the learner). -  Focus on predic@ve accuracy allows for simple empirical comparison (average predic@on performance on benchmark data). 25
P R E F E R E N C E E L I C I T A T I O N A N D M C D A decision maker M O D E L R E V E A L E D P R E F E R E N C E S D E C I S I O N A N A L Y S T preference queries -  Single user, interac@ve process -  Inconsistencies are explicitly dealt with. -  Construc@ve process, guiding the DM toward a coherent preference model. -  No „ground truth“, no true vs. es@mated model. -  A decision is not objec@vely correct, but coherent and well-­‐jus@fied. 26
A G E N D A 1.  Introduc9on 2.  Preference learning versus elicita9on 3.  The Choquet integral 4.  Choquis9c regression 5.  Summary and conclusion 27
C A N D I D A T E S A G G R E G A T I O N O F C R I T E R I A Math CS StaCsCcs English score 0.80 0.75 0.60 0.95 0.7 0.50 0.60 0.90 0.45 0.4 0.95 0.50 0.90 0.50 0.6 … … … … .... 0.04 0.90 0.50 0.90 0.5 -  requires a proper „weighing“ of criteria according to their importance -  a weighted average does not capture any interac@ons between criteria 28
A D D I T I V E & N O N -­‐ A D D I T I V E M E A S U R E S We require ... normaliza9on monotonicity not necessarily addi9vity 29
A D D I T I V E & N O N -­‐ A D D I T I V E M E A S U R E S 30
D I S C R E T E C H O Q U E T I N T E G R A L 31
D I S C R E T E C H O Q U E T I N T E G R A L Special cases: -  min -  max -  weighted mean (addi9ve measure) -  ordered weighted average (OWA) 32
D E C I S I O N B O U N D A R Y I N T W O D I M E N S I O N S 1!
instances with aggregated score = c c!
0!
0!
c!
1!
33
D E C I S I O N B O U N D A R Y I N T W O D I M E N S I O N S 1!
instances with aggregated score = c c!
0!
0!
c!
1!
34
D E C I S I O N B O U N D A R Y I N T W O D I M E N S I O N S 1!
instances with aggregated score = c c!
0!
0!
c!
1!
35
D E C I S I O N B O U N D A R Y I N T W O D I M E N S I O N S 1!
instances with aggregated score = c c!
0!
0!
c!
1!
36
D E C I S I O N B O U N D A R Y I N T W O D I M E N S I O N S 1!
instances with aggregated score = c c!
0!
0!
c!
1!
37
D E C I S I O N B O U N D A R Y I N T W O D I M E N S I O N S 1!
instances with aggregated score = c c!
0!
0!
c!
1!
38
C H O Q U E T I N T E G R A L A S A T H R E S H O L D C L A S S I F I E R 1.  aggregate atribute values (with the Choquet integral) 2.  classify posi9ve if above the threshold and nega9ve otherwise 39
C H O Q U E T I N T E G R A L A S A T H R E S H O L D C L A S S I F I E R à high flexibility, possibly coming with the danger of overfi=ng! 40
T H E V C D I M E N S I O N §  The Vapnik-­‐Chervonenkis (VC) dimension is a measure of the „flexibility“ (capacity) of a model class (a class of binary classifiers). §  It is used, for example, in theore9cal guarantees of the generalizaCon performance (high VC dimension indicates danger of overfivng). generaliza9on error VC dimension empirical error 41
T H E V C D I M E N S I O N 42
S H A T T E R I N G W I T H L I N E A R M O D E L S 1
1
1
1
0!
0!
0!
0!
0!
1!
0!
1!
0!
1!
1
1
1
1
0!
0!
0!
0!
0!
1!
0!
1!
0!
1!
0!
1!
0!
1!
... one can „jus@fy“ the selec@on of each subset of candidates! 43
T H E V C D I M E N S I O N §  More generally, the VC dimension of the class of linear models on m atributes is m+1. §  Therefore, if one only has m examples, it might be impossible to uniquely idenCfy the class of another instance, even under the assump9on of a linear (u9lity) model. §  If the VC dimension of the model class is k, then a robust predic9on cannot be precise unless one has seen at least k examples! 44
V C D I M E N S I O N O F T H E C H O Q U E T I N T E G R A L 45
T H E M Ö B I U S T R A N S F O R M 46
A G E N D A 1.  Introduc9on 2.  Preference learning versus elicita9on 3.  The Choquet integral 4.  ChoquisCc regression 5.  Summary and conclusion 47
R A N K I N G B A S E D O N P A I R E D C O M P A R I S O N S TRAINING DATA: The goal might be to find a Choquet integral whose u@lity degrees tend to agree with the observed pairwise preferences! 48
O R D I N A L C L A S S I F I C A T I O N / S O R T I N G TRAINING DATA: The goal might be to find a Choquet integral whose u@lity degrees tend to agree with the observed classifica@ons! 49
T H E B I N A R Y C A S E TRAINING DATA: dis9nguishing between „good“ and „bad“ 50
D I S C R E T E C H O Q U E T I N T E G R A L 51
M O D E L I D E N T I F I C A T I O N -  Observed preference data could be considered as constraints on the model parameters, i.e., the underlying fuzzy measure (Moebius transform) à „robust“ predic9on = version space = set of feasible measures -  Yet, recall our basic sevng ... data genera9ng process, the „ground truth“ M O D E L ... S T O C H A S T I C P R O C E S S E S T I M A T E D M O D E L O B S E R V E D D A T A L E A R N I N G A L G O R I T H M 52
S T O C H A S T I C M O D E L I N G 53
S T O C H A S T I C M O D E L I N G LOGISTIC NOISY RESPONSE MODEL precision of the model u9lity threshold 54
S T O C H A S T I C M O D E L I N G γ = 0 γ = ∞ F R O M L O G I S T I C T O C H O Q U I S T I C R E G R E S S I O N Logis9c Choquis9c Choquet integral of (normalized) atribute values Model induc9on via M A X I M U M L I K E L I H O O D I N F E R E N C E : Adopt model parameters that maximize the probability of the observed data (given the model)! 56
M O D E L I D E N T I F I C A T I O N subset of possible models (version space) model space constraint-­‐based inference model space likelihood inference 57
C H O Q U I S T I C R E G R E S S I O N : P A R A M E T E R E S T I M A T I O N §  ML es9ma9on leads to a constrained opCmizaCon problem: condi9ons on bias and scale parameter normaliza9on and monotonicity of the non-­‐addi9ve measure 58
E X P E R I M E N T A L E V A L U A T I O N ( 0 / 1 L O S S ) 20% 50% 80% monotone classifier nonlinear classifier E X P E R I M E N T A L E V A L U A T I O N ( A U C ) 20% 50% 80% monotone classifier nonlinear classifier I N T E R P R E T A T I O N O F A C H O Q U I S T I C M O D E L §  Importance of criteria can be measured by the Shapley index: increase through ci in the context A §  Interac9ons between criteria can be measured by the interacCon index: 61
T H E P O L Y T O P E O F C A P A C I T I E S S P A C E O F V A L I D M E A S U R E S 62
T H E P O L Y T O P E O F C A P A C I T I E S 63
T H E C A S E K = 2 64
R U N T I M E C O M P A R I S O N runtime (seconds)
400
300
CR−orig
CR−AI
CR−AII
200
100
0
5
6
7
8
number of attributes
9
10
Average run9me on the SDW data as a func9on of the number of atributes included. 65
S U M M A R Y & C O N C L U S I O N §  Preference learning is -  methodologically interes9ng, -  theoreCcally challenging, -  and pracCcally useful, with many poten9al applicaCons; -  interdisciplinary (connec9ons to opera9ons research, decision sciences, economics, social choice, recommender systems, informa9on retrieval, ...). §  We advocate the Choquet integral as a versa9le tool in the context of (supervised) machine learning, especially for learning monotone models. §  Being used as a predic9on func9on to combine several input features (criteria) into an output, the Choquet integral nicely combines monotonicity, non-­‐linearity, interpretability (importance, interacCon). 66
S E L E C T E D L I T E R A T U R E § 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
§ 
E. Hüllermeier, J. Fürnkranz, W. Cheng and K. Brinker. Label ranking by learning pairwise preferences. Ar9ficial Intelligence, 172, 2008. W. Cheng, J. Hühn and E. Hüllermeier. Decision tree and instance-­‐based learning for label ranking, ICML-­‐09, Montreal, 2009. W. Cheng, K. Dembczynski and E. Hüllermeier. Label ranking using the Placket-­‐Luce model. ICML-­‐10, Haifa, Israel, 2010. W.W. Cohen, R.E. Schapire and Y. Singer. Learning to order things. Journal of Ar9ficial Intelligence Research, 10:243–270, 1999. D. Goldberg, D. Nichols, B.M. Oki and D. Terry. Using collabora9ve filtering to weave and informa9on tapestry. Communica9ons of the ACM, 35(12):61–70, 1992. A. Fallah Tehrani, W. Cheng, E. Hüllermeier. Choquis9c Regression: Generalizing Logis9c Regression using the Choquet Integral. EUSFLAT—2011. A. Fallah Tehrani, W. Cheng, K. Dembczynski, E. Hüllermeier. Learning Monotone Nonlinear Models using the Choquet Integral. ECML/PKDD—2011. E. Hüllermeier, A. Fallah Tehrani. On the VC Dimension of the Choquet Integral. IPMU—2012. A. Fallah Tehrani, W. Cheng, E. Hüllermeier. Preference Learning using the Choquet Integral: The Case of Mul9par9te Ranking. IEEE Transac9ons on Fuzzy Systems, 2012. E. Hüllermeier, A. Fallah Tehrani. Efficient Learning of Classifiers based on the 2-­‐addi9ve Choquet Integral. C. Moewes and A. Nürnberger (eds). Computa9onal Intelligence in Intelligent Data Analysis. Studies in Computa9onal Intelligence. Springer, 2012. A. Fallah Tehrani, W. Cheng, K. Dembczynski, E. Hüllermeier. Learning Monotone Nonlinear Models using the Choquet Integral. Machine Learning, 89(1):183—211, 2012. 67