Multi-label Classification Yusuke Miyao • N. Ghamrawi, A. McCallum. Collective multi-label classification. CIKM 2005. • S. Godbole, S. Sarawagi. Discriminative methods for multi-labeled classification. PAKDD 2004. • G. Tsoumakas, I. Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. ECML 2007. • G. Tsoumakas, I. Katakis. Multi-label classification: An overview. Journal of Data Warehousing and Mining. 2007. • A. Fujino, H. Isozaki. Multi-label text categorization with model combination based on F1-score maximization. IJCNLP 2008. Machine Learning Template Library • Separate data structures from learning algorithms • Allow for any combinations of structures and algorithms Perceptron 1-best MIRA Log-linear model Classifier decode Markov chain diff Semi-Markov expectation Dep. tree n-best MIRA Reranking n-best Feature forest Learning algorithm Interface Naïve Bayes Max-margin EM algorithm Multi-label Data structure Target Problem • Choose multiple labels from a fixed set of labels • Ex. Keyword assignment (text categorization) Keyword set Politics Sports Entertainment Drama Life Comedy Food Video Music Travel Animation Tech Health Book Recipe Food Recipe Text Select appropriate keywords for the text Applications • Keyword assignment (text categorization) – Benchmark data: Reuter-21578, OHSUMED, etc. • Medical diagnosis • Protein function classification – Benchmark data: Yeast, Genbase, etc. • Music/scene categorization • Non-contiguous, overlapping segmentation [McDonald et al., 2005] Formulation • x: object, L: label set, y⊆L: labels assigned to x • y = argmaxx f(x,y) L Politics Sports Entertainment Drama Life Comedy Food Video Music Travel Animation Tech Health Book Recipe Food Recipe x y Popular Approaches • Subsets as atomic labels – Each subset is considered as an atomic label – Tractable only when |L| is small • A set of binary classifications – One-vs-all – Each label is independently assigned • Label ranking – A ranking function is induced from multi-labeled data (BoosTexter [Schapire et al., 2000], Rank-SVM [Elisseeff et al., 2002], large-margin [Crammer et al., 2003]) • Probabilistic generative models [McCallum 1999; Ueda et al., 2003; Sato et al., 2007] Issues on Multi-Label Classification • How to reduce training/running cost – The number of targets (i.e. subsets) is exponentially related to the size of the label set • How to model correlation of labels – Binary classification cannot use features on multiple labels • Classification vs. Ranking • Hierarchical multi-label classification (ex. MeSH term) [Cesa-Bianchi et al. 2006; J. Rousu et al., 2006] Collective Multi-Label Classification • CRF is applied to multi-label classification y arg max p( y | x) x 1 p( y | x) exp λ f ( x, y ) Z ( x) • Features are defined on pairs of labels • Notation: – yi = 1 if i-th label ∈ y – yi = 0 otherwise Accounting for Multiple Labels • Binary Model: fb(x,y): yi given x • Collective Multi-Label (CML) model: fml(x,y): yi and yj • Collective Multi-Label with Features (CMLF) model: fmlf(x,y): yi and yj given x Parameter Estimation • Enumeration of y is intractable in general • Two approximations: – Supported combinations: • consider only the label combinations that occur in training data – Binary pruned inference: • first apply binary model • consider only the labels having probabilities above a threshold t • No dynamic programming Experiments • Reuters-21578 Modified Apte (ModApte) split – 90 labels – Training: 9,603 docs, Test: 3,299 docs – 8.7% of the documents have multiple labels • OHSUMED “Heart Disease” documents – 40 labels assigned to 15-74 training documents – 16 labels assigned to 75 or more training documents Results: Reuters-21578 • Supported combinations Binary CML CMLF macro-F1 0.4380 0.4478 0.4477 micro-F1 0.8627 0.8659 0.8635 exact match 0.7999 0.8329 0.8316 classification time 1.4 ms 48 ms 78 ms • Binary pruned Binary CML CMLF macro-F1 0.4388 0.4792 0.4760 micro-F1 0.8634 0.8692 0.8701 exact match 0.8000 0.8119 0.8162 classification time 1.4 ms 4.6 ms 4.7 ms Results: OHSUMED • Supported combinations Binary CML CMLF macro-F1 0.6483 0.6795 0.6629 micro-F1 0.6849 0.7003 0.6983 exact match 0.4914 0.5925 0.6025 • Binary pruned Binary CML CMLF macro-F1 0.6482 0.6556 0.6658 micro-F1 0.6849 0.6751 0.6886 exact match 0.4918 0.5226 0.5190 Similar Methods • H. Kazawa et al. Maximal margin labeling for multi-topic text categorization. NIPS 2004. – All subsets are considered as atomic labels – Approximation by only considering neighbor subsets (subsets that differ in a single label from the gold) • S. Zhu et al. Multi-labelled classification using maximum entropy method. SIGIR 2005. – Simply enumerate all subsets, and use fml – Only evaluated with small label sets (≦ 10) Discriminative Methods for MultiLabeled Classification • Cascade binary classifiers (SVM) input text 1 +1 1 +1 2 -1 2 +1 3 -1 3 -1 |L| +1 |L| +1 classifier for each label ensemble classifier • Another technique: remove negative instances that are close to decision boundary Random k-Labelsets • Randomly select size-k subsets from 2L • Train multi-class classifiers for the subsets • Label a new instance by majority voting Y1 (1,0,0,1,0,…,0,0) Y2 (0,1,0,1,0,…,0,1) Y3 (1,0,0,0,1,…,0,1) Ym (0,0,0,1,0,…,1,1) input text classifiers for size-k subsets majority voting (0,0,0,1,0,…,0,1) Other Approaches • Learn a latent model to account for label correlations – K. Yu et al. Multi-label informed latent semantic indexing. SIGIR 2005. – J. Zhang et al. Learning multiple related tasks using latent independent component analysis. NIPS 2005. – V. Roth et al. Improved functional prediction of proteins by learning kernel combinations in multilabel settings. PMSB 2006. • kNN-like algorithms – M-L Zhang et al. A k-nearest neighbor based algorithm for multilabel classification. IEEE Conference on Granular Computing. 2005. – F. Kang et al. Correlated label propagation with application to multi-label learning. CVPR 2006. – K. Brinker et al. Case-based multilabel ranking. IJCAI 2007. Summary • Multi-label classification is an important and interesting problem • Major issues: – Label correlation – Computational cost • A lot of methods have been proposed – Basically, enhancement of fundamental methods (subsets as atomic labels, set of binary classifications) • No existing methods solve the problem completely Future Directions • Algorithm for exact solution? • Other learning algorithms – Via machine learning template library • Structurization of label sets – IS-A hierarchy → hierarchical multi-label – Exclusive labels • Modeling of label distance – Redesign of objective functions Possible Applications • Any tasks of keyword assignments • Substitute for n-best/ranking • Multi-label problems where label sets are not fixed – Keyword (key phrase) extraction • Choose words/phrases from each document – Summarization by sentence extraction • cf. D. Xin et al. Extracting redundancy-aware top-k patterns. KDD 2006.
© Copyright 2026 Paperzz