Multi-label CRF

Multi-label Classification
Yusuke Miyao
• N. Ghamrawi, A. McCallum. Collective multi-label
classification. CIKM 2005.
• S. Godbole, S. Sarawagi. Discriminative methods for
multi-labeled classification. PAKDD 2004.
• G. Tsoumakas, I. Vlahavas. Random k-labelsets: An
ensemble method for multilabel classification. ECML
2007.
• G. Tsoumakas, I. Katakis. Multi-label classification: An
overview. Journal of Data Warehousing and Mining.
2007.
• A. Fujino, H. Isozaki. Multi-label text categorization
with model combination based on F1-score
maximization. IJCNLP 2008.
Machine Learning Template Library
• Separate data structures from learning algorithms
• Allow for any combinations of structures and algorithms
Perceptron
1-best MIRA
Log-linear model
Classifier
decode
Markov chain
diff
Semi-Markov
expectation
Dep. tree
n-best MIRA
Reranking
n-best
Feature forest
Learning algorithm
Interface
Naïve Bayes
Max-margin
EM algorithm
Multi-label
Data structure
Target Problem
• Choose multiple labels from a fixed set of labels
• Ex. Keyword assignment (text categorization)
Keyword set
Politics Sports
Entertainment Drama
Life Comedy Food
Video
Music
Travel
Animation Tech
Health Book Recipe
Food
Recipe
Text
Select appropriate
keywords for the text
Applications
• Keyword assignment (text categorization)
– Benchmark data: Reuter-21578, OHSUMED, etc.
• Medical diagnosis
• Protein function classification
– Benchmark data: Yeast, Genbase, etc.
• Music/scene categorization
• Non-contiguous, overlapping segmentation
[McDonald et al., 2005]
Formulation
• x: object, L: label set, y⊆L: labels assigned to x
• y = argmaxx f(x,y)
L
Politics Sports
Entertainment Drama
Life Comedy Food
Video
Music
Travel
Animation Tech
Health Book Recipe
Food
Recipe
x
y
Popular Approaches
• Subsets as atomic labels
– Each subset is considered as an atomic label
– Tractable only when |L| is small
• A set of binary classifications
– One-vs-all
– Each label is independently assigned
• Label ranking
– A ranking function is induced from multi-labeled data
(BoosTexter [Schapire et al., 2000], Rank-SVM [Elisseeff et al., 2002],
large-margin [Crammer et al., 2003])
• Probabilistic generative models [McCallum 1999; Ueda et al.,
2003; Sato et al., 2007]
Issues on Multi-Label Classification
• How to reduce training/running cost
– The number of targets (i.e. subsets) is
exponentially related to the size of the label set
• How to model correlation of labels
– Binary classification cannot use features on
multiple labels
• Classification vs. Ranking
• Hierarchical multi-label classification (ex.
MeSH term) [Cesa-Bianchi et al. 2006; J. Rousu et al., 2006]
Collective Multi-Label Classification
• CRF is applied to multi-label classification
y  arg max p( y | x)
x
1
p( y | x) 
exp λ  f ( x, y ) 
Z ( x)
• Features are defined on pairs of labels
• Notation:
– yi = 1 if i-th label ∈ y
– yi = 0 otherwise
Accounting for Multiple Labels
• Binary Model: fb(x,y): yi given x
• Collective Multi-Label (CML) model: fml(x,y): yi
and yj
• Collective Multi-Label with Features (CMLF)
model: fmlf(x,y): yi and yj given x
Parameter Estimation
• Enumeration of y is intractable in general
• Two approximations:
– Supported combinations:
• consider only the label combinations that occur in
training data
– Binary pruned inference:
• first apply binary model
• consider only the labels having probabilities above a
threshold t
• No dynamic programming
Experiments
• Reuters-21578 Modified Apte (ModApte) split
– 90 labels
– Training: 9,603 docs, Test: 3,299 docs
– 8.7% of the documents have multiple labels
• OHSUMED “Heart Disease” documents
– 40 labels assigned to 15-74 training documents
– 16 labels assigned to 75 or more training
documents
Results: Reuters-21578
• Supported combinations
Binary
CML
CMLF
macro-F1
0.4380
0.4478
0.4477
micro-F1
0.8627
0.8659
0.8635
exact match
0.7999
0.8329
0.8316
classification time
1.4 ms
48 ms
78 ms
• Binary pruned
Binary
CML
CMLF
macro-F1
0.4388
0.4792
0.4760
micro-F1
0.8634
0.8692
0.8701
exact match
0.8000
0.8119
0.8162
classification time
1.4 ms
4.6 ms
4.7 ms
Results: OHSUMED
• Supported combinations
Binary
CML
CMLF
macro-F1
0.6483
0.6795
0.6629
micro-F1
0.6849
0.7003
0.6983
exact match
0.4914
0.5925
0.6025
• Binary pruned
Binary
CML
CMLF
macro-F1
0.6482
0.6556
0.6658
micro-F1
0.6849
0.6751
0.6886
exact match
0.4918
0.5226
0.5190
Similar Methods
• H. Kazawa et al. Maximal margin labeling for
multi-topic text categorization. NIPS 2004.
– All subsets are considered as atomic labels
– Approximation by only considering neighbor subsets
(subsets that differ in a single label from the gold)
• S. Zhu et al. Multi-labelled classification using
maximum entropy method. SIGIR 2005.
– Simply enumerate all subsets, and use fml
– Only evaluated with small label sets (≦ 10)
Discriminative Methods for MultiLabeled Classification
• Cascade binary classifiers (SVM)
input text
1
+1
1
+1
2
-1
2
+1
3
-1
3
-1
|L|
+1
|L|
+1
classifier
for each label
ensemble
classifier
• Another technique: remove negative instances
that are close to decision boundary
Random k-Labelsets
• Randomly select size-k subsets from 2L
• Train multi-class classifiers for the subsets
• Label a new instance by majority voting
Y1
(1,0,0,1,0,…,0,0)
Y2
(0,1,0,1,0,…,0,1)
Y3
(1,0,0,0,1,…,0,1)
Ym
(0,0,0,1,0,…,1,1)
input text
classifiers
for size-k subsets
majority
voting
(0,0,0,1,0,…,0,1)
Other Approaches
• Learn a latent model to account for label correlations
– K. Yu et al. Multi-label informed latent semantic indexing. SIGIR
2005.
– J. Zhang et al. Learning multiple related tasks using latent
independent component analysis. NIPS 2005.
– V. Roth et al. Improved functional prediction of proteins by
learning kernel combinations in multilabel settings. PMSB 2006.
• kNN-like algorithms
– M-L Zhang et al. A k-nearest neighbor based algorithm for multilabel classification. IEEE Conference on Granular Computing.
2005.
– F. Kang et al. Correlated label propagation with application to
multi-label learning. CVPR 2006.
– K. Brinker et al. Case-based multilabel ranking. IJCAI 2007.
Summary
• Multi-label classification is an important and
interesting problem
• Major issues:
– Label correlation
– Computational cost
• A lot of methods have been proposed
– Basically, enhancement of fundamental methods
(subsets as atomic labels, set of binary classifications)
• No existing methods solve the problem
completely
Future Directions
• Algorithm for exact solution?
• Other learning algorithms
– Via machine learning template library
• Structurization of label sets
– IS-A hierarchy → hierarchical multi-label
– Exclusive labels
• Modeling of label distance
– Redesign of objective functions
Possible Applications
• Any tasks of keyword assignments
• Substitute for n-best/ranking
• Multi-label problems where label sets are not
fixed
– Keyword (key phrase) extraction
• Choose words/phrases from each document
– Summarization by sentence extraction
• cf. D. Xin et al. Extracting redundancy-aware top-k
patterns. KDD 2006.