Hierarchical Regularization Cascade for Joint Learning Alon Zweig [email protected] Daphna Weinshall [email protected] School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel, 91904 Abstract As the sheer volume of available benchmark datasets increases, the problem of joint learning of classifiers and knowledge-transfer between classifiers, becomes more and more relevant. We present a hierarchical approach which exploits information sharing among different classification tasks, in multitask and multi-class settings. It engages a top-down iterative method, which begins by posing an optimization problem with an incentive for large scale sharing among all classes. This incentive to share is gradually decreased, until there is no sharing and all tasks are considered separately. The method therefore exploits different levels of sharing within a given group of related tasks, without having to make hard decisions about the grouping of tasks. In order to deal with large scale problems, with many tasks and many classes, we extend our batch approach to an online setting and provide regret analysis of the algorithm. We tested our approach extensively on synthetic and real datasets, showing significant improvement over baseline and state-of-the-art methods. 1. Introduction Information sharing can be a very powerful tool in various domains. Consider visual object recognition where different categories typically share much in common: cars and trucks can be found on the road and both classes have wheels, cows and horses have four legs and can be found in the field together, etc. Accordingly, different information sharing approaches have been developed (Torralba et al., 2007; Obozinski et al., 2007; Quattoni et al., 2008; Amit et al., 2007; Duchi & Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). Singer, 2009; Kim & Xing, 2010; Shalev-Shwartz et al., 2011; Kang et al., 2011) . In this paper we focus on the multi-task and multiclass settings, where the learning is performed jointly for many tasks or classes. Multi-task (Caruana, 1997) is a setting where there are several individual tasks which are trained jointly, e.g., character recognition for different writers. Multi-class is the case where there is only a single classification task involving several possible labels (object classes), where the task is to assign each example a single label. Intuitively the more data and tasks or classes there are, the more one can benefit from sharing information between them. Recent approaches (Obozinski et al., 2007; Quattoni et al., 2008; Amit et al., 2007; Duchi & Singer, 2009; Shalev-Shwartz et al., 2011) to information sharing consider all tasks as a single group without discriminating between them. However, applying such approaches to datasets with many diverse tasks focus the learning on shared information among all tasks, which might miss out on some relevant information shared between a smaller group of tasks. To address this challenge, our work takes a hierarchical approach to sharing by gradually enforcing different levels of sharing, thus scaling up to scenarios with many tasks. The basic intuition is that it is desirable to be able to share a lot of information with a few related objects, while sharing less information with a larger set of less related objects. For example, we may encourage modest information sharing between a wide range of recognition tasks such as all road vehicles, and separately seek more active sharing among related objects such as all types of trucks. Previous work investigating the sharing of information at different levels either assume that the structure of sharing is known, or solve the hard problem of clustering the tasks into sharing groups (Torralba et al., 2007; Kang et al., 2011; Jawanpuria & Nath, 2012; Kim & Xing, 2010; Zhao et al., 2011). The clustering approach solves a hard problem and can be used most ef- Hierarchical Regularization Cascade for Joint Learning fectively with smaller data-sets, while clearly when the hierarchy is known methods which take advantage of it should be preferred. But under those condition where these methods are not effective, our implicit approach enables different levels of sharing without knowing the hierarchy or finding it explicitly. More specifically, we propose a top-down iterative feature selection approach: It starts with a high level where sharing features among all classes is induced. It then gradually decreases the incentive to share in successive levels, until there is no sharing at all and all tasks are considered separately in the last level. As a result, by decreasing the level of incentive to share, we achieve sharing between different subsets of tasks. The final classifier is a linear combination of diverse classifiers, where diversity is achieved by varying the regularization term. The diversity of regularization we exploit is based on two commonly used regularization functions: the l1 norm (Tibshirani, 1996) which induces feature sparsity, and the l1 /l2 norm analyzed in (Obozinski et al., 2007) which induces feature sparsity while favoring feature sharing between all tasks. Recently the sparse group lasso (Friedman et al., 2010) algorithm has been introduced, a linear combination of the lasso and group-lasso (Yuan & Lin, 2006) algorithms, which can yield sparse solutions in a selected group of variables, or in other words, it can discover smaller groups than the original group constraint (l1 /l2 ). The importance of hierarchy of classes (or taxonomy) has been acknowledged in several recent recognition approaches. A supervised known hierarchy has been used to guide classifier combination (Zweig & Weinshall, 2007), learn distance matrices (Hwang et al., 2011; Verma et al., 2012), induce orthogonal transfer down the hierarchy (Zhou et al., 2011) or detect novel classes (Weinshall et al., 2008). Recent work on multi-class classification (Bengio et al., 2010; Gao & Koller, 2011; Yang & Tsang, 2011) has also tried to infer explicitly some hierarchical discriminative structure over the input space, which is more efficient and accurate than the traditional multi-class flat structure. The goal in these methods is typically to use the discovered hierarchical structure to gain efficient access to data. This goal is in a sense orthogonal (and complementary) to our aim at exploiting the implicit hierarchical structure for information sharing for both multi-task and multi-class problems. The main contribution of this paper is to develop an implicit hierarchical regularization approach for information sharing in multi-task and multi-class learning (see Section 2). Another important contribution is the extension to the online setting where we are able to consider a lot more learning tasks simultaneously, thus benefiting from the many different levels of sharing in the data (see Section 3 for algorithm description and regret analysis). In Section 4 we describe extensive experiments on both synthetic and seven popular real datasets. In Section 5 we briefly present the extension of our approach to a knowledge-transfer setting. The results show that our algorithm performs better than baseline methods chosen for comparison, and state of the art methods described in (Kang et al., 2011; Kim & Xing, 2010). It scales well to large data scenarios, achieving significantly better results even when compared to the case where an explicit hierarchy is known in advance (Zhao et al., 2011). 2. Hierarchical regularization cascade We now describe our learning approach, which learns while sharing examples between tasks. We focus only on classification tasks, though our approach can be easily generalized to regression tasks. Notations Let k denote the number of tasks or classes. In the multi-task setting we assume that each task is binary, where x ∈ Rn is a datapoint and y ∈ {−1, 1} its label. Each task comes with its own i sample set Si = {(xs , ys )}m s=1 , where mi is the sample size and i ∈ {1...k}. In the multi-class setting we assume a single sample set S = {(xs , ys )}m s=1 , where ys ∈ {1..k}. Henceforth, when we refer to k classes or tasks, we shall use the term tasks to refer to both without loss of generality. Let n denote the number of features, matrix W ∈ Rn×k the matrix of feature weights being learnt jointly for the k tasks, and wi the i’th column of W. Let b ∈ Rk denote the vector of threshold parameters, where bi is the threshold parameter corresponding to task i. ||W||1 denotes the l1 normof W and ||W||1,2 n denotes its l1 /l2 norm- ||W||1,2 = j=1 ||wj ||2 , where wj is the j’th row of matrix W and ||wj ||2 its l2 norm. The classifiers we use for the i’th task are linear classifiers of the form f i (x) = wi ∗ x + bi . Binary task classification is obtained by taking the sign of f i (x), while multi-class classification retrieves the class with maximal value of f i (x). To simplify the presentation we henceforth omit the explicit reference to the bias term b; in this notation b is concatenated to matrix W as the last row, and each datapoint x has 1 added as its last element. Whenever the regularization of W is discussed, it is assumed that the last row of W is not affected. The classifiers now Hierarchical Regularization Cascade for Joint Learning take the form f i (x) = wi ∗ x. To measure loss we use the binary and multi-class hinge loss functions (Crammer & Singer, 2002). The multitask loss is defined as follows: k max(0, 1 − ys ∗ f i (xs )) (1) L({Si }ki=1 , W) = i=1 s∈Si Without joint regularization this is just the sum of the loss of k individual tasks. The multi-class loss is defined as: L(S, W)= max(0, 1 + max (f j (xs ) −f i (xs ))) (2) s∈S ys =i,j=i For brevity we will refer to both functions as L(W). Note that the choice of the hinge loss is not essential, and any other smooth convex loss function (see (Wright et al., 2009)) can be used. We construct a hierarchy of regularization functions in order to generate a diverse set of classifiers that can be combined to achieve better classification. The construction of the regularization cascade is guided by the desire to achieve different levels of information sharing among tasks. Specifically, at the highest level in the hierarchy we encourage classifiers to share information among all tasks by using regularization based on the l1 /l2 norm. At the bottom of the hierarchy we induce sparse regularization of the classifiers with no sharing by using the l1 norm. Intermediate levels capture decreasing levels of sharing (going from top to bottom), by using for regularization a linear combination of the l1 and l1 /l2 norms. We denote the regularization term of level l by ψ l : (3) where λl is the mixing coefficient and φl is the regularization coefficient. The regularization coefficient of the last row of Wl corresponding to bias b is 0. For each individual task (column of Wl ) we learn L classifiers, where each classifier is regularized differently. Choosing the L mixing terms λl ∈ [0..1] diversely results in inducing L different levels of sharing, with maximal sharing at λl = 0 and no incentive to share at λl = 1.1 2.2. Cascade algorithm Learning all diversely regularized classifiers jointly involves a large number of parameters which increases 1 Specifically, we denote by L the preset number of levels in our algorithm. In each level only a single set of parameters Wl is being learnt, with regularization uniquely defined by λl . We start by inducing maximal sharing with λ1 = 0. As the algorithm proceeds λl monotonically increases, inducing decreased amount of sharing between tasks as compared to previous steps. In the last level we set λL = 1, to induce sparse regularization with no incentive to share. Thus starting from l = 1 up to l = L, the algorithm for sparse group learning cascade solves 2.1. Hierarchical regularization ψ l (W) = φl ((1 − λl )||W||1,2 + λl ||W||1 ) multiplicatively with the number of levels L. A large number of parameters could harm the generalization properties of any algorithm which attempts to solve the optimization problem directly. We therefore developed an iterative approach presented in Algorithm 1, where each level is optimized separately using the optimal value from higher levels in the hierarchy. Note that while each regularizationterm ψ l induces l the sparsity of Wl , the output classifier L l=1 W may not be sparse. Wl = argmin L(W + Wl−1 ) + ψ l (W) W (4) The learnt parameters are aggregated through the learning cascade, where each step l of the algorithm receives as input the learnt parameters up to that point- Wl−1 . Thus the combination of input parameters learnt earlier together with a decrease in incentive to share is intended to guide the learning to focus on more task/class specific information as compared to previous steps. Note also that this sort of parameter passing between levels works only in conjunction with the regularization; without regularization, the solution of each step is not affected by the solution from previous steps. In l−1 for all l ∈ {1..L}, our experiments we set λl = L−1 while the set of parameters {φl }L l=1 is chosen using cross-validation. Algorithm 1 Regularization cascade l L Input : L , {λl }L l=1 , {φ }l=1 Output : W 1. W1 = argmin L(W) + φ1 ||W||1,2 W 2. for l = 2 to L (a) W = argmin L(W + Wl−1 ) + W φl ((1 − λl )||W||1,2 + λl ||W||1 ) (b) Wl = Wl−1 + W 3. W = WL Hierarchical Regularization Cascade for Joint Learning 2.3. Batch optimization At each step of the cascade we have a single unconstrained convex optimization problem, where we minimize over a smooth convex loss function summed with a non-smooth regularization term (4). This type of optimization problems has been studied extensively in the optimization community in recent years (Wright et al., 2009; Beck & Teboulle, 2009). We used two popular optimization methods (Daubechies et al., 2004; Beck & Teboulle, 2009), which converge to the single global optimum with rate of convergence O( T12 ) for (Beck & Teboulle, 2009). Both are iterative procedures which solve at iteration t the following sub-problem: αt t t−1 t ||Θt − Θt−1 ||22 + φψ(W t ) (Θ − Θ )∇L(Θ ) + min Θt 2 where ψ(W t ) = ((1 − λ)||Wt ||1,2 + λ||Wt ||1 ) and αt is a constant factor corresponding to the step size of iteration t. This sub-problem has a closed form solution presented in (Sprechmann et al., 2011), which yields an efficient implementation for solving a single iteration of Algorithm 1. The complexity of the algorithm is L times the complexity of solving (4). 3. Online algorithm When the number of training examples is very large, it quickly becomes computationally prohibitive to solve (4), the main step of Algorithm 1. We therefore developed an online algorithm which solves this optimization problem by considering one example at a time the set of parameters Wl is updated each time a new mini-sample appears containing a single example from each task. In order to solve (4) we adopt the efficient dualaveraging method proposed by (Xiao, 2010), which is a first-order method for solving stochastic and online regularized learning problems. Specifically we build on the work of (Yang et al., 2010), who presented a closed form-solution for the case of sparse group lasso needed for our specific implementation of the dual-averaging approach. The update performed at each time step by the dual averaging method can be written as: γ Wt = argmin ŪW + ψ l (W) + √ h(W) (5) t W where U denotes the subgradient of Lt (W + Wl−1 ) with respect to W, Ū the average subgradient up to time t, h(W) = 12 ||W||22 an auxiliary strongly convex function, and γ a constant which determines the convergence properties of the algorithm. Algorithm 2 describes our online algorithm. It follows from the analysis in (Xiao, 2010) that the run-time and memory complexity of the online algorithm based on update (5) is linear in the dimensionality of the parameter-set, which in our setting is nk. Note that in each time step a new example from each task is processed through the entire cascade before moving on to the next example. Algorithm 2 Online regularization cascade l L Input : L, γ, {φl }L l=1 , {λ }l=1 l Initialization: Ŵ0 = 0, Ūl0 = 0 ∀l ∈ {1..L} , Wt0 = 0 ∀t 1. for t = 1,2,3,... do (a) for l = 1 to L i. Given Lt (W + Wl−1 ), compute a subgradient Ult ∈ ∂Lt (W + Wl−1 ) 1 l l ii. Ūlt = t−1 t Ūt−1 + t Ut iii. Ŵtl = argmin Ūlt W + ψ l (W) + √γt h(W) W iv. Wtl = Wtl−1 + Ŵtl (b) Wt = WtL 3.1. Regret analysis In online algorithms, regret measures the difference between the accumulated loss over the sequence of examples produced by the online learning algorithm, as compared to the loss with a single set of parameters used for all examples and optimally chosen in hindsight. For T iterations of the algorithm, we can write the regret as RT (W∗ ) = T ψ(Wt ) − Lt (W∗ ) − ψ(W∗ )), where t=1 (Lt (Wt ) + T W∗ = argmin t=1 (Lt (W) + ψ(W)). W At each time step t, Algorithm 2 chooses for each level l > 1 of the cascade the set of parameters Ŵtl , to be added to the set of parameters Wtl−1 calculated in the previous level of the cascade. Thus the loss in level l at time t Lt,Wl−1 (Ŵ) = Lt (Ŵ + Wtl−1 ) depends on t both the example at time t and the estimate Wtl−1 obtained in previous learning stages of the cascade. We define the following regret function that compares the performance of the algorithm at level l to the best choice of parameters for all levels up to level l: RlT (Ŵ∗l ) = T t=1 (Lt,Wl−1 (Ŵtl ) + ψ l (Ŵtl ))− T t=1 t (Lt,W∗l−1 (Ŵ∗l ) + ψ l (Ŵ∗l )) where Ŵ∗l = argmin W T l−1 (W) t=1 (Lt,W∗ (6) + ψ l (W)), Hierarchical Regularization Cascade for Joint Learning W∗0 = 0 and W∗l = l k=1 Ŵ∗k . We note that the key difference between the usual definition of regret above and the definition in (6) is that in the usual regret definition we consider the same loss function for the learnt and optimal set of parameters. In (6), on the other hand, we consider two different loss functions - Lt,Wl−1 and Lt,W∗l−1 , each involving a t different set of parameters derived from previous levels in the cascade. We now state the main result, which provides an upper bound on the regret (6) and thus proves convergence. Theorem 1. Suppose there exist G and D such that ∀t, l ||Ult || < G√and h(W) < D2 , and suppose that RlT (Ŵ∗l ) ≥ −C T ; then √ 3 (7) RlT (Ŵ∗l ) ≤ A T + B(T + 1) 4 2 where A = (γD2 + Gγ ), B = 43 (l − 1)G 2M σ A, C = 2 −(M − 1)(γD2 + Gγ ) for some constant M > 1, and σ denotes the convexity parameter of ψ. Proof. See Appendix A in suppl. material. 4. Experiments Comparison Methods. We compare our algorithms to three baseline methods, representing three common optimization approaches: ’NoReg’ - where learning is done simply by minimizing the loss function without regularization. ’L12’ - a common approach to multi-task learning where in addition to minimizing the loss function we also regularize for group sparseness (enforcing feature sharing) using the l1 /l2 norm. ’L1’- a very common regularization approach where the loss function is regularized in order to induce sparsity by using the l1 norm. All methods are optimized using the same algorithms described above, where for the non-hierarchical methods we set L = 1, for ’NoReg’ we set φ = 0, for ’L12’ we set λ = 0, and for ’L1’ we set λ = 1. The parameter φ for ’L12’ and ’L1’ is also chosen using cross validation. We also use for comparison three recent approaches which exploit relatedness at multiple levels. (i) The single stage approach of (Kang et al., 2011) which simultaneously finds the grouping of tasks and learns the tasks. (ii) The tree-guided algorithm of (Kim & Xing, 2010) which can be viewed as a two stage approach, where the tasks are learnt after a hierarchical grouping of tasks is either discovered or provided. We applied the tree-guided algorithm in three conditions: when the true hierarchy is known, denoted ’TGGLOpt’; when it is discovered by agglomerative cluster- Figure 1. Synthetic data results. The ’Y’-axis measures the average accuracy over all tasks, where accuracy results correspond to 10 repetitions of the experiment. Left: performance as a function of sample size (’X’-axis). We show comparisons of both the batch and online algorithms, where ’H’ denotes our hierarchical Algorithm with 5 levels, and ’L11’ ,’L12’ and ’NoReg’ - the different baseline methods. Right: Performance as a function of the number of random features. We show comparison to the tree-guided group lasso algorithm based on the true hierarchy ’TGGLOpt’, clustered hierarchy ’TGGL-Cluster’ and random hierarchy ’TGGL-Rand’. ing (as suggested in (Kim & Xing, 2010)), denoted ’TGGL-Cluster’; or when randomly chosen (random permutation of the leafs of a binary tree), denoted ’TGGL-Rand’. (iii) The method described in (Zhao et al., 2011) which is based on the work of (Kim & Xing, 2010) and extends it to a large scale setting where a hierarchy of classes is assumed to be known. 4.1. Synthetic data In order to understand when our proposed method is likely to achieve improved performance, we tested it in a controlled manner on a synthetic dataset we had created. This synthetic dataset defines a group of tasks related in a hierarchical manner: the features correspond to nodes in a tree-like structure, and the number of tasks sharing each feature decreases with the distance of the feature node from the tree root. We tested to see if our approach is able to discover (implicitly) and exploit the hidden structure thus defined. The inputs to the algorithm are the sets of examples and labels from all k tasks {Si }ki=1 , without any knowledge of the underlying structure. Classification performance: We start by showing in Fig. 1-left that our hierarchical algorithm achieves the highest accuracy results, both for the batch and online settings. For the smallest sample size the ’L12’ baseline achieves similar performance, while for the largest sample size the ’L1’ baseline closes the gap indicating that given enough examples, sharing of information between classes becomes less important. We also see that the online Algorithm 2 converges to the performance of the batch algorithm after seeing enough examples. The advantage of the hierarchical method is not due simply to the fact that it employs a combination of Hierarchical Regularization Cascade for Joint Learning classifiers, but rather that it clearly benefits from sharing information. Specifically, when the cascade was applied to each task separately, it achieved only 93.21% accuracy as compared to 95.4% accuracy when applied to all the 100 tasks jointly. To evaluate our implicit approach we compared it to the Tree-Guided Group Lasso (Kim & Xing, 2010) (TGGL) where the hierarchy is assumed to be known - either provided by a supervisor (the true hierarchy), clustered at pre-processing, or randomly chosen. Fig. 1-right shows results for 100 tasks and 20 positive examples each. We challenged the discovery of task relatedness structure by adding to the original feature representation a varying number (500-4000) of irrelevant features, where each irrelevant feature takes the value ’-1’ or ’1’ randomly. Our method performs much better than TGGL with random hierarchy or clustering-based hierarchy. Interestingly, this advantage is maintained even when TGGL gets to use the true hierarchy, with up to 1500 irrelevant features, possibly due to other beneficial features of the cascade. 4.2. Real data Small Scale (Kang et al., 2011) describe a multitask experimental setting using two digit recognition datasets MNIST (LeCun et al., 1998) and USPS (Hull, 1994), which are small datasets with only 10 classes/digits. For comparison with (Kang et al., 2011), we ran our method on these datasets using the same representations, and fixing L = 3 for both datasets. Table 1 shows all results, demonstrating clear advantage to our method. The results of our basic baseline methods ’NoReg’ and ’L1’ achieve similar or worse results,2 comparable to the single task baseline approach presented in (Kang et al., 2011). Thus, the advantage of the cascade ’H’ does not stem from the different optimization procedures, but rather reflects the different approach to sharing. Table 1. Error rates on digit datasets USPS MNIST H 6.8% ± 0.2 13.4% ± 0.5 Kang et al. 8.4% ± 0.3 15.2% ± 0.3 Medium Scale We tested our batch approach with four medium sized data sets: Cifar100 (Krizhevsky & Hinton, 2009), Caltech101, Caltech256 (Griffin et al., 2007) and MIT-Indoor Scene dataset (Quattoni & Torralba, 2009) with 100, 102, 257 and 67 categories in each dataset respectively. We tested both the multiclass and multi-task settings. For the multi-task set2 ’NoReg’ - 9.5% ± 0.2 and 17% ± 0.5 for USPS and MNIST respectively; ’L1’ - 8.8% ± 0.5 and 16% ± 0.8 for USPS and MNIST respectively. ting we consider the 1-vs-all tasks. For Cifar-100 we fixed L = 5 for the number of levels in the cascade, and for the larger datasets of Caltech101/256 and IndoorScene we used L = 4. We also investigated a variety of features: for the Cifar-100 we used the global Gist representation embedded in an approximation of the RBF feature space using random projections as suggested by (Rahimi & Recht, 2007), resulting in a 768 feature vector. For the Caltech101/256 we used the output of the first stage of Gehler et al’s kernel combination approach (Gehler & Nowozin, 2009) (which achieves state of the art results on Caltech101/256) as the set of features. For the MIT-Indoor Scene dataset we used Object Bank features (Li et al., 2010), which achieves state-of-theart results on this and other datasets. We tested the Cifar-100 dataset in the multi-task and multi-class settings. We used 410 images from each class as the training set, 30 images as a validation set, and 60 images as the test set. For the multi-task setting we considered the 100 1-vs-rest classification tasks. For each binary task, we used all images in the train set of each class as the positive set, and 5 examples from each class in the ’rest’ set of classes as the negative set. The experiment was repeated 10 times using different random splits of the data. Results are shown in Table 2, showing similar performance for the cascade and the competing Tree-Guided Group Lasso method. Table 2. Cifar-100: accuracy results. ’Baselines’ denotes ’L1’, ’L12’ and ’NoReg’ which showed similar performance. H Baselines TGGL-Cluster TGGL-Rand multi-class 21.93 ± 0.38 18.23 ± 0.28 - 1-vs-rest 79.91 ± 0.22 76.98 ± 0.17 79.97 ± 0.10 80.25 ± 0.10 In our experiments with the MIT-Indoor Scene dataset we used 20, 50 and 80 images per scene category as a training set, and 80, 50 and 20 images per category as test set respectively. We repeated the experiment 10 times for random splits of the data, including the single predefined data split provided by (Quattoni & Torralba, 2009) as a benchmark. Fig. 2-left shows the classification results of the cascade, which significantly outperformed the baseline methods and the previously reported state of the art result of (Li et al., 2010) on the original data split of (Quattoni & Torralba, 2009), using the exact same feature representation. We also significantly outperformed ’TGGLCluster’ and ’TGGL-Rand’. With Caltech101 and Caltech256 we used the data Hierarchical Regularization Cascade for Joint Learning Figure 2. Real data results. ’Y’-axis measures the average accuracy over all tasks. Left, Multiclass accuracy results on the MIT-Indoor-Scene dataset, for 4 experimental conditions: 20, 50, and 80 images used for training respectively, and ’OrigSplit’ - the single predefined split of (Quattoni & Torralba, 2009). Right, Multi-task 1-vs-rest results for the Caltehc256 dataset, where ’LPB’ denotes our implementation of the binary version of the approach presented in (Gehler & Nowozin, 2009) (see text). The ’X’-axis varies with sample size. provided by (Gehler & Nowozin, 2009) for comparisons in both their original multi-class scenario and a new multi-task scenario. We tested our approach using the exact same experimental setting of (Gehler & Nowozin, 2009) given the scripts and data provided by the authors. In the original multi-class setting addressed in (Gehler & Nowozin, 2009) our results compare to their state-of-the-art results both for 30 training images (78.1%) in the Caltech101 and for 50 images (50%) in the Caltech256. In the multi-task scenario we trained a single 1-vsrest classifier for each class. In addition to our regular baseline comparisons we implemented a variant of the ν-LPB method, which was used in (Gehler & Nowozin, 2009) as the basis to their multi-class approach. Fig. 2-right, shows results for the multi-task setting on the Caltech256. First we note that our algorithm outperforms all other methods including ν-LPB. We also note that given this dataset all regularization approaches exceed the NoReg baseline, indicating that this data is sparse and benefits from information sharing. (Results for the Caltech101 are similar and have therefore been omitted.) Large Scale We demonstrate the ability of our online method to scale up to large datasets with many labels and many examples per each label by testing it on the ILSVRC(2010) challenge (Berg et al., 2010). This is a large scale visual object recognition challenge, with 1000 categories and 668-3047 examples per category. With so many categories the usual l1 /l2 regularization is expected to be too crude, identifying only a few shared features among such a big group of diverse classes. On the other hand, we expect our hierarchical method to capture varying levels of useful information to share. We compared our method to the hierarchical scheme of (Zhao et al., 2011) (using their exact same feature representation). Rather than compute the hierarchy from the data, their method takes advantage of a known semantic word-net hierarchy, which is used to define a hierarchical group-lasso regularization and calculate a similarity matrix augmented into the loss function. The comparison was done on the single split of the data provided by the challenge (Berg et al., 2010) . Table 3. ILSVRC(2010): Classification accuracy of the best of N decisions, Top N . Alg 2 Zhao Top 1 0.285 0.221 Top 2 0.361 0.302 Top 3 0.403 0.366 Top 4 0.434 0.411 Top 5 0.456 0.435 Table 3 shows our performance as compared to that of (Zhao et al., 2011). We show accuracy rates when considering the 1-5 top classified labels. In all settings we achieve significantly better performance using the exact same image representation and much less labeled information. 4.3. Discussion For small and medium scale datasets we see that our cascade approach and the batch Algorithm 1 outperform the baseline methods significantly. For the large scale dataset the online Algorithm 2 significantly outperforms all other baseline methods. It is interesting to note that even when the alternative baseline methods perform poorly, implying that the regularization functions are not beneficial on their own, combining them as we do in our hierarchical approach improves performance. This can be explained by the fact that a linear combination of classifiers is known to improve performance if the classifiers are accurate and diverse. When comparing to the recent related work of (Kim & Xing, 2010) denoted TGGL, we see that our implicit approach performs significantly better with the synthetic and MIT-Indoor scene datasets, while on the Cifar dataset we obtain similar results. With the synthetic dataset we saw that as the clustering of the task hierarchy becomes more challenging, TGGL with clustering degrades quickly while the performance of our method degrades more gracefully. We expect this to be the case in many real world problems where the underlining hierarchy is not known in advance. Furthermore, we note that with the small scale digit datasets our approach outperformed significantly the reported results of (Kang et al., 2011). Finally, we note that our approach can be used in a pure online setting while these two alternative methods cannot. Hierarchical Regularization Cascade for Joint Learning We also compared with a third method (Zhao et al., 2011) using the ILSVRC(2010) challenge (Berg et al., 2010); this is a challenging large scale visual categorization task, whose size - both the number of categories and the number of examples per category, provides the challenges particularly suitable for our approach. The online algorithm makes it possible to scale up to such a big dataset, while the hierarchical sharing is important with possibly many relevant levels of sharing between the tasks. Particularly encouraging is the improvement in performance when compared to the aforementioned work where an explicit hierarchy was provided to the algorithm. small to learn effectively in such a space. The online approach is particularly useful for bootstrapping the online learning of novel classes, achieving higher performance at the early stages of the learning as compared to a non transfer approach. Results are shown in Fig. 3. 5. Knowledge-Transfer Figure 3. Experimental results. The ’Y’-axis corresponds to average accuracy over all tasks, and the ’X’-axis to the sample size. KT-On denotes the online knowledge-transfer method. KT-Batch denotes the batch knowledge-transfer method. NoKT denotes the no knowledge-transfer control: the multi-task batch algorithm with L = 1 and φ = 0. (a) Results with the synthetic data described above. 99 tasks were used as the pre-trained tasks and a single task as the unknown novel task. The experiment was repeated 100 times, each repetition choosing a different task as the novel task. (b) Results for the large size ILSVRC2010 experiment. 1000 1-vs-rest tasks were considered, each task separating a single class from the rest. 900 tasks where chosen as the known and 100 as the unknown. We now briefly outline two natural extensions of our multi-task learning algorithms to achieve knowledgetransfer to novel related tasks which arrive with too few training examples.3 The transfer of information is based on the cascade of matrices {Wl }L l=1 learnt during pre-training. The extension of the batch algorithm is based on dimensionality reduction of pretrained models. The extension of the online method maintains the same regularization structure and parameters of the online multi-task setting. Batch Method The method is based on projecting the data into the subspaces defined by the learnt models {Wl }L l=1 . Consequently the new task is represented in L sub-spaces that capture the structure of the shared information between the previous k tasks. Specifically, we define each projection matrix Pl by the first z columns of the orthonormal matrix Ul , where svd(Wl ) = Ul ΣVl . At each level l we project the new data-points onto Pl . Thus the modeling of the new task involves z ∗ L parameters, wherein the l’th level data is projected onto the unique subspace characteristic of level l. In order to pass the parameters to the next level we project back to the original feature space. Online Method We assume that a related set of tasks has been learnt using the online Algorithm 2. In order to initialize the online learning of a new single task or group of tasks, we concatenate the parameters of the new tasks to the previously learnt parameters, thus influencing the structure of the newly learnt task parameters. The batch method is particularly useful when the data lies in a high dimensional feature space, and the number of examples from the novel task is too 3 See Appendix D in suppl. material for detailed algorithms and results of our knowledge-Transfer methods. (a) (b) 6. Summary We presented a cascade of regularized optimization problems designed to induce implicit hierarchical sharing of information in multi-task, multi-class and knowledge-transfer settings. We described efficient batch and online learning algorithms implementing the cascade. For the online algorithm we provided regret analysis from which it follows that the average regret of our learning method converges. The method was tested on synthetic data and seven real datasets, showing significant advantage over baseline methods, and similar or improved performance as compared to alternative state of the art methods. Acknowledgements We would like to thank Bin Zhao and Prof. Eric Xing for providing the code for representing the data in our comparison to their work (Zhao et al., 2011). We would also like to thank Uri Shalit for helpful discussions and literature pointers. Research funded in part by the Intel Collaborative Research Institute for Computational Intelligence (ICRICI). Hierarchical Regularization Cascade for Joint Learning References Amit, Y., Fink, M., Srebro, N., and Ullman, S. Uncovering shared structures in multiclass classification. ICML, 2007. Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. Bengio, S., Weston, J., and Grangier, D. Label embedding trees for large multi-class tasks. NIPS, 2010. Berg, A., Deng, J., and Fei-Fei, L. Large scale visual recognition challenge 2010, 2010. URL http://www. image-net.org/challenges/LSVRC/2010/index. Caruana, R. Multitask learning. Machine Learning, 1997. Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292, 2002. Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):1413–1457, 2004. Duchi, J. and Singer, Y. Boosting with structural sparsity. In ICML, pp. 297–304. ACM, 2009. Friedman, J., Hastie, T., and Tibshirani, R. A note on the group lasso and a sparse group lasso. Arxiv preprint arXiv:1001.0736, 2010. Gao, T. and Koller, D. Discriminative learning of relaxed hierarchy for large-scale visual recognition. ICCV, 2011. Gehler, P. and Nowozin, S. On feature combination for multiclass object classification. In ICCV, 2009. Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. 2007. Hull, J.J. A database for handwritten text recognition research. T-PAMI, IEEE, 16(5):550–554, 1994. Hwang, S.J., Grauman, K., and Sha, F. Learning a tree of metrics with disjoint visual features. NIPS, 2011. Jawanpuria, P. and Nath, J.S. A convex feature learning formulation for latent task structure discovery. ICML, 2012. Kang, Z., Grauman, K., and Sha, F. Learning with whom to share in multi-task feature learning. NIPS, 2011. Kim, S. and Xing, E.P. Tree-guided group lasso for multitask regression with structured sparsity. ICML, 2010. Li, L.J., Su, H., Xing, E.P., and Fei-Fei, L. Object bank: A high-level image representation for scene classification and semantic feature sparsification. NIPS, 2010. Obozinski, G., Taskar, B., and Jordan, M. Joint covariate selection for grouped classification. Department of Statistics, U. of California, Berkeley, TR, 743, 2007. Quattoni, A. and Torralba, A. Recognizing indoor scenes. CVPR, 2009. Quattoni, A., Collins, M., and Darrell, T. Transfer learning for image classification with sparse prototype representations. In CVPR, 2008. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. NIPS, 2007. Shalev-Shwartz, S., Wexler, Y., and Shashua, A. Shareboost: Efficient multiclass learning with feature sharing. Proc. NIPS, 2011. Sprechmann, P., Ramı́rez, I., Sapiro, G., and Eldar, Y.C. C-hilasso: A collaborative hierarchical sparse modeling framework. Signal Processing, IEEE Transactions on, 59(9):4183–4198, 2011. Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statist. Soc.. B, 58:267–288, 1996. Torralba, A., Murphy, K.P., and Freeman, W.T. Sharing visual features for multiclass and multiview object detection. T-PAMI, IEEE, 29(5):854–869, 2007. Verma, N., Mahajan, D., Sellamanickam, S., and Nair, V. Learning hierarchical similarity metrics. In CVPR. IEEE, 2012. Weinshall, D., Hermansky, H., Zweig, A., Luo, J., Jimison, H., Ohl, F., and Pavel, M. Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. In Proc. NIPS, volume 8, 2008. Wright, S.J., Nowak, R.D., and Figueiredo, M.A.T. Sparse reconstruction by separable approximation. Signal Processing, IEEE Transactions on, 57(7):2479–2493, 2009. Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. JMLR, 2010. Yang, H., Xu, Z., King, I., and Lyu, M. Online learning for group lasso. ICML, 2010. Yang, Jian-Bo and Tsang, Ivor. Hierarchical maximum margin learning for multi-class classification. UAI, 2011. Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. J. Royal Statist. Soc.. B, 68(1):49–67, 2006. Zhao, B., Fei-Fei, L., and Xing, E.P. Large-scale category structure aware image categorization. Proc. NIPS, 2011. Krizhevsky, A. and Hinton, GE. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009. Zhou, D., Xiao, L., and Wu, M. Hierarchical classification via orthogonal transfer. In ICML, 2011. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Zweig, A. and Weinshall, D. Exploiting Object Hierarchy: Combining Models from Different Category Levels. Proc. ICCV, 2007.
© Copyright 2026 Paperzz