Efficient Optimization of Log-Additive Models with Level 1

WWW.IJITECH.ORG
ISSN 2321-8665
Vol.01,Issue.02,
September-2013,
Pages:124-127
Efficient Optimization of Log-Additive Models with Level 1-Regularization
VENKATA SAI SRIHARSHA SAMMETA1, NAVYA YENUGANTI2, SEKHAR MUDDANA3
1
Dept of Computer Science, Vasavi College of Engineering, Osmania University, Hyderabad, India.
Dept of Computer Science, Vasavi College of Engineering, Osmania University, Hyderabad, India.
3
Ph.D, Dept of Computer Science, Clemson University, Former Googler, Principal, SR Institute of Technology.
.
Abstract: The l-bfgs constrained-recollection quasi-Newton
when most features are extraneous by Ng (2004). It withal
method is the algorithm of cull for optimizing the parameters
typically engenders sparse parameter vectors in which many
of immensely colossal-scale log-linear models with L2
of the parameters are precisely zero, which makes for samples
regularization, but it cannot be utilized for an L1-regularized
that are more interpretable and computationally manageable.
loss due to its non-differentiability when some argument is
This second assign of ye L1 regularize is a consequence of the
zero. Efficient algorithms have been proposed for this task,
fact that its first partial derivative with veneration to for each
but they get visionary when the number of arguments gets
one variable is fixed as the variable quantity moves toward
prodigiously and sizably voluminous. We present an
zero, “pushing” the value all the action to zero if possible.
algorithm Orthant-Sapient Inhibited-recollectionQuasiNewton
(The L2 regularize, by counterpoint, “pushes” a value less and
(owl-qn), predicated on l-bfgs, that can efficiently optimize
less as it moves toward zero, engendering parameters that are
the L1-governed log-likeliness of log-linear examples with
proximate to, but not precisely, zero.) Infelicitously, this fact
millions of parameters. In our experiments on a parse reabout L1 withal betokens that it is not differentiable at zero,
ranking task, our algorithm was several orders of magnitude
so the object occasion cannot be minimized with general
more expeditious than an alternative algorithm, and
purpose gradient-predicated optimization algorithms such as
substantially more expeditious than l-bfgs on the analogous
ye l-bfgs quasi-Newton method which has been shown to be
L2-regularized quandary. We withal present a proof that
superior at training sizably voluminous-scale L2-regularized
owlqn is ensured to converge to an ecumenically optimal
log-linear models by Malouf (2002) and Minka (2003).
parameter vector.
Several special-purport algorithms have been designed to
surmount this arduousness. Perkins and Theiler (2003)
Keywords: Machine Learning, Deep Learning, Clustering,
propose an algorithm called grafting, in which variables are
Neural Networks, Convex Coding, HMM,Log-Linear Models,
integrated one-at-a-time, each time re-optimizing the weights
Regularization, Log-Linear Models; Scalable Training; Parse
with veneration to the current set of variables. Goodman
Re-Ranking Task, L-BFGS,Globally Optimal Parameter, L1–
(2004) plus Kazama plus Tsujii (2003) (independently) show
Regularized.
how to express the objective as a constrained optimization
I. INTRODUCTION
quandary, which they solve with a modification of generalized
Log-linear models, including the special cases of Markov
iterative scaling (GIS), a quasi-Newton algorithm for
desultory fields and logistic regression, are utilized in a
quandaries with bound constraints, respectively. Haplessly,
variety of forms in machine learning. The arguments of such
GIS is generally considered to be dominated by l-bfgs, and
examples are generally
both of these algorithms command repeating the number of
f(x) = `(x) + r(x),
(1)
variables in the general case. Lee et al. (2006) propose the
Where r is a regularization term that favors “simpler”
algorithm irls-lars, exalted by Newton’s method, which
examples. It is well-kenned that the usage of regularization is
iteratively denigrates the function’s second order Taylor
compulsory to achieve a model that generalizes well to unseen
elaboration, matter to linear constraints. The quadratic
data, concretely if the number of parameters is very high
equation program at for each one iteration is efficiently solved
relative to the amount of training data. A cull of regularizer
utilizing the lars algorithm (lasso variant) of Efron et al.
that has received incrementing attention in recent years is the
(2004). They equate the approach to the early forenamed
weighted L1-norm of the parameters.
algorithms (except that of Kazama & Tsujii) on smallto
r(x) = Ckxk1 = C X i |xi |
medium-scale logistic regression quandaries, and show that in
For some constant C > 0. Introduced in the context of linear
most cases it is much more expeditious.
regression by Tibshirani (1996) where it is kenned as the lasso
Haplessly irls-lars cannot be habituated to train profoundly
estimator, the L1 regularizer relishes several auspicious
and
astronomically immense-scale loglinear models involving
properties compared to other regularizes such as L2. It was
millions of variables and training examples, such as are
shown experimentally to be capable of learning good models
2
Copyright @ 2013 IJIT. All rights reserved.
VENKATA SAI SRIHARSHA SAMMETA, NAVYA YENUGANTI, SEKHAR MUDDANA
normally found, for example, in natural language processing.
demonstrate the competency of owl-qn to scale up to
Albeit worst-case bounds are not kenned, under eleemosynary
prodigiously and sizably voluminous quandaries. Notation:m
posits the lasso variant of lars may require as lots as O(mn2 )
nLet us demonstrate some note and a few definitions that will
procedures, where m is the number of variables and n is ye
be utilized in the remnant of the paper. Suppose we have a
number of aiming instances. Indeed, the only test quandaries
convex function f: R n 7→ R and vector x ∈ R n. We will let
of Lee et al. in which some other algorithm neared or passed
∂ + I f(x) shows the right partial derivative of f at x with
by irls-lars were additionally the most astronomically
reverence to xi :
immense, with thousands of variables.
it can be denoted as ∂ + i f(x) = lim α↓0 f(x + αei) − f(x) α,
II. RELATED WORK
Anywhere ei , on the correspondent left over version ∂ − i
A Quasi-Newton Algorithms and L-BFGS: We commence
f(x). f at x is the directional derivative and is defined as f 0
our discussion of owl-qn with a verbal description of its raise,
(x; d) = lim α↓0 f(x + αd) − f(x) α. A vector d is concerned to
the l-bfgs quasi-Newton algorithm for unconstrained
as a stemma direction at x if f 0 (x; d) < 0. We will utilize k• k
optimization of a polish function. Same as Newton’s method,
to represent the L2 norm of a vector, unless explicitly
quasi-Newton algorithmic rule iteratively build a local
indicated k • k1. We will additionally find it convenient to
quadratic estimate to an occasion, plus then behavior a line
define a few special functions. The denotement function σ
seek in the guidance of the degree that understates the
takes values in {−1, 0, 1} according to whether an authentic
estimate. If Bk is the (perhaps estimated) Hessian matrix of a
value is negative, zero, or positive. The function π: R n 7→ R
polish function f at the point x k, and g k is ye slope of f at x
n is parameterized by y ∈ R n, where πi(x; y) = ½ xi if σ(xi) =
k, ye purpose is topically modelled by
σ(yi), 0 unless , And can be translated as the protrusion of x
Q(x) = f(xk) + (x −xk) >gk+1/2 (x − xk) >Bk(x − xk)
(2)
onto an orthant determined by y.
If Bk is positive certain, ye value x ∗ that understates Q can be
computed analytically according to
x ∗ = xk − Hkgk,
Where Hk=Bk−1. A quasi-Newton method acting then
searches along the ray x k − αHkg k for α ∈ (0, ∞) to obtain
the next point x k+1.
While pristine Newton’s method utilizes the exact second
order Taylor expansion at for each one point, quasi-Newton
algorithmic rule approximate the Hessian utilizing first-order
information amassed from a foretime explored points. l-bfgs,
as a constrained-recollection quasi-Newton algorithm, holds
only curve data from the most recent m points. Concretely, at
step k, it records the displacement sk=xk−xk−1 and the
vicissitude in gradient yk=gk−gk−1, discarding the corresponding transmitters from looping k − m. It uses {s i} and {y i} to
estimate Hk, or more than precisely, to figure the seek
direction −Hkg k, since the full Hessian matrix (which may be
unmanageably immensely colossal) is not explicitly computed
or inverted. The time and recollection requisites of the
computation are additive in the count of variable star. The
details of this process are not paramount for the purposes of
this paper, and we refer the fascinated reader to Nocedal and
Wright (1999).
III. PROPOSED ALGORITHM
In this paper, we propose an incipient algorithm predicated on
lbfgs for training immensely colossal-scale log-linear models
utilizing L1 regularization, Orthant-Sagacious Inhibitedrecollection QuasiNewton (owl-qn). At each iteration, our
algorithm calculates search guidance by more or less
minimizing a quadratic function that examples the accusative
over an orthant containing the anterior point. Our experiments
on a parse re-ranking task with over one million features
IV. EXPERIMENTAL RESULTS
Albeit we are primarily fascinated with the efficiency of
training, we first report the execution of ye discovered parse
models. Performance is quantified with the PARSEVAL
metric, i.e., the F-score over marked brackets. These answers
are summarized in Table1. Both cases of model did
importantly better than the service line, and may well be
believed state of-the-art. (For comparability, ye pattern of
Charniak plus Johnson (2005) supple mentally achieved
91.6% F-score on the same test set.5) Interestingly, the two
regularizes performed virtually identically:
Figure 1: L1-regularized objective value while the course
of optimization with owl-qn.
The results of CPU timing experiments utilizing the same
values of C are shown in Table 2. We ceased K&T after 946
iterations when it had reached the value 7.34 × 104, however
5.7% higher than ye better value found by owl-qn. The
International Journal of Innovative Technologies
Volume.01, Issue No.02, September-2013, Pages: 124-127
Efficient Optimization of Log-Additive Models with Level 1-Regularization
difference in both runtime plus count of function ratings
fascinating to optically discern whether owl-qn might be
amongst K&T and owl-qn is quite dramatic. Surprisingly,
subsidiary on L1- regularized quandaries of other forms
owl-qn converges even more expeditious than our
affecting millions of variables, for case lasso regression.
implementation of l-bfgs run on the L2-regularized objective.
Another management to research would be to utilize
Note withal that the runtime of completely algorithms is
homogeneous methods to optimize objectives with different
prevailed by function evaluations, and otherwise the most
sorts of non-differentiability’s, such as the SVM primal
extravagant step of owl-qn is ye calculation of the l-bfgs
objective.
direction. A more consummate picture of the training
VI. REFERENCES
efficiency of the two models can be harvested by
[1] Benson, J. S., & More, J. J. A limited memory variable
diagramming ye evaluate of ye objective as an occasion of the
metric method for bound constraint minimization.
count function calls, depicted in Figures 1 by 4. (Note ye
[2] Bertsekas, D. P. (1999). Nonlinear programming. Athena
divergences in scale of the x-axis.) Since its competency to
Scientific.
learn a sparse parameter vector is a consequential advantage
[3] Byrd, R. H., Lu, P., & Zhu, C. Y. (1995). A limited
of the L1 regularizer, we examine how ye number of non-zero
memory algorithm for bound constrained optimization. SIAM
weights exchanges during the flow of optimization in Figures
Journal on Scientific Computing, 16, 1190–1208.
5 and 6. Both algorithms start with a paramount fraction of
[4] Charniak, E., & Johnson, M. (2005). Coarse-to-fine nbest
features (5%-12%) and crop them off as ye algorithm
parsing and maxent discriminative reranking. ACL.
advances, with owl-qn engendering a sparse model rather
[5] Collins, M. (2000). Discriminative reranking for natural
more expeditiously. Interestingly, owl-qn breaks such pattern
language parsing. ICML (pp. 175–182).
with a single sharp valley at the commencement of the second
[6] Darroch, J., & Ratcliff, D. (1972). Generalised iterative
iteration.6 we believe the cause of this is that the model gives
scaling for log-linear models. Annals of Mathematical
an immensely colossal weight to lots characteristics on the
Statistics.
beginning iteration, alone to send out almost of them endorse
[7] Efron, B., Hastie, T., & Tibshirani, R. (2004). Least angle
towards nothing on the second. On the third loop some of
regression. Annals of Statistics.
those characteristics experience weights of the antithesis sign,
[8]Gao, J., Andrew, G., Johnson, M., & Toutanova, K. (2007).
and from thither the set of non-zero weights is more than
A comparative research of parameter estimation methods for
stable.7 K&T does not exhibit this anomaly.
statistical NLP. ACL.
[9] Goodman, J. (2004). Exponential priors for maximum
entropy models. ACL.
[10] Lee, S.-I., Lee, H., Abbeel, P., & Ng, A. (2006). Efficient
L1 regularized logistic regression. AAAI-06.
[11]Kazama, J., & Tsujii, J. (2003). Evaluation and expansion
of maximum entropy models with inequality constraints.
EMNLP.
[12] Malouf, R. (2002). A comparison of algorithms for
maximum entropy parameter estimation. CONLL.
[13] Ng, A. Y. (2004). Feature selection, L1 vs. L2
regularization, and rotational invariance. ICML.
[14] Nocedal, J., & Wright, S. J. (1999). Numerical
optimization. Springer.
[15] Perkins, S., & Theiler, J. (2003). Online feature selection
using grafting. ICML.
[16] Perkins, S., & Theiler, J. (2003). Online feature selection
Figure 2: L1-regularized objective evaluate throughout ye
using grafting. ICML.
course of optimization with K&T
We have adscititiously used owl-qn to train L1-regularized
examples for a variety of other prodigiously and sizably
voluminous-scale quandaries in NLP with up to ten million
variables, with kindred prosperity. This work is described in
Gao et al. (2007).
V. CONCLUSION
We have presented an algorithm owl-qn for expeditiously
aiming L1-regularized log-linear models with millions of
variables. We tried out ye algorithm on a very large-scale
NLP task, and detected that it was substantially more
expeditious than an option algorithm because L1
regularization, and even scarcely more expeditious than l-bfgs
on the analogous L2-regularized quandary. It would be
Author’s Profile:
Venkata Sai Sriharsha Sammeta is an
undergrad Machine Learning researcher in
Computer Science department of Vasavi
College
of
Engineering,
Osmania
University. Previously, he did internship in
Oracle R&D India Private Limited,
developing machine learning models for
Ordering and Service Management (OSM) systems to predict
future orders and improved the performance of Order
Lifecycle Management (OLM) significantly. His present
research areas are in the fields of Artificial Intelligence,
International Journal of Innovative Technologies
Volume.01, Issue No.02, September-2013, Pages: 124-127
VENKATA SAI SRIHARSHA SAMMETA, NAVYA YENUGANTI, SEKHAR MUDDANA
Machine Learning, Deep Learning, Natural Language
Processing and Recommender Systems.
Navya Yenuganti is a computer Science
student at Vasavi College of Engineering,
Osmania Univeristy. Previously, she did an
internship in Eidiko Systems Inc, an IBM
business partner that develops mobile and
cloud solutions for enterprise. There, she
worked on auto-reply functionality for
messages using heuristics-based approaches. Her research
areas are in the fields of Convolutional Neural Networks,
Machine Translation and Natural Language Processing.
Sekhar Muddana received his Ph.D. in
Computer Science from Clemson University,
USA in 1994. He then worked as Software
Engineer for a decade in USA. After which
he moved to India to Join Google India. He is
currently the Principal and Professor in
Computer Science at SR Institute of
Technology. He has published several papers in the field of
Algorithms and Game Theory. His notable work is Albert: a
computer algebra system for non-associative identities. He
also holds several patents in distributed systems. His current
interests are in the areas of Algorithms, Theory of
Computation, Compiler Design, Large Scale Data
Management, Parallel computation, Engineering Education.
International Journal of Innovative Technologies
Volume.01, Issue No.02, September-2013, Pages: 124-127