DISTRIBUTED ALGORITHMS FOR OPTIMIZATION AND

The Pennsylvania State University
The Graduate School
College of Engineering
DISTRIBUTED ALGORITHMS FOR OPTIMIZATION AND
VARIATIONAL INEQUALITY PROBLEMS: ADDRESSING
COMPETITION, UNCERTAINTY, AND MISSPECIFICATION
A Dissertation in
Industrial Engineering
by
Aswin Kannan
© 2015 Aswin Kannan
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
May 2015
The dissertation of Aswin Kannan was reviewed and approved∗ by the following:
Uday V. Shanbhag
Associate Professor of Industrial and Manufacturing Engineering
Dissertation Advisor, Chair of Committee
Terry L. Friesz
Committee Member
Harold and Inge Marcus Chaired Professor of Industrial and Manufacturing
Engineering
Paul M. Griffin
Committee Member
Professor of Industrial and Manufacturing Engineering
Minghui Zhu
Committee Member
Dorothy Quiggle Assistant Professor of Electrical Engineering
Angelia Nedić
Special Member
Associate Professor of Industrial and Enterprise Systems Engineering
University of Illinois at Urbana Champaign
Harriet B. Nembhard
Professor and Interim Department Head of Industrial and Manufacturing
Engineering
∗
Signatures are on file in the Graduate School.
ii
Abstract
This dissertation considers three sets of problems arising from optimization and
game-theoretic problems complicated by the presence of uncertainty, limited information, and problem misspecification. Broadly speaking, the research focuses
on developing gradient-based algorithms in networked and uncertain regimes and
focuses on the asymptotics and rate statements of such schemes. Next, we provide
a short description of each part of this dissertation.
The first part of this work considers computation of equilibria associated with
Nash games that lead to monotone variational inequalities, wherein each player
solves a convex program. Distributed extensions of standard approaches for solving
such variational problems are characterized by two challenges: (1) Unless suitable
assumptions (such as strong monotonicity) are imposed on the mapping arising in
the specification of the variational inequality, iterative methods often require the
solution to a sequence of regularized problems, a naturally two-timescale process
that is harder to implement in practice; (2) Additionally, algorithm parameters
for all players (such as steplengths, regularization parameters, etc.) have to be
chosen centrally and communicated to all players; importantly, these parameters
cannot be independently chosen by a player. Motivated by these shortcomings,
we present two practically implementable distributed regularization schemes that
work on a single-timescale; specifically, each scheme requires precisely one gradient
or projection step at every iteration. Both schemes are characterized by the property that the regularization/centering parameter are updated after every iteration,
rather than when one has approximately solved the regularized problem. To aid
in distributed settings requiring limited coordination across players, the schemes
allow players to select their parameters independently and do not insist on central
prescription of such parameters. We conclude with an application of these schemes
on a networked Cournot game with nonlinear prices.
In the second portion of our work, we consider stochastic variational inequalities
iii
under pseudomonotone settings. Referred to as pseudomonotone stochastic variational inequality problems or PSVIs, such problems emerge from product pricing,
fractional optimization problems, and subclasses of economic equilibrium problems
arising in uncertain regimes. Succinctly, we make two sets of contributions to the
study of PSVIs. In the first part of the paper, we observe that a direct application
of standard existence/uniqueness theory requires a tractable expression for the
integrals arising from the expectation, a relative rarity when faced with general
distributions. Instead, we develop integration-free sufficiency conditions for the
existence and uniqueness of solutions to PSVIs. In the second part of the paper, we
consider the solution of PSVIs via stochastic approximation (SA) schemes, motivated by the observation that almost all of the prior SA schemes can accommodate
monotone SVIs. Under various forms of pseudomonotonicity, we prove that the
solution iterates produced by extragradient SA schemes converge to the solution
set in an almost sure sense. This result is further extended to mirror-prox regimes
and an analogous statement is also provided for monotone regimes, under a weaksharpness requirement, where prior results have only shown convergence in terms of
the gap function through the use of averaging. Under strong pseudomonotonicity,
we derive the optimal initial steplength and show that the mean-squared error in
the solution iterates
produced by the extragradient SA scheme converges at the
optimal rate of O K1 . Similar rates are derived for mirror-prox generalizations and
monotone SVIs under a weak-sharpness requirement. Finally, both the asymptotics
and the empirical rates of the schemes are studied on a set of pseudomonotone and
non-monotone variational problems.
The third part of this dissertation studies networked settings where agents
attempt to solve a common problem, whose information is both misspecified and
only partly known to every agent. We consider a convex optimization problem
that requires minimizing a sum of misspecified agent-specific expectation-valued
convex functions over the intersection of a collection of agent-specific convex sets,
denoted by X1 , . . . , Xm . The agent objectives are misspecified in a parametric sense
and this misspecification may be resolved through solving a distinct stochastic
convex learning problem. We consider the simultaneous resolution of both problems
through a joint set of schemes in which agents update their decisions and their
beliefs regarding the misspecified parameter at every step. The former combines
an agent-specific averaging step and a projected stochastic gradient step while
parameter updates are carried out through a projected stochastic gradient step.
Given such a set of coupled schemes, we provide both almost sure convergence
statements as well as convergence rate statements when either Xi = X for every i
or X = ∩m
i=1 Xi . Notably, we prove that when X = ∩i=1 Xi and agent objectives
are strongly convex, we recover the optimal rate of stochastic approximation of
iv
√
O(1/ K) in the solution iterates despite the presence of averaging (arising from
the consensus step) and learning (arising from misspecification). When strong
convexity assumptions are weakened to mere convexity but Xi = X for every i, we
show that the averaged sequence over the entire K iterations displays a modest
degradation in the convergence rate from the optimal rate in terms of function
value. When the averaging window is reduced to K/2, we recover the optimal
rate of convergence in function values. Preliminary numerics are provided for an
economic dispatch problem with misspecified cost functions.
v
Table of Contents
List of Figures
ix
List of Tables
x
Acknowledgments
xi
Chapter 1
Introduction
1.1 Monotone Nash games . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Stochastic variational inequality problems . . . . . . . . . . . . . .
1.3 Distributed stochastic optimization under imperfect information . .
1
2
4
6
Chapter 2
Distributed computation of equilibria in monotone Nash games
via iterative regularization techniques
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Inapplicability of existing schemes . . . . . . . . . . . . . . .
2.1.2 Motivating applications . . . . . . . . . . . . . . . . . . . . .
2.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Iterative Tikhonov regularization schemes . . . . . . . . . . . . . .
2.2.1 Partially coordinated ITR schemes . . . . . . . . . . . . . .
2.2.2 A multistep ITR scheme . . . . . . . . . . . . . . . . . . . .
2.3 Iterative proximal point schemes . . . . . . . . . . . . . . . . . . . .
2.3.1 An iterative proximal point scheme . . . . . . . . . . . . . .
2.3.2 Partially coordinated modified iterative proximal point schemes
2.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Problem parameters and termination criteria . . . . . . . . .
2.4.2 Fully coordinated schemes . . . . . . . . . . . . . . . . . . .
2.4.3 Partially coordinated schemes . . . . . . . . . . . . . . . . .
2.4.4 Multistep schemes . . . . . . . . . . . . . . . . . . . . . . .
8
8
10
12
13
15
17
24
25
27
30
35
36
37
38
39
vi
2.5
2.6
2.4.5 Partial information . . . . . . . . . . . . . . . . . . . . . . .
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
42
43
Chapter 3
Pseudomonotone Stochastic Variational Inequality Problems:
Analysis and Optimal Stochastic Approximation Schemes
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1.1 Stochastic fractional programming . . . . . . . . .
3.1.1.2 Stochastic product pricing . . . . . . . . . . . . . .
3.1.1.3 Stochastic economic equilibria . . . . . . . . . . . .
3.1.2 Related work on SA . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Contributions and outline . . . . . . . . . . . . . . . . . . .
3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Stochastic Pseudomonotone Maps . . . . . . . . . . . . . . .
3.2.2 Existence and uniqueness of solutions . . . . . . . . . . . . .
3.3 Extragradient-based stochastic approximation schemes . . . . . . .
3.3.1 Background and Assumptions . . . . . . . . . . . . . . . . .
3.3.2 An extragradient SA scheme . . . . . . . . . . . . . . . . . .
3.3.3 Mirror-prox generalizations . . . . . . . . . . . . . . . . . . .
3.3.4 Rate of convergence analysis . . . . . . . . . . . . . . . . . .
3.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Test Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Algorithm parameters and termination criteria . . . . . . . .
3.4.3 Almost sure convergence behavior . . . . . . . . . . . . . . .
3.4.4 Error analysis and optimal choices of γ0 . . . . . . . . . . .
3.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 4
Distributed Stochastic Optimization under Imperfect Information
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Assumptions and Algorithms . . . . . . . . . . . . . . . . . . . . . .
4.3 Almost sure convergence . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Rate of convergence estimates . . . . . . . . . . . . . . . . . . . . .
4.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Consensus error . . . . . . . . . . . . . . . . . . . . . . . . .
vii
45
45
47
47
48
49
50
51
53
53
56
62
62
63
75
80
91
91
93
94
94
97
98
100
100
104
107
121
131
131
132
4.6
4.5.3 Learning error . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5.4 Exploration vs Exploitation . . . . . . . . . . . . . . . . . . 133
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter 5
Future Work
136
Bibliography
137
viii
List of Figures
2.1
p-ITR (l) and p-MIPP (r): Majors . . . . . . . . . . . . . . .
38
3.1
Mean-squared error vs h . . . . . . . . . . . . . . . . . . . . . . . .
95
4.1
Trajectory of the function values f (z ∗,k , θ∗ ) and f (z k , θ∗ ). . . . . . . 134
ix
List of Tables
2.1
2.2
2.3
2.4
2.5
2.6
ITR: Majors (J = 3, σ = 1.4) . . . . . . . . . . . . . . . . . . .
ITR: Majors (J = 5) . . . . . . . . . . . . . . . . . . . . . . . .
IPP and MIPP: Majors (J = 5, σ = 1.3) . . . . . . . . . . . .
p-ITR and p-MIPP: Majors (σ = 1.3) . . . . . . . . . . . . . .
Multistep ITR: Majors (J = 5, σ = 1.4) . . . . . . . . . . .
r-p-ITR and r-p-MIPP: Total projection steps (σ = 1.4)
with r fixed across agents . . . . . . . . . . . . . . . . . . .
2.7 r-p-ITR and r-p-MIPP: Total projection steps (σ = 1.4)
with firm-specific ru . . . . . . . . . . . . . . . . . . . . . . .
2.8 r-p-ITR and r-p-MIPP:Majors (σ = 1.3) with r fixed and
partial information . . . . . . . . . . . . . . . . . . . . . . . .
2.9 ru -p-ITR and ru -p-MIPP: Majors (σ = 1.3) with r not fixed
and partial information . . . . . . . . . . . . . . . . . . . . .
3.1
.
.
.
.
.
37
37
38
39
39
.
40
.
41
.
41
.
42
3.5
SA based approaches for Stochastic Variational Inequality Problems . . . . . . . . . . . . . . . . . . . . . . . . .
Asymptotics of ESA . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of ESA to MPSA-a and MPSA-b for frac.
nonlin. problems . . . . . . . . . . . . . . . . . . . . . . . . . .
Analytical vs Empirical Bounds for stochastic NashCournot Game. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimality error for varying choices of γ0 . . . . . . . . .
4.1
4.2
4.3
Deviation εk of decisions with iterations . . . . . . . . . . 132
Learning parameter error with iterations . . . . . . . . . 133
Error in iterates with k and κ . . . . . . . . . . . . . . . . . 134
3.2
3.3
3.4
x
51
95
96
96
97
Acknowledgments
It goes without mentioning that Dr. Uday Shanbhag has been an exemplar mentor,
one that taught me optimization and evoked my interest in the fundamental
science of variational inequalities. Besides mathematics, I learnt the basics of
computation, solver development, use of the right scientific language, and above all
nitty-gritty of Latex, all from Uday. His sense of mathematical completeness, rigor,
and perfection still leaves me astonished and keeps my quest going. Besides his
enlivening encouragement and motivation, his generosity to accommodate my travel
schedules and several months of remote advising leave me indebted for years to
come. On the personal front, I consider myself to be lucky and extremely fortunate
to have such a kind friend and mentor. Above all, when I return to academia, I
will try my best to be a person of Uday’s stature.
In addition to several conversations on distributed computation and for being
a part of my committee, I’d like to thank Prof. Nedich for teaching me convex
optimization, which has been truly instrumental in coming up with the basic
questions in this dissertation and my work on the development of optimization
solvers at Argonne and beyond. I’d like to thank Prof. Friesz, Prof. Zhu, and
Prof. Griffin for accepting to be a part of my committee and for their insightful
comments and suggestions through various chapters in this dissertation and the
style of my presentations.
This thesis is completely dedicated to my aunt’s family in Illinois (Dr. Nandkumar, Dr. Radha Nandkumar, Dr. Ajit Chary, and Dr. Anita Chary), whose
unfailing support throughout has been the major reason for achieving my goals.
On the personal side, my doctoral research would not have been possible without
my cousin and dearest friend, Dr. Ajit Chary, who has made my stay in Chicago all
these years more memorable than I can even imagine. I’d like to thank my parents
(Kannan Srinivasan and Rajalakshmi Kannan) and grandparents (Raghavan and
Choodamani Raghavan) for being an integral part of my research, work, doctoral
dissertation, and many more to go throughout my life. I’d also like to thank my
aunt’s family (Kannan Rajagopalan, Jayanthi Kannan, and Aravind Kannan) in
xi
Texas, rest of my family in India, and Dr. Ravichandran (and family) in Chennai
for their support at several important facets over the last few years.
My dear friends Venkat, Navin, Vijit, Sriram, Dinesh, Ram, Bharath, Chandru,
Vishal, Jayash, Ismaeel, Swami, Anjan, Rajan, JK, Shankar, and Jayanand from
UIUC, Rajesh, Krishna, Prasanna, Shashi, Ashutosh, and Daniel from Argonne
National Labs, Kalpesh and Arjun from Chicago, Aswin Dhamodharan, Vijay,
Bharadwaj, Girish, and Jonathan from Penn State, and Hari from Pittsburgh
(CMU) have made my social front more joyous throughout.
xii
Dedication
To my aunt’s family - Dr. Nandkumar, Dr. Radha Nandkumar, Dr. Ajit Chary,
and Dr. Anita Chary.
This thesis would not have been possible without my cousin Dr. Ajit Chary.
xiii
Chapter 1 |
Introduction
Optimization theory and algorithms are of enormous relevance in engineering,
economics, and applied sciences, amongst others. There have been tremendous
advances in the last several decades in the development of analytical and algorithmic
tools for addressing a broad class of optimization problems [1–5]. Following the
seminal work of Dantzig on linear programming [6] and the simplex method, the
field of convex optimization [7–10] has seen tremendous growth over the last several
decades. Convex optimization problems arise in a breadth of settings include
portfolio optimization, the design and operation of control systems, product design,
scheduling, and transportation management. Given a convex set X ⊆ Rn and a
convex function f : X → R, the convex optimization problem is defined as follows:
minimize f (x)
subject to x ∈ X.
(1.1)
Specifically when f (x) is a convex quadratic function and X is a polyhedral set,
(1.1) reduces to a convex quadratic program.
In this dissertation, we consider three specific generalizations of the standard
convex optimization problem, summarized briefly in the next bullets and described
in more detail in the forthcoming subsections:
Convex Nash games: In Section 1.1, we consider the game-theoretic generalization
of the convex optimization problem in which there are a collection of selfish agents,
each solving a parameterized convex optimization problem. Our focus lies in developing implementable distributed gradient schemes under which the resulting
sequence converges to a Nash equilibrium under the caveat that the variational map
1
derived from the player-specific gradient maps is not necessarily strongly monotone.
Pseudomonotone stochastic variational inequality problems: In Section 1.2, we
consider the stochastic variational inequality problem in which the expected-value
map is pseudomonotone (as well as some of its variants). In this context, we provide
both integration-free sufficiency conditions as well as develop a stochastic extragradient scheme that is shown to possess both almost sure convergence properties and
display the optimal rate of convergence.
Distributed optimization under imperfect information: Finally, in Section 1.3, we
consider a multi-agent setting where an imperfectly specified stochastic convex
optimization problem needs to be resolved via a distributed scheme. The misspecification is corrected by a parallel learning scheme and we proceed to show that
resulting coupled scheme produces sequences that are convergent to the required
solutions in an almost sure sense. In addition, we provide a characterization of the
non-asymptotic error bounds associated with such a sequence.
1.1 Monotone Nash games
Often, multi-agent systems are complicated by competition. This competition
might arise organically from inherently selfish players [11–13] or may be an artifact
of the model to develop distributed control schemes [14–16]. In a majority of these
settings, the goal lies in developing schemes that can compute Nash equilibria [17].
In either case, the resulting problem is characterized by a set of agents, denoted by
J , in which the jth agent’s problem is the following parameterized convex program:
Ag(x−j )
minimize
fj (xj ; x−j )
subject to
xj ∈ X j ,
where fj and Xj refer to agent j’s objective and feasibility set respectively, while
x−j denotes the vector of decisions of agents other than agent j. The Nash
equilibrium [17] of this game is given by the tuple of agent decisions at which no
agent can reduce his cost by unilaterally changing its decisions.
The variational inequality problem allows for developing a compact represen2
tation of the necessary and sufficient equilibrium conditions of this game. Under
assumptions of convexity on f (., x−j ) and Xj [18, Ch. 2], x∗j ∈ SOL(Ag(x∗−j )),
j ∈ 1, . . . , N is a solution to VI(X, F ) where


∇x1 f1 (x)


..


where F (x), 
 and
.

X,

|J |
Y
xj .
j=1
∇x|J | f|J | (x)
It may be recalled that VI(X, F ) requires an x such that
(y − x)T F (x) ≥ 0,
∀y ∈ X.
In convex Nash games characterized by the presence of monotone maps (referred
to as monotone Nash games), Tikhonov regularization and proximal point schemes
have proved to be useful [18, Ch. 12]. However, in addition to the fact that both
these schemes require increasingly accurate solutions to a sequence of strongly
monotone variational inequality problems. The need to solve a sequence of such
problems is not immediately easy to address in a distributed regime; instead we
propose a distributed (regularized/modified) projected-gradient scheme. For the
sake of completeness, we define strict monotonicity of the map as follows:
(F (x) − F (y))T (x − y) > 0, ∀x, y ∈ X, x 6= y.
Given a Nash game whose equilibrium conditions are given by a merely monotone
VI(X, F ), we define two schemes given an x0 ∈ X and sequences {k } , {θk }, and
{γk } as follows:
xk+1 := ΠX (xk − γk (F (xk ) + k xk )),
k ≥ 0,
(ITR)
xk+1 := ΠX (xk − γk (F (xk ) + θk (xk − xk−1 ))),
k ≥ 0.
(IPP)
Note that the ITR scheme is a modification of the standard Tikhonov regularization
scheme in that instead of computing a sequence of solutions to the (Tikhonov)
regularized problems, a single projected gradient step is taken after which the
steplength and the regularization sequence are updated. Similarly, the iterative
proximal point scheme obviates solving a sequence of proximal-point problems.
3
Istead, in the proposed scheme (IPP), the weighting parameter and steplength
sequence are updated after every gradient step. Our contributions in Chapter 2 are
summarized as follows:
• Using appropriate choices of steplengths and regularization sequences, we
establish the convergence of the ITR scheme by relating it to the trajectory
of the Tikhonov scheme. The convergence of the (IPP) follows on a similar
note under a slightly stronger assumption that the mapping F (x) is strictly
monotone.
• We extend our scheme and the associated convergence theory to partially
coordinated regimes where agents use their own steplengths and regularization
sequences. Agents are considered to communicate to a central authority or
to one another only at the end of every iteration. We further generalize these
schemes to multistep regimes where agents take more than one projection
step for a fixed choice of steplength and regularization parameters.
• The proposed algorithms are tested on a Nash-Cournot game with nonlinear
prices. In addition to numerically observing convergence, we study the
behavior of these algorithms in partially coordinated settings, multistep
regimes, and compare the tradeoffs between computational complexity and
accuracy.
1.2 Stochastic variational inequality problems
Next, we consider the setting when the map F of the variational inequality problem
is expectation-valued and for any x, F (x) is given by
Fi (x) , E[Fi (x, ξ)], for i = 1, · · · , n.
Note that ξ : Ω → Rd , F : X × Rd → Rn and (Ω, F, P ) is the associated probability
space. Stochastic variational inequality problems assume relevance in modeling
stochastic fractional programming, as arising from problems that include minimization of the lift to drag ratios by changing aerofoil dimensions such as thickness
and chord length [19, 20], fuel economy to vehicle performance ratios [21, 22], and
the maximization of signal-to-noise ratios in wireless networks [23, 24]. Stochastic
4
variational inequality problems also arise in stochastic product pricing [25–27] and
stochastic economic equilibria [28].
Inspired by the seminal work by Robbins and Monro [29], there has been a surge
of recent effort in the development of stochastic approximation (SA) schemes [30–32]
that generally use one sample (or a fixed number of samples) per iteration. However,
the applicability of these schemes has traditionally accommodated strongly, strictly,
or merely monotone maps. From the standpoint of analysis, traditional statements
on existence and uniqueness of solutions to the SVI either rely on compactness of
the set X or on tractability of the expected-value map E[F [x, ξ]], which is relatively challenging when faced with general probability spaces. Motivated by these
questions we develop integration free existence statements and develop convergent
algorithms to settings that are weaker than monotonicity. Our contributions in
Chapter 3 can be summarized as follows:
• By adopting the approach in [33], we propose an alternative by considering
only the scenario-specific maps to obtain sufficiency conditions for existence
and uniqueness that obviate the need for integration. In contrast with past
work, we provide refinements of these conditions for pseudomonotone regimes.
• We propose a stochastic version of the extragradient scheme and under
some mild assumptions on the stochastic error, we establish convergence for
different variants of pseudomonotonicity. Without using averaging of iterates
and under some basic assumptions on the solution properties, we claim
convergence under both pseudomonotone and monotone settings. Based on a
stronger form pseudomonotonicity, we show that the scheme demonstrates
the optimal rate of convergence. We further extend these statements to
prox-based generalizations of the extragradient scheme that admit general
distance functions.
• Finally, we consider a set of nonlinear test problems that are pseudomonotone, monotone, and non-monotone from different applications like fractional
programming and game theory. We study the performance of the proposed
algorithms, compare standard extragradient type schemes with prox-based
generalizations, and analyze the behavior of the initial parameter choices on
the computation behavior.
5
1.3 Distributed stochastic optimization under imperfect information
In Chapter 3, we consider regimes with a collection of m agents characterized by
convex cost functions E[ϕ1 (x, θ∗ , ω), . . . , E[ϕm (x, θ∗ , ω)] and the associated convex
feasibility sets X1 , . . . , Xm , where for i = 1, . . . , m, the functions ϕi (., θ, ω) are
convex in (.) for θ ∈ Θ and ω ∈ Ω and Xi is a closed and convex set. The resulting
distributed stochastic optimization problem is defined as follows:
minimize
m
X
E[ϕi (x, θ∗ , ω)]
i=1
subject to x ∈
m
\
(1.2)
Xi .
i=1
Almost all of the past work in solving such problems has relied on the assumption
that θ∗ is known to all of the agents (cf. [34–39] for application and algorithmic
contributions) In contrast with past work, every agent’s objective function is
corrupted by misspecification and this misspecification is resolved progressively
by solving a suitably defined learning problem. Note that the parameter θ∗ may
pertain to the nature of the cost functions and efficiencies (power generation), the
structure of the price function (revenue management problems), amongst others.
Our contributions in Chapter 4 can be summarized as follows:
• Asymptotics: We begin by considering the setting when X1 , . . . , Xm are not
necessarily identical and propose an algorithm where the learning parameter
θ and the agent iterates xi are updated simultaneously in the same timescale.
Under suitable assumptions, we prove the almost sure convergence of the
iterates produced by our proposed coupled scheme to the optimal solution of
the correctly specified computational problem and the true θ∗ .
• Rate statements: Under slightly stronger assumptions, we obtain O(1/k)
error bounds for the sequence of agent-specific iterates and distinguish the
impact of learning on the rate statement. Under cases of mere convexity,
we utilize iterate averaging and demonstrate that the optimal rate may be
attained.
6
• Computational behavior: We consider an economic dispatch problem as our
case study to test our proposed schemes. We compare the optimality gap and
the error from consensus for varying levels of solving the learning problem.
We also compare the behavior of our scheme with one where the learning
problem is solved apriori.
7
Chapter 2 |
Distributed computation of equilibria in monotone Nash games
via iterative regularization techniques
This chapter is a synopsis of published manuscripts1 .
2.1 Introduction
Consider an N −person Nash game, denoted by G, in which the jth player solves
Ag(x−j )
minimize
fj (xj ; x−j )
subject to
xj ∈ K j
for j = 1, . . . , N. Let xj ∈ Kj ⊆ Rnj where Kj is a closed and convex set for j =
P
1, . . . , N . Furthermore, suppose x−j denotes (xi )i6=j , and N
j=1 nj = n. Additionally,
we assume that for j = 1, . . . , N , the function fj is a differentiable real-valued
Q
function given by fj : Rn → R and convex in xj for all x−j ∈ K−j , i6=j Ki . A
1
Aswin Kannan and Uday Shanbhag, “Distributed Computation of Equilibria in
Monotone Nash Games via Iterative Regularization Techniques”, SIAM Journal on
Optimization 22-4 (2012), pp 1177-1205.
Aswin Kannan and Uday Shanbhag, Distributed Iterative Regularization Algorithms
for Monotone Nash Games, Proceedings of IEEE Conference on Decision and Control (CDC), 2010.
8
Nash equilibrium of the noncooperative game G is given by a tuple {x∗j }N
j=1 , where
x∗j ∈ SOL(Ag(x∗−j ))
for j = 1, . . . , N and SOL(Ag) denotes the set of solutions to problem (Ag). The
convexity of the objectives and the associated strategy sets allow us to claim that
the first-order equilibrium conditions are necessary and sufficient. In fact, these
conditions can be shown to be equivalent to a scalar variational inequality VI(K, F )
(see [18, Ch. 1]) where


∇ f (x)
 x1 1

..


F (x) , 

.


∇xN fN (x)
and K ,
N
Y
Kj .
i=j
Recall that VI(K, F ) is a problem requiring an x ∈ K such that
F (x)T (y − x) ≥ 0,
∀y ∈ K.
We assume throughout this paper that the mapping F : Rn → Rn is single-valued
and possesses a monotonicity property over a set K. It may be recalled that a
mapping F (x) is monotone over a set K if (F (x) − F (y))T (x − y) ≥ 0 for all
x, y ∈ K. We refer to the resulting class of Nash games as monotone Nash games.
Essentially, the problem of computing a monotone Nash equilibrium reduces
to that of solving a Cartesian monotone variational inequality. A host of methods
exist for solving such problems (cf. [18]). Our interest lies in the construction
of schemes implementable over networks in which players can independently
choose algorithm parameters (such as steplengths, for instance) and operate in a
single-timescale regime where each player takes a single step. More specifically, our
goal lies in constructing schemes characterized by the following properties:
(i) Distributed schemes: Here, the accent is on developing schemes in which a given
player need not have any information regarding the utilities or constraints of his
competitors. In game-theoretic regimes, as opposed to optimization settings, player
choices cannot be globally enforced. Therefore, players should be able to independently select algorithm parameters, such as steplength or regularization sequences.
Note that this neither necessitates the presence of a central coordinator for selecting
9
such parameters nor the burden of communicating these parameters to users across
the network. It should also be emphasized that while our schemes require that
every user be aware of competitive decisions, in most practical networked settings
arising in communication and traffic networks, users require aggregate utilization
of a link, rather than an entire vector of competitive decisions;2
(ii) Single-timescale schemes: Our goal lies in developing schemes implementable
on a single timescale and do not require central coordination. Such techniques
are characterized by having a single-loop while in contrast, hierarchical two-loop
schemes require that a lower-level loop is completed before a step is taken in the
upper loop;
(iii) Low complexity schemes: Finally, for allowing for broader utility of such
techniques across a range of applications, we are interested in low complexity
gradient-based schemes.
2.1.1 Inapplicability of existing schemes
The reader might question the need for new methods for solving a relatively simpler
class of variational inequalities (i.e. monotone). We begin by considering a strongly
monotone variational inequality problem VI(K, F ). Recall that this implies that
there exists an η > 0 such that (F (x) − F (y))T (x − y) ≥ ηkx − yk2 for all x, y ∈ K.
Then, a simple distributed gradient-based scheme is given by
xk+1
= ΠKj (xkj − γj ∇fj (xk )),
j
j = 1, . . . , N.
(2.1)
If F (x) is Lipschitz continuous over K with constant L, then convergence follows if
γj = β for j = 1, . . . , N and β < 2η/L2 [18]. However, this requires that all users
adopt this steplength and may be difficult to mandate over large-scale networked
regimes, particularly when problem parameters change intermittently. Succinctly,
this paper is motivated by the need to allow players to independently choose γj
while guaranteeing overall convergence.
In merely monotone regimes, a natural option is through the use of extragradient
schemes; such schemes would require agents to take two projection steps and choose
γj = β < 1/L. Again, players need to agree on a feasible steplength. In the same
2
Games in which the interactions across agents arise through aggregate decisions are known as
aggregative games [40]
10
vein, the hyperplane projection method [18, Ch. 12] does not require knowledge of
L but requires conducting a line search loop, a process that would be significantly
harder to coordinate across users. A possible avenue for alleviating the challenge is
through regularization methods, a set of techniques that address this ill-posedness
by sequentially solving a set of strongly monotone problems. One such technique is
an exact Tikhonov regularization method. Let
F k (z) , F (z) + k z,
(2.2)
where z k+1 solves VI(K, F k ) and k denotes the regularization at the kth iteration.
n o
Under suitable conditions (see [18, Ch.12]) the sequence z k converges to z ∗ as
k → 0. Tikhonov [41, 42] introduced his eponymously named regularization when
contending with the solution of ill-posed inverse problems while its application to
monotone variational inequalities appears to have been first suggested by Browder [43]. However, in most implementations, the solution of the strongly monotone
problem VI(K, F k ) needs to be either exact or approximately so. In effect, this
scheme requires solving a sequence of well-posed problems, each of which might
require a distributed iterative process in itself. This is effectively a two-timescale
method where the regularization method makes changes at a slower timescale while
solutions to the regularized problems in an exact or inexact form are found at a
faster timescale. Additionally, synchronous distributed schemes are challenging to
implement since this necessitates heterogeneous agents taking similar amounts of
time for computing exact/inexact solutions. Furthermore, we are again constrained
by the need to share problem parameters (L, η, etc.) to allow for the construction
of distributed schemes.
Yet another shortcoming of Tikhonov methods is the need to obtain solutions
of an increasingly ill-posed problem. An alternative was proposed by Martinet [44]
by way of a proximal-point method (cf. [18, 45]). Given a scalar θ > 0, the mapping
F k is redefined as
F k (z) , F (z) + θ(z − z k ),
(2.3)
where z k+1 solves VI(K, F k ). In 1976, Rockafeller [46] examined this scheme in
the context of maximal monotone operators [18] and discussed the convergence
of inexact variants as well as settings where θ may be updated across major
11
iterations. Finite termination of such schemes when a “weak sharpness” property
is imposed was examined by Ferris [47] while inexact versions were examined
by Burachik et al. [48]. More recently, proximal-point schemes have been used
effectively in a communications setting by Scutari et al. [49]. In this regime, the
authors effectively employ a best-response scheme by taking advantage of the
available analytical expressions for such a response. Finally, two references of note
are the recent papers by Nesterov [50] and Nemirovski [51] that provide iteration
bounds to reach a prescribed error level. Under the assumption of compact sets
and Lipschitzian monotone operators, the authors show that a variation of the
standard proximal-point scheme leads to convergent algorithms that require O(1/)
iterations for solutions with accuracy . Importantly, while the schemes have a
two-level structure, the lower level subproblem can be shown to be solvable in a
finite number of iterations.
Finally, Newton methods may also be employed for computing solutions of such
variational problems (see [18, Ch. 11] and [52]) and have been extremely successful
for the solution of variational inequalities and complementarity problems. The
PATH solver [53, 54] represents one of the few commercial solvers that can contend
with this class of problems. Interestingly, PATH implementations have incorporated
proximal-point techniques for regularizing the linear complementarity problems that
arise. The attraction of globalized Newton algorithms, such as PATH, lies in their
favorable global and local convergence properties. However, these methods require
solutions to a sequence of problems and distributed analogues remain difficult to
construct.
2.1.2 Motivating applications
Our work is motivated by a host of networked engineering-economic systems in
which players compete for shared resources while trying to maximize their respective
utility functions. These problems have been gaining increasing visibility in a host
of regimes. Yet, much of this work has been carried out in regimes where mappings
are strongly or strictly monotone and stringent requirements are imposed on userspecific algorithm parameters (such as steplengths). For instance, in communication
networks, motivated by models presented in a centralized optimization setting by
Kelly [55, 56], Srikant [57] amongst others, there has been a significant effort to
12
examine game-theoretic generalizations where users were assumed to be selfish. Led
principally by Başar and his collaborators, this line of research focused on developing
algorithms for distributed schemes for routing and flow control [12, 13, 58, 59] and
optical networks [60–62]. Notably, strong monotonicity assumptions are employed
in [12, 13, 58–60, 63, 64]. As a consequence, constant steplength schemes are
employed in such of these regimes and require global specification of steplengths
across users, as employed in [13], where user steplengths require a global operator
to specify steplengths. In fact, much of the preceding work in solving the associated
distributed control problem [65, 66] requires consistent steplengths across users. In
summary, many of the proposed schemes have largely focused on strongly or strictly
monotone Nash games and have often imposed strong requirements of consistency
of steplength choices across players.
2.1.3 Contributions
The overarching goal of the current work is the development of distributed limited
coordination low-complexity single-timescale schemes for computing Nash equilibria
when the games are characterized by monotone operators. Importantly, the schemes
should necessitate minimal specification of algorithm parameters across players with
no reliance on problem parameters.
Motivated by these challenges, we consider a framework in which players make
gradient updates, allowing for constructing easily implementable single timescale
schemes. The basis for our set of schemes lies in the observation that the parameter
k in Tikhonov techniques and the centering parameter z k−1 in proximal methods is
updated once a good approximation to the subproblem solution is available. Instead,
we pursue an alternative where we update these parameters after every iteration,
rather than when a good solution is available. We refer to these extensions of
standard regularization/proximal methods as iterative regularization techniques, a
term borrowed from Konnov [67]. Such schemes are motivated by several advantages:
first, players can independently select algorithm parameters independent of problem
parameters; second, the schemes are far easier to implement since the complexity
of the scheme is restricted to solving a single projection problem at each point
in time; third, online implementations that require coordination in a networked
setting are far easier to manage, making them significantly attractive over schemes
13
that require solving complex subproblems prior to updating their decisions; and
fourth, these schemes are easily distributed providing an avenue for solving truly
large-scale networked game-theoretic problems. We provide a unified framework for
stating two different single-timescale schemes namely the iterative Tikhonov scheme
and the iterative proximal scheme. Note that iterative regularization techniques
have been studied extensively in the realm of optimization problems. For instance,
Konnov [67] (also see [68]) provides an iterative Tikhonov scheme for VIs. Our
result on iterative Tikhonov is original and inspired by Polyak [69] on unconstrained
optimization. The main contributions of this paper are the following:
(a) Iterative Tikhonov regularization (ITR) schemes Our first scheme
extends a standard Tikhonov regularization method to the single-timescale regime
by requiring that the regularization parameter be updated at every iteration. Under
suitable assumptions on the steplength and regularization parameter sequences,
the resulting sequence is shown to converge to a solution of the Nash game. In this
scheme, the jth player may independently choose steplength sequences {γjk }N
j=1
and regularization sequences {kj }N
(denoted
by
p-ITR),
as
specified
by
j=1
zjk+1 = ΠKj zjk − γjk Fj (z k ) + kj zjk
,
j = 1, . . . , N.
(p-ITR)
Convergence of this scheme can be shown to hold if a suitable coordination requirement across steplengths is met. In fact, this coordination requirement can be seen
to be readily satisfied without requiring that players communicate their choices.
Finally, we examine a multistep generalization of this scheme, denoted by r-ITR,
where a fixed, but finite, number of projection steps are taken between updates
of the regularization parameter. Importantly, all of these schemes require mere
monotonicity of the map and do not assume any compactness requirement on the
strategy sets.
(b) Iterative proximal-point method An alternative to Tikhonov regularization methods is through the use of proximal-point methods. In the context
of strictly monotone Nash games, we introduce a single-timescale version of the
proximal-point method, termed as the iterative proximal-point or IPP method, in
which the centering parameter is updated at every iteration. We extend this to
a modified variant (referred to as MIPP) wherein the weighting parameter in the
14
prox-term is updated after every iteration. Analogous to ITR schemes, we extend
this to a partially coordinated version referred to as p-IPP, where the jth player may
independently choose steplength sequences {γjk }N
j=1 and regularization sequences
k N
{θj }j=1 as specified by
zjk+1 = ΠKj zjk − γjk Fj (z k ) + θjk (zjk − zjk−1 )
,
j = 1, . . . , N.
(p-IPP)
Finally, our scheme provides complete flexibility in parameter choice but requires
that players employ the prescribed framework. In game-theoretic regimes, players
are generally assumed to be selfish; consequently, if players were fully rational, they
would take best response steps. We obviate this challenge by working in a “bounded
rationality” setting where players rule out strategies of high complexity [70]. This
notion has its roots in the influential work by Simon [71]. In particular, his thesis
suggested that if reasoning and computation were considered to be costly, then
agents may not invest in these resources for marginal benefits. In the same vein,
we assume that solving a convex program may be computationally burdensome
for the agents. Instead, we assume that agents compute low-overhead gradient
steps associated with a regularized problem. Our work provides mathematical
substantiation for the claiming that if players independently operated in this regime,
convergence of the overall scheme ensures.
This paper is divided into five sections. In section 2.2, we extend Tikhonov
regularization methods to the distributed single-timescale regime in the context
of monotone Nash games and examine both partially coordinated and multistep
generalizations. Section 2.3 presents a single-timescale iterative proximal point
method in partially coordinated regimes. The performance of the schemes is
examined on a set of networked Nash-Cournot games with nonlinear price functions
in section 2.4. The paper concludes with a short summary of our contributions in
section 4.6.
2.2 Iterative Tikhonov regularization schemes
In standard Tikhonov regularization schemes, one constructs a sequence of exact
(or inexact) solutions to well-posed regularized problems and the regularization
parameter is driven to zero at a slower timescale. In contrast, we consider a
15
class of iterative Tikhonov schemes in which the steplength and regularization
parameter are changed at the same rates. We proceed to show that this scheme
is indeed convergent. The convergence statement for general monotone mappings
also appears in [67] but does so without a proof and employs slightly different
assumptions. Inspired by a result for optimization problems stated in [69], we
construct a generalization to partially coordinated and multistep settings. The
next Lemma from [69] is employed in developing our convergence theory.
αk
Lemma 1. Let uk+1 ≤ qk uk + αk , αk ≥ 0, ∞
k=0 (1 − qk ) = ∞, 1−qk → 0, k → ∞.
Suppose either (a) 0 ≤ qk < 1 for all k or (b) there exists a K̄ such that 0 < qk < 1
for k ≥ K̄ and qk < ∞ for all k ≤ K̄. Then limk→∞ uk ≤ 0 and if uk > 0, then
limk→∞ uk = 0.
P
Proof. Under hypothesis (a), this result is proved in [69]. Under (b), we may
consider a shifted process starting from K̄ and the required result follows from (a).
Throughout this section, we make the following assumption on the mapping F .
Assumption 1. (A1) Suppose the mapping F corresponding to the game G is
monotone and Lipschitz continuous with constant L over a closed and convex set
K. Furthermore, suppose SOL(K, F ) is nonempty and bounded.
Our proof of convergence relies on relating the iterates of the proposed ITR
scheme to that of the original Tikhonov scheme. With a slight abuse of notation,
we let k be a vector and k = k1 , . . . , kN . Let z ∗ refer to the solution of
the VI(K, F ) and let y k denote the solution of the regularized problem VI(K, F +
diag(k )I) where the identity may, denoted by I, is defined as I(x) = x; note that
the jth component of this mapping is given by Fj (x) + kj xj . The following Lemma
provides a bound between consecutive iterative of the standard Tikhonov scheme.
Note that (i) is reproduced from recent work by the second author [31].
Lemma 2. Suppose (A1) holds. Furthermore, consider the standard exact Tikhonov
scheme. Then, the following hold:
(i) The sequence {y k } is bounded as per
ky k k ≤
16
kmax ∗
kx k,
kmin
(2.4)
where x∗ is a solution of VI(K, F ). Furthermore, ky k k ≤ M where M , ckx∗ k
k
and max
≤ c holds for all k.
k
min
√ (k−1 −k )
(ii) Furthermore, we have that ky k − y k−1 k ≤ M N maxk min .
min
The remainder of this section is organized as follows. In Section 2.2.1, we
present a partially coordinated generalization of ITR where players may choose
their steplengths and regularization sequences independently. In Section 2.2.2, we
introduce a multistep generalization of the ITR scheme where a fixed number of
projection steps are interspersed between updates of the regularization parameter.
2.2.1 Partially coordinated ITR schemes
An overarching goal of our research lies in developing truly distributed schemes that
do not require players to communicate choices of algorithm parameters. This has an
important outcome in that changes in problem parameters have no impact on the
structure of update rules associated with each agent. The ITR scheme presented
in the earlier section required that every player employs the same steplength
and regularization sequence. However, in this subsection, we consider a partially
coordinated algorithm where the jth player independently chooses a steplength
sequence {γjk } and regularization sequence {kj }, leading to an update given by
zjk+1 = ΠKj zjk − γjk Fj (z k ) + kj zjk
, j = 1, . . . , N.
(2.5)
k
k
Assumption 2. (A2) If γmax
, γmin
, kmax and kmax are defined as
k
k
, min γjk , kmax , max kj and kmin , min kj ,
γmax
, max γjk , γmin
j
j
j
j
the steplength and regularization sequences employed in (2.5), denoted by {γjk } and
{kj } respectively, satisfy the following for j = 1, . . . , N :
(A2.1)
P∞
k=1
γjk kj = ∞;
(A2.2) limk→∞
(A2.3)
P∞
(A2.4)
P∞
k
(γmax
)2
k k
γmin
min
k 2
k=1 (γj )
= 0;
< ∞;
k k 2
k=1 (j γj )
< ∞;
17
(A2.5) limk→∞
k
(k−1
max −min )
k
k
min (γmin )2
(A2.6) limk→∞
k
k
γmax
−γmin
k k
γmin
min
(A2.7) limk→∞ kj = 0,
= 0;
= 0;
∀j = 1, . . . , N.
Before proceeding, we comment on the choice of sequences. (A2.1) and (A2.4)
suggest that the product of the regularization and steplength sequences has to be
driven to zero but not too rapidly. In fact, these assumptions emphasize that these
sequences cannot be independently chosen. While (A2.3) articulates the square
summability requirements for steplength sequences, (A2.2) allows for claiming that
k
k
γmin
/kmin → 0 and γmax
/kmax → 0. In effect, the steplength needs to be driven to
zero faster than the regularization parameter. Since (A2.7) guarantees that for any
j, we have that kj → 0, from (A2.2), we can directly claim that γjk → 0 as well.
Finally, (A2.5) and (A2.6) ensure that the disparity in steplength and regularization
sequences needs to be driven to zero sufficiently fast, to avoid build up of error.
Our first proposition relates the iterate z k+1 with that associated with the
Tikhonov regularization scheme y k .
Proposition 1. Consider a Nash game G and suppose (A1) and (A2) hold. Furthermore suppose {z k } denotes the sequence constructed by (2.5). Then
kz
k+1
k
k
− y k ≤ qk kz − y
k−1
√
k
qk M N (k−1
max − min )
k+
,
kmin
k
k
k
k
where qk2 , (1 − γmin
kmin )2 + (γmax
)2 L2 + 2(γmax
kmax − γmin
kmin )L .
Proof. Our proof relies on showing that the sequence constructed by the inexact ITR
scheme converges to that associated with the exact Tikhonov scheme. Throughout,
we denote the latter by {y k }. Then, by the triangle inequality, we have that
kz k+1 − z ∗ k ≤ kz k+1 − y k k + ky k − z ∗ k. We proceed to examine kz k+1 − y k k2 . By
employing the definition of the iterates and leveraging the non-expansivity of the
projection, we obtain the following:
kz k+1 − y k k2 =
N
X
kzjk+1 − yjk k2
j=1
=
N
X
kΠKj (zjk − γjk Fj (z k ) − γjk kj zjk ) − ΠKj (yjk − γjk Fj (y k ) − γjk kj yjk )k2
j=1
18
≤
N
X
k(zjk − γjk Fj (z k ) − γjk kj zjk ) − (yjk − γjk Fj (y k ) − γjk kj yjk )k2 .
j=1
Next, we expand the terms on the right as shown below.
N
X
k(zjk − γjk Fj (z k ) − γjk kj zjk ) − (yjk − γjk Fj (y k ) − γjk kj yjk k2
j=1
=
N X
(1 − γjk kj )2 kzjk − yjk k2 + (γjk )2 kFj (z k ) − Fj (y k )k2
j=1
−
N X
2γjk (1 − γjk kj )(zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
k
k
≤ (1 − γmin
kmin )2 kz k − y k k2 + (γmax
)2 kF (z k ) − F (y k )k2
−
N
X
2γjk (1 − γjk kj )(zjk − yjk )T (Fj (z k ) − Fj (y k )) .
j=1
{z
|
}
term 1
k
k
Of these, the first two terms are bounded by ((1 − γmin
kmin )2 + (γmax
)2 L2 )kz k − y k k2
and it remains to examine term 1. Through expansion and by adding and subtracting
terms, this expression can be rewritten as
−
N
X
2γjk (1 − γjk kj )(zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
=−
N
X
2γjk (zjk − yjk )T (Fj (z k ) − Fj (y k )) +
j=1
≤−
N
X
N
X
2(γjk )2 kj (zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
k
2γjk (zjk − yjk )T (Fj (z k ) − Fj (y k )) +2(γmax
)2 kmax Lkz k − y k k2 .
j=1
|
{z
term 2
}
By adding and subtracting terms, term 2 can be bounded as
−
N
X
2γjk (zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
=−
N
X
k
2γmax
(zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
−
N
X
k
2(γjk − γmax
)(zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
19
≤−
N
X
k
2(γjk − γmax
)(zjk − yjk )T (Fj (z k ) − Fj (y k )),
j=1
where the inequality is a consequence of invoking the monotonicity of F . We now
simplify the final expression on the right by using Hölder’s inequality and the
Lipschitz continuity of the mapping:
−
N
X
k
2(γjk − γmax
)(zjk − yjk )T (Fj (z k ) − Fj (y k ))
j=1
k
k
≤ 2(γmax
− γmin
)
N X
k
(zj
− yjk )T (Fj (z k ) − Fj (y k ))
j=1
k
k
≤ 2(γmax
− γmin
)k(z k − y k )kk(F (z k ) − F (y k ))k
k
k
≤ 2(γmax
− γmin
)Lkz k − y k k2 .
As a consequence, we have that
kz k+1 − y k k2
k
k
k
k
k
)2 kmax L kz k − y k k2
)L + 2(γmax
)2 L2 + 2(γmax
− γmin
kmin )2 + (γmax
≤ (1 − γmin
= qk2 kz k − y k k2 ,
k
k
k
k
k
where qk2 , (1 − γmin
kmin )2 + (γmax
)2 L2 + 2(γmax
− γmin
)L + 2(γmax
)2 kmax L .
As a consequence,
kz
k+1
k
k
− y k ≤ qk kz − y
k−1
k
k + qk ky − y
k−1
k
k ≤ qk kz − y
k−1
√
k−1
− kmin )
qk M N (max
,
k+
k
where the first inequality follows from the triangle inequality and the second
inequality is a result of Lemma 2.
Using this result, we may now show the convergence of the p-ITR scheme.
Theorem 3 (Convergence of (p-ITR)). Consider the Nash game G and suppose
(A1) and (A2) hold. Furthermore, suppose {z k } denotes the sequence constructed
by (2.5). Then {z k } → z ∗ as k → ∞.
Proof. This requires the use of Lemma 1 which can be invoked, if there exists a K
20
such that qk < 1 for k ≥ K,
∞
X
(1 − qk ) = ∞
√ (k−1 − k )
qk
M N max k min = 0.
k→∞ 1 − qk
min
and
lim
k=0
(i) First, we see that
k
k
k
k
qk2 = (1 − γmin
kmin )2 + (γmax
)2 L2 + 2(γmax
− γmin
)L + (γmax )2 max L
= 1−
k
k
k
− γmin
)L
(γmax
)2 kmax L
(γ k )2 L2 2(γmax
k
k
2 − max
−
−
γ
−
min
min
k
k
k
γmin
kmin
γmin
kmin
γmin
kmin
k
γmin
kmin
!!
.
Since {γ k }, {k } are non-increasing sequences converging to zero, there exists a K1
k
)2 L2
k
such that γmin
kmin ≤ 14 for k ≥ K1 . By (A2.2), we have that limk→∞ (γγmax
= 0,
k k
implying that for k ≥ K2 ,
a K3 such that
k
k )L
2(γmax
−γmin
k k
γmin
min
k
(γmax
)2 L2
k k
γmin
min
<
1
4
min min
≤
1
.
4
Furthermore, by (A2.6), there exists
for k ≥ K3 . Since limk→∞
limk→∞ kmax = 0, this implies that limk→∞
k
(γmax
)2 kmax L
k
γmin kmin
k
(γmax
)2 kmax L
k k
γmin
min
k
(γmax
)2
k k
γmin
min
= 0 and
= 0 and there exists a K4
such that
< 41 for k ≥ K4 . It follows that for k ≥ max(K1 , K2 , K3 , K4 ),
k
we have that qk2 ≤ (1 − γmin
kmin ) < 1, implying that qk < 1 for sufficiently large k.
(ii) It is easily seen that
∞
X
∞
X
1 − qk2
(1 − qk ) =
k=0
k=0 1 + qk

=
k
k
∞  γmin min
X
k=0
>
=
1
2
1
2


∞
X
k=0
∞ X
2−
k
(γmax
)2 L2
k
γmin kmin
−
k
k )L
2(γmax
−γmin
k
k
γmin min
1 + qk
k
γmin
kmin
−
k
γmin
kmin
−
k
(γmax
)2 kmax L
k
γmin kmin




k
k
k
k
(γmax
)2 L2 2(γmax
− γmin
)L
(γmax
)2 kmax L
k
k
2− k k
−
− γmin min −
k
k
γmin min
γmin
kmin
γmin
kmin
k
k
k
k
k
k
kmin )2 − (γmax
)2 kmax L
kmin − (γmax
)2 L2 − 2(γmax
− γmin
)L − (γmin
2γmin
!!
k=0
= ∞,
where the first inequality follows from qk < 1 and the second equality follows from
P∞
P∞
k
k
k
k
k
k=0 γmin min = ∞,
k=0 (γmax − γmin ) < ∞ and the square summability of γmax
k
and γmin
.
21
(iii) It suffices to show that
qk
lim
k→∞ 1 − qk
Using,
qk
1−qk
=
qk (1+qk )
,
1−qk2
√ (k−1 − k )
M N max k min
min
!!
= 0.
it can be noted that
k
qk (k−1
max − min )
1 − qk
kmin
qk (1 + qk )
=
2−
k
(γmax
)2 L2
k k
γmin
min
−
k )L
k
−γmin
2(γmax
k k
γmin
min
|
k
kmin −
− γmin
{z
k
(γmax
)2 kmax L
k k
γmin
min
}
Term 3
k
(k−1
max − min )
.
k
γmin
(kmin )2
|
{z
Term 4
}
Since γjk , kj → 0 for all j, it follows that qk → 1. From assumption A2, it can be
seen that Term 3 tends to 2 as k → ∞ (similar argument to the first claim). By
assumption (A2.5), Term 4 tends to zero as k → ∞.
We conclude this subsection with a Lemma that shows that such a feasible
choice of steplength and regularization sequences do indeed exist.
Lemma 4 (Feasible choice of sequences). Suppose the jth player employs a
steplength sequence given by γjk = (k + nj )−β where nj is a positive integer for
j = 1, . . . , N and let the regularization sequence be given by kj = (k + nj )−α where
1
< (α + β) < 1 and β > α. Then {kj } and {γjk } for j = 1, . . . , N satisfy (A2).
2
Proof. We begin by noting that (A2.1), (A2.3) and (A2.4) hold immediately by
noting that
kmax = (k+nmin )−α ,
k
γmax
= (k+nmin )−β , kmin = (k+nmax )−α ,
k
γmin
= (k+nmax )−β .
(A2.2) holds by noting that
(k + nmin )−2β
(k + nmax )β+α
1
=
lim
−(α+β)
β+α
k→∞ (k + nmax )
k→∞ (k + nmin )
(k + nmin )β−α
(k + nmax )β+α
1
= lim
lim
β+α
k→∞ (k + nmin )
k→∞ (k + nmin )β−α
lim
= 1 × 0 = 0.
22
To prove (A2.5), consider
k − kmin
lim max
= lim
k
k→∞ γmin
k→∞
(kmin )2
= lim
1
− (k+n1max )α
(k+nmin )α
1
1
(k+nmax )β (k+nmax )2α
!α
k + nmax
(k
k→∞
|
lim
k + nmin
{z
k→∞
+ nmax )α − (k + nmin )α
(k + nmax )−β
}
=1
k+nmin α
max α
)
1 − ( k+n
)
1 − (1 + nmin −n
max
t
= lim
=
lim
.
t→∞
k→∞ (k + nmax )−α−β
t−α−β
It is to be noted that t = k + nmax . It is easy to find that the above expression is
of the form 0/0 and thus L’Hôpital’s rule can be applied, allowing for the following
reduction:
max α
)
1 − (1 + nmin −n
t
−α−β
t→∞
t
min α−1
−α(1 − nmax −n
) (nmax − nmin )
1
k
. lim 1−α−β = 0,
= lim
k→∞ k
k→∞
−α − β
lim
where the last equality is a consequence of α + β < 1. Finally, the expression in
(A2.6) may be written as follows:
k
k
γmax
− γmin
(k + nmin )−β − (k + nmax )−β
=
k
γmin
kmin
(k + nmax )−(α+β)
(k + nmin )−β − (k + nmax )−β
=
k −(α+β) (1 + nmax
)−(α+β)
k
k −β (β(nmax − nmin )/k + O(1/k 2 ))
≈
k −(α+β) ( nmax
)−(α+β)
k
(β(nmax − nmin )/k + O(1/k 2 ))
=
k −α (1 + nmax
)−(α+β)
k
≈ O(1/k 1−β ).
It follows that limk→∞
k
k
γmax
−γmin
k k
γmin
min
= 0, completing the proof.
Remark: We considered the development of distributed algorithms whose parameters could be independently chosen by players and yet guarantee convergence.
If each player chooses a positive integer of his choice, then this integer defines
the steplength and regularization sequence. The previous result guarantees that
23
these set of choices satisfy (A2) and convergence follows. Importantly, players need
not communicate their choices nor update their choices if the underlying problem
changes.
2.2.2 A multistep ITR scheme
This subsection focuses on a scheme that represents a multistep generalization of
the standard ITR scheme. The standard ITR scheme requires that a single gradient
step is taken when the regularization parameter is updated. In contrast, a Tikhonov
scheme requires that for a given k , a solution of the coupled fixed point problem is
desired:
k
) + k zjk )), j = 1, . . . , N.
zjk = ΠKj (zjk − γ k (Fj (zjk ; z−j
(2.6)
In fact, for a fixed regularization parameter, an infinite number of synchronized
gradient steps leads to a solution to the Tikhonov subproblem. As opposed to
solving this regularized problem, we consider a multistep ITR scheme that employs
a fixed number of projection steps with a fixed regularization parameter and
steplength. The basic notion is that the iterates generated by this scheme are closer
to the Tikhonov iterates and the r−step ITR scheme can be mathematically stated
as follows:
zjk,`+1 = ΠKj zjk,` − γ k (Fj (z k,` ) + k zjk ) ,
` = 0, . . . , r − 1;
j = 1, . . . , N,
(2.7)
zjk+1 = zjk,r ,
j = 1, . . . , N.
(2.8)
The convergence of this scheme can now be proved.
Proposition 2. Consider the Nash game G and suppose assumptions (A1) and
(A2) hold. Then z k → z ∗ as k → ∞, where z k is obtained via the multistep ITR
scheme (2.7)-(2.8) and z ∗ is the a solution of VI(K, F ).
Proof. By the triangle inequality, we have that
kz k+1 − z ∗ k ≤ kz k+1 − y k k + ky k − z ∗ k.
24
Of these, the first term of the right hand side can be expressed as
kz k+1 − y k k ≤ kz k+1 − z k,r−1 k + . . . + kz k,1 − y k k.
On the lines of Theorem 3, it can be deduced that
kzk,`+1 − y k k ≤ qk kz k,` − y k k implying that kz k+1 − y k k ≤ q̄k kz k − y k k,
where q̄k = (qk )r and qk =
q
1 − 2γ k k + (γ k )2 (L2 + 2k ).
Convergence of the multistep ITR scheme follows if the hypothesis of Lemma 1
hold. First, we immediately have that q̄k ≤ qk < 1 for sufficiently large k since
qk < 1 for k ≥ K. It suffices to show that
∞
X
k−1 − k
q̄k
= 0.
M
k→∞ 1 − q̄k
k
(1 − q̄k ) = ∞ and lim
k=1
Next, from Theorem 3, we have that
P∞
k=1 (1− q̄k )
≥
q̄k
k−1 − k
qkr−1
M
=
1 − q̄k
k
1 + qk + ... + qkr−1
|
{z
Term 1
}
P∞
k=1 (1−qk )
k−1 − k
.
k
!
!
qk
M
1 − qk
|
= ∞. Furthermore,
{z
Term 2
}
As k → ∞, it is clear that qk → 1 and Term 1 → 1r . It follows from Theorem 3
that term 2 → 0. Therefore the second claim follows.
2.3 Iterative proximal point schemes
In this section, we consider an alternate technique for alleviating the absence of
strong monotonicity. This method uses a proximal term of the form θ(z − z k−1 ),
instead of k z in modifying the map. Consequently, when such a method is applied
to a variational inequality VI(K, F ), a sequence of iterates is constructed, each of
which requires the solution of a modified problem VI(K, F k ), where
F k (z) , F (z) + θ(z − z k−1 ).
25
Note that for any γ > 0, the solution of each subproblem, given by z k = SOL(K, F k ),
is given by the solution of the fixed-point problem
z k = ΠK z k − γ F (z k ) + θ(z k − z k−1 )
.
Under assumptions of convexity of the set K and monotonicity of F on K, the
convergence of the standard proximal-point algorithm has been established in [18,72].
An iterative proximal-point method of the form presented in this paper has been
employed within PATH implementations for more than a decade [53, 54]. In fact,
related avenues were also considered by Billups [73]. Remarkably, such proximal
strategies were seen to markedly improve the robustness of diverse algorithmic
schemes for solving complementarity problems.
The traditional proximal-point scheme suffers from a key drawback; it is essentially a two-timescale scheme requiring the solution of a variational inequality at
every iteration. In the spirit of the iterative Tikhonov regularization scheme, we
present a single-timescale iterative proximal point (IPP) method where the centering
term is updated from θ(z − z k−1 ) to θ(z − z k ) after taking a single projection
step. In a game-theoretic generalization of this scheme, a projection step using
the deviation between the k th and k − 1th iterates yields the k + 1th iterate and is
formally stated as
zjk+1 = ΠKj zjk − γ k (Fj (z k ) + θ(zjk − zjk−1 )) ,
for j = 1, . . . , N.
(2.9)
An alternate view of this scheme is embodied by the following update rule:
zjk+1 = ΠKj ((1 − γ k θ)zjk + γ k θz k−1 ) − γ k Fj (z k )) ,
for j = 1, . . . , N. (2.10)
In contrast with standard projection schemes of the form
zjk+1 = ΠKj zjk − γ k Fj (z k )) ,
for j = 1, . . . , N,
(2.11)
the proximal point method uses a convex combination between zjk and zjk−1 , given by
((1 − γ k θ)zjk + γ k θz k−1 ), instead of zjk . In Section 2.3.1, we present the convergence
theory for the IPP method. However, a challenge associated with this method is
that the parameter θ and the steplength γ k needs to be consistent across players.
26
We relax this requirement in Section 2.3.2 where a modified iterative proximal-point
scheme is examined:
zjk+1 = ΠKj zjk − γjk (Fj (z k ) + θjk (zjk − zjk−1 )) ,
for j = 1, . . . , N,
(2.12)
Here γjk and θjk are chosen independently by every player from a family of choices.
Finally, IPP schemes and their generalizations rely on strict monotonicity of the
mapping and compactness of the set, as specified by the following assumption.
Assumption 3. (A3) Consider a Nash game given by G. Suppose the mapping F
corresponding to G is strictly monotone and Lipschitz continuous with constant L
over a compact convex set K. Moreover, maxz∈K kzk ≤ C and maxz∈K kF (z)k ≤ β.
Note that compactness of K and strict monotonicity of the mapping guarantees
that the solution set SOL(K, F ) is a singleton. Finally, our convergence theory
relies on the following result from [69].
Lemma 5. Let uk+1 ≤ (1 + vk )uk + pk , uk , vk , pk ≥ 0 and
∞, k → ∞. Then limk→∞ uk = ū ≥ 0.
P∞
k=0
vk < ∞,
P∞
k=0
pk <
2.3.1 An iterative proximal point scheme
We begin by proving the global convergence of the IPP scheme under assumption
(A3).
Theorem 6. Consider the Nash game G and suppose assumption (A3) holds. Let
{zk } denote the set of iterates defined by the iterative proximal scheme (2.9). Let
P∞
P∞
k
k 2
∗
k=1 γ = ∞ and
k=1 (γ ) < ∞. Then limk→∞ zk = z .
Proof. We begin by expanding kz k+1 − z ∗ k and by using the non-expansivity
property of projection.
kz k+1 − z ∗ k2 = kΠK (z k − γ k (F (z k ) + θ(z k − z k−1 ))) − ΠK (z ∗ − γ k F (z ∗ ))k2
≤ k(z k − z ∗ ) − γ k (F (z k ) − F (z ∗ )) − γ k θ(z k − z k−1 )k2 .
Expanding the right hand side,
kz k+1 − z ∗ k2 ≤ kz k − z ∗ k2 + (γ k )2 kF (z k ) − F (z ∗ )k2 + (γ k θ)2 kz k − z k−1 k2
27
− 2γ k (z k − z ∗ )T (F (z k ) − F (z ∗ ))
− 2γ k θ(z k − z k−1 )T (z k − z ∗ ) − γ k (F (z k ) − F (z ∗ )) .
(2.13)
Using Lipschitz and monotonicity properties of F (x), we have
kz k+1 − z ∗ k2 ≤ (1 + (γ k )2 L2 )kz k − z ∗ k2 + (γ k θ)2 kz k − z k−1 k2
−2γ k θ(z k − z k−1 )T (z k − z ∗ ) − γ k (F (z k ) − F (z ∗ )) .
|
{z
}
Term 1
Term 1 can be bounded from above by the use of the Cauchy-Schwartz inequality,
the boundedness of the iterates, namely kz k − z ∗ k ≤ C, and the Lipschitz continuity
of F , as shown next.
kz k+1 − z ∗ k2 ≤ (1 + (γ k )2 L2 )kz k − z ∗ k2 + (γ k θ)2 kz k − z k−1 k2
+ 2γ k θk(z k − z k−1 )k k(z k − z ∗ )k + γ k k(F (z k ) − F (z ∗ ))k
≤ (1 + (γ k )2 L2 )kz k − z ∗ k2 + (γ k θ)2 kz k − z k−1 k2
+ 2γ k θCk(z k − z k−1 )k + 2γk2 θLCk(z k − z k−1 )k.
Next, we derive a bound on kz k − z k−1 k by leveraging the non-expansivity of the
Euclidean projector.
kz k − z k−1 k = kΠK (z k−1 − γk−1 (F (zk−1 ) + θ(z k−1 − z k−2 ))) − ΠK (z k−1 )k
≤ k(z k−1 − γk−1 (F (zk−1 ) + θ(z k−1 − z k−2 ))) − (z k−1 )k
= k − γk−1 (F (zk−1 ) + θ(z k−1 − z k−2 )))k.
It follows from the boundedness of K and the continuity of F (z), that there
exists a β > 0 such that kF (z)k ≤ β for all z ∈ K, implying that kz k − z k−1 k ≤
γk−1 (β+θC). The bound on kz k −z k−1 k, together with the knowledge that γ k ≤ γ k−1 ,
allows us to derive an upper bound kz k+1 − z ∗ k2 :
kz k+1 − z ∗ k2
2
≤ (1 + (γ k )2 L2 )kz k − z ∗ k2 + (γ k θ)2 γk−1
(β + θC)2 + 2γ k γk−1 θC(1 + γ k L)(β + θC)
2
2
≤ (1 + (γ k )2 L2 )kz k − z ∗ k2 + (γk−1 θ)2 γk−1
(β + θC)2 + 2γk−1
θC(1 + γk−1 L)(β + θC) .
|
{z
,vk
}
|
{z
,pk
28
}
The above sequence can be compactly represented as the recursive sequence
uk+1 ≤ (1 + vk )uk + pk ,
∞
X
where
vk = L2
k=1
∞
X
(γ k )2 < ∞,
k=1
∞
X
pk < ∞,
k=1
the latter a consequence of the square summability of γ k . It follows from Lemma 5
that uk → ū ≥ 0. It remains to show that ū ≡ 0.
Recall from (2.13) that kz k+1 − z ∗ k2 is bounded as per the following expression:
kz k+1 − z ∗ k2 ≤ (1 + vk )kz k − z ∗ k2 + pk − 2γ k (z k − z ∗ )T (F (z k ) − F (z ∗ )).
Suppose ū > 0. It follows that along every subsequence, we have that µk =
0
(zk − z ∗ )T (F (zk ) − z ∗ ) ≥ µ > 0, ∀k. This is a consequence of the strict monotonicity
of F whereby (F (z k ) − F (z ∗ ))T (z k − z ∗ ) → 0 if z k → z ∗ . Since ū > 0, it follows
that kz k − z ∗ k2 → ū > 0. Then by summing over all k, we obtain
∞
X
lim kz k+1 − z ∗ k2 ≤ kz 1 − z ∗ k2 +
k→∞
(γ k )2 L2 kz k − z ∗ k2 − 2
k=1
∞
X
∞
X
γ k µk +
k=1
pk .
k=1
Since vk and pk are summable and µk ≥ µ0 > 0 for all k, we have that
lim kz
k→∞
k+1
∗ 2
1
∗ 2
− z k ≤ kz − z k +
≤ kz 1 − z ∗ k2 +
∞
X
k 2
2
k
∗ 2
(γ ) L kz − z k − 2
∞
X
k
γ µk +
k=1
∞
X
k=1
∞
X
0
k=1
k=1
(γ k )2 L2 kz k − z ∗ k2 − 2µ
γk +
∞
X
pk
k=1
∞
X
pk ≤ −∞,
k=1
∞
k 2
k
where the last inequality follows from observing that ∞
k=0 (γ ) < ∞,
k=0 γ =
∞ and kzk − z ∗ k ≤ C. But this is a contradiction, implying that along some
subsequence, we have that µk → 0 and lim inf k→∞ kz k −z ∗ k2 = 0. But we know that
n o
z k has a limit point and that the sequence zk converges. Therefore, limk→∞ zk =
z∗.
P
P
Remark on IPP methods Its worth considering how the results of the IPP
scheme compare with the ITR method as well as with more classical proximal
point methods. Clearly, ITR schemes can be applied on monotone problems with
unbounded strategy sets while applying the IPP method requires both strict monotonicity and boundedness. General proximal-point schemes, however, require neither
29
strict monotonicity nor compactness. We conjecture that a possible strengthening
of our results may be possible via averaging techniques and will consider this in
future work.
2.3.2 Partially coordinated modified iterative proximal point schemes
In this section, we consider a modified IPP scheme in which agents use different
stepsizes and proximal parameter sequences at any iteration, as specified by the
following update rule
zjk+1 = ΠKj zjk − γjk (Fj (z k ) + θjk (zjk − zjk−1 )) ,
(2.14)
for j = 1, . . . , J, where θjk = cj /γjk and cj ∈ (0, 1). It follows that this scheme can
be effectively stated as
zjk+1 = ΠKj zjk − γjk (Fj (z k ) + θjk (zjk − zjk−1 ))
= ΠKj zjk − γjk Fj (z k ) − cj (zjk − zjk−1 ))
= ΠKj (1 − cj )zjk + cj zjk−1 − γ k Fj (z k )) .
Essentially, a projection step is carried out using a convex combination of zjk and zjk−1 .
Note that even in the standard iterative proximal-point scheme, such a combination
is used; however, in that setting, combination is of the form (1 − γjk θjk )zjk + γjk θjk zjk−1
and as one proceeds, γjk → 0 and one places less and less emphasis on the past. In
this setting, we use a fixed convex combination, specified by the parameter cj ∈ (0, 1).
Prior to proving our main convergence statement, we prove an intermediate result
that is required for claiming convergence.
Lemma 7. Suppose uk ≤ cuk−1 + γ k−1 β where c ∈ (0, 1), {γ k } is a decreasing
P
k 2
positive sequence with ∞
k=0 (γ ) < ∞ and 0 ≤ uk ≤ ū < ∞ for all k. Then, we
P∞
have that k=1 γ k uk < ∞.
Proof. By assumption, we have that uj ≤ cuj−1 + γ j−1 β for j ≥ 1. Multiplying
this expression by γ j−1 and summing over j, we obtain
k
X
j=1
γ j−1 uj ≤ c
k−1
X
γ j uj + β
j=0
30
k−1
X
(γ j )2 .
j=0
But γ j ≤ γ j−1 for all j, implying that
k
X
γ j uj ≤
j=1
k
X
γ j−1 uj ≤ c
j=1
=⇒ (1 − c)
k−1
X
k−1
X
γ j uj + β
j=0
k−1
X
(γ j )2
j=0
γ j uj + γ k uk ≤ cγ 0 u0 + β
j=1
k−1
X
(γ j )2 .
j=0
By taking limits, it follows that
(1 − c) lim
k−1
X
k→∞
γ j uj + lim γ k uk ≤ γ 0 u0 + β lim
j=0
k→∞
k→∞
|
{z
=0
}
j
Consequently, limk→∞ k−1
j=0 γ uj < ∞, since
uk is uniformly bounded by ū.
P
P∞
j=0 (γ
k
X
(γ j )2 .
j=0
j 2
) < ∞, limk→∞ γ k = 0 and
Proposition 3. Consider a Nash game G and suppose assumption (A3) holds. Let
n o
zjk denote the set of iterates defined by the partially coordinated MIPP scheme
given by (2.14). Moreover, let γjk θjk = cj for j = 1, .., N for all k ≥ 0 where
cj ∈ (0, 1) for j = 1, . . . , N . Furthermore, let the following hold:
∞
X
k
γmax
= ∞,
k=1
∞
X
k
(γmax
)2 < ∞ and
k=1
∞
X
k
k
(γmax
− γmin
) < ∞ hold.
k=1
Then lim zjk = zj∗ for j = 1, . . . , N.
k→∞
Proof. We begin by observing that kzjk − zjk−1 k can now be bounded as follows:
kzjk − zjk−1 k ≤ kΠKj (zjk−1 − γjk−1 (Fj (z k−1 ) + θjk (zjk−1 − zjk−2 ))) − ΠKj (zjk−1 )k
≤ k − γjk−1 (Fj (z k−1 ) + θjk (zjk−1 − zjk−2 ))k
≤ γjk−1 kFj (z k−1 )k + γjk−1 θjk−1 kzjk−1 − zjk−2 k = γjk−1 kFj (z k−1 )k + cj kzjk−1 − zjk−2 k.
It follows that
k−1
kz k − z k−1 k ≤ γmax
kF (z k−1 )k + cmax kz k−1 − z k−2 k.
The above sequence is of the form uk+1 = qk uk + αk , where qk = maxj γjk−1 θjk−1 =
k−1
cmax , αk = γmax
β and kF (z k )k ≤ β. Furthermore, the boundedness of K allows us
31
to claim that {uk } is a bounded sequence. Therefore, we have that
∞
X
(1 − qk ) = ∞
and
k=1
αk
αk
= lim
= 0.
k→∞ (1 − cmax )
k→∞ 1 − qk
lim
Therefore from Lemma 1,
lim kz k − z k−1 k = 0.
k→∞
Let ukt = ztk − ztk−1 and uk = t (ukt )2 (t refers to an element index in the vector z k ).
It is clear that uk and ukt converge to 0. Therefore, ∀δ > 0, ∃K such that, k1 ≥ K
and k2 > k1 , |ukt 1 +1 − ukt 2 | ≤ δ. Therefore |ukt 1 +1 − ukt 2 | = |ztk2 − ztk1 +1 | ≤ δ. This
implies that ztk is a cauchy sequence and therefore convergent. This implies that z k
converges. Since z ∗ is bounded and fixed, it is clear that, kz k − z ∗ k converges to
ū ≥ 0. It suffices to show that ū ≡ 0.
We proceed by contradiction and assume that ū > 0. By the definition of iterates
and leveraging properties of the projection operator, we have
P
N
X
kz k+1 − z ∗ k2 =
kzjk+1 − zj∗ k2
j=1
≤
N
X
k(zjk − γjk (Fj (z k ) + θjk (zjk − zjk−1 ))) − (zj∗ − γjk Fj (z ∗ ))k2
j=1
=
N
X
k(1 − γjk θjk )(zjk − zj∗ ) + γjk θjk (zjk−1 − zj∗ ) + γjk (Fj (z k ) − Fj (z ∗ ))k2 .
j=1
By expanding the expression on the right, we have
N
X
(1 − γjk θjk )2 k(zjk − zj∗ )k2 +
j=1
N
X
(γjk )2 k(Fj (z k ) − F (z ∗ ))k2 +
j=1
|
{z
+2
N
X
γjk θjk (1
−
γjk θjk )(zjk
−
|
zj∗ )T (zjk−1
−
zj∗ ) −2
j=1
|
−2
N
X
{z
Term 2
(γjk )2 θjk (Fj (z k ) − Fj (z ∗ ))T (zjk−1 − zj∗ )
Term 3
{z
Term 4
γjk (1 − γjk θjk )(zjk − zj∗ )T (Fj (z k ) − Fj (z ∗ )) .
j=1
|
}
j=1
}|
{z
N
X
(γjk θjk )2 k(zjk−1 − zj∗ )k2
j=1
}
Term 1
N
X
{z
}
Term 5
k
k
By recalling that γjk θjk = cj and by expressing γjk as (γjk − γmax
) + (γmax
), we
32
}
may combine terms 1,2 and 3 and terms 4 and 5 as in the proof of the previous
proposition, we have the following:
N
X
kzjk+1 − zj∗ k2
j=1
≤
N X
(1 − cj )kzjk − zj∗ k + cj kzjk−1 − zj∗ k
2
−2
j=1
k
(zjk − zj∗ )T (Fj (z k ) − Fj (z ∗ ))
γmax
j=1
|
+2
N
X
{z
}
Term 6
N
X
γjk cj kFj (z k )
− Fj (z
∗
)kkzjk
zjk−1 k
−
+
N
X
(γjk )2 kFj (z k ) − Fj (z ∗ )k2
j=1
j=1
|
{z
}
,dk
+2
N
X
k
(γmax
− γjk )kzjk − zj∗ kkFj (z k ) − Fj (z ∗ )k .
j=1
|
{z
}
,ek
Consequently,
N
X
kzjk+1 − zj∗ k2 −
N X
(1 − cj )kzjk − zj∗ k + cj kzjk−1 − zj∗ k
2
j=1
j=1
k
≤ −2γmax
(z k − z ∗ )T (F (z k ) − F (z ∗ )) + dk + ek .
Rearranging the left hand side, we obtain the following:
N
X
cj (1 − cj )
j=1
+
K X
kzjk−1
−
zj∗ k
−
kzjk
−
2
zj∗ k
!
−
cj kzj0
−
zj∗ k2
−
kzj1
−
zj∗ k2
k=1
N X
cj kzjK − zj∗ k2 + kzjK+1 − zj∗ k2
(2.15)
j=1
k
(z k − z ∗ )T (F (z k ) − F (z ∗ )) +dk + ek .
≤ −2γmax
|
{z
µk
}
Since 0 < cj < 1 for j = 1, . . . , N , it can be observed that
2
N
X
γjk cj kFj (z k ) − Fj (z ∗ )kkzjk − zjk−1 k +
j=1
≤ 2βcmax
N
X
(γjk )2 kFj (z k ) − Fj (z ∗ )k2
j=1
N
X
k
γjk kzjk − zjk−1 k + (γmax
)2
j=1
N
X
j=1
33
kFj (z k ) − Fj (z ∗ )k2
≤ 2βcmax
N
X
k
)2 2β 2 < ∞,
γjk kzjk − zjk−1 k + (γmax
j=1
where the last inequality follows from Lemma 7, implying that that
P
We may also conclude that ∞
k=1 ek < ∞ by noting that
∞
X
ek =
k=1
∞
X
k=1
2
N
X
k=1
dk < ∞.
k
(γmax
− γjk )kzjk − zj∗ kkFj (z k ) − Fj (z ∗ )k
j=1
∞
X
≤ 8βCN
P∞
k
k
(γmax
− γmin
) < ∞,
k=1
where the last inequality follows by assumption. Taking limits in (2.15), we have
∞
X
cj (1 − cj )
k→∞
kzjk − zj∗ k − kzjk−1 − zj∗ k
2
−c
N
X
N
X
kzj0 − zj∗ k2 −
j=1
j=1
k=1
+ lim
N X
kzjk+1 − zj∗ k2 + lim c
k→∞
j=1
N
X
kzjk − zj∗ k2 ≤
j=1
∞
X
N
X
kzj1 − zj∗ k2
j=1
k
µk ).
(dk − γmax
k=1
Since ū > 0, it follows that along every subsequence, we have that µk = 2(zk −
0
z ∗ )T (F (zk ) − z ∗ ) ≥ µ > 0, ∀k. This is a consequence of the strict monotonicity
of F which ensures that (F (z k ) − F (z ∗ ))T (z − z k ) > 0 for z k 6= z and z k , z ∗ ∈ K.
P k
This further implies that k γmax
µk = ∞ and together with the boundedness of z k
leads to the conclusion that
∞
X
(cj (1 − cj ))
k=1
N X
( kzjk − zj∗ k − kzjk−1 − zj∗ k
2
≤ −∞.
j=1
But this leads to a contradiction to the nonnegativity of the left-hand side. Therefore
along some subsequence, we have that µk → 0 and lim inf k→∞ kz k − z ∗ k2 = 0 or a
n o
subsequence of {zk } converges to z ∗ . But we know that the entire sequence z k
converges to a limit point and therefore it follows that limk→∞ z k = z ∗ .
Remark: The final result ensures that for j = 1, . . . , N , the jth agent can
independently select parameters cj ∈ (0, 1), steplength sequence γjk = (k+nj )−α and
proximal parameter sequence θjk = cj /γjk . Notably, these choices are independent
of problem parameters and do not require coordination across players.
34
2.4 Numerical results
In this section, we examine the behavior of the proposed schemes on a networked
Nash-Cournot game. This game is assumed to be played amongst a set of firms,
denoted by J , is assumed to compete over a network of nodes, denoted by N where
|J | = J and |N | = N . Production and sales of firm j ∈ J at node i are denoted
by yij and sij , respectively. Furthermore, the cost of production faced by firm j at
node i is assumed to be linear and is denoted by Cij . For purposes of simplicity,
we assume that the transportation costs are zero and generation is bounded by
physical capacity constraints. Finally, the total sales by a firm across all nodes
are required to be equal to the total generation across all nodes. The jth firm’s
parameterized optimization problem is represented by
maximize
X
(pi (Si )sij − Cij (yij ))
i∈N

subject to

yij ≤



 X
capij , ∀i ∈ N
X
yij =
sij

i∈N



 y ,s
ij
ij
i∈N
≥ 0,
∀i ∈ N






, Kj ,





where the nodal price function at node i, denoted by pi (Si ), is defined as pi (Si ) ,
P
ai − bi Siσ , Si , j∈J sij , ai , bi are positive scalars and σ ≥ 1. The convexity of each
player’s objective follows from noting that the function pi (Si )sij is concave in sij .
This follows from noting that for all i and j, we have ∇2sij fi (s) = 2bi σSiσ−1 + bi σ(σ −
1)sij Siσ−2 ≥ 0, since Si and sij are nonnegative and σ ≥ 1. The convexity of the
objectives allows for constructing sufficient equilibrium conditions, succinctly stated
as a variational inequality problem. If x•j , {x1j , . . . , xN j } then these conditions
are given by VI(K, F ), where




F (s, y1 )
J
s
 1

Y
(s)
F
..


j
,
K,
Kj , F (s, y) , 
 , Fj ,  y
.


Fj (y•j )
j=1
FN (s, yJ )
35







C 0 (y )
b S σ + b1 σs1j S1σ−1 − a1
 1j 1j 

 1 1
..
..



y
 , and Fj = 

Fjs = 
 for j = 1, . . . , J.
.
.

σ
bN SN
+
σ−1
bN σsN j SN
− aN
CN0 j (yN j )
Monotonicity of the mapping follows from noting that a symmetric permutation of
the Jacobian is a block diagonal matrix with the ith block denoted by Bi , defined
as


Ai 0
Bi = 
,
0 0

Si
 σ−1
where Ai , bi σSiσ−1 eeT + bi σ(σ − 1)Siσ−2 


+ si1 . . .
..
.
siJ
|

si1
..
.
Si
σ−1
...
+ siJ
{z


.

}
,Hi
It suffices to show that for all i, the matrix Ai is positive semidefinite. If Si,−j =
P
k6=j sik , then positive semidefiniteness of Ai is implied by the diagonal dominance
of 12 (Hi + HiT ). This is seen to immediately hold when 1 < σ ≤ 3 and J ≤ 3σ−1
, as
σ−1
the following expression reveals:
i,−1
σ
s + Sσ−1
 σ−1 i1
 1
 2 (si2 + si1 )

1
(Hi
2
+ HiT ) = 



..
.
1
(s + si1 )
2 iJ
1
(s + si2 )
2 i1
i,−2
σ
s + Sσ−1
σ−1 i2
...
...
..
.
1
(s
2 i1
1
(s
2 i2
σ
s
σ−1 iJ

+ siJ )

+ siJ ) 

+
Si,−J
σ−1
.



Note that the solvability of the associated variational inequality is a consequence
of the compactness of K and the single-valuedness of the mapping. While the variational inequality problem VI(K, F ) cannot be decomposed nodally, a distributed
gradient scheme may be proposed.
2.4.1 Problem parameters and termination criteria
In our test problem set, the intercepts ai are maintained at 400 across all the nodes
while the slopes bi are drawn from a truncated normal distribution specified as
bi = max(N (0.02, 0.005), 0) where N (µ, σ 2 ) denotes a normal distribution with mean
µ and variance σ 2 . Similarly, the generation capacities and costs are nonnegative
36
Table 2.1. ITR: Majors (J = 3, σ = 1.4)
Nodes
3
4
5
6
7
Varying α in
0.3
0.35
5912
1711
12204
3184
9604
2592
23145
5509
43030
9373
k
=
0.4
676
1164
972
1878
2989
Table 2.2. ITR: Majors (J = 5)
k−α
Nodes
0.45
330
533
454
814
1229
4
5
6
1.2
743
832
1369
ITR: Varying σ
1.3
1.4
1.5
617
8918
16293
694
9870
16537
1142
8452
33614
and are also drawn from truncated normal distributions max(N (250, 50), 0) and
max(N (2, 1), 0.6), respectively. The set of test problems comprises of a collection
of forty-five problems in which the number of nodes vary from 3 to 7 (unless stated
otherwise) in steps of 1 and firms vary from 3 to 7 (unless stated) in steps of 2.
The exponent in the price functions (namely σ) is also modified from 1.2 to 1.4 in
steps of 0.1. In all of our tests, the termination criteria are given by kFKnat (z)k ≤ δ
where FKnat (z) , z − ΠK (z − F (z)) and δ = 1e-1.
2.4.2 Fully coordinated schemes
In this subsection, we examine fully coordinated ITR and IPP schemes. The initial
regularization level 0 = 1e-2 for the ITR scheme while α and β are set to be 0.48
and 0.51, respectively for the ITR scheme. The values of θ and α are taken to be
0.2 and 0.51 respectively for the IPP scheme. We maintain c as 0.5 and α as 0.51
in the MIPP scheme. The value of the initial steplength γ 0 is taken to be 5.
The iterative Tikhonov regularization (ITR) techniques were tested on the
prescribed set of test problems. Table 2.1 compares the performance of the scheme
when k , given by k −α , is driven to zero at different rates. It is seen that maintaining
α close to 12 ensures that the scheme performs significantly better than when α is
kept small. As Table 2.4.2 reveals, increasing the nonlinearity of the problem by
raising σ leads to a significant growth in effort.
Iterative proximal point methods enjoy several key differences with analogous
Tikhonov schemes. The absence of a regularization sequence appears to allow for
faster convergence and they perform on par with the best performance of Tikhonov
0
schemes, namely when k = kα . Notably, smaller values of θ allow for significantly
faster convergence. The IPP and MIPP schemes are examined for varying θ and c
respectively on a five firm problem and the results are reported in Table 2.3. It is
seen that the performance of both the schemes deteriorate with increasing values
37
Table 2.3. IPP and MIPP: Majors (J = 5, σ = 1.3)
Nodes
3
4
5
6
7
0.4
67
79
66
77
82
IPP: Varying
0.6
0.8
88
112
87
112
86
109
88
109
87
115
θ
1.0
135
136
136
133
136
0.1
34
45
40
44
47
MIPP: Varying c
0.3
0.5
0.7
0.9
54
97
230
1519
68
110
242
1653
63
104
242
1643
70
126
305
2098
73
125
300
2040
Figure 2.1. p-ITR (l) and p-MIPP (r): Majors
of the proximal terms. We also find that the performance of the MIPP scheme
deteriorates faster than its IPP counterpart.
2.4.3 Partially coordinated schemes
Next, we consider partially coordinated schemes where each player chooses his own
parameter sequence, as prescribed in section 4.1. In both partially coordinated
schemes, γjk is defined as γjk = γ 0 /(n0 + nj + k)β where −nmax ≤ nj ≤ nmax . The
parameters n0 and γ 0 are set to be 1200 and 20 respectively across all agents.
¯
Moreover it is assumed that nj = (j − J)κ,
where J¯ = J+1
and κ represents
2
the extent of coordination. In effect, when κ = 0, it follows that nmax = 0,
and all the players employ the same steplength sequence. In a similar fashion,
¯ for p-MIPP
kj = 0 /(n0 + nj + k)β for p-ITR schemes and cj = ĉ + c̄(j − J)κ
schemes (ĉ and c̄ are constants). Finally, it is assumed that ĉ = 0.5 and c̄ = 5e-5.
We set the initial parameters γ 0 = 20 and 0 = 0.01. Note that as stated earlier,
we allow for partial coordination in γ and in the case of p-ITR and γ and c in
the case of p-MIPP. It is observed that as κ grows, partially coordinated schemes
do not deteriorate significantly in performance. However, as seen in Table 2.4, we
note that while some lack of coordination is tolerated, beyond a threshold, the
impact can be quite significant (as in κ = 350). The performance profiles [74] for
the overall testproblem set are shown in Figure 2.1. Again, these suggest that for
exceedingly high values of κ, significant degradation in performance may be seen in
both types of schemes.
38
Table 2.4. p-ITR and p-MIPP: Majors (σ = 1.3)
J
5
7
N
3
4
5
6
7
3
4
5
6
7
p-ITR: Varying
0
100
200
45
42
49
47
50
54
46
51
57
55
56
66
195
226
283
48
50
55
52
55
63
46
44
58
202
222
301
738
735
780
κ
350
59
61
65
150
412
70
191
224
518
918
p-MIPP: Varying κ
0
100
200
350
63
59
72
86
64
70
77
87
61
71
79
91
63
65
75
93
67
53
68
87
213
237
329
526
522
539
591
735
465
485
539
689
891
905
946
1061
1083
1094
1129
1224
Table 2.5. Multistep ITR: Majors (J = 5, σ = 1.4)
Nodes
3
4
5
6
7
1
9418
8919
9871
8453
9911
Varying number of steps r
2
3
4
5
9394
9386
9382
9380
8897
8890
8887
8885
9843
9834
9830
9827
8421
8411
8405
8402
9876
9864
9859
9855
2.4.4 Multistep schemes
In this subsection, we discuss multistep Tikhonov and proximal schemes where
agents at any outer iteration take multiple projection steps prior to an update of the
regularization parameter but exchange information at the end of every projection
step. More importantly it is to be observed that with the increase in the number of
inner projection steps, the scheme resembles exact Tikhonov and proximal schemes
respectively.
We begin by considering the impact of taking multiple projection steps between
updates of the Tikhonov regularization parameter, as specified by the multistep ITR
scheme (referred to as r-ITR when taking r projection steps between updates). If
agents exchange information while making projection steps for a fixed regularization
parameter, then this resembles a bounded-complexity Tikhonov regularization
scheme. If, however, agents do not exchange information, then as the number
of steps grows, the scheme bears increasing relevance to a bounded-complexity
best-response scheme, where players attempt to get close to optimal solutions,
given adversarial decisions. We begin by considering the r-ITR scheme wherein
39
we assumed that the steplength and regularization were specified by k −β and k −α
respectively, where β = 0.51 and α = 0.25. Table 2.4.4 reports the number of
outer iterations (steps) taken with increasing number of fixed inner projection steps,
denoted by r. Note that the overall complexity of the r-ITR scheme would be
obtained by multiplying the number of outer iterations by r. It emerges that ITR
with r = 1 often performs better (in terms of major iterations) than when there
are multiple inner projection steps. Furthermore, while communication overhead
often reduces slightly when r grows upto a certain level, the overall effort grows
significantly. In fact, the overall effort can be seen to grow by nearly a factor of r.
Next, we tested the performance of ITR and IPP schemes are tested for two
different problem sizes for different levels of partial coordination with five agents
while varying the number of inner projection steps. The performance of r-p-ITR
and r-p-IPP with increasing r (r fixed across all agents) is shown in Table 2.6.3
Remarkably, while Tikhonov schemes display significant dropoff in performance,
proximal-point schemes tend to show far lesser degradation, suggesting that the
latter are far more robust. A practical extension of such schemes arises when agents
choose an r from a uniform distribution U (2ru , 3ru ). The initial steplengths and
parameters were chosen as stated earlier. Note that the numbers indicated in the
tables denote the total projection steps taken (sum over agents, inner iterations and
outer iterations). It can be seen from Table 2.7 that the number of outer iterations
decreases (divide by the respective ru ) with increasing ru (as the schemes get closer
to their Tikhonov and proximal-point counterparts). However, it can be easily
observed that the overall complexity grows rapidly with increasing values of ru .
Table 2.6. r-p-ITR and r-p-MIPP: Total projection steps (σ = 1.4) with r
fixed across agents
N
4
5
3
κ
0
100
200
0
100
200
1
38425
38650
38910
43145
43405
43680
2
76790
77250
77760
86220
86740
87290
r-p-ITR
3
115170
115845
116610
129300
130065
130890
4
153540
154440
155460
172380
173400
174500
5
191900
193050
194325
215450
216750
218125
1
1510
1700
1855
4440
4750
5040
2
1550
1710
1860
4360
4670
4960
r-p-IPP
3
1635
1770
1905
4485
4800
5085
Note that r-p-IPP may be analyzed similar to its Tikhonov counterpart.
40
4
1720
1860
1980
4680
4980
5260
5
1825
1950
2075
4900
5200
5475
Table 2.7. r-p-ITR and r-p-MIPP: Total projection steps (σ = 1.4) with
firm-specific ru
N
4
5
κ
0
100
200
0
100
200
r-p-ITR: Varying κ
1
2
3
107492
168872
307040
100412
200746
308840
108836
178779
310880
112073
224068
310212
112736
225394
312084
113438
218100
331512
r-p-IPP: Varying κ
1
2
3
1470
1892
2000
1469
2418
2120
1750
2277
2200
3900
6084
5508
4160
4706
5832
6435
6575
5852
2.4.5 Partial information
Finally, we consider schemes where agents make multiple projection steps before
getting access to other agents’ iterates. In a standard ITR scheme, at iteration k,
k
agent j for a fixed z−j
takes r projection steps employing an update rule based
k
on parameters j and γjk . In the case of IPP schemes, the agent j takes r steps
with a steplength γjk . This is similar (but not identical) to a best response scheme,
where agents at every major iteration solve their optimization problem, by keeping
the other agents’ iterates fixed. The performance of these schemes is examined
for two different problem sizes for differing levels of partial coordination with five
agents. Table 2.8 shows the performance of ITR and IPP with increasing r, with r
being fixed across all agents. Table 2.9 shows the same when agents choose r from
a uniform distribution U (2ru , 3ru ). The initial steplengths and parameters were
chosen as stated earlier. It is seen that with the loss in information, performance
deteriorates rapidly and convergence is highly affected. In fact, the proximal-point
scheme which was robust when r was fixed across agents, now shows a marked
degradation in performance.
Table 2.8. r-p-ITR and r-p-MIPP:Majors (σ = 1.3) with r fixed and partial
information
N
4
5
κ
0
100
200
0
100
200
1
235
250
270
230
255
285
r-p-ITR: Varying κ
2
3
4
250
315
5660
260
315
5860
290
345
6420
240
345
5860
270
345
6060
300
375
6580
5
23575
23725
24175
23825
23975
24425
41
1
235
250
270
225
250
275
r-p-IPP: Varying κ
2
3
4
260
330
3820
270
330
4020
290
345
4700
250
345
4060
270
345
4260
300
360
4940
5
21500
21650
22075
21750
21875
22300
Table 2.9. ru -p-ITR and ru -p-MIPP: Majors (σ = 1.3) with r not fixed and
partial information
N
4
5
κ
0
100
200
0
100
200
ru = 1
420
364
377
338
338
351
ru -p-ITR
ru = 2
27482
27586
27950
27742
27846
28210
ru = 3
142760
142840
143160
143720
143800
144040
ru = 1
434
377
390
351
351
351
ru -p-IPP
ru = 2
25194
25298
25688
25480
25558
25948
ru = 3
136800
136920
137200
137720
137800
138080
2.5 Concluding remarks
We consider the development of single-timescale distributed techniques for the
solution of monotone Nash games over continuous strategy sets and concluded with
an examination of the performance of the schemes on a networked Nash-Cournot
game with nonlinear prices. Motivated by the natively two-timescale nature of
most standard regularization and proximal point schemes and the need to select
algorithm parameters centrally, we developed two classes of regularization methods
that are characterized by a single-timescale structure and allow for players to select
algorithm parameters with relatively less oversight. Importantly, the user-specific
algorithm parameters do not rely on Lipschitz or monotonicity constants.
(1) Iterative Tikhonov regularization schemes: In regimes where the mappings are
merely monotone and the strategy sets are possibly unbounded, we present an
iterative Tikhonov regularization scheme in which the every player takes a single
gradient step after which the regularization parameter is updated. Under suitable
rate requirements on the regularization and steplength sequences, we present
convergence theory in partially coordinated settings, that allow for independent
choices of steplength and regularization sequences. Finally, we also examine
multi-step variants where agents take multiple projection steps before updating
regularization parameters.
(2) Iterative proximal point schemes: Under a compactness assumption on the
strategy sets and strictly monotone maps, we present an iterative proximal point
scheme where the centering parameter is updated at every iteration and steplength
choices are made independently. A modified proximal-point scheme is presented
where the weight on the proximal term may be chosen independently and updated
42
at every iteration.
2.6 Appendix
Proof of Lemma 2:
Proof. (i) Recall that
(x − x∗ )T F (x∗ ) ≥ 0
for all x ∈ K.
(2.16)
Since y k solves VI(K, F + diag(k )I) for any k, it follows that
(y − y k )T (F (y k ) + diag(k )y k ) ≥ 0
for all y ∈ K and k ≥ 0.
(2.17)
By choosing x as y k in (2.16) and y as x∗ in (2.17), we obtain for all k ≥ 0,
(y k − x∗ )T F (x∗ ) ≥ 0 and (x∗ − y k )T (F (y k ) + diag(k )y k ) ≥ 0.
Since F is monotone, we obtain (y k − x∗ )T (F (x∗ ) − F (y k )) ≤ 0, implying (x∗ −
y k )T diag(k )y k ≥ 0. Through a rearrangement,
(x∗ )T diag(k )y k ≥ (y k )T diag(k )y k ≥ kmin ky k k2 ,
where kmin = min1≤j≤N kj . By using the Cauchy-Schwartz inequality, we see that
kmax kx∗ k ky k k ≥ (x∗ )T diag(k )y k ,
where kmax = max1≤i≤N k,i . Combining the preceding two inequalities, we obtain
ky k k ≤
k,
max
kx∗ k.
kmin
(2.18)
If k,max
≤ c and c is finite, then the sequence {y k } is bounded. Consider any
k,min
accumulation point ȳ of {y k } and let k → ∞ in (2.17) over a corresponding
convergent subsequence of {y k }. Then, by leveraging the continuity of F and the
43
fact that k,i → 0 as k → ∞, it follows that (y − ỹ)T F (ỹ) ≥ 0 for all y ∈ K. Thus,
every accumulation point ỹ of {y k } is a solution to VI(K, F ).
(ii) From the properties of variational inequalities, we have that
(y k−1 − y k )T (F (y k ) + diag(k )y k ) ≥ 0 and
(y k − y k−1 )T (F (y k−1 ) + diag(k−1 )y k−1 ) ≥ 0.
Through the addition of the two inequalities, we obtain
0 ≤ (y k−1 − y k )T (F (y k ) − F (y k−1 )) + (y k−1 − y k )T (diag(k )y k − diag(k−1 )y k−1 )
= (y k−1 − y k )T (F (y k ) − F (y k−1 ))
+ (y k−1 − y k )T (diag(k )y k − diag(k )y k−1 + diag(k )y k−1 − diag(k−1 )y k−1 ).
By rearranging the above expression, we obtain
(y k−1 − y k )T (diag(k )y k−1 − diag(k−1 )y k−1 )
≥ (y k−1 − y k )T (F (y k−1 ) − F (y k )) + (y k − y k−1 )T (diag(k )y k − diag(k )y k−1 )
≥ kmin ky k − y k−1 k2 ,
(2.19)
where the second inequality follows from the monotonicity of the mapping F .
However, through an application of the Cauchy-Schwartz inequality, the expression
(y k−1 − y k )T (diag(k )y k−1 − diag(k−1 )y k−1 ) is bounded from above as shown by
(y k−1 − y k )T (diag(k )y k−1 − diag(k−1 )y k−1 )
≤ k(y k−1 − y k )kky k−1 k(diag(k−1 ) − diag(k ))k.
By using (2.19), it follows that
ky k−1 kk diag(k−1 ) − diag(k )k
M k(diag(k−1 ) − diag(k ))k
≤
kmin
kmin
√
k−1
M N (max
− kmin )
≤
,
(2.20)
kmin
ky k − y k−1 k ≤
where ky k k ≤ M : for all k from (i).
44
Chapter 3 |
Pseudomonotone Stochastic Variational Inequality Problems: Analysis and Optimal Stochastic Approximation Schemes
3.1 Introduction
Several applications arising in engineering, science, finance, and economics lead to a
broad range of optimization and equilibrium problems, and under suitable convexity
assumptions, the equilibrium conditions of such problems may be compactly stated
as a variational inequality problem [18, 67, Ch. 1]. Recall that given a set X ⊆ Rn
and a map F : Rn → Rn , the variational inequality problem, denoted by VI(X, F ),
requires an x∗ ∈ X such that:
(x − x∗ )T F (x∗ ) ≥ 0, ∀x ∈ X.
In a multitude of settings, either the evaluation of the map F (x) is corrupted
by error or its components are expectation valued, a consequence of the original
model being a stochastic optimization or equilibrium problem. Consequently,
Fi (x) , E[Fi (x, ξ)] for i = 1, · · · , n. Note that ξ : Ω → Rd , F : X × Rd → Rn and
(Ω, F, P ) is the associated probability space. As a result, the stochastic variational
45
problem requires finding a vector x∗ ∈ X such that
(x − x∗ )T IE[F (x∗ , ξ(ω))] ≥ 0, ∀x ∈ X.
(3.1)
Throughout this paper we use F (x; ω) to refer to F (x, ξ(ω)).
Sufficiency conditions for the solvability of variational inequality problems often
depend either on the compactness of the associated set or the coercivity property
of the map [18, Chap. 2]. In a stochastic setting with unbounded sets, existence
statements require access to the integrals arising from the expectations IE[F (x; ω)]
to derive a suitable coercivity property. Recent work proposes an alternative by
obviating integration and developing integration-free sufficiency conditions for the
existence of solutions to stochastic Nash games [33], and more recently stochastic
quasi-variational and complementarity problems [75]. Yet these conditions can be
refined for addressing pseudomonotonicity of the maps and this paper provides
precisely a set of existence and uniqueness results along precisely such directions.
One avenue towards solving deterministic monotone variational inequality problems is through projection-based methods, [18, Ch. 12]. The simplest of such
methods necessitates a strong monotonicity requirement on the mapping F (x)
while convergence under weaker conditions is made possible through Tikhonov
regularization and proximal point schemes [41, 46, 47]. A key challenge in such
schemes is the need for solving a sequence of exact or increasingly accurate problems. In recent work, iterative counterparts have been developed in which a single
projection step is taken after which the steplength and regularization/centering
parameters are updated [76]. In the context of pseudomonotonicity, extragradient
schemes [18, Ch. 12], first suggested by [77], have been crucial in the solution of VIs
with pseudomonotone and Lipschitzian maps. More recently [78] considered a prox
generalization of the extragradient scheme in deterministic regimes. Linesearch
and hyperplane based schemes present some enhancements to projection methods
that improve convergence rates and bounds under some milder settings [18, Ch. 12].
The stochastic counterpart of the variational inequality has received relatively less
attention. Early efforts focused on the use of sample average approximation (SAA)
techniques and developed consistency statements of the resulting estimators [79].
In fact, such techniques have been applied towards the computation of stochastic
VIs [80]. More recently, such avenues have been utilized to develop confidence
46
regions with suitable central limit results [81, 82]. An alternative approach inspired
by the seminal work by Robbins and Monro [29] is that of stochastic approximation [83–85]. In this context, there has been a surge of recent effort in the
development of stochastic approximation (SA) schemes for monotone stochastic
variational inequality problems [30–32]. However, to the best of our knowledge, the
applicability of these schemes is limited to strongly, strictly, or merely monotone
maps.
3.1.1 Motivation
We motivate our study of pseudomonotone stochastic variational inequality problems
by considering three sets of problems.
3.1.1.1
Stochastic fractional programming
Design of engineered systems often require the optimization of ratios. For instance,
aircraft design problems require maximization of lift to drag ratios by changing
aerofoil dimensions such as thickness and chord length [19, 20], while in automotive
engineering, fuel economy to vehicle performance rations [21,22] require optimization.
Similarly, in wireless networks, one is concerned with the maximization of the
corresponding signal-to-noise metric [23,24]. All of the above problems fall under the
umbrella of “fractional programming” and we consider the stochastic generalization
of this problem:
"
min
x∈X
#
f (x; ξ(ω))
h(x) , E
,
g(x; ξ(ω))
(3.2)
where f, g : Rn × Rd → R and ξ : Ω → Rd . While h(x) cannot be guaranteed to
be pseudoconvex in general, in automotive problems [21], f (x; ξ(ω)) corresponds
to the uncertain time taken to accelerate from 0 to v max miles per hour while
g(x; ξ(ω)) denotes the uncertain fuel economy. The design space x corresponds to
engine design specifications such as gear ratios and transmission switching levels.
Consequently, the equilibrium conditions are given by a pseudomonotone stochastic
variational inequality problem. We extend Lemma 2.1 from [86] (proof omitted)
that provides conditions for pseudoconvexity of h(x) under some basic assumptions.
Note that h(x) is pseudoconvex if and only if −h(x) is pseudoconcave.
47
Lemma 8 (Stochastic pseudoconvex function). If (i) f : X × Rd → R is a
nonnegative convex (concave) in a.s. fashion; (ii) g : X × Rd → R is strictly
positive concave (convex) function in an a.s. sense; and (iii) f (•; ω), g(•; ω) ∈ C 1
in a.s. sense, then the function h : X → R, given by h(x) = f (x)/g(x) is strongly
pseudoconvex (pseudoconcave).
3.1.1.2
Stochastic product pricing
Consider an oligopolistic market with a set of substitutable goods in which firms
compete in prices. In such Bertrand markets [25, 87], the quantity sold by a
particular firm is contingent on the prices set by the firms (and possibly other
product attributes) and this firm-specific demand is captured by the Generalized
Extreme Value (GEV) model [25–27]. The multinomial logit is a commonly used
GEV model that possesses some tractability and finds application in revenue
management problems in product pricing. We begin by defining the product pricing
problem for firm j:
max fj (p) where fj (p) , E [pj ζj (p; ω)],
pj ∈Pj
pj denotes the price set by firm j and ζj (p; ω) denotes the demand for product j,
defined
e−αj (ω)pj
ζj (p; ω) ,
PN −α (ω)p ,
c + i=1 e i i
where αj (ω) represent positive parameters for j = 1, . . . , N . The resulting random
revenue function has been shown to be pseudoconcave (see [88]) and its expectation
is pseudoconcave (as we show later). The relevance of this observation can be
traced to the knowledge that under a pseudoconvexity assumption, given p∗−j , p∗j
is a stationary point of this problem if and only if p∗j is a global minimizer of
this problem. Consequently, any solution to the collection of pseudomonotone
variational inequality problems is a Nash-Bertrand equilibrium. Note that in
Cournot or quantity games, suitably specified inverse demand functions also lead
to pseudoconcave revenue functions [89, Th. 3.4].
48
3.1.1.3
Stochastic economic equilibria
We consider a competitive exchange economy in which there is a collection of n goods
with an associated price vector p ∈ Rn++ and a positive budget w. The consumption
vector F (p, w; ω) specifies the uncertain consumption level and the consumption
has to satisfy budget constraints in an expected-value sense, as specified by
E[pT F (p, w; ω)] ≤ w.
This demand function is assumed to be homogeneous with degree zero or F (λp, λw; ω) =
F (p, w; ω) for any positive λ. An additional condition satisfied by F (p, w; ω) is the
(expected) Weak Weak Axiom of revealed preference (EWWA), which requires that
for all pairs (p1 , w1 ) and (p2 , w2 )
pT2 E[F (p1 , w1 ; ω)] ≤ w2 =⇒ pT1 E[F (p2 , w2 ; ω)] ≥ w2 .
This axiom is an expected-value variant of WWA which itself represents a weakening
of the Weak Axiom of revealed preference [90]. Before proceeding, we provide
some intuition for this axiom. At prices p2 and budget w2 , an agent chooses a
bundle F (p2 , w2 ). If at the same prices, she can also afford F (p1 , w1 ), then we
have that pT2 F (p1 , w1 ) ≤ w2 . Consequently, the consumer believes that the bundle
F (p2 , w2 ) is at least as good as F (p1 , w1 ). If at p1 and w1 , the bundle F (p2 , w2 ) is
cheaper than the chosen bundle F (p1 , w1 ), it follows that she can afford a bundle
b such that b contains more of each commodity than F (p2 , w2 ). It may then be
concluded that the agent prefers b to F (p2 , w2 ). But F (p2 , w2 ) is at least as good
as F (p1 , w1 ), implying that the bundle b is preferable to F (p1 , w1 ). But b is cheaper
than F (p1 , w1 ), contradicting the choice of F (p1 , w1 ) and one can conclude that
F (p2 , w2 ) cannot be cheaper than F (p1 , w1 ) or pT1 F (p2 , w2 ) ≥ w1 . If w1 = w2 , we
have the following:
pT2 F (p1 , w) ≤ w =⇒ pT1 F (p2 , w) ≥ w.
By the budget identity, pT1 F (p1 , w) = pT2 F (p2 , w) = w, implying that
(p2 − p1 )T F (p1 , w) ≤ 0 =⇒ (p2 − p1 )T F (p2 , w) ≤ 0.
49
It follows that F (•, w) is a pseudomonotone map in (•) for any positive w. We now
present how one may model the notion of equilibrium in a general consumption
sector with a finite set of agents, denoted by A. An agent a ∈ A is characterized
by an endowment ea ∈ Rn++ and a demand function Fa (p, pT ea ), implying that the
consumption behavior of agent a is given by ϕa (p) = pT Fa (p, pT ea ). The aggregate
P
demand is given by the function F (p), defined as F (p) = a∈A Fa (p, pT ea ). Note
that F (p) is homogeneous in p with degree zero and by the individual budget
P
identities, we have Walras’ law; for all p, pT F (p) = pT e = pT ( a ea ), where e
denotes the sector-wide initial endowment. The demand function F (p) satisfies the
WWA if
p2 F (p1 ) ≤ pT2 e =⇒ pT1 F (p2 ) ≥ pT1 e.
The WWA can be presented in a more convenient fashion if Z(p) = F (p) − e.
While the consumption sector is captured by Z(p), the set Y represents the set
of technology available and y ∈ Y represents either input (negative) or output
(positive) based on sign. The set Y satisfies two requirements: (i) free disposal or
that goods may be arbitrarily wasted without using further inputs or −R+
n ⊆ Y ; and
(ii) production requires some inputs (no free lunch) or Y ∩ Rn+ = {0}. Consequently,
an equilibrium of the economy (Y, Z) is given by a p ∈ P such that
(a) Z(p) ∈ Y and (b) pT y ≤ 0, for all y ∈ Y.
Condition (a) implies that demand is met at equilibrium while (b) implies that
firms maximize profits by choosing plans y = Z(p). In fact, by [28, Th. 1], p is
an equilibrium of the economy (Y, Z) if and only if p is a solution of VI(Q, Z),
where Q = P ∩ Y ◦ and Y ◦ , {d : dT y ≥ 0, y ∈ Y }. But Z(p) = E[Z(p; ξ(ω)]
is a pseudomonotone map, leading to a pseudomonotone stochastic variational
inequality problem.
3.1.2 Related work on SA
Stochastic approximation (SA) schemes originate from the seminal paper by Robbins
and Monro [29] and have proved useful in solving a host of problems in control
theory, operations research, and economics (cf. [83–85]). Via averaging techniques
proposed by Polyak [91] and Polyak and Juditsky [92], asymptotically optimal
50
Ref.
[30]
[31]
[100]
[101]
[103]
[102]
Applicability
Strongly monotone
Monotone, Lipschitz
Monotone, non-Lipschitz
Monotone, non-Lipschitz
Strongly monotone, non-Lip.
Monotone, non-Lip.
SA Algorithm
Proj. based
Regularization
Reg. and smoothing
Mirror-prox
Proj. based, self-tuned
Extragradient
Avg.
N
N
N
Y
N
Y
Metric
Iterates
Iterates
Iterates
Gap fn.
MSE (Soln. Iter.)
Gap function
Rate
√
O(1/ K)
O(1/K)
√
O(1/ K)
a.s.
Y
Y
Y
N
Y
Y
Sec. 3.3
Sec. 3.4
Strong/Strict/pseudo-plus
Strong pseudo/mon.+weak-sharp
Extragr., Mir.-prox
Extragr., Mir.-prox
N
N
MSE (Soln. Iter.)
O(1/K)
Y
Y
Table 3.1. SA based approaches for Stochastic Variational Inequality
Problems
rates in function values can be derived (see related [93, 94] and prior work [95] on
averaging). In the last decade, there has been a surge of interest in the development
of techniques for stochastic convex optimization with a focus on optimal constant
steplengths [32], composite problems [96, 97] and nonconvexity [98, 99]. However,
in the context of stochastic variational inequality problems, much of the prior
work has been in the context of monotone operators. Almost sure convergence
of the solution iterates was first proven by Jiang and Xu [30], while regularized
schemes for addressing merely monotone but Lipschitzian maps were subsequently
developed [31]. In [100], Yousefian et al. weakened the Lipschitzian requirement by
developing an SA scheme that combined local smoothing and iterative regularization.
From a rate standpoint, there has been relatively less in the context of SVIs. A
particularly influential paper by Tauvel et al. [101] proved that the mirror-prox
√
stochastic approximation scheme admits the optimal rate of O(1/ K) in terms
of the gap function when employing averaging over monotone SVIs. In related
work [102], Yousefian et al. develop extragradient-based robust smoothing schemes
for monotone SVIs in non-Lipshitzian regimes and show the optimality of the rate
in terms of the gap function. For the purposes of clarity, some important previous
computational results related to stochastic variational inequality literature and
algorithmic settings are summarized in Table 3.1.
3.1.3 Contributions and outline
This paper considers the analysis and solution of stochastic variational inequality
problems and makes the following contributions:
(i) Analysis: Traditional solvability statements for VIs over unbounded sets often
require utilizing properties of the map (such as coercivity) that may be challenging
to ascertain when maps are expectation-valued. We present a set of verifiable
51
conditions for claiming the solvability of pseudomonotone SVIs. In addition, refined
statements are provided for variants of PSVIs such as pseudomonotone stochastic
complementarity problems. Finally, we provide analogous conditions for claiming
the uniqueness of solutions. Both sets of conditions are novel in that they do not
require integration and are distribution independent.
(ii) Almost-sure convergence: We consider a stochastic extragradient method
and under standard assumptions on steplength sequences, we show that the generated sequence of iterates converges to a solution in an almost-sure sense. We refine
these statements to variants of pseudomonotoncity and extend the convergence
statement to the mirror-prox regime as well as specialize it for monotone regimes
under weak-sharpness assumptions. To the best of our knowledge, there appears to
be no previous a.s. convergence theory for pseudomonotone SVIs.
(iii) Rate analysis: Under slightly stronger settings of pseudomonotonicty, we
prove that the extragradient scheme displays the optimal rate for strongly pseudomonotone maps. Additionally, a similar statement is proved for the mirror-prox
generalization as well as for problems characterized by the weak-sharpness property.
In particular, we emphasize that our work derives rate estimates for the iterates
without resorting to averaging, in contrast with available statements for monotone
SVIs that provide rate statements in terms of the gap function. Notably, in all three
cases, we further refine the bound by optimizing initial steplength. Again, this
appears to be amongst the first rate statements in the regime of pseudomonotone
problems.
(iv) Numerical Results: We consider a test suite that consists of monotone,
pseudomonotone, and nonmonotone problem instances. The numerical performance
on our test set is studied for different choices of algorithmic parameters and insights
on practical behavior are obtained and compared. In particular, we examine the
empirical benefits of optimizing the initial steplength.
The rest of the paper is organized into four sections. In Section 3.2, we
derive integration-free statements for the existence and uniqueness of solutions to
pseudomonotone stochastic variational inequality problems and their variants. In
Section 3.3, we presents the convergence theory and derive rate estimates under
different assumptions on the map. Finally, in Section 3.4, numerical performance
of the schemes is investigated and the paper concludes in section 4.6.
52
3.2 Analysis
Standard sufficiency conditions for the solvability of variational inequality problems
rely on compactness [18, Chap. 2] of the associated set X and coercivity properties
of the map [18, Chap. 2], amongst other conditions. Our interest lies in regimes
where the maps are expectation-valued and the set X is unbounded. In such
instances, a direct utilization of the coercivity properties is challenging since it
requires a tractable analytical form for the expectation. Obtaining closed form
expressions necessitate integration over probability spaces and requires knowledge
of the underlying distribution. An alternative to integration was recently proposed
in [33] and is reliant on imposing coercivity properties of the sampled map F (x; ω)
in an almost sure sense to obtain existence statements for stochastic Nash games. In
subsequent work, this avenue was extended to stochastic complementarity problems
and quasi-variational inequalities [75]. Unfortunately, a direct application of these
statements tends to be somewhat restrictive, thereby motivating a refinement of
these statements to pseudomonotone regimes. We begin with an introduction to
stochastic pseudomonotone maps in Section 3.2.1 and apply these findings towards
developing sufficiency conditions for existence and uniqueness in Section 3.2.2.
3.2.1 Stochastic Pseudomonotone Maps
We begin our study by recalling the definitions of pseudomonotone maps and their
variants.
Definition 1 (Monotonicity and Pseudomonotonicity). Consider a continuous
mapping F : X ⊆ Rn → Rn . Then, the following hold:
(i) F is monotone on X if for all x, y ∈ X, (F (x) − F (y))T (x − y) ≥ 0.
(ii) F is hypomonotone on X if there exists σh > 0, such that, for all x, y ∈ X,
(F (x) − F (y))T (x − y) ≥ −σh kx − yk2 .
(iii) F is pseudomonotone on X, if for all x, y ∈ X, (x − y)T F (y) ≥ 0 =⇒
(x − y)T F (x) ≥ 0.
(iv) F is pseudomonotone-plus on X, if it is pseudo-monotone on X and for all
vectors x and y in X, F (y)T (x−y) ≥ 0, F (x)T (x−y) = 0 =⇒ F (x) = F (y).
53
(v) F is strictly pseudomonotone on X, if for all x, y ∈ X, (x − y)T F (y) ≥ 0 =⇒
(x − y)T F (x) > 0, where x 6= y.
(vi) F is strongly pseudomonotone on X, if for all x, y ∈ X, there exists σ > 0
such that (x − y)T F (y) ≥ 0 =⇒ (x − y)T F (x) ≥ σkx − yk2 , where x 6= y.
It is relatively easy to show that (i) =⇒ (ii), (i) =⇒ (iii), and (vi) =⇒ (v)
=⇒ (iv) =⇒ (iii). Properties (i) and (ii) may be found in [18, Ch. 2], while the
remainder may be found in [104, 105]. Over a general measure space, evaluating
pseudomonotonicity properties can be challenging in that this again requires access
to closed form expressions for the expectation. Accordingly we impose properties on
the sampled map F (x; ω) in a probabilistic sense, in an effort to make statements
over the expected-value map. Note that in the remainder of this section, µ(U)
denotes the probability of event U.
Lemma 9. Suppose F (x; ω) is pseudomonotone on X in an a.s. sense. Then the
following hold:
(i) F (x) is pseudomonotone on X.
(ii) If F (x; ω) is pseudomonotone-plus on X in an a.s. sense, then F (x) is
pseudomonotone-plus on X.
(iii) If F (x; ω) is strongly pseudomonotone on X for all ω ∈ U ⊂ Ω and µ(U) > 0,
then F (x) is strongly pseudomonotone on X.
(iv) If F (x; ω) is strictly pseudomonotone on X for all ω ∈ U ⊂ Ω and µ(U) > 0,
then F (x) is strictly pseudomonotone on X.
Proof. (i) By assumption, F (x; ω) is pseudomonotone on X in an a.s. sense. Then
we have
[(y − x)T F (x; ω) ≥ 0] =⇒ [(y − x)T F (y; ω) ≥ 0], for a.e. ω ∈ Ω, x, y ∈ X.
(3.3)
Taking expectations on both sides of (3.3), we have that
IE[(y − x)T F (x; ω)] ≥ 0 =⇒ IE[(y − x)T F (y; ω)] ≥ 0, ∀x, y ∈ X.
54
The result follows by noting that F (x) = IE[F (x; ω)] and F (y) = IE[F (y, ω)].
(ii) From the plus property of F (x; ω), we have that
[(y − x)T F (x; ω) ≥ 0 and (y − x)T F (y; ω) = 0]
=⇒ F (x; ω) = F (y; ω), for a.e. ω ∈ Ω, x, y ∈ X.
(3.4)
Taking expectations on both sides, the claim holds based on the following:
[(y − x)T F (x) ≥ 0 and (y − x)T F (y) = 0] =⇒ F (x) = F (y), ∀x, y ∈ X.
(iii) Using the definition of strong pseudomonotonicity, for all x, y ∈ X and
x 6= y, F (x; ω)T (y − x) ≥ 0 implies that F (y; ω)T (y − x) ≥ σky − xk2 for every
ω ∈ U. Consequently, we have that
F (y)T (y − x) = IE[F (y; ω)]T (y − x) = IE[(F (y; ω))T (y − x)]
=
Z
(F (y; ω))T (y − x)dµ(ω) ≥ σkx − yk2
ω∈U
Z
dµ(ω)
ω∈U
= µ(U)σkx − yk2 .
(iv) Omitted.
Naturally, even if F (x) is pseudomonotone, then F (x; ω) need not be necessarily
pseudomonotone for every ω ∈ Ω. We now define two properties associated with
the solution set of VI(X, F ) [18, Ch. 6].
Definition 2 (Weak sharpness and acute-angle property). Consider VI(X, F )
and let X ∗ denote its set of solutions. Let F : X ⊆ Rn → Rn be a continuous
mapping. Then the following hold:
(i) The problem possesses the weak-sharp property if there exists an α > 0 such
that
(x − x∗ )T F (x∗ ) ≥ α min
kx − x∗ k,
∗
∗
x ∈X
∀x ∈ X,
∀x∗ ∈ X ∗ .
(ii) The acute angle relation holds if for any vector x ∈ X\X ∗ and a solution
x∗ ∈ X ∗ to the VI(X, F ),
(x − x∗ )T F (x) > 0.
55
(3.5)
It should be emphasized that for these properties to hold, the mapping need not
be necessarily monotone. Note that the acute angle property is implied by strict or
strong pseudomonotonicity. It can be observed that this property is related to a
solution x∗ but does not extend to any two general points in the set (see [78, 106]).
3.2.2 Existence and uniqueness of solutions
Standard existence results [18, Chap. 2] rely on X being compact or the coercivity
of the map over X. When F is an expectation-valued map and X is an unbounded
set, existence requires ascertaining the coercivity properties on F (x) = IE[F (x; ω)];
however such properties are not easily verifiable when considering general measure
spaces. We therefore resort to ascertaining whether solvability of the stochastic
variational problem (3.1). can be deduced from assuming that the suitable properties
of the mapping F (x; ω) hold in an almost sure sense. Throughout this paper, we
assume that X is a closed and convex set in Rn and F (x; ω) is a continuous map in
x for every ω ∈ Ω. We extend existence theory respectively from Propositions 2.2.3
and 2.3.16, and Theorem 2.3.5 in [18] to a stochastic regime as follows. We begin
by presenting a sufficiency condition for the existence that obviates integration and
does not necessitate any form of monotonicity.
Proposition 4 (Solvability of pseudomonotone SVI(X, F )). Suppose F (x; ω)
is pseudomonotone over X in an a.s. sense and F (x; ω)T (x − xref ) ≥ −u(ω) a.s.
where u(ω) is a nonnegative integrable function. Suppose one of the following holds:
(i) There exists a vector xref ∈ X such that the set L< (ω) is bounded a.s. where
n
o
L< (ω) , x ∈ X : F (x; ω)T (x − xref ) < 0 .
(ii) There exists a bounded open set C and a vector xref ∈ X ∩ C such that a.s.:
F (x; ω)T (x − xref ) ≥ 0, for all x ∈ X ∩ bd C.
Then SVI(X, F ) is solvable.
Proof. Let us assume that (i) holds. Consider the set L< defined as follows:
n
o
L< , x ∈ X : IE[F (x; ω)]T (x − xref ) < 0 .
56
It suffices to prove that L< is bounded given that (i) holds. Note that if L< is
bounded then solvability of SVI(X, F ) follows from proposition 2.2.3 in [18]. From
(i), L< (ω) be bounded in an almost sure sense. As a contradiction, let us assume
that L< is unbounded. Then there exists an unbounded sequence {xk } where
k ∈ K, such that {xk } ⊂ L< and
lim kxk k = ∞.
K3k→∞
It follows that limK3k→∞ IE[F (xk ; ω)T (xk − xref )] < 0. As a result, by noting that
F (x; ω)T (x − xref ) ≥ −u(ω) a.s., where u(ω) is a nonnegative integrable function,
we may now use Fatou’s Lemma to obtain the following:
E
T
ref
lim F (xk ; ω) (xk − x ) ≤
K3k→∞
lim IE[F (xk ; ω)T (xk − xref )] < 0.
K3k→∞
Therefore, for a set U with positive measure, we have that
lim F (xk ; ω)T (xk − xref ) < 0,
K3k→∞
∀ω ∈ U.
But this contradicts the a.s. boundedness of L< (ω) and consequently L< is bounded.
From Theorem 2.3.3 in [18], the solvability of SVI(X, F ) follows.
Suppose (ii) holds, implying that F (x; ω)T (x − xref ) ≥ 0, for all x ∈ X ∩ bdC.
Taking expectations, we obtain that
IE[F (x; ω)]T (x − xref ) ≥ 0, for all x ∈ X ∩ bd Φ.
The above condition implies that the SVI(X, F ) has a solution from Theorem 2.2.3
in [18].
We may now provide a result that relies on leveraging monotonicity properties
and utilizing the interioricity properties of X. Again, we develop almost sure
conditions that obviate the need for integration. Before proceeding further, we
define the recession cone X∞ and dual cone X ◦ corresponding to X as follows:
X∞ , {y ∈ X | ∀x ∈ X, ∀λ ≥ 0| x + λy ∈ X} , X ◦ ,
n
o
y | y T x ≥ 0, ∀x ∈ X .
Proposition 5. Let X ⊆ Rn be closed and convex. Suppose F (x; ω) is pseu57
domonotone and kF (x; ω)k ≤ B on X a.s., then X ∗ is nonempty and bounded
if
X∞
[−F (X; ω)◦ ] = {0} ,
\
a.s.
(3.6)
Proof. The pseudomonotonicity of F (x) follows from the pseudomonotonicity of
F (x; ω) by invoking Lemma 9. The boundedness of F (x) follows from the boundedness of F (x; ω) by leveraging Jensen’s inequality and convexity of the norm:
kE[F (x; ω)]k ≤ E [kF (x; ω)k] ≤ B.
From the proof of Prop. 2.2.3 presented in [18] for deterministic variational
inequalities, it hence suffices to show that
X∞
\
[−F (X)◦ ] = {0} .
We proceed to prove the result via a contradiction argument. By hypothesis,
T
let X∞ [−F (X; ω)◦ ] = {0} a.s. for all ω and let there exist a d 6= 0 and d ∈
T
X∞ [−F (X)◦ ]. This implies that d ∈ X∞ and F (x)T d ≤ 0, for all x ∈ X. Hence
Z
dT F (x; ω)dµ(ω) ≤ 0.
Ω
n
o
Let U = ω | dT F (x; ω) < 0 . Then, the above integral can be decomposed as
follows:
Z
U
T
d F (x; ω)dµ(ω) +
Z
dT F (x; ω)dµ(ω) ≤ 0.
(3.7)
Ω\U
Clearly, U has to have positive measure to ensure that (3.7) holds. Furthermore,
for every ω ∈ U, we have that there exists ω ∈ U such that F (x; ω)T d ≤ 0 and
0 6= d ∈ X∞ . But this contradicts the hypothesis (3.6) that requires that the only
such vector d be the zero vector in an a.s. sense. Our result follows.
We now provide a related result that allows for claiming the convexity, compactness, and nonemptiness of the solution set.
58
Proposition 6. Let X ⊆ Rn be closed and convex and let F (x; ω) be pseudomonotone on X in an almost sure sense. Then, the following hold:
(i) X ∗ is convex.
(ii) Suppose there exists a vector xref ∈ X such that F (xref ; ω) ∈ int((X∞ )◦ ) in
an almost sure sense. Then X ∗ is nonempty, convex, and compact.
Proof. (i) From Theorem 9, it follows that F (x) is pseudomonotone if F (x; ω) is
pseudomonotone in an almost sure sense. The first claim follows from Theorem 2.3.5
in [18] where it is shown that SOL(X, F ) is convex when F (x) is a pseudomonotone
map.
(ii) To prove the second claim, we revisit our hypothesis which requires existence of
an xref ∈ X such that
F (xref ; ω) ∈ int((X∞ )◦ ),
a.s.
Taking expectations, we obtain that F (xref ) ∈ int((X∞ )◦ ), and the result follows
from [18, Th. 2.3.5].
When X is a cone, VI(X, F ) is equivalent to CP(X, F ), a problem that requires
an x such that:
X 3 x ⊥ F (x) ∈ X ◦ .
We extend Theorems 2.4.7, 2.4.4, and Corollary 2.4.5 from [18] to stochastic regimes
that mandate some feasibility properties on the stochastic map F (x; ω).
Theorem 10 (Solvability of pseudomonotone SCP(X, F )). Let X be a
closed convex cone in Rn with int(X ◦ ) 6= ∅. Suppose F (x; ω) is pseudomonotone
on X almost surely and one of the following holds:
(i) The dual cone X ◦ has a nonempty interior and the following expression holds:
X∞
\
(−F (X; ω))◦ = {0} ,
a.s.
(ii) There exists an xref ∈ X such that F (xref ; ω) ∈ int(X ◦ ) in a.s. fashion.
59
(3.8)
(iii) The map F (x; ω) = M ω x + q ω and the dual cone X ◦ has a nonempty interior.
Additionally, the following holds a.s.:
h
i
d ∈ X, (M ω )T d ∈ −X ◦ , (q ω )T d ≤ 0
=⇒ d = 0.
(3.9)
Then SCP(X, F ) has a nonempty compact solution set.
Proof. (i) It suffices to show that (3.8) implies
X∞
\
(−F (X))◦ = {0} ,
(3.10)
since the latter expression implies that X ∗ is nonempty from [18, Th. 2.4.4.] under
the additional requirement that X ◦ has a nonempty interior. This can be shown
akin to the proof of Prop. 5.
(ii) By taking expectations, we have that F (xref ) ∈ int(X ◦ ) and xref ∈ X. Consequently, CP(X, F ) is strictly feasible and CP(X, F ) is solvable from [18, Th. 2.4.4].
(iii) Consider a d as defined by the left hand side of (3.9). It follows that d ∈ X and
(M ω )T d ∈ −X ◦ and (q ω )T d ≤ 0, we have F (x; ω) = M ω x + q ω ∈ X ◦ . By taking
expectations, we have M T d ∈ X ◦ and q T d ≤ 0, where M = E[M ω ] and q = E[q ω ].
It follows that
h
i
d ∈ X, M T d ∈ −X ◦ , q T d ≤ 0
=⇒ d = 0.
(3.11)
Since X ◦ has a nonempty interior by hypothesis, CP(X, F ) is solvable from [18,
Cor. 2.4.5].
We now turn our attention to providing sufficiency conditions for the uniqueness
of solutions to variational inequality problems. We initiate our discussion with a
sufficiency condition for ensuring that variational inequality problems with strongly/strictly pseudomonotone maps admit unique solutions. While this condition has
been previously studied in [107], we present it for the sake of completeness.
Theorem 11 (Uniqueness of solutions to strongly/strictly pseudomonotone VI(X, F )). Consider a variational inequality problem VI(X, F ) where X is a
closed and convex set in Rn . Suppose VI(X, F ) is solvable and one of the following
holds:
60
(i) Suppose F is strictly pseudomonotone over X.
(ii) Suppose F is strongly pseudomonotone over X.
Then VI(X, F ) admits a unique solution.
Proof. (i) Suppose F is strictly pseudomonotone over X and assume that VI(X, F )
admits at least two solutions x1 and x2 . Then, we have that
F (x1 )T (x2 − x1 ) ≥ 0 =⇒ F (x2 )T (x2 − x1 ) > 0.
(3.12)
But x2 solves VI(X, F ), implying that F (x2 )T (x − x2 ) ≥ 0 for all x ∈ X. In
particular, this holds for x = x1 . But this contradicts (3.12) and it follows that
VI(X, F ) admits a unique solution.
(ii) If F is strongly pseudomonotone over X, it is strictly pseudomonotone over X
and the result follows from (i).
We leverage this condition in developing a simple integration-free statement for
the uniqueness of solutions to pseudomonotone SVIs.
Proposition 7 (Uniqueness of solutions to strongly/strictly pseudomonotone SVI(X, F )). Let X be a closed and convex set in Rn and let F : X × Ω → Rn
be a pseudomonotone map in x for every ω ∈ Ω. Suppose VI(X, F ) is solvable and
consider the following statements:
(a) F (x; ω) is strictly pseudomonotone on X for ω ∈ U where U ⊆ Ω and
µ(U ) > 0.
(b) F (x; ω) is strongly pseudomonotone on X for ω ∈ U where U ⊆ Ω and
µ(U ) > 0.
Then SVI(X, F ) admits a unique solution.
Proof. The proof follows from showing that (a) and (b) respectively imply that
F is strictly and strongly pseudomonotone over X by leveraging Lemma. 9. The
required result follows from Theorem 11.
Remark: The above uniqueness result only requires that strong/strict pseudomonotonicity hold for F (x; ω) for a set of positive measure and not necessarily in an
almost sure sense.
61
3.3 Extragradient-based stochastic approximation schemes
3.3.1 Background and Assumptions
Given an x0 ∈ X in a traditional SA scheme and a steplength sequence {γk }, a
sequence {xk } is constructed via the following update rule:
xk+1 := ΠX (xk − γk (F (xk ) + wk )),
k ≥ 0,
(SA)
where wk is defined as follows:
wk , F (x; ωk ) − IE[F (x; ωk )].
(3.13)
We consider a stochastic extragradient scheme akin to that presented in [101]. Given
an x0 ∈ X and a steplength sequence {γk }, this scheme comprises of two steps for
k ≥ 0:
xk+1/2 := ΠX (xk − γk (F (xk ) + wk )),
xk+1 := ΠX (xk − γk (F (xk+1/2 ) + wk+1/2 )),
(ESA)
where wk+1/2 , F (xk+1/2 ; ωk+1/2 ) − F (xk+1/2 ). For any iteration k, the history Fk
n
o
is given by Fk = x0 , ω0 , ω1/2 , ω1 , . . . , ωk and Fk+1/2 = Fk ∪ {ωk+1/2 }. We make
the following assumptions on the conditional first and second moments:
Assumption 4. At an iteration k, the following hold:
(A1) The conditional means IE[wk |Fk ] and IE[wk+1/2 | Fk+1/2 ] are zero;
(A2) The conditional second moments, denoted by E[kwk k2 |Fk ] and IE[kwk+1/2 k2 |Fk+1/2 ],
are bounded in an a.s. sense or IE[kwk k2 | Fk ] ≤ ν 2 and IE[kwk+1/2 k2 |
Fk+1/2 ] ≤ ν 2 .
Next, we provide assumptions on steplength sequences consistent with most SA
schemes.
Assumption 5 (Steplength sequences). The sequence {γk } is a diminishing sequence satisfying the following:
(A3) The steplength sequence is square-summable or
62
P∞
2
k=0 γk
< ∞.
(A4) The steplength sequence is non-summable or
P∞
k=0 γk
= ∞.
We impose a further requirement on the map given by the following:
Assumption 6 (A5). The map F (x) satisfies the following for every x, y ∈ X:
kF (x) − F (y)k ≤ Lkx − yk + B,
(3.14)
where L, B ≥ 0.
Since we consider a stochastic setting, almost sure convergence is proved via
the following super-martingale convergence results from [69, lemma 10, page 49]
and [69, lemma 11, page 50] respectively.
Lemma 12. Let Vk be a sequence of nonnegative random variables adapted to
σ-algebra Fk and such that
E[Vk+1 | Fk ] ≤ (1 − uk )Vk + ψk , ∀k ≥ 0, a.s.
where 0 ≤ uk ≤ 1, ψk ≥ 0, and
Then, Vk → 0 in a.s. sense.
P∞
k=0
uk = ∞,
P∞
k=0
ψk < ∞, and limk→∞
ψk
uk
= 0.
Lemma 13. Let Vk , uk , ψk and γk be nonnegative random variables adapted to σ
P
P∞
algebra Fk . If a.s ∞
k=0 uk < ∞,
k=0 ψk < ∞, and
E[Vk+1 | Fk ] ≤ (1 + uk )Vk − δk + ψk , ∀k ≥ 0,
then Vk is convergent in an a.s. sense and
P∞
k=0 δk
< ∞ in an a.s. sense.
3.3.2 An extragradient SA scheme
In this subsection, we prove that the iterates generated by the (ESA) scheme
converge to the solution set of the original problem in an almost sure sense by
leveraging ideas drawn from the proof of the deterministic version presented in [18,
Lemma 12.1.10]. A challenge in this scheme arises due to the two independent
stochastic errors from the two sub-steps at every iteration and the lack of a direct
expression of xk+1 in terms of F (xk ) unlike the standard projection scheme. We
begin by relating any two successive iterates in Lemma 14.
63
Lemma 14. Consider SVI(X, F ) where F is a continuous map for all x ∈ X.
Suppose x∗ is any solution of SVI(X, F ). Let the iterates be generated by the
extragradient scheme (ESA) and let uk = 2γk F (xk+1/2 )T (xk+1/2 − x∗ ). Suppose β
and c are constants such that
1
1
−
1−
1+β 1+c
!
≥ 0.
Then the following holds for any iterate k:
kxk+1 − x∗ k2 ≤ kxk − x∗ k2 − 1 − 2γk2 (1 + β)L2 kxk − xk+1/2 k2 + 2γk2 (1 + β)B 2
− uk + (1 + c)γk2 kwk+1/2 − wk k2 + 2γk (x∗ − xk+1/2 )T wk+1/2 . (3.15)
Proof. Let yk = xk − γk (F (xk+1/2 ) + wk+1/2 ). Then,
kxk+1 − x∗ k2 = kΠX (yk ) − x∗ k2
= kyk − x∗ k2 + kΠX (yk ) − yk k2 + 2 (ΠX (yk ) − yk )T (yk − x∗ ).
Note that
2kyk − ΠX (yk )k2 + 2(ΠX (yk ) − yk )T (yk − x∗ )
= 2kyk − ΠX (yk )k2 + 2(ΠX (yk ) − yk )T (yk − ΠX (yk ) + ΠX (yk ) − x∗ )
= 2kyk − ΠX (yk )k2 − 2kyk − ΠX (yk )k2 + 2(ΠX (yk ) − yk )T (ΠX (yk ) − x∗ )
= 2(ΠX (yk ) − yk )T (ΠX (yk ) − x∗ ) ≤ 0,
where the last inequality follows from the projection property. As a consequence,
we have that
kyk − ΠX (yk )k2 + 2(ΠX (yk ) − yk )T (yk − x∗ ) ≤ −kyk − ΠX (yk )k2 .
(3.16)
It follows from (3.16) that
kxk+1 − x∗ k2 = kΠX (yk ) − x∗ k2
= kyk − x∗ k2 + kΠX (yk ) − yk k2 + 2(ΠX (yk ) − x∗ )T (yk − x∗ )
64
≤ kyk − x∗ k2 − kyk − ΠX (yk )k2 .
By expanding kyk − x∗ k2 − kyk − ΠX (yk )k2 ,
kxk+1 − x∗ k2
≤ kyk − x∗ k2 − kyk − ΠX (yk )k2
≤ kxk − γk (F (xk+1/2 ) + wk+1/2 ) − x∗ k2 − kxk − γk (F (xk+1/2 ) + wk+1/2 ) − xk+1 k2
= kxk − x∗ k2 + γk2 kF (xk+1/2 ) + wk+1/2 )k2 − 2γk (xk − x∗ )T (F (xk+1/2 + wk+1/2 ))
− kxk+1 − xk k2 − γk2 kF (xk+1/2 ) + wk+1/2 )k2
+ 2γk (xk − xk+1 )T (F (xk+1/2 ) + wk+1/2 )
= kxk − x∗ k2 − kxk − xk+1 k2 + 2γk (x∗ − xk+1 )T (F (xk+1/2 ) + wk+1/2 )
= kxk − x∗ k2 − kxk − xk+1 k2 + 2γk (xk+1/2 − xk+1 )T (F (xk+1/2 ) + wk+1/2 )
+ 2γk (x∗ − xk+1/2 )T (F (xk+1/2 )) + 2γk (x∗ − xk+1/2 )T wk+1/2 .
Through the addition and subtraction of xk+1/2 , we obtain the following:
kxk − x∗ k2 − kxk − xk+1 k2 + 2γk (xk+1/2 − xk+1 )T (F (xk+1/2 ) + wk+1/2 )
+ 2γk (x∗ − xk+1/2 )T (F (xk+1/2 )) + 2γk (x∗ − xk+1/2 )T wk+1/2
≤ kxk − x∗ k2 − kxk − xk+1/2 k2 − kxk+1/2 − xk+1 k2
+ 2γk (x∗ − xk+1/2 )T (F (xk+1/2 ))
+ 2(xk+1 − xk+1/2 )T (xk − γk (F (xk+1/2 ) + wk+1/2 ) − xk+1/2 )
+ 2γk (x∗ − xk+1/2 )T wk+1/2 .
(3.17)
By the variational characterization of projection, for any x ∈ Rn , we have that
(y − ΠX (x))T (ΠX (x) − x) ≥ 0, ∀y ∈ X. Consequently we have the following set
of relationships, where the last inequality follows from the projection property:
(xk+1 − xk+1/2 )T (xk − γk (F (xk+1/2 ) + wk+1/2 ) − xk+1/2 )
= (xk+1 − xk+1/2 )T (xk − γk (F (xk ) + wk+1/2 ) − xk+1/2 )
+ γk (xk+1 − xk+1/2 )T (F (xk ) − F (xk+1/2 ))
= (xk+1 − ΠX (xk − γk (F (xk ) + wk )))T (xk − γk F (xk ) − ΠX (xk − γk (F (xk ) + wk )))
+ γk (xk+1 − xk+1/2 )T (F (xk ) − F (xk+1/2 )) − γk (xk+1 − xk+1/2 )T (wk+1/2 − wk )
65
≤ γk (xk+1 − xk+1/2 )T (F (xk ) − F (xk+1/2 )) − γk (xk+1 − xk+1/2 )T (wk+1/2 − wk ).
Therefore from (3.17),
kxk+1 − x∗ k2
(3.18)
≤ kxk − x∗ k2 − kxk − xk+1/2 k2 − kxk+1/2 − xk+1 k2 − uk
+ 2γk (x∗ − xk+1/2 )T wk+1/2 + 2γk (xk+1 − xk+1/2 )T F (xk ) − F (xk+1/2 )
+ 2γk (xk+1 − xk+1/2 )T wk − wk+1/2 .
(3.19)
We may now bound 2γk (xk+1 − xk+1/2 )T F (xk ) − F (xk+1/2 ) by leveraging the
identity 2ab ≤ a2 + b2 and subsequently applying (A5), leading to the following
sequence of inequalities:
kxk+1 − x∗ k2
≤ kxk − x∗ k2 − kxk − xk+1/2 k2 − kxk+1/2 − xk+1 k2 − uk
1
+
kxk+1 − xk+1/2 k2 + (1 + β)γk2 kF (xk ) − F (xk+1/2 )k2
1+β
− 2γk (xk+1 − xk+1/2 )T (wk+1/2 − wk ) + 2γk (x∗ − xk+1/2 )T wk+1/2
!
1
≤ kxk − x k − 1 −
kxk+1 − xk+1/2 k2 − kxk − xk+1/2 k2 − uk
1+β
∗ 2
+ 2γk2 (1 + β)L2 kxk − xk+1/2 k2 + 2γk2 (1 + β)B 2 − 2γk (xk+1 − xk+1/2 )T (wk+1/2 − wk )
+ 2γk (x∗ − xk+1/2 )T wk+1/2 .
By completing squares in the penultimate term in the previous expression, we have
the following:
!
1
1
kxk − x k − 1 −
−
kxk+1 − xk+1/2 k2 − kxk − xk+1/2 k2 − uk
1+β 1+c
∗ 2
+ 2γk2 (1 + β)L2 kxk − xk+1/2 k2 + 2γk2 (1 + β)B 2 + (1 + c)γk2 kwk+1/2 − wk k2
+ 2γk (x∗ − xk+1/2 )T wk+1/2
!
1
1
≤ kxk − x k − 1 −
−
kxk+1 − xk+1/2 k2 −
1+β 1+c
∗ 2
1 − 2γk2 (1 + β)L2 kxk − xk+1/2 k2
− uk + 2γk2 (1 + β)B 2 + (1 + c)γk2 kwk+1/2 − wk k2 + 2γk (x∗ − xk+1/2 )T wk+1/2
66
≤ kxk − x∗ k2 − uk − 1 − 2γk2 (1 + β)L2 kxk − xk+1/2 k2 + 2γk2 (1 + β)B 2
+ (1 + c)γk2 kwk+1/2 − wk k2 + 2γk (x∗ − xk+1/2 )T wk+1/2 ,
(3.20)
where β and c are chosen such that (1 − 1/(1 + β) − 1/(1 + c)) ≥ 0.
Note that in the above proof, β, c ≥ 1 ensures that (1−1/(1+β)−1/(1+c)) ≥ 0.
While the above Lemma relates iterates for general mappings F , we use assumptions
on monotonicity to develop our convergence theory. We begin our analysis with
pseudomonotone-plus problems and prove the a.s. convergence of the iterates
generated by the extragradient scheme.
Proposition 8 (a.s. convergence of ESA). Let assumptions (A1-A5) hold.
Consider SVI(X, F ) where F is a continuous pseudomonotone-plus map over X and
q
1
let γ0 ≤ L1 2(1+β)
where (1 − 1/(1 + β) − 1/(1 + c)) ≥ 0. Then, the extragradient
scheme (ESA) generates a sequence {xk } such that xk → x∗ as k → ∞ in an a.s.
sense where x∗ ∈ X ∗ , the solution set of SVI(X, F ).
Proof. Suppose x∗ is any solution of SVI(X, F ). We use the expression in (3.20)
and proceed further. Taking expectations conditioned over Fk+1/2 , we obtain the
following:
E[kxk+1 − x∗ k2 | Fk+1/2 ]
≤ kxk − x∗ k2 − 1 − 2γk2 (1 + β)L2 kxk − xk+1/2 k2 + 2γk2 (1 + β)B 2 − uk
|
{z
≤ 0, γk ≤
1
L
q
}
1
2(1+β)
= 0
+ (1 +
c)γk2 IE[kwk+1/2
z
2
}|
∗
T
{
− wk k | Fk+1/2 ] + 2γk IE[(x − xk+1/2 ) wk+1/2 | Fk+1/2 ]
≤ kxk − x∗ k2 − 1 − 2(1 + β)γk2 L2 kxk − xk+1/2 k2 + 2(1 + c)γk2 ν 2
+ 2γk2 (1 + β)B 2 − uk .
The remainder of the proof requires application of the super-martingale convergence
theorem (Lemma 21). Let ψk and δk be given by
ψk = γk2 2(1 + c)ν 2 + (1 + β)B 2 , δk = 1 − (1 + β)γk2 L2 kxk − xk+1/2 k2 + uk .
From the solution property of F (x∗ )T (xk+1/2 − x∗ ) ≥ 0 and from the pseudomonotonicity of F (x), uk = 2γk F (xk+1/2 )T (xk+1/2 − x∗ ) ≥ 0 since F (x∗ )T (xk+1 − x∗ ) ≥ 0.
67
q
1
Choosing γ0 based on γ0 < L1 2(1+β)
, it can be seen that δk ≥ 0. Using assumption
P
(A3), it is observed that k ψk < ∞. Invoking Lemma 21, we have that {kxk −x∗ k2 }
P
is a convergent sequence in an a.s. sense and k δk < ∞. Let xk converge to x̂ where
P
x̂ is not necessarily a solution to SVI(X, F ), as k → ∞. From (A3), k γk2 < ∞,
P
P
implying that k (1 − (1 + β)γk2 L2 ) = ∞ and k kxk − xk+1/2 k2 < ∞. Hence,
kxk − xk+1/2 k → 0 and kxk+1 − xk+1/2 k → 0 as k → ∞ in an a.s. sense. Furthermore, {xk } is a convergent sequence in an a.s. sense with a limit point x̂ where x̂
is not necessarily an element of X ∗ . Furthermore, from the non-summability of γk ,
we have that a.s.
X
uk =
k
X
2γk F (xk+1/2 )T (xk+1/2 − x∗ ) < ∞
k
=⇒ lim inf F (xk+1/2 )T (xk+1/2 − x∗ ) = 0.
k→∞
(3.21)
But {xk+1/2 } is a convergent sequence, implying that if along some subsequence,
F (xk+1/2 )T (xk+1/2 − x∗ ) → 0 a.s., then we have that
lim F (xk+1/2 )T (xk+1/2 − x∗ ) = 0,
k→∞
a.s.
(3.22)
Recall that for any x̂ ∈ X, we have that F (x∗ )T (x̂ − x∗ ) ≥ 0. By the pseudomonotonicity of F over X,
F (x∗ )T (x̂ − x∗ ) ≥ 0 =⇒ F (x̂)T (x̂ − x∗ ) ≥ 0.
(3.23)
Since xk+1/2 → x̂ as k → ∞ in an a.s. sense, from (3.22) we obtain
F (x̂)T (x̂ − x∗ ) = 0.
(3.24)
Invoking the pseudomonotone-plus property of F together with (3.23) and (3.24),
we have
h
i
F (x∗ )T (x̂ − x∗ ) ≥ 0 and F (x̂)T (x̂ − x∗ ) = 0 =⇒ F (x̂) = F (x∗ ).
(3.25)
Therefore from (3.25), the following holds:
∀x ∈ X, F (x̂)T (x − x̂) = F (x∗ )T (x − x̂) = F (x∗ )T (x − x∗ ) + F (x∗ )T (x∗ − x̂)
68
≥ F (x∗ )T (x∗ − x̂) = F (x̂)T (x∗ − x̂) = 0,
where the last equality follows from (3.24). It follows that x̂ is a solution to
SVI(X, F ) and {xk } converges to some point in X ∗ in an a.s. sense.
It is worth noting that while we impose a bound on γ0 , the above result can
also be proved without such a requirement but by considering a shifted variant
of the super-martingale convergence result. Notably, the bound on γ0 ≤ √ 1
L
2(1+β)
is akin to the bound on steplength in extragradient methods for deterministic
variational inequality problems which is given by γ < L1 . Next, we extend the
convergence theory to accommodate variants of pseudomonotonicity as well as
problems satisfying the acute angle property.
Proposition 9 (a.s. convergence under variants of pseudomonotonicity).
Consider the SVI(X, F ). Let assumptions (A1-A5) hold. Consider a sequence of
q
1
iterates {xk } generated by the extragradient scheme where γ0 ≤ L1 2(1+β)
and β
and c are chosen such that (1 − 1/(1 + β) − 1/(1 + c)) ≥ 0. Consider one of the
following statements:
(i) F satisfies the acute angle relation at X ∗ given by (3.5).
(ii) F is strictly pseudomonotone on X.
(iii) F is strongly pseudomonotone on X.
(iv) F is pseudomonotone on X and is given by the gradient of E[f (x, ω)].
Then, as k → ∞, xk → x∗ ∈ X ∗ in an almost sure sense.
Proof. We begin from (3.22) in the proof of Prop. 8 and instead of the pseudomonotoneplus property, we impose properties imposed by (i) – (iv).
P
(i) From (3.5), uk = 2γk F (xk+1/2 )T (xk+1/2 − x∗ ) > 0. Invoking Lemma 21, k uk <
P
∞ and {xk } is a convergent sequence. But k γk = ∞, implying that
lim inf F (xk+1/2 )T (xk+1/2 − x∗ ) = 0.
k→∞
But for every k, we have that F (xk+1/2 )T (xk+1/2 − x∗ ) > 0 for every k implying
that along some subsequence, {xk+1/2 } → x∗ . However, {xk+1/2 } is a convergent
69
sequence, it follows that
lim xk+1/2 = x∗
k→∞
a.s.
(ii) Since F is strictly pseudomonotone, F satisfies the acute angle relation at X ∗
and the result follows.
(iii) Since F is strongly pseudomonotone, F is strictly pseudomonotone and the
result follows.
(iv) From the proof of Prop. 8, we have that {xk+1/2 } is a convergent sequence
in an a.s. sense and its limit point is x̂ which is not necessarily in X ∗ . By the
pseudoconvexity of f (x), for any x1 , x2 ∈ X,
∇f (x1 )T (x2 − x1 ) ≥ 0
=⇒ f (x2 ) ≥ f (x1 ).
Applying this on x̂ and x∗ , we have the following:
∇f (x∗ )T (x̂ − x∗ ) ≥ 0 =⇒ f (x̂) ≥ f (x∗ ).
(3.26)
But from (3.24), we have that ∇f (x̂)T (x∗ − x̂) = 0 implying that f (x̂) ≤ f (x∗ ). It
follows that f (x̂) = f (x∗ ). Consequently, x̂ is a global solution of the corresponding
optimization problem and a solution of SVI(X, F )
Next, we consider some instances where the mapping is monotone and VI(X, F )
satisfies a weak sharpness property. Prior to providing our main convergence
statement, we provide an intermediate lemma.
Lemma 15. Consider the SVI(X, F ). Let assumptions (A1-A4) hold. Further
suppose kF (x) − F (y)k ≤ Lkx − yk for every x, y ∈ X and kF (x)k ≤ C for every
x ∈ X. Suppose the weak sharp property holds for the mapping F and the solution
set X ∗ with parameter α. Suppose F is monotone over X. Let β and c be chosen
such that 1 − (1/(1 + β)) − (1/(1 + c)) ≥ 0. Then the following holds for every k:
E[dist2 (xk+1 , X ∗ ) | Fk+1/2 ]
!
L2 γ 2
≤ 1 + 2 k dist2 (xk , X ∗ ) − 1 − γk2 (1 + β)L2 − 2c̄2 kxk − xk+1/2 k2
c̄
!
2
C
− 2γk αdist(xk , X ∗ ) + γk2 2(1 + c)ν 2 + 2 .
(3.27)
4c̄
70
Proof. After recalling (3.19), we provide an upper bound by first completing
squares, and then by invoking Lipschitz continuity of the map:
kxk+1 − x∗ k2
≤ kxk − x∗ k2 − kxk − xk+1/2 k2 − kxk+1/2 − xk+1 k2
− uk + 2γk (x∗ − xk+1/2 )T wk+1/2
+ 2γk (xk+1 − xk+1/2 )T F (xk ) − F (xk+1/2 ) + 2γk (xk+1 − xk+1/2 )T wk − wk+1/2
≤ kxk − x∗ k2 − kxk − xk+1/2 k2 − kxk+1/2 − xk+1 k2 − uk
1
+
kxk+1 − xk+1/2 k2 + (1 + β)γk2 kF (xk ) − F (xk+1/2 )k2
1+β
1
+
kxk+1 − xk+1/2 k + (1 + c)γk2 kwk+1/2 − wk k2 + 2γk (x∗ − xk+1/2 )T wk+1/2
1+c
!
1
1
∗ 2
−
kxk+1 − xk+1/2 k2
≤ kxk − x k − 1 −
1+β 1+c
|
{z
}
≥0
− (1 − γk2 (1 + β)L2 )kxk − xk+1/2 k2 − uk + (1 + c)γk2 kwk+1/2 − wk k2
+ 2γk (x∗ − xk+1/2 )T wk+1/2
≤ kxk − x∗ k2 − uk − 1 − γk2 (1 + β)L2 kxk − xk+1/2 k2
+ 2(1 + c)γk2 kwk+1/2 k2 + kwk k2 + 2γk (x∗ − xk+1/2 )T wk+1/2 ,
(3.28)
where
uk = 2γk F (xk+1/2 )T (xk+1/2 − x∗ )
= 2γk F (xk+1/2 )T (xk − x∗ ) + 2γk F (xk+1/2 )T (xk+1/2 − xk )
= 2γk F (xk )T (xk − x∗ ) + 2γk (F (xk+1/2 ) − F (xk ))T (xk − x∗ )
+ 2γk F (xk+1/2 )T (xk+1/2 − xk ).
(3.29)
Suppose tk is defined as
tk = −2γk (F (xk+1/2 ) − F (xk ))T (xk − x∗ ) − 2γk F (xk+1/2 )T (xk+1/2 − xk )
+ 2(1 + c)γk2 (kwk k2 + kwk+1/2 k2 ),
(3.30)
implying that − uk + 2(1 + c)γk2 (kwk k2 + kwk+1/2 k2 ) = tk − 2γk F (xk )T (xk − x∗ ).
71
Therefore,
kxk+1 − x∗ k2 ≤ kxk − x∗ k2 + tk − 1 − γk2 (1 + β)L2 kxk − xk+1/2 k2
− 2γk F (xk )T (xk − x∗ ) + 2γk (x∗ − xk+1/2 )T wk+1/2 .
(3.31)
Taking expectations conditioned on Fk+1/2 and by noting that
E[2γk (x∗ − xk+1/2 )T wk+1/2 | Fk+1/2 ] = 0,
we obtain the following inequality:
E[kxk+1 − x∗ k2 | Fk+1/2 ]
≤ kxk − x∗ k2 − 1 − γk2 (1 + β)L2 kxk − xk+1/2 k2 − 2γk F (xk )T (xk − x∗ )
h
i
+ E tk | Fk+1/2 ,
= kxk − x∗ k2 − 1 − γk2 (1 + β)L2 kxk − xk+1/2 k2 − 2γk F (x∗ )T (xk − x∗ )
h
i
− 2γk (F (xk ) − F (x∗ ))T (xk − x∗ ) + E tk | Fk+1/2 ,
≤ kxk − x∗ k2 − 1 − γk2 (1 + β)L2 kxk − xk+1/2 k2 − 2γk αdist(xk , X ∗ )
h
i
+ E tk | Fk+1/2 ,
(3.32)
where the last inequality is a consequence of invoking the weak-sharpness property.
We now derive a bound for the last term in this inequality by utilizing (3.30):
h
E tk | Fk+1/2
i
≤ 2γk Lkxk+1/2 − xk kkxk − x∗k k + 2γk kF (xk+1/2 )kkxk+1/2 − xk k + 2(1 + c)γk2 ν 2
≤ c̄2 kxk+1/2 − xk k2 +
γk2 L2
kxk − x∗ k2 + Cγk kxk+1/2 − xk k + 2(1 + c)γk2 ν 2 ,
c̄2
(3.33)
h
i
where the first inequality follows by noting that E [kwk+1 k2 + kwk k2 ] | Fk+1/2 ≤
2ν 2 , and the second inequality is a consequence of noting the following when c̄ > 0:
2γk Lkxk − x∗ kkxk+1/2 − xk k ≤ c̄2 kxk+1/2 − xk k2 +
72
γk2 L2
kxk − x∗ k2 .
c̄2
As a result, (3.33) can be bounded as follows:
γk2 L2
kxk − x∗ k2 + Cγk kxk+1/2 − xk k + 2(1 + c)γk2 ν 2
c̄2
L2 γk2
C 2 γk2
2
2
∗ 2
2
2
≤ c̄ kxk+1/2 − xk k + 2 kxk − x k + c̄ kxk+1/2 − xk k +
+ 2(1 + c)γk2 ν 2
2
c̄
4c̄!
2 2
2
L
γ
C
(3.34)
= 2c̄2 kxk+1/2 − xk k2 + 2 k kxk − x∗ k2 + γk2 2(1 + c)ν 2 + 2 .
c̄
4c̄
c̄2 kxk+1/2 − xk k2 +
Consequently, we obtain the following expression:
E[kxk+1 − x∗ k2 | Fk+1/2 ]
!
L2 γ 2
≤ 1 + 2 k kxk − x∗ k2 − 1 − γk2 (1 + β)L2 − 2c̄2 kxk − xk+1/2 k2
c̄
!
C2
∗
2
2
− 2γk αdist(xk , X ) + γk 2(1 + c)ν + 2 .
(3.35)
4c̄
By noting that E[kxk+1 − x∗ k2 | Fk+1/2 ] ≥ E[dist2 (xk+1 , X ∗ ) | Fk+1/2 ], we obtain
that
E[dist2 (xk+1 , X ∗ ) | Fk+1/2 ]
!
L2 γk2
≤ 1+ 2
kxk − x∗ k2 − 1 − γk2 (1 + β)L2 − 2c̄2 kxk − xk+1/2 k2
c̄
!
C2
∗
2
2
(3.36)
− 2γk αdist(xk , X ) + γk 2(1 + c)ν + 2 .
4c̄
By minimizing the right hand side in x∗ over X ∗ , we obtain the required relationship:
E[dist2 (xk+1 , X ∗ ) | Fk+1/2 ]
!
L2 γ 2
≤ 1 + 2 k dist2 (xk , X ∗ ) − 1 − γk2 (1 + β)L2 − 2c̄2 kxk − xk+1/2 k2
c̄
!
C2
∗
2
2
− 2γk αdist(xk , X ) + γk 2(1 + c)ν + 2 .
(3.37)
4c̄
We now utilize this Lemma in proving that the sequence of iterates produced
by the extragradient scheme under a weak-sharpness assumption as well as the
73
additional requirement that the map is Lipschitz continuous and monotone.
Proposition 10 (a.s. convergence under monotonicity and weak-sharpness). Consider the SVI(X, F ). Let assumptions (A1-A4) hold. Further suppose
kF (x) − F (y)k ≤ Lkx − yk for every x, y ∈ X and kF (x)k ≤ C for every x ∈ X.
Suppose the weak sharpness property
r holds for the mapping F and the solution set
2
(1−2c̄ )
√1 and suppose F is
X ∗ with parameter α. Let γ0 ≤ (1+β)L
2 for some 0 < c̄ <
2
monotone over X. Let β and c be chosen such that 1 − (1/(1 + β)) − (1/(1 + c)) ≥ 0.
Then the sequence {xk } generated by (ESA) converges to an x∗ ∈ X ∗ in an almost
sure sense.
Proof.
r
We begin by noting that when γk is a non-increasing sequence with γ0 ≤
(1−2c̄2 )
(1+β)L2
with c̄ ≤
√1 ,
2
the recursion (3.37) can be simplified as
!
L2 γ 2
E[dist (xk+1 , X ) | Fk+1/2 ] ≤ 1 + 2 k dist2 (xk , X ∗ ) − 2γk αdist(xk , X ∗ )
c̄
!
C2
2
2
+ γk 2(1 + c)ν + 2 .
4c̄
(3.38)
2
∗
Since the hypotheses of Lemma 21 hold from the square summability of γk , it follows
that dist(xk , X ∗ ) → d¯ ≥ 0 in an a.s. sense (implying that {xk } is a convergent
P
P
sequence) and k 2γk αdist(xk , X ∗ ) < ∞ holds in a.s. sense. But k γk = ∞,
implying that the following holds a.s.:
lim inf dist(xk , X ∗ ) = 0.
k→∞
In other words, a subsequence of {xk } converges to a point in X ∗ in an a.s. sense.
But {xk } is a convergent sequence, leading to the conclusion that all subsequences
of {xk } converge to X ∗ in an a.s. sense. The required result follows.
Remark: A natural question emerges as to the relevance of this result, given that
monotone maps have been studied in the past (cf. [31, 101]). First, we present
statements that show that the original sequence converges almost surely to a point
in the solution set, rather than showing the averaged counterpart converges to
the solution set. Second, in contrast with [31], we do not resort to regularization
in deriving almost-sure convergence statements. Third, we proceed to show that
74
under the prescribed assumptions, we obtain the optimal rate of convergence in the
solution estimators, rather than the gap function. Finally, we do not require that
X be bounded for claiming the a.s. convergence of solution iterates in this regime.
3.3.3 Mirror-prox generalizations
Given a point and a set, the standard projection operation computes the closest
feasible point by considering Euclidean norm as the distance function. A generalization to this operation [32, 51] utilizes a class of distance functions that includes the
Euclidean norm as a special case. Given a distance metric s(x), the prox function
V (x, z) takes the form:
h
i
V (x, z) , s(z) − s(x) + ∇s(x)T (z − x) ,
(3.39)
where s(x) is assumed to be a strongly convex function in x. The resulting prox
subproblem is given by the following:
Px (r) = arg min rT z + V (x, z) .
(3.40)
z∈X
It is noted that for s(x) = kxk2 and r = γF (x), (3.40) represents the standard
gradient projection. Recent work [78] proposes prox generalizations to the extragradient scheme for pseudomonotone deterministic variational problems. Stochastic
variants to these prox schemes have been suggested in [101]. However, those settings attempt to obtain error bounds under a monotone setting. We consider a
prox-based generalization of the extragradient scheme for stochastic variational
problems (referred to as mirror-prox in [101]). Our contribution lies in showing that
the sequence of iterates converges a.s. to the solution set under pseudomonotone
settings as shown earlier under appropriate choices of steplengths. Formally, the
mirror prox stochastic approximation scheme is defined as follows:
xk+1/2 := Pxk (γk F (xk ; ωk )),
xk+1 := Pxk (γk F (xk+1/2 ; ωk+1/2 )),
75
k≥0
k ≥ 0.
(MPSA)
From the strong convexity of s(x), it is easy to ascertain that there exists a positive
scalar θ such that
θ
V (x, z) ≥ kx − zk2 .
2
(3.41)
Under the assumption that ∇s(x) is Lipschitz continuous with a finite nonnegative
constant LV , the following holds [108, Lemma 1.2.3]:
V (x, z) ≤ L2V kx − zk2 .
(3.42)
We use the following result from [78].
Lemma 16. Let X be a convex set in Rn and p : X → R be a differentiable convex
function. Assume that u∗ is an optimal solution of min{p(u) + V (x̃; u) : u ∈ X}.
Then the following holds:
p(u∗ ) + V (u∗ ; u) + V (x̃; u∗ ) ≤ p(u) + V (x̃; u), for all u ∈ X.
The next lemma relates prox functions over successive iterates and is the analog
of Lemma 14 from standard extragradient schemes.
Lemma 17. Consider SVI(X, F ) where F is a continuous map for all x ∈ X.
Suppose x∗ is any solution of SVI(X, F ). Consider the iterates generated by (MPSA)
and suppose assumption (A5) holds. Suppose β and c such that θ/2 − 1/(1 + β) − 1/
(1 + c) ≥ 0. Then, the following holds for any iterate k:
T
γk wk+1/2
(xk+1/2 − x∗ ) + uk + V (xk+1 , x∗ )
!
θ
≤ V (xk , x ) −
− 2γk2 (1 + β)L2 kxk+1/2 − xk k2
2
∗
+ γk2 (1 + c)kwk+1/2 − wk k2 + 2γk2 (1 + β)B 2 ,
where uk = 2γk F (xk+1/2 )T (xk+1/2 − x∗ ) and θ > 0 is a scalar.
Proof. From the definition of the iterates, with xk+1/2 being the solution to the
first prox-subproblem (MPSA) and using Lemma 16, we obtain that
γk F (xk ; ωk )T xk+1/2 + V (xk+1/2 , x) + V (xk , xk+1/2 )
76
≤ γk F (xk ; ωk )T x + V (xk , x),
(3.43)
implying that
γk F (xk ; ωk )T (xk+1/2 − x) + V (xk+1/2 , x) + V (xk , xk+1/2 ) ≤ V (xk , x).
Setting x = xk+1 , we have
γk F (xk ; ωk )T (xk+1/2 − xk+1 ) + V (xk , xk+1/2 )
+ V (xk+1/2 , xk+1 ) ≤ V (xk , xk+1 ).
(3.44)
Similarly, with xk+1 being the solution to the second prox subproblem (MPSA),
γk F (xk+1/2 , ωk+1/2 )T (xk+1 − x) + V (xk , xk+1 ) + V (xk+1 , x) ≤ V (xk , x).
(3.45)
Using the expression for V (xk , xk+1 ) from (3.44), we have
V (xk , x) ≥ γk F (xk+1/2 ; ωk+1/2 )T (xk+1 − x) + γk (F (xk ; ωk ))T (xk+1/2 − xk+1 )
+ V (xk , xk+1/2 ) + V (xk+1/2 , xk+1 ) + V (xk+1 , x).
Adding and subtracting xk+1/2 from the first term on the right hand side, we have
V (xk , x) ≥ γk F (xk+1/2 ; ωk+1/2 )T (xk+1/2 − x)
+ γk F (xk+1/2 ; ωk+1/2 )T (xk+1 − xk+1/2 )
+ γk F (xk ; ωk )T (xk+1/2 − xk+1 ) + V (xk , xk+1/2 )
+ V (xk+1/2 , xk+1 ) + V (xk+1 , x).
(3.46)
Setting x = x∗ and using F (xk+1/2 ; ωk+1/2 ) = F (xk+1/2 ) + wk+1/2 , (3.46) can be
rewritten as follows.
T
γk wk+1/2
(xk+1/2 − x∗ ) + uk + γk (F (xk+1/2 ; wk+1/2 ) − F (xk ; ωk ))T (xk+1 − xk+1/2 )
+ V (xk , xk+1/2 ) + V (xk+1/2 , xk+1 ) + V (xk+1 , x∗ ) ≤ V (xk , x∗ ),
77
where uk = F (xk+1/2 )T (xk+1/2 − x∗ ). Expanding and rearranging, we have
T
γk wk+1/2
(xk+1/2 − x∗ ) + uk + V (xk , xk+1/2 ) + V (xk+1/2 , xk+1 ) + V (xk+1 , x∗ )
≤ V (xk , x∗ ) − γk (F (xk+1/2 ) − F (xk ))T (xk+1 − xk+1/2 )
− γk (wk+1/2 − wk )T (xk+1 − xk+1/2 ).
Using the Cauchy-Schwartz inequality, we have
T
γk wk+1/2
(xk+1/2 − x∗ ) + uk + V (xk , xk+1/2 ) + V (xk+1/2 , xk+1 ) + V (xk+1 , x∗ )
≤ V (xk , x∗ ) + γk kF (xk+1/2 ) − F (xk )kkxk+1 − xk+1/2 k
+ γk kwk+1/2 − wk kkxk+1 − xk+1/2 k.
(3.47)
Applying (A5), kF (xk+1/2 ) − F (xk )k2 ≤ 2L2 kxk+1/2 − xk k2 + 2B 2 and completing
squares for the last two terms on the right in (3.47), we have
T
γk wk+1/2
(xk+1/2 − x∗ ) + uk + V (xk , xk+1/2 ) + V (xk+1/2 , xk+1 )
1
1
+ V (xk+1 , x∗ ) ≤ V (xk , x∗ ) +
kxk+1 − xk+1/2 k +
kxk+1 − xk+1/2 k2
1+β
1+c
+ 2γk2 (1 + β)(L2 kxk+1/2 − xk k2 + B 2 ) + γk2 (1 + c)kwk+1/2 − wk k2 .
Using the strong convexity relations, V (xk+1/2 , xk+1 ) ≥
V (xk+1/2 , xk ) ≥ 2θ kxk+1/2 − xk k2 , we have
θ
kxk+1/2
2
− xk+1 k2 and
γk ((wk+1/2 )T (xk+1/2 − x∗ )) + uk + V (xk+1 , x∗ )
!
1
1
θ
−
−
kxk+1/2 − xk+1 k2
≤ V (xk , x ) −
2 1+β 1+c
!
θ
2
2
−
− 2γk (1 + β)L kxk+1/2 − xk k2 + γk2 (1 + c)kwk+1/2 − wk k2
2
∗
+ 2γk2 (1 + β)B 2 .
Choosing β and c such that (θ/2 − 1/(1 + β) − 1/(1 + c)) ≥ 0, our proof follows.
T
γk wk+1/2
(xk+1/2 − x∗ ) + uk + V (xk+1 , x∗ )
!
θ
≤ V (xk , x∗ ) −
− 2γk2 (1 + β)L2 kxk+1/2 − xk k2
2
78
+ γk2 (1 + c)kwk+1/2 − wk k2 + 2γk2 (1 + β)B 2 .
(3.48)
We now proceed to use (3.48) to prove the almost sure convergence of the
sequence produced by the (MPSA) scheme.
Proposition 11 (a.s. convergence of MPSA scheme). Consider the SVI(X, F ).
Let assumptions (A1-A5) hold. Let β and c such that θ/2−1/(1+β)−1/(1+c) ≥ 0.
Let F be a pseudomonotone-plus map on X. Let {xk } denote a sequence of iterates
generated by (MPSA) and let X ∗ denote the set of solutions to the SVI(X, F ).
√
Suppose γ0 ≤ √θ . Then, xk → x∗ ∈ X ∗ in an almost sure sense.
2L
1+β
Proof. From the assumption on steplength, we have θ ≥ 4γk2 L2 (1 + β). Considering (3.48) and taking conditional expectations with respect to Fk+1/2 , we
have
uk + E[V (xk+1 , x∗ )|Fk+1/2 ] ≤ V (xk , x∗ ) + 2γk2 (1 + β)B 2
−E
h
h
γk ((wk+1/2 )T (xk+1/2 − x∗ )) | Fk+1/2
i
i
− E γk2 (1 + c) kwk+1/2 k2 + kwk k2 − 2wk wk+1/2 | Fk+1/2 .
Choosing β = c = 1 and invoking assumptions (A1-A2), we have
uk + E[V (xk+1 , x∗ )|Fk+1/2 ] ≤ V (xk , x∗ ) + 4γk2 ν 2 + B 2 .
We consider the term uk in the descent inequalities proposed in Lemma 17. On the
lines of proposition 8, it is easy to claim that lim inf k→∞ (F (xk+1/2 ))T (xk+1/2 −x∗ ) =
0. Moreover,
lim V (xk , xk+1/2 ) = 0, and lim V (xk+1 , x∗ ) = lim V (xk , x∗ ).
k→∞
k→∞
k→∞
Noting that V (xk , x∗ ) and V (xk , xk+1/2 ) are convergent, we have
lim xk+1/2 = lim xk , and lim kxk+1 − x∗ k2 = lim kxk − x∗ k2 .
k→∞
k→∞
k→∞
k→∞
Convergence of xk → x∗ ∈ X ∗ in an a.s. sense follows by invoking an argument
similar to Proposition 8.
79
Finally, we note that all the convergence theory proved for pseudomonotone
variants corresponding to the stochastic extragradient scheme can be extended in a
similar fashion to prox based schemes with general distance functions. The next
corollary formalizes this extension and is provided without a proof.
Corollary 1 (a.s. convergence of MPSA scheme under variants of pseudomonotonicity). Consider the SVI(X, F ). Let assumptions (A1-A5) hold. Consider a sequence of iterates {xk } generated by the MPSA scheme where γ0 is chosen
to be sufficiently small. Suppose one of the following statements hold:
(i) F satisfies the acute angle relation at X ∗ given by (3.5).
(ii) F is strictly pseudomonotone on X.
(iii) F is strongly pseudomonotone on X.
(iv) F is pseudomonotone on X and is given by the gradient of E[f (x, ω)].
Then, as k → ∞, xk → x∗ ∈ X ∗ in an almost sure sense.
3.3.4 Rate of convergence analysis
In the prior section, we proved the a.s. convergence of iterates generated by the
ESA and the MPSA scheme. In this section, we consider the development of error
bounds. Rate statements for the gap function have been provided in the context of
monotone stochastic variational inequality problems by Tauvel et al. [101]. In this
section, we provide rate statements under strong pseudomonotonicity and mere
monotonicity with a weak sharpness requirement. In the remainder of this section,
we assume that the steplength sequence is given by
γk =
γ0
,
k
(3.49)
where γ0 is a finite scalar. It is easy to observe that this satisfies assumptions
(A3)–(A4).
Proposition 12 (Rate statements under strong pseudomonotonicity). Suppose assumptions (A1)–(A5) hold and let L = 0 in (A5). Let F be strongly pseudomonotone on X and let the sequence of iterates {xk } be generated by (ESA).
80
Then, the following holds for K > 0:
h
∗ 2
E kxK − x k
i
!
1
46(B 2 + ν 2 )
≤
max
, kx0 − x∗ k2 ,
K
σ2
where x∗ is a solution to the SVI(X, F ).
Proof. We begin by noting that (A5) together with L = 0 implies the boundedness
of F (x) with constant B/2. We consider (3.20) as earlier:
kxk+1 − x∗ k2 ≤ kxk − x∗ k2 − uk + 2γk2 (1 + β)B 2 + (1 + c)γk2 kwk+1/2 − wk k2
+ 2γk (x∗ − xk+1/2 )T wk+1/2 .
(3.50)
Since x∗ is a solution of VI(X, F ), we have that F (x∗ )T (xk+1/2 − x∗ ) ≥ 0. By
recalling the definition of strong pseudomonotonicity,
−uk = −2γk F (xk+1/2 )T (xk+1/2 − x∗ ) ≤ −2γk σkxk+1/2 − x∗ k2
≤ γk σ 2kxk+1/2 − xk k2 − kxk − x∗ k2 ,
(3.51)
where the expressions follow from using the triangle inequality kxk − x∗ k2 ≤
2kxk+1/2 − x∗ k2 + 2kxk+1/2 − xk k2 . Using (3.51) in (3.50),
kxk+1 − x∗ k2 ≤ (1 − σγk )kxk − x∗ k2 + (1 + c)γk2 kwk+1/2 − wk k2
+ 2γk σkxk − xk+1/2 k2
+ 2γk (x∗ − xk+1/2 )T wk+1/2 + 2γk2 (1 + β)B 2 .
(3.52)
Using nonexpansivity of projection, noting that the conditional first moment is
zero, and by taking expectations, we have the following:
h
IE kxk+1/2 − xk k2
i
h
i
h
≤ IE kxk − γk (F (xk ) + wk ) − xk k2 = γk2 IE kF (xk ) + wk k2
h
i
h
h
= γk2 IE kF (xk )k2 + kwk k2 + γk2 IE IE 2wkT F (xk ) | Fk
ii
i
≤ γk2 (B 2 + ν 2 ), (3.53)
where the last inequality follows from observing that kF (x)k2 ≤ B 2 /4 ≤ B 2 , where
the weaker bound significantly simplifies the algebra. Applying expectations on
81
both sides of (3.52),
IE[kxk+1 − x∗ k2 ]
≤ IE[(1 − σγk )kxk − x∗ k2 ] + IE[γk2 (2σγ0 (B 2 + ν 2 ))]
{z
|
}
Term A
|
{z
Term B
}
+ IE[(1 + c)γk2 kwk+1/2 − wk k2 + 2γk (x∗ − xk+1/2 )T wk+1/2 ] + 2γk2 (1 + β)B 2 ,
{z
|
}
Term C
where the expression for Term B follows from γk ≤ γ0 . We proceed to study Term
C.
Term C
h
= (1 + c)γk2 IE[kwk+1/2 − wk k2 ] + 2γk IE IE[(x∗ − xk+1/2 )T wk+1/2 |F k ]
|
{z
i
}
= 0
+ 2γk2 (1 + β)B 2
≤ γk2 2(1 + c)ν 2 + 2(1 + β)B 2 .
We proceed with a further analysis by setting β = c = 1, a choice that helps
significantly from an algebraic standpoint. Therefore
IE[kxk+1 − x∗ k2 ] ≤ (1 − qk )IE[kxk − x∗ k2 ] + tk ,
where qk =
γ 2 (Mν + MB + 2σγ0 (B 2 + ν 2 ))
σγ0
, tk = 0
, MB = 4B 2 , and Mν = 4ν 2 .
2
k
k
By assuming that σγ0 > 1 and by invoking Lemma 191 , we obtain that
h
i
IE kxk − x∗ k2 ≤
M (γ0 )
, where
k
!
γ 2 (Mν + MB + 2σγ0 (B 2 + ν 2 ))
M (γ0 ) , max 0
, kx0 − x∗ k2 .
σγ0 − 1
1
This relatively simple result is presented in [79, chap. 5, eq. 292] without a proof.
82
One may to minimize M (γ0 ) in γ0 is by minimizing the first expression in the max.
function in γ0 . Therefore,
d γ02 t0 (γ0 )
2γ0 t0 (γ0 ) + γ02 t00 (γ0 )
t0 (γ0 )
=
− γ02
σ = 0.
dγ0 σγ0 − 1
(σγ0 − 1)
(σγ0 − 1)2
"
#
Since σγ0 > 1 it follows that
t0 (γ0 )
2t0 (γ0 ) + γ0 t00 (γ0 )
− γ0
σ = 0.
1
(σγ0 − 1)
This quadratic may be expressed as follows:
0 = 2(Mν + MB ) + 6σγ0 (B 2 + ν 2 ) −
γ0 σ(Mν + MB + 2σγ0 (B 2 + ν 2 ))
.
σγ0 − 1
Through a simplification, we obtain the following:
0 = 2σγ0 (Mν + MB ) + 6σ02 γ02 (B 2 + ν 2 ) − (2(Mν + MB ) + 6σγ0 (B 2 + ν 2 ))
− γ0 σ(Mν + MB + 2σγ0 (B 2 + ν 2 ))
= σγ0 (Mν + MB ) + 4σ02 γ02 (B 2 + ν 2 ) − (2(Mν + MB ) + 6σγ0 (B 2 + ν 2 ))
= 4σ02 γ02 (B 2 + ν 2 ) + σγ0 (Mν + MB − 6(B 2 + ν 2 )) − 2(Mν + MB ).
Through a rearrangement of terms, we obtain that
γ02 −
Mν + MB
γ0 (6(B 2 + ν 2 ) − Mν − MB )
= 2 2
.
2
2
4σ(B + ν )
2σ (B + ν 2 )
By recalling that MB = 4B 2 and Mν = 4ν 2 , we may simplify this equation as
follows:
√
2
1
2
1
33
±
33 + 1
γ02 −
γ0 = 2 =⇒ γ0 −
=
=⇒ γ0 =
.
2
2σ
σ
4σ
16σ
4σ
Considering only the positive root, we have that
3
2σ
√
<
33+1
4σ
<
7
.
4σ
On simplification,
!
1
46(B 2 + ν 2 )
IE[kxk − x k ] ≤ max
, kx0 − x∗ k2 .
2
k
σ
∗ 2
83
Remark: This result is notable from several standpoints. First, in contrast
with rate statements for settings that lack strong monotonicity, we provide a
rate statement in terms of solution iterates, rather than in terms of the gap
function. Second, our rate statement is optimal from a rate standpoint with a
poorer constant, in part due to the use of B 2 instead of B 2 /4. Notably, in strongly
convex optimization problems (cf. [79]), it can be seen that based on the optimal
2
2)
initial steplength, we have that E[kxK − x∗ k]2 ≤ K1 max 2(Bσ+ν
, kx0 − x∗ k2 .
2
Next, we generalize the above rate result to prox functions with general distance
functions.
Corollary 2 (Rate statements for MPSA). Suppose assumptions (A1)–(A5)
hold and let L = 0 in (A5). Let F (x) be strongly pseudomonotone on X and let
the sequence of iterates {xk } be generated by (MPSA). Then, the following holds
for K > 0:
!
2
(12L2V + θ2 )2 (5θ2 + 12L2V )(B 2 + ν 2 )
IE[kxK − x k ] ≤
max
, kx0 − x∗ k2 ,
2
2
θK
32σ θ
∗ 2
where x∗ is a solution to the SVI(X, F ).
Proof. Noting that L = 0, we begin by considering (3.48) as earlier:
γk ((wk+1/2 )T (xk+1/2 − x∗ )) + uk + V (xk+1 , x∗ )
≤ V (xk , x∗ ) + γk2 (1 + c)kwk+1/2 − wk k2 + 2γk2 (1 + β)B 2 .
Using uk ≥ −γk σ 2kxk+1/2 − xk k2 − kxk − x∗ k2 from (3.51) and by utilizing the
Lipschitzian relationship V (xk , x∗ ) ≤ L2V kxk − x∗ k2 from (3.42), we have
!
σγk
V (xk+1 , x ) ≤ 1 − 2 V (xk , x∗ ) + γk2 (1 + c)kwk+1/2 − wk k2 + 2γk2 (1 + β)B 2
LV
∗
T
+ 2σγk kxk+1/2 − xk k2 − γk wk+1/2
(xk+1/2 − x∗ ).
(3.54)
Applying conditional expectations on both sides and leveraging the boundedness of
the conditional second moments, we obtain the following:
IE[V (xk+1 , x∗ ) | Fk+1/2 ]
84
!
σγk
≤ 1 − 2 IE[V (xk , x∗ ) | Fk+1/2 ] − γk IE[(wk+1/2 )T (xk+1/2 − x∗ ) | Fk+1/2 ]
|
{z
}
LV
=0



2
2
T
+ γk2 (1 + c) 
IE[kwk+1/2 k + kwk k | Fk+1/2 ] + IE[2 wk wk+1/2 | Fk+1/2 ]
|
{z
=0
}
+ 2σγk IE[kxk+1/2 − xk k2 |Fk+1/2 ] + 2γk2 (1 + β)B 2 .
Taking expectations once again, we obtain the following:
!
σγk
IE[V (xk+1 , x )] ≤ 1 − 2 IE[V (xk , x∗ )] + γk2 2(1 + β)B 2 + 2(1 + c)ν 2
LV
∗
+ 2σγk IE[kxk+1/2 − xk k2 ].
By using the definition of the iterates from (MPSA),
V (xk , xk+1/2 ) + γk F (xk ; ωk )T xk+1/2 ≤ V (xk+1/2 , xk+1/2 ) + γk F (xk ; ωk )T xk
= γk F (xk ; ωk )T xk .
(3.55)
Using (3.41) and (3.55), we have
2
2
kxk+1/2 − xk k2 ≤ V (xk , xk+1/2 ) ≤ γk F (xk ; ωk )T (xk − xk+1/2 )
θ
θ
γk2
≤ 2 kF (xk ; ωk )k2 + β̄kxk − xk+1/2 k2 ,
β̄θ
where the last expression follows from completion of squares with 1 > β̄ > 0. Therefore
kxk+1/2 − xk k2 ≤
γk2
kF (xk ; ωk )k2 .
2
β̄(1 − β̄)θ
Noting that β̄ = 1/2 minimizes the expression on the right, we have
2
2
γk2
2
2 2σγ0 (B + ν )
2σγk IE[kxk+1/2 − xk k ] ≤ 2σγk
IE[kF
(x
;
ω
)k
]
≤
γ
k
k
k
β̄(1 − β̄)θ2
β̄(1 − β̄)θ2
!
2
2
8σγ0 (B + ν )
≤ γk2
.
θ2
2
85
!
As in the prior proof, we utilize B 2 as the bound rather than B 2 /4. Setting
β = c = 1 and γk = γ0 /k, the prior expression can be simplified as follows:
σγ0
≤ 1−
kL2V
∗
IE[V (xk+1 , x )]
t0 = MB + Mν +
!
IE[V (xk , x∗ )] +
γ02 t0
,
k2
8σγ0 (B 2 + ν 2 )
,
θ2
where MB = 4B 2 and Mν = 4ν 2 . On lines of Proposition 12 and (3.41),
2
M (γ0 )
IE[kxk − x∗ k2 ] ≤ IE[V (xk , x∗ )] ≤
, where
θ
k
!
γ02 L2V t0
2
, kx0 − x∗ k2 .
M (γ0 ) = max
2
θ
σγ0 − LV
(3.56)
Minimizing the first expression in the max. function in γ0 , we have
2t0 (γ0 ) + γ0 t00 (γ0 )
γ0 σt0 (γ0 )
−
= 0.
(σγ0 − L2V )
(σγ0 − L2V )2
Through a simplification, we obtain that
σγ0 (θ2 (MB + Mν ) + 8σγ0 (B 2 + ν 2 )) + 8σγ0 (B 2 + ν 2 )(σγ0 − L2V )
= 2(θ2 (MB + Mν ) + 8σγ0 (B 2 + ν 2 ))L2V ,
=⇒ γ02 −
θ2 (MB + Mν )L2V
γ0 (24(B 2 + ν 2 )L2V − θ2 (MB + Mν ))
=
.
16σ(B 2 + ν 2 )
8σ 2 (B 2 + ν 2 )
By recalling that MB + Mν = 4(B 2 + ν 2 ), we obtain the following:
γ02 − 2γ0
6L2V − θ2
8σ
!
=
θ2 L2V
,
2σ 2
6L2V − θ2 1
γ0∗ (LV , σ, θ) =
±
8σ
σ
s
5θ2 L2V
9L4
θ4
+ V + .
16
16
64
Considering only the positive root for γ0∗ , we obtain the following bounds:
3L2V
and
2σ
s
2
2
6LV − θ
1 6θ2 L2V
9L4V
θ4
3L2V
θ2
∗
γ0 (LV , σ, θ) ≤
+
+
+
≤
+
.
8σ
σ
16
16
64
2σ
8σ
γ0∗ (LV , σ, θ) ≥
86
(3.57)
This leads to the required bound:
IE[kxK − x∗ k2 ] ≤
max
2
θK
(12L2V +θ2 )2 (5θ2 +12L2V )(B 2 +ν 2 )
, kx0
32σ 2 θ2
(3.58)
− x∗ k 2 .
(3.59)
Remark: When the Euclidean norm is used as the distance function, it can
be observed that θ = 2 and LV = 1 and the MPSA scheme reduces to the
standard extragradient scheme. While (3.59) can be directly used to obtain the
rate estimate, we use (3.57) to obtain a much tighter estimate. Therefore, we have
√
γ0∗ (1, σ, 2) = 1+4σ33 . Applying this value of γ0∗ in (3.56) and following the rest of
the algebra from Proposition 12, we obtain the same upper bound as with the case
of EGA.
!
46(B 2 + ν 2 )
1
max
, kx0 − x∗ k2 .
IE[kxK − x k ] ≤
K
σ2
∗ 2
However, if we directly substitute LV = 1 and θ = 2 in (3.59), we obtain the slightly
weaker bound given by
!
1
64(B 2 + ν 2 )
IE[kxK − x k ] ≤
max
, kx0 − x∗ k2 .
2
K
σ
∗ 2
We conclude this section with a rate analysis on the solution iterates under mere
monotonicity of the map but under an additional requirement of weak-sharpness. We
observe that the specification of the optimal steplength requires globally minimizing
a product of convex functions over a Cartesian product of convex sets. While there
are settings where this product is indeed convex, in this instance, the product is a
nonconvex function. Yet, we observe that the global minimizer can be tractably
obtained by solving two convex programs. The following simple lemma provides
the necessary support for this result.
Lemma 18. Consider the following problem:
min {h(γ0 )g(z) | γ0 ∈ Γ0 , z ∈ Z} ,
where h and g are positive functions over Γ0 and Z, respectively. If γ̄0 and z̄ denote
87
global minimizers of h(γ0 ) and g(z) over Γ0 and Z, respectively, then the following
holds:
h(γ0∗ )g(z ∗ ) = min h(γ0 )g(z) = h(γ̄0 )g(z̄).
γ0 ∈Γ0 ,z∈Z
Proof. The proof has two steps. First, we note that
min
γ0 ∈Γ0 ,z∈Z
h(γ0 )g(z) ≥ h(γ̄0 )g(z̄),
implying that at any global minimizer (γ0∗ , z ∗ ),
h(γ0∗ )g(z ∗ ) ≥ h(γ̄0 )g(z̄).
(3.60)
Second, since (γ̄0 , z̄) ∈ Γ0 × Z, we have that h(γ0∗ )g(z ∗ ) has an optimal value that
is no smaller than that the value associated with any feasible solution or
h(γ0∗ )g(z ∗ ) =
min
γ0 ∈Γ0 ,z∈Z
h(γ0 )g(z) ≤ h(γ̄0 )g(z̄).
(3.61)
By combining (3.60) and (3.61), the result follows.
Next, under a monotonicity and weak-sharpness requirement, the ESA scheme is
shown to display the optimal rate of convergence in solution iterates. Additionally,
we prescribe the optimal initial steplength that minimizes the mean-squared error
by deriving the global minimizer of a nonconvex function in closed-form.
Proposition 13 (Rate analysis under monotonicity and weak sharpness).
Consider the SVI(X, F ). Suppose assumptions (A1)–(A4) hold and let γk be defined
as per (3.49). Let constants β and c be chosen such that 1 − (1/(1 + β)) − (1/
(1 + c)) ≥ 0. Let F (x) be a monotone map over the set X. Additionally, suppose
the map F (x) is Lipschitz continuous and bounded over X with constants L and
C. Let the mapping F (x) and solution set X ∗ possess the weak-sharpness property
with constant α and let X be compact such that kxk ≤ U for all x ∈ X. Suppose
xk is generated by (ESA). Then, the following holds for K > 0:
E[dist2 (xK , X ∗ )]
√ 2
1
2U 2
≤
max
4
ν
+
LU
2 + C 2 + 16L2 U 2 , dist2 (x0 , X ∗ ) .
2
K
α
88
!
Proof. Following along the lines of Proposition 15, we take expectations over (3.37):
E[dist2 (xk+1 , X ∗ )]
!
L2 γ 2
≤ 1 + 2 k IE[dist2 (xk , X ∗ )] − 1 − 2c̄2 − γk2 (1 + β)L2 IE[kxk − xk+1/2 k2 ]
c̄
!
C2
∗
2
2
− 2γk αdist(xk , X ) + γk 2(1 + c)ν + 2 .
(3.62)
4c̄
Since dist(xk , X ∗ ) ≤ 2U , it follows that −2γk αdist(xk , X ∗ ) ≤ − γUk α dist2 (xk , X ∗ ).
q
Furthermore, since c̄ ≤ 12 , kxk+1/2 − xk k ≤ 2U , and IE[dist2 (xk , X ∗ )] ≤ 4U 2 , the
right-hand side of (3.62) is bounded as follows:
E[dist2 (xk+1 , X ∗ )]
αγk
IE[dist2 (xk , X ∗ )] − 1 − 2c̄2 IE[kxk − xk+1/2 k2 ]
≤ 1−
U
!!
2
2 2
C
+
16L
U
+ γk2 (1 + β)L2 IE[kxk − xk+1/2 k2 ] + 2(1 + c)ν 2 +
4c̄2
αγk
≤ 1−
IE[dist2 (xk , X ∗ )] + γk2 (Mν (c) + MLC (c̄) + ML (β))
U
= (1 − qk )IE[dist2 (xk , X ∗ )] + tk ,
where
C 2 + 16L2 U 2
, ML (β) , 4(1 + β)L2 U 2 ,
4c̄2
q0 γ0
α
t0 γ 2
qk =
, tk = 20 , q0 = , and t0 = ML (β) + MLC (c̄) + Mν (c).
k
k
U
Mν (c) , 2(1 + c)ν 2 , MLC (c̄) ,
(3.63)
Through the application of Lemma 19 (as in Proposition 12), we obtain the following
bound on mean-squared error for every positive integer K:
1
max h(γ0 )g(c, β, c̄), dist2 (x0 , X ∗ ) ,
K
γ02 U
where h(γ0 ) ,
and g(c, β, c̄) , (Mν (c) + MLC (c̄) + ML (β)).
αγ0 − U
E[dist2 (xK , X ∗ )] ≤
89
Suppose Γ0 and Z are convex sets defined as
U
and
Γ0 , γ0 : γ0 ≥
α
q
Z , (c, β, c̄) : 1 − 1/(1 + β) − 1/(1 + c) ≥ 0, c̄ ≤ 1/2 .
Moreover, h(γ0 )g(z) is a product of two (convex) functions, both positive over
their prescribed sets, a global minimizer of this product can be tractably obtained
by invoking Lemma 18 as (γ̄0 , z̄) where γ̄0 and z̄ denote global minimizers of the
convex functions h(γ0 ) and g(z) over the convex sets Γ0 and Z, respectively
We now consider the problem of minimizing h(γ0 ) over Γ0 and g(z) over Z. It
can be seen that h(γ0 ) is a convex function that attains its globally minimal value
when
h0 (γ0 ) =
2γ0 U
γ02 U α
γ02 U α − 2γ0 U 2
2U
−
=
= 0, when γ0∗ =
.
2
2
αγ0 − U
(αγ0 − U )
(αγ0 − U )
α
Note that γ0∗ = 2U/α is feasible with respect to Γ0 . Furthermore, g(c, β, c̄) is a
separable function given by Mν (c) + ML (β) + MLC (c̄). We begin by noting that
the MLC (c̄) is a decreasing functionqin c̄ and assumes its minimal
value at the right
q
∗
end point of the feasibility set [0, 1/2], implying that c̄ = 1/2. Similarly, we
note that Mν (c) + ML (β) are increasing functions of c and β, respectively over the
set {(β, c) : 1 − 1/(1 + c) − 1/(1 + β) ≥ 0}. Consequently, the optimal value is
assumed at the left end-point of the feasibility set at which 1/(1 + c) + 1/(1 + β) = 1
implying β = 1/c. Therefore, the problem reduces to
2ν 2
min
+ 4βL2 U 2
β
β
!
ν
=⇒ β ∗ = √
and c∗ =
2LU
√
2LU
.
ν
This implies that
Mν (c∗ ) + ML (β ∗ ) + MLC (c̄∗ )
C 2 + 16L2 U 2
4(c̄∗ )2
√
√
C 2 + 16L2 U 2
= 2ν 2 + 2 2νLU + 4L2 U 2 + 2 2LU ν +
2
2
2 2
√
C
+
16L
U
= 2(ν + 2LU )2 +
.
2
= 2(1 + c∗ )ν 2 + 4(1 + β ∗ )L2 U 2 +
90
The optimal rate of convergence of mean-squared error reduces to the following:
E[dist2 (xK , X ∗ )] ≤
√
2U 2 1
max( 2 4(ν + 2LU )2 + C 2 + 16L2 U 2 , dist2 (x0 , X ∗ )).
K
α
Remark: In past research, the optimality of the rate of convergence has been
proved for monotone SVIs but in terms of the gap function. Our result shows that
under a suitable weak-sharpness property, rate optimality also holds in terms of
the solution iterates. Notably, we further refine the statement by optimizing the
initial steplength by (globally) minimizing a nonconvex function.
3.4 Numerical Results
In this section, we examine the performance of the presented schemes on a suite
of four test problems described in Section 3.4.1 while the algorithm parameters
are defined in Section 3.4.2. In Section 3.4.3, we compare the performance of the
ESA scheme with the MPSA schemes over the suite of test problems. Finally in
Section 3.4.4, we compare the empirical rates with the theoretically predicted rates
and quantify the benefits of optimal initial steplength.
3.4.1 Test Suite
The first two test problems are stochastic fractional convex quadratic and nonlinear,
both of which lead to pseudomonotone stochastic variational inequality problems.
The third set of test problems are stochastic variational inequality problems that
represent the (sufficient) equilibrium conditions of a stochastic Nash-Cournot game.
In particular, the players maximize pseudoconcave expectation-valued functions
but the resulting stochastic variational inequality problem is not necessarily pseudomonotone; however, some choices of parameters lead to pseudomonotone SVIs.
Our fourth test problem is Watson’s complementarity problem [109], which is not
necessarily monotone.
(i) Fractional Convex Quadratic Problems: The concept of maximizing or minimizing
ratios is of significant relevance in an engineering setting. We consider stochastic
91
fractional convex problems of the form:
"
#
f (x; ω)
min IE
,
x∈X
g(x)
(3.64)
where IE[f (x; ω)] and g(x) are strictly positive convex quadratic and linear functions,
respectively, defined as f (x; ω) , 0.5xT (θU U T + λV (ω))x + 0.5((c + c̄(ω))T x + 4n)2
and g(x) , rT x + t + 4n. We note that V (ω) and c̄(ω) are randomly generated from
standard normal and uniform distributions, U and c are deterministic constants
generated once from the standard normal distribution, while r and t are generated
once from uniform distributions. We note that θ = 0.025 and λ = kθU U T kF /
kV (ω)kF , where k.kF denotes the Frobenius norm and = 0.025. The set X
is defined as X , {x | Ax ≤ v, 0 ≤ x ≤ 4} , where A ∈ Rm×n and v ∈ Rm×1 are
generated once from standard normal and uniform distributions respectively. Note
n
that m = d 10
e is a variable dependent integer. It is easily seen that the resulting
SVI is pseudomonotone.
(ii) Fractional Convex Nonlinear Problems: We consider a nonlinear variant of
(i) with the same parameters and numerator but an exponential denominator
T
g(x) = 104 (λ − e(r x+t+4n)/2000 ), where λ = e(8n+2)/2000 . Again, it can be observed
that this problem leads to a pseudomonotone SVI.
(iii) Nash-Cournot Games: Next we consider a Nash-Cournot game with n selfish
players, all of which sell the same commodity [76,89] at a price given by the function
of the aggregate sales as per the Cournot specification [110, 111]. Specifically, the
ith agent solves the following problem: maxxi ∈Xi fi (x) = IE[p(x̄; ω)xi ], where
P
p(x̄; ω) = (a − bω x̄)κ , x̄ = ni=1 xi , κ ∈ (0, 1) and Xi = {xi | Ax ≤ v, 0 ≤ xi ≤ 3n} .
We note that a = 100dn/3e while bω is generated from a uniform distribution with
mean 1 and standard deviation , where = 0.025. We note that A and v are also
generated randomly as stated earlier. The resulting game is a shared constraint
Nash game and a (variational) equilibrium of this game [112] is given by a variational
inequality problem. Note that agent payoffs are pseudoconcave [89, Theorem 3.4]
and the (sufficient) equilibrium conditions are given by a variational inequality
n
VI(X, F ), where F (x) = ∇xi fi (x)
. Note that this variational problem is not
i=1
necessarily pseudomonotone.
(iv) Watson’s Problem: Finally, we consider a stochastic variant of the ten variable
non-monotone linear complementarity problem, first proposed by Watson [109]:
92
0 ≤ x ⊥ IE [M + M ω x + q + q ω ] ≥ 0, where M ω and q ω are randomly generated
matrices and vectors (from the standard normal distribution) respectively and
= 0.025 refers to the level of noise. We omit the definition of the ten-dimensional
matrix M, which can be found in [109, Example 3]. Note that q = ei and we
consider ten different instances, each corresponding to a coordinate direction ei .
3.4.2 Algorithm parameters and termination criteria
We conduct two sets of tests, the first of these pertains to the a.s. convergence
behavior while the second set compares the empirical rate estimates with the
theoretically prescribed levels. All the numerics were generated with Matlab
R2012a on a Linux OS with a 2.39 GHZ processor and 16 GB of memory. For the
first two test problem sets, x0 = 2e and γ0 is 1 and 2.5 respectively while for the
second two test problem sets, x0 = 0 and γ0 is 2.5 and 0.6, respectively.
(i) a.s. convergence: Here, n was varied from 10 to 30 in steps of 2 for the first three
test problems while ten different instances of q were generated as stated earlier for
the Watson’s problem, leading to a total of 40 test instances. Recalling that x is a
solution of VI(X, F ) if and only if FXnat (x) = x−ΠX (x−F (x)) = 0, a.s. convergence
can be empirically verified based on the value of ψ(xk ) = FXnat (xk ). Note that
our problem choices allow for evaluating the expectation, which is generally not
possible.
(ii) Rate statements: When evaluating the rate estimates, we consider a modified
Nash-Cournot game. The price was made affine, κ = 1 and the linear constraints
were dropped. We generated ten different problem instances for n ranging from
10 to 19 and set a = 0.1dn/10e and b = a/n. Note that bω was generated from
a normal distribution with mean b and standard deviation , where = 0.025b.
The associated set and mapping are defined to be F (x) = b(I + eeT )x − ae, X =
{x | 0 ≤ xi ≤ 1, i = 1, . . . , n} , where e and I denote the vector of ones and the
identity matrix. Next, we estimate the parameters in computing the theoretical
rates. Since ∇F (x) = b(I + eeT ), implying that F (x) is strongly monotone
(implies strongly pseudomonotone)
with constant
σ = b. Further, we have that
√
T
T
kF (x)k = kb(I+ee )x−aek = a n(I + ee )s − e ≤ a n, where the last inequality
√
follows from b = a/n and 0 ≤ s ≤ e. This implies that B = 2a n. If x∗ denotes
the unique solution of VI(X, F ), then the empirical error ψe (xK ) and theoretical
93
error ψb (xK ) are defined as follows (see Proposition 12):
N
1 X
M
≥ IE[kxK − x∗ k2 ],
ψe (xK ) =
kxjK − x∗ k2 , ψb (xK ) =
N j=1
K
(3.65)
γ02
M=
(Mν + MB ),
σγ0 − 1
(3.66)
where ψe (xK ) is a result of averaging over N sample paths. Since kF (x; ω) −
F (x)k2 = |b−bω |2 k(I+eeT )xk2 ≤ n(n+1)2 2 = ν 2 , then MB = 4a2 n (2σγ0 + (1 + β))
and Mν = 2 ((1 + c) + σγ0 ) ν 2 . Our final bounds are a consequence of setting
c = β = 1.
3.4.3 Almost sure convergence behavior
In this subsection, we compare the a.s. convergence behavior of the extragradient
and mirror-prox schemes under two different distance metrics. Table 3.2 displays
ψ(xK ) generated from the ESA scheme for increasing number of major iterations
for the forty problems of interest. We observe that in the fractional quadratic and
nonlinear problems, the ESA scheme performs relatively well, barring two instances.
Notably, much of the progress is made in the first 1000 iterations.
Next, we compare the stochastic extragradient scheme with two prox-based
generalizations that employ two distance functions proposed by Nemirowski [51]
P
P
(1+ 1 )
given by sa (x) = ni=1 (xi + δ) log(xi + δ) and sb (x) = log(n) ni=1 xi log(n) . The
variants of MPSA, referred to as MPSA-a and MPSA-b respectively, are studied
and the results are compared with the ESA scheme in Table 3.3 for ten nonlinear
fractional problems in the test set for progressively increasing number of major
iterations. It is observed that the ESA scheme sometimes (but not always) performs
better than MPSA-a from an error standpoint but each step of MPSA-a (and
MPSA-b) tends to require more effort as captured by the CPU time.
3.4.4 Error analysis and optimal choices of γ0
While the previous results focused on asymptotics, we now compare the empirical
rates with the theoretically predicted rates, as discussed in (4.4). In obtaining
√
the empirical results, the initial steplength γ0 was set to be (1 + 33)/(4σ) and
fifteen different sample paths of ESA were generated to compute Ψe (3.66). Given
94
Table 3.2. Asymptotics of ESA
Dimension (n)
Frac. Quad.
Frac. Nonlin.
Nash game
Watson-CP
10
11
12
13
14
15
16
17
18
19
10
11
12
13
14
15
16
17
18
19
10
11
12
13
14
15
16
17
18
19
10
11
12
13
14
15
16
17
18
19
K = 1
6.017e+00
6.343e+00
6.685e+00
7.211e+00
7.483e+00
7.473e+00
7.982e+00
8.007e+00
8.250e+00
8.718e+00
5.345e+00
3.392e+00
3.460e+00
5.855e+00
5.755e+00
7.145e+00
5.962e+00
6.015e+00
4.836e+00
3.602e+00
1.581e+01
9.886e+00
1.119e+01
7.551e+00
2.092e+01
2.165e+01
2.449e+01
8.641e+00
9.485e+00
1.827e+01
9.695e-01
9.592e-01
9.798e-01
9.695e-01
9.381e-01
9.381e-01
9.798e-01
9.381e-01
9.274e-01
9.381e-01
K = 100
5.710e-01
2.885e+00
9.425e-01
2.875e+00
1.683e+00
3.206e-01
3.327e-01
1.957e+00
3.060e+00
4.116e+00
1.110e-01
2.135e-01
1.467e-01
2.053e-01
2.608e-01
9.162e-04
1.609e-01
2.540e-01
2.491e-01
1.686e-01
8.734e-01
4.306e-01
9.558e-01
2.358e-01
9.826e-01
1.221e+00
1.530e+00
1.361e+00
1.256e+00
1.457e+00
2.873e-01
1.463e-01
1.636e-01
1.873e-01
2.667e-01
2.353e-01
2.259e-01
1.886e-01
2.174e-01
1.714e-01
Error ψ(xK )
K = 1000
K = 10000
4.690e-02
1.142e-03
1.962e+00
1.094e+00
2.827e-01
1.437e-01
1.728e+00
1.285e+00
2.052e-01
8.321e-02
1.441e-01
3.735e-02
6.572e-02
1.444e-02
1.018e+00
5.486e-01
1.725e+00
8.999e-01
2.763e+00
2.018e+00
2.754e-02
4.163e-03
1.153e-01
7.187e-02
9.712e-02
7.672e-02
9.461e-02
9.901e-02
1.122e-01
5.791e-02
9.433e-03
1.261e-02
6.847e-02
2.947e-02
1.040e-01
5.281e-02
1.582e-01
1.053e-01
1.030e-01
8.609e-02
4.624e-01
2.466e-01
2.418e-01
1.419e-01
4.195e-01
2.052e-01
1.388e-01
8.199e-02
4.876e-01
2.628e-01
5.613e-01
2.701e-01
8.509e-01
5.082e-01
8.349e-01
5.037e-01
7.358e-01
4.301e-01
8.848e-01
5.571e-01
2.329e-01
2.397e-01
1.277e-01
1.275e-01
1.468e-01
1.491e-01
1.960e-01
1.595e-01
2.194e-01
2.125e-01
1.357e-01
1.261e-01
1.399e-01
1.386e-01
1.668e-01
1.147e-01
1.676e-01
8.001e-02
1.636e-01
6.827e-02
K = 15000
7.951e-04
9.714e-01
1.288e-01
1.243e+00
7.035e-02
2.959e-02
1.014e-02
5.095e-01
8.028e-01
1.913e+00
2.955e-03
6.691e-02
7.450e-02
1.014e-01
5.186e-02
1.288e-02
2.540e-02
4.820e-02
9.833e-02
8.360e-02
1.634e-01
1.300e-01
1.813e-01
7.478e-02
2.385e-01
2.377e-01
4.685e-01
4.673e-01
3.913e-01
5.162e-01
2.477e-01
1.298e-01
1.385e-01
1.573e-01
2.173e-01
1.255e-01
1.400e-01
9.349e-02
7.541e-02
6.508e-02
that the expectation may be evaluated, we may solve the original problem to
obtain an estimate of x∗ . Table 3.5 compares the analytical bounds with empirical
results for the given set of problems in increasing iterations. For the (monotone)
problems considered, the theoretical bound is shown to be valid but relatively weak.
We now investigate the benefit of
utilizing the optimal γ0 , denoted by
γ0∗ . Here, we consider the same set
of problems as in the previous section and report the behavior of the
proposed extragradient scheme in
Table 3.4 for six different choices of
h γ0 , ranging from 0.001γ0∗ to 100γ0∗ in
factors of 10. It can be seen that γ0∗
Figure 3.1. Mean-squared error vs h
performs either the best (or close to
1.E+01 1.E-­‐04 1.E-­‐03 1.E-­‐02 1.E-­‐01 1.E+00 1.E+00 \psi(x_K) 1.E-­‐01 1.E-­‐02 1.E-­‐03 n = 5 n = 7 n = 9 1.E-­‐04 1.E-­‐05 n = 11 n = 13 1.E-­‐06 1.E-­‐07 95
1.E+01 1.E+02 Table 3.3. Comparison of ESA to MPSA-a and MPSA-b for frac. nonlin.
problems
Dimension (n)
Iteration K
10
11
12
13
14
15
16
17
18
19
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
1000
5000
10000
15000
Projection
ψ(xK )
Time (s)
2.754e-02
1.411e+01
7.774e-03
6.861e+01
4.163e-03
1.372e+02
2.955e-03
2.057e+02
1.153e-01
1.373e+01
8.196e-02
6.880e+01
7.187e-02
1.380e+02
6.691e-02
2.072e+02
9.712e-02
1.355e+01
8.126e-02
6.738e+01
7.672e-02
1.353e+02
7.450e-02
2.035e+02
9.461e-02
1.345e+01
9.441e-02
6.839e+01
9.901e-02
1.374e+02
1.014e-01
2.064e+02
1.122e-01
1.354e+01
7.006e-02
6.854e+01
5.791e-02
1.374e+02
5.186e-02
2.064e+02
9.433e-03
1.388e+01
1.202e-02
6.975e+01
1.261e-02
1.402e+02
1.288e-02
2.111e+02
6.847e-02
1.398e+01
3.799e-02
7.019e+01
2.947e-02
1.407e+02
2.540e-02
2.114e+02
1.040e-01
1.377e+01
6.352e-02
6.948e+01
5.281e-02
1.397e+02
4.820e-02
2.101e+02
1.582e-01
1.400e+01
1.190e-01
7.051e+01
1.053e-01
1.418e+02
9.833e-02
2.132e+02
1.030e-01
1.734e+01
9.060e-02
8.658e+01
8.609e-02
1.732e+02
8.360e-02
2.603e+02
Prox-A
ψ(xK )
Time(s)
1.352e-01
2.662e+01
1.117e-01
1.301e+02
1.050e-01
2.666e+02
1.019e-01
3.969e+02
1.786e-01
3.250e+01
1.343e-01
1.632e+02
1.207e-01
3.241e+02
1.138e-01
4.869e+02
1.764e-01
3.892e+01
1.362e-01
2.089e+02
1.235e-01
4.153e+02
1.170e-01
6.242e+02
1.157e-01
3.166e+01
8.663e-02
1.625e+02
8.025e-02
3.256e+02
7.705e-02
4.906e+02
1.674e-01
3.289e+01
1.362e-01
1.663e+02
1.266e-01
3.370e+02
1.218e-01
5.090e+02
3.508e-02
4.528e+01
1.956e-02
2.235e+02
1.673e-02
4.332e+02
1.578e-02
6.314e+02
1.666e-01
3.538e+01
1.328e-01
1.906e+02
1.225e-01
3.916e+02
1.169e-01
5.900e+02
2.205e-01
3.691e+01
1.707e-01
1.672e+02
1.560e-01
3.393e+02
1.486e-01
5.160e+02
2.156e-01
3.526e+01
1.653e-01
1.842e+02
1.500e-01
3.687e+02
1.422e-01
5.533e+02
1.677e-01
4.998e+01
1.314e-01
2.990e+02
1.222e-01
5.893e+02
1.179e-01
8.753e+02
Prox-B
ψ(xK )
Time (s)
1.624e-01
2.229e+01
1.132e-01
1.124e+02
9.762e-02
2.688e+02
8.953e-02
4.551e+02
2.059e-01
2.249e+01
1.715e-01
1.189e+02
1.594e-01
2.568e+02
1.528e-01
3.964e+02
1.186e-01
2.732e+01
1.064e-01
1.467e+02
1.029e-01
3.014e+02
1.016e-01
4.536e+02
2.288e-01
2.966e+01
1.911e-01
1.610e+02
1.741e-01
3.427e+02
1.657e-01
5.234e+02
3.037e-01
2.857e+01
2.440e-01
1.649e+02
2.225e-01
3.316e+02
2.108e-01
4.904e+02
2.277e-02
2.464e+01
6.038e-03
2.315e+02
9.711e-03
5.551e+02
1.107e-02
8.836e+02
2.992e-01
3.798e+01
2.307e-01
2.126e+02
2.074e-01
4.161e+02
1.950e-01
6.761e+02
3.813e-01
4.409e+01
3.003e-01
2.645e+02
2.748e-01
5.665e+02
2.614e-01
8.671e+02
3.921e-01
3.596e+01
3.308e-01
2.102e+02
3.119e-01
4.374e+02
3.021e-01
6.592e+02
3.652e-01
5.098e+01
2.745e-01
2.951e+02
2.507e-01
6.243e+02
2.398e-01
9.625e+02
Table 3.4. Analytical vs Empirical Bounds for stochastic Nash-Cournot
Game.
(n)
5
6
7
8
9
10
K = 1
ψe (xK )
ψb (xK )
3.45e+0
2.30e+04
4.38e+0
3.97e+04
5.32e+0
6.31e+04
6.27e+0
9.42e+04
7.23e+0
1.34e+05
8.19e+0
1.84e+05
K = 100
ψe (xK )
ψb (xK )
1.02e-04
2.30e+02
3.22e-04
3.97e+02
3.32e-04
6.31e+02
2.21e-03
9.42e+02
5.53e-04
1.34e+03
5.20e-03
1.84e+03
K =
ψe (xK )
4.54e-05
5.51e-05
9.33e-05
2.21e-03
1.39e-04
5.20e-03
1000
ψb (xK )
2.30e+01
3.97e+01
6.31e+01
9.42e+01
1.34e+02
1.84e+02
K = 10000
ψe (xK )
ψb (xK )
2.54e-05
2.30e+00
3.77e-05
3.97e+00
5.18e-05
6.31e+00
2.21e-03
9.42e+00
1.24e-04
1.34e+01
5.20e-03
1.84e+01
K = 150000
ψe (xK )
ψb (xK )
2.24e-05
1.53e-01
3.37e-03
2.65e-01
5.82e-05
4.20e-01
2.21e-03
6.28e-01
1.05e-04
8.94e-01
5.20e-03
1.22e+0
the best) of all schemes. In fact, in some instances, a poorly chosen steplength leads
to significant drop off in performance. It should also be noted that much smaller
steplengths have drastically poorer performance while much larger steplengths lead
to marginally worse behavior as observed in Figure 3.1 in which h denotes the
scaling of γ0∗ .
96
Table 3.5. Optimality error for varying choices of γ0 .
Dim (n)
5
6
7
8
9
10
11
12
13
14
Iteration (K)
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
25
1000
5000
15000
∗
0.001γ0
3.137e+00
2.910e+00
2.816e+00
2.754e+00
3.915e+00
3.586e+00
3.452e+00
3.363e+00
4.681e+00
4.233e+00
4.053e+00
3.934e+00
5.429e+00
4.849e+00
4.617e+00
4.465e+00
6.158e+00
5.431e+00
5.143e+00
4.955e+00
6.867e+00
5.981e+00
5.633e+00
5.406e+00
7.627e+00
6.564e+00
6.150e+00
5.883e+00
8.303e+00
7.058e+00
6.577e+00
6.267e+00
8.959e+00
7.520e+00
6.969e+00
6.617e+00
9.596e+00
7.957e+00
7.336e+00
6.940e+00
∗
0.01γ0
1.350e+00
6.354e-01
4.579e-01
3.660e-01
1.474e+00
6.109e-01
4.164e-01
3.204e-01
1.543e+00
5.632e-01
3.631e-01
2.689e-01
1.571e+00
5.048e-01
3.075e-01
2.190e-01
1.565e+00
4.420e-01
2.543e-01
1.741e-01
1.541e+00
3.825e-01
2.078e-01
1.366e-01
1.532e+00
3.423e-01
1.782e-01
1.143e-01
1.440e+00
2.615e-01
1.199e-01
6.882e-02
1.412e+00
2.432e-01
1.127e-01
6.651e-02
1.378e+00
2.255e-01
1.058e-01
6.410e-02
∗
0.1γ0
3.005e-02
1.971e-06
1.426e-05
1.748e-05
2.204e-02
2.198e-05
3.395e-05
3.703e-05
1.533e-02
4.832e-05
5.886e-05
5.898e-05
1.085e-02
7.415e-05
8.173e-05
8.573e-05
9.104e-03
9.887e-05
1.138e-04
1.108e-04
9.024e-03
1.553e-04
1.503e-04
1.346e-04
9.592e-03
5.963e-06
1.323e-06
8.031e-07
1.570e-03
2.639e-03
2.643e-03
2.643e-03
5.605e-03
3.015e-05
3.030e-05
3.031e-05
1.080e-02
9.320e-04
9.316e-04
9.316e-04
ψe
∗
γ0
4.659e-03
2.995e-05
2.380e-05
1.716e-05
1.640e-02
5.318e-05
4.154e-05
3.993e-05
5.604e-02
7.128e-05
6.410e-05
5.559e-05
2.442e-01
5.830e-05
1.009e-04
9.641e-05
7.168e-01
1.230e-04
1.253e-04
1.156e-04
2.046e+00
1.463e-04
1.090e-04
1.434e-04
4.331e+00
7.952e-05
8.573e-06
4.067e-06
7.169e+00
2.643e-03
2.643e-03
2.643e-03
1.082e+01
3.031e-05
3.031e-05
3.031e-05
1.220e+01
9.316e-04
9.316e-04
9.316e-04
∗
10γ0
3.455e+00
2.087e-04
4.496e-05
3.084e-05
4.382e+00
5.529e-05
8.673e-05
5.452e-05
5.324e+00
5.895e-04
8.245e-05
1.010e-04
6.275e+00
3.041e-04
1.331e-04
1.080e-04
7.234e+00
5.293e-04
1.527e-04
1.433e-04
8.197e+00
1.358e-03
2.185e-04
2.113e-04
9.243e+00
7.706e-04
1.459e-04
1.976e-05
1.022e+01
2.643e-03
2.643e-03
2.643e-03
1.121e+01
3.031e-05
3.031e-05
3.031e-05
1.220e+01
9.316e-04
9.316e-04
9.316e-04
∗
100γ0
3.455e+00
3.400e+00
3.153e-04
9.780e-05
4.382e+00
4.382e+00
6.600e-04
9.221e-05
5.324e+00
5.324e+00
6.742e-04
3.414e-04
6.275e+00
6.275e+00
1.037e-03
2.083e-04
7.234e+00
7.234e+00
1.395e-03
4.675e-04
8.197e+00
8.197e+00
3.610e-03
4.360e-04
9.243e+00
9.243e+00
2.372e-03
3.245e-04
1.022e+01
1.022e+01
2.026e-03
2.891e-04
1.121e+01
1.121e+01
2.552e-03
4.815e-04
9.768e+00
1.220e+01
9.879e-03
1.486e-03
3.5 Concluding remarks
Variational inequality problems represent a useful tool for modeling a range of
phenomena arising in engineering, economics, and the applied sciences. As the
role of uncertainty grows, there has been a growing interest in the stochastic
variational inequality problem. However, much of the past research, particularly
the algorithmic aspects, have focused on monotone stochastic variational inequality
problems. In this paper, we consider pseudomonotone generalizations and consider
both the analysis and solution of such problems. In the first part of the paper,
faced by the lack of access to a closed-form expression of the expectation, we
develop integration-free sufficiency statements for the existence (uniqueness) of
solutions to (strict/strong) pseudomonotone SVIs and their special cases, including
pseudomonotone stochastic complementarity problems. In the second part of the
paper, we consider the development of stochastic approximation schemes, the
97
majority of which have been provided for monotone problems. In this context,
we provide amongst the first results for claiming a.s. convergence of the solution
iterates to the solution set produced by a (stochastic) extragradient scheme as
well as mirror-prox generalizations. We also show that similar statements can be
provided for monotone SVIs under a weak-sharpness requirement; notably much
of the prior research for monotone SVIs uses averaging techniques in showing
that the gap function convergence in an expected-value sense. Under stronger
assumptions on the map, we show that both extragradient and mirror-prox schemes
display the optimal rate of convergence in terms of solution iterates, rather than
in terms of the gap function. Importantly, we further refine the rate statement by
deriving the optimal initial steplength. Notably, we see a modest degradation of the
rate from strongly monotone SVIs to strongly pseudomonotone SVIs. Preliminary
numerics suggest that the schemes perform well on a breadth of pseudomonotone
and non-monotone problems. Furthermore, empirical observations suggest that
significant benefit may accrue in terms of mean-squared error from employing the
optimal initial steplength.
Our work has made an initial step towards understanding how stochastic approximation schemes can be extended to regimes where pseudomonotonicity, rather than
monotonicity, of the map holds. Yet, we believe much remains to be understood
regarding how stochastic approximation schemes can be extended/modified to
contend with far weaker requirements on the map.
3.6 Appendix
Lemma 19. Consider the following recursion:
ak+1 ≤ (1 − 2cθ/k)ak + 12 θ2 M 2 /k 2 ,
where θ, M are positive constants and (1 − 2cθ) < 0. Then for k ≥ 1, we have that
2
2
−1 ,2a )
1
2ak ≤ max(θ M (2cθ−1)
.
k
Proof. Consider k = 1. Then the following holds: a2 ≤ (1 − 2cθ)a1 + 12 θ2 M 2 . If
(2cθ − 1) > 0, we may rearrange the inequalities as follows:
1
1
(2cθ − 1)a1 ≤ −a2 + θ2 M 2 or 2a1 ≤
θ2 M 2
2
(2cθ − 1)
98
=⇒ 2a1 ≤ max((2cθ − 1)−1 θ2 M 2 , 2a1 ).
Suppose, this holds for k, implying that 2ak ≤
show that this holds for k + 1:
!
ak+1
max(θ2 M 2 (2cθ−1)−1 ,2a1 )
.
k
!
!
We proceed to
1 θ2 M 2
(2cθ − 1)
θ2 M 2
2cθ u
2cθ u
+
+
≤ 1−
=
1
−
k 2k 2
k2
k 2k
2k
(2cθ − 1)k
!
2cθ u
(2cθ − 1) u
u
1 u
≤ 1−
+
=
−
k 2k
2k
k
2k k 2k
u
1
u
u
≤
−
, where u , max(θ2 M 2 (2cθ − 1)−1 , 2a1 ).
=
2k k + 1 2k
2(k + 1)
!
99
Chapter 4 |
Distributed Stochastic Optimization under Imperfect Information
This chapter represents a version of a working paper.1
4.1 Introduction
Distributed algorithms have grown enormously in relevance for addressing a broad
class of problems in consensus and optimization in a broad set of networked
applications in signal processing [39], communication networks [113, 114], power
systems [115], amongst others. A crucial assumption in any such framework is the
need for precise specification and constraint sets. In practice however, in many
engineered and economic systems, agent-specific functions may be misspecified
from a parametric standpoint but may have access to observations that can aid
in resolving this misspecification. Yet almost all of the efforts in distributed
algorithms for consensus and optimization [34, 35, 116, 117] obviate the question of
misspecification in the agent-specific problems, motivating the present work.
Decentralized and distributed approaches to decision-making and optimization
date back to seminal work by Tsitsiklis [5] where a breadth of conditions for guaranteeing convergence in asynchronous settings complicated by partial coordination,
delayed communication, and the presence of noise were presented. In subsequent
1
Aswin Kannan, Angelia Nedić, and Uday V. Shanbhag, “Distributed Stochastic
Optimization Under Imperfect Information”, Working Paper.
100
work [118], the authors examine the behavior of general distributed gradient-based
algorithms and relate their computational complexity with centralized approaches
for unconstrained stochastic regimes with finite communication delays. In related
work on parallel computing [119], the authors develop iterative schemes and derive
rate estimates for distributing computational load amongst multiple processors.
Recent extensions [120] analyze the rate of convergence in these cases by finding optimal agent weights for communication by choosing the weights based on
the maximum eigenvalues of the resulting communication matrix, an avenue that
guarantees faster convergence, at the expense of presolving a less tractable problem. In [121–123], alternative approaches are examined for computing the optimal
weights by solving smooth optimization problems.
While consensus based extensions to optimization with linear constraints were
considered in [124], convergent algorithms for problems under settings of general
agent specific convex constraints were first proposed in [34]. In [117] and [125],
the authors consider problems with common (global) inequality and equality constraints in addition to agent specific sets and propose primal dual based projection
schemes. A more general case with agents having only partial information with
respect to shared and nonsmooth constraints is studied in [126]. Recent work [127]
compares and obtains rate estimates for Newton and gradient based schemes to
solve distributed quadratic minimization, a form of weighted least squares problem
for networks with time varying topology. Finally, in recent work [128], the authors
consider a setting of asynchronous gossip-protocol, while stochastic extensions were
considered in [35, 36, 116] where convergent distributed schemes were proposed and
non-asymptotic error bounds were obtained.
Much of the prior work has concentrated on correctly specified regimes, where
agents have perfect information regarding their functions. However in practice,
agents may not completely know the expressions for their cost functions or there
could be noise / error associated with the operational data. For instance, in power
market regimes, cost coefficients are ascertained based on the historical fuel and
operating costs while portfolio optimization is reliant on the knowledge of the
covariance matrix associated with the assets of interest.
In this work, we focus on the setting where agents do not have complete
information regarding their objectives fi and this misspecification may be resolved
by the solution of a convex learning problem, instances of which may arise as
101
regression problems. We consider a networked multi-agent setting with timevarying undirected connectivity graphs, where the graph at time t is denoted by
G t = {N , E t }, N , {1, . . . , m} denotes the set of nodes and E t represents the set of
edges at time t. Each node is assumed to represent a single agent and the problem
of interest is
min
x
m
X
E[ϕi (x, θ∗ , ω)]
i=1
subject to x ∈
m
\
(4.1)
Xi ,
i=1
where θ∗ ∈ Rp represents the (misspecified) vector of parameters, E[ϕi (x, θ∗ , ω)] denotes the local cost function of agent i and the expectation is taken with respect to
the random variable ω, defined as ω : Ω → Rd , and (Ω, F, P) denotes the associated
probability space. The cost function is specified through a function ϕi : Rn ×Rp ×Rd ,
which is assumed to be convex and continuously differentiable function in x for
all θ ∈ Rp and ω. Furthermore, agent i’s feasible region is given by a closed and
convex set Xi ⊆ Rn . We consider the solution of this problem under two restrictions:
(i) Local information: Agent i has access to fi and the set Xi , and can communicate at time t with its local neighbors in the graph G t ;
(ii) Misspecification: The functions are misspecified, as the value θ∗ is not known
to the agents and consequently the agents have an inaccurate specification of their
own objectives.
It is assumed that the true parameter θ∗ is a solution to a distinct convex
optimization problem, accessible to every agent:
min E[g(θ, χ)],
θ∈Θ
(4.2)
where Θ ⊆ Rp is a closed and convex set, χ : Ωθ → Rr is a random variable with
the associated probability space given by (Ωθ , Fθ , Pθ ), while g : Rp × Rr → R is
a strongly convex and continuously differentiable function in θ for every χ. Note
that this could equivalently be viewed as m learning problems where agent i’s
objective is parameterized by θ∗ . More generally, such techniques can be employed
for learning the parametrization of a prescribed function. Our problem of interest
102
can be viewed as the joint solution of the following two problems:
∗
x ∈ argmin
x
(m
X
∗
E[ϕi (x, θ , ω)] | x ∈
i=1
m
\
)
Xi
(4.3)
i=1
θ∗ ∈ argmin {E[g(θ, χ)] | θ ∈ Θ} .
(4.4)
θ
Traditionally, such problems are approached sequentially: (i) An accurate approximation of θ∗ is first obtained; and (ii) Standard computational schemes are
then applied. However, this avenue is inadvisable when the learning problems
are stochastic and accurate solutions require significant effort. Consequently, if
the learning process is terminated prematurely, the resulting solution θ̂ may differ
significantly from θ∗ . Furthermore, such avenues can, at best, provide approximate
solutions.
Inspired by recent work on learning and optimization in a centralized regime [129],
we consider the development of schemes for distributed stochastic optimization.
Traditional consensus-based distributed optimization schemes [34] combine local
averaging with a projected gradient step. We introduce an additional layer of
a learning step to aid in resolving the misspecification and develop both global
convergence and rate statements for such a set of schemes. Our distinct contributions
can be summarized as follows:
• Asymptotics: We consider a setting where the sets X1 , . . . , Xm are either
identical or distinct to each agent. We propose an algorithm where the
learning parameter and the agent iterates are updated simultaneously. Under
suitable assumptions on the random error and Lipschitzian requirements of
the agent objectives, we prove the almost sure convergence of the iterates
produced by our proposed scheme to the optimal solution of the correctly
specified problem.
• Rate statements: Under slightly stronger assumptions, we obtain non-asymptotic
error bounds in a mean-squared sense. Under cases of strong convexity, we
√
establish an O(1/ k) rate estimate on the iterates when agent sets are general
and not identical. We clearly separate the effect of learning, consensus, and
stochasticity on the scalar corresponding to the rate result. Under cases of
mere convexity, we deploy the concept of iterate averaging and by using a
slightly different averaging window than usual, we recover the standard error
103
estimate on the function values when agent sets are identical.
• Computational study: We consider an economic dispatch problem as our case
study to test our proposed schemes. We compare the optimality gap and
the error from consensus for varying levels of solving the learning problem.
We also compare the behavior of our scheme with one where the learning
problem is solved apriori. Numerical tests support the theoretical findings
from both a convergence and a rate standpoint.
The remainder of this paper is organized into five sections. In Section 4.2, we
introduce our assumptions and state our algorithm. Global convergence and rate
statements are provided in Sections 4.3 and 4.4. In particular, our rate statements
quantify the degradation in the rate that emerges from learning and consensus.
Finally, we provide some preliminary numerics in Section 4.5 using an economic
dispatch problem and conclude in Section 4.6.
4.2 Assumptions and Algorithms
In this section, we begin by presenting a distributed framework for the joint solution
of the optimization and learning problem [129]. For this, we let for all i ∈ N ,
for all x and θ ∈ Θ,
fi (x, θ) , E[ϕi (x, θ, ω)]
and h(θ) , E[g(θ, χ)]
for all θ ∈ Θ.
The algorithm has the following form. Suppose vik , the average constructed by the
ith agent at epoch k, is defined by
vik :=
m
X
aj,k xk
i
for all i = 1, . . . , m and k ≥ 0,
j
(4.5)
j=1
where xki denotes agent i’s estimate of the solution at epoch k, aj,k
i are nonnegative
Pm
j,k
scalars and j=1 ai = 1. These weights are related to the underlying connectivity
graph G k at time k. The ith agent starts with random points x0i ∈ Xi and θi0 ∈ Θ,
and the agent updates at epoch k ≥ 0 are given by
xk+1
:= ΠXi vik − αk ∇x fi (vik , θik ) + wik
i
104
(4.6)
θik+1 := ΠΘ θik − γk ∇h(θik ) + βik
,
(4.7)
where wik , ∇x ϕi (vik , θik , ωik )−∇x fi (vik , θik ), βik , ∇θ g(θik , χki )−∇θ h(θik ), ∇x fi (x, θ) =
E[∇x ϕi (x, θ, ω)], and ∇h(θik ) = E[∇θ g(θik , χ)]. It can be seen that at time epoch
k, agent i proceeds to average over his neighbors decisions by using the weights
in (4.5) and employs this average, denoted by vik , to update its decision in (4.6).
Furthermore, agent i makes a subsequent update in its belief regarding θ∗ , by
taking a similar (stochastic) gradient update, given by (4.7). Note that aj,k
i denotes
j,k
the weight used by agent i to the iterate of agent j at time k, where ai is based
on the connectivity graph G k . Letting Eik denote the set of neighbors of agent i:
Eik = {j ∈ N | {j, i} ∈ E k } ∪ {i},
the weights aj,k
i are compliant with the neighbor structure:
aj,k
i


= 0 j 6∈ Eik
 > 0 j ∈ E k.
i
Throughout the paper we assume that each of the graphs G t is connected. Further,
we define F0 = {(x0i , θi0 ) i ∈ N } and Fk , F0 ∪ {ω0 , χ0 , . . . , ωk−1 , χk−1 } for all
k ≥ 1. We make the following assumptions on the first and second moments.
Assumption 7 (Conditional first and second moments).
(a) The conditional means IE[wik | Fk ] and IE[βik | Fk ] are zero for every k and for
every i ∈ N .
(b) The conditional second moments IE[kwik k2 |Fk ] and IE[kβik k2 |Fk ] for every k and
for every i ∈ N are bounded by ν 2 and νθ2 respectively. Furthermore, E[kx0i k2 ] and
E[kθi0 k2 ] are bounded for all i ∈ N .
We also make the following assumptions on the agent specific objectives and
the learning metric.
Assumption 8 (Agent objectives). The functions fi (x, θ) are convex in x for every
θ for all i ∈ N . Furthermore for every i ∈ N , the gradients ∇x fi (x, θ) are uniformly
Lipschitz continuous in θ for all x, i.e., k∇x fi (x, θ) − ∇x fi (x, θ̂)k ≤ Lθ kθ − θ̂k for
all θ and θ̂, for every i ∈ N and all x.
105
Assumption 9 (Learning metric). The function h is strongly convex over the set
Θ with a strong convexity constant κ > 0 and its gradients are Lipschitz continuous
with constant Rθ , k∇h(θa ) − ∇h(θb )k ≤ Rθ kθa − θb k, for all θa , θb ∈ Θ.
The weight matrix satisfies the following assumption.
Assumption 10 (Doubly stochastic nature of weight matrix). The matrix A(k)
Pm j,k
(whose ijth entry is denoted by aj,k
i ) is doubly stochastic for every k, i.e.,
i=1 ai =
Pm
j,k
1 for every j and j=1 ai = 1 for every i.
Throughout, we impose the following requirements on our steplength sequences.
Assumption 11 (Steplength sequences).
(a) The steplength sequences {αk } and {γk } satisfy the following for some τ such
that 0 < τ < 1:
∞
X
k=0
αk = ∞,
∞
X
∞
X
γk = ∞,
k=0
αk2−τ < ∞,
k=0
∞
X
γk2 < ∞.
k=0
(b) Furthermore, we have the following:
αk2−τ
lim
= 0.
k→∞ γk
We will use as blanket assumptions that the parameter set Θ and the sets Xi
are closed and convex. Occasionally, we will impose a boundedness assumption on
these sets.
Assumption 12.
(a) For every i ∈ N , the set Xi is bounded, and maxxi ∈Xi kxi k ≤ B.
(b) The set Θ is bounded and maxθ∈Θ kθk ≤ W .
(c) For i = 1, . . . , N , the gradient map kIE[∇xi ϕi (x, θ∗ , ω)]k ≤ S in a.s. sense.
For cases when X = ∩m
i=1 Xi , we assume that the interior of X is nonempty [34].
Assumption 13. There is a vector x̄ ∈ int(X), i.e., there exists a scalar δ > 0
such that {z | kz − x̄k ≤ δ} ⊂ X.
Since we consider a stochastic setting, we leverage the following set of supermartingale convergence theorems from [69, Lemma 10, page 49] and [69, Lemma 11,
page 50], respectively.
106
Lemma 20. Let Vk be a sequence of nonnegative random variables adapted to
σ-algebra Fk and such that a.s.
IE[Vk+1 | Fk ] ≤ (1 − uk )Vk + ψk , ∀k ≥ 0,
where 0 ≤ uk ≤ 1, ψk ≥ 0, and
Then, a.s. Vk → 0.
P∞
k=0
uk = ∞,
P∞
k=0
ψk < ∞, and limk→∞
ψk
uk
= 0.
Lemma 21. Let Vk , uk , ψk and γk be nonnegative random variables adapted to
P
P∞
σ-algebra Fk . If a.s. ∞
k=0 uk < ∞,
k=0 ψk < ∞, and
IE[Vk+1 | Fk ] ≤ (1 + uk )Vk − δk + ψk , ∀k ≥ 0,
then a.s. Vk is convergent and
P∞
k=0 δk
< ∞.
4.3 Almost sure convergence
Under the assumption that the problem in (4.1)–(4.2) has a solution, we prove
that the iterates generated by algorithm (4.6)–(4.7) converge to a solution (x∗ , θ∗ ).
Throughout the rest of the paper, we will simply write ∇fi (x, θ) instead of ∇x fi (x, θ).
While we do not have any specific assumption on both the structure of the objective,
we consider two different possibilities on the definition of the constraint sets as
specified in [34]:
• Each agent has complete access to the set of constraints, Xi = X for all
i. Such settings are prevalent in smaller networks or when the number of
constraints is comparatively less and when there is a centralized operator. In
these cases, we only have a mild requirement on communication across agents
to follow the assumption 8.
• However, in larger networks, agents have complete access to only localized
information. Instances in transportation and power networks are classic
examples, where information on breaking capacities corresponding to flow
and transmission lines is known only to agents operating within a known
neighbourhood. In such cases, we have the definition of sets to be more general
X = ∩m
i=1 Xi and assumptions on communication stronger than assumption 8.
107
We begin by relating consecutive iterates of the algorithm.
Lemma 22. Let assumptions 7, 8, and 12(c) hold. Let the iterates be generated
according to (4.6)–(4.7). Then the following holds for every k:
m
X
≤
h
IE kxk+1
− x∗ k2 | F k
i
i=1
m X
i
1 + αk2−τ L2θ kxki − x∗ k2 −
i=1
− 2αk
m
X
E[kφki k2 |Fk ] + mαk2 (2S 2 + ν 2 )
i=1
m X
fi (vik , θ∗ ) − fi (x∗ , θ∗ )
i=1
+ αkτ + 2αk2 L2θ
m
X
kθik − θ∗ k2 .
(4.8)
i=1
Proof. We use the projection relation for a closed convex set X, according to which
we have: kΠX (x) − yk2 ≤ kx − yk2 − kΠX (x) − xk2 for all y ∈ X. Therefore, for all
i, and any solution pair (x∗ , θ∗ ), we have
kxk+1
− x∗ k2
i
= ΠXi vik − αk ∇fi (vik , θik ) + wik
2
− x∗ 2
≤ vik − αk ∇fi (vik , θik ) + wik − x∗ − xk+1
i
−
(4.9)
2
vik − αk ∇fi (vik , θik ) + wik ≤ kvik − x∗ k2 + TAk + TBk − kφki k2 ,
(4.10)
where TAk , αk2 k∇fi (vik , θik ) + wik k2 ,
(4.11)
TBk , −2αk (vik − x∗ )T (∇fi (vik , θik ) + wik ),
φki , xk+1
− vik − αk ∇fi (vik , θik ) + wik
i
(4.12)
.
(4.13)
Expanding TAk , we obtain that
TAk = αk2 k∇fi (vik , θik ) + wik k2 = αk2 k∇fi (vik , θik ) + wik k2
= αk2 k∇fi (vik , θik )k2 + αk2 kwik k2 + 2αk2 (wik )T ∇fi (vik , θik ).
108
(4.14)
Applying conditional expectations on both sides of (4.14), we have that
h
i
IE TAk | F k = αk2 k∇fi (vik , θ∗ ) + ∇fi (vik , θik ) − ∇fi (vik , θ∗ )k2
i
h
h
+ αk2 IE kwik k2 | F k +2αk2 IE (wik )T ∇fi (vik , θik ) | F k
|
{z
≤ ν2
|
}
{z
= 0
i
}
≤ αk2 (2S 2 + 2L2θ kθik − θ∗ k2 + ν 2 ).
(4.15)
Consequently, we may derive a bound on TBk :
TBk
= −2αk (vik − x∗ )T (∇fi (vik , θik ) − ∇fi (vik ; θ∗ )) − 2αk (vik − x∗ )T ∇fi (vik , θ∗ )
− 2αk (vik − x∗ )T wik
≤ 2αk kvik − x∗ kk∇fi (vik , θik ) − ∇fi (vik , θ∗ )k − 2αk (vik − x∗ )T ∇fi (vik , θ∗ )
− 2αk (vik − x∗ )T wik
≤ 2αk Lθ kvik − x∗ kkθik − θ∗ k − 2αk (vik − x∗ )T wik − 2αk (vik − x∗ )T ∇fi (vik , θ∗ )
≤ αk2−τ L2θ kvik − x∗ k2 + αkτ kθik − θ∗ k2
− 2αk (vik − x∗ )T wik − 2αk fi (vik , θ∗ ) − fi (x∗ , θ∗ ) ,
(4.16)
where the inequalities are a result of invoking Lipschitz continuity, completion of
squares, and the convexity of fi . By taking conditional expectations with respect
to Fk , we obtain the following:
h
i
E TBk | Fk ≤ αk2−τ L2θ kvik − x∗ k2 + αkτ kθik − θ∗ k2 − 2αk fi (vik , θ∗ ) − fi (x∗ , θ∗ ) .
Summing (4.10) over i = 1, . . . , m, we have the following:
m
X
≤
h
IE kxk+1
− x∗ k 2 | F k
i
i=1
m
X
i
kvik − x∗ k2 + mαk2 (2S 2 + ν 2 ) + αk2−τ L2θ
kvik − x∗ k2
i=1
i=1
+ αkτ + 2αk2 L2θ
m
X
kθik − θ∗ k2 −
m X
m
X
i=1
i=1
− 2αk
m
X
fi (vik , θ∗ ) − fi (x∗ , θ∗ ) .
i=1
109
IE[kφki k2 | Fk ]
j,k
Noting that m
j=1 ai = 1 for all i and k, by the convexity of the squared-norm, we
have the following:
P
m
X
kvik
∗ 2
−x k =
i=1
=
m m
X
X
k
aj,k
i xj
i=1 j=1
!
m
m
X
X
j,k
ai
j=1
2
∗
−x ≤
m X
m
X
j,k
ai kxkj − x∗ k2
i=1 j=1
kxkj − x∗ k2 =
i=1
m
X
kxkj − x∗ k2 .
(4.17)
j=1
This leads to the following bound, completing the proof:
m
X
≤
h
− x∗ k2 |F k
IE kxk+1
i
i=1
m X
i
1 + αk2−τ L2θ kxki − x∗ k2 −
i=1
m
X
h
IE kφki k2 |Fk
i=1
− 2αk
m X
fi (vik , θ∗ )
∗
∗
− fi (x , θ ) +
(αkτ
+
mαk2 (2S 2
2αk2 L2θ
m
X
kθik − θ∗ k2
i=1
i=1
+
i
2
+ ν ).
Using the line of analysis
as in [34], we
define Φ(k, s) = A(s)A(s + 1) · · · A(k) for
j
1
k > s, for which we have Φi (k, s) − m ≤ Cλk−s for all i and j, where C > 0 and
λ ∈ (0, 1). We also state and use Lemma 8 from [34] for developing our convergence
theory.
Lemma 23. Suppose D = S + 2W Lθ . Let assumptions 8, 10, and 12(b,c) hold.
P
k
Let the iterates xki be generated according to (4.7) and let yik = m1 m
i=1 xi . Then
the following holds for every i, k:
kxki
k
k−1
− y k ≤ mCλ
m
X
kx0j k
+ mCD
+C
r=0
λk−r
m
X
λk−r αkr + 2αk−1 D
r=0
j=1
k−2
X
k−2
X
kφrj k + kφk−1
k+
i
j=1
110
m
1 X
kφk−1 k, a.s.
m j=1 j
(4.18)
Note that the bound D = S + 2W Lθ is obtained from observing that
k∇fi (x, θ)k ≤ k∇fi (x, θ∗ )k + k∇fi (x, θ∗ ) − ∇fi (xi , θ)k ≤ D = S + 2W Lθ ,
(4.19)
a consequence of invoking the triangle inequality and Assumptions 1, 2, and 6(c).
We first establish the convergence theory for the case where all agents have access
to the complete set of constraints.
Proposition 14 (a.s. convergence when Xi = X for every i). Let Assumptions 7 –
11, and 12(b,c) hold and let X = Xi for every i. Consider the iterates generated by
(4.7). Then for i = 1, . . . , m, xki → x∗ as k → ∞ in an almost sure sense.
Proof. By the nonexpansivity of the projection operator, by utilizing the strong
monotonicity and Lipschitz continuity of ∇θ h(θ), and by recalling the relation
θ∗ = Πθ (θ∗ − γk ∇θ h(θ∗ )) , we obtain the following
kθik+1 − θ∗ k2
≤ kθik − γk (∇h(θik ) + βik ) − θ∗ + γk ∇h(θ∗ )k2
= kθik − θ∗ k2 + γk2 k∇h(θik ) − ∇h(θ∗ )k2 + γk2 kβik k2
− 2γk (∇h(θik ) − ∇h(θ∗ ))T (θik − θ∗ ) − 2γk (βik )T (θik − θ∗ − γk (∇h(θik ) − ∇h(θ∗ )))
≤ (1 − 2γk κ + γk2 Rθ2 )kθik − θ∗ k2 + γk2 kβik k2
− 2γk (βik )T (θik − θ∗ − γk (∇h(θik ) − ∇h(θ∗ ))).
Taking conditional expectations, we may claim that
h
i
IE kθik+1 − θ∗ k2 | Fk ≤ (1 − 2γk κ + γk2 Rθ2 )kθik − θ∗ k2 + γk2 νθ2 ,
since E[2γk (βik )T (θik − θ∗ − γk (∇h(θik ) − ∇h(θ∗ ))) | Fk ] = 0. By summing over i,
we obtain
m
X
i=1
IE
h
kθik+1
∗ 2
− θ k |F
k
i
≤
m h
X
i
1 + γk2 Rθ2 − 2γk κ kθik − θ∗ k2 + γk2 νθ2 .
i=1
111
(4.20)
Defining ψk =
m
X
≤
ατk
2κγk
and adding (4.20) to (4.8), we obtain the following:
h
IE kxk+1
− x∗ k2 + ψk kθik+1 − θ∗ k2 | F k
i
i=1
m X
1 + αk2−τ L2θ kxki − x∗ k2 −
i=1
m
X
i
E[kφki k2 | Fk ]
i=1
− 2αk
m X
fi (vik , θ∗ ) − fi (x∗ , θ∗ ) + mαk2 (2S 2 + ν 2 ) + ψk γk2 νθ2
i=1
m
X
+ αkτ + 2αk2 L2θ + ψk 1 + γk2 Rθ2 − 2γk κ
kθik − θ∗ k2
i=1
≤
m X
1 + αk2−τ L2θ kxki − x∗ k2 − 2αk
i=1
−
m
X
E[kφki k2
| Fk ] + ψk 1 +
αkτ
i=1
+
mαk2 (2S 2
2
+ν )+
m
X
≤ (1 + ρk )
m X
fi (vik , θ∗ ) − fi (x∗ , θ∗ )
i=1
2αk2 L2θ
+
ψk
+
γk2 Rθ2
! m
X
− 2γk κ
kθik − θ∗ k2
i=1
ψk γk2 νθ2
kxki − x∗ k2 + ψk
i=1
− 2αk
m X
m
X
!
kθik − θ∗ k2 −
i=1
m
X
E[kφki k2 | Fk ]
i=1
fi (vik , θ∗ ) − fi (x∗ , θ∗ ) + mαk2 (2S 2 + ν 2 ) + ψk γk2 νθ2 ,
i=1
where ρk , max αk2−τ L2θ , γk2 Rθ2 − 2κγk +
ατk +2α2k L2θ
ψk
. Denoting Tik = kxki − x∗ k2 +
ψk kθik − θ∗ k2 , we have

m
X
IE[Tik+1
k
|F ]≤
i=1
m
X
(1 +
ρk )Tik
i=1


+ IE 


|


−2αk
m X
fi (vik , θ∗ )
∗
− fi (x , θ )
!
|F
i=1
{z




k
}
rk

!

 m
m  X k 2
 X
k

ψ k γk2 νθ2
− IE 
kφi k |F  +
 i=1
 i=1
|
{z
}
|
sk
ατ
∗
+ αk2 m(2S 2 + ν 2 ) ,
{z
uk
}
k
where ψk , 2κγkk . Letting y k = m1 m
i=1 xi , by convexity, the Cauchy-Schwartz
inequality, and the boundedness of ∇fi (vik ; θ∗ ), we have the following sequence of
P
112
inequalities:
rk = −2αk
m
X
fi (vik , θ∗ ) − fi (x∗ , θ∗ )
i=1
= −2αk
m X
fi (y k , θ∗ ) − fi (x∗ , θ∗ ) + fi (vik , θ∗ ) − fi (y k , θ∗ )
i=1
k
∗
∗
∗
≤ −2αk f (y , θ ) − f (x , θ ) +
m
X
!
(∇fi (vik , θ∗ ))T (vik
k
−y )
i=1
k
∗
∗
∗
≤ −2αk f (y , θ ) − f (x , θ ) + 2αk S
m
X
kvik − y k k,
i=1
where f =
sk :
IE
Pm
i=1
"m
X
fi . We obtain the following recursion from the nonnegativity of
#
Tik+1
|F
k
≤ (1 + ρk )
i=1
m
X
Ti k − 2αk f (y k , θ∗ ) − f (x∗ , θ∗ )
i=1
k
+ u + 2αk SIE
"m
X
#
kxki
k
− y k|F
k
,
i=1
m
k
k
k
k
where we employ (4.17) to deduce that m
i=1 kvi − y k ≤
i=1 kxi − y k. We now
Pm
proceed to show that { i=1 Tik } converges to zero in an almost sure sense as k → ∞
by employing the supermartingale convergence theorem. Our first step requires
P
k
k
showing the summability of ρk , uk and αk m
i=1 kxi − y k:
P
Summability of ρk : From ψk =
γk2 Rθ2 − 2κγk +
ατk
2κγk
P
as stated earlier, it is easy to observe that
αkτ
2καk2−τ L2θ 2
2α2 L2
+ k θ = γk2 Rθ2 +
γk .
ψk
ψk
γk
|
{z
}
ηk
From Assumption 11(b), limk→∞ ηk = 0 and from Assumption 11(a),
This further implies that
X
k
γk2 Rθ2
ατ + 2αk2 L2θ
− 2κγk + k
ψik
!
=
From ρk , max αk2−τ L2θ , γk2 Rθ2 − 2κγk +
113
X
γk2 Rθ2 + ηk γk2 < ∞.
P
k
γk2 < ∞.
(4.21)
k
ατk +2α2k L2θ
ψk
and by invoking Assump-
tion 11(a) to claim the summability of αk2−τ , it follows that
hP
P
k
ρk < ∞.
i
k+1
k
k
k
Summability of αk IE m
− vik k2 as
i=1 kxi − y k|F : We may now bound kxi
follows by using the projection property kΠX (y) − xk2 ≤ ky − xk2 − kΠX (y) − yk2 :
kxk+1
− vik k2 ≤ k(vik − αk (∇fi (vik , θik ) + wik )) − vik k2
i
2
− xk+1
− vik − αk (∇fi (vik , θik ) + wik ) i
= k − αk (∇fi (vik , θik ) + wik ))k2 − kφki k2
≤ kαk (∇fi (vik , θik ) + wik )k2 − kφki k2 .
This implies that
− vik k2 ≤ αk2 k(∇fi (vik , θik ) + wik )k2 .
kφki k2 ≤ kαk (∇fi (vik , θik ) + wik ))k2 − kxk+1
i
By utilizing the nonnegativity of kxk+1
− vik k2 , we now proceed to obtain a bound
i
for IE[kφri k2 |F k ], for r ≤ k:
n
o
n
o
IE[kφri k2 |F k ] = IE[kφri k2 |F 0 ∪ wi1 , χ1i , . . . , wik , χki ]
= IE[kφri k2 |F 0 ∪ wi1 , χ1i , . . . , wir , χri ]
= IE[kφri k2 |F r ]
≤ IE[kαr (∇fi (vir , θir ) + wir )k2 |F r ]
≤ αr2 IE[k∇fi (vir , θir )k2 |F r ] + IE[kwir k2 |F r ] + 2αr IE[∇fi (vir , θir )T wir |F r ]
≤ αr2 (D2 + ν 2 ),
(4.22)
where the inequalities follow from Assumptions 6(c), (4.19), and Assumption 1.
This implies that by Jensen’s inequality and by utilizing (4.22), we obtain
IE[kφri k|F k ] ≤
q
≤
q
IE[kφri k2 |F k ]
√
IE[kαr (∇fi (vir , θir ) + wir )k2 |F r ] ≤ αr D2 + ν 2 .
(4.23)
By taking conditional expectations on (4.18) in Lemma 4.2 and Assumption 12(a),
we have the following:
IE[kxki − y k k|F k ]
114
≤ mCλk−1
m
X
kx0j k + mCD
+C
λk−r
r=0
≤ mCλk−1
m
X
IE[kφrj k|F k ] + IE[kφk−1
k|F k ] +
i
j=1
m
X
kx0j k + mCD
+C
λk−r
r=0
≤ mCλk−1
k−2
X
m
1 X
IE[kφk−1
k|F k ]
j
m j=1
(4.24)
λk−r αkr + 2αk−1 D
r=0
j=1
k−2
X
λk−r αkr + 2αk−1 D
r=0
j=1
k−2
X
k−2
X
m
√
√
√
1 X
αkr D2 + ν 2 + αk−1 D2 + ν 2 +
αk−1 D2 + ν 2
m j=1
j=1
m
X
m
X
kx0j k + mCD
k−2
X
λk−r αkr + 2αk−1 D
r=0
j=1
√
+ mC D2 + ν 2
k−2
X
√
λk−r αkr + 2αk−1 D2 + ν 2
r=0
≤ m2 BCλk−1 + mC(D +
k−2
√
√
X
λk−r αkr + 2αk−1 (D + D2 + ν 2 ).
D2 + ν 2 )
r=0
≤ m2 BCλk−1 + mC(D +
k−2
√
√
X
λk−r αkr
D2 + ν 2 )λk + mC(D + D2 + ν 2 )
r=1
√
+ 2αk−1 (D + D2 + ν 2 ).
Consequently, we have that
IE[kxki − y k k|F k ]
≤ m2 BCλk−1 + mC(D +
+ 2αk−1 (D +
√
r=1
D2 + ν 2 )
≤ m2 BCλk−1 + mC(D +
√
+2αk−1 (D + D2 + ν 2 ).
Summability of IE[
∞ X
m
X
k=0 i=1
∞
X
k−2
√
√
X
λk−r
D2 + ν 2 )λk + mC(D + D2 + ν 2 )αk
Pm
i=1
√
√
D2 + ν 2 )λk + mC(D + D2 + ν 2 )αk
(4.25)
kxki − y k k|F k ] is a consequence of the following observation:
IE[αk kxki − y k k|F k ]
αk m2 BCλk−1 + mC(D +
≤
1
1−λ
√
√
D2 + ν 2 )λk + mC(D + D2 + ν 2 )αk
k=0
115
1
1−λ
+
≤
∞
X
αk 2αk−1 (D +
k=0
K
X
√
D2 + ν 2 )
αk m2 BCλk−1 + mC(D +
√
√
D2 + ν 2 )λk + mC(D + D2 + ν 2 )αk
1
1−λ
√
1
1−λ
k=0
+
+
+
K
X
αk 2αk−1 (D +
k=0
∞
X
k=K
∞
X
√
D2 + ν 2 )
αk m2 BCλk−1 + mC(D +
αk 2αk−1 (D +
√
D2 + ν 2 )λk + mC(D +
√
D2 + ν 2 )αk
D2 + ν 2 )
k=K
≤+
∞ X
2
m
k=K
2
+ 2αk−1
(D +
2
BCαk−1
+ mC(D +
√
D2
+
ν 2 )αk2
√
+ mC(D + D2 + ν 2 )αk2
√
D2 + ν 2 )
1
1−λ
< ∞.
Summability of uk : The summability of uk follows immediately from the summability
of αk2 .
By employing Lemma 21, we have that Tik is convergent in an almost sure sense
to a nonnegative random variable:
lim
k→∞
m
X
i=1
Tik
= lim
k→∞
m X
kxki
∗ 2
−x k +ψ
k
kθik
∗ 2
−θ k
!
= ri ≥ 0, for i ∈ N .
i=1
By applying Lemma 20 to (4.20), we have that for i ∈ N , {θik } is a convergent
sequence and θik → θ∗ as k → ∞ in an a.s. sense. But by recalling that {Tik }
n o
is a convergent sequence, we may conclude that xk is a convergent sequence.
P
k
k
Since y k is defined as m1 m
i=1 xi , {y } is a convergent sequence. Furthermore,
P∞
k
k ∗
∗ ∗
k ∗
∗ ∗
k=0 2α (f (y , θ )−f (x , θ )) < ∞ . From the nonnegativity of f (y , θ )−f (x , θ )
and the non-summability of αk , it follows that lim inf k→∞ f (y k , θ∗ ) = f (x∗ , θ∗ ) in
an a.s. sense. Since f is convex and therefore continuous, some subsequence of
{y k } converges to x∗ , a point in X ∗ , the entire sequence must converge to x∗ . But
P
Pm
P
Pm
k
k
k
k αk
k αk = ∞, implying that lim inf k→∞
i=1 kxi − y k < ∞ and
i=1 kxi −
P
k
k
y k k = 0 in an a.s. sense. If m
i=1 kxi − y k → 0 along any subsequence, then the
Pm
entire sequence { i=1 kxki −y k k} → 0 in an a.s. sense as k → ∞. But y k → x∗ ∈ X ∗
116
as k → ∞ implying that {xki } → x∗ for i = 1, . . . , m as k → ∞ in a.s. sense.
We now examine a special case where the set X is not common to all agents
and is given by
X,
m
\
Xi .
i=1
The connectivity matrix A requires stronger properties and we establish that our
algorithm generates iterates that converge to an optimal solution to (4.1) using
slightly different analysis. We use the following Lemma from [34] to develop our
convergence theory:
Lemma 24. Let Xi ∈ Rn , i = 1, . . . , m, be nonempty closed convex sets and let
there exist a vector x̄ ∈ X = ∩m
i=1 Xi such that dist(x̄, bd(X)) ≥ δ > 0. Let xi ∈ Xi ,
P
i
i = 1, . . . , m be arbitrary vectors and define their average as x̂ = (1/m) m
i=1 x .
P
δ
Let s = +δ
x̄ + +δ
x̂, where = m
j=1 dist(x̂, Xj ).
• The vector s belongs to the intersection set X.
• We have the following relation



m
m
X
1 X
kx̂ − sk ≤
kxj − x̄k  dist(x̂, Xj ) .
δm j=1
j=1
As a particular consequence, we have



m
m
X
1 X
dist(x̂, X) ≤
kxj − x̄k  dist(x̂, Xj ) .
δm j=1
j=1
Proposition 15 (a.s. convergence when X , ∩m
i=1 Xi ). Let Assumptions 7 to 8,
assumption 11(a), 12, and 13 hold. Furthermore, let aj,k
= m1 for every i, j, k
i
and X , ∩i Xi . Consider (4.1) and let the iterates be generated according to (4.7).
Let there exist a vector x̄ ∈ X such that dist(x̄, bd(X)) ≥ δ > 0. Then xki → x∗ as
k → ∞ in an almost sure sense for i = 1, . . . , m.
Proof. From the previously obtained expressions in (4.20) and using Lemma 22,
117
we have the following:

m
X
k+1
IE 
− x∗ k2 + kθik+1 − θ∗ k2 |F k 
kxi


i=1
≤


|
m
X
{z
Tik+1
m
X
m X
}
i=1
(1 + ρk ) kxki − x∗ k2 + kθik − θ∗ k2 −2αk
i=1
+
(4.26)
}
|
{z
Tik
γk2 ν 2 + αk2 M −
i=1
m
X
h
i
IE kΦki k2 |F k −2αk
m
X
IE
fi (sk , θ∗ ) − fi (x∗ , θ∗ )
i
h
fi (vik , θ∗ ) − fi (sk , θ∗ )
,
i=1
i=1
{z
|
}
pk
where ρk = max αk2−τ L2θ , γk2 Rθ2 − 2κγk + αkτ . Furthermore, we define sk as follows:
sk ,
m
1 X
k
δ
x̄ +
xbk where xbk ,
xk ,
δ + k
k + δ
m i=1 i
bk , Xi ) and δ is a scalar as defined in Assumption 13. Consewhere k , m
i=1 dist(x
quently, we have the following relationship:
P
k
k
ks − xb k =
k
x̄ +
δ + k
k
k
δ
xbk − xbk = (x̄ − xbk )
+δ
δ + k
m
k
k X
≤ kx̄ − xbk k ≤
kx̄ − xki k ≤
δ
mδ i=1
P
m
i=1
m
dist(xbk , Xi ) X
mδ
kx̄ − xki k.
i=1
where the last equality is based on the definition of k . Since the weights are
uniform, we have that vik = xbk for every i and for all k, implying that
m
X
P
m
i=1
kvik − sk k ≤
m
kxki − x̄k X
mδ
i=1
dist(xbk , Xi ) ≤
i=1
m
2B X
dist(xbk , Xi ),
δ i=1
for all i, k. Noting that xk+1
is any arbitrary point in Xi and v k = vik = xbk for
i
all i, k and by leveraging the fact that dist(xbk , Xi ) ≤ kxbk − xk+1
k, we obtain the
i
following:
m
X
m
m
2B X
2B X
k
b
kv − s k ≤
dist(x , Xi ) ≤
kv k − xk+1
k.
i
δ i=1
δ i=1
i=1
k
k
118
(4.27)
The term on the right can be simplified as follows:
m
2B X
kv k − xk+1
k
i
δ i=1
(4.28)
=
m
2B X
− αk (∇i f (v k , θik ) + wik )k
kv k + αk (∇i f (v k , θik ) + wik ) − xk+1
i
δ i=1
≤
m 2B X
k k
k
kv k + αk (∇i f (v k , θik ) + wik ) − xk+1
k
+
kα
(∇
f
(v
,
θ
)
+
w
)k
k
i
i
i
i
δ i=1
m 2B X
=
kφki k + kαk (∇i f (v k , θik ) + wik )k .
δ i=1
(4.29)
As a consequence, we may claim from (4.27) and (4.29) that
m
m
X
2B X
ks − v k ≤
kφki k .
kαk (∇i f (v k , θik ) + wik )k +
δ i=1
i=1
!
k
k
(4.30)
By leveraging convexity, the Cauchy-Schwartz inequality, and (4.30), we have the
following:
"
IE −2αk
"
≤ IE 2αk
m X
k
∗
∗
k
#
fi (v , θ ) − fi (s , θ ) | F
i=1
m
X
k
#
k∇fi (v k ; θ∗ )kkv k − sk k | F k
i=1
h
≤ 2αk mDIE kv k − sk k | F k
i
m
m
X
X
4BmDαk
≤
IE ( kαk (∇i f (v k , θik ) + wki )k +
kφki k) | F k .
δ
i=1
i=1
"
#
Using (4.29) and (4.23), we have that
"
IE −2αk
m X
#
fi (v k , θ∗ ) − fi (sk , θ∗ ) | F k
i=1
√
"m
#
X 4BmDαk
4Bm D D2 + ν 2 αk2
k
k
≤
+ IE
kφi k) | F
δ
δ
i=1
√
!
m
i
4Bm2 D D2 + ν 2 αk2 X
8B 2 m2 D2 αk2 1 h k 2
k
+
+ IE kφi k | F
≤
δ
δ2
2
i=1
√
m
2
2
2
3
2
2
i
4Bm D D2 + ν 2 αk 8B m D αk X 1 h k 2
k
=
+
+
IE
kφ
k
|
F
.
i
δ
δ2
i=1 2
2
119
(4.31)
hP
i
√
k k
i
k
Note that IE m
k(∇
f
(v
,
θ
)
+
w
)k|F
≤ D2 + ν 2 follows from (4.23). Ini
i=1
i
k
voking (4.31), the inequality (4.26) can be rewritten as follows:
m
X
h
IE T k+1 | F k
i
i=1
≤ (1 + ρk )
m
X
T k − 2αk
m X
i=1
fi (sk , θ∗ ) − fi (x∗ , θ∗ )
i=1
|
+
mγk2 νθ2
+
αk2 M
|
{z
}
rk
√
m
8B 2 m3 D2 2 4Bm2 D D2 + ν 2 αk2 1 X
+
α
+
−
IE[kφki k2 |F k ].
δ 2 {z k
δ
2
i=1
}
uk
From Assumption 11(a), k uk < ∞. The nonnegativity of rk can be seen to follow
by recalling that sk is feasible by invoking Lemma 24:
P
sk ∈ X and
m
X
fi (sk , θ∗ ) = f (sk , θ∗ ) =⇒ rk = 2αk (f (sk , θ∗ ) − f (x∗ , θ∗ )) ≥ 0.
i=1
The summability of ρk follows from (4.21). We may now employ the supermartingale
convergence lemma, given by Lemma 21, allowing us to conclude that {Tik } is a
convergent sequence in an a.s. sense and the following holds in an a.s. sense by
invoking the non-summability of αk .
lim inf f (sk , θ∗ ) = f (x∗ , θ∗ ),
k→∞
lim
k→∞
kxki
∗ 2
− x k + ψk kθik − θ∗ k2 = ri , where ri ≥ 0,
lim kφki k = 0.
(4.32)
k→∞
By applying Lemma 20 to (4.20), for i = 1, . . . , m, {θik } is a convergent sequence
in an a.s. sense and θik → θ∗ as k → ∞ in an a.s. sense. Consequently, {xki } is a
convergent sequence in an a.s. sense for i = 1, . . . , m. It follows that the sequence
n o
P
k
v k , v k = m1 m
i=1 xi is also convergent. Applying limits on both sides of (4.30),
we have that
m
X
2B
lim ks − v k ≤
mD lim αk +
lim kφki k = 0.
k→∞
k→∞
k→∞
δ
i=1
!
k
k
Furthermore, given the prior limit, {sk } is a convergent sequence in a.s. sense and
120
since a subsequence of {sk } converges to x∗ in an a.s. sense from (4.32), the entire
sequence {sk } → x∗ in an a.s. sense as k → ∞. Furthermore, recall that kxk+1
− vk
i
may be bounded as follows by the projection inequality:
kxk+1
− v k k ≤ kφki k + αk k∇fi (v k , θk ) + wik k a.s.
i
Note that the boundedness of k∇fi (v k , θk ) + wik k follows from (4.23). Applying
limits on both sides, we obtain the following:
lim kxk+1
− v k k ≤ lim kφki k + lim αk k∇fi (v k , θk ) + wik k = 0, almost surely.
i
k→∞
k→∞
k→∞
Noting that xki is a convergent sequence as stated earlier,
lim xki = lim v k = lim sk ,
k→∞
k→∞
k→∞
n
i = 1, . . . , m.
o
By invoking sk → x∗ , we have that xki → x∗ , for i = 1, .., N .
4.4 Rate of convergence estimates
In this section, we derive a non-asymptotic rate of convergence for the sequence of
iterates and quantify the impact in this rate arising from the misspecification of θ∗ .
Our first statement provides a rate estimate under the assumption that X = ∩Xi
and αk is a diminishing steplength sequence.
Proposition 16 (Rate of convergence when X = ∩m
i=1 Xi ). Let Assumptions 7-9
j,k
and 11-13 hold. Additionally, let ai = m1 for all i, j, k and fi (x; θ) be strongly
convex functions in x for every θ, with parameter σi . Additionally, suppose h(θ) is
Q
strongly convex in θ with parameter κ. Consider (4.1) with X = m
i=1 Xi and let
γ0
the iterates be generated according to (4.7) with steplengths γk = k and αk = αk0 .
Then, for all k ≥ 0 and i = 1, . . . , m,
IE[kxki − x∗ k2 ] ≤
max
α20 (Mlearn +Mcons ) Pm
, i=1
(σmin α0 −1)
k
E[kx0i − x∗ k2 ]
,
where Mlearn and Mcons are suitably defined positive scalars and σmin = mini σi .
Proof. Recall that kxk+1
− x∗ k2 ≤ kvik − x∗ k2 + TAk + TBk − kΦki k2 , where TAk , TBk ,
i
121
and φk are defined by (4.11), (4.12), and (4.13). From (4.16), TBk may be bounded
as follows:
TBk = −2αk (vik − x∗ )T ∇fi (vik , θ∗ ) − 2αk (vik − x∗ )T (∇fi (vik , θik ) − ∇fi (vik , θ∗ ))
− 2αk (vik − x∗ )T wik
≤ −2αk (fi (vik ) − f (x∗ )) − αk σi kvik − x∗ k2
+ 2αk kvik − x∗ kk(∇fi (vik , θik ) − ∇fi (vik , θ∗ ))k − 2αk (vik − x∗ )T wik ,
where the first inequality follows from the strong convexity relation fi (x∗ , θ∗ ) ≥
fi (vik , θ∗ ) + (x∗ − vik )T ∇fi (vik , θ∗ ) + σ2i kvik − x∗ k2 and the Cauchy-Schwartz inequality.
Taking expectations,
h
i
IE[TBk ] ≤ IE −αk σi kvik − x∗ k2 − 2αk (fi (vik , θ∗ ) − f (x∗ , θ∗ ))
− IE[IE[2αk (vik − x∗ )T wik | F k ]]
|
{z
}
=0
h
i
+ 2αk IE kvik − x∗ kk(∇fi (vik , θik ) − ∇fi (vik , θ∗ ))k .
Using (4.15) and taking expectations IE[IE[TAk |F k ]], we have that
h
IE[TAk ] = IE αk2 k∇fi (vik , θik ) + wik k2
i
= αk2 IE[k∇fi (vik , θik )k2 ] + IE[kwik k2 ] + αk2 IE[2(wik )T ∇fi (vik , θik )]
h
= αk2 IE[k∇fi (vik , θik )k2 ] + IE IE[[kwik k2 ] | F k ]
i


+ IE IE[2(wik )T ∇fi (vik , θik ) | F k ]

|

{z
}
=0
≤ αk2 (2S 2 + 2L2θ IE[kθik − θ∗ k2 ] + ν 2 ).
This leads to the following inequality:
IE[kxk+1
− x∗ k 2 ]
i
≤ IE[kvik − x∗ k2 ] + IE[TAk ] + IE[TBk ] − IE[kφki k2 ]
≤ IE[kvik − x∗ k2 ] + αk2 (2S 2 + 2L2θ IE[kθik − θ∗ k2 ] + ν 2 ) − IE[kφki k2 ]
− IE[αk σi kvik − x∗ k2 + 2αk (fi (vik ) − f (x∗ ))]
+ 2αk IE[kvik − x∗ kk(∇fi (vik , θik ) − ∇fi (vik , θ∗ ))k]
122
≤ (1 − αk σi )IE[kvik − x∗ k2 ] + 2(αk Lθ )2 IE[kθik − θ∗ k2 ] + αk2 (2S 2 + ν 2 )
+ 2αk Lθ IE[kvik − x∗ kkθik − θ∗ k] − IE[kφki k2 ] − 2αk IE[(fi (vik ) − f (x∗ ))],
where the last inequality is a consequence of leveraging the Lipschitz continuity of
∇fi (vik , θ) in θ. Summing over i,
IE
"m
X
#
kxk+1
i
∗ 2
−x k
i=1
≤ qx,k IE
"m
X
#
∗ 2
kvik
−x k
+
i=1
+
m
X
γk2 νθ2 −
i=1
m
X
αk2 (2S 2
2
+ ν ) + qθ,k IE
"m
X
i=1
m
X
IE[kφki k2 ] − 2
i=1
#
kθik
∗ 2
−θ k
i=1
m
X
αk IE[(fi (vik ) − f (x∗ ))],
i=1
where
!
qx,k , max(1 − αk σi ) = (1 − αk σmin ) and qθ,k ,
i
2αk2 L2θ
max
i∈{1,...,m}
αk L2θ
.
+
σi
Restating (4.20) and using assumption 12(b), we have
h
i
h
h
IE kθik+1 − θ∗ k2 = IE IE kθik+1 − θ∗ k2 | F k
Using γk =
γ0
,
k
≤ IE
h
≤ IE
h
ii
(4.33)
1 + γk2 L2θ − γk κ kθik − θ∗ k2 + γk2 νθ2
i
(1 − γk κ) kθik − θ∗ k2 + γk2 νθ2 + 4W 2 L2θ
i
.
we have from [79, chap. 5, eq. 292],
max
IE[kθik − θ∗ k2 ] ≤
γ02 Bθ2
, IE[kθ0
κγ0 −1
∗ 2
−θ k ]
k
=
Mθ (γ0 )
,
k
k
∗ 2
where Bθ2 = νθ2 +4L2θ W 2 . Substituting for IE[kθik −θ∗ k2 ] and using m
i=1 kvi −x k ≤
Pm Pm
Pm
i,k
i,k
k
∗ 2
k
∗ 2
i=1
j=1 aj kxj − x k =
j=1 kxj − x k by invoking aj = 1/m,
P
IE
"m
X
#
kxk+1
i
∗ 2
−x k
≤ qx,k IE
"m
X
#
kxki
∗ 2
−x k
i=1
i=1
−
m
X
h
+
Mθ (γ0 )mqθ,k
+ mαk2 (2S 2 + ν 2 )
k
i
IE kφki k2 +2αk (fi (vik , θ∗ ) − f (x∗ , θ∗ )) .
i=1
123
From the definition of the weights, vik = v k for i = 1, . . . , m, implying that
Pm
k ∗
∗
i=1 fi (vi ; θ ) = f (vk ; θ ) leading to the following inequality:
IE
"m
X
#
kxk+1
i
∗ 2
−x k
≤ qx,k IE
i=1
"m
X
#
kxki
∗ 2
−x k
i=1
−
m
X
+
Mθ (γ0 )mqθ,k
+ mαk2 (2S 2 + ν 2 )
k
IE[kφki k2 ] − 2αk (f (v k , θ∗ ) − f (x∗ , θ∗ )).
i=1
We may rewrite this inequality as
IE
"m
X
#
kxk+1
i
∗ 2
−x k
i=1
≤ qx,k IE
"m
X
#
kxki
∗ 2
−x k
−
i=1
m
X
IE[kφki k2 ] − 2αk IE[(f (v k ; θ∗ ) − f (x∗ ; θ∗ ))]
i=1
Mθ (γ0 )mqθ,k
+ mαk2 (2S 2 + ν 2 )
k"
#
m
X
Mθ (γ0 )mqθ,k
k
∗ 2
= qx,k IE
kxi − x k +
+ mαk2 (2S 2 + ν 2 )
k
i=1
+
−
m
X
IE[kφki k2 ] − 2αk IE[f (y k ; θ∗ ) − f (x∗ ; θ∗ ) − f (y k ; θ∗ ) + f (v k ; θ∗ )],
i=1
where the last inequality follows from adding and subtracting f (y k ). We define
P
k
y k = δ+
x̄ + kδ+δ v k and k = m
i=1 dist(x̂(k), Xi ), where δ is a scalar defined in
k
assumption 13. As shown in (4.30), we have the following bound on ky k − v k k:
ky k − v k k ≤
m
m 2B X
2B X
dist(v k , Xi ) ≤
kφki k + αk k∇fi (v k , θik ) + wik k . (4.34)
δ i=1
δ i=1
The last equality is based on the definition of k . Since y k ∈ X from [34, Lemma
2], f (y k ) − f (x∗ ) ≥ 0, ∀k. Hence,
IE
"m
X
#
kxk+1
i
∗ 2
−x k
≤ qx,k IE
i=1
"m
X
#
kxki
∗ 2
−x k
+
i=1
+ mαk2 (2S 2 + ν 2 ) −
m
X
h
Mθ (γ0 )mqθ,k
k
IE kφki k2
i
i=1
+
m
X
i
4Bm
αk IE kφki k + αk k∇fi (v k , θik ) + wik k .
δ
i=1
h
124
This implies that
"m
X
IE
#
kxk+1
i
∗ 2
−x k
i=1
≤ qx,k IE
"m
X
#
kxki
∗ 2
−x k
+
i=1
−
m
X
Mθ (γ0 )mqθ,k
+ mαk2 (2S 2 + ν 2 )
k
i 1
(IE kφki k2 − (IE[kφki k])2 )
2
i=1
h
8Bmαk
+m
δ
≤ qx,k IE
"m
X
2
+ αk2
m
h
i
4Bm X
IE k∇fi (v k , θik ) + wik k
δ i=1
#
kxki − x∗ k2 +
i=1
m
X
Mθ (γ0 )mqθ,k
+ mαk2 (2S 2 + ν 2 )
k
h
i
1
8Bmαk
−
IE kφki k2 + m
2 i=1
δ
2
+
4Bm
αk2
δ
m
X
h
i
IE k∇fi (v k , θik ) + wik k ,
i=1
where the first inequality follows from completing squares for 4αkδBm IE[kφki k] and
the second inequality is a consequence of utilizing Jensen’s inequality (i.e. IE[X]2 ≤
IE[X 2 ]) and by invoking Assumption 12(b). Finally from (4.23), by noting that
h
i
√
Pm
k k
k
IE
k∇f
(v
,
θ
)
+
w
k
≤
m
D2 + ν 2 , we have that
i
i=1
i
i
IE
"m
X
#
kxk+1
i
∗ 2
−x k
≤ qx,k
i=1
m
X
IE
h
kxki
∗ 2
−x k
i
i=1
m
h
i
1X
IE kφki k2
−
2 i=1
Mθ (γ0 )mqθ,k
+ mαk2 (2S 2 + ν 2 )
k
√
64B 2 m3 αk2 4Bm2 αk2 D2 + ν 2
+
+
.
δ2
δ
+
But qθ,k = αk L2θ (2α0 +
follows:
IE
"m
X
1
)
σmin
implying that the inequality may be rewritten as
#
kxk+1
i
∗ 2
−x k
≤ qx,k IE
i=1
"m
X
#
kxki
∗ 2
−x k
i=1
Mθ (γ0 )mα0 L2θ 2α0 +
1
mα02 (2S 2 + ν 2 )
k2
k2
√
64B 2 m3 α02 4Bm2 D2 + ν 2 α02
+
+
δ2k2
δk 2
+
125
σmin
+
= qx,k IE
"m
X
#
kxki
∗ 2
−x k
+ (Mlearn + Mcons ) αk2
i=1
where Mlearn
Mcons
1
=
2α0 +
and
α0 σmin
√
2 3
2
D2 + ν 2
64B
m
4Bm
.
= m(2S 2 + ν 2 ) +
+
δ2
δ
Mθ (γ0 )mL2θ
It then follows from [79, chap. 5, eq. 292]
IE[kxki − x∗ k2 ] ≤
max
α20 (Mlearn +Mcons ) Pm
, i=1
(σmin α0 −1)
E[kx0i − x∗ k2 ]
k
∀i, k.
Remark: Notably, the presence of both learning and consensus does not lead to
any deterioration of the optimal rate of convergence of solution iterates observed in
√
stochastic approximation in strongly convex problems which is given by O(1/ K).
Naturally, the constant factor does see a worsening in this regime.
Next, we show that when Xi = X for all i and the functions fi are merely convex
in
x, an averaging scheme produces a minor degradation of the rate of the order of
q
ln(K). This statement employs a constant steplength in the computational step
in deriving the bound.
Proposition 17 (Rate of convergence when Xi = X for all i). Let assumptions 7
to 12 hold. Let fi (x) be convex, h(θ) be a strongly convex function with parameter
κ > 0 and Xi = X across all agents. Consider (4.1) and let the iterates be generated
according to (4.7) in which γk = γk0 and αk := α for all k Then the following holds
for all k:
s
h
i
T0 (K)T1 (K)
T2
IE kf (ȳ K , θ∗ ) − f (x∗ , θ∗ )k ≤
+
,
4K
2K
K−1
m
1
k
where ȳ K = Km
i=1 xi ,T0 (K) ≤ b0 + O(ln(K)), T1 (K) ≤ b1 + O(ln(K)),
k=0
and b0 , b1 and T2 are suitably defined positive scalars.
P
P
Proof. As obtained earlier, kxk+1
− x∗ k2 may be bounded as follows:
i
kxk+1
− x∗ k2 ≤ kvik − x∗ k2 + TAk + TBk − kφki k2 ,
i
where TAk = αk2 k∇fi (vik , θik )k2 + kwik k2 + 2αk2 (wik )T ∇fi (vik , θik ),
TBk ≤ −2αk (fi (vik , θ∗ ) − f (x∗ , θ∗ )) − 2αk (vik − x∗ )T wik
126
+ 2αk kvik − x∗ kk(∇fi (vik , θik ) − ∇fi (vik , θ∗ ))k,
φki = xk+1
− vik − αk ∇fi (vik , θik ) + wik
i
.
We may now bound TBk as follows:
TBk ≤ −2αk (fi (y k , θ∗ ) − f (x∗ , θ∗ )) − 2αk (fi (vik , θ∗ ) − f (y k , θ∗ )) − 2αk (vik − x∗ )T wik
+ 2αk kvik − x∗ kk(∇fi (vik , θik ) − ∇fi (vik , θ∗ ))k
≤ −2αk (fi (y k , θ∗ ) − f (x∗ , θ∗ )) − 2αk (vik − x∗ )T wik + 2αk |fi (vik , θ∗ ) − fi (y k , θ∗ )|
+ 2αk kvik − x∗ kk(∇fi (vik , θik ) − ∇fi (vik , θ∗ ))k
≤ −2αk (fi (y k , θ∗ ) − f (x∗ , θ∗ )) − 2αk (vik − x∗ )T wik + 2αk |fi (vik , θ∗ ) − f (y k , θ∗ )|
+ 2αk Lθ kvik − x∗ kkθik − θ∗ k
≤ −2αk (fi (y k , θ∗ ) − f (x∗ , θ∗ )) − 2αk (vik − x∗ )T wik + 2αk |fi (vik , θ∗ ) − f (y k , θ∗ )|
+
αk2 L2θ k
kvi − x∗ k2 + c2 kθik − θ∗ k2 ,
c2
Pm
where y k = m1
earlier, we have
IE
"m
X
i=1
xki and c > 0 is a suitable scalar. Applying expectations as
#
kxk+1
− x∗ k2 ≤ (1 + qx,k )IE
i
"m
X
i=1
#
kxki − x∗ k2 +
i=1
+ qθk IE
"m
X
−2
αk2 (2S 2 + ν 2 )
i=1
#
kθik − θ∗ k2 +
i=1
m
X
m
X
m
X
γk2 νθ2
i=1
h
i
h
i
αk IE (fi (y k , θ∗ ) − f (x∗ , θ∗ ))
i=1
−2
m
X
αk IE (fi (vik , θ∗ ) − fi (y k , θ∗ )) ,
i=1
α2 L2
where qx,k = kc2 θ and qθ,k = (2αk2 L2θ + c2 ) .We now proceed to derive a bound for
P
k ∗
k ∗
the term IE[− m
i=1 αk (fi (vi , θ ) − fi (y , θ ))]:
"
IE −
m
X
#
αk (fi (vik , θ∗ ) − fi (y k , θ∗ ))
i=1
"
≤ IE −
m
X
#
k
∗ T
αk ∇fi (y , θ )
(vik
i=1
127
k
−y )
≤ IE
"m
X
#
αk Skvik
k
−y k
i=1
= IE
"m
X
#
αk Skvik − y k k ≤

m
m
X
X
IE  αk S
aj,k kxk
i
i=1
=
m
X
i=1
j

− y k k
j=1
i
h
αk SIE kxki − y k k ,
(4.35)
i=1
where the first inequality follows from the Cauchy-Schwartz inequality and the
second inequality is a consequence of invoking Jensen’s inequality and the convexity
P
k
∗ 2
of the norm. By denoting m
i=1 IE[kxi − x k ] by ak , we have
ak+1 + 2
m
X
h
i
αk IE (fi (y k , θ∗ ) − fi (x∗ , θ∗ ))
i=1
k
≤ (1 + qx,k )a +
m
X
αk2 (2S 2
2
+ ν ) + qθ,k IE
+
i=1
γk2 νθ2
+
m
X
#
kθik
∗ 2
−θ k
i=1
i=1
m
X
"m
X
h
i
αk SIE kxki − y k k .
i=1
Using the bound derived in (4.35) and setting αk = α, we have the following
recursion:
a
k+1
k
−a +
m
X
αIE[(fi (y k , θ∗ ) − fi (x∗ , θ∗ ))] ≤ qx,k ak + mα2 (2S 2 + ν 2 ) + mγk2 νθ2
i=1
+
m
X
IE[qθ,k kθik − θ∗ k2 + αSkxki − y k k].
i=1
(4.36)
The bound obtained from (4.25) can be modified to reflect the constant steplength
α:
kxki − y k k ≤ m2 BCλk−1
√
√
√
mC(D + D2 + ν 2 )α
k
2
2
+ mC(D + D + ν )λ +
+ 2α(D + D2 + ν 2 ).
1−λ
Summing (4.36) from 0 to K − 1, dividing by
128
PK−1
k=0
α, and incorporating the above
we have:
PK−1 Pm
IE[(fi (xki , θ∗ ) − fi (x∗ , θ∗ ))]
K
PK−1 (αLθ )2 k
0
K
a − a + k=1 c2 a + `1 (θk , K)
≤
+ αm(2S 2 + ν 2 )
Kα
P
Pm
k
k
S K−1
i=1 IE[kxi − y k]
k=0
+
K
PK−1 (αLθ )2 k
0
K
a − a + k=1 c2 a + `1 (θk , K)
≤
+ αm(2S 2 + ν 2 )
Kα √
Sm(m2 BC + mC(D + D2 + ν 2 ))
+
K(1 − λ)
√
!
√
mC(D + D2 + ν 2 )α
2
2
+ Sm
+ 2α(D + D + ν )
1−λ
8mB 2 + `1 (θk , K)
`3
≤
+ α`2 + ,
Kα
K
2
k=0
i=1
(4.37)
(4.38)
where
`1 (θk , K) =
K
X
"
m(αθk )2 νθ2
+
qθk IE
"m
X
##
kθik
∗ 2
−θ k
,
i=1
k=0
√
√
Sm2 C(D + D2 + ν 2 )
`2 = 2mS + ν +
+ 2αSm(D + D2 + ν 2 )
1−λ
2 2
4mB Lθ
+
,
c2
√
Sm2 (mBC + C(D + D2 + ν 2 ))
and `3 =
.
1−λ
2
2
m
0
∗ 2
2
2
Furthermore qθ,k = 2α2 L2θ + c2 and a0 = m
i=1 IE[kxi − x k ] ≤
i=1 (2B) = 4mB
and ak ≤ 4mB 2 for all k. Using the expression of IE[kθik − θ∗ k2 ] in proposition 16
for γk = γk0 , we have:
P
max
IE[kθik − θ∗ k2 ] ≤
P
γ02 Bθ2
, IE[kθ0
κγ0 −1
∗ 2
−θ k ]
=
k
Mθ (γ0 )
,
k
where Bθ2 = νθ2 + 4L2θ W 2 . It follows that `1 (θk , K) can be bounded as follows:
`1 (θk , K) ≤
mνθ2 γ02
1+
K−1
X
k=1
1
k2
!
129
+
K−1
X
(2α2 L2θ + c2 )
k=0
Mθ (γ0 )
k
≤ 3mνθ2 γ02 + Mθ (γ0 )(2α2 L2θ + c2 ) ln(1 + K − 1)
= 3mνθ2 γ02 + Mθ (γ0 )(2α2 L2θ + c2 ) ln(K),
where the expressions follow from the summability of
the main recursion (4.38) is given by the following:
2
PK−1 Pm
k=0
i=1
1
k k2
P
≤
π2
6
< 2. Consequently,
IE[(fi (y k , θ∗ ) − fi (x∗ , θ∗ ))]
T0 (K)
T2
≤
+ αT1 (K) + ,
K
Kα
K
where
T0 (K) = 8mB 2 + 3mνθ2 γ02 + Mθ (γ0 )c2 ln(K),
√
2
√
D2 + ν 2 )
4mB 2 L2θ
Sm
C(D
+
+ 2αSm(D + D2 + ν 2 ) +
T1 (K) = 2mS 2 + ν 2 +
1−λ
c2
+ 2Mθ (γ0 )L2θ ln(K),
√
Sm2 (mBC + C(D + D2 + ν 2 ))
T2 =
.
1−λ
Let ȳ K denote the average of the sequence {y k } and is defined as ȳ K ,
Then, from the convexity of fi , we have
∗
K
fi (ȳ , θ ) = fi
K−1
X
k=0
PK−1 yk
k=0 K
.
X
yk ∗
1 K−1
,θ ≤
fi (y k , θ∗ ).
K
K k=0
!
It follows that the error for the averaged sequence is given by the following:
1
IE[(f (ȳ , θ ) − f (x , θ ))] ≤
2
K
∗
∗
∗
!
T0 (K)
T2
+ αT1 (K) +
.
Kα
K
Minimizing the expression on the right of (4.39), we have α∗ =
leading to the following bound:
s
IE[(f (ȳ K , θ∗ ) − f (x∗ , θ∗ ))] ≤
q
(4.39)
T0 (K)/(KT1 (K)).
T2
T0 (K)T1 (K)
+
.
4K
2K
Noting that T0 (K) and T1 (K) are each O(ln(K)), it can be ascertained that the
T2
√
expression on the right is O ln(K)
+ 2K
.
K
We conclude this section with a corollary in which the averaging window is
130
√
changed to obtain the optimal rate of O(1/ K).
Corollary 3 (Rate of convergence when Xi = X for all i). Let assumptions 7 to 12
hold. Let fi (x) be convex, h(θ) be a strongly convex function with parameter κ > 0
and Xi = X across all agents. Consider (4.1) and let the iterates be generated
according to (4.7) in which γk = γk0 and αk := α for all k Then the following holds
for all k:
s
h
i
T0 (K)T1 (K)
T2
IE kf (ȳ K , θ∗ ) − f (x∗ , θ∗ )k ≤
+
,
4K
2K
2
K
where ȳK/2
= mK
positive scalars.
PK−1
k=dK/2e
Pm
i=1
xki and T0 (K), T1 (K) and T2 are suitably defined
4.5 Numerical Results
We consider a networked economic dispatch problem with nodes and firms respectively denoted by N and F , {1, . . . , F }. The ith firm is assumed to host
generation facilities in multiple nodes of the network denoted by Ni ⊂ N . FirmP
∗
specific generation cost takes the form ci (xi , θ∗ ) = j∈Ni rij
xij + 12 b∗i x̄2i , where
P
x̄i = j∈Ni xij and θ∗ = (r∗ , b∗ ). Every agent has access to a collection of noisecorrupted samples given by {c`i (x`i , θ∗ ) + ωi` } and the associated learning problem is
given by the following:


min
XX
(r,b)∈Θ
i
`
1
∗ `

krij
xij + b∗i (x̄`i )2 − c`i k2 
2
j∈Ni
X
The economic dispatch problem necessitates scheduling generation in a least-cost
P
P
fashion while meeting nodal demand requirements as specified by i∈N j∈Ni xij ≥ d.
X
min
x∈X
n
where X = x|0 ≤ x ≤ u,
P
i∈N
IE[hi (x; θ, ξ)],
i∈F
P
j∈Ni
o
xij ≥ d and IE[hi (x; θ, ξ)] = ci (x, θ∗ ).
4.5.1 Data
We consider a ten-node network with five firms, each housing a coal generator
at one node and an oil generator at the other. We assumed that the network is
131
Table 4.1. Deviation εk of decisions with iterations
Iterations
k
1
2
3
10
100
1000
10000
15000
Instance
prob = 2
2.20e+02
2.09e+07
2.63e-16
6.87e+03
4.84e+01
2.84e-29
1.19e-29
1.05e-29
prob = 4
1.17e+02
2.09e+07
5.36e-17
6.87e+03
4.84e+01
5.86e-29
2.45e-30
3.55e-30
prob = 7
4.44e+02
2.09e+07
2.52e-16
6.87e+03
4.84e+01
3.67e-29
2.42e-30
3.72e-30
prob = 10
3.34e+02
2.09e+07
2.35e-17
6.87e+03
4.84e+01
5.76e-29
1.59e-30
6.73e-30
completely connected and the doubly stochastic property of A holds. The demand d
is taken to be 10 units and ξ is assumed to be normally distributed with distribution
∼ N (0, 1). It is also assumed that the noise associated with the learning metric
is normally distributed with distribution ∼ N (0, 1). The samples associated with
costs are generated with Gaussian noise ∼ N (1, 0.25) for each i ∈ F and the values
of cli are generated from noisy quadratic and nonlinear functions. A total of ten
different problem instances were generated and our results focus on three different
insights. We used a diminishing steplength sequence for computing γk = γ0 /k,
where γ0 = 5 × 102 and αk = αk0 , with α0 = 5 × 102 .
4.5.2 Consensus error
We consider four problem instances and present the difference in agent’s beliefs
P
in the vector z using the metric εk = i∈F kz1k − zik k. Table 4.1 examines εk with
increasing k. Consensus in zi is attained across all agents in less than 1000 iterations.
In fact, within 1000 iterations, the metric εk drops to less than the floating point
precision. Intuitively, the overall optimality error is dominated by the learning and
stochasticity terms in the subsequent iterations.
4.5.3 Learning error
Next, we analyze the behavior of iterates rik and bki with progressively increasing
iterations K = k × κ, for different values of the total projection iterations in θ space
for problem instance 10 and the results are reported in Table 4.2. We note that
κ denotes the number of inner learning steps taken in the θ space for every one
132
Table 4.2. Learning parameter error with iterations
P
Iterations
k
1
50
100
1000
3000
5000
10000
15000
i∈F
κ=1
2.62e+03
3.50e+05
7.16e+04
5.77e+00
9.34e-01
4.04e-01
1.28e-01
6.42e-02
||||rik − ri∗ ||||2 + ||||bki − b∗i ||||2f
κ = 11
2.62e+03
3.50e+05
7.16e+04
2.92e-03
2.95e-03
2.95e-03
2.95e-03
2.95e-03
κ = 21
2.62e+03
3.50e+05
7.16e+04
2.95e-03
2.95e-03
2.95e-03
2.95e-03
2.95e-03
κ = 31
2.62e+03
3.50e+05
7.16e+04
2.95e-03
2.95e-03
2.95e-03
2.95e-03
2.95e-03
κ = 41
2.62e+03
3.50e+05
7.16e+04
2.95e-03
2.95e-03
2.95e-03
2.95e-03
2.95e-03
consensus iteration k (in the z space). From the actual values of r∗ and b∗ obtained
by using traditional mathematical software, it is noticed that the iterates reach
sufficiently close to optimality in approximately 15000 iterations in the θ space.
4.5.4 Exploration vs Exploitation
In our example, steps in the computation space are more expensive than those in
the learning space. We consider the impact of deeper exploration in the learning
space (exploraion) for every step made in the computation space (exploitation).
Specifically, for each update of z k , we performed both single and multiple updates
of θ, denoted by κ and the optimality error kz1k − z ∗ k is reported in Table 4.3.
We used the iterates of firm 1 in our tables and the iterates of the other firms
follow a similar pattern. Since we have a handle on the exact functional expressions
and values in this case, we computed z ∗ by solving the original quadratic program
with no uncertainty. It can be noted that for a fixed effort corresponding to
computing θ∗ , κ = 1 yields good solutions. As an example choosing the total
number of max based operations K = 1100 (number of projection steps taken for
θ∗ ), for {κ, k} = {1, 1000}, {κ, k} = {11, 100}, and {κ, k} = {21, 50} the respective
corresponding values of kz1k − z ∗ k were 3.65 × 10−1 , 8.20 × 100 , and 7.27 × 100 . While
using a higher κ yields better performance in the space of z k , it is seen to be at the
cost of significantly more number of steps in the space of θk . Notably there is not
a significantly better performance with κ > 1 and for κ ≥ 11, the performance is
almost identical. Subsequently we compare our scheme with a standard projection
scheme where θ∗ is obtained apriori by solving the regression problem. Denoting
z k and z ∗,k to be the respective iterates, the trajectory with increasing iterations k
133
Table 4.3. Error in iterates with k and κ
||z1k − z ∗ ||
Iterations
k
1
50
100
250
500
1000
2000
3000
5000
10000
15000
κ=1
1.01e+01
7.27e+00
8.20e+00
1.58e+01
8.47e-01
3.65e-01
1.79e-01
1.22e-01
7.61e-02
4.14e-02
2.94e-02
κ = 11
1.01e+01
7.27e+00
8.20e+00
1.58e+01
4.13e+00
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
κ = 21
1.01e+01
7.27e+00
8.20e+00
1.58e+01
4.15e+00
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
κ = 31
1.01e+01
7.27e+00
8.20e+00
1.58e+01
4.15e+00
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
κ = 41
1.01e+01
7.27e+00
8.20e+00
1.58e+01
4.15e+00
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
2.00e-03
is shown in figure 4.1 and the iterates are seen to asymptotically converge.
Figure 4.1. Trajectory of the function values f (z ∗,k , θ∗ ) and f (z k , θ∗ ).
4.6 Concluding Remarks
Traditionally, optimization algorithms have been developed under the premise
of exact imformation regarding functions and constraints. As systems grow in
complexity, an a priori knowledge of cost functions and efficiencies is difficult
to guarantee. One avenue lies in using observational information to learn these
functions while optimizing the overall system. We consider precisely such a question
in a networked multi-agent regime where an agent does not access to the decisions
of the entire collective. Generally, in such regimes, distributed optimization can be
carried out by combining a local averaging step with a projected gradient step. We
134
overlay a learning step where agents update their belief regarding the misspecified
parameter and examine the associated schemes in this regime. It is shown that
such schemes can be shown to display desirable a.s. global convergence and the
overall rate of convergence stays at the optimal O √1k rate in terms of iterate
convergence. Importantly, we quantify the contribution that both learning and its
interactions introduces on the overall rate. Preliminary numerics on an economic
dispatch problem suggest promise.
135
Chapter 5 |
Future Work
We consider several possible directions for future research:
Distributed algorithms for misspecified Nash games and Cartesian variational inequality problems: While this document focused on resolving misspecification in optimization problems via distributed algorithms, a natural extension
to Nash games and Cartesian variational inequality problems can be articulated.
In recent work, distributed schemes have been posed for a subclass of games called
aggregative Nash games that lead to strongly monotone variational inequality problems. We intend to examine two generalizations. First, can such such models can
be extended to misspecified regimes in a game-theoretic regimes. Such directions
draw motivation from past efforts in congestion modeling and sensor networks [3],
structural design [130, 131], and communication networks [4] where distributed
algorithms have found utility. Of particular interest is the need to simultaneously
learn payoff functions and equilibrium decisions. Second, rather than strongly
monotone maps, can such schemes be extended to extragradient regimes in an effort
to accommodate a weaker class of games. Much of the research has been restricted
to monotone Nash games, an assumption that precludes more general settings.
Evolving connectivity graphs: A crucial assumption employed in the last
chapter is the need for fixed connectivity graphs. Based on prior research in [132],
we intend to extend this to regimes where the connectivity graphs are evolving
and their union over a fixed interval has to be connected. We expect that this will
adversely affect the convergence rate and we intend to quantify the cost of such a
weakening in the connectivity requirement.
136
Bibliography
[1] Metzler, C., B. Hobbs, and J.-S. Pang (2003) “Nash-Cournot Equilibria
in Power Markets on a Linearized DC Network with Arbitrage: Formulations
and Properties,” Networks and Spatial Theory, 3(2), pp. 123–150.
[2] Dafermos, S. C. and S. C. McKelvey “Partitionable variational inequalities with applications to network and economic equilibria,” Journal
of Optimization Theory and Applications, 73(2).
[3] Scutari, G. and D. Palomar (2010) “MIMO Cognitive Radio: A Game
Theoretical Approach,” IEEE Transactions on Signal Processing, 58(2),
pp. 761–780.
[4] Yin, H., U. V. Shanbhag, and P. G. Mehta (2009) “Nash Equilibrium
Problems with Shared-Constraints,” Proceedings of the IEEE Conference on Decision and Control (CDC), pp. 4649–4654.
[5] Tsitsiklis, J. N. (1984) Problems in Decentralized Decision Making
and Computation, Ph.D. thesis, Department of Electrical Engineering and
Computer Science, Massachusetts Institute of Technology.
[6] Dantzig, G. B. (1955) “Linear programming under uncertainty,” Management Science, 1, pp. 197–206.
[7] Bertsekas, D. P., A. Nedić, and A. E. Ozdaglar (2003) Convex
analysis and optimization, Athena Scientific, Belmont, MA.
[8] Rockafellar, R. T. and R. J.-B. Wets (1975) “Stochastic convex
programming: Kuhn-Tucker conditions,” Journal of Mathematical Economics, 2(3), pp. 349–370.
[9] Rockafeller, R. (1972) Convex Analysis: 1st Edition, Princeton
University Press, MA.
[10] Beale, E. M. L. (1955) “On minimizing a convex function subject to linear
inequalities,” Journal of the Royal Statistical Society, Ser. B., 17, pp.
173–184; discussion, 194–203, (Symposium on linear programming.).
137
[11] Altman, E., T. Boulogne, R. El-Azouzi, T. Jiménez, and L. Wynter
(2006) “A survey on networking games in telecommunications,” Comput.
Oper. Res., 33(2), pp. 286–311.
[12] Alpcan, T. and T. Başar (2002) “A game-theoretic framework for congestion control in general topology networks,” in Proceedings of the 41st
IEEE Conference on Decision and Control, vol. 2, pp. 1218–1224.
[13] ——— (2003) “Distributed algorithms for Nash equilibria of flow control
games,” in Advances in Dynamic Games, vol. 7 of Annals of the International Society of Dynamic Games, Birkhäuser Boston, pp. 473–498.
URL http://citeseer.ifi.unizh.ch/alpcan03distributed.html
[14] Marden, J., G. Arslan, and J. Shamma (2009) “Cooperative Control
and Potential Games,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(6), pp. 1393–1407.
[15] Li, N. and J. Marden (2013) “Designing Games for Distributed Optimization,” IEEE Journal of Selected Topics in Signal Processing, 7(2),
pp. 230–242.
[16] Marden, J. and A. Wierman (2008) “Distributed welfare games with
applications to sensor coverage,” in Decision and Control, 2008. CDC
2008. 47th IEEE Conference on, pp. 1708–1713.
[17] Nash, J. F. (1950) “Equilibrium points in N-Person Games,” Proceedings
of National Academy of Science.
[18] Facchinei, F. and J.-S. Pang (2003) Finite Dimensional Variational
Inequalities and Complementarity Problems: Vols I and II, SpringerVerlag, NY, Inc.
[19] Fokin, D. (1998) “Maximization of the lift/drag ratio of airfoils with a
turbulent boundary layer,” Fluid Dynamics, 33(3), pp. 443–449.
URL http://dx.doi.org/10.1007/BF02698197
[20] Elizarov, A. (2008) “Maximizing the lift-drag ratio of wing airfoils with a
turbulent boundary layer: Exact solutions and approximations,” Doklady
Physics, 53(4), pp. 221–227.
URL http://dx.doi.org/10.1134/S1028335808040113
[21] Rousseau, A., P. Sharer, S. Pagerit, and S. Das (2005) “Trade-off
between Fuel Economy and Cost for Advanced Vehicle Configurations,” in
Proceedings of the 20th International Electric Vehicle Symposium,
Monaco.
138
[22] Vijayagopal, R., J. Kwon, A. Rousseau, and P. Maloney (2010)
“Maximizing Net Present Value of a Series PHEV by Optimizing Battery
Size and Vehicle Control Parameters,” in SAE Convergence Conference,
Detroit.
[23] Holtzman, J. M. (1966) “Signal-Noise Ratio Maximization Using the
Pontryagin Maximum Principle,” Bell System Technical Journal, 45(3),
pp. 473–489.
URL http://dx.doi.org/10.1002/j.1538-7305.1966.tb04218.x
[24] Duensing, G., H. Brooker, and J. Fitzsimmons (1996) “Maximizing
Signal-to-Noise Ratio in the Presence of Coil Coupling,” Journal of
Magnetic Resonance, Series B, 111(3), pp. 230–235.
URL
http://www.sciencedirect.com/science/article/pii/
S1064186696900886
[25] Choi, S. C., W. S. Desarbo, and P. T. Harker (1990) “Product
Positioning Under Price Competition,” Management Science, 36(2), pp.
175–199.
URL http://EconPapers.repec.org/RePEc:inm:ormnsc:v:36:y:1990:i:
2:p:175-199
[26] Garrow, L. A. and F. S. Koppelman (2004) “Multinomial and nested
logit models of airline passengers’ no-show and standby behaviour,” Journal
of Revenue Pricing Management, 3(3), pp. 237 – 253.
[27] Newman, J. P. (2008) “Normalization of network generalized extreme value
models,” Transportation Research Part B: Methodological, 42(10),
pp. 958–969.
URL
http://ideas.repec.org/a/eee/transb/v42y2008i10p958-969.
html
[28] Brighi, L. and R. John (2000) Characterizations of Pseudomonotone
Maps and Economic Equilibrium, Materials for discussion, University
of Modena and Reggio Emilia.
URL http://books.google.com/books?id=9VTuSAAACAAJ
[29] Robbins, H. and S. Monro (1951) “A stochastic approximation method,”
Annals of Mathematical Statistics, 22, pp. 400–407.
[30] Jiang, H. and H. Xu (2008) “Stochastic Approximation Approaches to
the Stochastic Variational Inequality Problem,” IEEE Transactions on
Automatic Control, 53(6), pp. 1462–1475.
139
[31] Koshal, J., A. Nedic, and U. V. Shanbhag (2013) “Regularized Iterative
Stochastic Approximation Methods for Stochastic Variational Inequality
Problems,” IEEE Transactions on Automatic Control, 58(3), pp. 594–
609.
[32] Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro (2009) “Robust
Stochastic Approximation Approach to Stochastic Programming,” SIAM J.
on Optimization, 19(4), pp. 1574–1609.
URL http://dx.doi.org/10.1137/070704277
[33] Ravat, U. and U. V. Shanbhag (2011) “On the Characterization of
Solution Sets of Smooth and Nonsmooth Convex Stochastic Nash Games,”
SIAM Journal on Optimization, 21(3), pp. 1168–1199.
[34] Nedic, A., A. Ozdaglar, and P. Parrilo (2010) “Constrained Consensus and Optimization in Multi-Agent Networks,” IEEE Transactions on
Automatic Control, 55(4), pp. 922–938.
[35] Touri, B., A. Nedic, and S. Ram (2010) “Asynchronous stochastic convex
optimization over random networks: Error bounds,” in Information Theory
and Applications Workshop (ITA), 2010, pp. 1–10.
[36] Ram, S., A. Nedic, and V. Veeravalli (2010) “Asynchronous Gossip
Algorithm for Stochastic Optimization: Constant Stepsize Analysis*,” in Recent Advances in Optimization and its Applications in Engineering,
Springer Berlin Heidelberg, pp. 51–60.
URL http://dx.doi.org/10.1007/978-3-642-12598-0_5
[37] Chen, J. and A. Sayed (2012) “Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,” IEEE Transactions
on Signal Processing, 60(8), pp. 4289–4305.
[38] Olshevsky, A. and J. Tsitsiklis (2009) “Convergence Speed in Distributed
Consensus and Averaging,” SIAM Journal on Control and Optimization, 48(1), pp. 33–55.
URL http://dx.doi.org/10.1137/060678324
[39] Touri, B. and A. Nedic (2009) “Distributed consensus over network with
noisy links,” in 12th International Conference on Information Fusion,
pp. 146–154.
[40] Jensen, M. K. (2010) “Aggregative games and best-reply potentials,”
Econom. Theory, 43(1), pp. 45–66.
140
[41] Tikhonov, A. N. (1963) “On the solution of incorrectly put problems and
the regularisation method,” in Outlines Joint Sympos. Partial Differential Equations (Novosibirsk, 1963), Acad. Sci. USSR Siberian Branch,
Moscow, pp. 261–265.
[42] Tikhonov, A. N. and V. Arsénine (1976) Méthodes de resolution de
problèmes mal posés, Éditions Mir, Moscow, traduit du russe par Vladimir
Kotliar.
[43] Browder, F. E. (1966) “Existence and approximation of solutions of nonlinear variational inequalities,” Proceedings of the National Academy
of Sciences of the United States of America, 56, pp. 1080–1086.
[44] Martinet, B. (1970) “Régularisation d’inéquations variationnelles par approximations successives,” Rev. Française Informat. Recherche Opérationnelle, 4(Ser. R-3), pp. 154–158.
[45] Patriksson, M. (1999) Nonlinear programming and variational inequality problems, vol. 23 of Applied Optimization, Kluwer Academic
Publishers, Dordrecht, a unified approach.
[46] Rockafellar, R. T. (1976) “Monotone operators and the proximal point
algorithm,” SIAM J. Control Optimization, 14(5), pp. 877–898.
[47] Ferris, M. C. (1991) “Finite termination of the proximal point algorithm,”
Math. Programming, 50(3, (Ser. A)), pp. 359–366.
URL http://dx.doi.org/10.1007/BF01594944
[48] Burachik, R. S., A. N. Iusem, and B. F. Svaiter (1997) “Enlargement
of monotone operators with applications to variational inequalities,” SetValued Anal., 5(2), pp. 159–180.
URL http://dx.doi.org/10.1023/A:1008615624787
[49] Scutari, G., F. Facchinei, J.-S. Pang, and D. Palomar (2014) “Real
and Complex Monotone Communication Games,” Information Theory,
IEEE Transactions on, 60(7), pp. 4197–4231.
[50] Nesterov, Y. (2007) “Dual extrapolation and its applications to solving
variational inequalities and related problems,” Math. Program., 109(2-3,
Ser. B), pp. 319–344.
URL http://dx.doi.org/10.1007/s10107-006-0034-z
[51] Nemirovski, A. (2004) “Prox-method with rate of convergence O(1/t) for
variational inequalities with Lipschitz continuous monotone operators and
smooth convex-concave saddle point problems,” SIAM J. Optim., 15(1),
141
pp. 229–251 (electronic).
URL http://dx.doi.org/10.1137/S1052623403425629
[52] Ralph, D. (1994) “Global convergence of damped Newton’s method for
nonsmooth equations via the path search,” Mathematics of Operations
Research, 19(2), pp. 352–389.
[53] Ferris, M. C. and T. S. Munson (1998) “Complementarity Problems
in GAMS and the PATH Solver,” Journal of Economic Dynamics and
Control, 24, p. 2000.
[54] ——— (1998) “Interfaces to PATH 3.0: Design, Implementation and Usage,”
Computational Optimization and Applications, 12, pp. 207–227.
[55] Kelly, F. (2001) “Mathematical modelling of the Internet,” in Mathematics unlimited—2001 and beyond, Springer, Berlin, pp. 685–702.
[56] Gibbens, R. J. and F. P. Kelly (1999) “Resource pricing and the evolution
of congestion control,” Automatica J. IFAC, 35(12), pp. 1969–1985, special
issue on control methods for communication networks.
[57] Srikant, R. (2004) The mathematics of Internet congestion control,
Systems & Control: Foundations & Applications, Birkhäuser Boston Inc.,
Boston, MA.
[58] Altman, E., T. Jimenez, T. Başar, and N. Shimkin (2000) “Competitive routing in networks with polynomial cost,” in INFOCOM 2000.
Nineteenth Annual Joint Conference of the IEEE Computer and
Communications Societies. Proceedings. IEEE, vol. 3, pp. 1586 –1593
vol.3.
[59] Altman, E., T. Başar, T. Jimenez, and N. Shimkin (2002) “Competitive routing in networks with polynomial costs,” IEEE Transactions on
Automatic Control, 47(1), pp. 92 –96.
[60] Pan, Y. and L. Pavel (2005) “OSNR optimization in optical networks:
extension for capacity constraints,” in Proceedings of the American
Control Conference, pp. 2379 – 2384.
[61] Pavel, L. (2006) “A noncooperative game approach to OSNR optimization
in optical networks,” IEEE Transactions on Automatic Control, 51(5),
pp. 848–852.
[62] Pan, Y. and L. Pavel (2009) “Games with coupled propagated constraints
in optical network with multi-link topologies,” Automatica, 45, pp. 871–880.
142
[63] Alpcan, T., T. Basar, R. Srikant, and E. Altman (2001) “CDMA
uplink power control as a noncooperative game,” in Proceedings of the
40th IEEE Conference on Decision and Control, vol. 1, pp. 197–202.
[64] Alpcan, T., T. Başar, R. Srikant, and E. Altman (November 2002.)
“CDMA uplink power control as a noncooperative game,” Wireless Networks, 8, pp. 659– 669.
[65] Kelly, F. P., A. Maulloo, and D. Tan (1998) “Rate control for communication networks:shadow prices, proportional fairness and stability,” Journal
of the Operational Research Society, 49, pp. 237–252.
[66] Low, S. H. and D. E. Lapsley (1999) “Optimization flow control, I: basic
algorithm and convergence,” IEEE/ACM Transactions on Networking,
7(6), pp. 861–874.
[67] Konnov, I. V. (2007) Equilibrium Models and Variational Inequalities, Elsevier.
[68] Golshtein, E. G. and N. V. Tretyakov (1996) Modified Lagrangians
and monotone maps in optimization, Wiley-Interscience Series in Discrete Mathematics and Optimization, John Wiley & Sons Inc., New York,
translated from the 1989 Russian original by Tretyakov, A Wiley-Interscience
Publication.
[69] Polyak, B. (1987) Introduction to Optimization, Optimization Software
Inc., New York.
[70] Papadimitriou, C. H. and M. Yannakakis (1994) On Bounded Rationality And Computational Complexity, Tech. rep., Indiana University.
[71] Simon, H. A. (1996) The sciences of the artificial (3rd ed.), MIT Press,
Cambridge, MA, USA.
[72] Allevi, E., A. Gnudi, and I. Konnov (2004) “The proximal point method
for nonmonotone variational inequalities,” Mathematical Methods of
Operations Research, 63(3), pp. 553–565.
[73] Billups, S. (1995) Algorithms for Complementarity Problems and
Generalized Equations, Ph.D. thesis, Department of Computer Science,
University of Wisconsin at Madison.
[74] Dolan, E. D. and J. J. Moré (2002) “Benchmarking optimization software
with performance profiles,” Mathematical Programming, 91(2, Ser. A),
pp. 201–213.
143
[75] Ravat, U. and U. V. Shanbhag (2014) “On the existence of solutions
to stochastic quasi-variational inequality and complementarity problems,”
Under revision, http://arxiv.org/abs/1306.0586.
[76] Kannan, A. and U. Shanbhag (2012) “Distributed Computation of Equilibria in Monotone Nash Games via Iterative Regularization Techniques,”
SIAM Journal on Optimization, 22(4), pp. 1177–1205.
URL http://epubs.siam.org/doi/abs/10.1137/110825352
[77] Korpelevich, G. M. (1976) “The extragradient method for finding saddle
points and other problems,” Ekonomika i Matematcheskie Metody, 12,
pp. 747–756.
[78] Dang, C. D. and G. Lan On the Convergence Propertes of NonEuclidean Extragradient Methods for Variational Inequalities with
Generalized Monotone Operators, Tech. rep., Department of Industrial
and Systems Engineering, University of Florida.
[79] Shapiro, A., D. Dentcheva, and A. Ruszczynski (2009) Lectures
on stochastic programming: modeling and theory, The society for
industrial and applied mathematics and the mathematical programming
society, Philadelphia, USA.
URL
http://www2.isye.gatech.edu/people/faculty/Alex_Shapiro/
SPbook.pdf
[80] Xu, H. (2010) “Sample average approximation methods for a class of stochastic variational inequality problems,” Asia-Pac. J. Oper. Res., 27(1), pp.
103–119.
URL http://dx.doi.org/10.1142/S0217595910002569
[81] Lu, S. and A. Budhiraja (2013) “Confidence Regions for Stochastic Variational Inequalities,” Mathematics of Operations Research, 38(3), pp.
545–568.
URL http://dx.doi.org/10.1287/moor.1120.0579
[82] Lu, S. (2014) “Symmetric confidence regions and confidence intervals for
normal map formulations of stochastic variational inequalities,” To appear
in SIAM Journal on Optimization.
[83] Kushner, H. J. and G. G. Yin (2003) Stochastic Approximation and
Recursive Algorithms and Applications, Springer New York.
[84] Borkar, V. S. (2008) Stochastic Approximation: A Dynamical Systems Viewpoint, Cambridge University Press.
144
[85] Spall, J. C. (2005) Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley Series in Discrete
Mathematics and Optimization, Wiley.
URL http://books.google.com/books?id=f66OIvvkKnAC
[86] Chandra, S. (1972) “Strong Pseudo-Convex Programming,” Indian Journal of Pure and Applied Mathematics, 3(2), pp. 278–282.
[87] Hobbs, B. F. (1986) “Mill Pricing Versus Spatial Price Discrimination Under
Bertrand and Cournot Spatial Competition,” The Journal of Industrial
Economics, 35(2), pp. 173–191.
[88] Gallego, G. and M. Hu (2014) “Dynamic pricing of perishable assets
under competition,” Management science: journal of the Institute for
operations research and the management sciences, 60(5), pp. 1241–
1259.
[89] Ewerhart, C. (2011) Cournot games with biconcave demand, ECON
- Working Papers 016, Department of Economics - University of Zurich.
URL http://ideas.repec.org/p/zur/econwp/016.html
[90] Kihlstrom, R., A. Mas-Colell, and H. Sonnenschein (1976) “The
Demand Theory of the Weak Axiom of Revealed Preference,” Econometrica,
44(5), pp. 971–978.
[91] Polyak, B. T. (1990) “New stochastic approximation type procedures,”
Automat. i Telemekh., 7, pp. 98–107.
[92] Polyak, B. T. and A. Juditsky (1992) “Acceleration of Stochastic Approximation by Averaging,” SIAM Journal on Control and Optimization,
30(4), pp. 838–855.
URL http://dx.doi.org/10.1137/0330046
[93] Kushner, H. J. and J. Yang (1993) “Stochastic approximation with
averaging of the iterates: optimal asymptotic rate of convergence for general
processes,” SIAM Journal on Control and Optimization, 31(4), pp.
1045–1062.
[94] ——— (1995) “Analysis of adaptive step-size SA algorithms for parameter
tracking,” IEEE Transactions on Automatic Control, 40(8), pp. 1403–
1410.
[95] Nemirovski, A. S. and D. B. Yudin (1983) Problem complexity and
method efficiency in optimization, Wiley-Interscience, Translated by: E.
R. Dawson.
145
[96] Ghadimi, S. and G. Lan (2012) “Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic
Algorithmic Framework,” SIAM Journal on Optimization, 22(4), pp.
1469–1492.
[97] ——— (2013) “Optimal Stochastic Approximation Algorithms for Strongly
Convex Stochastic Composite Optimization, II: Shrinking Procedures and
Optimal Algorithms,” SIAM Journal on Optimization, 23(4), pp. 2061–
2089.
[98] Bertsekas, D. and J. Tsitsiklis (2000) “Gradient Convergence in Gradient
methods with Errors,” SIAM Journal on Optimization, 10(3), pp. 627–
642.
URL http://dx.doi.org/10.1137/S1052623497331063
[99] Ghadimi, S. and G. Lan (2013) “Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming,” SIAM Journal on Optimization, 23(4), pp. 2341–2368.
[100] Yousefian, F., A. Nedić, and U. V. Shanbhag (2013) “A regularized
smoothing stochastic approximation (RSSA) algorithm for stochastic variational inequality problems,” in Proceedings of the Winter Simulation
Conference (WSC), pp. 933–944.
[101] Juditsky, A., A. Nemirovski, and C. Tauvel (2011) “Solving variational
inequalities with stochastic mirror-prox algorithm,” Stochastic Systems,
1(1), pp. 17–58.
URL http://dx.doi.org/10.1214/10-SSY011
[102] Yousefian, F., A. Nedić, and U. V. Shanbhag (2014) “Optimal robust
smoothing extragradient algorithms for stochastic variational inequality problems,” Proceedings of the IEEE Conference on Decision and Control (CDC).
URL http://arxiv.org/abs/1403.5591
[103] ——— (2013) “Stochastic approximation schemes for nonsmooth stochastic
multi-user optimization and Nash games,” under revision.
URL http://arxiv.org/abs/1301.1711
[104] Hu, X. and J. Wang (2006) “Solving Pseudomonotone Variational Inequalities and Pseudoconvex Optimization Problems Using the Projection Neural
Network,” IEEE Transactions on Neural Networks, 17(6), pp. 1487–
1499.
URL http://dx.doi.org/10.1109/TNN.2006.879774
146
[105] Karamardian, S. and S. Schaible (1990) “Seven kinds of monotone
maps,” Journal of Optimization Theory and Applications, 66(1), pp.
37–46.
[106] Konnov, I. V. (April, 2002) “Theory and Applications of Variational
Inequalities,” Preprint ISBN 951-42-6688-9, University of Oulu, Department of Mathematical Sciences.
[107] John, R. (1998) “Variational Inequalities and Pseudomonotone Functions:
Some Characterizations,” in Generalized Convexity, Generalized Monotonicity: Recent Results, vol. 27 of Nonconvex Optimization and Its
Applications, Springer US, pp. 291–301.
URL http://dx.doi.org/10.1007/978-1-4613-3341-8_13
[108] Nesterov, Y. (2004) Introductory lectures on convex optimization
: a basic course, Applied optimization, Kluwer Academic Publ., Boston,
Dordrecht, London.
URL http://opac.inria.fr/record=b1104789
[109] Watson, L. T. (1979) “SOLVING THE NONLINEAR COMPLEMENTARITY PROBLEM BY A HOMOTOPY METHOD,” SIAM Journal on
Control and Optimization, 17(1), pp. 36–46.
[110] Allaz, B. and J.-L. Vila “Cournot competition, forward markets and
efficiency,” Journal of Economic Theory, 59.
[111] Hobbs, B. F. (1986) “Mill Pricing Versus Spatial Price Discrimination Under
Bertrand and Cournot Spatial Competition,” The Journal of Industrial
Economics, 35(2), pp. 173–191.
[112] Facchinei, F., A. Fischer, and V. Piccialli (2007) “On generalized Nash
games and variational inequalities,” Oper. Res. Lett., 35(2), pp. 159–164.
[113] Olshevsky, A. and J. N. Tsitsiklis (2009) “Convergence Speed in Distributed Consensus and Averaging,” SIAM J. Control Optim., 48(1), pp.
33–55.
URL http://dx.doi.org/10.1137/060678324
[114] Jafarizadeh, S. (2010) “Exact Determination of Optimal Weights for
Fastest Distribution Consensus Algorithm in Star and CCS Networks via
SDP,” CoRR, abs/1001.4278.
[115] Dominguez-Garcia, A., S. Cady, and C. Hadjicostis (2012) “Decentralized optimal dispatch of distributed energy resources,” in Decision
and Control (CDC), 2012 IEEE 51st Annual Conference on, pp.
3688–3693.
147
[116] Ram, S. S., A. Nedić, and V. V. Veeravalli (2009) “Asynchronous gossip
algorithms for stochastic optimization,” in Proceedings of the 48th IEEE
Conference on Decision and Control (CDC-CCC), pp. 3581–3586.
[117] Yuan, D., S. Xu, and H. Zhao (2011) “Distributed Primal-Dual Subgradient Method for Multiagent Optimization via Consensus Algorithms,”
IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics, 41(6), pp. 1715–1724.
[118] Tsitsiklis, J., D. Bertsekas, and M. Athans (1986) “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,”
IEEE Transactions on Automatic Control, 31(9), pp. 803–812.
[119] Cybenko, G. (1989) “Dynamic Load Balancing for Distributed Memory
Multiprocessors,” Journal of Parallel Distributed Computing, 7(2), pp.
279–301.
URL http://dx.doi.org/10.1016/0743-7315(89)90021-X
[120] Jafarizadeh, S. and A. Jamalipour (2010), “Weight Optimization for
Distributed Average Consensus Algorithm in Symmetric, CCS & KCS Star
Networks,” ArXiv:1001.4278v3.
[121] Mahmoud, E. C., G. Neglia, and K. Avrachenkov (2012) Distributed
Weight Selection in Consensus Protocols by Schatten Norm Minimization, Rapport de recherche RR-8078, INRIA.
URL http://hal.inria.fr/hal-00738249
[122] Necoara, I., I. Dumitrache, and J. A. K. Suykens (2010) “Fast primaldual projected linear iterations for distributed consensus in constrained convex
optimization,” in 49th IEEE Conference on Decision and Control
(CDC), pp. 1366–1371.
[123] Xu, Y. and H. Wang (2013) “Optimal Weight Determination and Consensus
Formation under Fuzzy Linguistic Environment,” Procedia Computer
Science, 17(0), pp. 482 – 489, first International Conference on Information
Technology and Quantitative Management.
URL
http://www.sciencedirect.com/science/article/pii/
S1877050913001956
[124] Cheng, Y. (1987) “Dual gradient method for linearly constrained, strongly
convex, separable mathematical programming problems,” Journal of Optimization Theory and Applications, 53(2), pp. 237–246.
URL http://dx.doi.org/10.1007/BF00939216
148
[125] Zhu, M. and S. Martinez (2012) “On Distributed Convex Optimization
Under Inequality and Equality Constraints,” IEEE Transactions on Automatic Control, 57(1), pp. 151–164.
[126] Chang, T., A. Nedić, and A. Scaglione (2014) “Distributed Constrained
Optimization by Consensus-Based Primal-Dual Perturbation Method,” IEEE
Transactions on Automatic Control, 59(6), pp. 1524–1538.
[127] Zanella, F., D. Varagnolo, A. Cenedese, G. Pillonetto, and
L. Schenato (2012) “The convergence rate of Newton-Raphson consensus
optimization for quadratic cost functions,” in 51st IEEE Conference on
Decision and Control (CDC), pp. 5098–5103.
[128] Lee, S. and A. Nedić, “Asynchronous Gossip-Based Random Projection
Algorithms Over Networks,” Under 2nd review at IEEE Transactions on
Automatic Control, April 2013.
[129] Jiang, H. and U. V. Shanbhag (2013) “On the solution of stochastic optimization problems in imperfect information regimes,” in Winter Simulation
Conference (WSC),, pp. 821–832.
[130] P. Song, J. S. P. and V. Kumar (2004) “A semi-implicit time-stepping
model for frictional compliant contact problems,” International Journal
for Numerical Methods in Engineering., 60, pp. 2231–2261.
[131] Tzitzouris, J. A. and J. S. Pang (2001) “A time-stepping complementarity approach for frictionless systems of rigid bodies,” SIAM Journal on
Optimization, 12, pp. 834–860.
[132] Koshal, J., A. Nedic, and U. Shanbhag (2012) “A gossip algorithm for
aggregative games on graphs,” Proceedings of the IEEE Conference on
Decision and Control (CDC), pp. 4840–4845.
149
Vita
Aswin Kannan
I was born and raised in Chennai, India. I hold a Bachelors in Mechanical Engineering (2008) from College of Engineering Guindy, Chennai, India and a Masters
in Industrial Engineering (2010) from University of Illinois at Urbana Champaign,
Champaign, IL. I worked as a Scientific Developer for Argonne National Labs,
Lemont, IL from 2010-2012 in the Mathematics Division. From October 2014, I’ve
been working as a Computational Scientist for Oracle Corp. in their analytics
division at Burlington, MA.
Besides variational inequalities, I’m also interested in quadratic programming
and derivative free optimization.