A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network

Lü, Yunshan; Xiong, Hailing; Zhou, Hao; Guan, Xin

doi:10.3390/math10030357

Open AccessArticle

A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network

¹

Database and Artificial Intelligence Laboratory, College of Computer and Information Science, Southwest University, Chongqing 400715, China

²

College of Big Data and Software, Chongqing College of Mobile Communication, Chongqing 401520, China

³

Business College, Southwest University, Chongqing 402460, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(3), 357; https://doi.org/10.3390/math10030357

Submission received: 12 December 2021 / Revised: 14 January 2022 / Accepted: 21 January 2022 / Published: 25 January 2022

(This article belongs to the Special Issue Modeling and Analysis of Complex Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, significant progress has been made in the field of distributed optimization algorithms. This study focused on the distributed convex optimization problem over an undirected network. The target was to minimize the average of all local objective functions known by each agent while each agent communicates necessary information only with its neighbors. Based on the state-of-the-art algorithm, we proposed a novel distributed optimization algorithm, when the objective function of each agent satisfies smoothness and strong convexity. Faster convergence can be attained by utilizing Nesterov and Heavy-ball accelerated methods simultaneously, making the algorithm widely applicable to many large-scale distributed tasks. Meanwhile, the step-sizes and accelerated momentum coefficients are designed as uncoordinate, time-varying, and nonidentical, which can make the algorithm adapt to a wide range of application scenarios. Under some necessary assumptions and conditions, through rigorous theoretical analysis, a linear convergence rate was achieved. Finally, the numerical experiments over a real dataset demonstrate the superiority and efficacy of the novel algorithm compared to similar algorithms.

Keywords:

distributed convex optimization; accelerated method; uncoordinated; undirected network; linear convergence

1. Introduction

In recent years, with the rapid development of artificial intelligence, big data, etc., there has been much attention to distributed optimization problems in multi-agent systems. As one of the most important fields, distributed optimization methods have gained significant growing interest due to the widespread applications in science and engineering areas such as the transmission of information in wireless sensor networks [1,2,3], the collaboration of vehicles in formation control [4,5], speeding up the optimization process in distributed machine learning [6,7], distributed resource allocation in smart-grid networks [8,9,10], distributed control in nonlinear dynamical systems [11,12], etc. Specifically, a distributed optimization framework can avoid the establishment of long-distance communication between agents while providing better load balancing for the network. Compared to traditional centralized optimization, agents in a multi-agent system communicate information only with their neighbors for distributed optimization. At the same time, the local objective function of each agent is known only by itself.

Literature Review: Since the DGD (Distributed gradient descent) algorithm was proposed by Nedic [13] for solving distributed convex problem in multiagent systems, great progress has been made in the distributed optimization field. Especially, the distributed first-order methods have attracted many researchers’ attention. Based on consensus theory [14] and gradient-descent technology, diminishing step-sizes were introduced into the algorithm DGD [13], which made the algorithm converge to the exact optimal solution but with a sublinear rate. When there were constraints of a decision variable, by utilizing the projection method, Sundhar [15] proposed a stochastic subgradient projection algorithm. Similar to [13,15], refs. [16,17,18] also employed the diminishing step-sizes, and these algorithms could converge linearly. However, diminishing step-sizes will lead to a much slower convergence rate. Then, the distributed algorithms with constant step-sizes were developed in [19,20,21,22,23,24,25,26,27,28,29,30,31,32] to overcome the shortcoming. The algorithm EXTRA [19] (Extra: An exact first-order algorithm for decentralized consensus optimization) and its improvement [20,21,22] modified the update rule of DGD by taking the difference of two consecutive iterations of formulas. Compared to DGD, the linear convergence rate can be verified in EXTRA, and even the step-size was fixed to a constant; EXTRA was more stable, but two weight matrices in EXTRA must obey strict conditions called the Mixing Matrix. A different type of distributed optimization algorithm HSADO (harnessing smoothness to accelerate distributed optimization) was proposed by Qu and Li [26] when the local objective functions were convex and smooth. HSADO adopted a gradient-tracking mechanism, which replaced the gradient term in DGD with a tracking gradient that was the gradient estimation of the average gradient of the whole network. If the step-sizes were set to constants, HSADO also can converge to the optimal solution linearly. Based on HSADO, researchers modified to adapt different scenarios, such as time-varying networks [27,28] and node-varying [29,30] and accelerated methods [31,32]. Further, researchers studied the primal-dual method in distributed optimization by utilizing the Augmented Lagrangian function; the original problem was reformulated in a dual problem. It has been demonstrated that EXTRA was equivalent to the algorithms in [33,34] by introducing dual variables, and [27] also provided a primal-dual interpretation for HSADO. Recently, the primal-dual algorithm UG (A unification and generalization of exact distributed first-order methods) proposed in [35] unified and generalized the methods EXTRA and HSADO, while it also converged linearly.

Motivations: Among these studies, EXTRA, HSADO, and UG are most related to our research. The algorithm UG can be regard as a generalization and unification of DGD, EXTRA, and HSADO. However, each local objective function of the agent in the network requires to be twice continuously differentiable, which is a rigorous condition in actual scenarios. In order to obtain linear convergence, fixed constant step-sizes were frequently adopted in distributed optimization algorithms such as [19,26,35], etc. Unfortunately, uncoordinated step-sizes for different agents are required rather than the same constant step-sizes. This situation was first studied in [36], in which an augmented distributed-gradient method was proposed, but it converged sublinearly. Then, by employing uncoordinated step-sizes, Lü [30] and Jakovetic [28] both established a global linear convergence of their algorithms in time-varying undirected and directed networks, respectively. To endow more independence, time-varying and nonidentical step-sizes of each agent were studied. A primal-dual fixed-point algorithm with nonidentical step-sizes was proposed by Li [37] when the object function of each agent was twice differentiable and nonsmooth. Xin [32] also adopted nonidentical step-sizes in a directed network. With a more relaxed step-size and network topolog, a distributed primal-dual optimization method in [38] was proposed by utilizing time-varying step-sizes, which was proved to converge linearly. Until now, to the best knowledge of the authors, no related studies for the widely used algorithm UG with uncoordinated, time-varying, and nonidentical step-sizes in an undirected network were studied. Recently, as optimization processes of large-scale tasks such as deep learning are getting slower, the convergence rate of distributed optimization algorithms need further improvement. With the help of Nesterov [39] and Heavy-ball [40] accelerated methods, the convergence rate of distributed optimization algorithms can be improved. In [32], a Heavy-ball distributed accelerated method with gradient-tracking technology was proposed to accelerate the well-known row-stochastic and column-stochastic algorithm [41]. In [31,42], a better convergence rate was shown by utilizing the Nesterov accelerated method. Moreover, both the Heavy-ball and Nesterov accelerated methods were introduced to improve the convergence rate in directed networks for machine learning in [43]. For the widely used algorithm UG, it is challenging to study whether the simultaneous inclusion of Heavy-ball and Nesterov momentum can bring about a faster convergence rate in large-scale computing and communication tasks.

Statement of Contributions: Throughout this article, we mainly focus on the application of distributed convex optimization method over an undirected network. We propose a novel distributed optimization algorithm with uncoordinated, time-varying, and nonidentical step-sizes and accelerated momentum terms, which has a faster linear convergence rate and can apply to more scenarios. To summarize, three contributions are as follows:

Based on the distributed optimization methods [19,26,35], we designed and discussed a faster distributed optimization accelerated algorithm, named UGNH (UG with Nesterov and Heavy-ball accelerated methods), which solves the distributed convex problems over an undirected network. In particular, the momentum with the Nesterov and Heavy-ball methods together improve the convergence rate, which can be seen in the numerical experiments.
Compared to related algorithms, in our algorithm, not only the step-sizes but the coefficients of momentum terms (for convenience, we call them coefficients for short later) are uncoordinated, time-varying, and nonidentical, which are locally chosen for each agent. Through convergence analysis, the step-sizes and coefficients are more flexible than most existing methods. Meanwhile, if the local objective functions satisfy the conditions that are smooth and strongly convex, we can obtain an upper bound of step-sizes and coefficients. Under the upper bounds, the sequences generated by UGNH converge to the exact optimal solutions linearly.
In contrast to related algorithms, the upper bounds of the largest step-sizes and coefficients of UGNH are more relaxed, which only depend on the parameters of objective functions and the topology of the network. Meanwhile, there can be zero step-sizes and coefficients (not all) among agents.

Organization: The rest of this article is arranged as follows. In Section 2, we describe the distributed problem and provide some necessary assumptions. In Section 3, we discuss the development of relevant distributed optimization algorithms and two classical accelerated methods and then propose a new distributed accelerated algorithm. Convergence analysis is detailed in Section 4. In Section 5, numerical experiments are provided to demonstrate the superiority and efficiency of our algorithm. Finally, Section 6 concludes this article and provides some research directions for the future.

Basic Notation: Throughout the rest of this article, unless otherwise specified, all vectors are considered as column vectors, and n is the number of agents in network. The real-number set, the natural-number set, and the m-dimensional real column vector are denoted by

R

,

N

, and

R^{m}

, respectively. The subscript notations

i, j \in \{1, 2, \dots, n\}

represent the indices of the agents, while the superscript notation t represents an index for the iteration step, e.g.,

x_{i}^{t}

represents the ith agent’s decision variable at the jth iteration;

0_{n} \in R^{n}

,

1_{n} \in R^{n}

, and

I_{n} \in R^{n \times n}

denote an n-dimensional zero vector, one vector, and an identity matrix, respectively. For a matrix P,

p_{i j}

denotes the element at the i-th row and the j-th column of P, while its spectral radius and spectral norm are defined as

ρ (P)

and

∥ P ∥

, respectively. Similarly,

∥ x ∥

denotes the 2-norm for vector x. The transpose of a vector x and a matrix P are denoted by

x^{T}

and

P^{T}

, respectively. For a vector

r = {[r_{1}, r_{2}, \dots, r_{n}]}^{T}

, diag(r) represents a diagonal matrix, the diagonal elements of which equal to the vector r. The notation ⊗ represents the Kronecker product. Let

\nabla f (x) : R^{m} \to R^{m}

denote the gradient of

f (x)

at x.

2. Preliminaries

This section describes the formulation of the distributed optimization problem and some necessary basic assumptions related to network and function.

2.1. Problem Formulation

Consider an undirected network of n agents, which cooperatively solve the optimization problem written in the following form over a common variable

x \in R^{m}

:

\begin{matrix} min_{x \in R^{m}} f (x) = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x) \end{matrix}

(1)

Here, each local objective function

f_{i} : R^{m} \to R

with a convex property is possessed by agent i, which exchanges local information only with its neighbors. Our main target was to design a distributed optimization algorithm, a decision variable of which can linearly converge to the optimal solution that minimizes the average of all local objective functions. The optimal average objective value of problem (1) is defined as

f ({\tilde{x}}^{*})

, where

{\tilde{x}}^{*} \in R^{m}

is the optimal decision variable. Then, the global optimal solution of (1) is denoted by

x^{*} \in R^{n m}

, where

x^{*} = 1_{n} \otimes {\tilde{x}}^{*}

.

As a local copy of the global decision variable is saved at each agent, optimization problem (1) can be solved in a distributed way by iterating the decision variable. In this study, network is described as

G = \{V, E\}

, where

V = \{1, 2, \dots, n\}

is the vertex set that represents the agents of the network, and

E = \{(i, j) | i, j \in V\}

is the edges set. In an undirected network, an edge

(i, j) \in E

implies that an edge

(j, i) \in E

too. Meanwhile, agent i and agent j can exchange information with each other. Let

N_{i} = \{j | (i, j) \in E\} \cup \{i\}

denote the set of all neighbors of agent i. Then, formulation (1) can be rewritten as follows:

\begin{matrix} min_{x \in R^{n m}} f (x) = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x_{i}) \end{matrix}

(2)

where

x = {[x_{1}^{T}, x_{2}^{T}, \dots, x_{n}^{T}]}^{T} \in R^{n m}

,

x_{i} = x_{j}

for

\forall i, j \in V

. Recently, it has been proved that a new equality

\frac{1}{α} L^{\frac{1}{2}} x = 0

is equivalent to the consensus condition

x_{i} = x_{j}

in [33], where

α

is the step-size, and

L = I - P

is a Laplace matrix. Then, the primal-dual method can be introduced to solve (2) by utilizing the Augmented Lagrangian function, which is also a cornerstone of our algorithm.

Next, some necessary assumptions about the underlying graph and local objective functions are formalized, which are a common standard in related distributed optimization studies.

2.2. Assumptions

Assumption 1

([35]). The network

G = \{V, E\}

is connected, undirected, and simple. In particular, there are no self-loops of any agent and no multiple links between any two agents.

Assumption 2

([19]). A non-negative symmetric doubly stochastic weight matrix

P = \{p_{i j}\} \in R^{n \times n}

is defined to represent network

G

. The weight of the matrix P satisfies the following three conditions:

Non-negative: $p_{i j} = \{\begin{matrix} > 0, j \in N_{i} \\ = 0, o t h e r w i s e \end{matrix}$
Symmetric: $p_{i j} = p_{j i}$
Doubly stochastic: $\sum_{i = 1}^{n} p_{i j} = \sum_{j = 1}^{n} p_{i j} = 1$

Assumption 3.

Each local objective function

f_{i} : R^{m} \to R, i \in V

is smooth with Lipschitz constant

ψ_{i}

and strongly convex with parameter

μ_{i}

. Mathematically, there exits

ψ_{i} > 0

,

μ_{i} \geq 0 (\sum μ_{i} > 0)

, for any

x, y \in R^{m}

such that:

\begin{matrix} ∥ \nabla f_{i} (x) - \nabla f_{i} (y) ∥ \leq ψ_{i} ∥ x - y ∥ \\ f_{i} (x) - f_{i} (y) \geq \nabla f_{i} {(x)}^{T} (x - y) + \frac{μ_{i}}{2} ∥ x - y ∥ \end{matrix}

Remark 1.

Assumption 1 ensures that each agent can directly or indirectly affect other agents in the network. Assumption 3 is a standard assumption in convergence analysis of distributed optimization methods. Especially, under the assumption of strongly convex for each function, there exists a unique global optimal solution to problem (1). Moreover, for the global objective function f, we define

\bar{ψ} = (\frac{1}{n}) \sum_{i = 1}^{n} ψ_{i}

as the global Lipschitz constant and

\bar{μ} = (\frac{1}{n}) \sum_{i = 1}^{n} μ_{i}

as the global strongly convex parameter.

3. Algorithm Development

In this section, Section 3.1 describes the development of some related algorithms. Section 3.2 describes the Nesterov and Heavy-ball accelerated methods for the distributed optimization algorithm. Section 3.3 describes the proposed algorithm UGNH and the relationship between UGNH and the previous algorithms.

3.1. Related Algorithms

In this subsection, we focus on some classical algorithms DGD, EXTRA, HSADO, and UG, which are related to the proposed algorithm and give them a simple explanation.

In [13], Nedic and Ozdaglar proposed a standard distributed gradient descent method DGD. The method updated the decision variable at each agent through its neighbors and the local negative gradient’s direction, as follows:

\begin{matrix} x_{i}^{t + 1} = \sum_{j \in N_{i}} p_{i j} x_{j}^{t} - α^{t} \nabla f_{i} (x_{i}^{t}) \end{matrix}

(3)

where

α^{t}

was the step-size, which satisfied

α^{t} > 0

,

\sum_{t = 0}^{\infty} α^{t} = \infty

, and

\sum {(α^{t})}^{2} < \infty

; the matrix P satisfied Assumption 2. The variable

x_{i}^{t}

stored in the agents is the local estimate of x at the t-th iteration. It was proved that sequences generated by DGD cannot converge to the exact optimal solution

x^{*}

when employing a fixed step-size, i.e.,

α^{t} = α

. By taking an appropriately diminishing step-sizes, DGD can converge accurately, but the convergence rate was sublinear.

To acquire linear convergence, Shi [19] proposed a new method EXTRA by modifying the update rule of DGD (3). There were two steps performed as follows:

\begin{matrix} x_{i}^{1} = & \sum_{j \in N_{i}} p_{i j} x_{j}^{0} - α \nabla f_{i} (x_{i}^{0}) & (4) \\ x_{i}^{t + 1} = & x_{i}^{t} + \sum_{j \in N_{i}} p_{i j} x_{j}^{t} - \sum_{j \in N_{i}} {\tilde{p}}_{i j} x_{j}^{t - 1} - α (\nabla f_{i} (x_{i}^{t}) - \nabla f_{i} (x_{i}^{t - 1})) & (5) \end{matrix}

where the step-size

α > 0

was a constant, the matrix P satisfied Assumption 2, while

\tilde{P} = \frac{I + P}{2}

was appropriate. Compared to DGD (3), an initial condition (4) and one more iteration (5) were added. Notably, the step-size was a constant, but EXTRA can linearly converge to the exact optimal solution as long as the step-size was chosen appropriately.

Based on DGD (3), Qu and Li [26] proposed a novel distributed algorithm HSADO by using gradient-tracking technology. An auxiliary variable

z_{i}^{t}

was introduced to estimate the network-wide gradient average

\frac{1}{n} \sum_{i = 1}^{n} \nabla f_{i} (x_{i}^{t})

at the t-th iteration for agent i. As a result, the gradient contribution

- α \nabla f_{i} (x_{i}^{t})

in (3) was replaced by

- α z_{i}^{t}

. The specific updating rules were as follows:

\begin{matrix} x_{i}^{t + 1} & = \sum_{j \in N_{i}} p_{i j} x_{j}^{t} - α z_{i}^{t} & (6) \\ z_{i}^{t + 1} & = \sum_{j \in N_{i}} p_{i j} z_{j}^{t} + \nabla f_{i} (x_{i}^{t + 1}) - \nabla f_{i} (x_{i}^{t}) & (7) \end{matrix}

where the step-size

α > 0

was a constant, and the matrix P satisfied Assumption 2. Under the previous assumptions, initialized with

x_{i}^{0} \in R^{m}

and

z_{i}^{0} = \nabla f_{i} (x_{i}^{0})

, a global linear convergence rate could be gained when choosing an appropriate fixed step-size.

Recently, a novel distributed optimization algorithm UG was proposed in [35], which used the primal-dual method to solve the equivalent problem (2). Through tuning parameters, the algorithm subsumed the well-known algorithm EXTRA and HSADO. Updating rules were as follows:

\begin{matrix} x_{i}^{t + 1} & = \sum_{j \in N_{i}} p_{i j} x_{j}^{t} - α (\nabla f_{i} (x_{i}^{t}) + z_{i}^{t}) & (8) \\ z_{i}^{t + 1} & = z_{i}^{t} - \sum_{j \in N_{i}} l_{i j} (\nabla f_{j} (x_{j}^{t}) + z_{j}^{t} - \sum_{q \in N_{j}} k_{j q} x_{q}^{t}) & (9) \end{matrix}

where the step-size

α > 0

was a constant;

x_{i}^{t}

was primal variable; and

z_{i}^{t}

was dual variable, which were initialized to

x_{i}^{0} \in R^{m}

and

z_{i}^{0} = 0_{m}

, respectively. For using more-compact notation, we defined

P = \{p_{i j}\}

,

L = \{l_{i j}\}

, and

K = \{k_{i j}\}

. The matrix

L = I_{n} - P

and the matrix

K \in R^{n \times n}

are symmetric with the property that there exists some constant

λ

such that

K 1_{n} = λ 1_{n}

.

By analyzing when the matrix

K

is chosen properly, the algorithm UG is equivalent to: (1) EXTRA, when

K = \frac{1}{α} P

and (2) HSADO, when

K = 0_{n}

. For others,

K = \frac{\bar{μ} + \bar{ψ}}{2} I_{n}

and

K = \frac{\bar{μ} + \bar{ψ}}{1 + λ_{n}} P

(

λ_{n}

is the smallest eigenvalue of matrix P) are appropriate for

K = k I_{n}

and

K = k P

, respectively. There are no extra computational variables and no communication relationships with other agents in the formula

K

, so (8) and (9) are easy to implement. As UG unifies and generalizes previous methods, we mainly focused on it.

3.2. Distributed Accelerated Methods

In this section, centralized Nesterov and Heavy-ball accelerated methods will be introduced. With them, many distributed optimization algorithms can converge faster.

For the gradient-descent algorithm, i.e.,

x_{i}^{t + 1} = x_{i}^{t} - α \nabla f_{i} (x_{i}^{t})

, the best achievable convergence is

O ({(\frac{κ - 1}{κ + 1})}^{t})

;

κ = \frac{\bar{ψ}}{\bar{μ}}

denotes the condition number of the objective function. If

\bar{ψ}

is much larger than

\bar{μ}

so that

κ

is large, then the gradient descent becomes quite slow. To accelerate the gradient descent, Polyak [40] proposed a method called Heavy-ball for updating decision variable. The specifics was as follows:

\begin{matrix} x_{i}^{t + 1} = x_{i}^{t} - α \nabla f_{i} (x_{i}^{t}) + γ (x_{i}^{t} - x_{i}^{t - 1}) \end{matrix}

(10)

where

γ

was the momentum-accelerated coefficient, and the term

γ (x_{i}^{t} - x_{i}^{t - 1})

was used to accelerate the convergence of the decision variable. It had been proved that under the appropriate step-size

α

and the coefficient

γ

, the momentum-accelerated method could achieve a convergence rate of

O ({(\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1})}^{t})

, which was obviously faster.

Inspired by conjugate gradient methods [44], history gradient information can improve the convergence rate for distributed first-order optimization algorithms. Nesterov proposed a method called CNGD [40] (Centralised Nesterov Gradient Descent method) as follows:

\begin{matrix} x_{i}^{t + 1} & = y_{i}^{t} - α \nabla f_{i} (y_{i}^{t}) & (11) \\ y_{i}^{t + 1} & = x_{i}^{t + 1} + γ (x_{i}^{t + 1} - x_{i}^{t}) & (12) \end{matrix}

where

α = \sqrt{\frac{\bar{μ}}{\bar{ψ}}}

,

γ = \frac{\sqrt{\bar{ψ}} - \sqrt{\bar{μ}}}{\sqrt{\bar{ψ}} + \sqrt{\bar{μ}}}

. It had been proved that CNGD achieved the best convergence rate among all centralized gradient methods within first-order algorithms. Under the previous assumptions, CNGD achieved a faster convergence rate

O ({(1 - \sqrt{\frac{\bar{μ}}{\bar{ψ}}})}^{t})

, compared to the CGD’s convergence rate

O ({(1 - \frac{\bar{μ}}{\bar{ψ}})}^{t})

.

It is notable that the two accelerated methods have been adapted in many distributed algorithms, such as [42,43], etc. In this study, we devoted ourselves to studying the two accelerated methods on UG.

3.3. The Proposed Algorithm

The recent studies [35,42] are the most relevant to our work. Based on these works, considering that the Nesterov and Heavy-ball accelerated methods are very helpful for achieving a faster convergence, we added them into UG simultaneously. Meanwhile, in order to apply in many more scenarios, the step-sizes and coefficients were designed as uncoordinated, time-varying, and nonidentical. Combining together, we propose a new distributed optimization algorithm named UGNH as follows:

\begin{matrix} x_{i}^{t + 1} & = \sum_{j = 1}^{n} p_{i j} y_{j}^{t} - α_{i}^{t} (\nabla f_{i} (y_{i}^{t}) + z_{i}^{t}) + γ_{i}^{t} (x_{i}^{t} - x_{i}^{t - 1}) & (13) \\ y_{i}^{t + 1} & = x_{i}^{t + 1} + γ_{i}^{t} (x_{i}^{t + 1} - x_{i}^{t}) & (14) \\ z_{i}^{t + 1} & = z_{i}^{t} - \sum_{j = 1}^{n} l_{i j} (\nabla f_{j} (y_{j}^{t}) + z_{i}^{t} - \sum_{q = 1}^{n} k_{j q} y_{q}^{t}) & (15) \end{matrix}

where

i, j \in V

,

t \in N

, the step-sizes

α_{i}^{t} > 0

, and accelerated momentum coefficients

γ_{i}^{t} \geq 0

are uncoordinated, time-varying, and nonidentical, which are locally chosen at each agent. At the t-th iteration, each agent stores three variables: the primal decision variable

x_{i}^{t} \in R^{m}

, the temporary variable

y_{i}^{t} \in R^{m}

, and the dual variable

z_{i}^{t} \in R^{m}

, which start with initial states:

x_{i}^{0} \in R^{m}

,

y_{i}^{0} \in R^{m}

and

z_{i}^{0} = 0_{m}

. The update of UGNH at each agent i is formally described in Algorithm 1.

Algorithm 1 The update of the algorithm UGNH at each agent i

1:: Initialization: each agent starts with: $x_{i}^{0} \in R^{m}$ , $y_{i}^{0} \in R^{m}$ and $z_{i}^{0} = 0_{m}$ .
2:: for $t = 0, 1, 2, \dots$ do
3:: Update the primal decision variable $x_{i}$ as follows:

x_{i}^{t + 1} = \sum_{j = 1}^{n} p_{i j} y_{j}^{t} - α_{i}^{t} (\nabla f_{i} (y_{i}^{t}) + z_{i}^{t}) + γ_{i}^{t} (x_{i}^{t} - x_{i}^{t - 1})

4:: Update the temporary variable $y_{i}^{t + 1}$ as follows:

y_{i}^{t + 1} = x_{i}^{t + 1} + γ_{i}^{t} (x_{i}^{t + 1} - x_{i}^{t})

5:: for $j = 1, 2, \dots, n$ do
6:: for $q = 1, 2, \dots, n$ do
7:: Calculate $z_{t e m p} = \sum_{j = 1}^{n} l_{i j} (\nabla f_{j} (y_{j}^{t}) + z_{i}^{t} - \sum_{q = 1}^{n} k_{j q} y_{q}^{t})$
8:: end for
9:: Update the dual variable $z_{i}$ as follows:

z_{i}^{t + 1} = z_{i}^{t} - z_{t e m p}

10:: end for
11:: end for

It is clear that UGNH is a primal-dual method;

γ_{i}^{t} (x_{i}^{t} - x_{i}^{t - 1})

is the Heavy-ball accelerated term in (13), (14) is the Nesterov accelerated term, and (15) is the dual variable iteration. It also can be easy to verify that UGNH is equivalent to UG if

α_{i}^{t} = α

,

γ_{i}^{t} = 0_{m}

. Further, it can be equal to EXTRA and HSADO if the matrix

K

is chosen properly.

Remark 2.

For the sake of compaction and brevity, let the dimension

m = 1

. Other multiple dimensions can be similarly proved.

As a result, we define:

x^{t} = {[x_{1}^{t}, x_{2}^{t}, \dots, x_{n}^{t}]}^{T} \in R^{n}

,

y^{t} = {[y_{1}^{t}, y_{2}^{t}, \dots, y_{n}^{t}]}^{T} \in R^{n}

,

z^{t} = {[z_{1}^{t}, z_{2}^{t}, \dots, z_{n}^{t}]}^{T} \in R^{n}

and

\nabla F (y^{t}) = {[\nabla f_{1} (y_{1}^{t}), \nabla f_{2} (y_{2}^{t}), \dots, \nabla f_{n} (y_{n}^{t})]}^{T} \in R^{n}

, other notations latter used are defined as before. Then, UGNH can be compactly reformulated in a martix form as follows:

\begin{matrix} x^{t + 1} & = P y^{t} - Γ_{α}^{t} (\nabla F (y^{t}) + z^{t}) + Γ_{γ}^{t} (x^{t} - x^{t - 1}) & (16) \\ y^{t + 1} & = x^{t + 1} + Γ_{γ}^{t} (x^{t + 1} - x^{t}) & (17) \\ z^{t + 1} & = z^{t} - L (\nabla F (y^{t}) + z^{t} - K y^{t}) & (18) \end{matrix}

where

α^{t} = {[α_{1}^{t}, α_{2}^{t}, \dots, α_{n}^{t}]}^{T} \in R^{n}

and

γ_{}^{t} = {[γ_{1}^{t}, γ_{2}^{t}, \dots, γ_{n}^{t}]}^{T} \in R^{n}

represent step-sizes and coefficients, respectively. Furthermore, we define

Γ_{α}^{t} = d i a g (α^{t}) \in R^{n \times n}

and

Γ_{γ}^{t} = d i a g (γ^{t}) \in R^{n \times n}

.

4. Convergence Analysis

This section analyzes in detail the linear convergence of decision variable sequences generated by UGNH when step-sizes and coefficients are chosen properly. First, we define some notations that may frequently be used later.

\begin{matrix} {\bar{x}}^{t} = \frac{1}{n} 1_{n}^{T} x^{t}, {\bar{z}}^{t} = \frac{1}{n} 1_{n}^{T} z^{t}, J_{n} = \frac{1}{n} 1_{n} 1_{n}^{T}, \tilde{ψ} = max_{i \in V} \{ψ_{i}\}, \\ a = ∥ P - J_{n} ∥, b = ∥ I_{n} - J_{n} ∥, c = ∥ P - I_{n} ∥, d = ∥ L ∥ (\tilde{ψ} + ∥ K ∥) \end{matrix}

Moreover, considering that the step-sizes and coefficients are uncoordinated, time-varying, and nonidentical, there are many possible numerical values that may be difficult to handle. By employing a small trick, we only studied the supremum and infimum of the step-sizes and coefficients. The specific definitions are as follows:

α_{max} = \underset{t \geq 0}{s u p} max_{i \in V} \{α_{i}^{t}\}, α_{min} = \underset{t \geq 0}{i n f} min_{i \in V} \{α_{i}^{t}\}, \tilde{γ} = \underset{t \geq 0}{s u p} max_{i \in V} \{γ_{i}^{t}\}

In addition, let

ξ_{α} = α_{max} - α_{min}

be the difference between

α_{max}

and

α_{min}

, and let

Φ = \frac{α_{max}}{α_{min}}

be the condition number.

Before giving the main results, we introduce some helpful supporting lemmas for the convergence analysis.

4.1. Supporting Lemmas

Lemma 1

([26]). Under Assumption 3, the global objective function f is

\bar{ψ}

-smooth and

\bar{μ}

-strongly convex. For any

x \in R

and

0 < α < \frac{2}{\bar{ψ}}

, we have:

∥ x - α \nabla f (x) - x^{*} ∥ \leq ζ ∥ x - x^{*} ∥

where

ζ = max \{|1 - \bar{ψ} α|, |1 - \bar{μ} α|\}

.

Lemma 2

([19]). Assumption null

\{I_{n} - P\} = s p a n \{1\}

, matrix P satisfies Assumption 2,

x^{*}

is the optimal solution when

x^{*}

satisfies the following conditions:

$x^{*} = P x^{*}$ (consensus)
$1_{n}^{T} \nabla F (x^{*}) = 0$ (optimality)

Lemma 3

([32]). Assume that a matrix

P \in R^{n \times n}

and a vector

ε \in R^{n}

are non-negative and positive, respectively; if

P ε < ϱ ε

with

ϱ > 0

, we have

ρ (P) < ϱ

.

4.2. Main Results

In this section, the linear-convergence analysis of the proposed algorithm is carried out in detail. Similar to relevant studies, we mainly focus on the following four mathematical expressions at the

(t + 1)

-th iteration:

x^{t + 1} - 1_{n} \otimes {\bar{x}}^{t + 1}

,

1_{n} \otimes {\bar{x}}^{t + 1} - x^{*}

,

x^{t + 1} - x^{t}

, and

z^{t + 1} - z^{*}

. For convenience, let

Ξ_{1}^{t + 1}

,

Ξ_{2}^{t + 1}

,

Ξ_{3}^{t + 1}

, and

Ξ_{4}^{t + 1}

represent the four expressions, respectively. Among them, by introducing the norm,

∥ Ξ_{1}^{t + 1} ∥

is described as a consensus violation,

∥ Ξ_{2}^{t + 1} ∥

as an optimal residual,

∥ Ξ_{3}^{t + 1} ∥

as a state difference, and

∥ Ξ_{4}^{t + 1} ∥

as a dual error.

Next, we spared no effort to bound the four norm expressions at the

(t + 1)

-th iteration through their estimates at the t-th iteration in terms of linear combinations. Subsequently, based on Assumptions 1–3, we established a linear inequalities system for convergence analysis. In what follows, consensus violation

∥ Ξ_{1}^{t + 1} ∥

is bounded first.

Lemma 4.

\forall t > 0

, the following inequality holds:

\begin{matrix} ∥ Ξ_{1}^{t + 1} ∥ \leq & (a + b α_{max} \tilde{ψ}) ∥ Ξ_{1}^{t} ∥ + b α_{max} \tilde{ψ} ∥ Ξ_{2}^{t} ∥ + (b α_{max} \tilde{ψ} + a + b) \tilde{γ} ∥ Ξ_{3}^{t} ∥ + b α_{max} ∥ Ξ_{4}^{t} ∥ \end{matrix}

(19)

Proof of Lemma 4.

Considering (16) and (17), we have:

\begin{matrix} x^{t + 1} = & P x^{t} - Γ_{α}^{t} (\nabla F (y^{t}) + z^{t}) + P Γ_{γ}^{t - 1} Ξ_{3}^{t} + Γ_{γ}^{t} Ξ_{3}^{t} \end{matrix}

(20)

Note that

(I_{n} - J_{n}) x^{t + 1} = Ξ_{1}^{t + 1}

,

(I_{n} - J_{n}) P = P - J_{n}

and

(P - J_{n}) 1_{n} = 0_{n}

, multiplying

(I_{n} - J_{n})

on both sides of (20), then:

\begin{matrix} Ξ_{1}^{t + 1} = & (P - J_{n}) Ξ_{1}^{t} + (I_{n} - J_{n}) Γ_{γ}^{t} Ξ_{3}^{t} - (I_{n} - J_{n}) Γ_{α}^{t} (\nabla F (y^{t}) - \nabla F (x^{*})) \\ - (I_{n} - J_{n}) Γ_{α}^{t} (z^{t} + \nabla F (x^{*})) + (P - J_{n}) Γ_{γ}^{t - 1} Ξ_{3}^{t} \end{matrix}

(21)

Based on the fact

z^{t} - z^{*} = z^{t} + \nabla F (x^{*})

[35] and Assumption 3, taking the norm on both sides of (21), then:

\begin{matrix} ∥ Ξ_{1}^{t + 1} ∥ \leq & ∥ P - J_{n} ∥ ∥ Ξ_{1}^{t} ∥ + ∥ I_{n} - J_{n} ∥ α_{max} \tilde{ψ} (∥ Ξ_{1}^{t} ∥ + ∥ Ξ_{2}^{t} ∥) + ∥ I_{n} - J_{n} ∥ α_{max} \tilde{ψ} \tilde{γ} ∥ Ξ_{3}^{t} ∥ \\ + ∥ I_{n} - J_{n} ∥ α_{max} ∥ Ξ_{4}^{t} ∥ + ∥ P - J_{n} ∥ \tilde{γ} ∥ Ξ_{3}^{t} ∥ + ∥ I_{n} - J_{n} ∥ \tilde{γ} ∥ Ξ_{3}^{t} ∥ \end{matrix}

(22)

Recalling the definition of a and b, then:

\begin{matrix} ∥ Ξ_{1}^{t + 1} ∥ \leq & a ∥ Ξ_{1}^{t} ∥ + b α_{max} \tilde{ψ} ∥ Ξ_{1}^{t} ∥ + b α_{max} \tilde{ψ} ∥ Ξ_{2}^{t} ∥ + b α_{max} \tilde{ψ} \tilde{γ} ∥ Ξ_{3}^{t} ∥ \\ + b α_{max} ∥ Ξ_{4}^{t} ∥ + (a + b) \tilde{γ} ∥ Ξ_{3}^{t} ∥ \end{matrix}

(23)

Rearranging the terms in (23), the result in Lemma 4 is obtained. □

Lemma 5.

\forall t > 0

, the following inequality holds:

\begin{matrix} ∥ Ξ_{2}^{t + 1} ∥ \leq & (α_{max} \tilde{ψ} + ξ_{α} \tilde{ψ}) ∥ Ξ_{1}^{t} ∥ + (ζ + ξ_{α} \tilde{ψ}) ∥ Ξ_{2}^{t} ∥ + (α_{max} \tilde{ψ} + ξ_{α} \tilde{ψ} + 2) \tilde{γ} ∥ Ξ_{3}^{t} ∥ + ξ_{α} ∥ Ξ_{4}^{t} ∥ \end{matrix}

(24)

Proof of Lemma 5.

Multiplying

J_{n}

on both of (16), and substituting

y^{t} = x^{t} + Γ_{γ}^{t - 1} Ξ_{3}^{t}

, we have:

\begin{matrix} J_{n} x^{t + 1} = & J_{n} x^{t} - J_{n} Γ_{α}^{t} (\nabla F (y^{t}) + z^{t}) + J_{n} Γ_{γ}^{t - 1} Ξ_{3}^{t} + J_{n} Γ_{γ}^{t} Ξ_{3}^{t} \end{matrix}

(25)

To get the related terms, recalling the fact that

{\bar{z}}^{t + 1} = {\bar{z}}^{t} = \dots = {\bar{z}}^{0} = 0

(e.g.,

J_{n} z^{t} = 0

) in [35], we add some useful items and delete them in (25) as follows:

\begin{matrix} J_{n} x^{t + 1} = & J_{n} x^{t} - α_{max} J_{n} \nabla F (J_{n} x^{t}) + α_{max} J_{n} (\nabla F (J_{n} x^{t}) - \nabla F (y^{t})) + J_{n} Γ_{γ}^{t - 1} Ξ_{3}^{t} + J_{n} Γ_{γ}^{t} Ξ_{3}^{t} \\ + J_{n} (1_{n} \otimes α_{max} - Γ_{α}^{t}) (\nabla F (y^{t}) - \nabla F (x^{*})) + J_{n} (1_{n} \otimes α_{max} - Γ_{α}^{t}) (z^{t} + \nabla F (x^{*})) \end{matrix}

(26)

By applying

\nabla f (x) = \frac{1}{n} 1_{n} \nabla F (x)

, subtracting

x^{*}

on the sides of (26), we then obtain:

\begin{matrix} Ξ_{2}^{t + 1} = & 1_{n} ({\bar{x}}^{t} - {\tilde{x}}^{*} - α_{max} \nabla f ({\bar{x}}^{t})) + α_{max} J_{n} (\nabla F (J_{n} x^{t}) - \nabla F (y^{t})) + J_{n} Γ_{γ}^{t - 1} Ξ_{3}^{t} + J_{n} Γ_{γ}^{t} Ξ_{3}^{t} \\ + J_{n} (1_{n} \otimes α_{max} - Γ_{α}^{t}) (\nabla F (y^{t}) - \nabla F (x^{*})) + J_{n} (1_{n} \otimes α_{max} - Γ_{α}^{t}) (z^{t} + \nabla F (x^{*})) \end{matrix}

(27)

Taking the norm on both sides of (27) and using Lemma 1, then:

\begin{matrix} ∥ Ξ_{2}^{t + 1} ∥ \leq & ζ ∥ Ξ_{2}^{t} ∥ + α_{max} ∥ J_{n} ∥ \tilde{ψ} ∥ Ξ_{1}^{t} - Γ_{γ}^{t - 1} Ξ_{3}^{t} ∥ + ∥ J_{n} ∥ (α_{max} - α_{min}) \tilde{ψ} ∥ y^{t} - x^{*} ∥ \\ + ∥ J_{n} ∥ (α_{max} - α_{min}) ∥ Ξ_{4}^{t} ∥ + 2 ∥ J_{n} ∥ \tilde{γ} ∥ Ξ_{3}^{t} ∥ \\ \leq & ζ ∥ Ξ_{2}^{t} ∥ + α_{max} \tilde{ψ} ∥ Ξ_{1}^{t} ∥ + α_{max} \tilde{ψ} \tilde{γ} ∥ Ξ_{3}^{t} ∥ + ξ_{α} \tilde{ψ} ∥ Ξ_{1}^{t} ∥ + ξ_{α} \tilde{ψ} ∥ Ξ_{2}^{t} ∥ \\ + ξ_{α} \tilde{ψ} \tilde{γ} ∥ Ξ_{3}^{t} ∥ + ξ_{α} ∥ Ξ_{4}^{t} ∥ + 2 \tilde{γ} ∥ Ξ_{3}^{t} ∥ \end{matrix}

(28)

Rearranging the terms in (28), the desired results can be obtained. □

Lemma 6.

\forall t > 0

, the following inequality holds:

\begin{matrix} ∥ Ξ_{3}^{t + 1} ∥ \leq & (c + α_{max} \tilde{ψ}) ∥ Ξ_{1}^{t} ∥ + α_{max} \tilde{ψ} ∥ Ξ_{2}^{t} ∥ + (α_{max} \tilde{ψ} \tilde{γ} + 2 \tilde{γ}) ∥ Ξ_{3}^{t} ∥ + α_{max} ∥ Ξ_{4}^{t} ∥ \end{matrix}

(29)

Proof of Lemma 6.

Substituting

y^{t} = x^{t} + Γ_{γ}^{t - 1} Ξ_{3}^{t}

in (16), then subtracting

x^{t}

on both sides, then:

\begin{matrix} Ξ_{3}^{t + 1} = & P (x^{t} + Γ_{γ}^{t - 1} Ξ_{3}^{t}) - x^{t} - Γ_{α}^{t} (\nabla F (y^{t}) + z^{t}) + Γ_{γ}^{t} Ξ_{3}^{t} \\ = & (P - I_{n}) Ξ_{1}^{t} - Γ_{α}^{t} (\nabla F (y^{t}) - \nabla F (x^{*})) + Γ_{α}^{t} Ξ_{4}^{t} + (P Γ_{γ}^{t - 1} + Γ_{γ}^{t}) Ξ_{3}^{t} \end{matrix}

(30)

The second equality is based on

(P - I_{n}) 1_{n} = 0_{n}

; recalling the definition of c and taking the norm on both sides of (30), we have:

\begin{matrix} ∥ Ξ_{3}^{t + 1} ∥ \leq & c ∥ Ξ_{1}^{t} ∥ + α_{max} \tilde{ψ} ∥ y^{t} - x^{*} ∥ + α_{max} ∥ Ξ_{4}^{t} ∥ + 2 \tilde{γ} ∥ Ξ_{3} ∥ \\ = & c ∥ Ξ_{1}^{t} ∥ + α_{max} \tilde{ψ} ∥ Ξ_{1}^{t} ∥ + α_{max} \tilde{ψ} ∥ Ξ_{2}^{t} ∥ + α_{max} \tilde{ψ} \tilde{γ} ∥ Ξ_{3}^{t} ∥ + α_{max} ∥ Ξ_{4}^{t} ∥ + 2 \tilde{γ} ∥ Ξ_{3}^{t} ∥ \end{matrix}

(31)

Rearranging the terms in (31), the result in Lemma 6 is obtained. □

Lemma 7.

Let Assumptions 2–3 and Lemma 2 hold.

\forall t > 0

, the following inequality holds:

\begin{matrix} ∥ Ξ_{4}^{t + 1} ∥ \leq d ∥ Ξ_{1}^{t} ∥ + d ∥ Ξ_{2}^{t} ∥ + d \tilde{γ} ∥ Ξ_{3}^{t} ∥ + a ∥ Ξ_{4}^{t} ∥ \end{matrix}

(32)

Proof of Lemma 7.

Noting

(P - J_{n}) 1_{n} = 0_{n}

and adding

\nabla F (x^{*})

on both sides of (18), we have:

\begin{matrix} z^{t + 1} + \nabla F (x^{*}) = & z^{t} + \nabla F (x^{*}) - L (\nabla F (y^{t}) + z^{t} - K y^{t}) \\ = & P (z^{t} + \nabla F (x^{*})) + L K y^{t} - (I_{n} - P) (\nabla F (y^{t}) - \nabla F (x^{*})) \\ = & (P - J_{n}) (z^{t} - z^{*}) + L K (y^{t} - x^{*}) - L (\nabla F (y^{t}) - \nabla F (x^{*})) \end{matrix}

(33)

The third equality of (33) is from the following fact in [35] and Lemma 2:

{\bar{z}}^{t + 1} = {\bar{z}}^{t} = \dots = {\bar{z}}^{0} = 0

,

J_{n} \nabla F (x^{*}) = 0_{n}

and

L K 1_{n} = 0_{n}

.

Recalling the definition of d, taking the norm on both sides of (33),we have:

\begin{matrix} ∥ Ξ_{4}^{t + 1} ∥ & \leq ∥ P - J_{n} ∥ ∥ Ξ_{4}^{t} ∥ + (∥ L ∥ \tilde{ψ} + ∥ L ∥ ∥ K ∥) ∥ y^{t} - x^{*} ∥ = a ∥ Ξ_{4}^{t} ∥ + d ∥ y^{t} - x^{*} ∥ \end{matrix}

(34)

Substituting

y^{t} = x^{t} + Γ_{γ}^{t - 1} Ξ_{3}^{t}

in (34) and rearranging the terms can yield the desired result. □

With the Lemmas 4–7 above, we established the main convergence result as follows.

Theorem 1.

Suppose that Assumptions 1–3 hold. Considering the sequences

{x^{t}}

,

{y^{t}}

, and

{z^{t}}

generating by the proposed algorithm UGNH and combining Lemmas 4–7 in a linear-inequalities system, we have:

\begin{matrix} [\begin{matrix} ∥ Ξ_{1}^{t + 1} ∥ \\ ∥ Ξ_{2}^{t + 1} ∥ \\ ∥ Ξ_{3}^{t + 1} ∥ \\ ∥ Ξ_{4}^{t + 1} ∥ \end{matrix}] \leq H [\begin{matrix} ∥ Ξ_{1}^{t} ∥ \\ ∥ Ξ_{2}^{t} ∥ \\ ∥ Ξ_{3}^{t} ∥ \\ ∥ Ξ_{4}^{t} ∥ \end{matrix}] \end{matrix}

(35)

where the matrix

H \in R^{4 \times 4}

is given as below:

\begin{matrix} H = [\begin{matrix} a + b α_{max} \tilde{ψ} & b α_{max} \tilde{ψ} & b α_{max} \tilde{ψ} \tilde{γ} + a \tilde{γ} + b \tilde{γ} & b α_{max} \\ α_{max} \tilde{ψ} + ξ_{α} \tilde{ψ} & ζ + ξ_{α} \tilde{ψ} & α_{max} \tilde{ψ} \tilde{γ} + ξ_{α} \tilde{ψ} \tilde{γ} + 2 \tilde{γ} & ξ_{α} \\ c + α_{max} \tilde{ψ} & α_{max} \tilde{ψ} & α_{max} \tilde{ψ} \tilde{γ} + 2 \tilde{γ} & α_{max} \\ d & d & d \tilde{γ} & a \end{matrix}] \end{matrix}

The largest step-size satisfies:

\begin{matrix} α_{max} < min \{\frac{ε_{1} - a ε_{1}}{b \tilde{ψ} ε_{1} + b \tilde{ψ} ε_{2} + b ε_{4}}, \frac{ε_{3} - c ε_{1}}{\tilde{ψ} ε_{1} + \tilde{ψ} ε_{2} + ε_{4}}, \frac{1}{\bar{ψ}}\} \end{matrix}

(36)

The maximum momentum coefficient satisfies:

\begin{matrix} \tilde{γ} < min & \{\frac{ε_{1} - a ε_{1} - b α_{max} \tilde{ψ} ε_{1} - b α_{max} \tilde{ψ} ε_{2} - b α_{max} ε_{4}}{b α_{max} \tilde{ψ} ε_{3} + a ε_{3} + b ε_{3}}, \\ \frac{\bar{μ} α_{max} ε_{2} - ξ_{α} \tilde{ψ} ε_{2} - α_{max} \tilde{ψ} ε_{1} - ξ_{α} \tilde{ψ} ε_{1} - ξ_{α} ε_{4}}{α_{max} \tilde{ψ} ε_{3} + ξ_{α} \tilde{ψ} ε_{3} + 2 ε_{3}}, \\ \frac{ε_{3} - α_{max} \tilde{ψ} ε_{2} - c ε_{1} - α_{max} \tilde{ψ} ε_{1} - α_{max} ε_{4}}{α_{max} \tilde{ψ} ε_{3} + 2 ε_{3}}, \frac{ε_{4} - d ε_{1} - d ε_{2} - a ε_{4}}{d ε_{3}}\} \end{matrix}

(37)

And the conditional number satisfies:

\begin{matrix} 1 \leq Φ < \frac{ε_{4} + \tilde{ψ} ε_{2} + \tilde{ψ} ε_{1}}{ε_{4} + \tilde{ψ} ε_{2} + 2 \tilde{ψ} ε_{1} - \bar{μ} ε_{2}} \end{matrix}

(38)

where

ε_{1}

,

ε_{2}

,

ε_{3}

, and

ε_{4}

are arbitrary constants, which obey the following picking rules:

\begin{matrix} ε_{2} > 0, ε_{1} < \frac{\bar{μ} ε_{2}}{\tilde{ψ}}, ε_{3} > c ε_{1}, ε_{4} > \frac{d ε_{1} + d ε_{2}}{1 - a} \end{matrix}

(39)

Then, the spectral radius of the matrix

H

is strictly less than 1, i.e.,

ρ (H) < 1

, which is the desired result.

Proof of Theorem 1.

According to Lemmas 4–7, we can immediately get the inequalities (35). Then, we provide some necessary conditions for parameters

\tilde{ψ}

,

\tilde{γ}

and

Φ

, such that

ρ (H) < 1

. Based on Lemma 3, let

ε = {[ε_{1}, ε_{2}, ε_{3}, ε_{4}]}^{T} \in R^{4}

be a positive vector, if

H ε < ε

, then

ρ (H) < 1

. According to the definition of

H

above, the inequality

H ε < ε

is equivalent to the following four inequalities:

\begin{matrix} (b α_{max} \tilde{ψ} + a + b) \tilde{γ} ε_{3} < ε_{1} - a ε_{1} - b α_{max} \tilde{ψ} ε_{1} - b α_{max} \tilde{ψ} ε_{2} - b α_{max} ε_{4} \end{matrix}

(40)

\begin{matrix} (α_{max} \tilde{ψ} + ξ_{α} \tilde{ψ} + 2) \tilde{γ} ε_{3} < ε_{2} - α_{max} \tilde{ψ} ε_{1} - ξ_{α} \tilde{ψ} ε_{1} - ζ ε_{2} - ξ_{α} \tilde{ψ} ε_{2} - ξ_{α} ε_{4} \end{matrix}

(41)

\begin{matrix} (α_{max} \tilde{ψ} + 2) \tilde{γ} ε_{3} < ε_{3} - c ε_{1} - α_{max} \tilde{ψ} ε_{1} - α_{max} \tilde{ψ} ε_{2} - α_{max} ε_{4} \end{matrix}

(42)

\begin{matrix} d \tilde{γ} ε_{3} < ε_{4} - d ε_{1} - d ε_{2} - a ε_{4} \end{matrix}

(43)

According to Lemma 1, if

0 < α_{max} < \frac{1}{\bar{ψ}}

,

ζ = 1 - \bar{μ} α_{max}

, then (41) is equivalent to the following inequality:

\begin{matrix} (α_{max} \tilde{ψ} + ξ_{α} \tilde{ψ} + 2) \tilde{γ} ε_{3} < \bar{μ} α_{max} ε_{2} - α_{max} \tilde{ψ} ε_{1} - ξ_{α} \tilde{ψ} ε_{1} - ξ_{α} \tilde{ψ} ε_{2} - ξ_{α} ε_{4} \end{matrix}

(44)

To make sure that the parameter

\tilde{γ}

is positive, it implies that the right sides of (40) and (42)–(44) are positive. Immediately, we can get the following conditions:

\begin{matrix} α_{max} < \frac{ε_{1} - a ε_{1}}{b \tilde{ψ} ε_{1} + b \tilde{ψ} ε_{2} + b ε_{4}} \end{matrix}

(45)

\begin{matrix} ξ_{α} < \frac{\bar{μ} α_{max} ε_{2} - α_{max} \tilde{ψ} ε_{1}}{ε_{4} + \tilde{ψ} ε_{2} + \tilde{ψ} ε_{1}}; ε_{1} < \frac{\bar{μ} ε_{2}}{\tilde{ψ}} \end{matrix}

(46)

\begin{matrix} α_{max} < \frac{ε_{3} - c ε_{1}}{\tilde{ψ} ε_{1} + \tilde{ψ} ε_{2} + ε_{4}}; ε_{3} > c ε_{1} \end{matrix}

(47)

\begin{matrix} ε_{4} > \frac{d ε_{1} + d ε_{2}}{1 - a} \end{matrix}

(48)

Recalling that

ξ_{α} = α_{max} - α_{min}

,

Φ = \frac{α_{max}}{α_{min}}

as the conditional number, (46) further implies that:

\begin{matrix} 1 \leq Φ < \frac{ε_{4} + \tilde{ψ} ε_{2} + \tilde{ψ} ε_{1}}{ε_{4} + \tilde{ψ} ε_{2} + 2 \tilde{ψ} ε_{1} - \bar{μ} ε_{2}} \end{matrix}

(49)

Now, we attempt to select the proper vector

ε = {[ε_{1}, ε_{2}, ε_{3}, ε_{4}]}^{T}

such that the parameters

α_{max}

,

\tilde{γ}

and

Φ

are available. Based on (46)–(48), an arbitrary positive constant

ε_{2}

is chosen first, and then we choose

ε_{1}

from (46), finally choosing

ε_{3}

and

ε_{4}

from (47) and (48), respectively. Hence, according to (45) and (47), and the requirement of

0 < α_{max} < \frac{1}{\bar{ψ}}

in (44), the upper bound of the largest step-size

α_{max}

shown in (36) can be obtained. Furthermore, according to (46), the upper bound of the conditional number

Φ

demonstrated in (30) can be obtained. Besides, the upper bound of the maximum coefficient

\tilde{γ}

can yield from (40) and (42)–(44). Above all, the proof is finished. □

Remark 3.

According to Theorem 1, a linear convergence rate of the proposed algorithm can be easily obtained if the parameters

α_{max}

,

\tilde{γ}

and Φ follow the conditions (36)–(38), respectively. It is noteworthy that these parameters only depend on the topology of the network and objective functions. Although some global parameters such as

\bar{μ}

,

\bar{ψ}

and

\tilde{ψ}

are needed when designing step-sizes and the coefficients, these parameters can be easily pre-calculated without much effort.

Remark 4.

Being uncoordinated and being nonidentical are two important characteristics often designed in many related studies, considering that step-sizes and coefficients might be changed with time variance in some practical scenarios. In our algorithm, step-sizes and coefficients were designed as uncoordinated, time-varying, and nonidentical. Furthermore, the largest step-size and coefficients were chosen according to their bounds shown in Theorem 1, which only depend on the the communication network and the objective functions. Notably, there is a bound of a conditional number, such that when the largest step-size is chosen, the smallest step-size needs to be chosen carefully.

5. Numerical Experiments

In this section, some necessary numerical experiments in a real dataset are provided to illustrate the efficiency and superiority of our algorithm. In the experiments, we considered a binary-classification logistic-regression problem in the Wisconsin breast cancer dataset provided in the UCI Machine Learning Repository [45]. The problem can be described in the following form:

\begin{matrix} min_{x \in R^{d}} \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} ln (1 + exp (- y_{i j} a_{i j}^{T} x)) + \frac{τ}{2} ∥ x ∥ \end{matrix}

(50)

with each local objective function

f_{i}

written as follows:

\begin{matrix} f_{i} (x) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} ln (1 + exp (- y_{i j} a_{i j}^{T} x)) + \frac{τ}{2} ∥ x ∥ \end{matrix}

(51)

where n is the number of the agent in network, and d is the dimension of the decision variable. Each agent i is assumed to have an equal data samples

n_{i}

, i.e.,

n_{i} = \frac{N}{n}

(N is the total data samples).

a_{i j} \in R^{d}

represents the feature vector of the jth data sample at the ith agent, while

y_{i j} \in \{- 1, 1\}

denotes the corresponding label. The regularization term

\frac{τ}{2} ∥ x ∥

with parameter

λ = 1

was set to avoid over-fitting.

In the experiments, we set

N = 200

as training data, and

d = 9

represents the feature in the real dataset. Meanwhile, we simulated a randomly undirected network generated by the Erdos–Renyi network with

n = 10

nodes and edge probability

p = 0.7

. Then, we compared the proposed algorithm UGNH to relevant algorithms: EXTRA, HSADO, and UG.

Figure 1, Figure 2, Figure 3, Figure 4 and Figure 5 show the results of our experiments, and the main conclusions are as follows:

Figure 1 indicates that the proposed algorithm UGNH promotes the convergence rate compared to the related algorithms in the real dataset; thus, UGNH is effective and superior. From Figure 2, the sequences generated by UGNH, EXTRA, UG, and HSADO can converge to the optimal solutions as expected. Avoiding confusion of the figure, only one dimension of each decision variable is exhibited.
Figure 3 means that UGNH with the Nesterov momentum and the Heavy-ball momentum improved the convergence rate compared to the algorithm with only one or no momentum.
In Figure 4, we can conclude that step-size is usually chosen very small; the larger step-size leads to a faster convergence rate if it is chosen under the upper bound. For the coefficient, a similar result can be obtained in Figure 5. Comparing the two figures, it can be concluded that small changes in step-size are more influential than that of the coefficient.

6. Conclusions

In this study, a novel uncoordinated, time-varying, and nonidentical distributed optimization accelerated algorithm was proposed. It was mainly applied to handle the distributed optimization convex problem in an undirected network, where all agents are in an effort to optimize the average of all local objective functions collaboratively. When the largest step-size and the maximum coefficient do not exceed some estimated upper bounds, which have been provided in Theorem 1, the convergence rate of UGNH is linear under the condition that each local objective function is smooth and strongly convex. Besides, these parameters only depend on the topology of the network and the local objective function.

It is worth noting that to achieve a faster linear convergence rate, the Heavy-ball and Nesterov accelerated methods were simultaneously added into the algorithm, which provides a new way for accelerating convergence of other distributed optimization algorithms. Furthermore, the experiment results verified the effective and superior performance in a real dataset. However, UGNH is not suitable for all scenarios, and there are some more in-depth areas worth studying, such as the time-varying network architecture, random link failures, asynchronous communication between agents, directed networks, and so on. In all, these problems are worthy of further study and are our future research direction.

Author Contributions

Data curation, H.Z.; writing—original draft, Y.L.; supervision, X.G.; writing—review and editing, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China (41271292), in part by the Key Project of Chongqing Science and Technology Bureau (cstc2019jscx-gksbX0103), in part by the Fundamental Research Funds for the Central Universities under Project (SWU2009107), in part by the Chongqing Natural Science Foundation (cstc2020jcyj-msxmX0324), in part by the Key Project of Natural Science Research of Education Department in Anhui Province of China (KJ2019A0864), and in part by the Construction of Chengdu-Chongqing Economic Circle Science and Technology Innovation Project (KJCX2020007).

Institutional Review Board Statement

This paper does not studies involving human or animal.

Informed Consent Statement

This paper does not studies involving human or animal.

Data Availability Statement

The dataset can be fetched on the website http://archive.ics.uci.edu/ml, (accessed on 11 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, H.; Liao, X.; Wang, Z.; Huang, T.; Chen, G. Distributed parameter estimation in unreliable sensor networks via broadcast gossip algorithms. Neural Netw. 2016, 73, 1–9. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dougherty, S.; Guay, M. An extremum-seeking controller for distributed optimization over sensor networks. IEEE Trans. Autom. Control 2016, 62, 928–933. [Google Scholar] [CrossRef]
Rahmani, A.M.; Ali, S.; Yousefpoor, M.S.; Yousefpoor, E.; Naqvi, R.A.; Siddique, K.; Hosseinzadeh, M. An area coverage scheme based on fuzzy logic and shuffled frog-leaping algorithm (sfla) in heterogeneous wireless sensor networks. Mathematics 2021, 9, 2251. [Google Scholar] [CrossRef]
Ren, W. Consensus based formation control strategies for multi-vehicle systems. In Proceedings of the 2006 American Control Conference, Philadelphia, PA, USA, 14–16 June 2006; p. 6. [Google Scholar]
Yan, B.; Shi, P.; Lim, C.C.; Wu, C.; Shi, Z. Optimally distributed formation control with obstacle avoidance for mixed-order multi-agent systems under switching topologies. IET Control Theory Appl. 2018, 12, 1853–1863. [Google Scholar] [CrossRef]
Cevher, V.; Becker, S.; Schmidt, M. Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Processing Mag. 2014, 31, 32–43. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, W.; Pan, G. A Distributed Quantum-Behaved Particle Swarm Optimization Using Opposition-Based Learning on Spark for Large-Scale Optimization Problem. Mathematics 2020, 8, 1860. [Google Scholar] [CrossRef]
Li, K.; Liu, Q.; Yang, S.; Cao, J.; Lu, G. Cooperative optimization of dual multiagent system for optimal resource allocation. IEEE Trans. Syst. Man Cybern. Syst. 2018, 50, 4676–4687. [Google Scholar] [CrossRef]
Jia, W.; Qin, S. Distributed Optimization Over Directed Graphs with Continuous-Time Algorithm. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 1911–1916. [Google Scholar]
Ahmed, E.M.; Rathinam, R.; Dayalan, S.; Fernandez, G.S.; Ali, Z.M.; Aleem, S.H.; Omar, A.I. A Comprehensive Analysis of Demand Response Pricing Strategies in a Smart Grid Environment Using Particle Swarm Optimization and the Strawberry Optimization Algorithm. Mathematics 2021, 9, 2338. [Google Scholar] [CrossRef]
Zhang, Q.; Gong, Z.; Yang, Z.; Chen, Z. Distributed convex optimization for flocking of nonlinear multi-agent systems. Int. J. Control Autom. Syst. 2019, 17, 1177–1183. [Google Scholar] [CrossRef]
Tang, X.; Li, M.; Wei, S.; Ding, B. Event-triggered Synchronous Distributed Model Predictive Control for Multi-agent Systems. Int. J. Control Autom. Syst. 2021, 19, 1273–1282. [Google Scholar] [CrossRef]
Nedic, A.; Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 2009, 54, 48–61. [Google Scholar] [CrossRef]
DeGroot, M.H. Reaching a consensus. J. Am. Stat. Assoc. 1974, 69, 118–121. [Google Scholar] [CrossRef]
Ram, S.S.; Nedić, A.; Veeravalli, V.V. Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 2010, 147, 516–545. [Google Scholar]
Nedic, A.; Ozdaglar, A.; Parrilo, P.A. Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 2010, 55, 922–938. [Google Scholar] [CrossRef]
Duchi, J.C.; Agarwal, A.; Wainwright, M.J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans. Autom. Control 2011, 57, 592–606. [Google Scholar] [CrossRef] [Green Version]
Jakovetić, D.; Xavier, J.; Moura, J.M. Fast distributed gradient methods. IEEE Trans. Autom. Control 2014, 59, 1131–1146. [Google Scholar] [CrossRef] [Green Version]
Shi, W.; Ling, Q.; Wu, G.; Yin, W. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 2015, 25, 944–966. [Google Scholar] [CrossRef]
Shi, W.; Ling, Q.; Wu, G.; Yin, W. A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Processing 2015, 63, 6013–6023. [Google Scholar] [CrossRef]
Xi, C.; Khan, U.A. DEXTRA: A fast algorithm for optimization over directed graphs. IEEE Trans. Autom. Control 2017, 62, 4980–4993. [Google Scholar] [CrossRef]
Zeng, J.; Yin, W. Extrapush for convex smooth decentralized optimization over directed networks. arXiv 2015, arXiv:1511.02942. [Google Scholar]
Yuan, K.; Ying, B.; Zhao, X.; Sayed, A.H. Exact diffusion for distributed optimization and learning-Part I: Algorithm development. IEEE Trans. Signal Processing 2018, 67, 708–723. [Google Scholar] [CrossRef] [Green Version]
Yuan, K.; Ying, B.; Zhao, X.; Sayed, A.H. Exact diffusion for distributed optimization and learning-Part II: Convergence analysis. IEEE Trans. Signal Processing 2018, 67, 724–739. [Google Scholar] [CrossRef] [Green Version]
Jakovetić, D.; Moura, J.M.; Xavier, J. Linear convergence rate of a class of distributed augmented lagrangian algorithms. IEEE Trans. Autom. Control 2014, 60, 922–936. [Google Scholar] [CrossRef] [Green Version]
Qu, G.; Li, N. Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 2017, 5, 1245–1260. [Google Scholar] [CrossRef] [Green Version]
Nedic, A.; Olshevsky, A.; Shi, W. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 2017, 27, 2597–2633. [Google Scholar] [CrossRef]
Jakovetic, D.; Krejic, N.; Malaspina, G. Linear Convergence Rate Analysis of a Class of Exact First-Order Distributed Methods for Time-Varying Directed Networks and Uncoordinated Step Sizes. arXiv 2007, arXiv:2007.08837 2020. [Google Scholar]
Nedić, A.; Olshevsky, A.; Shi, W.; Uribe, C.A. Geometrically convergent distributed optimization with uncoordinated step-sizes. In Proceedings of the 2017 American Control Conference (ACC), Seattle, WA, USA, 24–26 May 2017; pp. 3950–3955. [Google Scholar]
Lu, Q.; Li, H.; Xia, D. Geometrical convergence rate for distributed optimization with time-varying directed graphs and uncoordinated step-sizes. Inf. Sci. 2018, 422, 516–530. [Google Scholar] [CrossRef] [Green Version]
Qu, G.; Li, N. Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control 2019, 65, 2566–2581. [Google Scholar] [CrossRef] [Green Version]
Xin, R.; Khan, U.A. Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control 2019, 65, 2627–2633. [Google Scholar] [CrossRef] [Green Version]
Mokhtari, A.; Ribeiro, A. DSA: Decentralized double stochastic averaging gradient algorithm. J. Mach. Learn. Res. 2016, 17, 2165–2199. [Google Scholar]
Nedić, A.; Ozdaglar, A. Subgradient methods for saddle-point problems. J. Optim. Theory Appl. 2009, 142, 205–228. [Google Scholar] [CrossRef]
Jakovetić, D. A unification and generalization of exact distributed first-order methods. IEEE Trans. Signal Inf. Processing Over Netw. 2018, 5, 31–46. [Google Scholar] [CrossRef] [Green Version]
Xu, J.; Zhu, S.; Soh, Y.C.; Xie, L. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In Proceedings of the 2015 54th IEEE Conference on Decision and Control (CDC), Osaka, Japan, 15–18 December 2015; pp. 2055–2060. [Google Scholar]
Li, H.; Zheng, Z.; Lü, Q.; Wang, Z.; Gao, L.; Wu, G.C.; Ji, L.; Wang, H. Primal-Dual Fixed Point Algorithms Based on Adapted Metric for Distributed Optimization. IEEE Trans. Neural Netw. Learn. Syst. 2021, 2021, 1–15. [Google Scholar] [CrossRef]
Liu, P.; Li, H.; Dai, X.; Han, Q. Distributed primal-dual optimisation method with uncoordinated time-varying step-sizes. Int. J. Syst. Sci. 2018, 49, 1256–1272. [Google Scholar] [CrossRef]
Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003; Volume 87. [Google Scholar]
Rivet, A.; Souloumiac, A. Introduction to Optimization. Optimization Software, Publications Division; Citeseer: Washington, DC, USA, 1987. [Google Scholar]
Xin, R.; Khan, U.A. A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Syst. Lett. 2018, 2, 315–320. [Google Scholar] [CrossRef] [Green Version]
Cheng, H.; Li, H.; Wang, Z. On the convergence of exact distributed generalisation and acceleration algorithm for convex optimisation. Int. J. Syst. Sci. 2020, 51, 1–17. [Google Scholar] [CrossRef]
Lü, Q.; Liao, X.; Li, H.; Huang, T. A nesterov-like gradient tracking algorithm for distributed optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 2020, 51, 6258–6270. [Google Scholar] [CrossRef]
Hestenes, M.R.; Stiefel, E. Methods of Conjugate Gradients for Solving Linear Systems; NBS: Washington, DC, USA, 1952; Volume 49. [Google Scholar]
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 11 December 2021).

Figure 1. Performance comparisons between the proposed algorithm and related algorithms.

Figure 2. One dimension of variable between the proposed algorithm and related algorithms.

Figure 3. Performance comparisons between the proposed algorithm and the method without momentum terms.

Figure 4. Performance comparisons between different step-sizes.

Figure 5. Performance comparisons between different momentum coefficients.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lü, Y.; Xiong, H.; Zhou, H.; Guan, X. A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network. Mathematics 2022, 10, 357. https://doi.org/10.3390/math10030357

AMA Style

Lü Y, Xiong H, Zhou H, Guan X. A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network. Mathematics. 2022; 10(3):357. https://doi.org/10.3390/math10030357

Chicago/Turabian Style

Lü, Yunshan, Hailing Xiong, Hao Zhou, and Xin Guan. 2022. "A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network" Mathematics 10, no. 3: 357. https://doi.org/10.3390/math10030357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network

Abstract

1. Introduction

2. Preliminaries

2.1. Problem Formulation

2.2. Assumptions

3. Algorithm Development

3.1. Related Algorithms

3.2. Distributed Accelerated Methods

3.3. The Proposed Algorithm

4. Convergence Analysis

4.1. Supporting Lemmas

4.2. Main Results

5. Numerical Experiments

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI