Next Article in Journal
Photothermal Response for the Thermoelastic Bending Effect Considering Dissipating Effects by Means of Fractional Dual-Phase-Lag Theory
Next Article in Special Issue
Subordination Principle for Generalized Fractional Zener Models
Previous Article in Journal
Optimal FOPI Error Voltage Control Dead-Time Compensation for PMSM Servo System
Previous Article in Special Issue
G-Fractional Diffusion on Bounded Domains in Rd
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fractional Gradient Methods via ψ-Hilfer Derivative

by
Nelson Vieira
1,*,
M. Manuela Rodrigues
1 and
Milton Ferreira
2,3
1
CIDMA — Center for Research and Development in Mathematics and Applications, Department of Mathematics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
2
School of Technology and Management, Polytechnic of Leiria, 2411-901 Leiria, Portugal
3
CIDMA—Center for Research and Development in Mathematics and Applications, University of Aveiro, 3810-193 Aveiro, Portugal
*
Author to whom correspondence should be addressed.
Fractal Fract. 2023, 7(3), 275; https://doi.org/10.3390/fractalfract7030275
Submission received: 30 January 2023 / Revised: 9 March 2023 / Accepted: 14 March 2023 / Published: 21 March 2023
(This article belongs to the Special Issue New Trends on Generalized Fractional Calculus)

Abstract

:
Motivated by the increase in practical applications of fractional calculus, we study the classical gradient method under the perspective of the ψ -Hilfer derivative. This allows us to cover several definitions of fractional derivatives that are found in the literature in our study. The convergence of the ψ -Hilfer continuous fractional gradient method was studied both for strongly and non-strongly convex cases. Using a series representation of the target function, we developed an algorithm for the ψ -Hilfer fractional order gradient method. The numerical method obtained by truncating higher-order terms was tested and analyzed using benchmark functions. Considering variable order differentiation and step size optimization, the ψ -Hilfer fractional gradient method showed better results in terms of speed and accuracy. Our results generalize previous works in the literature.

1. Introduction

The gradient descent method is a classical convex optimization method. It is widely used in many areas of computer science, such as in image processing [1,2], machine learning [3,4,5], and control systems [6]. Its use on a large scale is essentially due to its intuitive structure, ease of implementation, and accuracy. In recent years, there has been an increase in interest in the application of fractional calculus techniques to develop and implement fractional gradient methods (FGM). The first work dealing with such methods is in [2,7], which addresses problems in the fields of signal processing and adaptive learning. The design of fractional least mean squares algorithms is another example of the application of FGM [8,9,10]. Recently, some applications of FGM have focused on artificial intelligence subjects such as machine learning, deep learning, and neural networks (see [11,12,13] and references therein).
Replacing the first-order integer derivative with a fractional derivative in a gradient can improve its convergence, because long-term information can be included. However, there are some convergence issues in the numerical implementation of the FGM because the real extreme value of the target function is not always the same as the fractional extreme value.
In [14], the authors propose a new FGM to overcome this problem, considering an iterative update of the lower limit of integration in the fractional derivative to shorten the memory characteristic presented in the fractional derivative and truncating the higher order terms of the series expansion associated with the target function. Afterward, Wei et al. [15] designed another method involving variable fractional order to solve the convergence problem.
In the field of fractional calculation, several definitions of fractional and derivative integrals varying in their kernel size can be found. This diversity allows certain problems to be tackled with specific fractional operators. To establish a general operator, a ψ -fractional integral operator with respect to function ψ was proposed in [16,17], where the kernel depends on a function ψ with specific properties. To incorporate as many fractional derivative definitions as possible into a single formulation, the concept of a fractional derivative of a function with respect to the function ψ was introduced. In 2017, Almeida [18] proposed the ψ -Caputo fractional derivative and studied its main properties. A similar approach can be used to define the ψ -Riemann–Liouville fractional derivative. In 2018, Sousa and Oliveira [19] unified both definitions using Hilfer’s concept and introduced the ψ -Hilfer fractional derivative. This approach offers the flexibility of choosing the differentiation type, as Hilfer’s definition interpolates smoothly between fractional derivatives of Caputo and Riemann–Liouville types. Additionally, by choosing the function ψ , we obtain well-known fractional derivatives, such as Caputo, Riemann–Liouville, Hadamard, Katugampola, Chen, Jumarie, Prabhakar, Erdélyi-Kober, and Weyl, among others (see Section 5 in [19]).
The aim of this work is to propose a FGM with a ψ -Hilfer fractional derivative. Using this type of general derivative allows us to deal with several fractional derivatives in the literature at the same time. It also allows us to study cases where the target function is a composition of functions. In the first section, we show some auxiliary results concerning the chain rule and solutions of some fractional partial differential equations to study the convergence of the continuous ψ -fractional gradient method for strongly and non-strongly convex target functions. In the second section, we introduce and implement numerical algorithms for the ψ -Hilfer FGM in one- and two-dimensional cases, generalizing the ideas presented in [14,15]. The proposed algorithms were tested using benchmark functions. The numerical results had better performance compared with the classical gradient in terms of accuracy and number of iterations.
In summary, this paper is organized as follows: in Section 2, we recall some basic concepts about the ψ -Hilfer derivative and the two-parameter Mittag–Leffler function. We present some auxiliary results in Section 3, which are then used to analyze the continuous gradient method for strongly and non-strongly convex target functions in Section 4. In the last section of the paper, we design and implement numerical algorithms for the ψ -Hilfer FGM by replacing the lower limit of the fractional integral with the last iterate and by using the variable order of differentiation with the optimization of the step size in each iteration. The convergence, accuracy, and speed of the algorithms are analyzed using different examples.

2. General Fractional Derivatives and Special Functions

In this section, we recall some concepts related to fractional integrals and derivatives of a function with respect to another function ψ (for more details, see [16,18,19]).
Definition 1.
(cf. [19], Def. 4) Let a , b be a finite or infinite interval on the real line R and α > 0 . In addition, let ψ be an increasing and positive monotone function on a , b . The left Riemann–Liouville fractional integral of a function f with respect to another function ψ on a , b is defined by
I a + α ; ψ f t = 1 Γ α a t ψ w ψ t ψ w α 1 f w d w , t > a .
Now, we introduce the definition of the so-called ψ -Hilfer fractional derivative of a function f with respect to another function.
Definition 2.
(cf. [19], Def. 7) Let α > 0 , m = α + 1 , and I = a , b be a finite or infinite interval on the real line and f , ψ C m a , b of two functions such that ψ is a positive monotone increasing function and ψ t 0 , for all t I . The ψ-Hilfer left fractional derivative H D t , a + α , μ ; ψ of order α and type μ 0 , 1 is defined by
D H a + α , μ ; ψ f t = I a + μ m α ; ψ 1 ψ t d d t m I a + 1 μ m α ; ψ f t .
We observe that when μ = 0 , we recover the left fractional derivative ψ -Riemann–Liouville (see Definition 5 in [19]) and when μ = 1 , we obtain the left ψ -Caputo fractional derivative (see Definition 6 in [19]). The following list shows some fractional derivatives that are encompassed in Definition 2 for specific choices of the function ψ and parameter μ :
  • Riemann–Liouville: ψ t = t , I = R + , and μ = 0 ;
  • Caputo: ψ t = t , I = R + , and μ = 1 ;
  • Katugampola: ψ t = t ρ , with ρ R + , I = R + , and μ = 0 ;
  • Caputo-Katugampola: ψ t = t ρ , with ρ R + , I = R + , and μ = 1 ;
  • Hadamard: ψ t = ln t , I = ] 1 , + [ , and μ = 0 ;
  • Caputo-Hadamard: ψ t = ln t , I = ] 1 , + [ , and μ = 1 .
For a more complete list, please refer to Section 5 in [19]. By considering partial fractional integrals and derivatives, previous definitions can be defined for higher dimensions (see Chapter 5 in [16]). Furthermore, the ψ -Hilfer fractional derivative of an n-dimensional vector function f t = f 1 t , , f n n is defined component-wise as
D H a + α , μ ; ψ f t = D H a + α , μ ; ψ f 1 t , , D H a + α , μ ; ψ f n t .
Next, we present some technical results related to previously introduced operators.
Theorem 1.
(cf. [19], Thm. 5) If f C m a , b , α > 0 , m = α + 1 , and μ [ 0 , 1 ] , then
I a + α ; ψ D H a + α , μ ; ψ f t = f t k = 1 m ψ t ψ a γ k Γ γ k + 1 f ψ [ m k ] I a + 1 μ m α ; ψ f a ,
where γ = α + μ k α and f ψ [ m ] f t = 1 ψ t d d t m f t .
Lemma 1.
(cf. [19], Lem. 5) Given δ R , consider the function f t = ψ t ψ a δ 1 , where δ > m . Then, for m = α + 1 and μ 0 , 1 , we have
D H a + α , μ ; ψ f x = Γ δ Γ δ α ψ t ψ a δ α 1 .
Some results of the paper are given in terms of the two-parameter Mittag–Leffler function, which is defined by the following power series (see [20])
E β 1 , β 2 z = n = 0 + z n Γ β 1 n + β 2 , Re β 1 > 0 , β 2 C , z C .
For z = x , with x R + , 0 < β 1 < 2 , and β 2 C , the two-parameter Mittag–Leffler function has the following asymptotic expansion (see Equation (4.7.36) in [20]):
E β 1 , β 2 x = k = 1 p x k Γ β 2 β 1 k + O x 1 p , p N , x + .

3. Auxiliary Results

In this section, we present some auxiliary results needed for our work. These extend some results presented in [21] to the ψ -Hilfer derivative of arbitrary type μ [ 0 , 1 ] .
We start by presenting a representation formula for the solution of a Cauchy problem involving the ψ -Hilfer derivative. Let us consider h : R 0 + × R n R n and f : R R n . We obtain the following results.
Proposition 1.
Let α [ 0 , 1 ] and μ 0 , 1 . A continuous function f is a solution of the problem
D H a + α , μ ; ψ f t = h t , f t , t a , I a + 1 α 1 μ ; ψ f a + = f a
if and only if f is given by
f t = f a Γ α + μ 1 α ψ t ψ a α + μ 1 α 1 + 1 Γ α a t ψ t ψ s α 1 ψ s h s , f s d s , t a .
Proof. 
Applying I a + α ; ψ to both sides of the fractional differential equation in (7), and taking (4) with m = 1 , we have
I a + α ; ψ D H a + α , μ ; ψ f t = f t f a Γ α + μ 1 α ψ t ψ a α + μ 1 α 1
which is equivalent to
f t = f a Γ α + μ 1 α ψ t ψ a α + μ 1 α 1 + I a + α ; ψ D H a + α , μ ; ψ f t .
From (1) and (7), we obtain these results.    □
Now, we present some results concerning the fractional derivative of a composite function. Let us consider g : R n R , f : R R n , ∇ as the classical gradient operator, and the ψ -Hilfer fractional derivative of an n-dimensional vector function as (3).
Theorem 2.
Let α [ 0 , 1 ] , μ [ 0 , 1 ] , a 0 , and the function ψ to be in the conditions of Definition 2. For t > a , let us define the function ζ t by setting
ζ t s = g f s g f t g f t , f s f t ,
where
g f t = g x 1 f t , , g x n f t .
The following identity holds
Γ 1 α D H a + α , μ ; ψ g f t g f t , D H a + α , μ ; ψ f t = ψ t ψ a α g f t g f t , f t α a t ψ t ψ s α 1 ψ s ζ t s d s .
Proof. 
By the Newton–Leibniz formula, for each component of the function f, one has
f i t = f i a + a t f i s d s = f i a + I a + 1 ; ψ f i ψ t , i = 1 , , n .
From (3), for i = 1 , , n , we have
D H a + α , μ ; ψ f i t = D H a + α , μ ; ψ f i a t + D H a + α , μ ; ψ I a + 1 ; ψ f i ψ t = I a + μ 1 α ; ψ 1 ψ t d d t I a + 1 μ 1 α ; ψ f i a t + I a + μ 1 α ; ψ 1 ψ t d d t I a + 1 μ 1 α ; ψ I a + 1 ; ψ f i ψ t = f i a I a + μ 1 α ; ψ 1 ψ t d d t I a + 1 μ 1 α ; ψ 1 t + I a + μ 1 α ; ψ 1 ψ t d d t I a + 1 μ 1 α + 1 ; ψ f i ψ t .
Taking (1) into account, the first term in (11) is
f i a I a + μ 1 α ; ψ 1 ψ t d d t I a + 1 μ 1 α ; ψ 1 t = f i a I a + μ 1 α ; ψ 1 ψ t d d t 1 Γ 1 μ 1 α a t ψ t ψ s 1 μ 1 α 1 ψ s d s = f i t I a + μ 1 α ; ψ 1 ψ t d d t ψ t ψ a 1 μ 1 α Γ 1 μ 1 α + 1 = f i a Γ μ 1 α a t ψ t ψ s μ 1 α 1 ψ s ψ a 1 μ 1 α 1 Γ 1 μ 1 α ψ s d s = f i a Γ 1 α ψ t ψ a α .
For the second term in (11), taking (1) and the Leibniz rule for differentiation under the integral sign into account, we obtain
I a + μ 1 α ; ψ 1 ψ t d d t I a + 1 μ 1 α + 1 ; ψ f i ψ t = I a + μ 1 α ; ψ 1 ψ t d d t 1 Γ 1 μ 1 α + 1 a t ψ t ψ s 1 μ 1 α f i s d s = I a + μ 1 α ; ψ 1 Γ 1 μ 1 α a t ψ t ψ s 1 μ 1 α 1 f i s d s = I a + μ 1 α ; ψ I a + 1 μ 1 α ; ψ f i ψ t = 1 Γ 1 α a t ψ t ψ s α f i s d s .
From (12) and (13), expression (11) simplifies to
D H a + α , μ ; ψ f i t = 1 Γ 1 α f i a ψ t ψ a α + a t ψ t ψ s α f i s d s , i = 1 , , n
and therefore,
D H a + α , μ ; ψ f t = 1 Γ 1 α f a ψ t ψ a α + a t ψ t ψ s α f s d s .
Hence, we can write
Γ 1 α D H a + α , μ ; ψ g f t g f t , D H a + α , μ ; ψ f t = ψ t ψ a α g f a g f t , f a + a t ψ t ψ s α g f s g f t , f s d s = ψ t ψ a α g f a g f t , f a + a t ψ t ψ s α d ζ t s
which implies, by integrating by parts, that
Γ 1 α D H a + α , μ ; ψ g f t g f t , D H a + α , μ ; ψ f t = ψ t ψ a α g f a g f t , f a + lim s t ψ t ψ s α ζ t s ψ t ψ a α ζ t a α a t ψ t ψ a α 1 ψ s ζ t s d s .
As α [ 0 , 1 ] , we have by L’Hôpital’s rule that
lim s t ψ t ψ s α ζ t s = lim s t ζ t s ψ t ψ s α = lim s t ζ t s ψ t ψ s 1 α α ψ s = 0 .
Finally, from (17) and (9), we obtain the following from (16)
Γ 1 α D H a + α , μ ; ψ g f t g f t , D H a + α , μ ; ψ f t = ψ t ψ a α ζ t a g f t g f t , f t ζ t a α a t ψ t ψ s α 1 ψ s ζ t s d s ,
which gives the desired result.    □
Corollary 1.
Let α [ 0 , 1 ] , μ [ 0 , 1 ] , a 0 , and the function ψ have the conditions of Definition 2. If g : R n R is of class C 1 and convex, i.e.,
g x g y + g x , x y , f o r a l l x , y R n ,
then
D H a + α , μ ; ψ g f t g f t , D H a + α , μ ; ψ f t + g 0 Γ 1 α ψ t ψ a α .
Proof. 
From (18), it follows that for x = 0 ,
g f t g f t , f t g 0 g f t , 0 = g 0 .
On the other hand, based on Theorem 2, we have the following for x = 0
g f t g f t , f t = Γ 1 α ψ t ψ a α D H a + α , μ ; ψ g f t g f t , D H a + α , μ ; ψ f t .
Combining the two previous expressions, we obtain our results.    □
If we consider μ = 0 , which corresponds to the ψ -Riemann–Liouville case, the previous results reduce to Proposition 3.3 and Corollary 3.4 in [21], respectively. Moreover, the correspondent results for the ψ -Caputo case ( μ = 1 ) are presented in Proposition 3.1 of [21].
Now, we present an auxiliary result involving the two-parameter Mittag–Leffler function.
Proposition 2.
Let α [ 0 , 1 ] , μ [ 0 , 1 ] , and a 0 . Moreover, let ψ be in the conditions of Definition 2, such that sup ψ t : t a = + . Then, the following limit holds
lim t + a t ψ t ψ a α 1 ψ s E α , α + μ 1 α λ ψ s ψ a α d s = 0 .
Proof. 
Taking into account Theorem 5.1 in [22] for the case of a homogeneous equation, the solution of the initial value problem
D H a + α , μ ; ψ u t = λ u t ; λ R + , α [ 0 , 1 ] , μ [ 0 , 1 ] , t a I a + 1 α μ 1 α ; ψ u ( a ) = u a
is given by
u t = ψ t ψ a α μ 1 α 1 E α , α + μ 1 α λ ψ t ψ a α .
Hence, by Proposition 1, we have
E α , α μ 1 α λ ψ t ψ a α = u a Γ α + μ 1 α ψ t ψ a α + μ 1 α 1 λ Γ α a t ψ t ψ s α 1 ψ s E α , α + μ 1 α λ ψ s ψ a α d s .
Taking the limit when t + on both sides and considering the asymptotic expansion (6), we conclude that the left-hand side tends to approach zero and the first term of the right-hand side also tends to approach zero. Hence, we obtain
0 = 0 λ Γ α a t ψ t ψ s α 1 ψ s E α , α + μ 1 α λ ψ s ψ a α d s
which leads to our results.    □
The case when μ = 1 , i.e., the ψ -Caputo case, was already studied in [21] and corresponds to Lemma 3.7.

4. Continuous Gradient Method via the ψ -Hilfer Derivative

Assume that we aim to determine the minimum of a function f : R n R . To achieve this, the gradient descent method is used, starting with an initial prediction x 0 of the local minimum and producing a sequence x 0 , x 1 , x 2 , based on the following recurrence relation:
x k + 1 = x k θ k f x k ,
where the step size θ > 0 is either constant or varying at each iteration k. The sequence x k k = 0 + generated by the gradient descent method is monotonic, i.e., f x 0 > f x 1 > f x 2 > , and is expected to converge to a local minimum of f. Typically, the stopping criterion is in the form f x ϵ , where ϵ > 0 . By expressing (20) as
x k + 1 x k θ k = f x k ,
we can interpret (21) as the discretization of the initial value problem
y t = f y t , y 0 = y 0 R n ,
using the explicit Euler scheme with step size θ k . The system (22) is known as the continuous gradient method (see [21]). Assuming that f is both strongly convex and smooth, the solutions of (20) and (22) converge to the unique stationary point at an exponential rate. In general, if a convergence result is shown for a continuous method, then we can construct various finite difference schemes for the solution of the associated Cauchy problem. Let us now consider the following ψ -fractional version of (22)
D H a + α , μ ; ψ z t = θ f z t , t a ,
such that z : R R n , f : R n R , α [ 0 , 1 ] , μ [ 0 , 1 ] , and I a + 1 α ; ψ z a + = z a R n , where the last expression is evaluated at the limit t a + . For y * S f = z R n : f z = 0 , let us define the following sum of squares error function:
φ t = 1 2 z t y * 2 , t a .

4.1. The Convex Case

Here, we investigate (23) under the assumption of non-strongly convexity of f.
Theorem 3.
Let α [ 0 , 1 ] and μ [ 0 , 1 ] . Suppose that the function f : R n R is of class C 1 and convex, i.e., f satisfies (18). For the ψ-fractional differential Equation (23), with step size θ constant, the solution z · converges to y * with the upper bound
z t y * 2 C ψ t ψ a μ 1 α , for all t a .
Proof. 
Based on Corollary 1 applied to (24) and the fact that f is of class C 1 and convex, we have
D H a + α , μ ; ψ φ t θ z t y * , f z t θ f y * t f z t 0 , for all t a .
By the properties of ψ (see Definition 2), the previous expression is equivalent to
I a + α ; ψ D H a + α , μ ; ψ φ t θ I a + α ; ψ f y * t f z t 0 .
Using the composition rule (4), with m = 1 , we have
φ t ψ t ψ a μ 1 α Γ 1 μ 1 α I a + 1 μ 1 α ; ψ φ a + 0 .
From (24) and considering C = 1 Γ 1 μ 1 α I a + 1 μ 1 α ; ψ φ a + , we obtain our results.    □
If we consider μ = 0 in the previous result, we recover Theorem 4.2 in [21].

4.2. The Strongly Convex Case

Here, we show that under the assumption of strong convexity of the function f, the solution of (23) admits a Mittag–Leffler convergence, which is a general type of exponential convergence to the stationary point. Recall the definition of a strongly convex function.
Definition 3.
(cf. [21]) A function f C 1 is strongly convex with parameter m f > 0 if
f x f y + f y , x y + m f 2 x y 2 , for all x , y R n ,
where stands for the gradient operator.
Theorem 4.
Let α [ 0 , 1 ] and μ [ 0 , 1 ] . Suppose that f is of class C 2 and is strongly convex. Considering the ψ-fractional differential Equation (23), where the step size θ is a constant, then the solution z · converges to y * , with the upper bound
z t y * 2 φ a ψ t ψ a ( α 1 ) ( 1 μ ) E α , α + μ 1 α θ m f ψ t ψ a α , t a ,
where φ a = 1 2 I a + 1 μ 1 α ; ψ z t y * 2 a + .
Proof. 
In Definition 3, if we consider y = z t and x = y * , we have for t a and m f > 0 that
f y * f z t f z t , y * z t + m f 2 y * z t 2 ,
which is equivalent to
f z t , y * z t f z t f y * + m f 2 y * z t 2 m f 2 y * z t 2 ,
where the last inequality holds, as y * arg min x R n f x . From (23), Corollary 1, and (26), we have
D H a + α , μ ; ψ φ t z t y * , D H a + α , μ ; ψ z t = θ z t y * , f z t θ m f 2 z t y * 2 .
Setting
h t = θ m f 2 z t y * 2 D H a + α , μ ; ψ φ t ,
we have from (27) that h t 0 for all t a , and moreover, from (28) and recalling (24), the previous expression is equivalent to
D H a + α , μ ; ψ φ t = θ m f φ t h t .
By Theorem 5.1 in [22], we have that the solution of (29) is
φ t = φ a ψ t ψ a α + μ 1 α 1 E α , α + μ 1 α θ m f ψ t ψ a α
a t ψ t ψ a α 1 ψ w E α , α θ m f ψ t ψ a α h w d w
φ a ψ t ψ a α + μ 1 α 1 E α , α + μ 1 α θ m f ψ t ψ a α ,
where the last inequality holds, as the integrand function in (30) is positive and 0 a < t .    □

4.3. Convergence at an Exponential Rate

Theorem 4 establishes the Mittag–Leffler convergence rate for the solution of (23) to a stationary point. Specifically, when α = 1 , the exponential rate O e θ , m f ψ t of the continuous gradient method (22) is recovered for any μ [ 0 , 1 ] .
Theorem 5.
Let α [ 0 , 1 ] , μ [ 0 , 1 ] , and ψ satisfy the conditions of Definition 2 with sup ψ t : t a = + . Let f : R n R be a function that is C 1 , convex, and Lipschitz smooth with constant L f , that is,
f x f y L f x y , for all x , y R n .
If the solution z · of (23) converges to y * at the exponential rate O e ω ψ t , then y * = 0 .
Proof. 
Let z · be a solution of (23) converging to the stationary point y * at the rate O e ω ψ t . Then, there exists a t 1 greater or equal to a, such that
z t y * e ω ψ t , for all t t 1 .
By contradiction, let us assume that y * 0 . We can then set
k = θ y * + 1 .
From Formula (4.11.4b) in [20],
E α , β z = 2 E 2 α , β z 2 E α , β z , Re α > 0 , β C ,
we can find t 2 t 1 with the property that
E α , β L f ψ t ψ a α k e ω ψ t , for all t t 2 .
By Proposition 1, z · is of the following form
z t = z a Γ α + μ 1 α ψ t ψ a α + μ 1 α 1 θ Γ α a t ψ t ψ s α 1 ψ s f z s d s ,
which is equivalent to
z a ψ t ψ a α + μ 1 α 1 = Γ α + μ 1 α z t + θ Γ α + μ 1 α Γ α a t ψ t ψ s α 1 ψ s f z s d s .
Setting
u t = z a ψ t ψ a α + μ 1 α 1 Γ α + μ 1 α y * ,
then, by (37), we obtain
u t = z a ψ t ψ a α + μ 1 α 1 Γ α + μ 1 α y * Γ α + μ 1 α z t y * + θ Γ α + μ 1 α Γ α a t ψ t ψ s α 1 ψ s f z s d s .
By assumption (32), we obtain
u t Γ α + μ 1 α z t y * + θ L f Γ α + μ 1 α Γ α a t ψ t ψ s α 1 ψ s z t y * d s
= Γ α + μ 1 α z t y * + θ L f Γ α + μ 1 α Γ α a t 2 ψ t ψ s α 1 ψ s z t y * d s + θ L f Γ α + μ 1 α Γ α t 2 t ψ t ψ s α 1 ψ s z t y * d s .
Now, denoting Q = sup z · y * : t [ a , t 2 ] and using (36), we obtain
u t Γ α + μ 1 α z t y * + θ L f Q Γ α + μ 1 α Γ α a t 2 ψ t ψ s α 1 ψ s d s + θ L f Γ α + μ 1 α Γ α t 2 t ψ t ψ s α 1 ψ s E α , α + μ 1 α L f ψ s ψ a α d s = Γ α + μ 1 α z t y * + θ L f Q Γ α + μ 1 α α Γ α ψ t ψ a α ψ t ψ t 2 α + θ L f Γ α + μ 1 α Γ α t 2 t ψ t ψ s α 1 ψ s E α , α + μ 1 α L f ψ s ψ a α d s .
By the mean value theorem (applied to the function ψ t ψ · α ), there exists δ ] a , t 2 [ such that
ψ t ψ a α ψ t ψ t 2 α = α ψ t ψ δ α 1 ψ δ
which implies that, as α [ 0 , 1 ] and sup ψ t : t a = + ,
lim t + ψ t ψ a α ψ t ψ t 2 α = α lim t + ψ t ψ δ α 1 ψ δ = α lim t + ψ δ ψ t ψ δ 1 α = 0 .
Hence, taking the limit of u t when t + , we obtain
lim t + u t = 0 Γ α + μ 1 α y * = 0 .
This implies that y * = 0 , which is a contradiction.    □

5. ψ -Hilfer Fractional Gradient Method

The aim of this section is to construct and implement a numerical method for the ψ -Hilfer FGM in one- and two-dimensional cases. For both cases, we perform numerical simulations using benchmark functions.

5.1. The One-Dimensional Case

5.1.1. Design of the Numerical Method

The gradient descent method typically takes steps proportional to the negative gradient (or approximate gradient) of a function at the current iteration, that is, x k + 1 is updated by the following law
x k + 1 = x k θ f ( x k ) , k = 0 , 1 , 2 , ,
where θ > 0 is the step size or learning rate, and f ( x k ) is the first derivative of f evaluated at x = x k . We assume that f : R R admits a local minimum at the point x * in D ρ x * = x R : x x * < ρ , for some ρ > 0 , and f admits a Taylor series expansion centered at x 0 = a ,
f ( x ) = p = 0 + f ( p ) ( a ) p ! x a p ,
with domain of convergence D R such that X * D . As we want to consider the fractional gradient in the ψ -Hilfer sense, our first (and natural attempt) is to consider the iterative method
x k + 1 = x k θ D H a + α , μ ; ψ f ψ x k , k = 0 , 1 , 2 , ,
where D H a + α , μ ; ψ is the ψ -Hilfer derivative of order α ] 0 , 1 [ and type μ [ 0 , 1 ] , given by (2), and the function ψ is in the conditions of Definition 2. However, a simple example shows that (42) is not the correct approach. In fact, let us consider the quadratic function f x = ( x h ) 2 with a minimum at x * = h . For this function, we have that
D H a + α , μ ; ψ f x = p = p 0 2 f p ( a ) Γ p + 1 α x a p α ,
where p 0 = 0 if μ [ 0 , 1 ] and p 0 = 1 if μ = 1 . As D H a + α , μ ; ψ f h 0 , the iterative method (42) does not converge to the real minimum point. This example shows that the ψ -Hilfer FGM with a fixed lower limit of integration does not converge to the minimum point. This is due to the influence of long-time memory terms, which is an intrinsic feature of fractional derivatives. In order to address this problem and inspired by the ideas presented in [14,15], we replace the starting point a in the fractional derivative by the term x k 1 of the previous iteration, that is,
x k + 1 = x k θ D H x k 1 + α , μ ; ψ f ψ x k , k = 1 , 2 ,
where α ] 0 , 1 [ , μ [ 0 , 1 ] , and θ > 0 . This eliminates the long-time memory effect during the iteration procedure. In this sense, and taking into account the series representation (41) and differentiation rule (5), we obtain
D H x k 1 + α , μ ; ψ f ψ x k = p = p 0 + f p ( ψ x k 1 ) Γ p + 1 α ψ x k ψ x k 1 p α ,
where p 0 = 0 if μ [ 0 , 1 ] or p 0 = 1 if μ = 1 . Thus, the representation formula (45) depends only on μ = 1 or μ 1 . With this modification in the ψ -Hilfer FGM, we obtain the following convergence results.
Theorem 6.
If the algorithm (44) is convergent, with the fractional gradient given by (45), then it converges to the minimum point of f ψ · .
Proof. 
Let x * be the minimum point of f ψ · . We prove that the sequence x k k N converges to x * by contradiction. Assume that x k converges to a different x x * and f ( ψ x ) 0 . As the algorithm is convergent, we have that lim k + x k x = 0 . Moreover, for any small positive ϵ , there exists a sufficiently large number N N , such that
ψ x k 1 ψ x < ϵ < ψ x * ψ x
for any k > N . Thus,
δ = inf p > N f p 0 ( ψ x k 1 ) > 0
must hold. From (45) we have
x k + 1 x k = θ D H x k 1 + α , μ ; ψ f ψ x k = θ p = 0 + f p + p 0 ( ψ x k 1 ) Γ p + p 0 + 1 α ψ x k ψ x k 1 p + p 0 α θ f ( p 0 ) ( ψ x k 1 ) Γ p 0 + 1 α ψ x k ψ x k 1 p 0 α θ p = 1 + f p + p 0 ( ψ x k 1 ) Γ p + p 0 + 1 α ψ x k ψ x k 1 p + p 0 α .
Considering
C = sup p N f ( p 0 ) ( ψ x k 1 ) Γ p 0 + 1 α ,
we have, from the previous expression,
x k + 1 x k θ f ( p 0 ) ( ψ x k 1 ) Γ p 0 + 1 α ψ x k ψ x k 1 p 0 α θ C p = 1 + | ψ x k ψ x k 1 | p + p 0 α .
The geometric series in the previous expression is convergent for sufficiently large k. Hence, we obtain
x k + 1 x k θ f ( p 0 ) ( ψ x k 1 ) Γ p 0 + 1 α ψ x k ψ x k 1 p 0 α θ C ψ x k ψ x k 1 1 + p 0 α 1 ψ x k ψ x k 1 ,
which is equivalent to
x k + 1 x k θ f ( p 0 ) ( ψ x k 1 ) Γ p 0 + 1 α C ψ x k ψ x k 1 1 ψ x k ψ x k 1 ψ x k ψ x k 1 p 0 α
d ψ x k ψ x k 1 p 0 α ,
where
d = d ϵ = θ δ Γ p 0 + 1 α 2 C ϵ 1 ϵ .
One can always find ϵ sufficiently small, such that
δ Γ p 0 + 1 α 2 C ϵ 1 ϵ > 2 ϵ α θ δ Γ p 0 + 1 α > 2 ϵ α θ + 2 C ϵ 1 ϵ
because the function g ϵ = 2 ϵ α θ + 2 C ϵ 1 ϵ is positively increasing for α [ 0 , 1 ] , θ [ 0 , 1 ] , and ϵ [ 0 , 1 ] . Hence, from (48) and taking into account (50), we obtain
x k + 1 x k > 2 ϵ p 0 = 2 ϵ , if μ = 1 2 , if μ [ 0 , 1 ] .
On the other hand, from the assumption (46), we have
x k + 1 x k ψ x k + 1 ψ x k ψ x k + 1 ψ x + ψ x ψ x k = 2 ϵ ,
which contradicts (51). This completes the proof.    □
Sometimes, the function f is not smooth enough to admit a series representation in the form (41), and therefore, the implementation of (44) using the series (45) is not possible. For implementation in practice, we need to truncate the series. In our first approach, we consider only the term of the series containing f ( ψ x k ) , as it is the most relevant for the gradient method. Thus, the ψ -Hilfer FGM (44) simplifies to
x k + 1 = x k θ f ( ψ x k 1 ) Γ 2 α ψ x k ψ x k 1 1 α .
Furthermore, in order to avoid the appearance of complex numbers, we introduce the modulus in the expression (52), that is,
x k + 1 = x k θ f ( ψ x k 1 ) Γ 2 α ψ x k ψ x k 1 1 α .
The pseudocode associated to (53) is presented in Algorithm 1. As (53) is independent of the μ parameter, from now on we call the method ψ -FGM. Following the same arguments as in the proof of Theorem 6, we obtain the following results.
Theorem 7.
If the algorithm (44) is convergent with the fractional gradient given by (53), then it converges to the minimum point of f ψ · .
In the following pseudocode, we describe the implementation of the algorithm (53).
Algorithm 1: ψ -FGM with higher order truncation.
Fractalfract 07 00275 i001
As we have seen, it is possible to construct a ψ -FGM that converges to the minimum point of a function. To improve the convergence of the proposed method, we can consider variable order differentiation α ( x ) in each iteration. Some examples of α ( x ) are given by (see [15]):
α x = 1 1 + β J x ,
α x = 2 1 + e β J x ,
α x = 1 cos β J x ,
α x = 1 2 π arctan β J x ,
α x = 1 tanh β J x ,
where β > 0 and we consider the loss function J x = f ψ x 2 to be minimized in each iteration. The consideration of the square in the loss function guarantees its non-negativity. All examples given satisfy
lim β 0 α x = 1 = lim x x * α x ,
where the second limit results from the fact that J x 0 as x x * . Variable order differentiation turns the ψ -FGM into a learning method, because as x gradually approaches x * , α x 1 . The ψ -FGM with variable order is given by
x k + 1 = x k θ f ( ψ x k 1 ) Γ 2 α x k ψ x k ψ x k 1 1 α x k .
Theorem 6 remains valid for this variation of Algorithm 1.

5.1.2. Numerical Simulations

Now, we provide some examples that show the validity of the previous algorithms applied to the quadratic function f x = x 1 2 , which is one of the simplest benchmark functions. This function is a convex function with a unique global minimum at x * = 1 . For the ψ -derivative, we consider the following cases:
  • Caputo and Riemann–Liouville fractional derivatives: ψ 1 x = x , I = [ 0 , + [ , and a = 0 ;
  • Hadamard fractional derivative: ψ 2 x = ln x , I = [ 1 , + [ , and a = 1 ;
  • Katugampola fractional derivative: ψ 3 x = x 0.5 , I = [ 0 , + [ , and a = 0.5 . In this case, a cannot coincide with the lower limit of the interval I because ψ is not defined at x = 0 .
Figure 1, Figure 2 and Figure 3 show the numerical results of Algorithm 1 applied to the composite functions f ψ 1 · , f ψ 2 · , and f ψ 3 · , choosing different parameters. In Figure 1, we consider θ = 0.1 , x 0 = 2 , ϵ = 10 9 , and different orders of differentiation α = 0.4 , 0.6 , 0.8 , 1.0 .
Analyzing the plots in Figure 1, we see that the convergence of the ψ -FGM in the non-integer case is slower, in general, than in the integer case ( α = 1.0 ). In Figure 2, we consider x 0 = 2 , α = 0.75 , ϵ = 10 9 , and different step sizes θ = 0.05 , 0.1 , 0.2 , 0.3 .
The plots in Figure 2 show that the increment of the step size makes convergence faster. The optimization of the step size in each iteration would lead to the optimal convergence of the method in a few iterations. In Figure 3, we consider different initial approximations x 0 , and the values α = 0.75 , θ = 0.1 , and ϵ = 10 9 . As expected, convergence becomes faster as x 0 approaches the minimum point.
Now, we show the numerical simulations of Algorithm 1 with a variable order of differentiation α x . In Figure 4, we consider θ = 0.1 , β = 0.1 , x 0 = 2 , ϵ = 10 9 , and the variable order Functions (54)–(58). In Figure 5, we exhibit the behaviour of the algorithm for α ( x ) given by (58) and different values of β .
From these plots, we conclude that in the one-dimensional case, the consideration of variable order differentiation can speed up the convergence, but it is, in general, slower than the classical gradient descent method with integer derivative. A further improvement of the algorithm could be made by considering a variable step size optimized in each iteration. This idea is implemented in the next section, where we consider optimization problems in R 2 .

5.2. The Two-Dimensional Case

5.2.1. Untrained Approach

Motivated by the ideas presented in [14,15], we extend the results presented in Section 5.1 to the two-dimensional case. We consider a function ψ in the conditions of Definition 2 and the vector-valued function Ψ : R 2 R 2 given by Ψ X = ψ x , ψ y with X = x , y . Moreover, let f : R 2 R be a function that admits a local minimum at the point X * in D ρ X * = X R 2 : X X * < ρ for some ρ > 0 . We want to find a local minimum point of the function f Ψ X = f ψ x , ψ y , with X = x , y , through the iterative method
X k + 1 = X k θ H X k 1 + α , μ ; ψ f Ψ X k , θ > 0 .
We assume that f admits a Taylor series centered at the point a 1 , a 2 D ρ X * , given by
f x , y = p = 0 + q = 0 + p + q f x p y q a 1 , a 2 1 p ! q ! x a 1 p y a 2 q ,
with a domain of convergence D R 2 , such that X * D . Then, the ψ -Hilfer fractional gradient in (60) is given by
H X k 1 + α , μ ; ψ f Ψ X k = x k 1 + H x α , μ ; ψ f Ψ X k , y k 1 + H y α , μ ; ψ f Ψ X k ,
where x k 1 + H x α , μ ; ψ f Ψ X k and y k 1 + H x α , μ ; ψ f Ψ X k denote the partial ψ -Hilfer derivatives of f, with respect to x and y, of order α [ 0 , 1 ] , type μ [ 0 , 1 ] , and with the lower limit of integration replaced by x k 1 and y k 1 , respectively. Taking into account (61) and (5), we have the following expressions for the components of (62)
x k 1 + H x α , μ ; ψ f Ψ X k = p = p 0 + q = 0 + p + q f x p y q Ψ X k 1 ψ y k ψ y k 1 q q ! ψ x k ψ x k 1 p α Γ p + 1 α ,
y k 1 + H x α , μ ; ψ f Ψ X k = p = 0 + q = q 0 + p + q f x p y q Ψ X k 1 ψ x k ψ x k 1 p p ! ψ y k ψ y k 1 q α Γ q + 1 α ,
where
p 0 = q 0 = 0 , if μ [ 0 , 1 ] 1 , if μ = 1 .
The iterative method proposed in (60) takes into account the short memory characteristics of the fractional derivatives and, as in the one-dimensional case, we can see from (63) and (64) that the method does not depend on the type of derivative μ . Furthermore, due to the freedom of choice of the μ parameter ( μ = 1 or μ [ 0 , 1 ] ) and the ψ function, we can deal with several fractional derivatives (see Section 5 in [19]). We obtain the following convergence results:
Theorem 8.
If the algorithm (60) is convergent where the fractional gradient is given by (62)–(64), then it converges to the minimum point of f Ψ · .
Proof. 
Let X * be the minimum point of f Ψ · . We prove that the sequence X k k N converges to X * by contradiction. Assume that X k converges to a different X X * and f x Ψ X 0 f y Ψ X . As the algorithm is convergent, we have that lim k + X k X = 0 . Moreover, for any small positive ϵ , there exists a sufficiently large number N N , such that
ψ x k 1 ψ x 2 < ϵ 2 2 < ψ x * ψ x 2
and
ψ y k 1 ψ y 2 < ϵ 2 2 < ψ y * ψ y 2
for any k > N . Thus,
δ 1 = inf p , q > N p 0 f x p 0 ( Ψ X k 1 ) > 0 and δ 2 = inf p , q > N q 0 f y q 0 ( Ψ X k 1 ) > 0
must hold. From (62)–(64) we have
X k + 1 X k 2 = θ 2 H x k 1 + α , μ ; ψ f Ψ X k 1 2 = θ 2 p = 0 + q = 0 + p + p 0 + q f x p + p 0 y q Ψ X k 1 ψ y k ψ y k 1 q q ! ψ x k ψ x k 1 p + p 0 α Γ p + p 0 + 1 α 2 + θ 2 p = 0 + q = 0 + p + q + q 0 f x p y q + q 0 Ψ X k 1 ψ x k ψ x k 1 p p ! ψ y k ψ y k 1 q + q 0 α Γ q + q 0 + 1 α 2 θ 2 p 0 f x p 0 ( Ψ X k 1 ) ψ x k ψ x k 1 p 0 α Γ p 0 + 1 α 2 θ 2 p = 1 + q = 1 + p + p 0 + q f x p y q Ψ X k 1 ψ y k ψ y k 1 q q ! ψ x k ψ x k 1 p + p 0 α Γ p + p 0 + 1 α 2 + θ 2 q 0 f y q 0 ( Ψ X k 1 ) ψ y k ψ y k 1 q 0 α Γ q 0 + 1 α 2 θ 2 p = 1 + q = 1 + p + q + q 0 f x p y q Ψ X k 1 ψ x k ψ x k 1 p p ! ψ y k ψ y k 1 q + q 0 α Γ q + q 0 + 1 α 2 .
Considering
C 1 = sup p , q N p 0 f x p 0 ( Ψ X k 1 ) Γ p 0 + 1 α q ! and C 2 = sup p , q N q 0 f y q 0 ( Ψ X k 1 ) Γ q 0 + 1 α p ! ,
from the previous expression, we obtain
X k + 1 X k 2 θ 2 p 0 f x p 0 ( Ψ X k 1 ) ψ x k ψ x k 1 p 0 α Γ p 0 + 1 α 2 θ 2 C 1 2 p = 1 + q = 1 + ψ y k ψ y k 1 q ψ x k ψ x k 1 p + p 0 α 2 + θ 2 q 0 f y q 0 ( Ψ X k 1 ) ψ y k ψ y k 1 q 0 α Γ q 0 + 1 α 2 θ 2 C 2 2 p = 1 + q = 1 + ψ x k ψ x k 1 p ψ y k ψ y k 1 q + q 0 α 2 .
The double series that appears in the previous expression are of a geometric type with positive radius less than 1. Hence, by the sum of a geometric series, we have
X k + 1 X k 2 θ 2 p 0 f x p 0 ( Ψ X k 1 ) ψ x k ψ x k 1 p 0 α Γ p 0 + 1 α 2 θ 2 C 1 2 ψ y k ψ y k 1 1 ψ y k ψ y k 1 2 ψ x k ψ x k 1 1 + p 0 α 1 ψ x k ψ x k 1 2 + θ 2 q 0 f y q 0 ( Ψ X k 1 ) ψ y k ψ y k 1 q 0 α Γ q 0 + 1 α 2 θ 2 C 2 2 ψ x k ψ x k 1 1 ψ x k ψ x k 1 2 ψ y k ψ y k 1 1 + q 0 α 1 ψ y k ψ y k 1 2 ,
which is equivalent to
X k + 1 X k 2 θ 2 p 0 f x p 0 ( Ψ X k 1 ) Γ p 0 + 1 α 2 C 1 2 ψ y k ψ y k 1 1 ψ y k ψ y k 1 2 ψ x k ψ x k 1 1 ψ x k ψ x k 1 2 × ψ x k ψ x k 1 p 0 α 2
+ θ 2 q 0 f y q 0 ( Ψ X k 1 ) Γ q 0 + 1 α 2 C 2 2 ψ x k ψ x k 1 1 ψ x k ψ x k 1 2 ψ y k ψ y k 1 1 ψ y k ψ y k 1 2 × ψ y k ψ y k 1 q 0 α 2 d 1 ψ x k ψ x k 1 p 0 α 2 + d 2 ψ y k ψ y k 1 q 0 α 2 ,
where
d 1 = d 1 ϵ = θ 2 δ 1 Γ p 0 + 1 α 2 C 1 2 ϵ 1 ϵ 4
and
d 2 = d 2 ϵ = θ 2 δ 2 Γ q 0 + 1 α 2 C 2 2 ϵ 1 ϵ 4 .
One can always find ϵ sufficiently small, such that
δ 1 Γ p 0 + 1 α 2 C 1 2 ϵ 1 ϵ 4 > ϵ 2 α θ 2 δ 1 Γ p 0 + 1 α 2 > ϵ 2 α θ 2 + C 1 2 ϵ 1 ϵ 4
and
δ 2 Γ q 0 + 1 α 2 C 1 2 ϵ 1 ϵ 4 > ϵ 2 α θ 2 δ 2 Γ q 0 + 1 α 2 > ϵ 2 α θ 2 + C 2 2 ϵ 1 ϵ 4 ,
because the functions
g 1 ϵ = ϵ 2 α θ 2 + C 1 2 ϵ 1 ϵ 4 and g 2 ϵ = ϵ 2 α θ 2 + C 2 2 ϵ 1 ϵ 4
are positive increasing for α [ 0 , 1 ] , θ [ 0 , 1 ] , and ϵ [ 0 , 1 ] . Hence, from (69) and taking into account (70) and (71), we obtain
X k + 1 X k 2 > ϵ 2 α ϵ 2 p 0 2 α + ϵ 2 α ϵ 2 q 0 2 α = 2 ϵ 2 , if μ = 1 2 , if μ [ 0 , 1 ] .
On the other hand, from assumption (71), we have
X k + 1 X k 2 Ψ X k + 1 Ψ X k 2 ψ x k + 1 ψ x k 2 + ψ y k + 1 ψ y k 2 ψ x k + 1 ψ x 2 + ψ x ψ x k 2 + ψ y k + 1 ψ y 2 ψ y ψ y k 2
< ϵ 2 2 + ϵ 2 2 + ϵ 2 2 + ϵ 2 2 = 2 ϵ 2 ,
which contradicts (72). This completes the proof.    □
Despite the convergence of (60), it is important to point out that sometimes the function f Ψ · is not smooth enough. In these cases, the algorithm involving the double series (63) and (64) cannot be implemented. Moreover, in the same way as was done for the one-dimensional case, assuming that f is at least of the class C 1 , we only consider the following terms of (63) and (64)
f x ( Ψ X k 1 ) Γ 2 α ψ x k ψ x k 1 1 α term p = 1 and q = 0 in ( 63 ) , f y ( Ψ X k 1 ) Γ 2 α ψ y k ψ y k 1 1 α , term p = 0 and q = 1 in ( 64 ) .
Thus, the higher order terms are eliminated and we have the following update of (62):
H X k 1 + α , μ ; ψ f Ψ X k = f x ( Ψ X k 1 ) Γ 2 α ψ x k ψ x k 1 1 α , f y ( Ψ X k 1 ) Γ 2 α ψ y k ψ y k 1 1 α .
To avoid the appearance of complex numbers, we also consider
H X k 1 + α , μ ; ψ f Ψ X k = f x ( Ψ X k 1 ) Γ 2 α ψ x k ψ x k 1 1 α , f y Ψ X k 1 Γ 2 α ψ y k ψ y k 1 1 α .
The implementation of (73) is presented in Algorithm 2. In a similar way as it was done in Theorem 8, we can state the following results.
Theorem 9.
If the algorithm (60) is convergent where the fractional gradient is given by (73), then it converges to the minimum point of f Ψ · .
Algorithm 2: 2D ψ -FGM with higher order truncation.
Fractalfract 07 00275 i002

5.2.2. The Trained Approach

In this section, we refine Algorithm 2 in two ways that train our algorithm in each iteration to find the most accurate X k . First, we consider a variable step size θ k > 0 that is updated in each iteration by minimizing the following function
ϕ θ k = f Ψ X k θ k H X k 1 + α , μ ; ψ f Ψ X k .
In the second refinement, we adjust the order of integration α with X k . More precisely, if f is a non-negative function with a unique minimum point X * , we can consider any of the functions (54)–(58). In our approach, we only consider the following variable fractional order
α X = 1 2 π arctan β J X ,
where the loss function is J X = f ( Ψ X ) . From (74), we infer that when J X 0 , one has α X 1 , and when J X 0 , one has α X 0 . As a consequence of the previous refinements, we have the following iterative method:
X k + 1 = X k θ k H X k 1 + α X k , μ ; ψ f Ψ X k ,
where the fractional gradient is given by
H X k 1 + α X k , μ ; ψ f Ψ X k = f x ( Ψ X k 1 ) Γ 2 α X k ψ x k ψ x k 1 1 α X k , f y ( Ψ X k 1 ) Γ 2 α X k ψ y k ψ y k 1 1 α X k .
The implementation of (76) is presented in Algorithm 3. Likewise, with a variable fractional order α X , the following theorem follows.
Theorem 10.
If the algorithm (75) is convergent where the fractional gradient is given by (76), then it converges to the minimum point of f Ψ · .
The proof of this result follows the same reasoning of the proof of Theorem 8, and therefore, is omitted.
Algorithm 3: 2D ψ -FGM with variable fractional order and optimized step size.
Fractalfract 07 00275 i003

5.2.3. Numerical Simulations

In this section, we implement Algorithms 2 and 3 for finding the local minimum point of the function f Ψ · for particular choices of f and ψ . For the function f, we consider the following cases:
  • f 1 x , y = 4 x 2 4 x y + 2 y 2 with minimum point at 0 , 0 ,
  • Matyas function: f 2 x , y = 0.26 x 2 + y 2 0.48 x y with minimum point at 0 , 0 ,
  • Wayburn and Seader No. 1 function: f 3 x , y = x 6 + y 4 17 2 + 2 x + y 4 2 with minimum point at 1 , 2 .
The function f 1 is a classic convex quadratic function in R 2 and can be considered an academic example for implementing our algorithms. The choice of functions f 2 and f 3 is due to the fact that they are benchmark functions used to test and evaluate several characteristics of optimization algorithms, such as convergence rate, precision, robustness, and general performance. More precisely, the Matyas function has a plate shape and the Wayburn and Seader No. 1 function has a valley shape, which implies slow convergence to the minimum point of the corresponding function. For the functions ψ in the vector function Ψ , we consider the choices
  • ψ 1 ( x ) = x , with x I = [ 0 , + [ ,
  • ψ 2 ( x ) = x 2 , with x I = [ 0 , + [ ,
  • ψ 3 ( x ) = ln x , with x I = [ 1 , + [ .
For the numerical simulations, we consider some combinations of the functions f i , i = 1 , 2 , and ψ j , j = 1 , 2 , 3 , and compare the results in the following scenarios:
  • Algorithm 3 with α X = 1 that corresponds to the classical 2D gradient descent method,
  • Algorithm 2 with α X = 0.8 and step size θ = 0.1 ,
  • Algorithm 3 with α X = 1 2 π arctan ( β J ( X ) ) , with β = 0.1 .
Figure 6 shows the target functions f 1 Ψ 1 X = 4 x 2 4 x y + 2 y 2 and f 1 Ψ 2 X = 4 x 4 4 x 2 y 2 + 2 y 4 , both with a local minimum point at X * = 0 , 0 . Figure 7 and Figure 8 show the X k iterates in the corresponding contour plots of the functions. The plots on the right show the amplification close to the minimum point. The results of numerical simulations are summarized in Table 1. The stopping criterion used was ϵ = 10 9 .
In Table 1, we present the information concerning the X k iterates of the implemented algorithms. When we consider Ψ 1 , we achieve the global minimum point in the three cases; however, it is clear that Algorithm 2 leads to the worst results in terms of speediness, whereas the classical case and Algorithm 3 have similar results. If we consider Ψ 2 and we restrict our analysis to the two fastest algorithms, we conclude that, in this case, Algorithm 3 provides a more accurate approximation in fewer iterations. We point out that the objective function f 1 Ψ 2 · is a function with less convexity near the minimum point when compared with the objective function f 1 Ψ 1 · , which leads to an optimization problem that is more challenging under the numerical point of view.
In the next set of figures and tables, we consider the Matyas function to test our algorithms. More precisely, we consider the functions f 2 Ψ 1 X = 0.26 x 2 + y 2 0.48 x y with local minimum at the point X * = 0 , 0 , and f 2 Ψ 3 X = 0.26 ln 2 x + ln 2 y 0.48 ln x ln y with local minimum at the point X * = 1 , 1 .
Figure 9 shows both functions f 2 Ψ 1 · and f 2 Ψ 3 · with a plate-like shape. Considering the same stopping criteria, we have the following results.
From the analysis of Table 2, we see that the three methods converge in the case of f 2 ( Ψ 1 ) , but Algorithm 2 is the worst in terms of iterations. The classic case and Algorithm 3 have similar results in terms of precision; however, Algorithm 3 presents better performance in terms of the number of iterations. In the case of f 2 ( Ψ 3 ) , the Algorithm 2 diverges and the other two are convergent. Algorithm 3 required half of the iterations compared with the classical gradient method.
Table 2. Information about the k-iteration associated with Figure 10 and Figure 11.
Table 2. Information about the k-iteration associated with Figure 10 and Figure 11.
α θ k X 0 X k X k X *
f 2 Ψ 1 X
Classical Gradient1.0optimized43 1.50 , 2.50 8.9458 × 10 9 , 8.8320 × 10 9 1.2571 × 10 8
Algorithm 20.81.094777 0.75 , 1.25 1.7395 × 10 6 , 1.7395 × 10 6 2.4600 × 10 6
Algorithm 3variableoptimized27 1.50 , 2.50 1.4056 × 10 8 , 1.3811 × 10 8 1.9727 × 10 8
f 2 Ψ 3 X
Classical Gradient1.0optimized51 1.50 , 2.50 1 , 1 0.5 × 10 16
Algorithm 20.80.1divergence 0.75 , 1.25
Algorithm 3variableoptimized29 1.50 , 2.50 1 , 1 0.5 × 10 6
In the final set of figures and tables, the function f 3 is composed with Ψ 1 and Ψ 2 . Taking into account the results obtained previously for the Matyas function, where it is clear that Algorithm 2 leads to worse results in terms of rapidness and accuracy, we only implemented the classical gradient method and Algorithm 3. The following figure shows the graphical representation of f 3 Ψ 1 X = x 6 + y 4 17 2 + 2 x + y 4 2 with local minimum at the point X * = 1 , 2 and f 3 Ψ 2 X = x 12 + y 8 17 2 + 2 x 2 + y 2 4 2 with local maximum at X * = 0 , 0 (or a local minimum of the function f 3 Ψ 2 X ).
The plots in Figure 12 show that both functions are valley-shaped. Figure 13 and Figure 14 and Table 3 show the obtained numerical results.
In this last case, we see that the classic gradient method and Algorithm 3 provide very good approximations. Algorithm 3 performs better in terms of the number of iterations. For instance, in the case of f 3 ( Ψ 1 ) , the number of iterations decreased around 97% in comparison with the classical gradient method.

6. Conclusions

In this work, we study the classical gradient method from the perspective of the ψ -Hilfer fractional derivative. In the first part of the article, we consider the continuous gradient method and perform the convergence analysis for strongly and non-strongly convex cases. The identification of functions of the Lyapunov type, together with the auxiliary results demonstrated, allowed us to establish the convergence of the generating trajectories in the case of ψ -Hilfer.
In the second part of the paper, we first show that the ψ -Hilfer FGM with the ψ -Hilfer gradient given as a power series can converge to a point different from the extreme point. To work out this problem, we propose an algorithm with a variable lower bound of integration, reducing the influence of long-time memory terms. By truncating the higher order terms, we obtain the ψ -FGM, which allows easy implementation in practice. Furthermore, we optimized the step size in each iteration and considered a variable order of differentiation to increase the precision and speed of the method. These two tunable parameters improved the performance of the method in terms of speed of convergence.
Our numerical simulations showed that the proposed FGM achieved the approximation with equal or better precision, but in much fewer iterations compared with the classical gradient method with optimized step size. We emphasize that in our 2D numerical simulations, the Matyas function and the Wayburn and Seader No. 1 functions are well-known benchmark functions used to test optimization methods. These functions have the shapes of plates and valleys, respectively, representing an extra challenge in numerical simulations.
In future works, it would be interesting to further develop this theory to see its application in the field of convolutional neural networks.

Author Contributions

Conceptualization, N.V., M.M.R. and M.F.; investigation, N.V., M.M.R. and M.F.; writing—original draft, N.V., M.M.R. and M.F.; writing—review and editing, N.V., M.M.R. and M.F. All authors have read and agreed to the published version of the manuscript.

Funding

The work of the authors was supported by Portuguese funds through the CIDMA (Center for Research and Development in Mathematics and Applications) and FCT (Fundação para a Ciência e a Tecnologia), within projects UIDB/04106/2020 and UIDP/04106/2020. N. Vieira was also supported by FCT via the 2018 FCT program of Stimulus of Scientific Employment—Individual Support (Ref: CEECIND/01131/2018).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lin, J.Y.; Liao, C.W. New IIR filter-based adaptive algorithm in active noise control applications: Commutation error-introduced LMS algorithm and associated convergence assessment by a deterministic approach. Automatica 2008, 44, 2916–2922. [Google Scholar] [CrossRef]
  2. Pu, Y.F.; Zhou, J.L.; Zhang, Y.; Zhang, N.; Huang, G.; Siarry, P. Fractional extreme value adaptive training method: Fractional steepest descent approach. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 653–662. [Google Scholar] [CrossRef] [PubMed]
  3. Ge, Z.W.; Ding, F.; Xu, L.; Alsaedi, A.; Hayat, T. Gradient-based iterative identification method for multivariate equation-error autoregressive moving average systems using the decomposition technique. J. Frankl. Inst. 2019, 356, 1658–1676. [Google Scholar] [CrossRef]
  4. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  5. Wong, C.C.; Chen, C.C. A hybrid clustering and gradient descent approach for fuzzy modeling. IEEE Trans. Syst. Man Cybern. Part B Cybern. 1999, 29, 686–693. [Google Scholar] [CrossRef] [PubMed]
  6. Ren, Z.G.; Xu, C.; Zhou, Z.C.; Wu, Z.Z.; Chen, T.H. Boundary stabilization of a class of reaction-advection-difffusion systems via a gradient-based optimization approach. J. Frankl. Inst. 2019, 356, 173–195. [Google Scholar] [CrossRef]
  7. Tan, Y.; He, Z.; Tian, B. A novel generalization of modified LMS algorithm to fractional order. IEEE Signal Process. Lett. 2015, 22, 1244–1248. [Google Scholar] [CrossRef]
  8. Cheng, S.S.; Wei, Y.H.; Chen, Y.Q.; Li, Y.; Wang, Y. An innovative fractional order LMS based on variable initial value and gradient order. Signal Process. 2017, 133, 260–269. [Google Scholar] [CrossRef]
  9. Raja, M.A.Z.; Chaudhary, N.I. Two-stage fractional least mean square identification algorithm for parameter estimation of CARMA systems. Signal Process. 2015, 107, 327–339. [Google Scholar] [CrossRef]
  10. Shah, S.M.; Samar, R.; Khan, N.M.; Raja, M.A.Z. Design of fractional-order variants of complex LMS and NLMS algorithms for adaptive channel equalization. Nonlinear Dyn. 2017, 88, 839–858. [Google Scholar] [CrossRef]
  11. Viera-Martin, E.; Gómez-Aguilar, J.F.; Solís-Pérez, J.E.; Hernández-Pérez, J.A.; Escobar-Jiménez, R.F. Artificial neural networks: A practical review of applications involving fractional calculus. Eur. Phys. J. Spec. Top. 2022, 231, 2059–2095. [Google Scholar] [CrossRef] [PubMed]
  12. Sheng, D.; Wei, Y.; Chen, Y.; Wang, Y. Convolutional neural networks with fractional order gradient method. Neurocomputing 2020, 408, 42–50. [Google Scholar] [CrossRef] [Green Version]
  13. Wang, Y.L.; Jahanshahi, H.; Bekiros, S.; Bezzina, F.; Chu, Y.M.; Aly, A.A. Deep recurrent neural networks with finite-time terminal sliding mode control for a chaotic fractional-order financial system with market confidence. Chaos Solitons Fractals 2021, 146, 110881. [Google Scholar] [CrossRef]
  14. Chen, Y.Q.; Gao, Q.; Wei, Y.H.; Wang, Y. Study on fractional order gradient methods. Appl. Math. Comput. 2017, 314, 310–321. [Google Scholar] [CrossRef]
  15. Wei, Y.; Kang, Y.; Yin, W.; Wang, Y. Generalization of the gradient method with fractional order gradient direction. J. Frankl. Inst. 2020, 357, 2514–2532. [Google Scholar] [CrossRef]
  16. Samko, S.G.; Kilbas, A.A.; Marichev, O.I. Fractional Integrals and Derivatives: Theory and Applications; Gordon and Breach: New York, NY, USA, 1993. [Google Scholar]
  17. Kilbas, A.A.; Srivastava, H.M.; Trujillo, J.J. Theory and Applications of Fractional Differential Equations; North-Holland Mathematics Studies; Elsevier: Amsterdam, The Netherlands, 2006; Volume 204. [Google Scholar]
  18. Almeida, R. A Caputo fractional derivative of a function with respect to another function. Commun. Nonlinear Sci. Numer. Simul. 2017, 44, 460–481. [Google Scholar] [CrossRef] [Green Version]
  19. Sousa, J.V.C.; Oliveira, E.C. On the ψ-Hilfer derivative. Commun. Nonlinear Sci. Numer. Simulat. 2018, 60, 72–91. [Google Scholar] [CrossRef]
  20. Gorenflo, R.; Kilbas, A.A.; Mainardi, F.; Rogosin, S.V. Mittag-Leffler Functions, Related Topics and Applications; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  21. Hai, P.V.; Rosenfeld, J.A. The gradient descent method from the perspective of fractional calculus. Math. Meth. Appl. Sci. 2021, 44, 5520–5547. [Google Scholar] [CrossRef]
  22. Kucche, K.D.; Mali, A.D.; Sousa, J.V.C. On the nonlinear ψ-Hilfer fractional differential equations. Comput. Appl. Math. 2019, 38, 73. [Google Scholar] [CrossRef]
Figure 1. Algorithm 1 for different orders of differentiation, α .
Figure 1. Algorithm 1 for different orders of differentiation, α .
Fractalfract 07 00275 g001
Figure 2. Algorithm 1 for different step sizes, θ .
Figure 2. Algorithm 1 for different step sizes, θ .
Fractalfract 07 00275 g002
Figure 3. Algorithm 1 for different initial approximations, x 0 .
Figure 3. Algorithm 1 for different initial approximations, x 0 .
Fractalfract 07 00275 g003
Figure 4. Algorithm 1 for different variable orders of differentiation.
Figure 4. Algorithm 1 for different variable orders of differentiation.
Fractalfract 07 00275 g004
Figure 5. Algorithm 1 with α ( x ) = 1 tanh ( β J ( x ) ) and different values of β .
Figure 5. Algorithm 1 with α ( x ) = 1 tanh ( β J ( x ) ) and different values of β .
Fractalfract 07 00275 g005
Figure 6. Graphical representations of f 1 Ψ 1 X and f 1 Ψ 2 X .
Figure 6. Graphical representations of f 1 Ψ 1 X and f 1 Ψ 2 X .
Fractalfract 07 00275 g006
Figure 7. Iterates for f 1 Ψ 1 X .
Figure 7. Iterates for f 1 Ψ 1 X .
Fractalfract 07 00275 g007
Figure 8. Iterates for f 1 Ψ 2 X .
Figure 8. Iterates for f 1 Ψ 2 X .
Fractalfract 07 00275 g008
Figure 9. Graphical representations of f 2 Ψ 1 X and f 2 Ψ 3 X .
Figure 9. Graphical representations of f 2 Ψ 1 X and f 2 Ψ 3 X .
Fractalfract 07 00275 g009
Figure 10. Iterates for f 2 Ψ 1 X .
Figure 10. Iterates for f 2 Ψ 1 X .
Fractalfract 07 00275 g010
Figure 11. Iterates for f 2 Ψ 3 X .
Figure 11. Iterates for f 2 Ψ 3 X .
Fractalfract 07 00275 g011
Figure 12. Graphical representations of f 3 Ψ 1 X and f 3 Ψ 2 X .
Figure 12. Graphical representations of f 3 Ψ 1 X and f 3 Ψ 2 X .
Fractalfract 07 00275 g012
Figure 13. Iterates for f 3 Ψ 1 X .
Figure 13. Iterates for f 3 Ψ 1 X .
Fractalfract 07 00275 g013
Figure 14. Iterates for f 3 Ψ 2 X .
Figure 14. Iterates for f 3 Ψ 2 X .
Fractalfract 07 00275 g014
Table 1. Information about the k-iteration associated with Figure 7 and Figure 8.
Table 1. Information about the k-iteration associated with Figure 7 and Figure 8.
α θ k X 0 X k X k X *
f 1 Ψ 1 X
Classical Gradient1.0optimized49 1.50 , 2.50 7.8188 × 10 11 , 1.3031 × 10 10 1.5197 × 10 10
Algorithm 20.80.12480 0.75 , 1.25 3.3860 × 10 8 , 5.3902 × 10 8 6.3655 × 10 10
Algorithm 3variableoptimized50 1.50 , 2.50 9.3483 × 10 11 , 1.4213 × 10 10 1.7012 × 10 10
f 1 Ψ 2 X
Classical Gradient1.0optimized77 1.50 , 2.50 3.3370 × 10 4 , 5.5617 × 10 4 6.4860 × 10 4
Algorithm 20.80.1divergence 0.75 , 1.25
Algorithm 3variableoptimized73 1.50 , 2.50 3.9399 × 10 4 , 5.4845 × 10 4 6.7530 × 10 4
Table 3. Information about the k-iteration associated with Figure 13 and Figure 14.
Table 3. Information about the k-iteration associated with Figure 13 and Figure 14.
α θ k X 0 X k X k X *
f 3 Ψ 1 X
Classical Gradient1.0optimized2923 1.50 , 2.50 1 , 2 0.5 × 10 16
Algorithm 3variableoptimized71 1.50 , 2.50 1 , 2 1.378364 × 10 10
f 3 Ψ 2 X
Classical Gradient1.0optimized51 0.50 , 0.25 1.6446 × 10 12 , 1.3283 × 10 11 1.3385 × 10 11
Algorithm 3variableoptimized39 0.50 , 0.25 1.0755 × 10 12 , 2.1842 × 10 11 2.1868 × 10 11
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vieira, N.; Rodrigues, M.M.; Ferreira, M. Fractional Gradient Methods via ψ-Hilfer Derivative. Fractal Fract. 2023, 7, 275. https://doi.org/10.3390/fractalfract7030275

AMA Style

Vieira N, Rodrigues MM, Ferreira M. Fractional Gradient Methods via ψ-Hilfer Derivative. Fractal and Fractional. 2023; 7(3):275. https://doi.org/10.3390/fractalfract7030275

Chicago/Turabian Style

Vieira, Nelson, M. Manuela Rodrigues, and Milton Ferreira. 2023. "Fractional Gradient Methods via ψ-Hilfer Derivative" Fractal and Fractional 7, no. 3: 275. https://doi.org/10.3390/fractalfract7030275

Article Metrics

Back to TopTop