Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning

Rosafalco, Luca; De Ponti, Jacopo Maria; Iorio, Luca; Ardito, Raffaele; Corigliano, Alberto

doi:10.3390/ecsa-9-13216

Open AccessProceeding Paper

Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning^†

by

Luca Rosafalco

^*

,

Jacopo Maria De Ponti

,

Luca Iorio

,

Raffaele Ardito

and

Alberto Corigliano

Dipartimento di Ingegneria Civile ed Ambientale, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milano, Italy

^*

Author to whom correspondence should be addressed.

^†

Presented at the 9th International Electronic Conference on Sensors and Applications, 1–15 November 2022; Available online: https://ecsa-9.sciforum.net/.

Eng. Proc. 2022, 27(1), 18; https://doi.org/10.3390/ecsa-9-13216

Published: 1 November 2022

(This article belongs to the Proceedings of The 9th International Electronic Conference on Sensors and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The design optimization of the grading of a resonator array for energy harvesting in sensors is described. Attention is paid to set the resonator heights, possibly removing resonators whenever convenient. Instead of employing time-consuming heuristic approaches that require verifying the physical understanding of the problem and tuning the design ruling parameters, the optimization task is treated as a Markov decision process, in which states describe specific system configurations, and actions represent the modifications to the current design. The physics-based understanding of the problem is exploited to constrain the set of possible modifications to the mechanical system. Finite elements simulations are exploited to evaluate the action effects and to inform the reinforcement learning agent. The proximal policy optimization algorithm is employed to solve the Markov decision problem. The procedure is demonstrated to be able to automatically produce configurations, enhancing the mechanical system performance. The proposed framework is generalizable to a large class of problems involving the design optimization of sensors.

Keywords:

energy harvesting for sensors; metamaterials; reinforcement learning; Markov decision process

1. Introduction

An elastic waveguide with a graded array of resonant bars was proposed for energy harvesting in [1,2], with possible applications in microsystems. This metamaterial structure features a spatial variation of mechanical properties allowing for manipulating propagating waves. Specifically, the grading enables both to enhance the wavefield amplitude in the resonator endowed with the harvester, typically realized through a piezoelectric material, and to enhance the interaction time between the waves and the resonators. Our aim is to improve energy harvesting capacities by tuning the lengths of the resonator bars. With a similar goal, refs. [3,4] compared different grading laws.

The optimization of a mechanical system can be automatized by relying: on gradient-based methods, genetic algorithms [5], and particle swarm optimization [6]. However [7], the first family of approaches is negatively affected by the nonlinear dependence between the optimization object and the design parameters; the second suffers from a high computational cost; and the third requires constraining some parameters of the optimization algorithm without any clear indications for doing so.

As conducted in [8], we propose to look at the optimization task as a Markov decision process (MDP), in which states describe specific configurations, and actions represent the modification to the current design. The solution of the MDP is based on the use of RL and, in particular, of the Proximal Policy Optimization (PPO) algorithm [9]. Finite Element (FE) simulations are exploited to simulate wave propagation in order to provide information to the RL agent. In [10], experimental data were used with the same goal.

Another aspect of interest is the description adopted for the possible system configurations. Indeed, the physical understanding of the problem has suggested setting the resonator lengths and possibly modifying the number of resonators through few interpolation points and B-spline interpolation, similarly to what was carried out by [11] for structural shape optimization.

The proposed procedure will be demonstrated to able to lead to suboptimal configurations, enhancing the mechanical system performance with respect to previously proposed configurations. The interest of the approach stays in the possible applications to a large class of optimization problems involved in the design of sensors.

The remainder of the paper is arranged as follows. The proposed methodology is detailed in Section 2, while the results relevant to the optimization of rainbow-based metamaterial for energy harvesting are reported in Section 3. Final considerations are collected in Section 4.

2. Methodology

The metamaterial optimization is organized in a sequence of T actions

A_{t}

, with

t = 1, \dots, T

, taken by an agent, producing a modification of the system state

S_{t}

. The performance of the obtained configuration is measured by the reward

R_{t}

, here defined as the average value in time of the elastic energy of the bar endowed with the harvester. This quantity is strictly related to the energy obtained by exploiting a piezoelectric material to convert mechanical into electrical energy. States and rewards define the environment in which the agent plays. Given that the probability to get into a state

S_{t}

depends only on

S_{t - 1}

and on

A_{t - 1}

, an MDP was used to formalize the sequential decision process. Considering a certain state

S_{t}

, the optimization problem coincides with the maximization of the expected return

G_{t}

, defined as

G_{t} = R_{t + 1} + R_{t + 2} + \dots + R_{T} .

(1)

The agents’ actions are guided by a policy

π

, here treated as a stochastic entity associating a Probability Density Function (PDF) over the set of possible actions to a given state of the system. Stochasticity is required to allow the exploration of the state space. To understand if a policy

π

is preferable than a second policy

π^{'}

, value functions

v_{π} (s)

are used, where s is treated as a random variable whose possible realizations at time t are indicated by

S_{t}

. Value functions are defined as

v_{π} = E_{π} [G_{t} | S_{t} = s],

(2)

where

E_{π}

is the expected value of

G_{t}

starting from

S_{t}

and using

π

to guide the following actions. Other two quantities, namely the action–value function

q_{π} (s, a)

and the advantage function

d_{π} (s, a)

, are similarly defined as

\begin{matrix} q_{π} (s, a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a], \end{matrix}

(3a)

\begin{matrix} d_{π} (s, a) = q_{π} (s, a) - v_{π} (s) . \end{matrix}

(3b)

The notion of

d_{π} (s, a)

is exploited by PPO, a policy gradient algorithm. This family of RL approaches explicitly looks for the best policy

π^{*}

by exploiting a (large) number of agent–environment interactions. The outcome of the procedure is typically a suboptimal policy. However, approximating

π^{*}

does not preclude to enhance the system performance with respect to already known configurations.

Before presenting PPO, we discuss the description adopted for the states. The possibility of representing the state through a vector collecting the resonator lengths was discarded because modifying one by one the resonator length produces reward alterations too limited to inform the RL agent. A more convenient option is to employ a limited number

N_{s}

of continuous variables by constraining the state and action spaces through the enforcement of smooth graded patterns of the resonator lengths. This strategy is motivated by the problem insight gained in previous works [1,2]. Specifically, the coordinates of a few points were employed as state variables, while the envelope of the resonator array was obtained by interpolating these points through cubic B-splines. Figure 1 is reported to exemplify the adopted state representation. Actions coincide with modifying the coordinates of the light blue starts, as it is further specified in Section 3.

Handling continuous state and action spaces forces to approximate

v_{π} (s)

and

q_{π} (s, a)

by parametric functions

\begin{matrix} v_{π} (s) \approx v (s, θ_{v}), \end{matrix}

(4a)

\begin{matrix} q_{π} (s, a) \approx q (s, a, θ_{q}) . \end{matrix}

(4b)

whose tunable weights are collected in

θ_{v} \in R^{N_{θ v}}

and in

θ_{q} \in R^{N_{θ q}}

, respectively. A similar treatment was performed for the advantage function

d_{π} (s, a) \approx d (s, a, θ_{v})

.

By associating the PDF characterizing a Gaussian distribution to the policy, a tunable parametric function was exploited to establish a mapping between the state and the statistical moments of the PDF, namely the mean

μ (s, θ_{p})

and the standard deviation

σ (s, θ_{p})

. The weight tuning both the advantage function and the function having as output the policy moments is conducted through PPO. In particular, two fully connected Neural Networks (NN) featuring 32 neurons in each layer were employed for modeling

d (s, a, θ_{v})

and the function with output

[μ (s, θ_{p}), σ (s, θ_{p})]

. Thanks to NN differentiability,

θ_{p}

is updated to maximize the objective function of PPO

L (θ_{p}) = {\hat{E}}_{e} [\min (\frac{π (a | s, θ_{p})}{π_{old} (a | s, θ_{p_{old}})} {\hat{d}}_{e} (s, a, θ_{v}), clip (\frac{π (a | s, θ_{p})}{π_{old} (a | s, θ_{p_{old}})}, 1 - ϵ, 1 + ϵ) {\hat{d}}_{e} (s, a, θ_{v}))],

(5)

via Adam [12], where:

ϵ = 0.2

;

{\hat{E}}_{e}

and

{\hat{d}}_{e}

are computed over

N_{e}

episodes; an episode is a complete sequence of agent–environment interaction

t = (1, \dots, T)

.

Specifically, indicating by

y (θ_{p})

the ratio between

π (a | s, θ_{p})

and

π_{old} (a | s, θ_{p_{old}})

, the “min” and “clip” operations allow to define the following probability distribution

\{\begin{matrix} y (θ_{p}) {\hat{d}}_{e} (s, a, θ_{v}) & \begin{matrix} for {\hat{d}}_{e} (s, a, θ_{v}) > 0 and y (θ_{p}) < 1 + ϵ, \\ or {\hat{d}}_{e} (s, a, θ_{v}) < 0 and y (θ_{p}) > 1 - ϵ, \end{matrix} \\ (1 + ϵ) {\hat{d}}_{e} (s, a, θ_{v}) & for {\hat{d}}_{e} (s, a, θ_{v}) > 0 and y (θ_{p}) > 1 + ϵ, \\ (1 - ϵ) {\hat{d}}_{e} (s, a, θ_{v}) & for {\hat{d}}_{e} (s, a, θ_{v}) < 0 and y (θ_{p}) < 1 - ϵ, \end{matrix}

(6)

whose expected mean is the objective of PPO. The update of

d (s, a, θ_{v})

is separately conducted every

N_{e}

episodes according to the actor–critic scheme of the PPO algorithm [13]. Additional details on PPO can be found in [9].

3. Results

To compute the reward related to a certain state, wave propagation is simulated through FEs for

T = 1.25 \times 10^{- 5}

s with a time step of

3 \times 10^{- 9}

. The waveguide was discretized using 376 Euler Bernoulli beams, while a mass–spring schematization was employed for the resonating bars. The lengths of the FE were set to

0.0344 \times 10^{- 3}

m in between the resonators and to

0.344 \times 10^{- 3}

m elsewhere. The mesh refinement was required to catch the effects of the resonator interactions. Two absorbing layers, one at the beginning of the waveguide and the other at the end, were placed to avoid reflections, as suggested in [14]. The employed material is aluminum with density

ρ = 2710 g / m^{3}

and Young’s modulus

E = 70 GPa

. Concerning the cross-sectional area and moment of inertia, the ones of the waveguide are

B_{w} = 3 \times 10^{- 6} m^{2}

and

I_{w} = 2.5 \times 10^{- 13} m^{4}

, while the relevant moment of inertia

I_{r}

of the resonating bars is equal to

0.4909 \times 10^{- 13} m^{4}

. An initial number of 25 resonators with spacing close to

λ_{w} / 11

was considered, where

λ_{w} = 1.8 \times 10^{- 3}

m is the length of the flexural wave traveling on the elastic beam without resonators.

The excitation generating the propagating wave is reported in Figure 2. It mimics the one experimentally adopted by [2]. The frequency content of the excitation matches the first bending frequency

ω_{h} = 17.67

MHz of the resonator endowed with the harvester. The four points depicted as light blue markers in Figure 1 were employed to define the arrangement of the resonating bars. Specifically, the number

N_{s}

of continuous variables was set to 4. They coincide with the z coordinates of the first and fourth points and with the

(x, z)

coordinates of the second point. The third point, placed at the tip of the bar equipped with the harvester, is fixed. The order of the agent action was set, too; see Table 1.

Except for the way in which the state space was constrained, no other physical knowledge of the system was exploited. As starting state, the z coordinates of all the points were set equal to the length of the harvester bar

l_{h} = 5.028 \times 10^{- 4} m

. The range of variation of the coordinate points is also reported in Table 1. The value

l^{\max} = 9.156 \times 10^{- 4} m

allows to have a

10 %

attenuation of the forced response of a bar with length

l_{r}^{\max}

and moment of inertia

I_{r}

excited by an oscillating force with frequency equal to

ω_{h}

. If bars with lengths smaller than

l_{h} / 20

result by the interpolation, they are removed from the system, in this way enabling to modify the number of resonators.

The outcomes of the optimization process are evaluated in terms of the reward

R_{T}

of the last episode configuration. This value was divided by the reward

R_{T}^{H}

featuring the waveguide with just one resonator. The interest is to judge the performance improvement with respect to the configuration featuring a linear grading reported in Figure 3b, originally proposed by [1] on the basis of physical considerations.

Two resonator arrangements were found out by the RL agent. The best discovered configuration depicted in Figure 3c was generated after roughly 5000 agent–environment interactions, much before the total number

N_{I}

of interactions, here set to

N_{I} =

100,000, ran out. Instead, the converged RL policy configuration shown in Figure 3d was produced by the quasi-deterministic policy obtained at the end of the agent training. This policy is a suboptimal solution of the MDP. They both outperform the linear grading rule by ≈4.7% and by ≈1.0%, respectively.

The suboptimality of the converged RL policy and the better performance of the other discovered configuration should not appear to undermine the value of the method. Indeed, the obtained configurations are close in terms of

R_{T} / R_{T}^{H}

; they confirm the physical intuition of the problem. Discovering the reported best configuration is allowed by the first policy updates; the closest approximation of the optimal policy could have been obtained, but only at the cost of a huge increase in the computational time [13]. On the contrary, the small number of agent–environment interactions needed to discover the configuration in Figure 3c promises a successful application of this RL- and MDP-based optimization approach to other sensor design problems, possibly involving more complex and time demanding simulations, even in the realm of multiphysics.

Moreover, it is worth to remember that these configurations were obtained without exploiting the physical understanding of the problem, such as the notion that an initial linear ascending grading both enhances the interaction time between the waves and the resonators and increases the wavefield amplitude in the resonator endowed with the harvester. On the contrary, a greater insight into the surface wave propagation in rainbow-based structures should be obtained by explaining the reason behind the improved performances of the discovered configurations. For example, the concave curvature reported at the beginning of the grading deserves deeper comprehension. Moreover, it is shown that the best performance was obtained when the number of resonators was reduced to 23. These and other aspects are currently under investigation.

4. Conclusions

In this work, the grading optimization of a resonator array for energy harvesting with possible applications in sensor design was performed, exploiting an innovative reinforcement learning approach. Using few points and interpolation functions to describe the space of the possible system states, the proximal policy optimization algorithm led to two resonator configurations, both improving the performance with respect to a reference linear grading rule. The optimization outcome confirmed the physical comprehension of the problem already in possession, promising to open the understanding of more subtle mechanical aspects. The procedure is suitable to be generalized to other optimizations of sensor systems.

Author Contributions

Conceptualization, L.R., J.M.D.P., R.A. and A.C.; methodology, formal analysis, and investigation, L.R. and J.M.D.P.; software, validation, resources, and visualization, L.R., J.M.D.P. and L.I.; writing—original draft preparation, L.R.; writing—review and editing, J.M.D.P., L.I., R.A. and A.C.; supervision, project administration, and funding acquisition, R.A. and A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been partially funded by the support of the H2020 FET—proactive project Metamaterial-Enabled Vibration Energy Harvesting (MetaVEH) project under Grant Agreement No. 952039.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

De Ponti, J.M.; Colombi, A.; Ardito, R.; Braghin, F.; Corigliano, A.; Craster, R.V. Graded elastic metasurface for enhanced energy harvesting. New J. Phys. 2020, 22, 013013. [Google Scholar] [CrossRef]
De Ponti, J.M.; Colombi, A.; Riva, E.; Ardito, R.; Braghin, F.; Corigliano, A.; Craster, R.V. Experimental investigation of amplification, via a mechanical delay–line, in a rainbow–based metamaterial for energy harvesting. Appl. Phys. Lett. 2020, 117, 143902. [Google Scholar] [CrossRef]
Alshaqaq, M.; Erturk, A. Graded multifunctional piezoelectric metastructures for wideband vibration attenuation and energy harvesting. Smart Mater. Struct. 2020, 30, 1–11. [Google Scholar] [CrossRef]
Zhao, B.; Thomsen, H.R.; De Ponti, J.M.; Riva, E.; Van Damme, B.; Bergamini, A.; Chatzi, E.; Colombi, A. A graded metamaterial for broadband and high-capability piezoelectric energy harvesting. Energy Convers. Manag. 2022, 269, 116056. [Google Scholar] [CrossRef]
Jenkins, W. Towards structural optimization via the genetic algorithm. Comput. Struct. 1991, 40, 1321–1327. [Google Scholar] [CrossRef]
Perez, R.; Behdinan, K. Particle swarm approach for structural design optimization. Comput. Struct. 2007, 85, 1579–1588. [Google Scholar] [CrossRef]
Viquerat, J.; Rabault, J.; Kuhnle, A.; Ghraieb, H.; Larcher, A.; Hachem, E. Direct shape optimization through deep reinforcement learning. J. Comput. Phys. 2021, 428, 110080. [Google Scholar] [CrossRef]
Ororbia, M.E.; Warn, G.P. Design Synthesis Through a Markov Decision Process and Reinforcement Learning Framework. J. Comput. Inf. Sci. Eng. 2021, 22, 021002. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Fan, D.; Yang, L.; Wang, Z.; Triantafyllou, M.S.; Karniadakis, G.E. Reinforcement learning for bluff body active flow control in experiments and simulations. Proc. Natl. Acad. Sci. USA 2020, 117, 26091–26098. [Google Scholar] [CrossRef] [PubMed]
Papadrakakis, M.; Lagaros, N.D.; Tsompanakis, Y. Structural optimization using evolution strategies and neural networks. Comput. Methods Appl. Mech. Eng. 1998, 156, 309–333. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Rajagopal, P.; Drozdz, M.; Skelton, E.A.; Lowe, M.J.; Craster, R.V. On the use of absorbing layers to simulate the propagation of elastic waves in unbounded isotropic media using commercially available Finite Element packages. NDT E Int. 2012, 51, 30–40. [Google Scholar] [CrossRef]

Figure 1. Use of interpolation points (light blue markers) to define the envelope curve (dotted line) setting the resonator lengths. The resonator endowed with the harvester is plotted with an orange line. The circles recall the lumped mass-spring description adopted for the resonators.

Figure 2. Load applied to the rainbow-based metamaterial.

Figure 3. Optimized and reference configurations of the bar arrangement together with the relevant reward

R_{T} / R_{T}^{H}

. (a) Harvester-only configuration,

R_{T} / R_{T}^{H} = 1.000

; (b) reference-optimized configuration,

R_{T} / R_{T}^{H} = 3.504

; (c) best RL discovered configuration,

R_{T} / R_{T}^{H} = 3.669

; (d) converged RL policy configuration,

R_{T} / R_{T}^{H} = 3.537

.

Figure 3. Optimized and reference configurations of the bar arrangement together with the relevant reward

R_{T} / R_{T}^{H}

. (a) Harvester-only configuration,

R_{T} / R_{T}^{H} = 1.000

; (b) reference-optimized configuration,

R_{T} / R_{T}^{H} = 3.504

; (c) best RL discovered configuration,

R_{T} / R_{T}^{H} = 3.669

; (d) converged RL policy configuration,

R_{T} / R_{T}^{H} = 3.537

.

Table 1. Description and ordering of the agent actions.

Action	What	Variable Value	Range of
Ordering	Is Modified	at the Starting State	Possible Values
1	1st point z coordinate	$5.028 \times 10^{- 4}$ m	$[0, 9.156 \times 10^{- 4} m]$
2	4th point z coordinate	$5.028 \times 10^{- 4}$ m	$[0, 9.156 \times 10^{- 4} m]$
3	2nd point x coordinate	0.0697 m	$[0.0682, 0.0711 m]$
4	2nd point z coordinate	$5.028 \times 10^{- 4}$ m	$[0, 9.156 \times 10^{- 4} m]$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rosafalco, L.; De Ponti, J.M.; Iorio, L.; Ardito, R.; Corigliano, A. Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning. Eng. Proc. 2022, 27, 18. https://doi.org/10.3390/ecsa-9-13216

AMA Style

Rosafalco L, De Ponti JM, Iorio L, Ardito R, Corigliano A. Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning. Engineering Proceedings. 2022; 27(1):18. https://doi.org/10.3390/ecsa-9-13216

Chicago/Turabian Style

Rosafalco, Luca, Jacopo Maria De Ponti, Luca Iorio, Raffaele Ardito, and Alberto Corigliano. 2022. "Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning" Engineering Proceedings 27, no. 1: 18. https://doi.org/10.3390/ecsa-9-13216

Article Menu

Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning^†

Abstract

1. Introduction

2. Methodology

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning †

Abstract

1. Introduction

2. Methodology

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Optimization of Graded Arrays of Resonators for Energy Harvesting in Sensors as a Markov Decision Process Solved via Reinforcement Learning^†