Next Article in Journal
Evaluating and Visualizing the Contribution of ECG Characteristic Waveforms for PPG-Based Blood Pressure Estimation
Next Article in Special Issue
Dynamic Balancing of Humanoid Robot with Proprioceptive Actuation: Systematic Design of Algorithm, Software, and Hardware
Previous Article in Journal
Thrombin Determination Using Graphene Oxide Sensors with Co-Assisted Amplification
Previous Article in Special Issue
Design, Analysis and Experiments of Hexapod Robot with Six-Link Legs for High Dynamic Locomotion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Stability Training Method of Legged Robots Based on Training Platforms and Reinforcement Learning with Its Simulation and Experiment

Humanoid & Gorilla Robot and Its Intelligent Motion Control Laboratory, School of Mechatronics Engineering, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Micromachines 2022, 13(9), 1436; https://doi.org/10.3390/mi13091436
Submission received: 23 June 2022 / Revised: 24 July 2022 / Accepted: 28 July 2022 / Published: 31 August 2022
(This article belongs to the Special Issue New Advances in Biomimetic Robots)

Abstract

:
This paper continues the proposed idea of stability training for legged robots with any number of legs and any size on a motion platform and introduces the concept of a learning-based controller, the global self-stabilizer, to obtain a self-stabilization capability in robots. The overall structure of the global self-stabilizer is divided into three modules: action selection, adjustment calculation and joint motion mapping, with corresponding learning algorithms proposed for each module. Taking the human-sized biped robot, GoRoBoT-II, as an example, simulations and experiments in three kinds of motions were performed to validate the feasibility of the proposed idea. A well-designed training platform was used to perform composite random amplitude-limited disturbances, such as the sagittal and lateral tilt perturbations (±25°) and impact perturbations (0.47 times the robot gravity). The results show that the proposed global self-stabilizer converges after training and can dynamically combine actions according to the system state. Compared with the controllers used to generate the training data, the trained global self-stabilizer increases the success rate of stability verification simulations and experiments by more than 20% and 15%, respectively.

1. Introduction

Compared with fix-based industrial robots, mobile robots have a wider application prospect because of their mobility and operational capacity. In particular, legged robots have a similar mechanism to animals and thus have a stronger adaptability to complex terrains than robots with other movements, such as wheeled and tracked robots. However, the practicality of legged robots is still lower than that of wheeled and tracked robots due to the difficulties in balance control and perturbation recovery.
The early studies mainly focused on the balance control of walking motions. Based on the Zero Moment Point (ZMP) force reflection control proposed by Vukobratovic [1], a variety of balance control methods such as body posture control [2], ZMP damping control [3] and landing point adjustment control [4] were proposed and deployed successively on ASIMO, Petman and other robots. Since then, researchers have started to consider the influence of perturbations and proposed corresponding control methods according to different types of perturbations. Successful results have been obtained for tilt ground [5,6,7], uneven ground [8], external force impact [9,10,11] and other perturbations.
The above balance controllers generate a planned response for a specific perturbation and then calculate the control outputs that enable the robot to track a determined trajectory by solving the dynamical model (or simplified model) of that robot. Thus, these controllers can be collectively defined as model-based balance controllers. Such controllers have achieved many successful results in structured environments such as laboratories, but their application is limited in unstructured, complex environments where the robot may be subject to multiple, mutually compounding, and unpredictable perturbations.
Consequently, more and more studies have paid attention to obtaining the self-stabilization capability of legged robots by using learning-based methods which are able to obtain the optimal mapping from the system state to the joint adjustment. The relevant literature on learning-based balance control methods is summarized in Table 1.
As shown in Table 1, some studies did not consider any perturbation, and the rest of them applied one kind of perturbation—generally a specific perturbation in a single direction. Moreover, the dimensions of the state/action space defined in existing studies are relatively low, which means that the learning process is carried out locally in the whole state space. In addition, the scope of state migration is relatively small when the applied perturbation is simple. Therefore, even though successful results can be obtained in the laboratory, state confusion is prone to occur in local state spaces determined by only partial state variables when the existing controllers face complex perturbations in reality, which leads to the failure in maintaining balance. In addition, the learning algorithms are applied without considering the curse of dimensionality when the number of state/action variables increases.
To address the above problems, the authors in [24] have proposed the idea of robot stability training—that is, to simulate composite perturbations by the random amplitude-limited motion of a six-degrees-of-freedom (DOF) training platform on which the robot is trained, and to obtain the self-stabilization capability by reinforcement learning with feature selection. A stability training simulation [25] of a bipedal robot was performed under randomly varying ground tilt perturbation, which preliminarily verified the feasibility of this idea. Relevant studies in medicine and biology also corroborate the practicability of stability training—for example, studies on movement disorder syndrome [26], stroke rehabilitation [27] and mice anatomy [28] have shown that training an organism with a moving platform can enhance or rebuild its balance.
In order to distinguish from the balance controllers learned under a single disturbance, the robot self-stabilizer trained on a 6-DOF motion platform called the global self-stabilizer, where “global” means that the training process has traversed all different kinds of environmental disturbances through the random amplitude-limited motion of the training platform. A robot self-stabilizer trained under such conditions can obtain robustness to any environmental perturbation, and after sufficient training, it can make the robot stable under any perturbation within its driving capability.
Figure 1 compares the differences between the general balance controller of a legged robot and the global self-stabilizer in this study.
Figure 1a shows that a general robot balance controller uses a specific balance control law for different perturbations; the balance control of the robot is coupled with a specific motion, which means it is not universal.
As in Figure 1b, the global self-stabilizer is separated from the motion controller. The former finds the optimal joints’ increments according to the internal state/action map; the latter only needs to generate the reference motion according to the given motion parameters and is not affected by the global self-stabilizer. Thus, the two tasks (motion and balance) are independent. Only the target motion and the driving capability of the robot are considered in the motion controller. In other words, the self-stabilization capability obtained is not limited to specific motions, and the global self-stabilizer, such as a cerebellum, can be applied to any motion under any perturbation after being sufficiently trained.
In this paper, the stability training system of a legged robot with multiple legs is established and a general hierarchical structure of the global self-stabilizer is designed. The task of the proposed global self-stabilizer will be divided into three subtasks: action selection, adjustment calculation and joint motion mapping. Each subtask will be learned in different state spaces.
This paper is organized as follows: Section 2 describes the model of the training system and defines the state space of system variables and actions. Section 3 presents the three modules of the global self-stabilizer and their corresponding learning algorithms. Section 4 describes the simulated and experimental environments for stability training of a biped robot (GoRoBoT-II) and the balance controllers for generating training data. Section 5 and Section 6 presents the simulation and experiment training processes and results. This paper is concluded in Section 7.

2. A General Model for Stability Training of Legged Robots

The basic idea of the legged robot stability training proposed by the authors is shown in Figure 2. During the training period, the robot stands on a training platform that performs a 6-DOF random amplitude-limited motion to simulate perturbations in the real world. The joint motion is generated by model-based balance controllers. The global self-stabilizer learns from the state transition data to obtain the optimal state/action mapping through reinforcement learning. After training, the converged global self-stabilizer can be used in uncertain environments to keep the robot stable.

2.1. Environmental Disturbance Simulation Method based on Motion Platform

A dedicated 6-DOF serial-parallel mechanism motion platform [24,29] was designed in the authors’ laboratory for generating composite perturbations during stability training. Its mechanism sketch is shown in Figure 3a. The reference frames ΣOB-xByBzB and ΣOP-xPyPzP are fixed to the ground and the platform, respectively. The motion of the moving platform can be represented by the displacements xP, yP, zP and 3-2-1 Euler angles θP1, θP2 and θP3 of the frame ΣOP with respect to frame ΣOB in Figure 3b. The pose vector can be expressed as XP = [xP, yP, zP, θP1, θP2, θP3] T. Point C represents the center of mass (CoM) of the trained robot.
Two forms of perturbations, ground tilt perturbations and inertial force/moment perturbations, can be generated by the above platform. If the training platform performs a random amplitude-limited motion, the generated tilt perturbation angle β, inertial force perturbation FP and inertial moment perturbation MP will also be randomly distributed within a certain range, thus enabling a comprehensive simulation of perturbations in the real world.

2.2. Model of the Training System

As shown in Figure 4, legged robots of any mechanical configurations and any size standing on the training platform can all be equated to a multi-branch chain rigid-body system with n1 (n1 ≥ 1) stance legs and n2 swing legs (n2 ≥ 0) if the motion in the air is not considered.
The reference frame ΣOS-xSySzS is established at the center of the theoretical support zone, and the motion of frame ΣOS with respect to frame ΣOP can represent the change in the contact state of the robot’s feet. In this study, situations in which the robot is completely in the air or the support foot slides on the training platform are not considered. Thus, only the 2-DOF flip motion of the theoretical support zone is analyzed, with the flip angles θS1 and θS2, respectively. Each swing leg can be viewed as an open chain mechanism with its root located at the torso. The swing leg reference frame ΣOFj-xFjyFjzFj is located at the center of the bottom surface of the jth swing foot (j = 1, 2…n2). The motion of the swing leg can be represented by the pose vector XFj—the pose of frame ΣOFj with respect to the torso frame ΣOT-xTyTzT.
To establish the system variable set for the above model, the variables that can be measured or estimated in this system are summarized in Table 2.
For robots with any number of legs and any configuration, the system variable set can be constructed according to Table 2. In addition, the state variables corresponding to each action will be selected from the system variable set in the subsequent stability training.

2.3. Action Set of Legged Robot

The action set which stores the action variables and their adjustment equations is the discourse domain for the action selection. The action is considered as the active adjustment performed by the robot. So, after excluding the system variables that cannot be actively adjusted in the last two rows of Table 2, six types of actions are obtained: single-joint action, torso action, swing foot action, CoM action, inertial force/moment action and ZMP action (corresponding to the first six rows of Table 2, respectively).
In the stability training, the robot needs to accomplish three tasks simultaneously, i.e., tracking motion samples, resisting environmental (training platform) perturbations and avoiding joint limits. In the following, the six types of actions listed will be assigned to the three tasks mentioned above, and then the equation for action adjustment will be designed for each action. The parameters for each action are explained in Table 3.
(1)
Single-joint action. When the robot’s joint reaches its position limit, velocity limit or acceleration limit, the motion of the robot will be affected, so joint limit avoidance is required.
The angular acceleration θ ¨ k (k = 1, 2…NJ) of the NJ joints of the robot are taken as the action variables in the single-joint action so that the motion curves obtained by integrating the acceleration are smoother than those obtained by directly adjusting the position and velocity. The adjustment is calculated according to Equation (1).
Δ θ ¨ X i = L θ X i , K 11 , ε 11 + L θ ˙ X i , K 12 , ε 12 + L θ ¨ X i , K 13 , ε 13 , X = L , R ; i = 1 , 2 , , 6
The compensation equation for the joint angular limit is calculated according to Equation (2). The compensation equations for joint velocity and acceleration are similar and will not be listed specifically.
L θ X i , K 11 , ε 11 = K 11 θ X i θ X i max + ε 1 , θ X i > θ X i max ε 1 0 , O . W . K 11 θ X i min + ε 1 θ X i , θ X i < θ X i min + ε 1
(2)
Torso action. This kind of action is used to bring the robot stance leg back to the preset motion sample after other adjustments. The action variable is chosen as X ¨ T , and its adjustment is calculated using the PD control law shown in the following equation.
Δ X ¨ T = Δ x ¨ T Δ y ¨ T Δ z ¨ T Δ θ ¨ T 1 Δ θ ¨ T 2 Δ θ ¨ T 3 T = K 21 X T d X T + K 22 X ˙ T d X ˙ T + X ¨ T d
(3)
Swing foot action. Similar to the torso action, for the n2 swimming feet in the general model. The action variables are chosen as X ¨ F j (j = 1, 2…n2) and the adjustment is calculated using the PD control law shown in the following equation.
Δ X ¨ F j = Δ x ¨ F j Δ y ¨ F j Δ z ¨ F j Δ θ ¨ F j 1 Δ θ ¨ F j 2 Δ θ ¨ F j 3 T = K 31 X F j d X F j + K 32 X ˙ F j d X ˙ F j + X ¨ F j d
(4)
CoM action. This type of action will directly adjust the robot CoM to keep balance on the moving platform. The action variable is chosen as the linear acceleration of the CoM. To keep the CoM above the stance legs, the adjustment is calculated according to the estimated position of the moving platform.
Δ x ¨ C Δ y ¨ C Δ z ¨ C T = K 41 P S + 0 0 l C T P C + K 42 P ˙ S w P × 0 0 l C T P ˙ C P ¨ C
(5)
Inertial force/moment action. The inertial forces and moments influenced by the motion of the limbs are taken as a class of actions to cope with the perturbations. The action variables are chosen as the inertial force F and the inertial moment M at the CoM. The kinetic energy attenuation method proposed by the authors of [11] is used here to keep the robot balanced. The adjustment is calculated as follows:
Δ F X Δ F Y Δ F Z T = F F last = K 51 m C v C F last Δ M X Δ M Y Δ M Z T = M M last = K 52 L C M last
(6)
ZMP action. As a common control strategy in robot balance control, changing the ZMP position within the support zone through limb motion can be used as a class of action in response to perturbations. Therefore, the action variables are chosen as xZMP and yZMP. Using the pose balance control method based on the CP point proposed by the authors of [8], the ZMP adjustment is calculated with the following equation:
Δ x ZMP Δ y ZMP T = 1 + K 6 P CP K 6 P 0 P ZMP
The action set of legged robots can be written as:
Q = Δ θ ¨ 1   Δ θ ¨ 2 θ ¨ N J   Δ X ¨ T   Δ X ¨ F 1   Δ X ¨ F 2 Δ X ¨ F n 2   Δ P ¨ C   Δ F   Δ M   Δ x ZMP   Δ y ZMP
Although only one equation is given for the adjustment of each action in Q, different adjustments can be obtained by adjusting the 12 free parameters (K11, K12, K13, K21, etc.). The determination methods and specific values of these parameters will be illustrated in Section 4 with simulation examples.

3. The Global Self-Stabilizer

3.1. Preprocessing and Structure of the Global Self-Stabilizer

Dimensionality reduction and discretization are required to enable the learning process to exponentially converge because the system space designed in Section 2.2 is a high-dimensional continuous space. The system variable set listed in Table 2 is denoted as X = {xi |I = 1, 2…N}, and the action set in 2.3 is denoted as Q = {qj |j = 1, 2…m}. The global self-stabilizer in this study will establish the mapping from X to Q.
The RAFS feature selection method proposed by the authors in [30] will be used to reduce the dimensionality of the system space to obtain the state set S j = s j k k = 1 , 2 , , N S j ; s j k X —corresponding to each action qj and followed by the autonomic abstraction calculation of the state space based on the Gaussian basis functions proposed by the authors in [25]. The continuous state space corresponding to Sj is then discretized into different Gaussian basis functions according to the maximum affiliation principle. The full set of Gaussian basis functions corresponding to Sj can be expressed as Ψj = {ψjk = <μjk, Σjk>|k = 1, 2, …, NBj}, where μjk and Σjk are the center vector and covariance matrix of the basis function ψjk, respectively.
With x = [x1, x2xN] T denoting the vector in the system space and sj = [sj1, sj2sjNSj] T denoting the vector in the state space of action qj, the mapping of X to Sj after feature selection can be expressed as:
s j = W x
where Wj is the NSj × N selection matrix obtained from the feature selection calculation.
The affiliation of the reduced-dimensional state vector sj to the basis function ψjk can be expressed as:
f s j , ψ j k = e 0.5 μ j k s j T Σ j k 1 μ j k s j
The NSj-dimensional continuous state space corresponding to Sj can thus be transformed into a discrete space with NBj values. To facilitate the learning calculation of the global self-stabilizer in Section 2, a normalized affiliation function is also defined.
f ^ s j , ψ j k = f s j , ψ j k / i = 1 N B j f s j , ψ j i
The legged robot’s actions need to be executed by the joint motion, so the global self-stabilizer also needs to establish the mapping from Q to the joint angular acceleration increment vector Δ θ ¨ = θ 1 , θ 1 , , θ N J T . Because the action variables in Q are all acceleration or force/moment, we can linearize the kinematic or dynamical equations of the system:
q j = b j · Δ θ ¨ ( j = 1 , 2 , , m )
where bj is the NJ-dimensional joint motion mapping vector, which represents the projection of the action adjustment qj in the robot joint space. Combining the joint motion mapping vectors into the mapping matrix B = [b1, b2, …, bn] T, Equation (12) can then be written as:
q = B Δ θ ¨
In general, the number of actions m is greater than the robot DOF NJ, so the matrix B is singular. Therefore, the action selection matrix ANm is constructed, which has only one element of 1 in each row and the remaining elements of 0. Adding the action selection matrix A to Equation (13) gives:
Δ θ ¨ = A B 1 A q
According to Equation (14), the global self-stabilizer is divided into three modules in this study: action selection module, adjustment calculation module and joint motion mapping module, which are used to generate A, q and B, respectively. The specific structure of the global self-stabilizer is shown in Figure 5.

3.2. Action Selection Module

The action selection module selects NJ actions in the action set and generates the action selection matrix A. Two main considerations are made when selecting the combination of actions: the value of the actions for the robot stability at the current state, and the influence of the actions on each other when combined.
The action value function is defined as VA(x). The mutual influence between actions is shown by the singularity of AB in Equation (14). For the action variables qi and qj, the mutual influence c i j is quantified by the relative projection of the joint mapping vectors bi and bj (defined in Section 3.4):
cij = |biTbj|/(||bi||×||bj||),
For any action selection of matrix A, we can define the action selection evaluation function as follows:
E A ( A , x ) = V A x A T 1 ω C A T 1 T C A T 1
where 1 is an all-one vector, and ωC is the weight of the action value and mutual influence. Action selection can be achieved by solving the optimization model shown in Equation (17).
A = arg max E A ( A , x ) s . t . rank ( A ) = N J

3.3. Adjustment Calculation Module

There may be different formulas (or different parameters) for calculating the adjustment of the same action because the training data of the global self-stabilizer may have multiple sources (model-based controllers, motion capture data, etc.). Therefore, the task of the adjustment calculation module is to select the most valuable adjustment calculation formula for each action, then calculate and output action adjustment q.
Assuming that the jth action has nj (j = 1, 2, …, m) different adjustment formulas, the value functions of these formulas Vjk(x) (j = 1, 2, …, m; k = 1, 2, …, nj) are obtained by learning, and the formula with the largest Vjk(x) is selected to calculate qj.
The action value function VAj(x) can be determined by Vjk(x):
V A j x = max k = 1 , 2 , , n j V   j k x
Both the action selection module and the adjustment calculation module need to determine the value function Vjk(x) through learning. The learning of Vjk(x) will be introduced below. The state transition of the training data at each moment can be extracted as a quintuple <x, I, Δ θ ¨ , r, x′>, where I is the activation flag matrix of the adjustment calculation formula, and the element Ijk takes 1 when the adjustment qj is calculated by the kth formula. x′ is the system variable vector at the next moment. r is the immediate reward, considering the stability of the robot and the difference between the actual motion and the reference motion of the robot. The reward function r is defined according to Equation (19).
r = 1 n i = 1 N J 1 θ i d θ i θ i max θ i min , stable 100 , unstable
where θid is the joint angle of the ith joint in the motion sample; θimax and θimin are the positive and negative limit positions of the ith joint, respectively.
In this paper, ZMP is not the only criterion for determining stability. When ZMP is within the support zone, the robot is considered to be stable; when ZMP exceeds the support zone, the robot will start to flip along the boundary of the support zone. The robot is still considered to have the possibility of recovery when the flip angle is less than 45°; only after the flip angle exceeds 45° is the robot considered to be in an irrecoverable unstable state.
For each training data, the value function Q(ψijk) is updated by Q-learning.
Q ψ i j k Q ψ i j k + I j k α r j k f x , ψ i j k + γ h = 1 N B j k f x , ψ h j k Q ψ h j k Q ψ i j k
where rjk is the reward function after assigning the immediate reward r to the kth adjustment calculation formula for the jth action, calculated according to Equation (21).
r j k = r I j k b j Δ θ ¨ / b j j = 1 m k = 1 n j I j k b j Δ θ ¨ / b j

3.4. Joint Motion Mapping Module

The task of this module is to give the joint mapping matrix B based on the feedback of the system variable x. The radial basis function network (RBF network) will be used to train the mapping relation ( x , Δ θ ¨ ) q as an approximation to the system motion equations because it is difficult to obtain training data in the form of <x, B> directly. However, it is always possible to extract training data in the form of <x, q, Δ θ ¨ > from the robot state transition. Then, the local linearized mapping matrix B is obtained by differentiating this RBF network.
This network is split into sub-networks with one single behavioral variable qi to reduce the complexity. In Equation (12), ignoring the effect of Δ θ ¨ on bi, the mapping vector bi is considered as a function of x only. After performing the local linearization, qi can be calculated by the following equation.
q i = q i 0 + b i 0 T Δ θ ¨ Δ θ ¨ 0
where qi0, bi0  Δ θ ¨ 0 are the mean values of qi, bi and Δ θ ¨ 0 in the neighborhood of the local linearization, respectively.
The RBF network structure is shown in Figure 6, where Bi and vi are the weight matrix and bias vector connecting the input layer to the hidden layer, respectively; uij (I = 1, 2, …, m; j = 1, 2, …, Ni) is the linear activation function; the connection weights of the hidden layer to the output layer are the affiliation function fij (defined in Equation (11)). The output equation of the network is shown in Equation (23), where fi = [fi1, fi2, …, fiNi]T.
q i = f i T B i Δ θ ¨ + v i
The basis functions of the above RBF network are evaluated only in the space tensorized by x to reduce the number of basis functions. This modified RBF network is equivalent to linear (first order) interpolation in the multi-dimensional space, which can improve the fitting accuracy.
Differentiating Equation (23), the equation for the mapping vector bi extracted from the RBF network is:
b i = q i Δ θ ¨ = B i T f i
The training of the designed RBF network is divided into two steps: (1) determination of the center and boundary of the basis function; (2) local training inside the basis function.
The center and boundary of the basis function are determined by the state space autonomic abstraction calculation based on the Gaussian base function [25] (feature selection is also required). For each RBF sub-network, the basis function set can be expressed as ΨBi = {ψBij|j = 1, 2, …, Ni} after the autonomic abstraction calculation.
For the jth basis function of action qi, the following error function can be defined:
e i j = 1 2 k = 1 N q i f s i ( k ) , ψ B i j q i k b R i j Δ θ ¨ k v i j 2
The superscript (k) represents the kth training data, bRij is the jth row of the weight matrix Bi, and vij is the jth element of the bias vector vi. For simplicity, f(si(k), WBi, ψBij) is abbreviated as fijk. To minimize eij, the following equations need to be solved:
e i j b R i j v i j = k = 1 N q i f i j k b R i j Δ θ ¨ k + v i j q i k Δ θ ¨ k 1 T = 0
Solving Equation (26), the solution shown in Equation (27) can be obtained.
b R i j v i j T = U ^ U T 1 U ^ q L i
Where the definition of U, U ^ and qLi are shown in:
U = Δ θ ¨ 1 Δ θ ¨ 2 Δ θ ¨ N q i 1 1 1
U ^ = f i j 1 Δ θ ¨ 1 f i j 2 Δ θ ¨ 2 f i j N q i Δ θ ¨ N q i f i j 1 f i j 2 f i j N q i
q L i = q i 1 q i 2 q i N q i T

4. Stability Training System of Biped Robots

Taking the biped robot GoRoBoT-II as an example, the simulated and experimental stability training environment are established to validate the effectiveness of the proposed idea and the balance controllers for generating the training data are designed.

4.1. Simulation Environment

The biped robot used in this study is the bipedal part of the GoRoBoT-II robot designed by the author’s laboratory. Its main mechanism parameters are shown in Table 4. The seven-bar multi-rigid-body model of the biped robot is shown in Figure 7. The reference frames and variables are defined according to the model in Section 2.2. In addition, the joint angles of the left and right legs are denoted as θLi and θRi (i = 1, 2, …, 6), respectively.

4.2. Experiment Environment

The experimental system for stability training is shown in Figure 8, which includes the upper computer, motion platform, biped robot and protection device. The upper computer is a PC with a Windows operating system. The protection device is composed of a wire rope, a fixed pulley and a pull ring. When the robot is stable, the wire rope stays slack and does not affect the robot; when the robot is unstable, the experimenter pulls the protection rope tightly to prevent the robot from falling down.
The training platform in the above system is a 2-DOF motion platform. The mechanism diagram and its main parameters are given in Figure 9. The motion platform can oscillate around the x-axis and y-axis, denoted by θP1 and θP2, respectively. The limits of oscillation amplitude, speed and acceleration are ±20°, ±40°/s and ±60°/s2, respectively, which meet the requirements for stability training.
Each joint of the robot is driven by a Maxon RE35 DC servo motor. The transmission system consists of a synchronous belt drive (first stage) and a harmonic gear drive (second stage).
The motion control commands of the robot are generated by the upper computer, and the DC servo motors of each joint are position servos controlled by IPM100 controllers. In addition to the photoelectric encoders on the DC servo motors, the robot is equipped with a gyroscope (mounted on the torso) and force sensors (mounted under the soles of the feet) to measure the acceleration and velocity of the torso as well as the contact forces, respectively.

4.3. Balance Controllers for Stability Training Data Generation

The model-based balance controllers used for training data generation can be obtained by combining actions in action set Q. The stance leg can follow the motion sample input when the behavior variable X ¨ T is adjusted according to Equation (3); similarly, when X ¨ F is adjusted according to Equation (4), the swing leg can follow the motion sample input. Thus, if the robot’s action vector is chosen to be X ¨ T T X ¨ F T T , the robot’s motion will be completely limited to the motion sample input.
By replacing some elements of the above action vector with the variables of three types of actions—CoM action, inertial force/moment action and ZMP action—the balance adjustments can be achieved based on the input sample motions. A variety of legged robot balance controllers with different action combinations can be obtained. The three types of controllers are described in detail below.
(1)
CoM adjustment balance controller. This controller maintains the robot’s balance by keeping the robot’s CoM above its support zone. The action variable that must be selected is P ¨ C , and the action variable to be replaced can be x ¨ T , y ¨ T or z ¨ T in X ¨ T , and x ¨ F , y ¨ F or z ¨ F in X ¨ F . The former corresponds to adjusting the robot’s CoM by translational motion of the torso, and the latter by the swing foot.
(2)
Energy attenuation balance controller. This controller dissipates the system energy by making the inertial force and moment do negative work, thus achieving stabilization. The action variables to be selected are F and M. The action variables that can be removed are the torso acceleration X ¨ T or the swing leg acceleration X ¨ F , which correspond to the two ways of changing the inertial force and moment by the stance leg adjustment or the swing leg adjustment, respectively.
(3)
ZMP adjustment balance controller. This controller keeps the robot’s CP point in the center of the support zone by adjusting the ZMP. Therefore, the action variables that must be selected are xZMP and yZMP, and the substituted action variables can be x ¨ T or y ¨ T in X ¨ T , and x ¨ F or y ¨ F in X ¨ F , which is equivalent to the adjustment of the ZMP position by torso swing or swing leg kick.
Table 5 summarizes the six balance controllers. During the stability training, the action selection matrix A is determined by the corresponding controllers used in Table 5; the joint mapping matrix B is calculated according to the kinematics and dynamics of the robot; the adjustment vector Δq within each control cycle is calculated from the corresponding adjustment calculation formula (Equations (1)–(7)) according to the current state x; and the control output Δ θ ¨ is solved by Equation (14). Furthermore, the state transition information <x, I, Δ θ ¨ , r, x′> generated by the above balance controller will be recorded to form the training data, and this data will be used for learning the three modules of the global self-stabilizer.
When the position, velocity or acceleration of a joint enters its limit neighborhood (determined by εJ1, εJ2 and εJ3), the joint limit will be avoided by the single-joint action, which is achieved by selecting one of the single-joint actions that has the largest influence coefficient (defined by Equation (15)).

5. Simulation Results

Here, the stability training data of the single-leg stance, double-leg stance and stepping will be generated within the simulation environment established in 4.1 using the model-based balance controllers in 4.3 to train the global self-stabilizer, after which the stability verification simulation of the trained global self-stabilizer will be performed under the same conditions.

5.1. Stability Training in Simulation

In the stability training simulation, the motion platform applies two kinds of perturbations. The first one is time-varying ground tilt perturbation by the amplitude-limited random motion of the swing angle θP1 and θP2 (see Figure 7); the other is the impact perturbation by the sudden change of angular velocity based on the first one.
Three different sets of the control parameters in the adjustment amount (Equations (1)–(7)) are designed, corresponding to different response speeds. The specific values are given in Table 6, which were obtained from the simulation conducted before training. The superscript is used to indicate the level action that the variable takes, such as xZMP(1).
Three reference motions were used for the stability training simulation, i.e., single-leg stance, double-leg stance and stepping. The stepping motion has random landing points, and the motion samples were obtained by the planning method proposed in [31]. One hundred simulations were performed for each level of each balance controller under each perturbation condition in Adams, and 4000 system variable transition data were extracted from each simulation. The duration of each simulation was 20 s, and the control period was 5 ms. For the controllers of TCi, TEi and TBi (I = 1, 2, 3), a total of 1.2 × 106 transition data of system variables without impact and 8 × 105 with impact were obtained, respectively; for the controllers of FCi, FEi and FBi (I = 1, 2, 3), a total of 8 × 105 transition data of system variables without impact and 4 × 105 with impact were obtained, respectively. The maximum simulation success rates among all model-based controllers are shown in Table 8.
As a preparation for Q-learning and RBF network learning, feature selection and autonomic abstraction calculations were performed first, and the results are shown in Table 7. The value functions of the actions with different parameters share the same feature selection results, but the state space autonomic abstraction calculation is performed with different basis function distributions so that different parameters obtain different numbers of basis functions.
A total of 198 system variables were selected for the 40 functions in the above table. There were an average of five state variables per function from 113 system variables, which shows that the RAFS feature selection method effectively reduces the state space dimensionality of learning.
The 30 most-selected system variables are shown in Figure 10. The most-selected variables are joint angles of stance leg, followed by the position of CoM and the flip angle. Overall, the system variables related to robot CoM, platform swing angle, resultant force/moment and ZMP are all present in the top 30 most-selected variables. All of these are important variables or equilibrium criteria in biped robot balance control, which indicates that the RAFS feature selection method successfully selected system variables of significance.
The 40 functions in Table 7 were learned separately after the feature selection and state space autonomic abstraction calculation described above. The Q-learning of the action values was trained in a batch, with the amount of training data for each batch being 10,000, and the incremental threshold of the value function for iterative convergence set to 10−5; the RBF network for joint motion mapping was trained according to Equation (27). The optimal solution was converged after performing one iteration on all training data.

5.2. Stability Verification Simulation of the Trained Global Self-Stabilizer

To verify the effectiveness of the trained global self-stabilizer, five hundred stability verification simulations were performed on the motion platform for each of the three robot motions, under the same simulation conditions and parameters as the training data generation. The success rates of the above verification simulations are presented in Table 8 and are compared with the highest success rate of the model-based balance controllers.
From the above table, it can be seen that the trained global self-stabilizer obtains stronger stability than the model-based balance controllers, with increases ranging from around 10% to 33%. The global self-stabilizer nearly doubles the success rate when the impact perturbations are applied.
The verification simulation results of the single-leg stance and the stepping will be analyzed next, because the double-leg stance is less challenging than others.
The ZMP curves in two single-leg stance simulations are given in Figure 11, respectively. Figure 11a depicts that the trained global self-stabilizer regulated the ZMP to the center of the support zone when no impact perturbation is applied. Figure 11b shows that the ZMP exceeded the support zone boundary with the farthest distance of 87.7 mm after the impact, and the global self-stabilizer reduced the ZMP’s oscillation amplitude and finally recovered the flat-foot contact of the robot.
The joint angles in the same simulations are shown in Figure 12, wherein the joint limits are marked with horizontal lines. The moments of impact and restoration of equilibrium are also marked with vertical lines in Figure 12b. Figure 12a shows that the knee joints approach the joint limit between 15 s and 17 s, and the global self-stabilizer distributes the motion of the knee joints to the ankle joints.
Screenshots of the single-stance stability verification simulation with impact using the virtual prototype in Adams are shown in Figure 13.
The action-switching process of the global self-stabilizer is given in Figure 14. The switching of x ¨ C , y ¨ C , xZMP and yZMP without impact are given in Figure 14a,b. When the sagittal impact is applied, the global self-stabilizer will use FX and MY actions to replace x ¨ F and θ ¨ T 1 , respectively (for the lateral impact it will use FY and MX to replace y ¨ F and θ ¨ T 2 , respectively). The switching process is shown in Figure 14c,d.
In the case of single-leg stance, the global self-stabilizer dynamically mixed TC and TB controllers in the case of no impact; in the presence of impact, the global self-stabilizer combined the four types of controllers, TC, TB, FE and TE. The controller parameters were also adjusted according to the system state. The switching rules were implicitly contained in value functions obtained from training process, and the results are equivalent to exploring different combinations of actions or parameters for calculating adjustments in different locations of the system space. Therefore, the global self-stabilizer obtained a stronger stability than the original controller used to generate the training data.
The simulation data of single-leg stance with impact were sampled using a Gaussian function (standard deviation 5 mm). The probabilities of the distribution of the simulation success rate with respect to the ZMP position and the support surface flip angle are shown in Figure 15. From Figure 15a, it can be seen that the robot is basically guaranteed to be stable when the ZMP is within the support zone. The area circled by the contour with an 80% success rate is about 1.8 times the size of the support zone, indicating that the global self-stabilizer makes it possible for the robot to recover its balance even when the ZMP is out of the support zone. Figure 15b shows that the robot has 100% stability when the flip angle θS1 is less than 6° and θS2 is less than 5°; the success rate of recovering balance gradually decreases as the flip angle rises. Figure 15 also shows that the robot has stronger robustness to resist sagittal disturbances than lateral disturbances in single-leg stance.
The joint angles in one stepping simulation are shown in Figure 16 which depicts that the robot has periodic trajectories of joint angles. In addition, there are also irregular fluctuations due to the changing ground tilt perturbation imposed by the moving platform.
The action switching in the sagittal and lateral planes are given in Figure 17. The action switching processes in two planes are similar. Where the CoM action is dominant during the double-legged stance period, the ZMP action is dominant during the single-leg stance period, and the inertial force action is used before and after the swing foot hits the ground.

6. Experiment Results

In this section, experiments are conducted firstly for stability training, followed by stability verification experiments using the trained global self-stabilizer to show the effects.

6.1. Stability Training Experiment

The global self-stabilizer obtained from the simulation was transplanted to the robot to reduce the wear of mechanical parts by frequent training experiments. The parameters that need to be transplanted include the parameter set Ψ of the basis function, the value function V and the connection weight matrix Hi of the RBF network.
The stability training experiments of three motions were performed using the bipedal part of the GoRoBoT-II robot. The total number of experiments and the number of successes for each motion under different perturbation conditions are given in Table 9.
The transplanted global self-stabilizer was trained using the obtained experimental data, and the procedure and the parameters to be learned are similar to the simulation training in 5.1.

6.2. Stability Verification Experiment

For the three motions of double-leg stance, single-leg stance and stepping, twenty stability validation experiments were conducted, and the disturbances were generated according to the fourth, second, and first row parameters in Table 9, respectively. The corresponding success rates are 75%, 60% and 55%, respectively. Accordingly, the success rates of the model-based balance controller in the experiments were improved by 16.7%, 26.7% and 25.4%, respectively.
The distributions of the experimental data in the platform phase space are shown in Figure 18, where the unstable points indicate that the robot met the unstable condition in Equation (19) within 3 s, while the stable points indicate that the robot did not fall over within 3 s. The phase space of the motion platform was divided into the stable region, the unstable region and the transition region. It can be seen that the stable region of all three motions is larger than the size of the unstable region and the transition region, indicating that the trained global self-stabilizer gained the ability to resist external perturbations.
The ZMP curves that were obtained in three random experiments for each motion are given in Figure 19, which shows that the trained global stabilizer can restore balance even if the ZMP is out of the support zone. In addition, the corresponding experiment screenshots are shown in Figure 20.
In summary, the trained global self-stabilizer obtained the self-stabilization capability to cope with the random amplitude-limited perturbations under different motions. In addition, the stabilization capability was stronger than that of the model-based balance controllers after the training process, which indicates that the global self-stabilizer extracted and generated the control strategy that was most beneficial to maintain the robot’s balance based on the training data, and obtained a better state/action mapping.

7. Conclusions

A general model of a stability training system with a training platform is designed for legged robots with an arbitrary number of legs and an arbitrary configuration. The application of the proposed idea was given from three perspectives: system variable determination, action set construction and model-based controller designs for training data generation. A global self-stabilizer capable of learning from different sources of training data in a high-dimensional continuous system space was proposed to address the stability training problem of legged robots. The overall task of keeping the robot stable is broken down into three modules: action selection, adjustment calculation and joint motion mapping, in which the action selection and adjustment calculation modules use the Q-learning algorithm, and the joint motion mapping module uses a modified RBF network.
Stability training simulations and experiments of the global self-stabilizer were conducted by taking the bipedal robot, GoRoBoT-II, as an example (it should also be noted that the application of the proposed training method was not limited by the size of robot). The training data that were generated from 18 controllers were used for training the global self-stabilizer.
Stability verification simulations and experiments were conducted for the trained global self-stabilizer, and the following conclusions can be obtained:
  • Simulation verification showed that the success rates of the trained global self-stabilizer, in three kinds of motion, under different disturbances, were higher than that of the model-based balance controller, with an improvement of at least 9.4%.
  • Experiment verification showed that the trained global self-stabilizer could keep the robot balanced under the random amplitude-limited tilt perturbation. The success rates of the stability verification experiments could reach 75%, 60% and 55%, respectively, which were higher than the success rates obtained using the model-based balance controller during the training data generation (58.3%, 33.3% and 29.6%, respectively).
  • The trained global self-stabilizer obtained different action combinations from the training data, and also continuously switched parameters according to the system state. This indicates that the designed global self-stabilizer was able to explore better state–action mapping from the training data and had the ability to learn and evolve continuously.
In summary, the proposed global self-stabilizer was able to accomplish the stability training task under compound perturbations and explore better action combinations from multiple different sources of training data. In the next step, we will put the trained global self-stabilizer into a real, unknown environment for further experiments.

Author Contributions

Conceptualization, W.W.; methodology, W.W.; software, L.G.; validation, L.G and X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, W.W.; supervision, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China, grant number 2018YFB1304502 and Major Program of National Natural Science Foundation of China, grant number 61936004.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Borovac, B.; Vukobratovic, M.; Surla, D. An Approach to Biped Control Synthesis. Robotica 1989, 7, 231–241. [Google Scholar] [CrossRef]
  2. Yokoi, K.; Kanehiro, F.; Kaneko, K.; Kajita, S.; Fujiwara, K.; Hirukawa, H. Experimental study of humanoid robot HRP-1S. Int. J. Robot Res. 2004, 23, 351–362. [Google Scholar] [CrossRef]
  3. Hirukawa, H.; Kanehiro, F.; Kajita, S.; Fujiwara, K.; Yokoi, K.; Kaneko, K.; Harada, K. Experimental evaluation of the dynamic simulation of biped walking of humanoid robots. In Proceedings of the 20th IEEE International Conference on Robotics and Automation (ICRA), Taipei, Taiwan, 14–19 September 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 1640–1645. [Google Scholar]
  4. Okada, K.; Ogura, T.; Haneda, A.; Inaba, M. Autonomous 3D walking system for a humanoid robot based on visual step recognition and 3D foot step planner. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Barcelona, Spain, 18–22 April 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 623–628. [Google Scholar]
  5. Kim, J.W.; Tran, T.T.; Dang, C.V.; Kang, B. Motion and Walking Stabilization of Humanoids Using Sensory Reflex Control. Int. J. Adv. Robot Syst. 2016, 13, 77. [Google Scholar] [CrossRef]
  6. Kaewlek, N.; Maneewarn, T. Inclined Plane Walking Compensation for a Humanoid Robot. In Proceedings of the International Conference on Control, Automation and Systems (ICCAS 2010), Gyeonggi do, Korea, 27–30 October 2010; IEEE: Piscataway, NJ, USA, 2005; pp. 1403–1407. [Google Scholar]
  7. Yang, S.P.; Chen, H.; Fu, Z.; Zhang, W. Force-feedback based Whole-body Stabilizer for Position-Controlled Humanoid Robots. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Electr Network, Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7432–7439. [Google Scholar]
  8. Seo, K.; Kim, J.; Roh, K. Towards Natural Bipedal Walking: Virtual Gravity Compensation and Capture Point Control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 4019–4026. [Google Scholar]
  9. Elhasairi, A.; Pechev, A. Humanoid robot balance control using the spherical inverted pendulum mode. Front. Robot AI 2015, 2, 21. [Google Scholar] [CrossRef]
  10. Alcaraz-Jimenez, J.J.; Herrero-Perez, D.; Martinez-Barbera, H. Robust feedback control of ZMP-based gait for the humanoid robot Nao. Int. J. Robot Res. 2013, 32, 1074–1088. [Google Scholar] [CrossRef]
  11. Gao, L.Y.; Wu, W.G.; Ieee. Kinetic Energy Attenuation Method for Posture Balance Control of Humanoid Biped Robot under Impact Disturbance. In Proceedings of the 44th Annual Conference of the IEEE Industrial-Electronics-Society (IECON), Washington, DC, USA, 20–23 October 2018; pp. 2564–2569. [Google Scholar]
  12. Henaff, P.; Scesa, V.; Ben Ouezdou, F.; Bruneau, O. Real time implementation of CTRNN and BPTT algorithm to learn on-line biped robot balance: Experiments on the standing posture. Control Eng. Pract. 2011, 19, 89–99. [Google Scholar] [CrossRef]
  13. Shieh, M.Y.; Chang, K.H.; Chuang, C.Y.; Lia, Y.S.; Ieee. Development and implementation of an artificial neural network based controller for gait balance of a biped robot. In Proceedings of the 33rd Annual Conference of the IEEE-Industrial-Electronics-Society, Taipei, Taiwan, 5–8 November 2007; p. 2778. [Google Scholar]
  14. Zhou, C.J.; Meng, Q.C. Dynamic balance of a biped robot using fuzzy reinforcement learning agents. Fuzzy Sets Syst. 2003, 134, 169–187. [Google Scholar] [CrossRef]
  15. Ferreira, J.P.; Crisostomo, M.M.; Coimbra, A.P. SVR Versus Neural-Fuzzy Network Controllers for the Sagittal Balance of a Biped Robot. IEEE Trans. Neural Netw. 2009, 20, 1885–1897. [Google Scholar] [CrossRef] [PubMed]
  16. Li, Z.J.; Ge, Q.B.; Ye, W.J.; Yuan, P.J. Dynamic Balance Optimization and Control of Quadruped Robot Systems With Flexible Joints. IEEE Trans. Syst. Man Cybern. Syst. 2016, 46, 1338–1351. [Google Scholar] [CrossRef]
  17. Hwang, K.S.; Li, J.S.; Jiang, W.C.; Wang, W.H. Gait Balance of Biped Robot based on Reinforcement Learning. In Proceedings of the SICE Annual Conference, Nagoya University, Nagoya, Japan, 14–17 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 435–439. [Google Scholar]
  18. Hengst, B.; Lange, M.; White, B. Learning ankle-tilt and foot-placement control for flat-footed bipedal balancing and walking. In Proceedings of the 2011 11th IEEE-RAS International Conference on Humanoid Robots, Bled, Slovenia, 26–28 October 2011; pp. 288–293. [Google Scholar]
  19. Lin, J.L.; Hwang, K.S. Balancing and Reconstruction of Segmented Postures for Humanoid Robots in Imitation of Motion. IEEE Access 2017, 5, 17534–17542. [Google Scholar] [CrossRef]
  20. Hwang, K.S.; Jiang, W.C.; Chen, Y.J.; Shi, H.B. Motion Segmentation and Balancing for a Biped Robot’s Imitation Learning. IEEE Trans. Ind. Inform. 2017, 13, 1099–1108. [Google Scholar] [CrossRef]
  21. Liu, C.J.; Lonsberry, A.G.; Nandor, M.J.; Audu, M.L.; Lonsberry, A.J.; Quinn, R.D. Implementation of Deep Deterministic Policy Gradients for Controlling Dynamic Bipedal Walking. Biomimetics 2019, 4, 28. [Google Scholar] [CrossRef] [PubMed]
  22. Valle, C.M.C.O.; Tanscheit, R.; Mendoza, L.A.F. Computed-Torque Control of a Simulated Bipedal Robot with Locomotion by Reinforcement Learning. In Proceedings of the 2016 IEEE Latin American Conference on Computational Intelligence (La-Cci), Cartagena, Colombia, 2–4 November 2016. [Google Scholar]
  23. Li, Z.Y.; Cheng, X.X.; Peng, X.B.; Abbeel, P.; Levine, S.; Berseth, G.; Sreenath, K.; Ieee. Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May 30–5 June 2021; pp. 2811–2817. [Google Scholar]
  24. Wu, W.G.; Du, W.Q. Research of 6-DOF Serial-Parallel Mechanism Platform for Stability Training of Legged-Walking Robot. J. Harbin Inst. Technol. (New Ser.) 2014, 2, 75–82. [Google Scholar] [CrossRef]
  25. Wu, W.G.; Gao, L.Y. Posture self-stabilizer of a biped robot based on training platform and reinforcement learning. Robot Auton. Syst. 2017, 98, 42–55. [Google Scholar] [CrossRef]
  26. Jelsma, D.; Ferguson, G.D.; Smits-Engelsman, B.C.M.; Geuze, R.H. Short-term motor learning of dynamic balance control in children with probable Developmental Coordination Disorder. Res. Dev. Disabil. 2015, 38, 213–222. [Google Scholar] [CrossRef] [PubMed]
  27. Maciaszek, J.; Borawska, S.; Wojcikiewicz, J. Influence of Posturographic Platform Biofeedback Training on the Dynamic Balance of Adult Stroke Patients. J. Stroke Cerebrovasc. Dis. 2014, 23, 1269–1274. [Google Scholar] [CrossRef] [PubMed]
  28. DiFeo, G.; Curlik, D.M.; Shors, T.J. The motirod: A novel physical skill task that enhances motivation to learn and thereby increases neurogenesis especially in the female hippocampus. Brain Res. 2015, 1621, 187–196. [Google Scholar] [CrossRef] [PubMed]
  29. Wu, W.G.; Gao, L.Y. Modular combined motion platform used for stability training and amplitude limiting random motion planning and control method. CN Patent CN110275551A, 7 December 2021. [Google Scholar]
  30. Gao, L.Y.; Wu, W.G. Relevance assignation feature selection method based on mutual information for machine learning. Knowl.-Based Syst. 2020, 209, 106439. [Google Scholar] [CrossRef]
  31. Hou, Y.Y. Research on Flexible Drive Unit and Its Application in Humanoid Biped Robot. Ph.D. Dissertation, Harbin Institute of Technology, Harbin, China, 2014. [Google Scholar]
Figure 1. Comparison of a general balance controller and the global self-stabilizer for legged robots. (a) General balance controller; (b) global self-stabilizer.
Figure 1. Comparison of a general balance controller and the global self-stabilizer for legged robots. (a) General balance controller; (b) global self-stabilizer.
Micromachines 13 01436 g001
Figure 2. The basic idea of legged robot stability training and its application [25].
Figure 2. The basic idea of legged robot stability training and its application [25].
Micromachines 13 01436 g002
Figure 3. The mechanism of the training platform and its motion. (a) A 6-DOF serial–parallel mechanism of the training platform; (b) spatial motion of the training platform.
Figure 3. The mechanism of the training platform and its motion. (a) A 6-DOF serial–parallel mechanism of the training platform; (b) spatial motion of the training platform.
Micromachines 13 01436 g003
Figure 4. The general model for stability training system.
Figure 4. The general model for stability training system.
Micromachines 13 01436 g004
Figure 5. Structure of global self-stabilizer.
Figure 5. Structure of global self-stabilizer.
Micromachines 13 01436 g005
Figure 6. Structure of the RBF network.
Figure 6. Structure of the RBF network.
Micromachines 13 01436 g006
Figure 7. Multi-rigid-body model of the biped robot.
Figure 7. Multi-rigid-body model of the biped robot.
Micromachines 13 01436 g007
Figure 8. Biped robot stability training experiment system.
Figure 8. Biped robot stability training experiment system.
Micromachines 13 01436 g008
Figure 9. Mechanism diagram of the 2-DOF motion platform.
Figure 9. Mechanism diagram of the 2-DOF motion platform.
Micromachines 13 01436 g009
Figure 10. The 30 most-selected system variables.
Figure 10. The 30 most-selected system variables.
Micromachines 13 01436 g010
Figure 11. ZMP curves in two single-leg stance simulations. (a) Without impact; (b) with impact.
Figure 11. ZMP curves in two single-leg stance simulations. (a) Without impact; (b) with impact.
Micromachines 13 01436 g011
Figure 12. Pitch angles from two single-leg stance simulations. (a) Without impact; (b) with impact.
Figure 12. Pitch angles from two single-leg stance simulations. (a) Without impact; (b) with impact.
Micromachines 13 01436 g012
Figure 13. Screenshot of single-stance stability verification simulation with impact.
Figure 13. Screenshot of single-stance stability verification simulation with impact.
Micromachines 13 01436 g013
Figure 14. Action switching of the global self-stabilizer during single-leg stance. (a) Action switching on the x-axis without impact; (b) action switching on the y-axis without impact; (c) action switching on the x-axis with impact; (d) action switching on the y-axis with impact.
Figure 14. Action switching of the global self-stabilizer during single-leg stance. (a) Action switching on the x-axis without impact; (b) action switching on the y-axis without impact; (c) action switching on the x-axis with impact; (d) action switching on the y-axis with impact.
Micromachines 13 01436 g014
Figure 15. Success rate contour map of single-leg stance simulation with impact. (a) Success rate with respect to ZMP; (b) success rate with respect to flip angle.
Figure 15. Success rate contour map of single-leg stance simulation with impact. (a) Success rate with respect to ZMP; (b) success rate with respect to flip angle.
Micromachines 13 01436 g015
Figure 16. Joint angle in random stepping. (a) Roll joint angle; (b) pitch joint angle.
Figure 16. Joint angle in random stepping. (a) Roll joint angle; (b) pitch joint angle.
Micromachines 13 01436 g016
Figure 17. Action switching in random stepping. (a) Action switching in sagittal plane; (b) action switching in lateral plane.
Figure 17. Action switching in random stepping. (a) Action switching in sagittal plane; (b) action switching in lateral plane.
Micromachines 13 01436 g017
Figure 18. Data distributions in motion platform phase space. (a) Double-leg stance; (b) single-leg stance; (c) stepping.
Figure 18. Data distributions in motion platform phase space. (a) Double-leg stance; (b) single-leg stance; (c) stepping.
Micromachines 13 01436 g018
Figure 19. ZMP curves in three experiments. (a) Double-leg stance; (b) single-leg stance; (c) stepping.
Figure 19. ZMP curves in three experiments. (a) Double-leg stance; (b) single-leg stance; (c) stepping.
Micromachines 13 01436 g019
Figure 20. Screenshots of three experiments. (a) Double-leg stance; (b) single-leg stance; (c) stepping.
Figure 20. Screenshots of three experiments. (a) Double-leg stance; (b) single-leg stance; (c) stepping.
Micromachines 13 01436 g020aMicromachines 13 01436 g020b
Table 1. Summary of learning-based balance control methods.
Table 1. Summary of learning-based balance control methods.
ScholarAlgorithmState SpaceAction SpaceDisturbance
Scesa et al. [12]CTRNN6-d 1 continuous space3-d continuous spaceSagittal/lateral push
Shieh et al. [13]FNN10-d continuous space1-d continuous spaceTilt/rugged ground
Zhou et al. [14]Fuzzy reinforcement learningTwo 2-d continuous space1-d continuous spaceNone
Joao et al. [15]SVM + FNN2-d continuous space1-d continuous spaceNone
Li et al. [16]Fuzzy control + optimal controlTwo 3-d continuous space3-d continuous spaceNone
Hwang et al. [17]Q-learning82 discrete states24 discrete actionsSeesaw
Hengst et al. [18]Q-learning4-d continuous space9 discrete actionsNone
Hwang et al. [19,20]Q-learning + Reconstruction of Segmented Postures8 discrete states25 discrete actionsNone
Liu et al. [21]DDPG4-d continuous space2-d continuous spaceSagittal impact
Valle et al. [22]Approximate Q-learning66 discrete states8 discrete actionsNone
Li et al. [23]PPO20-d continuous space10-d continuous spaceLoad of 15% total mass
1 short for 6-dimensional.
Table 2. System variables of a general model for stability training of legged robots.
Table 2. System variables of a general model for stability training of legged robots.
Category of
System Variables
Definition of VariableSymbolic Representation
Joint motionAngle, angular velocity and acceleration of joints θ k ,   θ ˙ k ,   θ ¨ k , k = 1, 2, … NJ
Torso motionPose, velocity and acceleration of torso X T   =   [ x T ,   y T ,   z T ,   θ T 1 ,   θ T 2 ,   θ T 3 ]   T ,   X ˙ T ,   X ¨ T
jth swing foot motionPose, velocity and acceleration of jth foot X F j   =   [ x F j ,   y F j ,   z F j ,   θ F j 1 ,   θ F j 2 ,   θ F j 3 ]   T ,   X ˙ F j ,   X ¨ F j
CoM motionPose, velocity and acceleration of CoM P C   =   [ x C ,   y C ,   z C ]   T ,   P ˙ C ,   P ¨ C
ZMP positionZMP position in ΣOBPZMP = [xZMP, yZMP, 0] T
Inertial force and momentResultant force and moment at CoMF = [FX, FY, FZ] T
M = [MX, MY, MZ] T
Support zone flip motionflip angle, angular velocity and angular acceleration of ΣOS with respect to ΣOP θ S 1 ,   θ ˙ S 1 ,   θ ¨ S 1 ,
θ S 2 ,   θ ˙ S 2 ,   θ ¨ S 2
Moving platform motionPose, velocity and acceleration of ΣOP X P ,   X ˙ P ,   X ¨ P
Table 3. Parameter table for action set.
Table 3. Parameter table for action set.
ActionParameterMeaning
Single-joint actionK11, K12, K13the compensation coefficients when the joint position, velocity and acceleration are close to the limit
ε11, ε12, ε13the width of the neighborhood where the joint position, velocity and acceleration start to avoid the limit
L (·)the compensation function for avoiding the joint limit
Torso action X T d ,   X ˙ T d the torso target pose and velocity vector
K21, K22the proportional and derivative coefficients for the torso adjustment
Swing foot action X F j d ,   X ˙ F j d the swing foot target pose and velocity vector
K31, K32the proportional and derivative coefficients for the swing foot adjustment
CoM actionPS, PCthe position of the stance foot coordinate system origin OS and the robot CoM
lCthe distance from the robot CoM to the OS
K41, K42the proportional and derivative coefficients of the CoM adjustment
Inertial force/moment actionFlast, Mlastthe resultant inertial force and moment at the CoM in the last control cycle
mCthe total mass of the robot
LCthe angular momentum about the CoM
K51, K52the adjustment coefficients for the inertial force and moment
ZMP actionxZMP, yZMPposition of the ZMP point along the x and y axes within ΣOS
PCPthe CP point position in the support zone
P0the position of the center point of stance foot
K6the coefficient for ZMP adjustment
Table 4. Main parameters of the biped robot GoRoBoT-II.
Table 4. Main parameters of the biped robot GoRoBoT-II.
ParameterLength (mm)ParameterLength (mm)ParameterMass (kg)
Torso length l0300Hip width lh125Torso mass m012.5
Thigh length l1220Forefoot length lf1120Thigh mass m16
Calf length l2189Hindfoot length lf260Calf mass m22.5
Ankle height l3104Foot width lfw90Foot mass m30.25
Table 5. Model-based balance controller for global self-stabilizer training data generation.
Table 5. Model-based balance controller for global self-stabilizer training data generation.
Balance ControllerBehaviorSymbolActivated Action Variables
CoM motionTorso translationTC Δ P ¨ C , Δ θ ¨ T 1 , Δ θ ¨ T 2 , Δ θ ¨ T 3 , Δ X ¨ F
jth swing foot kickFC Δ X ¨ T , Δ P ¨ C , Δ θ ¨ F 1 , Δ θ ¨ F 2 , Δ θ ¨ F 3
Energy attenuationTorso motionTE Δ F , Δ M , Δ X ¨ F
jth swing foot motionFE Δ X ¨ T , Δ F , Δ M
CP balance controlTorso translationTB Δ x ZMP , Δ y ZMP , Δ z ¨ T , Δ θ ¨ T 1 , Δ θ ¨ T 2 , Δ θ ¨ T 3 , Δ X ¨ F
jth swing foot kickFB Δ X ¨ T , Δ x ZMP , Δ y ZMP , Δ z ¨ F , Δ θ ¨ F 1 , Δ θ ¨ F 2 , Δ θ ¨ F 3
Table 6. Different levels of parameters for adjustment.
Table 6. Different levels of parameters for adjustment.
ActionParameterValue
Level 1Level 2Level 3
Single-joint action(K11, K12, K13)(5, 3, 1)(4, 2, 1)(3, 1, 1)
(ε1, ε2, ε3)(4°, 6°/s, 12°/s2)(7°, 10°/s, 20°/s2)(10°, 15°/s, 30°/s2)
Torso action(K21, K22)(14.1, 100)(20, 100)(28.2, 100)
Swing foot action(K31, K32)(14.1, 100)(20, 100)(28.2, 100)
CoM action(K41, K42)(14.1, 100)(20, 100)(28.2, 100)
Inertial force/moment action(K51, K52)(1, 0.5)(1.5, 0.8)(2, 1)
ZMP actionK60.81.21.6
Table 7. Results of key feature selection and autonomic abstraction calculation of state space.
Table 7. Results of key feature selection and autonomic abstraction calculation of state space.
VariableSelected
State Variables
Number of Basis Function
Level 1Level 2Level 3
Action
variable
Δ θ ¨ R i (i = 1, 2, …,6) θ R i , Δ θ ˙ R i , Δ θ ¨ R i 52 *70 *64 *
Δ θ ¨ L i (i = 1, 2, …,6) θ L i , Δ θ ˙ L i , Δ θ ¨ L i 55 *67 *59 *
Δ x ¨ T xT, x ˙ T , xZMP, θR5, θR4187219351763
Δ y ¨ T yT, y ˙ T , yZMP, θR6111913281266
Δ z ¨ T zT, z ˙ T , θR4, θ ˙ R 4 120312981255
Δ θ ¨ T 1 θP1, θ ˙ P 1 , θT1, θ ˙ T 1 135314991296
Δ θ ¨ T 2 θP2, θ ˙ P 2 , θT2, θ ˙ T 2 127713941206
Δ θ ¨ T 3 θT3, θ ˙ T 3 , θR1206301255
Δ x ¨ F xF, x ˙ F , MY, FX, θS2, θ ˙ S 2 245225712368
Δ y ¨ F yF, y ˙ F , MX, FY, θS1, θ ˙ S 1 228022462116
Δ z ¨ F zF, z ˙ F , θL4, θ ˙ L 4 136814201297
Δ θ ¨ F 1 θF1, θ ˙ F 1 , θL6, θ ˙ L 6 138515381226
Δ θ ¨ F 2 θF2, θ ˙ F 2 , θL5, θ ˙ L 5 123514961126
Δ θ ¨ F 3 θF3, θ ˙ F 3 , θL1341391335
Δ x ¨ C xC, x ˙ C , xZMP, θS2, θ ˙ S 2 , θP2, θ ˙ P 2 , MY, FX, θR5, θ ˙ R 5 11,35912,67012,370
Δ y ¨ C yC, y ˙ C , yZMP, θS1, θ ˙ S 1 , θP1, θ ˙ P 1 , MX, FY, θR6, θ ˙ R 6 12,69713,01911,268
Δ z ¨ C zC, z ˙ C , θR4, θ ˙ R 4 , θL4, θ ˙ L 4 233225692571
ΔFXFX, xZMP, θS2, θ ˙ S 2 , θP2, θ ˙ P 2 , xC, x ˙ C 500252335493
ΔFYFY, yZMP, θS1, θ ˙ S 1 , θP1, θ ˙ P 1 , yC, y ˙ C 554059816127
ΔFZθS2, θ ˙ S 2 , θS1, θ ˙ S 1 , zC, z ˙ C 362738263695
ΔMXyZMP, θS1, θ ˙ S 1 , θP1, θ ˙ P 1 , yC, y ˙ C 389034523321
ΔMYxZMP, θS2, θ ˙ S 2 , θP2, θ ˙ P 2 , xC, x ˙ C 402339263751
ΔMZθS2, θ ˙ S 2 , θS1, θ ˙ S 1 123113961117
ΔxZMPxZMP, θP1, θ ˙ P 1 , θP2, θ ˙ P 2 , θS1, θS2, xC, x ˙ C , yC, y ˙ C 869590079861
ΔyZMPyZMP, θP1, θ ˙ P 1 , θP2, θ ˙ P 2 , θS1, θS2, xC, x ˙ C , yC, y ˙ C 882489379331
Joint
mapping
ΔFXxC, zC, θR1, θR2, θR3, θR4, θR5, θR66892
ΔFYyC, zC, θR1, θR2, θR3, θR4, θR5, θR67101
ΔFZxC, yC, zC, θR1, θR2, θR3, θR4, θR5, θR67840
ΔxZMPxC, zC, θR1, θR2, θR3, θR4, θR5, θR6, θL2, θL3, FX, FZ, MY24,427
ΔyZMPyC, zC, θR1, θR2, θR3, θR4, θR5, θR6, θL2, θL3, FY, FZ, MX22,246
* Denotes the average number of basis functions.
Table 8. Comparison of the simulation success rates of the trained global self-stabilizer and model-based balance controllers.
Table 8. Comparison of the simulation success rates of the trained global self-stabilizer and model-based balance controllers.
MotionSuccess Rate without ImpactSuccess Rate with Impact
Global Self-StabilizerModel-Based
Controllers
Global Self-StabilizerModel-Based
Controllers
Double-leg stance97.4%88% (max)85.7%47% (max)
Single-leg stance94.2%75% (max)80.2%47% (max)
Stepping76.6%44% (max)--
Table 9. Parameters and results of stability training experiments.
Table 9. Parameters and results of stability training experiments.
Platform Moving Parameter
(Angle, Angular Velocity and Acceleration)
Success/Overall
Double-Leg StanceSingle-Leg StanceStepping
±7° ±10°/s, ±20°/s211/1217/2416/54
±14° ±15°/s, ±30°/s217/2410/30-
±20° ±20°/s, ±40°/s27/12--
±20° ±25°/s, ±60°/s22/6--
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wu, W.; Gao, L.; Zhang, X. A Stability Training Method of Legged Robots Based on Training Platforms and Reinforcement Learning with Its Simulation and Experiment. Micromachines 2022, 13, 1436. https://doi.org/10.3390/mi13091436

AMA Style

Wu W, Gao L, Zhang X. A Stability Training Method of Legged Robots Based on Training Platforms and Reinforcement Learning with Its Simulation and Experiment. Micromachines. 2022; 13(9):1436. https://doi.org/10.3390/mi13091436

Chicago/Turabian Style

Wu, Weiguo, Liyang Gao, and Xiao Zhang. 2022. "A Stability Training Method of Legged Robots Based on Training Platforms and Reinforcement Learning with Its Simulation and Experiment" Micromachines 13, no. 9: 1436. https://doi.org/10.3390/mi13091436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop