In a practical application of reinforcement learning, if an invalid action can be predicted in the current state, the Action Mask can be used to prune it to accelerate learning [
33,
34,
35]. However, this method sets the probability of some actions directly to zero under particular situations, so too many hard rules result in too many limitations on exploration. It is difficult to design a set of perfect rules and parameters for a greenhouse control problem with complexly coupled factors, but empirically, some fuzzy conditions from experience can be summarized to determine whether a particular facility is currently suitable for use. Our method proposes the design of a network of masking rules that can be manually initialized by partially empirical rules and that can be updated with parameters based on feedback by a policy gradient algorithm during reinforcement learning, which was referred to as SAM.
2.2.1. Overview
The overall algorithmic architecture of the proposed method is shown in
Figure 2. The agent includes the Actor and the Critic, and the Actor includes two networks: one fully connected feedforward neural network, which is originally randomly initialized, and one soft-action-mask network (in
Section 2.2.2), which is manually initialized according to the rules. The output of the SAM network, which is a bias layer, biases the probability of the logits layer of the original neural network before the softmax activation function, suppressing the sampling of low-benefit actions and obtaining the final policy
, which is the probability of each action and where
is the parameter of completed Actor-network.
The form of each variable in
Figure 2 is shown in (6), where
, and
are the same as in Equation (
2).
and
, observing the trend and the intensity of indoor and outdoor temperature change, denote the difference between the sampled temperature at the moment
and its previous moment.
are the time steps in which one state of the facilities lasts.
is the output of the Actor with dimension
, i.e., the set of actions to be selected by RL agent policy, and the final selected action
is sampled from the policy distribution.
After the action decoupling module,
is converted into the facilities’ control input quantity
. The agent continuously selects the action according to the current policy Actor and observes the state of the environment to collect the reward information from the historical feedback, which is put into the experience recovery pool of the PPO Update Loss module in
Figure 2. After several rounds, based on the PPO algorithm [
36], the agent’s parameters are updated.
The PPO algorithm is based on the Actor–Critic framework [
37] and invokes the concept of importance sampling to enhance efficiency. The loss function for policy optimization based on the advantage function is as shown in (7), where
is the estimated advantage. Forcing a constraint on the ratio of old to new probabilities in the update, the new loss function is described as in (8) and (9), where
is the ratio of the probability distribution of the updated policy
to the old policy
is a small positive constant close to zero, and the final loss function is as in (10), considering the error and entropy terms:
When updating the parameters, the Actor estimates the advantage function by interactively collecting sample data over
K time steps in (11), where
is the discount factor of advantage,
is the reward discount,
is the TD error,
is the immediate reward at
t, and
V is the state value calculated based on the Critic network.
is the actual long-term cumulative benefit, usually estimated as (12):
The PPO algorithm flow pseudo-code is shown in Algorithm 1. The symbols and parameters involved are explained in
Table 2.
Algorithm 1 PPO Algorithm |
Input: total numbers of iterations N, max steps per round K, replay buffer pool of history experience , policy before update , update epoch Z |
1: for do |
2: for do |
3: Get observation state s from environment |
4: Choose action a by policy |
5: Run a in environment |
6: Observe reward r and new state |
7: Add experience information {s,a,r,} into |
8: end for |
9: Calculate advantage , ,…, , with based on (11) |
10: Optimize loss function (8–10) and (12) with respect to for Z epochs |
11: Update to |
12: end for |
2.2.2. Soft Action-Mask Net
In the study [
34], it was mentioned that the use of action masks allows the pruning of invalid actions. It was demonstrated that this process can be regarded as a state-dependent differentiable function [
35], which is in line with the assumptions of the Policy Gradient Algorithm. In the greenhouse, there are also several invalid actions that can be predicted. Significantly, exploration in a real greenhouse requires avoiding the execution of dangerous actions in particular states in order to ensure the safety of the crops and the facilities.
To distinguish it from the SAM proposed later, this type of action mask for dangerous actions in greenhouses is referred to as a Hard Action Mask (HAM). Take action
for an example: if
corresponds to the action of turning on the fan, it is not advisable to choose this action in low temperatures according to human experience. Assuming that state
is in a low temperature, the logits output of the Actor is
and the policy network output transformed into action sampling probabilities after softmax activation is
:
The HAM sets the logits corresponding to action
to an extremely small value, Min, and the probability of selecting
will be set to zero as:
HAM ensures the safe process by completely setting the sampling probability of unsafe or invalid actions to zero. However, exploration in the safe region does not guarantee good gains for all action trajectories. Furthermore, in addition to dangerous actions, low benefits in the early stages of learning are also undesirable, and initialization of a high level can effectively shorten the learning period. Actually, in realistic scenarios, instructive human knowledge can be provided beforehand. To combine knowledge with reinforcement learning, ref. [
30] proposed a method of initializing neural networks in the form of decision trees to implement a warm start. However, the human experience summarized in greenhouses is usually ambiguous, and it is hard to cover the complete state domain. In addition, the multi-factor coupling relationship results in a complex of conditional rules, leading to difficulty in applying the results of [
30], where the initialization of the decision tree is too complicated and the over-regulated exploration restricts boosting capabilities of the agent. We propose a method transforming several ruled soft masks into a network form and incorporating partial experience to guide the agent’s exploration process, finally achieving reinforcement learning of better initialization for greenhouse control. The networked form is shown in
Figure 3.
By manually initializing the network weights and comparators, the judicial process with combined first-order rules is transformed through the network into a bias layer that reduces the probability of sampling improper actions, which is referred to as SAM. Unlike the direct masking of HAM, SAM adds a bias layer to the output of the logits layer of the original policy network of the agent. The offset base value is initialized to a relatively small value, avoiding full suppression of the policy gradient backpropagation found in HAM. Secondly, the rules set by manual experience are sometimes vague and one-sided. The networked SAM allows the weights and offsets of the rule nodes in the network to be dynamically trained through the gradient, allowing the intelligent agent to explore better policy and rule parameters through the reward mechanism.
The network initialization requires manual pre-establishment of the masking rules, and the rules build the network in the form of weights and comparators. The initialization process is Algorithm 2. Assuming a total of
rules consisting of
first-order discriminant conditions, as in
Figure 3,
is the output of the judgment nodes,
and
are the initialization weights and bias parameters of the judgment nodes, respectively, and
is the degree of uncertainty of the discriminant conditions, taken from 0 to 1. Equations (15)–(21) comprise the calculation process from the input to the bias layer, where
is the path matrix initialized by
discriminative paths composed of n different conditions in series according to k rules, and
is the final output of the bias layer after SAM.
Algorithm 2 Initialize of SAM net |
Input: Expert knowledge rules collection = , numbers of rules k, count of all conditions n |
1: initialize collection of path = |
2: initialize collection of weight = |
3: initialize collection of comparison = |
4: initialize collection of mask = |
5: for do |
6: = |
7: = |
8: for do |
9: if condition then |
10: |
11: |
12: initialize by (17) |
13: |
14: end if |
15: end for |
16: |
17: initialize by (20-21) |
18: |
19: end for |
Still considering the fan action, suppose there are rules as follows: when the indoor temperature is below 30 degrees Celsius and the outdoor temperature is below 28, try not to turn on the fan. State , action space , where corresponds to turning on the fan. Initialize , and . When the condition is not met, the offset is 0, otherwise the larger the absolute value of the offset, the higher the degree of inhibition of the action, with less inhibition near the critical value of the judgment condition.
2.2.3. Region Reward
The reward function consists of two main components: a control interval error penalty
and a fan loss penalty
, as shown in Equations (22) and (23), respectively.
are the weights for each, and
indicates the target, in which
is the target for each factor:
The fan loss penalty
is an indirect measure of fan energy consumption based on the cumulative hours of the fan running,
T is the data sampling period, and the element value
is determined based on the current state and the upper and lower limits
and
of suitability, as in (24):
Depending on the crop’s growth requirements and tolerance, three levels of condition areas are distinguished.
is an excellent environmental condition for growth,
is a sub-optimal condition that can be tolerated for a long period, and
is an extreme condition that can only be tolerated for a short period. If the environment has difficulty in running at
with the lack of precise control equipment, then running in
is expected and
should be avoided at all costs. The setting of the penalty term is therefore differentiated by the current state, as shown in Equation
:
The complete RL with SAM algorithm flow is shown in Algorithm 3.
Algorithm 3 RL with Soft Action-Mask |
Input: Dimension of state observation , dimension of action , expert knowledge , total numbers of iterations N, max steps per round K, replay buffer pool of history experience , update epoch Z |
1: randomly initialize actor and critic parameters in Agent |
2: initialize mask net in Agent with by Algorithm 2 |
3: for do |
4: for do |
5: get observation state from environment |
6: get logits layer forward the actor net |
7: get bias forward the soft mask net by (15)–(21) |
8: |
9: choose action a by policy |
10: run a in environment |
11: observe new state and calculate reward r by (22)–(25) |
12: add experience information {,a,r,} into |
13: end for |
14: update to for Z epochs by Algorithm 1 |
15: end for |