Next Article in Journal
Optimal Pressure Sensor Deployment for Leak Identification in Water Distribution Networks
Next Article in Special Issue
A Dual Load-Modulated Doherty Power Amplifier Design Method for Improving Power Back-Off Efficiency
Previous Article in Journal
Health Status Recognition Method for Rotating Machinery Based on Multi-Scale Hybrid Features and Improved Convolutional Neural Networks
Previous Article in Special Issue
Surface-Mount Zero-Ohm Jumper Resistor Characterization in High-Speed Controlled Impedance Transmission Lines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems

1
Department of Electronic Engineering, Gachon University, Seongnam 13120, Republic of Korea
2
School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(12), 5689; https://doi.org/10.3390/s23125689
Submission received: 17 April 2023 / Revised: 8 June 2023 / Accepted: 16 June 2023 / Published: 18 June 2023
(This article belongs to the Special Issue MIMO Technologies in Sensors and Wireless Communication Applications)

Abstract

:
This paper proposes a reinforcement learning-aided channel estimator for time-varying multi-input multi-output systems. The basic concept of the proposed channel estimator is the selection of the detected data symbol in the data-aided channel estimation. To achieve the selection successfully, we first formulate an optimization problem to minimize the data-aided channel estimation error. However, in time-varying channels, the optimal solution is difficult to derive because of its computational complexity and the time-varying nature of the channel. To address these difficulties, we consider a sequential selection for the detected symbols and a refinement for the selected symbols. A Markov decision process is formulated for sequential selection, and a reinforcement learning algorithm that efficiently computes the optimal policy is proposed with state element refinement. Simulation results demonstrate that the proposed channel estimator outperforms conventional channel estimators by efficiently capturing the variation of the channels.

1. Introduction

The multi-input multi-output (MIMO) system is a key technology in modern communication and can significantly improve channel capacity and communication reliability by using multiple antennas [1,2,3,4,5,6,7]. Spatial multiplexing and diversity gain are representative schemes for this improvement [1,2]. Notably, the channel capacity increases linearly with the number of either transmitter and receiver antennas. However, this increase is based on the unrealistic assumption of perfect channel state information (PCSI) at both the transmitter and receiver.
Many studies have proposed improving the channel estimation accuracy with limited time and frequency resources [8,9,10,11,12,13,14,15]. A representative method is pilot-aided channel estimation, which exploits the information shared between a transmitter and receiver. Linear minimum-mean square-error (LMMSE) channel estimation is a well-known method for pilot-aided channel estimation, owing to its simple structure [8]. However, LMMSE channel estimation exhibits unsatisfactory performance with a limited number of pilots. Thus, many pilots are required to satisfy the performance requirement, which decreases the spectral efficiency.
To overcome this problem, data-aided channel estimation has been investigated in which the detected data symbols are exploited as additional pilot symbols [16,17,18,19,20,21,22,23,24,25,26]. However, the detected data symbols may have errors that degrade the accuracy of channel estimation. The iterative turbo equalizer can overcome this degradation by increasing the maximum-a-posteriori probability (MAP) [16,17,18,19,20,21,22]. However, such an iterative turbo equalizer has considerable complexity and latency at the receivers.
As a non-iterative approach, the reinforcement learning (RL)-aided channel estimator was introduced in [27,28,29,30,31,32,33]. The basic concept of this approach is the sequential selection of detected data symbols to minimize the channel estimation errors. Hence, a Markov decision process (MDP) was defined to solve the sequential selection, and the corresponding optimal policy was derived in a closed-form expression in [31]. In [32], a low-complexity algorithm was investigated by introducing sub-blocks and finite backup samples, and the computational complexity and latency were significantly reduced without performance loss. Recently, a general framework for RL-aided channel estimation was studied in [33] based on Monte Carlo tree search. However, the RL-aided channel estimators in [31,32,33] were originally considered in time-invariant channels; they perform insufficiently in time-varying channels.
In this paper, we propose an RL-aided channel estimator for time-varying MIMO channels. To achieve this, we first introduce an optimization problem for an RL-aided channel estimator in time-varying channels. We then formulate an MDP to solve the optimization problem, and propose an RL algorithm for the MDP that considers the time-varying nature of the channel. The main contributions of this paper are as follows:
  • We propose an RL-aided channel estimator for time-varying channels modeled using a first-order Gaussian—Markov process. First, we define the optimization problem in time-varying channels to select the detected data symbols and minimize the estimation error between the estimated and current channels. This optimization problem is different from those in [31,32,33], where the selection of the detected data symbols is unchanged because the current channel remains unchanged with the time slot index.
  • We propose an RL algorithm for the optimization problem that captures the time-varying nature of a channel. Because the optimization problem minimizes the estimation error between the estimated and current channels, we adjust the weights of the data symbols to improve the channel accuracy of the current channel. Using this adjustment, we derive the optimal policy as a closed-form solution. Note that the proposed optimal policy differs from those in [31,32,33] because the influence of soft-decision symbols in the virtual state for future rewards gradually diminishes as the time slot index increases.
  • We propose a further performance improvement scheme to refine the state elements. This is because the previously selected data symbol degrades the estimation accuracy of the current channel. To improve the estimation accuracy, we refine the previously selected data symbol by reflecting the channel variation. In addition, we remove selected data symbols that are too old by introducing a sliding window, because they have a large noise variance to estimate the current channel. Through simulations, we demonstrated the effectiveness of the proposed channel estimator compared with conventional channel estimators in time-varying channels.
The remainder of this paper is organized as follows. In Section 1, we introduce the system model, optimization problem, and the MDP. The proposed channel estimator, which determines the optimal policy for time-varying channels, is described in Section 2. We propose a further performance improvement scheme in Section 3. In Section 4, we present simulation results to demonstrate the effectiveness of the proposed channel estimator. Finally, we provide our conclusions in Section 5.

2. Preliminaries

This section describes the system model of a data-aided channel estimator for time-varying MIMO channels. We present the considered channel estimation and data detection schemes based on the model and introduce an optimization problem for data-aided channel estimation.

2.1. System Model

We consider MIMO systems in which a transmitter with a number of transmit antennas N t communicates with a receiver with a number of receive antennas N r (Figure 1). The information is first encoded and mapped to the symbol constellation where X is the symbol constellation set. The transmitted symbol at time n denoted by x [ n ] X N t is then sent over a wireless channel. We model the wireless channel using a first-order Gaussian—Markov process as a time-varying channel model [34,35,36,37,38], where the channel matrix H [ n ] C N t × N r has its ( t , r ) -th component between the t-th and r-th antennas following a Rayleigh fading CN ( 0 , 1 ) distribution. The temporal correlation of the wireless channel, denoted by ϵ [ 0 , 1 ] , increases with velocity. Based on this model, the channel matrix H [ n ] at time slot n is given by
H [ n ] = 1 ϵ 2 H [ n 1 ] + ϵ Δ [ n ] ,
where Δ [ n ] follows a CN ( 0 , 1 ) distribution.
When the transmitter sends the symbol x [ n ] to the receiver over the wireless channel H [ n ] , the received symbol z [ n ] is given by
z [ n ] = H H [ n ] x [ n ] + n [ n ] ,
where ( · ) H denotes the conjugate transpose. n [ n ] is the additive white Gaussian noise (AWGN) at time slot n, with distribution CN ( 0 N r , σ n 2 I N r ) , where 0 m and I m respectively denote m × m zero and identity matrices.
The frame consists of one pilot and M d data blocks (Figure 1). The pilot block contains N p symbols, whereas each data block contains N d symbols. M p = { 1 , , N p } is defined as the pilot index set and M d = { ( d 1 ) N d + 1 , , d N d } is defined as the data index set. We consider data-aided channel estimation, where the receiver obtains the initial channel estimates using pilot symbols, and the accuracy of the initial channel estimates is improved by exploiting data symbols.
We adopt the LMMSE method as the basic channel estimation method because it has a simple structure and provides a reasonable performance. Based on the LMMSE method, h ^ r of the r-th row for the initial channel estimate H ^ can be obtained as
h ^ r = X p ( X p ) H + σ n 2 I N t 1 X p ( z r p ) H ,
where ( · ) 1 is the inverse operation. X p = [ x [ 1 ] , , x [ N p ] ] T and z r p = [ z r [ 1 ] , , z r [ N p ] ] T are the pilot and corresponding received symbols in the pilot block, respectively.
The conventional channel estimator performs data detection at the receiver using the initial channel estimates h ^ . Because the MAP rule guarantees optimal performance, we adopt it for data detection, which is given by
x ^ [ n ] = argmax x k X N t θ k [ n ] .
where | · | is the cardinality of a set. x k X N t where k belongs to the index set of the symbol vector candidate K = { 1 , , | X N t | } . θ k [ n ] denotes a posteriori probability (APP), which is given by
θ k [ n ] = P z [ n ] | x [ n ] = x k P x [ n ] = x k j K P z [ n ] | x [ n ] = x j P x [ n ] = x j ,
where the likelihood probability in (5) is calculated by assuming the AWGN channel as
P z [ n ] | x [ n ] = x k = 1 π σ n 2 N r e z [ n ] h ^ H x k 2 σ n 2 .
where · 2 denotes the norm operation and P ( · ) is the probability of an event. The a priori probability in (5) is also assumed to have an equal probability for possible candidate transmitted symbol P x [ n ] = x k = 1 | X | N t .

2.2. Problem

In a time-varying channel, the estimation accuracy of h ^ decreases gradually as time slot index n increases. This degradation results in poor detection performance at the receiver. Because the detected data symbol may have an error owing to the channel, an incorrect use of the detected data symbol severely degrades performance. To overcome this degradation, we consider a data-aided channel estimator that selects the detected data symbols for data-aided channel estimation.
For the selection, we define action a A = { 0 , 1 } where the detected data symbol is used in channel estimation when a = 1 ; otherwise, the detected data symbol is not used. When we define a { 0 , 1 } N d as a set of actions, the considered data-aided channel estimation can be obtained using this set as
h ^ r ( a ) = X ^ ( a ) X ^ H ( a ) + σ n 2 I N t 1 X ^ ( a ) z r H ( a )
where z r ( a ) = [ z r p , z r [ e 1 ( a ) ] , , z r [ e a 0 ( a ) ] ] T and X ^ ( a ) = [ X p , x ^ [ e 1 ( a ) ] , , x ^ [ e a 0 ( a ) ] ] T . The time slot index of the i-th nonzero element is denoted as e i ( · ) . We then define the optimization problem as
a = argmin a E { h ^ ( a ) H [ n ] 2 } ,
where E ( · ) is the expectation of a random variable.
Compared with previous studies [31,32,33], the optimization problem in (8) considers the selection to minimize the MSE between the estimated channel and H [ n ] . Because the channel is variant with time slot index n, the best action a may be different with time slot index n. That is, the best action in the previous time slot index may be invalid in the next time slot index. In addition, the optimization problem is difficult to solve because the number of candidate actions increases exponentially with the data symbol length. An exhaustive search for action candidates is not feasible in practical applications. To resolve these difficulties, we introduce a sequential selection of the detected data symbols and a refinement of the selected data.

2.3. Markov Decision Process

We formulate an MDP that solves the optimization problem in (8). To achieve this, we define state S n , transition function T n + 1 ( a , j ) ( S n ) , action A , and reward R ( S n , S n + 1 ) [39]. Subsequently, the Q-value function Q S n , a and the optimal policy π S n will be presented. The basic definitions for the MDP are adopted from those in [31,32,33]; however, the RL solution for the MDP is different from those in previous studies, which will be explained in the next section.
The state set S n is defined as
S n = { X n , X ^ n , C | X n = x 1 x N p , x k C 1 , , x k C | C | , X ^ n = x 1 x N p , x ^ C 1 , , x ^ C | C | , C 1 , , n 1 } ,
where C is the set of time slot indices where the symbol is used in channel estimation, and C ( i ) is the i-th smallest element. k n K = { 1 , , | X | } is the transmitted symbol index at time slot n. Based on the expression, we can obtain the proposed channel estimate using the state S n S n as
h ^ r ( S n ) = X ^ n X ^ n H + σ n 2 I N t 1 X ^ n z r H ( S n )
where z r ( S n ) = [ z r p , z r [ C 1 ] , , z r [ C | C | ] ] T . Note that S n is the set of all states and S n is the state.
The action set is defined as A = { 0 , 1 } . As explained in the previous subsection, the detected data symbol is used in the proposed channel estimation when a = 1 ; otherwise, the detected data symbol is not used. The transition function T ( a , j ) S n from state S n S n is defined as
T ( a , j ) S n = P U n + 1 ( a , j ) S n | S n , a = I x [ n ] = x j , j J a , a = 1 , 1 , j J a , a = 0 .
where I ( · ) equals one when the event is true and zero otherwise. J 0 = { 0 } and J 1 { 1 , , K } . U n + 1 ( a , j ) S n U n + 1 ( a , j ) S n is a possible candidate for the next state from state S n and is defined as
U n + 1 ( a , j ) S n = [ X n , x j ] , [ X ^ n , x ^ [ n ] ] , [ C n ] , j J a , a = 1 , X n , X ^ n , C , j J a , a = 0 .
The reward R S n , S n + 1 is defined as the difference between the MSEs at the current state S n S n and the next state S n + 1 S n + 1 , which is given by
R S n , S n + 1 = E h ^ r S n h r [ n ] 2 E h ^ r S n + 1 h r [ n + 1 ] 2 = Tr B S n Tr B S n + 1 = Tr B S n B S n + 1 ,
where B S n = E ( h ^ r S n h r [ n ] ) ( h ^ r S n h r [ n ] ) H is error covariance. Unlike in [31,32,33], the error covariance is defined between the estimated channel h ^ r S n and h r [ n ] at time slot index n.
The Q-value function Q S n , a is the sum of the rewards, which is given by
Q S n , a = j J a T ( a , j ) S n R S n , U n + 1 ( a , j ) S n + γ V U n + 1 ( a , j ) S n ,
where Tr ( · ) is a trace operation. V U n + 1 ( a , j ) S n is the optimal sum of future reward after U n + 1 ( a , j ) S n . γ is a discounting factor whose value is assumed as one because the proposed channel estimator also considers the effect of future rewards at the ending state [31].
The optimal policy maximizes the Q-value function, which is expressed as
π S n = argmax a A Q S n , a .
Solving the optimization problem in (15) is highly difficult because the transition probability T ( a , j ) S n is unknown, and the number of candidate states exponentially increases with the data length. An effective method to solve this problem is to use a reinforcement learning algorithm. Therefore, the proposed channel estimator also adopts a reinforcement learning algorithm, but the effect of the time-varying channel is also considered in comparison with [31,32,33].
A deep reinforcement learning (DRL) approach is a promising solution for dealing with the dimension explosion of the states by leveraging deep neural networks. To apply the DRL approach to our MDP, an agent needs to interact with an environment to obtain an action-value function for a given action and state. However, both the states and rewards of our MDP are not observable at the receiver. This means that the agent cannot acquire training samples, each of which consists of the state (or the state transition) and the corresponding reward. Consequently, the DRL approach and other data-driven approaches are not directly applicable to solving our MDP.

3. Proposed Optimal Policy

This section describes the proposed optimal policy. The basic concept of the derivation is similar to that in [31,32,33]. However, its direct extension is difficult for time-varying channels. This is because capturing time-variant channels using previously selected data symbols is difficult. To address this, we approximate the first-order Gaussian—Markov process and propose a computationally efficient algorithm.
We employ the approximation in [31,32,33] for the transition function, which is given by
T ^ ( a , j ) S n = θ j [ n ] , j J a , a = 1 , 1 , j J a , a = 0 ,
where T ^ ( a , j ) S n T ( a , j ) S n as θ j [ n ] 1 .
The main difficulty in analyzing the time-varying channel model is solving element Δ [ n ] . To resolve this difficulty, we approximate the first-order Gaussian—Markov process in (1) as follows:
H [ n ] 1 ϵ 2 H [ n 1 ] ,
where H [ n 1 ] Δ [ n ] . This approximation is often adopted in studies because it provides analytical tractableness [36,37,38]. Using this approximation, the received symbol z [ n + m ] for 1 m can be expressed in terms of H [ n ] as follows:
z [ n + m ] = H H [ n + m ] x [ n + m ] + n [ n + m ] H H [ n ] 1 ϵ 2 m x [ n + m ] + n [ n + m ] ,
From approximation (18), the virtual state in [31] that mimics the optimal behavior from state U n + 1 ( a , j ) S n can be obtained as follows:
U ˜ m ( a , j ) S n = X m ( a , j ) , X ^ m ( a ) , C m ( a ) ,
where
X m ( a , j ) = X n , x j , 1 ϵ 2 x ˜ [ n + 1 ] , , 1 ϵ 2 m n 1 x ˜ [ m 1 ] , a = 1 , X n , 1 ϵ 2 x ˜ [ n + 1 ] , , 1 ϵ 2 m n 1 x ˜ [ m 1 ] , a = 0 . X ^ m ( a ) = X ^ n , x ^ [ n ] , 1 ϵ 2 x ˜ [ n + 1 ] , , 1 ϵ 2 m n 1 x ˜ [ m 1 ] , a = 1 , X ^ n , 1 ϵ 2 x ˜ [ n + 1 ] , , 1 ϵ 2 m n 1 x ˜ [ m 1 ] , a = 0 . C m ( a ) = C { n + 1 , , m } , a = 1 , C { n + 2 , , m } , a = 0 .
The soft-decision symbol x ˜ [ m ] for m n + 1 is define as
x ˜ [ m ] = k = 1 K θ k [ m ] x j .
In (19), because 0 ϵ 1 , the effect of soft decision symbol x ˜ [ n + m ] for estimating H [ n ] is diminished as m increases. Based on the virtual state, the state-action diagram for the proposed channel estimator is shown in Figure 2. In this figure, the number of state transitions at state S n are one and K for a = 0 and a = 1 , respectively. However, after n + 2 , the state transition is simplified to one because the virtual state mimics the behavior of state U n + 1 ( a , j ) S n .
Using the definition of virtual state (19), we can compute the future reward V U n + 1 ( a , j ) S n as
V U n + 1 ( a , j ) S n R U n + 1 ( a , j ) S n , U ˜ n + 2 ( a , j ) S n + m = n + 2 M d N d R U ˜ m ( a , j ) S n , U ˜ m + 1 ( a , j ) S n .
By applying (13) to the future reward, the future reward is simplified as
V U n + 1 ( a , j ) S n = Tr B U n + 1 ( a , j ) S n B U ˜ n + 2 ( a , j ) S n + m = n + 2 M d N d B U ˜ m ( a , j ) S n B U ˜ m + 1 ( a , j ) S n = Tr B U n + 1 ( a , j ) S n B U ˜ M d N d + 1 ( a , j ) S n .
Using the approximations (16) and (22), the Q-value function in (14) is obtained as follows:
Q S n , a = j J a T ^ ( a , j ) S n Tr B S n B U ˜ M d N d + 1 ( a , j ) S n .
The error covariance matrix B U ˜ m ( a , j ) can be computed as
B U ˜ m ( a , j ) = E { h ^ r U ˜ m ( a , j ) h r [ n ] 2 } = E { ( h ^ r U ˜ m ( a , j ) h r [ n ] ) ( h ^ r U ˜ m ( a , j ) h r [ n ] ) H } = E h ^ r U ˜ m ( a , j ) h ^ r H U ˜ m ( a , j ) h r [ n ] h ^ r H U ˜ m ( a , j ) h ^ r U ˜ m ( a , j ) h r H [ n ] + h r [ n ] h r H [ n ] = ( a ) Q m ( a ) X ^ m ( a , j ) X m ( a , j ) H X m ( a , j ) + σ n 2 I | C m ( a ) | ( X ^ m ( a , j ) ) H Q m ( a ) X m ( a , j ) ( X ^ m ( a , j ) ) H Q m ( a ) Q m ( a ) X ^ m ( a , j ) X m ( a , j ) H + I N t = I N t Q m ( a ) X ^ m ( a , j ) X m ( a , j ) H I N t Q m ( a ) X ^ m ( a , j ) X m ( a , j ) H H + σ n 2 Q m ( a ) σ n 4 Q m ( a ) 2 = ( b ) Q m ( a ) D m ( a , j ) D m ( a , j ) H Q m ( a ) + σ n 2 Q m ( a ) σ n 4 Q m ( a ) 2
where the distribution of z r H U ˜ m ( a , j ) S n is given by CN 0 | C m ( a ) | , X m ( a , j ) H X m ( a , j ) + σ n 2 I | C m ( a ) | and Q m ( a ) = X ^ n X ^ n H + k = n + 1 m 1 ( 1 ϵ 2 ) m n x ˜ [ k ] x ˜ H [ k ] + σ n 2 I N t 1 is applied in ( a ) . D m ( a , j ) = ( Q m ( a ) ) 1 X ^ m X m H = X ^ m X ^ m X m H + σ n 2 I N t is used in ( b ) .
By applying (24) to the Q-value function, the optimal policy at S n is computed as
π S n = argmax a { 0 , 1 } Q S n , a = I Q S n , 1 Q S n , 0 0 = I Tr j = 1 K B U ˜ M d N d + 1 ( 0 , 0 ) S n θ j [ n ] B U ˜ M d N d + 1 ( 1 , j ) S n 0 .
where B U ˜ M d N d + 1 ( a , j ) S n = σ n 2 Q ( a ) σ n 4 Q ( a ) 2 + Q ( a ) D ( a , j ) D ( a , j ) H Q ( a ) . Q ( a ) = Q M d N d + 1 ( a ) and D ( a , j ) = D M d N d + 1 ( a , j ) are defined as
Q ( a ) = X ^ M d N d + 1 ( a ) X ^ M d N d + 1 ( a ) H + σ n 2 I N tx 1 = ( a ) X ^ n X ^ n H + m = n + 1 M d N d ( 1 ϵ 2 ) m n x ˜ [ m ] x ˜ H [ m ] + σ n 2 I N t 1 , a = 0 , Q ( 0 ) 1 + x ˜ [ n ] x ˜ H [ n ] 1 , a = 1 . D ( a , j ) = X ^ M d N d + 1 ( a ) X ^ M d N d + 1 ( a ) X M d N d + 1 ( a , j ) H + σ n 2 I N t = ( b ) X ^ n X ^ n X n H + σ n 2 I N t , j J a , a = 0 , D ( 0 , 0 ) + x ^ [ n ] x ^ [ n ] x j H , j J a , a = 1 .
Similar to [31], Q ( 1 ) and Q ( 0 ) satisfy Q ( 1 ) = Q ( 0 ) Q ( 0 ) x ^ [ n ] x ^ H [ n ] Q ( 0 ) 1 + x ^ H [ n ] Q ( 0 ) x ^ [ n ] . In addition, D ( 1 , j ) and D ( 0 , 0 ) satisfy j = 1 K θ j [ n ] D ( 1 , j ) D ( 1 , j ) H = D ( 0 , 0 ) + d ^ n D ( 0 , 0 ) + d ^ n H + δ n x ^ [ n ] x ^ H [ n ] where d ^ n = x ^ [ n ] ( x ^ [ n ] x ˜ [ n ] ) H , and δ n = j = 1 K θ j [ n ] x ^ [ n ] x j 2 x ^ [ n ] x ˜ [ n ] 2 .
Finally, similar to [32], by applying the results in (23) and (24) to (25), we obtain the proposed optimal policy in closed-form as
π S n = I σ n 2 ( 1 + α n ) + σ n 4 a n 2 + d n 2 2 σ n 4 β n + γ n + c n b n + d n 2 1 .
When we define Q = Q ( 0 ) and D = D ( 0 , 0 ) , vectors are computed as a n = Q x ^ [ n ] 1 + α n , b n = D H b n , c n = x ^ [ n ] x ˜ [ n ] 1 + α n , and d n = D H Q a n a n 2 . In addition, the constants are computed as α n = x ^ H [ n ] Q x ^ [ n ] , β n = a n H Q a n a n 2 , and γ n = δ n 1 + α n . Note that the expression of the optimal policy in (26) is similar to that in [32]. However, the vectors and constants in the optimal policy is different from those in [32] because the temporal correlation ϵ is considered in Q and D . When ϵ = 0 , the optimal policy in (26) is equivalent to that in [32].

4. Further Performance Improvement

In this section, we propose a practical method to improve the estimation accuracy of the proposed channel estimator. The proposed method refines state elements to capture the time-varying nature of the channel.

4.1. State Element Refinement

Elements X n and X ^ n in state S n are updated when the detected data symbol is selected based on the optimal policy. However, the elements gradually lose their effectiveness in estimating H [ n ] as time slot index n increases. To address this, we first represent the received symbol for 1 m in terms of H [ n ] as
z [ n m ] = H H [ n m ] x [ n m ] + n [ n m ] H H [ n ] 1 ϵ 2 m x [ n m ] + n [ n m ] .
Using (27), we refine the elements X n and X ^ n in state as the time slot index increases, which is given by
X n 1 ϵ 2 1 X n X ^ n 1 ϵ 2 1 X ^ n .
Regardless of the above refinement, the previously selected data symbols lose their effectiveness as the time slot index increases, particularly for large data lengths. This is because the term Δ [ n ] in (1) becomes dominant, increasing the uncertainty in estimating the channel. To overcome this, we remove too-old selected data symbols in state by introducing a window size N w . In other words, we maintain the size of the set of time slot indices as | C | = N w . Thus, when the optimal action is one at time slot index n, n is included, whereas the first index C ( 1 ) is removed from set C , which can be expressed as
X n X n X n [ C ( 1 ) ] , X ^ n X ^ n X ^ n [ C ( 1 ) ] , C C C ( 1 ) .

4.2. Algorithm

Using the proposed optimal policy and performance improvement strategy, the proposed channel estimator is summarized in Algorithm 1. The receiver obtains the initial channel estimation during pilot transmission. Subsequently, during the data transmission, the receiver sequentially selects a data symbol based on the optimal policy. When the optimal action a = 1 , the state S n is updated using the most-probable state transition [31]. In addition, the state element refinement is performed based on this condition. After each data block ends, the channel estimate is updated using the state S n .
Algorithm 1: Proposed channel estimator
1 Obtain the initial channel estimate H h ^ = h ^ 1 , , h ^ N r from (3)
2 Initialize the state S 1 = X p , X p , ϕ .
3 for   d = 1 to M d  do (
4   for   n M d do (
5    Compute the optimal policy a = π ( S n ) from (26).
6    Set the optimal values j = 0 for a = 0 and x j = x ^ [ n ] for a = 1 .
7    Update S n + 1 U n + 1 ( a , j ) S n from (12).
8    if   a = = 1 and N w < | C | then (
9      Remove the state elements in S n + 1 .
10       X n + 1 X n + 1 X n + 1 [ C ( 1 ) ] ,
11       X ^ n + 1 X ^ n + 1 X ^ n [ C ( 1 ) ] ,
12       C C C ( 1 ) .
13    end (
14    Refine the state elements in X n + 1 1 ϵ 2 1 X n + 1 and
      X ^ n + 1 1 ϵ 2 1 X ^ n + 1 .
15  end (
16  Update the channel estimate H h ^ = h ^ 1 S n , , h ^ N r S n from (10).
17 end (
In Figure 3, we show a block diagram of the proposed channel estimator, which consists of the LMMSE channel estimator, optimal policy calculator, and state element refinement. The LMMSE channel estimator obtains the initial estimate at pilot transmission and updates the estimate at data transmission using state S n . The optimal policy calculator obtains the optimal action of (26) from the channel estimates and APP from the data detector. The state elements are then refined based on the obtained optimal action, and the refined state is used to estimate the channel and optimal policy for the next step.
Application of other data detection: The proposed RL-aided channel estimator can be universally applied to any other soft-output data detection method. To achieve this, the proposed RL-aided channel estimator relies on the availability of APPs, which can be directly derived from the MAP data detection method. In the case of using other soft-output data detections, the proposed RL-aided channel estimator can utilize the APPs that are computed from the log-likelihood ratios.
Complexity analysis: Complexity is analyzed in terms of real multiplications to provide an implementation perspective. Figure 3 shows the hardware structure of the proposed RL-aided channel estimator, which consists of the LMMSE channel estimator, state element refinement, and optimal policy calculator. Because the exact complexity can vary depending on the implementation details, the complexity order ( O ( · ) ) of each component is analyzed.
The complexity order of the LMMSE channel estimator in (7) is O ( ( N p + | C ( a ) | ) ( N t 2 + N t N r ) ) where C ( a ) is the set of selected data symbol vectors. The complexity order of state element refinement in Section 4.1 is O ( 4 ( N p + | C ( a ) | ) . The complexity of the optimal policy in (25) is primarily determined by the computation of Q ( a ) . Consequently, the complexity order of the optimal policy in (25) is O 2 N t 2 T d 2 . It is important to note that among the components of the proposed channel estimator, the optimal policy calculator has the highest complexity because it performs every data symbol index n, while the other components perform every data block index d.

5. Simulation Results

This section presents the effectiveness of the proposed channel estimator using simulations. The numbers of transmit and receive antennas used were N t = 2 and N r = 4 . The transmission frame consisted of one pilot block with N p = 8 symbols and M d = 20 data blocks with N d = 128 symbols. Each symbol used 4-quadrature amplitude modulation (QAM) symbol mapping. We adopted turbo channel code with a rate of 1 / 2 and 16 cyclic redundancy check bits. For the proposed channel estimator, the window size was set to N w = 2 × N d . The signal-to-noise ratio (SNR) was defined as E b / N 0 = 1 / ( log 2 | X | σ n 2 ) under the power constraint E { x [ n ] 2 } = 1 . The proposed channel estimator was compared with the following methods.
  • PCSI: This method is ideal for time-invariant channels in which a perfect initial channel estimate is available at the receiver. Because the initial channel changes during data transmission, it is not optimal for time-varying channels.
  • Pilot: This method uses a conventional pilot-aided channel estimator using (3).
  • Soft: This method is a data-aided channel estimator when all symbols in (20) are used as additional pilot symbols.
  • Conv-RL [31]: This method is a data-aided channel estimator in which the detected data symbol is selected using the RL approach developed for time-invariant channels.
The performance of the methods was compared with that of the proposed channel estimator in terms of the block-error rate (BLER) and normalized MSE (NMSE). In addition, we considered the time-invariant channel ϵ = 0 and time-variant channel with ϵ = 0.005 and ϵ = 0.01 . Note that channel was more severely variant when ϵ = 0.01 than when ϵ = 0.005 .
Figure 4 shows the BLERs for the proposed and other channel estimators in the time-invariant channel, i.e., ϵ = 0 . The conventional pilot-aided channel estimator exhibited a poor performance when the number of pilots was small. Data-aided channel estimators can overcome performance degradation caused by pilot-aided channel estimators. In particular, the RL-based channel estimator [31] showed an outstanding performance compared with other channel estimators. The BLER of the proposed channel estimator was slightly worse than that of [31] because of a reduced window size N w = 2 × N d .
In Figure 5, the proposed channel estimator is compared with other channel estimators in time-varying channels. The proposed channel estimator had a better BLER improvement than the conventional pilot-aided channel estimator. In particular, the performance improvement is more prominent at ϵ = 0.01 than that at ϵ = 0.005 . This is because the proposed channel estimator can efficiently capture channel variations by selecting and refining detected data symbols. In addition, in time-variant channels, the proposed channel estimator had a slightly higher BLER than the RL-based channel estimator, primarily due to the utilization of a reduced window size (see Figure 5). However, in time-varying channels, this reduction in window size actually contributed to an improvement in BLER by effectively leveraging the most recent data symbols. Consequently, the proposed channel estimator had a lower BLER compared to the RL-based channel estimator. In Figure 6, we show the BLER of the proposed channel estimator for different window sizes N w in time-varying channels with ϵ = 0.01 The BLER of the proposed channel estimator gradually degraded as N w increased. This is because the selected data symbol is undesirable as an additional pilot symbol in fast fading channels; therefore, only the usage of the latest selected data symbol can improve the performance.
To further investigate the effect of window size, we investigated the NMSE of the proposed channel estimator for different window sizes N w at ϵ = 0.005 and E b / N 0 = 2 dB (Figure 7). We observed that the NMSE improved until M d = 2 but decreased as the data block length increased. This is because the old selected data symbol is ineffective for estimating the channel. Thus, when we discard the old data symbol, we can further improve the estimation accuracy (Figure 7).

6. Conclusions

A data-aided channel estimator was proposed for time-varying channels, which involves selecting the detected data symbol. To facilitate efficient selection of the detected data symbol, an optimization problem was initially formulated to minimize the channel estimation error. Subsequently, the MDP for this optimization problem was formulated, and its optimal policy was derived using an RL algorithm. In the derivation process, approximations for the transition probability and a first-order Gaussian–Markov process were utilized. To improve estimation accuracy, a state element refinement was introduced to capture the time-varying nature of the channel by incorporating a window size. Simulation results demonstrated that the proposed channel estimator provides similar performance to the conventional RL-based channel estimator in time-invariant channels when ϵ = 0 , while showing improved performance in time-varying channels when ϵ = 0.01 and ϵ = 0.005 compared to conventional RL-based channel estimator.
An interesting direction for further research involves optimizing the frame structure in terms of the spectral efficiency. In this study, the frame structure comprises one pilot and D data block. The proposed RL-aided channel estimator is applied to the data blocks to capture the time-varying nature of the channel. However, in fast fading channels, it can be challenging for the proposed channel estimator to accurately track channel variations. In such cases, reducing the value of D in the frame structure can potentially improve the performance. However, this reduction also leads to a degradation in spectral efficiency. To find an appropriate value for D in time-varying channels, an optimization problem that maximizes spectral efficiency while maintaining acceptable performance levels becomes a suitable criterion. To address this, one approach is to first derive the performance of the RL-aided channel estimator. Subsequently, the solution to the optimization problem can be obtained using the derived performance.

Author Contributions

Conceptualization, M.M.; Methodology, T.-K.K.; Software, T.-K.K.; Validation, M.M.; Formal analysis, T.-K.K.; Investigation, T.-K.K.; Resources, T.-K.K.; Data curation, T.-K.K.; Writing—original draft, T.-K.K.; Writing—review & editing, M.M.; Visualization, M.M.; Supervision, M.M.; Project administration, M.M.; Funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Tae-Kyoung Kim was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MIST) (No. 2021R1F1A1063273). The work of Moonsik Min was supported in part by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. 2023R1A2C1004034), and in part by the BK21 FOUR Project funded by the Ministry of Education, Korea (4199990113966).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goldsmith, A.; Jafar, S.A.; Jindal, N.; Vishwanath, S. Capacity Limits of MIMO Channels. IEEE J. Sel. Commun. 2003, 21, 684–702. [Google Scholar] [CrossRef] [Green Version]
  2. Zheng, L.; Tse, D.N.C. Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels. IEEE Trans. Inf. Theory 2003, 49, 1073–1096. [Google Scholar] [CrossRef] [Green Version]
  3. Paulraj, A.J.; Gore, D.A.; Nabar, R.U.; Bolcskei, H. An Overview of MIMO Communications-a Key to Gigabit Wireless. Proc. IEEE 2004, 92, 198–218. [Google Scholar] [CrossRef] [Green Version]
  4. Sanayei, S.; Nosratinia, A. Antenna Selection in MIMO Systems. IEEE Commun. Mag. 2004, 42, 68–73. [Google Scholar] [CrossRef]
  5. Larsson, E.G.; Edfors, O.; Tufvesson, F.; Marzetta, T.L. Massive MIMO for Next Generation Wireless Systems. IEEE Commun. Mag. 2014, 52, 186–1954. [Google Scholar] [CrossRef] [Green Version]
  6. Zheng, K.; Zhao, L.; Mei, J.; Shao, B.; Xiang, W.; Hanzo, L. Survey of Large-Scale MIMO Systems. IEEE Commun. Surv. Tutor. 2015, 17, 1738–1760. [Google Scholar] [CrossRef] [Green Version]
  7. Yang, S.; Hanzo, L. Fifty Years of MIMO Detection: The Road to Large-Scale MIMOs. IEEE Commun. Surv. Tutor. 2015, 17, 1941–1988. [Google Scholar] [CrossRef] [Green Version]
  8. Morelli, M.; Mengali, U. A Comparison of Pilot-Aided Channel Estimation Methods for OFDM System. IEEE Trans. Signal Process. 2001, 49, 3065–3073. [Google Scholar] [CrossRef]
  9. Coleri, S.; Ergen, M.; Puri, A.; Bahai, A. Channel Estimation Techniques Based on Pilot Arrangement in OFDM Systems. IEEE Trans. Broadcast. 2002, 48, 223–229. [Google Scholar] [CrossRef] [Green Version]
  10. Mostofi, Y.; Cox, D.C. ICI Mitigation for Pilot-Aided OFDM Mobile Systems. IEEE Trans. Wirel. Commun. 2005, 4, 765–774. [Google Scholar] [CrossRef]
  11. Biguesh, M.; Gershman, A.B. Training-based MIMO Channel Estimation: A Study of Estimator Tradeoffs and Optimal Training Signals. IEEE Trans. Signal Process. 2006, 54, 884–893. [Google Scholar] [CrossRef]
  12. Ozdemir, M.K.; Arslan, H. Channel Estimation for Wireless OFDM Systems. IEEE Commun. Surv. Tutor. 2007, 9, 18–48. [Google Scholar] [CrossRef]
  13. Soltani, M.; Pourahmadi, V.; Mirzaei, A.; Sheikhzadeh, H. Deep Learning-based Channel Estimation. IEEE Commun. Lett. 2019, 23, 652–655. [Google Scholar] [CrossRef] [Green Version]
  14. Le, H.A.; Van Chien, T.; Nguyen, T.H.; Choo, H.; Nguyen, V.D. Machine Learning-Based 5G-and-Beyond Channel Estimation for MIMO-OFDM Communication Systems. Sensors 2021, 21, 4861. [Google Scholar] [CrossRef]
  15. Yuan, J.; Ngo, H.Q.; Matthaiou, M. Machine Learning-Based Channel Prediction in Massive MIMO with Channel Aging. IEEE Trans. Wirel. Commun. 2020, 19, 2960–2973. [Google Scholar] [CrossRef]
  16. Valenti, M.C.; Woerner, B.D. Iterative Channel Estimation and Decoding of Pilot Symbol Assisted Turbo Codes over Flat-Fading Channels. IEEE J. Sel. Commun. 2001, 19, 1697–1705. [Google Scholar] [CrossRef]
  17. Dowler, A.; Nix, A.; McGeehan, J. Data-derived Iterative Channel Estimation with Channel Tracking for a Mobile Fourth Generation Wide Area OFDM System. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), San Francisco, CA, USA, 1–5 December 2003. [Google Scholar]
  18. Cozzo, C.; Hughes, B.L. Joint Channel Estimation and Data Detection in Space-Time Communications. IEEE Trans. Commun. 2003, 51, 1266–1270. [Google Scholar] [CrossRef]
  19. Song, S.; Singer, A.C.; Sung, K.M. Soft Input Channel Estimation for Turbo Equalization. IEEE Trans. Signal Process. 2004, 52, 2885–2894. [Google Scholar] [CrossRef]
  20. Nicoli, M.; Ferrara, S.; Spagnolini, U. Soft-Iterative Channel Estimation: Methods and Performance Analysis. IEEE Trans. Signal Process. 2007, 55, 2993–3006. [Google Scholar] [CrossRef]
  21. Zhao, M.; Shi, Z.; Reed, M.C. Iterative Turbo Channel Estimation for OFDM System over Rapid Dispersive Fading Channel. IEEE Trans. Wirel. Commun. 2008, 7, 3174–3184. [Google Scholar] [CrossRef]
  22. Guo, Q.; Ping, L.; Huang, D. A Low-Complexity Iterative Channel Estimation and Detection Technique for Doubly Selective Channels. IEEE Trans. Wirel. Commun. 2009, 8, 4340–4349. [Google Scholar]
  23. Ma, J.; Ping, L. Data-Aided Channel Estimation in Large Antenna Systems. IEEE Trans. Signal Process. 2014, 62, 3111–3124. [Google Scholar]
  24. Wen, C.K.; Wang, C.J.; Jin, S.; Wong, K.K.; Ting, P. Bayes-Optimal Joint Channel-and-Data Estimation for Massive MIMO with Low-Precision ADCs. IEEE Trans. Signal Process. 2015, 64, 2541–2556. [Google Scholar] [CrossRef] [Green Version]
  25. Park, S.; Shim, B.; Choi, J.W. Iterative Channel Estimation Using Virtual Pilot Signals for MIMO-OFDM Systems. IEEE Trans. Signal Process. 2015, 63, 3032–3045. [Google Scholar] [CrossRef]
  26. Huang, C.; Liu, L.; Yuen, C.; Sun, S. Iterative Channel Estimation Using LSE and Sparse Message Passing for mmWave MIMO Systems. IEEE Trans. Signal Process. 2018, 67, 245–259. [Google Scholar] [CrossRef] [Green Version]
  27. Li, X.; Wang, Q.; Yang, H.; Ma, X. Data-Aided MIMO Channel Estimation by Clustering and Reinforcement-Learning. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022. [Google Scholar]
  28. Naeem, M.; De Pietro, G.; Coronato, A. Application of Reinforcement Learning and Deep Learning in Multiple-Input and Multiple-Output (MIMO) Systems. Sensors 2022, 22, 309. [Google Scholar] [CrossRef]
  29. Oh, M.S.; Hosseinalipour, S.; Kim, T.; Brinton, C.G.; Love, D.J. Channel Estimation via Successive Denoising in MIMO OFDM Systems: A Reinforcement Learning Approach. In Proceedings of the IEEE International Conference on Communications (ICC), Montreal, QC, Canada, 14–23 June 2021. [Google Scholar]
  30. Chu, M.; Liu, A.; LAu, V.K.N.; Jiang, C.; Yang, T. Deep Reinforcement Learning based End-to-End Multi-User Channel Prediction and Beamforming. IEEE Trans. Wirel. Commun. 2022, 21, 10271–10285. [Google Scholar] [CrossRef]
  31. Jeon, Y.S.; Li, J.; Tavangaran, N.; Poor, H.V. Data-Aided Channel Estimator for MIMO Systems via Reinforcement Learning. In Proceedings of the IEEE International Conference on Communications (ICC), Prayagraj, India, 27–29 November 2020. [Google Scholar]
  32. Kim, T.K.; Min, M. A Low-Complexity Algorithm for Reinforcement Learning-Based Channel Estimator for MIMO Systems. Sensors 2022, 21, 4379. [Google Scholar] [CrossRef]
  33. Kim, T.K.; Jeon, Y.S.; Li, J.; Tavangaran, N.; Poor, H.V. Semi-Data-Aided Channel Estimation for MIMO Systems via Reinforcement Learning. IEEE Trans. Wirel. Commun. 2022; early access. [Google Scholar] [CrossRef]
  34. Dong, M.; Tong, L.; Sadler, B.M. Optimal Insertion of Pilot Symbols for Transmissions over Time-Varying Flat Fading Channels. IEEE Trans. Signal Process. 2004, 52, 1403–1418. [Google Scholar] [CrossRef] [Green Version]
  35. Kim, T.K.; Jeon, Y.S.; Min, M. Training Length Adaptation for Reinforcement Learning-Based Detection in Time-Varying Massive MIMO Systems With One-Bit ADCs. IEEE Trans. Veh. Technol. 2021, 70, 6999–7011. [Google Scholar] [CrossRef]
  36. Li, C.C.; Lin, Y.P. Predictive Coding of Bit Loading for Time Correlated MIMO Channels with A Decision Feedback Receiver. IEEE Trans. Signal Process. 2015, 63, 3376–3386. [Google Scholar] [CrossRef]
  37. Kim, H.; Yu, H.; Lee, Y. Limited Feedback for Multicell Zero-Forcing Coordinated Beamforming in Time-Varying Channels. IEEE Trans. Veh. Technol. 2015, 64, 2349–2359. [Google Scholar] [CrossRef]
  38. Mirza, J.; Dmochowski, P.A.; Smith, P.J.; Shafi, M. A Differential Codebook with Adaptive Scaling for Limited Feedback MU MISO Systems. IEEE Wirel. Commun. Lett. 2014, 3, 2–5. [Google Scholar] [CrossRef]
  39. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Figure 1. Considered system model and frame structure in time-varying channels.
Figure 1. Considered system model and frame structure in time-varying channels.
Sensors 23 05689 g001
Figure 2. State-action diagram of the proposed channel estimator. After the time slot index n + 2 , the virtual state is applied such that the state transition is simplified.
Figure 2. State-action diagram of the proposed channel estimator. After the time slot index n + 2 , the virtual state is applied such that the state transition is simplified.
Sensors 23 05689 g002
Figure 3. Proposed channel estimator using a further performance improvement strategy.
Figure 3. Proposed channel estimator using a further performance improvement strategy.
Sensors 23 05689 g003
Figure 4. BLER for different channel estimators in time-invariant channels.
Figure 4. BLER for different channel estimators in time-invariant channels.
Sensors 23 05689 g004
Figure 5. BLER for different channel estimators in time-varying channels ( ϵ = 0.005 and ϵ = 0.01 ).
Figure 5. BLER for different channel estimators in time-varying channels ( ϵ = 0.005 and ϵ = 0.01 ).
Sensors 23 05689 g005
Figure 6. BLER of the proposed channel estimator for different window sizes N w in time-varying channels with ϵ = 0.01 .
Figure 6. BLER of the proposed channel estimator for different window sizes N w in time-varying channels with ϵ = 0.01 .
Sensors 23 05689 g006
Figure 7. NMSE of the proposed channel estimator for different window sizes N w in time-varying channels with ϵ = 0.005 .
Figure 7. NMSE of the proposed channel estimator for different window sizes N w in time-varying channels with ϵ = 0.005 .
Sensors 23 05689 g007
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, T.-K.; Min, M. Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems. Sensors 2023, 23, 5689. https://doi.org/10.3390/s23125689

AMA Style

Kim T-K, Min M. Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems. Sensors. 2023; 23(12):5689. https://doi.org/10.3390/s23125689

Chicago/Turabian Style

Kim, Tae-Kyoung, and Moonsik Min. 2023. "Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems" Sensors 23, no. 12: 5689. https://doi.org/10.3390/s23125689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop