Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems

Kim, Tae-Kyoung; Min, Moonsik

doi:10.3390/s23125689

Open AccessArticle

Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems

by

Tae-Kyoung Kim

¹

and

Moonsik Min

^2,*

¹

Department of Electronic Engineering, Gachon University, Seongnam 13120, Republic of Korea

²

School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(12), 5689; https://doi.org/10.3390/s23125689

Submission received: 17 April 2023 / Revised: 8 June 2023 / Accepted: 16 June 2023 / Published: 18 June 2023

(This article belongs to the Special Issue MIMO Technologies in Sensors and Wireless Communication Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper proposes a reinforcement learning-aided channel estimator for time-varying multi-input multi-output systems. The basic concept of the proposed channel estimator is the selection of the detected data symbol in the data-aided channel estimation. To achieve the selection successfully, we first formulate an optimization problem to minimize the data-aided channel estimation error. However, in time-varying channels, the optimal solution is difficult to derive because of its computational complexity and the time-varying nature of the channel. To address these difficulties, we consider a sequential selection for the detected symbols and a refinement for the selected symbols. A Markov decision process is formulated for sequential selection, and a reinforcement learning algorithm that efficiently computes the optimal policy is proposed with state element refinement. Simulation results demonstrate that the proposed channel estimator outperforms conventional channel estimators by efficiently capturing the variation of the channels.

Keywords:

data-aided channel estimation; non-iterative approach; first-order Gaussian—Markov channel model; reinforcement learning

1. Introduction

The multi-input multi-output (MIMO) system is a key technology in modern communication and can significantly improve channel capacity and communication reliability by using multiple antennas [1,2,3,4,5,6,7]. Spatial multiplexing and diversity gain are representative schemes for this improvement [1,2]. Notably, the channel capacity increases linearly with the number of either transmitter and receiver antennas. However, this increase is based on the unrealistic assumption of perfect channel state information (PCSI) at both the transmitter and receiver.

Many studies have proposed improving the channel estimation accuracy with limited time and frequency resources [8,9,10,11,12,13,14,15]. A representative method is pilot-aided channel estimation, which exploits the information shared between a transmitter and receiver. Linear minimum-mean square-error (LMMSE) channel estimation is a well-known method for pilot-aided channel estimation, owing to its simple structure [8]. However, LMMSE channel estimation exhibits unsatisfactory performance with a limited number of pilots. Thus, many pilots are required to satisfy the performance requirement, which decreases the spectral efficiency.

To overcome this problem, data-aided channel estimation has been investigated in which the detected data symbols are exploited as additional pilot symbols [16,17,18,19,20,21,22,23,24,25,26]. However, the detected data symbols may have errors that degrade the accuracy of channel estimation. The iterative turbo equalizer can overcome this degradation by increasing the maximum-a-posteriori probability (MAP) [16,17,18,19,20,21,22]. However, such an iterative turbo equalizer has considerable complexity and latency at the receivers.

As a non-iterative approach, the reinforcement learning (RL)-aided channel estimator was introduced in [27,28,29,30,31,32,33]. The basic concept of this approach is the sequential selection of detected data symbols to minimize the channel estimation errors. Hence, a Markov decision process (MDP) was defined to solve the sequential selection, and the corresponding optimal policy was derived in a closed-form expression in [31]. In [32], a low-complexity algorithm was investigated by introducing sub-blocks and finite backup samples, and the computational complexity and latency were significantly reduced without performance loss. Recently, a general framework for RL-aided channel estimation was studied in [33] based on Monte Carlo tree search. However, the RL-aided channel estimators in [31,32,33] were originally considered in time-invariant channels; they perform insufficiently in time-varying channels.

In this paper, we propose an RL-aided channel estimator for time-varying MIMO channels. To achieve this, we first introduce an optimization problem for an RL-aided channel estimator in time-varying channels. We then formulate an MDP to solve the optimization problem, and propose an RL algorithm for the MDP that considers the time-varying nature of the channel. The main contributions of this paper are as follows:

We propose an RL-aided channel estimator for time-varying channels modeled using a first-order Gaussian—Markov process. First, we define the optimization problem in time-varying channels to select the detected data symbols and minimize the estimation error between the estimated and current channels. This optimization problem is different from those in [31,32,33], where the selection of the detected data symbols is unchanged because the current channel remains unchanged with the time slot index.
We propose an RL algorithm for the optimization problem that captures the time-varying nature of a channel. Because the optimization problem minimizes the estimation error between the estimated and current channels, we adjust the weights of the data symbols to improve the channel accuracy of the current channel. Using this adjustment, we derive the optimal policy as a closed-form solution. Note that the proposed optimal policy differs from those in [31,32,33] because the influence of soft-decision symbols in the virtual state for future rewards gradually diminishes as the time slot index increases.
We propose a further performance improvement scheme to refine the state elements. This is because the previously selected data symbol degrades the estimation accuracy of the current channel. To improve the estimation accuracy, we refine the previously selected data symbol by reflecting the channel variation. In addition, we remove selected data symbols that are too old by introducing a sliding window, because they have a large noise variance to estimate the current channel. Through simulations, we demonstrated the effectiveness of the proposed channel estimator compared with conventional channel estimators in time-varying channels.

The remainder of this paper is organized as follows. In Section 1, we introduce the system model, optimization problem, and the MDP. The proposed channel estimator, which determines the optimal policy for time-varying channels, is described in Section 2. We propose a further performance improvement scheme in Section 3. In Section 4, we present simulation results to demonstrate the effectiveness of the proposed channel estimator. Finally, we provide our conclusions in Section 5.

2. Preliminaries

This section describes the system model of a data-aided channel estimator for time-varying MIMO channels. We present the considered channel estimation and data detection schemes based on the model and introduce an optimization problem for data-aided channel estimation.

2.1. System Model

We consider MIMO systems in which a transmitter with a number of transmit antennas

N_{t}

communicates with a receiver with a number of receive antennas

N_{r}

(Figure 1). The information is first encoded and mapped to the symbol constellation where

X

is the symbol constellation set. The transmitted symbol at time n denoted by

x [n] \in X^{N_{t}}

is then sent over a wireless channel. We model the wireless channel using a first-order Gaussian—Markov process as a time-varying channel model [34,35,36,37,38], where the channel matrix

H [n] \in C^{N_{t} \times N_{r}}

has its

(t, r)

-th component between the t-th and r-th antennas following a Rayleigh fading

CN (0, 1)

distribution. The temporal correlation of the wireless channel, denoted by

ϵ \in [0, 1]

, increases with velocity. Based on this model, the channel matrix

H [n]

at time slot n is given by

\begin{matrix} H [n] & = \sqrt{1 - ϵ^{2}} H [n - 1] + ϵ Δ [n], \end{matrix}

(1)

where

Δ [n]

follows a

CN (0, 1)

distribution.

When the transmitter sends the symbol

x [n]

to the receiver over the wireless channel

H [n]

, the received symbol

z [n]

is given by

\begin{matrix} z [n] & = H^{H} [n] x [n] + n [n], \end{matrix}

(2)

where

{(\cdot)}^{H}

denotes the conjugate transpose.

n [n]

is the additive white Gaussian noise (AWGN) at time slot n, with distribution

CN (0_{N_{r}}, σ_{n}^{2} I_{N_{r}})

, where

0_{m}

and

I_{m}

respectively denote

m \times m

zero and identity matrices.

The frame consists of one pilot and

M_{d}

data blocks (Figure 1). The pilot block contains

N_{p}

symbols, whereas each data block contains

N_{d}

symbols.

M_{p} = {1, \dots, N_{p}}

is defined as the pilot index set and

M_{d} = {(d - 1) N_{d} + 1, \dots, d N_{d}}

is defined as the data index set. We consider data-aided channel estimation, where the receiver obtains the initial channel estimates using pilot symbols, and the accuracy of the initial channel estimates is improved by exploiting data symbols.

We adopt the LMMSE method as the basic channel estimation method because it has a simple structure and provides a reasonable performance. Based on the LMMSE method,

{\hat{h}}_{r}

of the r-th row for the initial channel estimate

\hat{H}

can be obtained as

\begin{matrix} {\hat{h}}_{r} & = {(X^{p} {(X^{p})}^{H} + σ_{n}^{2} I_{N_{t}})}^{- 1} X^{p} {(z_{r}^{p})}^{H}, \end{matrix}

(3)

where

{(\cdot)}^{- 1}

is the inverse operation.

X^{p} = {[x [1], \dots, x [N_{p}]]}^{T}

and

z_{r}^{p} = {[z_{r} [1], \dots, z_{r} [N_{p}]]}^{T}

are the pilot and corresponding received symbols in the pilot block, respectively.

The conventional channel estimator performs data detection at the receiver using the initial channel estimates

\hat{h}

. Because the MAP rule guarantees optimal performance, we adopt it for data detection, which is given by

\begin{matrix} \hat{x} [n] & = \underset{x_{k} \in X^{N_{t}}}{argmax} θ_{k} [n] . \end{matrix}

(4)

where

| \cdot |

is the cardinality of a set.

x_{k} \in X^{N_{t}}

where k belongs to the index set of the symbol vector candidate

K = {1, \dots, | X^{N_{t}} |}

.

θ_{k} [n]

denotes a posteriori probability (APP), which is given by

\begin{matrix} θ_{k} [n] & = \frac{P [z [n] | x [n] = x_{k}] P [x [n] = x_{k}]}{\sum_{j \in K} P [z [n] | x [n] = x_{j}] P [x [n] = x_{j}]}, \end{matrix}

(5)

where the likelihood probability in (5) is calculated by assuming the AWGN channel as

\begin{matrix} P [z [n] | x [n] = x_{k}] & = \frac{1}{{(π σ_{n}^{2})}^{N_{r}}} e^{- \frac{∥ z [n] - {\hat{h}}^{H} x_{k} ∥^{2}}{σ_{n}^{2}}} . \end{matrix}

(6)

where

{∥ \cdot ∥}^{2}

denotes the norm operation and

P (\cdot)

is the probability of an event. The a priori probability in (5) is also assumed to have an equal probability for possible candidate transmitted symbol

P [x [n] = x_{k}] = \frac{1}{{| X |}^{N_{t}}}

.

2.2. Problem

In a time-varying channel, the estimation accuracy of

\hat{h}

decreases gradually as time slot index n increases. This degradation results in poor detection performance at the receiver. Because the detected data symbol may have an error owing to the channel, an incorrect use of the detected data symbol severely degrades performance. To overcome this degradation, we consider a data-aided channel estimator that selects the detected data symbols for data-aided channel estimation.

For the selection, we define action

a \in A = {0, 1}

where the detected data symbol is used in channel estimation when

a = 1

; otherwise, the detected data symbol is not used. When we define

a \in {0, 1}^{N_{d}}

as a set of actions, the considered data-aided channel estimation can be obtained using this set as

\begin{matrix} {\hat{h}}_{r} (a) & = {(\hat{X} (a) {\hat{X}}^{H} (a) + σ_{n}^{2} I_{N_{t}})}^{- 1} \hat{X} (a) z_{r}^{H} (a) \end{matrix}

(7)

where

z_{r} (a) = {[z_{r}^{p}, z_{r} [e_{1} (a)], \dots, z_{r} [e_{{∥ a ∥}_{0}} (a)]]}^{T}

and

\hat{X} (a) = {[X^{p}, \hat{x} [e_{1} (a)], \dots, \hat{x} [e_{{∥ a ∥}_{0}} (a)]]}^{T}

. The time slot index of the i-th nonzero element is denoted as

e_{i} (\cdot)

. We then define the optimization problem as

\begin{matrix} a^{★} & = \underset{a}{argmin} E {∥ \hat{h} (a) - H [n] ∥^{2}}, \end{matrix}

(8)

where

E (\cdot)

is the expectation of a random variable.

Compared with previous studies [31,32,33], the optimization problem in (8) considers the selection to minimize the MSE between the estimated channel and

H [n]

. Because the channel is variant with time slot index n, the best action

a^{★}

may be different with time slot index n. That is, the best action in the previous time slot index may be invalid in the next time slot index. In addition, the optimization problem is difficult to solve because the number of candidate actions increases exponentially with the data symbol length. An exhaustive search for action candidates is not feasible in practical applications. To resolve these difficulties, we introduce a sequential selection of the detected data symbols and a refinement of the selected data.

2.3. Markov Decision Process

We formulate an MDP that solves the optimization problem in (8). To achieve this, we define state

S_{n}

, transition function

T_{n + 1}^{(a, j)} (S_{n})

, action

A

, and reward

R (S_{n}, S_{n + 1})

[39]. Subsequently, the Q-value function

Q (S_{n}, a)

and the optimal policy

π^{★} (S_{n})

will be presented. The basic definitions for the MDP are adopted from those in [31,32,33]; however, the RL solution for the MDP is different from those in previous studies, which will be explained in the next section.

The state set

S_{n}

is defined as

\begin{matrix} S_{n} = {(X_{n}, {\hat{X}}_{n}, C) | & X_{n} = [x [1] \dots x [N_{p}], x_{k_{C (1)}}, \dots, x_{k_{C (| C |)}}], \\ {\hat{X}}_{n} = [x [1] \dots x [N_{p}], \hat{x} [C (1)], \dots, \hat{x} [C (| C |)]], \\ C \subset \{1, \dots, n - 1\}}, \end{matrix}

(9)

where

C

is the set of time slot indices where the symbol is used in channel estimation, and

C (i)

is the i-th smallest element.

k_{n} \in K = {1, \dots, | X |}

is the transmitted symbol index at time slot n. Based on the expression, we can obtain the proposed channel estimate using the state

S_{n} \in S_{n}

as

\begin{matrix} {\hat{h}}_{r} (S_{n}) & = {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + σ_{n}^{2} I_{N_{t}})}^{- 1} {\hat{X}}_{n} z_{r}^{H} (S_{n}) \end{matrix}

(10)

where

z_{r} (S_{n}) = {[z_{r}^{p}, z_{r} [C (1)], \dots, z_{r} [C (| C |)]]}^{T}

. Note that

S_{n}

is the set of all states and

S_{n}

is the state.

The action set is defined as

A = {0, 1}

. As explained in the previous subsection, the detected data symbol is used in the proposed channel estimation when

a = 1

; otherwise, the detected data symbol is not used. The transition function

T^{(a, j)} (S_{n})

from state

S_{n} \in S_{n}

is defined as

\begin{matrix} T^{(a, j)} (S_{n}) & = P [U_{n + 1}^{(a, j)} (S_{n}) | S_{n}, a] = \{\begin{matrix} I [x [n] = x_{j}], & j \in J_{a}, a = 1, \\ 1, & j \in J_{a}, a = 0 . \end{matrix} \end{matrix}

(11)

where

I (\cdot)

equals one when the event is true and zero otherwise.

J_{0} \in = {0}

and

J_{1} \in {1, \dots, K}

.

U_{n + 1}^{(a, j)} (S_{n}) \in U_{n + 1}^{(a, j)} (S_{n})

is a possible candidate for the next state from state

S_{n}

and is defined as

\begin{matrix} U_{n + 1}^{(a, j)} (S_{n}) & = \{\begin{matrix} ([X_{n}, x_{j}], [{\hat{X}}_{n}, \hat{x} [n]], [C \cup n]), & j \in J_{a}, a = 1, \\ (X_{n}, {\hat{X}}_{n}, C), & j \in J_{a}, a = 0 . \end{matrix} \end{matrix}

(12)

The reward

R (S_{n}, S_{n + 1})

is defined as the difference between the MSEs at the current state

S_{n} \in S_{n}

and the next state

S_{n + 1} \in S_{n + 1}

, which is given by

\begin{matrix} R (S_{n}, S_{n + 1}) & = E [∥ {\hat{h}}_{r} (S_{n}) - h_{r} {[n] ∥}^{2}] - E [∥ {\hat{h}}_{r} (S_{n + 1}) - h_{r} {[n + 1] ∥}^{2}] \\ = Tr [B (S_{n})] - Tr [B (S_{n + 1})] = Tr [B (S_{n}) - B (S_{n + 1})], \end{matrix}

(13)

where

B (S_{n}) = E [({\hat{h}}_{r} (S_{n}) - h_{r} [n]) {({\hat{h}}_{r} (S_{n}) - h_{r} [n])}^{H}]

is error covariance. Unlike in [31,32,33], the error covariance is defined between the estimated channel

{\hat{h}}_{r} (S_{n})

and

h_{r} [n]

at time slot index n.

The Q-value function

Q (S_{n}, a)

is the sum of the rewards, which is given by

\begin{matrix} Q (S_{n}, a) & = \sum_{j \in J_{a}} T^{(a, j)} (S_{n}) [R (S_{n}, U_{n + 1}^{(a, j)} (S_{n})) + γ V^{★} (U_{n + 1}^{(a, j)} (S_{n}))], \end{matrix}

(14)

where

Tr (\cdot)

is a trace operation.

V^{★} (U_{n + 1}^{(a, j)} (S_{n}))

is the optimal sum of future reward after

U_{n + 1}^{(a, j)} (S_{n})

.

γ

is a discounting factor whose value is assumed as one because the proposed channel estimator also considers the effect of future rewards at the ending state [31].

The optimal policy maximizes the Q-value function, which is expressed as

\begin{matrix} π^{★} (S_{n}) & = \underset{a \in A}{argmax} Q (S_{n}, a) . \end{matrix}

(15)

Solving the optimization problem in (15) is highly difficult because the transition probability

T^{(a, j)} (S_{n})

is unknown, and the number of candidate states exponentially increases with the data length. An effective method to solve this problem is to use a reinforcement learning algorithm. Therefore, the proposed channel estimator also adopts a reinforcement learning algorithm, but the effect of the time-varying channel is also considered in comparison with [31,32,33].

A deep reinforcement learning (DRL) approach is a promising solution for dealing with the dimension explosion of the states by leveraging deep neural networks. To apply the DRL approach to our MDP, an agent needs to interact with an environment to obtain an action-value function for a given action and state. However, both the states and rewards of our MDP are not observable at the receiver. This means that the agent cannot acquire training samples, each of which consists of the state (or the state transition) and the corresponding reward. Consequently, the DRL approach and other data-driven approaches are not directly applicable to solving our MDP.

3. Proposed Optimal Policy

This section describes the proposed optimal policy. The basic concept of the derivation is similar to that in [31,32,33]. However, its direct extension is difficult for time-varying channels. This is because capturing time-variant channels using previously selected data symbols is difficult. To address this, we approximate the first-order Gaussian—Markov process and propose a computationally efficient algorithm.

We employ the approximation in [31,32,33] for the transition function, which is given by

\begin{matrix} {\hat{T}}^{(a, j)} (S_{n}) & = \{\begin{matrix} θ_{j} [n], & j \in J_{a}, a = 1, \\ 1, & j \in J_{a}, a = 0, \end{matrix} \end{matrix}

(16)

where

{\hat{T}}^{(a, j)} (S_{n}) \to T^{(a, j)} (S_{n})

as

θ_{j} [n] \to 1

.

The main difficulty in analyzing the time-varying channel model is solving element

Δ [n]

. To resolve this difficulty, we approximate the first-order Gaussian—Markov process in (1) as follows:

\begin{matrix} H [n] & \approx \sqrt{1 - ϵ^{2}} H [n - 1], \end{matrix}

(17)

where

H [n - 1] ≫ Δ [n]

. This approximation is often adopted in studies because it provides analytical tractableness [36,37,38]. Using this approximation, the received symbol

z [n + m]

for

1 \leq m

can be expressed in terms of

H [n]

as follows:

\begin{matrix} z [n + m] & = H^{H} [n + m] x [n + m] + n [n + m] \\ \approx H^{H} [n] ({\sqrt{1 - ϵ^{2}}}^{m} x [n + m]) + n [n + m], \end{matrix}

(18)

From approximation (18), the virtual state in [31] that mimics the optimal behavior from state

U_{n + 1}^{(a, j)} (S_{n})

can be obtained as follows:

\begin{matrix} {\tilde{U}}_{m}^{(a, j)} (S_{n}) & = (X_{m}^{(a, j)}, {\hat{X}}_{m}^{(a)}, C_{m}^{(a)}), \end{matrix}

(19)

where

\begin{matrix} X_{m}^{(a, j)} & = \{\begin{matrix} [X_{n}, x_{j}, (\sqrt{1 - ϵ^{2}} \tilde{x} [n + 1]), \dots, ({\sqrt{1 - ϵ^{2}}}^{m - n - 1} \tilde{x} [m - 1])], & a = 1, \\ [X_{n}, (\sqrt{1 - ϵ^{2}} \tilde{x} [n + 1]), \dots, ({\sqrt{1 - ϵ^{2}}}^{m - n - 1} \tilde{x} [m - 1])], & a = 0 . \end{matrix} \\ {\hat{X}}_{m}^{(a)} & = \{\begin{matrix} [{\hat{X}}_{n}, \hat{x} [n], (\sqrt{1 - ϵ^{2}} \tilde{x} [n + 1]), \dots, ({\sqrt{1 - ϵ^{2}}}^{m - n - 1} \tilde{x} [m - 1])], & a = 1, \\ [{\hat{X}}_{n}, (\sqrt{1 - ϵ^{2}} \tilde{x} [n + 1]), \dots, ({\sqrt{1 - ϵ^{2}}}^{m - n - 1} \tilde{x} [m - 1])], & a = 0 . \end{matrix} \\ C_{m}^{(a)} & = \{\begin{matrix} [C \cup {n + 1, \dots, m}], & a = 1, \\ [C \cup {n + 2, \dots, m}], & a = 0 . \end{matrix} \end{matrix}

The soft-decision symbol

\tilde{x} [m]

for

m \geq n + 1

is define as

\begin{matrix} \tilde{x} [m] & = \sum_{k = 1}^{K} θ_{k} [m] x_{j} . \end{matrix}

(20)

In (19), because

0 \leq ϵ \leq 1

, the effect of soft decision symbol

\tilde{x} [n + m]

for estimating

H [n]

is diminished as m increases. Based on the virtual state, the state-action diagram for the proposed channel estimator is shown in Figure 2. In this figure, the number of state transitions at state

S_{n}

are one and K for

a = 0

and

a = 1

, respectively. However, after

n + 2

, the state transition is simplified to one because the virtual state mimics the behavior of state

U_{n + 1}^{(a, j)} (S_{n})

.

Using the definition of virtual state (19), we can compute the future reward

V^{★} (U_{n + 1}^{(a, j)} (S_{n}))

as

\begin{matrix} V^{★} (U_{n + 1}^{(a, j)} (S_{n})) & \approx R (U_{n + 1}^{(a, j)} (S_{n}), {\tilde{U}}_{n + 2}^{(a, j)} (S_{n})) + \sum_{m = n + 2}^{M_{d} (N_{d})} R ({\tilde{U}}_{m}^{(a, j)} (S_{n}), {\tilde{U}}_{m + 1}^{(a, j)} (S_{n})) . \end{matrix}

(21)

By applying (13) to the future reward, the future reward is simplified as

\begin{matrix} V^{★} (U_{n + 1}^{(a, j)} (S_{n})) & = Tr [B (U_{n + 1}^{(a, j)} (S_{n})) - B ({\tilde{U}}_{n + 2}^{(a, j)} (S_{n})) + \sum_{m = n + 2}^{M_{d} (N_{d})} B ({\tilde{U}}_{m}^{(a, j)} (S_{n})) - B ({\tilde{U}}_{m + 1}^{(a, j)} (S_{n}))] \\ = Tr [B (U_{n + 1}^{(a, j)} (S_{n})) - B ({\tilde{U}}_{M_{d} (N_{d}) + 1}^{(a, j)} (S_{n}))] . \end{matrix}

(22)

Using the approximations (16) and (22), the Q-value function in (14) is obtained as follows:

\begin{matrix} Q (S_{n}, a) & = \sum_{j \in J_{a}} {\hat{T}}^{(a, j)} (S_{n}) Tr [B (S_{n}) - B ({\tilde{U}}_{M_{d} (N_{d}) + 1}^{(a, j)} (S_{n}))] . \end{matrix}

(23)

The error covariance matrix

B ({\tilde{U}}_{m}^{(a, j)})

can be computed as

\begin{matrix} B ({\tilde{U}}_{m}^{(a, j)}) & = E {∥ {\hat{h}}_{r} ({\tilde{U}}_{m}^{(a, j)}) - h_{r} [n] ∥^{2}} \\ = E {({\hat{h}}_{r} ({\tilde{U}}_{m}^{(a, j)}) - h_{r} [n]) {({\hat{h}}_{r} ({\tilde{U}}_{m}^{(a, j)}) - h_{r} [n])}^{H}} \\ = E \{{\hat{h}}_{r} ({\tilde{U}}_{m}^{(a, j)}) {\hat{h}}_{r}^{H} ({\tilde{U}}_{m}^{(a, j)}) - h_{r} [n] {\hat{h}}_{r}^{H} ({\tilde{U}}_{m}^{(a, j)}) - {\hat{h}}_{r} ({\tilde{U}}_{m}^{(a, j)}) h_{r}^{H} [n] + h_{r} [n] h_{r}^{H} [n]\} \\ \overset{(a)}{=} Q_{m}^{(a)} {\hat{X}}_{m}^{(a, j)} ({(X_{m}^{(a, j)})}^{H} X_{m}^{(a, j)} + σ_{n}^{2} I_{| C_{m}^{(a)} |}) {({\hat{X}}_{m}^{(a, j)})}^{H} Q_{m}^{(a)} \\ - X_{m}^{(a, j)} {({\hat{X}}_{m}^{(a, j)})}^{H} Q_{m}^{(a)} - Q_{m}^{(a)} {\hat{X}}_{m}^{(a, j)} {(X_{m}^{(a, j)})}^{H} + I_{N_{t}} \\ = (I_{N_{t}} - Q_{m}^{(a)} {\hat{X}}_{m}^{(a, j)} {(X_{m}^{(a, j)})}^{H}) {(I_{N_{t}} - Q_{m}^{(a)} {\hat{X}}_{m}^{(a, j)} {(X_{m}^{(a, j)})}^{H})}^{H} + σ_{n}^{2} Q_{m}^{(a)} - σ_{n}^{4} {(Q_{m}^{(a)})}^{2} \\ \overset{(b)}{=} Q_{m}^{(a)} D_{m}^{(a, j)} {(D_{m}^{(a, j)})}^{H} Q_{m}^{(a)} + σ_{n}^{2} Q_{m}^{(a)} - σ_{n}^{4} {(Q_{m}^{(a)})}^{2} \end{matrix}

(24)

where the distribution of

z_{r}^{H} ({\tilde{U}}_{m}^{(a, j)} (S_{n}))

is given by

CN (0_{| C_{m}^{(a)} |}, {(X_{m}^{(a, j)})}^{H} X_{m}^{(a, j)} + σ_{n}^{2} I_{| C_{m}^{(a)} |})

and

Q_{m}^{(a)} = {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{k = n + 1}^{m - 1} {(1 - ϵ^{2})}^{m - n} \tilde{x} [k] {\tilde{x}}^{H} [k] + σ_{n}^{2} I_{N_{t}})}^{- 1}

is applied in

(a)

.

D_{m}^{(a, j)} = {(Q_{m}^{(a)})}^{- 1} - {\hat{X}}_{m} X_{m}^{H} = {\hat{X}}_{m} {({\hat{X}}_{m} - X_{m})}^{H} + σ_{n}^{2} I_{N_{t}}

is used in

(b)

.

By applying (24) to the Q-value function, the optimal policy at

S_{n}

is computed as

\begin{matrix} π^{★} (S_{n}) & = \underset{a \in {0, 1}}{argmax} Q (S_{n}, a) = I [(Q (S_{n}, 1) - Q (S_{n}, 0)) \geq 0] \\ = I [Tr [(\sum_{j = 1}^{K} B ({\tilde{U}}_{M_{d} (N_{d}) + 1}^{(0, 0)} (S_{n})) - θ_{j} [n] B ({\tilde{U}}_{M_{d} (N_{d}) + 1}^{(1, j)} (S_{n})))] \geq 0] . \end{matrix}

(25)

where

B ({\tilde{U}}_{M_{d} (N_{d}) + 1}^{(a, j)} (S_{n})) = σ_{n}^{2} Q^{(a)} - σ_{n}^{4} {(Q^{(a)})}^{2} + Q^{(a)} D^{(a, j)} {(D^{(a, j)})}^{H} Q^{(a)}

.

Q^{(a)} = Q_{M_{d} (N_{d}) + 1}^{(a)}

and

D^{(a, j)} = D_{M_{d} (N_{d}) + 1}^{(a, j)}

are defined as

\begin{matrix} Q^{(a)} & = {({\hat{X}}_{M_{d} (N_{d}) + 1}^{(a)} {({\hat{X}}_{M_{d} (N_{d}) + 1}^{(a)})}^{H} + σ_{n}^{2} I_{N_{tx}})}^{- 1} \\ \overset{(a)}{=} \{\begin{matrix} {({\hat{X}}_{n} {\hat{X}}_{n}^{H} + \sum_{m = n + 1}^{M_{d} (N_{d})} {(1 - ϵ^{2})}^{m - n} \tilde{x} [m] {\tilde{x}}^{H} [m] + σ_{n}^{2} I_{N_{t}})}^{- 1}, & a = 0, \\ {({(Q^{(0)})}^{- 1} + \tilde{x} [n] {\tilde{x}}^{H} [n])}^{- 1}, & a = 1 . \end{matrix} \\ D^{(a, j)} & = {\hat{X}}_{M_{d} (N_{d}) + 1}^{(a)} {({\hat{X}}_{M_{d} (N_{d}) + 1}^{(a)} - X_{M_{d} (N_{d}) + 1}^{(a, j)})}^{H} + σ_{n}^{2} I_{N_{t}} \\ \overset{(b)}{=} \{\begin{matrix} {\hat{X}}_{n} {({\hat{X}}_{n} - X_{n})}^{H} + σ_{n}^{2} I_{N_{t}}, & j \in J_{a}, a = 0, \\ D^{(0, 0)} + \hat{x} [n] {(\hat{x} [n] - x_{j})}^{H}, & j \in J_{a}, a = 1 . \end{matrix} \end{matrix}

Similar to [31],

Q^{(1)}

and

Q^{(0)}

satisfy

Q^{(1)} = Q^{(0)} - \frac{Q^{(0)} \hat{x} [n] {\hat{x}}^{H} [n] Q^{(0)}}{1 + {\hat{x}}^{H} [n] Q^{(0)} \hat{x} [n]}

. In addition,

D^{(1, j)}

and

D^{(0, 0)}

satisfy

\sum_{j = 1}^{K} θ_{j} [n] D^{(1, j)} {(D^{(1, j)})}^{H} = (D^{(0, 0)} + {\hat{d}}_{n}) {(D^{(0, 0)} + {\hat{d}}_{n})}^{H} + δ_{n} \hat{x} [n] {\hat{x}}^{H} [n]

where

{\hat{d}}_{n} = \hat{x} [n] {(\hat{x} [n] - \tilde{x} [n])}^{H}

, and

δ_{n} = \sum_{j = 1}^{K} θ_{j} [n] ∥ \hat{x} [n] - x_{j} ∥^{2} - {∥ \hat{x} [n] - \tilde{x} [n] ∥}^{2}

.

Finally, similar to [32], by applying the results in (23) and (24) to (25), we obtain the proposed optimal policy in closed-form as

\begin{matrix} π^{★} (S_{n}) & = I [\frac{σ_{n}^{2} (1 + α_{n}) + σ_{n}^{4} ∥ a_{n} ∥^{2} + {∥ d_{n} ∥}^{2}}{2 σ_{n}^{4} β_{n} + γ_{n} + {∥ c_{n} - b_{n} + d_{n} ∥}^{2}} \geq 1] . \end{matrix}

(26)

When we define

Q = Q^{(0)}

and

D = D^{(0, 0)}

, vectors are computed as

a_{n} = \frac{Q \hat{x} [n]}{\sqrt{1 + α_{n}}}

,

b_{n} = D^{H} b_{n}

,

c_{n} = \frac{\hat{x} [n] - \tilde{x} [n]}{\sqrt{1 + α_{n}}}

, and

d_{n} = \frac{D^{H} Q a_{n}}{∥ a_{n} ∥^{2}}

. In addition, the constants are computed as

α_{n} = {\hat{x}}^{H} [n] Q \hat{x} [n]

,

β_{n} = \frac{a_{n}^{H} Q a_{n}}{∥ a_{n} ∥^{2}}

, and

γ_{n} = \frac{δ_{n}}{1 + α_{n}}

. Note that the expression of the optimal policy in (26) is similar to that in [32]. However, the vectors and constants in the optimal policy is different from those in [32] because the temporal correlation

ϵ

is considered in

Q

and

D

. When

ϵ = 0

, the optimal policy in (26) is equivalent to that in [32].

4. Further Performance Improvement

In this section, we propose a practical method to improve the estimation accuracy of the proposed channel estimator. The proposed method refines state elements to capture the time-varying nature of the channel.

4.1. State Element Refinement

Elements

X_{n}

and

{\hat{X}}_{n}

in state

S_{n}

are updated when the detected data symbol is selected based on the optimal policy. However, the elements gradually lose their effectiveness in estimating

H [n]

as time slot index n increases. To address this, we first represent the received symbol for

1 \leq m

in terms of

H [n]

as

\begin{matrix} z [n - m] & = H^{H} [n - m] x [n - m] + n [n - m] \\ \approx H^{H} [n] ({\sqrt{1 - ϵ^{2}}}^{- m} x [n - m]) + n [n - m] . \end{matrix}

(27)

Using (27), we refine the elements

X_{n}

and

{\hat{X}}_{n}

in state as the time slot index increases, which is given by

\begin{matrix} X_{n} & \leftarrow {\sqrt{1 - ϵ^{2}}}^{- 1} X_{n} \\ {\hat{X}}_{n} & \leftarrow {\sqrt{1 - ϵ^{2}}}^{- 1} {\hat{X}}_{n} . \end{matrix}

(28)

Regardless of the above refinement, the previously selected data symbols lose their effectiveness as the time slot index increases, particularly for large data lengths. This is because the term

Δ [n]

in (1) becomes dominant, increasing the uncertainty in estimating the channel. To overcome this, we remove too-old selected data symbols in state by introducing a window size

N_{w}

. In other words, we maintain the size of the set of time slot indices as

| C | = N_{w}

. Thus, when the optimal action is one at time slot index n, n is included, whereas the first index

C (1)

is removed from set

C

, which can be expressed as

\begin{matrix} X_{n} & \leftarrow X_{n} ∖ X_{n} [C (1)], \\ {\hat{X}}_{n} & \leftarrow {\hat{X}}_{n} ∖ {\hat{X}}_{n} [C (1)], \\ C & \leftarrow C ∖ C (1) . \end{matrix}

(29)

4.2. Algorithm

Using the proposed optimal policy and performance improvement strategy, the proposed channel estimator is summarized in Algorithm 1. The receiver obtains the initial channel estimation during pilot transmission. Subsequently, during the data transmission, the receiver sequentially selects a data symbol based on the optimal policy. When the optimal action

a^{★} = 1

, the state

S_{n}

is updated using the most-probable state transition [31]. In addition, the state element refinement is performed based on this condition. After each data block ends, the channel estimate is updated using the state

S_{n}

.

Algorithm 1: Proposed channel estimator

1 Obtain the initial channel estimate

H \leftarrow \hat{h} = [{\hat{h}}_{1}, \dots, {\hat{h}}_{N_{r}}]

from (3)

2 Initialize the state

S_{1} = (X^{p}, X^{p}, ϕ)

.

3 for

d = 1

to

M_{d}

do

4 for

n \in M_{d}

do

5 Compute the optimal policy

a^{★} = π^{★} (S_{n})

from (26).

6 Set the optimal values

j^{★} = 0

for

a^{★} = 0

and

x_{j^{★}} = \hat{x} [n]

for

a^{★} = 1

.

7 Update

S_{n + 1} \leftarrow U_{n + 1}^{(a^{★}, j^{★})} (S_{n})

from (12).

8 if

a^{★} = = 1

and

N_{w} < | C |

then

9 Remove the state elements in

S_{n + 1}

.

10

X_{n + 1} \leftarrow X_{n + 1} ∖ X_{n + 1} [C (1)]

,

11

{\hat{X}}_{n + 1} \leftarrow {\hat{X}}_{n + 1} ∖ {\hat{X}}_{n} [C (1)]

,

12

C \leftarrow C ∖ C (1)

.

13 end

14 Refine the state elements in

X_{n + 1} \leftarrow {\sqrt{1 - ϵ^{2}}}^{- 1} X_{n + 1}

and

{\hat{X}}_{n + 1} \leftarrow {\sqrt{1 - ϵ^{2}}}^{- 1} {\hat{X}}_{n + 1}

.

15 end

16 Update the channel estimate

H \leftarrow \hat{h} = [{\hat{h}}_{1} (S_{n}), \dots, {\hat{h}}_{N_{r}} (S_{n})]

from (10).

17 end

In Figure 3, we show a block diagram of the proposed channel estimator, which consists of the LMMSE channel estimator, optimal policy calculator, and state element refinement. The LMMSE channel estimator obtains the initial estimate at pilot transmission and updates the estimate at data transmission using state

S_{n}

. The optimal policy calculator obtains the optimal action of (26) from the channel estimates and APP from the data detector. The state elements are then refined based on the obtained optimal action, and the refined state is used to estimate the channel and optimal policy for the next step.

Application of other data detection: The proposed RL-aided channel estimator can be universally applied to any other soft-output data detection method. To achieve this, the proposed RL-aided channel estimator relies on the availability of APPs, which can be directly derived from the MAP data detection method. In the case of using other soft-output data detections, the proposed RL-aided channel estimator can utilize the APPs that are computed from the log-likelihood ratios.

Complexity analysis: Complexity is analyzed in terms of real multiplications to provide an implementation perspective. Figure 3 shows the hardware structure of the proposed RL-aided channel estimator, which consists of the LMMSE channel estimator, state element refinement, and optimal policy calculator. Because the exact complexity can vary depending on the implementation details, the complexity order (

O (\cdot)

) of each component is analyzed.

The complexity order of the LMMSE channel estimator in (7) is

O ((N_{p} + | C (a^{★}) |)

(N_{t}^{2} + N_{t} N_{r}))

where

C (a^{★})

is the set of selected data symbol vectors. The complexity order of state element refinement in Section 4.1 is

O (4 (N_{p} + | C (a^{★}) |)

. The complexity of the optimal policy in (25) is primarily determined by the computation of

Q^{(a)}

. Consequently, the complexity order of the optimal policy in (25) is

O (2 N_{t}^{2} T_{d}^{2})

. It is important to note that among the components of the proposed channel estimator, the optimal policy calculator has the highest complexity because it performs every data symbol index n, while the other components perform every data block index d.

5. Simulation Results

This section presents the effectiveness of the proposed channel estimator using simulations. The numbers of transmit and receive antennas used were

N_{t} = 2

and

N_{r} = 4

. The transmission frame consisted of one pilot block with

N_{p} = 8

symbols and

M_{d} = 20

data blocks with

N_{d} = 128

symbols. Each symbol used 4-quadrature amplitude modulation (QAM) symbol mapping. We adopted turbo channel code with a rate of

1 / 2

and 16 cyclic redundancy check bits. For the proposed channel estimator, the window size was set to

N_{w} = 2 \times N_{d}

. The signal-to-noise ratio (SNR) was defined as

E_{b} / N_{0} = 1 / ({log}_{2} | X | σ_{n}^{2})

under the power constraint

E {∥ x [n] ∥^{2}} = 1

. The proposed channel estimator was compared with the following methods.

PCSI: This method is ideal for time-invariant channels in which a perfect initial channel estimate is available at the receiver. Because the initial channel changes during data transmission, it is not optimal for time-varying channels.
Pilot: This method uses a conventional pilot-aided channel estimator using (3).
Soft: This method is a data-aided channel estimator when all symbols in (20) are used as additional pilot symbols.
Conv-RL [31]: This method is a data-aided channel estimator in which the detected data symbol is selected using the RL approach developed for time-invariant channels.

The performance of the methods was compared with that of the proposed channel estimator in terms of the block-error rate (BLER) and normalized MSE (NMSE). In addition, we considered the time-invariant channel

ϵ = 0

and time-variant channel with

ϵ = 0.005

and

ϵ = 0.01

. Note that channel was more severely variant when

ϵ = 0.01

than when

ϵ = 0.005

.

Figure 4 shows the BLERs for the proposed and other channel estimators in the time-invariant channel, i.e.,

ϵ = 0

. The conventional pilot-aided channel estimator exhibited a poor performance when the number of pilots was small. Data-aided channel estimators can overcome performance degradation caused by pilot-aided channel estimators. In particular, the RL-based channel estimator [31] showed an outstanding performance compared with other channel estimators. The BLER of the proposed channel estimator was slightly worse than that of [31] because of a reduced window size

N_{w} = 2 \times N_{d}

.

In Figure 5, the proposed channel estimator is compared with other channel estimators in time-varying channels. The proposed channel estimator had a better BLER improvement than the conventional pilot-aided channel estimator. In particular, the performance improvement is more prominent at

ϵ = 0.01

than that at

ϵ = 0.005

. This is because the proposed channel estimator can efficiently capture channel variations by selecting and refining detected data symbols. In addition, in time-variant channels, the proposed channel estimator had a slightly higher BLER than the RL-based channel estimator, primarily due to the utilization of a reduced window size (see Figure 5). However, in time-varying channels, this reduction in window size actually contributed to an improvement in BLER by effectively leveraging the most recent data symbols. Consequently, the proposed channel estimator had a lower BLER compared to the RL-based channel estimator. In Figure 6, we show the BLER of the proposed channel estimator for different window sizes

N_{w}

in time-varying channels with

ϵ = 0.01

The BLER of the proposed channel estimator gradually degraded as

N_{w}

increased. This is because the selected data symbol is undesirable as an additional pilot symbol in fast fading channels; therefore, only the usage of the latest selected data symbol can improve the performance.

To further investigate the effect of window size, we investigated the NMSE of the proposed channel estimator for different window sizes

N_{w}

at

ϵ = 0.005

and

E_{b} / N_{0} = - 2

dB (Figure 7). We observed that the NMSE improved until

M_{d} = 2

but decreased as the data block length increased. This is because the old selected data symbol is ineffective for estimating the channel. Thus, when we discard the old data symbol, we can further improve the estimation accuracy (Figure 7).

6. Conclusions

A data-aided channel estimator was proposed for time-varying channels, which involves selecting the detected data symbol. To facilitate efficient selection of the detected data symbol, an optimization problem was initially formulated to minimize the channel estimation error. Subsequently, the MDP for this optimization problem was formulated, and its optimal policy was derived using an RL algorithm. In the derivation process, approximations for the transition probability and a first-order Gaussian–Markov process were utilized. To improve estimation accuracy, a state element refinement was introduced to capture the time-varying nature of the channel by incorporating a window size. Simulation results demonstrated that the proposed channel estimator provides similar performance to the conventional RL-based channel estimator in time-invariant channels when

ϵ = 0

, while showing improved performance in time-varying channels when

ϵ = 0.01

and

ϵ = 0.005

compared to conventional RL-based channel estimator.

An interesting direction for further research involves optimizing the frame structure in terms of the spectral efficiency. In this study, the frame structure comprises one pilot and D data block. The proposed RL-aided channel estimator is applied to the data blocks to capture the time-varying nature of the channel. However, in fast fading channels, it can be challenging for the proposed channel estimator to accurately track channel variations. In such cases, reducing the value of D in the frame structure can potentially improve the performance. However, this reduction also leads to a degradation in spectral efficiency. To find an appropriate value for D in time-varying channels, an optimization problem that maximizes spectral efficiency while maintaining acceptable performance levels becomes a suitable criterion. To address this, one approach is to first derive the performance of the RL-aided channel estimator. Subsequently, the solution to the optimization problem can be obtained using the derived performance.

Author Contributions

Conceptualization, M.M.; Methodology, T.-K.K.; Software, T.-K.K.; Validation, M.M.; Formal analysis, T.-K.K.; Investigation, T.-K.K.; Resources, T.-K.K.; Data curation, T.-K.K.; Writing—original draft, T.-K.K.; Writing—review & editing, M.M.; Visualization, M.M.; Supervision, M.M.; Project administration, M.M.; Funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Tae-Kyoung Kim was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MIST) (No. 2021R1F1A1063273). The work of Moonsik Min was supported in part by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. 2023R1A2C1004034), and in part by the BK21 FOUR Project funded by the Ministry of Education, Korea (4199990113966).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goldsmith, A.; Jafar, S.A.; Jindal, N.; Vishwanath, S. Capacity Limits of MIMO Channels. IEEE J. Sel. Commun. 2003, 21, 684–702. [Google Scholar] [CrossRef] [Green Version]
Zheng, L.; Tse, D.N.C. Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels. IEEE Trans. Inf. Theory 2003, 49, 1073–1096. [Google Scholar] [CrossRef] [Green Version]
Paulraj, A.J.; Gore, D.A.; Nabar, R.U.; Bolcskei, H. An Overview of MIMO Communications-a Key to Gigabit Wireless. Proc. IEEE 2004, 92, 198–218. [Google Scholar] [CrossRef] [Green Version]
Sanayei, S.; Nosratinia, A. Antenna Selection in MIMO Systems. IEEE Commun. Mag. 2004, 42, 68–73. [Google Scholar] [CrossRef]
Larsson, E.G.; Edfors, O.; Tufvesson, F.; Marzetta, T.L. Massive MIMO for Next Generation Wireless Systems. IEEE Commun. Mag. 2014, 52, 186–1954. [Google Scholar] [CrossRef] [Green Version]
Zheng, K.; Zhao, L.; Mei, J.; Shao, B.; Xiang, W.; Hanzo, L. Survey of Large-Scale MIMO Systems. IEEE Commun. Surv. Tutor. 2015, 17, 1738–1760. [Google Scholar] [CrossRef] [Green Version]
Yang, S.; Hanzo, L. Fifty Years of MIMO Detection: The Road to Large-Scale MIMOs. IEEE Commun. Surv. Tutor. 2015, 17, 1941–1988. [Google Scholar] [CrossRef] [Green Version]
Morelli, M.; Mengali, U. A Comparison of Pilot-Aided Channel Estimation Methods for OFDM System. IEEE Trans. Signal Process. 2001, 49, 3065–3073. [Google Scholar] [CrossRef]
Coleri, S.; Ergen, M.; Puri, A.; Bahai, A. Channel Estimation Techniques Based on Pilot Arrangement in OFDM Systems. IEEE Trans. Broadcast. 2002, 48, 223–229. [Google Scholar] [CrossRef] [Green Version]
Mostofi, Y.; Cox, D.C. ICI Mitigation for Pilot-Aided OFDM Mobile Systems. IEEE Trans. Wirel. Commun. 2005, 4, 765–774. [Google Scholar] [CrossRef]
Biguesh, M.; Gershman, A.B. Training-based MIMO Channel Estimation: A Study of Estimator Tradeoffs and Optimal Training Signals. IEEE Trans. Signal Process. 2006, 54, 884–893. [Google Scholar] [CrossRef]
Ozdemir, M.K.; Arslan, H. Channel Estimation for Wireless OFDM Systems. IEEE Commun. Surv. Tutor. 2007, 9, 18–48. [Google Scholar] [CrossRef]
Soltani, M.; Pourahmadi, V.; Mirzaei, A.; Sheikhzadeh, H. Deep Learning-based Channel Estimation. IEEE Commun. Lett. 2019, 23, 652–655. [Google Scholar] [CrossRef] [Green Version]
Le, H.A.; Van Chien, T.; Nguyen, T.H.; Choo, H.; Nguyen, V.D. Machine Learning-Based 5G-and-Beyond Channel Estimation for MIMO-OFDM Communication Systems. Sensors 2021, 21, 4861. [Google Scholar] [CrossRef]
Yuan, J.; Ngo, H.Q.; Matthaiou, M. Machine Learning-Based Channel Prediction in Massive MIMO with Channel Aging. IEEE Trans. Wirel. Commun. 2020, 19, 2960–2973. [Google Scholar] [CrossRef]
Valenti, M.C.; Woerner, B.D. Iterative Channel Estimation and Decoding of Pilot Symbol Assisted Turbo Codes over Flat-Fading Channels. IEEE J. Sel. Commun. 2001, 19, 1697–1705. [Google Scholar] [CrossRef]
Dowler, A.; Nix, A.; McGeehan, J. Data-derived Iterative Channel Estimation with Channel Tracking for a Mobile Fourth Generation Wide Area OFDM System. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), San Francisco, CA, USA, 1–5 December 2003. [Google Scholar]
Cozzo, C.; Hughes, B.L. Joint Channel Estimation and Data Detection in Space-Time Communications. IEEE Trans. Commun. 2003, 51, 1266–1270. [Google Scholar] [CrossRef]
Song, S.; Singer, A.C.; Sung, K.M. Soft Input Channel Estimation for Turbo Equalization. IEEE Trans. Signal Process. 2004, 52, 2885–2894. [Google Scholar] [CrossRef]
Nicoli, M.; Ferrara, S.; Spagnolini, U. Soft-Iterative Channel Estimation: Methods and Performance Analysis. IEEE Trans. Signal Process. 2007, 55, 2993–3006. [Google Scholar] [CrossRef]
Zhao, M.; Shi, Z.; Reed, M.C. Iterative Turbo Channel Estimation for OFDM System over Rapid Dispersive Fading Channel. IEEE Trans. Wirel. Commun. 2008, 7, 3174–3184. [Google Scholar] [CrossRef]
Guo, Q.; Ping, L.; Huang, D. A Low-Complexity Iterative Channel Estimation and Detection Technique for Doubly Selective Channels. IEEE Trans. Wirel. Commun. 2009, 8, 4340–4349. [Google Scholar]
Ma, J.; Ping, L. Data-Aided Channel Estimation in Large Antenna Systems. IEEE Trans. Signal Process. 2014, 62, 3111–3124. [Google Scholar]
Wen, C.K.; Wang, C.J.; Jin, S.; Wong, K.K.; Ting, P. Bayes-Optimal Joint Channel-and-Data Estimation for Massive MIMO with Low-Precision ADCs. IEEE Trans. Signal Process. 2015, 64, 2541–2556. [Google Scholar] [CrossRef] [Green Version]
Park, S.; Shim, B.; Choi, J.W. Iterative Channel Estimation Using Virtual Pilot Signals for MIMO-OFDM Systems. IEEE Trans. Signal Process. 2015, 63, 3032–3045. [Google Scholar] [CrossRef]
Huang, C.; Liu, L.; Yuen, C.; Sun, S. Iterative Channel Estimation Using LSE and Sparse Message Passing for mmWave MIMO Systems. IEEE Trans. Signal Process. 2018, 67, 245–259. [Google Scholar] [CrossRef] [Green Version]
Li, X.; Wang, Q.; Yang, H.; Ma, X. Data-Aided MIMO Channel Estimation by Clustering and Reinforcement-Learning. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022. [Google Scholar]
Naeem, M.; De Pietro, G.; Coronato, A. Application of Reinforcement Learning and Deep Learning in Multiple-Input and Multiple-Output (MIMO) Systems. Sensors 2022, 22, 309. [Google Scholar] [CrossRef]
Oh, M.S.; Hosseinalipour, S.; Kim, T.; Brinton, C.G.; Love, D.J. Channel Estimation via Successive Denoising in MIMO OFDM Systems: A Reinforcement Learning Approach. In Proceedings of the IEEE International Conference on Communications (ICC), Montreal, QC, Canada, 14–23 June 2021. [Google Scholar]
Chu, M.; Liu, A.; LAu, V.K.N.; Jiang, C.; Yang, T. Deep Reinforcement Learning based End-to-End Multi-User Channel Prediction and Beamforming. IEEE Trans. Wirel. Commun. 2022, 21, 10271–10285. [Google Scholar] [CrossRef]
Jeon, Y.S.; Li, J.; Tavangaran, N.; Poor, H.V. Data-Aided Channel Estimator for MIMO Systems via Reinforcement Learning. In Proceedings of the IEEE International Conference on Communications (ICC), Prayagraj, India, 27–29 November 2020. [Google Scholar]
Kim, T.K.; Min, M. A Low-Complexity Algorithm for Reinforcement Learning-Based Channel Estimator for MIMO Systems. Sensors 2022, 21, 4379. [Google Scholar] [CrossRef]
Kim, T.K.; Jeon, Y.S.; Li, J.; Tavangaran, N.; Poor, H.V. Semi-Data-Aided Channel Estimation for MIMO Systems via Reinforcement Learning. IEEE Trans. Wirel. Commun. 2022; early access. [Google Scholar] [CrossRef]
Dong, M.; Tong, L.; Sadler, B.M. Optimal Insertion of Pilot Symbols for Transmissions over Time-Varying Flat Fading Channels. IEEE Trans. Signal Process. 2004, 52, 1403–1418. [Google Scholar] [CrossRef] [Green Version]
Kim, T.K.; Jeon, Y.S.; Min, M. Training Length Adaptation for Reinforcement Learning-Based Detection in Time-Varying Massive MIMO Systems With One-Bit ADCs. IEEE Trans. Veh. Technol. 2021, 70, 6999–7011. [Google Scholar] [CrossRef]
Li, C.C.; Lin, Y.P. Predictive Coding of Bit Loading for Time Correlated MIMO Channels with A Decision Feedback Receiver. IEEE Trans. Signal Process. 2015, 63, 3376–3386. [Google Scholar] [CrossRef]
Kim, H.; Yu, H.; Lee, Y. Limited Feedback for Multicell Zero-Forcing Coordinated Beamforming in Time-Varying Channels. IEEE Trans. Veh. Technol. 2015, 64, 2349–2359. [Google Scholar] [CrossRef]
Mirza, J.; Dmochowski, P.A.; Smith, P.J.; Shafi, M. A Differential Codebook with Adaptive Scaling for Limited Feedback MU MISO Systems. IEEE Wirel. Commun. Lett. 2014, 3, 2–5. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]

Figure 1. Considered system model and frame structure in time-varying channels.

Figure 2. State-action diagram of the proposed channel estimator. After the time slot index

n + 2

, the virtual state is applied such that the state transition is simplified.

Figure 2. State-action diagram of the proposed channel estimator. After the time slot index

n + 2

, the virtual state is applied such that the state transition is simplified.

Figure 3. Proposed channel estimator using a further performance improvement strategy.

Figure 4. BLER for different channel estimators in time-invariant channels.

Figure 5. BLER for different channel estimators in time-varying channels (

ϵ = 0.005

and

ϵ = 0.01

).

Figure 5. BLER for different channel estimators in time-varying channels (

ϵ = 0.005

and

ϵ = 0.01

).

Figure 6. BLER of the proposed channel estimator for different window sizes

N_{w}

in time-varying channels with

ϵ = 0.01

.

Figure 6. BLER of the proposed channel estimator for different window sizes

N_{w}

in time-varying channels with

ϵ = 0.01

.

Figure 7. NMSE of the proposed channel estimator for different window sizes

N_{w}

in time-varying channels with

ϵ = 0.005

.

Figure 7. NMSE of the proposed channel estimator for different window sizes

N_{w}

in time-varying channels with

ϵ = 0.005

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, T.-K.; Min, M. Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems. Sensors 2023, 23, 5689. https://doi.org/10.3390/s23125689

AMA Style

Kim T-K, Min M. Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems. Sensors. 2023; 23(12):5689. https://doi.org/10.3390/s23125689

Chicago/Turabian Style

Kim, Tae-Kyoung, and Moonsik Min. 2023. "Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems" Sensors 23, no. 12: 5689. https://doi.org/10.3390/s23125689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Aided Channel Estimator in Time-Varying MIMO Systems

Abstract

1. Introduction

2. Preliminaries

2.1. System Model

2.2. Problem

2.3. Markov Decision Process

3. Proposed Optimal Policy

4. Further Performance Improvement

4.1. State Element Refinement

4.2. Algorithm

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI