1. Introduction
In cognitive radio networks (CRNs), secondary users (SUs) may access a potentially large number of frequency bands or channels that are not occupied by primary users (PUs) at given time and space. Therefore, the coexistence of PUs and SUs becomes one of the key challenges while accessing the same part of the spectrum [
1]. In an ideal condition, SUs must sense all channels before deciding which channel to access based on accessing strategy. However, in actuality, because of wideband spectrum and hardware constraints, it is difficult for SUs to sense the entire operating spectrum band (300GHz) in a given period of time. Although compressive sensing is adopted as a wideband spectrum sensing technology for CRNs to solve this problem [
2,
3], little research has been done to implement feasible wide band spectrum sensing, as it is especially difficult to perform compressive sensing when prior knowledge of the primary signals is lacking. In fact, spectrum statistical information as a priori knowledge may not always be securable in a decentralized cognitive radio network. Hence, the blind subNyquist wideband sensing is still an open issue in the field of compressive sensing for CRNs [
4]. Some efforts [
5] have been made to solve this problem. Cognitive compressive sensing has been formulated as a restless multiarmed bandit (rMAB) problem, which makes compressive sensing adaptive and cognitive.
In this paper, we investigate the blind spectrum selection problem for classical narrowband spectrum sensing technology considering the handoff cost. In a decentralized CRN, prior knowledge of spectrum statistical information maybe not acquirable; in this context, many scholars have developed the multiarmed bandit (MAB) framework for opportunistic spectrum access (OSA) of CRN [
6,
7,
8,
9]. Anandkumar et al. [
8] proposed a distributed algorithm named ρ
^{PRE} policy based on the
${\mathsf{\epsilon}}_{n}$greedy policy [
10]. Gai et al. [
9] proposed a SL(K) subroutine and then established the prioritized access policy (DLP) and fair access policy (DLF) based on SL(K) and a preallocation order. Chen et al. [
11] then proposed
kthUCB1 policy combined the
${\mathsf{\epsilon}}_{n}$greedy and UCB1 policy and evaluated its performance both for realtime applications and besteffort applications. All of them achieve logarithmic regret in the long run.
Due to the spectrum varying nature of CRN, SUs are required to perform proactive spectrum handoffs when the spectrum band is occupied by PUs in the MAB framework, which results in a handoff delay consisting of RF reconfiguration or negotiation between transceiver In the above work [
9,
10,
11], the handoff delay is not taken into consideration in their paper, i.e., the spectrum handoff is assumed to be costless. In this paper, by including a fixed handoff delay, SUs have to make the choice of either staying foregoing spectrum with low availability or handing off to a spectrum with higher availability and tolerating the handoff delay. We formulate this problem and investigate the performance of the above policies, i.e., ρ
^{PRE}, SL(K), and
kthUCB1. To the best of our knowledge, the influence of the handoff delay on the above policies in the MAB framework has not been investigated yet.
The rest of this paper is organized as follows:
Section 2 describes the system model, which is similar to related works [
9,
12] except that the handoff delay is included as a handoff cost. In
Section 3, we formulate the problem and present the three policies. In
Section 4, we examine the proposed scheme through simulation. Finally, the paper concludes with a summary in
Section 5.
2. System Model
The channel model of a cognitive radio network with
$C,C\ge 2$ independent and orthogonal channels that are licensed to a primary network following a synchronous slot structure is illustrated in
Figure 1. We model the channel availability
${W}_{i}$ as an i.i.d. Bernoulli process with mean value
${\mathsf{\beta}}_{i}\in \mathrm{{\rm B}}$:
${W}_{i}~B({\mathsf{\beta}}_{i})$, in which
${W}_{i}(t)$ denotes the “free”(denoted by 1) or “busy”(denoted by 0) state at time
$t$ of channel
$i$. The SUs can access the free slot, which will yield a handoff delay if they access a channel that they did not access in the previous slot.
The cognitive radio network is a composite of $M$ secondary users. They access the channels in a decentralized way, i.e., there is no centralized entity collecting channel availability and channel state information (CSI) and then dispatching channels to SUs. In a slot cognitive radio network, a centralized entity will heavily impact the network performance. Regardless, we assume each SU has a preallocated rank that is dispatched by the network when the network forming or SU joins the network. For simplicity, we assume the priority of $S{U}_{j}$ is ranked by $j$, i.e., the priority of $S{U}_{p}$ is higher than $S{U}_{q}$ if $p<q$ for either type of application. This rank is prior to learning and transmission processes and will not be changed afterwards.
The SUs behave in a proactive way to access the channels: They record past channel access histories and then utilize them to make predictions on future spectrum availability following a given policy. In addition, because of the inclusion of the handoff delay, SUs have to make the choice of either staying foregoing spectrum with relatively low availability or handing off to a spectrum with higher availability and tolerating the handoff delay. We denote the proportion of the handoff delay to the entire slot as fixed handoff cost $H$ for simplicity.
The cognitive radio frame structure is shown in
Figure 2. At the beginning of the frame, an SU chooses a channel to sense. Once the sensing result indicates that the channel is idle, the SU transmits pilot to receiver to probe CSI. The CSI is fed back through a dedicated errorfree feedback channel without delay. The length of the data transmission is scalable and will be adopted by a transceiver according to the handoff delay or the data length in this scheme. At the end of the frame, the receiver acknowledges every successful or unsuccessful transmission as
${Z}_{i,j}(k)=0$ for a collision that occurs; otherwise, it is 1.
3. Problem Formulation and Policies
Blind spectrum selection in decentralized cognitive radio networks can be formulated as a decentralized MAB problem for multiple distributed SUs [
8,
9,
12,
13,
14], and in this paper the terms “channel” and “arm” are used interchangeably. Denote
${\mathsf{\pi}}_{j}$ as the decentralized policy for SU
$j$ and
$\mathsf{\pi}=\{{\mathsf{\pi}}_{j},1\le j\le M\}$ as the set of homogeneous policies of all users. Arm
$i$ yields reward
${X}_{i}(t)$ at slot
$t$ according to its distribution, whose expectation is
${\mathsf{\theta}}_{i}$,
${\mathsf{\theta}}_{i}\in \mathrm{\Theta}$. Thus, the sum of the actual reward obtained by all users after
$T$ slots following policy
$\mathsf{\pi}$ is
where
${\mathbb{I}}_{i,j}(t)$ is defined to be 1 if user
$j$ is the only one to play arm
$i$ at slot
$t$; otherwise, it is 0.
In the ideal scenario where the availability statistics
$\mathrm{\Theta}$ are known, the SUs are orthogonally allocated to the
${\mathcal{O}}_{M}^{*}$ channels, where
${\mathcal{O}}_{M}^{*}$ is the set of
$M$ arms with
$M$ largest expected rewards. Then, the expected reward after the
$t$ slots is
Then, we can define the performance of the policy
$\mathsf{\pi}$ as regret
${R}_{M}^{\pi}(\mathrm{\Theta};T)$:
where
$\mathbb{E}[\xb7]$ is the expectation operator.
We call a policy
$\mathsf{\pi}$ uniformly good if for every configuration
$\mathrm{\Theta}$, the regret satisfies
Such policies do not allow the total regret to increase rapidly for any
$\mathrm{\Theta}$.
This problem is widely studied and several representative policies that are uniformly good are proposed: the distributed ρ
^{PRE} policy [
8], the SL(K) policy [
9], and the
kthUCB1 policy [
11]. The distributed ρ
^{PRE} policy based on the
${\mathsf{\epsilon}}_{n}$greedy policy, which prescribes to play the highest average reward arm with probability
$1{\mathsf{\epsilon}}_{n}$ and a random arm with probability
${\mathsf{\epsilon}}_{n}$, and
${\mathsf{\epsilon}}_{n}$ decreases as the experiment proceeds. However, one parameter of the policy requires prior evaluation of the arm reward means. To avoid this problem, the SL(K) policy is proposed based on the classical UCB1 policy of the MAB problem. Through it guarantees logarithm regret in the long run, it leads to a larger leading constant in the logarithmic order. The
kthUCB1 policy makes a good tradeoff on both policies.
Algorithm 1: ρ^{PRE} policy for the user with rank $K$. 
//Define: ${n}_{i}(t)$: the number of arm $i$ is played after $t$ slots. ${\widehat{\mathsf{\theta}}}_{i}(t)$: sample mean availabilities after $t$ slots. ${\epsilon}_{t}:=\mathrm{min}[\frac{\mathsf{\beta}}{t},1]$, where decay rate $\mathsf{\beta}$ is prior valuated according to the arm reward means //Init: play each arm once For $t=1\text{}\mathrm{to}\text{}C$ Play arm $i=t$ and let ${n}_{i}(t)=1$, ${\widehat{\mathsf{\theta}}}_{i}(t)={X}_{i}(t)$ EndFor //Main loop For $t=N+1\text{}\mathrm{to}\text{}T$ Step1: play the arm of the $K\mathrm{th}$ highest index values in $\{{\widehat{\mathsf{\theta}}}_{i}(t)\}$ with probability $1{\mathsf{\epsilon}}_{t}$ and play a channel uniformly at random with probability ${\mathsf{\epsilon}}_{t}$ Step2: Update ${n}_{i}(t)$, ${\widehat{\mathsf{\theta}}}_{i}(t)$ and ${\mathsf{\epsilon}}_{n}$ EndFor

Algorithm 2: SL(K) policy for the user with rank $K$. 
//Define: ${n}_{i}(t)$: the number of arm $i$ is played after $t$ slots. ${\widehat{\mathsf{\theta}}}_{i}(t)$: sample mean availabilities after $t$ slots. // Init: play each arm once For $t=1\text{}\mathrm{to}\text{}C$ Play arm $i=t$ and let ${n}_{i}(t)=1$, ${\widehat{\mathsf{\theta}}}_{i}(t)={X}_{i}(t)$ EndFor // Main loop For $t=N+1\text{}\mathrm{to}\text{}T$ Step1: Select the set ${\mathcal{O}}_{K}$ contains arms with the $K$ highest index values:
Step2: Play the arm with the minimal index value in ${\mathcal{O}}_{K}$ according to
Step3: Update ${n}_{i}(t)$ and ${\widehat{\mathsf{\theta}}}_{i}(t)$. EndFor

Algorithm 3: kthUCB1 policy for the user with rank $K$. 
//Define: ${n}_{i}(t)$: the number of arm $i$ is played after $t$ slots. ${\widehat{\mathsf{\theta}}}_{i}(t)$: sample mean availabilities after $t$ slots. ${\mathsf{\epsilon}}_{t}:=\mathrm{min}[\frac{\mathsf{\beta}}{t},1]$, where decay rate $\mathsf{\beta}$ is prior valuated according to the arm reward means // Init: play each arm once For $t=1\text{}\mathrm{to}\text{}C$ Play arm $i=t$ and let ${n}_{i}(t)=1$, ${\widehat{\theta}}_{i}(t)={X}_{i}(t)$ EndFor // Main loop For $t=N+1\text{}\mathrm{to}\text{}T$ Step1: Select the set ${\mathcal{O}}_{K}$ contains arms with the $K$ highest index values.
Step2: with probability $1{\mathsf{\epsilon}}_{t}$ play the arm with minimum index value in ${\mathcal{O}}_{K}$ and with probability ${\mathsf{\epsilon}}_{t}$ play an arm uniformly at random in ${\mathcal{O}}_{K}$. Step3: Update ${n}_{i}(t)$, ${\widehat{\mathsf{\theta}}}_{i}(t)$ and ${\mathsf{\epsilon}}_{n}$. EndFor

The above policies are derived and investigated in the scenario where there is no expense when the player switches from one arm to another. However, in CRNs, the handoff cost should be taken into consideration as illustrated in
Section 2. Therefore, let
be the sum of handoff cost of all users at slot
$t$, where
${\mathbb{J}}_{i,j}(t)$ is the indicator if user
$j$ switches to arm
$i$ from other arms. Then, define the handoff regret as
We define the total regret as
Since the inclusion of the handoff delay
$H$,
$H{R}^{\mathsf{\pi}}(\mathrm{\Theta};T)$, and
${R}_{M}^{\mathsf{\pi}}(\mathrm{\Theta};T)$ are correlatively related, it is difficult to make a theoretical analysis of the total regret in a distributed multiuser case, although the authors of [
15] considered this problem in a singleuser case. In the next section, we examine the above policies by simulation and discuss their performance.
4. Simulation Results and Analysis
In this section, we present simulation results for the scheme proposed in this work. Simulations are done using Matlab, and we assume
$C=9$ channels with channel availabilities
$\mathrm{B}$ = [0.5, 0.2, 0.8, 0.6, 0.9, .03, 0.4, 0.1, 0.7] and
$M=3$ SUs. The policy parameter configuration is the decay rate
$\mathsf{\beta}=400$ for the ρ
^{PRE} policy and
$\mathsf{\beta}=50$ for the
kthUCB1 policy, which is an optimal configuration according to the authors of [
11], and the SL(K) policy is parameterless. The time scope is
$T=5\times {10}^{4}$. Every experiment is repeated 50 times.
As the regret of one user already takes the collision into consideration, the total regret of a CRN is simply the sum of all users in that CRN. Therefore, we present the regret and actions of one user in CRN where we take SU with the rank
$K=2$.
Figure 3 shows the regrets and actions of these policies under fixed handoff cost
$H=0.1$.
From
Figure 3, we can see that the three policies all achieve the logarithm regret and the actions converge to the third arm, which is the arm with the secondbest channel availability. The regret of policy
kthUCB1 (
Figure 3b) is smaller than that of policy ρ
^{PRE} (
Figure 3a). This is caused by two aspects. Firstly, the decay rate
$\mathsf{\beta}$ in
kthUCB1 can be smaller than ρ
^{PRE} and does not make the policy diverge. Secondly, the
kthUCB1 policy can distinguish the orderoptimal arm from other arms more precisely than the ρ
^{PRE} policy by comparing the action histogram of
Figure 3a,b, in which the number of arm 9 selected by the
kthUCB1 policy is smaller than that of policy ρ
^{PRE}. The regret of policy SL(K) in
Figure 3c is largest, which means it has the largest leading constant in the logarithmic order.
We also investigated the regret of the three policies with varying handoff cost
$H$ as shown in
Figure 4. As handoff cost is meaningless when the value is larger than 0.5, an
$H$ between 0 and 0.5 is chosen. From
Figure 4, we see that the regret increases as
$H$ increases. Moreover, the growth rates of the three policies are all small when
$H<0.3$, which shows that these policies perform well. However, they become considerably large when
$H>0.3$. Intuitively, this is because a large
$H$ causes the arms to become indistinguishable.
5. Conclusions
In this work, we studied the blind spectrum selection in a decentralized cognitive radio networks in the MAB framework considering the handoff delay as a fixed handoff cost. We formulate this problem and investigate three representative policies and prove the uniform goodness of these policies for our scenario. Through simulation, we further show that, despite the inclusion of the fixed handoff cost, ρ^{PRE}, SL(K), and kthUCB1 achieve the same asymptotic performance as they do without handoff cost. Through comparison of these three policies, we found that the kthUCB1 policy has better overall performance.