1. Introduction
A hyperspectral image (HSI) can be represented as a threedimensional data cube, containing both spectral and spatial information to characterize radiation properties, spatial distribution and geometric characteristics of ground objects [
1,
2]. Compared with panchromatic, RGB and multispectral pictures that have only several broad bands, HSI usually has hundreds of spectral bands. The rich spectral information of HSI can be used to discriminate subtle differences between similar ground objects, which makes HSI suitable for different applications, such as target recognition, mineral detection, precision agriculture [
1,
2,
3]. Due to the scattering of ground surface and low spatial resolution of the hyperspectral sensor, an observed HSI pixel is often a mixture of multiple ground materials [
4,
5,
6]. This is the so called “mixed pixel”. The presence of “mixed pixels” seriously affects the application of HSIs. To address the problem of mixed pixels, hyperspectral unmixing (HU) techniques have been developed [
4,
5,
6,
7,
8]. HU aims to decompose a mixed spectral into a collection of pure spectra (endmembers) while also providing the corresponding fractions (abundances). In terms of the spectral mixture mechanism, HU algorithms can be roughly categorized into linear and nonlinear ones [
4,
5]. Although, in general, the nonlinear mixing assumption represents the mostreal cases better, the linear mixing assumption (although more simplified) has been proved to work very satisfactory in many cases in practice. Taking into account its mathematical tractability, it has attracted significant attention from the scientific community. For these reasons, the linear mixture model is adopted in the present paper, in which a measured spectral can be represented as a linear combination of several endmembers.
Nonnegative matrix factorization (NMF) is a widely used linear HU method [
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]. In this framework, HU is regarded as a blind source separable problem, and decomposes an observed HSI matrix into the product of the pure pixel matrix (endmember matrix) and corresponding proportion matrix (abundance matrix). Respecting the physical constraints, nonnegative constraints on the endmembers and abundances, and abundance sumtoone constraint (ASC) are imposed. The NMF algorithm has the characteristics of intuition and interpretability. However, due to the existence of large number of unknown dependent variables, the solution space of NMF model is too large. To restrict its solution space, many NMF variants are proposed by adding constraints on the abundance or endmember [
10,
11,
12,
13,
14,
15,
16]. Miao et al. incorporated a volume constraint of endmember into the NMF formulation and proposed a minimum volume constrained NMF (MVCNMF) model [
10], which can perform unsupervised endmember extraction from highly mixed image data without the purepixel assumption. Jia et al. introduced two constraints to the NMF [
11], i.e., piecewise smoothness of spectral data and sparseness of abundance fraction. Similarly, two constraints on abundance (i.e., abundance separation constraint and abundance smoothness constraint) were added into the NMF [
12]. Qian et al. imposed an
${l}_{1/2}$normbased sparse constraints on the abundance and proposed an
${l}_{1/2}$NMF unmixing model [
13]. Lu et al. considered the manifold structure of HSI and incorporated manifold regularization into the
${l}_{1/2}$NMF [
14]. Wang et al. added endmember dissimilarity constraint into the NMF [
15].
Although the aforementioned NMF methods improved the classical NMF unmixing model at a certain extent, they ignored the effect of noise. As the objective function of NMF is the least squares loss, NMF is sensitive to noise and corresponding unmixing results are usually inaccurate and unstable. To suppress the effect of noise and improve the robustness of the model, many robust NMF methods were proposed [
17,
18,
19,
20]. He et al. proposed a sparsityregularized robust NMF by adding a sparse matrix into the linear mixture model to model the sparse noise [
17]. Du et al. introduced a robust entropyinduced metric (CIM) and proposed a CIMbased NMF (CIMNMF) model, which can effectively deal with nonGaussian noise [
18]. Wang et al. proposed a robust correntropybased NMF model (CENMF) [
19], which contained a correntropybased loss function and an
${l}_{1}$norm sparse constraint on the abundance. Based on the Huber’s Mestimator, Huang et al. constructed
${l}_{2,1}$norm and
${l}_{1,2}$norm based loss functions to obtain a new robust NMF model [
20,
21]. Defining the
${l}_{2,1}$norm (
${l}_{1,2}$norm) based loss function actually assumes that the columnwise (rowwise) approximation residual follows Laplacian (Gaussian) distribution from the viewpoint of maximum likelihood estimation (MLE). However, in practice this assumption may not hold well, especially when HSI contains complex mixture noise, such as impulse noise, stripes, deadlines, and other types of noise [
22,
23].
Inspired by the robust regression theory [
23,
24], we design the approximation residual as an MLElike estimator and propose a robust MLEbased
${l}_{1/2}$NMF model (MLENMF) for HU. It replaces the leastsquares loss in the original NMF by a robust MLEbased loss, which is a function (associated with the distribution of the approximation residuals) of the approximation residuals [
24]. The proposed MLENMF can be converted to a weighted
${l}_{1/2}$NMF model and can be solved by a reweighted multiplication update iteration algorithm [
9,
13]. By choosing an appropriate weight function, MLENMF can automatically assign small weights to bands with large residuals, which can effectively reduce the effect of noisy bands and improve the unmixing accuracy. Experimental results on simulated and real hyperspectral data sets show the superiority of MLENMF over existing NMF methods.
The rest of the paper is organized as follows.
Section 2 introduces the NMF and
${l}_{1/2}$NMF.
Section 3 describes our proposed MLENMF method. The experimental results and analysis are provided in
Section 4.
Section 5 discusses the effect of parameters in the algorithm. Finally,
Section 6 concludes the paper.
2. NMF Unmixing Model
Under the linear spectral mixing mechanism, an observed spectral
$\mathit{h}\in {R}^{M\times 1}$ can be represented linearly by the endmember
${\mathit{z}}_{1},\cdots ,{\mathit{z}}_{P}$ [
4,
10,
11,
12,
13]:
where
$\mathit{Z}=\left[{\mathit{z}}_{1},\cdots ,{\mathit{z}}_{P}\right]\in {R}^{M\times P}$ represents the endmember matrix,
$\mathit{s}\in {R}^{P\times 1}$ is the coefficient (abundance) vector, and
$\mathit{\u03f5}$ is the residual. Applying the above linear mixing model (1) for all hyperspectral pixels
${\mathit{h}}_{1},\cdots ,{\mathit{h}}_{N}$, the following matrix representation can be obtained:
where
$\mathit{H}=\left[{\mathit{h}}_{1},\cdots ,{\mathit{h}}_{N}\right]\in {R}^{M\times N}$,
$\mathit{S}=\left[{\mathit{s}}_{1},\cdots ,{\mathit{s}}_{N}\right]\in {R}^{P\times N}$ are nonnegative hyperspectral data matrix and abundance matrix, respectively.
$\mathit{E}\in {R}^{M\times N}$ is the residual matrix.
In Equation (2), to make the decomposition result as accurate as possible, the residual should be minimized. Then, an NMF unmixing model can be obtained by considering the nonnegative property of endmember and abundance matrices:
where
${\Vert \xb7\Vert}_{F}$ denotes the Frobenius norm, and
$\mathit{Z}\ge 0$ means that each element of
$\mathit{Z}$ is nonnegative. As each column of abundance matrix
$\mathit{S}$ records the proportion of endmembers in representing a pixel, the columns of
$\mathit{S}$ (each one corresponding to a pixel) should satisfy the sumtoone constraint, i.e.,
${\sum}_{p=1}^{P}{S}_{pn}=1,n=1,\cdots ,N$.
The above NMF Model (3) can be easily solved by the multiplication update algorithm [
9,
13]. However, its solution space is very large [
13]. To restrict the solution space, an
${l}_{1/2}$constraint can be added to the abundance matrix
$\mathit{S}$, and an
${l}_{1/2}$NMF model can be obtained as [
13]:
where
$\lambda $ is a regularization parameter and
${\Vert \mathit{S}\Vert}_{1/2}$ is the
${l}_{1/2}$regularizer [
13]. As proved in Refs. [
13,
25],
${l}_{1/2}$regularizer is a good choice in enforcing the sparsity of hyperspectral unmixing because the sparsity of the
${l}_{q}$ (
$1/2\le q<1$) solution increases as
$q$ decreases, whereas the sparsity of the solution for
${l}_{q}$ (
$0<q\le 1/2$) shows little change with respect to
$q$. Meanwhile, the sparsity represented by
${l}_{1/2}$ also enforces the volume of the simplex to be minimized [
13].
3. MLENMF Unmixing Model
In the NMF model (3) or (4), the objective function ${\Vert \mathit{H}\mathit{Z}\mathit{S}\Vert}_{F}^{2}$ is the leastsquares (LS) function which is sensitive to noise. Here, we employ a new robust MLEbased loss to replace the LS objective function and propose an MLEbased NMF (MLENMF) model for HU.
Firstly, the matrix norm form is transformed into vector norm form:
where
${\mathit{H}}^{i}$ is the
$i$th row of matrix
$\mathit{H}$.
We can regard the least squares objective function as the sum of approximation residuals, and then construct an MLElike robust estimator to approximate the minimum of objective function. Denote the approximation residual of the
$i$th band as
${e}_{i}={\Vert {\mathit{H}}^{i}{\left(\mathit{Z}\mathit{S}\right)}^{i}\Vert}_{2}$ and define residual vector
$e={\left[{e}_{1},\dots ,{e}_{M}\right]}^{T}$, the above Formula (5) can be rewritten as:
Assume that
${e}_{1},\dots ,{e}_{M}$ are independent and identically distributed (i.i.d) random variables, which follow the same probability distribution function
${g}_{\theta}\left({e}_{i}\right)$, where
$\theta $ is the distribution parameter. The likelihood function can be expressed as:
According to the principle of MLE, the following objective function should be minimized:
where
${\phi}_{\theta}\left({e}_{i}\right)=\mathrm{ln}{g}_{\theta}\left({e}_{i}\right)$. If we replace the objective function
${\Vert \mathit{H}\mathit{Z}\mathit{S}\Vert}_{F}^{2}$ in Equation (4) by the loss in Equation (8), we can get the following optimization problem:
In fact, the aim is to construct a loss function to replace the least squares function to reduce the impact of noise. To construct the loss function, we analyze its Taylor expansion. Assume that
${g}_{\theta}$ is symmetric, and
${g}_{\theta}\left({e}_{i}\right)<{g}_{\theta}\left({e}_{j}\right)$ if
$\left{e}_{i}\right>\left{e}_{j}\right$. We can infer that: (1)
${g}_{\theta}\left(0\right)$ is global maximum of
${g}_{\theta}$ and
${\phi}_{\theta}\left(0\right)$ is the global minimum of
${\phi}_{\theta}$; (2)
${\phi}_{\theta}\left({e}_{i}\right)={\phi}_{\theta}\left({e}_{i}\right)$; (3)
${\phi}_{\theta}\left({e}_{i}\right)>{\phi}_{\theta}\left({e}_{j}\right)$ if
$\left{e}_{i}\right>\left{e}_{j}\right$. For simplicity, we assume
${\phi}_{\theta}\left(0\right)=0$. Define
${D}_{\theta}\left(\mathit{e}\right)={\sum}_{i=1}^{M}{\phi}_{\theta}\left({e}_{i}\right)$. According to the firstorder Taylor expansion around
${\mathit{e}}_{0}$,
${D}_{\theta}\left(\mathit{e}\right)$ can be approximated as [
24]:
where
${D}_{\theta}^{\prime}\left({\mathit{e}}_{0}\right)$ is the first order derivative of
${D}_{\theta}\left(\mathit{e}\right)$ at
${\mathit{e}}_{0}$, and
$\mathit{W}$ is the Hessian matrix. We can get the mixed partial derivatives
$\frac{{\partial}^{2}{D}_{\theta}}{\partial {e}_{i}\partial {e}_{j}}=0$ (
${e}_{i}\ne {e}_{j}$) as the error residuals
${e}_{i}$ and
${e}_{j}$ are assumed i.i.d., and hence
$\mathit{W}$ is a diagonal matrix. Taking the derivative of
${\tilde{D}}_{\theta}\left(\mathit{e}\right)$ with respect to
$\mathit{e}$, it gets
As
${\phi}_{\theta}\left(0\right)=0$ is the global minimum of
${\phi}_{\theta}$, the minimum of
${D}_{\theta}\left(\mathit{e}\right)$ is
${D}_{\theta}\left(\mathbf{0}\right)$.
${\tilde{D}}_{\theta}\left(\mathit{e}\right)$ should also reach its minimum at
$\mathit{e}=\mathbf{0}$ for it is an approximation of
${D}_{\theta}\left(\mathit{e}\right)$, so
${\tilde{D}}_{\theta}^{\prime}\left(\mathbf{0}\right)=0$ and then we can derive the following formulas from Equation (11):
where
${W}_{i,i}$ is the
$i$th diagonal element of
$\mathit{W}$. Denote
${w}_{i}={W}_{i,i}$, Equation (13) can be written as
As
${\phi}_{\theta}\left(x\right)$ is a nonlinear and nonconvex function, it is difficult to solve the model (9) directly. Inspired by the above Formula (14), we can get:
and then the Model (9) can be expressed as a weighted NMF model:
The objective function of Model (16) can be rewritten as:
where
$\tilde{\mathit{H}}=\sqrt{\mathit{W}}\mathit{H}$,
$\tilde{\mathit{Z}}=\sqrt{\mathit{W}}\mathit{Z}$. Then, the Model (16) can be expressed as:
It is easy to see that model (18) is also an
${l}_{1/2}$NMF algorithm, and can be solved by the multiplication update iteration rule as follows [
9,
13]:
The final endmember matrix is $\mathit{Z}={\mathit{W}}^{\frac{1}{2}}\tilde{\mathit{Z}}$.
In the model (18), a key factor is the weight. In this paper, the weight function is set as the logistic function [
23,
24,
26]:
where
$\gamma ,\tau $ are positive scalars. Parameter
$\gamma $ controls the decreasing rate from 1 to 0, and
$\tau $ controls the location of demarcation point [
24]. It is clear that the value of weight function decreases rapidly with the increase of residual
${e}_{i}$.
MLE weight function in Equation (21) can approximate the weight of commonly used loss functions, such as ${l}_{2,1}$, maximum correntropy and Huber weights.
When
$\gamma =2$ and
$\tau \to 0$, the MLE weight function is:
which is close to
${l}_{2,1}$ weight:
$\frac{1}{1+{e}_{i}^{2}}$. The corresponding weights are shown as red and blue lines in
Figure 1a.
When
$\gamma =\frac{1}{{\sigma}^{2}}$ and
$\tau \to 0$, the MLE weight function is:
which is close to the weight of maximum correntropy criterion:
$\mathrm{exp}\left(\frac{{e}_{i}^{2}}{{\sigma}^{2}}\right)$ (
$\sigma $ is a parameter). The corresponding weights are shown in
Figure 1b.
By choosing appropriate parameters, the MLE weight can also approximate the Huber weight:
as shown in
Figure 1c.
Based on Equations (14) and (21), the objective function of MLE can be obtained as:
From Equations (8) and (25), we can see that the probability distribution function
${g}_{\theta}\left({e}_{i}\right)$ has the form:
If
$\tau =0$,
$\gamma \to 0$, the probability distribution function
${g}_{\theta}\left({e}_{i}\right)$ is actually a Gaussian distribution:
In this case, the weight defined in Equation (21) is: ${\omega}_{i}=1/2$, which is the LS case.
In
Figure 2a, we compare the MLE objective function with the LS loss function. MLE objective function is controlled by the parameters
$\gamma ,\tau $, and is truncated to a constant for large residuals (e.g.,
$\left{e}_{i}\right>2$). As the constant has no effect on the optimization model, the negative effect of noise (points with large residuals) can be automatically diminished. Compared with the MLE function, LS loss function is global and increases quadratically as the increase of residual. When there has heavy noise, the objective function of LS model will be dominated by the points with heavy noise.
Figure 2b shows the influence function [
22,
27] of MLE and LS. The influence function of a loss
$\phi \left(e\right)$ is defined as:
$\psi \left(e\right)=\partial \phi \left(e\right)/\partial e$, which measures the robustness of loss function as the increase of error residual. For residual
${e}_{i}>0$, the influence function of MLE increases first, then decreases and finally reaches the zero value. It means that larger errors finally have no effect on the MLEbased model. However, the influence function of LS continues to grow linearly. So, the LS loss function is seriously affected by noise. In the presence of noise, MLE is obviously more robust than LS.
The procedure of the proposed MLENMF is shown in Algorithm 1.
Algorithm 1 MLENMF. 
Input: hyperspectral matrix $\mathit{H}$, the parameter $\gamma ,\tau $ Initialization: endmember ${\mathit{Z}}_{0}$ and abundance ${\mathit{S}}_{0}$, Output: estimated endmember and abundance matrices.  1.
Initialize ${\mathit{Z}}^{\left(0\right)}={\mathit{Z}}_{0},{\mathit{S}}^{\left(0\right)}={\mathit{S}}_{0},v=1,\mathit{W}=\mathit{I}$  2.
Run the following steps until convergence: (b) Calculate the weight of each entry: (c) Compute the weighted matrices: (d) Updating endmember matrix and weighted abundance matrix: (e) $v=v+1$

Remark. In the current method, it assumes that different bands are independent and then an MLE solution can be deduced. The band independence assumption is only used in the derivation of MLE estimator. By means of this assumption, it can finally generate a weighted NMF model where the weight function can be used to reduce the effect of noisy bands. Although hyperspectral bands are not independent from each other in practice, the final weighted NMF model (i.e., MLENMF) can still alleviate negative effects of noise.
5. Discussion
As described in
Section 4.2,
$\tau $ is the
$\left(100\xi \right)$th percentile of residual vector
$\tilde{e}={\left[{e}_{1}^{2},\dots ,{e}_{M}^{2}\right]}^{T}$, and
$\gamma =c/\tau $,
$c\in \left(0,10\right],\text{}\xi \in \left(0,1\right]$. By tuning the parameters
$c$ and
$\xi $, the MLE objective function in Equation (26) can be truncated, as shown in
Figure 9. Parameter
$c$ and
$\xi $ control the decreasing rate and the location of truncation point, respectively. The larger the value of
$c$, the greater the degree of truncation. The smaller the value of
$\xi $, the more forward the position of the truncation point. As shown in
Figure 9, when the noise or residual is large, it is better to choose a larger
$c$ and a smaller
$\xi $ that truncates the weight of larger residuals to a constant (seeing the red dotted line).
We take the Urban data set as an example to show the effect of parameters
$c$ and
$\xi $.
Figure 10 shows the SAD results of MLENMF on Urban data with 210 bands. The results in
Figure 10a are obtained by fixing
$\xi =0.4$ and changing
$c$ in the set
$\left\{0.1,\text{}0,5,\text{}1,\text{}2,\text{}5,\text{}10\right\}$. When
$\xi $ is fixed, larger
$c$ values correspond to better unmixing results. As shown in
Figure 9,
$c$ affects the degree of truncation. If choosing a large
$c$, the weight of large errors can be truncated to a constant (e.g., the objective function values are constant for errors larger than 1.5, showing as the red solid line in
Figure 9). As their objective function values are constant, they have no influence on the model. For Urban data with all 210 bands, MLENMF with a larger
$c$ can effectively alleviate the effect of noisy bands. By fixing
$c=10$ and changing
$\xi $ in the set
$\left\{0.1,\text{}0.2,\text{}0,4,\text{}0.6,\text{}0.8,\text{}1\right\}$,
Figure 10b shows the SAD of MLENMF versus parameter
$\xi $. It is better to set the parameter
$\xi $ in the interval [0.4 0.8] when
$c$ is fixed. Parameter
$\xi $ determines the ratio of inliers. As the data contains noisy bands, the value of
$\xi $ should be less than 1.
When the known noisy bands on the Urban data are removed, the experimental results on Urban data with 162 bands are obtained at fixing
$\xi =0.8$ and
$c=1$, respectively. The results are shown in
Figure 11. From
Figure 11a, we can see that the proposed MLENMF is not sensitive to parameter
$c$ because different
$c$ values generate similar results for small errors in the case of low noise or no noise data as shown in
Figure 9. From
Figure 11b, the best result is achieved at
$\xi =1$, which means that the data points are almost inliers.
The above analysis recommends setting the parameter
$\xi $ in the interval [0.4 0.8]. For data with heavy noise,
$\xi $ can be set to be a small value, such as
$\xi =0.4$. Parameter
$c$ is chosen in the interval [
1,
10]. For data with heavy noise, it can set
$c=10$. Otherwise, a moderate value
$c=1$ is recommended.