We provide the framework for our proposed continuous semi-supervised nonnegative matrix factorization method (CSSNMF).
3.1. Formulation
We consider having a document-term matrix [
6]
for
n documents with their associated word frequencies in the
m columns—a “bag of words” where each document is represented only by the frequencies of its words. Each document has a corresponding value in
so that we can associate with
X the vector
. Put another way: each document is represented as a row of
X, call it
, which stores the frequencies of each of the
m words within the corpus; then, to each such
x there is an observation
(and over
n documents, this generates
). We choose
and
as hyper-parameters where
r denotes the number of topics and
is weight put on the regression error.
In the real data that we look at, each row of X will represent the reviews written for one university instructor, with frequencies of the words in the columns. For each instructor in the dataset, the mean value of their respective student ratings will be a single component of Y. From a predictive standpoint, we would like to predict the mean rating of an instructor based only on the words in their review, i.e., take a vector of the word frequencies and make a prediction of their mean rating (the hat indicates a prediction). We want the prediction to be as close to the true mean rating as possible. The topic modelling aspect of this is that instead of using the full x vector of dimension m, we approximate x as a nonnegative linear combination of r topic vectors (interpretable vectors of word frequencies). We effectively compress x to a vector of dimension and we model the rating as a linear combination of the components of w.
Given
,
, and
, we define a penalty function that combines topic modelling with a linear regression based on the topic representations. The intuition with the weighting is that as the weight
increases, topic modelling is still done, but more and more emphasis is put on producing an accurate regression on
Y. We define
and where
is given by
The matrix with its column of 1s allows for an intercept: given a topic representation , we predict a value
We also impose a normalization constraint, that
so that the topics have unit length in
.
Remark 1 (Sum of Topic Representations). If is normalized so its rows sum to 1 then it is also the case that by noting that and summing over j.
When
,
has no effect upon
and we first perform regular NMF over
W and
H and, as a final step, we choose
to minimize the regression error. In other words, if
, we do NMF first and then find the best
given the already determined weights for each document. It seems intuitive, however, that the regression could be improved if
and
W both were being influenced by the regression to
Y, which is what our method aims to do when
. From a practical perspective, if
, then the regression error becomes dominant and we may expect the topics as found in
H to be less meaningful. In
Section 3.2, we state some theoretical properties of our method as it is being trained.
Once
and
are known, we can make predictions for the response variable corresponding to a document. This amounts to finding the best nonnegative topic encoding
for the document and using that encoding in the linear model—see
Section 3.3.
Remark 2 (Uniqueness). Using our established notation, we remark that if and , then and , where , , and for any invertible with and both having all entries nonnegative. Thus, uniqueness of an optima, if it exists, can only be unique up to matrix multiplications.
3.2. Theoretical Results
We present two important behaviors of CSSNMF with regard to increasing and its effect upon predicting the response variable.
Proposition 1 (Regression Error with Nonzero
).
For , let be a unique (as per Remark 2) global minimum to Equations (6)–(9). Then . Theorem 1 (Weakly Decreasing Regression Error).
Let be given where are the unique (as per Remark 2) global minimizers of Equations (6)–(9) for . Then Remark 3. Proposition 1 and Theorem 1 are based on obtaining a global minimum. In practice, we may only find a local minimum.
Proposition 1 and Theorem 1 are statements pertaining to training the model. Assuming we have the optimal solutions, Proposition 1 tells us that the regression error for is no worse than the regression error with and could, in fact, be better. Thus, the intuition that selecting topics while paying attention to the regression error is practical. Then Theorem 1 says that the regression error is weakly monotonically decreasing as increases. In practical application, we find the error strictly monotonically decreases.
Before proceeding to algorithmic procedures, we prove Proposition 1 and Theorem 1.
Proof of Proposition 1. The first inequality comes from how
are defined by Equation (
6). The final inequality comes from how
are defined as minimizers in Equation (
7).
Since we first assumed , we obtain Finally, if , then there is equality with □
Proof of Theorem 1. Note that if
, then Theorem 1 already applies, so we assume
. We have that
Adding Equations (
10) and (
11) together,
which, upon dividing by
, directly gives
□
3.3. Algorithm
Our minimization approach is iterative and based on the alternating nonnegative least squares [
16] approach. Due to the coupling of NMF and regression errors, other approaches such as multiplicative or additive updates [
17] are less natural. Each iteration consists of: (1) holding
H and
fixed while optimizing each row of
W separately (nonnegative least squares); (2) holding
W and
fixed while optimizing each column of
H separately (nonnegative least squares); and finally (3) holding
W and
H fixed while optimizing over
. The error is nondecreasing between iterations and from one optimization to the next. We now derive and justify this approach (Algorithm 1) in increasing complexity of cases.
Algorithm 1: Overall CSSNMF algorithm.
|
| Input :A matrix , |
| a vector , |
| a positive integer , |
| a scalar , |
| a relative error tolerance , and |
| a maximum number of iterations . |
| Output:Minimizers of Equations (6)–(9): nonnegative matrix , |
| nonnegative matrix , and |
| vector |
1 | |
2 | Elementwise, , , |
| |
3 | |
4 | while and do |
5 | | | as per Algorithm 2 |
6 | | | as per Algorithm 3 |
7 | | | as per Algorithm 4 |
8 | | | Normalize W, H, and as per Algorithm 5 |
9 | | | |
10 | | | if then |
11 | | | | | |
12 | | | end if |
13 | | | |
14 | | | |
15 | end while |
16 | return |
Algorithm 2: Updating W. |
|
Algorithm 3: Updating H. |
|
Algorithm 4: Updating . |
| Input | :A vector , and |
| | a matrix |
| Output | :A new value for . |
1 | |
2 | |
3 | return |
Algorithm 5: Normalization process. |
| Input | :A matrix , |
| | a matrix , and |
| | a vector |
| Output | :New values for W, H, and . |
1 | a vector of row sums of H. |
2 | |
3 | . |
4 | . |
5 | |
6 | return W, H, and . |
If
W and
H are given and only
can vary, then Equation (
2) is minimized when
is minimized. If
has full rank or is overdetermined, this happens when the error,
, is orthogonal to the column span of
or that
When
is underdetermined or has full rank, we require
with
minimized (for uniqueness) whereby
See Algorithm 4. Thus, when
does not have full rank, the solution is
where
is the pseudoinverse of
. See [
18] for a discussion of pseudoinverses. In applications, we only expect to see overdetermined systems because the number of topics
r should be less than the number of documents
If
W and
are given and only
H can change, then minimizing Equation (
2) requires minimizing
. We can expand this error term out in the columns of
H:
Because columnwise the terms of the sum are independent, we can minimize each column
of
H separately to minimize the sum, i.e.,
as given in Algorithm 3.
When
H and
are fixed, then Equation (
2) can be written out as
where
and
. Defining matrices
we can rewrite Equation (
15) as
which can be minimized through
This is precisely Algorithm 2.
For our optimizations and linear algebra, we used Numpy [
19] and SciPy [
20]. The nonnegative least squares routine employs an active set method to solve the least squares minimization problem with inequality constraints [
21]. The active set method amounts to gradually building up the set of active constraints (those for which their not being enforced in the unconstrained problem would result in constraint violation or equality, i.e., a regression variable with a nonnegativity constraint being less than or equal to 0) and then optimizing over the passive set (all variables not in the active set) once identified with equality constraints on the active set [
22]. This method can also be parallelized [
23].
In addition to the steps outlined within these algorithms, we employed two additional modifications: (1) we defined , and any entries in H less than were replaced by (otherwise on some occasions, the W update step would fail); and (2) the minimizations at times yielded worse objective errors than already obtained and, when this happened, we did not update to the worse value.
As noted with other NMF routines, we might not reach a global minimizer [
24]. In practice, the minimization should be run repeatedly with different random initializations to find a more ideal local minimum.
From an application standpoint, we wish to run the model on documents it has not been trained on. Algorithm 6 stipulates how a prediction takes place. We first find the best nonnegative decomposition of the document, a vector in
, into the topic basis, projecting to
dimensions. With the representation in topic-coordinates, we then use the linear model.
Algorithm 6: Prediction process. |
| Input | :A matrix , |
| | a vector , |
| | and a vector . |
| Output | :Model prediction for response variable, . |
1 | Compute . |
2 | Compute . |
3 | return |
An implementation of our algorithm can be found on our
BitBucket repository, accessed on 21 February 2023.