1. Introduction
The use of side information in a supervised machine learning (ML) setting, proposed in [
1], was referred to as
privileged information, and was further developed in [
2], who extended the paradigm to unsupervised learning. As illustrated in [
3], privileged information refers to data which neither belong to the input space nor to the output space but are still beneficial for learning. In general, side information may, for example, be knowledge given by a domain expert.
In our work, side information comes separate to the training and test data, but unlike other approaches such as the one described above, it can be used by the ML algorithm not only during training but also during testing. In particular, we concentrate on supervised neural networks (NNs) and do not constrain the side information to be fed into the input layer of the network, which constitutes a nontrivial extension of the paradigm. To distinguish between the two approaches we refer to the additional information used in the previous approaches, when the information is only visible during the training phase, as privileged information, and in our approach, when the information is visible during both the train and test phases, simply as
side information. In our model, side information is fed into the NN from a knowledge base, which inspects the current input data and searches for relevant side information in its memory. Two examples of the potential use of side information in ML are the use of azimuth angles as side information for images of 3D chairs [
4], and the use of dictionaries as side information for extracting medical concepts from social media [
5].
Here, we present an architecture and formalism for side information, and also a proof of concept of our model on a couple of datasets to demonstrate how it may be useful in the sense of improving the performance of an NN. We note that side information is not necessarily useful in all cases, and, in fact, it could be malicious, causing the performance of an NN to decrease.
Our proposal for incorporating side information is relatively simple when side information is available, and, within the proposed formalism, it is straightforward to measure its contribution to the ML. It is also general in terms of the choice of neural network architecture that is used, and able to cater for different formats of side information.
To reiterate, the main research question we address here is when side information comes separate to the training and test data, what is the effect, on the ML algorithm, of making use of such side information, not only during training but also during testing? The rest of the paper is organised as follows: In
Section 2, we review some related work, apart from the privileged information already mentioned above. In
Section 3, we present a neural network architecture that includes side information, while in
Section 4 we present our formalism for side information based on the difference between the loss of the NN without and with side information, where side information is deemed to be useful if the loss decreases. In
Section 5, we present the materials and methods used for implementing the proof of concept for side information based on two datasets, while in
Section 6 we present the results and discuss them. Finally, in
Section 7, we present our concluding remarks.
2. Related Work
We now briefly mention some recent related research. In [
6], a language model is augmented with a very large external memory, and uses a nearestneighbour retrieval model to search for similar text in this memory to enhance the training data. The retrieved information can be used during the training or finetuning of neural language models. This work focuses on enhancing language models, and, as such, does not address the general problem of utilising side information.
In a broad sense, our proposal fits into the informed ML paradigm with prior information, where prior knowledge is independent from the input data; has a formal representation; and is integrated into the ML algorithm [
7]. Moreover, human knowledge may be input into an ML algorithm in different formats, through examples, which may be quantitative or qualitative [
8]. In addition, there are different ways of incorporating the knowledge into an ML algorithm, say an NN, for example, by modifying the input, the loss function, and/or the network architecture [
9]. Another trend is that of incorporating physical knowledge into deep learning [
10], and more specifically encoding differential equations in NNs, leading to physicsinformed NNs [
11].
Our method of using probabilistic facts, described in
Section 3, is a straightforward mechanism for extracting the information that is needed from the knowledge base in order to integrate them into the neural network architecture in a flexible manner. Moreover, our proposal allows feeding side information into any layer of the NN, not just into the input layer, which is to the best of our knowledge novel.
Somewhat related is also neurosymbolic artificial intelligence (AI), where neural and symbolic reasoning components are tightly integrated via neurosymbolic rules; see [
12] for a comprehensive survey of the field, and [
13] which looks at combining ML with semantic web components. While NNs are able to learn nonlinear relationships from complex data, symbolic systems can provide a logical reasoning capability and structured representation of expert knowledge [
14]. Thus, in a broad sense, the symbolic component may provide an interface for feeding side information into an NN. In addition, a neurosymbolic framework was introduced in [
15] that supports querying, learning, and reasoning about data and knowledge.
We point out that, as such, our formalism for incorporating side information into an NN does not attempt to couple symbolic knowledge reasoning with an NN as does neurosymbolic AI [
12,
15]. The knowledge base, defined in
Section 3, acts as a repository, which we find convenient to describe in terms of a collection of facts providing a link, albeit weak, to neurosymbolic AI. For example, in
Section 5, in the context of the wellknown MNIST dataset [
16], we use expert knowledge on whether an image of a digit is peculiar as side information for the dataset, and inject this information directly into the NN, noting that we do not insist the side information exists for all data items.
3. A General Method for Incorporating Side Information
To simplify the model under discussion, let us assume a generic neural network (NN) with input and output layers and some hidden layers inbetween. We use the following terminology: ${x}_{t}$ is the input to the NN at time step t, ${y}_{t}$ is the true target value (or simply the target value), ${\widehat{y}}_{t}$ is the estimated target value (or the NN’s output value), and ${s}_{t}$ is the side information, defined more precisely below.
The side information comes from a knowledge base, $\mathcal{KB}$, containing both facts (probabilistic statements) and rules (if–then statements). To start off, herein, we will restrict ourselves to side information in the form of facts represented as triples, $(p,z,c)$, with p being the probability that the instance z is a member of class c, and where a class may be viewed as a collection of facts. In other words, the triple $(p,z,c)$ can be written as $p=P\left(z\rightc)$, which is the likelihood of z given c. We note that facts may not necessarily be atomic units, and that side information such as ${s}_{t}$ is generally viewed as a conjunction of one or more facts.
Let us examine some examples of side information facts, demonstrating that the knowledge base $\mathcal{KB}$ can, in principle, contain very general information:
 (i)
$(p,headache,symptom)$ is the probability of a headache being a $symptom$. Here, $headache$ is a textual instance and the likelihood is $P\left(headache\rightsymptom)$.
 (ii)
$(p,imag{e}_{i},green)$ is the probability that the background of $imag{e}_{i}$ is $green$; in this case the probability could denote the proportion of the background of the image. Here, $imag{e}_{i}$ is an image instance and the likelihood is $P\left(imag{e}_{i}\rightgreen)$.
 (iii)
$(p,ec{g}_{i},irregular)$ is the probability that $ec{g}_{i}$ is $irregular$. Here, $ec{g}_{i}$ is a segment of an ECG time series instance and the likelihood is $P\left(ec{g}_{i}\rightirregular)$.
The architecture of the neural network incorporating side information is shown in
Figure 1. We assume that the NN operates in discrete time steps, and that at the current time step,
t, instance
${x}_{t}$ is fed into the network and
${\widehat{y}}_{t}$ is its output. As stated above, we will concentrate on a knowledge base of facts, and detail what preprocessing may occur to incorporate facts into the network as side information. It is important that the preprocessing step at time
t can only access the side information and the input presented to the network at that time, i.e., it cannot look into the past or the future. It is, of course, sensible to assume that one cannot access inputs that will be presented to the network in the future at times greater than
t, and regarding the past, at times less than
t, the reasoning being that this information is already encoded in the network. Moreover, the learner may not be allowed to look into the past due to privacy considerations. On a high level, this is similar to continual learning (CL) frameworks where a CL model is updated at each time step using only the new data and the old model, without access to data from previous tasks due to speed, security, and privacy constraints [
17].
As noted in the caption of
Figure 1, side information may be connected to the rest of the network in several ways through additional nodes. The decision of whether, for example, to connect the side information to the first or second hidden layers and/or also to the output node can be made on the basis of experimental results. We emphasise that in our model, side information is available in both the training and test phases.
Let
${x}_{t}$ be the input instance to the network at time step
t and
${s}_{t}=\mathcal{KB}({x}_{t},\mathcal{C})$ represent the side information at time
t, where
$\mathcal{C}$ is a set of constraints which may be empty. To clarify how side information is incorporated into an NN, we give, in Algorithm 1, highlevel pseudocode corresponding to the architecture for side information, as shown in
Figure 1.
Algorithm 1 Training or testing with side information 
 1:
${x}_{t}$ is the input to the network at time step t during the train or test phase  2:
$\lambda \leftarrow $ layer of the network that the side information is fed into  3:
$\mathcal{C}\leftarrow $ the set of constraints on the knowledge base  4:
for time step $t=1,2,\dots ,n$ do  5:
${x}_{t}$ is fed as input to the network  6:
${s}_{t}\leftarrow \mathcal{KB}({x}_{t},\mathcal{C})$  7:
an encoding of ${s}_{t}$ is fed into the network at layer $\lambda $  8:
${\widehat{y}}_{t}\leftarrow $ output of the network  9:
end for

We stipulate that $\mathcal{KB}({x}_{t},\mathcal{C})={\mathcal{KB}}_{\mathcal{C}}\left({x}_{t}\right)$, i.e., we can either first compute the result of applying $\mathcal{KB}$ to ${x}_{t}$, resulting in $\mathcal{KB}\left({x}_{t}\right)$, and then constrain $\mathcal{KB}\left({x}_{t}\right)$ to $\mathcal{C}$, or, alternatively, we can first constrain $\mathcal{KB}$ to $\mathcal{C}$, resulting in ${\mathcal{KB}}_{\mathcal{C}}$, and then apply ${\mathcal{KB}}_{\mathcal{C}}$ to ${x}_{t}$. For this reason, from now on, we prefer to write ${\mathcal{KB}}_{\mathcal{C}}\left({x}_{t}\right)$ instead of $\mathcal{KB}({x}_{t},\mathcal{C})$, noting that when $\mathcal{C}$ is empty, i.e., $\mathcal{C}=\u2300$, then ${\mathcal{KB}}_{\mathcal{C}}={\mathcal{KB}}_{\u2300}=\mathcal{KB}$. We leave the internal working of ${\mathcal{KB}}_{\mathcal{C}}\left({x}_{t}\right)$ unspecified, although we note that it is important that the computational cost of ${\mathcal{KB}}_{\mathcal{C}}\left({x}_{t}\right)$ is low.
The set of constraints $\mathcal{C}$ may be applicationdependent, for example, on whether ${x}_{t}$ is text or imagebased. Moreover, for a particular dataset we may constrain knowledgebased queries to a small number of one or more classes. To give a more concrete example, if ${x}_{t}=imag{e}_{i}$, i.e., the input to the network is an image, and we have constrained the knowledge base to the class $green$, then its output will be of the form ${s}_{t}=(p,imag{e}_{i},green)$. In this case, the knowledge base matches the input image precisely and returns in ${s}_{t}$ its probability of being $green$. In the more general case, the knowledge base will only match part of the input.
More specifically, we let ${s}_{t}=\{{s}_{t1},{s}_{t2},\dots ,{s}_{tm}\}$ be the result of applying the knowledge base ${\mathcal{KB}}_{\mathcal{C}}$ to the input ${x}_{t}$, where each ${s}_{tj}$, for $j=1,2,\dots ,m$, encodes a fact $({p}_{tj},{z}_{tj},{c}_{tj})$, so that the probability, ${p}_{tj}=P\left({z}_{tj}\right{c}_{tj})$, for $j=1,2,\dots ,m$, will be fed to the network as side information. From a practical perspective it will be effective to set a nonnegative threshold $\theta $ such that ${s}_{tj}=({p}_{tj},{z}_{tj},{c}_{tj})$ is included in ${s}_{t}$ only when ${p}_{tj}>\theta $.
We also assume for convenience that the number of facts returned by
${s}_{t}={\mathcal{KB}}_{\mathcal{C}}\left({x}_{t}\right)$ that are fed into the network as side information is fixed at a maximum, say
m, and that they all pass the threshold, i.e.,
${p}_{tj}>\theta $. Now, if there are more than
m facts that pass the threshold, then we only consider the top
m facts with the highest probabilities, where ties are broken randomly. On the other hand, if there are less than
m facts that pass the threshold, say
$k<m$, then we consider the values for the
$mk$ facts that did not pass the threshold as missing. A simple solution to deal with the missing values is to consider their imputation [
18], and in particular, we suggest using a constant imputation [
19] replacing the missing values with a constant such as the mean or the median for the class, which will not adversely affect the output of the network. We observe that if
$k=0$ for all inputs
${x}_{t}$ in the test data, but not necessarily in the training data, then side information is reduced to privileged information.
Now, assuming that all the side information is fed directly into the first hidden layer, as is the input data, then it can be shown that using side information is equivalent to extending the input data by m values through a preprocessing step, and then training and testing the network on the extended input. When side information is fed into other layers of the network this equivalence breaks down.
4. A Formalism for Side Information
Our formalism for assessing the benefits of side information is based on quantifying the difference between the loss of the neural network without and with side information; see (
1) below. Additionally, we provide detailed analysis for two of the most common loss functions,
squared loss and
crossentropy loss [
20]; see (
2) and (
3), respectively, below.
Moreover, in (
5) we present the notion of being
individually useful at time step
t, which is the smallest nonnegative
$\delta $ such that the probability that the difference in loss at
t when side information is absent and present is larger than a given nonnegative margin,
$\u03f5$, which is greater than or equal to
$1\delta $. When side information is individually useful for all time steps it is said to be
globally useful. Furthermore, in (
6) we modify (
5) to allow for the concept of side information being
useful on average.
To clarify the notation, we assume that ${x}_{t}$ is the input to the NN at time t, ${y}_{t}$ is the target value given input ${x}_{t}$, ${\widehat{y}}_{t}$ is the estimated target value given ${x}_{t}$, i.e., the output of the NN when no side information is available for ${x}_{t}$, and ${\widehat{y}}_{{s}_{t}}$ is the estimated target value given ${x}_{t}$ and side information ${s}_{t}$, i.e., the output of the NN when side information ${s}_{t}$ is present for ${x}_{t}$.
To formalise the contribution of the side information, let
$\mathcal{L}({y}_{t},{\widehat{y}}_{t})$ be the NN loss on input
${x}_{t}$ and
$\mathcal{L}({y}_{t},{\widehat{y}}_{{s}_{t}})$ be the NN loss on input
${x}_{t}$ with side information
${s}_{t}$. The difference between the loss from
${\widehat{y}}_{t}$ and the loss from
${\widehat{y}}_{{s}_{t}}$, where
$0\le {y}_{t}\le 1$ and
$0<{\widehat{y}}_{t},{\widehat{y}}_{{s}_{t}}<1$, is given by
Now assuming squared loss for a regression problem, we obtain
which has a positive solution when
${\widehat{y}}_{t}<{\widehat{y}}_{{s}_{t}}<2{y}_{t}{\widehat{y}}_{t}$, i.e., both terms in the last equation are negative, or
${\widehat{y}}_{t}>{\widehat{y}}_{{s}_{t}}>2{y}_{t}{\widehat{y}}_{t}$, i.e., both these terms are positive.
If we restrict ourselves to binary classifiers, then squared loss reduces to the wellknown Brier score [
21]. Inspecting (
2) for this case, when
${y}_{t}=1$ we have the constraint
${\widehat{y}}_{t}<{\widehat{y}}_{{s}_{t}}$, ensuring both terms are negative. Correspondingly, when
${y}_{t}=0$, we have the constraint
${\widehat{y}}_{t}>{\widehat{y}}_{{s}_{t}}$, ensuring that both terms are positive; note that in this case,
${\widehat{y}}_{{s}_{t}}<{\widehat{y}}_{t}$ is intuitive since the classifier returns the probability of
${y}_{t}$ being 1, i.e., estimating the chance of a positive outcome.
On the other hand, assuming crossentropy loss for a classification problem, we obtain
which has a positive solution when
${\widehat{y}}_{t}<{\widehat{y}}_{{s}_{t}}$.
In this case, if we restrict ourselves to binary classifiers, then crossentropy loss reduces to the wellknown logarithmic score [
21]. Here, we need to note that a positive solution always exists, since the binary crossentropy loss is given by
when the estimated target value is
${\widehat{y}}_{t}$, and similarly replacing
${\widehat{y}}_{t}$ with
${\widehat{y}}_{{s}_{t}}$ when the estimated target value is
${\widehat{y}}_{{s}_{t}}$.
Now, we would like the probability,
$1\delta $, for
$0\le \delta \le 1$ and a given margin,
$\u03f5\ge 0$, for the difference between
$\mathcal{L}({y}_{t},{\widehat{y}}_{t})$ and
$\mathcal{L}({y}_{t},{\widehat{y}}_{{s}_{t}})$ to be minimised; that is, we wish to evaluate
The probability in (
5) can be estimated by parallel or repeated applications of the train and test cycle for the NN without and with side information, and the losses
$\mathcal{L}({y}_{t},{\widehat{y}}_{t})$ and
$\mathcal{L}({y}_{t},{\widehat{y}}_{{s}_{t}})$ are recorded for every individual run in order to minimise over
$\delta $ for a given
$\u03f5$. Each time the NN is trained its parameters are initialised randomly and the input is presented to the network in a random order, as was done in [
22]; other methods of randomly initialising the network are also possible [
23].
When (
5) is satisfied, then we say that side information is
individually useful for
t, where
$t\in \{1,2,\dots ,n\}$. In the case when side information is individually useful for all
$t\in \{1,2,\dots ,n\}$ we say that side information is (strictly)
globally useful. We define side information to be
useful on average if the following modification of (
5) is satisfied:
noting that we can estimate (
6) in a similar way to estimating (
5) with the difference that, in order to minimise over
$\delta $ for a given
$\u03f5$, the average loss is calculated for each run of the NN.
In the case where the greater or equal operator (≥) in the inequality
$\mathcal{L}({y}_{t},{\widehat{y}}_{t})\mathcal{L}({y}_{t},{\widehat{y}}_{{s}_{t}})\ge \u03f5$, contained in (
5), is substituted with equality to zero (
$=0$) or is within a small margin thereof (i.e.,
$\u03f5$ is close to 0), we may say that side information is
neutral. Moreover, we note that there may be situations when side information is
malicious, which can be captured by swapping
$\mathcal{L}({y}_{t},{\widehat{y}}_{t})$ and
$\mathcal{L}({y}_{t},{\widehat{y}}_{{s}_{t}})$ in (
5). We will not deal with these cases herein and leave them for further investigation. (Similar modifications can be made to (
6) for side information to be neutral or malicious on average.)
We note that malicious side information may be adversarial, i.e., intentionally targeting the model, or simply noisy, which has a similar effect but is not necessarily deliberate. In both cases, the presence of side information will cause the loss function to increase. Here, we do not touch upon the methods for countering adversarial attacks, which is an ongoing endeavour [
24].
5. Materials and Methods
As a proof of concept of our model of side information, we consider two datasets. The first is the
Modified National Institute of Standards and Technology (MNIST) dataset [
16]. It is a handwritten,
$28\times 28$ pixel image, digit database consisting of a training set of 60,000 instances and a test set of 10,000 instances. These are the standard sizes of training and test for MNIST, which result in about 85.7% of the data being used for training and about 14.3% reserved for testing. This is a classification problem where there are 10 labels each depicting a digit identity between 0 and 9.
Our second experiment is based on the House Price dataset, which is adapted from the Zillow’s home value prediction Kaggle competition dataset [
25], which contains 10 input features as well as a label variable [
26]. The dataset consists of 1460 instances. The main task is to predict whether the house price is above or below the median value according to value of the label variable being 1 or 0, respectively, so it is therefore a binary classification problem.
For both experiments we utilised a
fully connected feedforward neural network (FNN) containing two hidden layers, as well as a softmax output layer; see also
Figure 1 and Algorithm 1 regarding the modification of the neural network to handle side information. Our experiments mainly aim at evaluating whether or not the utilisation of side information is beneficial to the performance of the learning algorithm, i.e., whether it is useful. This is achieved by evaluating the difference between the respective NN losses without and with side information. The experiments demonstrate the effectiveness of our approach in improving the learning performance, which is evidenced by a reduction in the loss (i.e., improvements in the accuracy) of the classifiers, for each of the two experiments, during the test phase.
In both experiments, all the reported classification accuracy values and usefulness estimates (according to (
5) and (
6)) are averaged over 10 separate runs of the NN. For every experiment, we also report the statistical significance values of the differences between the corresponding classification accuracy values, without and with side information. To this effect, we perform a paired
ttest to assess the difference in accuracy levels obtained, where the null hypothesis states that the corresponding accuracy levels are not significantly different.
For each run out of the 10 repetitions, the parameters of the NN are initialised randomly. In addition, the order via which the input data are presented to the network is also randomly shuffled during each run. Also, missing side information values are replaced with the average value of the side information for the respective class label. The number of epochs used is 10, with a batch size of 250. The adopted loss used in both experiments is the crossentropy loss.
5.1. MNIST Dataset
For this dataset we add side information in the form of a single attribute, giving information about how peculiar the handwriting of the corresponding handwritten digit looks. The spectrum of values is between 0, which refers to a very oddlooking digit with respect to its groundtruth identity, and 1, which denotes a very standardlooking digit; we also made use of three intermediate values, $0.25$, $0.5$ and $0.75$. This side information was coded manually for $20\%$ of the test and training data, and was fed into the network through the knowledge base right after the first hidden layer. In accordance with the adopted FNN architecture, we believe that adding side information after the first hidden layer best highlights the utilisation of the proposed concept of side information, since if the side information was to be added prior to that (i.e., before the first layer), its impact would have been equivalent to extending the input. Otherwise, adding it after the second hidden layer (although technically possible) may not enable the side information to sufficiently impact the learning procedure. According to the $(p,z,c)$ representation, the side information here can be generated as $(p,image,not\u2010peculiar)$.
As mentioned earlier, an FNN with two hidden layers is utilised. The first hidden layer consists of 350 hidden units, whereas the second hidden layer contains 50 hidden units (these choices are reasonable given the size and structure of the dataset), followed by a softmax activation. Similar to numerous previous works in ML, we focus on comparisons based on an FNN with a certain capacity (two hidden layers in our case, rather than five or six, for instance) in order to concentrate on assessing the algorithm rather than the power of a deep discriminative model.
5.2. House Price Dataset
The FNN utilised in this experiment consists of two hidden layers, followed by a softmax activation. The first hidden layer consists of 12 hidden units, while the second consists of 4 hidden units (these choices are reasonable given the size and structure of the dataset). A portion of $20\%$ of the data is reserved for the test set.
Out of the 10 features, we experimented with one feature at a time being kept aside. Whilst the other 9 features are presented as input to the input layer of the NN in the normal manner, the aforementioned 10th feature is used to generate side information, which is fed to the NN through the knowledge base right after the first hidden layer. The reasoning behind selecting this location for adding the side information is similar to the one highlighted with MNIST. According to the $(p,z,c)$ representation, here, the side information can be generated as $(p,house\u2010data,feature\u2010value)$, where the feature in question varies according to the one left out.