An Intelligent Approach Using Machine Learning Techniques to Predict Flow in People

Pegalajar, M. C.; Ruiz, L. G. B.; Pérez-Moreiras, E.; Boada-Grau, J.; Serrano-Fernandez, M. J.

doi:10.3390/bdcc7020067

Open AccessArticle

An Intelligent Approach Using Machine Learning Techniques to Predict Flow in People

by

M. C. Pegalajar

^1,*

,

L. G. B. Ruiz

^2,*

,

E. Pérez-Moreiras

³

,

J. Boada-Grau

⁴ and

M. J. Serrano-Fernandez

⁴

¹

Department of Computer Science and Artificial Intelligence, University of Granada, 18014 Granada, Spain

²

Department of Software Engineering, University of Granada, 18014 Granada, Spain

³

RH Asesores Improving S.L., Instituto para el Desarrollo del Talento, 28046 Madrid, Spain

⁴

Department of Psychology, Rovira I Virgili University, 43002 Tarragona, Spain

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2023, 7(2), 67; https://doi.org/10.3390/bdcc7020067

Submission received: 9 March 2023 / Revised: 31 March 2023 / Accepted: 3 April 2023 / Published: 4 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

The goal of this study is to estimate the state of consciousness known as Flow, which is associated with an optimal experience and can indicate a person’s efficiency in both personal and professional settings. To predict Flow, we employ artificial intelligence techniques using a set of variables not directly connected with its construct. We analyse a significant amount of data from psychological tests that measure various personality traits. Data mining techniques support conclusions drawn from the psychological study. We apply linear regression, regression tree, random forest, support vector machine, and artificial neural networks. The results show that the multi-layer perceptron network is the best estimator, with an MSE of 0.007122 and an accuracy of 88.58%. Our approach offers a novel perspective on the relationship between personality and the state of consciousness known as Flow.

Keywords:

machine learning; artificial neural networks; flow; psychology; data mining

1. Introduction

Although psychology has taken longer than other fields to adopt machine learning models for the analysis of experimental results, an increasing number of research works have shown the effectiveness of these models and their benefits in complementing traditional statistical techniques [1,2,3,4,5,6,7,8] As a result, machine learning methods are becoming a valuable tool in differential statistics, supporting the development of more accurate forecasting models through the generalisation and evaluation of new data [9,10].

In [1,2,3], several authors explored the potential of machine learning models in the analysis of psychological experimental data. In [1], the authors presented a procedure that aimed to combine exploratory and predictive modelling to construct new psychometric questionnaires with psychological and neuroscientific theoretical grounding [11]. They employed exploratory data analysis to examine the dimensional structure of the questionnaires and utilised artificial neural networks (ANNs) to predict the psychopathological diagnosis of clinical subjects. Similarly, Orrú et al. [2] suggested the application of machine learning models to analyse a subset of items from the Toronto Alexithymia Scale, which was useful in accurately distinguishing patients with fibromyalgia from healthy controls. In [3], the authors claimed that complementing the analytical workflow of psychological experiments with machine-learning-based analysis could maximise accuracy and minimise replicability issues. Additionally, machine learning analysis is model-agnostic and mainly focuses on prediction rather than inference.

Gonzalez et al. [5] provided a review of four statistical approaches for item selection. The first approach presented was the response theory, which is used to build static short-forms. The second approach was computerised adaptive testing, followed by a genetic algorithm and regression trees. The authors discussed the theoretical strengths and weaknesses of these four statistical solutions and considered the overlap between both areas: psychometrics and machine learning. Their results suggest that machine learning models such as logistic regression or random forest can have comparable classification performance to the psychometric methods using estimated item response theory scores. Therefore, machine learning models can be a viable alternative for classification when psychometric methods are not feasible.

Traditionally, psychology has had a special interest in studying which factors improve the living standards of people [12,13,14,15]. Among these factors, we can find study of the term “Flow”, which has been widely researched by many authors in recent years [16].

In psychology, the term “Flow” is defined as the mental state in which a person performing certain tasks is fully immersed in a feeling of energised focus and full involvement [17,18]. Initially, the studies were conducted using musicians, basketball players, surgeons, teachers, and businesspeople as these sorts of people seem to perform extremely motivating activities or, in other words, activities that they love to do [19]. Later, these works were applied to people of different ages, cultures [20], and jobs, finding that this optimal experience was also present in them.

Thus, this research has the potential to provide significant benefits to both individuals and organisations. By using AI techniques to estimate the state of Flow, we can offer a more objective way to quantify how efficient a person may be at both personal and professional levels. This information can be useful for individuals seeking to improve their performance or make better decisions about their career paths. Moreover, organisations can use this information to optimise work environments and tasks, leading to increased productivity and job satisfaction among employees. By incorporating personality variables that are not directly related to Flow, our approach also offers a novel perspective on the relationship between personality and optimal experience, providing new insights into human psychology.

Related Work

Csikszentmihaly [17,18] is the author who has studied this optimal experience or Flow state the most. He claimed that the Flow experience was characterised by a mental state in which a person is totally engrossed in some tasks with full involvement, focus, and enjoyment in the process of the activity. In other words, Flow would be the complete absorption of what a person does and even a results in a transformation of our sense of time. The Flow state is also linked to some activities which involve all of a person’s skills and abilities. According to the author, the Flow state occurs because the person’s full attention is on the task at hand: there is no more attention to be allocated. The concept of Flow has gradually evolved due to interviews and surveys conducted with individuals who described their emotional experiences while engaging in certain activities, resulting in emotions [21] that they may not have previously experienced which are akin to happiness.

In addition to the Flow state, we can find the concept of the autotelic personality. People with this kind of personality possess some qualities such as curiosity, persistence, low levels of self-centeredness, and a high rate of performing activities for intrinsic reasons only. For this reason, these people are more likely to enter the Flow state than others.

As previously mentioned, the Flow state has been shown to have great importance in sports, music, and employment. Particularly, the latter is receiving extensive attention due to its connection with quality, job satisfaction, and productivity [22,23,24,25,26]. Investigations have even detected factors that influence well-being and job satisfaction among employees [27]. In addition, a positive relationship was found between emotional intelligence and psychological Flow, which turned out to prevent work stress [28]. Indeed, Hirao and Kobayashi [29] believe that unemployed people with an autotelic personality have a better quality of life than those without this kind of character.

In this work, several surveys were used to measure up to 26 factors related to various characteristics of individuals, e.g., personal, emotional intelligence, job satisfaction, Flow, etc. Our goal is to investigate whether it is possible to estimate the Flow variable using unrelated variables by implementing several machine learning models to predict and classify the Flow variable, using the rest of the items from different questionnaires.

We emphasise the importance of the Flow state and increasing Flow among employees as it can improve the work environment and job performance [30]. This could help companies enhance their productivity, organisational quality, and other important factors. Accordingly, it seems to be important to study whether it is possible to predict the Flow score of an individual from other personality characteristics or emotional intelligence, among other factors, i.e., data that companies frequently possess with respect to their workers. In that connection and on the basis of artificial intelligence techniques, we develop the following in this study: provided a set of different variables evaluated in diverse people, we implement machine learning models to predict and classify the Flow variable. Eventually, after conducting the experiments, we draw a conclusion about what the best model is and what the pros and cons of each model are.

To this end, the present manuscript is structured as follows: Section 2 introduces our methodology, participants, and procedures in addition to the machine learning methods employed. Section 3 describes the experimentation conducted. Section 4 includes the results obtained from our experiments. Finally, Section 5 summarises the main conclusions achieved.

2. Methodology

This study aimed to utilise artificial intelligence techniques to model the Flow variable [31], a significant factor in the field of psychology. To accomplish this objective, data mining methods were necessary to interpret the vast amount of data gathered from various psychological tests. These tests were designed to measure the participants’ different characteristics.

This task was divided into two different parts. The first was to examine all the variables involved in the problem, while the second was to determine which parameters influence the Flow variable. To achieve this, a predictive correlation study was conducted.

As we have just mentioned, the first step was to analyse the data collected to obtain statistical relationships among the variables of our problem. At this point, our endeavours were focused on linking the Flow variable with other kinds of psychological or personal variables.

Two different approaches were used to forecast the Flow variable for a particular person, and each approach used a different method. The first option was to use regression models since the psychological variables can be treated as continuous variables, and thus each variable can be used in a specific range of applications. The second approach was to employ supervised algorithms in order to classify labelled samples for the Flow variable.

2.1. Method

Participants and Procedures

In our study, we conducted surveys with a total of 856 participants. It is worth noting that we intentionally chose Spanish speakers from around the world to avoid focusing on a specific demographic sample that could potentially bias our study. By selecting participants from a diverse range of backgrounds and geographic locations, we aimed to increase the generalizability of our findings to broader populations. This approach allowed us to obtain a more representative sample. There were 36 variables to be analysed, one of which was the variable to be predicted: Flow.

Each person has a set of 11 variables that are personal and business-oriented with respect to their current employment situation, family, etc. In addition to this, there are 25 other psychologically related variables obtained from the applied surveys: (1) Flow [32]; (2) self-efficacy [33]; (3) self-esteem [34]; (4) extraversion; (5) emotional stability; (6) responsibility; (7) kindness; (8) open-mindedness to experience; (9) flourishing; engagement, [34] which is split into four variables: (10) vigour, (11) dedication, (12) absorption, and (13) total engagement; (14) satisfaction with life [35,36,37]; emotional intelligence [38] also divided into four items: (15) perception; (16) comprehension; (17) regulation; (18) total emotional intelligence; and finally, personal and organisational quality [39], which involves (19) emotional vitality, (20) organisational stress, (21) emotional stress, (22) physical stress, (23) abandonment, (24) health risk and (25) total.

To facilitate understanding for readers, we have summarised each variable and its corresponding test in Table 1.

All variables might be related to the variables to be predicted, and this was the main issue to solve: to analyse the different relationships among the variables.

These data have a wide variety of information. Furthermore, since they were filled in online, some of them might present several mistakes. Some people may not have answered properly, or there may be null values in the questionnaires. Therefore, we needed to pre-process these flawed values due to the potential influence they could have on our models.

First, we analysed the data provided by the participants to subsequently create social groups and to study links among attributes. The personal information variables of the participants were the following: gender, age, marital and employment status, educational background, business unit, seniority in the company and position, number of dependents, experience, professional group, sector, number of personnel, annual turnover, scope, and recent 12-month situation of their company.

According to the data, the gender distribution was 40% men and 60% women, which is not a significant imbalance. The age of the participants ranged from 16 to 76 years old, with an average age of 44.34 and a standard deviation of 11.29. The majority of participants were married and employed and had at least a Bachelor, Master’s, or PhD degree. Only a few participants had a primary school certificate or non-certificated studies.

After examining all the previous variables, we focused on the psychological variable named Flow. Flow is a state of consciousness which appears when people experience optimal situations; as we mentioned before, it is usually associated with specific tasks in which one needs to involve their attention capacity and other skills. These experiences were defined by scientific research as a state in which a person is utterly absorbed in an assignment [41]. With the goal of measuring this feature, Likert’s proposal, commonly known as the EBF 9 scale, was adopted. This scale is an abbreviated version comprising nine items for each dimension proposed in the Flow Theory. It draws on a study conducted by Godoy-Izquierdo [42] in which the authors established a range between 9 and 45 to evaluate this state.

To ensure accurate results, a three-step procedure was employed. Firstly, the data were split into training and validation sets with an 80:20 ratio. Secondly, the models and their parameters were established using the training set and a grid search procedure with cross-validation. Lastly, the trained models were validated, and the results were extracted. Figure 1 provides an illustration of this process.

Note that all variables were standardised so that they could be applied to distance-based models and, in this way, they were not influenced by the variables whose scales were uneven. Moreover, having all variables on the same scale provides a better understanding of the outcomes and a better classification and prediction.

2.2. Measures

The main metric we used in this study to evaluate the performance of our models was the mean squared error (MSE). Nonetheless, we also used the explained variance and the R2 score to support our statements.

In addition to the root mean squared error (RMSE), the MSE [43] is one of the most commonly used techniques for evaluating regression models. These two metrics have been adopted in a large number of regression problems. They rely on the error made by each estimated sample

{\hat{y}}_{i}

compared with its actual counterpart

y_{i}

. For a dataset whose number of samples is

n

, this metric can be calculated as follows:

M S E = \frac{1}{n} \cdot \sum_{i = 0}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(1)

The explained variance (EV) [42] score allowed us to indicate the importance of our statistical findings. It is quantified by the variance. In this event, the best result would be 1, and it can be formulated according to the next equation:

E V = 1 - \frac{V a r \{y - \hat{y}\}}{V a r \{y\}}

(2)

The last measure employed was the R2 metric [43], which is also known as the coefficient of determination and is used to measure the predicting success. This metric also has a top limit of 1, and it might be negative too due to the fact that the model can be significantly worse. On the other hand, a constant model, which predicts the expected value of

y

regardless of the input vector, would obtain an R2 score of 0. The formula to compute this metric is defined below:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(3)

where \bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

(4)

and \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = \sum_{i = 1}^{n} ϵ_{i}^{2}

(5)

These were the metrics adopted in our first approach. The following are the metrics needed to quantify the quality of our results in the second approach. As a consequence, some classification metrics must be defined. The first is accuracy, which is defined as the number of correct predictions made divided by the total number of samples [44]:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(6)

where

T P

and

T N

stand for true positives and negatives, respectively, and

F P and F N

are their false counterparts. Recall [45] is also used to obtain the number of correct positives divided by the number of all samples that should have been identified as positive:

R e c a l l = \frac{T P}{T P + F N}

(7)

Precision is the number of correct positives divided by the number of positive results obtained by the model [45]:

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

The last measure is the Hamming loss [46], which is the portion of labels that are incorrectly predicted.

L_{H} (y, \hat{y}) = \sum_{i = 1}^{n} 1_{y_{i} \neq {\hat{y}}_{i}}

(9)

2.3. Machine Learning Techniques

This section is intended to provide a brief introduction to the machine learning method applied in this work. Specifically, we used linear regression, k-nearest neighbours, support vector machine, trees, random forest and artificial neural networks. While these models are typically used to solve classification problems [47], in this work, we utilized their regression adaptation as the problem we wanted to solve required the analysis of a high range of possible values for the variables.

Our proposed method involved testing several machine learning algorithms for predicting the Flow variable. Each algorithm has its own strengths and weaknesses and may be more suitable for different types of data. By testing multiple algorithms, we can determine which one is the most effective for our specific dataset. We trained and tested each algorithm using a cross-validation technique, ensuring that our results were robust and not dependent on any particular training or testing set. We then compared the performance of each algorithm using the various evaluation metrics previously described. This allowed us to identify the most effective algorithm for predicting the Flow variable and to gain insights into which factors may be driving its predictions.

2.3.1. Linear Regression

Linear regression (LR) is an essential model used in statistics identify the relationship between variables and to analyse their dependency [48].

Here, this model was used as a reference for the rest of the regression models so as to compare their results. In doing so, we obtained the lower limit that the rest of the models needed to improve. Since this is a simple model, it did not have any parameters to analyse or improve. Briefly, LR is represented by a line which must minimise the error between the training points and the line which is subsequently employed to predict.

In the field of psychology, traditional approaches to modelling the relationship between psychological variables often rely on linear regression analysis. As such, we included LR to ensure that our proposed method could be compared to traditionally oriented methods.

2.3.2. Nearest Neighbours

In this work, we made use of the popular nearest neighbour (k-NN) implementation, based on the relative distances between each data point to compute the expected class [49]. This algorithm has a parameter k that is set by the user to obtain the best number of neighbours and to obtain a mean value for that value.

This algorithm is widely used in many problems due to its ease of understanding and low computational cost. Although k-NN is mainly adopted in classification problems, this technique can be extended to regression problems. It is based on the sliced inverse regression technique; in this way, local regression using neighbourhoods is carried out.

2.3.3. Support Vector Machine

The third model we applied is the regression version of the well-known support vector machine. This model is designed to solve regression problems by finding the hyperplane that best fits the data [50], maximising the margin between the hyperplane and the data points. It works by transforming the data into a higher-dimensional space in which a linear hyperplane can be used to separate the data into different regions. The goal is to find the hyperplane that maximises the margin between the hyperplane and the data points while minimising the error.

However, the training process takes a long time in this technique due to the high number of calculations needed. On the basis of SVM models, the support vector regression (SVR) model has benefits in high dimensionality spaces as it is characterised by the optimization-margin algorithm to model different data points.

2.3.4. Tree-Based Models

Another two state-of-the-art methods were used in this study: decision tree (DT) or regression tree (RT) and its ensemble variant, random forest (RF) [51]. In addition to being widely used in hundreds of fields, it is one of the most descriptive models as it can provide a wealth of information on the data and the relationships between input and output. Its tree-based structure provides a hierarchical representation of the knowledge of the data, which can be used both to predict and to describe how the data behave.

In this case, there were two important parameters to be tested: the depth of the tree and the minimum number of samples required to split an internal node. Additionally, an extra parameter was required for the RF model. RF comprises a set of DTs which combine their output to refine the expected value. As a result, the additional parameter for RF was the number trees used in the ensemble.

2.3.5. Artificial Neural Networks

An artificial neural network (ANN) is a bio-inspired system that builds a computing system by means of nodes, or neurons, and connections, or weights, among those nodes. By combining different values of the weights, an ANN is capable of modelling dependencies between inputs and outputs and predicting future values.

The simplest model is known as a multi-layer perceptron (MLP) network [52] and has a feedforward architecture. MLPs commonly consist of three main layers: input, hidden, and output layers. Firstly, all the model weights are randomly initialised. Eventually, they change in order to correct the error achieved by the model. Consequently, several iterations are needed so that all the weights can converge until an optimal solution. In our problem, several numbers of layers and neurons were be tested to obtain the best network.

3. Experiments

The data were extracted from the doctoral thesis of Pérez-Moreiras [53]. First and foremost, we verified that our participants were in the range we highlighted above. Considering Table 2, we observed that our data fulfilled this requirement. Most were grouped within the higher values, between 30 and 36. This is a good score for this psychological aspect. It is important to note that a higher value indicates better performance for this variable and in our case, very few data points presented a low value of this psychological component.

Once this was achieved, the next point to study was the correlation between variables. To do so, our first steps were to focus on using RF to provide some knowledge about these correlations. The nature of this model allowed us to extract this information. In order to obtain the correlation degree between predictors and independent variables, we analysed how many times these variables were chosen as root nodes. Notice that this node is considered the most important one for our classifier as it is the node with the highest capabilities of slicing and predicting the data.

The variable we analysed was the Flow variable. The remaining variables were used as inputs for our models. All experiments were performed using the RF model, with 1000 trees per classifier to ensure the validity of the results.

As can be seen in Figure 2, there were three most relevant variables in this problem: V16 (comprehension), V2 (self-efficacy), and V5 (emotional stability). Therefore, we can conclude that these three variables are related to the Flow state, as one needs to have good emotional stability and be quite efficient in order to enter this state. An aspect worthy of mention is that variables such as personal character and business-related variables barely influenced our problem. For this reason, we did not make use of these variables in our models as inputs and we skipped them as our first solution. In this way, it was easier for the machine learning methods to adjust their results as potential noise variables were removed.

Thus far, our efforts had been focused on analysing the problem. Since our variables had a wide range of possible values, the best way to proceed with predicting them was by using regression models. Therefore, this was our first approach to solving the problem.

Each model was tested with different parameters. In the case of kNN, the number of neighbours was verified from 1 to 20. Several kernels were examined in the case of SVR, such as RBF, polynomial, sigmoidal, and linear; the regularisation parameter was set to

\{0.001, 0.01, 0.1, 0.5, 1, 10\}

. The criterion to measure the quality of a split of RT was tested using the MSE, the Friedman version, and the mean absolute error. The RF experiments focused on the number of estimators, setting them up to

\{10, 50, 100, 500, 1000\}

. For the MLP, several hidden layer sizes were tested, from 1 to 5 layers, and each of them had neurons from 2 to 100; the activation functions were identity, logistic, hyperbolic tan function, and the rectified linear unit function. The optimization methods tested were the stochastic gradient descendent (SGD), a family of the quasi-Newton optimizer (LBFGS) and the Adam optimizer.

Our second approach was to employ multiclass classifiers to categorise different ranges of our variable. In doing so, we transformed our regression problem into a multi-classification problem. There were five intervals or classes. The first one, from 0 to 0.2, was labelled 0 (whose meaning is very low); the second, from 0.2 to 0.4, was labelled 1 (low); the third, from 0.4 to 0.6, was labelled 2 (medium); the fourth, from 0.6 to 0.8, was labelled 3 (high); and the last one, from 0.8 to 1, was labelled 4 (very high). In this way, our psychological variables would have a logical meaning as they measured the intensity of the variables in a specific individual. The greater value of a variable, the higher its strength in that field.

4. Results

The outcomes of the first approach, which used regression techniques to predict the Flow variable, are presented in Table 3. The table shows that the MLP model performed best, followed by RF in both training and testing, although the RF model performed slightly better in terms of EV, with a difference of eight units to the fourth decimal place. Another remarkable point in this table is the low error achieved by the LR model because, despite being the simplest model, this technique has better prediction than RT.

In order to not hamper the readability of this paper, we will not show all the experiments conducted and will instead endeavour to explain the most important ones. Therefore, we will skip the intermediate tests in order to obtain the best parameters, and we will only list the optimal values attained. For instance, the grid search for the MLP’s parameters chose a single hidden layer with only one neuron and the logistic activation function. It was an unexpected discovery that MLPs obtained better results in training as the number of neurons increased; however, the test error also increased significantly. The notable aspect of this finding was that the threshold in which this event occurs was a single neuron. Considering these results, we can observe a small difference between the R2 metrics for the MLP in the first place and the RF in the second and vice versa for the EV; however, this was true not only for this model but also in comparison with the rest of the models. This dissimilarity was approximately eight-hundredths.

There is a clear difference among the models tested in this study. The best-performing techniques were MLP and RF, with the MLP method achieving the best results in the test stage, indicating the ability of our models to generalise. Notice that ANN training is a computationally costly task as it could overfit easily. For this reason, MLP may have had the lowest MSE in training but not in testing for the EV metric. Nonetheless, it achieved a similar error as the tree ensemble. Although the error in most models is comparatively low, it can be seen that virtually all models obtained a slightly lower error than LR. Having noted this, we obtained another interesting and unexpected finding because the same cannot be said of kNN, SVR, or RT, which achieved worse results despite their higher computational cost and complexity.

In summary, the MLP model was shown to be the most accurate model for predicting the Flow variable. Nevertheless, it is also recommended to use RF due to its low computational cost and high interpretability as it is based on trees.

In our continued efforts to address the problem, we proposed a second approach in which we aimed to classify the different ranges of our variable using a supervised algorithm. As a multi-class problem cannot be solved by a linear model, this technique was removed from our experiments. Therefore, we used SVM, kNN, DT, RF, and ANN as classifiers in this problem. Similar to our previous experiments, our models were tested with several parameters, and the best ones were as follows: linear kernel and a regularisation parameter of 10 for SVM; 17 neighbours for kNN; Gini criterion in the case of DT; entropy criterion and 1000 estimators for RF; and five layers with 200 neurons, using the Adam optimizer, for MLP.

The results of these models can be seen in Table 4. The resultant metrics are quite impressive. In the training stage, SVM is the second-best technique, according to three out of four metrics, accuracy, recall, and precision, and it obtained a poor Hamming loss. This tells us that if SVM fails, then the difference between the misclassification and its actual value is greater than other models, such as kNN. However, kNN was one of the worst methods for all the metrics in the test stage. This is because kNN is a technique based on distances, and hence it is not rare that it achieved a good score in training by trying to minimise distances. The Hamming loss, in this case, indicates how poor the classification from the model was. Therefore, according to the rest of the measures, kNN does not classify properly either in training or testing, but its approximation was better than DT in some statistics.

Another aspect of these results can be seen with the SVM model. Since it accomplished the second-best precision in training, we can suppose that SVM classified minority classes in a better way than the others, although this fact is not reflected in the test.

Finally, we can observe that MLP achieved the best scores for all metrics in both the test and training stages. MLP’s high accuracy indicates that a greater proportion of its predictions were correct. In addition to this, MLP obtained the best Hamming loss during testing, which means that its incorrect classifications were closer to the correct class than those of the other models. MLP also achieved the best recall score, indicating that MLP correctly classified more positive samples out of all positive classifications. A similar trend was observed with SVM, which had the second-best recall and precision but obtained a worse accuracy, as it seems to have focused on a specific fraction of the data. In conclusion, MLP outperformed the other models overall, although some models that specialised in a portion of the data achieved interesting scores in the training stage.

After obtaining the best model for predicting the Flow variable, the next step was to analyse the specific results of the chosen model which, in this case, was MLP. To do so, we computed the confusion matrix. An illustration of the results can be found in Figure 3. This figure illustrates the points in which the classifier was not able to predict properly. In most cases, the classes with more misclassified points are classes 2 and 3, and sometimes class 4; seldom do we find class 0 or 1 misplaced. This is a direct result of the oversampling method carried out. In fact, if this technique was not applied, those minority classes would have been misclassified.

At first glance, this fact seems to be caused by an overlapping effect of the adjacent classes, as the predicted label and the true value are placed one unit from its actual value; that is to say, if one focuses on label 2, it can be observed that the failures are obtained by classifying this class in class 1 once and class 3 13 times. This is why the Hamming loss was low for this model; this is noteworthy, as we transformed the resultant values of the Flow variable into different ranges to obtain classes.

We can refer to Figure 4 to observe the relationship between the predicted labels and the true label. The abscissa axis is the associated label, the ordinate axis displays the normalised value of the Flow variable, and the colour and the horizontal lines define each label within the same range. As mentioned previously, the MLP’s misclassified points in Figure 4 are closer to the boundaries. For instance, there are 13 points labelled as a large yellow dot in class 2, which is very close to the boundary between class 2 and class 3 and were classified as a label 3. Similarly, the small yellow point indicates that it is near class 2 and was incorrectly classified with that label. The situation is also similar for the purple circles of label 3. Among the eight points that were misclassified, five out of eight are in the limits between the two classes, 3 and 4. This is the highest point. The second point, which is slightly smaller, had three erroneous predictions. This is not located exactly on the class boundary but rather the second-closest location, as it is the case for the last point, which represents one instance classified as a class 2.

These two figures, Figure 3 and Figure 4, represent the typical scenario encountered while training models to predict the Flow variable. It is not rare for misclassified points to fall in the middle of the defined interval. For this reason, intelligent models achieved high scores but not perfect marks: it was due to instances that are extremely close to one another.

While our results demonstrate that the MLP model outperformed other models for predicting the Flow variable in our dataset, it is important to consider the theoretical implications of our approach. Our oversampling method, which increased the representation of minority classes in the dataset, allowed our models to better capture the subtle difference in the data and avoid bias towards the majority classes. Furthermore, the MLP’s ability to identify complex relationships between input features and target variables is a key strength of neural networks and is likely a major factor contributing to its superior performance. It is worth noting, however, that the proposed method is not without limitations. Oversampling can lead to overfitting if not applied properly, and ANNs can be prone to overfitting as well. Therefore, it is important to continue refining and validating our approach to ensure its effectiveness and generalisability.

5. Conclusions

In this study, we proposed a machine-learning-based method to analyse a set of psychological tests in order to determine several characteristics in people. We used several surveys to measure up to 26 factors related to multiple psychological features. In particular, we focused on the Flow variable, and we developed models to estimate this item using not-related variables. We implemented linear regression, regression tree, random forest, support vector machine, and artificial neural networks.

In our study, MLP demonstrated the best performance in predicting the Flow variable, with an MSE of 0.007122, an R2 of 0.604601, and an EV of 0.607038, outperforming RF in both the learning and test phases. Despite its simplicity, LR also showed a low prediction error. In terms of classification, MLP was the most accurate model, with a precision of 88.58% and similar performance in recall and precision metrics.

On the other hand, we observed that the misclassified samples were only those located at the very limit of their classes. This suggests that our solution works as a new tool for estimating Flow behaviour in individuals, using variables measured by other tests that are not directly correlated with the Flow measure with high accuracy and precision.

In future work, it would be interesting to investigate the potential application of advanced mathematical tools [54,55] in order to improve real-world decision-making problems such as the one presented in this study.

Author Contributions

All authors have contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

We acknowledge financial support from the Energetic Intelligence Chair and RH Asesores Improving S.L.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

AI	Artificial Intelligence
ANN	Artificial Neural Network
DT	Decision Tree
EV	Explained Variance
k-NN	k-Nearest Neighbour
LBFGS	Limited-memory Broyden–Fletcher–Goldfard–Shanno algorithm
LR	Linear Regression
MLP	Multi-Layer Perceptron
MSE	Mean Squared Error
RF	Random Forest
RMSE	Root Mean Squared Error
RT	Regression Tree
SGD	Stochastic Gradient Descendent
SVR	Support Vector Regression
V1	Flow Variable

References

Dolce, P.; Marocco, D.; Maldonato, M.N.; Sperandeo, R. Toward a machine learning predictive-oriented approach to complement explanatory modeling. An application for evaluating psychopathological traits based on affective neurosciences and phenomenology. Front. Psychol. 2020, 11, 446. [Google Scholar] [CrossRef] [PubMed]
Orrù, G.; Gemignani, A.; Ciacchini, R.; Bazzichi, L.; Conversano, C. Machine learning increases diagnosticity in psychometric evaluation of alexithymia in fibromyalgia. Front. Med. 2020, 6, 319. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Orrù, G.; Monaro, M.; Conversano, C.; Gemignani, A.; Sartori, G. Machine learning in psychometrics and psychological research. Front. Psychol. 2020, 10, 2970. [Google Scholar] [CrossRef] [Green Version]
Gonzalez, O. Psychometric and machine learning approaches to reduce the length of scales. Multivar. Behav. Res. 2020, 56, 903–919. [Google Scholar] [CrossRef]
Gonzalez, O. Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychol. Methods 2021, 26, 236. [Google Scholar] [CrossRef]
Di Nuovo, A.G.; Catania, V.; Di Nuovo, S.; Buono, S. Psychology with soft computing: An integrated approach and its applications. Appl. Soft Comput. 2008, 8, 829–837. [Google Scholar] [CrossRef]
Khayamim, A.; Mirzazadeh, A.; Naderi, B. Portfolio rebalancing with respect to market psychology in a fuzzy environment: A case study in tehran stock exchange. Appl. Soft Comput. 2018, 64, 244–259. [Google Scholar] [CrossRef]
Hosseinalipour, A.; Gharehchopogh, F.S.; Masdari, M.; Khademi, A. A novel binary farmland fertility algorithm for feature selection in analysis of the text psychology. Appl. Intell. 2021, 51, 4824–4859. [Google Scholar] [CrossRef]
Adjerid, I.; Kelley, K. Big data in psychology: A framework for research advancement. Am. Psychol. 2018, 73, 899. [Google Scholar] [CrossRef]
Kuzma, M.; Andrejková, G. Predicting user’s preferences using neural networks and psychology models. Appl. Intell. 2016, 44, 526–538. [Google Scholar] [CrossRef]
Jamone, L.; Ugur, E.; Cangelosi, A.; Fadiga, L.; Bernardino, A.; Piater, J.; Santos-Victor, J. Affordances in psychology, neuroscience, and robotics: A survey. IEEE Trans. Cogn. Dev. Syst. 2018, 10, 4–25. [Google Scholar] [CrossRef] [Green Version]
Sánchez, Y.; Coma, T.; Aguelo, A.; Cerezo, E. Applying a psychotherapeutic theory to the modeling of affective intelligent agents. IEEE Trans. Cogn. Dev. Syst. 2020, 12, 285–299. [Google Scholar] [CrossRef]
Grześ, M.; Hoey, J.; Khan, S.S.; Mihailidis, A.; Czarnuch, S.; Jackson, D.; Monk, A. Relational approach to knowledge engineering for pomdp-based assistance systems as a translation of a psychological model. Int. J. Approx. Reason. 2014, 55, 36–58. [Google Scholar] [CrossRef]
Huang, H.; Shen, H.; Meng, Z.; Chang, H.; He, H. Community-based influence maximization for viral marketing. Appl. Intell. 2019, 49, 2137–2150. [Google Scholar] [CrossRef]
Xue, D.; Wu, L.; Hong, Z.; Guo, S.; Gao, L.; Wu, Z.; Zhong, X.; Sun, J. Deep learning-based personality recognition from text posts of online social networks. Appl. Intell. 2018, 48, 4232–4246. [Google Scholar] [CrossRef]
Rissler, R.; Nadj, M.; Li, M.X.; Loewe, N.; Knierim, M.T.; Maedche, A. To be or not to be in flow at work: Physiological classification of flow using machine learning. IEEE Trans. Affect. Comput. 2020, 14, 463–474. [Google Scholar] [CrossRef]
Csikszentmihalyi, M. Fluir (Flow): Una Psicología de la Felicidad; Editorial Kairós: Barcelona, Spain, 2010. [Google Scholar]
Csikszentmihalyi, M. Flow: The psychology of optimal experience. Acad. Manag. Rev. 1991, 16, 636–640. [Google Scholar]
Roy, F. Development of a Work Climate Questionnaire. Master’s Thesis, University of Montreal, Montreal, QC, Canada, 1989. [Google Scholar]
Sheetal, A.; Savani, K. A machine learning model of cultural change: Role of prosociality, political attitudes, and protestant work ethic. Am. Psychol. 2021, 76, 997. [Google Scholar] [CrossRef] [PubMed]
Lara-Álvarez, C.; Mitre-Hernandez, H.; Flores, J.J.; Pérez-Espinosa, H. Induction of emotional states in educational video games through a fuzzy control system. IEEE Trans. Affect. Comput. 2021, 12, 66–77. [Google Scholar] [CrossRef]
Nader, M.; Bernate, S.P.P.; Santa-Bárbara, E.S. Predicción de la satisfacción y el bienestar en el trabajo: Hacia un modelo de organización saludable en colombia. Estud. Gerenc. 2014, 30, 31–39. [Google Scholar] [CrossRef] [Green Version]
van Oortmerssen, L.A.; Caniëls, M.C.J.; van Assen, M.F. Coping with work stressors and paving the way for flow: Challenge and hindrance demands, humor, and cynicism. J. Happiness Stud. 2020, 21, 2257–2277. [Google Scholar] [CrossRef] [Green Version]
Peifer, C.; Syrek, C.; Ostwald, V.; Schuh, E.; Antoni, C.H. Thieves of flow: How unfinished tasks at work are related to flow experience and wellbeing. J. Happiness Stud. 2019, 21, 1641–1660. [Google Scholar] [CrossRef]
Bawa, P. Learning in the age of sars-cov-2: A quantitative study of learners’ performance in the age of emergency remote teaching. Comput. Educ. Open 2020, 1, 100016. [Google Scholar] [CrossRef]
Soriano, A.; Kozusznik, M.W.; Peiró, J.M.; Demerouti, E. Employees’ work patterns–office type fit and the dynamic relationship between flow and performance. Appl. Psychol. 2021, 70, 759–787. [Google Scholar] [CrossRef]
Güngör, Z.; Serhadlıoğlu, G.; Kesen, S.E. A fuzzy ahp approach to personnel selection problem. Appl. Soft Comput. 2009, 9, 641–646. [Google Scholar] [CrossRef]
Thelwall, M. Tensistrength: Stress and relaxation magnitude detection for social media texts. Inf. Process. Manag. 2017, 53, 106–121. [Google Scholar] [CrossRef] [Green Version]
Karthik, L.; Kumar, G.; Keswani, T.; Bhattacharyya, A.; Chandar, S.S.; Rao, K.B. Protease inhibitors from marine actinobacteria as a potential source for antimalarial compound. PLoS ONE 2014, 9, e90972. [Google Scholar] [CrossRef] [Green Version]
De Mauro, A.; Greco, M.; Grimaldi, M.; Ritala, P. Human resources for big data professions: A systematic classification of job roles and required skill sets. Inf. Process. Manag. 2018, 54, 807–817. [Google Scholar] [CrossRef]
Seligman, M.; Csikszentmihalyi, M. Flow and the Foundations of Positive Psychology; Spriner: Berlin/Heidelberg, Germany, 2014; pp. 279–298. [Google Scholar] [CrossRef]
Fernández Macías, M.Á.; Godoy Izquierdo, D.; Jaenes Sánchez, J.C.; Bohórquez Gómez-Millán, M.R.; Vélez Toral, M. Flow y rendimiento en corredores de maratón. Revista de Psicología del Deporte 2015, 24, 9–19. Available online: http://hdl.handle.net/11441/57930 (accessed on 8 April 2021).
Schwarzer, R.; Baessler, J. Evaluación de la autoeficacia: Adaptación española de la escala de autoeficacia general. Ansiedad Y Estrés 1996, 2, 1–8. [Google Scholar]
Schaufeli, W.B.; Bakker, A.B.; Salanova, M. The measurement of work engagement with a short questionnaire: A cross-national study. Educ. Psychol. Meas. 2006, 66, 701–716. [Google Scholar] [CrossRef]
Diener, E.; Emmons, R.A.; Larsen, R.J.; Griffin, S. The satisfaction with life scale. J. Personal. Assess. 1985, 49, 71–75. [Google Scholar] [CrossRef] [PubMed]
Atienza, F.L.; Pons, D.; Balaguer, I.; García-Merita, M. Propiedades psicométricas de la escala de satisfacción con la vida en adolescentes. Psicothema 2000, 12, 314–319. [Google Scholar]
Pons, D.; Atienza, F.L.; Balaguer, I.; García-Merita, M. Propiedades psicométricas de la escala de satisfacción con la vida en personas de tercera edad. Rev. Iberoam. Diagnóstico Evaluación Psicológica 2002, 13, 71–82. [Google Scholar]
Fernández-Berrocal, P.; Extremera, N.; Ramos, N. Validity and reliability of the spanish modified version of the trait meta-mood scale. Psychol. Rep. 2004, 94, 751–755. [Google Scholar] [CrossRef]
Bradley, R.T.; McCraty, R.; Atkinson, M.; Arguelles, L.; Rees, R.A.; Tomasino, D. Reducing test anxiety and improving test performance in america’s schools. In Results from the TestEdge® National Demonstration Study; Institute of HeartMath: Boulder Creek, CA, USA, 2007; Available online: http://www.issuelab.org/resources/3089/3089.pdf (accessed on 8 April 2021).
Vigil-Colet, A.; Morales-Vives, F.; Camps, E.; Tous, J.; Lorenzo-Seva, U. Development and validation of the overall personality assessment scale (operas). Psicothema 2013, 25, 100–106. [Google Scholar] [PubMed]
O’Grady, K.E. Measures of explained variance: Cautions and limitations. Psychol. Bull. 1982, 92, 766–777. [Google Scholar] [CrossRef]
Nagelkerke, N.J. A note on a general definition of the coefficient of determination. Biometrika 1991, 78, 691–692. Available online: http://pdfs.semanticscholar.org/1970/6b6e9ba4050a20f2980bea1de35d23882b51.pdf (accessed on 8 April 2021). [CrossRef]
Wang, Z.; Bovik, A.C. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Process. Mag. 2009, 26, 98–117. [Google Scholar] [CrossRef]
Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, f-Score and Roc: A Family of Discriminant Measures for Performance Evaluation; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1015–1021. [Google Scholar]
Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and f-Score, with Implication for Evaluation; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar]
Norouzi, M.; Fleet, D.J.; Salakhutdinov, R.R. Hamming Distance Metric Learning. Adv. Neural Inf. Process. Syst. 2012, 25, 1061–1069. [Google Scholar]
Heazlewood, I.; Walsh, J.; Climstein, M.; Kettunen, J.; Adams, K.; DeBeliso, M. A comparison of classification accuracy for gender using neural networks multilayer perceptron (MLP), radial basis function (RBF) procedures compared to discriminant function analysis and logistic regression based on nine sports psychological constructs to measure motivations to participate in masters sports competing at the 2009 world masters games. In Proceedings of the 10th International Symposium on Computer Science in Sports (ISCSS); Springer: Cham, Switzerland, 2016; pp. 93–101. [Google Scholar]
Brown, S.H. Multiple linear regression analysis: A matrix approach with matlab. Ala. J. Math. 2009, 34, 1–3. Available online: http://ajmonline.org/2009/brown.pdf (accessed on 8 April 2021).
Hastie, T.; Tibshirani, R. Discriminant adaptive nearest neighbor classification and regression. Adv. Neural Inf. Process. Syst. 1996, 8, 409–415. [Google Scholar]
Drucker, H.; Burges, C.J.; Kaufman, L.; Smola, A.J.; Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst. 1997, 9, 155–161. [Google Scholar]
Nagy, G.I.; Barta, G.; Kazi, S.; Borbély, G.; Simon, G. Gefcom2014: Probabilistic solar and wind power forecasting using a generalized additive tree ensemble approach. Int. J. Forecast. 2016, 32, 1087–1093. [Google Scholar] [CrossRef]
Gardner, M.W.; Dorling, S.R. Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences. Atmos. Environ. 1998, 32, 2627–2636. [Google Scholar] [CrossRef]
Pérez-Moreiras, E. La Inteligencia y el Coaching Energéticos: Una Aproximación Basada en la Evidencia Desde la Psicología para el Desarrollo de la Inteligencia Energética (Energetic Intelligence), el Fluir (Flow) y el Florecer (Flourishing). Ph.D. Thesis, Universitat Rovira i Virgili, Catalonia, Spain, 2020. Available online: http://hdl.handle.net/10803/670608 (accessed on 8 April 2021).
Mohammad, M.M.S.; Abdullah, S.; Al-Shomrani, M.M. Some linear Diophantine fuzzy similarity measures and their application in decision making problem. IEEE Access 2022, 10, 29859–29877. [Google Scholar] [CrossRef]
Yahya, M.; Abdullah, S.; Almagrabi, A.O.; Botmart, T. Analysis of S-box based on image encryption application using complex fuzzy credibility Frank aggregation operators. IEEE Access 2022, 10, 88858–88871. [Google Scholar] [CrossRef]

Figure 1. Flow chart of the experimental process.

Figure 2. Importance of the variables with respect to Flow using the random forest method.

Figure 3. Example of a confusion matrix for a multi-layer perceptron in test.

Figure 4. Misclassified points, according to its label and its normalised value obtained from the raw data.

Table 1. Summary of the conducted tests and the related variables they measure.

Variable	Test Name
Self-efficacy	Generalised self-efficacy (EAG) [33]
Engagement	Engagement (UWES 9) [34]
Flourishing	Flourishing scale [35]
Self-esteem	Self-esteem scale [34]
Satisfaction with life	Satisfaction with life (SWLS) [35,36,37]
Flow	Flow scale (EBF, Flow 9) [32]
Emotional intelligence	Spanish Modified Trait Meta-Mood Scale-24 (TMMS-24) [38]
Personality	OPERAS [40]
Personal and organisational quality	VCPO R4 [39]

Table 2. Statistics of the Flow variable.

Metric	Value
Min	9
Max	45
Mean	32.61
SD	4.91

Table 3. Results of the variable Flow.

Model	Train			Test
Model	MSE	R2	EV	MSE	R2	EV
LR	0.01082	0.42208	0.42567	0.00965	0.53205	0.53519
kNN	0.00969	0.44436	0.45144	0.01159	0.43792	0.44138
SVR	0.00727	0.58710	0.59062	0.00976	0.52666	0.53061
RT	0.01576	0.10924	0.11832	0.02054	0.00434	0.02807
RF	0.00745	0.57794	0.58186	0.00817	0.60382	0.60781
MLP	0.00721	0.61488	0.61491	0.00712	0.60460	0.60703

Table 4. Classification results for the Flow variable.

Model	Train				Test
Model	Accuracy	Recall	Precision	H. Loss	Accuracy	Recall	Precision	H. Loss
SVM	0.723684	0.651678	0.729184	0.276316	0.634884	0.499130	0.536051	0.365116
kNN	0.175660	0.084199	0.118585	0.074260	0.587209	0.377096	0.544471	0.412791
DT	0.154102	0.127478	0.146377	0.093898	0.540698	0.492626	0.593977	0.459302
RF	0.681355	0.506213	0.623433	0.318645	0.639535	0.456988	0.524155	0.360465
MLP	0.975382	0.975473	0.975550	0.024618	0.885808	0.884328	0.883165	0.114192

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pegalajar, M.C.; Ruiz, L.G.B.; Pérez-Moreiras, E.; Boada-Grau, J.; Serrano-Fernandez, M.J. An Intelligent Approach Using Machine Learning Techniques to Predict Flow in People. Big Data Cogn. Comput. 2023, 7, 67. https://doi.org/10.3390/bdcc7020067

AMA Style

Pegalajar MC, Ruiz LGB, Pérez-Moreiras E, Boada-Grau J, Serrano-Fernandez MJ. An Intelligent Approach Using Machine Learning Techniques to Predict Flow in People. Big Data and Cognitive Computing. 2023; 7(2):67. https://doi.org/10.3390/bdcc7020067

Chicago/Turabian Style

Pegalajar, M. C., L. G. B. Ruiz, E. Pérez-Moreiras, J. Boada-Grau, and M. J. Serrano-Fernandez. 2023. "An Intelligent Approach Using Machine Learning Techniques to Predict Flow in People" Big Data and Cognitive Computing 7, no. 2: 67. https://doi.org/10.3390/bdcc7020067

Article Menu

An Intelligent Approach Using Machine Learning Techniques to Predict Flow in People

Abstract

1. Introduction

Related Work

2. Methodology

2.1. Method

Participants and Procedures

2.2. Measures

2.3. Machine Learning Techniques

2.3.1. Linear Regression

2.3.2. Nearest Neighbours

2.3.3. Support Vector Machine

2.3.4. Tree-Based Models

2.3.5. Artificial Neural Networks

3. Experiments

4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI