1. Introduction
Communication technologies have brought about many different changes in the way the average person lives. As the Internet becomes an integral part of everyday life of more and more people, the need to accurately identify the demographic characteristics of Internet users has become paramount, for several reasons. The reasons for this are varied and related to user security and the best use of Internet services. Profiling unknown users by identifying certain inherent or acquired characteristics, such as their age and educational level, is essential for various applications, including personalised content delivery, targeted advertising, and customisation of the user experience. In this context, the use of keystroke dynamics as a means of extracting valuable demographic information has garnered considerable attention. Keystroke dynamics, a branch of behavioural biometrics, focuses on analysing the unique typing patterns exhibited by individuals [
1]. These rhythms and patterns are idiosyncratic [
2], in the same way as an individual’s handwriting or signature, due to the similar underlying neurophysiological mechanisms. By studying different typing patterns and their correlations with demographic characteristics, keystroke dynamics provides a novel approach to demographic profiling.
Bulgarian is a Slavic language spoken by about 9 million people, mostly in Bulgaria and other neighbouring countries. It is the official language of Bulgaria and has historical importance in the Balkan region. Bulgarian has unique linguistic features, such as the Cyrillic alphabet used for writing [
3]. However, the application of keystroke dynamics in the Bulgarian linguistic context remains relatively unexplored. Bulgarian, as a Slavic language, has distinct linguistic features that may influence typing behaviour. Investigating the feasibility and effectiveness of keystroke dynamics within the Bulgarian-speaking population is crucial for the development of accurate and reliable demographic profiling techniques tailored specifically for this language group, which numbers approximately 400 million people mostly in Eastern Europe and Northern Asia.
Traditional methods of demographic profile creation, such as surveys and questionnaires, often suffer from limitations such as subjectivity, response bias, and reliance on user self-reporting. Also, some more modern methods, such as recognizing characteristics from facial photographs or studying the text the user has written, require the existence of specific multimedia files or access to personal data. In contrast, keystroke dynamics leverages data derived from how users type and not what they type. This means that no access to their personal information is required. Also, for data recording, it is not required sophisticated hardware, as a simple physical or virtual keyboard is enough. Furthermore, the method is non-intrusive, and it is possible to record data continuously without interfering with the ongoing work of the users.
Creating the profile of an unknown user has several applications. First, unsuspecting users may be alerted to the possibility of falling victim to malicious users hiding their characteristics. Second, sites, forums, products, and services that match a user’s profile can be recommended, saving them valuable time from searching the Internet. Third, authentication can be enhanced, as the identification will be enriched with additional information. Fourth, regarding the recording of data from the Bulgarian language, it facilitates a more comprehensive understanding of the unique typing behaviours exhibited by Bulgarian speakers, taking into account the specific linguistic characteristics of the Bulgarian language.
This paper endeavours to identify specific characteristics of unidentified Internet users by analysing their typing patterns. The primary objectives are twofold: firstly, to fortify security measures, safeguarding genuine users from potentially fraudulent activities; and secondly, to optimize the utilization of Internet services. This way, this work aims to enhance user authentication, to ensure the protection of unsuspecting individuals, and simultaneously to maximize the efficiency and user experience across various online services.
The rest of the paper follows the following structure. First, a summary of the existing literature relevant to the topic under consideration is provided. Then, the methodology followed is described and its details are discussed. In
Section 4, the results for each of the four machine learning models used are presented. Finally, the paper concludes by presenting possible future directions for this research.
2. Background
The idea of keystroke dynamics dates back to the late 1800s. In fact, it came from a long-held belief that Morse code senders could identify each other by speed and rate of transmission. In addition, telegraphers identified each other through what they called the “sender’s punch”. The U.S. National Science Foundation, or NSF, conducted research in the 1980s that determined that each person has his or her own keyboard writing style. This is achieved through the NSF’s keystroke recognition method, which analyses and processes the way a person writes on their keyboard [
4].
As early as the mid-1970s, the examination of how the way one uses the keyboard can be a recognisable hallmark began. This was first highlighted in Spillane’s research [
5], where the idea of identifying users by the way they type was introduced. Also, an important contribution was made through the publication of the study by Forsen et al. [
6], where keystroke dynamics was analysed as one of the biometric characteristics that can be used to verify the identity of a user requesting access to a system. One of the first studies on this topic was conducted by Gaines et al. [
7]. They had a group of seven secretaries write the same three paragraphs twice over a period of four months. A total of 300 to 400 words were required both during the writing phase and for each comparison. Time delays between successive typing were measured, and the analysis was based on a limited number of digraphs (two consecutive letters). Although the results were very encouraging (FAR 0% and FRR 4%), the sample size was too small and the volume of data required was too large.
Another study was conducted by Umphres and Williams in 1985 [
8]. In this work, the time delay between consecutive key presses was also used to authenticate the user. It took approximately 1400 key presses to generate a profile for each user. Each time authentication was required, another 300 characters were required. The FAR achieved was 6%, but it is clear that the volume of data required was particularly large. Also, a similar study was conducted by Leggett and Williams [
9] with data obtained from 17 computer programmers. The system developed showed an FAR of 5% and an FRR of 5.5%. However, a major drawback of this method is the need for a large amount of data. In total, each programmer had to write over 1000 words.
Canales et al. [
10] attempted to create an authentication system for students who are examined online. For this reason, they used data coming from keystroke dynamics and stylometry. Specifically, known keystroke dynamics features and 82 stylometric features were extracted, which were character-based, word-based, and syntactic. Data derived from the recording of 40 students were collected, and a K-NN was used as a classifier. FAR and FRR were chosen as metrics, and experimental results showed that authentication was more successful when only keystroke dynamics features were used.
In their study, Zhong et al. [
11] contributed a new distance metric in the research field of user authentication through keystroke dynamics, which was a combination of Manhattan distance and Mahalanobis distance, attempting to exploit the advantages of the two metrics and eliminate their disadvantages. To test the performance of their system they used the CMU keystroke dynamics benchmark dataset, and showed that the new metric they proposed outperformed other distance metrics.
In another study by Monrose and Rubin [
12], the aim of the task was user recognition, and therefore, each volunteer was asked to enter a specific sentence as well as a sentence of their choice. The success rates were not so satisfactory when the texts were unfamiliar, but when the text copied by the user was specified, they reached a success rate of 90.7%.
Ayotte et al. [
13] attempted to address the problem of requiring a lot of data to achieve a high success rate for user authentication through keystroke dynamics. For this purpose, they introduced the snapshot-based tail area density metric (ITAD), a new graph comparison algorithm, to significantly reduce the number of keystrokes required for user authentication. The classifier they used was random forest, and in addition to the very good results they achieved, they showed that the most commonly used keystroke dynamics features, namely keystroke durations and digram latencies, are the most effective.
In a different field, Acien et al. [
14] presented a comprehensive exploration of long short-term memory (LSTM) networks for keystroke biometric authentication on a large scale in free-text scenarios. Their research assessed the performance of LSTMs trained with a moderate number of keystrokes per user. They considered various machine learning models, training sample sizes, keystroke sequence lengths, and databases based on different device types, such as physical and virtual keyboards. Their methodology achieved an EER of 2.2% and 9.2% for physical and virtual keyboards, respectively. In fact, they showed that it can also be used in an authentication system involving many users, since the error rates increased only slightly even when there was data from 100,000 people.
Sahu et al. [
15] dealt with the problem encountered in some systems where multiple users are involved and one user connects to another user’s account. To solve this problem, they resorted to keystroke dynamics, and the algorithm they proposed involved techniques of data preprocessing, dimensionality reduction, data clustering, data embedding, and data localization and could be used directly on the typing data. To test the performance of their algorithm they used two available datasets. Ultimately, the rates of correct user identification they achieved were high.
In another study on user age search, Tsimperidis et al. [
16] exploited a dataset containing 387 logs and extracted 700 keystroke dynamics features from them. The features extracted included keystroke durations and digram latencies. Using five different classifiers, experiments were conducted with different feature sets. The results of the experiments led to the development of a system that could identify the age group of an unknown user with an accuracy of about 90%, among four different options.
Buriro et al. [
17] tried to investigate the possibility of estimating, among other things, the age of a user who types a PIN/password between 4 and 16 digits in length, on mobile devices. Their data were collected from 150 volunteers on a specific device, and three classes were defined. They extracted temporal keystroke dynamics features and used several classifiers. Finally, the best results came from random forest, which had an accuracy of 87.9%.
In another study, Ulinskas et al. [
18] utilised an existing keystroke dynamics dataset derived from a recording of 53 individuals typing the same password. The purpose of the study was to identify user fatigue. From these data, features relating to keystroke durations and digram latencies were extracted. Using six different classifiers, it was observed that the best results came from the latencies in the “up-up” graph, managing to identify fatigue with 91% accuracy.
In a different field, Pentel [
19] collected keystroke data from various Web applications from 2011 to 2018. The log data were linked to the age and gender of the users and in some cases to other available information. A total of 2.3 million keystrokes from 7119 data logs were analysed, which came from approximately 1000 individuals and covered six different age groups. Binary and multi-class classification was applied using supervised machine learning methods. The results of the binary classification showed that performance was at the general baseline level, with the best F-score exceeding 0.92 and the lowest being 0.82. Through discriminative feature analysis, it was discovered that there was some overlap with features extracted from previous text mining studies.
In their effort, Yan and Yan [
20] conducted a study in which a methodology was developed to categorize blog writers according to their gender. To achieve this, they utilised a collection of 75,000 blog entries and used various word features such as frequency of occurrence, blog background colour, font type and style, as well as other features such as punctuation marks and emoticons. Their methodology achieved an F-measure equal to 0.68.
In another study, Jones et al. [
21] conducted a study where they collected data from user profiles and search keywords from Yahoo.com. They then created a model using a classifier based on SVM. This model was able to achieve very satisfactory accuracy in classifying the gender of users, with a rate of 83.8%. It was also able to predict the age of the users with an accuracy of 63.9%.
Tsimperidis et al. [
22] proposed a method to identify some characteristics of an unknown user through keystroke dynamics. They collected data from 110 volunteers during their daily device usage, and then they trained five machine learning models using selected features and tried it to recognize the age group, handwriting, and education level of unknown users. The experimental results showed that this method can recognize the age group with 87.6% accuracy, handwriting with 97.0% accuracy, and education level with 84.3% accuracy for an unknown user.
In a different field, Pentel [
23] focused on the analysis of unintended user activities in human–computer interactions. While user interfaces are usually designed to react only to intentional commands, users often perform unintentional activities that produce many cues for the user and can be used to plan the appropriate response by the system. Specifically, the goal of this research was to predict the age and gender of users through the analysis of data generated from mouse and keyboard devices. These data were collected from six different systems from 2011 to 2017 and include information from 1519 individuals. The machine learning models were able to predict both the age and gender of the user with very high accuracy. In particular, the F-score and accuracy metrics were above 0.9.
In their study, Cascone et al. [
24] used keystroke dynamics on touch devices to classify demographic information, such as the user’s age and gender. The authors sought to investigate whether the process of touch typing, which includes information about the pressure applied to the keys, can be used to detect user demographic information. To achieve this, the researchers analysed the data collected during the touch typing process using various machine learning algorithms. Among the findings of the research, it emerged that younger people tend to type faster but with more errors, while older people tend to type slower but with fewer errors. This may suggest possible correlations between keystroke dynamics and user age characteristics.
In the study conducted by Raul et al. [
25] examined the use of keystroke dynamics as a biometric authentication method. The researchers analysed the data collected to evaluate the effectiveness of various authentication methods based on keystroke dynamics. They also studied various algorithms based on statistics and machine learning to analyse their positive and negative points. From the research, it was found that there is a need to extend the keystroke dynamics dataset to include all the key features.
One of the most prominent issues in classification studies is user classification based on their age, probably because age is one of the personal characteristics that people choose not to declare or often misrepresent in order to avoid being noticed in case they commit malicious actions.
The study of Schler et al. [
26] was the search for the age of the author of a blog. The researchers collected their data from 71,493 blogs, which they classified according to the age of the author. For several of them, no age information was available, and for some of the classes they created, there were not enough data, resulting in three classes: the 10s (age group 13–17), the 20s (age group 23–27), and what they called the 30s (age group 33–46). As features for the classification, they used the frequency of occurrence of some words. The multi-class real Winnow algorithm was used for classification, in which for each class, a vector of as many dimensions as the set of parameters chosen was defined. The final results proved that the age group of blog creators could be correctly predicted with 73% accuracy.
The study by Rao et al. [
27] aims to identify the characteristics of Twitter users, especially their age group, gender, region origin, and political orientation. They proposed an approach to automatically discover a number of user attributes by examining their status messages, the social network structure, and the communication behaviour of the users. SVM was chosen as the classifier, and users were divided into people over 30 and under 30. The researchers tested the system and attained a classification accuracy rate of about 74%.
Keystroke dynamics can be used to identify the under-18 age group, thus offering an effective way to create a model to protect children from online threats. By implementing a limited firewall, an environment that is more suitable for this particular user group will be created [
28]. It can also be exploited in e-commerce problems by creating product recommendation services that are tailored to the age and gender of the users. Furthermore, the ability to identify the age and user through keystroke dynamics can allow the creation of a system where content or advertisements can be presented efficiently and targeted to the appropriate consumers, taking into account their preferences and characteristics [
29].
Educational systems vary greatly between countries. International data on education should therefore be based on a classification that proposes, for all countries of the world, correct criteria for the distribution of educational programs at levels that can be considered comparable.
The educational level of an individual is an important characteristic in various surveys that have been carried out over the years.
While fixed-text keystroke dynamics biometrics are often used during the login process to provide an authentication, free-text biometric keystroke systems allow continuous authentication of a user during the entire session for increased security [
30]. Furthermore, other studies [
31,
32] have exploited these additional user characteristics, such as age and gender, to improve the performance of the user authentication model.
Lin et al. [
33] presented a proposal for an authentication system based on the analysis of the keystrokes dynamics features. This includes recording the duration required to press a key (known as keystroke duration) and the time between the release of a key and the pressing of the next one (known as the up-down diagram). The purpose is to detect unauthorised users, even when they have knowledge of the genuine password for an account. After collecting the data and extracting the necessary features, a convolutional neural network is applied, which achieves 99% accuracy in detecting legitimate users.
3. Methodology
Datasets on keystroke dynamics are difficult to find on the Internet. In fact, most of them come from recording users typing fixed text. In contrary, the publication of free-text logging data carries the risk of leaking personal data and is therefore rarely found on the Internet by studies or surveys. The methodology of this research consists of three consecutive phases. In the first phase, free-text data were collected from Bulgarian-speaking volunteers who agreed to participate in the process. In the second phase, appropriate keystroke dynamics features were extracted. Finally, in the third phase, machine learning algorithms, namely naïve Bayes, SVM, multilayer perceptron, and random forest, were used to classify users according to their age and educational level.
3.1. Data Collection
For the needs of the research, keylogger software was developed and installed on the volunteers’ computers. To ensure that sensitive and personal data of the volunteers, such as passwords or credit card numbers, would not be leaked, the researchers committed themselves by signing a consent form not to disclose the data to third parties and to the exclusive use of these data in this research. Furthermore, volunteers were given the option to activate the keylogger only when they wished to do so, in order to choose which data would be recorded.
For the needs of data collection, hundreds of individuals were approached, and ultimately, several dozen participated, generating a number of logfiles, each of which contained data from the use of approximately 3500 keystrokes. Each participant could type at any moment of the day and at any application. This was done to capture as much of the participants’ daily typing as possible. That is, no specific time frame was imposed, adding versatility and reliability to the dataset, and therefore, no specific keylogging sessions were defined.
Each keystroke action performed by the volunteers was recorded in the logs, which were in the following format:
Each line represents a record of the volunteer’s action. The first field represents the virtual key code of the key that the volunteer pressed or released. The second field, enclosed by the symbol “#”, indicates the date on which the action took place. The third field corresponds to the exact time when the action took place, expressed as an integer number representing milliseconds from the beginning of the day. The fourth field describes the type of action, with the word “dn” indicating the pressing of a key and “up” referring to the release of a key.
The recording of the volunteers whose native language is Bulgarian was carried out between 29 March 2022 and 16 May 2022. During this period 46 logfiles were collected.
Table 1 shows the demographic data of the Bulgarian-speaking volunteers studied in this research.
Admittedly, the dataset created is relatively small. But since there is no dataset on the Internet with data from Bulgarian-speaking users, at least according to what is known, which is the target group of this research, combined with the difficulty of creating such a dataset, due to the distrust of individuals to participate in a process where their typing is recorded, the particular dataset is considered the best that can be used.
Looking at the logfiles, which were obtained from the recording of Bulgarian-speaking volunteers, it seemed that an attempt could be made to study the data and classify the users based on their age. It was important to test this characteristic to find a way to group the recorded ages of the users. An option was made to create two classes, those up to 45 years old and those over 45 years old. The first class consists of 21 logfiles, while the second class consists of 25.
Analysing the data of Bulgarian-speaking users, the choice of classify users by their educational level is an interesting and feasible classification. In their personal data, when registering for the project, users indicated their educational level according to the ISCED scale. The ISCED (International Standard Classification of Education) classification was developed by UNESCO in the mid-1970s and was first revised in 1997. A further revision of ISCED took place between 2009 and 2011 with extensive global consultations with countries, regional experts, and international organisations. Finally, ISCED 2011 was adopted by the UNESCO General Conference in November 2011 [
34]. Users who have declared ISCED2, ISCED3, and ISCED5 are considered non-university level, while users with ISCED6 and ISCED7–8 have university education and above.
3.2. Feature Extraction
Keystroke dynamics comes with many different features, which can be categorised into two major categories: temporal and non-temporal. In particular, temporal features are the most common category and include features such as keystroke durations and digram latencies. In this research, keystroke durations were chosen to be examined. The software implemented for feature extraction was implemented using the Python programming language, and its purpose is to calculate the average duration of keystrokes by each user. After its execution, files of appropriate format are generated to be readable by the WEKA 3.8.6 software.
3.3. Classifier Selection and Model Evaluation
After a thorough evaluation, the four models that emerged with excellent accuracy and low time complexity were naïve Bayes, SVM (support vector machine), MLP (multilayer perceptron), and random forest. The model validation stage seeks to assure the correct operation and application of the models. There are a variety of techniques that can be used to verify the reliability of a model, and several of these were applied to validate the four models.
In the context of evaluating machine learning algorithms, there are several metrics used to compare results between different approaches. One of the most common metrics is accuracy, which refers to the percentage of correctly classified instances in relation to the total number of instances. In addition to this, there are other metrics that offer additional interest in the evaluation of algorithms.
In particular, the time to build model is critical, as it reflects the time required to train the model. In addition, the F-measure (F1), which is the harmonic mean between precision and recall [
35] and is a safe measure even for unbalanced datasets, and the area under the ROC curve (AUC), which is the area under the receiver operating characteristic curve [
36], were used. These additional metrics deepen the evaluation of models and help in the proper selection of machine learning methods.
5. Conclusions
It is part of everyday life for people to communicate over the Internet, and usually via text messaging. One of the major threats with this way of communication is those users who hide their personal characteristics, such as age and gender, and aim to deceive unsuspecting users. Due to the nature of online communication, hiding such information is an easy task. In order to protect unsuspecting users, various methods have been proposed to reveal some of the characteristics of anonymous users. To solve this problem, in this paper, a method based on keystroke dynamics is proposed.
For the purpose of this research, 46 logfiles containing data from keyboard usage were collected. After extracting the appropriate keystroke dynamics features, the performance of four different classifiers, naïve Bayes, support vector machine, multilayer perceptron, and random forest were examined. The experimental results proved that it is possible to create fairly reliable systems that can identify the age and educational level of an unknown Internet user with an accuracy of 93.47% and 86.95%, respectively.
The ability to identify the age and educational level of a typing user is valuable in many areas, such as digital forensics, targeted advertising, and behavioural biometrics for user profiling. However, it is important to note that the development of such a system must comply with the current legal framework, as the analysis of unauthorised keystrokes may violate privacy, with potential implications for human security.
The contributions of the paper are therefore two-fold: first, the creation of a free-text keystroke dynamics dataset, which is not often found on the Internet, due to the risk of leaking volunteers’ personal data, and second, the novelty of using keystroke dynamics to detect certain features of unknown users.
In future research, it could be possible to record keystrokes from more users and to collect keystroking data from more devices, such as tablets and smartphones. In this way, the dataset will be extended resulting in any findings that are extracted being more valid. This also would allow more keystroke dynamics features to be examined and more classifiers to be tested. With these data from a future study, a complete profile of the user will be created and will be easily identified each time a user tries to enter a system or unauthorised access is attempted and rejected.