Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication

Kamiński, Kamil Adam; Dobrowolski, Andrzej Piotr; Piotrowski, Zbigniew; Ścibiorek, Przemysław

doi:10.3390/electronics12183791

Open AccessArticle

Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication

by

Kamil Adam Kamiński

^1,2,*

,

Andrzej Piotr Dobrowolski

³

,

Zbigniew Piotrowski

³

and

Przemysław Ścibiorek

⁴

¹

Institute of Optoelectronics, Military University of Technology, 2 Kaliski Street, 00-908 Warsaw, Poland

²

BITRES Sp. z o.o., 9/2 Chałubiński Street, 02-004 Warsaw, Poland

³

Faculty of Electronics, Military University of Technology, 2 Kaliski Street, 00-908 Warsaw, Poland

⁴

POL Cyber Command, 2 Buka Street, 05-119 Legionowo, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3791; https://doi.org/10.3390/electronics12183791

Submission received: 1 August 2023 / Revised: 1 September 2023 / Accepted: 3 September 2023 / Published: 7 September 2023

(This article belongs to the Special Issue Biometric Recognition: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a voice biometrics system implemented in a web application as part of a two-factor authentication (2FA) user login. The web-based application, via a client interface, runs registration, preprocessing, feature extraction and normalization, classification, and speaker verification procedures based on a modified Gaussian mixture model (GMM) algorithm adapted to the application requirements. The article describes in detail the internal modules of this ASR (Automatic Speaker Recognition) system. A comparison of the performance of competing ASR systems using the commercial NIST 2002 SRE voice dataset tested under the same conditions is also presented. In addition, it presents the results of the influence of the application of cepstral mean and variance normalization over a sliding window (WCMVN) and its relevance, especially for voice recordings recorded in varying acoustic tracks. The article also presents the results of the selection of a reference model representing an alternative hypothesis in the decision-making system, which significantly translates into an increase in the effectiveness of speaker verification. The final experiment presented is a test of the performance achieved in a varying acoustic environment during remote voice login to a web portal by the test group, as well as a final adjustment of the decision-making threshold.

Keywords:

speaker recognition; biometrics (access control); authentication; cepstral analysis; Gaussian mixture model; genetic algorithms; system verification

1. Introduction

The two-factor login method described in this article using a voice biometrics system is implemented in the novel PicWATermark system for user verification and authentication. This is implemented, among other things, by means of voice biometrics but also by means of a marking method using a watermark as an identifier contained in the digital material [1,2]. In addition, the module will have in-built copyright protection mechanisms for the creators of the digital material, e.g., to identify the digital images taken.

The voice biometrics system that is the subject of this article provides convenient functionality for the user using the PicWATermark system, enhancing not only login convenience but also security. In addition, in this era of the prevailing SARS-CoV-2 virus pandemic, it allows the PicWATermark system to be used efficiently without the need to remove a protective mask or gloves, which would be necessary when using facial biometrics or a fingerprint as a second login component. It is also important to remember that the human voice, as a biometric, does not require additional passwords to be remembered or other sensitive data to be entered, and every user has it with them at all times. This represents a significant advantage over the use of additional passwords, which are becoming increasingly difficult to remember due to the complexity required.

The implementation of the voice biometrics module is shown in Figure 1. Voice authentication is one of the three possible two factor authentication (2FA) methods in the PicWATermark system (alongside photo and PIN authorization). The voice authentication module in the system exists as a Docker container that communicates with the user authentication module via a REST API. The BIO module, which is part of the voice authorization module, consists of two submodules. The first is the learning submodule, which is used in the creation of the user voice model. The testing submodule, on the other hand, is used in biometric verification mode, during which the characteristic features of the voice are compared with a previously created model of the user’s voice.

User login to the PicWATermark system using the voice authorization module must be preceded by the activation of two-factor authorization in the user account. During this operation, the user is asked to provide a 25 s voice sample, through which a voice model will be created. Once the model has been successfully generated, the user can log into the system using their voice. To do this, in addition to the standard login data (username and password), the user must record a 5 s voice sample, which is then analyzed by the testing submodule of the voice biometrics module (BIO). After successful verification of the login and password and a voice sample, the user is granted access to the PicWATermark system.

The aforementioned architecture of the voice biometrics system has been further elaborated upon in Section 3. Secure login constitutes a daily challenge faced by nearly every computer or smartphone user in the contemporary world. Therefore, in response to this issue, the authors have attempted to create a voice biometric system and subsequently integrate it into a two-factor authentication system.

2. Related Works

In this section, an analysis of the latest scientific publications in the areas of 2FA and voice biometrics is presented.

In the first instance, the authors will focus on presenting the most commonly used components of 2FA. As research demonstrates, the utilization of the second factor of authentication has become a socially prevalent phenomenon [3,4,5]. Only 21% of users employ single-factor authentication, while as many as 72% utilize a second factor of authentication to enhance its security.

In Table 1, the authors have compiled popular authentication factors constituting 2FA [6,7,8,9,10,11,12]. They have subjected them to comparison based on the following parameters:

− Universality—every individual should possess the considered factor;
− Uniqueness—the factor should ensure a high degree of differentiation among individuals;
− Collectability—the factor should be measurable through practical means;
− Performance—determines the potential for achieving accuracy, speed, and reliability;
− Acceptability—society should not have reservations about the use of technology employed by the specific factor.
− Spoofing—indicates the level of difficulty in intercepting and falsifying a sample of data from the respective factor.

Subsequently, the authors conducted a review of currently employed voice biometrics methods. The vast majority of authors use the cesptral method to create a unique vector of discriminant features, as well as mel-scale summation filters to create so-called mel-frequency cepstrum coefficients (MFCCs) [13,14,15,16,17,18,19,20,21,22,23]. Additional distinctive features are also used, such as T-phase features [3] or features based on the method of Linear Predictive Coding (LPC) [13,16,19,20,21].

Another essential element of the architecture of speaker recognition systems is an effective classifier. A popular and effective classification method is the Gaussian mixture model with Universal Background Model (GMM-UBM) method or its various modifications. Using a high number of Gaussian distributions per voice model [13,14,15]. In recent years, there has also been a trend towards the use of Deep Neural Networks (DNN) [17,18,24], including Convolutional Neural Networks (CNN) [19], as well as so-called Long Short-Term Memory Networks (LSTM) [16,20,21]. Another type of neural network is the Time Delay Neural Network (TDNN) [21,22,23,24,25], as well as the Sequence-to-Sequence Attentional Siamese Neural Network (Seq2Seq-ASNN) [26].

The authors of this paper use mel-cepstral features and weighted cepstral features during the generation of discriminant features, as well as a GMM classifier with a few Gaussian distributions, providing memory-saving processing, which is importance for the implementation of the voice biometric system. For a detailed description of the various processing steps, see Section 3. It should be noted that cross-comparing speaker recognition systems is not the easiest thing to do. This is because there are a number of commercial voice datasets that are used to test the effectiveness of Automatic Speaker Recognition (ASR) systems. Nevertheless, it can be concluded that, depending on the speech processing methods used and the voice base used, the above-mentioned authors achieve an equal error rate (EER) of between 16.09% and 0.73%.

The voice biometrics system described in this paper performs at a very good level relative to its competitors, taking additionally into account the fact that the experiments presented were carried out under real-world conditions using a cloud implementation of the ASR system. The individual test results are presented in Section 4 of the article.

Furthermore, it should be noted that most of the voice biometric systems described in the literature lack practical aspects and attempts at their real-world implementation. The authors of this article have taken on the challenge of developing and implementing a practical voice biometrics system that, considering Table 1, is excellently suited for use in 2FA systems. The utilization of voice biometrics for this purpose represents an innovative idea and a significant contribution to existing research, especially in comparison to commonly used additional security measures such as passwords, facial recognition, or fingerprint scans. Moreover, this approach is highly secure, as demonstrated in Section 4.

3. Methods

The structure of the ASR system described in this article is shown in Figure 2. The remainder of this section describes the individual modules of the ASR system in more detail, starting with the speech acquisition and signal pre-processing processes and the associated proprietary methods for selecting the processed speech frames. Another element of the operation of the presented voice biometrics system is the process of extracting, selecting, and normalizing the unique characteristics of the speaker’s voice. The process involved cepstral analysis of the speech signal as well as a genetic algorithm. This is followed by a presentation of the classifier used, using Gaussian mixture models, which is a memory-efficient and individually information-rich classification method. The next stage of processing is the decision-making process and the normalization of the final speaker verification result.

3.1. Signal Acquisition

The first stage of the ASR system is the acquisition of voice signals. We are talking about signals subject to verification as well as signals that are training material for the creation of voice models. During acquisition, the quality of the acoustic track used, as well as the conditions in which the voice recordings were made, are important considerations. In the presented implementation of the ASR system, the stability of the Internet connection is also of importance due to the need to upload the recorded speech fragment to the PicWATermark server.

3.2. Signal Pre-Processing

During the pre-processing of the speech signal, several operations are carried out to prepare the signal for feature extraction, thereby minimizing the impact of the recording device on system performance.

The first of the processes implemented is the clipping of silences, which are a typical part of almost every vocal utterance. This operation makes it possible to reduce the number of signal frames processed, which increases the speed of voice recognition and, above all, the efficiency of correct speaker verification, as only the frames relevant to speaker recognition are analyzed. In the presented implementation of the system, silence clipping is implemented twice. The first “rough” clipping of silence is carried out by the front-end using Voice Activity Detection. Thanks to this approach, a selection of signal frames containing speech is already made during recording, which saves both time and the transfer of data sent by the user to the PicWATermark server. Further “fine” selection of ASR-relevant frames is carried out in further signal pre-processing.

Another operation performed on the processed speech signal, already taking place on the PicWATermark server, is its normalization, where two actions are performed, i.e., removal of the mean value and scaling. The removal of the mean value from the digital speech signal is due to imperfections in the acquisition process. A speech signal in physical terms is nothing more than variations in sound pressure, so its average value can be assumed to be zero. In the practical implementation of digital speech signal processing, this value is almost always non-zero. This is due to the processing of speech fragments of finite length. The second action performed on the signal is scaling, which compensates for the mismatch between the speech signal and the range of the transducer. This allows quietly recorded parts of speech to be amplified. In the case of an ASR system designed for speaker verification independent of speech content, it is not necessary to preserve the energy relations occurring between the individual signal fragments. Therefore, in the present system, scaling is implemented relative to the maximum value of the signal to avoid distortions and make maximum use of the available number representation.

The next stage is signal filtering, known as pre-emphasis. It aims to compensate for the phenomenon found in the speech signal of lower amplitudes of higher frequency components relative to lower frequency components [27]. The aforementioned filtering is of greatest importance for frequencies above 3 kHz [28], so in the presented implementation of the ASR system, which processes signals with a maximum frequency of 4 kHz, its importance is minor.

The next step in the pre-processing of the speech signal is filtering in the frequency domain to reduce the components of the sound that are inaudible to humans. For this purpose, a high-pass filter with the amplitude characteristics shown in Figure 3 was used. The filter parameters were selected through an optimization process.

The speech signal processed in this way is then divided into short (quasi-stationary) fragments called frames. Each processed frame is analyzed separately and used in the creation of a separate distinctive feature vector.

Associated with the segmentation process is the windowing operation, i.e., the multiplication of the signal by an assumed time window with a width that determines the length of the frame. The time window is moved along the time axis with a specific increment. To minimize the phenomenon of so-called spectrum leakage, i.e., strong artifacts present in the processed signal. The ASR system presented here uses a Hamming window with low sidelight levels.

The final element of speech signal pre-processing is the selection of signal frames relevant to the ASR system. The aforementioned silence clipping function is the first “coarse” element to eliminate silence for longer parts of speech. In the presented implementation of the ASR system, three additional mechanisms are used to allow more accurate selection of signal frames.

The first is implemented to extract only voiced parts of speech, carrying information about the laryngeal tone. In the voiced frames, there are maxima in the frequency domain in a regular manner, which cannot be said for the voiceless fragments of processed speech, which resemble more of a noise signal. Using the autocorrelation function allows the sonority of the signal frame under consideration to be determined. The highest value of the autocorrelation function is obtained for zero offset; however, the bar is related to the energy of the signal, which is why the second maximum of the autocorrelation function is considered when looking for sonorous frames, which should be juxtaposed with the empirically determined sonority threshold [29].

Another criterion used to select representative signal frames is the re-detection of frames containing only the speech signal. This time with the elimination of shorter fragments of silence. The assumed minimum length of the cut frame of silence and the applied offset are of the same size as those used in the extraction of individual features in this ASR system. The process of selecting a threshold value for this criterion was also subject to optimization [29].

The final stage of frame selection is carried out on the basis of checking their noise level. This is made possible by determining the fundamental frequency using independent methods—autocorrelation F_0ac and cepstral F_0c. These two methods of determining F₀ have different resistance to signal noise. Proper use of this property makes it possible to determine which signal frames do not meet the accepted quality criterion (1) [30]. According to the literature, the autocorrelation method of determining the fundamental frequency is considered more accurate than the cepstral method; however, it is less robust to the noise of the signal under consideration. Therefore, the smaller the difference occurring between the fundamental frequency determined by these methods, the more the frame of the signal under consideration can be considered less noisy [29].

|F_{0 c} - F_{0 a c}| \leq p_{f} m i n (F_{0 c}, F_{0 a c})

(1)

where p_f represents the optimized threshold value [29].

3.3. Extraction of Distinctive Features

Another essential module of two-factor login using a voice biometrics system is the generation of speakers’ personal characteristics. This stage is particularly important because errors and shortcomings therein reduce the discriminatory capacity of speakers’ voices, which, in the later stages of the system’s operation, can no longer be made up for. The main objective of the parameterization is to transform the temporal input waveform in such a way as to obtain a possibly small number of descriptors containing the most relevant information about the speaker’s voice, thus minimizing their sensitivity to signal variation that is irrelevant to this system, i.e., dependent on the content of the speech or the parameters of the acoustic track used during acquisition.

Due to the redundancy of the signal in the time domain, it is much more efficient from the point of view of this voice biometrics system to analyze it further in the frequency domain. One of the reasons for this approach is inspired by the functioning of the sense of hearing, which, in the course of evolution, has been adapted to interpret the amplitude-frequency envelope of the speech signal appropriately [31].

The frequency form of the speech signal is the initial element in the subsequent parameterization. In the presented voice biometrics system, two types of descriptors requiring further mathematical transformations of the amplitude spectrum have been used, namely weighted cepstral features and mel-cepstral features.

For the generation of weighted cepstral features, the next process is the logarithmization of the amplitude spectrum, whereby the multiplicative relationship between the slow-variable component and the amplitudes of the individual stimulus-derived pulses is converted into an additive relationship. By subjecting such a signal to an inverse Fourier transform, the slow-variable waveforms associated with the transmittance of the vocal tract are placed close to zero on the cepstral time axis, called pseudo-time, while the pulses associated with the laryngeal sound start approximately within the period of the laryngeal signal and repeat every period. The final step in the generation of weighted cepstral features is to multiply the resulting signal in the pseudo-time domain by a summation filter bank that takes into account not only the maximum amplitudes of the bands in the cepstrum but also the values surrounding them, which also carry individual information about the speaker’s voice.

During mel-cesptral feature generation, the Mel-Frequency Cepstrum Coefficients (MFCC) method was used [32]. It works by multiplying the amplitude Fourier spectrum after transformation through a mel filter bank, which mimics the human auditory organ and its non-linear sensitivity to stimuli from different frequency ranges, resulting in improved perception. Figure 4 illustrates this feature for 30 filters and a maximum signal frequency of 4 kHz.

The processed signal is then logarithmized, similar to the weighted cepstral features. The final step in mel-cepstral feature generation is subjecting the signal to a cosine transformation for feature decorrelation.

3.4. Selection of Distinctive Features

Worldwide research shows that using the maximum set of features does not always produce the best results [33,34,35,36]. Feature selection often offers the possibility of obtaining higher or the same classification accuracy for a reduced feature vector, which in turn translates into reduced computation time.

When assessing feature quality, some features may be in the form of measurement noise, degrading the ability to recognize a given pattern, while others may be highly correlated with each other, resulting in the dominance of these features over the others and usually adversely affecting the quality of the classification.

An important element is the choice of feature selection method. A wide variety of selection methods, ranging from fast ranking methods to time-consuming methods incorporating complex classifiers, are available in the literature. The best-known quality measures and feature selection methods include the Fisher coefficient, t-statistics, cross-correlation, sequential forward selection, genetic algorithms, and linear discriminant analysis.

In the system presented here, the authors used a genetic algorithm to select the most representative features of the speaker’s voice. This method takes into account the synergy of the features and makes it possible to obtain an optimal set of them; however, it is time-consuming. The working principle of the implemented feature classifier using a genetic algorithm is shown in Figure 5. In contrast, a detailed description of the genetic feature selector created by the authors is described in the article [34].

3.5. Normalization of Distinctive Features

In the presented voice biometrics system, the normalization of distinctive features is realized using the mean value of a given feature and its standard deviation in the processed signal frame (cepstral mean and variance normalization over a sliding window—WCMVN) [28]. Norming is perfomed according to the following formula:

{\hat{x}}_{t} (i) = \frac{x_{t} (i) - μ_{t} (i)}{δ_{t} (i)}

(2)

where x_t(i) and

{\hat{x}}_{t} (i)

are the i-th components of the feature vectors in the considered frame before and after normalization, respectively, while µ_t and δ_t are respectively the mean value and standard deviation at time t for the feature vectors adopted during normalization [28].

3.6. Creation of Voice Models

Classification in the presented voice biometrics system is implemented based on a linear combination of Gaussian distributions. By making appropriate use of the distinctive features that constitute the learning data for the classifier, it is possible to create memory-efficient, rich in individual information, voice models called Gaussian mixture model.

During the operation of the classifier, the models iteratively adjust their parameters, i.e., expectation values, covariance matrices, and distribution weights, to the learning data according to the Expectation Maximization (EM) algorithm.

The operation of the EM algorithm involves the iterative repetition of two steps. The first is the estimation of the a posteriori probability value of the current model occurrence for the observations of the considered learning data. In contrast, the second step is maximization, allowing the parameters of the new model to be determined [37], maximizing the aforementioned probability function. Each subsequent step uses the quantities calculated in the previous step. The model learning process ends when there is no adequate increment in the reliability function or if the maximum number of iterations is reached.

During the speaker identification process, a decision is made as to which of the speakers represented by the λ_k voice models (for k = 1, …, N, where N is the number of voices in the considered dataset) is most likely to belong to the recognized fragment of the voice signal represented by the set of personal feature vectors X. The discrimination function then takes the form [37].

g_{k} (X) = p (X | λ_{k})

(3)

The selection of the most likely voice model is carried out by ranking according to the criterion [37].

k^{*} = a r g m a x g_{k} (X)

(4)

In order to convert the multiplicative relationship between consecutive observations into an additive one, in a practical implementation, the logarithmic value of the reliability function (log-likelihood) is determined, and the criterion is Equation (5) [37].

{l k}^{*} = \arg \underset{1 \leq k \leq N}{m a x} \sum_{t = 1}^{T} l o g p (x_{t}| λ_{k})

(5)

where the probability p(x_t|λ_k) is the weighted sum of the Gaussian distributions for a single observation t.

Also associated with the classification process using GMM is the Universal Background Model (UBM), which is created using learning data from different classes [38,39].

The model has two main uses. The first is to use it as initiating data in the process of creating models of specific speakers. With this approach, the model can be trained in fewer iterations as it does not start with strongly outlying initial data. The GMM-UBM algorithm has been more extensively described and tested in earlier studies by the authors [39,40].

The universal voice model can also be used in a decision-making system to determine the alternative hypothesis proving that the signal comes from another speaker in the population [41,42].

3.7. Decision-Making System

In the decision-making system of the speaker verification system, there are two hypotheses about the probability of unique features of a recognized utterance in a given statistical model of the speaker. The hypotheses can be formulated as follows:

− H₀ (null hypothesis)—the voice signal X comes from speaker k,
− H₁ (alternative hypothesis)—the voice signal X comes from another speaker ~k from the population.

Deciding whether a voice signal X comes from speaker k or comes from another speaker ~k depends on the relationship between the null and alternative hypothesis probabilities and the juxtaposition with the detection threshold θ. If we assume that the null hypothesis is represented by the λ_hyp model and the alternative hypothesis by the

λ_{\bar{h y p}}

model, this relationship can be described by Equation (6):

Λ (X) = \frac{p (X | λ_{h y p})}{p (X | λ_{\bar{h y p}})} > θ

(6)

The above equation is called the likelihood ratio test (LRT) or Neyman-Pearsonar test [28,42]. The likelihood quotient (6) is also often given in the logarithmic Equation (7)

Λ (X) = \log (\frac{p (X | λ_{h y p})}{p (X | λ_{\bar{h y p}})}) = \log p (X | λ_{h y p}) - \log p (X | λ_{\bar{h y p}}) > θ

(7)

As the null hypothesis, the result obtained from Equation (5) can be taken as the null hypothesis, where the GMM classifier created by the authors looks for the maximum value of the sum of the logarithms of the probability densities telling the occurrence of the feature x_t vector in the speaker model λ_k. Accordingly, this relationship can be presented as follows:

\log p (X | λ_{h y p}) = \underset{1 \leq k \leq N}{m a x} \sum_{t = 1}^{T} \log p (x_{t} | λ_{k})

(8)

where: N is the number of all voices in the dataset and T is the number of personal feature vectors extracted from the recognized speech signal. Determining the value of the logarithm of the plausibility of the alternative hypothesis

\log p (X | λ_{\bar{h y p}})

in the system presented here is performed by directly using the logarithm of the plausibility obtained by the UBM universal voice model created (described in Section 4.2).

\log p (X | λ_{\bar{h y p}}) = \log p (X | λ_{U B M})

(9)

3.8. Normalizing the Outcome of the Verification

The final processing step in this voice biometrics system is the normalization of the user verification result. For this purpose, the authors used the C-normalization (combined normalization) method—a combination of Z-normalization (zero normalization) and T-normalization (test normalization), assuming that the results of Z and T normalization are independent random variables. During its implementation, the results were subject to transformation according to the formula [43]:

C = \frac{T + Z}{2} ~ N (\frac{μ_{Z} + μ_{T}}{2} \frac{δ_{Z}^{2} + δ_{T}^{2}}{4})

(10)

where µ_Z and µ_T are successively the averages resulting from the Z- and T-normalizations, and δ_Z and δ_T are the standard deviations of these normalizations. The first component of formula (10) (T-normalization) is implemented at test time (online), the test recording is checked against the declared speaker model and a group of other cohort models, and then the speaker under consideration is assigned the mean and variance of these scores. In the case of the other component of Equation (10) (Z-normalization), the model is checked against initial statements of which the modeled speaker is not the author, and then the speaker under consideration is assigned the mean and variance from these results.

In addition, an important element of this normalization is the proper selection of the models included in the cohort involved in determining the components of Equation (10).

4. Results

This section contains experimental results illustrating the optimization process of selected elements of the voice biometrics system that have not been investigated before in the authors’ previous research [44,45]. The first part of the section presents a proposal to improve the performance of the decision-making system, and the second part presents the impact of feature normalization and verification score normalization on the effectiveness of voice biometrics in a multisession voice dataset. In contrast, the last part of the section provides the results of the optimization of the adopted decision threshold.

4.1. Impact of the Normalization of Distinctive Features on the Effectiveness of the ASR System

The first experiment conducted was to verify the speaker verification results obtained using a multisession voice dataset consisting of recordings of 50 speakers recorded on 10 independent acoustic tracks at a sampling rate of 8 kS/s [46]. This allows the performance of voice biometrics to be tested in a varied acoustic environment, making the presented results more realistic (red markings in Figure 6). An attempt was also made to additionally normalize the signal’s distinctive features by including the mean value of a feature and its standard deviation in the processed signal frame WCMVN [28] (blue markings in Figure 6). The results also include two options for how to test the ASR system. The first allows the system to be tested using the same acoustic tracks that were used to create the voice models (indicated by the circles in Figure 6). The other (more complex) requirement, on the other hand, requires that different acoustic tracks are used when testing the ASR system than those used when creating the voice models (indicated by the triangles in Figure 6).

As the results in Figure 6 show, it is appropriate to use additional normalization of distinctive features, which has a direct impact on increasing the effectiveness of speaker verification. This increase is particularly noticeable for the more difficult testing option of using different acoustic tracks when creating voice models and testing the ASR system. This option is an extremely difficult situation, but it may occasionally occur in a real-life situation when using an ASR system.

4.2. Optimization of the Adopted Alternative Hypothesis in the Decision-Making System

In subsequent experiments, a series of studies were carried out using the NIST 2002 SRE commercial voice dataset, which contained recordings of 330 voices (191 female, 139 male) [47]. The recordings were resampled to a sampling rate of 8 kS/s, similar to telephony conditions. The aim of the research was to test the impact of varying the selection of the alternative hypothesis

λ_{\bar{h y p}}

of the reliability quotient used to make a decision during speaker vetting.

The first approach assumed the determination of

λ_{\bar{h y p}}

, by means of the mean of the logarithms of the reliability M of the population models, which are not also the declared models (Figure 7). A test of this solution is shown later in this section.

The other option involved the determination of an alternative hypothesis using a universal voice model (UBM), for which M speakers’ learning data was used. Figure 8 shows a diagram of the proposed solution. A test of this solution is shown in Figure 9 (blue lines).

For experiments requiring speaker identity (user/intruder), half of the voice models and all test signals were used. The results presented in Figure 9 illustrate the impact of the number of voices included in the reference model (necessary to determine the alternative hypothesis) on the quality of the classification. The options adopted mirror the solutions presented in Figure 7 and Figure 8, but are enriched by varying the selection of the nearest voice models.

The first option involves selecting the closest voices from the dataset to the speech fragment to be verified (online), which will then be used to create a reference model

λ_{\bar{h y p}}

. The second (offline) option, on the other hand, involves selecting the closest voices to the speaker model created during learning and then creating a reference model based on this

λ_{\bar{h y p}}

. In addition, results for a UBM created from randomly selected voices from the base (green line) are also presented.

As can be seen from the above experiments, the best of the proposed options for the reference model is to use a universal voice model (UBM), created from the most similar voice models against the test (online) recording. The number of votes used for

λ_{\bar{h y p}}

, chosen by the authors, is 8, which is primarily due to the lowest EER value obtained.

Figure 10 shows an illustration of the operational curve (ROC) and the detection error trade-off (DET) curve for the most favorable option. In addition, the exact values of the area under the curve (AUC) and the equal error rate (EER) are presented.

4.3. Optimization of the Decision Threshold

For the final verification of the implemented voice biometrics module in the PicWATermark system, a series of tests were carried out, and a group of speakers were invited to perform them. The tests were conducted in three groups. The first involved 21 testers, who carried out the process of teaching and testing the voice biometrics module. Each speaker created their voice model on the first of the two available audio tracks. The first audio track consisted of the following: a Trust GXT 232 MANTIS microphone with USB connector and a Dell Latitude 5285 tablet with integrated sound card. The other track, in turn, consisted of a Trust MC-1300 microphone with a 3.5 mm jack connector and a Lenovo Y510p notebook with an integrated sound card. In the next sequence, each speaker performed a login test for their own account on both audio tracks. In further illustrations, a more complex variant of testing the system was used, i.e., logging in to the account from a different device than the user voice model. In addition, a series of potential attempts were made to hack into another user’s account using both acoustic tracks, logging into the account of the next person on the user list. This approach allows the dataset to make n voices, n attempts to log the right users, and n attempts to log potential intruders.

Subsequent voice biometrics tests in the PicWATermark system were performed in two rounds (10 and 24 participants). These experiments were already taking place during the COVID-19 outbreak, forcing them to be conducted entirely remotely. After creating their voice model, each tester was asked to attempt to log in to their own account and to the account of the next speaker on the list. It should be noted at this point that each speaker logged in from a completely independent acoustic track, which makes the voice verification results obtained significantly more realistic.

A total of 55 speakers were experimented with in three rounds, allowing the results of 110 logins to be collected (55 to their own account and 55 to another user’s account as an intruder). Figure 11 provides an illustration of the reliability quotient results obtained when trying to log in to one’s own account (green circles) and another user’s account (red circles). The experiments were carried out with a fixed decision threshold of “0” (blue line), resulting in an accuracy (ACC) classification score of 91.82% from 110 logins to the system.

A selection of decision threshold values (Figure 12) was also made, which resulted in a significantly higher classification accuracy of 93.64% for decision thresholds “4” (black point) and “9” (purple point).

The final evidence of the correct operation of the implemented speaker verification system are the depictions of the operational curve (ROC) and detection error trade-off (DET) curve shown in Figure 13. In addition, the area under the curve (AUC) and the equilibrium error rate (EER) were determined, which quantify the quality of the decision-making system. The results presented in Figure 13 are based on a sample of 110 speaker logins in the PicWATermark system.

Given these results and the fact that they were carried out on numerous independent acoustic tracks, it seems reasonable to adjust the decision threshold in further implementations of the PicWATermark system. As the choice of an appropriate decision threshold represents a trade-off between the convenience of logging into one’s own account and the danger of intrusion by an intruder, the authors adopted trials at level “9” in further implementations of the system. This approach allows for an increase in the achievable value of the ACC and increases the safety of the system.

4.4. Comparison with other Speaker Verification Methods

To test the validity of using the Gaussian mixture model method for speaker classification, the authors compared the results obtained with other available methods. The classification method based on Gaussian mixture models is not one of the latest trends in the field of voice biometrics, but nevertheless, in a situation where we have a small representation of data for a given class, it performs excellently. In order to confirm this thesis, the authors conducted independent tests using the commercial voice dataset NIST 2002 SRE [47] described in Section 4.2. The same test conditions were used, i.e., the voice recordings were divided into a 25 s training fragment and a 5 s test fragment. The experiment used all 330 available voices, which were resampled to a sampling rate of 8 kS/s. The authors compared the ASR system presented in this paper, using the author’s feature set and optimized GMM classifier, with two other available systems whose implementation allowed them to be downloaded and tested for performance on their own [48,49]. The need to independently test other speaker classification systems was due to the fact that results described by other authors are often difficult to compare. This is because of the different voice datasets used to test the system, as well as because of the different ways in which the experiments were conducted, among other things, the different lengths of recordings used to train and test the system.

The first of the alternative speaker verification methods that were compared is the I-vector method [48], which aims to model the overall variability of the training data and compress the information into a low-dimensional vector. The used classifier was pre-trained using the LibriSpeech dataset of approximately 10,000 h of speech corpora [50]. The details of the operation of this method are beyond the scope of the article and are presented in the publication [51].

The second speaker verification method that was compared is the YAMNET (Yet Another Mixture of Experts Network) method, which is a neural network model developed by Google for classifying various audio sounds [49]. It has been trained on a large base of different categories of sounds, including animal noises, music, and environmental sounds, among others. For the purposes of this article, a transfer learning technique was used to apply the YAMNET network to speaker verification. Implementation details of the YAMNET network are presented in the article [52,53].

Table 2 presents a summary of the obtained EER values for the 3 methods of speaker verification, i.e., optimized author’s GMM, I-vector, and YAMNET.

5. Conclusions

The web-based implementation of the voice biometrics system presented in this article has undergone extensive optimization and operational testing. The high speaker verification results obtained demonstrate that voice biometrics can be successfully used as a 2FA component. Tests were performed using commercial and in-house voice datasets. For example, in the light of research based on the NIST 2002 SRE dataset [13], the results achieved by the authors are at least satisfactory, achieving a 6.4% lower EER.

However, the most valuable experiment was the testing of the implemented voice biometrics system by speakers logging in remotely to their accounts from independent acoustic tracks. The tests resulted in an EER value of 7.27% when sampling 110 system logins by 55 testers. This is the closest to real-world use of the voice biometrics module in the PicWATermark system.

In addition, the authors conducted independent tests of other implementations of voice biometric systems based on the I-vector method and YAMNET. The tests were conducted under the same system testing conditions. The ASR system presented in the paper achieved a lower EER of 1.22% than the I-vector classifier [48] and 1.52% lower than the YAMNET classifier [49]. This demonstrates the superior ability of the presented ASR system to classify speakers compared to other systems. The relatively low EER obtained compared to other methods is due, among other things, to well-chosen discriminative features, appropriately optimized GMMs, and small data representations for given classes, for which GMMs perform excellently in data generalization.

Furthermore, it is important to note that the practical implementation of the present voice biometrics system in the context of 2FA, as presented in the article, opens new avenues for the utilization of this type of biometrics in security systems. The objective set forth by the authors, which aimed at creating an effective voice biometrics system and implementing it in a practical 2FA setting, has been successfully achieved.

Certainly, the developed voice biometrics system needs to be further improved and address the current threats facing voice biometrics, such as increasing the system’s resilience against impersonation attempts, including using deepfake technology, which, together with the development of artificial intelligence, could pose a major challenge to voice biometrics in the future. Nevertheless, this work is a significant contribution to the practical applications of voice biometrics in 2FA systems, highlighting the difficulties and challenges faced by the authors of similar systems, which include, among others, the system’s resistance to the diversity of voice recording devices as well as the scalability of the system architecture that allows for efficient handling of large user bases.

Author Contributions

Conceptualization, K.A.K., A.P.D. and Z.P; methodology, A.P.D. and K.A.K.; software, K.A.K. and P.Ś; validation, A.P.D.; formal analysis, K.A.K. and A.P.D.; investigation, K.A.K. and A.P.D.; resources, K.A.K.; data curation, K.A.K. and P.Ś.; writing—original draft preparation, K.A.K.; writing—review and editing, A.P.D. and Z.P.; visualization, K.A.K.; supervision, A.P.D. and Z.P.; project administration, Z.P.; funding acquisition, Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported and funded by The National Centre for Research and Development, grant no. CyberSecIdent/381319/II/NCBR/2018 for the implementation and financing of a project implemented on behalf of state defense and security as part of the program “Cyberse-curity and e-Identity” as well as by the Military University of Technology under research project no. UGB/22-864/2023 on “Methods of watermark embedding and extraction and methods of aggregation and spectral analysis with the use of neural networks”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Piotrowski, Z.; Lenarczyk, P.P. Blind Image Counterwatermarking—Hidden Data Filter. Multimed Tools Appl. 2017, 76, 10119–10131. [Google Scholar] [CrossRef]
Kaczmarek, P.; Piotrowski, Z. Desigining a mobile application on the example of a system for digital photos watermarking. In Proceedings of the Radioelectronic Systems Conference 2019, Jachranka, Poland, 20–21 November 2019; SPIE: Bellingham, WA, USA, 2020; Volume 11442, pp. 272–279. [Google Scholar] [CrossRef]
Hossain, M.N.; Zaman, S.F.U.; Khan, T.Z.; Katha, S.A.; Anwar, M.T.; Hossain, M.I. Implementing Biometric or Graphical Password Authentication in a Universal Three-Factor Authentication System. In Proceedings of the 2022 4th International Conference on Computer Communication and the Internet, ICCCI, Chiba, Japan, 1–3 July 2022; pp. 72–77. [Google Scholar] [CrossRef]
Two-Factor Authentication (2FA) Security Adoption Surges-|ChannelE2E. Available online: https://www.channele2e.com/news/two-factor-authentication-2fa-adoption-surges (accessed on 1 September 2023).
The 2021 State of the Auth Report: 2FA Climbs, While Password Managers and Biometrics Trend|Duo Security. Available online: https://duo.com/blog/the-2021-state-of-the-auth-report-2fa-climbs-password-managers-biometrics-trend (accessed on 1 September 2023).
Nogia, Y.; Singh, S.; Tyagi, V. Multifactor Authentication Schemes for Multiserver Based Wireless Application: A Review. In Proceedings of the ICSCCC 2023-3rd International Conference on Secure Cyber Computing and Communications, Jalandhar, India, 26–28 May 2023; pp. 196–201. [Google Scholar] [CrossRef]
Fujii, H.; Tsuruoka, Y. SV-2FA: Two-Factor User Authentication with SMS and Voiceprint Challenge Response. In Proceedings of the 2013 8th International Conference for Internet Technology and Secured Transactions, ICITST 2013, London, UK, 9–12 December 2013; pp. 283–287. [Google Scholar] [CrossRef]
The ‘123’ of Biometric Technology|Semantic Scholar. Available online: https://www.semanticscholar.org/paper/The-%E2%80%98-123-%E2%80%99-of-Biometric-Technology-Yau-Yun/b2f539d1face23a018b8e2824a898a8fee3ac77c (accessed on 1 September 2023).
Mairaj, M.; Khan, M.S.A.; Agha, D.E.S.; Qazi, F. Review on Three-Factor Authorization Based on Different IoT Devices. In Proceedings of the 2023 Global Conference on Wireless and Optical Technologies, GCWOT 2023, Malaga, Spain, 24–27 January 2023. [Google Scholar] [CrossRef]
Ometov, A.; Bezzateev, S.; Mäkitalo, N.; Andreev, S.; Mikkonen, T.; Koucheryavy, Y. Multi-Factor Authentication: A Survey. Cryptography 2018, 2, 1. [Google Scholar] [CrossRef]
Alomar, N.; Alsaleh, M.; Alarifi, A. Social Authentication Applications, Attacks, Defense Strategies and Future Research Directions: A Systematic Review. IEEE Commun. Surv. Tutor. 2017, 19, 1080–1111. [Google Scholar] [CrossRef]
Bezzateev, S.; Fomicheva, S. Soft Multi-Factor Authentication. In Proceedings of the Wave Electronics and its Application in Information and Telecommunication Systems, WECONF-Conference Proceedings, St. Petersburg, Russia, 1–5 June 2020. [Google Scholar] [CrossRef]
Gandhi, A.; Patil, H.A. Feature Extraction from Temporal Phase for Speaker Recognition. In Proceedings of the 2018 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 16–19 July 2018; pp. 382–386. [Google Scholar] [CrossRef]
Dustor, A. Speaker Verification with TIMIT Corpus-Some Remarks on Classical Methods. In Proceedings of the Signal Processing-Algorithms, Architectures, Arrangements, and Applications Conference Proceedings, SPA 2020, Poznan, Poland, 23–25 September 2020; pp. 174–179. [Google Scholar] [CrossRef]
Kang, W.H.; Kim, N.S. Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings. Sensors 2019, 19, 4709. [Google Scholar] [CrossRef] [PubMed]
Xu, Q.; Wang, M.; Xu, C.; Xu, L. Speaker Recognition Based on Long Short-Term Memory Networks. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 318–322. [Google Scholar] [CrossRef]
Hu, Z.; Fu, Y.; Xu, X.; Zhang, H. I-Vector and DNN Hybrid Method for Short Utterance Speaker Recognition. In Proceedings of the 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 6–8 November 2020; pp. 67–71. [Google Scholar] [CrossRef]
Lin, W.; Mak, M.-M.; Li, N.; Su, D.; Yu, D. Multi-Level Deep Neural Network Adaptation for Speaker Verification Using MMD and Consistency Regularization. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6839–6843. [Google Scholar] [CrossRef]
Jagiasi, R.; Ghosalkar, S.; Kulal, P.; Bharambe, A. CNN Based Speaker Recognition in Language and Text-Independent Small Scale System. In Proceedings of the 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 12–14 December 2019; pp. 176–179. [Google Scholar] [CrossRef]
Devi, K.J.; Thongam, K. Automatic Speaker Recognition from Speech Signal Using Bidirectional Long-Short-Term Memory Recurrent Neural Network. Comput. Intell. 2023, 39, 170–193. [Google Scholar] [CrossRef]
Moumin, A.A.; Kumar, S.S. Automatic Speaker Recognition Using Deep Neural Network Classifiers. In Proceedings of the 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 19–21 January 2021; pp. 282–286. [Google Scholar] [CrossRef]
Hong, Q.-B.; Wu, C.-H.; Wang, H.-M.; Huang, C.-L. Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6849–6853. [Google Scholar] [CrossRef]
Wang, S.; Yang, Y.; Wu, Z.; Qian, Y.; Yu, K. Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2598–2609. [Google Scholar] [CrossRef]
Bykov, M.M.; Kovtun, V.V.; Kobylyanska, I.M.; Wójcik, W.; Smailova, S. Improvement of the Learning Process of the Automated Speaker Recognition System for Critical Use with HMM-DNN Component. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2019, Wilga, Poland, 25 May–2 June 2019; SPIE: Bellingham, WA, USA, 2019; Volume 11176, pp. 588–597. [Google Scholar] [CrossRef]
Zhang, C.; Yu, M.; Weng, C.; Yu, D. Towards Robust Speaker Verification with Target Speaker Enhancement. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6693–6697. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, M.; Li, N.; Yu, C.; Cui, J.; Yu, D. Seq2Seq Attentional Siamese Neural Networks for Text-Dependent Speaker Verification. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6131–6135. [Google Scholar] [CrossRef]
Madisetti, V.; Williams, D.B. Digital Signal Processing Handbook; CRC Press, LLC: Boca Raton, FL, USA, 1999. [Google Scholar]
Makowski, R. Automatyczne Rozpoznawanie Mowy-Wybrane Zagadnienia; Oficyna Wydawnicza Politechniki Wrocławskiej: Wrocław, Poland, 2011; ISBN 978-83-7493-615-6. [Google Scholar]
Kamiński, K. System Automatycznego Rozpoznawania Mówcy Oparty na Analizie Cepstralnej Sygnału Mowy i Modelach Mieszanin Gaussowskich. Ph.D. Thesis, Military University of Technology, Warsaw, Poland, 2018. [Google Scholar]
Ciota, Z. Metody Przetwarzanie Sygnałów Akustycznych w Komputerowej Analizie Mowy; EXIT: Warsaw, Poland, 2010; ISBN 978-83-7837-531-9. [Google Scholar]
Pawłowski, Z. Foniatryczna Diagnostyka Wykonawstwa Emisji Głosu Śpiewaczego i Mówionego; Impuls Press: Cracow, Poland, 2005; ISBN 978-83-7850-295-1. [Google Scholar]
Davis, S.B.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentations. IEEE Trans. ASSP 1980, 28, 357–366. [Google Scholar] [CrossRef]
Harrag, A.; Saigaa, D.; Boukharouba, K.; Drif, M. GA-based feature subset selection Application to Arabic speaker recognition system. In Proceedings of the 2011 11th International Conference on Hybrid Intelligent Systems (HIS), Malacca, Malaysia, 5–8 December 2011; pp. 383–387. [Google Scholar] [CrossRef]
Kamiński, K.; Dobrowolski, A.; Majda, E. Selekcja cech osobniczych sygnału mowy z wykorzystaniem algorytmów genetycznych. Inżynieria Bezpieczeństwa Obiektów Antropog. 2019, 1–2, 8–16. [Google Scholar] [CrossRef]
Osowski, S. Metody i Narzedzia Eksploracji Danych; BTC: Warsaw, Poland, 2013; ISBN 978-83-60233-92-4. [Google Scholar]
Zamalloa, M.; Bordel, G.; Rodriguez, L.J.; Penagarikano, M. Feature Selection Based on Genetic Algorithms for Speaker Recognition. In Proceedings of the 2006 IEEE Odyssey—The Speaker and Language Recognition Workshop, San Juan, PR, USA, 28–30 June 2006; pp. 1–8. [Google Scholar] [CrossRef]
Tran, D.; Tu, L.; Wagner, M. Fuzzy Gaussian mixture models for speaker recognition. In Proceedings of the International Conference on Spoken Language Processing ICSLP 1998, Sydney, Australia, 30 November–4 December 1998; p. 798. [Google Scholar]
Janicki, A.; Staroszczyk, T. Klasyfikacja mówców oparta na modelowaniu GMM-UBM dla mowy o różnej jakości. Prz. Telekomun. —Wiadomości Telekomun. 2011, 84, 1469–1474. [Google Scholar]
Kamiński, K.; Dobrowolski, A.P.; Majda, E. Evaluation of functionality speaker recognition system for downgraded voice signal quality. Prz. Elektrotechniczny 2014, 90, 164–167. [Google Scholar] [CrossRef]
Kaminski, K.; Majda, E.; Dobrowolski, A.P. Automatic Speaker Recognition Using a Unique Personal Feature Vector and Gaussian Mixture Models. In Proceedings of the 2013 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 26–28 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 220–225. [Google Scholar]
Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef]
Kamiński, K.; Dobrowolski, A.P.; Majda, E. Voice identification in the open set of speakers. Prz. Elektrotechniczny 2015, 91, 206–210. [Google Scholar] [CrossRef]
Büyük, O.; Arslan, M.L. Model selection and score normalization for text-dependent single utterance speaker verification. Turk. J. Electr. Eng. Comput. Sci. 2012, 20, 1277–1295. [Google Scholar] [CrossRef]
Kamiński, K.A.; Dobrowolski, A.P. Automatic Speaker Recognition System Based on Gaussian Mixture Models, Cepstral Analysis, and Genetic Selection of Distinctive Features. Sensors 2022, 22, 9370. [Google Scholar] [CrossRef] [PubMed]
Dobrowolski, A.P.; Majda, E. Application of homomorphic methods of speech signal processing in speakers recognition system. Prz. Elektrotechniczny 2012, 88, 12–16. [Google Scholar]
Kamiński, K.; Dobrowolski, A.P.; Majda, E.; Posiadała, D. Optimization of the automatic speaker recognition system for different acoustic paths. Prz. Elektrotechniczny 2015, 91, 89–92. [Google Scholar] [CrossRef]
Martin, A.; Przybocki, M. 2002 NIST Speaker Recognition Evaluation LDC2004S04; Linguistic Data Consortium: Philadelphia, PA, USA, 2004. [Google Scholar] [CrossRef]
Pretrained Speaker Recognition System-MATLAB SpeakerRecognition. Available online: https://www.mathworks.com/help/audio/ref/speakerrecognition.html (accessed on 3 July 2023).
YAMNet Neural Network-MATLAB Yamnet. Available online: https://www.mathworks.com/help/audio/ref/yamnet.html (accessed on 17 July 2023).
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings 2015, South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Matějka, P.; Glembek, O.; Castaldo, F.; Alam, M.J.; Plchot, O.; Kenny, P.; Burget, L.; Černocky, J. Full-Covariance UBM and Heavy-Tailed PLDA in i-Vector Speaker Verification. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings, Prague, Czech Republic, 22–27 May 2011; pp. 4828–4831. [Google Scholar] [CrossRef]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings 2017, New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN Architectures for Large-Scale Audio Classification. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings 2017, New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar] [CrossRef]

Figure 1. Implementation diagram for the user authentication module in the PicWATermark system.

Figure 2. Operating diagram of the voice user verification module.

Figure 3. Amplitude characteristics of a 22-order Chebyshev filter of type II [29].

Figure 4. Distribution of 30 mel filters in the frequency range up to 4 kHz [29].

Figure 5. Flowchart of feature selection using a genetic algorithm [34].

Figure 6. Results of speaker verification in the multisession voice dataset.

Figure 7. Diagram for determining the credibility quotient for the alternative hypothesis, derived from the average of the logarithms of credibility from M population models.

Figure 8. Diagram for determining the credibility quotient for the alternative hypothesis, derived from the logarithm of the credibility of the universal voice model.

Figure 9. Performance results of the ASR system depending on the reference model used, obtained from the NIST 2002 SRE voice dataset.

Figure 10. Results of speaker verification of the most favorable alternative hypothesis selection option in the NIST 2002 SRE voice dataset using the ROC curve and DET curve.

Figure 11. Summary illustration of the speaker verification results obtained by the group of speakers testing the voice biometrics module.

Figure 12. Optimization of the decision threshold of a speaker recognition system based on 110 verifications.

Figure 13. Evaluation of a voice biometrics classifier based on the ROC curve and DET curve in a dataset of 110 speaker logins in the PicWATermark system.

Table 1. Comparison of individual factors for 2FA: H—high; M—medium; L—low; n/a—unavailable [8,10].

Factor	Universality	Uniqueness	Collectability	Performance	Acceptability	Spoofing
Password	n/a	L	H	H	H	H
Token	n/a	M	H	H	H	H
Voice	M	L	M	L	H	H
Facial	H	L	M	L	H	M
Ocular-based	H	H	M	M	L	H
Fingerprint	M	H	M	H	M	H
Hand geometry	M	M	M	M	M	M
Location	n/a	L	M	H	M	H
Vein	M	M	M	M	M	M
Thermal image	H	H	L	M	H	H
Behavior	H	H	L	L	L	L
Beam-forming	n/a	M	L	L	L	H
OCS ¹	n/a	L	L	L	L	M
ECG ²	L	H	L	M	M	L
EEG ³	L	H	L	M	L	L
DNA	H	H	L	H	L	L

¹—Occupant Classification Systems (OCS); ²—Electrocardiographic (ECG) Recognition; ³—Electroencephalographic (EEG) Recognition.

Table 2. Results of speaker verification method comparison using the NIST 2002 SRE dataset, which consists of 330 speakers.

Name of the Speaker Verification Method	Optimized Custom GMM	I-Vector	YAMNET
Number of features	23	60	64 *
EER	9.69%	10.91%	11.21%

* 64 is the number of mel bands in the mel septrogram.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kamiński, K.A.; Dobrowolski, A.P.; Piotrowski, Z.; Ścibiorek, P. Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication. Electronics 2023, 12, 3791. https://doi.org/10.3390/electronics12183791

AMA Style

Kamiński KA, Dobrowolski AP, Piotrowski Z, Ścibiorek P. Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication. Electronics. 2023; 12(18):3791. https://doi.org/10.3390/electronics12183791

Chicago/Turabian Style

Kamiński, Kamil Adam, Andrzej Piotr Dobrowolski, Zbigniew Piotrowski, and Przemysław Ścibiorek. 2023. "Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication" Electronics 12, no. 18: 3791. https://doi.org/10.3390/electronics12183791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Signal Acquisition

3.2. Signal Pre-Processing

3.3. Extraction of Distinctive Features

3.4. Selection of Distinctive Features

3.5. Normalization of Distinctive Features

3.6. Creation of Voice Models

3.7. Decision-Making System

3.8. Normalizing the Outcome of the Verification

4. Results

4.1. Impact of the Normalization of Distinctive Features on the Effectiveness of the ASR System

4.2. Optimization of the Adopted Alternative Hypothesis in the Decision-Making System

4.3. Optimization of the Decision Threshold

4.4. Comparison with other Speaker Verification Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI