Next Article in Journal
Effect of Uniaxial Compression Frequency on Osteogenic Cell Responses in Dynamic 3D Cultures
Next Article in Special Issue
Augmented Reality Surgical Navigation System Integrated with Deep Learning
Previous Article in Journal
Prediction of Contaminated Areas Using Ultraviolet Fluorescence Markers for Medical Simulation: A Mobile Phone Application Approach
Previous Article in Special Issue
Comparison of CT- and MRI-Based Quantification of Tumor Heterogeneity and Vascularity for Correlations with Prognostic Biomarkers and Survival Outcomes: A Single-Center Prospective Cohort Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CNN-Based Identification of Parkinson’s Disease from Continuous Speech in Noisy Environments

by
Paul Faragó
1,*,
Sebastian-Aurelian Ștefănigă
2,
Claudia-Georgiana Cordoș
1,
Laura-Ioana Mihăilă
1,
Sorin Hintea
1,
Ana-Sorina Peștean
3,
Michel Beyer
4,5,
Lăcrămioara Perju-Dumbravă
3 and
Robert Radu Ileșan
3,4
1
Bases of Electronics Department, Faculty of Electronics, Telecommunications and Information Technology, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania
2
Department of Computer Science, Faculty of Mathematics and Computer Science, West University of Timisoara, 300223 Timisoara, Romania
3
Department of Neurology and Pediatric Neurology, Faculty of Medicine, University of Medicine and Pharmacy “Iuliu Hatieganu” Cluj-Napoca, 400012 Cluj-Napoca, Romania
4
Clinic of Oral and Cranio-Maxillofacial Surgery, University Hospital Basel, CH-4031 Basel, Switzerland
5
Medical Additive Manufacturing Research Group (Swiss MAM), Department of Biomedical Engineering, University of Basel, CH-4123 Allschwil, Switzerland
*
Author to whom correspondence should be addressed.
Bioengineering 2023, 10(5), 531; https://doi.org/10.3390/bioengineering10050531
Submission received: 13 March 2023 / Revised: 21 April 2023 / Accepted: 24 April 2023 / Published: 26 April 2023
(This article belongs to the Special Issue Artificial Intelligence in Biomedical Diagnosis and Prognosis)

Abstract

:
Parkinson’s disease is a progressive neurodegenerative disorder caused by dopaminergic neuron degeneration. Parkinsonian speech impairment is one of the earliest presentations of the disease and, along with tremor, is suitable for pre-diagnosis. It is defined by hypokinetic dysarthria and accounts for respiratory, phonatory, articulatory, and prosodic manifestations. The topic of this article targets artificial-intelligence-based identification of Parkinson’s disease from continuous speech recorded in a noisy environment. The novelty of this work is twofold. First, the proposed assessment workflow performed speech analysis on samples of continuous speech. Second, we analyzed and quantified Wiener filter applicability for speech denoising in the context of Parkinsonian speech identification. We argue that the Parkinsonian features of loudness, intonation, phonation, prosody, and articulation are contained in the speech, speech energy, and Mel spectrograms. Thus, the proposed workflow follows a feature-based speech assessment to determine the feature variation ranges, followed by speech classification using convolutional neural networks. We report the best classification accuracies of 96% on speech energy, 93% on speech, and 92% on Mel spectrograms. We conclude that the Wiener filter improves both feature-based analysis and convolutional-neural-network-based classification performances.

1. Introduction

Parkinson’s disease (PD) is a progressive neurodegenerative disorder (pathology where cells of the brain stop working or die) caused by dopaminergic neuron degeneration in the pars compacta of the substantia nigra from the ventral midbrain [1,2]. Furthermore, the presence, in the substantia nigra, of Lewy bodies containing alpha-synuclein is a clear neuropathological expression of PD [2].
The clinical presentation of patients with PD accounts, among others, for motor symptoms (e.g., tremor, bradykinesia, and rigidity), which could be seen as the last part of the cascade mechanism that starts with the upper-mentioned loss of dopaminergic neurons (substantia nigra), inducing reduced facilitation of voluntary movements and advancing to severe motor and non-motor symptoms. The last, non-motor symptoms (e.g., pain, fatigue, low blood pressure, restless legs, bladder and bowel problems, skin and sweating, sleep, eating, swallowing and saliva control, eye problems, foot care, dental health, mental health issues, mild memory and thinking problems, anxiety, dementia, depression, hallucinations and delusions, and speech and communication issues), have been gaining more and more attention in the last decades [3]. As we can comprehend, PD has a high diversity in clinical appearance, and new studies show that some of them (e.g., anxiety, depression, and anhedonia) could be related to serotonergic neurotransmission (non-dopaminergic systems) affecting up to 50% of the patients, with a clear impact on the quality of life [4,5,6,7,8].
The global incidence of PD increased from 2.5 million in 1990 to 6.1 million in 2016 [9], accounting for a 21.7% increase in the age-standardized rate of prevalence [10,11]. One million people have PD in the US alone, and the number is expected to reach 1.2 million by 2030 [12].
Based on the previously analyzed literature, we can argue that PD is highly challenging to diagnose and treat due to its myriad of clinical appearances. In this study, we focused on one of them, speech impairment, with the aim of supporting research in this field and clinicians in their quest for precision medicine.
Parkinsonian speech impairment is defined by hypokinetic dysarthria, a motor disorder which affects the magnitude and velocity of the articulatory movements and the inter-articulator timing disturbances during speech production [13,14]. Hypokinetic dysarthria accounts for respiratory, phonatory, articulatory, and prosodic manifestations [15]. As such, Parkinsonian speech is characterized by voice blocking, reduced voice intensity, mono-pitch/mono-loudness oration, tremor phonation (changes in the energy and fundamental frequency), breathy/hoarse voice, and hypotonic phonation, as well as reduced stress and incorrect articulation [13,16,17,18,19,20]. Speaking tasks reported in the literature for the assessment of Parkinsonian speech are classified into sustained vowel phonation, diadochokinetic task (repetition of fast syllables, usually with occlusive consonants), and continuous speech (reading and/or monologue/free speech) [21,22]. We extend this classification with the addition of two further speech tasks, as identified in the literature: isolated words and short sentences.
There is a prevalence of up to 89% in patients with PD who experience, among others, eloquent-speech difficulties, such as dysarthria (difficulty speaking due to brain damage, neuromuscular speech disorder) [23]. Unfortunately, clinical diagnosis for PD often materializes long after substantial neurophysiological damage has occurred as symptoms intensify over time. Altered speech is directly correlated with disability and poor outcomes resulting in reduced quality of life [7,8]. As speech impairment could be one of the first signs of PD [24]; timely identification is paramount for early intervention.

1.1. Related Work—Features Extraction

Feature classes for the objective assessment of hypokinetic phonation and articulatory impairment in PD are presented in Table 1, categorized by the speaking task.
Voice blocking is assessed using phonetic and phonologic speech features: pause count, pause duration, speech rate, etc. [25,26], from continuous speech.
Reduced speech loudness/intensity and mono-pitch and mono-loudness oration are assessed from prosody [27] based on pitch, i.e., fundamental frequency (f0) and speech intensity (I)/energy (E), respectively [28], taken in standard deviation.
Tremor phonation (and voice quality) is assessed on sustained vowels [13], isolated words, or short sentences [29,30,31], in terms of speech prosody: intensity/energy variation, fundamental frequency variation, and harmonic-to-noise ratio (HNR) [32].
Articulatory impairment is assessed by means of formant analysis, usually on sustained vowel phonation [13,33,34] and isolated words [31].
As illustrated, most of the literature references handle sustained vowel phonation and diadochokinetic speech tasks, along with isolated word and short sentence utterings. There are very few references to Parkinsonian speech assessment and identification in continuous speech.
Khan et al. argue in [35] that the assessment and identification of PD on continuous speech leads to better results by using Mel-frequency cepstral coefficients (MFCCs). Indeed, MFCC was employed, in addition to prosody, noise, formant, and cepstral analysis, for running speech assessment by Orzoco et al. in [36]. As for another example, Laganas et al. also employed MFCC besides pitch, pitch onset, and pitch offset for running speech assessment in PD [28].
Further on, Parkinsonian speech can be assessed using time-domain features, e.g., (short-term) energy and zero crossing rate, to evaluate voice activity [37]. On the other hand, Parkinsonian speech can be assessed using frequency-domain features, e.g., skewness and kurtosis [37], as well as MFCCs and the derivatives of MFCCs to evaluate spectrum shape [38].
The features reported in the literature for Parkinsonian speech assessment are listed in Table 2, categorized by the feature classes.

1.2. Related Work—Classifiers

Regarding Parkinsonian speech identification, several classifiers have been reported in the literature: Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGBoost), K-Nearest Neighbor (KNN), and Random Forest (RF) [39], support vector machines (SVMs), artificial neural networks (ANNs)/convolutional neural networks (CNNs) [40]. SVMs and CNNs exhibit the most widespread employment: SVMs are preferred for vowel and syllable classification, whereas CNNs are preferred for sequences of text.
For exemplification, an SVM model with a hybrid CS-PSO parameter optimization method was used by Kaya in [41] and achieved a 97.4% accuracy on the classification of voice measurements.
An SVM was also employed by Yaman et al. in [42], along with k-NN, for the automatic detection of PD from vowels. In this study, a statistical pooling method was applied to increase the size of the dataset. Then, the reported accuracy accounted for 91.25% in the case of SVM and 91.23% in the case of KNN.
Appakaya et al. employed the fine Gaussian SVM in [43] for the classification of Mel-frequency cepstral coefficients (MFCCs) extracted from three isolated words clustered into nine groups depending on the vowel content and achieved accuracy values that were between 60% and 90%. The study analyzed both fixed-width and pitch synchronous speech segmentation.
Hoq et al. proposed two hybrid models which integrate the Principal Component Analysis (PCA) and the deep neural network (DNN) of a Sparse Autoencoder (SAE) into an SVM in [39] and achieved an accuracy of 89.4% and 94.4%, respectively, for the detection of Parkinsonian speech based on the patient’s vocal features.
As an alternative to SVMs, which perform Parkinsonian speech identification based on features sets, CNNs perform Parkinsonian speech identification by solving an image classification problem.
For exemplification, Suhas et al. employed CNNs to perform spectrogram-based classification of dysarthria into three classes, amyotrophic lateral sclerosis (ALS), Parkinson’s disease (PD), and healthy controls (HC), and reported accuracy values above 80% [44].
Vaiciukynas et al. employed CNNs for Parkinsonian speech detection from a four-word sentence, achieving the best accuracy, i.e., 85.9% (equal error rate of 14.1%) [38]. In their work, the CNN was applied to classify the spectrograms of nine feature maps, including speech spectrograms; Mel frequency spectral coefficients—with the first and second derivative; Mel frequency cepstral coefficients; and linear predictive coding coefficients.
Gómez-Vilda et al. proposed a Random Least Squares Feed-Forward Network (RLSFN), namely an ANN classifier with stochastic and least-square learning methods for weight adaptation, in [13] for PD detection from sustained vowel recordings, with an accuracy over 99.4%. PD detection was performed based on the speech articulation neuro-mechanics, i.e., absolute kinematic velocity of the jaw-tongue system assessed in [13] by signal energy and formants.

1.3. Present Study

The topic of this article targets AI-based speech assessment for the identification of Parkinsonian speech. In previous work, we considered speech assessment in the framework of a decision support system for PD pre-diagnosis [45]. In the present study, we went further and focused on parkinsonian speech classification from running speech with the aim to facilitate the development of decision support systems for pre-diagnosis in neuroscience.
The literature review shows an abundance of reports on PD identification from short speech segments, i.e., vowels, syllables, and short words/sentences, mostly recorded in a laboratory environment. On the other hand, sample recordings in ambient conditions and PD identification from continuous speech are pursued less in the literature. Moreover, none of the reviewed solutions attempts to solve this problem by using CNN [46,47,48]. As such, the speech assessment workflow proposed in this article is aimed towards the assessment of continuous speech acquired in a noisy environment.
Our work is based on the premises that PD is identifiable from speech through loudness, intonation, phonation, prosody, and articulation. For this purpose, in our study, we performed an extensive investigation into phonological features, prosody features, time-domain features, frequency-domain features, and LPC analysis for formant extraction. Furthermore, we argue that the Parkinsonian traits identified with the feature-based speech analysis are contained in the speech, speech energy, and Mel spectrograms. Thus, we consider the spectrograms to be excellent candidates for CNN-based classification.
The novelty of this work is twofold. First, speech assessment was performed on samples of continuous speech, rather than utterings of sustained vowels, syllables, isolated words, or short sentences, as previously reported in the literature.
Second, we recorded the speech samples in a clinic, in the examination room—an inherently noisy environment, with no prior measures taken for soundproofing and noise reduction. On the one hand, this allowed us to investigate the presence of Parkinsonian speech attributes in the noisy signal. On the other hand, we were able to analyze and quantify the applicability of an optimal filter—the Wiener filter [49,50,51], in our work—for speech denoising in the context of Parkinsonian speech identification.
It should be noted that the speech samples used for the Parkinsonian speech assessment and CNN training were recorded from Romanian speaking patients and healthy controls (HCs) from our targeted study group. The dataset was constructed following a research protocol we devised ourselves, in contrast to publicly available third-party speech databases where we have no control over the acquisition and processing protocol.

2. Materials and Methods

Our methodology for AI-based Parkinsonian speech identification follows speech acquisition, speech processing, an investigation on feature extraction and feature assessment, and finally CNN-based spectrogram classification.

2.1. Speech Acquisition Protocol

The protocol adopted for speech acquisition and assessment is depicted in the workflow in Figure 1.
Speech acquisition was performed indoors, in a clinical environment, in the examination room of the Neurology Department. No special measures were taken for soundproofing or noise reduction in the examination room.
The study group consisted of twenty-seven subjects: sixteen PD patients and eleven healthy controls (HCs). The PD group included ten males and six females. The HC group included six males and five females. The healthy controls did not have any previously diagnosed neurodegenerative disorder or logopedic condition.
The subjects were provided with an A4 printout with the date of evaluation and a 31-word text sequence in the Romanian language that they were asked to read out. The evaluator recorded the subjects’ speaking with a 44.1 kHz sampling frequency, using the sound recorder from an Android smartphone device, and downloaded the recording onto a laptop for speech processing and assessment.
Speech assessment was performed in this study in terms of phonology, prosody, time-domain, frequency-domain, and LPC analyses for formant extraction, as well as CNN-based classification of the speech, speech energy, and Mel spectrograms.

2.2. Proposed Workflow for Speech Processing and Assessment

Speech processing and assessment was performed in the MATLAB environment following the block diagram from Figure 2, which accounts for speech sample importation, speech processing, feature extraction, and assessment.
Considering that the speech acquisition was performed in the clinic, which is an inherently noisy environment, a noise suppression stage implemented in this work with the Wiener filter was envisioned in the speech processing and assessment workflow. To investigate the effects of noise suppression on the speech assessment outcome, the same assessment procedure was applied to both original and filtered signals for comparison.
As indicated in Figure 2, a voice activity detector (VAD) was employed to discriminate speech from silence and pauses and, thus, to identify the speech segments. An energy-based VAD implementation was considered in this work. The VAD implementation assumes speech signal segmentation with 20 ms non-overlapping rectangular windows and the extraction of the signal energy (enrg) in each segment. The energy comparison threshold was set empirically to 1/10 of the maximum signal energy. Accordingly, speech activity is characterized by a larger signal energy in contrast to silence [52]. The evaluation of the Parkinsonian speech attributes is then performed on the extracted speech segments.
The Parkinsonian speech assessment features targeted in this work are listed in Table 3. The phonological feature extraction procedure is straightforward, following voice activity detection, and relies basically on counting the utterings and pauses. Prosody, time domain, frequency domain, formant analyses, and spectrogram classification, on the other hand, only target the active segments of speech. For this purpose, we considered extracting the segments of speech from the speech samples.
For each of the extracted speech segments, we generated the speech spectrogram, speech energy spectrogram, and Mel spectrogram. The spectrograms were then applied for CNN-based classification.
Finally, feature extraction was performed on each of the extracted speech segments. For this purpose, we considered segmentation with 20 ms rectangular windows and 50% overlap [37], followed by specific prosody, time-domain, frequency domain, and formant extraction techniques.

2.2.1. Mathematical Formula of the Wiener Filter

Adaptive linear filtering is based on the theory of minimum least square error filters and is applied in a variety of domains, e.g., linear prediction, echo cancellation, system identification, channel equalization, etc.
In adaptive filters, the aim of parameter adaptation is to minimize the estimation error, e(t), between the desired signal, s(t), and the filtered signal, ŝ(t):
e ( t ) = s ( t ) s ^ ( t ) .
In this paper, the Wiener filter is implemented on the FIR filter topology in Figure 3. Adaptivity assumes having the filter parameters recalculated in an automatic fashion to account for the statistical characteristics of the input signal and noise during the filtering process [49,50,51].
Our choice for the FIR filter is motivated by the stability of the topology, as well as ease in computing the filter weights.

Time-Domain Equations

The filter transfer function is given by the following convolution:
s ^ ( n ) = k = 0 N 1 w k · y ( n k ) ,
Alternatively, it is expressed using vector notation:
s ^ ( n ) = w t · y ,
where w = [wi], i = 0…N − 1 is the coefficient vector, and y is the input vector to the FIR filter. The estimation error (1) is then expressed in discrete time as
e ( n ) = s ( n ) s ^ ( n ) = s ( n ) w t · y .
The Wiener filter operates towards minimizing the mean square error (MSE); thus, we have the following:
E [ e 2 ( n ) ] = E [ ( s ( n ) w T y ) 2 ] = E [ s 2 ( n ) ] 2 w T E [ y · s ( n ) ] + w T E [ y · y T ] w ,
where E[.] is the expectation operator. Then, one can identify that
r s s ( 0 ) = E [ s 2 ( n ) ]
is the variance of the desired signal under the assumption that the mean of s is 0. Under the additional assumption that the input signal, y, and the desired responses are jointly stationary [51], one will further identify that
r y s ( n ) = E [ y · s ( n ) ]
is the cross-correlation vector between the input and the desired signals, and
R y y = E [ y · y t ]
is the input signal autocorrelation matrix. The MSE is then rewritten as follows:
E [ e 2 ( n ) ] = r s s ( 0 ) 2 w T r y s + w T R y y w .
Under the Wiener theory, the filter optimization criterion is the least mean square error (LMSE) [51]. The MSE given in (9) is a second-order function in w, which has a single minimum that is determined by
w E [ e 2 ( n ) ] = 2 r y s + 2 w T R y y = 0 ,
which resolves to the Wiener coefficient vector, w, which satisfies the LMSE criterion:
w = R y y 1 r y s
In the case of additive noise, n, namely
y ( n ) = s ( n ) + n ( n ) ,
and assuming that the signal and noise are uncorrelated, we obtain the following:
r s n = 0 ,
whereas the noisy and noise-free signal are correlated:
r s s = r s y ,
Then, it follows that [49]
R y y = R s s + R n n .
Substituting (14) and (15) in (11) yields the following:
w = ( R s s + R n n ) 1 · r s s
which defines the optimal linear filter for additive noise suppression [49].

Frequency-Domain Equations

In the frequency domain, the Wiener filter output Ŝ(f) is expressed as follows:
S ^ ( f ) = Y ( f ) · W ( f )
which defines the error signal E(f) as follows:
E ( f ) = S ( f ) S ^ ( f ) = S ( f ) Y ( f ) · W ( f ) .
The MSE is then expressed as follows:
E [ | E ( f ) | 2 ] = E [ ( S ( f ) Y ( f ) · W ( f ) ) ( S ( f ) Y ( f ) · W ( f ) ) ] ,
where E[.] is the expectation operator, and * is the complex-conjugated product. Then, one can identify the following:
P Y Y ( f ) = E [ Y ( f ) · Y * ( f ) ] ,
as the power spectrum of Y(f), and
P S Y ( f ) = E [ S ( f ) · Y * ( f ) ] ,
as the cross-power spectrum of Y(f) and S(f) [49].
The derivation of the Wiener coefficients under the LMSE criterion requires us to equate the MSE derivative to 0:
E [ | E ( f ) | 2 ] W ( f ) = 2 · P S Y ( f ) + 2 · W ( f ) · P Y Y ( f ) = 0 .
The transfer function of the Wiener filter is then expressed as follows:
W ( f ) = P s y ( f ) P y y ( f ) .
In the case of additive noise, the filter input signal is expressed in the frequency domain:
Y ( f ) = S ( f ) + N ( f ) ,
where N(f) is the noise spectrum. Under the assumption that the signal and noise are uncorrelated, whereas the noisy signal and noise-free signal are correlated, as were the assumptions for the time-domain analysis, the Wiener filter is rewritten as follows:
W ( f ) = P s s ( f ) P s s ( f ) + P n n ( f ) ,
where Pss(f) and Pnn(f) are the signal and noise power spectra, respectively [49]. Dividing both nominator and denominator by Pnn(f) yields the following:
W ( f ) = ζ ( f ) ζ ( f ) + 1 ,
where ζ(f) is the signal-to-noise ratio defined in terms of power spectra [49,50]. The MATLAB implementation of the Wiener filter, empowered in our work, follows the mathematical formula derived by (26).

Wiener Filter Performance Metrics

An objective evaluation of the Wiener filter noise suppression performance was performed in this work by using the signal-to-noise ratio (SNR) and signal-to-noise ratio improvement (SNRI) as speech enhancement measures, and the mean square error (MSE) as signal fidelity measure [52,53,54]. Each is defined as follows.
The SNR is estimated in dB according to the definition of the global SNR as the logarithm of the signal (Psignal) and noise (Pnoise) power ratio:
S N R [ d B ] = 10 · lg ( P s i g n a l P n o i s e ) ,
where the noise power, Pnoise, is determined from the silence segments and the signal power, Psignal, is determined from the speech activity segments, as discriminated by the voice activity detector [52]. Note that, although Psignal contains the power of both speech and noise, the SNR estimated with (34) is relevant to evaluate the noise suppression performances of the Wiener filter. Large SNR values imply that speech magnitude is considerably larger than noise, whereas small SNR values imply that the noise magnitude is rather large in comparison to speech magnitude.
The SNR is expressed for both original and filtered signals. Then, we estimate the SNRI as follows:
S N R I [ d B ] = S N R o r i g i n a l [ d B ] S N R f i l t e r e d [ d B ] ,
indicating the improvement of the speech sample.
Finally, the MSE is computed according to the following:
M S E = 1 n i = 1 n ( s i s i ^ ) 2 .

2.2.2. Feature Extraction for Parkinsonian Speech Assessment

The feature extraction stages applied for phonological, prosody, time-domain, frequency-domain, and LPC analyses, sequentially, are described as follows.

Phonological Analysis

A phonological analysis of the speech signal, aiming for the identification of Parkinsonian speech phonology, was performed in in this work in terms of the number of utterings (nutterings), number of pauses (npauses), speech rate (rspeech), and pause duration (tpause).
Phonological feature extraction is straightforward, following voice activity detection, and it is described as follows:
  • The uttering count corresponds to the number of detected voice activities,
  • The pause count corresponds to the number of detected pauses,
  • The speech rate, expressed in words/minute, is determined as the number of utterings expressed throughout the complete speech duration,
  • The pause time, expressed in seconds, is determined as the total duration of pause segments (to be noticed is that we have eliminate the initial and final pauses prior to assessment).

Prosody Analysis

The speech prosody assessment was performed in this work in terms of the mean and standard deviation of the signal intensity (I) and fundamental frequency (f0).

Time-Domain Analysis

We performed a time-domain speech analysis targeting the assessment of signal intensity and periodicity, i.e., zero-crossing-based features [55].
The time-domain features targeted in this work and considered relevant for the assessment of speech intensity are the mean absolute value (mav), energy (enrg), and root mean square (rms), which are defined as follows:
m a v k = 1 n i = 1 n | x i | ,
e n r g k = 1 n i = 1 n s i g i 2 ,   k = 1 , n w ¯ ,
r m s k = 1 n i = 1 n s i g i 2 ,   k = 1 , n w ¯ ,
where k is the segment index, n is the segment length (in samples), and nw is the total number of segments [56].
The time-domain features targeted in our work and considered relevant for speech periodicity are the zero-crossing rate (ZC) and slope sign changes (SSCs), which are defined as follows:
Z C k = i = 2 n ( s g n ( s i g i 1 · s i g i ) = 1 ) , k = 1 , n w ¯ ,
S S C k = i = 3 n ( s g n ( ( s i g i 1 s i g i 2 ) · ( s i g i s i g i 1 ) ) = 1 ) , k = 1 , n w ¯ ,
where k is the segment index, n is the segment length (in samples), and nw is the total number of segments [56].

Frequency-Domain Analysis

We performed a frequency-domain speech analysis targeting the assessment of the power spectrum components and power spectrum shape [37]. The power spectrum (P) was generated for each 20 ms signal frame, and the frequency-domain features were extracted as follows.
The frequency-domain features targeted in this work for the assessment of the power spectrum components are the frequency of the maximum spectral component (maxf) and the weighted average of the frequency components (waf), defined as follows:
m a x f k = { f | P k ( f ) = m a x ( P k ) } , k = 1 , n w ¯ ,
w a f k = i = 1 n P k 2 ( f i ) · f i i = 1 n P k 2 ( f i ) , k = 1 , n w ¯ ,
where k is the segment index, n is the segment length (in samples), and nw is the total number of segments. Note that, while the pitch is also a relevant power spectrum component assessment feature [25,26,37], it was previously addressed in a prosody assessment.
The frequency-domain features targeted in this work for the assessment of the power spectrum shape are skewness and kurtosis [57].

LPC Analysis

The formants are estimated by means of the linear predictive coding (LPC) analysis. The first three formants (f1, f2, and f3) were considered for assessment in this work.
The LPC analysis was preceded by a down-sampling of the speech signal from 44.1 kHz to 16 kHz and segmentation with a 2 ms rectangular widow with 50% overlap. A finer resolution was required, in comparison to the time-domain and frequency-domain analyses, to catch the vowels within the utterings and perform the formant analysis accordingly.

2.2.3. CNN-Based Spectrogram Classification

In this paper, convolutional neural networks (CNNs) were used to train data in order to classify speech into PD and HC classes. The CNN is a subdomain of AI that has achieved immense success in recent years. These neural networks are deep because their architecture is more complex and consists of several layers of convolution, providing an improvement in model performance with the increase of the dataset [46]. Using CNN, the extraction of features from images is performed automatically, and there is no need for human intervention. Therefore, convolutional networks have the role of recognizing certain characteristics from the images applied to the input of the model, based on the convolution operations, and recombining the features extracted in the final layers of the architecture to achieve the classification. Thus, the CNN improves the structure and performance of traditional artificial networks, and the architecture of these models is suitable for recognizing certain patterns, i.e., features from the structure of 2D images [47]. As a mode of use, the CNN achieved very good results in the analysis of medical images, image segmentation, or in the field of visual recognition [48].
CNN-based classification for the discrimination of Parkinsonian speech is performed in our work on spectrograms. The spectrogram is a three-dimensional plot of the signal amplitude vs. time and frequency [58] and can be employed for CNN-based classification [59]. Our motivation for spectrogram employment resides in the fact that it contains a visual representation of the Parkinsonian speech characterization features defined in Section 2.2.2. As such, we expect that the CNN-based classification of the speech spectrograms captures the feature-based Parkinsonian speech assessment.
The CNN-based spectrogram classification workflow is illustrated in Figure 4. First, spectrograms of the speech sequences extracted from the VAD were generated. The spectrograms were saved as jpeg images and were applied to the CNN for speech classification.
The MobilNet model is built on separable convolution, and all layers are followed by Relu activation functions, with the exception of the final layer, which is a complete convolution, and which is fully connected. The hyperparameter settings are listed in Table 4. The CNN structure is then given in Table 5.
Three types of spectrograms were used for CNN training: speech spectrograms, speech energy spectrograms, and vowel maps and Mel spectrograms.
The speech spectrogram provides a visual representation of the speech power spectrum variation in time. As such, the speech spectrogram can be used to assess the time-frequency amplitude distribution [58].
The speech energy spectrogram further provides a visual representation spectral energy distribution into short-term spectra on segments of speech. As such, the speech energy spectrogram tracks acoustic–phonetic changes [60].
Alternatively, the Mel spectrogram was derived as the short-term power spectrum, based on the linear cosine transformation of the log power spectrum, on a non-linear scale; provided a visual representation of the human hearing perception; and explored phonetic variation and change [61].
In our study, we used the MobileNet CNN architecture model. The MobileNet model performed feature extraction based on 28 layers of convolution, which are grouped into modules, offering a fast computation time [62], with the aim of maximizing accuracy and reducing the cost of computation [63]. MobileNet uses depth-wise separable convolutions to reduce the number of parameters and size of the model and tracks the balance between compression and precision.
The CNN model was trained in Google Colab, using Python. Our choice for the Colab programming environment was motivated by the free Graphics Processing Unit (GPU) services that allow the construction and automatic training of neural networks by performing parallel tasks on large datasets. Network training was performed with a learning rate of 0.005. That means the amount that the weights are updated during training is 0.5%. This is the most important parameter in the network training process, as it regulates its performance, controlling the rate at which the algorithm learns parameter values. Moreover, we chose to use the batch_size parameter set to 128 to use less memory during training and to speed up the training procedure. The number of epochs used for the complete training cycles of the networks is variable and is chosen between 100 and 200 epochs.

3. Results

3.1. Wiener Filter Performance Evaluation

The statistics of the estimated speech enhancement and fidelity measures are listed in Table 6. The complete record of the speech enhancement and fidelity measures, which were computed for every subject in the study group, is listed in Appendix A Table A1.

3.2. Feature Extraction for Parkinsonian Speech Assessment

The results of the feature extraction stages applied for phonology, prosody, time-domain, frequency-domain, and formant analyses are described as follows.

3.2.1. Phonological Analysis

The phonological speech parameters assessed in this work are expressed in terms of uttering count, pause count, speech rate, and pause duration.
The first stage in phonology assessment assumes the discrimination of utterings from pauses. The energy-based VAD described in Section 2.2 is employed for this purpose. The results of the voice activity detection procedure are depicted in Figure 5 for a PD patient. The original speech sample with the corresponding signal energy is plotted in Figure 5a, and the filtered speech sample with the corresponding signal energy is plotted in Figure 5b.
The comparison threshold, plotted with orange on the energy plot, is set empirically to 1/10 of the maximum signal energy. Utterings are then identified for signal energy levels above the comparison threshold, as plotted with orange on the speech sample.
As illustrated in Figure 5, noise in the original signal leads to different energy values in contrast to the filtered signal. The identification of utterings and pauses thus leads to different results on the two signals. Consequently, the phonological parameters estimated from the VAD are also different for the original and filtered signal.
The same voice activity detection procedure is depicted for an HC in Figure 6. The original signal with the corresponding signal energy are plotted in Figure 6a. The filtered signal with the corresponding signal energy is plotted in Figure 6b.
The uttering count and the pause count were determined directly from the voice activity detection results. The VAD further enables the assessment of the speech rate and pause duration on the entire speech sample. Statistics of the extracted phonological parameters, namely nuttering, npause, rspeech, and tpause, are listed in Table 7 for both original and filtered speech samples. The complete record of the phonological features, which were computed for every subject in the study group, is listed given in Appendix A Table A2.

3.2.2. Prosody Analysis

The prosody features are evaluated in this work in terms of speech intensity (I) and pitch, i.e., fundamental frequency (f0). The prosody features computed on the speech sample of a PD patient are plotted in Figure 7, with Figure 7a illustrating the features estimated from the original signal, and Figure 7b from the filtered signal.
The prosody features computed on the speech sample of an HC are plotted in Figure 8, with Figure 8a illustrating the features estimated form the original signal and Figure 8b from the filtered signal.
We estimated the mean (µ) and standard deviation (σ) of the prosody speech parameters. The statistics of the extracted speech prosody, in mean and standard deviation, are listed in Table 8 for both the original and filtered speech samples. Note that the fundamental frequency metrics are assessed separately for the male and female subjects. The complete record of the prosody features, computed for every subject in the study group, is listed in Appendix A Table A3.

3.2.3. Time-Domain Analysis

The time-domain features determined in this work are the intensity-based features, i.e., MAV, E and RMS; and the periodicity-based features, i.e., ZC and SSC.
The time-domain intensity-based features estimated from the speech sample of a PD patient are plotted in Figure 9: those for the original signal are shown in Figure 9a, and those from the filtered signal are in Figure 9b.
The time-domain intensity-based features estimated form the speech sample of an HC are plotted in Figure 10: those for the original signal are shown in Figure 10a, and those for the filtered signal are in Figure 10b.
The statistics for the time-domain intensity-based features, in mean value and standard deviation, are listed in Table 9 for both the original and filtered speech samples. The complete record of the intensity-based time-domain features, computed for every subject in the study group, is listed in Appendix A Table A4.
The time-domain periodicity-based features estimated from the speech sample of a PD patient are plotted in Figure 11: those for the original signal in are shown in Figure 11a, and those for the filtered signal are in Figure 11b.
The time-domain periodicity-based features estimated from the speech sample of an HC are plotted in Figure 12: those for the original signal are shown in Figure 12a, and those for the filtered signal are in Figure 12b.
The statistics for the time-domain periodicity-based features, in mean value and standard deviation, are listed in Table 10 for the both original and filtered speech samples. The complete record of the periodicity-based time-domain features, computed for every subject in the study group, is listed in Appendix A Table A5.

3.2.4. Frequency-Domain Analysis

The frequency-domain features determined in this work for the power spectrum assessment are MAXf and WAF. The frequency-domain features which assess the power spectrum shape are expressed in terms of skewness and kurtosis.
The frequency-domain features estimated from the speech sample of a PD patient are plotted in Figure 13: those for the original signal are shown in Figure 13a, and those for the filtered signal are in Figure 13b.
The frequency-domain features estimated from the speech sample of an HC are plotted in Figure 14, for the original signal in Figure 14a and the filtered signal in Figure 14b.
The statistics of the frequency-domain features, in mean value and standard deviation, are listed in Table 11 for both the original and filtered speech samples. The complete record of the frequency-domain features, computed for every subject in the study group, is listed in Appendix A Table A6 for the mean value and Table A7 for the standard deviation.

3.2.5. LPC Analysis

An LPC analysis was performed in this work, with the aim of formant extraction. The first three formants extracted for a PD patient are plotted alongside the speech sample in Figure 15: those for the original signal are shown in Figure 15a, and those for the filtered signal are in Figure 15b.
The first three formants extracted for an HC are plotted alongside the speech sample in Figure 16; those for the original signal are shown in Figure 16a, and those for the filtered signal are in Figure 16b.
The statistics of the first three formants, in mean value and standard deviation, are listed in Table 12 for both the original and filtered speech samples. The complete record of the formants, which were computed for every subject in the study group, is listed in Appendix A Table A8 for the mean value and Table A9 for the standard deviation.

3.3. CNN-Based Spectrogram Classification

The speech spectrogram of the sequence corresponding to the uttering of the word “Românie” in Romanian language, consisting of four vowels—two individual vowels and one vowel group—is plotted alongside the waveform of the uttering in Figure 17: that for a PD patient is shown in Figure 17a, and that for an HC is in Figure 17b.
The speech energy spectrogram corresponding to the uttering of the same word is plotted in Figure 18, alongside the waveform of the uttering: that for a PD patient is shown in Figure 18a, and that for an HC is in Figure 18b.
The Mel spectrogram of the sequence corresponding to the uttering of the same word is plotted in Figure 19, alongside the waveform of the uttering: that for a PD patient is shown in Figure 19a, and that for an HC is in Figure 19b.
The dataset for the CNN consists of the spectrograms for the speech sequences extracted from the speech samples of the 27 subjects: 16 patients diagnosed with PD and 11 healthy controls. Accordingly, the dataset for the original speech samples consists of 318 utterings: 215 for PD patients and 103 for HCs. The dataset for the filtered speech samples consists of 289 utterings: 194 for PD patients and 95 for HCs. The dataset was divided into the training dataset—accounting for 80%, with 20% used for validation; and the test dataset—accounting for 20%.
The classification accuracy was evaluated according to accuracy (acc) and loss [64,65]. Accuracy is defined as
acc = TP + TN TP + TN + FP + FN ,
with the parameters accounting for true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). The TP and TN metrics count the correct classifications, whereas the FP and FN metrics count the incorrect classifications. Accordingly, the accuracy indicates the probability of accurately identifying the samples in either of the two classes. Loss, on the other hand, is an indicator of the deviation between the predicted values and the real labels. Binary cross entropy is a commonly used loss function in binary classification problems. It measures the difference between the predicted probabilities and the true labels for each data point. Moreover, binary cross entropy has a probabilistic interpretation: it can be viewed as the negative log likelihood of the true label under the predicted probability distribution. In other words, the lower the loss, the higher the likelihood that the model’s predictions are correct. Overall, binary cross entropy is a good choice for binary classification tasks because it is easy to compute, has a probabilistic interpretation, and can be optimized efficiently by using gradient-based methods.
The estimated CNN performance metrics that were obtained after network training, in terms of accuracy FP, FN, and loss, are listed in Table 13. As illustrated, the best results were obtained based on speech energy, with an accuracy of 96% and a loss of only 0.12. Speech spectrograms and Mel spectrograms led to lower accuracy values.
A closer inspection of the speech phonological parameters, which are given in Appendix A Table A2, points out that the patients PD 1, PD 4, PD 5, PD 11, and PD 13 exhibit feature values in the HC range, contradicting the guidelines prescribed by Boschi et al. [25]. Contrarywise, the healthy controls HC 4, HC 5, and HC 8 exhibit feature values in the PD range.
Thus, in the second CNN training attempt, we eliminated the speech spectrograms of the subjects with feature values outside the variation range prescribed by the statistics reported in Table 6 and Table 7. In this case, the CNN dataset for the original speech samples is reduced to 241 utterings: 181 for PD patients and 60 for HCs. The dataset for the filtered speech samples is reduced to 222 utterings: 166 for PD patients and 56 for HCs. The classification accuracy, however, is improved, becoming 93%, in the case of the filtered signal, with a loss of only 0.1. The dataset distribution for CNN training and validation is the same.
The classification accuracy achieved in this work is listed in comparison to values reported in the literature in Table 14 and Table 15. Table 14 points out that the classification accuracy depends primarily on the speech task. Sustained vowel phonation and diadochokinetic tasks account for phonetic segment duration in the order of magnitude of seconds. In extremis, [13] reported on sustained vowel phonation with a duration of 2 s. Thus, feature extraction provides a good feature resolution, and consequently, there are sufficient numeric data available for assessment and classification. This makes vowels and diadochokinetic tasks appropriate for classification using supervised learning architectures such as k-NN, SVM, or RF. Contrarywise, phonetic segments in the continuous speech samples are limited to 100–300 ms [66]. In such cases, the feature resolution is rather small; thus, neural network architectures are more suitable for classification.
Sustained vowel phonation and diadochokinetic tasks reach large classification accuracy values. Specifically, the highest classification accuracies were achieved for sustained vowel phonation in [13,39]. Table 14 points out that we were able to report comparable accuracy values. On the other hand, there is only a small number of solutions in the literature which report on Parkinsonian speech identification from continuous speech, and which also reach lower classification accuracies [38,43,44]. From this point of view, the classification accuracy reported in our work is larger than the accuracy reported in the literature for a similar task.
Furthermore, the speech samples classified in our study were recorded in-clinic, an inherently noisy environment, in contrast to a soundproofed laboratory environment, as was the case in the related work.
With respect to the aim of our study, which targeted the CNN-based identification of PD from continuous speech, we compared our results to others obtained using deep learning models. As illustrated in Table 15, the classification accuracy we achieved in our study using CNNs is higher than the accuracy reported in [38,44]. On the other hand, the larger accuracy reported in [13] was achieved on sustained vowel phonation, in contrast to running speech, which was the case in our work.

4. Discussion

4.1. Speech Enhancement and Fidelity Measures

The SNR values indicate a clear improvement of the speech samples with Wiener filtering. As a quantitative measure of the signal improvement, the SNRI indicates that Wiener filtering improved the speech signal with an average 4 dB for both PD patients and HC. The MSE in the10−4 order of magnitude indicates that there are no severe deviations between the original and fileted speech signals. It is thus sensible to assume that relevant information for the characterization of Parkinsonian speech was not lost with filtering.

4.2. Feature Extraction for Parkinsonian Speech Assessment

4.2.1. Phonology Analysis

The phonological features extracted from the speech samples confirm previous results reported by Boschi et al. as relevant [25]. Accordingly, our results illustrate that Parkinsonian speech exhibits an increased pause count in comparison to HCs, which is consistent with hypokinetic phonation and voice blocking [18]. The total pause duration, attributable to inappropriate silence [18], is also larger for PD patients.
Furthermore, uttering count and speech rate—estimated in our study as the number of utterings per minute, exhibit larger values for PD patients. This result is attributable to the dysfluent nature of speech in PD [18,33].
With respect to filtering, although the specific feature values were changed, the feature relationships hold for both original and filtered speech samples.

4.2.2. Prosody Analysis

Our results on prosody assessment exhibit smaller values for speech intensity, in both mean and standard deviation, for PD in comparison to HC. While the smaller mean reveals reduced voice intensity and speech loudness, the smaller standard deviation reveals the mono-loudness attribute of Parkinsonian speech.
The standard deviation of the fundamental frequency, reported in the literature as an indicator for intonation-related impairment [27,31], reveals a smaller value in the case of Parkinsonian speech.
The effects of Wiener filtering on the prosody features of speech accounts for changes in the intensity mean and standard deviation values, because of noise suppression. The differences in the fundamental frequency are insignificant. Nevertheless, the relationship between the prosody features holds for both original and filtered speech samples.

4.2.3. Time-Domain Analysis

The time-domain analysis of the speech samples illustrates that the intensity-based features are smaller for Parkinsonian speech in comparison to HC, in both mean and standard deviation. This relationship is consistent with the attributes of Parkinsonian speech [28]. Indeed, smaller mean values are an indicator of reduced voice intensity and speech loudness. Smaller standard deviation values are an indicator for mono-loudness speech and reduced intensity modulation. These relationships hold for both original and filtered speech samples; the difference in feature values is, however, more pronounced for the filtered signal.
The periodicity-based features exhibit a smaller zero-crossing rate value for Parkinsonian speech in comparison to HC, in both mean value and standard deviation. This result is consistent with the mono-pitch attribute of Parkinsonian speech [28]. Slope sign changes, on the other hand, exhibit a larger value for Parkinsonian speech, in both mean value and standard deviation. These relationships hold for both original and filtered speech samples; yet again, the difference in feature values is more pronounced for the filtered signal.

4.2.4. Frequency-Domain Analysis

A frequency-domain analysis was performed in this work to assess the spectral content by means of the maximum component frequency and the weighted average of the frequency components. Further on, the spectrum shape was assessed by means of skewness and kurtosis.
Our assessment results show that both power spectrum component features are lower for Parkinsonian speech in comparison to HC. The lower maximum component frequency of Parkinsonian speech originates from breathy voice [60] and indicates that breath is the dominant speech component in the presence of reduced voice intensity. The lower weighted average of the frequency components, on the other hand, provides a numeric estimate which captures phonation, expressivity, modulation, and articulation difficulties [28,31]. These relationships stand for both original and filtered speech sequences.
Spectrum shape assessment exhibits a similar skewness value for PD and HC, whereas kurtosis exhibits larger values for Parkinsonian speech. The difference in kurtosis, however, is small, and we cannot base the discrimination of Parkinsonian speech on this feature. Wiener filtering does not change the spectrum shape feature values.

4.2.5. LPC Analysis

A formant analysis addresses the assessment of incorrect articulation as a characteristic of Parkinsonian speech [18,31,33]. Indeed, f1 is produced by jaw movement, whereas f2 is produced by tongue movement [67]. In this work, we performed formant extraction by means of an LPC analysis.
Our assessment results show that the standard deviation of the formants is smaller for parkinsonian speech in comparison to HC. Considering that we performed the assessment on samples of continuous speech, this result is accountable to imprecise articulation of consonants [18] and is consistent with hypokinetic speech.
These relationships hold for both original and filtered speech samples; moreover, filtering does not change the formant frequencies significantly.

4.3. CNN-Based Spectrogram Classification

Three types of spectrograms were employed in this work for CNN-based speech classification: speech spectrograms, speech energy spectrograms, and Mel spectrograms. We argue that several features of Parkinsonian speech, identified with prosody, time-domain, frequency-domain, and LPC analyses, are contained in these spectrograms. This was our motivation for spectrogram employment in the CNN-based classification of Parkinsonian speech.
The speech spectrogram, as a representation of the speech intensity in the time-frequency coordinate system [58], visualizes reduced voice intensity and speech loudness in PD. Furthermore, the speech spectrogram visualizes relatively constant spectral maxima vs. time in PD. As discussed for the feature assessment, these attributes are consistent with Parkinsonian softness of voice, reduced speech modulation, articulation, and expressivity [18,27,28,31,31]. Furthermore, the speech spectrograms provide a better visualization of breathy voice [60].
Reduced speech loudness of the PD patient in contrast to the HC is also visible in the speech energy and Mel spectrograms. The speech energy spectrogram further visualizes acoustic–phonetic changes [60], which are more abrupt in the case of the PD patient.
Both speech and Mel spectrograms visualize that the energy content in the case of Parkinsonian speech is confined to smaller frequencies in contrast to HCs. However, this is more pronounced on the Mel spectrogram, which highlights a spectral peak that stays constant vs. time. This is consistent with the mono-pitch attribute of Parkinsonian speech [28].
Feature-based speech assessment points out that certain patients exhibit phonological feature values in the HC range, whereas certain healthy controls exhibit feature values in the PD range. This observation is extrapolated to the spectrogram analysis. As such, we attempted to eliminate from the dataset all speech spectrograms generated for subjects with phonological feature values outside the specified variation ranges. The classification accuracy on speech spectrograms was improved from 78% with 0.3 loss to 85% with 0.8 loss for the unfiltered signals and from 86% with 0.4 loss to 95% with 0.1 loss on the filtered signals. The classification accuracy on speech energy spectrograms was improved from 80% with 0.3 loss to 87% with 0.4 loss for the unfiltered signals and from 84% with 0.6 loss to 96% with 0.1 loss on the filtered signals. The classification accuracy on Mel spectrograms was improved from 58% with 0.5 loss to 87% with 0.7 loss for the unfiltered signals and from 70% with 0.3 loss to 92% with 0.5 loss on the filtered signals. As illustrated, our approach led to the improvement of classification accuracy.
The highest accuracy improvement achieved on Mel spectrograms is motivated by the fact that Mel spectrograms visualize speech perception [60]. Thus, it is inferable that the speech samples which are assessed with a feature-based analysis to be healthy also account for the perception of the speech sample to be healthy.
Regarding noise suppression, the 4 dB SNR improvement achieved with the Wiener optimal filter on the speech samples produces an improvement in the CNN-based classification accuracy of 8–12%. Indeed, as a result of noise suppression, the spectrograms only contain relevant speech information.
The best CNN-based PD classification accuracy was achieved for the speech energy spectrograms, both before and after data set reduction and regardless of filtering. This result is explained by the fact that the speech energy spectrogram captures acoustic–phonetic changes on segments of speech [60] for which PD is identifiable [31].
Regarding our choice for the MobileNet model, it is mainly based on our previous study in [64], wherein we investigated the MobileNet, EfficientNet and Xception models for image classification in the discrimination of PD. Since we obtained the best classification accuracy with the MobileNet, it was our straightforward choice for the present study.

4.4. Limitations

In this paper, we analyzed phonological features, prosody features, time-domain features, frequency-domain features, and LPC analysis for formant extraction. The reported features measure the Parkinsonian traits of continuous speech, confirming the particularities of PD vs. HC in terms of loudness, intonation, phonation, prosody, and articulation.
Given the continuous nature of the speech task, the duration of the voiced segments is considerably shorter than for sustained vowel phonation and diadochokinetic tasks. Specifically, we can only isolate vowels with a duration of 100–200 ms vs. sustained vowel phonation, which accounts for 2 s in duration [13]. As such, a limitation of our work is that we are unable to assess feature standard deviations attributable to tremor phonation on voiced segments. Specifically, while the standard deviation of pitch, energy, and formants on vowel phonation and diadochokinetic tasks is reported to be larger for PD in comparison to HC [13,25], we report on larger values for HC, and this is attributable to voice modulation, expressivity, and articulation throughout the continuous speech.
With regard to speech sample recording in noisy environment, we confirmed that the Wiener optimal filter is applicable for noise suppression, while maintaining the Parkinsonian speech attributes. However, the limitations of Wiener filtering in the presented application occur when the speech is recorded with background talk, hospital traffic, etc., which is interpreted by the filter as voice activity rather than noise, and therefore, it is not suppressed.

5. Conclusions

In this paper, we discussed AI-based identification of Parkinsonian speech. The novelty of this work is twofold. First, we performed Parkinsonian speech assessment on samples of continuous speech. Second, we recorded the speech samples in the clinic, in an inherently noisy environment, and thus we were able to analyze and quantify the Wiener filter’s applicability to speech denoising for the identification of Parkinsonian speech. We concluded that Wiener filter improves both feature-based-analysis and CNN-based-classification performances.
The proposed speech assessment methodology for the AI-based identification of Parkinsonian speech follows speech acquisition, processing, feature extraction, feature assessment, and finally CNN-based classification of spectrograms generated from the speech samples. Our target was to assess loudness, intonation, phonation, prosody, and articulation of speech by means of phonological, prosody, time-domain, frequency-domain, and LPC features respectively. We argue that the Parkinsonian traits identified with the feature-based speech analysis are contained in the spectrograms. Then, the best classification accuracies we achieved were 96% on speech energy, 93% on speech, and 92% on Mel spectrograms.
The assessment results reported in this paper confirm the results previously reported in the literature. Nevertheless, the strength of our results is that we achieved them on samples of continuous speech rather than short speech segments, e.g., sustained vowels, short syllables/words, or short sentences. Furthermore, the speech samples used for the Parkinsonian speech assessment and CNN training were acquired from patients and healthy controls in our targeted study group, following a research protocol that we devised ourselves, and not from publicly available third-party speech databases where we have no control over the acquisition and processing protocol.
The results reported in this paper can constitute guidelines for a running speech assessment methodology in PD. This could lay down the foundation for new applications to assess the quality of spoken communication.
Our future research is oriented towards the development of an autonomous AI-based decision support system for PD pre-diagnosis. We aim to integrate the methodology proposed and developed in this study, along with our previously reported solutions on the tremor [45], gait [64,68], and written communication assessment [45], in correlation with Parkinson’s disease rating scales, cognitive evaluation, and the resulting socioeconomic impact.

Author Contributions

Conceptualization, P.F. and R.R.I.; data curation, P.F. and R.R.I.; formal analysis, S.-A.Ș. and R.R.I.; investigation, R.R.I.; methodology, P.F., S.-A.Ș. and R.R.I.; project administration, P.F. and R.R.I.; resources, A.-S.P. and R.R.I.; software, P.F., S.-A.Ș., C.-G.C., L.-I.M. and M.B.; supervision, L.P.-D. and R.R.I.; validation, P.F., S.-A.Ș., C.-G.C., L.-I.M., S.H. and R.R.I.; visualization, P.F. and S.-A.Ș.; writing—original draft, P.F., C.-G.C., L.-I.M. and R.R.I.; writing—review and editing, P.F., S.-A.Ș., S.H., A.-S.P., M.B., L.P.-D. and R.R.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the University of Medicine and Pharmacy “Iuliu Hatieganu” Cluj-Napoca, Romania (Protocol Code 86; and date of approval, 1 February 2018).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

We have chosen not to make the data publicly available in accordance to the protocol statement.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Wiener Filter Performance Evaluation

Table A1. Wiener filter speech enhancement and fidelity measures.
Table A1. Wiener filter speech enhancement and fidelity measures.
IDSNR (dB)
Original Signal
SNR (dB)
Filtered Signal
SNRIMSE
PD 143.143.20.12.29 × 10−4
PD 246.3503.72.27 × 10−4
PD 344.248.34.17.32 × 10−5
PD 443.543.70.22.58 × 10−4
PD 544.950.45.51.5 × 10−4
PD 642.447.85.43.81 × 10−4
PD 736.442.66.28.42 × 10−4
PD 831.834.833.68 × 10−4
PD 94649.33.32.47 × 10−4
PD 108181.10.11.19 × 10−4
PD 1158.262.94.78.94 × 10−4
PD 1244.250.96.72.28 × 10−4
PD 139.711.41.72.81 × 10−4
PD 1416.124.78.66.5 × 10−4
PD 1526.333.371.94 × 10−4
PD 161520.95.96.93 × 10−4
Statistics39.3 ± 17.443.5 ± 16.54.1 ± 2.62.8 × 10−4 ± 2.2 × 10-4
HC 12431.37.33.55 × 10−4
HC 238.144.36.24.67 × 10−4
HC 333.741.88.13.47 × 10−4
HC 427.228.61.43.81 × 10−4
HC 542.546.84.31.68 × 10−4
HC 632.2396.85.8 × 10−4
HC 733.535.41.97.49 × 10−4
HC 844.149.55.43.35 × 10−4
HC 928.232.34.11.2 × 10−3
HC 1026.328.42.13.79 × 10−4
HC 1151.7553.36.67 × 10−4
Statistics34.7 ± 8.639.3 ± 8.94.6 ± 2.35.1 × 10−4 ± 2.8 × 10−4

Appendix A.2. Feature Extraction for Parkinsonian Speech Assessment

Appendix A.2.1. Phonological Analysis

Table A2. Phonological parameters.
Table A2. Phonological parameters.
IDOriginal SignalFiltered Signal
nutteringnpauserspeechtpausenutteringnpauserspeechtpause
PD 1131247.4357625.61.8
PD 2191849.77.5141336.67
PD 3171649.18.4141341.17.7
PD 46535.21.55429.41.4
PD 57624.64.27624.64
PD 6201942.710.7171636.310.1
PD 7131234.112.2131232.111.5
PD 8109345.79824.95.2
PD 9111043.83.510939.83.2
PD 10141336.77.1141336.75.8
PD 117638.22.8768.22.7
PD 1210932.266.698296.2
PD 138728.53.698323.2
PD 14121133.431413395.7
PD 15191848.27.5181745.66.1
PD 1636355213.4343349.111.5
Statistics13.9 ± 7.412.9 ± 7.439.4 ± 8.38.3 ± 7.912.6 ± 6.911.6 ± 6.933.1 ± 9.85.8 ± 3.2
HC 16519.85.28724.25
HC 24316.21.34316.21
HC 35421.62.16525.12
HC 4131234.34.4121134.34.2
HC 51211501.910935.71.7
HC 6121144.75.68730.75
HC 798288.998288.6
HC 8181751.56.21716495.8
HC 99829.63.47623.73.3
HC 108730.93.57627.43.1
HC 117620.87.67620.87.3
Statistics9.4 ± 4.18.4 ± 4.131.6 ± 12.34.6 ± 2.48.6 ± 3.57.6 ± 3.528.6 ± 8.84.3 ± 2.4

Appendix A.2.2. Prosody Analysis

Table A3. Prosody parameters, in mean and standard deviation.
Table A3. Prosody parameters, in mean and standard deviation.
IDOriginal SignalFiltered Signal
µ(I)σ(I)µ(f0)σ(f0)µ(I)σ(I)µ(f0)σ(f0)
PD 10.0750.091113.256.60.0760.093121.1272.18
PD 20.1370.159232.858.20.1350.157234.0257.3
PD 30.1110.122152.436.40.1110.122155.2737.83
PD 40.1050.118138.622.50.1060.119138.6223.8
PD 50.0520.073140.366.10.0690.097150.8669.46
PD 60.0970.122146.629.90.0970.121147.8531.11
PD 70.1430.156127.3470.1440.157128.6344.6
PD 80.0770.093163.678.70.0810.096161.3162.25
PD 90.1190.141120.336.40.120.143120.9836.67
PD 100.0740.091103.160.20.0720.09102.8759.83
PD 110.050.07227.557.20.070.093232.9853.89
PD 120.020.0313657.90.030.041148.4769.2
PD 130.0250.035196.382.60.0330.045206.2579.66
PD 140.040.06135.364.20.060.081160.6478.94
PD 150.020.027184105.20.0240.036190.16102.28
PD 160.020.025202.293.50.0240.034211.9886.7
Statistics0.07 ± 0.040.09 ± 0.05157.5 ± 39.859.5 ± 22.70.07 ± 0.040.09 ± 0.04163 ± 40.460.4 ± 21.7
Male statistics0.08 ± 0.040.1 ± 0.04138.8 ± 33.949.8 ± 15.20.08 ± 0.030.1 ± 0.03145.3 ± 35.454 ± 19
Female statistics0.07 ± 0.050.08 ± 0.06188.6 ± 18.875.8 ± 24.90.07 ± 0.050.08 ± 0.05193.2 ± 30.571 ± 23.1
HC 10.0770.09155.0465.70.0770.088155.5963.36
HC 20.1130.113243.637.30.1440.115245.0835.01
HC 30.1020.112235.234.50.10.11237.2332.35
HC 40.0950.107172.638.40.0970.111178.6846.01
HC 50.120.134180.847.20.120.134180.8944.81
HC 60.0750.096128.946.10.0750.098131.1264.72
HC 70.070.0920398.30.0760.098203.3145.85
HC 80.080.104156.9450.0810.106158.4599.16
HC 90.10.1131.8500.0960.103133.5649.14
HC 100.080.1160.164.30.080.099161.8965.58
HC 110.10.12115253.90.0990.121152.154.05
Statistics0.09 ± 0.020.1 ± 0.01174.5 ± 38.248.7 ± 23.20.09 ± 0.020.1 ± 0.01176.38.254.5 ± 18.5
Male statistics0.08 ± 0.010.1 ± 0.01150.9 ± 17.144.1 ± 22.10.08 ± 0.010.1 ± 0.01153.2 ± 18.164.7 ± 18.9
Female statistics0.09 ± 0.020.01 ± 0.01202.9 ± 3854.2 ± 25.80.1 ± 0.030.1 ± 0.01203.7 ± 38.842.4 ± 8.9

Appendix A.2.3. Time-Domain Analysis

Table A4. Time-domain intensity-based features, in mean and standard deviation.
Table A4. Time-domain intensity-based features, in mean and standard deviation.
IDOriginal SignalFiltered Signal
µ(mav)σ(mav)µ(enrg)σ(enrg)µ(rms)σ(rms)µ(mav)σ(mav)µ(enrg)σ(enrg)µ(rms)σ(rms)
PD 128170.20.2352134230.30.34330
PD 233250.20.3392943340.40.65039
PD 322120.10.1261528170.20.13420
PD 433230.30.3422944310.50.55539
PD 552520.81.2646370691.42.28684
PD 635290.30.4413446591.74.957116
PD 724140.10.1281631190.20.23722
PD 841290.40.5503553390.70.86548
PD 929200.20.2362538270.30.44734
PD 1021130.10.1271727180.20.23423
PD 1175511.2796193751.82.210583
PD 1233220.20.2392543300.40.45134
PD 1333240.20.4392940330.40.74839
PD 1450430.60.9615173601.31.88871
PD 1525230.20.4292632310.30.63734
PD 1626230.10.4312632260.20.43830
Statistics36 ± 1327 ± 130.3 ± 0.30.5 ± 0.443 ± 1532 ± 1547 ± 1838 ± 180.7 ± 0.60.4 ± 0.156 ± 2148 ± 28
Male statistics39 ± 1531 ± 160.4 ± 0.30.5 ± 0.447 ± 1736 ± 1852 ± 2244 ± 220.9 ± 0.70.6 ± 0.263 ± 2457 ± 32
Female statistics3 ± 0.723 ± 60.2 ± 0.10.4 ± 0.136 ± 0.927 ± 738 ± 930 ± 80.4 ± 0.20.5 ± 0.345 ± 1235 ± 10
HC 139290.30.5473452390.60.9630.045
HC 249280.50.5593465390.80.8780.045
HC 342310.40.6493655430.71.1660.048
HC 441290.40.5493452390.60.9630.046
HC 526180.20.2322235250.30.3430.029
HC 645360.50.7584560480.91.3760.061
HC 7624911.4786482661.82.61030.086
HC 839300.40.5493850660.72.6640.086
HC 97347121.2915898382.10.81210.047
HC 1037280.30.4463548380.60.8610.047
HC 1159450.81.1745578621.51.9980.075
Statistics47 ± 1334 ± 100.5 ± 0.30.7 ± 0.457 ± 1741 ± 1361 ± 1846 ± 131 ± 0.61.3 ± 0.876 ± 2356 ± 19
Male statistics46 ± 1433 ± 70.5 ± 0.30.6 ± 0.357 ± 1741 ± 960 ± 1945 ± 110.9 ± 0.61.2 ± 0.775 ± 2355 ± 16
Female statistics48 ± 1434 ± 130.6 ± 0.30.8 ± 0.558 ± 1942 ± 1763 ± 1947 ± 171 ± 0.61.3 ± 0.978 ± 2457 ± 23
Table A5. Time-domain periodicity-based features, in mean and standard deviation.
Table A5. Time-domain periodicity-based features, in mean and standard deviation.
IDOriginal SignalFiltered Signal
µ(ZC)σ(ZC)µ(SSC)σ(SSC)µ(ZC)σ(ZC)µ(SSC)σ(SSC)
PD 122.35716.859182.54474.58422.64118.677196.72782.65
PD 229.30753.375140.798140.37431.87259.298140.584140.668
PD 324.32521.144121.92888.09226.71229.402118.47887.19
PD 428.65929.329181.151126.52629.18130.003183.965128.598
PD 566.08181.9290.121173.970.194102.646279.219191.997
PD 630.00552.14204.257121.60830.94754.477198.394120.823
PD 723.84540.067148.127109.04724.74941.65142.158107.563
PD 834.63243.872144.893104.4135.11644.324150.085107.657
PD 919.44224.564180.573113.78820.09425.974178.236114.469
PD 1029.11639.246195.915108.02729.61441.167203.457110.171
PD 1121.37431.819129.823123.65522.5633.815132.5122.207
PD 1216.49428.891174.101112.07917.19731.203169.326110.657
PD 1316.49433.584174.10188.32630.75439.367134.91493.509
PD 1437.43155.588217.195128.35246.01370.774223.118145.956
PD 1515.7518.996169.841131.92417.22923.741171.263132.151
PD 1634.65418.996206.441131.92437.33857.353199.109116.7
Statistics28.1 ± 12.636.7 ± 18177.7 ± 41.9117.9 ± 24.230.8 ± 13.444.2 ± 22174 ± 41.8118.9 ± 23.3
Male statistics29.5 ± 14.240.1 ± 18.9189.8 ± 43.3120.4 ± 24.631.5 ± 15.845.5 ± 25.2189.3 ± 41.6122.8 ± 24
Female statistics25.9 ± 8.531.7 ± 14.5159.7 ± 30114.2 ± 23.529.8 ± 7.142.2 ± 14.4152.4 ± 28.8113 ± 21.1
HC 145.74776.733174.158123.92147.01378.947172.795125.428
HC 238.70842.538118.25576.93738.44641.955117.14576.903
HC 330.32638.703154.18993.20130.00340.942143.29196.501
HC 440.38762.869187.823100.33141.83166.345182.848109.791
HC 53238.80290.08961.73334.47543.33699.18965.526
HC 624.87227.76692.93152.63526.42131.123101.40257.502
HC 739.45940.397124.72355.80241.16642.695132.30558.475
HC 827.96828.1484.77364.63229.54442.69592.79358.475
HC 942.37958.031184.803122.55542.95536.452184.22254.428
HC 1035.39634.74393.19750.38836.46936.45296.98554.428
HC 1142.98877.642195.069133.85443.70178.549197.261135.084
Statistics36.4 ± 6.847.9 ± 18136.4 ± 43.885.1 ± 31.237.4 ± 6.749.5 ± 17.1138.2 ± 39.981.1 ± 30.3
Male statistics36.1 ± 8.348 ± 20.6136.6 ± 50.785.7 ± 34.137.4 ± 848.7 ± 19.3138.5 ± 45.776.7 ± 32.1
Female statistics36.7 ± 5.347.6 ± 16.9136.5 ± 39.984.3 ± 31.337.6 ± 5.449.5 ± 16.3137.8 ± 37.186.5 ± 30.7

Appendix A.2.4. Frequency-Domain Analysis

Table A6. Frequency-domain features in mean value.
Table A6. Frequency-domain features in mean value.
IDOriginal SignalFiltered Signal
µ(maxf)µ(waf)µ(skw)µ(kur)µ(maxf)µ(waf)µ(skw)µ(kur)
PD 1209.8221224.84549.976358117.5921190.2661209.555910.33465125.1718
PD 2360.1689416.2511.92247159.6414.8776444.998111.83087157.4186
PD 3370.1667366.06549.967189117.7195375.2765383.31799.904568116.8925
PD 4250.2299306.068510.05182119.875267.008305.619510.13883121.6912
PD 5302.4605370.19439.733266113.4279300.2343374.769710.01356118.778
PD 6230.2874260.180212.05196161.9582230.490524268.868712.06773162.2545
PD 7220.8475253.26310.69389130.4106223.762915250.191710.76735132.0026
PD 8345.9367411.269810.22188123.6549355.175689417.425410.17484122.818
PD 9182.4607208.410610.10069119.5384184.348562210.211810.139120.2532
PD 10305.2498319.77049.385109106.0929296.426479327.80419.545114109.7087
PD 11249.8765277.712713.11219185.5506262.681159289.920913.00803183.6191
PD 12146.2178154.758311.97501158.6683144.605475157.424811.97299158.6024
PD 13315.6509354.709110.87591136.1331365.054945395.414410.86759136.3212
PD 14298.2712336.075910.39996125.5157430.044276463.834610.22208122.7967
PD 15219.5424235.043311.66939152.9737225.228311233.979811.6668153.0658
PD 16433.1558464.455811.40796149.6136447.3508.731811.26647146.6707
Statistics277.5 ± 76.9309.9 ± 85.210.8 ± 1136.1 ± 22.5294.5 ± 94327.6 ± 103.310.9 ± 1136.7 ± 21
Male statistics239.6 ± 53.9271.1 ± 63.110.7 ± 1.2133.9 ± 25.9253 ± 77285.8 ± 85.410.8 ± 1.1135.5 ± 24.1
Female statistics340.8 ± 70.9374.6 ± 78.911 ± 0.8139.9 ± 16.9363.8 ± 76.1397.3 ± 91.611 ± 0.8138.9 ± 16.5
HC 1623.3607657.58769.915721118.046647.812359682.74599.895411117.3672
HC 2477.2542522.721511.13494142.5209451.206897516.003811.10386141.5863
HC 3301.9284349.390911.5549149.9868288.343558345.216611.67933152.7823
HC 4343.8331397.115510.73022132.7825370.254314428.895710.85112135.3927
HC 5473.8431510.755610.06492121.1023506.415344553.08559.929213118.6239
HC 6340.6038380.70029.451434107.9514358.076225400.72249.440863107.6968
HC 7641.3078678.639310.06799120.9231642.514345691.839110.03913120.6899
HC 8448.7052474.808110.08178119.5018475.053763496.799710.04414119.2715
HC 9367.6768403.45549.790699114.6094376.106195410.7189.844548115.3877
HC 10571.8169627.16419.956054118.8349588.329839623.4869.8651116.949
HC 11439.9353501.909810.13185121.2875444.933078498.935410.14959121.573
Statistics457.3 ± 115.9500.4 ± 114.510.9 ± 1124.3 ± 12.4468.1 ± 199.2513.5 ± 155.510.3 ± 0.7124.3 ± 13.3
Male statistics449.3 ± 122.4490.1 ± 122.610.9 ± 1.3118.6 ± 8.1469.3 ± 124507.3 ± 199.410 ± 0.5118.7 ± 9.1
Female statistics466.9 ± 121512.7 ± 116.611 ± 0.8131.2 ± 14466.7 ± 127.5521 ± 124.110.6 ± 0.8131 ± 15.3
Table A7. Frequency-domain features, in standard deviation.
Table A7. Frequency-domain features, in standard deviation.
IDOriginal SignalFiltered Signal
σ(maxf)σ(waf)σ(skw)σ(kur)σ(maxf)σ(waf)σ(skw)σ(kur)
PD 1140.8132110.25582.63390355.30804123.2352109.52772.74398758.12588
PD 2785.5251789.53842.99430463.92157989.0715887.14572.97883563.58607
PD 3258.7677197.01512.82517757.78512298.5884288.79142.90455358.90765
PD 4286.0091282.49532.69111855.13494317.0508263.98872.73182855.89014
PD 5528.8487544.71612.81869756.79514536.2376547.25572.78148256.80749
PD 6607.7036544.6872.76068560.40223623.1119592.81852.74837560.312
PD 7442.4261451.39872.44027453.07628459.0798459.0092.42467652.861
PD 8437.354463.70742.91098359.8237456.3991482.68612.9360459.88946
PD 989.7363170.236312.50547752.4335988.2419477.23812.53847653.54082
PD 10584.4737458.15562.72920456.43168588.0268518.45182.8079758.18234
PD 11253.676264.57552.71339758.55941305.6971313.76412.80387360.36714
PD 1262.3406366.328172.50330656.1398160.5476198.170672.46356155.52048
PD 13317.5697296.66032.72808858.53734544.5609494.10212.7678458.32652
PD 14781.5125698.9092.62136154.917061089.867990.52382.74344756.2561
PD 15199.1553255.01082.66228658.29154255.5855279.87062.65954758.0698
PD 161051.378926.18613.18114167.420141012.5411020.3643.23548168.03222
Statistics426.7 ± 280.8401.2 ± 255.52.7 ± 0.257.8 ± 3.8484.2 ± 321.64634 ± 297.82.8 ± 0.258.4 ± 3.7
Male statistics377.8 ± 260.3349.2 ± 236.32.6 ± 0.855.9 ± 17419.1 ± 323.1380.4 ± 304.72.7 ± 0.956.8 ± 18.1
Female statistics508.3 ± 338488 ± 3032.9 ± 0.261 ± 3.8592.8 ± 333575.5 ± 309.82.9 ± 0.261.1 ± 3.9
HC 11410.0131295.1243.04759760.424381434.0051370.8383.01059460.23597
HC 2600.6572546.81542.94892962.02396508.5495533.50492.91661.50816
HC 3317.4417359.88612.66483758.45661252.831320.54992.64788957.81285
HC 4709.2429659.42372.74150956.89116725.7896736.65742.76502857.35001
HC 5741.7255707.32633.07499861.17489841.5044806.44673.1806462.78911
HC 6476.0353457.60132.81618156.1085531.2828506.7662.81449755.93618
HC 7915.6143873.0833.06264861.8054964.6048903.36653.12456962.63882
HC 8433.6057445.42682.55898852.90653538.7827523.28712.67725654.6878
HC 9469.309410.90732.85024759.04572398.2593419.07792.80138658.13629
HC 10731.1745751.78043.0000360.84957764.2004758.56253.03960261.60967
HC 11790.7127816.422.87302857.9772821.1247796.35622.87102258.41921
Statistics690.5 ± 298.5665.8 ± 271.22.9 ± 0.258.9 ± 2.8707.4 ± 320.9697.8 ± 289.159.2 ± 0.259.2 ± 2.7
Male statistics704.9 ± 368.6670 ± 334.72.8 ± 0.257.7 ± 3732 ± 369.6719.2 ± 346.42.9 ± 0.1458 ± 2.6
Female statistics673.2 ± 288.6660.7 ± 209.22.9 ± 0.260.3 ± 1.9677.7 ± 291672 ± 239.72.9 ± 0.260.6 ± 2.4

Appendix A.2.5. LPC Analysis

Table A8. First three formants (f1, f2, and f3), in mean value.
Table A8. First three formants (f1, f2, and f3), in mean value.
IDOriginal SignalFiltered Signal
µ(f1)µ(f2)µ(f3)µ(f1)µ(f2)µ(f3)
PD 1146.3977356.2246927.963195.99813215.8823664.4091
PD 2140.5878305.1284846.6479143.1003309.3127853.4781
PD 3116.3885238.901723.2804117.1406241.6413728.9491
PD 4116.3885238.901723.2804112.4424251.6562725.0138
PD 5106.2606244.0261718.9596107.1858246.2775720.2278
PD 6117.6647295.6761806.0591128.2465321.9942865.6273
PD 7126.7025318.2057858.2249124.9557246.3748721.9826
PD 8125.9183249.031725.5743118.7285290.2411808.252
PD 9118.2062286.9403799.757892.24254199.7526636.9858
PD 1090.98081195.1003631.759167.9252397.43671007.766
PD 11168.3419398.13651008.629128.3763346.5467900.6316
PD 12127.7762344.6129898.1218112.3823232.116714.1573
PD 13110.4784225.7373701.0682106.2781235.6601703.9548
PD 14102.7345230.3552686.8945139.465323.6074864.4784
PD 15138.2548319.142857.9386105.7259234.3006699.3675
PD 16101.5309227.0754685.9352128.2465321.9942865.6273
Statistics122.2 ± 19.4279.6 ± 57.1787.5 ± 104.5119.9 ± 19274.3 ± 54.2776.5 ± 100.2
Male statistics122.1 ± 23.1290.8 ± 67.5806 ± 124117.6 ± 21.7280.2 ± 63.8784.1 ± 118.8
Female statistics122.2 ± 15.5260.8 ± 40.9756.7 ± 75.6123.8 ± 15264.6 ± 40.8763.7 ± 74.5
HC 1119.4577258.2862747.4789119.796259.1702748.1364
HC 2126.8611243.3643731.9308125.3441241.6419726.2462
HC 3127.005279.1861794.3075127.6849281.9603800.3555
HC 4118.1437270.415773.6646120.3063276.3826780.0706
HC 5141.4042302.962837.551137.2403293.8991823.2496
HC 6139.2445317.8754863.2708135.1576308.9066846.237
HC 7112.6445209.2341670.1639109.3551204.5748661.7582
HC 8137.7004275.2643781.1636135.5184269.6477769.8928
HC 9100.6592212.7324669.956498.78291210.2192667.7949
HC 10116.4745231.9436720.1556115.1922230.5635717.091
HC 11119.2252253.3964731.0832117.1745249.7392725.2645
Statistics123.5 ± 12.4259.5 ± 34.4756.4 ± 61.6122 ± 11.9257 ± 33.4751.5 ± 59.4
Male statistics121.9 ± 14.5261 ± 36.6759.3 ± 65120.8 ± 13.725,901 ± 34.9754.9 ± 60.4
Female statistics125.5 ± 10.7257.6 ± 35.6753 ± 64.5123.4 ± 10.6254.4 ± 35.3747.4 ± 64.9
Table A9. First three formants (f1, f2, and f3), in standard deviation.
Table A9. First three formants (f1, f2, and f3), in standard deviation.
IDOriginal SignalFiltered Signal
σ(f1)σ(f2)σ(f3)σ(f1)σ(f2)σ(f3)
PD 1105.5148191.726310.0546119.8245233.1591400.3893
PD 2119.8571217.726389.5605120.1219215.9192385.9008
PD 3130.1879230.3062397.0764129.7914230.2576396.8608
PD 4122.6321231.428394.7242122.7717232.5018395.1004
PD 5116.6938228.0452406.5741117.4288229.7301407.7685
PD 6104.8515225.3515407.9145105.2166224.507406.5868
PD 7104.7348213.5573383.3348105.0007212.0539379.9155
PD 8132.904230.5034400.8092133.1788230.791402.9661
PD 9109.764223.279404.8753108.5854222.2566401.8178
PD 10125.4507234.265382.4851125.3875236.2768384.0149
PD 1191.01827153.9159291.002990.96241155.512290.7811
PD 1293.42048202.0746365.965792.79807200.2818362.8614
PD 13131.167230.5023395.655130.7768231.1815396.792
PD 14131.167230.5023395.655123.4698240.062412.6428
PD 15111.0153210.9409389.2035110.2592209.8486386.3314
PD 16122.5947242.9469415.3833124.49242.7555413.6597
Statistics115.8 ± 13.4218.6 ± 21.6383.1 ± 34.5116.3 ± 12.9221.7 ± 21.1389 ± 29.4
Male statistics110.5 ± 35.6213.4 ± 68.6374.3 ± 119.4111.1 ± 35.6218.6 ± 70.1384.2 ± 120.8
Female statistics124.6 ± 8.4227.2 ± 11.3397.9 ± 9.6124.8 ± 8.5226.8 ± 11.9397 ± 10.5
HC 1125.2216234.0009408.6084125.0871234.0544408.5006
HC 2141.3759229.5698376.7384140.8267230.6014378.0453
HC 3121.4583222.5808394.6395120.9949222.8789393.4422
HC 4121.4583222.5808394.6395120.9949222.8789393.4422
HC 5123.4126216.2284379.077123.3718219.2568386.5237
HC 6116.3071210.4272368.4274116.2192214.4403379.4195
HC 7144.1913230.2897362.6656143.1769232.56363.4399
HC 8131.9069223.022397.9898132.8698225.197403.2771
HC 9125.6425230.4328402.9594124.1504230.823402.7296
HC 10135.9895232.2704391.454134.6502232.8903391.5255
HC 11123.2632229.5243415.2614122.8076229.6399416.1476
Statistics128.2 ± 8.9225.5 ± 7.3390.2 ± 16.6127.7 ± 8.8226.8 ± 6.4392.4 ± 15.2
Male statistics126.1 ± 7.1225.5 ± 8.8394 ± 13.9125.7 ± 7226.7 ± 7.4396.5 ± 10.5
Female statistics130.7 ± 11.1225.6 ± 6.1385.7 ± 20.1130.2 ± 10.8227 ± 5.7387.5 ± 19.5

References

  1. Triarhou, L.C. Dopamine and Parkinson’s Disease. In Madame Curie Bioscience Database; Landes Bioscience: Austin, TX, USA, 2013. [Google Scholar]
  2. Tysnes, O.B.; Storstein, A. Epidemiology of Parkinson’s disease. J. Neural Transm. 2017, 124, 901–905. [Google Scholar] [CrossRef] [PubMed]
  3. Garcia-Ruiz, P.J.; Chaudhuri, K.R.; Martinez-Martin, P. Non-motor symptoms of Parkinson’s disease: A review from the past. J. Neurol. Sci. 2014, 338, 30–33. [Google Scholar] [CrossRef]
  4. Gallagher, D.A.; Schrag, A. Psychosis, apathy, depression and anxiety in Parkinson’s disease. Neurobiol. Dis. 2012, 46, 581–589. [Google Scholar] [CrossRef]
  5. Duncan, G.W.; Khoo, T.K.; Yarnall, A.J.; O’Brien, J.T.; Coleman, S.Y.; Brooks, D.J.; Barker, R.A.; Burn, D.J. Health-related quality of life in early Parkinson’s disease: The impact of nonmotor symptoms. Mov. Disord. Off. J. Mov. Disord. Soc. 2014, 29, 195–202. [Google Scholar] [CrossRef] [PubMed]
  6. Bugalho, P.; Lampreia, T.; Miguel, R.; Mendonça, M.D.; Caetano, A.; Barbosa, R. Non-Motor symptoms in Portuguese Parkinson’s Disease patients: Correlation and impact on Quality of Life and Activities of Daily Living. Sci. Rep. 2016, 6, 32267. [Google Scholar] [CrossRef] [PubMed]
  7. Miller, N.; Noble, E.; Jones, D.; Burn, D. Life with communication changes in Parkinson’s disease. Age Ageing 2006, 35, 235–239. [Google Scholar] [CrossRef]
  8. Miller, N.; Allcock, L.; Jones, D.; Noble, E.; Hildreth, A.J.; Burn, D.J. Prevalence and pattern of perceived intelligibility changes in Parkinson’s disease. J. Neurol. Neurosurg. Psychiatry 2007, 78, 1188–1190. [Google Scholar] [CrossRef]
  9. Ray Dorsey, E. Global, regional, and national burden of Parkinson’s disease, 1990–2016: A systematic analysis for the Global Burden of Disease Study. Lancet Neurol. 2016, 17, 939–953. [Google Scholar] [CrossRef]
  10. Yang, W.; Hamilton, J.L.; Kopil, C.; Beck, J.C.; Tanner, C.M.; Albin, R.L.; Dorsey, E.R.; Dahodwala, N.; Cintina, I.; Hogan, P.; et al. Current and projected future economic burden of Parkinson’s disease in the U.S. NPJ Parkinsons Dis. 2020, 6, 15. [Google Scholar] [CrossRef]
  11. Tinelli, M.; Kanavos, P.; Grimaccia, F. The Value of Early Diagnosis and Treatment in Parkinson’s Disease. A Literature Review of the Potential Clinical and Socioeconomic Impact of Targeting Unmet Needs in Parkinson’s Disease; London School of Economics and Political Science: London, UK, 2016. [Google Scholar]
  12. Marras, C.; Beck, J.C.; Bower, J.H.; Roberts, E.; Ritz, B.; Ross, G.W.; Tanner, C.M. Prevalence of Parkinson’s disease across North America. NPJ Park. Dis. 2018, 4, 21. [Google Scholar] [CrossRef]
  13. Pedro, G.-V.; Jiri, M.; Ferrández José, M.; Daniel, P.-A.; Andrés, G.-R.; Victoria, R.-B.; Zoltan, G.; Zdenek, S.; Ilona, E.; Milena, K.; et al. Parkinson Disease Detection from Speech Articulation Neuromechanics. Front. Neuroinformatics 2017, 11, 56. [Google Scholar] [CrossRef]
  14. Yunusova, Y.; Weismer, G.G.; Westbury, J.R.; Lindstrom, M.J. Articulatory movements during vowels in speakers with dysarthria and healthy controls. J. Speech Lang. Hear. Res. 2008, 51, 596–611. [Google Scholar] [CrossRef] [PubMed]
  15. Lowit, A.; Marchetti, A.; Corson, S.; Kuschmann, A. Rhythmic performance in hypokinetic dysarthria: Relationship between reading, spontaneous speech and diadochokinetic tasks. J. Commun. Disord. 2018, 72, 26–39. [Google Scholar] [CrossRef] [PubMed]
  16. Tsanas, A.; Little, M.A.; McSharry, P.E.; Ramig, L.O. Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson’s disease symptom severity. J. R. Soc. Interface 2011, 8, 842–855. [Google Scholar] [CrossRef] [PubMed]
  17. Galaz, Z.; Mekyska, J.; Mzourek, Z.; Smekal, Z.; Rektorova, I.; Eliasova, I.; Kostalova, M.; Mrackova, M.; Berankova, D. Prosodic analysis of neutral, stress-modified and rhymed speech in patients with Parkinson’s disease. Comput. Methods Programs Biomed. 2016, 127, 301–317. [Google Scholar] [CrossRef]
  18. Tykalova, T.; Rusz, J.; Klempir, J.; Cmejla, R.; Ruzicka, E. Distinct patterns of imprecise consonant articulation among Parkinson’s disease, progressive supranuclear palsy and multiple system atrophy. Brain Lang. 2017, 165, 1–9. [Google Scholar] [CrossRef]
  19. Brabenec, L.; Mekyska, J.; Galaz, Z.; Rektorova, I. Speech disorders in Parkinson’s disease: Early diagnostics and effects of medication and brain stimulation. Neural Transm. 2017, 124, 303–334. [Google Scholar] [CrossRef]
  20. Villa-Canas, T.; Orozco-Arroyave, J.; Vargas-Bonilla, J.; Arias-Londono, J. Modulation spectra for automatic detection of Parkinson’s disease. In Proceedings of the Image Signal Processing and Artificial Vision (STSIVA) 2014 XIX Symposium, Armenia-Quindio, Armenia, Colombia, 17–19 September 2014; pp. 1–5. [Google Scholar]
  21. Jeancolas, L.; Benali, H.; Benkelfat, B.-E.; Mangone, G.; Corvol, J.-C.; Vidailhet, M.; Lehericy, S.; Petrovska-Delacrétaz, D. Automatic detection of early stages of Parkinson’s disease through acoustic voice analysis with mel-frequency cepstral coefficients. In Proceedings of the 3rd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP 2017), Fez, Morocco, 22–24 May 2017; pp. 1–4. [Google Scholar]
  22. Suhas, B.N.; Patel, D.; Rao, N.; Belur, Y.; Reddy, P.; Atchayaram, N.; Yadav, R.; Gope, D.; Ghosh, P.K. Comparison of Speech Tasks and Recording Devices for Voice Based Automatic Classification of Healthy Subjects and Patients with Amyotrophic Lateral Sclerosis. Proc. Interspeech 2019, 2019, 4564–4568. [Google Scholar]
  23. Dashtipour, K.; Tafreshi, A.; Lee, J.; Crawley, B. Speech disorders in Parkinson’s disease: Pathophysiology, medical management and surgical approaches. Neurodegener. Dis. Manag. 2018, 8, 337–348. [Google Scholar] [CrossRef]
  24. Maskeliūnas, R.; Damaševičius, R.; Kulikajevas, A.; Padervinskis, E.; Pribuišis, K.; Uloza, V. A Hybrid U-Lossian Deep Learning Network for Screening and Evaluating Parkinson’s Disease. Appl. Sci. 2022, 12, 11601. [Google Scholar] [CrossRef]
  25. Veronica, B.; Eleonora, C.; Monica, C.; Cristiano, C.; Andrea, M.; Cappa Stefano, F. Connected Speech in Neurodegenerative Language Disorders: A Review. Front. Psychol. 2017, 8, 269. [Google Scholar] [CrossRef]
  26. Al-Hameed, S.; Benaissa, M.; Christensen, H.; Mirheidari, B.; Blackburn, D.; Reuber, M. A new diagnostic approach for the identification of patients with neurodegenerative cognitive complaints. PLoS ONE 2019, 14, e0217388. [Google Scholar] [CrossRef] [PubMed]
  27. Skodda, S.; Gronheit, W.; Schlegel, U. Intonation and speech rate in parkinson’s disease: General and dynamic aspects and responsiveness to levodopa admission. J. Voice 2011, 25, 199–205. [Google Scholar] [CrossRef] [PubMed]
  28. Laganas, C.; Iakovakis, D.; Hadjidimitriou, S.; Charisis, V.; Dias, S.B.; Bostantzopoulou, S.; Katsarou, Z.; Klingelhoefer, L.; Reichmann, H.; Trivedi, D.; et al. Parkinson’s Disease Detection Based on Running Speech Data from Phone Calls. IEEE Trans. Bio-Med. Eng. 2022, 69, 1573–1584. [Google Scholar] [CrossRef] [PubMed]
  29. Harel, B.T.; Cannizzaro, M.S.; Cohen, H.; Reilly, N.; Snyder, P.J. Acoustic characteristics of Parkinsonian peech: A potential biomarker of early disease progression and treatment. J. Neurolinguist. 2004, 17, 439–453. [Google Scholar] [CrossRef]
  30. Rusz, J.; Cmejla, R.; Ruzickova, H.; Ruzicka, E. Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson’s disease. J. Acoust. Soc. Am. 2011, 129, 350–367. [Google Scholar] [CrossRef]
  31. Orozco-Arroyave, J.R.; Hönig, F.; Arias-Londoño, J.D.; Vargas-Bonilla, J.F.; Skodda, S.; Rusz, J.; Nöth, E. Voiced/unvoiced transitions in speech as a potential bio-marker to detect Parkinson’s disease. Proc. Interspeech 2015, 2015, 95–99. [Google Scholar] [CrossRef]
  32. Mekyska, J.; Janousova, E.; Gómez, P.; Smekal, Z.; Rektorova, I.; Eliasova, I.; Kostalova, M.; Mrackova, M.; Alonso-Hernandez, J.B.; Faundez-Zanuy, M.; et al. Robust and complex approach of pathological speech signal analysis. Neurocomputing 2015, 167, 94–111. [Google Scholar] [CrossRef]
  33. Skodda, S.; Visser, W.; Schlegel, U. Vowel articulation in parkinson’s diease. J. Voice 2011, 25, 467–472. [Google Scholar] [CrossRef]
  34. Rusz, J.; Cmejla, R.; Tykalova, T.; Ruzickova, H.; Klempir, J.; Majerova, V.; Picmausova, J.; Roth, J.; Ruzicka, E. Imprecise vowel articulation as a potential early marker of Parkinson’s disease: Effect of speaking task. J. Acoust. Soc. Am. 2013, 134, 2171–2181. [Google Scholar] [CrossRef]
  35. Khan, T. Running-Speech MFCC Are Better Markers of Parkinsonian Speech Deficits Than Vowel Phonation and Diadochokinetic. Available online: http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-24645 (accessed on 21 April 2023).
  36. Orozco-Arroyave, J.R.; Hönig, F.; Arias-Londoño, J.D.; Vargas-Bonilla, J.F.; Daqrouq, K.; Skodda, S.; Rusz, J.; Nöth, E. Automatic detection of Parkinson’s disease in running speech spoken in three different languages. J. Acoust. Soc. Am. 2016, 139, 481–500. [Google Scholar] [CrossRef] [PubMed]
  37. Amato, F.; Borzì, L.; Olmo, G.; Orozco-Arroyave, J.R. An algorithm for Parkinson’s disease speech classification based on isolated words analysis. Health Inf. Sci. Syst. 2021, 9, 32. [Google Scholar] [CrossRef] [PubMed]
  38. Vaiciukynas, E.; Gelzinis, A.; Verikas, A.; Bacauskiene, M. Parkinson’s Disease Detection from Speech Using Convolutional Neural Networks. In Smart Objects and Technologies for Social Good. GOODTECHS 2017; Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Guidi, B., Ricci, L., Calafate, C., Gaggi, O., Marquez-Barja, J., Eds.; Springer: Cham, Switzerland, 2018; Volume 233. [Google Scholar] [CrossRef]
  39. Hoq, M.; Uddin, M.N.; Park, S.B. Vocal Feature Ectraction-Based Artificial Intelligent Model for Parkinson’s Disease Detection. Diagnosis 2021, 11, 11061076. [Google Scholar]
  40. Mei, J.; Desrosiers, C.; Frasnelli, J. Machine Learning for the Diagnosis of Parkinson’s Disease: A Review of Literature. Front Aging Neurosci. 2021, 13, 633752. [Google Scholar] [CrossRef] [PubMed]
  41. Kaya, D. Optimization of SVM Parameters with Hybrid CS-PSO Algorithms for Parkinson’s Disease in LabVIEW Environment. Parkinsons. Dis. 2019, 2019, 2513053. [Google Scholar] [CrossRef] [PubMed]
  42. Yaman, O.; Ertam, F.; Tuncer, T. Automated Parkinson’s Disease Recognition Based on Statistical Pooling Method Using Acoustic Features; Elsevier: Amsterdam, The Netherlands, 2020; Volume 135. [Google Scholar]
  43. Appakaya, S.B.; Sankar, R. Parkinson’s Disease Classification using Pitch Synchronous Speech Segments and Fine Gaussian Kernels based SVM. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2020, 2020, 236–239. [Google Scholar] [CrossRef]
  44. Suhas, B.; Mallela, J.; Illa, A.; Yamini, B.; Atchayaram, N.; Yadav, R.; Gope, D.; Ghosh, P.K. Speech task based automatic classification of ALS and Parkinson’s Disease and their severity using log Mel spectrograms. In Proceedings of the 2020 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 24 July 2020; pp. 1–5. [Google Scholar] [CrossRef]
  45. Faragó, P.; Popescu, A.-S.; Perju-Dumbravă, L.; Ileşan, R.R. Wearables as Part of Decision Support System in Parkinson’s Disease Prediagnosis: A Case Study. In Proceedings of the 2022 E-Health and Bioengineering Conference (EHB), Iasi, Romania, 17–18 November 2022; pp. 1–4. [Google Scholar] [CrossRef]
  46. Sarker, I.H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef]
  47. Wu, J. Introduction to convolutional neural networks. Natl. Key Lab Nov. Softw. Technol. 2017, 5, 495. [Google Scholar]
  48. Fira, M.; Costin, H.-N.; Goraș, L. A Study on Dictionary Selection in Compressive Sensing for ECG Signals Compression and Classification. Biosensors 2022, 12, 146. [Google Scholar] [CrossRef]
  49. Vaseghi, S.V. Multimedia Signal Processing Theory and Applications in Speech, Music and Communications; John Wiley& Sons, Ltd: Hoboken, NJ, USA, 2007; ISBN 978-0-470-06201-2. [Google Scholar]
  50. Steven, W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing. Available online: https://www.dspguide.com/ (accessed on 21 April 2023).
  51. Lascu, M.; Lascu, D. Electrocardiogram compression and optimal ECG filtering algorithms. WSEAS Trans. Comput. 2008, 7, 155–164. [Google Scholar]
  52. Vondrasek, M.; Pollak, P. Methods for Speech SNR estimation: Evaluation Tool and Analysis of VAD Dependency. Radioengineering 2005, 14, 6–11. [Google Scholar]
  53. Strake, M.; Defraene, B.; Fluyt, K.; Tirry, W.; Fingscheidt, T. Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration. EURASIP J. Adv. Signal Process. 2020, 2020, 49. [Google Scholar] [CrossRef]
  54. Ke, Y.; Li, A.; Zheng, C.; Peng, R.; Li, X. Low-complexity artificial noise suppression methods for deep learning-based speech enhancement algorithms. J. Audio Speech Music Proc. 2021, 2021, 17. [Google Scholar] [CrossRef]
  55. Alías, F.; Socoró, J.C.; Sevillano, X. A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Appl. Sci. 2016, 6, 143. [Google Scholar] [CrossRef]
  56. Faragó, P.; Grama, L.; Farago, M.-A.; Hintea, S. A Novel Wearable Foot and Ankle Monitoring System for the Assessment of Gait Biomechanics. Appl. Sci. 2021, 11, 268. [Google Scholar] [CrossRef]
  57. Vaiciukynas, E.; Verikas, A.; Gelzinis, A.; Bacauskiene, M. Detecting Parkinson’s disease from sustained phonation and speech signals. PLoS ONE 2017, 12, e0185613. [Google Scholar] [CrossRef]
  58. Bryson, D.J.; Nakamura, H.; Hahn, M.E. High energy spectrogram with integrated prior knowledge for EMG-based locomotion classification. Med. Eng. Phys. 2015, 37, 518–524. [Google Scholar] [CrossRef]
  59. Cordo, C.; Mihailă, L.; Faragó, P.; Hintea, S. ECG signal classification using Convolutional Neural Networks for Biometric Identification. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), Brno, Czech Republic, 26–28 June 2021; pp. 167–170. [Google Scholar] [CrossRef]
  60. Dumpala, S.H.; Alluri, K.N.R.K.R. An Algorithm for Detection of Breath Sounds in Spontaneous Speech with Application to Speaker Recognition. In Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science; Karpov, A., Potapova, R., Mporas, I., Eds.; Springer: Cham, Switzerland, 2017; Volume 10458. [Google Scholar] [CrossRef]
  61. Pantelis, D.P.; Hadjipantelis, Z.; Coleman, J.S.; Aston, J.A.D. The statistical analysis of acoustic phonetic data: Exploring differences between spoken Romance languages. Appl. Statist. 2018, 67, 1103–1145. [Google Scholar]
  62. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  63. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  64. Ileșan, R.R.; Cordoș, C.-G.; Mihăilă, L.-I.; Fleșar, R.; Popescu, A.-S.; Perju-Dumbravă, L.; Faragó, P. Proof of Concept in Artificial-Intelligence-Based Wearable Gait Monitoring for Parkinson’s Disease Management Optimization. Biosensors 2022, 12, 189. [Google Scholar] [CrossRef]
  65. Fira, M.; Costin, H.-N.; Goraș, L. On the Classification of ECG and EEG Signals with Various Degrees of Dimensionality Reduction. Biosensors 2021, 11, 161. [Google Scholar] [CrossRef] [PubMed]
  66. Kent, R.D.; Forner, L.L. Speech segment duration in sentence recitations by children and adults. J. Phon. 1980, 8, 157–168. [Google Scholar] [CrossRef]
  67. Carmona-Duarte, C.; Plamondon, R.; Gómez-Vilda, P.; Ferrer, M.A.; Alonso, J.B.; Londral, A.R.M. Application of the lognormal model to the vocal tract movement to detect neurological diseases in voice. In Proceedings of the International Conference on Innovation in Medicine and Healthcare, Tenerife, Spain, 15–17 June 2016; Springer: Cham, Switzerland, 2016; pp. 25–35. [Google Scholar]
  68. Mihăilă, L.-I.; Cordoş, C.-G.; Ileşan, R.R.; Faragó, P.; Hintea, S. CNN-based Identification of Parkinsonian Gait using Ground Reaction Forces. In Proceedings of the 2022 45th International Conference on Telecommunications and Signal Processing (TSP), Prague, Czech Republic, 13–15 July 2022; pp. 318–321. [Google Scholar] [CrossRef]
Figure 1. Speech acquisition and assessment protocol in the study of AI-based Parkinsonian speech identification.
Figure 1. Speech acquisition and assessment protocol in the study of AI-based Parkinsonian speech identification.
Bioengineering 10 00531 g001
Figure 2. Proposed speech processing and assessment workflow, aiming for the identification of Parkinsonian speech following feature-based assessment and CNN-based classification.
Figure 2. Proposed speech processing and assessment workflow, aiming for the identification of Parkinsonian speech following feature-based assessment and CNN-based classification.
Bioengineering 10 00531 g002
Figure 3. Block diagram of the Wiener filter implemented on the FIR filter topology.
Figure 3. Block diagram of the Wiener filter implemented on the FIR filter topology.
Bioengineering 10 00531 g003
Figure 4. Workflow of the CNN-based classification of spectrograms aiming for the identification of Parkinsonian speech.
Figure 4. Workflow of the CNN-based classification of spectrograms aiming for the identification of Parkinsonian speech.
Bioengineering 10 00531 g004
Figure 5. The voice activity detection procedure illustrated for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the signal (blue) and the detected voice activity (orange). The bottom figure plots the signal energy (blue) and the comparison threshold (orange).
Figure 5. The voice activity detection procedure illustrated for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the signal (blue) and the detected voice activity (orange). The bottom figure plots the signal energy (blue) and the comparison threshold (orange).
Bioengineering 10 00531 g005
Figure 6. The voice activity detection procedure illustrated for an HC: (a) original signal and (b) filtered signal. The top figure plots the signal (blue) and the detected voice activity (orange). The bottom figure plots the signal energy (blue) and the comparison threshold (orange).
Figure 6. The voice activity detection procedure illustrated for an HC: (a) original signal and (b) filtered signal. The top figure plots the signal (blue) and the detected voice activity (orange). The bottom figure plots the signal energy (blue) and the comparison threshold (orange).
Bioengineering 10 00531 g006
Figure 7. The prosody features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, the middle figure plots the signal intensity, and the bottom figure plots the pitch.
Figure 7. The prosody features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, the middle figure plots the signal intensity, and the bottom figure plots the pitch.
Bioengineering 10 00531 g007
Figure 8. The prosody features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, the middle figure plots the signal intensity, and the bottom figure plots the pitch.
Figure 8. The prosody features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, the middle figure plots the signal intensity, and the bottom figure plots the pitch.
Bioengineering 10 00531 g008
Figure 9. The time-domain intensity-based features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the mean absolute value, signal energy, and root mean square.
Figure 9. The time-domain intensity-based features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the mean absolute value, signal energy, and root mean square.
Bioengineering 10 00531 g009
Figure 10. The time-domain intensity-based features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the mean absolute value, signal energy, and root mean square.
Figure 10. The time-domain intensity-based features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the mean absolute value, signal energy, and root mean square.
Bioengineering 10 00531 g010
Figure 11. The time-domain periodicity-based features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the zero-crossing rate and slope sign changes.
Figure 11. The time-domain periodicity-based features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the zero-crossing rate and slope sign changes.
Bioengineering 10 00531 g011
Figure 12. The time-domain periodicity-based features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the zero-crossing rate and slope sign changes.
Figure 12. The time-domain periodicity-based features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the zero-crossing rate and slope sign changes.
Bioengineering 10 00531 g012
Figure 13. The frequency-domain features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the frequency of the maximum spectral component, weighted average of the frequency components, skewness, and kurtosis.
Figure 13. The frequency-domain features extracted for a PD patient: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the frequency of the maximum spectral component, weighted average of the frequency components, skewness, and kurtosis.
Bioengineering 10 00531 g013
Figure 14. The frequency-domain features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the frequency of the maximum spectral component, weighted average of the frequency components, skewness, and kurtosis.
Figure 14. The frequency-domain features extracted for an HC: (a) original signal and (b) filtered signal. The top figure plots the speech sample, followed by the frequency of the maximum spectral component, weighted average of the frequency components, skewness, and kurtosis.
Bioengineering 10 00531 g014
Figure 15. The speech sample (top) and the first three formants (f1, f2, and f3) extracted for a PD patient: (a) original signal and (b) filtered signal.
Figure 15. The speech sample (top) and the first three formants (f1, f2, and f3) extracted for a PD patient: (a) original signal and (b) filtered signal.
Bioengineering 10 00531 g015
Figure 16. The speech sample (top) and the first three formants (f1, f2, and f3) extracted for an HC: (a) original signal and (b) filtered signal.
Figure 16. The speech sample (top) and the first three formants (f1, f2, and f3) extracted for an HC: (a) original signal and (b) filtered signal.
Bioengineering 10 00531 g016
Figure 17. The speech sample (top) and the speech spectrogram (bottom) for the uttering of the word “Românie” by (a) a PD patient and (b) an HC.
Figure 17. The speech sample (top) and the speech spectrogram (bottom) for the uttering of the word “Românie” by (a) a PD patient and (b) an HC.
Bioengineering 10 00531 g017
Figure 18. The speech energy (top) and the speech energy spectrogram (bottom) for the uttering of the word “Românie” by (a) a PD patient and (b) an HC.
Figure 18. The speech energy (top) and the speech energy spectrogram (bottom) for the uttering of the word “Românie” by (a) a PD patient and (b) an HC.
Bioengineering 10 00531 g018
Figure 19. The speech sample (top) and the Mel spectrogram (bottom) for the uttering of the word “Românie” by (a) a PD patient and (b) an HC.
Figure 19. The speech sample (top) and the Mel spectrogram (bottom) for the uttering of the word “Românie” by (a) a PD patient and (b) an HC.
Bioengineering 10 00531 g019
Table 1. Feature classes, categorized by the speaking task, for the objective assessment and identification of hypokinetic dysarthria manifestations.
Table 1. Feature classes, categorized by the speaking task, for the objective assessment and identification of hypokinetic dysarthria manifestations.
Hypokinetic Dysarthria ManifestationSpeaking Task
Sustained Vowel PhonationDiadochokinetic TaskIsolated WordsShort
Sentences
Continuous Speech
Voice blockingn.a.n.a.n.a.PhonologyPhonology
Mono-pitch orationn.a.n.a.n.a.n.a.MFCCs
Mono-loudness orationn.a.n.a.n.a.n.a.MFCCs
Tremor phonationProsodyProsodyProsodyProsodyMFCCs
Voice qualityTime domain
Frequency domain
Time domain
Frequency domain
Time domain
Frequency domain
Time domain
Frequency domain
MFCCs
Impaired articulationFormantsFormantsFormantsn.a.MFCCs
n.a.—not available/not reported. MFCCs—Mel-frequency cepstral coefficients.
Table 2. Parkinsonian speech assessment features, categorized by the feature classes.
Table 2. Parkinsonian speech assessment features, categorized by the feature classes.
Feature ClassSNRIReference
PhonologySpeech and silence statistics: speech rate, number of pauses, pause duration, phonemic errors, phonation time, locution time, filled pauses, false starts[25,26]
ProsodyPitch[27,28]
σ(f0), σ(I)[13,25,26,27,29,30,31]
HNR[26,32]
Shimmer, jitter[26]
Time domainEnergy[37]
Zero-crossing rate[37]
Frequency domainFilter bank energy coefficient, spectral sub-band centroid[26]
Skewness, kurtosis[37]
Formantsf1, f2, f3[13,31,33,34,36]
MFCCMFCC[26,35,38]
Derivatives of the MFCC[38]
Table 3. Parkinsonian speech assessment features targeted in this work.
Table 3. Parkinsonian speech assessment features targeted in this work.
Feature SetSNRI
PhonologyUttering count (nuttering), number of pauses (npause), speech rate (rspeech), pause duration (tpause)
ProsodyIntensity (I), fundamental frequency (f0)
Time domainMean absolute value (mav), energy (enrg), root mean square (rms), zero-crossing rate (ZC), slope sign changes (SSC)
Frequency domainFrequency of the maximum spectral component (maxf), weighted average of the spectral components (waf), skewness, kurtosis
Formantsf1, f2, f3
Table 4. The CNN hyperparameter settings.
Table 4. The CNN hyperparameter settings.
HyperparameterValue
Learning rate0.005
Loss functionBinaryCrossentropy
Activation functionRELU
Batch normalizationactive
Epochs100
Data augmentationRandomContrast (factor = 0.3)
RandomFlip (mode = “horizontal”)
RandomRotation (factor = 0.18)
Table 5. The CNN structure.
Table 5. The CNN structure.
Type/StrideFilter ShapeInput Size
Conv/s23 × 3 × 3 × 32224 × 224 × 3
Conv dw/s13 × 3 × 32 dw112 × 112 × 32
Conv/s11 × 1 × 32 × 64112 × 112 × 32
Conv dw/s23 × 3 × 64 dw112 × 112 × 64
Conv/s11 × 1 × 64 × 12856 × 56 × 64
Conv dw/s13 × 3 × 128 dw56 × 56 × 128
Conv/s11 × 1 × 128 × 12856 × 56 × 128
Conv dw/s23 × 3 × 128 dw56 × 56 × 128
Conv/s11 × 1 × 128 × 25628 × 28 × 128
Conv dw/s13 × 3 × 256 dw28 × 28 × 256
Conv/s11 × 1 × 256 × 25628 × 28 × 256
Conv dw/s23 × 3 × 256 dw28 × 28 × 256
Conv/s11 × 1 × 256 × 51214 × 14 × 256
Conv dw/s1
Conv/s1
3 × 3 × 512 dw
1 × 1 × 512 × 512
14 × 14 × 512
14 × 14 × 512
Conv dw/s23 × 3 × 512 dw14 × 14 × 512
Conv/s11 × 1 × 512 × 10247 × 7 × 512
Conv dw/s23 × 3 × 1024 dw7 × 7 × 1024
Conv/s11 × 1 × 1024 × 10247 × 7 × 1024
Avg Pool/s1Pool 7 × 77 × 7 × 1024
FC/s11024 × 10001 × 1 × 1024
Softmax/s1Classifier1 × 1 × 1000
Table 6. Statistics of the Wiener filter speech enhancement and fidelity measures.
Table 6. Statistics of the Wiener filter speech enhancement and fidelity measures.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
SNR39.3 ± 17.434.7 ± 8.643.5 ± 16.539.3 ± 8.9
SNRI--4.1 ± 2.64.6 ± 2.3
MSE--(2.8 ± 2.2) × 10−4(5.1 ± 2.8) × 10−4
Table 7. Statistics of the phonological parameters.
Table 7. Statistics of the phonological parameters.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
nuttering13.9 ± 7.49.4 ± 4.112.6 ± 6.98.6 ± 3.5
npause12.9 ± 7.48.4 ± 4.111.6 ± 6.97.6 ± 3.5
rspeech39.4 ± 8.331.6 ± 12.333.1 ± 9.828.6 ± 8.8
tpause8.3 ± 7.94.6 ± 2.45.8 ± 3.24.3 ± 2.4
Table 8. Statistics of the speech prosody parameters, in mean and standard deviation.
Table 8. Statistics of the speech prosody parameters, in mean and standard deviation.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
µ(I)72.8 ± 42.492 ± 16.578.2 ± 38.595 ± 21.3
σ(I)88.3 ± 45.4106 ± 13.395.3 ± 40.7107.5 ± 12.7
µ(f0)157.5 ± 39.8174.5 ± 38.2163.3 ± 40.4176.2 ± 38.2
σ(f0)59.5 ± 22.748.7 ± 23.260.4 ± 21.754.5 ± 18.5
µ(f0) male138.8 ± 33.9150.9 ± 17145.3 ± 35.4153.2 ± 18.1
σ(f0) male49.8 ± 15.244.1 ± 22.154 ± 1964.7 ± 18.9
µ(f0) female188.6 ± 28.8202.9 ± 38193.2 ± 30.5203.7 ± 38.8
σ(f0) female75.8 ± 24.954.2 ± 25.871 ± 23.142.4 ± 8.8
Table 9. Statistics of the time-domain intensity-based features, in mean and standard deviation.
Table 9. Statistics of the time-domain intensity-based features, in mean and standard deviation.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
µ(mav)36 ± 1347 ± 1347 ± 1861 ± 18
σ(mav)27 ± 1334 ± 1038 ± 1846 ± 13
µ(enrg)0.3 ± 0.30.5 ± 0.30.7 ± 0.61 ± 0.6
σ(enrg)0.5 ± 0.40.7 ± 0.40.4 ± 0.11.3 ± 0.8
µ(rms)43 ± 1557 ± 1756 ± 2176 ± 23
σ(rms)32 ± 1541 ± 1348 ± 2856 ± 19
Table 10. Statistics of time-domain periodicity-based features, in mean and standard deviation.
Table 10. Statistics of time-domain periodicity-based features, in mean and standard deviation.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
µ(ZC)28.1 ± 12.636.4 ± 6.830.8 ± 13.437.4 ± 6.7
σ(ZC)36.7 ± 1847.9 ± 1844.2 ± 2249.5 ± 17.1
µ(SSC)177.7 ± 41.9136.4 ± 43.8174 ± 41.8138.2 ± 39.9
σ(SSC)117.9 ± 24.285.1 ± 31.2118.9 ± 23.381.1 ± 30.3
Table 11. Statistics of frequency-domain features, in mean and standard deviation.
Table 11. Statistics of frequency-domain features, in mean and standard deviation.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
µ(maxf)277.5 ± 76.9457.3 ± 115.9294.5 ± 94468.1 ± 199.2
σ(maxf)426.7 ± 280.8690.5 ± 298.5484.2 ± 321.6707.4 ± 320.9
µ(waf)309.9 ± 85.2391.7 ± 261.5327.6 ± 103.3513.5 ± 155.5
σ(waf)401.2 ± 255.5665.8 ± 271.2463.4 ± 297.8697.8 ± 289.1
µ(skw)10.8 ± 110.9 ± 110.9 ± 110.3 ± 0.7
σ(skw)2.7 ± 0.22.9 ± 0.22.8 ± 0.22.9 ± 0.2
µ(kur)136.1 ± 22.5124.3 ± 12.4136.7 ± 21124.3 ± 13.3
σ(kur)57.8 ± 3.858.9 ± 2.858.4 ± 3.759.2 ± 2.7
Table 12. Statistics of first three formants (f1, f2, and f3), in mean and standard deviation.
Table 12. Statistics of first three formants (f1, f2, and f3), in mean and standard deviation.
FeatureOriginal SignalFiltered Signal
PDHCPDHC
µ(f1)122.2 ± 19.4123.5 ± 12.4119.9 ± 19122 ± 11.9
σ(f1)115.8 ± 13.4128.2 ± 8.9116.3 ± 12.9127.7 ± 8.8
µ(f2)279.6 ± 57.1259.5 ± 34.4274.3 ± 54.2257 ± 33.4
σ(f2)218.6 ± 21.6225.5 ± 7.3221.7 ± 21.1226.8 ± 6.4
µ(f3)787.5 ± 104.5756.4 ± 61.6776.5 ± 100.2751.5 ± 59.4
σ(f3)383.1 ± 34.5390.2 ± 16.6389 ± 29.4392.4 ± 15.2
Table 13. Performance metrics for CNN-based Parkinsonian speech identification.
Table 13. Performance metrics for CNN-based Parkinsonian speech identification.
FeatureOriginal SignalFiltered Signal
AccuracyFPFNLossAccuracyFPFNLoss
Speech spectrograms
(all patients)
78%680.386%350.4
Speech spectrograms
(reduced dataset)
85%520.893%300.1
Speech energy spectrograms80%480.384%550.6
Speech energy spectrograms
(reduced dataset)
87%240.496%200.1
Mel spectrograms58%12140.570%7100.3
Mel spectrograms
(reduced dataset)
87%060.792%220.5
Table 14. Comparison of the classification accuracy reported in this work vs. the literature, based on the speech task.
Table 14. Comparison of the classification accuracy reported in this work vs. the literature, based on the speech task.
ReferencePerformance Metrics
Speaking TaskFeatureAccuracy
This workContinuous speechSpeech/speech energy/Mel spectrogram93%/96%/92%
[41]n.a.22 speech attributes97.4%
[42]Vowels19 acoustic features91.25%/91.23%
[43]Isolated wordsMFCC60% … 90%
[39]Sustained vowel a6 vocal feature sets89.4%/94.4%
[44]Sustained phonation, diadochokinetic task, continuous speech SPEC and MFCC features>80%
[38]Short sentence segmentsSpectrograms85.9%
[13]Sustained vowelsEnergy, formants99.4%
[31]Continuous speechEnergy91% … 98%
[28]Continuous speech282 features83% … 93%
n.a.—not available/not reported.
Table 15. Comparison of the deep learning-based classification accuracy reported in this work vs. the literature.
Table 15. Comparison of the deep learning-based classification accuracy reported in this work vs. the literature.
ReferencePerformance Metrics
ClassifierAccuracy
This workCNN93%/96%/92%
[44]CNN>80%
[38]CNN85.9%
[13]NN99.4%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Faragó, P.; Ștefănigă, S.-A.; Cordoș, C.-G.; Mihăilă, L.-I.; Hintea, S.; Peștean, A.-S.; Beyer, M.; Perju-Dumbravă, L.; Ileșan, R.R. CNN-Based Identification of Parkinson’s Disease from Continuous Speech in Noisy Environments. Bioengineering 2023, 10, 531. https://doi.org/10.3390/bioengineering10050531

AMA Style

Faragó P, Ștefănigă S-A, Cordoș C-G, Mihăilă L-I, Hintea S, Peștean A-S, Beyer M, Perju-Dumbravă L, Ileșan RR. CNN-Based Identification of Parkinson’s Disease from Continuous Speech in Noisy Environments. Bioengineering. 2023; 10(5):531. https://doi.org/10.3390/bioengineering10050531

Chicago/Turabian Style

Faragó, Paul, Sebastian-Aurelian Ștefănigă, Claudia-Georgiana Cordoș, Laura-Ioana Mihăilă, Sorin Hintea, Ana-Sorina Peștean, Michel Beyer, Lăcrămioara Perju-Dumbravă, and Robert Radu Ileșan. 2023. "CNN-Based Identification of Parkinson’s Disease from Continuous Speech in Noisy Environments" Bioengineering 10, no. 5: 531. https://doi.org/10.3390/bioengineering10050531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop