Theory and Data-Driven Competence Evaluation with Multimodal Machine Learning—A Chinese Competence Evaluation Multimodal Dataset

Xian, Teli; Du, Peiyuan; Liao, Chengcheng

doi:10.3390/app13137761

Open AccessArticle

Theory and Data-Driven Competence Evaluation with Multimodal Machine Learning—A Chinese Competence Evaluation Multimodal Dataset

by

Teli Xian

,

Peiyuan Du

and

Chengcheng Liao

^*

Business School, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7761; https://doi.org/10.3390/app13137761

Submission received: 25 April 2023 / Revised: 3 June 2023 / Accepted: 5 June 2023 / Published: 30 June 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In social interactions, people who are perceived as competent win more chances, tend to have more opportunities, and perform better in both personal and professional aspects of their lives. However, the process of evaluating competence is still poorly understood. To fill this gap, we developed a two-step empirical study to propose a competence evaluation framework and a predictor of individual competence based on multimodal data using machine learning and computer vision methods. In study 1, from a knowledge-driven perspective, we first proposed a competence evaluation framework composed of 4 inner traits (skill, expression efficiency, intelligence, and capability) and 6 outer traits (age, eye gaze variation, glasses, length-to-width ratio, vocal energy, and vocal variation). Then, eXtreme Gradient Boosting (XGBoost) and Shapley Additive exPlanations (SHAP) were utilized to predict and interpret individual competence, respectively. The results indicate that 8 (4 inner and 4 outer) traits (in descending order: vocal energy, age, length-to-width ratio, glasses, expression efficiency, capability, intelligence, and skill) contribute positively to competence evaluation, while 2 outer traits (vocal variation and eye gaze variation) contribute negatively. In study 2, from a data-driven perspective, we accurately predicted competence with a cutting-edge multimodal machine learning algorithm, low-rank multimodal fusion (LMF), which exploits the intra- and intermodal interactions among all the visual, vocal, and textual features of an individual’s competence behavior. The results indicate that vocal and visual features contribute most to competence evaluation. In addition, we provided a Chinese Competence Evaluation Multimodal Dataset (CH-CMD) for individual competence analysis. This paper provides a systemic competence framework with empirical consolidation and an effective multimodal machine learning method for competence evaluation, offering novel insights into the study of individual affective traits, quality, personality, etc.

Keywords:

competence evaluation; competence framework; multimodal machine learning; computer vision methods; data-driven

1. Introduction

Competence, a quality that distinguishes an individual from ordinary people, can be explained by a combination of traits, such as being skillful, efficient, knowledgeable, and intelligent [1,2,3]. Some researchers have defined competence as an underlying characteristic of an individual that is causally related to effective or superior job performance [4,5,6]. Others have related competence to a series of personal traits (describing the qualities of competent people, such as confidence, intelligence, and efficiency) and behaviors (describing the actions of competent people, such as analytical thinking, interpersonal understanding, and information seeking) [7,8,9]. Meanwhile, the social role of individual competence has been widely emphasized since it represents the advanced knowledge and skills required for communication and collaboration, information management, learning and problem solving, and meaningful participation, as well as attitudes related to the strategic use of these skills in intercultural, critical, creative, responsible, and autonomous manners [10]. For instance, in politics, highly competent candidates receive more support [11]. In marketing, competent salespersons earn more customer trust, thus leading to more extensive sales [12]. In an organization, a CEO who even “looks competent” is more likely to be selected and receive more salary compensation [13].

Given the importance of competence, researchers have conducted much research on competence frameworks [1,14,15,16]. Bogdan Wojciszke (1993) proposed that competence is related to three factors: “intelligence, willpower, and courage” (1993, p. 329) [14]. S. T. Fiske et al. (2007) built a classical competence framework including “perceived ability, intelligence, skill, creativity, and efficacy” (2007, p. 77) [15]. In subsequent studies, new dimensions for competence evaluation, such as independence, education, and effectiveness, have continued to be added [16,17,18]. Researchers have offered valuable insights into the concept of competence, and among the scales they developed, there are four main dimensions used as measures of competence: skill, efficiency, intelligence, and capability [1,14,19]. However, previous research has mainly focused on measuring competence based on individuals’ self-perception or others’ judgment using surveys, and the hard evidence of competence-related individual traits is quite limited. Moreover, existing studies have been disproportionately focused on a series of inner traits (e.g., intelligence and capability) related to competence. However, individual competence can actually be evaluated based on both inner and outer traits. For example, among inner traits, skill, intelligence, capability, and efficiency highlight individuals’ previous studies and are thus correlated to competence evaluation [14,16,19,20,21,22,23,24]. Among outer traits, facial and vocal cues play an important role in competence assessment [25,26,27,28].

Given the above, not only inner competence traits (i.e., skill, intelligence, capability, and efficiency) but also outer traits convey rich social information about competence evaluation. Therefore, there is a chance to evaluate perceived competence directly through easily observed individual traits. To fill these gaps, we developed a two-step empirical study that leverages machine learning and computer vision methods and proposed a framework and a predictor of competence evaluation based on multimodal data. This study aims to address the following questions:

(1): What are the multimodal variables (i.e., outer and inner traits) that affect individual competence?
(2): How do large-granular multimodal competence variables and small-granular multimodal competence features contribute to competence prediction?
(3): How can cutting-edge deep learning models be utilized to make highly accurate predictions of competence by utilizing visual, vocal, and textual information?

To answer these questions, we conducted two studies from knowledge-driven and data-driven perspectives. In the knowledge-driven study (study 1), we first extracted 10 competence-related traits (both outer and inner traits) from the previous literature. Second, we obtained 4473 GB of 2573 video files that include the competence behavior. Third, two groups of experts annotated the perceived competence of individuals in the videos. Fourth, we employed machine learning methods, eXtreme Gradient Boosting (XGBoost) and SHapley Additive exPlanations (SHAP), to predict and understand individual competence. Fifth, to comprehensively understand competence and validate the results of machine learning, ordinary least squares [29] regression was used to provide economic evidence of the impact of 10 traits on competence evaluation. In the data-driven study (study 2), we accurately predicted competence with a cutting-edge multimodal machine learning algorithm, low-rank multimodal fusion (LMF), which exploits the intra- and intermodal interactions among all the visual, vocal, and textual features composing an individual’s competence behavior. In total, 43 dimensions of visual features were extracted by OpenFace [30], 24 dimensions of vocal features were extracted by Librosa [31], and 768 dimensions of textual features were extracted by BERT [32].

In study 1, the results of XGBoost prediction indicate that 10 traits contribute to competence evaluation, with a mean absolute error (MAE) value of 7.335 × 10⁻⁶, which is highly accurate, and the SHAP results indicate that 8 traits (vocal energy, age, length-to-width ratio, glasses, expression efficiency, capability, intelligence, and skill) contribute positively to competence evaluation, while vocal variation and eye gaze variation contribute negatively. In study 2, the LMF model with 3 modalities (visual, vocal, and textual) has high accuracy, with a value of 87.20%, highlighting the importance of information on individuals’ visual, vocal, and textual competence behaviors. In addition, the results indicate that the visual modality contributes the most to individual perceived competence, followed by the vocal modality and then the textual modality. It seems that “how an individual looks” and “how he speaks” are more important than “what he says.”

This paper offers the following four theoretical contributions: (1) This research extends prior research by integrating a more comprehensive competence evaluation framework with 10 traits, including not only 4 inner traits but also 6 outer traits that can be easily observed, thus helping to clarify the conceptual ambiguity surrounding competence [33]. (2) This study is among the first to directly and empirically predict competence with both 10 knowledge-driven traits and 835 data-driven features extracted from videos, bringing new insights into the evaluation of competence and generating more convincing results. While previous studies used small samples and evaluated competence based on human perception [1,16,18], the use of data-driven empirical evidence in this study provides a deeper understanding of an individual’s competence evaluation. (3) Methodologically, an advanced deep learning algorithm, LMF, is employed for the analysis of psychology-based issues. Both intra- and intermodal interactions among an individual’s visual, vocal, and textual information are mapped, thus leading to a more precise prediction of individual competence, which offers insights into machine learning analysis in psychological research [34,35,36]. (4) We provide the Chinese Competence Evaluation Multimodal Dataset (CH-CMD) for individual competence evaluation to the public. This dataset, comprising rich information on visual, vocal, and textual information, along with annotated competence scores, could encourage further studies of the relationship between inner and outer traits and individual competence. The framework of this paper is shown in Figure 1.

2. Theoretical Background and Hypothesis Development

2.1. Competence Framework

The word “competence”, an important concept in the management strategy literature since the 1990s, is “shrouded in theoretical confusion” [37]. There are three main ways to establish a coherent definition of competence: relating competence with a series of behaviors (e.g., learning style, work feedback, collaboration) [38,39,40], treating competence as a key dimensional stereotype according to social judgment theory [9,15], and associating competence with superior work performance [8].

Although the concept of competence has been constantly confused [33], academic endeavors have focused on building competence frameworks and applying them to a wide range of fields, such as personnel selection and performance evaluation. For instance, Bogdan Wojciszke (1993) proposed that competence is related to three factors: “intelligence, willpower, and courage” (1993, p. 329) [14]. Fiske (2007) built a classical competence framework including “perceived ability, intelligence, skill, creativity, and efficacy” (2007, p. 77) [15]. In subsequent studies, adjustments have continued to be made on this basis, and a series of new factors, such as independence, education, and effectiveness, have been added [16,17,18]. Table 1 is a summary of the research on competence evaluation. In following previous work, we selected the four most commonly used dimensions to understand the nature of competence: skill, expression efficiency, intelligence, and capability (Table 1).

2.2. Competence-Related Traits

We integrated a more comprehensive competence evaluation framework that includes four inner traits and six outer traits of competence. In this subsection, the previous literature is reviewed to provide a theoretical background for the framework we propose.

2.2.1. Inner Traits

The inner traits used in this study include four traits extracted from prior studies: skill, expression efficiency, intelligence, and capability.

(a) Skill. Skills refer to particular ways to perform tasks well. Skill is a comprehensive concept that includes professional skills, learning skills, communication skills, and comprehension skills [16,19,20]. People tend to associate skill with competence since it is an indicator of expertise and high performance [41]. For instance, salespersons with analytical skills earn higher sales [42], adaptive selling skills enable salespersons to better position a firm’s products over those of its competitors [43], and interpersonal mentalizing skills help salespersons reduce burnout, thus improving their performance [44]. Thus, skill is anticipated to be positively correlated with competence.

(b) Expression efficiency. Efficiency refers to carrying out work processes efficiently [19,20]. Here, efficiency is mainly manifested by expression efficiency, such as organizing ideas into points clearly and using rhetoric (e.g., metaphors and exemplification) to convey information vividly [45]. Researchers have proven that the use of metaphors enhances the idea quality, as determined by listeners [46]; the use of exemplification helps listeners understand complex topics since it increases vividness and authenticity [47,48]. Expression efficiency is a critical component of high service quality [49,50] and client trust [51]. Thus, expression efficiency is anticipated to be positively correlated with competence.

(c) Intelligence. Intelligence is a measure of the extent to which an individual masters knowledge about a product. As individual-level intellectual capital, intelligence increases salespersons’ information-based behavior and provides a basis for identifying competence cues [12,52]. Higher performance in terms of selling activities is obtained by salespersons who are able to display product knowledge intelligently [53,54]. Therefore, intelligence is anticipated to be positively correlated with competence.

(d) Capability. Capability means that individuals possess the ability to perform a certain task [55], for instance, being capable of satisfying the needs of customers or having decision-making control over selling issues such as offering discounts and free shipping. Prior research has indicated that, in addition to skills, staff members also need empowerment to complete work tasks [56]. Salespersons who are capable of determining their sales tasks and courses of action have higher self-efficiency and sales performance [57]. Researchers have also found that by introducing capability into the competence framework, the robustness of the framework improves [24]. Given the above, capability is anticipated to be positively correlated with competence.

2.2.2. Outer Traits

The outer traits included six easily observed physical traits: age, eye gaze variation, glasses, length-to-width ratio, vocal energy, and vocal variation (Table 2).

(e) Age. Age represents the estimated age of an individual based on his appearance (rather than his exact age). People hold a series of positive stereotypes about older staff [58], since age is often associated with working experience [59]. For this reason, age has been recognized as a dimension in which individual skills and capabilities may systematically grow over time [60]. For instance, older staff often master higher physical [61] and organizational social skills [59], and they are perceived to be more professional than younger staff [62]. Therefore, age is anticipated to be positively correlated with competence.

(f) Glasses. Glasses refer to whether an individual is wearing glasses. In general, people believe that glasses are worn by those who are more intelligent, knowledgeable, and professional; thus, wearing glasses is an indicator of competence [63,64,65]. For instance, wearing glasses increases the perceived competence of candidates in political elections, thus leading to electoral success, and this effect is especially stronger in situations where intelligence is desirable [66]. For these reasons, wearing glasses is anticipated to be positively correlated with competence.

(g) Eye-gaze variation. Eye-gaze variation is a measure of the extent to which an individual shifts his gaze during a speech. Previous studies have validated the negative impact of gaze variation on individuals’ social perceptions, such as attractiveness [67], dominance [68], and confidence [69]. In video-mediated communication, gaze is defined as directly looking at a camera; thus, eye gaze variation means that an individual fails to maintain constant eye contact with the camera. However, even in online communication, eye gaze is very important. For instance, in clinical situations, physicians with a constant gaze in videoconferencing teletherapy are perceived to be more reliable [70,71]. In online education situations, instructors with limited gaze variation are deemed to be more confident and have higher educational performance [72]. Therefore, gaze variation is anticipated to be negatively correlated with competence.

(h) Length-to-width ratio. The length-to-width ratio relates to the shape of one’s face, and it is calculated by dividing the face length (measured from the top of the eyelids to the upper lip) by the width. Previous studies have examined whether the length-to-width ratio is positively related to an individual’s persuasiveness [73,74]. For women, a higher length-to-width ratio is associated with social capability, emotional capability, and attractiveness [75]. For men, a higher length-to-width ratio indicates less aggressiveness and is more appealing [76,77]. Therefore, the length-to-width ratio is anticipated to be positively correlated with competence.

(i) Vocal energy. Vocal energy is the energy of a human voice and is usually quantified by voice amplitude. A wealth of research has proven that voice energy is positively related to self-confidence [78,79,80] and competence [81]. It also provides information associated with an individual’s perceived dominance (Burgoon, Johnson, and Koch, 1998 [82]) and increases persuasiveness (Van Zant and Berger, 2020 [83]). Therefore, vocal energy is anticipated to be positively correlated with competence.

(j) Vocal variation. Vocal variation refers to shifts in one’s voice amplitude during speech. Vocal variation also carries rich, socially relevant information [84]. For instance, speakers who lack confidence and background knowledge about a topic tend to shift their voice during speech [85]. Previous studies have also suggested that larger vocal variations have a negative effect on speakers’ perceptual performance [86]. Therefore, while vocal energy is considered a positive indicator of competence, vocal variation appears to be considered the opposite.

Table 2. Competence-related variables.

Inner Traits	Definition	References
Skill	A comprehensive concept including the ability to adapt, learn, communicate, comprehend, and question.	[41,43,44]
Expression efficiency	The extent to which an individual exploits oral expression skills when they introduce products (i.e., organizing ideas clearly and using metaphors, examples, and other rhetoric).	[45,46,47,48,49,50,51]
Intelligence	The extent to which a salesperson is knowledgeable about the technical features and capabilities of products.	[12,52,53,54]
Capability	The extent to which an individual has decision-making control over selling issues.	[87,88,89,90,91]
Outer Traits	Definition	References
Age	The estimated age of an individual is based on his appearance.	[58,59,60,61,62]
Glasses	Whether an individual wears glasses	[63,64,65,66]
Eye gaze variation	Whether an individual makes direct eye contact with a camera	[69,70,71,72,92]
Length-to-width ratio	A measure of an individual’s face length (measured from the top of the eyelid to the upper lip) divided by the width.	[73,74,75,76,77]
Vocal energy	A term that expresses the perceived power of a voice and is quantified by amplitude.	[78,79,80,82,83]
Vocal variation	Shifts in a speaker’s voice amplitude during a speech.	[84,85,86,93]

3. Study 1. Knowledge-Driven Competence

3.1. Data

Online livestream videos provide abundant materials to comprehensively understand individual competence, including influencers’ inner and outer traits, which can be obtained through visual, vocal, and textual cues detected from their facial traits, voice, and speech transcripts. In this section, videos used for competence evaluation were randomly selected from an online livestream platform to extract both inner and outer competence-related traits (see Table 2). Four steps were taken to eliminate the heterogeneity of the videos. (1) The number of videos in each product category was roughly balanced to avoid product-type interference. (2) Videos of the top or bottom influencers were filtered out since these videos may contain too large or too small impacts on the prediction outcome, respectively. (3) The data were cleaned by filtering out videos (a) with no influencer or more than one influencer, (b) with no sound, and (c) with a moving camera. (4) All the transcripts transferred from audio files were proofread to guarantee the validity of our empirical results.

After these steps were completed, 4473 GB of video files were obtained. The videos were evenly distributed in terms of product categories, including beauty products, baby products, sporting goods products, and furnishings.

3.2. Variables Extraction

3.2.1. Inner Traits Extraction

The inner traits (skill, expression efficiency, intelligence, and capability) were obtained by computing the text similarity between the influencer’s transcript and an example transcript collection. First, we invited 10 experts in sociology and psychology to create an example transcript collection, which includes 100 sentences displaying the inner traits of skill, expression efficiency, intelligence, and capability. Next, the transcripts were cleaned and text segmentation conducted with the Jieba module in Python. Then, bidirectional encoder representations from transformers (BERT)-based word embeddings were used to transfer the transcripts into 768-dimensional sequence embeddings [32]. Finally, the cosine text similarity between the influencers’ transcripts and the example collection was calculated to measure to what extent an influencer exhibits these inner traits (see Appendix A).

3.2.2. Outer Traits Extraction

Among the outer traits, age, glasses, and length-to-width ratio were obtained using Face++ https://www.faceplusplus.com.cn/ (accessed on 17 June 2022), a platform that provides a face detection application programming interface [94]. This tool applies 106 facial landmarks on the human face in the video and marks the outline of the face, eye, eyebrow, lip, and nose contour, enabling the precise detection of facial traits. For instance, facial length can be measured by calculating the distance between the landmark on the top of the eyelid and that on the upper lip. Facial width is measured in similar a way, so the length-to-width ratio could be measured. Glasses can be measured by detecting whether the landmarks around the eyes are obstructed. Age can be evaluated through the skin detection API of Face++.

Gaze information was obtained using OpenFace 2.0 https://openface.io/ (accessed on 17 June 2022), an interactive application for facial behavior analysis. OpenFace 2.0 applies the CLNF framework, a general deformable shape registration approach, to detect the location of the eyeball and the pupil. CLNF gives the pupil location in 3D camera coordinates and computes the pupil’s intersection with the eyeball sphere. In this way, the gaze vector can be obtained by calculating the vector from the 3D eyeball center to the pupil location. Then, eye-gaze variation was measured by the coefficient of variation between an individual’s gaze from left to right and up to down.

Vocal energy and vocal variation were extracted from audio files using the Librosa speech toolkit at 22,050 Hz Here, all the audio files were segmented from the livestream videos by calling the moviepy module in Python. Vocal energy measures the amplitude of the influencer, and vocal variation measures the standard deviation of amplitude. The variable summary of our research is shown in Table 3.

3.3. Competence Factors with a Machine Learning Method

3.3.1. Model

In this section, our empirical research consists of three steps. First, to consolidate the framework of 10 competence-related traits and generate an accurate competence evaluation, a machine learning algorithm, XGBoost, was used to build the prediction model. Then, to make the results interpretable, SHAP was applied to rank the importance of the 10 traits. Finally, to further validate the results of the machine learning model, OLS regression was performed on the 10 traits and individual competence.

(1): XGBoost

In this section, we employed the XGBoost model to study whether our framework can be used to predict individual competence. XGBoost, a scalable machine learning system for tree boosting, has been widely proven to yield state-of-the-art results in many prediction tasks [95]. As essentially a gradient tree boosting algorithm, in XGBoost, a scalable end-to-end tree learning system is applied. The core idea of XGBoost is to continuously conduct feature splitting to train new regression tree functions

f (x)

and use them to fit the residual of the prediction in the previous round, as shown in Equations (1)–(4).

{\hat{y}}_{i}^{(0)} = 0

(1)

{\hat{y}}_{i}^{(1)} = f_{1} (x_{i}) = {\hat{y}}_{i}^{(0)} + f_{1} (x_{i})

(2)

{\hat{y}}_{i}^{(2)} = f_{1} (x_{i}) + f_{2} (x_{i}) + = {\hat{y}}_{i}^{(1)} + f_{2} (x_{i}) \dots

(3)

{\hat{y}}_{i}^{(t)} = \sum_{k = 1}^{t} f_{k} (x_{i}) = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})

(4)

Here,

{\hat{y}}_{i}^{(t)}

is the prediction at training round

t

.

f_{t} (x_{i})

is the new tree generated in round t. In every round, the new function

f_{t} (x_{i})

should be the one that minimizes the objective in Equation (5).

O b j^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}) + c o n s t a n t

(5)

l denotes the loss function, and

Ω (f_{t})

is the regularizer that penalizes the complexity of the model. Finally, the prediction

{\hat{y}}_{i}

is the summation of the predicted values of all trees, as shown in Equation (6).

{\hat{y}}_{i} = \emptyset (x_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i})

(6)

By explicitly adding a regularizer to the model, XGBoost excels at avoiding overfitting and thus improving the generalizability of the model. In addition, a tree learning algorithm and justified weighted quantile sketch procedure are applied; thus, this model also excels at handling sparse data and achieving efficient calculation.

In our study, we trained and tested the model with video clips from the CH-CMD, which contains 2374 samples in total. We randomly selected 80% of the video clips as training data and the remaining video clips as test data. We set the learning rate to 0.01 and the number of trees to 100. Finally, in the task of predicting perceived individual competence, a mean square error (MSE) of 7.335 × 10⁻⁶ and a root mean square error (RMSE) of 0.002708 were obtained using this model. These results indicate that our competence evaluation framework contains valid information about perceived competence.

(2): SHAP

In this step, SHAP, a machine learning interpretability approach, was applied to rank the importance of 10 inner and outer individual traits to competence evaluation. Based on cooperative game theory, SHAP uses Shapley values to measure the importance of each variable. It is assumed that when given the input of a set of variable values (the

j^{t h}

feature of the

i^{t h}

sample is

x_{i j}

) and the baseline of model

Y_{b a s e}

, the machine learning model can output a particular prediction of

Y_{i}

. Since different feature values have different impacts,

f (x_{i j})

denotes a specific contribution of

x_{i j}

to the prediction

Y_{i}

. Here,

f (x_{i j})

is the SHAP value of

x_{i j}

and follows Equation (7).

Y_{i} = y_{b a s e} + f (x_{i 1}) + f (x_{i 2}) + \dots + f (x_{i j})

(7)

Here,

x_{i j}

represents the 10 traits in the framework,

f

is the XGBoost model built in the last step, and

Y_{i}

(the output of the model) reflects the evaluation of the competence of each sample. SHAP exhibits two apparent advantages to meet our research requirements: (1) It presents not only the impact extent but also the impact direction of each variable: when the SHAP value is positive (

f (x_{i j}) > 0)

, this specific variable of the

i^{t h}

sample contributes positively to the prediction of

Y_{i}

, and the reverse also holds true. (2) SHAP can exhibit individual-level importance and better visualize the contribution of each trait.

3.3.2. Results

Figure 2 displays the mean SHAP value of the 10 traits, which are sorted from high to low according to the mean SHAP value. The mean SHAP values reflect the significance of the traits in the competence evaluation.

In addition, Figure 3 shows the density scatter plots of the aggregate SHAP values across all input video samples to visualize individual-level trait importance. In this panel, traits are sorted in the same order as in Figure 2 and are denoted by lines consisting of scattered dots. Every pink or blue dot denotes the SHAP value of an input video sample. Pink dots indicate high trait values, while blue dots indicate low trait values.

From Figure 2 and Figure 3, vocal energy and vocal variation are among the two most impactful traits. While vocal energy has a positive impact on competence evaluation, vocal variation has a negative impact. These results are consistent with the theoretical analysis in Section 2.2. Eye-gaze variation is also important in the competence evaluation framework. However, its impact is contextual; that is, while it contributes to the competence evaluation of some individuals, it may hinder that of others (see Section 3.5). Facial traits, such as age, glasses, and the length-to-width ratio, also play important roles in the model. A higher estimated age could either increase or decrease the predicted competence. The length-to-width ratio contributes positively to competence evaluation. Wearing glasses positively impacts competence, but not wearing glasses negatively impacts competence.

Finally, expression efficiency, capability, intelligence, and skill have smaller impacts on competence evaluation, and their impact is not monotonic. They exhibit similar patterns in the scatter plot, indicating that a larger value has either a positive or negative impact on the predicted outcome.

3.4. Competence Factors with an Economical Method

To validate the results, we also used OLS regression to investigate the impact of individual traits on perceived competence (Table 4). Table 4 shows the regression results of the OLS model with the CH-CMD. In general, the results are consistent with those of the machine learning approach. Therefore, the credibility of the XGBoost model we built is validated, and our proposed competence evaluation framework is adequate for the prediction task.

3.5. Heterogeneity Analysis

In the previous subsections, we presented a ranking of the contribution of individual trait variables to the prediction of competence. Moreover, individual heterogeneity exists in the SHAP results (Figure 4). Figure 4 shows the SHAP values of two examples to demonstrate this heterogeneity.

For instance, age (with a value of 0.35), length-to-width ratio (with a value of 0.16), vocal energy (with a value of 0.12), and vocal variation (with a value of 0.08) contribute positively to individual 1′s competence but contribute negatively to individual 2′s competence. In addition, even though eye gaze variation contributes negatively to the competence of both individuals 1 and 2, the degree is different. Eye gaze variation negatively impacts individual 2′s competence by 0.24, while it negatively impacts individual 1′s competence by only 0.09. Therefore, the influence of these traits is contextual.

3.6. Discussion

In Section 3, we conducted study 1, that is, an experiment to explain and predict perceived individual competence from a knowledge-driven perspective. By considering the previous competence framework, we developed an enhanced competence evaluation framework (including 4 inner traits and 6 outer traits) to explain individual competence. To validate this framework, we conducted empirical research following two pathways: (1) First, we utilized a machine learning model. We applied the XGBoost model to predict individual perceived competence with the abovementioned variables and the SHAP method to realize the impact direction and feature importance of these variables. (2) Second, we used a classical econometric model. We conducted an OLS regression analysis to explore the general relationships between competence and these variables. Finally, an MSE of 7.335 × 10⁻⁶ and an RMSE of 0.002708 were obtained using the XGBoost model, and the results of these two pathways are consistent, indicating that the proposed framework is adequate for the competence prediction task and consolidates the prior competence theory.

First, the appearance traits of individuals, such as age, glasses, and length-to-width ratio, are positively related to perceived competence. From the SHAP results, it is noted that these variables contribute greatly to the prediction results. Second, eye gaze variation has a negative effect on the evaluation of individual competence. However, this impact is contextual, according to the SHAP results. Third, voice traits have the greatest contribution to the evaluation of individual competence. Between them, vocal energy has a positive effect, while vocal variation has a negative effect.

Meanwhile, expression efficiency, capability, intelligence, and skill correlate positively to individual competence in general. However, these traits contribute the least to the prediction of competence. This may indicate that inner traits are not critical to the embodiment of individual competence in a fast-paced livestream context.

4. Study 2. Knowledge-Driven and Data-Driven Competence

4.1. Dataset Construction

Extracting individual competence-related features from massive unstructured data in videos is a continuous challenge. In this section, we introduce the CH-CMD to capture and structure individual competence with deep learning at a smaller granularity. In the following subsections, we explain the details of the data acquisition and crawl system, dataset split, final statistics, preprocessing steps, and annotation.

4.1.1. Data Acquisition

All the videos used to extract features in the three modalities in study 2 were randomly crawled from a livestream platform. We built a new crawl system based on existing crawlers that allows us to filter out videos with more than one influencer on the screen (to ensure monologue videos) and with a moving camera (to ensure that all influencers’ intentions are focused on the camera). Then, the original videos (often with hours of content) were segmented into pieces of video clips that lasted for only a few seconds and contained only one sentence. A total of 1973 videos were crawled using this process.

4.1.2. Dataset Split

The Natural Split CH-CMD crawled without any guidance and comprises a representative sample of the natural distribution of competence polarity on the selected platform. However, it is not suitable for deep learning. In this subsection, we first discuss the statistics of the Natural Split CH-CMD and then discuss the process of guided crawling to acquire videos with more polarized competence to balance the dataset.

(1): Natural split dataset

In this paper, the 1973 video samples were crawled from the selected livestream platform to capture the distribution of competence (Figure 5). Figure 5 demonstrates that the majority of the individuals in the video clips are neutrally competent (962), followed by weakly competent (356), weakly incompetent (319), incompetent (228), and competent (108). Without any guidance, the Natural Split dataset contains fewer polarized videos and reflects the original distribution of the livestream platform. However, for deep learning, the distribution of this unprocessed data is not ideal since there are too many videos with neutrally competent individuals.

(2): Guided crawl

To compensate for the lack of polarized competence and balance the distribution of samples, we implemented stratified sampling in the crawl system: (1) First, we crawled videos of medium length to observe a greater fluctuation of competence; (2) Second, we analyzed the videos while crawling with data from three modalities (instead of analyzing them after crawling); and (3) Third, we filtered out videos of the top and bottom influencers since they generate too much or too little influence, respectively. After three rounds of filtering, controllable crawling by stratified sampling can be realized. In addition, we added manually annotated polarized videos from the Persuasive Opinion Multimedia (POM) dataset [96].

A total of 2374 videos were obtained for the Guided Crawl CH-CMD, which has a more balanced distribution compared to that of the Natural Split CH-CMD (see Figure 5) and facilitates the prediction of individual competence with deep learning approaches.

4.1.3. Preprocessing

Some video restrictions are set to filter out inappropriate samples:

(1) Using face detection technology, we chose only videos with one influencer on the screen to ensure that the videos contained only monologues. (2) Videos with moving cameras were filtered to ensure that influencers were concentrated on the camera.

With the high-quality transcripts we obtained by automatic speech recognition [34], we tokenized the final pool of videos into sentences using punctuation markers and randomly sampled a sentence from the transcript of each livestream video. Thus, we obtained video clips containing only one sentence that lasted for approximately 8 s.

4.1.4. Final Statistics

The final pool consists of 2374 livestream video clips in total. These videos were crawled from different commodity types, including beauty products, baby products, sporting goods products, and furnishings. We limited the number of IDs for each commodity type. These steps ensured the diversity of the videos.

The diversity of videos leads to the following generalized advantages: (1) The diversity of commodity types enables models to achieve generalizability across various commodity types and marginalize the definition of the dataset domain; (2) the diversity of commodity types leads to the diversity of individuals, allowing the training model to be implemented across different individuals and enabling generalization; (3) the diversity of commodity types also leads to the diversity of record settings, allowing the training model to be generalized through microphones and cameras with different intrinsic parameters.

4.1.5. Annotation

The videos chosen for the dataset were then annotated for competence by expert annotators on a Chinese crowdsourcing platform, and measures were taken to ensure the reliability of the annotation results. In this subsection, we first introduce the selection of annotators, review the definition of competence, and describe the annotation interface.

(1): Annotator selection

The annotation of the CH-CMD was conducted by expert annotators from three sources: academia, industry, and crowdsourcing. Since we aimed to learn about individuals’ intuitively perceived competence, experts in academia were exclusively from business schools in universities (instead of psychology or medical schools). Experts in the industry included livestream and platform salespeople.

In addition, measures were taken to guarantee annotation quality. Expert annotators were trained before they conducted the annotation task. For instance, to avoid implicit prejudice and subjectivity, they received the following notification: “Please watch the video and annotate the competence of the speaker. Please note that you will only evaluate the individual competence of the speaker, not whether you agree with them”. Only the annotators with an acceptance rate greater than 98% were allowed to perform this task.

(2): Competence annotation

Previous studies have provided a variety of definitions of competence and used different terms as a measure of competence. Table 1 presents a brief summary of the measurement of competence in previous studies. Among them, we chose 4 pairs of opposite items, “unintelligent/intelligent”, “unskillful/skillful”, “incapable/capable”, and “inefficient/efficient”, and averaged them to measure individual competence. Since these items have been widely used in published work, the reliability and validity of our annotation can be assured (see Section 2.2). The annotators were also required to utilize a 7-point competence scale, ranging from 1 to 7 (e.g., “unintelligent/intelligent”; 1 = “unintelligent” and 7 = “intelligent”).

(3): Annotation user interface

The interface viewed by annotators in the process of annotation is shown in Figure 6. Before annotating, the annotators viewed a five-minute training video that provided instructions on how to use the annotation system. Then, a short livestream video was played. The annotators were allowed to rewatch the video at any time but were not allowed to perform annotations before the video had completely played at least once. In addition, at any moment during this process, the annotators were allowed to review the training video.

4.1.6. Extracted Features

The extracted features in each modality are as follows (for all experiments in Section 4.2, we extracted the same features):

(1): Visual

Visual features are traits that can be captured visually. To extract visual features, we cut frames from the video clips at 30 Hz. First, with the multitask cascaded convolutional neural network (MTCNN) face detection algorithm [97], we obtained aligned faces. In addition, we applied the facial action coding system (FACS) to influencers’ faces to detect facial action units and better understand their facial expressions. Finally, we obtained 35 facial action units and 8-dimensional gazes using the OpenFace2.0 toolkit. Forty-three-dimensional visual features were extracted, as shown in Figure 7.

(2): Vocal

Vocal features are the acoustic features of an influencer’s voice. Vocal features were obtained from the MP3 audio files that were separated from the video clips by calling the moviepy module in Python. Then, we used the Librosa speech toolkit to extract voice features from these audio files at 22,050 Hz. Finally, 3-dimensional frame-level voice features, including 20-dimensional mel-frequency cepstral coefficients (MFCCSs), RMS, zero crossing rate, spectral rolloff, and spectral centroid, were extracted. Twenty-four-dimensional vocal features were extracted, as shown in Figure 8.

(3): Textual

Textual features are the characteristics of individuals’ language content when they introduce products. All text was obtained by transferring individuals’ voices into text with the ASR technique of the Alibaba Cloud. Since all the videos were obtained from a livestream platform, the transcripts were in Chinese. We used two tokens to label the beginning and end of each text. Then, we generated word vectors by using pretrained Chinese BERT-based word embeddings. Finally, each word was represented as a 768-dimensional word vector, as shown in Figure 9 and Table 5.

4.2. Model Specifications

In prior works, we utilized visual, vocal, and textual features to predict competence scores. However, one problem remained unsolved (see Figure 10). If prediction outcomes generated by models with a single modality are inconsistent with each other, how can we integrate all the information to obtain a comprehensive competence score? In other words, how can the model map not only the intramodality information but also the cross-modality interaction information to obtain a prediction output?

To address this challenge, we applied a cutting-edge deep learning algorithm, LMF, for individual competence evaluation at the feature level [98]. This algorithm outperforms other deep learning algorithms in three ways: (a) LMF is characterized by using tensor representation, enriching the intermodal information in the model and leading to a more convincing prediction. (b) LMF is an efficient light computation method characterized by using low-rank tensors instead of explicitly creating a high-dimensional tensor. Therefore, LMF requires relatively low computing power, making it suitable for application and generalization in various scenarios. (c) LMF has also been proven to achieve high performance in prior studies, with relatively high accuracy and low MAE [98].

Given the above advantages of LMF, we chose it for competence prediction. We divided the 2374 video samples into an 80% training set, a 10% validation set, and a 10% test set for the LMF model. The operating principle of this method is the transformation of the input representation into a high-dimensional tensor

Z

and then mapping it back to a lower-dimensional output vector space. Here,

{Z_{m}}_{m = 1}^{M}

denotes the unimodal vectors of M different modalities, and

h

denotes the multimodal output representation. However, in the following steps, LMF does not explicitly create a tensor

Z

. Instead, three steps are taken to simplify the calculation:

First, the weight factor

W

is decomposed. Generally, to map tensor

Z

back to a lower-dimensional output vector space, it needs to be multiplied by a weight factor

W

, which is naturally a tensor of order −(

M + 1

) and denotes the weight of a linear layer. It can be partitioned into

{\tilde{W}}_{k}

(

{\tilde{W}}_{k} \in R^{d_{1} \times d_{2} \times \dots \times d_{M}}, k = 1, 2, \dots, d_{h})

. Moreover, every

{\tilde{W}}_{k}

can be decomposed by Equation (8).

{\tilde{W}}_{k} = \sum_{i = 1}^{R} \otimes_{m = 1}^{M} w_{m, k}^{(i)}, w_{m, k}^{(i)} \in R_{m}^{d}

(8)

Here, the minimal R that makes the decomposition valid is referred to the rank of the tensor. Thus, the weight factor

W

is expressed in Equation (9).

W = \sum_{i = 1}^{r} \otimes_{m = 1}^{M} w_{m}^{(i)}

(9)

Second, tensor

Z

must be decomposed. Currently,

h

can be calculated, as shown in Equation (10).

h = (\sum_{i = 1}^{r} \otimes_{m = 1}^{M} w_{m}^{(i)}) \cdot Z

(10)

We continue the optimization by decomposing

Z

in Equation (11).

h = (\sum_{i = 1}^{r} \otimes_{m = 1}^{M} w_{m}^{(i)}) \cdot Z = \sum_{i = 1}^{r} (\otimes_{m = 1}^{M} w_{m}^{(i)} \cdot Z) = \sum_{i = 1}^{r} (\otimes_{m = 1}^{M} w_{m}^{(i)} \cdot \otimes_{m = 1}^{M} Z_{m})

(11)

Third,

h

is reconstructed. In practice, a slightly different form is generally used to reconstruct

h

in Equation (12).

h = \land_{m = 1}^{M} (\sum_{i = 1}^{r} w_{m}^{(i)} \cdot Z_{m}) = \sum_{i = 1}^{r} {\land_{m = 1}^{M} [w_{m}^{(1)}, w_{m}^{(2)}, \dots, w_{m}^{(r)}] \cdot \tilde{z_{m}}}_{i}

(12)

Here,

Λ_{m = 1}^{M}

denotes the elementwise product of a sequence of tensors. Thus, h can be computed without explicitly creating the tensor Z from the input representations.

Z_{m}

, allowing the LMF model to be applied to any number of modalities and improving computational efficiency.

Using the LMF algorithm, our prediction task with inputs of individuals’ features in three modalities can be explained, as shown in Figure 11. First, we passed the unimodal inputs.

x_{a}

,

x_{v}

, and

x_{l}

into three subembedding networks

f_{a}

,

f_{v}

, and

f_{l}

, respectively, so that we could obtain the unimodal representations

z_{1}

,

z_{2}

, and

z_{3}

, respectively. Here, every

z_{i}

was appended to 1. Second, with the method of low-rank multimodal fusion with modality-specific factors, we obtained the multimodal output representation h. This method makes multimodal fusion efficient without compromising the model’s performance. Finally, the multimodal representation, h, was used to generate an evaluation of individual competence.

4.3. Results

An ablation study was conducted by comparing the performance of the 6 models (including 3 models with unimodal data, 2 models with biomodal data, and 1 full fusion model with trimodal data) that combine data in different modalities (textual, vocal, and visual). The results are shown in Table 6. For the no-fusion model with unimodal information, relatively optimal performance, with a prediction accuracy of 0.8328, was obtained for the visual modality (see line 3 in Table 6). For the partial fusion model with bimodal information, the model with textual and visual modalities performed better than the other two models (see lines 4 to 6 in Table 6). For the full fusion model, the accuracy reached 0.8723 and was higher than the accuracies obtained using the no fusion and partial fusion models (see line 8 in Table 6). These results indicate that the LMF algorithm is adequate for the prediction task, validating the necessity of multimodal analysis. Figure 12 is the visualization of the model evaluation in terms of accuracy (a), Corr (b), F1-score (c), and MAE (d) values. According to Figure 12, the LMF model with visual, vocal, and textual modalities performed better than the other bimodal and unimodal models, highlighting the necessity of multimodal analysis.

In addition, we conducted a benchmark experiment with the same dataset using the early fusion long short-term memory recurrent neural network (LSTM-RNN), another classical deep learning model [99]. The results showed that the LMF model displayed better performance than the LSTM model in terms of multimodal analysis, with higher accuracy, F1-score, and Corr values and lower MAE values (see lines 7 and 8 in Table 6).

4.4. Discussion

In study 2, we explored individual competence from livestream videos at a smaller granularity from a data-driven perspective. With the crawled livestream videos from the livestream platform, we first constructed the CH-CMD and sent the livestream videos to a crowdsourcing platform for competence annotation. Then, we extracted features based on three modalities: visual, vocal, and textual. Finally, we utilized an efficient deep learning algorithm, LMF, to build our competence prediction model with the input data from the three modalities.

Our research confirmed the necessity of multimodal analysis in the livestream context, with the trimodal model (accuracy reaching 87.20%) outperforming the unimodal (accuracy by over 7%) and bimodal models (accuracy by approximately 3%). Most prior studies examined individual competence with intramodality information, but in the real world, modalities interact with each other, and they may influence individuals’ perceptions. For instance, visual information can influence the impact of vocal behaviors [100], and an individual’s verbal statements may affect his visual perception [101]. Therefore, it is essential to make predictions with full cross-modal information and employ multimodal analysis in our research.

In addition, the results of study 2 provided valuable insights into the impact of individual features in the visual, vocal, and textual modalities on perceived competence. First, the results of unimodal models indicate that models with visual features perform better than models with vocal or textual features (see lines 1 to 3 in Table 6). Second, the bimodal model with the visual modality performed better (see lines 4 to 6 in Table 6). These results indicate that the visual modality contributes the most to perceived individual competence, followed by the vocal modality, and then the textual modality. It seems that “how an individual looks” and “how he speaks” are more important than “what he says”. This is because the transcripts used by influencers in livestreams generally have similar patterns and wording [102], so competence is mainly demonstrated through visual and vocal features rather than textual features. This result is consistent with study 1: according to the SHAP results, influencers’ facial traits (such as eye gaze variation, length-to-width ratio, and age) and vocal traits (such as vocal energy and vocal variation) rank among the top traits, while traits extracted from text (such as intelligence, capability, expression efficiency, and skill) rank among the bottom traits. All of these results can be explained by “visual and demeanor biases”, which means that audiences often rely more on visual and vocal cues than textual cues to make judgments about an individual [100,103].

5. Conclusions

5.1. Discussion

In this article, we introduce two studies on individual competence from knowledge-driven and data-driven perspectives. In the knowledge-driven study (study 1), based on the XGBoost and SHAP results, both outer and inner traits contribute to competence evaluation. Moreover, outer traits (age, eye gaze variation, glasses, length-to-width ratio, vocal energy, and vocal variation) demonstrate even more significance than inner traits (skill, expression efficiency, intelligence, and capability) in competence prediction. This result was verified in the data-driven study (study 2), which indicates that the visual and vocal modalities are more important than the textual modality. In the no-fusion LMF, better performance is achieved by the model with visual and vocal features than by the model with textual features, and in the partial-fusion LMF, models with the visual modality outperform the model without this modality. Both studies convey a message: “what you say” is less important than “how you look” and “how you speak” [100,103].

5.2. Theoretical Contributions

This research makes five contributions to the theory.

First, this study extends the literature by integrating a more comprehensive competence evaluation framework. By including easily observed outer and inner traits, this framework provides a more convenient way to understand competence evaluation and helps to address the ambiguity in previous competence studies.

Next, this research empirically validates the extant literature and successfully predicts individual competence. While previous studies mainly mapped competence using surveys [1,14,19], this research measures competence with 10 knowledge-driven traits and 835 data-driven features extracted from videos, realizing more convincing prediction outcomes and reducing subjectivity [37].

This research also employs a cutting-edge deep learning algorithm, LMF, in competence evaluation. The full-fusion LMF model outperforms other models, demonstrating the necessity and legitimacy of multimodal analysis. The results also provide information about which modality is more important for competence evaluation.

Finally, this research provides academia with the CH-CMD, which contains rich multimodal information for individual competence evaluation. This dataset is among the first to be used to explore individual competence through livestream videos, encouraging further research on individual affective traits, quality, personality, etc.

5.3. Managerial Contribution

The competence evaluation framework in the research can be adopted by both individuals and organizations. First, at the individual level, influencers in livestreams, online teachers, and even candidates in remote interviews can apply this framework to reexamine and improve their competence in front of a camera. Simple adjustments can be made to one’s behaviors according to the framework (such as wearing glasses and adjusting one’s voice energy).

Second, at the organizational level, this framework is useful. For instance, online stores could select influencers with high competence scores before formally signing contracts with them instead of choosing inadequate influencers who may have poor performance on a livestream. Online education platforms could apply this framework to send lectures and courses given by the most competent teachers to the home page or recommend them to users. Personnel departments could apply this framework to self-introduction videos from job seekers and decide on the list of candidates who can enter the next round of interviews.

Third, technically, this research adopts cutting-edge machine learning algorithms to analyze individual traits. With the light computation LMF method, the model can conduct competence evaluations efficiently without consuming excessive computing power and time [98]. Therefore, this technique can be easily applied and generalized by businesses.

Finally, the CH-CMD generated in this research is publicly available, encouraging the industry to analyze individual competence and explore its business application. Moreover, this dataset also fills the gap that few datasets contain information on Chinese influencers’ competence traits.

6. Limitations

Finally, three limitations of this study call for further research.

First, our research mainly explores the correlation between individual competence and the 10 traits (4 inner traits and 6 outer traits). However, it did not explain the causality between them. Whether these traits indeed cause an increase or decrease in competence evaluation and “how” these traits influence competence (a mechanistic study) require further research.

Second, we trained and tested our competence evaluation model with livestream platforms, since livestream videos facilitate our research on individual competence in three modalities. However, regretfully, due to difficulties in data acquisition, we did not validate our model with other video platforms or other business settings this time, which calls for further research to study the application of this model in different contexts, such as online education, online interviews, and remote diagnosis.

Third, this research focused on providing a Chinese competence evaluation multimodal dataset and testing and training the models with Chinese livestream videos. Whether these models can be generalized to other ethnic groups and cultural backgrounds remains an open issue. This limit calls for more diverse samples from different ethnic groups and cultural backgrounds to be involved in the dataset in future research and for more advanced algorithms supporting cross-language analysis to be applied in this area.

Author Contributions

Conceptualization, C.L.; Methodology, T.X. and P.D.; Software, P.D.; Validation, C.L.; Investigation, T.X.; Data curation, P.D.; Writing—original draft, T.X.; Writing—review & editing, T.X.; Supervision, C.L.; Project administration, C.L.; Funding acquisition, C.L. The Corresponding author of this paper is C.L. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the support from the National Natural Science Foundation of China (Grant: 71925003; Grant: 72172099; Grant: 72102238), Sichuan University (Grant: SKSYL2019-01, Grant: 2022CX22; Grant: skbsh2023-59; Grant: JCXK2236; Grant: 20502044C3002), and the China Postdoctoral Science Foundation (Grant: 2021M702319).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Expert Selection

To extract the 4 inner traits, we invited 10 experts to create an example transcript collection, which comprises sentences displaying the inner traits of skill, expression efficiency, capability, and intelligence. Several steps were taken to assure the high quality of the example transcript collection: (1) The 10 experts involved in this task were exclusively from sociology and psychology (instead of marketing), since we aimed to obtain a more general transcript collection reflecting individual competence in various settings. (2) We relied on minimal training for the experts to limit the potential bias training may cause and only told them “Please write down the sentences that you consider to reflect one’s skills/ expression efficiency/capability/intelligence”. (3) After the task, all the sentences provided by them were reviewed by different experts, and only sentences with an acceptance rate of over 98% were included in the example collection.

Finally, a total of 100 sentences that reflect the 4 inner traits (each with 25 sentences) were selected for the example collection. For instance, a sentence displaying high expression efficiency is characterized by using rhetoric to convey information vividly, and it may say, “Our fabric is super comfortable, just like the skin of a baby”. or “The color of this skirt, which is quite similar to that of avocado, makes you look very fair!”.

Appendix A.2. Transcript Preprocessing

Since all the transcripts were in Chinese, we applied Jieba, a Python Chinese word segmentation module, to conduct text segmentation. Jieba is a widely used tool in natural language processing, and it supports three types of text segmentation modes, including accurate mode, full mode, and search engine mode. Here, we selected the accurate mode since it cuts a sentence into the most accurate segmentations and is thus most suitable for text analysis in our research.

Furthermore, we also conducted basic data cleaning by deleting meaningless punctuation such as “…”, spaces between strings, and strings without Chinese characters. However, we did not set the stop words for all the transcripts, such that the original oral style of the influencers and the example transcript collection could be retained for subsequent analysis.

Appendix A.3. Sentence Embeddings

In this step, BERT, a pre-trained language representation model, was applied as a tool for text representation [32]. BERT extracts features, namely word and sentence embedding vectors, as high-quality inputs to downstream NLP models. In the past, models such as Word2Vec or Fasttext represented words as uniquely indexed values (one-hot encoding) or neural word embeddings, where vocabulary words were matched against the fixed-length feature embeddings, so each word had a fixed representation regardless of the context within which it appeared. Bert outperforms them by producing word representations that are dynamically informed by the words around them. Finally, BERT transferred each sentence in the transcript into a 786-dimensional sentence embedding,

v (s)

.

Appendix A.4. Cosine Similarity

We used cosine similarity to compute the text similarity score between influencers’ transcripts and a sample collection provided by experts [104] In the prior step, we processed the sentences in the influencer’s transcript and the example collection into sentence embeddings. Here, the influencer’s sentence embeddings were represented by

v (s_{1})

, and the sentence embeddings of the example collection were represented by

v (s_{2})

. To measure the semantic similarity between them, we computed the cosine similarity between

v (s_{1})

and

v (s_{2})

by:

{Sim}_{v} (s_{1}, s_{2}) = \frac{v (s_{1}) v (s_{2})}{∥ v {(s_{1}) ∥}_{2} ∥ v {(s_{2}) ∥}_{2}}

(A1)

{Sim}_{v} (s_{1}, s_{2})

denotes the text similarity score. In this way, the semantic similarity between the influencer’s transcript and the example collection was measured. In other words, the inner traits of an influencer can be defined. For instance, if an influencer’s transcript is highly similar to the sentence in the example collection characterized by intelligence, this influencer is considered to display the inner trait of intelligence in his speech.

References

Aaker, J.; Vohs, K.D.; Mogilner, C. Nonprofits Are Seen as Warm and For-Profits as Competent: Firm Stereotypes Matter. J. Consum. Res. 2010, 37, 224–237. [Google Scholar] [CrossRef]
Judd, C.M.; James-Hawkins, L.; Yzerbyt, V.; Kashima, Y. Fundamental dimensions of social judgment: Understanding the relations between judgments of competence and warmth. J. Pers. Soc. Psychol. 2005, 89, 899–913. [Google Scholar] [CrossRef] [PubMed]
Grandey, A.A.; Fisk, G.M.; Mattila, A.S.; Jansen, K.J.; Sideman, L.A. Is “service with a smile” enough? Authenticity of positive displays during service encounters. Organ. Behav. Hum. Decis. Process. 2005, 96, 38–55. [Google Scholar] [CrossRef]
French, J.A.; Williamson, P.D.; Thadani, V.M.; Darcey, T.M.; Mattson, R.H.; Spencer, S.S.; Spencer, D.D. Characteristics of medial temporal lobe epilepsy: I. Results of history and physical examination. Ann. Neurol. 1993, 34, 774–780. [Google Scholar] [CrossRef] [PubMed]
Hartle, F. How to Re-Engineer Your Performance Management Process; Kogan Page: New York, NY, USA, 1995. [Google Scholar]
Spencer, L.; Spencer, S. Competence at Work: Models for Superior Performance; Wiley: New York, NY, USA, 1993. [Google Scholar]
Mansfield, B. Competence-based Qualifications: A Response. J. Eur. Ind. Train. 1993, 17, 19–22. [Google Scholar] [CrossRef]
McClelland, D.C. Identifying Competencies with Behavioral-Event Interviews. Psychol. Sci. 1998, 9, 331–339. [Google Scholar] [CrossRef]
Wojciszke, B. Multiple meanings of behavior: Construing actions in terms of competence or morality. J. Personal. Soc. Psychol. 1994, 67, 222–232. [Google Scholar] [CrossRef]
Hatlevik, O.E.; Christophersen, K.-A. Digital competence at the beginning of upper secondary school: Identifying factors explaining digital inclusion. Comput. Educ. 2013, 63, 240–247. [Google Scholar] [CrossRef]
Sussman, A.B.; Petkova, K.; Todorov, A. Competence ratings in US predict presidential election outcomes in Bulgaria. J. Exp. Soc. Psychol. 2013, 49, 771–775. [Google Scholar] [CrossRef]
Mariadoss, B.J.; Milewicz, C.; Lee, S.; Sahaym, A. Salesperson competitive intelligence and performance: The role of product knowledge and sales force automation usage. Ind. Mark. Manag. 2014, 43, 136–145. [Google Scholar] [CrossRef]
Graham, J.R.; Harvey, C.R.; Puri, M. A Corporate Beauty Contest. Manag. Sci. 2017, 63, 3044–3056. [Google Scholar] [CrossRef]
Wojciszke, B.; Brycz, H.; Borkenau, P. Effects of Information Content and Evaluative Extremity on Positivity and Negativity Biases. J. Personal. Soc. Psychol. 1993, 64, 327–335. [Google Scholar] [CrossRef]
Fiske, S.T.; Cuddy, A.J.; Glick, P. Universal dimensions of social cognition: Warmth and competence. Trends Cogn. Sci. 2007, 11, 77–83. [Google Scholar] [CrossRef] [PubMed]
Lebowitz, M.S.; Ahn, W.K.; Oltman, K. Sometimes more competent, but always less warm: Perceptions of biologically oriented mental-health clinicians. Int. J. Soc. Psychiatry 2015, 61, 668–676. [Google Scholar] [CrossRef] [PubMed]
van de Ven, N.; Meijs, M.H.; Vingerhoets, A. What emotional tears convey: Tearful individuals are seen as warmer, but also as less competent. Br. J. Soc. Psychol. 2017, 56, 146–160. [Google Scholar] [CrossRef] [PubMed]
Awale, A.; Chan, C.S.; Ho, G.T.S. The influence of perceived warmth and competence on realistic threat and willingness for intergroup contact. Eur. J. Soc. Psychol. 2019, 49, 857–870. [Google Scholar] [CrossRef]
Fiske, S.T. A Model of (Often Mixed) Stereotype Content: Competence and Warmth Respectively Follow From Perceived Status and Competition. J. Personal. Soc. Psychol. 2002, 82, 878–902. [Google Scholar] [CrossRef]
Johnston, R. The determinants of service quality: Satisfiers and dissatisfiers. Int. J. Serv. Ind. Manag. 1995, 6, 53–71. [Google Scholar] [CrossRef]
Wu, Y.-C.; Tsai, C.-S.; Hsiung, H.-W.; Chen, K.-Y. Linkage between frontline employee service competence scale and customer perceptions of service quality. J. Serv. Mark. 2015, 29, 224–234. [Google Scholar] [CrossRef]
Leung, F.F.; Kim, S.; Tse, C.H. Highlighting Effort Versus Talent in Service Employee Performance: Customer Attributions and Responses. J. Mark. 2020, 84, 106–121. [Google Scholar] [CrossRef]
Brown, C.M.; Troy, N.S.; Jobson, K.R.; Link, J.K. Contextual and personal determinants of preferring success attributed to natural talent or striving. J. Exp. Soc. Psychol. 2018, 78, 134–147. [Google Scholar] [CrossRef]
Lester, S. Professional standards, competence and capability. High. Educ. Ski. Work-Based Learn. 2014, 4, 31–43. [Google Scholar] [CrossRef]
Walker, M.; Wanke, M. Caring or daring? Exploring the impact of facial masculinity/femininity and gender category information on first impressions. PLoS ONE 2017, 12, e0181306. [Google Scholar] [CrossRef]
Wen, F.; Qiao, Y.; Zuo, B.; Ye, H.; Ding, Y.; Wang, Q.; Ma, S. Dominance or Integration? Influence of Sexual Dimorphism and Clothing Color on Judgments of Male and Female Targets’ Attractiveness, Warmth, and Competence. Arch. Sex. Behav. 2022, 51, 2823–2836. [Google Scholar] [CrossRef]
Klofstad, C.A.; Anderson, R.C.; Nowicki, S. Perceptions of Competence, Strength, and Age Influence Voters to Select Leaders with Lower-Pitched Voices. PLoS ONE 2015, 10, e0133779. [Google Scholar] [CrossRef]
Tigue, C.C.; Borak, D.J.; O’Connor, J.J.M.; Schandl, C.; Feinberg, D.R. Voice pitch influences voting behavior. Evol. Hum. Behav. 2012, 33, 210–216. [Google Scholar] [CrossRef]
Crivelli, C.; Carrera, P.; Fernández-Dols, J.-M. Are smiles a sign of happiness? Spontaneous expressions of judo winners. Evol. Hum. Behav. 2015, 36, 52–58. [Google Scholar] [CrossRef]
Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.P. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 59–66. [Google Scholar]
Babu, P.A.; Nagaraju, V.S.; Vallabhuni, R.R. Speech Emotion Recognition System With Librosa. In Proceedings of the 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT), Bhopal, India, 18–19 June 2021; pp. 421–424. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
van Klink, M.R.D.; Boon, J. Competencies: The triumph of a fuzzy concept. Int. J. Hum. Resour. Dev. Manag. 2003, 3, 125–137. [Google Scholar] [CrossRef]
Neethu, M.S.; Rajasree, R. Sentiment analysis in twitter using machine learning techniques. In Proceedings of the 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), Tiruchengode, India, 4–6 July 2013; pp. 1–5. [Google Scholar]
Elhai, J.D.; Montag, C. The compatibility of theoretical frameworks with machine learning analyses in psychological research. Curr. Opin. Psychol. 2020, 36, 83–88. [Google Scholar] [CrossRef] [PubMed]
Domínguez-Jiménez, J.A.; Campo-Landines, K.C.; Martínez-Santos, J.C.; Delahoz, E.J.; Contreras-Ortiz, S.H. A machine learning model for emotion recognition from physiological signals. Biomed. Signal Process. Control 2020, 55, 101646. [Google Scholar] [CrossRef]
Le Deist, F.D.; Winterton, J. What Is Competence? Hum. Resour. Dev. Int. 2005, 8, 27–46. [Google Scholar] [CrossRef]
Rimm-Kaufman, S.E.; Early, D.M.; Cox, M.J.; Saluja, G.; Pianta, R.C.; Bradley, R.H.; Payne, C. Early behavioral attributes and teachers’ sensitivity as predictors of competent behavior in the kindergarten classroom. J. Appl. Dev. Psychol. 2002, 23, 451–470. [Google Scholar] [CrossRef]
Burgoyne, J. Creating the Managerial Portfolio: Building On Competency Approaches To Management Development. Manag. Educ. Dev. 1989, 20, 56–61. [Google Scholar] [CrossRef]
Dooley, K.E.; Lindner, J.R.; Dooley, L.M.; Alagaraja, M. Behaviorally anchored competencies: Evaluation tool for training via distance. Hum. Resour. Dev. Int. 2004, 7, 315–332. [Google Scholar] [CrossRef]
Rentz, J.O.; Shepherd, C.D.; Tashchian, A.; Dabholkar, P.A.; Ladd, R.T. A Measure of Selling Skill: Scale Development and Validation. J. Pers. Sell. Sales Manag. 2002, 22, 13–21. [Google Scholar]
Peesker, K.M.; Kerr, P.D.; Bolander, W.; Ryals, L.J.; Lister, J.A.; Dover, H.F. Hiring for sales success: The emerging importance of salesperson analytical skills. J. Bus. Res. 2022, 144, 17–30. [Google Scholar] [CrossRef]
Hughes, D.E.; Le Bon, J.; Rapp, A. Gaining and leveraging customer-based competitive intelligence: The pivotal role of social capital and salesperson adaptive selling skills. J. Acad. Mark. Sci. 2012, 41, 91–110. [Google Scholar] [CrossRef]
Gabler, C.B.; Vieira, V.A.; Senra, K.B.; Agnihotri, R. Measuring and testing the impact of interpersonal mentalizing skills on retail sales performance. J. Pers. Sell. Sales Manag. 2019, 39, 222–237. [Google Scholar] [CrossRef]
Kipnis, D.; Schmidt, S.M.; Wilkinson, I. Intraorganizational influence tactics: Explorations in getting one’s way. J. Appl. Psychol. 1980, 65, 440–452. [Google Scholar] [CrossRef]
Elmore, K.C.; Luna-Lucero, M. Light Bulbs or Seeds? How Metaphors for Ideas Influence Judgments About Genius. Soc. Psychol. Personal. Sci. 2016, 8, 200–208. [Google Scholar] [CrossRef]
Brosius, H.-B.; Bathelt, A. The Utility of Exemplars in Persuasive Communications. Commun. Res. 1994, 21, 48–78. [Google Scholar] [CrossRef]
Thibodeau, P.H.; Matlock, T.; Flusberg, S.J. The role of metaphor in communication and thought. Lang. Linguist. Compass 2019, 13, e12327. [Google Scholar] [CrossRef]
Headley, D.E.; Choi, B. Achieving Service Quality Through Gap Analysis and a Basic Statistical Approach. J. Serv. Mark. 1992, 6, 5–14. [Google Scholar] [CrossRef]
Morgan, R.M.; Hunt, S.D. The Commitment-Trust Theory of Relationship Marketing. J. Mark. 1994, 58, 20–38. [Google Scholar] [CrossRef]
Moorman, C.; Deshpande, R.; Zaltman, G. Factors Affecting Trust in Market-Research Relationships. J. Mark. 1993, 57, 81–101. [Google Scholar] [CrossRef]
Weitz, B.A.; Sujan, H.; Sujan, M. Knowledge, Motivation, and Adaptive Behavior: A Framework for Improving Selling Effectiveness. J. Mark. 1986, 50, 174–191. [Google Scholar] [CrossRef]
Cravens, D.W.; Ingram, T.N.; LaForge, R.W.; Young, C.E. Behavior-Based and Outcome-Based Salesforce Control Systems. J. Mark. 1993, 57, 47–59. [Google Scholar] [CrossRef]
Rapp, A.; Agnihotri, R.; Baker, T.L. Conceptualizing salesperson competitive intelligence: An individual-level perspective. J. Pers. Sell. Sales Manag. 2011, 31, 141–155. [Google Scholar] [CrossRef]
Anand, P.; Hunter, G.; Carter, I.; Dowding, K.; Guala, F.; Van Hees, M. The Development of Capability Indicators. J. Hum. Dev. Capab. 2009, 10, 125–152. [Google Scholar] [CrossRef]
Karasek, R.A. Job Demands, Job Decision Latitude, and Mental Strain: Implications for Job Redesign. Adm. Sci. Q. 1979, 24, 285–308. [Google Scholar] [CrossRef]
Wang, G.; Netemyer, R.G. The effects of job autonomy, customer demandingness, and trait competitiveness on salesperson learning, self-efficacy, and performance. J. Acad. Mark. Sci. 2002, 30, 217–228. [Google Scholar] [CrossRef]
Oldmeadow, J.; Fiske, S.T. System-justifying ideologies moderate status = competence stereotypes: Roles for belief in a just world and social dominance orientation. Eur. J. Soc. Psychol. 2007, 37, 1135–1148. [Google Scholar] [CrossRef]
Sturman, M. Searching for the Inverted U-Shaped Relationship Between Time and Performance: Meta-Analyses of the Experience/Performance, Tenure/Performance, and Age/Performance Relationships. J. Manag. 2003, 29, 609–640. [Google Scholar] [CrossRef]
Avolio, B.J.; Waldman, D.A.; McDaniel, M.A. Age and Work Performance in Nonmanagerial Jobs: The Effects of Experience and Occupational Type. Acad. Manag. J. 1990, 33, 407–422. [Google Scholar] [CrossRef]
Tyagi, P.K.; Wotruba, T.R. Do gender and age really matter in direct selling? An exploratory investigation. J. Mark. Manag. 1998, 8, 22–33. [Google Scholar]
Waldman, D.A.; Avolio, B.J. A meta-analysis of age differences in job performance. J. Appl. Psychol. 1986, 71, 33–38. [Google Scholar] [CrossRef]
Harris, M.B. Sex Differences in Stereotypes of Spectacles1. J. Appl. Soc. Psychol. 1991, 21, 1659–1680. [Google Scholar] [CrossRef]
Terry, R.L.; Krantz, J.H. Dimensions of Trait Attributions Associated with Eyeglasses, Men’s Facial Hair, and Women’s Hair Length1. J. Appl. Soc. Psychol. 1993, 23, 1757–1769. [Google Scholar] [CrossRef]
Leder, H.; Forster, M.; Gerger, G. The Glasses Stereotype Revisited. Swiss J. Psychol. 2011, 70, 211–222. [Google Scholar] [CrossRef]
Fleischmann, A.; Lammers, J.; Stoker, J.I.; Garretsen, H. You Can Leave Your Glasses on. Soc. Psychol. 2019, 50, 38–52. [Google Scholar] [CrossRef]
Mason, M.F.; Tatkow, E.P.; Macrae, C.N. The Look of Love: Gaze Shifts and Person Perception. Psychol. Sci. 2005, 16, 236–239. [Google Scholar] [CrossRef] [PubMed]
Kraut, R.E.; Poe, D.B. Behavioral roots of person perception: The deception judgments of customs inspectors and laymen. J. Personal. Soc. Psychol. 1980, 39, 784–798. [Google Scholar] [CrossRef]
Mori, Y.; Pell, M.D. The Look of (Un)confidence: Visual Markers for Inferring Speaker Confidence in Speech. Front. Commun. 2019, 4, 63. [Google Scholar] [CrossRef]
Grondin, F.; Lomanowska, A.M.; Poire, V.; Jackson, P.L. Clients in Simulated Teletherapy via Videoconference Compensate for Altered Eye Contact When Evaluating Therapist Empathy. J. Clin. Med. 2022, 11, 3461. [Google Scholar] [CrossRef]
Jongerius, C.; Twisk, J.W.R.; Romijn, J.A.; Callemein, T.; Goedeme, T.; Smets, E.M.A.; Hillen, M.A. The Influence of Face Gaze by Physicians on Patient Trust: An Observational Study. J. Gen. Intern. Med. 2022, 37, 1408–1414. [Google Scholar] [CrossRef] [PubMed]
Beege, M.; Schneider, S.; Nebel, S.; Rey, G.D. Look into my eyes! Exploring the effect of addressing in educational videos. Learn. Instr. 2017, 49, 113–120. [Google Scholar] [CrossRef]
Haselhuhn, M.P.; Wong, E.M. Bad to the bone: Facial structure predicts unethical behaviour. Proc. Biol. Sci. 2012, 279, 571–576. [Google Scholar] [CrossRef]
Stirrat, M.; Perrett, D.I. Valid facial cues to cooperation and trust: Male facial width and trustworthiness. Psychol. Sci. 2010, 21, 349–354. [Google Scholar] [CrossRef]
Durkee, P.K.; Ayers, J.D. Is facial width-to-height ratio reliably associated with social inferences? Evol. Hum. Behav. 2021, 42, 583–592. [Google Scholar] [CrossRef]
Geniole, S.N.; McCormick, C.M. Facing our ancestors: Judgements of aggression are consistent and related to the facial width-to-height ratio in men irrespective of beards. Evol. Hum. Behav. 2015, 36, 279–285. [Google Scholar] [CrossRef]
Lefevre, C.E.; Lewis, G.J. Perceiving Aggression from Facial Structure: Further Evidence for A Positive Association with Facial Width–To–Height Ratio and Masculinity, but Not for Moderation by Self–Reported Dominance. Eur. J. Personal. 2014, 28, 530–537. [Google Scholar] [CrossRef]
Guyer, J.J.; Brinol, P.; Vaughan-Johnston, T.I.; Fabrigar, L.R.; Moreno, L.; Petty, R.E. Paralinguistic Features Communicated through Voice can Affect Appraisals of Confidence and Evaluative Judgments. J. Nonverbal Behav. 2021, 45, 479–504. [Google Scholar] [CrossRef]
Jiang, X.; Pell, M.D. The sound of confidence and doubt. Speech Commun. 2017, 88, 106–126. [Google Scholar] [CrossRef]
Kimble, C.E.; Seidel, S.D. Vocal signs of confidence. J. Nonverbal Behav. 1991, 15, 99–105. [Google Scholar] [CrossRef]
Ray, G.B. Vocally cued personality prototypes: An implicit personality theory approach. Commun. Monogr. 1986, 53, 266–276. [Google Scholar] [CrossRef]
Burgoon, J.K.; Johnson, M.L.; Koch, P.T. The nature and measurement of interpersonal dominance. Commun. Monogr. 1998, 65, 308–335. [Google Scholar] [CrossRef]
Van Zant, A.B.; Berger, J. How the voice persuades. J. Personal. Soc. Psychol. 2020, 118, 661–682. [Google Scholar] [CrossRef]
Fishbein, A.R.; Prior, N.H.; Brown, J.A.; Ball, G.F.; Dooling, R.J. Discrimination of natural acoustic variation in vocal signals. Sci. Rep. 2021, 11, 916. [Google Scholar] [CrossRef] [PubMed]
Smith, V.L.; Clark, H.H. On the Course of Answering Questions. J. Mem. Lang. 1993, 32, 25–38. [Google Scholar] [CrossRef]
Mullennix, J.W.; Bihon, T.; Bricklemyer, J.; Gaston, J.; Keener, J.M. Effects of Variation in Emotional Tone of Voice on Speech Perception. Lang. Speech 2002, 45, 255–283. [Google Scholar] [CrossRef] [PubMed]
Homburg, C.; Jensen, O.; Hahn, A. How to Organize Pricing? Vertical Delegation and Horizontal Dispersion of Pricing Authority. J. Mark. 2012, 76, 49–69. [Google Scholar] [CrossRef]
Utych, S.M. Speaking style and candidate evaluations. Politics Groups Identities 2021, 9, 589–607. [Google Scholar] [CrossRef]
Haleta, L.L. Student perceptions of teachers’ use of language: The effects of powerful and powerless language on impression formation and uncertainty. Commun. Educ. 1996, 45, 16–28. [Google Scholar] [CrossRef]
Conley, J.M.; O’Barr, W.M.; Lind, E.A. The Power of Language: Presentational Style in the Courtroom. Duke Law J. 1979, 1978, 1375–1399. [Google Scholar] [CrossRef]
Bradac, J.J.; Mulac, A. A molecular view of powerful and powerless speech styles: Attributional consequences of specific language features and communicator intentions. Commun. Monogr. 1984, 51, 307–319. [Google Scholar] [CrossRef]
Hsu, C.-F.; Wang, Y.-S.; Lei, C.-L.; Chen, K.-T. Look at Me! Correcting Eye Gaze in Live Video Communication. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 1–21. [Google Scholar] [CrossRef]
Buller, D.B.; Burgoon, J.K. The Effects of Vocalics and Nonverbal Sensitivity on Compliance: A Replication and Extension. Hum. Commun. Res. 1986, 13, 126–144. [Google Scholar] [CrossRef]
Salgado-Montejo, A.; Tapia Leon, I.; Elliot, A.J.; Salgado, C.J.; Spence, C. Smiles over Frowns: When Curved Lines Influence Product Preference. Psychol. Mark. 2015, 32, 771–781. [Google Scholar] [CrossRef]
Guestrin, T.C.C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
Park, S.; Shim, H.S.; Chatterjee, M.; Sagae, K.; Morency, L.-P. Computational Analysis of Persuasiveness in Social Multimedia. In Proceedings of the 16th International Conference on Multimodal Interaction, Istanbul, Turkey, 12–16 November 2014; pp. 50–57. [Google Scholar]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
Liang, P.P.; Zadeh, A.; Morency, L.-P. Multimodal Local-Global Ranking Fusion for Emotion Recognition. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Association for Computing Machinery, Boulder, CO, USA, 16–20 October 2018; pp. 472–476. [Google Scholar]
De Waele, A.; Claeys, A.-S.; Cauberghe, V.; Fannes, G. Spokespersons’ Nonverbal Behavior in Times of Crisis: The Relative Importance of Visual and Vocal Cues. J. Nonverbal Behav. 2018, 42, 441–460. [Google Scholar] [CrossRef]
Anlló, H.A.-O.; Watanabe, K.; Sackur, J.; de Gardelle, V. Effects of false statements on visual perception hinge on social suggestibility. J. Exp. Psychol. Hum. Percept. Perform. 2022, 48, 889–900. [Google Scholar] [CrossRef] [PubMed]
Recktenwald, D. Toward a transcription and analysis of live streaming on Twitch. J. Pragmat. 2017, 115, 68–81. [Google Scholar] [CrossRef]
Burgoon, J.K.; Blair, J.P.; Strom, R.E. Cognitive Biases and Nonverbal Cue Availability in Detecting Deception. Hum. Commun. Res. 2008, 34, 572–599. [Google Scholar] [CrossRef]
Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017. [Google Scholar]

Figure 1. Knowledge-driven study 1 and data-driven study 2.

Figure 2. The mean SHAP values.

Figure 3. Density scatter plots of the SHAP values.

Figure 4. SHAP values of two samples.

Figure 5. Competence distribution.

Figure 6. Competence annotation interface.

Figure 7. Extracted features from the visual modality.

Figure 8. Mel spectrogram of the vocal modality.

Figure 9. Word cloud of the textual modality.

Figure 10. Multimodal prediction task with three modalities.

Figure 11. Overview of the LMF algorithm.

Figure 12. Accuracy (a), Corr (b), F1-score (c), and MAE (d) evaluation.

Table 1. Competence evaluation dimensions.

Source	Measurement
Awale (2019) [18]	Incompetent	vs.	Competent
	Unconfident	vs.	Confident
	Incapable	vs.	Capable
	Inefficient	vs.	Efficient
	Unintelligent	vs.	Intelligent
	Unskillful	vs.	Skillful
Bogdan Wojciszke (1993) [14]	Unintelligent	vs.	Intelligent
	Timid	vs.	Courageous
	Lack of Will-power	vs.	Will-power
Aaker (2010) [1]	Incompetent	vs.	Competent
	Ineffective	vs.	Effective
	Inefficient	vs.	Efficient
Lebowitz (2015) [16]	Not Confident	vs.	Confident	1. Unskillful vs. Skillful
	Incompetent	vs.	Competent	2. Inefficient vs. Efficient
	Unintelligent	vs.	Intelligent	3. Unintelligent vs. Intelligent
	Incapable	vs.	Capable	4. Incapable vs. Capable
	Not Independent	vs.	Independent
	Not Competitive	vs.	Competitive
	Unskilled	vs.	Skilled
	Uneducated	vs.	Educated
van de Ven (2017) [17]	Incompetent	vs.	Competent
	Insecure	vs.	Self-assured
	Incapable	vs.	Capable
	Clumsy	vs.	Skilled
Fiske (2002) [19]	Incompetent	vs.	Competent
	Unconfident	vs.	Confident
	Not Independent	vs.	Independent
	Uncompetitive	vs.	Competitive
	Unintelligent	vs.	Intelligent

Table 3. Variable summary.

Variable	Description	Mean	SD	Min	Max	API Algorithm
Dependent Variable
Competence	Numeric, the value from the perceived competence evaluation	40.664	9.751	10	80	/
Independent Variables
Glasses	Binary, whether an individual wears glasses	0.011	0.104	0	1	Face++
Eye gaze variation	Numeric, calculation of the coefficient of variation between an individual’s gaze from left to right and up to down	0.056	0.055	0	0.402	OpenFace 2.0
Length-to-width ratio	Numeric, calculation of the length of an individual’s face (measured from the top of the eyelid to the upper lip) divided by the width	0.705	0.03	0.596	0.897	Face++
Vocal energy	Numeric, the amplitude of an individual’s voice	0.022	0.023	0.002	0.219	Librosa
Vocal variation	Numeric, the amplitude variation in an individual’s voice	0.011	0.009	0.001	0.102	Librosa
Expression efficiency	Numeric, a measure of the extent to which an individual exploits his oral expression skills	0.042	0.2	0	1	Text similarity
Skill	Numeric, a measure of the extent to which an individual demonstrates his sales skills	0.053	0.705	0.03	0.596	Text similarity
Intelligence	Numeric, a measure of the extent to which an individual demonstrates knowledge about a product’s technical features and capabilities	0.121	0.326	0	1	Text similarity
Capability	Numeric, a measure of the extent to which an individual uses a powerful style of speech	0.075	0.263	0	1	Text similarity
Age	An individual’s perceived age	25.454	4.354	13	47	Face++
Control Variable
Gender	Female influencers are labeled as 0, male influencers are labeled as 1.	0.096	0.276	0	1	Face++

Note: OpenFace 2.0 is a framework to implement a modern facial behavior analysis algorithm. Librosa is a Python package for audio analysis. Face++ is a platform offering computer vision technologies such as face detection, face search, face landmarks, and face masks.

Table 4. OLS model regression results.

Variable	(1)	(2)	(3)
Skill	0.0795		0.0987
	(−0.0757)		(−0.0741)
Expression efficiency	0.207 **		0.229 ***
	(−0.0851)		(−0.0834)
Intelligence	0.0644		0.0386
	(−0.052)		(−0.0511)
Capability	0.139 **		0.110 *
	(−0.0645)		(−0.0633)
Age		0.0195 ***	0.0191 ***
		(−0.0039)	(−0.0039)
Glasses		1.153 ***	1.158 ***
		(−0.162)	(−0.162)
Eye gaze variation		−0.316	−0.353
		(−0.305)	(−0.304)
Length-to-width ratio		1.489 ***	1.447 **
		(−0.565)	(−0.565)
Vocal energy		7.095 ***	7.001 ***
		(−1.18)	(−1.181)
Vocal variation		−6.427 **	−5.997 *
		(−3.113)	(−3.116)
Constant	4.035 ***	2.436 ***	2.447 ***
	(−0.0196)	(−0.414)	(−0.414)
Observations	3300	3300	3300
R-squared	0.004	0.044	0.048

Standard errors in parentheses *** p < 0.01, ** p < 0.05, * p < 0.1.

Table 5. Extracted features.

Modality	Toolkit	Dims	Features
Visual	OpenFace	43	Eye features (8 dims)
Visual	OpenFace	43	Facial action unit (35 dims)
Vocal	Librosa	24	MFCCSs (20 dims)
			RMS
			Zero crossing rate
			Spectral rolloff
			Spectral centroid
Textual	BERT	768	BERT-base word embeddings

Table 6. Modality ablation results.

Method		Deep Learning Model	Data Modalities Used by the Model			(a) Accuracy	(b) F1-Score	(c) MAE	(d) Corr
Method		Deep Learning Model	Textual	Vocal	Visual	(a) Accuracy	(b) F1-Score	(c) MAE	(d) Corr
(1)	No fusion + unimodal data	LMF	√			0.8055	0.8057	0.7353	0.644
(2)				√		0.8207	0.8211	0.7015	0.6951
(3)					√	0.8328	0.832	0.6701	0.6893
(4)	Partial fusion + bimodal data	LMF	√	√		0.8419	0.8418	0.724	0.6471
(5)			√		√	0.845	0.8453	0.6581	0.6983
(6)				√	√	0.8419	0.7293	0.724	0.6471
(7)	Full fusion + trimodal data	EF_LSTM	√	√	√	0.8297	0.8299	0.7415	0.6304
(8)	Full fusion + trimodal data	LMF	√	√	√	0.8723	0.8717	0.6347	0.7188

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xian, T.; Du, P.; Liao, C. Theory and Data-Driven Competence Evaluation with Multimodal Machine Learning—A Chinese Competence Evaluation Multimodal Dataset. Appl. Sci. 2023, 13, 7761. https://doi.org/10.3390/app13137761

AMA Style

Xian T, Du P, Liao C. Theory and Data-Driven Competence Evaluation with Multimodal Machine Learning—A Chinese Competence Evaluation Multimodal Dataset. Applied Sciences. 2023; 13(13):7761. https://doi.org/10.3390/app13137761

Chicago/Turabian Style

Xian, Teli, Peiyuan Du, and Chengcheng Liao. 2023. "Theory and Data-Driven Competence Evaluation with Multimodal Machine Learning—A Chinese Competence Evaluation Multimodal Dataset" Applied Sciences 13, no. 13: 7761. https://doi.org/10.3390/app13137761

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Theory and Data-Driven Competence Evaluation with Multimodal Machine Learning—A Chinese Competence Evaluation Multimodal Dataset

Abstract

1. Introduction

2. Theoretical Background and Hypothesis Development

2.1. Competence Framework

2.2. Competence-Related Traits

2.2.1. Inner Traits

2.2.2. Outer Traits

3. Study 1. Knowledge-Driven Competence

3.1. Data

3.2. Variables Extraction

3.2.1. Inner Traits Extraction

3.2.2. Outer Traits Extraction

3.3. Competence Factors with a Machine Learning Method

3.3.1. Model

3.3.2. Results

3.4. Competence Factors with an Economical Method

3.5. Heterogeneity Analysis

3.6. Discussion

4. Study 2. Knowledge-Driven and Data-Driven Competence

4.1. Dataset Construction

4.1.1. Data Acquisition

4.1.2. Dataset Split

4.1.3. Preprocessing

4.1.4. Final Statistics

4.1.5. Annotation

4.1.6. Extracted Features

4.2. Model Specifications

4.3. Results

4.4. Discussion

5. Conclusions

5.1. Discussion

5.2. Theoretical Contributions

5.3. Managerial Contribution

6. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Expert Selection

Appendix A.2. Transcript Preprocessing

Appendix A.3. Sentence Embeddings

Appendix A.4. Cosine Similarity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI