1. Introduction
The upper limbs are an important part of the human trunk. Able-bodied upper limbs can complete tasks such as grasping, carrying, and kneading according to human intentions [
1]. Due to the lack of hand function, patients with upper limb disabilities have difficulty in completing daily actions such as drinking, dressing, and wearing glasses, which cause inconvenience to life, study, and work and seriously affect the quality of life. Some hand functions can be achieved by wearing cosmetic prostheses or myoelectric prostheses, but most of them cannot follow human intentions and cannot meet the needs of daily life. Therefore, the study of handicapped manipulators has always been one of the important topics in robotics [
2,
3]. However, the key to realizing the assistive function of the manipulator is whether the manipulator can be coordinated with the action intention to help patients with upper limb disabilities smoothly realize actions such as drinking and dressing and, finally, realize the reconstruction of hand function.
The control of prosthetic hands depends on the recognition of the upper limb action intention, and the collection and analysis of upper limb movement information is the first step to recognizing the upper limb action intention. Generally, the information sources containing the upper limb action intention are divided into two categories: bioelectrical information, such as EMG signals and EEG signals, and general physical information, such as posture signals, visual signals, and force signals [
4]. Myoelectric control is currently the most widespread method for controlling disability-assisted prostheses; it controls the device by collecting electrical impulses generated by muscles through sensors attached to the limb [
5]. Song et al. recognized seven common human lower limb movement patterns in daily life based on a multilayer perceptron (MLP) and a long short-term memory (LSTM) network and achieved better recognition results [
6]. Chai et al. proposed a closed-loop model based on sEMG signals that is composed of a long short-term memory (LSTM) network and a discrete-time zeroing neural network (ZNN). Experiments showed that the model has high accuracy in the intention recognition of a simple joint motion [
7]. However, myoelectric control also has certain limitations. Some patients have a high degree of upper limb amputation and fewer limb muscles remaining, resulting in fewer sources of EMG signals. Moreover, the EMG signal is susceptible to interference by non-ideal factors such as motor offset, individuality differences, and muscle fatigue, which can have a serious impact on the performance of the EMG-controlled prosthetic hand.
The research on a brain–computer interface has mainly focused on decoding the brain neural activity information during human thought activity [
8,
9]. Brain neural activity information mainly includes EEG signals generated by physical movement, movement imagination, and sensory sensation, among which the study of movement imagination EEG signals was applied to upper limb action intention recognition. He and Sun et al. asked subjects to imagine left-hand movement and right-hand movement and used different feature extraction methods to classify the two types of EEG signals to judge the upper limb action intention; the experiments all achieved a certain accuracy [
10,
11]. However, the low signal-to-noise ratio of EEG signals and the non-obvious relationship between EEG signals and action intention still make it a challenge to extract effective information and interpret signals [
12,
13]. It can be seen that bioelectrical signals usually express the upper limb action intention more directly, but it is still challenging to obtain stable bioelectrical signals and decode them accurately.
In recent years, due to the remarkable success of deep learning in the field of computer vision [
14], some research on action intention recognition and manipulator control based on computer vision has gradually emerged. Ghazaei G. et al. applied visual signals to prosthetic hand control and trained more than 500 grabbable object images by using a convolutional neural network (CNN) that could realize basic object grasping and movement functions. However, relying only on visual signals, it is difficult to continuously follow human intention to complete coherent and complex actions such as drinking and dressing [
15]. Shi et al. used a convolutional neural network (CNN) to classify images of daily objects according to different grasping patterns and applied this vision-based pattern recognition method to dexterous prosthetic hand control. The prosthetic hand had good performance in the task of “reaching out and grasping”. Compared with the traditional EMG control method, the control effect of the prosthetic hand based on computer vision has been significantly improved [
16]. Visual signals are usually obtained by computers capturing, interpreting, and processing visually perceptible objects. When the manipulator moves to the position of the target object in the upper limb action, the control system can obtain key information such as the category, shape, and purpose of the target object according to the visual signal; this provides a basis for judging the timing of the manipulator opening in the next action. However, visual signals cannot judge the motion state of the upper limbs, so it is difficult to accurately identify the intention of the upper limbs by relying only on visual signals.
Inertial sensors are widely used in various applications such as portable mobile devices, rehabilitation monitoring, and motion recognition due to their low power consumption, low cost, small size, and high accuracy [
17,
18,
19,
20]. Cui et al. put inertial sensors on upper limbs and proposed an arm motion recognition method based on a sub-motion characteristic matrix and a dynamic time warping (DTW) algorithm, with a recognition rate of 99.4% [
21]. Xuan et al. fixed inertial sensors on the foot, outer calf, and outer thigh to collect acceleration signals during lower limb movements and to recognize lower limb motion intentions. The experiments achieved a 97% recognition rate in five steady-state modes: walking on flat ground, going upstairs, going downstairs, going uphill, and going downhill, and achieving smooth and stable control of the lower limb prosthesis [
22]. The angular velocity and attitude angle of the limb can be obtained by wearing the inertial sensor on the limb. However, upper limb movements are often more complex and contain more abundant action intentions, so it is impossible to recognize upper limb action intentions based only on motion information obtained by inertial sensors. However, the inertial sensor data represent the motion state of the upper limb and the upper limb posture, which can exactly complement the visual information. Therefore, this paper proposes a new method of upper limb intention recognition based on the fusion of posture information and visual information and applies it to the field of prosthetic hand control.
2. Materials and Methods
2.1. Data Platform and Acquisition
We used a data glove model, WISEGLOVE7F+, produced by Beijing Xintian Vision Technology Co., Ltd., Beijing, China, to collect upper limb posture angle data. The data glove consists of three inertial measurement units, and the inertial measurement units include a three-axis accelerometer and a three-axis gyroscope. The sensors collect data at a frequency of 50 Hz, with an accuracy of 0.2 degrees. Image data were collected by a miniature camera model, HD810, produced by Shenzhen Weidafei Technology Co. The micro camera has a 2 million pixel resolution (1920 × 1080) and a focal length of 20~60 mm. The installation positions of the two types of sensors are shown in
Figure 1. The three inertial measurement units of the data glove are worn on the upper arm, forearm, and opisthenar, and the micro camera is fixed at the finger sleeve of the middle finger of the data glove. The glove collects the posture angle data during the movement of the upper limb, and the posture angle data is directly transmitted to the computer through the serial port. All the data analysis processes are done on a computer. Through the analysis of the posture angle, a control signal is sent to the camera at the appropriate time to make the miniature camera capture the image of the possible target object in front of the finger.
2.2. Sliding Window Method
The upper limb movement data collected by the sensor are a time series, and it is not accurate to judge the upper limb state only according to the data of the current sampling point; hence, this paper uses the sliding window method to extract data. Considering that intention recognition is closely related to the control of the prosthetic hand, the real-time requirement is high, and the delay caused by the window length needs to be eliminated. If the length of the sliding window is too large, this will cause serious delay; if the window length is too small, this will be detrimental to the accuracy of the data; hence, this paper uses the sliding window method with the window lengths of 5, 10, 15 and 20 to process the data respectively. The results show that the sliding window with a window length of 10 has the best effect, so the forward sliding window with a length of
= 10 is adopted in this paper, where the window length refers to the number of sampling points. Over time, every time the sliding window slides back one point, the first data in the window is removed, and the value of the sliding window is used as the result of the last sampling point in the window. The calculation results are as follows:
represents the serial number of the current sampling point,
are the data in the sliding window, and
are the data extracted by the sliding window.
2.3. Upper Limb Kinematics Modeling
From a mechanical point of view, the upper limb structure is similar to a linkage mechanism. The three parts of the upper arm, forearm, and opisthenar can be regarded as the components of the mechanism, while the shoulder joint, elbow joint, and wrist joint are the motion pairs connecting various motion components. Therefore, this paper simplifies the upper limb into a linkage model, which can be easily analyzed and described by mathematical theory.
After determining a simplified model of the upper limb, it is necessary to establish the corresponding D-H coordinate system, which is a rectangular coordinate system built according to the rules of the D-H coordinate system [
23] so that a position can be calculated along the coordinate system. The upper limb spatial position is a relative position relationship, and the upper limb kinematic model is established to calculate the position of the end of the upper limb relative to the body. Hence, we establish the reference coordinate system at the position above the chest level, with the shoulder. Based on the above criteria, the complete mathematical model of the upper limb established in this paper is shown in
Figure 2. The five points of
,
,
,
,
represent the five positions of chest, shoulder, elbow, wrist, and opisthenar in the human body.
,
,
,
,
denote the base coordinate system of the chest and shoulder coordinate system, elbow coordinate system, wrist coordinate system, and opisthenar coordinate system, respectively.
,
,
,
denote the dimensions of each component, that is, half shoulder width, upper arm length, forearm length, and opisthenar length.
The chest and shoulder are not equipped with inertial sensors, so , and are fixed coordinate systems. The elbow coordinate system is parallel to the upper arm inertial measurement unit coordinate system, so the change of the attitude angle of the upper arm inertial measurement unit coordinate system is the change of the attitude angle of the elbow coordinate system. Similarly, the change of the forearm inertial measurement unit coordinate system and the change of the opisthenar inertial measurement unit coordinate system represent the change of the posture angle of the wrist coordinate system and the change of the posture angle of the opisthenar coordinate system, respectively.
In order to solve the transformation relationship between the opisthenar coordinate system and the upper extremity base coordinate system, taking the transformation of the shoulder coordinate system and the base coordinate system as an example, the solution formula of
in the
coordinate system is:
is the vector coordinate of
with respect to the
coordinate system, and
is the homogeneous transformation matrix from
to
, which represents the translation and rotation relationship between the two coordinate systems.
The equation is the expression; describes the rotation relationship from coordinate system to coordinate system , which is determined by the rotation angle between coordinate systems. Additionally, since the shoulder coordinates are fixed, is constant. For the elbow, wrist, and opisthenar coordinates, is determined by the attitude angle collected by the inertial sensor. describes the translation between coordinate systems and the length of the link member.
From
to
, there are four coordinate system changes. According to the forward kinematics theory of mechanical arms, the forward kinematics equations of upper limbs can be obtained as long as the homogeneous transformation matrices of four transformations are obtained respectively and multiplied successively.
denotes the vector coordinates of the opisthenar relative to the coordinate system , that is, the position of the end of the upper limb relative to the human chest.
2.4. Upper Limb Position Prediction Based on Multilayer Perceptron
During continuous movement of the upper limb, the end position changes constantly, and we only need to focus on the position where the manipulator may open or close. Usually, we need to control the opening and closing of the manipulator when we reach in front of the torso to grasp or put down objects and when we put on or unload objects near the body. Therefore, this paper divides the upper limb end positions into three categories: torso front, upper body nearby, and the initial position, shown in
Figure 3. Among them, torso front refers to the direction of the face, a space far from the torso; upper body nearby refers to the shoulders and the area of the body above the shoulders; the initial position is the area where the arm is naturally drooping and the end of where the hand can reach.
In this paper, a multilayer perceptron is used to classify the upper limb end position, which is described as follows:
where
is a set of input of sample data, and
and
are weight parameter matrices acting between the input layer and the hidden layer and between the hidden layer and the output layer, respectively.
and
are bias terms, and
is the rectified linear unit that is used as the activation function in this model:
The upper limb end positions have been calculated in
Section 2.4. The position data are a group of three-dimensional coordinates, so the number of input neurons of the multilayer perceptron is set to 3. The number of hidden layers of the model is set to 2, and the number of neurons in each layer is set to 8 and 6. The number of neurons in the output layer corresponds to the number of categories and is set to 3.
Figure 4 shows the structure of the multilayer perceptron model. Finally, the model output is normalized using the Softmax function, the expression of which is
is the probability that the input sample is classified into the kth class, and is the output value of the corresponding output layer neuron. The cross-entropy function is chosen as the loss function, and the multilayer perceptron model is trained by minimizing the cross-entropy function using the stochastic gradient descent algorithm.
2.5. Object Detection Based on YOLOv5
The implementation of the target detection function in our system is important, and the target information captured by the camera fixed to the data glove is one of the important types of feedback for us to control the movement of the manipulator. We focus on upper limb intention recognition for people with upper limb disabilities and work on applying this approach to prosthetic systems. The selected target detection and recognition algorithm should meet the requirements of real-time and high robustness in order to improve the user experience during practical application. Traditional target detection and recognition methods are often executed in multiple steps, region selection, feature extraction, and classification [
24]. Region selection is performed to localize the target location, and this step is often used to traverse the entire image using sliding windows of multiple sizes. This is a reliable but inefficient method, with too much time complexity to meet our real-time requirements. Additionally, commonly used handcrafted feature extraction methods, such as Hog [
25], are not very robust for detecting changes in target diversity. Moreover, the use of two-stage neural-network-based methods [
26] makes it difficult to meet the real-time requirements, so the use of end-to-end one-step deep learning schemes is considered necessary.
The current state-of-the-art one-step learning solution is considered to be the YOLO series [
27]. The YOLO models are simultaneously characterized by high accuracy and fast recognition speed. As the latest version of this series, YOLOv5 has four submodels: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The difference between the four models is the different settings of depth and width. Among them, YOLOv5l is the recommended standard version. For the YOLOv5m model, although the class average accuracy mAP@0.5 is lower than YOLOv5l by 1.4%, the inference speed of single image detection is much faster; this is the best model, with superior detection speed and accuracy, that meets our needs.
We use the YOLOv5m model to process key frames captured by the camera rather than processing the entire video stream all the time. This approach is sufficient for intent recognition while improving resource efficiency for deployment on wearable devices. The capture of keyframes is controlled by gating signals sent by the MCU, which is responsible for controlling the camera and communicating with the computer. The motion state of the upper limb is determined by analyzing the sensor data captured by the micro-measurement unit (we will describe this section in
Section 2.6). When the upper limb changes from motion to rest, the MCU is commanded to send a low-level signal to the camera that will control the camera to capture the current image. This image will be fed to the YOLOv5m model for processing.
2.6. Information Fusion Decision-Making Method
Firstly, the upper limb action curve of a healthy person is used as the research object to analyze the upper limb action intention. As shown in
Figure 5, the angular velocity curves of a healthy person performing the drinking action and the glasses-wearing action are shown, respectively. The angular velocity values were obtained by differentiating the posture angle data. The value of angular velocity fluctuates greatly when the upper limbs move and is close to 0 when the upper limbs stop moving and are at rest. In
Figure 5a, the
and
segments are the periods when the human hand picks up the cup and puts it down. In
Figure 5b, the
and
segments are the periods when the human hand picks up the glasses and puts on the glasses. During these four periods, the upper limbs are in a static state. For stable control, the hand usually grasps the object only after the upper limb is at rest. Therefore, the dynamic and static state of the upper limb is one of the important bases for the recognition of the intention of the upper limb movement.
The action of human upper limbs to perform operations on common objects can be broken down into the two stages of grasping and subsequent processes. The type of target usually determines the intention of the upper limb action [
28]. In this paper, the target objects are divided into two categories according to their uses: wearable objects and non-wearable objects. Wearable objects are objects that can be worn on the human body or removed from the human body. When the target object is a wearable object and the hand is positioned close to the body, the robot needs to open or close the hand to put on or take off the object. When the target object is a non-wearable object and the hand is positioned close to the human body, the current process is only a transition process for the whole upper limb action, and the manipulator does not need to operate. While the position of the hand is located in front of the torso and the upper limb is stationary, it is generally necessary to grasp or put down the object, at which time the manipulator should be executing an open or close operation. According to the above description, this paper uses the upper limb motion state, target object type, and upper limb end position to jointly decide the intention of the upper limb action; the specific logic is shown in
Figure 6.
The
Figure 7 shows a complete description of the prosthetic hand control system based on upper limb action intention recognition. Data gloves are worn on human upper limbs to collect the attitude angle data during the movement of the upper limbs. The pre-processed attitude angle data is differentiated to obtain the angular velocity value to judge the state of the upper limbs and serve as the input of the kinematics model of human upper limbs to calculate the position of the upper limbs. The upper limb end position is classified into three categories by the multilayer perceptron model. The miniature camera is used to capture the target object image, which is classified into two categories after the specific type is detected by the YOLOv5 model. Then, the action intention of the upper limb is determined by the combination of the upper limb motion state, the target object type, and the upper limb end position to achieve manipulator control.
3. Results and Discussion
3.1. Dataset
Two datasets are used in this paper: a coordinate dataset used to train a multilayer perceptron to classify upper limb end positions and a picture dataset used to train the YOLOv5 model.
Twenty volunteers were invited to participate in the experiment, including ten females and ten males, ranging in age from 18 to 40 years old. All of them met the standard of healthy individuals who could accurately control their body movements. Each volunteer wore data gloves on their right upper limbs (dominant side) and performed the actions of grabbing objects or lowering objects in front of the torso and wearing objects or removing objects near the upper body; each action was repeated 20 times. When the hand is open or closed, the upper limb will remain stationary for a moment. At this time, the attitude angle of the upper limb is transmitted to the computer and converted into position information through the upper limb kinematics model, that is, the three-dimensional coordinates of the end of the upper limb in the chest coordinate system. The final coordinate data set consisted of 22,500 data points and was divided into a training set and a test set according to a 9:1 ratio.
A 2-megapixel camera was used to collect 3000 pictures of each item under different lighting and angles as an image data set. The xml files recording the category information and location information were generated by labeling the images with the Labeling tool (an open-source image annotation tool), and we divided them into a training set and a test set according to the ratio of 8:2.
3.2. Experimental Equipment
The experimental equipment is shown in
Figure 8, including four parts: a data glove, a miniature camera, an MCU (microcontroller unit), and a mechanical prosthetic hand. The data glove is responsible for collecting the posture angles of the upper arm, forearm, and opisthenar during upper limb movement; the miniature camera is fixed at the finger of the manipulator and connected with the MCU. The MCU triggers the camera to capture images at a low level. The mechanical prosthetic hand is also connected with the MCU. After the computer analyzes and calculates the data, it sends instructions to the single-chip microcomputer to control the opening or closing of the manipulator.
3.3. Upper Limb Model Validation
In this section, experiments are designed to verify the correctness of the upper limb kinematics model. The experimental method is to preset four fixed trajectories, then let the upper limb end move smoothly along the preset trajectories, and, finally, compare the actual motion trajectories with the preset trajectories to complete the verification. The four preset trajectories and the experimental process are shown in
Figure 9.
Each trajectory is described separately below:
- (1)
Horizontal trajectory: The end of the upper limb draws a horizontal line. In the Y-Z plane, the end of the upper limb takes the shoulder as the origin and moves 25 cm in the direction of the increasing Y-axis coordinate values.
- (2)
Vertical trajectory: The end of the upper limb draws a vertical line. In the Y-Z plane, the end of the upper limb takes the shoulder as the origin and moves 25 cm in the direction of the growth of the Z-axis coordinate value.
- (3)
45° oblique trajectory: The upper limb draws a 45° oblique line. In the Y-Z plane, the end of the upper limb takes the shoulder as the origin and moves 25 cm in the direction of 45°.
- (4)
Half-circle trajectory: The upper limb draws a half-circle line. The end of the upper limb draws a half-circle line with the shoulder as the center of the circle and a radius of 65 cm in the Y-Z plane; the start point is (0 cm, −65 cm), and the end point is (0 cm, 65 cm).
We invited ten volunteers to repeat the execution of the above four trajectories three times, and the comparison between the preset trajectory and the trajectory solved by the method in this paper is shown in
Figure 10.
From
Figure 10, it can be observed that the solution results of three experiments for each trajectory are slightly different from the preset value, but the trend of the solution trajectory is basically consistent with the preset trajectory, which proves the correctness of the upper limb model and the upper limb position solution method in this paper. In terms of accuracy, this paper places more emphasis on the accuracy of the key position point solution results. The error analysis of the above four trajectory start and end points is used to verify the accuracy of position solving, and the error formula is as follows.
where
is the position coordinates of the solved position and
is the position coordinates of the preset trajectory. The formula is essentially a calculation of the linear distance between the solved position and the preset position as the representation value of the error. The error values of the four trajectories performed by each of the ten volunteers were averaged, and the results of the error calculation are shown in
Table 1.
As can be observed from
Table 1, the error values of the experimental solutions are all within 7 cm, which is within an acceptable range; this proves that the accuracy of the upper limb position solution is good.
3.4. Multilayer Perceptron Model Analysis
In the training stage of the multilayer perceptron model, the learning rate was set as 0.005 and the learning rate decay rate was set as 0.99. Batch size was set to 200, training a total of 100 epochs. L2 regularization was used to reduce model overfitting, and the regularization coefficient was 0.1. In order to further reduce the risk of overfitting, a dropout layer was added after the two hidden layers, a certain number of neurons were randomly deactivated, and this probability was uniformly set to 0.2. The trained model was evaluated for performance on a pre-divided coordinate test set. The prediction results were made into a confusion matrix, as shown in
Figure 11. In this paper, three evaluation indicators: accuracy, F1 score (macro), and the kappa coefficient, are used to evaluate the multi-classification effect. The indicator scores are shown in
Table 2. It is verified that the model has good multi-classification performance.
3.5. YOLOv5 Model Training Analysis
mAP (mean average precision) is the average value of AP (average precision), which is the main evaluation indicator of the target detection algorithm. The higher the mAP value, the better the detection effect of the target detection model on the data set. Its calculation formula is:
where
K is the number of categories in the detection task,
N is the total number of samples in the test set,
is the precision when
samples are simultaneously detected, and
is the difference score of the recall rate when the number of samples changes from
to
.
In this paper, the YOLOv5 learning platform is built in the Linux environment. The test environment is NVIDIA GeForce PTX 3080, with 16 G memory GPU, the CUDA version is 11.3, and the cuDNN version is 8.2.1. The YOLOv5m training model was adopted. The model batch size was set to 30, and 100 epochs were trained. Finally, the mAP of the model reached 0.9601, indicating that the detection effect of the YOLOv5m model was good.
3.6. Upper Limb Action Intention Recognition Method Validation
In this section, experiments are designed to verify the feasibility of the upper limb action intention recognition method. We invited 10 volunteers, including 5 men and 5 women, aged from 18 to 40 years, to do the experiment. The implementation included seven daily life actions, and each volunteer performed each action 20 times. The volunteers wore data gloves and held manipulators to simulate the situation of patients with upper limb disabilities wearing prosthetic hands.
The experimental process is as follows: First, the user controls the manipulator to approach the target object, and the miniature camera on the manipulator collects the image of the target object. After the image is collected successfully, the action intention recognition method of the upper limb on the PC end is triggered, and the manipulator sends instructions to the MCU for control through the PC. In each experiment, the upper limb end position recognition and target object recognition results were recorded at the PC end.
As shown in
Figure 12, lines 1 to 7 correspond to the action flow of drinking water, combing hair, answering the phone, putting on a hat, wearing glasses, taking off glasses, and moving a cup, respectively. The first column corresponds to the state of action preparation, and the second column corresponds to the state of pictures collected by the micro camera on the hand of the mechanical assistant.
The criterion for the success of the experiment is that the position of the end of the upper limb and the type of the target object are identified successfully during the experiment and the experiment process is implemented completely. The experimental results are shown in
Table 3.
Each action was performed 200 times. The recognition accuracy of the position of the upper limb end reached 100%. Due to the uncertainty of the hand position during image acquisition, the field of view of the camera installed on the hand may only contain a small part of the target object, resulting in the failure of target object recognition; the success rate of target object detection was 92.4%. Finally, the success rate of the upper limb action intention recognition experiment reached 92.4%, which verified the feasibility and generality of the proposed upper limb action intention recognition method.
4. Conclusions
In this paper, we proposed an upper limb action intention recognition method based on the fusion of posture information and visual information to solve the problem of action intention recognition from a new perspective. We collected attitude angle data during the upper limb movement to determine the motion state of the upper limb. We used positive kinematic theory to build an easy-to-analyze mathematical model of human upper limbs to obtain the end positions of upper limbs and designed experiments to verify the correctness of the model. Then, we used a multilayer perceptron model to classify the end positions into three categories, and the model classification rate reached 95.78%. In addition, we mounted a miniature camera on the hand to obtain image information of the target object and used the YOLOv5 model to classify the object, in which the trained YOLOv5 model had good recognition performance, with an mAP of 0.9601. According to the purpose of the target object, the objects recognized by the YOLOv5 model are further classified into wearable and non-wearable objects. Finally, according to the upper limb motion state, the upper limb end position, and the target object type, the mechanical prosthetic hand operation is jointly decided. The intention recognition method in this paper was applied to the control of a mechanical disabled hand, and seven experiments were completed, including drinking water, combing hair, answering the phone, putting on a hat, wearing glasses, taking off glasses, and moving a cup. The success rate of the experiments reached 92.4%, which shows that the intention recognition method proposed in this paper is feasible and performs well.
Although the method designed in this paper has good results for upper limb action intention recognition and mechanical prosthetic hand control, there are still some problems that need further improvement and analysis. In this paper, the micro camera is installed on the finger of the mechanical prosthetic hand to obtain the image of the target object, and the position of the prosthetic hand, when grasping the object, has higher requirements to ensure that the camera has a good field of vision. There is a burden on the user to ensure that the object is in the center of the camera’s field of view. Therefore, future research will consider the use of a global camera that can ensure that the target object is in the camera’s field of view in the first-person perspective and can obtain more abundant information about the environment, the state of the prosthetic hand, the state of the upper limb, and so on.