Abstract
In order to improve the effect of e-sports training, this paper combines the intelligent gesture recognition technology to construct an e-sports training system and judges the training effect of players through the recognition of players’ gestures. Moreover, this paper studies the commonly used feature extraction algorithms and proposes an improved SLC-Harris feature extraction algorithm, and the feasibility of this algorithm is verified by the experimental results on the EUROC data set. In addition, this paper uses the KLT optical flow algorithm to track the extracted feature points and calculates the pure visual pose through epipolar geometry, triangulation, and PnP algorithms. The experimental research results show that the electronic economic training system based on intelligent gesture recognition proposed in this paper has certain effects.
1. Introduction
The reason why e-sports can become a sports competition is that it is closely related to the progress of society, the development of science and technology, and the spiritual and cultural needs of the people. Although there are countless people who enjoy this high-tech intellectual sports event, in fact, public opinion instills the harmful opinion of e-sports in people intentionally or unintentionally. Some media reported extensively that some students were addicted to games and could not extricate themselves, wasting their youth and studies, which made e-sports become “electronic heroin” that everyone shouted. The huge pressure of public opinion makes e-sports face severe survival pressure, and it is difficult for enterprises to enter this market justifiably. Moreover, athletes can only be called “players,” and their treatment cannot be compared with that of ordinary athletes. At the same time, the majority of fans can only engage in e-sports secretly. In addition, in the face of huge pressure from public opinion, it is difficult for the government to guide and supervise confidently, and sometimes, it has to prohibit escrow. The ban on television broadcasting of e-sports competitions can be described as a huge obstacle to the normal development of the current social discrimination against the e-sports sports industry.
Generally speaking, the development of e-sports is not yet mature, and the development of e-sports is still in its infancy [1], which is manifested in many aspects: the public recognition is not enough, there are few related large-scale events, there is no professional-scale operation, there is less research in this area, and so on [2]. Especially on college campuses, although students have more time for self-discipline than before, the school does not pay enough attention to e-sports, and there is no relatively formal organization and management of participants, which has led to many human resources problems waste [3].
In order to cater to the trend of e-sports development, vigorously develop e-sports business, improve the overall level of e-sports, and enable e-sports activities to develop well in colleges and universities, the current primary task is to deepen the characteristics of students participating in e-sports activities [4]. Among them, the analysis and research on the current situation, development trend, and participation significance of e-sports in colleges and universities are particularly important in order to discover the problems existing in the development of campus e-sports and put forward reasonable suggestions for the development [5].
As emerging sports, e-sports are mainly participated by the younger generation, which has the characteristics of younger and younger age. E-sports can exercise people’s thinking ability, psychological pressure resistance, unity and cooperation, hand-eye coordination, and so on. It can also make the younger generation have the awareness of abiding by the rules in the process of participating in e-sports [6]; trained participants have a fair and open, never admit defeat, pursue a stronger competitive spirit, and have a positive impact on the lives of participants. Many colleges and universities have successively opened related majors in e-sports. Although e-sports is popular in the world, the related research and guiding theories on how to cultivate e-sports talents are rare [7].
Different scholars have different views on the attributes and characteristics of e-sports. Literature [8] proposed that “e-sports include three basic characteristics: one is electronics, the other is competitive sports, and the third is a confrontation between people. At the same time, e-sports sports are divided into virtualized e-sports sports and fictionalized sports.” Literature [9] pointed out that “the most fundamental characteristics of video games that distinguish them from other artificial games are: virtual environment, absence of the body and artificial intelligence,” emphasizing the main position of electronic communication technology in e-sports. Scholar Yang Fang believes that “e-sports should return to the essence of games, and games to competitive sports are based on the evolution trend of play-game-competitive sports” and based on the development process of traditional competitive sports, puts forward a plan for the development of e-sports. Jia Peng and Yao Jiaxin believe that e-sports has great characteristics: the diversity of functional structure requirements, the full expansion of self-awareness, the complexity of sports information pattern recognition, the agility of information processing efficiency, and the accuracy of intuitive thinking and decision-making. Sex analyzes and clarifies the various attributes of e-sports from many aspects [10].
The discussion on the attributes of e-sports is still going on. Based on the current research, it can be determined that the two essential attributes of e-sports are electronic interaction and confrontational competition. Without electronic interaction, it becomes traditional competitive sports; it becomes a video game, so the two are interdependent and indispensable. With the development of electronic interaction technology, various forms of e-sports have emerged [11].
Event services are mainly engaged in e-sports referees, coaches, club operation and management, game commentary, data and tactical analysis, and so on. Practitioners need to have data analysis capabilities, management capabilities, and commentary capabilities. The production and broadcast of the event include content production and external dissemination, mainly involving the design of live content and promotion plans, venue layout, equipment debugging, video data collection, postprocessing, background data analysis, and so on. The practitioners should have journalism, communication, broadcasting, TV technology, and other related professional abilities [12].
Since the e-sports industry is an emerging industry, most employees are not from e-sports majors and have not received a complete and systematic e-sports theoretical education, but nearly 90% of the employees believe that the e-sports industry needs prejob training [13]. Judging from the current situation of the development of the entire industry, it is undoubtedly the most attractive option to work for game manufacturers, but it is difficult for game manufacturers to absorb more human resources without major business adjustments. Therefore, the need to train practitioners in support organizations around e-sports events becomes more obvious [14]. For example, training content production capabilities (reporters, screenwriters, copywriters, and anchors) requires a professional background in journalism and communication; training event support capabilities (coaches, data analysts, nutritionists, and brokers) requires sports and information technology. Professional background: training public relations, marketing capabilities (products, business, brand marketing, and media), requires a professional background in marketing and management [15].
E-sports self-media is still a media, and you must have the ability to report news, or you can dig deep into a vertical field, such as specializing in video commentary of games, specializing in game clearance strategies, specializing in sharing game skills, and so on. After all, hot spots can bring traffic. WeMedia is a personalized media with social attributes; it communicates with users; and it has its own distinct character orientation [16]. To be a self-media, you should also have strong analytical skills and be able to interpret a topic or special event from a unique or professional perspective. Current e-sports professional ability training pathways.
1.1. Current E-Sports Professional Ability Training Pathways
Most training institutions in society position themselves as training professional players but basically lack training resources. Training institutions do not have coaches, data analysts, or club managers, and it is difficult for the trained people to find a suitable position in the e-sports circle. Rather than cultivating professional skills, it is better to make money from e-sports hot spots. Money has no intention or inability to contribute to the development of the e-sports industry [17].
At present, the main e-sports talents are cultivated by e-sports companies and e-sports clubs. The club mainly trains professional players, coaches, and data analysts in order to achieve better results in the league. Game companies train referees, game developers, commentators, and other related talents to ensure the healthy development of the e-sports industry [18]. An analysis of the revenue structure of the e-sports industry can help us see the e-sports industry more transparently. The truly profitable institutions are still game manufacturers, which continuously create market value through development and operation. In the context of the continuous development and popularization of the video game industry, competition has become a starting point for expanding influence and creating new commercial value. The comprehensive development of competitive value is inseparable from the promotion of surrounding formats, and new jobs such as video, live broadcast, and commentary emerge in an endless stream [19].
This paper combines the intelligent gesture recognition technology to construct an e-sports training system and uses the player’s gesture recognition to judge the player’s training effect to improve the e-sports training effect.
2. Intelligent Gesture Recognition
2.1. Gesture Intelligent Positioning
The structural framework of the gesture autonomous localization algorithm is shown in Figure 1.

Monocular visual-inertial odometry uses a pure camera in the front end for motion estimation. The algorithm firstly extracts the features of the image information collected by the camera, then uses the optical flow method to track the feature points, and finally uses PnP (Perspective-n-Point) to perform motion estimation on the tracked feature points. Then, the algorithm eliminates the mismatched point pairs through random sampling consistency (RANSAC) and uses nonlinear optimization to optimize the pose. The front-end process is shown in Figure 2.

2.2. SLC-Harris Feature Extraction
The feature is the digital expression of the object in the image, and the image can be quantitatively analyzed by extracting the feature. Commonly used feature extraction methods mainly include SIFT algorithm, SURF algorithm, and ORB algorithm.
The traditional Harris algorithm calculates the angular responsivity as shown below. It is mainly based on the weighted summation of the squared and multiplied gradients of all pixels in the window.
Among them, there are
In formula (1), k is a constant ranging from 0.04 to 0.06, and both and in formula (2) represent eigenvalues.
For a grayscale image, the value of any point (x, y) in the integral image ii (x, y) refers to the sum of all grayscale values from the upper left corner of the image to the area where this point is located, as shown in Figure 3.

The calculation formula of pixels in the rectangular window is as follows:
The most complex calculation in the Harris algorithm is the calculation of diagonal responsivity. The original calculation method causes the calculation overlap between each pixel in the integration window, resulting in high computational complexity. For this, the gradient values in and are used to integrate the image to speed up the calculation of the angular responsivity. The calculation formula is as follows:
Efficient nonmaximum suppression (E-NMS) is used to efficiently extract unique feature locations for each corner region, and the region thresholds are compared using image patches instead of pixels. The principle is shown in Figure 4.

2.3. KLT Optical Flow Tracking
After the key points are extracted, the optical flow method is used to calculate the minimum photometric error by establishing an error model. This method does not need to calculate descriptors or feature point matching, which will greatly save the amount of calculation.
The basic idea of LK optical flow is to assume that the optical flow in the local neighborhood of a pixel is invariant, and based on this assumption, construct a least-squares problem about the optical flow of the neighborhood pixels.
First, it is assumed that the light intensity of the pixel in each frame of the image is constant. According to this, for the pixel located at (x, y) at time t, moving to (x + dx, y + dy) at time t + dt, there are
Then, according to another basic assumption of LK optical flow, the displacement of pixels in adjacent images is small; the Taylor expansion of formula (6) is
Combining the above formulas and dividing by dt into both sides of the formula, we get:where represents the motion speed of the pixel on the x-axis, represents the motion speed of the pixel on the y-axis, and the two speeds are recorded as u and , respectively. At the same time, represents the gradient value of the image in the x-axis direction at the pixel point; represents the gradient value in the y-axis direction at the pixel point; and represents the derivative value of the image in the t direction, which are denoted as , and , respectively. Therefore, formula (8) can be written in matrix form as follows:
Finally, according to the third basic assumption of the LK optical flow method, adjacent pixels in the same image plane have similar motion; a size window is defined. According to the same motion of all pixels in the window, formulas can be listed; the overdetermined formulas can be constructed; and the motion parameters of the center point can be obtained by the least square method. Accordingly, its formula can be expressed as follows:
Each image frame is downsampled by pyramid layering, and multilevel pyramids are established.where L represents the Lth layer image.
The algorithm calculates the value of the bottom layer from top to bottom according to the Gaussian pyramid and calculates the pixel value near the edge of the image based on the following formulas:
The camera motion pose is estimated using SFM in the vision front end. For a monocular camera, the camera pose can be estimated by the geometric relationship between two points in different locations in real space and their projected points on their respective imaging planes. As shown in Figure 5, P is any point in the three-dimensional space, and its coordinates are [X, Y, Z]T; O1 and O2 are the optical centers of the two camera positions. and are the projection points of point P on the imaging plane and the imaging plane , respectively. According to the pixel positions of the two matched point pairs and , the essential matrix E and the fundamental matrix F can be obtained.

According to the camera imaging model, we assume that K is the camera internal parameter matrix, and R and t represent the rotation matrix and translation vector from plane to plane , and the following formula can be obtained:
Homogeneous coordinate transformation and normalization between 2D and 3D, we can getwhere and represent the coordinates of pixels and in the normalized plane, respectively. The algorithm combines formuls (13) and (14) and multiplies by to obtain the essential matrix E and the fundamental matrix F, which can be sorted out as follows:where represents the antisymmetric matrix.
When there are more than eight sets of point pairs such as and , the eight-point method can be used to construct a linear formula system for the simplified formula, and then the unique solution of R and t can be obtained.
When the monocular camera recovers the pose through the epipolar geometric relationship, the obtained translation is the normalized value, which has no practical significance. In order to obtain the depth information on feature points, triangulation needs to be introduced. We assume that and represent the depth of the two feature points; we can get
The feature point depth values and can be obtained by left-multiplying formula (17) by or , respectively, as follows:
When the positions of multiple points in space are known, the camera pose can be estimated by the PnP algorithm. Common PnP algorithms include P3P, DLT, and BA optimization. Among them, the P3P algorithm is the most common method. The algorithm needs to know at least three points and their projection points on the camera imaging plane. Then, the camera pose can be estimated by solving the relationship between point pairs according to the similar triangle principle and the cosine theorem. A schematic diagram of the P3P relationship is shown in Figure 6.

The coordinate system convention is as follows: the world coordinate system is represented by , and and represent the IMU coordinate system and the camera coordinate system, respectively. The relationship between the coordinate systems is shown in Figure 7. represents the visual reference frame in the sliding window, which is independent of the IMU measurement and can represent any frame in the visual structure. represents the transformation from the IMU coordinate system to the world coordinate system; represents the IMU frame of the kth image; represents the transformation from the camera coordinate system to the visual reference system; and represents the camera frame of the kth image. represents the measured value and parameter estimation value of the sensor; represents the latest scale parameter of the sliding window; and the rotation can be represented by the rotation matrix and the quaternion . represents the gravity vector in the world coordinate system, and represents the gravity vector in the visual reference coordinate system.

2.4. IMU Preintegration
The sampling frequency of the camera used in this paper is 20 Hz, and the sampling frequency of the IMU is 200 Hz. It can be seen that the frequency of the IMU is much higher than that of the image. In order to avoid the repeated integration phenomenon caused by the frequency change of the visual frame optimization state caused by the high sampling rate of the IMU, a preintegration technique is used for all IMU sampling data between two image key frames. Furthermore, inertial measurements between adjacent image key frames are aggregated into a relative motion constraint through a preintegration technique. The principle of preintegration is shown in Figure 8.

In Figure 8, from top to bottom are the time scale line, the number of image frames generated, the number of image key frames generated, the number of IMU samples, and the IMU preintegration value.
The measurement error of the system is mainly affected by bias random walk b and white noise η, and other errors such as the Markov process are ignored. Then, the measurement model of the accelerometer and gyroscope in the IMU can be expressed by the following formula:where , and represent the measured value and real value of angular velocity and acceleration, respectively; , and represent the random walk noise and measurement white noise of angular velocity and acceleration, respectively; and is the rotation matrix transformed from the world coordinate system to the IMU coordinate system.
White noise obeys a Gaussian distribution, that is, . The derivative of random walk noise also obeys the Gaussian distribution, that is, .
The differential kinematic formulas for (representing the position, velocity, and rotation expressed in quaternions, respectively) versus time can be written as follows:where represents quaternion multiplication.
Through the above derivative relationship, the position, velocity, and rotation at time k + 1 can be obtained from the position, velocity, and rotation at time k and by integrating the measured values of the IMU over time . The continuous integration formula for PVQ is as follows:
where and represent the acceleration and angular velocity measured in the IMU coordinate system, respectively. represents the time difference from the kth frame to the k + 1 frame. represents the rotation matrix from the world coordinate system to the IMU coordinate system. Because the measured and belong to the IMU coordinate system, in order to transform the IMU measured value to the world coordinate system, the rotation matrix needs to be left-multiplied. means quaternion right multiplication; means antisymmetric matrix in quaternion multiplication means the imaginary part value of quaternion). We assume that the quaternion is ; then we have
By observing the continuous integral formula of PVQ, it can be seen that the current state is recursively obtained from the state of the previous time, and the estimated value is constantly changing. This will cause the IMU measurements to be repropagated, causing the velocity and rotation to be reintegrated after each nonlinear optimization iteration, resulting in a higher computational cost. Therefore, the optimization variables are separated from the IMU preintegration terms of the two key frames, and the rotation matrix of the world coordinate system to the IMU coordinate system can be obtained by simultaneously left-multiplying the left and right sides of the continuous integration formula of PVQ:
The image frames and of two consecutive moments are given, and the linear acceleration and angular velocity are preintegrated in the local coordinate system to obtainwhere represent the relative pose, velocity, and rotation constraints, respectively, and are also the relative motion of to . It can be seen that they are only related to and in and , and they have nothing to do with the initial position and velocity of coordinate system .
Therefore, the preintegration formula is rediscussed, in terms of ; it is related to and of the IMU; and and are also variables that need to be optimized. When the bias change is small, are adjusted according to their first-order approximations to the bias.where and are the block matrices in and and are the block matrices in .
There are errors in the integral values of the IMU at different times, and the errors at time t are mainly related to the measured values of , , and at time t. The following definition is given to represent the error vector:
The derivation is based on the derivative of the error term kinetic formula. First, two concepts are introduced: true and nominal, where true represents the real measurement value containing noise and nominal represents the theoretical value without noise, and represents the measurement error; there are
Among them, there are:
Combining the above formulas, we can obtain
The derivation of is as follows, and according to the formula in the literature, it can be known that
In this paper, according to the noise model and bias, we can get
In summary, the derivative of the IMU measurement error term at time t can be as follows:
We set . The above formula can be simplified to
According to the definition of the derivative, the prediction formula of the mean is as follows:
According to the error value at the current moment, the mean and covariance at the next moment can be predicted. The prediction formula for covariance is as follows:where represents the initial value of the iteration and its value is zero and Q represents the diagonal covariance matrix of the noise term as follows:
According to formula (35), the iterative formula of the error term Jacobian can be obtained as follows:where the iterative initial value of the Jacobian matrix is I.
2.5. Sliding Window Initialization
When the camera extrinsic parameter matrix is known, the pose obtained by the initialization of the monocular camera is transformed from the visual coordinate system to the IMU coordinate system to obtain the following formula:where s is the translation factor obtained by visual initialization, which has no real information.
The pure visual initialization method lacks absolute scale information. Therefore, the value estimated by the visual SFM is aligned with the IMU after preintegration to estimate the true scale. Visual-inertial alignment initialization is mainly to solve the following problems, including the initialization of gyroscope bias, the initialization of velocity, gravitational acceleration, and scale.
The first is to initialize the gyroscope. The gyroscope bias can be obtained from two consecutive key frames with known orientations, considering two consecutive frames and in the sliding window, and and represent the rotations obtained from the pure visual sliding window optimization, respectively. Linearize the IMU preintegration term for the gyroscope bias and minimize the following function:
Among them, there are:
In formula (42), represents all the frames in the window, and the product of the two quaternions indicates that the camera rotates from the kth frame to the k + 1th frame, and the gyroscope rotates from the k + 1th frame to the kth frame, and the optimized objective function is
The algorithm takes into the formula and multiplies the inverse moment ordering of the relative constraints obtained from the preintegration to the left on both sides of formula (40) and obtains by Cholesky decomposition (multiplying the transpose of on both sides of the formula):
In this way, the initial calibration value of the gyroscope bias can be estimated, and then the IMU preintegration terms are corrected with the new gyroscope bias.
The second is the initialization of velocity, gravitational acceleration, and scale. The initialized state vector is as follows:
where the state vector represents the speed of the visual coordinate system of the kth frame image, represents the gravity vector in the visual coordinate system, and s is the estimated scale parameter. To sum up, the dimension of is 3(n + 1) + 3 + 1. The constraint relationship between the scale parameter and the speed of the visual SFM is as follows:where are all obtained from visual SFM. and are mutually inverse matrices. The following linear least squares problem is constructed to complete the initialization of velocity, gravitational acceleration, and scale:
2.6. Monocular Visual Inertial Coupling Nonlinear Optimization
When coupling the visual constraint value and the IMU constraint value, the data of the inertial sensor should be introduced first, and the constraint value of the IMU on the state should be added to the optimized state vector. Then, nonlinear optimization is performed within a sliding window, and all state vectors of the sliding window are as follows:where represents the state of the IMU when the kth image is captured. There are n + 1 states in the sliding window, and each state contains the position, velocity, and rotation in the world coordinate system, and the IMU offsets in the IMU coordinate system. represents the inverse depth information of the mth 3D point, and represents the external parameter from the camera to the IMU. The objective function iswhere is the Huber norm, which is defined as follows:
In formula (52), represents the Mahalanobis distance weighted by the covariance matrix , and , and represent the marginalized prior information, the IMU measurement residual, and the visual reprojection error, respectively. is the set of all IMU measurement frames, and is the set of visual features in the sliding window.
According to the Gauss–Newton method, the incremental method can be used to calculate the minimum value of the Gaussian objective function, as follows:where is the Jacobian matrix of the error term with respect to all state vectors .
The algorithm differentiates the above formula from and then sets its derivative to 0, and the formula for the increment can be calculated as follows:
The overall incremental equation of the objective function is as follows:where represents the covariance of the IMU preintegrated noise term and represents the visually observed noise covariance. When the noise covariance of the IMU is larger, the inverse of , that is, its information matrix is smaller, which means that the IMU observations are not as reliable as the visual observations. Formula (50) can be simplified towhere , and represent the Hessian matrix. Using the perturbation method to calculate, we can getwhere represents the small disturbance of the state vector instead of the incremental represents the disturbance of the state vector.
The continuous preintegration formula is derived in the IMU preintegration, and the IMU residuals are as follows:
According to the above formula, it can be seen that the optimization variables of the IMU residual mainly include the position, rotation, speed, and inertia bias at the and times:
To calculate the Jacobian matrix, perturbation is added to each optimization variable to obtain
Taking the partial derivatives for the above disturbance variables, respectively, we can get
The visual residual is a reprojection error, which represents the difference between the estimated value and the observed value of the feature point in the normalized camera coordinate system. The small receiver camera used in this paper belongs to the fisheye camera model and belongs to the fisheye with a large viewing angle, so its projection on the unit sphere needs to be considered, as shown in Figure 9.

Through the unit spherical projection model illustrated in the figure above, the value of the visual residual is decomposed into two directions. The final visual residual model looks like this:where and represent the estimated and observed coordinates of the landmark point 1 in the j-th frame image under the normalized camera coordinate system, respectively. The optimization variables of the visual residual are as follows:where represents the inverse depth value when the landmark point 1 is first observed by the j-th image. The inverse depth is used as the optimization variable because the inverse depth satisfies the Gaussian distribution, and it can reduce the parameter variables in the actual optimization process. According to the above formula, by adding disturbance to each optimization variable, the following Jacobian is obtained:
3. E-Sports Training System Based on Intelligent Gesture Recognition
This paper combines the finger joints and the sensor in the data glove to demarcate the movement of the finger joints. This paper mainly considers the distal phalanx of the thumb (TDP) and the proximal joint proximal phalanx of the thumb (TPP) of the thumb as shown in Figure 10 and the changes in the intermediate joints middle phalanges (MP) and proximal phalanges (PP) of the remaining four fingers.

This paper combines the algorithm part of the second part to construct the e-sports training system, and the overall framework of the system is shown in Figure 11.

The simulation of the system proposed in this paper is carried out through the MATLAB platform, and the gesture recognition effect of the system and the application effect in the e-sports training system are evaluated, and the obtained results are shown in Tables 1 and 2.
It can be seen from the above research that the electronic economic training system based on intelligent gesture recognition proposed in this paper has certain effects.
4. Conclusion
As emerging sports, e-sports are mainly participated by the younger generation, which has the characteristics of younger and younger age. E-sports can exercise people’s thinking ability, psychological pressure resistance, unity and cooperation, hand-eye coordination, and so on. Moreover, in the process of participating in e-sports, it can also make the younger generation have the awareness of abiding by the rules, cultivate the participants to have a fair and open, never admit defeat, pursue a stronger competitive spirit, and have a positive impact on the participants’ lives. This paper combines the intelligent gesture recognition technology and the construction of the performance e-sports training system and judges the training effect of the players through the player gesture recognition. The research shows that the electronic economic training system based on intelligent gesture recognition proposed in this paper has certain effects.
Data Availability
The labeled data set used to support the findings of this study is available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This study was sponsored by Shandong Sport University.