Abstract
Recently, research on singing assessment has focused on analyzing singer characteristics rather than building an automated evaluating approach. Hence, there are many methods for evaluating the accuracy of singer’s singing performance. Traditional methods mostly use quantitative analysis approach for the accuracy analysis of vocal artistic performance, and the results have obvious ambiguity and correlation. Therefore, this paper proposes a method for evaluating the accuracy of vocal art expression of singers’ voices based on a sensor spatial localization algorithm. The adaptive cascade retrieval control approach is used to find and mine the singers’ performance based on their voice characteristics. The sensor spatial location algorithm and huge vocal music resources with high precision are utilized to locate the vocal music sources in the mining results. The vocal music of the singers is received through the vocal music preprocessing platform, which sends and controls musical compositions. The adaptive frequency’s shift filter and the wavelet transform method are used to filter out any noise in the music. Finally, the accurate evaluation of singer vocal artistic performance is realized from the aspects of music beat detection, note boundary accuracy analysis, and voiced segment accuracy detection. The experimental results show that our proposed method can reduce the error of the sound source’s localization and can effectively filter out the noise in the music signal.
1. Introduction
The main value of vocal music as a performing art is that it provides society with aesthetic objects that can be appreciated to meet the needs of people’s spiritual needs. When people participate in vocal music appreciation activities, people are participating in an aesthetic subject with flesh and soul. Their internal aesthetic feelings arise spontaneously during the “collision” between aesthetic subject and object [1]. The aesthetic feeling is a complex thinking process that mainly refers to the produced emotional attitude and emotional experience. It includes factors of rational judgment and analysis, as well as psychological processes including auditory perception association and imagination. When this understanding from sensibility to rationality tends to mature and is measured and expressed according to certain standards, it constitutes the aesthetic evaluation of song singing. Aesthetic evaluation is only a kind of cognitive identification and judgment. It can carry out the theoretical evaluation, but this can only be a subjective exposition of the feeling of skill quality. From the operational requirements of vocal music competition, this qualitative analysis must be transformed into quantitative analysis to realize the “application value” of vocal music aesthetic evaluation in this field [2].
There are few methods for evaluating the accuracy of vocal music artistic expression. The following studies analyze several common methods. The first is the vocal music evaluation method, which is based on a feature comparison. This method extracts and tests characteristic parameters such as sound intensity during the evaluation of the process. Next, the pitch, breath of vocal music, and related source sound are all evaluated using the feature matching method, and the scoring algorithm gives an objective score based on similarity [3]. The establishment of a multimedia vocal music objective evaluation system improves the intuition and interaction of vocal music teaching. Simulation results show that the algorithm complexity of this method is low, and its scoring results are consistent with people’s subjective feelings. The second technique uses an objective assessment software program to evaluate sound in singing performance. The universal norms of subjective timbre evaluation in vocal music singing are first investigated in this approach. The author then examines how to quantify and encode these criteria so that they may be entered into a smart external validation technology. The test findings suggest that using this approach, intelligent singing performing assessment may satisfy subjective evaluation standards. This approach also promotes the accuracy and usability of the intelligent assessment system for vocal music [4]. The third is the vocal music performance evaluation system based on a neural network. To begin, the system’s algorithm flow is meticulously developed using the technological principles of a computer neural network. The Fourier transform and its enhanced functions are used to identify the performance requirements of vocal music. The system’s major modules are created using the system architecture and data processing flow as a guide, and the essential design codes are provided.
Researchers have taken piano performance as an example, and players with different reinforcement levels are selected to test the accuracy of the system evaluation. The test results show that the system can reflect the real level of performers and is conducive to vocal music teaching [5]. The impacts of the microdevelopment control approach on the performance of RFID are examined in this research. This system was designed to design and test RFID and antennae using a microdevelopment control approach. Practical read-range experiments were used to investigate the performance of a tag transmitter. The study’s findings revealed that this absolute method functioned consistently and then was suited for RFID applications that required the location of on-site parts and equipment to be identified [6]. This article integrates visual acuity with the smart various sensors to realize in real-time the singer’s placement, and facial emotion with the intelligence sensor scheme. The emotion recognition through central nervous modification allows the singer to effectively operate making it a practical choice. This article creates the system’s functional framework, sends data wireless sensor network, replicates the human emotions model, and then investigates the procedure of vocal and central nervous adjustments. The experimental study results show that the strategy described in this work has a positive impact [7]. This study describes a method for measuring singing abilities that do not rely on melodic scoring data. This necessitates a different strategy than current programs, particularly ones now being used for singing processes. Previous singing assessment research centered on examining the qualities of the singing voice rather than building an automated evaluation approach. Pitch interval precision and vibrato are used in this study as acoustic parameters that are independent of the singer’s or melody’s characteristics. A 2-class classification test with 600 song sequences was used to test the method, which yielded an overall classification rate of 83.5% [8].
The above methods have achieved some results in practical application, but most of the above methods belong to quantitative analysis methods, and the results have obvious fuzziness and relativity. Naturally, they lack the clarity of connotation comparison between singers, and cannot compare the vocal artistic performance accuracy of singers’ voices only from the score itself. Therefore, this paper proposes an accuracy evaluation method for vocal music artistic expression of the singer’s voice based on a sensor space location algorithm.
The following are the main contributions of the paper:(i)The accurate evaluation of singer vocal artistic performance is realized from the aspects of music beat detection, note boundary accuracy analysis, and voiced segment accuracy detection.(ii)Analyze the distribution characteristics of useful vocal signals and noise at different frequency coefficient scales. This paper uses the threshold method to zero the high-frequency wavelet coefficients.(iii)A preprocessing of vocal music work positioning and noise reduction is completed, the interference signal is filtered, the influence of interference factors on vocal music work analysis is reduced, and effective support is provided for the analysis of the accuracy of vocal music artistic performance of singers.(iv)The singers’ vocal music is received through the vocal music processing platform, which uses a sensor spatial localization algorithm to evaluate the accuracy of vocal art expression of singers’ voices.
The rest of this paper is arranged in a logical order: Section 2 shows related work, Section 3 shows the Overview of the singer’s voice characteristics, Section 4 shows preprocessing of the singer’s vocal works, and Section 5 analysis of the accuracy of the singer’s voice vocal performance, Section 6 shows experimental verification. Finally, Section 6 illustrates the research to a conclusion.
2. Related Work
This paper provides a signal propagation model-based approach for obtaining precise item position information. To increase the accuracy of range estimations between RFID tags and readers, a multilateration approach is created as the kernels of the algorithm [9]. The authors present a technique for extending the usage of current RFID technology on building sites to track the precise location of tagged goods. Experimentation has shown that the combination of RFID and GPS technologies may be used to track building materials by extensively distributing reduced Tags with GPS-enabled RFID readers [10]. The authors of this study present a novel technique that relies just on the reading time of a few passively HD RFID tags and does so without the need of any various sensors, signal strength measurements, or a vision system. The findings of their experiments reveal that their system provides a flexible and cost-effective manner of servicing mobile robots in an interior setting, allowing for synchronized location-based and orientational estimates [11]. The writers suggest that the importance of interior location information be emphasized to improve facility use and upkeep. They look at 21 research projects that used RFID-based indoor location sensing (ILS) methods for algorithm design, devices, and networks, evidence from the study shows, and performance assessment. A description of the usage of the proximity approach and the supporting rationales in RFID-based ILS is provided, indicating that no one solution meets the needs of broad RFID-based ILS implementation [12]. This study presents a piecewise linear approximation method that uses a straight line to predict the current still uncompressed data point under the constraint of a defined error limit until the limit is violated by a new data point. Draw a new straight line to approximate the next arrival point, starting with this new data point [13]. This study introduced the EAQ approach, which first converts the original time series into a multiversion array (MVA) with a specified time series description. The MVA prefix can be used to reconstruct an approximate approximation of the original time series with a particular error. When the prefix is raised, the inaccuracy decreases, and a time series approximation with varying errors can be realized [14]. Scholars also have developed data compression techniques based on time series temporal correlation methods, such as the Discrete Fourier Transform, and Discrete Wavelet Transform. However, because it is computationally expensive, it is not appropriate for WSN [15].
3. Overview of Singer’s Voice Characteristics
3.1. Singing Range
Range refers to the range from the lowest to the highest pitch that can be reached by a human voice or musical instrument. In a broad sense, it refers to the total range of the sound train, also known as the total range. In a narrow sense, it refers to the range of individual voices or musical instruments. The range of human voice refers to the part that the human voice can cover in the whole range [7, 16–18]. Each person’s congenital vocal cord size, length, thickness, and whether the day after tomorrow has undergone systematic vocal training determine the individual range. The ultimate vocal range of the human body is often very wide, but it is difficult to have complete control over all sound levels within the ultimate vocal range. In fact, singing pays more attention to the singing range. Within the singing range, the singer’s singing should be loud, full, and freely controlled. The width of the singer’s singing range reflects the singer’s singing ability to a certain extent. The singing range width of professional singers can generally reach two octaves or more [19–22].
3.2. Singer’s Voice Characteristics and Song Matching Degree
It is not only necessary to consider the voice characteristics of the singer, but also the characteristics of the song, such as the song’s requirements for singing ability, and what kind of timbre is more appropriate to interpret the emotion of the song. Therefore, when analyzing the sound characteristics, it is necessary to consider the degree of matching with the song [23–25]. Different songs have different singing difficulties, and some songs require a wide range and many treble parts. Some songs require a narrow range and are catchy, which is suitable for public singing. Therefore, the song model should include the range requirements of the song. The range requirements of the song can be obtained from the simplified score information of the song. In addition, different songs show different emotions and need to be interpreted with appropriate timbre. Generally speaking, the timbre of the original singer is more suitable, so the timbre representation of the original singer should also be included in the song model. Intuitively, if the singer and the original song have high similarity in timbre, it is also suitable to recommend the singer’s song to the singer [26–29].
4. Preprocessing of Singer’s Vocal Works
4.1. Positioning and Mining of Vocal Music Resources
After the above analysis of the singer’s voice characteristics, according to the analyzed singer’s voice characteristics, a high-precision location mining method of massive vocal music resources based on adaptive cascade retrieval control is proposed. Suppose is a playback subset of the massive vocal music resource library and are constraint vectors. The directed graph model [1] is used to represent the distribution feature points of the massive vocal music resource library in the cloud-computing environment, where is a pair of nodes, and the fitting model of the vocal resource flow is
Here, represents the model parameter estimation result; represents the distribution space of vocal music resource flow.
The average value of the load balance value that is most suitable for mass vocal resource allocation and scheduling is found through cloud computing, expressed as follows:
Here, represents the number of requests for vocal resource access; represents the actual number of vocal files played; represents the number of accesses. According to the support degree of the data item, the priority list of vocal music resource playback is constructed, and the autocorrelation feature mining result of the vocal music resource is
Here, represents the component of the cross-correlation feature function of the candidate data item; represents the high-order accumulation coefficient of the vocal resource distribution. According to the above feature mining results, the extracted autocorrelation features are used as pheromone guides for high-precision location mining of vocal music resources. Through adaptive cascading retrieval control, the candidate item set and frequent itemset of vocal music resources are excavated, and the ability of location mining and accuracy analysis of vocal music is improved.
4.2. Vocal Sound Source Localization Based on Sensing Spatial Localization Algorithm
The sensing spatial localization algorithm mainly realizes the sound source localization through the sound sensor array. The sound sensor array [2] refers to the array composed of several sound sensors arranged by a certain geometric structure. It has strong spatial selectivity and can obtain the sound source signal without moving the sound sensor. At the same time, it can also realize the adaptive detection of the sound source within a certain range location and tracking, which makes it widely used in many fields. The spatial positioning diagram of the sound source is shown in Figure 1.

According to Figure 1, in the spatial coordinate system of the sound source, four receivers are fixed on three coordinate axes and origin positions, respectively, A, B, C, and O, and the corresponding coordinates are (a, 0, 0), (0, b, 0), (0, 0, c), (0, 0, 0). After the sound source emits sound waves, the four receivers receive the signals one after another. Since the precise time of arrival of the signal cannot be measured, the time difference is used to calculate the position of the sound source. Suppose the speed of sound waves propagating along the surface of the medium is , the time difference between the signals received by receiver A and receiver O is , and the time difference between the signals received by receiver B and receiver O is , the time difference between the signal received by receiver C and receiver O is , then the vocal sound source localization result can be obtained by formula:
4.3. Vocal Works Preprocessing Platform
Based on completing the function of vocal music sound source localization, a vocal work-preprocessing platform is established, which includes a sound-sending module, a sound-receiving and processing module, and a controller module. The structure of the preprocessing platform for vocal works is shown in Figure 2.

4.3.1. Sound Sending Module
The sound transmission module is composed of a buzzer [3] and a triode. The working principle of the module is relatively simple, with fewer components and convenient control. The buzzer is a commonly used sounding device in electronic circuits, which can be divided into two types: active and passive. The module designed in this paper uses a passive buzzer, which is driven by a transistor to amplify and drive the current to generate audio signals.
4.3.2. Sound Receiving and Processing Module
A microphone is used to receive audio signals, and after simple signal amplification and filtering, a monostable circuit composed of 555 triggers produces a better square wave signal. The circuit principle diagram of the sound receiving and processing module is shown in Figure 3.

4.3.3. Controller Module
The controller module adopts TI’s MSP430 series single-chip microcomputer with model MSP430F149, which is a type of single-chip microcomputer with special emphasis on ultralow power consumption and is suitable for long-time working occasions that use battery power. MSP430F149 has Flash memory, which is more flexible. Using a 16-bit reduced instruction set system (RISC), 125 ns instruction cycle, most instructions are completed in one instruction cycle. MSP430F149 has a powerful interrupt function. There are two timers inside: 16-bit Timer_A with 3 capture/compare registers and 16-bit Timer_B with 7 capture/compare registers, which can be used to capture sound signals and record time differences. The designed vocal work preprocessing platform, realizes the reception, transmission, and control processing of the singer’s vocal work, which provides a basis for the analysis of vocal accuracy.
4.4. Vocal Noise Reduction Processing
4.4.1. Wavelet Transform-Based Vocal Noise Reduction Algorithm
Suppose is a dimensional space function of vocal music, and is the mother wavelet function, and the two satisfy the condition: [4]. The Fourier transform of needs to meet the following conditions:
A wavelet sequence [5] is obtained after scaling and translation transformation , as shown in
Here, and , respectively, represent the expansion and contraction factors.
For analog signals, continuous wavelet transform [6] is shown in
Here, represents the frequency domain characteristics of the analog signal; represents the fundamental wavelet; represents the frequency domain function.
In the process of noise reduction of vocal signals, since it is a digital signal, it is necessary to adopt wavelet transform [7] in the form of a dispersion, as shown in
Here, represents the average similarity information of the vocal signal; represents the detailed information.
The steps of vocal signal noise reduction based on wavelet transform are as follows: Step 1: collect the vocal signal, perform a truncation operation on it, and extract the most effective vocal signal. Step 2: process the original vocal signal to make it meet the requirements of wavelet transform processing. Step 3: multiscale decomposition of the vocal signal using wavelet transform to obtain a series of vocal signal components of different frequencies, thereby obtaining the corresponding wavelet coefficients. Step 4: analyze the distribution characteristics of useful vocal signals and noise at different frequency coefficient scales. This paper uses the threshold method to zero the high-frequency wavelet coefficients. Since the noise corresponds to the high-frequency wavelet coefficients, the low-frequency wavelet coefficients are retained. Step 5: reconstruct the retained wavelet frequency coefficients to remove the vocal signal contaminated by noise.
Based on the above, the noise reduction principle of vocal music signal based on wavelet transform is shown in Figure 4.

4.4.2. Optimization of Interference Signal Filtering
Through wavelet transform, the noise reduction of a vocal signal is realized, and the distribution characteristics of vocal signal and noise at different frequency coefficient scales are obtained. Although the interference of noise on vocal accuracy analysis is reduced to a certain extent, there are still some interference signals. Therefore, this article will further filter out the interference signals in vocal music. Use an adaptive frequency shift filter [8] to overlap the time and frequency of continuous music signal in vocal music, separate and filter out the interference signal contained in it. Considering that the continuous vocal signal belongs to the full-band signal, the interference signal is filtered out according to the cyclic frequency characteristics to obtain a noninterference, full-band continuous vocal signal. The filtering step is shown in Figure 5.

As shown in Figure 5, the interference signal can be divided into nonconjugate and conjugate cyclic frequencies, denoted by and , respectively, the code rate of the interfering signal is denoted by , and the estimated value of the carrier frequency of the interfering signal is denoted by . In the interference signal filtering process, the reference signal is introduced, and its cycle frequency is set . The reference signal is used to reconstruct the interference signal , and the interference signal is filtered through the FIR filter to obtain an interference-free full-band continuous vocal signal [9].
To get a better filtering effect of interference signals, with the goal of minimizing the noise interference coefficient, an objective function is set:where represents the reconstructed interference signal; represents the conjugate transpose of the FIR filter.
For the objective function [10] given by (9), the following constraints are established:
According to the principle of least square error, calculate the minimum value of noise interference coefficient, and get
Here, represents the iterative convergence factor.
The minimum noise interference coefficient and reconstructed interference signal obtained above are substituted into the adaptive frequency shift filter, and the flow shown in Figure 5 is executed again to filter out the full-band continuous music signal containing interference signal to obtain a clear music signal.
Through the above process, the preprocessing of vocal music work positioning and noise reduction is completed, the interference signal is filtered, the influence of interference factors on vocal music work analysis is reduced, and effective support is provided for the analysis of the accuracy of vocal music artistic performance of singers.
5. Analysis of the Accuracy of the Singer’s Voice Vocal Performance
5.1. Music Beat Detection
The real-time beat detection algorithm must be a simple method that can solve complex problems. It not only needs to efficiently extract the moment when the signal energy is raised but also suppress the influence of other nonbeat signals.
First, consider the human ear’s perception of music beats. Under normal circumstances, the number of music beats per minute (bpm) is between 45 and 180. In other words, the music beat frequency is between 0.75 Hz and 3 Hz. The beat cycle is between 0.33 seconds and 1.33 seconds, and bpm is a feature displayed for a longer period of time. An objective fact is that the human ear generally needs 2 seconds to feel the speed of the music beat. Therefore, it is reasonable that the refresh cycle of the detected bpm value is about 2 seconds, and it can also meet the requirements. This means that a bpm value is allowed to be obtained within 2 seconds, which is one of the characteristics of the extremely low-complexity, real-time music beat detection algorithm described in this article.
Second, need to consider the sampling rate required to calculate the bpm. As mentioned above, the music beat frequency is between 0.75 Hz and 3 Hz. Intuitively, reducing the sampling rate can reduce the amount of data to be processed. However, one must pay attention to the lower limit of the sampling rate, otherwise, the problem of bpm underreporting will occur. At the same time, it must be noted that the high-frequency signal part also carries tempo information, and the conventional down-sampling method will lose a lot of high-frequency information.
The music beat is the number of music beats per minute, and the music beat value bpm has the following relationship with the music beat cycle T (bpm):
When the sampling frequency is , the expression of the music beat is
Here, represents the music beat cycle.
To ensure that every bpm value (all integers) within the common value range of 45∼180 bpm is distinguishable, a certain sampling rate must be guaranteed, that is, the beat period corresponding to the two adjacent music beat values must be distinguishable, that is to say,
5.2. Analysis of the Accuracy of the Note Boundary
Rhythm and melody are the two most basic elements of music. Exactly dividing notes and extracting pitch are the prerequisites for evaluating the accuracy of the singer’s vocal art performance. The above problems can be solved by using the DTW algorithm. Due to the large difference in the feature distribution between the upper note boundary and the lower note boundary, different distance measures are used when calculating the similarity. The calculation method of the similarity matrix in this article is as follows:(1)The sound signal is divided into frames with a frame length of 32 ms and the frame is shifted by 16 ms to obtain the singing sequence , and the music score is divided into frames with the same step length to obtain the score sequence .(2)Perform pitch extraction on each frame , perform unvoiced and voiced discrimination, and mark unvoiced frames.(3)Set the pitch of each frame into the pitch corresponding to the middle note, and mark the start frame of the note at the same time.(4)Define the distance metric as follows:(5)Here, and . The first item indicates that the unvoiced frame, is the note start frame; the second item indicates that is the unvoiced frame, and is not the note start frame; the third item indicates other conditions.(6)Use the distance metric of the above formula to construct a similarity matrix, and search for the optimal path through a dynamic programming algorithm. The start position of the note in the optimal path corresponds to the start time of the note in the audio signal, and so does the end time.
Vocal segmentation focuses on the accuracy of note boundaries. The similarity matrix described above can detect the beginning and ending boundaries of finals and voiced initials. However, the starting position of the singing note is determined by the boundary of the initial, and the silent or unvoiced sound in the initial has no pitch information, so the pitch alignment model cannot be used to detect the initial boundary. In addition, the pitch alignment model uses a frame length of 32 ms, so the accuracy of the note boundary will not be higher than 32 ms. For this reason, this paper designs an initial boundary correction algorithm, which can accurately detect the initial boundary of initial initials (especially unvoiced initials).
The algorithm uses 4 ms as a unit to frame the signal, extract short-term energy and zero-crossing rate, and use an algorithm based on heuristic rules to detect silence. The starting points of the given notes are approximately distributed around the starting position of the finals, and the duration of the initials is relatively stable. Therefore, we can search for the position of silence in a section before the finals and adjust the note boundary to this point. If no silence is found in this interval, the boundary is adjusted to the end of the search interval.
5.3. Accuracy Detection of Voiced Segments
Generally, the proportion of human voice harmonic components higher than 4 kHz in music is relatively small, so the original music signal is down-sampled to 8 kHz, which can also reduce the amount of calculation for subsequent processing. The audio signal is short-term stable, and the audio must be framed and windowed. This paper uses Hamming window, takes 320 samples per frame of signal, and uses a short-time Fourier transform [11] to transform the signal in time frequency.
Music is composed of singing voice and accompaniment. They are all composed of notes with a timing value. Each note has a relatively stable spectrum characteristic [12], which is reflected in a series of spectrum segments—the differences within the segments. Small, large differences between segments, this paper uses the metric distance (DIS) algorithm to segment notes. DIS distance is a distance measurement method that integrates the mean and variance between data segments and is used to characterize the gap between audio segments. The length of the two data windows before and after this article is both taken as five frames, so the DIS measurement distance can be abbreviated as
Here, and , respectively, represent the mean vector of the two audio features before and after; and , respectively, represent the traces of the covariance matrix of the two audio features before and after. When the feature means the difference between the two audio segments is large, and the feature variance within the segment is small, the greater the DIS indicates the greater the distance between the two audio segments.
The characteristic parameters of voiced segments are estimated using a short-term amplitude spectrum, and the DIS distance function concerning the number of frames is calculated by sliding the data window by frame:
Here, and , respectively, represent the mean vector of the two audio features before and after the frame, and , respectively, represent the traces of the covariance matrix of the two audio features before and after the frame.
Find all the maximum points , set the threshold to the average value , and delete the maximum points less than the threshold . In addition, the duration of quarter notes in fast-paced music is about 0.5 s. Considering that the duration of eighth notes and sixteenth notes are 1/2 and 1/4 of quarter notes, this article assumes that the pitch is not less than 100 ms, otherwise, the corresponding maximum point is removed, so that the remaining maximum point is the note cut point, and the accuracy of the voiced segment can be judged according to the obtained note cut point [13, 14].
5.4. Evaluation of the Accuracy of the Singer’s Vocal Art Performance
To evaluate the accuracy of the singer’s vocal art performance more reasonably, this paper proposes a sound-level completion quality evaluation method based on the lower bound of the Wilson interval. Wilson’s confidence interval is often used to solve the accuracy problem of small samples so that no matter whether the number of samples is large or small, the value in Wilson’s confidence interval has the same confidence [15].
Given the singer ’s singing sequence and confidence level , the total number of times singing at pitch level is known to be , and the number of singing times of pitch level is counted. Based on the lower bound of the Wilson interval, the quality of sound level can be completed by
Here, represents the proportion of singer who sings on pitch level , and represents the statistic constant corresponding to the confidence level .
At the 90% confidence level, the value is 1.643, which is calculated by substituting it into formula (18). The specific data are shown in Table 1. The evaluation results of the accuracy of the singer’s voice and vocal performance based on the lower bound of the Wilson interval have high reliability.
6. Experimental Verification
To verify the effectiveness of the proposed method for analyzing the accuracy of the singer’s vocal performance based on the sensory spatial positioning algorithm, experiments are carried out. The experiment was carried out under the hardware condition that the processor was Intel [email protected] GHz, the memory was 4 GB, the operating system was 32-bit, and the version was Windows7. The simulation experiment was carried out using SPSS19.0 software.
6.1. Vocal Music Corpus
The music data comes from the MIR-1K data set. The data set contains 1,000 pieces of singing voices with a sampling rate of 16 kHz, and separable accompaniment music, including the fundamental frequency tags of the singing voice with a time interval of 10 ms. It consists of 7 male singers and 21 singers. Fragment composition of female singer’s score. This experiment randomly selects 500 pieces of music from the MIR-1K data set as the training set, and the remaining 500 pieces of music as the main melody extraction test set.
Choose 9 music clips arbitrarily in the training set and test set, ranging in length from the 20 s to 60 s. 9 songs contain 411 syllables (411 vowels and 330 initials). The average length of the fragment is about 25 s, the total number of notes is 993, and the boundaries and pitches of consonants and vowels are manually marked. Table 2 is a statistical table of the singing range of 9 music fragments.
From the data in Table 2, it can be seen that among the 9 experimental subjects, singer 1 has a strong pitch control ability between A2 and G5; singer 2 has the highest benchmark singing ability, with a sound-level completion quality of 0.85, and has a better performance. Wide singing range, reaching [C3, F5]. Singer 7 has the lowest baseline singing ability and a narrow range of vocal range. Singer 8’s singing ability is average.
6.2. Analysis of Experimental Results
To verify the effectiveness of the method of this article, the vocal evaluation method based on feature comparison and the vocal performance evaluation method based on objective evaluation software algorithm is used as comparison methods. The comparison results are analyzed as follows.
6.2.1. The Error of Sound Source Localization Test Results
In the analysis of the performance accuracy of the singer’s voice vocal music, the accuracy of the sound source localization result will have a certain impact on the analysis result. Therefore, using this index as a comparison index, the application effects of the three methods are analyzed, and the results are shown in Table 3.
It can be seen from the data in Table 3 that the standard deviation of the sound source positioning error measured by the method of this article, the evaluation method based on feature comparison, and the evaluation method based on feature comparison are all less than 1, but the standard deviation of the method in this paper is smaller, indicating that the sound source positioning result is more accurate. There are only individual data in the measured results, and the error is largely due to external interference and other factors.
6.2.2. Noise Reduction Performance of Music Signal
To analyze the music signal noise reduction performance of different methods in the analysis of the performance accuracy of the singer’s voice vocal music, a piece of electronic music signal is used as the test object. After adding a certain amount of noise, the change curve is shown in Figure 6(a). After using different methods to reduce noise, the change curve is shown in Figures 6(b)–6(d).

(a)

(b)

(c)

(d)
Comparing and analyzing the noise reduction effect of electronic music signal in Figure 7, it can be found that the traditional method of electronic music signal noise reduction effect is poor, the noise removal is not clean, the electronic music signal distortion effect after denoising is more serious, and some useful signals It was removed as noise by mistake. After using the method in this paper to reduce the noise of the electronic music signal, the noise in the electronic music signal can be effectively filtered out, and the ideal noise reduction effect is obtained, which reflects the advantages of this method in the noise reduction process.

6.2.3. Time-Consuming Analysis of the Accuracy of the Singer’s Vocal Art Performance
The three approaches are compared using the accuracy analysis of the singer’s vocal performance as a comparative index, and the outcomes are shown in Table 4.
According to the data in Table 4, when using this method to analyze the accuracy of six vocal clips, the time consumed is significantly lower than the evaluation method based on feature comparison and the evaluation method based on objective evaluation software algorithm. This technique’s performance is better, as can be seen, because, in the accuracy detection stage of the voiced segment, the original music signal is downsampled to 8 kHz. This can also reduce the amount of calculation of subsequent processing, thus reducing the time-consuming accuracy evaluation. Analysis of the singer’s voice and vocal art performance with time-consuming is shown in Figure 8.

7. Conclusion
Singing is a way for people to express their feelings or their perceptions of the world around them. The contents and sentiments of singing are represented via voice and body expression. This is intimately connected to a singer’s emotional motivation and physical synchronization. Furthermore, the quality of the singer’s voice and the presentation of vocal mentality are directly impacted by whether or not the singer’s body exertion is appropriate. To address some of the problems in the quantitative scoring of singing art, and to improve operability after vocal aesthetic evaluation has already been translated into quantitative analysis. As a result, tests and other activities in vocal competitions require quantitative expression. It is simple to master the unified objective standards for application under the above background, and a set of activities that conform to the rules of singing art must be developed. Based on the sensory spatial positioning algorithm, this paper proposes a technique for measuring the accuracy of a singer’s voice performance in vocal music. The experimental results show that the method can reduce the error of the sound source localization test results, has a better performance of noise reduction in music signals, and the accuracy of the singer’s voice vocal performance analysis is less time-consuming, indicating that the method has a better application effect.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This work was supported by the Youth Fund Project of Humanities and Social Sciences Research of the Ministry of Education in 2021, “A Study of Chinese Classical Poetry and Songs from the Perspective of Language Musicology (1912–2012)” (project approval number: 21YJC760077).