Abstract

Computer music creation boasts broad application prospects. It generally relies on artificial intelligence (AI) and machine learning (ML) to generate the music score that matches the original mono-symbol score model or memorize/recognize the rhythms and beats of the music. However, there are very few music melody synthesis models based on artificial neural networks (ANNs). Some ANN-based models cannot adapt to the transposition invariance of original rhythm training set. To overcome the defect, this paper tries to develop an automatic synthesis technology of music teaching melodies based on recurrent neural network (RNN). Firstly, a strategy was proposed to extract the acoustic features from music melody. Next, the sequence-sequence model was adopted to synthetize general music melodies. After that, an RNN was established to synthetize music melody with singing melody, such as to find the suitable singing segments for the music melody in teaching scenario. The RNN can synthetize music melody with a short delay solely based on static acoustic features, eliminating the need for dynamic features. The proposed model was proved valid through experiments.

1. Introduction

With the rapid development of modern computer science, many researchers have shifted their focus to computer-based algorithm composition or automatic music melody generation system. The research results on music melody synthesis and music modeling methods are being applied to various fields. The research of computer music creation aims to quantify and combine the emotional tendencies of music, with the aid of computer and mathematical algorithms. The specific tasks include aided composition, sound simulation and storage, and music analysis and creation [1, 2]. Computer music creation generally relies on artificial intelligence (AI) and machine learning (ML) to generate the music score that matches the original mono-symbol score model or memorize/recognize the rhythms and beats of the music. Despite its broad application prospects, the AI-based composition without needing lots of music knowledge rules is in the theoretical stage [3, 4].

Speech processing has been widely applied in composition and songwriting, record production, and entertainment. Unlike simple speech synthesis, music melody synthesis has two additional processing steps: tone detection and transformation [5, 6]. Wenner et al. [7] preprocessed the musical melody synthesis corpus through automatic note segmentation and voiced/unvoiced sound recognition, constructed a high-quality music melody synthesis system, and proposed a music melody adjustment algorithm, which functions as an adaptive filter capable of detecting musical note cycles.

AI has already been adopted to realize algorithm composition or automatic music generation [812]. Bilbao et al. [13] introduced bidirectional long short-term memory (LSTM) neural network to the mixed music generation system and thus realized the training of multi-voice music datasets. Their approach provides effective chord progressions while ensuring melody time and transposition invariance.

Electronic synthetic tones bring rich new sound experience to music of various styles and themes. Electronic musical instruments differ from traditional acoustic instruments in sound rendering principle and acoustic features [1419]. Miranda et al. [20] expounded the computer-aided means to realize the acoustic features, voice editing, and modulation of electronic sound melodies and provided a valuable reference for applying electronic sound melodies in modern music creation. However, computer-based accompaniment has a rigid chord structure, which cannot easily adapt to diverse music styles. To solve the problem, Taigman et al. [21] put forward an adaptive automatic accompaniment algorithm, including chord series extraction and automatic accompaniment figure acquisition, and created a suitable accompaniment figure database based on chord sequences in the light of the features of music melodies and emotions.

The existing studies at home and abroad mostly concentrate on the methods, melodic forms, and tone synergy of computer music creation [2226]. However, there are very few music melody synthesis models based on artificial neural networks (ANNs). Some ANN-based models cannot adapt to the transposition invariance of original rhythm training set. To overcome the defect, this paper attempts to develop an automatic synthesis technology of music teaching melodies based on recurrent neural network (RNN).

The remainder of this paper is organized as follows. Section 2 extracts the acoustic features from music melody. Section 3 applies the sequence-sequence model to synthetize general music melodies. Section 4 establishes an RNN to synthetize music melody with singing melody, aiming to find the suitable singing segments for the music melody in teaching scenario. Finally, experiments were carried out to prove the effectiveness of our model.

2. Acoustic Feature Extraction

The automatic synthesis of music melody aims to obtain a melody that is beautiful and pleasant to human ears. To describe the differences in the auditory sensitivity of human ears to music melodies of different frequencies, the linear frequency μ of each music melody was transformed based on mel scale frequency μMR:

Under the scale of mel scale frequency μMR, the multiples of the μMR difference between two music melodies are roughly equal to those of the tone difference perceived by human ears.

For the above reason, mel scale was adopted to extract acoustic features in our music melody synthesis system. Since the music melody signal in the high-frequency band is relatively weak, such a signal was compensated through preemphasis. Let β be the preemphasis factor. Then, the preemphasis of music melody a(τ) can be described by

To prevent spectrum leakage and enhance the continuity of the left end and right ends of the signal frame, it is necessary to perform framing and windowing of the music melody signal with short-time stationarity features. Let CH(m), m = 0, 1, …, M − 1, be the framed music melody signal and CK(m) be the function of the Hamming window. Then, the windowed signal CH(m) can be described bywhere CK(m) can be described by

It is difficult to extract the features from time-domain music melody signal. The general practice is to convert the signal to the frequency domain through short-time Fourier transform (STFT) before further analysis. Let CH(m) be the input signal of STFT and M be the number of Fourier points. Then, the fast Fourier transform of the M points of the windowed framed time-domain music melody signal CH(m) can be expressed as

In the frequency domain, the absolute value of the spectrum of the music melody signal can be described as

The human ears can only detect the frequency components in a certain range. Therefore, the human auditory system could be treated as a filter bank that only allows some frequency signals to pass through. This paper simulates the human auditory system with a mel filter bank. Let μn, μmax, and μmin be the central, upper, and lower frequencies of the filter bank, respectively. Then, the transfer function of the filter bank can be described by

Let N be the number of triangular filters. Then, the signal on passing through the mel filter bank QFn(l) can be expressed as

The mel spectrum can be extracted by taking the logarithm of on.

3. Sequence-Sequence Model-Based Music Melody Synthesis

Both speech synthesis based on statistical parameters and music melody synthesis based on neural networks face the following defects: the complexity of model construction and the dependence of front-end text processing on texts with strong linguistics knowledge. Unlike these approaches, sequence-sequence model-based speech synthesis can directly transform phonetic notations to waveforms and significantly simplify the front-end module. Since a standard neural network cannot directly process input and variable sequences with variable length, a sequence-sequence model is needed to handle the scenario that input sequence is not equal to the output sequence.

This paper constructs a sequence-sequence model of music melody to realize the automatic synthesis of melody sequences. Figure 1 shows the structure of the music melody synthesis system based on the sequence-sequence model. In the proposed model, the music melody recognition module receives a music melody sequence and outputs a singing melody sequence. The synthesis module receives the target melody sequence and outputs an audio sequence.

The RNN model consists of an encoder and a decoder, using an activation function Γ. The hidden state at the current moment τ depends on the input at the current moment and the hidden state at the previous moment τ − 1:

Let be the hidden layer state of the neural network at moment ψa and s() be the nonlinear transform. Then, the middle vector of the encoder can be obtained through the nonlinear transform of each hidden layer states:

The middle vector is equivalent to the final encoded state of the hidden layer:

The next output bi of the encoder can be generated based on the middle vector and historical outputs b1, b2, …, bi-1. The encoder is often adopted to predict the next acoustic feature in music melody composition. It is necessary to determine the middle vector and the existing acoustic features:

Let ri−1 be the state of a hidden layer node in the RNN of the decoder; bi-1 be the output of that node at the moment i 1; and s() be the nonlinear transform. Then, formula (12) can be simplified as

The sequence-sequence model is prone to a potential problem: the input sequence has the same influence on the weight of each element in the output sequence. This problem can be solved by the attention mechanism. Figure 2 shows the structure of the attention-based model.

The attention mechanism highlights that different nodes belong to different parts of the input sequence. Let be the hidden layer output of the encoder at the moment ; rτ-1 be the hidden layer output of the encoder at the moment τ − 1; and e be the alignment model. Then, the matching degree DOτj between the location of input layer nodes and output layer nodes can be calculated bywhere e is a nonlinear function to compare and compute the matching degree between and rτ−1. The greater DOτ is, the more necessary it is to emphasize the input sequence at the current moment during the decoding of music melody signal. Let U, V, and W be weight matrices. Then, the point multiplication, weighting, weight stitching, and sensing of the alignment model e can be, respectively, described by

DOτj can be normalized by

The middle vector pτ can be obtained through weighted summation:

Let s() be the nonlinear transform. Then, the next hidden layer state can be calculated by

To realize frame-level feature mapping from the input to the output, the proposed model needs to transform the contextual features on the phoneme level and the frame level. This section puts forward a length prediction model for music melody, which supports the time supervision labeling. Suppose a phoneme sequence is given for a rhythm in a music melody sequence of the length M. Let ε be the model parameter of the RNN; be the rhythm state sequence of the melody; and Lm be the number of rhythm states of the melody. Then, the length prediction of music melody states can be regarded as the forecast of the length of the state allocation sequence. The prediction goal is to maximize the likelihood in the following formula:

Let om,l(δm, l) be the probability density function of the length model of the music melody; δm,l be the time of the l-th state of the m-th phoneme; Ψ be the total length constraint; nm,l be the length of rhythm state predicted by network model; and φm,l be the variance obtained from the individual length of each phoneme in the music melody database. Solving the maximum likelihood of formula (19), the rhythm state length of each melody can be obtained by

Let ψS be the phoneme length specified in the music score. Then, σ can be calculated by

4. RNN-Based Melody Synthesis

To find the suitable singing segments for the music melody in the teaching scenario, the RNN-based statistical music melody synthesis algorithm needs to realize the following goal in the synthesis phase: identifying the most possible acoustic feature sequence ũ from the signing melody sequence k with given linguistic features and a series of trained context-dependent music melodies Φ. Then, we havewhere ŵ is the rhythm state of a melody:

Let λŵi be the mean vector under state ŵi; ∑ŵi be the corresponding covariance matrix; and and be the mean vector and covariance matrix under the given state of singing melody sentence, respectively. If the output probability of the neural network obeys single Gaussian distribution, then formula (25) can be rewritten as

From the statistical features of output probability (Figure 3), it can be learned that λŵi is a jump series because the rhythm states of a melody are discrete and independent. The music melody signal reconstructed from λŵi has a discontinuous boundary of rhythm states. To solve the problem, this paper introduces an observation vector u, which covers the static acoustic feature and its first- and second-order derivatives with respect to time:

Let SC be the sparse coefficient matrix. Then, the relationship between the observation vector sequence and the acoustic eigenvector sequence can be described by

Combining formulas (24) and (26):

The maximization of the output probability is equivalent to finding the maximum of θ:

Find the partial derivative of θ in formula (28):

Make formula (29) equal to 0, and a linear equation about θ can be obtained:

Unlike other deep neural networks, the RNN combines the output of the input layer and the output of the hidden layer at the previous moment into the input of the hidden layer. Therefore, the network can capture the dynamic law of sequential music melodies from the periodic connections between hidden layer nodes. Let ωga, ωbg, and ωgg be the weight matrices of input layer-hidden layer, hidden layer-output layer, and hidden layer-hidden layer, respectively; and γb be the bias vectors of hidden layer and output layer, respectively; G(.) be the activation function between hidden layers; and {aτ}ψτ=1, , and {bτ}ψτ=1 be the input music melody features, hidden layer sequence, and output features, respectively. Then, can be expressed as

Besides, bτ can be given by

For the traditional RNN, vanishing gradient problem might occur during network training, owing to the use of backpropagation algorithm. To prevent this problem, the LSTM, a time RNN model, was adopted to synthetize the teaching audios including both singing melody and music melody. As shown in Figure 4, an LSTM unit contains an input gate IGτ, a forget gate FGτ, and an output gate OGτ, as well as a memory cell MCτ. Let and aτ be the hidden layer output and input signal of the network at the moment τ, respectively; ωI and ωXH be the weight matrices of input layer nodes and hidden layer nodes, respectively; CPH and γ be the weight and bias, respectively; and be the Hadamard product of the elements of a matrix. Then, the operations of the input gate IGτ, the forget gate FGτ, the memory cell MCτ, and the output gate OGτ can be, respectively, expressed as

Then, can be calculated by

To reduce the delay of singing melody relative to music melody, the LSTM was adopted to build the acoustic model, and recurrent output layers were configured to further smoothen the acoustic features between adjacent frames. Figure 5 shows the framework of the low-delay synthesis model of singing melody relative to music melody. Let ωbb be the weight matrix of the recurrent connections of the output layer extended from the traditional RNN. Then, we have

Dynamic features are needed to smoothen the parameter trajectories. It is possible to obtain smooth parameter trajectories by smoothing the acoustic parameters with recurrent output layers. The LSTM-based recurrent output layers receive the activation of the hidden layer and the output bτ−1 at the moment τ − 1, process them with the activation function and the input gate operation, and save some of the information to the state of the memory cell:

The state MCτ of the memory cell at time τ can be obtained through forget gate operation and scrapping some useless information:

Finally, the network output bτ at time τ can be obtained through output gate operation of MCτ:

5. Experiments and Result Analysis

To compare the convergence of different music melody synthesis models, Figure 6 shows the trend of the loss function value on the verification set of four models: DCNN based on static acoustic feature, LSTM, LSTM-RNN, and our model. The value of the loss function is the difference of model output and the actual value. Extended from AlexNet, the DCNN boasts deep layers and numerous parameters and has been widely applied to signal recognition and image processing. As an RNN, the LSTM is suitable for processing and predicting important events with relatively long intervals and delays in the time series. The network has been adopted in many scientific fields, namely, language learning and translation, robotic control, image analysis, document summarization, speech and image recognition, handwriting recognition, chatbot control, disease, click rate and stock prediction, and music synthesis.

The loss of each model continued to decrease with the growing number of iterations and eventually converged. After convergence, the DCNN had the greatest loss, the LSTM and LSTM-RNN had similar losses, and the proposed low-delay LSTM-LSTM realized the smallest loss.

To compare the music melody generated by our model with the original music melody, forty segments of music melodies were randomly selected from a test set containing 577 segments. The music melodies synthetized by different networks were objectively measured by four metrics: BAP distortion, F0 RMSE, LE, and MCD. The results in Table 1 show that MCD had the greatest influence on the synthetic music melodies. Overall, our model, which further smooths the acoustic features between adjacent frames with recurrent outputs, outperformed other networks, as evidenced by the small gaps of the four metrics. Therefore, our model is highly robust in finding the suitable singing segments for teaching.

Table 2 presents the errors of different model configurations in predicting phoneme length. It can be seen that the optimal configuration is our model with four layers, whose RMSE was 5.19 and the sum of RMSE and cross entropy was 4.18. Besides, our model had better phoneme synthesis effect than DCNN and LSTM, a sign of superiority in the modeling of music melody sequence. Based on RMSE + cross entropy, the least mean square (LMS) of the synthetic melody can be predicted, in order to effectively reduce the RMSE of phoneme length prediction. However, the predicted phoneme length might deviate from the actual length of the music melody. It is important to apply a constraint on the phoneme length of the two melodies. Table 3 shows the errors in the phoneme length before and after applying the constraint.

As shown in Table 3, before the constraint was applied, the predicted phoneme length was inconsistent with the given value. After applying the constraint, the mean error between the predicted value and the given value dropped. This proves the reasonability of introducing the constraint on the phoneme level.

Table 4 presents the prediction errors of acoustic features of different model configurations on the test set. As shown in Tables 3 and 4, our model achieved a lower prediction error of the acoustic features of the test set than LSTM and DNN, during the synthesis of singing and music melodies. This means that our model can establish a good time series dependence and thus achieve an ideal synthesis effect.

For the proposed low-delay synthesis model of singing melody relative to music melody, it is important to evaluate the influence of the decoding consistency between singing melody and music melody on the modeling accuracy of acoustic parameters in the synthetic melody. For this purpose, a contrastive experiment was designed to compare the melody generated from natural music melody and that generated from score notes, under different lengths of historical access points (HAPs).

Table 5 shows the F0 values under different HAP lengths. As shown in Table 5, the F0 RMSE and F0 Pearson correlation did not change with the utilization rate of historical frames and remained independent of the type of source melody (natural melody or score notes). In addition, the HAP length had a limited influence on F0 Pearson correlation and LE.

Next, the mean length of the sliding window was set to 5, 10, 15, and 20 frames in turn for the end processing module. After verification and optimization, it was found that the window of 15 frames led to the best experimental results. Table 6 presents the prediction errors of different models in fundamental frequency and spectrum. Taking the melody generated from score notes for reference, the melody synthetized by our model had a lower F0 RMSE and a higher F0 Pearson correlation than that obtained by DCNN and LSTM, that is, our model can find the singing melody of better tonal consistency with music melody.

6. Conclusions

Based on the RNN algorithm, this paper probes deep into the automatic synthesis of music teaching melodies. After extracting the acoustic features from music melodies, the authors established a sequence-sequence model for synthetizing general music melodies. To find the suitable signing segments for a given music melody in the teaching scenario, an RNN was set up to synthetize music melody with singing melody. After that, the convergence of different network models was compared through experiments, which verifies the feasibility of our model. In addition, the results of different models were compared before and after adding the singing melody, and the difference of the melody generated by our model and the original music melody was quantified accurately. Furthermore, the prediction error of phoneme time of each model configuration and that after applying the time constraint were obtained through experiments. The relevant results confirm the superiority of our model over DCNN and LSTM in modeling music melody sequence.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.