Abstract

This paper presents a method to perform chord classification from recorded audio. The signal harmonics are obtained by using the Fast Fourier Transform, and timbral information is suppressed by spectral whitening. A multiple fundamental frequency estimation of whitened data is achieved by adding attenuated harmonics by a weighting function. This paper proposes a method that performs feature selection by using a thresholding of the uncertainty of all frequency bins. Those measurements under the threshold are removed from the signal in the frequency domain. This allows a reduction of 95.53% of the signal characteristics, and the other 4.47% of frequency bins are used as enhanced information for the classifier. An Artificial Neural Network was utilized to classify four types of chords: major, minor, major 7th, and minor 7th. Those, played in the twelve musical notes, give a total of 48 different chords. Two reference methods (based on Hidden Markov Models) were compared with the method proposed in this paper by having the same database for the evaluation test. In most of the performed tests, the proposed method achieved a reasonably high performance, with an accuracy of 93%.

1. Introduction

A chord, by definition, is a harmonic set of two or more musical notes that are heard as if they was simultaneously sounding [1]. A musical note refers to the pitch class set of , , , , , , , , , , , , and the intervals between notes are known as half-note interval or semitone interval. Thus, chords can be seen as musical features and they are the principal harmonic content that describes a musical piece [2, 3].

A chord has a basic construction known as triad that includes notes identified as a fundamental (the root), a third, and a fifth [4]. The root can be any note chosen from the pitch class set, and it is used as the first note to construct the chord; besides, this note gives the name to the chord. The third has the function of making the chord be minor or major. For a minor chord the third is located at 3 half-notes intervals from the root. On the other hand, a major chord has the third placed at 4 half-note intervals from the root. The perfect fifth, which completes the triad, is located at 7 half-note intervals from the root. If a note is added to the triad at 11 half-note intervals from the root, then the chord will become a 7th chord. For instance, a major chord () will be composed of a root note, a major third note, and a perfect fifth note; the major with a 7th () is composed of the same triad of major plus the 7th note .

Chord arrangements, melody and lyrics, can be grouped in written summaries known as lead sheets [5]. All kind of musicians, from professionals to amateur, make use of these sheets because they provide additional information about when and how to play the chords or some other arrangement on a melody.

Writing lead sheets of chords by hand is a task known as chord transcription. It can only be performed by an expert; however this is a time-consuming and expensive process. In engineering, the automatization of chord transcription has been considered a high-level task and has some applications such as key detection [6, 7], cover song identification [8], and audio-to-lyrics alignment [9].

Chord transcription requires recognizing or estimating the chord from an audio file by applying some signal preprocessing. The most common method for chord recognition is based on templates [10, 11]; in this case a template is a vector of numbers. Then, this method suggests that only chord definition is necessary to achieve recognition. The simplest chord template has a binary structure, for this kind of template, the notes that belong to the chord will have unit amplitude, and the remaining ones will have null amplitude. This template is described by a 12-dimensional vector; each number in the vector represents a semitone in the chromatic scale or pitch class set. As an illustration, the major chord template will be [10001001000]. The 12-dimensional vectors obtained from an audio frame signal are known as chroma vectors, and they were proposed by Fujishima [12] for chord recognition using templates. In his work, chroma vectors are obtained from the Discrete Fourier Transform (DFT) of the input signal. Fujishima’s method (Pitch Class Profile, PCP) is based on an intensity map on the Simple Auditory Model (SAM) of Leman [13]. This allows chroma vector to be formed by the energy of the twelve semitones of the chromatic scale. In order to perform chord recognition, two matching methods were tested: the Nearest Neighbors [14] (Euclidean distances between the template vectors and the chroma vectors) and the Weighted Sum Method (dot product between chroma vectors and templates).

Lee [11] applied the Harmonic Product Spectrum (HPS) [15] to propose the Enhanced Pitch Class Profile (EPCP). In his study, chord recognition is performed by maximizing the correlation between chord templates and chroma vectors.

Template matching models have poor recognition performance on real life songs, because chords change with time, and consequently chroma vectors will have semitones of two different chords. Therefore, statistical models became popular methods for chord recognition [1618]. Thus, Hidden Markov Models [19, 20] (HMM) are probabilistic models for a sequence of observed variables assumed to be independent of each other, and it is supposed that there is a sequence of hidden variables that are related with the observed variables.

Barbancho et al. [21] proposed a method using HMM to perform a transcription of guitar chords. The chord types used in their study are major, minor, major 7th, and minor 7th of each root of the pitch class set. That is a total of 48 chord types. All of them can be played in many different forms; thus, to play the same chord several finger positions can be used. In their work, 330 different forms for 48 chord types are proposed (for details see the reference); in this case every single form is a hidden state. Feature extraction is achieved by the algorithm presented by Klapuri [22], and a model that constrains the transitions between consecutive forms is proposed. Additionally, a cost function that measures the physical difficulty of moving from one chord form to another one is developed. Their method was evaluated using recordings from three musical instruments: an acoustic guitar, an electric guitar, and a Spanish guitar.

Ryynänen and Klapuri [23] proposed a method using HMM to perform melody transcription and classification of bass line and chords in polyphonic music. In this case, fundamental frequencies (’s) are found using the estimator in [21]; after that, these are passed through a PCP algorithm in order to enhance them. A HMM of 24 states (12 states for major chords and 12 states for minor ones) is defined. The transition probabilities between states are found using the Viterbi algorithm [24]. The method does not detect silent segments; however, it provides chord labeling for each analyzed frame.

The aforementioned methods achieve low accuracies, and the most recent cited one, the method from Barbancho et al., achieves high accuracy by combining probabilistic models. However, the uses of a HMM and the probabilistic models in their work make such method somewhat complex.

In this paper, we propose a method based on Artificial Neural Networks (ANNs) to classify chords from recorded audio. This method classifies chords from any octave for a six-string standard guitar. The chord types are major, minor, major 7th, and minor 7th, that is, the same variants for the chords used by Barbancho et al. [21]. First, time signals are converted to the frequency domain, and timbral information is suppressed by spectral whitening. For feature selection, we propose an algorithm that measures the uncertainty for the frequency bins. This allows reducing the dimensionality of the input signal and enhances the relevant components to improve the accuracy of the classifier. Finally, the extracted information is sent to an ANN to be classified. Our method avoids the calculation of transition probabilities and probabilistic models working in combination; nevertheless the accuracy achieved in this study has superior performance over the most mentioned methods.

The rest of this paper is organized as follows. In Section 2, fundamental concepts related to this study are presented. Section 3 details the theoretical aspects of the proposed method. Section 4 presents experimental results that validate the proposed method, and Section 5 includes our conclusions and directions for future work.

2. General Concepts

For clarity purposes, this section presents two important concepts widely used in Digital Signal Processing (DSP). These concepts are the Fourier Transform and spectral whitening.

2.1. Transformation to Frequency Domain

The human hearing system is capable of performing a transformation from the time domain to the frequency domain. There is evidence that humans are more sensitive to magnitude than phase information [25]; as a consequence humans can perceive harmonic information. This is the main idea to perform the classification of guitar audio signals in this work. Therefore, a frequency domain representation of the original signal has to be calculated.

The time to frequency domain transformation is obtained by applying the Fast Fourier Transform (FFT) to the input signal and is represented by

Equation (1) describes the transformation of at all times. However, this is not convenient because songs or signals, in general, are not stationary. For this reason, a window function, , is applied to the time signal aswhere , for this study, is the Hamming window function according towhere , , and is the number of samples in the frame analysis. A study about the use of different window types can be found in Harris [26]. Equations (2) and (3) divide the signal in different frames that allowing the analysis of the signal in the frequency domain by

For this work, windowing functions will have 50% of overlapping to analyze the entire signal and thus obtain a set of frames (for simplicity in the notation will be used). Those frames can be concatenated to construct a matrix , and, then, compute the FFT for every column. The result is a representation in the frequency domain as in Figure 1; this representation is known as spectrogram [27]. This is the format that the signals will be presented to the classifier for training.

2.2. Spectral Whitening

This process allows obtaining a uniform spectrum of the input signal, and it is achieved by boosting the frequency bins of the FFT. There exist different methods to perform spectral whitening [2831].

Thus, inverse filtering [22] is the whitening method used in our experiments, and it is described next.

First, the original windowed signal is zero-padded to twice its length asand its FFT, represented by , is calculated. The resulting frequency spectrum will have an improved amplitude estimation because of the zero-padding. Next, a filter bank is applied to ; the central frequencies of this bank are given bywhere . In this case, each filter in the bank has a triangular response ; in fact, this bank tries to simulate the inner ear basilar membrane. The band-pass frequencies for each filter are from to . Because there is no more relevant information at higher frequencies than 7000 Hz, the maximum value for the parameter was .

Subsequently, the standard deviations are calculated aswhere uppercase is the length of the FFT series.

Later on, the compression coefficients for the central frequencies are calculated as , where is the amount of spectral whitening applied to the signal. The coefficients are those that belong to the frequency bin of the “peak” of each triangle response; observe Figure 2. The rest of the coefficients for the remaining frequency bins are obtained performing a linear interpolation between the central frequency coefficients .

Finally, the white spectrum is obtained with a pointwise multiplication of all compression coefficients with as

3. Proposed Method

Our proposed method is described in the block diagram shown in Figure 3. The method begins by defining the columns of matrix aswhere a single column vector represents the th Hamming windowed audio sample. These columns are zero-padding to twice their length aswhere is a zero matrix of the same size of . Then, (10) indicates an augmented matrix.

After that, the signal spectrum for every column of is calculated by applying the FFT, and then these columns are passed through a spectral whitening step and the output matrix is represented as . Furthermore, by taking advantage of the symmetrical shape of the FFT, only the first half of the frequency spectrum (represented by ) is taken in order to perform the analysis.

A multiple fundamental frequency estimation algorithm and a weighting function are applied to the whitened audio signals. These algorithms enhance the fundamental frequencies by adding their harmonics attenuated by the weighting function. The output matrix of this step is denoted as .

The training set includes all data in a matrix of frequency bins and audio samples, where each row or frequency bin will be an input to the classifier. The number of inputs can be reduced from to ( matrix) by applying a method based on the uncertainty of the frequency bins, thus enhancing the pertinent information to perform a classification. Finally, enhanced data are used to train the classifier and then to validate its performance.

3.1. Multiple Fundamental Frequency Estimation

The fundamental frequencies of the semitones in the guitar are defined bywhere and is the minimum frequency to be known; for example, in a standard six-string guitar, the lowest note is having a frequency of 82 Hz.

Signal theory establishes that the harmonic partials (or just harmonics) of a fundamental frequency are defined bywhere . In this study represents the number of harmonics to be considered. As an illustration, for a fundamental frequency Hz of a note, the first three harmonics will be the set .

In this work, if a frequency is located at % of the semitones frequencies, then this frequency is considered to be correct. This approach was proposed in [22].

In an th frame under analysis, fundamental frequencies can be raised if harmonics are added to its fundamentals [22], by applyingand, then, all harmonics and their fundamental frequencies , described in (13), are removed from the frequency spectrum. When the resulting signal is again analyzed, with the described method, a different fundamental frequency will be raised.

A common issue with (13) is when two or more fundamentals share a same harmonic. For instance, the fundamental frequency of 65.5 Hz of note has a harmonic located at 196.5 Hz. When the Euclidean distances [32] between the analysis frequency and the frequencies of the semitones are computed, the minimum distance or nearest frequency will correspond to the note. This implies that if those two notes are present in the same analysis frame, then the harmonic of will be summed and eliminated with the harmonics of the note. This is because the Hz harmonic is located in the range of of the frequency of a note.

There are some methods that deal with this problem. In [33], a technique that makes use of a moving average filter is proposed. In that work, the fundamental frequency takes its original amplitude and a moving average filter modifies the amplitude of its harmonics. Then, only part of their amplitude is removed from the original frequency spectrum.

In [22], a weighting function that modifies the amplitude of the harmonics is proposed. Also, an algorithm to find multiple fundamental frequencies is suggested. The weighting function is given bywhere represents the low limit frequency (e.g., 82 Hz), is the fundamental frequency under analysis, and is the sampling frequency. The parameters and are used to optimize the function and minimize the amplitude estimation error (see [22] for details). In the work [22], the analyzed in a whitened signal is used to find its harmonics withwhere is a range of frequency bins in the vicinity of analyzed. The parameter indicates that the signal spectrum is divided into analysis blocks, to find the fundamental frequencies. Thus, becomes a linear function of the magnitude spectrum . Then, a residual spectrum is initialized to , and a fundamental period is estimated using . The harmonics of are found in , and then they are added to a vector in their corresponding position of the spectrum. The new residual spectrum is calculated aswhere is the amount of subtraction. This process iteratively computes a different fundamental frequency using the methodology described above. The algorithm finishes until there are no more harmonics in to be analyzed. Equation (15) was adapted to keep the notation of our work; refer to [22] for further analysis.

In this study, we propose a modification of Klapuri’s algorithm, in an attempt to achieve a better estimate of the multiple fundamental frequencies. Using (14) and the th whitened signal , the multiple fundamental frequencies can be found by usingwhere for . Equation (17) analyzes all frequency bins and its harmonics in the signal spectrum. This equation adds to the th frequency bin, all its harmonics in of the entire spectrum. Besides, the weighting function performs an estimation of the harmonic amplitude that must be added to the th frequency bin. Observe that the weighting function does not modify the original amplitude of the harmonics.

Finally when all frequency bins have been analyzed, the resulting signal has all its fundamental frequencies with high amplitude. This will help the classifier to have an accurate performance.

3.2. Feature Selection

The objective of this paper is to classify frequencies. Then, the inputs of the classifier are all frequency bins that come from the FFT. However, not all frequency bins will have relevant information. Therefore, a method to remove unnecessary data and enhance the relevant data has to be performed. This will result in a reduction of the number of inputs to the classifier.

We propose a method based on the uncertainty of the frequency bins. This method will discriminate all those that are not relevant for the classifier in order to improve its performance.

In Wei [34], it is stated that, similarly to the entropy, the variance can be considered as a measure of uncertainty of a random variable, if and only if the distribution has one central tendency. The histograms for all frequency bins of the 48 chord types were calculated. This can be used to verify whether the distribution could be approximated to any distribution with only one central tendency. For simplicity, Figure 4 represents one frequency bin distribution of a major and a minor chord, respectively; it can be seen that the distribution fits into a Gaussian distribution. This same behavior was observed in the other samples of the 48 different chords. This demonstrates that the variance can be used in this study as an uncertainty measure in the frequency bins.

In order to perform the feature selection using the uncertainty of the frequency bins, first consider a matrix defined bywhere is a vector formed by the magnitudes of the th-component frequency bin of all audio samples. The variances of each can be computed withwhereIf , then it means that for that particular frequency bin the input is quasi-constant; consequently this frequency bin can be eliminated from all audio samples. This can be achieved if we consider and a vector formed with the indexes of that are defined byOnce feature selection has been performed, the remaining frequency bins will form the input to the classifier.

3.3. Classifier

Classification is an important part for chord transcription. In order to perform a good classification, important data will be generated from the original information. Then, a classification algorithm will be able to label the chords. Artificial Neural Networks [35] (ANNs) can be considered as “massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections”; according to Jain et al. [36] ANNs have been used in chord recognition as a preprocessing method or as a classification method. Gagnon et al. [37] proposed a method with ANN to preclassify the number of strings plucked in a chord. Humphrey and Bello [38] used labeled data to train a convolutional neural network. In this study, an Artificial Neural Network was used to perform classification. Figure 5 represents the configuration for the ANN used in this work. The ANN was trained using the Back Propagation algorithm [39].

4. Experimental Results

Computer simulations were performed to quantitatively evaluate the proposed method. The performance of two state-of-the-art references [21, 23] was compared with the present work.

Databases for training and testing containing four chord types (major, minor, major 7th, and minor 7th) with different versions of the same chord are considered. Electric and acoustic guitar recordings were used to construct the training data set. A total of 25 minutes were recorded from an electric guitar, and a total of 30 minutes were recorded from an acoustic guitar. Recordings include sets of chords played consecutively (e.g., ---…), as well as some parts of songs. The database used for evaluation was provided by Barbancho et al. [21]. This database has 14 recordings: 11 recordings from two different Spanish guitars played by two different guitar players, 2 recordings from an electric guitar, and 1 recording from an acoustic guitar, making a total duration of 21 minutes and 50 seconds. The sampling frequency is of 44100 Hz for all audio recordings.

The training data set was divided into frames of 93 ms, leading to a FFT of 4096 frequency bins. In the spectral whitening, the signal was zero-padded to twice its length before applying the frequency domain transform, so a FFT of data was obtained. For the spectral whitening, the parameter takes the original length of the FFT but the length of the whitened signals remains at 4096 frequency bins. For the multiple fundamental frequency estimation, the and parameters are constant and set to 52 and 320, respectively, as in Klapuri [22], while the parameter was adjusted to improve performance. An optimum value of 0.99 was found. This parameter differs from the value in [22] because, in our method, the signal is modified in every cycle that in (15) increases; on the other hand, Klapuri [22] modifies the signal after increases to its higher value.

These processes were applied to all audio samples to build a training data set. In this case, the data set is a matrix of rows (frequency bins) by columns (audio samples). In (21), the maximum variance for all frequency bins in the audio samples is computed. Equation (22) proposes a threshold to remove all those frequency bins that remain quasi-constant. For instance, suppose that a threshold of is set, and some frequency bins variances (shown in Table 1) are evaluated. Only those above the threshold will be taken as inputs to the classifier, as is shown in Table 2.

Performance tests were made to find the optimal value for . This parameter was varied; then the ANN was trained and evaluated. The process was repeated until the best result was obtained. The parameter was found to be optimal at 0.01326. This allows a 95.6% reduction of the total of the frequency bins, while keeping the relevant information. Therefore, we concluded that, for a value lower than 0.01326, some information required for a correct classification is lost. Figure 6 shows part of the training data set, in fact only frequency bins in the range , and 3000 audio samples are depicted. Figure 7 shows the same data set of Figure 6 after the feature extraction algorithm was applied. It can be observed that the algorithm maintains sufficient information to train the classifier.

An ANN was used as a classification method with 183 inputs and 48 outputs. The applied performance metric was the ratio of the number of correctly classified chords to the total number of frames analyzed.

The validation test had the same structure as the one presented in Figure 3. First, audio data was loaded. Second, a frequency domain transformation and a spectral whitening are applied to the signal. Finally, the multiple fundamental frequency estimation algorithm is used. At this point, the signal has 4096 frequency bins. To reduce the number of frequency bins, only those that meet (22) are taken from the signal and then passed through the classifier.

The results of the proposed method VTH (Variance Threshold) in this work were compared with two state-of-the-art methods. The best are shown in Table 3; specifically 48 chord types with different variants of the same chord were evaluated. For reference method proposed by Barbancho et al. [21], experiments with different algorithms were performed. This method is denoted by PM and includes all models described next. The PHY model describes the probability of the physical transition between chords. These probabilities are computed by measuring the cost of moving the fingers from one position to another. The MUS model is based on musical transition probabilities, that is, the probabilities of switching between chords. These were estimated from the first eight albums of The Beatles. And, the PC model is equal to the proposed method but without the transition probabilities; instead, uniform transition probabilities are used. All models were separately tested; an accuracy of 86% was achieved at most. The best result was obtained from using the combination of all methods; a 95% accuracy was achieved in this case.

For the reference method proposed by Ryynänen and Klapuri [23], the evaluation results were taken from [21]; in this case, three tests were performed. First, MM tests (only major and minor chords) were carried on; for all three tests, this was the one with the highest accuracy (91%). Second, MMC tests were executed, all chords were taken into account; however 7th major/minor chords labeled as major/minor were correctly classified; that is, a labeled as was correct. Finally, CC tests were set with the 48 possibilities; that is, 7th major/minor chords labeled as major/minor were incorrect; this results in an accuracy of 70%.

The proposed method on this paper achieves an accuracy of 93% in the evaluation test. This classification performance was achieved with a 95% confidence interval of . The results are competitive with the two reference methods. Even though Barbancho et al. [21] have a 95% of accuracy, it is only achieved when all algorithms PHY, MUS, and PC are combined. Besides, HMM needs the calculations of probability transitions between the states of the model (48 chord types). This makes their method more complex than the one presented in this work. This paper focuses only on chord recognition, so the comparison with [21] does not take into consideration the finger configuration.

5. Conclusions

A method to classify chords of an audio signal is proposed in this work. This is based on a frequency domain transformation, where harmonics are the key to find the fundamental frequencies that compose the input signal. It was found that all remaining frequency bins after feature extraction were in the range from 40 Hz to 800 Hz. This means that the relevant information for the classifier is located on the low frequency end.

The chords considered were major, minor, major 7th, and minor 7th. Two state-of-the-art methods, which used the same chords, were taken to compare our study. All computer simulations were performed using the same database. The reference method from Ryynänen and Klapuri [23] had the best performance when only 24 chord types were considered. Our method outperforms the method of Ryynänen and Klapuri by 2%, even when, in our work, 48 chord types were classified. The reference method of Barbancho et al. [21] had an accuracy of 95%; however, they performed a signal analysis to propose two statistical models and a third one that does not consider probability transitions between states. Their best performance is achieved with all models working together; if they are separately tested, the performance is at most 86%. Also, their classification method is based on a Hidden Markov Model that needs interconnected states.

The method presented in this work avoids designing statistical models and interconnected states for the HMM. The Artificial Neural Network as a classification method works with a high precision when the data presented have been processed with an appropriate algorithm. The proposed method for feature selection achieves high accuracy, because the data presented to the classifier have the pertinent information to be trained.

The sampling frequency of 44100 Hz and the windowing of 4096 data result in a frequency resolution of 10 Hz. With this frequency resolution it is not possible to distinguish the low frequencies of the guitar, for example, an with 82 Hz and an with 87 Hz. However, the original signal has six sources (strings), where three of them are octaves from the other three (except for 7th chords). Then, because the proposed method for multiple fundamental frequency estimation adds the harmonics for every single th bin, the high octaves can be raised. For example, for an of 82 Hz, the octave at 164 Hz will also be raised. Then, this octave with the other fundamentals gives a correct classification of the chord. In the case of an , the fundamental at 87 Hz can not be distinguished from the frequency of 82 Hz. Nevertheless, the octave at 174 Hz will be perfectly raised; so with the other fundamentals frequencies of , the ANN performs a correct classification.

The present work due to its simplicity can be applied to chord recognition in some devices, for example, a Field-Programmable Gate Array (FPGA) or some microcontrollers. This study leaves for a future work the source separation of each string in the guitar. Once a played chord is known, we can make some assumptions about where the hand playing the chord is. Thus, we can apply some methods of blind source separation to obtain the audio of each guitar string. Besides, with the information of separated strings, the classifier can be extended for a wide set of chord families. Because the classification can be performed by a single string instead of the mixture of six strings, this can lead to the complete transcription of guitar chords and identification of strings being played.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research has been supported by the “National Council on Science and Technology” of Mexico (CONACYT) under Grant no. 429450/265881 and by Universidad de Guanajuato through DAIP. The authors would like to thank A. M. Barbancho et al., for providing the database used for comparison.