Abstract

Audio processing has become an inseparable part of modern applications in domains ranging from health care to speech-controlled devices. In automated audio segmentation, deep learning plays a vital role. In this article, we are discussing audio segmentation based on deep learning. Audio segmentation divides the digital audio signal into a sequence of segments or frames and then classifies these into various classes such as speech recognition, music, or noise. Segmentation plays an important role in audio signal processing. The most important aspect is to secure a large amount of high-quality data when training a deep learning network. In this study, various application areas, citation records, documents published year-wise, and source-wise analysis are computed using Scopus and Web of Science (WoS) databases. The analysis presented in this paper supports and establishes the significance of the deep learning techniques in audio segmentation.

1. Introduction

The fundamental goal of audio segmentation is to divide an audio signal into small segments so that the entities may be easily identified. Each segment contains audio data from a certain acoustic category, such as speech, animal voices, music, human activity sounds, environmental sounds, and so on [1]. The level of abstraction in audio class analysis varies depending on the deployment. For example, the radio broadcast audio signal segmentation has focused on detecting speech, silence, and other noise disturbances [2]. The general concept and process of audio segmentation are given in Figure 1. The audio stream is fed into an audio segmentation architecture, which is an open form of architecture that can take many various kinds [3]. Traditionally, to produce meaningful and quality segments, audio signal is passed through several stages of processing, as shown in Figure 1. The output stream, which contains an adjunct data series of segment-level labels, is then transmitted to a routing switch, where each audio form is routed to the proper type of post-processing [4]. Speech parts are driven into automatic speech recognizers for linguistic or speaker role processing in broadcast transmissions, while music parts are driven into a sound effect collection library [5]. For each audio sequence, segmentation is performed by computing multiple features [6]. These characteristics are calculated on an audio segment, a frame, or a set of samples that is a subset of the audio segment.

In recent years, audio segmentation and deep learning have received widespread attention for research focus. Several countries and researchers have successfully applied audio segmentation techniques in various fields like speech recognition, music, or noise removal with different deep learning algorithms [7]. A study analysis literature review method was employed by analyzing articles and conferences published from 2005 to 2021 using the VOS viewer software. One hundred seventy documents were downloaded in (.CSV) file format using two keywords, “Audio Segmentation” and “Deep Learning,” from the Scopus database.

1.1. Audio Data Analysis

A sound is represented as an audio signal in which the frequency, bandwidth, decibel, and so on are the parameters. A typical audio signal can be represented as a function of Amplitude and Time [8]. Several digital devices help in the audio recording and then represent these sounds in a computer-readable way. These are some instances of these formats:(i)mp3 (MPEG-1 Audio Layer 3) format(ii)wav (Waveform Audio File) format(iii)WMA (Windows Media Audio) format

The extraction of acoustics features relevant to the task at hand is involved a typical audio data processing procedure followed by decision-making techniques, including detection and classification. As a result, audio data analysis is used to analyze and comprehend audio signals captured by digital equipment, with various applications in healthcare, production, and enterprise [9]. Among these, applications are customer intelligence analysis from user service calls, social-media content analysis, medical aids, patient-care systems, and public safety.

1.2. Related Work

In the task of audio segmentation, several authors have devised a segmentation approach by a classification system based on neural networks. A multilayer perceptron trained using genetic algorithms to achieve multiclass audio segmentation is an example of a feed-forward network. Many data are needed to train deep neural networks for reliable predictions [10]. Some studies have used data augmentation approaches to expand the quantity of data to overcome these problems. To tackle the data shortage problem, Raza used two approaches to enhance the amount of the dataset.

The authors suggested that the noise injection method effectively covers data shortage. Normally, data is audio data that has augmented to prevent overfitting by deliberately injecting noise; it adds random noise to the audio signal and performs audio transformation that slightly deforms the pitch and tempo [11]. When data augmentation is performed, the standard quality of the source data has a vital influence. High-quality data mean a clear signal without any other type of noise in the audio signal. However, noise is unavoidable during recordings, and every sound recording has a different length [12]. For effective analysis, noise removal is very important, and normalizing and generalizing the raw dataset is also required.

A study showed an improvement in performance by conducting denoising in the preprocessing step [13]. The authors mentioned the importance and effect of data generalization [14]. It is important to extract appropriate features for each label to classify the data according to the class. There are several methods for extracting the data, like MFCC, spectrogram, and using a deep learning network.

In Figure 2, we have presented the Source-Wise Analysis of audio segmentation and deep learning research trends using the Web of Science Database [15]. The experiment was conducted based on the data collected and analyzed from the Web of Science database using two keywords, “Audio Segmentation” and “Deep Learning,” from 1999–2021. Seventy-five publications selected from the Web of Science Core Collection are shown in Figure 2. Figure 2 represents the sources or the fields where the audio segmentation is used with deep learning. In Engineering Electrical Electronic, the audio segmentation has 36 publication records, the maximum values in the Web of Science Database. The second highest records are 22 documents in Computer Science Artificial Intelligence. Also, in Acoustics, the audio segmentation has 16 documents and 14 in Computer Science Information Systems [16].

In Figure 3, we represent the research work related to audio segmentation that started majorly in 2005. Until a decade, there was a very slow increase in this type of research, but post-2016, there was a sharp rise in this area of research [17]. The exponential increase in research trends can be seen in audio segmentation-related research since 2017.

It replicates that audio segmentation and deep learning field is becoming an attractive research area in the steady research zone progressively as citation number is growing previous five years [18]. It is undoubtedly realized that there is the maximum number of publications accomplished in 2019 out of the last fifteen years (2005 to 2021) in this area of research. In the current year, 2021 has witnessed considerable confidence amongst researchers regarding its application in this domain for research, that is why most publications related to this field have been published [19]. As per the trend analysis, there is a very high potential for research in this domain, as shown by the cumulative rising pattern of research for audio segmentation methods [20].

2.1. Keywords Related to Audio Segmentation and Deep Learning

There are six different clusters of keywords in the co-occurrence network, which can be created using the VOS viewer software. In this network visualization, each cluster has a different colour [21]. This analysis considers the keywords that appeared in at least three collected documents. From 1310 keywords, only 119 have met the threshold represented in co-occurrence network visualization to compose the critical areas of audio segmentation and deep learning, as shown in Figure 4.

The colours red, blue, and green are shown in Figure 4 to represent co-occurrence in the related keywords. The shades of purple, orange, and so on show that the co-occurrence is a combination of two or more two domains.

2.2. Year-Wise Publications and Research Trends

Figure 5 shows the publication variations published from 2005 to 2021, showing the gradual increase in the number of publications in audio segmentation based on deep learning techniques [22].

Figure 5 shows the distribution of documents published by year related to the application of audio segmentation with deep learning techniques [23]. The research related to this domain started from 2005 to 2012 gradually. Furthermore, since 2016, the number of publications dealing with audio segmentation methods has continuously increased trends in research. In this sense, it is evident that many documents have been published during the last two years (2019 and 2020), with 36 and 47 documents, respectively. At the same time, 73.52% (125) of the total publications of audio segmentation with deep learning were published in the last five years (2015–2020).

2.3. Country-Wise Research Trends for Audio Segmentation

Many researchers globally implement audio segmentation methods. The documents published in the last 20 years come from various countries [24]. Neural networks are also used with audio segmentation, where the sum of squared errors is used to evaluate efficiency [25]. Researchers have worked globally in implementing various audio segmentation applications, as shown in Figure 6.

Table 1 shows the top 20 countries, the number of research documents related to audio segmentation, and citations. Figure 6 shows the Density Visualization for global research analysis of audio segmentation researchers based on the Scopus database [26]. As we can see, those two countries the United States and China have strong links by comparing to other countries [27]. So, their clustering is highest among the other countries [28].

As described in the given table, maximum number of publications was published in the United States [29]. These data are extracted from the Scopus database having a minimum of three documents [30]. United States, China, and India are the top three countries, respectively, where research on audio segmentation is highest, and the entire documents and citations are as high. It clearly shows how the related research is co-related [31].

In this study, India has 11 documents and 26 citations, indicating that Indian authors are more actively involved in research based on the audio segmentation field [32]. So, most researchers are from the United States, China, and India, and a lot of research potential lies in countries like Canada and the United Kingdom [33].

2.4. Prominent Researchers for Audio Segmentation

The publication searched from the Scopus database using two search keywords, “Audio Segmentation” and “Deep Learning,” has been cited several times as described in Table 2. By applying the filter of a minimum of 10 citations for each document, we got 24 publications [34]. Table 2 represents the citations for 24 publications identified using the VOS viewer software package.

Figure 7 shows the author-wise analysis of audio segmentation research content based on the Scopus database. The number mentioned in author’s citations is ninety-four, the highest number in this research area. From Figure 7, we can say that Zhang s. (2018) is cited most. Huang h. (2020) is the second-largest because of the number of citations. The number of citations of Huang h. (2020) is sixty-three.

3. Application Areas of Audio Segmentation

Audio segmentation is often utilized in various applications, like Automatic Speech Recognition [35], Automatic Language Identification, and Automatic Emotion Recognition systems [36]. The audio signal is segmented into a sequence of frames and classified into several classes like music [37], speech [38], noise [39], and so on. The noise is filtered out of the sound signal in this approach because audio recordings are significant variations, like ratio [40], audio encoding [41], bandwidth [42], language [43], speaking styles [44], gender [45], and sound pitch [46], which are the challenges.

Segmentation provides the most effective method for splitting multimedia data into digital data by extracting diverse aspects of multimedia data [47]. This segmentation yields useful information such as speaker signal and identity division, as well as automatic indexing and data retrieval of all instances of a certain speaker [48]. We can do automatic online speech recognition acoustic models to improve overall system performance by collecting all segments produced by the same speaker [49]. Typically, a certain set of properties is extracted from each audio frame [50]. Features are used two ways: with the extracted value or changes over time [51]. It is feasible to calculate statistical features such as variance and variance using changes over time [52].

Audio segmentation is used to analyze and understand the audio recordings, with several applications in healthcare, production, and enterprise [53]. Among these, applications are customer intelligence analysis from user service calls, social-media content analysis [54], medical aids, patient-care systems, and public safety [55]. In healthcare, with the assistance of audio segmentation, a real-time cardiac arrest detection system monitors and detects any upcoming heart-related diseases [56]. It categorized the heart sound recordings into normal and abnormal heart sounds per perceived health risk. The system can provide the potential to monitor many people at a time and supply fast and effective warnings to doctors for further treatments [57].

The experiments conducted have major relevance and contribute to predicting future study domain trends. The various experiments such as source-wise, author-wise, country-wise, citation-wise, and so on give us knowledge of work conducted in this domain and the possible areas of further research [58]. The author-wise analysis gives us information about various authors working in this domain. Similarly, source-wise information helps us understand the various sources where information relevant to the work can be found.

4. Conclusions

As speech technologies’ applications are progressing, audio segmentation techniques’ significance is also increasing. The huge surge in the number of research articles on deep learning-based audio segmentation indicates the paramount importance of these techniques. This paper highlights the applications and source-wise research significance of audio segmentation. The analysis presented in this paper also exhibits the clear and strong relationship between audio segmentation and deep learning techniques. This work can further be extended to include the domain-specific and contextual analysis of audio segmentation techniques.

Data Availability

The data used to support the findings of this study are included in the article. Should further data or information be required, these are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors thank Thapar University, Punjab, for the technical assistance. The authors appreciate the support from Mettu University, Ethiopia.