Robots can use echo signals for simultaneous localization and mapping (SLAM) services in unknown environments where its own camera is not available. In current acoustic SLAM solutions, the time of arrival (TOA) in the room impulse response (RIR) needs to be associated with the corresponding reflected wall, which leads to an echo labelling problem (ELP). The position of the wall can be derived from the TOA associated with the wall, but most of the current solutions ignore the effect of the cumulative error in the robot’s moving state measurement on the wall position estimation. In addition, the estimated room map contains only the shape information of the room and lacks position information such as the positions of doors and windows. To address the above problems, this paper proposes a graph optimization-based acoustic SLAM edge computing system offering centimeter-level mapping services with reflector recognition capability. In this paper, a robot equipped with a sound source and a four-channel microphone array travels around the room, and it can collect the room impulse response at different positions of the room and extract the RIR cepstrum feature from the room impulse response. The ELP is solved by using the RIR cepstrum to identify reflectors with different absorption coefficients. Then, the similarity of the RIR cepstrum vectors is used for closed-loop detection. Finally, this paper proposes a method to eliminate the cumulative error of robot movement by fusing IMU data and acoustic echo data using graph-optimized edge computation. The experiments show that the acoustic SLAM system in this paper can accurately estimate the trajectory of the robot and the position of doors, windows, and so on in the room map. The average self-localization error of the robot is 2.84 cm, and the mapping error is 4.86 cm, which meet the requirement of centimeter-level map service.

1. Introduction

With the arrival of intelligent society, mobile robots have been widely used in people’s life and work, which greatly facilitates people’s life and work. In the unknown indoor space, robots realize positioning and navigation services need to know the surrounding environment map and their own position in the map. However, the robot does not know the indoor map information. Simultaneous Localization and Mapping (SLAM) is the service of detecting and sensing the map (contour) of the surrounding environment for a moving subject in an unknown environment, relying only on the mounted sensors, while determining its own position in the map [1]. After decades of continuous development, SLAM technology services have been widely used in mobile robotics [2], virtual/augmented reality [3], autonomous driving [4], and so on. The current mainstream SLAM techniques are classified based on the differences in the sensors used and can be divided into LIDAR SLAM techniques and visual SLAM techniques. Sensors such as LIDAR and cameras have the advantage of accuracy and high resolution but they also have disadvantages: LIDAR is a very expensive sensor and poses health and safety issues in operation [5]. Cameras, although cost getting lower, require high processing power in low-light environments as well as low signal-to-noise ratios [6]. In addition, the above systems are computationally complex and usually use cloud-based processing, which is costly and involves privacy and security. In contrast, acoustic sensors as the standard for mobile robots can be used in low-light and dark environments [7]. Map reconstruction work can be achieved using the robot’s own arithmetic power, which not only has significant cost advantages but also offloads privacy-aware services to MEC (mobile edge computing), avoiding the leakage of private information such as indoor images. Therefore, researchers have begun to explore the implementation of acoustic SLAM.

In indoor environments, the propagation of acoustic signals is obscured and reflected by buildings resulting in multipath effects, which to some extent reflect information about the room arrangement and geometry and can be used to estimate environmental maps. Researchers have started to estimate room shapes from the room impulse response (RIR) of acoustics. Labelling the first-order echoes in the RIR to the walls that generate them is the key for room shape estimation. Since the RIR itself does not specify which reflections come from which walls, there is an echo labelling problem (ELP) [8]. Literatures [9, 10] solved the ELP using the properties of the Euclidean distance matrix (EDM), but the algorithm must traverse the TOA combinations of all echoes, which has a high computational complexity. References [8, 11, 12] improved the computational complexity of their work based on EDM using graph theory, subspace filtering, and greedy iteration, respectively, but the overall computational complexity is still large. There is also a method of solving ELP using elliptic constraints. Literatures [13, 14] solved the series of reflective points of walls based on an elliptic constraint model and finally used the Hough transform for the estimation of each reflective wall. This method needs to arrange many anchor nodes to obtain enough data, which is complex to implement and extremely computationally intensive. Literature [15] proposed an algorithm to reduce the computational complexity based on elliptic constraints for iterative echo marking. The above method uses a stationary distributed microphone array, which requires the sound source and microphone array to be arranged in the room in advance and is not applicable to the practical application scenario of SLAM.

In addition to the static deployment of sources and microphones scheme in the above work, there is another scheme that embeds acoustic sensors on a mobile robot, which is more in line with the practical needs of SLAM and is more relevant for research. Whether the robot is equipped with an acoustic source can be classified as active acoustic SLAM and passive acoustic SLAM. References [16, 17] used robots equipped with multichannel microphones for their own localization as well as localization of acoustic sources by sensing the ambient sound sources around them. However, suitable sound sources that can be detected are not always available in the actual environment. Another option is to use a robot equipped with both sound sources and microphones for simultaneous localization and mapping. In literatures [1820], a mobile robot with a juxtaposition of an acoustic source and a single-channel microphone was used to collect first-order echoes. Due to the weak spatial perception of the single-channel microphone, the robot moved at least three times for a reflective wall estimation, and the movement error may affect the accuracy of the reflective wall estimation. Multichannel microphones have better spatial perception capability than single microphones and can obtain more information about the interior geometry for the same number of measurements. Literature [21] achieved room shape estimation using a robot equipped with an acoustic source and a four-channel microphone by estimating the location of the first-order image source by clustering, but the robot needs to collect a large amount of RIR data at each movement. All the above acoustic SLAM schemes can achieve the estimation of room shape but ignore the effect of cumulative errors in the measurement of the robot’s moving state, that is, on the reconstruction of the room contour. The room map has only room shape information and lacks the location information of doors and windows because it cannot distinguish between different reflective materials.

In this paper, a robot active sounding scheme equipped with both sound sources and microphone arrays is used for simultaneous localization and mapping. To address the ELP problem mentioned above and the cumulative error of robot movement measurement affecting the accuracy of map building, this paper proposes a graph-optimized acoustic SLAM edge computing system based on graph optimization that can identify room detail information. The main contributions of this paper are as follows:(1)A robot system prototype is designed and implemented that can be used for acoustic SLAM(2)A first-order echo labelling algorithm based on RIR cepstrum is proposed, which solves ELP by distinguishing different reflective materials(3)A graph optimization-based method is proposed for correcting the pose estimated by the trapezoidal constraint

The sections of this paper are organized as follows. Section 1 introduces the background and current research status of acoustic SLAM, and Section 2 describes the problem setup and the architecture of the graph optimization-based acoustic SLAM system. In Section 3, a prototype of our designed robotic system for acoustic SLAM is presented. Section 4 introduces the related acoustic SLAM methods. Section 5 shows the simulations and experiments, and Section 5 concludes.

2. Problem Description and System Structure

In an indoor environment, the sound signal received by a microphone consists of the direct sound from the source and the reflected sound that is reflected by the walls. In the image source model [22, 23], the reflected sound from the actual sound source was replaced by the direct sound from the image source, as shown in Figure 1(a). For a first-order echo described by the unit normal and an arbitrary wall point and the kth reflected wall, the first-order image source of the real sound source s with microphone is calculated according to . The sound propagation process was described in terms of the room impulse response (RIR), which consists of a series of Dirac pulses :where is the noise, is the amplitude of the received pulse, and its magnitude depends on the absorption coefficient of the wall and the distance from the image source to the microphone [18]. is the arrival time of the corresponding pulse, which is proportional to the distance from the image source to the microphone . The room impulse response can be represented as a dataset . ELP finds the data associated with the first-order image source from the dataset . The was defined as the label corresponding to data . The data association process for a first-order image source can be represented by the function :

The distance from the first-order image source to the microphone can be known from the marked first-order echo data. For a single-channel microphone, as shown in Figure 1(a), the position of the image source is on a circle with the position of microphone as the center and as the radius, and the exact position of the image source on the circle cannot be determined due to the lack of spatial information. For multichannel microphones, as shown in Figure 1(b), the position of the image source is at the intersection point between circles . In 2D space, it is known from the TOA localization algorithm [24] that the uniqueness of the image source s location can be guaranteed when the number of microphones is greater than 3. The midpoint of the line connecting the real source s and the image source is the location of the reflecting wall.

In this paper, an omnidirectional sound source was defined, and an omnidirectional 4-channel microphone array are installed on the robot, and the location of the microphone array in relation to the sound source is shown in Figure 2(a). The robot travels around the room in a circle, and for each step the robot takes, the sound source generates a pulse while the microphone array records an echo. The room was defined as a 2D polygon for the sake of descriptive simplicity, and the approach in this paper can be easily extended to 3D.

The robot can estimate the position of each wall relative to itself in the room using echo information, as shown in Figure 2(b). Based on the above description, the difficulty of first-order image source position estimation is solving the echo labelling problem (ELP) [8]. In this paper, taking advantage of the strong spatial perception of multichannel microphones, an RIR cepstrum feature that can distinguish reflective walls with different absorption coefficients was proposed, and based on this feature, this paper proposes a solution for first-order echo labelling based on the RIR cepstrum.

The robot travels around the room in a circle and uses the echo information from different locations to estimate the distance from the wall to itself for the shape estimation of the indoor room, as shown in Figure 2(b). Since there is a cumulative error in the robot position estimated by IMU data, this can lead to inaccurate estimation of the wall position. To solve this problem, a graph optimization-based acoustic SLAM method was proposed, which uses graph optimization to fuse the robot pose estimated by the sound echo signal with the pose estimated by IMU to eliminate the cumulative error, and the system block diagram of this method is shown in Figure 3. Referring to other graph optimization structures [2527], the graph optimization system in this paper is also mainly divided into two parts: front end and back end. The front end establishes the graph vertices and the positional constraint relations between graph vertices based on the IMU sensor and acoustic sensor data, and the back end optimizes the positional graph based on the closed-loop constraints added by the closed-loop detection and the constraint relations between graph vertices to finally obtain globally consistent robot trajectories and indoor room maps.

3. Acoustic SLAM Method

According to the system framework of acoustic SLAM, the acoustic SLAM method can be divided into two parts: echo labelling and pose correction. Echo labelling extracts the first-order echo signal from the echo signal and then estimates the position of the first-order image sound source based on the first-order echo signal. The pose correction is to eliminate the accumulated errors during the robot movement globally using graph optimization methods.

3.1. Echo Labelling Based on the RIR Cepstrum Feature

The ELP problem is often solved using the Euclidean distance matrix approach, which requires traversing all possible combinations of echoes with high complexity. In this section, an echo labelling method based on RIR cepstrum is proposed, which can achieve fast and accurate echo labelling.

3.1.1. RIR Cepstrum

Multichannel microphones are more spatially aware than single-channel microphones, and a spatial cepstrum feature is proposed in literature [28], which can represent the relative position of the sound source in the room. Inspired by this, the room impulse response cepstrum feature was proposed, which can be used for first-order echo labelling and loopback detection.

Suppose the robot is equipped with M omnidirectional microphones and an omnidirectional sound source . As shown in Figure 4, the robot moves N steps from to , and each time it moves, the robot’s sound source generates a pulse, while the microphone acquires the room impulse response at the current position. The robot can acquire M room impulse responses at . According to equation (1), consists of a series of pulses, and the average energy feature of the kth pulse is extracted:where is the TOA value of the kth pulse in and 2L + 1 is the width of the rectangular window.

The time delay feature of the kth pulse in iswhere c is the speed of sound propagation in the air.

Log operations are performed on the above two features separately to obtain the log energy vector and the log time delay vector :

The robot is obtained from ; the matrix of the average amplitude logarithm of the impulse response about the room and the matrix of the logarithm of the arrival distance can be obtained as follows:

As in the method for extracting the spatial cepstrum [29], we also use PCA instead of DFT or DCT. can be obtained from .

Since is a symmetric matrix, the eigenvalue decomposition can be expressed as follows:where is the eigenvector matrix, is the diagonal matrix, and the diagonal elements are the eigenvalues, in descending order.

After PCA dimensionality reduction, the data can be expressed as

Similarly, for the matrix , PCA dimensionality reduction is performed.

The principal component components are selected from and , and they are the components with the largest eigenvalues and , which form the room impulse response cepstrum .where is the matrix, is defined as the room impulse response cepstrum, is the amplitude cepstrum, and is the distance cepstrum.

The amplitude cepstrum corresponds to the average amplitude of the pulses observed by the microphone array. When the pulses observed by the microphone array are consistent, i.e., the first-order echoes come from the same reflecting wall, the magnitude of the amplitude cepstrum is inversely proportional to the distance of the robot from the reflecting wall. Similarly, when the pulses observed by the microphone array are consistent, the magnitude of the distance cepstrum is proportional to the distance of the robot from the reflecting wall. As shown in Figure 5(a), in a 6 m ∗ 6 m rectangular room, the robot moves from to , and to ensure the consistent observation of the microphone matrix, the reflection coefficients of walls w1–w4 to 0.8, 0, 0, and 0 were defined. At this time, the robot can only receive the echo from wall w1. We select the second pulse in the room impulse response and extract the RIR cepstrum according to the above method and represent the RIR cepstrum in a two-dimensional Cartesian coordinate system, as shown in Figure 5(b). We can observe that when the pulses observed by the microphone matrix are consistent, the cepstrum of the room impulse response of and approximates a linear relationship. When the value of the RIR cepstrum of the microphone array pulse combination is in the vicinity of the straight line corresponding to the reflective wall and then combined with the size of the microphone array, it can be determined whether the microphone array is observed consistently.

3.1.2. Echo Labelling Based on the RIR Cepstrum Feature

Based on the feature that the one-dimensional component of the RIR cepstrum approximately satisfies a linear relationship with the two-dimensional component when the RIR cepstrum is observed consistently, a method for first-order echo labelling was proposed. According to the image source model, it is known that the 2nd and 3rd pulses received by the microphone must be the first-order echoes reflected from the wall. The robot follows the trajectory in Figure 4 from moving to . Since the robot moves against the wall and the spacing between the microphones is small, it can be guaranteed that the 2nd and 3rd pulses observed by the microphones are mostly from the same virtual source, i.e., the observations are consistent. Assuming that the acoustic reflection coefficients of each of the four walls of the room in Figure 4 are different, the robot takes the 2nd and 3rd pulses from . The second and third pulses received by the microphone are taken to find the cepstrum of the room impulse response; the matrix of the logarithm of the room impulse response amplitude at this time and the matrix of the logarithm of the arrival distance are as follows:

Following the method in the previous section, PCA operations are performed on and , to obtain the feature matrices and , respectively. The RIR cepstrum after PCA dimensionality reduction is

According to the characteristics of the RIR cepstrum feature, can be fitted as a straight line corresponding to the reflecting wall with different reflection coefficients.

When the kth pulse observed by the microphone matrix is greater than 3, it is not possible to determine whether the kth pulse observed at this time is a first-order echo. The first-order echo candidates from wall can be obtained from the TOA values of the first and second echoes and the relationship between the microphone positions. These candidates can be combined to obtain a new combination of room pulses, which corresponds to the RIR cepstrum as follows:

If the new combination is a first-order echo from wall , the corresponding room impulse response cepstrum should be near the straight line , and the distance from to the straight line satisfies the following equation:where Δ is the Euclidean distance threshold. Following the above method, the first-order echoes of different walls can be distinguished, and thus, the location of the image source can be estimated.

3.2. Pose Correction Based on Graph Optimization

The pose correction method consists of three parts: closed-loop detection, pose estimation, and pose correction. In this paper, the pose at different locations of the robot is used as nodes of the graph. The constraint relationship between graph nodes is established using closed-loop detection and pose estimation. Finally, the graph optimization method is used to correct the robot’s pose.

3.2.1. Closed-Loop Detection

Closed-loop detection determines whether the robot has reached the previous position, and it is extremely important for back-end optimization. After the above echo labelling method, the robot at the location can obtain the RIR cepstrum consistent with the k-sided wall observation, and these cepstrums can be combined into a 2k-dimensional RIR cepstrum vector as follows:where is the RIR cepstrum from wall .

Similarly, the robot can obtain RIR cepstrum vector at as follows:

The vectors and can be used to express the Euclidean distance between them to express the similarity of their spaces. When and are in the same position or close to each other, the Euclidean distance between the two vectors should satisfy the following formula:where is the Euclidean distance threshold.

3.2.2. Trapezoidal Constraint-Based Pose Estimation

In indoor space, the actual position of the robot and the position movement of the image source satisfy the isosceles trapezoidal constraint [18, 29], as shown in Figure 6. Based on this constraint, a robot positional estimation method was proposed.

In world coordinates, let the robot’s pose at be and the robot’s pose at be . The change in pose of the robot moving from to iswhere is expressed in polar coordinates as

In equation (21), is the displacement variable, is the angle of the displacement direction to the X-axis of the world coordinate system, and is the rotation angle of the robot coordinate system.

The coordinates of the image source at and are expressed in polar coordinates in the robot coordinate system, respectively, as follows:

The robot rotation angle is related only to the angle of the polar coordinates of the image source.

The estimated value of the robot rotation angle is

Robots from move to , and the length of the image acoustic source polar coordinates changes as follows:where .

Let , the following vector function can be obtained:

The estimated value of the acoustic sensor has a random error ζ, and the variance of the error is .

Weighted error function () is as follows:where and is the unit matrix.

The objective optimization function is obtained by minimizing the weighted error function .

The Levenberg–Marquardt algorithm is used to solve equation (29).

Linearize the error function ε(s) by linearly expanding d(s) through a first-order Taylor series:where is the Jacobi matrix, which is expressed as follows:

Equation (30) is brought into equation (28) to solve for the extreme value of the weighted error function ε(s). According to the Levenberg–Marquardt method:where .

The above equation allows to find the step size for each iteration. The value estimated by the IMU is used as the initial value for the iterative calculation.where is the number of iterations.

According to the above method, it is possible to use the acoustic signal to accurately estimate the pose change between different positions of the robot.

3.2.3. Pose Correction Based on Graph Optimization

According to the above method, we can construct the graph. Every time the robot moves a certain distance or rotates a certain arc, a vertex is added to the graph, and the constraint relationship between the vertices is established according to the pose estimation algorithm. The structure of the graph is shown in Figure 7.

Let be a vector of parameters, where describes the pose of node . The robot moves from the pose node xi to the pose node , is the pose transformation estimated by the IMU and is the pose transformation observed by the acoustic sensor. Let be the error function from to , which is the difference between the robot’s predicted observation and the actual observation .

The dashed box in Figure 7 shows the constraint relationship between node to node .

Let be the set of constraint pairs of nodes in the graph and the set of nodes in the graph of trajectory points of the robot. The goal of the maximum likelihood method is to find the configuration of nodes that minimizes the negative log likelihood F (x) of all observations.where is the measurement information matrix of the error function .

The objective optimization function is

Using the first-order Taylor expansion error function .

The error function is only related to and . Its Jacobi matrix is

The nonlinear optimization algorithm is used to iteratively solve for the minimum value of F (x). The step size of each iteration can be solved using equation (32).where .

The optimal robot trajectory point is obtained by iterative calculation.where is the initial trajectory point of the robot estimated by the IMU and is the number of iterations.

The robot travels around the room once, constructs the vertices and edges of the graph according to the method in this paper, and solves equation (36) using a nonlinear optimization algorithm. The optimal robot movement trajectory is obtained by optimizing the length of the edges of the graph to minimize F (x).

4. System Implementation and Experimental Verification

This section introduces our self-designed robot prototype and then experimentally verifies the performance of the acoustic SLAM method in this paper.

4.1. System Implementation

Our robot is based on the Turtlebot3 Waffle Pi robot, a small, low-cost, fully programmable, ROS-based mobile robot, as shown in Figure 8(b). Turtlebot3 consists mainly of Raspberry Pi3 and OpenCR control board (with IMU sensor inside). In this system, OpenCR is responsible for collecting the built-in IMU sensor data and sound data as well as driving the robot to move, Raspberry Pi 3 is responsible for processing and calculating the data, and Raspberry Pi 3 is connected to OpenCR via USB 2.0. The system architecture connection diagram of the robot is shown in Figure 8(a). Since the sound source is close to the microphone array, which may result in larger direct waves and affect the reception of other reflected waves, a sound insulation panel was designed between the microphone array and the sound source to isolate the direct sound. In addition, the sound insulation panel can also isolate the reflected waves from the upper and lower walls. The physical diagram of the robot is shown in Figure 8(c).

The robot uses an active acoustic scheme for self-localization and room contour estimation, and to prevent the sound signals emitted during the robot’s work from affecting people’s life and work, this paper uses sound pulse signals in the pseudoultrasonic band (16k–24k), which has a wavelength between 1.4 cm and 2.1 cm and can be guaranteed to be received by a small microphone array, and the sound signals in this band are insensitive to the human ear but can be picked up by the robot’s acoustic sensors. Sound source emits a sound signal of 16 kHz–20 kHz chirp pulse signal, which can avoid the leakage of sensitive information such as indoor human voice and ensure the security of privacy, but most speakers do not support the sound signal of the band, the need for speaker selection. Sound generation equipment was used, that is, Huawei Sound is used as the sound source. Huawei Sound is a 360-degree omnidirectional speaker with a frequency response range of 55 Hz–40 kHz and supports 3.5 mm wired audio input.

The process of acquiring the sound pulse signal from the robot itself is the process of converting the sound signal from a mechanical wave to a digital signal. The sound signal is first converted into a voltage signal through a microphone, and since this voltage signal is usually small, it needs to be amplified by an amplifier; then, the amplified signal is passed through a 15k–24k bandpass filter to filter out the noise signal outside the pseudoultrasonic band. Finally, the filtered signal is converted to a digital signal by an ADC.

Based on the acquisition process of the sound signal, the microphone array signal acquisition board for the robot was designed. The microphone in the acquisition board is a 130F22 omnidirectional microphone from PCB, which has a frequency response range of 10 Hz–20 kHz, and the microphone is an SMB interface that can be plugged into the acquisition board standing up. The acquisition board is a four-channel microphone array, and the microphones are distributed at equal intervals on a circle with a radius of 0.14 m. The physical diagram of the acquisition board is shown in Figure 9.

4.2. Experimental Verification

The experimental part is divided into echo labelling experiments, pose correction and room shape estimation experiments, and real room experiments.

4.2.1. Echo Labelling Algorithm Performance Simulation

For the echo labelling experiments, echo labelling simulations were conducted in three different shapes of rooms: square, rectangular, and pipeline. The shape of the room is schematically shown in Figure 10, and w1, w2, w3, and w4 were used to denote the four walls of the room, and their corresponding reflection coefficients are . The radius of the microphone array of the robot is 0.2 m, and a location in the room is randomly selected, the RIR of each microphone under that location is simulated using the image source method, and the echoes are labelled using the method in Section 3. To verify the robustness of the algorithm to noise, the Gaussian white noise was added to the propagation distance of the signal as follows:where d is the true propagation distance and is the additive noise, and its standard deviation σ varies from 0.01 to 0.05 in steps of 0.005. Similar to [15], the F1-score was used to evaluate the goodness of the echo markers.

Literature [15] solved ELP by means of elliptical iterations, and for comparison, experiments in a square room with reflection coefficients for each wall of were performed. Randomly 300 points were chosen to conduct echo labelling experiments with the method of this paper and compare with the method of literature [15]. The experimental results are shown in Figure 11(b), where method 1 is the above method and method 2 is the method of this paper.

Then, the F1-score of different numbers was simulated of microphone array robotic echo markers in each of the three rooms in Figure 10, where the reflection coefficient of the room is . The experimental results are shown in Figure 12. The experiments show that the F1-score can be maintained above 95% in all three rooms under low noise conditions (σ ≤ 0.03 m).

4.2.2. Simulation of Pose Correction and Room Shape Estimation

Simulation experiments were conducted on robot self-localization and room position estimation in a room with wooden door and glass windows of size 7 m ∗ 7 m. Among them, the sound reflection coefficient of wooden door is 0.7, the sound reflection coefficient of glass window is 0.8, and the sound reflection coefficient of wall is 0.9. The robot’s microphone array is a four-channel microphone array with a radius of 0.2 m. The robot travels around the room along the wall, and every 0.4 m, the robot actively emits sound and simulates the RIR of the current position (σ = 0.05 m).

The blue diamond line in Figure 13(b) shows the trajectory of the robot without pose correction, and the green line is the closed-loop result detected by the method in this paper. Figure 13(c) shows the trajectory after the pose correction based on the graph optimization, and the optimized path trajectory is basically consistent with the real trajectory. Figure 13(d) shows the location of the reflector estimated based on the optimized path, where the green “×” is the estimated location of the wall, the red “×” is the location of the glass, and the pink “×” is the location of the door.

The position error was used to measure the robot’s self-positioning error and mapping error, where is the X-axis coordinate error and is the Y-axis coordinate error. Figure 14(a) shows the average self-localization error and mapping error statistics of the robot traveling some of the position points according to the route in Figure 13(a). The self-positioning error of the robot is less than 3.18 cm with 60% probability, and the average mapping error is less than 4.86 cm with 58% probability.

The comparison experiments between the method in this paper and the method based on KEF filtering was done, in which the robot drove around the room randomly once in the above room and simulated 100 Monte Carlo experiments, respectively. The average self-localization error and mapping error of the experiments are shown in Table 1, where method 1 is the method without path optimization, method 2 is the method of path optimization by KEF filtering, and method 3 is the method of this paper.

4.2.3. Real Room Experiments

To verify the stability of our own designed robot and the practical performance of the method in this paper, the experiments were conducted in a real room of 4.3 m × 5.5 m × 3 m (room dimensions were obtained using total station measurements). The total station was placed at the doorway and was used to measure the actual position of the robot as well as the actual position of the walls. The positions of the total station and the robot in the room are shown in Figure 15(a). Due to the height limitation of the robot, the robot can only measure the reflected wall under the red line in the right figure in Figure 15(a). The robot travels around the room close to the wall, and every time it moves, the robot actively vocalizes once (moving distance is less than 0.5 m) and records the RIR of the current position. Every time the robot moves during the experiment, the real position of the robot is measured with the total station and recorded. The RIRs obtained by the four microphones are shown in Figure 15(b) (the ambient temperature of the experiment is 30 degrees, and the corresponding sound speed is 349.75 m/s). Red marker points are first-order echoes from the wall, and red marker points of the same shape are first-order echoes from the same wall.

Since the sound insulation panels were added between the speaker and the microphone receiver board and the sound propagation direction of the speaker is 360 degrees in the horizontal direction, there is no first-order reflection echo from the upper and lower walls. The final experimental results are shown in 2D, as shown in Figure 15(c). The overall average self-positioning error of the robot is 2.84 cm, and the average mapping error is 4.86 cm.

5. Conclusion

This study introduces a graph optimization-based acoustic SLAM edge computing system and a method that provide new ideas for the solution of the acoustic SLAM problem. Based on the solution in this study, the robot can use acoustic signals to achieve self-localization and centimeter-level room map construction services containing door and window information. The current method in this paper has better performance in an empty room. In the future, acoustic SLAM research will be conducted in more complex indoor spaces.

Data Availability

The simulation and experimental data used to support the findings of this study have not been made available because this paper is funded by the Guangxi Science and Technology Plan Project (No. AD18281044). The grant is still in the research phase, and all research data are currently restricted to disclose within the project team.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research work was funded by the Guangxi Science and Technology Plan Project (Nos. AD18281044 and AD18281020), the Guangxi Keypoint Research and Invention Program (No. AB18221011), the Dean Project of Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education (Nos. CRKL190104 and CRKL200107), and the Innovation Project of Guangxi Graduate Education (No. 2020YCXS024).