Abstract

In the paper, we applied the customized AI module to the OTDR device and, combined with the optical power monitoring module, realized the AI-assisted optical network fault location mechanism for the high-density interconnection scenario of data centers. The mechanism can make full use of the data from optical links. Based on the link data, the AI module can predict the links that may fail, and then the target links will be monitored by the optical power module. The mechanism can quickly locate and respond to faulty links. Through the test, the introduction of an AI model can improve the average fault detection efficiency of the link by 98.41%.

1. Introduction

As the data center gets bigger and bigger and the topological structure becomes more and more complex, a data center failure is a disaster that can cause the loss of huge amounts of data and the interruption of large calculations. At the same time, as the number of devices and links increases rapidly, the frequency of failure in optical networks of data centers increases and the number of alarms increases, which makes it difficult to locate faults and takes more time to rectify faults. How to locate the fault quickly and accurately from a large number of alarm devices has proven to be a thorny problem [1].

As reported by the Federal Communications Commission (FCC), more than one-third of service disruptions are caused by fiber-cable problems [2]. Therefore, automatic monitoring and diagnosis of optical fiber links are very beneficial. By introducing machine learning (ML) in data centers, it will not only revolutionize the (mainly manual and human) approach to the traditional management of fiber-optic network fault management [3]. It also helps optical network operators plan and schedule their maintenance activities more efficiently [4] and thereby save CAPEX/OPEX and reduce the time to repair (MTTR) by quickly discovering and pinpointing the link faults. This enables operators to more easily meet service level agreements (SLAs) and improve customer satisfaction by reducing downtime and improving network quality. In 2018, Rafique et al. [5, 6] proposed an optical layer fault detection architecture based on machine learning and defined four types of optical layer fault types. It was suggested to acquire and collect optical power monitoring data through the southbound interface of SDON, conduct data analysis through the ANN algorithm, and upload data analysis results through the northbound interface. In the same year, Huawei put forward the optical service fault prediction scheme combining artificial intelligence and big data technology, mainly taking the bit error rate (BER) and optical power as input to predict the optical service fault, and cooperated with operators to carry out the initial verification of the OTN live network. The prediction accuracy is 85%, which not only improves the robustness of their network but also reduces the network cost of inspection. Chen et al. [7] proposed a DNN-based optical transmission link fault detection scheme in which the clustering module of unsupervised learning and the DNN module of supervised learning were integrated to analyze the internal relationship between optical power and the alarm log to detect link faults. However, the above work only realizes the fault prediction and does not consider the problem of fault location.

The optical time-domain reflectometer (OTDR) is the most common way for quality evaluation and fault location of optical fibers [8]. At present, the commonly used data center fault monitoring scheme is to adopt optical switch polling and optical power monitoring. However, in the case of high-density interconnection of optical networks in data centers, fault detection in this way still consumes a lot of time, which is not conducive to troubleshooting and solving faults. In [4], the author proposes an OTDR optimization scheme based on LSTM. A LSTM model is used to predict possible faults according to OTDR detection results. However, this method requires continuous use of OTDR to detect link conditions, and the existing data center operation and maintenance data cannot be fully utilized. In this paper, based on the model that was realized in [9], an AI auxiliary judgment and failure location platform was designed and implemented. By using the operational data collected from optical network link, AI module predicts possible failure of the link, platform will send instructions to the optical switch according to the prediction result and monitor the optical power of optical links that may fail. Once the optical power is below the threshold, OTDR is enabled for link detection. After the test, the average fault detection efficiency of the link increased by 98.41%.

This paper is organized as follows: Section 2 describes the system architecture and equipment introduction. Section 3 introduces the AI model that is used in our system. Practical application and performance analysis of the platform are discussed in Section 4. Conclusions are drawn in Section 5.

2. The Architecture of the Platform

The architecture diagram of our test is shown in Figure 1. The AI-assigned monitoring platform collects data from the optical link in real-time. These data are used by the AI model to predict the status of the optical link. According to the prediction result, the platform issues instructions to the optical switch array, which will switch the predicted failure link in turn before the next instruction arrives. At the same time, equipment A monitors the power of the link and starts the OTDR to detect the link when the power is lower than the threshold. The above workflow is shown in Figure 2.

Figure 3 is the architecture of equipment A.

In Figure 3, the laser produces a 1650 nm laser burst according to the pulse generator. The pulse enters the optical link through the circulator. Uplink light from the optical link enters the WDM filter module through the circulator. Uplink light and 1650 nm backward scattering light enter modules B, which is used for OTDR data acquisition and processing, and C, which is used to calculate optical power. The calculation result is sent to the AI-assisted monitoring platform.

3. AI Model Used in the Platform

This section includes a theoretical introduction and the results of the failure prediction model. Part A is mainly about the LSTM model for each feature. Part B shows the classification result of the SVM model.

3.1. LSTM Model

A typical LSTM neural network with cell, input gate, forget gate, and output gate, as shown in Figure 4. Memory-cell takes input from the output of the LSTM neural network in the last iteration. The input-gate obtains a new input point from outside and processes newly coming data. Forget-gate decides when to forget the output results, which selects the optimal time lag for the input sequence. The output-gate takes all the results calculated and generates output for the LSTM neural network cell. Compared with traditional RNNs, LSTM avoids the problem of gradient disappearance or gradient expansion while learning faster.

We chose six features for the training of the LSTM model, such as laser bias current, input optical power, output optical power, OSNR, temperature in the model, and detection point temperature. We show LSTM results for the four features below. Other results can be seen in the paper [9]. The left image shows the loss of the LSTM model in training and validation, and the right image shows the comparison of test data and the LSTM model’s prediction result.

Figure 5 shows the results of using LSTM for Laser Bias Current prediction. It can be seen from the results that the validation loss is less than 0.001, and the model has high accuracy in the prediction of laser bias current.

Figure 6 shows the results of using LSTM for input optical power prediction. It can be seen from the results that the validation loss is less than 0.005 and the model has high accuracy in the prediction of input optical power.

Figure 7 shows the results of using LSTM for output optical power prediction. It can be seen from the results that there are a few less accurate numbers, but overall the results are accurate.

Figure 8 shows the results of using LSTM for OSNR prediction. According to the prediction results, the prediction results of OSNR are relatively low compared with the actual data, which will be optimized in the follow-up work.

3.2. SVM Model

The SVM is essentially a binary classification algorithm that screens the support vectors from the training data and uses them to establish a decision function [10, 11]. In practical application, in the case of linear inseparability, the kernel function of SVM can realize the mapping from a low-dimensional space to a high-dimensional space and transform the two types of points in the low-dimensional space into linearly separable points, as shown in Figure 9.

The trained SVM model is used to classify the optical network status data predicted by the LSTM model and judge whether it belongs to the failure state. We compare the classification results of the SVM model with the true results and calculate the accuracy according to (1). The calculation accuracy is 90.63%. When we calculate the failure accuracy according to (2). The calculation result is 99.38%, which means the AI module can predict almost all failures.where TP represents true positive, TN represents true negative, FP represents false positive, and FN represents false negative.

As shown in Figure 10, to facilitate the presentation of the results, we divided the SVM classification results into ten pieces and counted the accuracy, TN, FN, and corresponding true network failure numbers, respectively. By the way, the fluctuation between each accuracy is related to the result distribution.

From Figure 10, the number of TN is very close to the actual number of failures, which means that the failure prediction accuracy is very high. Results FN show that some faultless links are predicted to be faulty links, and we will compensate for this deficiency by monitoring the optical power of links predicted to be faulty.

4. Result Analysis

This section will show the performance of the platform in practical application.

Figure 11 shows the details of link channel 3 in normal condition when the optical switch array changes the link in turn without an AI module. “Optical power” shows the current power of channel 3, whose value is −7.979 dBm. “Distance” represents the length of the optical link. “OTDR” is set as “manual,” which means the parameters shown in the figure are the result of manually turning on the OTDR probe. Figure 12 shows the logs of optical switch polling.

When the AI module predicts link failure, it will send an instruction to the optical switch array and record some prediction logs in platform. Figure 13 is the screenshot of the recorded prediction logs. Figure 14 shows logs that the optical switch array changes the link according to the AI prediction result.

Figure 15 shows the monitoring results of optical power when the link failure predicted by AI occurs. The OTDR mode is set to auto, which means that when the optical power is abnormal, OTDR detection is automatically started. The optical power of link channel 3 currently detected is −54.457 dBm. The value of “distance” is 9852.35, which means there is a breakpoint at 9852.35 m. The curve of the OTDR detection is shown in Figure 16.

We can see from the figure above that there is a dramatic change in the curve near 10,000 meters, which is the position of the breakpoint.

Figure 17 shows the comparison of the time consumption between the conventional polling detection method and the AI-based detection method when a random fault occurs in 1024 links. The calculation formula is (3). The introduction of an AI model increases the average failure detection efficiency of failure links by 98.41%.where t1 represents the time consumption of discovering failure links without using the AI model, and t2 represents the time consumption of discovering failure links with the AI model.

From the efficiency curve, we can see that the introduction of an AI model greatly reduces the time of failure detection and improves the efficiency of the equipment.

5. Conclusion

In this paper, we design an AI-assisted optical link failure prediction and failure location platform based on AI module and test its performance. The optical power monitoring can compensate for the shortage of the AI model, which may predict the normal state as the failure state. At the same time, the introduction of an AI model increases the average failure detection efficiency of a failure link by 98.41%. This greatly improves the efficiency of failure detection and location in data centers.

Data Availability

The data used to support the findings of this study are restricted by the China Mobile Communications Corporation in order to protect patient privacy or endangered species. Data are available from WeiJi at [email protected] for researchers who meet the criteria for access to confidential data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Key Research and Development Program of China (grant no. 2022YFB2802403), the National Natural Science Foundation (grant no. 62220106002), the Natural Science Foundation of Shandong Province (grant no. ZR2021MF018), and the Open Fund of IPOC 2021 (grant no. BUPT).