Abstract

This paper presents a hardware processor for 100 Gbps wireless data link layer. A serial Reed-Solomon decoder requires a clock of 12.5 GHz to fulfill timings constraints of the transmission. Receiving a single Ethernet frame on a 100 Gbps physical layer may be faster than accessing DDR3 memory. Processing so fast streams on a state-of-the-art FPGA (field programmable gate arrays) requires a dedicated approach. Thus, the paper presents lightweight RS FEC engine, frames fragmentation, aggregation, and a protocol with selective fragment retransmission. The implemented FPGA demonstrator achieves nearly 120 Gbps and accepts bit error rate (BER) up to . Moreover, redundancy added to the frames is adopted according to the channel BER by a dedicated link adaptation algorithm. At the end, ASIC synthesis results are presented including detailed statistics of consumed energy per bit.

1. Introduction

The fastest wireless technology available, based on wireless LAN 802.11ac (5 GHz) and 802.11ad (60 GHz), achieves data rates of “only” 7 Gbps [1]. Our goal is, to achieve wireless transmission of 100 Gbps, which is a great breakthrough: 10x faster than any other wireless communication. This paper is related to End2End100 project and cooperates with other proposed projects of the DFG SPP1655 [2]. This group of projects will investigate a complete wireless 100 Gbps system at terahertz frequencies (~240 GHz). Figure 1 depicts the demonstrator setup as investigated in the DFG. This paper describes research on the data link layer part of the demonstrator. The baseband and PHY layer are investigated under Real100G.COM activity [3]. More information on the dedicated PHY layer and baseband processing can be found in [46].

Within the last three years, a few new approaches for 100 Gbps physical layer (PHY) have been proposed. Table 1 summarizes selected transmission experiments performed in the terahertz band at the PHY layer.

From the data link layer point of view, the research presented in [7] is especially interesting. The authors consider a hybrid-ARQ approach for nanonetworks operating in 300 GHz band with OOK modulation. The simulation uses Hamming (15, 11) channel coding with automatic repeat request (ARQ). The uncomplicated solution is considered due to limited power for the targeted application and cannot be considered as a solution of general purpose 100 Gbps radio transceivers. Our paper proposes significantly more powerful error correction techniques, but also the consumed energy per processed bit is approx. 104 times higher (comparing to the results estimated in [7]).

To perform 100 Gbps transmissions, more than a fast PHY and baseband is required. We focus on the overall transmission efficiency and reduce the overall overhead induced by protocols. As a result of our work an FPGA based data link layer processor is presented. The implementation uses link adaptation methods and process ~117 Gbps of user data.

This section discusses major challenges of implementing data link layer for 100 Gbps wireless communication. A state-of-the-art Virtex7 FPGA is considered as the validation platform and IHP 130 nm technology as the final ASIC implementation.

2.1. Ultra-Short Processing Time

To achieve 100 Gbps data transmission, a frame of 1500 bytes must be processed within 120 ns. Therefore, an FPGA with 200 MHz clock must process 63 bytes of the frame in each clock cycle. A single Viterbi decoder implemented in Virtex7 FPGA requires approximately 60000 ns to process the data [8], but as mentioned before, the complete processing must be finished in approx. 500 times shorter period. Moreover, transmitting the shortest Ethernet frame (64 bytes) requires ~5 ns. Thus, RAM memory with latency ≪5 ns is required. Additionally, at 100 Gbps data rate, the transceivers need 12.5 GB of the memory to store the transmitted data over the last second. State-of-the-art computers and FPGA kits have a few GB of DDR3 or DDR4 memory available. However, the access time to such memory is too slow. The estimated access time of DDR3 memory is around 45 ns [9], but as mentioned before, ≪5 ns is required.

2.2. Bit Errors and Forward Error Correction (FEC)

Today there are various FPGA implementations that support 100 Gbps in wired networks, for example, Ethernet 802.3ba based on optical fiber cables running on Altera FPGA platform [17]. In theory, these high-speed implementations might be used, with some adaptations, as the data link layer of 100 Gbps wireless systems. However, it will work inefficiently because the data link layer of wireless systems must cope with unpredictable bit error rates (BER), leading to more complex solutions [18]. The BER in wireless communication can vary by several orders of magnitude. For example, the BER of high-speed wireless RF frontends presented in [10] achieves BER in range 1 × 10−10 to 4 × 10−3. Therefore, the FEC has to be adopted to the channel conditions. This increases the average code rate and compensates the unpredictable BER on wireless links.

An additional difference between the optical and wireless communication is duplex switching. Optical communication can use two separated fibers to perform uplink and downlink transmissions. Wireless transceivers are limited in this aspect. In most cases, half-duplex communication takes place. A radio can be in receive or transmit mode, but usually never in both states at the same time. Furthermore, some of the communication time is wasted because of switching the PHY between receiving (RX) and transmitting (TX) modes. Due to the mentioned factors, the data link layer for the wireless 100 G communication should be considered as new research. In the other case it is not possible to achieve high efficiency.

2.3. Forward Error Correction Complexity

Forward error correction (FEC) allows reducing the effective BER on the data link layer. That significantly improves robustness and efficiency of the system. Unfortunately, the FEC gain is very expensive in meaning of the processing effort. In [19] we introduce logic area consumed by Reed-Solomon, Viterbi, and LDPC decoders. The 1/2-rate Viterbi decoder with 5-bit soft coding presented in [20] requires a logic area of approx. 23 Virtex7 FPGAs to deal with the targeted 100 G stream.

3. Work Details

This chapter explains all necessary optimizations required for the implementation of the FPGA data link layer processor (Figure 2). One of the main objectives is algorithms comparison according to the computational complexity, hardware resources, and processing latency.

3.1. Frames Fragmentation

Frame error rate depends mainly on BER and frame length. For a longer frame, the probability that at least one of the bits will be altered during transmission is higher, due to the channel impairments. Fewer bits in a frame reduce the number of possibilities for bit errors to occur. Thus, shorter frames are preferred in a noisy channel. This observation leads to a frame-fragmentation concept. Long frames can be split into several shorter frames [22] (Figure 3). This operation improves frame error rate and data goodput (Figure 4). This is especially important for wireless 100 Gbps implementations, where the frame length has to be maximally extended to achieve high transmission goodput and to reduce idle time of the RF-frontend.

If the retransmission process is taken into consideration (ARQ), then the probability of successful transmission of a payload encapsulated into smaller frames is higher than probability of transmission of the same payload encapsulated into longer frames. The probability can be calculated by the following equation:where is probability of successful frame delivery after transmissions, is frame length in bits, and is bit error rate.

Figure 5 compares the probability of a successful payload delivery of 1 MB and 4 MB frames as a function of transmission time. Shorter frame achieves significantly higher probability of successful reception.

3.2. Frames Aggregation and Selective Fragment Retransmission

Frames fragmentation improves transmission goodput by decreasing frames length in a noisy environment. This reduces frame error rate, but there are also negative aspects of this process. Increased number of frames requires more preambles generated on the PHY level. Each frame is extended by a PHY preamble to set correct RF-gain, synchronize the center frequency, and recover the data clock on the receiver side. In this time, user data is not transmitted and the preamble is reducing the goodput. Therefore, the number of transmitted preambles and headers should be reduced. The only way to do it is to extend the frame length as much as possible. This reduces the number of transmitted preambles and headers, but very long frames are not preferred in a noisy RF-environment due to increased frame error rate. This causes an impasse, but there is a possibility to reduce the number of transmitted preambles as well as reduce the logical frame size using frames aggregation and selective fragment retransmission [23] (Figure 6).

The most important aspect of an aggregated frame is resistance to bit errors and reduced preamble-overhead. The fragments of the frame share a common preamble and header, but CRC fields are separate. The CRCs are recalculated for each fragment independently, which allows detection and retransmission of the defected parts individually (Figure 7). Additionally, fragments length can be controlled on the fly according to the channel BER.

3.3. Estimation of Hardware Resources Required for 100 Gbps FEC Decoder

Tables 2 and 3 compare some common convolutional and RS coders implemented for FPGA technology. In the compared group of algorithms, RS consumes usually fewer hardware resources than the Viterbi. It means that RS obtains higher throughput per a single logic cell (Table 3). This comparison shows the advantages of RS decoders, but the analysis is not correct in this case. Both algorithms are compared in typical configurations. The code rate of the Viterbi decoder is defined as and the code rate of the RS as . Thus, the algorithms have different error correction capabilities and such a study is not reliable. RS algorithm requires long decoding pipeline in most implementations and shortening RS (255, 239) codes to obtain is usually not possible for FPGA IPcores [2427]. Additionally, changing the code rates of the codes may significantly influence the decoding efficiency, and such modified algorithms may be ineffective in terms of obtained coding gain and consumed hardware resources. Moreover, decoding performance depends on the error type on the decoder input, and both solutions prefer different channel types. Thus, comparison of the hardware resources is difficult and can lead to wrong conclusions. Therefore, in this paragraph, a different benchmark is used. The goal is to estimate if 100 Gbps FEC engine based on typical FEC codes fits into a single high-end FPGA (e.g., Virtex7), in other words, if implementation of 100 Gbps FEC engine in an FPGA is achievable. Due to this reason, Table 3 gives an approximation of average throughput per single LUT of a half rate soft decision Viterbi decoder in comparison to a hard coding RS (255, 239) decoder. The normalized decoding throughput of the RS is 25 times higher than for the Viterbi. When the Viterbi solution is scaled for a 100 Gbps system, then the overhead is so high that it is not possible to fit the Viterbi decoder into the targeted xc7vx690tffg1761-2 FPGA. Due to this reason, the soft decision decodable Viterbi decoder cannot be considered for FPGA implementation of the 100 Gbps FEC engine. Error correction performance of the RS is limited but allows communicating over channels with for single errors.

Table 3 compares LDPC [20] and RS decoders in a similar way. The RS (255, 239) requires less hardware resources than the soft decision decodable LDPC (1536, 1152) to achieve the same decoding throughput. Thus, RS coding is one of assumed solutions to build a FEC engine for the targeted 100 Gbps application.

3.4. Error Correction Performance of Selected FEC Codes

The previous paragraph clearly shows that the soft decodable FEC methods require very high computation power. Moreover, we do not have a possibility to realize ADC that supports multibit quantization at the targeted data rate. Thus, error correction performance of hard decision decodable Viterbi, LDPC, and BCH decoders is compared with RS. Firstly, a complete Matlab model of the targeted ASIC/FPGA is implemented, and a simulation against single errors is performed. In the presented case, the selected RS (255, 223), BCH (2047, 1816), and HD-LDPC (64800, 57600) decoders achieve more or less comparable results (Figure 8). The Viterbi decodable convolutional code (with ) obtains poor error correction performance. Thus, the code should not be considered for the targeted ASIC/FPGA implementation. The LDPC code corrects ~15% higher BER than the RS. The BCH decoder provides the best results and corrects ~30% higher BER than the RS. If a channel with burst errors is considered (Figure 9), then the RS decoder achieves significantly higher error correction performance than the other algorithms. It is important to emphasize the fact that the LDPC and Viterbi decoders use hard-decoded data input, and it is an untypical way of using the codes. In most applications, Viterbi and LDPC decoders use soft decision decoded bit values.

The goodput to BER notation shown in Figures 8 and 9 is not suited for coding papers, but for practical implementations allows estimating the required BER on the PHY layer. For all hardware-in-the-loop experiments, such notation is more preferable, because the overall system performance can be estimated without any additional SNR/EbN0 recalculations (e.g., goodput of the system can be easily estimated when one of the reported RF frontends in Table 1 is considered).

Due to relatively low decoding effort of RS codes, remarkable performance against single errors, and high performance against burst errors, RS codes are selected for the targeted FPGA/ASIC implementation. The performance against burst errors is required, because a single ripple on the RF-frontend or baseband power supply may disturb the decoding of tens consecutive bits (transmission of a single bit at 100 Gbps takes 10 ps). Additionally, data bits are packed into 60-bit symbols in the considered baseband [4]. If a single symbol decoding fails due to noise or synchronization, then a burst error up to 60 bits is expected on the output of the PSSS baseband.

3.5. Parallel Processing Array of Interleaved RS Decoders

The presented FPGA design cannot run at faster frequency than ~250 MHz on a Virtex7 FPGA. Thus, throughput of a single RS entity is limited to ~2 Gbps. It means that at least 50 parallel decoders are required to achieve the targeted 100 Gbps data rate. Therefore, an interleaved calculation array of RS decoders has to be employed. We have already published a dedicated article about Interleaved RS for 100 Gbps applications in [29]. In this paper we introduce the most important aspects of the implemented IRS FEC engine.

IRS architecture has two advantages. Firstly, robustness against long-burst errors is improved (Figure 10). Secondly, throughput of the coder is multiplied, so the Interleaved RS array can be easily scaled for 100 Gbps operations.

Due to the hardware issues of Virtex7 FPGA, there are advantages to keep the internal processing data buses 64-bit wide. The main reason of it is hardware multiplexing supported by the FPGA hardware. The multiplexers located at the input stage of the FPGA logic deserializes the input data to 64 bits in most cases (e.g., 10 G Ethernet [30] and GTH transceivers [31]). Each single RS entity calculates 8 bits from the 64-bit word (Figure 11). In the FPGA, ten such structures are employed to support the required 100 Gbps. Therefore, the FPGA calculates 10 × 64 bits in each clock cycle.

3.6. Link Adaptation

It is possible to design an algorithm that finds a trade-off between the coding overhead and the demanded error correction performance. The algorithm analyses the number of successfully delivered data fragments and the number of corrected errors in the fragments. When the goodput is degraded by losses of data, then the algorithm increases the FEC coding. It is important to define thresholds, when the FEC modes should be changed. One of possible solutions is setting the thresholds to the code rates of the employed codes. For example, when RS (255, 239) and RS (255, 223) are used, then the thresholds can be set to 239/255 ≈ 93.7% and to 223/255 ≈ 87.5%. If the percentage of successfully delivered data fragments is below the given values, then a code with a lower code rate is applied. In short, the approach finds a compromise between RS overhead and the rate of the lost data fragments. The thresholds correspond to the code rates and define the upper boundaries of achievable goodput for the codes. Figure 12 explains the theory of operation of the algorithm. Figure 13 shows the results of the proposed approach (the FPGA adjusts the redundancy in a range of 2–18 symbols per RS block). More details about the implemented link adaptation can be found in our articles in [21, 32].

3.7. FEC Decoder for Frame Headers

When fragmentation and aggregation are in use, the frame header is the most important part of the frame. Any single error that occurs in the fragmented payload reduces the goodput by , where “” is the number of aggregated fragments. The situation deteriorates greatly when a bit error occurs in the frame header. The header contains necessary information for frame decoding, for example, the length and FEC encoding method. Without this information, the frame cannot be decoded and all “” payload fragments are lost. Additionally, complexity of the frame-decoding pipeline is lower when the header is decoded immediately (without a delay). For this purpose, an independent triple-modular redundancy decoder is proposed. A hardware implementation of the method decodes the header in time shorter than 5 ns. It means that information from the header can be used immediately and frame buffering is avoided. This is not possible when the header is encoded with a RS, BCH, or convolutional code.

The triple-modular redundancy encoder sends the header three times, and the decoder checks the CRCs of the copies. If all copies have bit errors, then majority voting is performed on the copies, and the output of the voting is additionally checked. If at least one variant of four is correct, then the header is decoded successfully (Figure 14).

The code rate of the presented method is equal to = 1/3. In the considered case, the header is 20 bytes long and the encoded header uses 40 bytes of redundancy. The assumed frame length is not shorter than 64 kB, so the header overhead is lower than 0.61 of the total frame length. Therefore, the coding overhead can be neglected.

The introduced triple-modular routines are uncomplicated but error correction performance is poor. In Figure 15, the algorithm is compared to the other previously mentioned FEC schemes (BCH, HD-LDPC, RS, and HD-Convolutional). Convolutional coding with hard symbol representation achieves approx. 8 dB better results at the same code rate. When the gain is compared to RS (255, 223), then the results are more optimistic. If the header is encoded together with the frame data, then the triple redundancy is at least 2 dB better than the RS (denoted as “RS (255, 223)” in Figure 15). Moreover, hardware implementation of the modular redundancy requires only 1 FPGA clock cycle (<5 ns) to provide decoding results. The RS needs ~2000 ns in the considered IHP implementation [20] and ~3700 ns in the Xilinx version of the RS IPcore [24]. The Viterbi decoder provided by Xilinx requires at least ~400 ns to perform the same task [8].

3.8. ACK Frame Length and ACK Encoding

If the transmitter and receiver do not exchange information about the lost data, then some fragments might be never delivered, and user payload may lack consistency. To avoid such situations, ACK frames are sent after an arbitrary defined number of data frames. When the ACK frame is lost, then the transmitter sends the ACK-request frame after a predefined timeout. Additionally, the ACK transmission requires switching the RF frontends two times (from RX to TX and from TX to RX). At this time and during the timeout duration, the user data is not exchanged. Thus, it has a serious impact on the goodput. To avoid the looseness of the ACKs and to reduce the number of timeouts, the ACKs are encoded with the strongest FEC available in the system (RS 255, 237 in the FPGA implementation). That significantly improves ACK robustness. The overhead induced by the coding is relatively low, because ACK frame is short (~1 kB) and is sent infrequently (not often than every 4 MB of user payload). The ACK frame length depends on the number of successfully received fragments in a single ARQ session (positive acknowledgment). When the fragmentation threshold is lowered, then many small parts have to be sent and acknowledged. Thus, compression methods can be used to reduce the length of the ACK frames. Figure 16 compares three approaches: a basic bit map coding and two versions of sequence coding. A single value and a bit map are sent in the state-of-the-art bit map scheme. The first value defines the first acknowledged fragment number, and the bit map determines all next values. The bit position specifies an offset, and the bit value defines whether the fragment is acknowledged or not. The second and third methods send only a sequence of addresses of successfully received fragments. All three methods are implemented in FPGA/ASIC, and the ACK frame can be generated and compressed on the fly by FPGA/ASIC logic. Such approach reduces ACK processing latency. Figure 16 compares the maximal ACK frame length obtained by the proposed techniques as a function of the channel BER. The proposed method with 15-bit coding significantly reduces the ACK size (<1 kB) under all BER conditions. This additionally improves robustness of the return channel.

The sequence coding with 16 bits writes to memory two uint16 values that define the first and the last sequence number of successfully received fragments in a consecutive subsequence. Such coding is very effective if all or almost all frame-fragments are successfully decoded (transmission at according to Figure 16). On the other hand, encoding of single fragments in a long sequence of faulty fragments costs 32 bits: two uint16 values (at according to Figure 16). To overcome this problem, a modified coding is proposed. The modified solution uses 15 lower bits for a sequence number, and the most significant bit is reserved to indicate single values. All three methods are compared in Figure 17.

4. IHP 130 nm ASIC Synthesis

After successful FPGA validation, the design has been synthesized into IHP 130 nm CMOS technology. The FPGA implemented FEC engine supports RS (255, 237) and currently we cannot support more robust coding in the FPGA. Although changing the technology to 130 nm ASIC allows enabling RS (255, 223) FEC in the ASIC design, this significantly improves error correction performance and the engine can work with (burst errors). The RS coders consume approx. 88% of the chip resources, and the complete design requires silicon area of 30.64 mm2. The computation effort depends on the number of defected symbols in a data block. Therefore, the consumed power and dissipated energy of the processor are a function of the channel BER (Figure 18). The synthesis tool reports the maximal operational clock up to 210 MHz.

The synthesis and ASIC power profiling tools allow estimating energy per bit for intermediate codes selected by the link adaptation algorithm (Figure 19). Frame processing features (deaggregation, CRC, FSMs, ACK processing, etc.) consume relatively small amount of energy (1.39 pJ/bit) in comparison to the FEC decoding (up to ~41 pJ/bit).

4.1. at Physical Layer and Energy Required for Forward Error Correction

The targeted baseband [4] uses parallel spread spectrum sequences (PSSS) with PAM-16 modulation (4 bits/Hz) and channel deconvolution. PAM-16 modulation is selected due to relatively uncomplicated hardware required for the PSSS processing in an analog baseband [4, 6, 33]. The PSSS and channel deconvolution provide some processing gain. This gain additionally improves transmission quality. Investigation of the PSSS and channel deconvolution is beyond the scope of this work. Thus, in this paragraph a simplified baseband and RF-frontend without the PSSS and channel deconvolution but with PAM-16 modulation are considered. As a result, required for the presented datalink layer processor is estimated. Due to the aggregation and fragmentation schemes, the DLL operates with deactivated FEC at with insignificant goodput degradation (goodput of 97.84% according to Figure 13). Improving the link quality above this value using FEC or increasing transmission power does not improve the goodput significantly but may lead to energy waste. Thus, from an energy efficiency point of view, the operational is recommended to obtain reasonable goodput. At , the accelerator still operates and achieves goodput of 90.68%. However, BER values higher than ~ are not recommended, and in such case, the FEC effort should be increased to reduce the number of retransmitted frames. The required values for post-FEC and post-FEC are presented in Figure 20. The measurements shown in Figure 20 correspond to the goodput characteristics shown in Figure 21.

The goodput (Figure 21) decreases with the decrease of the , even if the post-FEC BER values are constant ( and ). This is caused by the link adaptation approach but not by retransmissions (ARQ). The processor increases the number of redundancy bits with the decrease of . Thus, the average amount of user data in a frame decreases, so the goodput decreases as well. The presented values correspond to the worst case without the processing gain of the PSSS and channel deconvolution.

5. Conclusion

In our article, we define some challenges for 100 Gbps data link layer processor. Later, some algorithms were considered and required simplifications for FPGA/ASIC processing are introduced. This allows implementing and validating the proposed system on a standard FPGA platform. The basic HARQ type-I method with link adaptation can be used instead of more complex HARQ schemes. Efficiency degradation is relatively insignificant, and the type-I method does not require cache memory. This simplifies the design and removes one of the hardware boundaries. Additionally, we demonstrate FEC engine that works with ~117 Gbps on a single Virtex7 FPGA. Fragmentation, aggregation, and ACK compression additionally improve resistance against bit errors. The triple-modular redundancy decoder simplifies the processing pipeline and reduces header decoding time to few nanoseconds. At the end, consumed energy per bit for the design synthesized into 130 nm IHP CMOS technology is presented.

Competing Interests

The authors declare that they have no competing interests.