Abstract
We present a software toolchain for constructing largescale regular expression matching (REM) on FPGA. The software automates the conversion of regular expressions into compact and highperformance nondeterministic finite automata (RENFA). Each RENFA is described as an RTL regular expression matching engine (REME) in VHDL for FPGA implementation. Assuming a fixed number of fanout transitions per state, an state bytespercycle RENFA can be constructed in time and memory by our software. A large number of RENFAs are placed onto a twodimensional staged pipeline, allowing scalability to thousands of RENFAs with linear area increase and little clock rate penalty due to scaling. On a PC with a 2 GHz Athlon64 processor and 2 GB memory, our prototype software constructs hundreds of RENFAs used by Snort in less than 10 seconds. We also designed a benchmark generator which can produce RENFAs with configurable pattern complexity parameters, including state count, state fanin, loopback and feedforward distances. Several regular expressions with various complexities are used to test the performance of our RENFA construction software.
1. Introduction
Regular expression matching (REM) has many applications ranging from text processing to packet filtering. In the narrow sense, each regular expression defines a regular language over the alphabet of input characters. A regular language applies three basic operators on the alphabet: concatenation (), union (), and Kleene closure (), which allow the construction of complex expressions. There are other common operators that also conform to the regular language construct, such as character classes (), optionality (), and constrained repetitions (a,, , b, a, b). All of these operators can be realized by proper arrangements of the three basic ones.
Improving largescale REM performance has been a research focus in the recent years [1–11]. Since regular languages can be necessarily and sufficiently accepted by finite state automata, a regular expression matching engine (REME) supporting concatenation, union, closure, repetition, and optionality can always be implemented as either a nondeterministic finite automaton (RENFA) or a deterministic finite automaton (REDFA). Figure 1 compares side by side the architectures of the two types of automata.
(a)
(b)
In an RENFA approach [2, 4, 7–10], individual regular expressions and their character matching states are processed in parallel with one another. As a result, more than one state in an RENFA can be active at any time. Optimizations such as input/output pipelining [4], commonprefix extraction [2, 4], multicharacter input [9, 10], and centralized character decoding [2, 12] can be applied to improve throughput and reduce resource requirements of the overall design.
In an REDFA approach, several regular expressions are grouped (union'd) into a DFA by expanding different combinations of active states into additional combined states. In principle, only one combined state in an REDFA is active at any time. Various techniques [5, 6, 13, 14] are then applied to improve memory access efficiency and to reduce the total number of states, which usually suffers from quadratic to exponential explosion [11].
Due to the matching power of regular expressions and the complexity of the strings being matched, the REM process can be the slowest bottleneck of a system. To match a regular expression of length over an alphabet of size can take up to time to process each character (for RENFA) or memory space to store the state transition table (for REDFA) [11]. Furthermore, to match concurrent regular expressions, the overall throughput could be times slower (for RENFA) or take more memory space (for REDFA) in the worst case.
Modern FPGAs offer large amount of reconfigurable logic (LUTs) and onchip memory (BRAM). We developed a compact and highperformance RENFA architecture for REM which utilizes both onchip logic and memory resources on modern FPGAs [10]. In this study, we focus on the automatic parsing, translation, and construction of regular expressions matching engine (REME) using our RENFA architecture for fully automated FPGA implementation. More specifically, we develop an REME construction software with the following components
(1)Automatic conversion from regular expression parse tree [15] to a uniform and modular RENFA structure.(2)Automatic generation of RTL code in VHDL for each RENFA. The resulting circuit is spatially stacked a configurable number of times for multicharacter matching.(3)Allocation of centralized character classification in BRAM for up to 256 REMEs using a simple heuristics.(4)Automatic construction of up to 16 pipelines in a twodimensional structure.(5)A benchmark generator of regular expressions with configurable pattern complexity parameters (state count, state fanin, loopback, and feedforward distances).The rest of this paper is organized as follows. The background and prior work of RENFA on FPGA are discussed in Section 2. An overview of our software toolchain is given in Section 3. Section 4 describes REME construction, while Section 5 covers architectural optimization. Section 6 introduces an REME benchmark generator and uses it to evaluate the performance of the REME construction and optimization software. Section 7 concludes the paper.
2. Background and Related Work
Hardware implementation of regular expression matching (REM) was first studied by Floyd and Ullman [15], where an state RENFA is translated into integrated circuits using no more than circuit area. Sidhu and Prasanna [8] later proposed an algorithm to implement REM on FPGA in a similar RENFA architecture, which has been used by most other RENFA implementations on FPGAs [2, 4, 7, 9]. Yang et al. [10] adopted a different approach to translate arbitrary regular expressions to corresponding RENFAs with a more modular and uniform circuit structure.
Automatic REME construction on FPGAs was first proposed in [4] using JHDL for both regular expression parsing and circuit generation. In particular, the (J)HDL construction approach used in [4] is in contrast to the selfconfiguration approach done by [8]. Reference [4] also considered largescale REME construction, where the character input is broadcasted globally to all states in a treestructured pipeline. Automatic REME construction in VHDL was proposed in [2, 7]. In [2], the regular expression was first tokenized and parsed into a hierarchy of basic NFA blocks, then constructed in VHDL using a bottomup scheme. In [7], a set of scripts was used to compile regular expressions into opcodes, to convert opcodes into NFA, and to construct the NFA circuits in VHDL.
A multicharacter decoder was proposed in [16] to improve pattern matching throughput. While the technique was claimed to be applicable to REM, only the construction of a fixedstring matching circuit was explained. The paper, however, did not describe an automatic mechanism to translate any general pattern into a multicharacter matching circuit. An algorithm that extends any singlecharacter matching REME temporally into a multicharacter matching REME was proposed in [9]. In contrast, the uniform structure of the RENFA in [10] allows its circuit to be stacked spatially and automatically to process multiple characters per clock cycle.
3. Overview of the Software Toolchain
The main purpose of our software toolchain is to automate the construction and optimization of largescale RENFA circuits on FPGA. The toolchain allows us to generate the whole RTL circuit matching thousands of regular expressions in orders of seconds using a single command. Such a toolchain can help us not only to avoid the tedious and errorprone circuit construction, but also to generate a largescale regular expression matching engine (REME) for implementation in a small amount of time.
Figure 2 gives an overview of the toolchain. The toolchain consists of two main parts: REME Construction and Architectural Optimization, briefly described as follows:
In practice, the two paths of REME Construction in Figure 2 are written as a single module interleaving the two tasks for each input regular expression. Conceptually, however, they are independent of each other and can be executed in parallel. In contrast, the two tasks in Architectural Optimization, spatial stacking, and pipeline marshaling must be performed in serial. The details of the REME Construction part are presented in Section 4, while those of the Architectural Optimization part are in Section 5.
In addition to the basic operators of concatenation, union () and Kleene closure () used to define a regular language, our software also handles most frequently used operators by the Snort IDS [7] such as the repetition (), optionality (), constrained repetition (a,b), and any character class (). Table 1 lists the operators supported by our software. The syntax and semantics of these operators are compatible with the PerlCompatible Regular Expression [17]. For example, the expression “” specifies any IP address followed by an optional nonnumerical characters.
4. Automatic REME Construction
The REME Construction is performed in three steps: (1) parse the regular expressions into tree structures, (2) use the modified McNaughtonYamada (MMY) construction (Figure 4, Algorithm 1) to construct the RENFAs, (3) map the RENFAs into structural VHDL suitable for FPGA implementation.

4.1. From Regular Expression to Parse Tree
The first step is to represent each regular expression as a corresponding parse tree using a standard compiler technique. This step is the same as that described in [15]. Figure 3 shows a parsetree representation of a regular expression “∖x2f(fn∣s)∖x3F[rn]si.” This is simplified for the value of illustration from an actual Snort [18] pattern. In particular, a union of any number of single characters is parsed as a single character class (e.g., the [rn] in Figure 3), which can be matched very efficiently in our REM architecture [10].
(a) 
(b) 
(c) 
(d) 
The resulting parse tree always consists of three types of internal nodes, op_concat, op_union, and op_closure, and a number of leaf nodes equal to the number of individual (and possibly nonunique) character classes in the regular expression.
4.2. From Regular Expression Parse Tree to NFA
Unlike previous work in [15] and later in [8] which use the McNaughtonYamada (MNY) construction to convert regular expressions into RENFAs, we proposed the modified McNaughtonYamada (MMY) construction in [10] to perform the conversion. Figure 4 gives a graphical description of the modified construction rules.
A formal definition of the construction mechanism is given in Algorithm 1. The algorithm takes the regular expression parse tree generated from the previous subsection as input. It is in general a recursive algorithm, where the subtrees of each internal node is processed recursively before the operator of the current node is handled. The only exception is the right child of an op_concat node, where for performance reason the tail recursion is performed iteratively. This avoids excessive recursion for a long sequence of op_concat operators (which is predominantly the case in realworld patterns).
Two special entities are used in Algorithm 1 for the MMY construction. The first is the set of immediate previous states , which contains the source states of all fanin transitions to the part of RENFA currently under construction. This entity corresponds to the dashed ellipses on the left of Figures 4(c) and 4(d). It allows a long sequence of transitions in the original MNY construction to be collapsed into a single transition in the MMY construction.
The second entity is the pseudostate , which works as a placeholder for the source states of an op_closure's feedback loop before the op_closure is converted to be part of the RENFA. This temporary placeholder is needed to break the circular dependence of an op_closure construction on the resulting fanout states of the very op_closure construction.
The MMY construction algorithm produces an NFA extremely modular and easy to map to HDL codes. For example, using the modified construction algorithm, the regular expression “∖x2F(fn∣s)∖x3F[rn]si” is converted into a modular NFA with a uniform structure (Figure 5). This conversion is arguably the most complex part of the construction process, taking roughly 350 lines of C code for the automation.
4.3. From RENFA to VHDL
To translate the RENFA (like Figure 5) into VHDL, each pair of nodes inside a lightly shaded ellipse is mapped to an entity statebit with one parameter: the number of input ports, determined by the number of “previous states” that immediately transition to the current state. Inside the entity statebit, all inputs aggregate to a single OR gate, followed by a character matching via logic AND and a state value register. The singlebit output value of the register is connected to the inputs of the immediate “next states.”
The REM circuit for Figure 5 is shown in Figure 6. On FPGA devices with 4input LUTs, a input OR followed by a 2input AND can be efficiently implemented on a single LUT if , or on a single slice of 2 LUTs if . The mapping takes only about 300 lines of C code to convert any RENFA to its RTL structural VHDL description.
4.4. BRAMBased Character Classification
Our REM architecture in [10] used a 256bit column of BRAM to match any character class of 8bit characters. Each bit of the column represents the inclusion of an 8bit character in the character set. The value of every input characters is used as a row index to BRAM to retrieve the matching result (true false) of that character against all character classes (one for each column). Each singlebit result is routed from BRAM to its corresponding correct entity statebit as the input to the AND gate. As a result, character classification of an state RENFA can be implemented on a block memory (BRAM) of no more than bits.
Furthermore, if two states (either within the same regular expression or across different regular expressions) match the same character class, then they can share the same BRAM column output. We use a twophase procedure to aggregate the matching outputs of identical character classes.
(i)In phase 1, the software collects the set of unique character classes from a regular expression. Each unique character class is associated with a floatingpoint sorting key:(a)if the character class appears only once in the regular expression, then the sorting key is its (only) position index within the regular expression;(b)if the charactter class appears multiple times in the regular expression, then the sorting key is the average of all its position indexes within the regular expression;(ii)In phase 2, the unique character classes are sorted according to their sorting keys and instantiated as BRAM columns. Each BRAM column is also associated with the identifier of the instantiated character class. The output of each BRAM column is then connected to the character matching inputs with the same identifier.The twophase procedure allows our software to use the minimum number of BRAM columns for character class matching. It also minimizes routing distance by exploiting the natural ordering (the sorting keys) of the character classes within the regular expressions. The aggregation of character classes and their distribution to the RENFA states take 250 lines of C code.
5. Automated Architectural Optimizations
After constructing REMEs individually for all regular expressions, the software applies two architectural optimizations [10]. (1) The REMEs are stacked to form multicharacter matching (MCM) circuits which trade off minimum resource usage for higher performance. (2) The MCM REMEs are grouped into clusters of 16 and marshaled onto a twodimensional staged pipeline structure.
5.1. Circuit Stacking for Multicharacter Matching
In contrast to the NFAlevel temporal extension used in [9], we adopted a circuitlevel spatial stacking to construct multicharacter matching (MCM) REMEs. Figure 6 shows the basic construction concept of a 2character matching circuit from two copies of a singlecharacter matching circuit. An algorithm for this spatial stacking approach and the proof of correctness were given in [10]. Benefits of the spatial stacking approach include the following.
Simplicity
The time complexity to construct an state, character matching REME using spatial stacking is [10]. In contrast, the time complexity of temporal extension is [9].
Flexibility
The spatial stacking approach can generate an MCM REME of any natural number , while the temporal extension approach only generates RENFAs with .
In practice, is usually a few tens while between 2 to 8, making the spatial stacking approach hundreds of times faster than the temporal extension approach. As discussed in Section 6.2, our software can construct thousands of MCM REMEs in 10 seconds. Also, the optimal value of with respect to performance efficiency (defined in [10]) is usually not a power of 2. For example, the REMEs from Snort rules achieve optimal performance efficiency at [10].
The program code to construct any character matching REME using spatial stacking is simple. Let be a singlecharacter matching circuit. The program first makes copies of , , each receiving one of the consecutive input characters. Then, instead of routing the state outputs back to the state inputs of the same circuit, it removes the state registers of and connects the (nonregistered) state outputs of to the state inputs of for . Finally, it connects the (registered) state outputs of to the state inputs of . The result is an character matching circuit for .
In general, to construct an character matching circuit , we perform the following transformations on every state of and :
(1)remove state register of ; forward the AND gate output to its state output,(2)disconnect state output of from the state inputs of , and reconnect it to the corresponding state inputs of ,(3)disconnect state output of from the state inputs of , and reconnect it to the corresponding state inputs of ,(4)the combined circuit receives character matching signals per cycle. The first signals are sent to the part; the last signals are sent to the part.5.2. REME Clustering for Staged Pipelining
With a straightforward implementation, the BRAMbased character classifier (Section 4.4) uses 256 bits per state. To implement thousands of REMEs with tens of thousands states, the character classifier would require tens of megabits of BRAM and become the resource bottleneck on FPGA. A second issue in implementing large number of REMEs on FPGA is signal routing. The character matching results from the centralized character classifier in BRAM must be distributed to all REMEs, while the pattern matching result from every REME must be collected and aggregated to the final output. The potentially long routing makes the circuit hard to scale to large number of REMEs.
A 2D staged pipeline design was proposed in [10] to solve both problems. Figure 8 shows the basic structure of such a staged pipeline. Each stage may contain a cluster of up to 16 REMEs. The horizontal arrows between the pipelines are the signal paths of the input characters. The vertical arrows between pipeline stages are the character matching signals and the pattern matching results. A priority encoder is used at every stage and pipeline to aggregate the pattern matching results.
Marshaling REMEs into this staged pipeline structure, however, is painstaking and errorprone when done manually. This is mainly due to the buffering and distribution of the character matching signals (the thick vertical arrows in Figure 8). Additionally, different REME grouping can result in different resource usage and routing complexity and give rise to performance variation among REME clusters. To solve these problems, our software use the following heuristic to marshal REMEs with total states into pipelines.
(1)First calculate the average number of states per pipeline, .(2)Add any of the REMEs into a new pipeline. Compute the compatibility between the resulting (singleREME) pipeline and each of the rest REMEs. The compatibility between a pipeline and an REME is defined as the number identical character classes in both divided by the number of unique character classes in the REME.(3)Add the most compatible REME to the pipeline. Recompute the compatibility of all remaining REMEs.(4)Repeat step 3 until the total number of states in the pipeline is greater than , where is a design constant.(5)Go back to step 2 to work on a new pipeline until all REMEs are exhausted.After marshaling the REMEs into different pipelines, the REMEs within each pipeline are marshaled into different stages in a similar manner. When adding an REME to a pipeline, a function is called to compare each of the character class in the REME to the character classes previously collected in BRAM. If an identical character class is found, then proper connections are made from the BRAM output to the inputs of the respective states.
The time complexity of this procedure is , where is the number of distinct character classes among the states in the REMEs. The space complexity is . In real applications, grows almost linearly with respect to for small , but quickly flats out and grows much slower than when is moderately large (a few hundred).
Matching outputs from all REMEs are prioritized. Currently, the software assigns higher priority to lowerindexed pipelines and stages, although the priority can be programmed in any other way with little additional complexity.
6. Experimental Results
6.1. Design of Benchmark Generator
We developed a regular expression benchmark generator to test how different types of regular expressions affect the performance of the REMEs constructed by our software. The benchmark generator produced regular expressions of different state count (), state fanin (), and variable lengths of loopback () and feedforward (). A general structure of the generated regular expressions is described in Figure 9. (Due to our use of BRAM for character classification, every character class, no matter how simple or complicated it is, takes exactly 256 BRAM bits and is matched by one BRAM access. Since the complexity of character classes does not affect performance, our benchmark generator assigns arbitrary values to the character classes without loss of generality.)
State count represents the total number of states in an RENFA. It was used by most related work as the primary metric for REME complexity [2, 4, 7, 9]. We further defined state fanin as the maximum number of transitions entering any state [10], since the state machine runs at the speed of the slowest state transition. Both op_union and op_closure can increase state fanin, which is the secondary metric for REME complexity.
A state transition loopback is always caused by an op_closure, while a state transition feedforward can be caused by unbalanced alternative paths within an op_union. Both properties are highorder metrics describing the routing lengths of an REME. According to our experimental experience, however, the actual routing complexity of the REME circuit on FPGA is highly subject to the optimizations done by the place and route software and may not reflect these two metrics closely.
6.2. Performance Evaluation of the Software Toolchain
The time taken to translate a set of parsed regular expressions to VHDL was roughly proportional to the product of the number of states () and the size of multicharacter input (), an observation agreeing with our analysis in Section 5.1. On a 2 GHz Athlon 64 PC, it took between 6 and 12 seconds to translate 1280 Snort REMEs (28k states) to VHDL, as increased from 2 to 8. In all cases, about 30% of the time was used for file I/O. Figure 10 illustrates the construction time of various cases in more detail. (Due to the relatively large I/O overhead and the short overall runtime, there is high variance (15%) among different runs of the same construction. The construction time is also greatly affected by the complexity of regular expressions, especially the state count and the state fanin discussed in Section 6.1.)
These results show that the software proposed in this paper is suitable for largescale REME construction. Since it takes only a few seconds to translate a thousand regular expressions into structural VHDL, the software can be used to reconstruct a largescale REME quickly in response to dictionary changes. Due to the large number of logic resource used, however, the synthesis and place and route times are in the order of several tens minutes.
6.3. Performance Evaluation of the Constructed REMEs
We first used the benchmark generator described in Section 6.1 to produce synthetic regular expressions of different numbers and complexities, then use our REME construction software to convert the synthetic regular expressions into 2character matching REME circuits in VHDL. We synthesized the VHDL into Xilinx NGC targeting the Virtex 4 LX device family and extracted the estimated clock frequency from the timing analysis.
Figure 11 shows clock frequency and LUT usage versus length of REMEs. Series concat1 was produced by one long sequence of concatenations. Series union2 was produced by a union of two equallength concatenations. In each test case, 6 identical REMEs were placed into a single stage.
Series union2 ran at lower clock frequency than series concat1 due to the use of the op_union operator, which caused series union2 to have twice the (maximum) state fanin as concat1. The clock rates of both series started to decline gradually with respect to REME length around 32 to 40 states per REME. This decline was due to the longer paths to access the centralized character classification signals from BRAM. This is evidenced by the fact that both concat1 and union2 ran at about the same clock rates beyond the length of 40 states, showing a bottleneck elsewhere from the state transitions within the logic slices of FPGA.
In Figure 12, we analyzed the effect of the number of REMEs on achievable clock frequency and total LUT usage. In each test case, 64 states were generated for each REME; 30 states were wrapped inside an op_closure (), which was then op_unioned with a sequence of 30 other states () and concatenated with the last 4 states in sequence. In the union series, , the 30 states inside the op_closure were further wrapped by an op_union of operands, each states in length. The purpose was to see how clock rate scaled with respect to number of REMEs for different REME structures and complexities.
As shown in Figure 12, clock frequency declined between 15% to 25% when number of REMEs varied from 1 to 16. All these 16 REMEs are put inside a single stage by our software. Since the added regular expressions were all identical, this decline was again due to longer BRAM access, caused by both longer routes and larger fanout.
Above 16 REMEs, however, the staged pipeline came into effect, keeping the clock rates at slightly above 300 MHz. This evidently shows that the staged pipeline proposed in [10] was effective in scaling up number of REMEs in a single circuit. LUT usage maintained linear increase with respect to the number of REMEs.
As expected, a higher value results in a slightly lower achievable clock frequency due to the higher state fanin of the REMEs.
Figure 13 examines clock frequency versus state fanin more thoroughly. In each test case, REMEs of 52 states were constructed, with 24 states put inside an op_union of operands, varying from 1 (single 24state sequence) to 12 (union of 2state sequences). For the has_loop series, there was also a loopback transition from the outputs of the 24state back to the inputs of the itself. There was no such loopback for the no_loop series.
The clock frequency was found to decline sublinearly with respect to the state fanin, at a rate consistent with the findings in Section 6.2. The decline however was not completely smooth because the logic gates on the FPGA device were organized as 4input LUTsfanins of size multiples of 4 tend to perform better than the overall trend. The loopback transition around the op_union (in the has_loop series) connected every state output of the union operator to every input state of that operator. This resulted in more complex routing and further impacted the clock frequency.
Overall our experiments show that the REME construction algorithms proposed in [10] generated FPGA circuits with high clock frequency and high LUT efficiency for large number of highly complex regular expressions.
7. Conclusions
We presented a software toolchain which automates the construction and optimizations of regular expression matching engines (REMEs) on FPGA. The software accepts a potentially large number of regular expressions as input and generates RTL codes in VHDL as output, which could be accepted directly by FPGA synthesis and implementation tools. The automated REME optimizations include centralized character classifications, multicharacter matching, and staged pipelining. We also developed a benchmark generator to produce REMEs of configurable pattern complexities to evaluate the performance of the software.
On a 2 GHz Athlon 64 PC, our software generates a compact and highperformance REME circuit matching over a thousand regular expressions in just a few seconds. Extensive studies showed that the twodimensional staged pipeline effectively localized signal routing and achieved a clock rate over 300 MHz while processing hundreds of REMEs in parallel.
Acknowledgment
This work was supported by U.S. National Science Foundation under Grant CCR0702784.