Abstract

We present a software toolchain for constructing large-scale regular expression matching (REM) on FPGA. The software automates the conversion of regular expressions into compact and high-performance nondeterministic finite automata (RE-NFA). Each RE-NFA is described as an RTL regular expression matching engine (REME) in VHDL for FPGA implementation. Assuming a fixed number of fan-out transitions per state, an -state -bytes-per-cycle RE-NFA can be constructed in time and memory by our software. A large number of RE-NFAs are placed onto a two-dimensional staged pipeline, allowing scalability to thousands of RE-NFAs with linear area increase and little clock rate penalty due to scaling. On a PC with a 2 GHz Athlon64 processor and 2 GB memory, our prototype software constructs hundreds of RE-NFAs used by Snort in less than 10 seconds. We also designed a benchmark generator which can produce RE-NFAs with configurable pattern complexity parameters, including state count, state fan-in, loop-back and feed-forward distances. Several regular expressions with various complexities are used to test the performance of our RE-NFA construction software.

1. Introduction

Regular expression matching (REM) has many applications ranging from text processing to packet filtering. In the narrow sense, each regular expression defines a regular language over the alphabet of input characters. A regular language applies three basic operators on the alphabet: concatenation ( ), union ( ), and Kleene closure ( ), which allow the construction of complex expressions. There are other common operators that also conform to the regular language construct, such as character classes ( ), optionality ( ), and constrained repetitions ( a, ,  , b , a, b ). All of these operators can be realized by proper arrangements of the three basic ones.

Improving large-scale REM performance has been a research focus in the recent years [111]. Since regular languages can be necessarily and sufficiently accepted by finite state automata, a regular expression matching engine (REME) supporting concatenation, union, closure, repetition, and optionality can always be implemented as either a non-deterministic finite automaton (RE-NFA) or a deterministic finite automaton (RE-DFA). Figure 1 compares side by side the architectures of the two types of automata.

In an RE-NFA approach [2, 4, 710], individual regular expressions and their character matching states are processed in parallel with one another. As a result, more than one state in an RE-NFA can be active at any time. Optimizations such as input/output pipelining [4], common-prefix extraction [2, 4], multicharacter input [9, 10], and centralized character decoding [2, 12] can be applied to improve throughput and reduce resource requirements of the overall design.

In an RE-DFA approach, several regular expressions are grouped (union'd) into a DFA by expanding different combinations of active states into additional combined states. In principle, only one combined state in an RE-DFA is active at any time. Various techniques [5, 6, 13, 14] are then applied to improve memory access efficiency and to reduce the total number of states, which usually suffers from quadratic to exponential explosion [11].

Due to the matching power of regular expressions and the complexity of the strings being matched, the REM process can be the slowest bottleneck of a system. To match a regular expression of length over an alphabet of size can take up to time to process each character (for RE-NFA) or memory space to store the state transition table (for RE-DFA) [11]. Furthermore, to match concurrent regular expressions, the overall throughput could be times slower (for RE-NFA) or take more memory space (for RE-DFA) in the worst case.

Modern FPGAs offer large amount of reconfigurable logic (LUTs) and on-chip memory (BRAM). We developed a compact and high-performance RE-NFA architecture for REM which utilizes both on-chip logic and memory resources on modern FPGAs [10]. In this study, we focus on the automatic parsing, translation, and construction of regular expressions matching engine (REME) using our RE-NFA architecture for fully automated FPGA implementation. More specifically, we develop an REME construction software with the following components

(1)Automatic conversion from regular expression parse tree [15] to a uniform and modular RE-NFA structure.(2)Automatic generation of RTL code in VHDL for each RE-NFA. The resulting circuit is spatially stacked a configurable number of times for multicharacter matching.(3)Allocation of centralized character classification in BRAM for up to 256 REMEs using a simple heuristics.(4)Automatic construction of up to 16 pipelines in a two-dimensional structure.(5)A benchmark generator of regular expressions with configurable pattern complexity parameters (state count, state fan-in, loop-back, and feed-forward distances).

The rest of this paper is organized as follows. The background and prior work of RE-NFA on FPGA are discussed in Section 2. An overview of our software toolchain is given in Section 3. Section 4 describes REME construction, while Section 5 covers architectural optimization. Section 6 introduces an REME benchmark generator and uses it to evaluate the performance of the REME construction and optimization software. Section 7 concludes the paper.

Hardware implementation of regular expression matching (REM) was first studied by Floyd and Ullman [15], where an -state RE-NFA is translated into integrated circuits using no more than circuit area. Sidhu and Prasanna [8] later proposed an algorithm to implement REM on FPGA in a similar RE-NFA architecture, which has been used by most other RE-NFA implementations on FPGAs [2, 4, 7, 9]. Yang et al. [10] adopted a different approach to translate arbitrary regular expressions to corresponding RE-NFAs with a more modular and uniform circuit structure.

Automatic REME construction on FPGAs was first proposed in [4] using JHDL for both regular expression parsing and circuit generation. In particular, the (J)HDL construction approach used in [4] is in contrast to the self-configuration approach done by [8]. Reference [4] also considered large-scale REME construction, where the character input is broadcasted globally to all states in a tree-structured pipeline. Automatic REME construction in VHDL was proposed in [2, 7]. In [2], the regular expression was first tokenized and parsed into a hierarchy of basic NFA blocks, then constructed in VHDL using a bottom-up scheme. In [7], a set of scripts was used to compile regular expressions into op-codes, to convert op-codes into NFA, and to construct the NFA circuits in VHDL.

A multi-character decoder was proposed in [16] to improve pattern matching throughput. While the technique was claimed to be applicable to REM, only the construction of a fixed-string matching circuit was explained. The paper, however, did not describe an automatic mechanism to translate any general pattern into a multi-character matching circuit. An algorithm that extends any single-character matching REME temporally into a multi-character matching REME was proposed in [9]. In contrast, the uniform structure of the RE-NFA in [10] allows its circuit to be stacked spatially and automatically to process multiple characters per clock cycle.

3. Overview of the Software Toolchain

The main purpose of our software toolchain is to automate the construction and optimization of large-scale RE-NFA circuits on FPGA. The toolchain allows us to generate the whole RTL circuit matching thousands of regular expressions in orders of seconds using a single command. Such a toolchain can help us not only to avoid the tedious and error-prone circuit construction, but also to generate a large-scale regular expression matching engine (REME) for implementation in a small amount of time.

Figure 2 gives an overview of the toolchain. The toolchain consists of two main parts: REME Construction and Architectural Optimization, briefly described as follows:

(1)REME Construction: converts each regular expression into an RE-NFA circuit and collects unique character classes in BRAM across all regular expressions.(2)Architectural Optimization: applies spatial stacking to the individual RE-NFA circuits; marshals RE-NFAs into a 2D staged pipeline to form the final circuit.

In practice, the two paths of REME Construction in Figure 2 are written as a single module interleaving the two tasks for each input regular expression. Conceptually, however, they are independent of each other and can be executed in parallel. In contrast, the two tasks in Architectural Optimization, spatial stacking, and pipeline marshaling must be performed in serial. The details of the REME Construction part are presented in Section 4, while those of the Architectural Optimization part are in Section 5.

In addition to the basic operators of concatenation, union ( ) and Kleene closure ( ) used to define a regular language, our software also handles most frequently used operators by the Snort IDS [7] such as the repetition ( ), optionality ( ), constrained repetition ( a,b ), and any character class ( ). Table 1 lists the operators supported by our software. The syntax and semantics of these operators are compatible with the Perl-Compatible Regular Expression [17]. For example, the expression “ ” specifies any IP address followed by an optional nonnumerical characters.

4. Automatic REME Construction

The REME Construction is performed in three steps: (1) parse the regular expressions into tree structures, (2) use the modified McNaughton-Yamada (MMY) construction (Figure 4, Algorithm 1) to construct the RE-NFAs, (3) map the RE-NFAs into structural VHDL suitable for FPGA implementation.

Notations:
[value]Content value of node .
[leftrightchild] Left, right, or only child of node .
[next]    Set of next-state transitions of state .
[char]     Set of matching characters of state .
Macros:
CREATE_STATE ( ):
      Create a new state in the state transition table
      
CREATE_PSEAUDO():
Create a special pseudo-state for later use.
ADD_PSEUDO_NEXT ( ):
For every state , add the state set [next]
to [next]. Pseudo-state is deleted afterward.
PROCEDURE RE2NFA ( , , )
Root node of the parse (sub-)tree.
  Set of immediate previous states.
  Set of states transitioning directly outside of .
The resulting state transition table.
BEGIN
;
while
if [value] = OP_CONCAT
RE2NFA ( [left], );
[right];
else if  [value] = OP_UNION
RE2NFA ( [left],
RE2NFA ( [right],
return ;
else if  [value] = OP_CLOSURE
CREATE_PSEUO();
;
RE2NFA ( [child], ;
ADD_PSEUDO_NEXT ;
return   ;
else //
CREATE_STATE
[char] [value];
foreach   in
// add -transitions
[next] [next]
end foreach
return ;
end if
end while
// error: [right] cannot be
END

4.1. From Regular Expression to Parse Tree

The first step is to represent each regular expression as a corresponding parse tree using a standard compiler technique. This step is the same as that described in [15]. Figure 3 shows a parse-tree representation of a regular expression “x2f(fns)x3F[ r n] si.” This is simplified for the value of illustration from an actual Snort [18] pattern. In particular, a union of any number of single characters is parsed as a single character class (e.g., the [ r n] in Figure 3), which can be matched very efficiently in our REM architecture [10].

The resulting parse tree always consists of three types of internal nodes, op_concat, op_union, and op_closure, and a number of leaf nodes equal to the number of individual (and possibly nonunique) character classes in the regular expression.

4.2. From Regular Expression Parse Tree to NFA

Unlike previous work in [15] and later in [8] which use the McNaughton-Yamada (MNY) construction to convert regular expressions into RE-NFAs, we proposed the modified McNaughton-Yamada (MMY) construction in [10] to perform the conversion. Figure 4 gives a graphical description of the modified construction rules.

A formal definition of the construction mechanism is given in Algorithm 1. The algorithm takes the regular expression parse tree generated from the previous subsection as input. It is in general a recursive algorithm, where the subtrees of each internal node is processed recursively before the operator of the current node is handled. The only exception is the right child of an op_concat node, where for performance reason the tail recursion is performed iteratively. This avoids excessive recursion for a long sequence of op_concat operators (which is predominantly the case in real-world patterns).

Two special entities are used in Algorithm 1 for the MMY construction. The first is the set of immediate previous states , which contains the source states of all fan-in transitions to the part of RE-NFA currently under construction. This entity corresponds to the dashed ellipses on the left of Figures 4(c) and 4(d). It allows a long sequence of -transitions in the original MNY construction to be collapsed into a single -transition in the MMY construction.

The second entity is the pseudostate , which works as a placeholder for the source states of an op_closure's feedback loop before the op_closure is converted to be part of the RE-NFA. This temporary placeholder is needed to break the circular dependence of an op_closure construction on the resulting fan-out states of the very op_closure construction.

The MMY construction algorithm produces an NFA extremely modular and easy to map to HDL codes. For example, using the modified construction algorithm, the regular expression “x2F(fns)x3F[ r n] si” is converted into a modular NFA with a uniform structure (Figure 5). This conversion is arguably the most complex part of the construction process, taking roughly 350 lines of C code for the automation.

4.3. From RE-NFA to VHDL

To translate the RE-NFA (like Figure 5) into VHDL, each pair of nodes inside a lightly shaded ellipse is mapped to an entity statebit with one parameter: the number of input ports, determined by the number of “previous states” that immediately transition to the current state. Inside the entity statebit, all inputs aggregate to a single OR gate, followed by a character matching via logic AND and a state value register. The single-bit output value of the register is connected to the inputs of the immediate “next states.”

The REM circuit for Figure 5 is shown in Figure 6. On FPGA devices with 4-input LUTs, a -input OR followed by a 2-input AND can be efficiently implemented on a single LUT if , or on a single slice of 2 LUTs if . The mapping takes only about 300 lines of C code to convert any RE-NFA to its RTL structural VHDL description.

4.4. BRAM-Based Character Classification

Our REM architecture in [10] used a 256-bit column of BRAM to match any character class of 8-bit characters. Each bit of the column represents the inclusion of an 8-bit character in the character set. The value of every input characters is used as a row index to BRAM to retrieve the matching result (true false) of that character against all character classes (one for each column). Each single-bit result is routed from BRAM to its corresponding correct entity statebit as the input to the AND gate. As a result, character classification of an -state RE-NFA can be implemented on a block memory (BRAM) of no more than bits.

Furthermore, if two states (either within the same regular expression or across different regular expressions) match the same character class, then they can share the same BRAM column output. We use a two-phase procedure to aggregate the matching outputs of identical character classes.

(i)In phase 1, the software collects the set of unique character classes from a regular expression. Each unique character class is associated with a floating-point sorting key:(a)if the character class appears only once in the regular expression, then the sorting key is its (only) position index within the regular expression;(b)if the charactter class appears multiple times in the regular expression, then the sorting key is the average of all its position indexes within the regular expression;(ii)In phase 2, the unique character classes are sorted according to their sorting keys and instantiated as BRAM columns. Each BRAM column is also associated with the identifier of the instantiated character class. The output of each BRAM column is then connected to the character matching inputs with the same identifier.

The two-phase procedure allows our software to use the minimum number of BRAM columns for character class matching. It also minimizes routing distance by exploiting the natural ordering (the sorting keys) of the character classes within the regular expressions. The aggregation of character classes and their distribution to the RE-NFA states take 250 lines of C code.

5. Automated Architectural Optimizations

After constructing REMEs individually for all regular expressions, the software applies two architectural optimizations [10]. (1) The REMEs are stacked to form multi-character matching (MCM) circuits which trade off minimum resource usage for higher performance. (2) The MCM REMEs are grouped into clusters of 16 and marshaled onto a two-dimensional staged pipeline structure.

5.1. Circuit Stacking for Multicharacter Matching

In contrast to the NFA-level temporal extension used in [9], we adopted a circuit-level spatial stacking to construct multi-character matching (MCM) REMEs. Figure 6 shows the basic construction concept of a 2-character matching circuit from two copies of a single-character matching circuit. An algorithm for this spatial stacking approach and the proof of correctness were given in [10]. Benefits of the spatial stacking approach include the following.

Simplicity
The time complexity to construct an -state, -character matching REME using spatial stacking is [10]. In contrast, the time complexity of temporal extension is [9].

Flexibility
The spatial stacking approach can generate an MCM REME of any natural number , while the temporal extension approach only generates RE-NFAs with .

In practice, is usually a few tens while between 2 to 8, making the spatial stacking approach hundreds of times faster than the temporal extension approach. As discussed in Section 6.2, our software can construct thousands of MCM REMEs in 10 seconds. Also, the optimal value of with respect to performance efficiency (defined in [10]) is usually not a power of 2. For example, the REMEs from Snort rules achieve optimal performance efficiency at [10].

The program code to construct any -character matching REME using spatial stacking is simple. Let be a single-character matching circuit. The program first makes copies of , , each receiving one of the consecutive input characters. Then, instead of routing the state outputs back to the state inputs of the same circuit, it removes the state registers of and connects the (nonregistered) state outputs of to the state inputs of for . Finally, it connects the (registered) state outputs of to the state inputs of . The result is an -character matching circuit for .

In general, to construct an -character matching circuit , we perform the following transformations on every state of and :

(1)remove state register of ; forward the AND gate output to its state output,(2)disconnect state output of from the state inputs of , and reconnect it to the corresponding state inputs of ,(3)disconnect state output of from the state inputs of , and reconnect it to the corresponding state inputs of ,(4)the combined circuit receives character matching signals per cycle. The first signals are sent to the part; the last signals are sent to the part.
5.2. REME Clustering for Staged Pipelining

With a straight-forward implementation, the BRAM-based character classifier (Section 4.4) uses 256 bits per state. To implement thousands of REMEs with tens of thousands states, the character classifier would require tens of megabits of BRAM and become the resource bottleneck on FPGA. A second issue in implementing large number of REMEs on FPGA is signal routing. The character matching results from the centralized character classifier in BRAM must be distributed to all REMEs, while the pattern matching result from every REME must be collected and aggregated to the final output. The potentially long routing makes the circuit hard to scale to large number of REMEs.

A 2D staged pipeline design was proposed in [10] to solve both problems. Figure 8 shows the basic structure of such a staged pipeline. Each stage may contain a cluster of up to 16 REMEs. The horizontal arrows between the pipelines are the signal paths of the input characters. The vertical arrows between pipeline stages are the character matching signals and the pattern matching results. A priority encoder is used at every stage and pipeline to aggregate the pattern matching results.

Marshaling REMEs into this staged pipeline structure, however, is painstaking and error-prone when done manually. This is mainly due to the buffering and distribution of the character matching signals (the thick vertical arrows in Figure 8). Additionally, different REME grouping can result in different resource usage and routing complexity and give rise to performance variation among REME clusters. To solve these problems, our software use the following heuristic to marshal REMEs with total states into pipelines.

(1)First calculate the average number of states per pipeline, .(2)Add any of the REMEs into a new pipeline. Compute the compatibility between the resulting (single-REME) pipeline and each of the rest REMEs. The compatibility between a pipeline and an REME is defined as the number identical character classes in both divided by the number of unique character classes in the REME.(3)Add the most compatible REME to the pipeline. Recompute the compatibility of all remaining REMEs.(4)Repeat step 3 until the total number of states in the pipeline is greater than , where is a design constant.(5)Go back to step 2 to work on a new pipeline until all REMEs are exhausted.

After marshaling the REMEs into different pipelines, the REMEs within each pipeline are marshaled into different stages in a similar manner. When adding an REME to a pipeline, a function is called to compare each of the character class in the REME to the character classes previously collected in BRAM. If an identical character class is found, then proper connections are made from the BRAM output to the inputs of the respective states.

The time complexity of this procedure is , where is the number of distinct character classes among the states in the REMEs. The space complexity is . In real applications, grows almost linearly with respect to for small , but quickly flats out and grows much slower than when is moderately large (a few hundred).

Matching outputs from all REMEs are prioritized. Currently, the software assigns higher priority to lower-indexed pipelines and stages, although the priority can be programmed in any other way with little additional complexity.

6. Experimental Results

6.1. Design of Benchmark Generator

We developed a regular expression benchmark generator to test how different types of regular expressions affect the performance of the REMEs constructed by our software. The benchmark generator produced regular expressions of different state count ( ), state fan-in ( ), and variable lengths of loop-back ( ) and feed-forward ( ). A general structure of the generated regular expressions is described in Figure 9. (Due to our use of BRAM for character classification, every character class, no matter how simple or complicated it is, takes exactly 256 BRAM bits and is matched by one BRAM access. Since the complexity of character classes does not affect performance, our benchmark generator assigns arbitrary values to the character classes without loss of generality.)

State count represents the total number of states in an RE-NFA. It was used by most related work as the primary metric for REME complexity [2, 4, 7, 9]. We further defined state fan-in as the maximum number of transitions entering any state [10], since the state machine runs at the speed of the slowest state transition. Both op_union and op_closure can increase state fan-in, which is the secondary metric for REME complexity.

A state transition loop-back is always caused by an op_closure, while a state transition feed-forward can be caused by unbalanced alternative paths within an op_union. Both properties are high-order metrics describing the routing lengths of an REME. According to our experimental experience, however, the actual routing complexity of the REME circuit on FPGA is highly subject to the optimizations done by the place and route software and may not reflect these two metrics closely.

6.2. Performance Evaluation of the Software Toolchain

The time taken to translate a set of parsed regular expressions to VHDL was roughly proportional to the product of the number of states ( ) and the size of multi-character input ( ), an observation agreeing with our analysis in Section 5.1. On a 2 GHz Athlon 64 PC, it took between 6 and 12 seconds to translate 1280 Snort REMEs ( 28k states) to VHDL, as increased from 2 to 8. In all cases, about 30% of the time was used for file I/O. Figure 10 illustrates the construction time of various cases in more detail. (Due to the relatively large I/O overhead and the short overall runtime, there is high variance ( 15%) among different runs of the same construction. The construction time is also greatly affected by the complexity of regular expressions, especially the state count and the state fan-in discussed in Section 6.1.)

These results show that the software proposed in this paper is suitable for large-scale REME construction. Since it takes only a few seconds to translate a thousand regular expressions into structural VHDL, the software can be used to reconstruct a large-scale REME quickly in response to dictionary changes. Due to the large number of logic resource used, however, the synthesis and place and route times are in the order of several tens minutes.

6.3. Performance Evaluation of the Constructed REMEs

We first used the benchmark generator described in Section 6.1 to produce synthetic regular expressions of different numbers and complexities, then use our REME construction software to convert the synthetic regular expressions into 2-character matching REME circuits in VHDL. We synthesized the VHDL into Xilinx NGC targeting the Virtex 4 LX device family and extracted the estimated clock frequency from the timing analysis.

Figure 11 shows clock frequency and LUT usage versus length of REMEs. Series concat1 was produced by one long sequence of concatenations. Series union2 was produced by a union of two equal-length concatenations. In each test case, 6 identical REMEs were placed into a single stage.

Series union2 ran at lower clock frequency than series concat1 due to the use of the op_union operator, which caused series union2 to have twice the (maximum) state fan-in as concat1. The clock rates of both series started to decline gradually with respect to REME length around 32 to 40 states per REME. This decline was due to the longer paths to access the centralized character classification signals from BRAM. This is evidenced by the fact that both concat1 and union2 ran at about the same clock rates beyond the length of 40 states, showing a bottleneck elsewhere from the state transitions within the logic slices of FPGA.

In Figure 12, we analyzed the effect of the number of REMEs on achievable clock frequency and total LUT usage. In each test case, 64 states were generated for each REME; 30 states were wrapped inside an op_closure ( ), which was then op_union-ed with a sequence of 30 other states ( ) and concatenated with the last 4 states in sequence. In the -union series, , the 30 states inside the op_closure were further wrapped by an op_union of operands, each states in length. The purpose was to see how clock rate scaled with respect to number of REMEs for different REME structures and complexities.

As shown in Figure 12, clock frequency declined between 15% to 25% when number of REMEs varied from 1 to 16. All these 16 REMEs are put inside a single stage by our software. Since the added regular expressions were all identical, this decline was again due to longer BRAM access, caused by both longer routes and larger fan-out.

Above 16 REMEs, however, the staged pipeline came into effect, keeping the clock rates at slightly above 300 MHz. This evidently shows that the staged pipeline proposed in [10] was effective in scaling up number of REMEs in a single circuit. LUT usage maintained linear increase with respect to the number of REMEs.

As expected, a higher value results in a slightly lower achievable clock frequency due to the higher state fan-in of the REMEs.

Figure 13 examines clock frequency versus state fan-in more thoroughly. In each test case, REMEs of 52 states were constructed, with 24 states put inside an op_union of operands, varying from 1 (single 24-state sequence) to 12 (union of 2-state sequences). For the has_loop series, there was also a loop-back transition from the outputs of the 24-state back to the inputs of the itself. There was no such loop-back for the no_loop series.

The clock frequency was found to decline sublinearly with respect to the state fan-in, at a rate consistent with the findings in Section 6.2. The decline however was not completely smooth because the logic gates on the FPGA device were organized as 4-input LUTs-fan-ins of size multiples of 4 tend to perform better than the overall trend. The loop-back transition around the op_union (in the has_loop series) connected every state output of the union operator to every input state of that operator. This resulted in more complex routing and further impacted the clock frequency.

Overall our experiments show that the REME construction algorithms proposed in [10] generated FPGA circuits with high clock frequency and high LUT efficiency for large number of highly complex regular expressions.

7. Conclusions

We presented a software toolchain which automates the construction and optimizations of regular expression matching engines (REMEs) on FPGA. The software accepts a potentially large number of regular expressions as input and generates RTL codes in VHDL as output, which could be accepted directly by FPGA synthesis and implementation tools. The automated REME optimizations include centralized character classifications, multi-character matching, and staged pipelining. We also developed a benchmark generator to produce REMEs of configurable pattern complexities to evaluate the performance of the software.

On a 2 GHz Athlon 64 PC, our software generates a compact and high-performance REME circuit matching over a thousand regular expressions in just a few seconds. Extensive studies showed that the two-dimensional staged pipeline effectively localized signal routing and achieved a clock rate over 300 MHz while processing hundreds of REMEs in parallel.

Acknowledgment

This work was supported by U.S. National Science Foundation under Grant CCR-0702784.