Abstract

The content-based access of CAMs makes them of great interest in lookup-based operations. However, the large amounts of parallel comparisons required cause an expensive cost in power dissipation. In this work, we present a novel banked precomputation-based architecture for low-power and storage-demanding applications where the reduction of both dynamic and leakage power consumption is addressed. Experimental results show that the proposed banked architecture reduces up to an 89% of dynamic power consumption during the search process while the leakage power consumption is also minimized up to a 91%.

1. Introduction

A broad range of modern applications demand large storage devices with fast search capabilities. Content-addressable memories (CAMs) have emerged as one of the favorite devices for such applications [13]. In CAMs data are accessed based on their content, rather than their physical address. This functionality has shown to be specially efficient in lookup-based applications like TLBs [1], associative computing and data compression [2]. High-speed networks such as gigabit Ethernet and ATM switches [3] also benefit from this particular structure [4].

However, CAMs pay a high hardware cost for this content-based access because the memory cell must include comparison circuitry, negatively impacting the size/speed tradeoff and complexity of the implementation. Usually, a 9-transistor cell is required instead of the 6-transistor cell used in SRAM. Moreover, the large amounts of parallel comparisons performed in conventional CAMs make the device consume too much power, preventing the implementation of large-scale CAMs in a single chip as the leading-edge applications demand.

In this paper, we present the implementation and characterization of a CAM with low-power constraints. The proposed architecture is highly scalable and provides high-performance functioning at large sizes. Moreover, it achieves great savings for both dynamic (89%) and leakage (91%) power consumption. Therefore, this architecture overcomes previous limitations of the CAM implementation and makes it suitable to all the applications where a high-performance low-power data search functioning is needed.

Previous work on CAM design has focused only on reducing the dynamic power consumption of the match line [5] and enhancing the search speed [6]. Although many approaches addressing dynamic power dissipation have been reported [79], the resulting circuit techniques have either substantial area overhead, deficiencies in noise immunity, or cannot be easily scaled without a negative impact on performance.

Our work overcomes these limitations by a novel and effective design of the CAM architecture and also addresses leakage energy reduction in an efficient way. The architecture presented here improves the energy savings obtained by Lin et al. in [10], and recently extended by Noda et al. [11] and Choi et al. [9]. These recent works provide a low power implementation of the CAM based on the precomputation of an index parameter. Nevertheless, they are constrained to specific small sizes, lack of scalability and present an increased search delay.

Moreover, the leakage energy consumption (which is one of the main issues regarding current electronic design) has not been addressed by previous approaches. Static power dissipation is becoming as important as dynamic dissipation as transistor gate sizes are being reduced [12]. Our work improves the dynamic power savings of the referred approaches and also reduces the leakage energy consumption of the memory to a minimum.

As stated before, this work is based on a parameter precomputation-based architecture [10] (PB-CAM from now on); however, we are able to reduce the parameter word's size with respect to [10], decreasing in this way the logic complexity, area, and power consumption related to this parameter. Moreover, the energy savings obtained with the proposed banked architecture (up to 89% of the dynamic power consumption and a 91% of the leakage power consumption) improve the previous implementations of similar technologies and also improve the scalability capabilities of architectures like [9, 11].

Our research work on this field has already shown good results in terms of area and dynamic power consumption [13, 14]. This paper expands [14] where an improved architecture was presented (with a novel hardware mechanism to reduce the static power consumption and increase the dynamic energy savings) with new experimental results and a deeper analysis of the consequences of applying leakage reduction techniques over CAM memories.

The paper is composed as follows. Section 2 presents the first designed approach where the minimization of the dynamic power consumption is addressed, while the leakage power reduction technique is shown in Section 3. The experimental results are introduced in Section 4. Finally, some conclusions are drawn.

2. Devised Architecture: Banked Approach

The first architecture presented in this section describes a banked implementation of the mentioned PB-CAM [10] in order to reduce the dynamic power consumption during the search operation. The main idea of the PB-CAM is to store a parameter word (obtained by a formula, e.g., a one's count) to perform the comparison process in a reduced number of memory positions, saving dynamic power consumption. However, the total power consumption of the logic can still be too high for low-power applications. Additionally, the PB-CAM architecture is also based on a novel seven-transistor data memory cell (that can only be used on conventional architectures) instead of the common one of nine or ten transistors, that can only be used on conventional architectures.

Our architecture employs the precomputed parameter to perform a power-aware ordering of the data. The extracted parameter allows us the classification of the memory contents based on the one's count. This classification can be used to store the memory contents in such an efficient way that the search operation is restricted to a smaller memory size.

The order of the memory data attending to the one's count parameter makes possible to split the memory architecture into independent banks where every data in the bank has the same value for one subset of the parameter (e.g., N-least significant bits). Moreover, the logic needed for this ordering is very simple and does not present a serious overhead in terms of delay and energy consumption [13].

In Figure 1 our architecture is depicted (RAM area corresponds to the memory where the output address associated to each tag will be found to complete a CAM-search engine). It can be observed that each memory word is composed of a validation bit, the data word and part of its parameter. The parameter extractor computes the parameter of the input tag and a bank decoder selects the proper bank with a subset of the bits of the calculated parameter. Then, the rest of the parameter bits and the input data are searched in the decoded bank.

Due to the banked implementation of the memory, the operation of the architecture is restricted to just one bank every cycle. One of the advantages of this banked structure is the reduction of the dynamic power consumption as the charge in the bit lines is limited to one bank (the driven line is simplified to the bit line of the accessed bank of the memory). This behavior is also shown by the parameter lines and also has a positive influence in the memory speed. The complexity of the logic shared for the banks (buffers, priority encoders, and address decoders) is reduced when the bank approach is applied. This simplification saves area, power consumption and improves the delay of these devices.

3. Devised Architecture: Drowsy Approach

The banked PB-CAM presented in Section 2 has been improved to reduce the static power consumption of the architecture. In this case, the unused memory banks are put into a low-power state and a pipeline carefully manages the operations to avoid any performance penalty. This approach is based on the design of a low-power cell and a further pipelined subbanked implementation.

3.1. Low-Power Cells (Data and Parameter)

As has been previously described, the memory area has been split into several banks, which can be independently accessed. Then, a dynamic voltage scaling (DVS) technique is applied to turn the unused banks into a low-power state and thus save as much energy as possible in the system. However, when the objective is a memory device, the cost of recovering the lost information could hide any power saving or, at least, represent a very significant time penalty. Moreover, something has to be done in the powered-down banks to prevent the information from being lost.

An efficient approach to achieve the low-power (drowsy) state is proposed by [15], where a DVS technique is exploited to reduce static power consumption. As is well known, both dynamic and static power consumption are proportional to the supply voltage. DVS technique benefits of this fact by turning down the power supply to reduce power consumption. However, reducing the supply voltage also has a negative impact in other parameters such as speed, hence DVS techniques look for combining different voltages while not affecting other parameters.

The method proposed by Flautner et al. implements DVS to reduce the leakage power of cache cells by scaling the voltage of the cell to a lower voltage that ensures the preservation of the state of the memory cells. This voltage can be conservatively approximated to 1.5 times th [15], but further reductions of the scaled voltage would increase static power savings [16].

Figures 2 and 3 show the modified memory cells (data and parameter) used to support the drowsy state with a dual power supply (not part of the cell, it is shared by all the cells in a same bank). As can be observed, the dual power supply is switched to low when the cell is in drowsy state. It is necessary to use high- th devices as pass transistors because the voltage on bit lines could destroy the cell contents. Before a memory position can be accessed, the power supply has to be switched to high (wake up) to restore the contents and allow the access. The careful management of these operations along the pipeline presented in the next section takes care of this extra clock cycle to avoid any performance penalty.

As mentioned before, each bank of the CAM architecture counts with the additional logic required to implement the DVS mechanism. Since the low-power consumption state is selected for the whole bank instead of a specific memory position, the overhead of the control logic is greatly minimized.

3.2. Pipeline

A clear way to improve the access time of a CAM is the use of a pipeline structure, which additionally provides greater scalability in the performance and density of the applications that make use of CAMs. The aforementioned DVS technique can take advantage of this pipeline to awake the drowsy cells one clock cycle before the access. Only one of the banks needs to be on while the rest can remain in the drowsy state saving leakage energy.

In our approach, the devised pipeline configuration includes the three operations needed in a CAM-search engine: READ, OVERWRITE, and WRITE. READ is the read operation in the associated RAM memory after the tag is found in the CAM. In OVERWRITE, after the tag is found in the CAM, a write operation is done in the RAM memory, and WRITE is the operation to write a tag and its data in both CAM and RAM memories.

The pipeline stages defined within those operations are EXT (parameter extraction of the input tag and selection of the working bank), SEARCH (tag comparison in the CAM), DEC (decodification of internal address, common for both RAM and CAM), READ (only in RAM memory), and WRITE (in both memories or only in RAM):

(i)READ operation: EXT–SEARCH–READ_R,(ii)OVERWRITE operation: EXT–SEARCH–WRITE_R,(iii) WRITE operation: EXT–DEC–WRITE_CR.

However, this three-stage pipeline shows a structural and data hazard, as shown in Figure 4(a). The stages and the resources used in each stage (parameter extractor, address decoder, CAM, and RAM memories) are shown in the plot. This hazard is produced in the CAM structure between the READ (or OVERWRITE) operation and the WRITE operation because the CAM area is simultaneously accessed by the second and third stages, respectively. This problem can be solved by including a fourth pipeline stage splitting the WRITE operation into WRITE_C and WRITE_R and introducing a “no operation” stage, NOP (see Figure 4(b)). All the CAM accesses are in the third stage and the RAM accesses in the fourth:

(i)READ: EXT–NOP–SEARCH–READ_R,(ii)OVERWRITE: EXT–NOP–SEARCH–WRITE_R,(iii)WRITE: EXT–DEC–WRITE_C–WRITE_R.

The second stage would mean a cycle delay in the READ and OVERWRITE operations, but we take advantage of this cycle to wake up the memory cells of the CAM memory from the drowsy state. Therefore, the pipeline is not stalled and the performance is not compromised.

The throughput of the pipeline is set by the slowest stage, the third one in a READ or OVERWRITE operation with the SEARCH stage, where the parallel access to all the words in a bank of the CAM memory is carried out.

3.3. Banks Subdivision

Once the leakage current control mechanism has been exposed, the natural goal consists in increasing the expected energy savings. The simplest idea is to divide the memory in as many banks as possible, using more parameter bits to decode the active bank. This technique presents the same advantages as those mentioned in Section 2.

However, dividing the memory in so many banks has two very important drawbacks. Firstly, the unbalanced use of the banks which means an increase in the failed search rate of the memory (number of times that a searched data is not in the memory and has to be written in the memory before a new search). The 4-bank implementation in Section 2 presents a homogeneous use of the banks (each bank with an almost exact 25% distribution of the input tags) but if a third bit is introduced to split the CAM into 8 banks, the distribution of inputs varies from 14.48% to 10.52% for the one's count parameter (that is a difference of 27.4% between the most and the least used bank). The second drawback is the complex layout that will require the memory, due to the common elements of the banks.

Therefore, another technique has been devised to preserve the homogeneous use of the banks and a realistic layout: the subdivision of each bank into a set of subbanks. The main idea is to combine the parameter decoding with a new ordering of the input tags, using in this case the value of some bits of the input tags. In this way, the tags found in the same bank are ordered in local subbanks attending to some bits (the tags belonging to the same subbank share the same value of some bits). This mechanism obtains a very homogeneous use of the subbanks without impacting the layout.

For example, in the previous 4-bank implementation, using any two bits of the tag to enable the bank subdivision, there will be 4 subbanks per bank (those banks correspond to the values 00, 01, 10, and 11 of any two tag bits), as depicted in Figure 5. Unlike the 8-bank implementation, this 4-bank configuration with subbanking presents a very homogeneous use of the subbanks, with only 0.013% of maximal difference between subbanks. Moreover, the obtained layout remains without appreciable changes when the subbanking approach is applied.

One of the key advantages of this subbanking technique is that any memory operation will be done only in the proper subbank of the decoded bank, while the other subbanks of that bank as well as the other banks will remain at the drowsy mode. Moreover, there are also dynamic power and area advantages of this technique very similar to the ones presented for the bank implementation. For example, the tag bits used for the subbanking do not need to be stored. Also, the complexity of the common logic (address decoder and priority encoders) can be simplified by designing a single element for the subbank and sharing this design for every subbank in the architecture. And finally, the power consumption of the comparison operation is restricted to the working subbank, which increases the savings in this factor.

4. Experimental Results

Our experiments have been carried out with Spice simulations in the Cadence environment. The technology used to implement the designed architecture has been 0.35  from Austria MicroSystems (as the base architecture is also a 0.35  design), while the estimation of the leakage energy savings has been carried out using the 70 nm BPTM [17] models, as the leakage power cannot be studied with the technology selected for the implementation.

4.1. Banked Implementation

The banked architecture has been firstly evaluated in terms of the energy savings obtained after reducing dynamic power. The simulated memory is the architecture described in Figure 1, implemented as a memory of 2048 positions and 32 bits per word, and split into 4 independent banks. Our approach decreases the dynamic power consumption by a 78% (18.86 fJ/bit in the banked architecture with respect to 86 fJ/bit in [10]).

The area improvement achieved with the proposed architecture has also been evaluated in terms of number of transistors. When the memory size is fixed and the number of bits per word is varied, there is a reduced area improvement of the banked implementation with respect to the original PB-CAM due to the savings in the parameter word length and comparison logic. Compared to a traditional implementation, area savings are quite representative for architectures with more than 16 bits (up to 17.5% in the range considered: 8 to 128 bits per word and a 2048 words memory size) due to the use of 7-transistor memory data cells instead of the 9-transistor one of a traditional architecture.

For the implemented architecture, the area savings obtained are 13.5% with respect to a traditional architecture and 10.7% respect to the base PB-CAM.

Finally, the performance of the design has been analyzed to assure the required fast response. The results show a 7.5 ns delay for the search operation, which also includes the data write into a RAM memory. The comparison of these performance results with the ones described by Lin in [10] shows how the operation time in the banked architecture is a 25% faster than in the original PB-CAM (10 nanoseconds).

4.2. Drowsy Implementation

Regarding the energy savings obtained, this approach presents further improvements both in dynamic and leakage energy due to the subbanking and drowsy techniques. The simulated architecture is a CAM with 8192 positions (notice the larger implementation with respect to the baseline architecture) and 32 bits per word, implemented with 4 independent banks, 4 subbanks per bank, and the described 4-stage pipeline.

With this design, the dynamic consumption is reduced to 9.06 fJ/bit (a decrease of an 89% and 52% with respect to the baseline architecture and our first approach, resp.) when the subbanking technique is used without the drowsy technique.

To apply the drowsy technique for leakage reduction, we have considered two scaled voltages, the one referred as the conservative approach [15] (scaled voltage 1.5  th) and a smaller one [16] (1.25  th). The BPTM simulations estimate a 92% of leakage energy savings for the memory cells at a low power state for the conservative voltage, that is increased up to a 98% for the second voltage. Given that the working subbank has to remain at full voltage, the total power consumption, for the simulated architecture, is reduced an 86% and a 91% respectively.

However when the drowsy technique is applied, the waking up of the accessed subbank is a new source of dynamic power consumption. The total value of this dynamic consumption depends on the scaled voltage used for the low-power state, but in the opposite way for the leakage consumption. When the scaled voltage is reduced, the leakage power consumption is decreased while the waking up dynamic power consumption is increased (the swing voltage between low-power state and working state is increased).

If we extrapolate BPTM simulations to the 0.35  simulated architecture and the two proposed voltages, the waking up of a subbank means an additional dynamic power consumption of 2.77 fJ/bit for the 1.5  th and 3.06 fJ/bit for the 1.25  th. In Table 1 the experimental results for the two voltages are summarized.

These energy savings can be easily increased if an architecture with more banks or more subbanks per bank is selected. As stated before, and as can be seen in Figure 6, using an 8-banks implementation unbalances the usage of banks and increases the failed search rate, due to the different probabilities of each parameter. While for parameter zero or 32 there is only one possible input data, there are 601 hundred million different input data for parameter 16.

However, using more subbanks per bank has a very weak impact in that factor because all the input data bits have the same probability (and so the ones used to select the subbank), as can be seen in Figure 7. For a 4-bank implementation, the percentage of use between the most and the least used bank is 0.006%. When subbanks are introduced, it can be observed how that percentage duplicates when we quadruplicate the number of subbanks, and has a very small value for a very wide range of subbanks.

In this way, using more subbanks only increases slightly the unbalance between banks. However, the energy saved while using more subbanks is considerable, as the active part of the memory is smaller: for dynamic power consumption there is less comparison power consumed during a search and less wake up power consumption, and for leakage power more parts of the memory can stay at a low-power state.

The drowsy implementation with subbanking also presents area improvements because two less memory cells per data are stored, while the overhead of the DVS control has a very small impact given that it is shared by all the cells on the same subbank. In Figure 8 it can be observed the area savings when the word length is fixed (32b) and the memory size is ranged. For our simulated architecture we obtained an improvement of 16.1% with respect to the traditional implementation, 13.4% respect the base PB-CAM and a 5% compared with our banked implementation.

The throughput of the system is also improved (70% when compared with the baseline architecture, 60% when compared with the banked PB-CAM) due to the pipeline. A comparison of the two approaches presented in this paper, as well as the baseline architecture, can be found in Table 2.

5. Conclusions

Nowadays, the limiting factor in applications where the CAMs play a critical role is the power consumption of these devices. The integration levels achieved by current technology processes have turned the area and performance factors into secondary actors. Search-based applications with high-performance constrains demand efficient implementations of content-addressable memories to cover the astringent constraints. The work presented in this paper has shown efficient mechanisms to reduce the dynamic and static power consumption by means of hardware modifications. These approaches do not compromise the performance and area improvements achieved with the architecture.

Acknowledgment

This work was supported by the Spanish Ministry of Science and Education under Contract TEC2006-00739.