Journal of Electrical and Computer Engineering

Volume 2015 (2015), Article ID 242086, 8 pages

http://dx.doi.org/10.1155/2015/242086

## A Bit String Content Aware Chunking Strategy for Reduced CPU Energy on Cloud Storage

^{1}College of Computer Science, South-Central University for Nationalities, Wuhan, Hubei 430074, China^{2}School of Information Engineering, Wuhan Technology and Business University, Wuhan, Hubei 430065, China^{3}School of Foreign Languages, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

Received 11 May 2015; Accepted 30 July 2015

Academic Editor: Lu Liu

Copyright © 2015 Bin Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In order to achieve energy saving and reduce the total cost of ownership, green storage has become the first priority for data center. Detecting and deleting the redundant data are the key factors to the reduction of the energy consumption of CPU, while high performance stable chunking strategy provides the groundwork for detecting redundant data. The existing chunking algorithm greatly reduces the system performance when confronted with big data and it wastes a lot of energy. Factors affecting the chunking performance are analyzed and discussed in the paper and a new fingerprint signature calculation is implemented. Furthermore, a Bit String Content Aware Chunking Strategy (BCCS) is put forward. This strategy reduces the cost of signature computation in chunking process to improve the system performance and cuts down the energy consumption of the cloud storage data center. On the basis of relevant test scenarios and test data of this paper, the advantages of the chunking strategy are verified.

#### 1. Introduction

Along with the development of the next generation of network computing technology, such as the application of Internet and cloud computing, the scale of the data center is showing the explosive growth in the past 10 years. The total amount of global information is double every 2 years. It was 1.8 ZB in 2011 and will reach 8 ZB in 2015. In the next 5 years (2020), the data will be 50 times higher than that of today [1].

According to the 2005 annual report of the international well-known consulting company Gartner [2], the total power consumption was 45,000,000,000 kwh of electricity for the whole data centers in USA in 2005. In its latest report about the data center efficiency the Natural Resources Defense Council (NRDC) [3] suggested that it was 91,000,000,000 kwh of electricity in 2013, which was more than double the total power for homes of New York. The annual electricity consumption will reach 140,000,000,000 kwh by 2020. Data center has developed into one of the great energy consumption industries rapidly.

How was electrical energy wasted in the data center? References [4–6] carried out experimental analysis and found out that the biggest power consumption was from first refrigeration system and the server, second IT infrastructure, such as storage and network air conditioning, and last air conditioning and lighting, and so forth. Among them, the storage system consumes about 14% electrical energy [7]. Construction of energy-efficient green data center is the future research and development tendency for the industry. Most businesses including IBM, Microsoft, HP, SUN, Cisco, Google, and other companies have launched a research on green data center.

Green storage technology refers to reduction of the data storage power consumption, electronic carbon compounds, the construction and operation cost, and improvement of the storage equipment performance in terms of the environment protection and energy saving. The study shows that there is a large amount of redundant data in the growing data. And the proportion of redundancy will be higher as time goes on, leading to a sharp decline in the utilization of storage space. The low utilization of storage space leads to a waste of both storage resources and lots of energy.

In order to improve the utilization of storage space, a large number of intelligent data management technologies, such as virtualization, thin provisioning, data deduplication, and hierarchical storage, have been widely applied in the cloud data center. Deduplication, now a hot research topic, meets the ends of energy conservation, efficiency of customers’ investment and reduction of emission, and operation costs by saving lots of storage space, improving the performance of data read and write, and lessening the bandwidth consumption effectively.

Research shows that a large amount of data is duplication in the growing data. Therefore, the key for reducing the data in the system is to find and delete the redundant data. The basis of detecting it is the high-performance and stable chunking strategy.

It takes advantage of signature calculation to partition the data objects in the most existing literatures [8–10] on the variable length data chunking algorithms. The research on the optimization measure of time cost for data chunking algorithm is relatively few.

It is CPU intensive for detecting the chunk boundaries through signature calculation. For a 100 MB file, if the expected average length of chunk is 8 KB, the minimum length of chunk is 2 KB, and the file is expected to be divided into 12800 chunks. For each chunk boundary, 6144 times signature calculations and comparisons are required. times signature calculations and comparisons are required for a 100 MB file. In the heavy CPU operation, the ratio of the number of chunk to the total number of signature calculation is only 1 : 6144, which is a difference of nearly three orders of magnitude. A lot of valuable CPU resources are consumed by invalid signature calculations and comparisons, which leads to more waste of energy.

The key problem of improving chunking performance lies in cutting down the consumption of CPU resources by reducing the number of invalid signature calculations and comparisons.

In this paper, we build the BCCS to a prototype of storage system. The primary contribution of our work is as follows. Firstly, as the deficiencies of total signature computations of Rabin fingerprinting algorithm are very huge, it is very CPU-demanded to divide the chunks that decrease the system performance. A new strategy, BCCS, is presented, reducing the overhead of generating fingerprints, as well as converting the question of files data partitioning stably to matching two binary bit strings. It samples a special bit from a text byte to constitute its data fingerprint, which converts the signature computation to binary string comparison. Only 1/8 of the text reads originally, and the bit operation is taking place of the traditional comparison. Secondly, by replacing the comparison operation, BCCS uses bitwise operation to optimize each matching process and exclude the unmatching positions as much as possible, getting the maximum jumping distance to quicken the matching process of binary string. It also reduces calculation and comparison costs by taking advantage of the bit feature information brought by failure matching every time. This measure reduces the cost of signature computation in chunking process and brings down CPU resource consumption to improve the system performance.

BCCS also divides the chunking process into two steps: the preprocessing step and the partitioning step. Then, BCCS adopts the parallel processing mode dealing with the preprocessing at a preprocessing node that synchronizes with partitioning at the front-end node. It saves the time of the preprocessing when the system is working. All of the measures minimize the total time consumption for data partitioning.

The rest of this paper provides the following. Section 2 introduces related works of traditional chunking algorithms. Section 3 studies BCCS algorithm in detail. In Section 4, we perform various experiments to measure the overhead of the data chunking, and Section 5 draws the conclusion.

#### 2. Related Works

In order to solve the stability problem of chunking, Muthitacharoen et al. [11] adopted a variable length chunking algorithm named CDC based on content in the Low Bandwidth Network File System. It predetermined a piece of data as the boundary of the chunk in the pending data objects (called Marker in this paper) and determined the boundary of each chunk by finding the Marker. In actual practice, Rabin algorithm [12] is used to calculate the fingerprint of the data in the window to find the boundary of each chunk in the data objects. But the CDC algorithm remains inadequate. Firstly, for the chunk of different sizes, in some cases, since volatility is relatively large, it increases the difficulty of storage management. Secondly, the granularity settings will directly affect the precision and effect of duplicate. The higher the precision of deduplication, the smaller the granularity, but the chunk metadata overhead will increase to affect the overall performance of the system. Conversely, if the granularity is too large, reducing the metadata overhead, it is possible that the duplicate precision cannot reach the requirements. Thirdly, it is possible that the boundaries of chunks cannot be found, and hard chunking will be employed.

Eshghi and Tang [13] proposed double threshold algorithm TTTD (Two Thresholds Two Divisors) to get the stability for chunking, which regulated the size of chunk only in the upper and lower bounds (, is the length of chunk). To a certain extent, TTTD alleviated the contradiction between chunking granularity and metadata overhead. However it did not completely solve the problem of hard chunking.

Bobbarjung et al. [14] put forward another chunking algorithm named FingerDiff based on variable length chunking. The main idea of this algorithm was merging multiple chunks that were not changed into a single SuperChunk as far as possible to reduce the amount of chunk metadata. To a certain extent, it could reduce the storage space and it accelerated the speed of query.

Research works on how to avoid long chunks, how to choose the optimal chunk size to achieve the optimal deduplication effect, and so on have been done a lot; nevertheless it is seldom mentioned in the literature how to reduce the overhead of fingerprint calculation for the data in the sliding window in the chunking process. When the amount of data becomes large, the calculation of the CPU overhead will be greatly increased; thus chunking algorithm will face enormous pressure on the performance.

#### 3. The Bit String Content Aware Chunking Strategy

Most existing chunking strategies [15–17] had used Rabin algorithm in signature calculation in which the total amount of chunk and the total number of signature calculations were nearly 3 orders of magnitude. There still existed a lot of invalid signature calculations. And each signature calculation called for modulo operation which consumed a large amount of CPU resources.

Two key problems should be solved in order to reduce the signature calculation and save CPU resources. The first is to reduce the calculations of fingerprints as far as possible; the second is to minimize the number of signature computations.

According to binary string, BCCS borrows the incremental calculation method of fingerprint from the Rabin fingerprint algorithm to convert the problem of chunking file stably to the matching process of two binary strings, reducing the overhead of chunk fingerprints generation. BCCS also learns the leap-forward match from the BM [18, 19] to get the maximum length of jump through each match, reducing the middle overhead of calculation and comparison. All above bring down the cost of signature in the process of chunking to improve the efficiency of matching.

##### 3.1. Fast Fingerprint Generation Based on Bitwise Operation

The Rabin algorithm is based on the principle that a pattern string whose length is is mapped into a binary bit string (fingerprint) and each text string whose length is is also mapped into binary bit string. Then the algorithm determines whether the two strings are equal by comparison.

The Rabin algorithm was more efficient, but it consumed a large amount of CPU resource for modulo operation. Therefore, the key to reduce CPU overhead is improving algorithm to minimize modulo operation.

The basic comparison unit is byte in the Rabin algorithm. The new algorithm can be considered to select on binary bit to represent the basic comparison unit to accelerate the algorithm speed. The process can be completed by bitwise operation instruction without modulo operation and greatly reduces the CPU overhead of generated fingerprints.

Therefore, the new algorithm selects the low order bit of each basic comparison unit to form its fingerprint. If the fingerprint of string () is known, the can be got by the following bitwise operations.

Let us assume that and * (the next new substring):* SHR AL, 1; SHL BX, 1.

Then, the data stored in the BX is the fingerprint of , namely, .

##### 3.2. Implementation of BCCS

Assume that the length of the original pattern string is basic units; the bits pattern string fingerprint is consisting of the low order bit of each basic unit. The BCCS prefetches low order bit of the text basic units to storage in the bit array and then computes the text string fingerprints by the fast fingerprint generation algorithm.

It considers the following two ways for pattern matching in this paper.

*(I) BCCS Based on Pattern (BCCS-P)*. Similar to the BM algorithm, BCCS-P uses the scanning mode from left to right to pattern match. As shown in Figure 1, in order to reduce the matching computation, the pattern and text should be left-justified and then a substring at the tail of text in alignment with the pattern whose length is should be selected; the BCCS-P will search the match substring from the right of pattern, but not matching the whole pattern at first.