Abstract

HBase, a master-slave framework, and Cassandra, a peer-to-peer (P2P) framework, are the two most commonly used large-scale distributed NoSQL databases, especially applicable to the cloud computing with high flexibility and scalability and the ease of big data processing. Regarding storage structure, different structure adopts distinct backup strategy to reduce the risks of data loss. This paper aims to realize high efficient remote cloud data center backup using HBase and Cassandra, and in order to verify the high efficiency backup they have applied Thrift Java for cloud data center to take a stress test by performing strictly data read/write and remote database backup in the large amounts of data. Finally, in terms of the effectiveness-cost evaluation to assess the remote datacenter backup, a cost-performance ratio has been evaluated for several benchmark databases and the proposed ones. As a result, the proposed HBase approach outperforms the other databases.

1. Introduction

In recent years, cloud services [1, 2] are applicable in our daily lives. Many traditional services such as telemarketing, television and advertisement are evolving into digitized formats. As smart devices are gaining popularity and usage, the exchange of information is no longer limited to just desktop computers, but instead, information is transferred through portable smart devices [3, 4], so that humans can receive prompt and up-to-date information anytime. Due to the above reasons, data of all types and forms are constantly being produced, leaving the mass of uncorrelated or unrelated information, causing conventional databases to not be able to handle the workload in a big data environment. This leads to the emergence of nonrelational databases, of which many notable NoSQL databases that are currently being used by enterprises are HBase [5], Cassandra [6], and Mongo [7]. Generally, companies will assess the types of applications before deciding which database to use. To these companies, the data analysis of these databases can mean a matter of success or failure. For example, the mailing system, trading records, or number of hits on an advertisement, performing such retrieval, clearing, analysis, and transforming them into useful information for the user. As the types of information ever increases, the data processing abilities of nonrelational databases becomes ever challenging. The more well-known HBase and Cassandra databases are often used for a company as internal database system, and it uses its own distributed architecture to deal with data backup between different sites.

Distributed systems are often built under a single-cluster environment and contain a preventive measure against the single-point failure problem, that is, to prevent system crash or data loss. However, it could happen in such accidents as power shut-down, natural disaster, or manual error that leads to whole system collapse and then initiates a remote backup to the remote data center. Even though NoSQL database uses distributed architecture to prevent the risk of data loss, it has neglected the importance of remote data center backup. In addition to considering nodal independence and providing uninterrupted services, a good database system should also be able to support instant cross-cluster or cross-hierarchy remote backup. With this backup mechanism, data can be restored and prevent further data corruption problems. This paper will implement data center remote backup using two remarkable NoSQL databases and perform stress tests with a large scale of data, for instances, read, write, and remote data center backup. The experimental results of remote data center backup using HBase and Cassandra will show the assessment of their effectiveness and efficiency based on cost-performance ratio [8].

The following paragraphs of this paper are arranged as follows. In Section 2, large-scale database in data center will be described. The way to remote data center backup is given in Section 3. Section 4 proposes the method to implement the NoSQL database remote backup. The experimental results and discussion will be obtained in Section 5. Finally we drew a brief conclusion in Section 6.

2. Large-Scale Database in Data Center

The database storage structure can be divided into several types; currently, the more common databases are hierarchical (IBM IMS), network (Computer Associates Company IDMS), relational (MySQL, Microsoft SQL Server, Informix, PostgreSQL, and Access), and object-oriented (PostgreSQL). With the rapid growth of IT in recent years, the new data storage architecture to store large amounts of unstructured data, collectively called nonrelational database, that is, NoSQL Database (Google BigTable, Mongo DB, Apache HBase, and Apache Cassandra), was developed. NoSQL first appeared in 1998; it was developed by Carlo Strozzi as a lite, open sourced relational database, which does not provide the SQL function.

This paper will realize remote data center backup for the two distributed databases HBase and Cassandra. Both designs achieved two of the three characteristics that are consistency (C), availability (A), and partition tolerance (P) in C.A.P. theory [9].

HBase, a distributed database, works under the master-slave framework [10], where the master node assigns information to the slave node to realize the distributed data storage, meanwhile emphasizing consistency and partition tolerance characteristics. Regarding remote data center backup, a certain data center with HBase has the following advantages: retain data consistency, activate instant reading or writing of massive information, access large-scale unstructured data, expand new slave nodes, provide computing resources, and prevent a single-node failure problems in the cluster.

Cassandra, a distributed database, works under the peer-to-peer (P2P) [11] framework, where each node contains totally identical backup information to realize the distributed data storage with uninterrupted services, at the same time emphasizing availability and partition tolerance characteristics. As for remote data center backup, a certain data center with Cassandra has the following advantages: each node shares equal information, cluster setup is quick and simple, cluster can dynamically expand new nodes, each node has the equal priority of its precedence, and cluster does not have a single-node failure problem.

3. Remote Data Center Backup

3.1. Remote HBase and Cassandra Data Centers Backup

Remote HBase data center backup architecture [12] is as shown in Figure 1. The master cluster and slave cluster must possess their own independent Zookeeper in a cluster [13]. The master cluster will establish a copy code for the data center and designate the location of the replication, so to achieve offsite or remote data center backup between different sites. Remote Cassandra data center backup architecture [14] is as shown in Figure 2. Cassandra is of peer-to-peer (P2P) framework connects all nodes together. When information is written into data center A, a copy of the data is immediately backed up into a designated data center B, and each node can designate a permanent storage location in a rack [15] as show in Figure 3. This paper expands the application of a single-cluster replication mechanism to the replication of data center level. Through adjusting the replication mechanism between data center and nodes, the corresponding nodes from two independent data centers are connected and linked through SSH protocol, and then information is distributed and written into these nodes by master node or seed node to achieve remote data center backup.

3.2. Cross-Platform Data Transfer Using Apache Thrift

Apache Thrift [16] was developed by the Facebook team [17], and it was donated to the Apache Foundation in 2007 to become one of the open source projects. Thrift was designed to solve Facebook’s problem of large number of data transfers between various platforms and distinct programming languages and thus cross-platform RPC protocols. Thrift supports a number of programming languages [18], such as C++, C#, Cocoa, Erlang, Haskell, Java, Ocami, Perl, PHP, Python, Ruby, and Smalltalk. With binary high performance communication properties, Thrift supports multiple forms of RPC protocol acted as a cross-platform API. Thrift is also a transfer tool suitable for large amounts of data exchange and storage [19]; when comparing with JSON and XML, its performance and capability of large-scale data transfer is clearly superior to both of them. The basic architecture of Thrift is as shown in Figure 4. In Figure 4 the Input Code is the programming language performed by the Client. The Service Client is the Client side and Server side code framework defined by Thrift documents, and read ()/write () are codes outlined in Thrift documents to realize actual data read and write operations. The rest are Thrift’s transfer framework, protocols, and underlying I/O protocols. Using Thrift, we can conveniently define a multilanguage service system, and select different transfer protocol. The Server side includes the transfer protocol and the basic transfer framework, providing both single and multithread operation modes on the Server, where the Server and browser are capable of interoperability concurrently.

4. Research Method

The following procedures will first explain how to setup HBase and Cassandra data centers using CentOS 6.4 system to achieve remote backup. Next, this system will test the performance of data centers against reading, writing, and remote backup of large amounts of information.

4.1. Implementation of HBase and Cassandra Data Centers

Data centers A and B are installed on the CentOS 6.4 operating system, and HBase and Cassandra data centers are setup using CentOS 6.4 system. The following procedures explain how to build Cassandra and HBase data centers and backup mechanisms. Finally, we will develop test tools; the test performances include reading, writing, and remote backup in the data center:(1)CentOS’s firewall is strictly controlled; to use the transfer ports, one must preset the settings as shown in Figure 5.(2)IT manager sets up HBase and Cassandra data centers and examines the status of all nodes as shown in Figures 6 and 7.(3)Forms with identical names must be created in both data centers in HBase system. The primary data center will execute command (add_peer) [12], and back up the information onto the secondary data center, as shown in Figures 8 and 9.(4)IT manager edits Cassandra’s file content (cassandra-topology.properties), as shown in Figure 10 and then sets the names of the data center and the storage location of the nodes (rack number).(5)IT manager edits Cassandra’s file content (cassandra.yaml), as shown in Figure 11, and then changes the content of endpoint_snitch [14] to PropertyFileSnitch (data center management mode).(6)IT manager executes command (create keyspace test with strategy_options = and placement_strategy = “NetworkTopologyStrategy”) in Cassandra’s primary data center and then creates a form and initializes remote backup as shown in Figure 12.(7)IT manager eventually has to test the performance of writing, reading, and offsite data backup against large amounts of information using Thrift Java as shown in Figures 13 and 14.As shown in Figure 15, the user will be connected to the database through the Server Login function, select a file folder using Server Information, and then select a data table. Having completed above instructions, the user can operate the database according to the functions described and shown in Figure 15.

4.2. Performance Index

Equation (1) calculates the average access time (AAT) for different data size. In (1), represents the average access time with a specific data size, and represents the current data size:The following three formulae will evaluate the performance index (PI) [1, 2]. Equation (2) calculates the data center’s average access times overall (i.e., write, read, and remote backup), in which represents the average access time of each data size; please refer back to (1). Equation (3) calculates the data center’s normalized performance index. Equation (4) calculates the data center’s performance index overall, is constant value and the aim is to quantify the value for observation:

4.3. Total Cost of Ownership

The total cost of ownership (TCO) [1, 2] is divided into four parts: hardware costs, software costs, downtime costs, and operating expenses. The costs of a five-year period are calculated using (5) where the subscript represents various data center and stands for a certain period of time. Among it, we assume there is an annual unexpected downtime, , the monthly expenses , including machine room fees, installation and setup fee, provisional changing fees, and bandwidth costs:

4.4. Cost-Performance Ratio

This section defines the cost-performance ratio (C-P ratio) [8], , of each data center based on total cost of ownership, , and performance index, , as shown in (6). Equation (6) is the formula for C-P ratio where is the constant value of scale factor, and the aim is to quantify the C-P ratio within the interval of to observe the differences of each data center:

5. Experimental Results and Discussion

This section will go for the remote data center backup, the stress test, as well as the evaluation of total cost of ownership and performance index among various data centers. Finally, the assessment about the effectiveness and efficiency among various data centers have done well based on cost-performance ratio.

5.1. Hardware and Software Specifications in Data Center

All of tests have performed on IBM X3650 Server and IBM BladeCenter as shown in Table 1. The copyrights of several databases applied in this paper are shown in Table 2, of which Apache HBase and Apache Cassandra are of NoSQL database proposed this paper, but otherwise Cloudera HBase, DataStax Cassandra, and Oracle MySQL are alternative databases.

5.2. Stress Test of Data Read/Write in Data Center

Writing and reading tests of large amounts of information are originating from various database data centers. A total of four varying data sizes were tested, and the average time of a single datum access was calculated for each:(1)Data centers A and B perform large amounts of information writing test through Thrift Java. Five consecutive writing times among various data centers were recorded for each data size as listed in Table 3. We substitute the results from Table 3 into (1) to calculate the average time of a single datum write for each type of data center as shown in Figure 16.(2)Data centers A and B perform large amounts of information reading test through Thrift. Five consecutive reading times among various data centers were recorded for each data size as listed in Table 4. We substitute the results from Table 4 into (1) to calculate the average time of a single datum read for each type of data center as shown in Figure 17.

5.3. Stress Test of Remote Data Center Backup

The remote backup testing tool, Thrift Java, is mainly used to find out how long will it take to backup each other’s data remotely between data centers A and B as shown in Table 5.

As a matter of fact, tests show that the average time of a single datum access for the remote backup of Apache HBase and Apache Cassandra only takes a fraction of mini-second. Further investigations found that although the two data centers are located in different network domains, they still belonged to the same campus network. The information might have only passed through the campus network internally but never reaches the internet outside, leading to speedy the remote backup. Nonetheless, we do not need to set up new data centers elsewhere to conduct more detailed tests because we believe that information exchange through internet will get the almost same results just like performing the remote backup tests via intranet in campus. Five consecutive backup times among various data centers were recorded for each data size as listed in Table 5. We substitute the results from Table 5 into (1) to calculate the average time of a single datum backup for each type of data center as shown in Figure 18.

5.4. Evaluation of Performance Index

The following subsection will evaluate the performance index. We first substitute in the average execution times from Figures 16, 17, and 18 into (2) to find the normalized performance index of data centers for each test as listed in Table 6.

Next we substitute the numbers from Table 6 into (3) to find average normalized performance index as listed in Table 7. Finally, we substitute the numbers from Table 7 into (4) to find the performance index of data centers as listed in Table 8.

5.5. Evaluation of Total Cost of Ownership

The total cost of ownership (TCO) includes hardware costs, software costs, downtime costs, and operating expenses. TCO over a five-year period is calculated using (5) and has listed in Table 9. We estimate an annual unexpected downtime costing around USD$1000; the monthly expenses includes around USD$200 machine room fees, installation and setup fee of around USD$200/time, provisional changing fees of around USD$10/time, and bandwidth costs.

5.6. Assessment of Cost-Performance Ratio

In (6), the formula assesses the cost-performance ratio, , of each data center according to total cost of ownership, , and performance index, . Therefore, we substitute the numbers from Tables 8 and 9 into (6) to find the cost-performance ratio of each data center as listed in Table 10 and shown in Figure 19.

5.7. Discussion

In Figure 19, we have found that Apache HBase and Apache Cassandra obtain higher C-P ratios, whereas MySQL get the lowest one. MySQL adopts the two-dimensional array storage structure and thus each row can have multiple columns. The test data used in this paper is that considering each rowkey it has five column values, and hence MySQL will need to execute five more writing operations for each data query. In contrast, Apache HBase and Apache Cassandra adopting a single Key-Value pattern in storage, the five column values can be written into database currently, namely, no matter how many number of column values for each rowkey, only one write operation required. Figures 16 and 17 show that when comparing with other databases, MySQL consume more time; similarly, as shown in Figure 18, MySQL consumes more time in the remote backup as well. To conclude from the above results, NoSQL database has gained better performance when facing massive information processing.

6. Conclusion

According to the experimental results of remote datacenter backup, this paper has successfully realized the remote HBase and/or Cassandra datacenter backup. In addition the effectiveness-cost evaluation using C-P ratio is employed to assess the effectiveness and efficiency of remote datacenter backup, and the assessment among various datacenters has been completed over a five-year period. As a result, both HBase and Cassandra yield the best C-P ratio when comparing with the alternatives, provided that our proposed approach indeed gives us an insight into the assessment of remote datacenter backup.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is supported by the Ministry of Science and Technology, Taiwan, under Grant no. MOST 103-2221-E-390-011.