Information Analysis of High-Dimensional Data and ApplicationsView this Special Issue
A Multidata Connection Scheme for Big Data High-Dimension Using the Data Connection Coefficient
In the era of big data and cloud computing, sources and types of data vary, and the volume and flow of data are massive and continuous. With the widespread use of mobile devices and the Internet, huge volumes of data distributed over heterogeneous networks move forward and backward across networks. In order to meet the demands of big data service providers and customers, efficient technologies for processing and transferring big data over networks are needed. This paper proposes a multidata connection (MDC) scheme that decreases the amount of power and time necessary for information to be communicated between the content server and the mobile users (i.e., the content consumers who are moving freely across different networks while using their mobile devices). MDC scheme is an approach to data validation that requires the presentation of two or more pieces of data in a heterogeneous environment of big data. The MDC transmits the difference of the consecutive data sequences instead of sending the data itself to the receiver, thus increasing transmission throughput and speed.
The increase in data sources, including social networks and mobile devices, has brought the exponential growth, availability, and use of information [1, 2]. This trend to larger data sets, commonly referred to as big data, entails new technologies and practices of collecting, storing, and analyzing data sets as traditional data processing applications cannot handle them efficiently. Advanced big data technologies can uncover hidden values or additional information from large volumes of data, providing more predictive, tailored information that can lead to substantial innovation gains [3, 4].
Although the immense amount of data is generated on a regular basis via a variety of channels (e.g., business processes, machines, and human interaction with things like social media sites, and mobile devices), information services reflecting specific individual and community needs and interests are scare. This indicates that technologies that can uncover the insights and meaning of the underlying data are essential to gain value from big data.
Big data implies enormous volumes of data. For example, a single data file in scientific applications is of terabyte size and it can take hours, days, or even longer to traverse the network. However, volume is not the only aspect that characterizes big data. Big data is often defined along three dimensions—volume, velocity, and variety . That is, big data is high volume, high velocity, and high variety information assets.
Big data has the power to radically change every aspect of our lives, including politics, economy, culture, and science. While it provides new opportunities for prosperity and innovation, there is always the question of privacy or security, particularly with customer data . Online searches, store purchases, Facebook posts, cell phone usage, and so forth are creating a flood of data that when organized and analyzed, reveals trends and habits about individuals and society at large. This is why big data is sometimes nicknamed “modern day version of Big Brother.” The threats of data breach and leakage need to be understood and protected through the use of appropriate security controls.
Big data is being transmitted over communications networks with increasing frequency, causing the problem of machine and communication overhead. The transmission of huge data files is not the only challenge. Real-time data transmission is also becoming increasingly commonplace. For example, mobile users may request big data services using their smartphones while moving across different networks. An efficient technique that can process and move large amounts of data quickly and easily in such a setting is needed.
The MDC scheme proposed in this paper minimizes the time necessary for information to be communicated between the content server and the mobile user so that the user moving across networks can receive big data services without interruption. MDC scheme is an approach to data validation that requires the presentation of two or more pieces of data in a heterogeneous environment of big data. The MDC transmits the difference between the consecutive data sequences (the locations of the data in the content server, more precisely) rather than the data itself. This allows achieving low latency and reliable throughput, thus avoiding interruptions in service. In addition, the MDC delivers sufficient random bits to be used for validation checks.
The remainder of the paper is organized as follows. Section 2 describes the definition and characteristics of big data. Section 3 presents the proposed MDC scheme. In Section 4, the performance of the proposed scheme is evaluated. The final section provides concluding remarks and directions for future research.
2. Related Work
2.1. Big Data
Big data refers to the expanding volume of high velocity, complex, and diverse types of data, both structured and unstructured . Big data comes in the form of emails, photos, monitoring devices, audio, cell phone GPS signals, and many more. With the widespread use of mobile devices and the Internet, data is routinely generated by human interaction on systems like social media, causing the exponential growth of the volume of data to be analyzed. The advance of machine to machine (M2M) also accelerates the exponential growth of data volumes [7, 8].
In today’s world, data comes from everywhere and it is different from the data in conventional database systems in many aspects. Online content produced by Internet users, commonly known as the UCC or UGC, is an example of the unstructured data that creates problems for storage, mining, and analyzing data. With the available data on social media platforms, it is possible to know many personal things about individuals (e.g., demographic information, preferences, purchase habits, and friends). CCTVs that are installed almost everywhere (e.g., roads, buildings, and inside the residence elevator) are gathering huge volumes of video data every day. Along with the private sector, the public sector is generating and managing large amounts of data associated with census, health and welfare services, national pensions, and so forth .
2.2. Characteristics of Big Data
Table 1 shows the four major elements of big data—volume, velocity, variety, and complexity. Today, big data is recognized as a government and national asset in many countries but it requires innovative forms of information processing for enhanced insight discovery, decision making, and process optimization. The size of the data sets within the data analysis and velocity with which they need to be analyzed has outpaced the current abilities of standard tools and methods of analysis.
Technologies such as distributed storage, parallel processing, and cloud computing allow fast processing and analysis of massive data sets. With such technologies, industries and digital marketers can gather immediate feedback from tweet updates, Facebook comments, or social media discussions and predict and analyze their business circumstances within a tolerable elapsed time. Thus, companies can provide a more appropriate and prompt response to the customers, increasing their chance of success.
In the market, there are open source and/or free software solutions and distributed, scalable, and flexible hardware infrastructures for the management and analysis of large quantities of data. For example, Hadoop, R packages, and cloud computing are popular big data technologies. That is, it is possible to realize benefits from the use of big data without procuring expensive storages or a data warehouse [4, 10].
3. Multidata Connection Scheme
As the volumes and types of data in clouds increase, techniques for efficiently handling big data stored in heterogeneous devices of different networks are required. The MDC scheme proposed in this paper increases the throughput of cloud-based big data by predicting the data to be sent at later times and sending the difference of the consecutive data sequences, not the data itself, to the receiving node.
The MDC scheme transfers the difference between the consecutive data members, rather than the data member itself, in order to increase throughput and speed. Let be the data member of a data set transmitted in a big data environment. Instead of sending to the receiver, the MDC sends , the difference of the two consecutive data members, as represented in (1). denotes the number of data members in the data set transmitted in a big data environment:
When and the previous data member are known, can be derived. Thus, the receiver is able to create with the received difference .
As presented in Figure 1, the MDC increases throughput by predicting , the data member. Given that an estimate of is , the difference with a data estimation error, that is, , will be transferred. The receiver predicts with the past data members and creates by adding the received to the predicted .
Suppose that the all-order derivatives of the data member at a random time point , , are known. Using a Taylor series, is expressed in (2). denotes the time at which server generates the data member:
In (2), if the data member and its derivative values at time are known, the data member at a later time can be predicted. Even if only the first derivative is known, an approximate estimation of the future data member is possible, as expressed in (3). Let be the data of . If , then . With in (3), is produced. This leads to
An approximate value in (3) is further refined by increasing the number of terms on the right side. To perform the differentials of higher order, more past data are needed. The accuracy of prediction increases as the number of previous data members increases. The prediction is carried out using
In (5), denotes , that is, an estimate of . is expressed in
As increases, the accuracy of prediction increases. The data member sent by the server is and the data member received by the receiver is .
3.2. Linear Mean Square Error Estimation
When two estimation errors and of the data member of interest are related, one of them can be used to derive the other. That is, can be estimated based on the knowledge of . An estimate of is also likely to have an estimation error, making it not quite the same as the actual estimation error . The MDC finds the best estimate of the data member estimation error using the minimum mean square error (MMSE) criterion, . This is expressed in
In general, , the best estimate of the data member estimation error , is a nonlinear function of . Assuming that , can be restricted to a linear function of , as represented in
is computed using
is computed using
As shown in (13), is . Consider
In (13), and . With this, is rewritten as
Equation (14) is in line with the orthogonality principle. is orthogonal (“not related”) to when the mean square error (MSE) is the minimum.
By the orthogonality theorem, (15) computes the MMSE. Consider
3.3. Estimation of the Data Members
If an estimate of the data member of interest, , is related with other estimates of the data member, , then can be estimated via the linear summation of . This is represented in
As expressed in (17), is computed using . Consider
takes the derivative of with respect to , as represented in
is rewritten in (20) by rearranging the differential calculus and the average. Consider
Equation (20) is rewritten in
By differentiating with regard to and equating the result to zero, equations are obtained. Constant can be obtained using the inverse matrix, as shown in
In (25), is 0 (). is rewritten in
4.1. Performance Measures
In the experiments, MATLAB was used to evaluate the proposed MDC scheme with , the MMSE of data member estimations. For simplification, it was assumed that the data member has the same attribute function. With this assumption, all data members have the same size and the receiver needs the same storage space for . If an estimate of the data member is , the difference with an estimation error, , is transmitted.
The performance of the MDC was evaluated in terms of data member prediction accuracy measured using , the probability of mean squared data prediction error. To evaluate the accuracy of the received difference information, two random nodes and share the MSEs associated with the data differences. Here, the difference can be between two consecutive data members, denoted as (), and between as many as (). is calculated using
denotes the predicted location of the data member used in the MSE. denotes the number of data differences used in the MSE. denotes the size of :
In the equations above, denotes the probability of MSE of . chooses of that satisfies . Using , is produced. of is that obtains the actual data member and satisfies .
To measure the performance of the proposed MDC scheme, , , and were used.
4.2. Experiment Results
Figure 2 shows the data accuracy in terms of where is 0.5 and varies form 50,000, 100,000, 250,000, up to 500,000. Compared to the previous scheme that sends the data member itself, the MDC decreases the amount of power necessary for the data to be sent from the server to the receiver by 28%. As the number of data members increases, the throughput and storage space efficiency of the MDC increases.
Figure 3 shows of the data members that are randomly selected from the networks of different sizes. The results show that increases as the network size increases. These results were obtained by analyzing many different , , and values. In Figure 3, is proportional to .
Figure 4 presents the delays in obtaining the data at the receiver based on the probability of prediction errors. Compared to the previous scheme in which the data is sent directly by the server to the receiver, the MDC decreases the delay by 28.6%. This indicates that transferring the data difference instead of the data itself contributes to providing seamless big data services to the end users.
Figure 5 shows the service delays with regard to the number of hops between source (the content server) and destination (the user). In comparison with the previous scheme, the proposed MDC has a constant service delay irrespective of the number of network hops. The MDC reduces the amount of data to be transferred by delivering only the difference of data sequences, and thus it is less affected by different network conditions.
This paper presented the MDC scheme that improves throughput by transferring the difference of the consecutive data sequences, instead of the data itself, in heterogeneous big data environments. The MDC minimizes the data transfer time between the content server and the user so that the user who moves across different networks can receive big data services without interruption. As it sends only the difference of the consecutive data members, high throughput is achieved irrespective of the types, functionalities, and characteristics of the data.
In comparison with the previous scheme, it was observed that the MDC decreases the amount of power required to communicate the data by 28%. This processing efficiency increases as the number of data members to be delivered increases, which leads to increased storage space efficiency. The MDC decreases the delay in obtaining the data at the receiving node by 28.6%. In addition, the MDC maintains a constant service delay irrespective of the number of network hops. In the future, the performance of the MDC will be examined in an environment where two or more networks are interacting with, and its resilience against various security attacks will be studied.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov, and M. Ouaret, “The discover codec: architecture, techniques and evaluation,” in Proceedings of the 26th Picture Coding Symposium (PCS '07), pp. 1–4, November 2007.View at: Google Scholar
J. Ascenso and F. Pereira, “Adaptive hash-based side information exploitation for efficient Wyner-Ziv video coding,” in Proceedings of the 14th IEEE International Conference on Image Processing (ICIP '07), pp. III29–III32, IEEE, San Antonio, Tex, USA, September 2007.View at: Publisher Site | Google Scholar
G. Bjontegaard, “Calculation of average PSNR differences between RD-curves,” in Proceedings of the 13th Meeting of Video Coding Experts Group (VCEG '01), Austin, Tex, USA, April 2001.View at: Google Scholar