- About this Journal ·
- Abstracting and Indexing ·
- Aims and Scope ·
- Annual Issues ·
- Article Processing Charges ·
- Author Guidelines ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Table of Contents
Advances in Mechanical Engineering
Volume 2013 (2013), Article ID 590234, 8 pages
Continuous Distributed Top- Monitoring over High-Speed Rail Data Stream in Cloud Computing Environment
1State Key Laboratory of Rail Traffic Control and Safety, Beijing JiaoTong University, Beijing 100044, China
2Astronautics Standards Institute, Beijing 100071, China
3Chongqing Public Security Bureau, Chongqing 401147, China
Received 18 September 2013; Accepted 30 October 2013
Academic Editor: Fenyuan Wang
Copyright © 2013 Hanning Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the environment of cloud computing, real-time mass data about high-speed rail which is based on the intense monitoring of large scale perceived equipment provides strong support for the safety and maintenance of high-speed rail. In this paper, we focus on the Top- algorithm of continuous distribution based on Multisource distributed data stream for high-speed rail monitoring. Specifically, we formalized Top- monitoring model of high-speed rail and proposed DTMR that is the Top- monitoring algorithm with random, continuous, or strictly monotone aggregation functions. The DTMR was proved to be valid by lots of experiments.
High-speed rail is the dominant overland transportation in the 21st century. As the largest country in Asia, China has built the longest high-speed rail while more rails are under construction. While high-speed rail is a collection of various advanced railway technology in an extremely complex system, the safety and capacity maintenance directly determine the service quality of high-speed rail. As an important component of high-speed rail information infrastructure, an intelligent environmental monitoring system based on cloud computing [1–3] can transmit real-time, time-varying, continuous, and massive data streams [4, 5] using wireless sensor networks  as well as high-speed network connecting with data center. Therefore, it is very important to study the real-time continuous query  algorithm for high-speed rail with massive data streams.
From the application, high-speed rail sensor equipment can be generally divided into two categories: real-time visibility and fault detection/prediction . Real-time visibility is the basis of many advanced applications, which can enhance the railway network capacity, improve fuel efficiency, and asset utilization. This application needs regular measurement of various operating parameters associated with high-speed rail or passenger flow, so as to improve the overall operation of high-speed rail [9–11]. On the other hand, the application of fault detection is to monitor components of the high-speed rail (such as wheels and carriage) and avoid catastrophic events (such as derailment occurred). People can take corrective action before the accident when the key parts of high-speed rail trains, rail and the relevant geographic area are detected by sensors and intelligent devices [12, 13]. As shown in Figure 1, The perceivable equipment installed in the railway/train real-time monitor the railway environment/train conditions.
Top-  query is to find results with highest values according to the rank of a user-defined function. In monitoring high-speed rail data flow with the massive data transmission, Top- is a very important query algorithm, which returns most important results according to the sorting function in the potential data space. In the process of Top- queries, system usually calculates the property of each object, and then utilizes a monotone sorting function to aggregate multiple property values as the weights of the object, finally returns objects with highest weights as the results of a query. At the same time, in high-speed rail, being real-time is another important indicator to evaluate the Top- query algorithm.
Processing architecture of stratified High-speed Monitoring data stream is shown in Figure 2. In the distributed systems with multiple data streams, the basic method to realize Top- query is to send data streams from all nodes to a central node, which performs the calculation of the Top- query. However, the requirement of this approach is often beyond the processing capacity of monitoring system while time delay also appears. Literature  discussed how to realize the on-line Top- monitoring of effective data streams in distributed environments. Constraint relationship of monitored objects has been established in the remote data source. In the most of time, the global Top- collection is same as the Top- collection on each monitoring nodes. Communication among the nodes is triggered only when the constraints are changed. In this manner, network communication in the on-line Top- monitoring is effectively decreased. However, the method in  needs to deliver the constraint-broken information to the entire Top- collection. Paper  presented a more efficient distributed Top- algorithm: MR, which significantly reduced the communication cost compared with the method in . Furthermore, communication cost of MR is independent of the value of . However, these two methods only use ascending functions as the sorting function, while a user-specified sorting function may be arbitrary in practical applications.
This paper presents DTMQ (Distributed Top- Monitoring Query) algorithm which supports any distributed Top- monitoring with random, continuous or strictly monotone aggregation function. DTMQ maintains the Top- result set based on establishment of the constraints on remote data flow. DTMQ communication among nodes is needed only when the constraints are broken while the communication occurs on partial nodes and is independent to . The efficiency of DTMQ has been proved by synthetic data with normal distribution and Zipf distribution. Experiments show that the transmission volume of DTMQ is at least one magnitude scale lower than that of other algorithms. The main contribution of this paper is as follows.(1)It presents a formal model of Top- monitoring with multiple high-speed rail data streams in the cloud computing environment.(2)DTMQ is proposed, which supports random, continuous, strictly monotone aggregation functions of Top- monitoring.(3)The effectiveness of DTMQ is verified through a lot of experiments.
Sections 2, 3, and 4 introduces the relevant work as well as formal description of the problem. Section 5 explains design conception and optimization strategy of DTMQ algorithm. Besides, experimental results comparison is demonstrated in Section 6, while summary of the entire research is made in Section 7.
2. Relevant Works
The most famous Top- query algorithm is the threshold algorithm (TA) first proposed by . TA is efficient which is usually used for single data source, while it is not suitable for distributed system. On the basis of TA, the researchers have developed many Top- query processing algorithms for distributed system. However, these algorithms are all like “snapshots”, which means they cannot support continuous queries over data streams.
Literature  discussed how to realize the on-line Top- monitoring of effective data streams in distributed environments. Constraint relationship of monitored objects has been established in the remote data source. In the most cases, the global Top- collection is same as the Top- collection on each monitoring nodes. Communication among the nodes is triggered only when the constraints are changed. In this manner, network communication in the on-line Top- monitoring is effectively decreased. However, the method in  needs to deliver the constraint-broken information to the entire Top- collection, which still causes significant communication cost. Note that the highlighted parts have been stated in the introduction section and you need to double check the literature [15, 19].
Kawashima et al.  first proposed the sliding-window-oriented query processing method. Considering many existing Top- query techniques, they developed a framework of Top- query, which provides a compactible set of Top- query. This set cannot only calculate Top- query results, but also provide incremental maintenance. Leung et al. and Mouratidis et al. also proposed a SCSQ buffer strategy, which compresses the compactible set to reduce the complexity of space and time effectively [22, 23]. Wu et al. and Yang et al. [24, 25] solved the issue of Top- query with incomplete data streams by generating multiple instances for the same object and applying the effective data cleaning method. In addition, Chu et al.  introduced a concept of Skyline analysis to obtain an accurate Top- Skyline query and approximate query algorithm.
Literature  discussed how to monitor sensors with the highest values in sensor networks, while the property of the aggregation functions in Top- monitoring was not involved. Top- monitoring on a single data stream with multiple aggregation functions has been studied in [28–31], which focused on how to save CPU and memory source on each node, but did not consider the cost saving of network communication in distribution environments.
In practical applications, a user-specified sorting function may be random. In order to deal with the random sorting function, this paper presents an algorithm called DTMQ which supports any distributed Top- monitoring with random, continuous or strictly monotone aggregation function. Furthermore, the communication cost of this new algorithm has been proved to be independent of .
3. Terms and Definition
A wireless sensor network (WSN) consists of spatially distributed autonomous sensors to monitor physical or environmental conditions, such as temperature, sound, and pressure, and to cooperatively pass their data through the network to a main location. Therefore, data should also be processed by sensor networks in a distributed fashion. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. Raw sensor data is shown in Figure 3.
A distributed Top- monitoring model is presented in this section. In high-speed rail environments with many safety monitoring nodes, the top values among the data streams should be identified according to the user-defined sorting functions. High-speed rail safety monitoring environment contains a large-scale monitoring system with sensing nodes, a central coordinating node , and remote monitoring nodes .
Definition 1. Given a set of remote monitoring nodes , and the current monitoring object set , objects to be monitored are denoted as . Let be the monitoring value of by .
Definition 2. Let be the aggregation values of in the central coordinating node , then we have , where is a user-specified sorting function.
To simplify the problem, this paper only considers the situations with a continuous and strictly ascending function where each node only needs to monitor—, and those situations will become as same as the situations with ascending . Monitoring data and aggregation data are both assumed to be real, while . It needs to be mention that not every object in can be monitored by nodes. When object cannot be monitored by node , let , which does not affect the universality of Definition 1.
Definition 3. Distributed Top- monitoring continuously searches for a set which satisfies: and , , (), where is the set size of a given monitoring result set Top-.
In practical systems, when changes, the systems need to determine the new Top- result set . During this process, the systems continue to set as the current collection. Although this setting is not consistent with , the error is negligible because the required processing period is very short.
4. Problem Descriptions
Constraints are discussed in this section. A Basic Distributed Top- Monitoring (BDTM) which supports any sorting function transmits all data streams to calculate Top- result set. However, in this process, the transmitted data is huge. In order to improve the communication efficiency, this paper established constraints based on remote data flow, and Top- set can be maintained by this method. The basic idea of establishing the constraint is to ensure that the global optimal objects are optimal objects in each node. Communication is needed only when the constraints are broken.
For any between 1 and , let be the adjustment factor with respect to the object in object set . This factor is used to adjust the weights of on both observing node and central coordinate node . From Definition 3, we have and .
Theorem 4. For any Top- results set, there exits adjustment factors satisfying below two constraints. The set satisfying the two constraints is Top- result set.
Constraint. , there exists .
Constraint. , , , there exists .
Proof. We will first prove that there exists an adjustment factor satisfying the above constraints for any Top- result set. Because is a strictly ascending function, it can be seen that which makes for and which makes () for . Then, the adjustment factor obtained in this manner satisfies the two constraints.
Next, we will prove that the set satisfying the constraints is Top- results set. From Constraint 2, , and , there exists . Because is a continuous and strictly ascending function; that is, for any , , we have . Therefore, from Constraint 1, and , From Definition 3, the above set is Top- results set.
Proof is complete.
5. Distributed Top- Monitoring
This section presents a distributed Top- monitoring algorithm DTMQ (Distributed Top- Monitoring Query) with generally minimum refresh, which supports any continuous and strictly monotone aggregation function. By Theorem 4, in Top- monitoring, we only need to obtain the set satisfying these two constraints.
During initialization, the initial result set Top- can be obtained by TA algorithm, and then the Allocation function of the adjustment factor (see Algorithm 5) sets the adjustment factor satisfying both constraints.
In the monitoring process, when the constraints in monitoring node are broken, let be sets of objects breaking the constraints in the monitoring node , which is named as the conflict set. Let be the down conflict set, and be up conflict set. Let be the set of objects with higher aggregation values in , that is, and , , . Literature  has proven to be a Top- result set, and that the minimum data transmitted by must contain and the observation values of in node when reconstructing constraints. The two boundary values are
The description of DTMQ algorithm is as follows.
Algorithm 5 (DTMQ). One has the following.
Step 1. sends a constraints-rebuilding request to , which contains conflict set and its monitoring value in node , and two boundary value and .
Step 2. When obtains the data from node , which includes the conflict set , the monitoring values of the objects in in node , and two boundary values ( and ), it then calculates the aggregation value of the objects in . Compute set , and apply function Allocation to re-distribute adjustment factor. Update Top- results set, which is , and transmit it to all other nodes.
Function Allocation. Consider the following. Input is . Output is .(1)Calculate the aggregation value of each object in conflict set : (2)Compute Interpolation factor of each object in conflict set , to make (3)Compute the adjustment factor of each object in node in conflict set :
Because is a continuous and strictly ascending function, then we have that , satisfying line 2 in Allocation can be obtained by analytical or iterative methods. Iterative method only needs to calculate approximate solutions. The difference between (the value of when has been used) and are recorded by the central coordinating node. This paper assumes to be an exact solution.
The validity of GMA algorithm is proved next. That is: adjusting factor generated by Allocation satisfies the constraints in Section 3. Moreover, communication cost of DTMQ will be proved to have no relationship with .
Theorem 6. Adjustment factor generated by Allocation function satisfies the Constraint 1. That is, , there exists .
Proof. Since Allocation function is only involved in adjustment of the objects in , we only need to prove that aggregation values of objects in newly generated keep unchanged. Proof is complete.
Theorem 7. Adjustment factor generated by Allocation function satisfies the Constraint 2. That is, , , , there exists .
Proof. There are two different conditions because .(1), , because allocation does not change and adjustment factor of , as well as the constraints are not broken by or , Constraint 2 is still available.(2), , from the definition of , we have . Due to the line 2 and line 3 of Allocation function, and satisfies Constraint 2 because is a strictly ascending function.(3)Let , and . From the definition of and , we have . , , from the definition of , , because is a strictly ascending function, we have , , that is, , , and satisfies Constraint 2.(4), , from the definition of , , because is a strictly ascending function, we have , , that is, , , and satisfies Constraint 2.
In all, , , and satisfies Constraint 2.
Theorem 8. Communication cost of DTMQ is independent of .
Proof. DTMQ communication is needed only when the constraints are broken. In the process of rebuilding constraints, DTMQ only needs to obtain and two boundary values from each monitoring node, and transmit the new adjustment factor in to other monitoring nodes. Only contains conflict objects, while the size of is unrelated to . Therefore, communication cost of DTMQ is independent of . Proof is complete.
6. Experimental Performance
In order to verify the effectiveness of the proposed algorithm, simulated experiments on the algorithm have been conducted. In the experiments, DTMQ was compared with BDTM described in Section 3, and the performance was evaluated by the communication cost.
Synthetic data were used in the experiments. Two datasets containing 50000 bytes were constructed, which followed the Normal distribution and the Zipf distribution (factor 0.8), respectively.
The experiments only discuss common addition, multiplication and mixed operation. The following three kinds of functions have been used as the aggregation function:(1), where denotes the number of request received in a sliding window of 1 hour. denotes the situation with aggregation function being multiplication.(2), stands for the situations with aggregation function being mixed operation of addition and multiplication.(3). represents the situation with aggregation function being addition of mean values.
The experiments were done with Intel i5 2.5 GHz CPU, 2G DDB memory, CentOS operating system. The simulated test program has been developed using Java programming. The experimental results are shown in Figure 4.
Figures 5 and 6 show test result under the Normally distributed dataset and Zipf distributed dataset. From Figures 5 and 6, the transmission volumes of DTMR is lower than that of BDTM by at least one magnitude scale (the ordinate is the logarithm). This is because the DTMR communication among the nodes only appears when the constraints are broken and the communication just involves several related nodes, whereas BDTM requires all the data are transmitted to the .
This paper verifies the performance of DTMR under different values. Figures 7, 8, and 9 shows experimental results of the aggregation function , , on a Normally distributed data collection. The performance on Zipf data sets is similar, which is not listed here to save space. From Figures 7–9, the communication cost of DTMR is independent of . In other words, it will not increase when becomes larger.
This paper studied how to realize the general and efficient distributed Top- monitoring to continuously monitor top values among massive data streams in the high-speed rail monitoring. User-specified sorting functions in practical application may be arbitrary. In this paper, we propose a general distributed Top- monitoring algorithm: DTMR, which supports random, continuous or strictly monotone aggregation functions. DTMR maintains the Top- result set based on the constraints established from remote data stream. DTMR communication among the nodes only appears when the constraints are broken and the communication just involves several related nodes. The communication costs are independent of . The efficiency of DTMR has been proved using the practical data and simulated data. Experimental results show that the transmission volume of DTMR is lower than that of other algorithms by at least one scale magnitude. The future research will focus on developing methodology to divide the sliding window to conduct parallel Top- queries and decreasing the impact of high-speed rail uncertainty.
Conflict of Interests
The authors declare that they have no financial and personal relationships with other people or organizations that can inappropriately influence their work and there is no professional or other personal interest of any nature or kind in any product, service, and/or company that could be construed as influencing the position presented in, or the review of, this paper.
This study was supported by the National Natural Science Foundation of China (Grant no.: 61272029), National Key Technology R&D Program (Grant no.: 2009BAG12A10), independent subject of State Key Laboratory of Rail Traffic Control and Safety, Beijing JiaoTong University (Contract no.: RCS2009ZT007), and it is partially supported by the MOE key Laboratory for Transportation Complex Systems Theory and Technology School of Traffic and Transportation, Beijing JiaoTong University.
- M. Armbrust, A. Fox, R. Griffith, et al., “Above the clouds: a Berkeley view of cloud computing,” Tech. Rep. UCB/EECS-2009-28, RAD Laboratory, Berkeley University of California, 2009.
- S. Marston, Z. Li, S. Bandyopadhyay, J. Zhang, and A. Ghalsasi, “Cloud computing—the business perspective,” Decision Support Systems, vol. 51, no. 1, pp. 176–189, 2011.
- P. Mell and T. Grance, “The NIST definition of cloud computing,” NIST Special Publication, vol. 800, no. 145, 7 pages, 2011.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '02), pp. 1–16, New York, NY, USA, June 2002.
- K. Y. Qi, Y. B. Han, Z. F. Zhao, et al., “Real-time data stream processing and key techniques oriented to large-scale sensor data,” Computer Integrated Manufacturing Systems, vol. 19, no. 3, pp. 641–653, 2013.
- J. Stankovic, “Wireless sensor networks,” in Handbook of Real-Time and Embedded Systems, CRC Press, Boca Raton, Fla, USA, 2007.
- A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom, “Characterizing memory requirements for queries over continuous data streams,” in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '02), pp. 221–232, New York, NY, USA, June 2002, http://infolab.stanford.edu/~arvind/papers/bm-pods02.pdf.
- C. Hartung, R. Han, C. Seielstad, and S. Holbrook, “Fire WxNet: a multi-tiered portable wireless system for monitoring weather conditions in wildland fire environments,” in Proceedings of the 4th International Conference on Mobile Systems, Applications and Services, pp. 28–41, June 2006.
- J. Chaolong, X. Weixiang, W. Futian, and W. Hanning, “Track irregularity time series analysis and trend forecasting,” Discrete Dynamics in Nature and Society, vol. 2012, Article ID 387857, 15 pages, 2012.
- H. Wang, W. Xu, F. Wang, and C. Jia, “A cloud-computing-based data placement strategy in high-speed railway,” Applied Mechanics and Materials, vol. 2012, Article ID 396387, 15 pages, 2012.
- J. Chaolong, X. Weixiang, W. Lili, and W. Hanning, “Study of railway track irregularity standard deviation time series based on data mining and linear model,” Mathematical Problems in Engineering, vol. 2013, Article ID 486738, 12 pages, 2013.
- H. Guo, W. Wang, W. Guo, X. Jiang, and H. Bubb, “Reliability analysis of pedestrian safety crossing in urban traffic environment,” Safety Science, vol. 50, no. 4, pp. 968–973, 2012.
- W. Wang, Y. Mao, J. Jin et al., “Driver's various information process and multi-ruled decision-making mechanism: a fundamental of intelligent driving shaping model,” International Journal of Computational Intelligence Systems, vol. 4, no. 3, pp. 297–305, 2011.
- N. Bruno, L. Gravano, and A. Marian, “Evaluating top-k queries over web-accessible databases,” in Proceedings of the 18th International Conference on Data Engineering (ICDE '02), pp. 369–380, March 2002.
- B. Babcock and C. Olston, “Distributed top-K monitoring,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '03), pp. 28–39, June 2003.
- B. Babcock and C. Olston, “Distributed top-k monitoring,” Tech. Rep., Computer Science Department, Stanford University, Stanford, Calif, USA, 2002, http://infolab.stanford.edu/~olston/publications/topk.html.
- B. Deng, Y. Jia, and S. Q. Yang, “Supporting efficient distributed top-k monitoring,” in Proceedings of the 7th International Conference on Web-Age Information Management (WAIM '06), Lecture Notes in Computer Science, pp. 496–507, Springer, June 2006.
- R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algorithms for middleware,” in Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '01), pp. 102–113, May 2001.
- Y. Zhang, W. Zhang, X. Lin, B. Jiang, and J. Pei, “Ranking uncertain sky: the probabilistic top-k skyline operator,” Information Systems, vol. 36, no. 5, pp. 898–915, 2011.
- C. Jin, K. Yi, L. Chen, J. X. Yu, and X. Lin, “Sliding-window top-K queries on uncertain streams,” The International Journal on Very Large Data Bases, vol. 19, no. 3, pp. 411–435, 2010.
- H. Kawashima, H. Kitagawa, and X. Li, “Complex event processing over uncertain data streams,” in Proceedings of the International Conferenceon P2P, Parallel, Grid, Cloud and Internet Computing, pp. 521–526, 2010.
- C. K. S. Leung, B. Hao, and F. Jiang, “Constrained frequent itemset mining from uncertain data streams,” in Proceedings of the IEEE 26th International Conference on Data Engineering Workshops (ICDEW '10), pp. 120–127, Long Beach, Calif, USA, March 2010.
- K. Mouratidis, S. Bakiras, and D. Papadias, “Continuous monitoring of top-k queries over sliding windows,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '06), pp. 635–646, June 2006.
- M. Wu, J. Xu, X. Tang, and W. C. Lee, “Monitoring top-k query in wireless sensor networks,” in Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), April 2006.
- H. C. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker, “Map-reduce-merge: simplified relational data processing on large clusters,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '07), pp. 1029–1040, June 2007.
- C. T. Chu, S. K. Kim, Y. A. Lin et al., Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems, MIT Press, Cambridge, Mass, USA, 2007.
- W. Liang, B. Chen, and J. X. Yu, “Top-k query evaluation in sensor networks under query response time constraint,” Information Sciences, vol. 181, no. 4, pp. 869–882, 2011.
- X. Han, J. Li, and D. Yang, “Supporting early pruning in top-k query processing on massive data,” Information Processing Letters, vol. 111, no. 11, pp. 524–532, 2011.
- E. Cohen, N. Grossaug, and H. Kaplan, “Processing top-k queries from samples,” Computer Networks, vol. 52, no. 14, pp. 2605–2622, 2008.
- R. Akbarinia, E. Pacitti, and P. Valduriez, “Best position algorithms for efficient top-k query processing,” Information Systems, vol. 36, no. 6, pp. 973–989, 2011.
- C. Wang, L. Y. Yuan, and J. H. You, “On the semantics of top-k ranking for objects with uncertain data,” Computers and Mathematics with Applications, vol. 62, no. 7, pp. 2812–2823, 2011.