Abstract

There is a growing awareness that the complexity of managing Big Data is one of the main challenges in the developing field of the Internet of Things (IoT). Complexity arises from several aspects of the Big Data life cycle, such as gathering data, storing them onto cloud servers, cleaning and integrating the data, a process involving the last advances in ontologies, such as Extensible Markup Language (XML) and Resource Description Framework (RDF), and the application of machine learning methods to carry out classifications, predictions, and visualizations. In this review, the state of the art of all the aforementioned aspects of Big Data in the context of the Internet of Things is exposed. The most novel technologies in machine learning, deep learning, and data mining on Big Data are discussed as well. Finally, we also point the reader to the state-of-the-art literature for further in-depth studies, and we present the major trends for the future.

1. Introduction

The fast-developing and expanding area known as the Internet of Things (IoT) [13] involves expanding the Internet beyond such standard devices as computers, smartphones, and tablets to also include the connection of other physical devices and objects. This allows for a variety of devices, sensors, etc. to be monitored and controlled, and to interact and communicate via the Internet. This means that an abundance of opportunity for brand new and revolutionary types of services and applications arises. As a result, we are now witnessing a technological revolution where millions of people are connecting and generate tremendous amounts of data through the increasing use of a wide variety of devices. These include smart devices and any type of wearable that are connected to the Internet, powering novel connected applications and solutions. The cost of technology has sharply decreased making it possible for everybody to access the Internet and to gather data and an abundance of real-time information.

One immediate consequence of this revolutionary emergence of novel technological opportunities is the urgent need for the development and adaptation of other related areas to further enable the development of the IoT field. Thus, new words, as well as new expressions, have started to emerge, such as Big Data [4, 5], cloud computing [6], and Data science. Data science has been defined as a “concept to unify statistics, data analysis, machine learning and their related methods” to “understand and analyse actual phenomena” with data [7, 8], and there is now a strong demand for professional data scientists in a multitude of sectors [912].

This article aims at providing a review of IoT related surveys in order to highlight the opportunities and the challenges, as well as the state-of-the-art technologies related to Big Data. There will be a particular focus on how to address the arising problems of managing the ensuing increased complexity. Since it is such a complex area, we have divided the Big Data procedure into several different stages to establish the most important points in each, while highlighting to the reader the most relevant papers related to every stage. Due to the complexity of managing Big Data, we have created separate sections in regard to the aforementioned stages of Big Data procedure. Our contribution explicitly indicates the advantages of every stage in the knowledge discovering procedure in contrast to approaches that offer more general visions. The advantage of this proposal is to be able to understand as well as analyse the challenges and opportunities in every particular phase.

The remainder of this article is structured as follows: first the next section discusses a set of general approaches to handle the complexity of managing Big Data in the context of the IoT as well as the future trends in the development of these approaches; then a section follows that discusses the knowledge discovering procedure in data gathered from a large number of diverse devices in the context of the IoT; finally, we provide a conclusion that summarises the article and points out major future trends.

2. The Internet of Things and Complexity Handling: Architectures for Big Data

The Internet of Things (IoT) paradigm has brought a great revolution to our society [1315]. It is a technology that makes our world better. It allows us to get information about the physical environment around us, and from this data valuable knowledge can be inferred about how the world works. This knowledge enables the deployment of new real-world applications, and it makes it easier for smart decisions to improve the quality of life of the citizens of our society. There are many examples of how this novel technology runs. The smart city concept is a representative use case, where many applications have been developed for its ecosystem [1619].

An important source of complexity within the IoT paradigm comes from the great amount of data collected. In most cases, the data also need to be processed in order to be converted into useful knowledge.

In view of the recent proposals on how to handle the complexity of Big Data, there are three general approaches to carry out the ensuing very intensive data processing: (A) local processing; (B) edge computing; and (C) cloud computing. Figure 1 shows a schematic overview of these approaches, and Table 1 summarises a representative set of ways and aspects of handling the complexity arising from the IoT. Table 1 also provides references to corresponding papers, categorised under the headings of the three general approaches mentioned above. In the following subsections, brief descriptions of each of these approaches are presented, and finally their main future trends are introduced.

2.1. Local Processing

This approach basically consists of processing the data where the data is collected. In this way, no raw data need to be communicated to remote servers. Instead, only the useful and relevant information is centralised to make smart decisions [20, 21]. In addition, deploying the first-level intelligence closer to the sensors produces an increase in the overall energy efficiency and significantly reduces the communication needs of many IoT applications.

This approach develops the concept of ‘smart sensor,’ which was initially defined as ‘smart transducer’ [22]. A smart sensor is a sensor with computing and communication capabilities to make computations with the acquired data, make decisions, and store information for further use and perform two-way communications [23]. Smart sensors are becoming integral parts of intelligent systems and they are indispensable enablers of the IoT paradigm and the corresponding development of advanced applications. A typical example of these developed sensors is the ‘smart wearable.’ This device can acquire several biosignals, process them, show elaborated information to the user, and send the relevant information to, for example, external platforms for medical supervision [2325]. Other important applications come from the logistics [26] and industrial fields [27]. Indeed, the new computation and communication capabilities of the IoT paradigm allow for the implementation of intelligent manufacturing systems giving rise to the next generation of industry, the so-called ‘Industry 4.0’ [28].

In these environments, network virtualization plays a significant role in providing flexibility and better manageability to Internet [29]. This is a way for reducing the complexity of the infrastructure since network resources can be managed as logical services, rather than physical resources. This feature enables us to implement smart scheduling methods for network usage and dataflows routing from IoT applications [30].

In order to properly carry out this resource management, network performance monitoring needs to be performed in effective and efficient ways. However, it remains a challenge for network operators [31] since active monitoring techniques used to dynamically acquire it can introduce overheads in the network [32]. In general, existing methods are hard to use in practice and further research is needed in this area. Nevertheless, a promising idea to address this challenge consists in reducing the data measurement by implementing intelligent measurement schemes based on inference techniques from partial direct monitoring data [33].

2.2. Edge Computing

Edge computing is a novel paradigm which has spawned great interest recently. It consists of the deployment of storage and computing capabilities at the ‘Edge’ of the Internet. The ‘Edge’ of the Internet can be defined as the portion of the network between sensors or data sources and cloud data centres [34]. The edge computing paradigm aims at deploying computing, storage, and network resources in this portion. The physical proximity of the computing platforms to where the data acquisition happens makes it easier to achieve lower end-to-end latency, high bandwidth, and low jitter to services [35].

There are several ways to implement edge computing that have in turn led to different approaches, such as Fog Computing, Mobile Edge Computing (MEC), and Cloudlet Deployment. Fog Computing consists in using the network devices such as routers, switches, and gateways as Fog Nodes to provide storage and computing resources [36]. In addition, network virtualization has significantly contributed to developing this paradigm by considering the fog devices as virtual network nodes. This trend increases the deployment flexibility of Fog Computing services and their integration with mobile devices and ‘things’ [37]. MEC is a novel paradigm based on deploying cloud computing capabilities in the base stations of the telecom operators [38]. Finally, Cloudlet Deployment consists in the same concept as Cloud Computing, but without the Wide Area Network (WAN) inconveniences. The servers are installed within the local networks where the data sources are connected. These servers are known as cloudlets [39].

Applications for edge computing, such as in Virtual Reality and Gaming Applications [40], cannot tolerate high latency, or its unpredictability. This is something that remote cloud servers cannot deliver.

2.3. Cloud Computing

The Cloud Computing paradigm is one of the most disruptive technological innovations in the last few years. It makes available to anyone a flexible amount of computing resources under per-use payment methods, the so-called ‘as-a-service’ model. Currently, more and more software and hardware solutions are redesigned for this cloud paradigm [41].

The cloud computing model favours the development of large-sized data centres where the resources are optimised through virtualization and efficient management systems. This technology gives the IoT applications the possibility to work in different environments in a very agile way using the same infrastructure [42]. In such a way, combining the cloud computing paradigm with IoT forms a new type of distributed system able to provide IoT-as-a-Service (IoTaaS) [43]. This concept allows for the integration of powerful computing resources with different types of devices such as sensors, actuators, and other embedded devices to deliver advanced services and applications based on the gathered data. A particular instance of this idea is the Sensing and Actuation Cloud where the connected IoT devices are mainly sensors and actuators [44], or the Cloud Cyber Physical Systems (CPS) composed of sensors or sensor networks [45].

There are a great variety of successful examples of this trend in many areas, where the data are analysed in the cloud through Big Data and data mining methods to infer valuable knowledge from them and deliver rich and smart services to the stakeholders. For example, the smart city concept, mentioned above, is in part made possible by a centralised cloud-based data analysis and service provision [41, 46, 47].

In addition, a combination of these options can be designed taking several aspects into account, such as power consumption, communication networks, and the availability of computing platforms. Dynamic solutions can easily adapt to the more favourable approach to better handle the complexity and meet the operation constraints.

2.4. Future Trends

Regarding the future trends of the developments of these three general approaches to intensive data processing of IoT related Big Data, there are developments at several fronts. The following is a summary of those most relevant.

When it comes to local processing, the efforts are directed towards the continuous improvement of smart sensor devices. We can distinguish several research lines here. One is the efforts to increase the performance of the devices while simultaneously reducing their power consumption. Another is the integration of multiple sensing modalities on the same chip. Still another is the efforts directed towards the improvement of the methods employed for the extraction of useful information from the raw data [21].

Edge computing has a promising future since it decentralises the computing power along the network and produces clear benefits when it comes to response time and reliability [34]. The research lines in this field aim at reaching a smooth engagement with the IoT ecosystem, mainly by reducing the management complexity of dispersed edge resources and developing mechanisms to maintain the security perimeter for the data and applications [49].

The cloud computing paradigm has triggered a strong growth of computing services around the world. For this reason, there is intensive ongoing research on expanding cloud services and solutions to new fields of application. These tasks seek to simplify business and make services easier for stakeholders. In this way, the new 5G protocol will facilitate access for services and applications in the cloud improving the Quality-of-Experience [51].

3. Knowledge Discovering Procedure

In Figure 2, a classical procedure of discovering knowledge from the data gathered from a large number of diverse devices is depicted. In this figure, we get an overview of all the stages involved in such a process. There are many challenges involved in these stages that will be described next.

3.1. IoT Data Gathering

The gathering of data for IoT architectures involves collection from different sources like social networks, the web, various devices, software applications, humans, and not the least various kinds of sensors. In addition to physical sensors, there are also virtual sensors that are created by the combination and fusion of data from different physical sensors in the cloud [52]. When it comes to the gathering of data from sensors, not only the raw sensor data are collected and stored, but these are also often linked to, for example, relevant contextual information, which increases the value of the data [53]. All these different sources engender large amounts of various types of data that, of course, also increases the requirements for storage capacity. The increasingly affordable storage resources that have recently become available mitigates this problem to some extent though.

Sensor networks are central for realising the IoT and in order to handle large amounts of polymorphous, heterogeneous sensor data on a large scale. Very Large-Scale Sensor Networks are employed using Cloud Computing [54]. Some of the main challenges regarding Very Large-Scale Sensor Networks are to handle the sensor resources and the computational resources and to store and process the sensor data.

Table 2 provides references to papers focused on the gathering of data in the context of the IoT.

3.2. Data Cleaning and Integration

A consequence of the way information is gathered through various sources and devices within IoT is that the information varies broadly in structure and type. This leads to a need for integration, which can be defined as a set of techniques used to combine data from disparate sources into meaningful and valuable information.

Integration is one of the most challenging issues of Big Data, which is also associated with one of the most difficult Vs of Big Data, i.e., the variety of data. Table 3 shows a summary of papers that are focused on the problem of variety of information in Big Data.

Moreover, given the current context in which companies are organized, it is not enough to work with internal, local, and private databases. In most cases, there is also a need for the World Wide Web where many diverse databases and other data sources must interact and interoperate. This circumstance leads us to concepts such as heterogeneity and uncertainty.

Table 4 summarizes papers that deal with integration by means of a diversity of techniques and methods like XML, ontological constructs from knowledge representation, uncertainty, and data provenance.

3.3. Data Mining and Machine Learning

As more devices, sensors, etc. generate large amounts of data within the IoT, the question arises whether there are possibilities of finding hidden information in that data.

Data mining is a process that detects interesting knowledge from information repositories. This process is partly based on methods derived from modern machine learning algorithms adapted to fit Big Data and that extracts hidden information from, e.g., databases, data warehouses, data streams, time series, sequences, text, the web, and the large amount or valuable data generated by the IoT. Data mining aims at creating efficient predictive and descriptive models of large amounts of data that also generalize to new data [78]. It includes methods such as clustering, classification, time series analysis, association rule mining, and outlier analysis [79]. The precise choice among diverse data mining and machine learning techniques often depends on the taxonomy of the dataset.

Clustering includes unsupervised learning and uses the available structure to group data based on various kinds of similarity measures. Some examples of clustering methods are hierarchical clustering and partitioning algorithms, e.g., K-Means.

Classification is the process of finding models/functions describing classes that allow the prediction of class membership for new data. Some examples of classification methods are the K-Nearest Neighbour algorithm, Artificial Neural Networks, Decision Trees, Support Vector Machines, Bayesian Methods, and Rule-Based Methods.

In time series analysis meaningful properties are extracted from data over time, and in association rule mining, association rules are detected based on attribute-value conditions that are found frequently in the dataset.

Outlier analysis detects patterns that differ significantly from the main part of the data. The methods used are based on properties such as the density distribution or the distances between the instances in the data.

Table 5 provides a summary of, and references to, papers focusing on machine learning and data mining in the context of Big Data.

3.4. Deep Learning

In recent years, deep learning has become an important technology for solving a wide range of machine learning tasks [85]. There are applications for natural language processing [86], signal processing [87], and video analysis that allows for the achievement of significantly better results than the state-of-the-art baselines. Also, deep learning is a very useful tool for processing large volumes of data [62]. Because of high efficiency of processing data obtained from complex sensing environments at different spatial and temporal resolutions, deep learning is a suitable tool for analysing real-world IoT data. According to Gartner’s Top 10 Strategic Technology Trends for 2017 (https://www.gartner.com/smarterwithgartner/gartners-top-10-technology-trends-2017/), deep learning and IoT will become one of the most strategic technological two-way relationships: from the IoT side there are large volumes of data produced that require advanced analytics offered by the deep learning side. A wide range of deep learning architectures [88] finds applications for processing the data from IoT environments: convolutional networks for image analysis, recurrent networks for signal processing, autoencoders for denoising, feed forward networks for classification, and regression. Figure 3 represents a general architecture of deep learning.

Usually, the data are processed in dedicated frameworks such as Tensorflow (https://www.tensorflow.org/), Theano (http://deeplearning.net/software/theano/), Caffe (http://caffe.berkeleyvision.org/), H20 (https://www.h2o.ai/), and Torch (http://torch.ch/). Often GPUs or clusters of GPU servers are used for the processing [78, 79].

They offer different execution models as standalones or utilize high-performance computing based on, e.g., Hadoop, or Spark Cluster that allows a reduced time of computations. The frameworks have been widely compared and the reviews can be found online (https://dzone.com/articles/8-best-deep-learning-frameworks) (https://www.exastax.com/deep-learning/a-comparison-of-deep-learning-frameworks/). It should be noticed that these frameworks implement a processing model where the data are transferred to a server performing the analysis and in a final stage the response is returned. This model is subject to latency that could not be acceptable in some applications where there are requirements for high reliability, like, for example, when it comes to autonomous cars [89]. Thus, if efficiency constraints require real-time data processing, then a particular implementation of the algorithm is made on a local node. In its basic setting, this solution does not allow the use of information from other sources. An example of on the node-processing has been presented in [90], where on the node spectral domain preprocessing is used before the data is passed onto the deep learning framework for Human Activity Recognition.

For the IoT the deep analytics are made on large data collections and are usually based on creating more descriptive features of processed objects. For example, in temporal data processing for indoor location prediction [91], a Semisupervised Deep Extreme Learning Machine algorithm has been proposed that improves the localisation performance. The wireless positioning method has been improved with the usage of the Stacked Denoising Autoencoder and that also improves the performance by creating reliable features from a large set of noisy samples [92]. The prediction of home electricity power consumption has been analysed with a deep learning system that automatically extracts features from the captured data and optimises the electricity supply of the smart grid [93].

In Edge Computing with the analytics performed by a deep learning cluster [94], the resource consumption has been efficiently reduced [95]. Convolutional neural networks with automatically created features appeared to be a very good solution for privacy preservation [96]. Also in the security domain, deep learning finds many applications, e.g., it allows the construction of a model-based attack detection architecture for the IoT for cyber-attack detection in fog-to-things computing [97].

Video analysis integrated in IoT networks is strongly supported by neural networks, e.g., deep learning-based visual food recognition allows for the construction of a system employing an edge computing-based service for accurate dietary assessment [98]. RTFace, a mechanism for denaturing video streams, has been based on a Deep Neural Network for face detection [99]. It selectively blurs faces and enables privacy management for live video analytics.

3.5. Classification, Prediction, and Visualization

This section discusses the final stage in the chain of the “Procedure for Knowledge Discovery,” which is the obtainment of the final knowledge extracted from the raw data.

When employing machine learning methods for classification and prediction, it is important to use methods with good ability to generalize. The reason for this is that when we apply any of the aforementioned techniques, and after they have been trained on the original data, we want them to make good classifications and predictions of novel data rather than on the data used for training.

After machine learning methods have been applied, it is crucial to know how to interpret their outputs and understand what these mean and how they improve the knowledge in each application area. To that end, visualization methods are employed. Such methods are widely used within Big Data scenarios as they are very helpful for all types of graphical interpretations when the Volume, Variety, or Velocity are complex. In Table 6, we present a summary of, and referral to, papers that deal with visualization.

4. Conclusion

As indicated by the journal articles and the conference papers we have reviewed in this article, the complexity of Big Data is an urgent topic and the awareness of this is growing. Consequently, there is a lot of research carried out on this, and we will in all likelihood find more and more progress in this field during the next few years.

Additionally, a key issue that we really want to emphasize in this study is the aspects related to Big Data which transcend the academic area and that, therefore, are reflected in the company. An observation is that more than 50% out of 560 enterprises thinks Big Data will help them increase their operational efficiency as well as other things [60]. This indicates that there are a lot of opportunities for Big Data. However, it is also clear that there are many challenges in every phase of the knowledge discovery procedure that need to be addressed in order to achieve a continued and successful progress within the field of Big Data.

As is shown in Figure 1, there are three general approaches when carrying out intensive data processing in IoT architectures: (a) local processing, (b) edge computing, and (c) cloud computing. The text explained each of these approaches more in detail.

We also explained the knowledge discovery procedure by dividing it into several stages as shown in Figure 2. These steps are IoT Data Gathering, Data Cleaning, Integration, Machine Learning, Data Mining, Classification, Prediction, and Visualization.

We have also discussed that many research papers are focused on the variety of information because this is in itself, in conjunction with integration, one of the most challenging issues when it comes to the IoT. This is also the reason why it is very often also associated with one of the most difficult Vs of Big Data, which is the variety of data.

The trend for the future seems to be that more investigations will be carried out in such areas as (a) techniques for data integration, again the V of Variety; (b) more efficient machine learning techniques on big data, such as Deep Learning and frameworks such as Apache’s Hadoop and Spark, that will probably have a crucial importance; and (c) the visualization of the data, with, e.g., dashboards, and more efficient techniques for the visualization of indicators.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge the support from the research center Internet of Things and People (IOTAP) at Malmö University in Sweden. This work was also supported by the Spanish Research Agency (AEI) and the European Regional Development Fund (ERDF) under project CloudDriver4Industry TIN2017-89266-R.