The field of Big Data is rapidly developing with a lot of ongoing research, which will likely continue to expand in the future. A crucial part of this is Knowledge Discovery from Data (KDD), also known as the Knowledge Discovery Process (KDP). This process is a very complex procedure, and for that reason it is essential to divide it into several steps (Figure 1). Some authors use five steps to describe this procedure, whereas others use only four. We use the following four-step description:(1)Generation of Data: Data is generated from a multitude of various data sources, such as sensors, social media, the web, a multitude of devices, software applications, people, and various kinds of sensors [1].(2)Collection of Data: This step involves the storage of data into various types of databases, such as MongoDB, elastic, InfluxDB, MySQL, and NoSQL suitable for Big Data technologies [27]. Often the raw sensor data are collected, but it is common to also link them to contextual information [810]. Other tasks such as cleaning, integration, and transformation of data are essential for the optimal storage in databases [1113]. Several tools and technologies suitable to semi-automatize tasks such as Extract, Transform, and Load (ETL) [14] exist, e.g., Apache NIFI [15, 16] and Pentaho [17, 18]. These are very useful since there are many tasks involved in this step of the KDD procedure.(3)Machine Learning and Data Mining: Diverse machine learning and data mining methods are applied and benchmarked, and the results are compared. It should be kept in mind that though there are many machine learning methods, not all of them are suitable to use with Big Data [19, 20].(4)Classification, Prediction, and Visualization: This step focuses on, in particular, the obtaining of visualizations that present all the classification and prediction results in a useful way. Tools such as Grafana [21] could help to interpret the data visually, but also to simplify the identification of Key Performance Indicators (KPI) [22].

This special issue received in total 19 submitted papers, and after a meticulous reviewing process the editors decided to accept eight of these for publication, which implies an acceptance rate of about 42%.

Anomaly analysis is a crucial issue since it is a significant part of many areas, such as medical health, credit card fraud, and intrusion detection (X. Xu et al.). The authors of this paper provide a complete state-of-the-art presentation of anomaly detection. High dimensionalities and mixed types of data are the focus of this study as the identification of anomalous patterns is far from trivial. The authors introduce the reader to current advances on anomaly detection, while debating the pros and cons of various detection methods.

There are areas that are well-known, though barely referenced in the literature. For example, the one presented by M. Lodeiro-Santiago et al., where the goal is to detect small boats (pateras) to help address the problem of dangerous immigration. In this paper, the authors use deep convolutional neural networks to improve detection methods based on image processing through the application of filters. Their novel approach is able to recognise the boats through patterns regardless of where they are located. The proposed approach, which works in real-time, allows the detection of boats and people for search and rescue teams in order to plan for rescue operations before an emergency happens. The proposed method includes the use of essential cryptographic protocols for the protection of the highly sensitive information managed.

A method for the evaluation of heart rate streams in patients with ischemic heart disease is presented by M. D. Peláez-Aguilera et al. The authors present an innovative linguistic approach to manage relevant linguistic descriptions (protoforms). This provides a foundation for the cardiac rehabilitation team to identify sessions with significance indicators through linguistic summaries. As it is faced in the manuscript, cardiac rehabilitation programs are crucial to significantly decrease mortality rates in high-risk patients with ischemic heart disease.

In the work presented by Z. Marszałek et al., a fully flexible sorting method designed for parallel processing is presented. The authors describe a method based on modified merge sort designed for multicore architectures. The flexibility of the method, which is implemented for a number of processors, increases the efficiency of sorting by distributing the tasks between logical cores in a flexible way. Since powerful computer resources are often not very well exploited, their main goal is to use efficient algorithms to support the proficient use of all available resources.

F. M. Pérez et al. present a theoretical framework based on a generalisation of rough sets theory. This allows the establishment of a stochastic approach to solving the problem of outliers within a specific universe of data. An algorithm based on this theoretical framework is developed to make it suitable for large data volume applications. The experiments carried out validate the proposed algorithm in comparison to various algorithms analysed in the literature.

The work proposed by P. S. Szczepaniak and A. Duraj concerns the problem of outlier detection through the application of case-based reasoning. The authors argue that while this method has been successfully applied in an extensive variety of other domains, it has never been used for outlier detection.

In the manuscript presented by Q. Gu et al., the authors propose a Hybrid Genetic Grey Wolf Algorithm in order to improve the disadvantage of Grey Wolf Optimizer when solving Large-Scale Global Optimization problems.

Finally, D. Gil et al. provide a review highlighting the fact that the complexity of managing Big Data is one of the main challenges in the developing field of the Internet of Things (IoT). The review divides the discovery of knowledge into the four general steps sketched above and evaluates the most novel technologies involved. These include IoT data gathering, data cleaning and integration, data mining and machine learning, and classification, prediction, and visualization.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors acknowledge the support of the Internet of Things and People (IOTAP) Research Center at Malmö University in Sweden. This work was also supported by the Spanish Research Agency (AEI) and the European Regional Development Fund (ERDF) under the project CloudDriver4Industry TIN2017-89266-R. This work has also been funded by the Spanish Ministry of Economy and Competitiveness (MINECO/FEDER) under the granted Project SEQUOIA-UA (management requirements and methodology for Big Data analytics) TIN2015-63502-C3-3-R.

David Gil
Magnus Johnsson
Higinio Mora
Julian Szymanski