Abstract

Big data is the term used to denote enormous sets of data that differ from other classic databases in four main ways: (huge) volume, (high) velocity, (much greater) variety, and (big) value. In general, data are stored in a distributed fashion and on computing nodes as a result of which big data may be more susceptible to attacks by hackers. This paper presents a risk model for big data, which comprises Failure Mode and Effects Analysis (FMEA) and Grey Theory, more precisely grey relational analysis. This approach has several advantages: it provides a structured approach in order to incorporate the impact of big data risk factors; it facilitates the assessment of risk by breaking down the overall risk to big data; and finally its efficient evaluation criteria can help enterprises reduce the risks associated with big data. In order to illustrate the applicability of our proposal in practice, a numerical example, with realistic data based on expert knowledge, was developed. The numerical example analyzes four dimensions, that is, managing identification and access, registering the device and application, managing the infrastructure, and data governance, and 20 failure modes concerning the vulnerabilities of big data. The results show that the most important aspect of risk to big data relates to data governance.

1. Introduction

In recent years, big data has rapidly developed into an important topic that has attracted great attention from industry and society in general [1]. The big data concept and its applications have emerged from the increasing volumes of external and internal data in organizations and it differs from other databases in four aspects: volume, velocity, variety, and value. Volume refers to the amount of data, velocity refers to the speed with which data can be analyzed and processed, variety describes the different kinds and sources of data that may be structured, and value refers to valuable discoveries hidden in large datasets [2]. The emphasis in big data analytics is on how data is stored in a distributed fashion that allows it to be processed in parallel on many computing nodes in distributed environments across clusters of machines [3].

Given the significance that big data has for business applications and the increasing interest in various fields, relevant works should be mentioned: [4] argued that consumer analytics lies at the junction of big data and consumer behavior and highlights the importance of the interpretation of the data generated from big data. Reference [5] examined the role of big data in facilitating access to financial products for economically active low-income families and microenterprises in China. Reference [6] investigated the roles of big data and business intelligence (BI) in the decision-making process. Reference [7] presented a novel active learning method based on extreme learning machines with inherent properties that make handling big data highly attractive. Reference [8] developed a selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. Reference [9] discussed the advancement of big data technology, including the generation, management, and analysis of data. Finally, [10] described a brief overview of big data problems, including opportunities and challenges, current techniques, and technologies.

Big data processing begins with data being transmitted from different sources to storage devices and continues with the implementation of preprocessing, process mining and analysis, and decision-making [6]. Much of this processing takes place in parallel, which increases the risk of attack, and how best to guard against this is what big data management seeks to do [11].

Over the last few years, several researchers have proposed solutions for mitigating security threats. In [12], a taxonomy of events and scenarios was developed and the ranking of alternatives based on the criticality of the risk was provided by means of event tree analysis combined with fuzzy decision theory. Reference [13] developed a mathematical model to solve the problem according to the risk management paradigm and thereby provided managers with additional insights for making optimal decisions. There has also been research on the use of large network traces for mitigating security threats [14].

However, research analyzing the risks associated with big data is lacking. Moreover, from this perspective, information security measures are becoming more important due to the increasingly public nature of multiple sources. Hence, many issues related to big data applications can be addressed first by identifying the possible occurrences of failure and then by evaluating them. Consequently, this paper proposes the use of a specific Failure Mode and Effects Analysis (FMEA) method and Grey Theory, which allows for risk assessment at the crucial stages of the big data process. Both mathematical rigor, which is necessary to ensure the robustness of the model, and the judgments of those involved in the process, given the subjective characteristics of the types of assessments made, are considered in this model. This paper contributes to the literature in the following aspects. First, it offers new insights into how the different characteristics of big data are linked to risk in information security. Second, it provides a model risk analysis based on a multidimensional perspective of big data risk analysis.

The first section of the paper discusses big data and information security issues. Then, the discussion that follows relates to existing methodologies for information security and background information, which are necessary for developing the proposed approach. Next, we introduce the methodology and present a real case that illustrates how the methodology validates the proposed approach. Finally, the discussion presents the limitations of the research, suggested areas for further study, and concluding remarks.

2. Background

2.1. Big Data and Methodologies for Risk Management

As mentioned before, big data has different characteristics in terms of variety, velocity, value, and volume compared to classic databases. Consequently, big data risk management is more complex and is becoming one of the greatest concerns in the area of information security. Currently, another important point is that data availability and confidentiality are two top priorities regarding big data.

Recently, several works relating to big data and security have been published. Reference [15] proposed a new type of digital signature that is specifically designed for a graph-based big data system. To ensure the security of outsourced data, [16] developed an efficient ID-based auditing protocol for cloud data integrity using ID-based cryptography. In order to solve the problem of data integrity, [17] proposed a remote data-auditing technique based on algebraic signature properties for a cloud storage system that incurs minimal computational and communication costs. Reference [18] presented a risk assessment process that includes both risk arising from the interference of unauthorized information and issues related to failures in risk-aware access control systems.

There are many methods and techniques with respect to big data risk management. Table 1 lists and briefly describes qualitative methodologies for risk analysis.

Some approaches based on quantitative methods have also been proposed. Reference [19] presented an approach to the risk management of security information, encompassing FMEA and Fuzzy Theory. Reference [20] developed an analysis model to simultaneously define the risk factors and their causal relationships based on the knowledge from observed cases and domain experts. Reference [21] proposed a new method called the Information Security Risk Analysis Method (ISRAM) based on a quantitative approach.

As can be seen, the purpose of big data security mechanisms is to provide protection against malicious parties. Hence, researchers have also identified several forms of attacks and vulnerabilities regarding big data. Reference [22] investigated key threats that target VoIP hosts. Reference [23] analyzed the impact of malicious servers on different trust and reputation models in wireless sensor networks. Reference [24] examined a cloud architecture where different services are hosted on virtualized systems on the cloud by multiple cloud customers. Also, [25] outlined a discussion of the security and privacy challenges of cloud computing.

In this context, attacks themselves are becoming more and more sophisticated. Moreover, attackers also have easier access to ready-made tools that enable exploitation of platform vulnerabilities more effectively. For these reasons, the security risks of high volumes of data from multiple sources, complex data sharing, and accessibility-related issues arise in a big data environment. Therefore, there is an increasing need to develop and create new techniques for big data risk analysis.

2.2. Failure Mode and Effects Analysis (FMEA)

FMEA was first proposed by NASA in 1963. The main objective of FMEA is to discover, prevent, and correct potential failure modes, failure causes, failure effects, and problem areas affecting a system [31]. According to FMEA, the risk priorities of failure modes are generally determined through the risk priority number (RPN), which assesses three factors of risk: occurrence (), severity (), and detection (). Then, the RPN is defined by [32]Based on [33, 34], the classic proposal uses the 10-point linguistic scale for evaluating the , , and factors. This scale is described in Tables 2, 3, and 4 for each risk factor. The failure modes with higher RPNs, which are viewed as more important, should be corrected with higher priorities than those with lower RPNs.

The FMEA method has been applied to many engineering areas. Reference [35] extended the application of FMEA to risk management in the construction industry using combined fuzzy FMEA and fuzzy Analytic Hierarchy Process (AHP). Reference [36] described failures of the fuel feeding system that frequently occur in the sugar and pharmaceutical industries [37]. Reference [38] proposed FMEA for electric power grids, such as solar photovoltaics. Reference [39] presented a basis for prioritizing health care problems.

According to [40], the traditional FMEA method cannot assign different weightings to the risk factors of , , and and therefore may not be suitable for real-world situations. For these authors, introducing Grey Theory to the traditional FMEA enables engineers to allocate the relative importance of the risk factors , , and based on the research and their experience. In general, the major advantages of applying the grey method to FMEA are the following capabilities: assigning different weightings to each factor and not requiring any type of utility function [41].

References [32, 33] pointed out that the use of Grey Theory within the FMEA framework is practicable and can be accomplished. Reference [42] examined the ability to predict tanker equipment failure. Reference [43] proposed an approach that is expected to help service managers manage service failures. Thus, Grey Theory is one approach employed to improve the evaluation of risk.

2.3. Grey Theory

Grey Theory, introduced by [44], is a methodology that is used to solve uncertainty problems; it allows one to deal with systems that have imperfect or incomplete information or that even lack information. Grey Theory comprises grey numbers, grey relations (which this paper uses in the form of Grey Relational Analysis, GRA), and grey elements. These three essential components are used to replace classical mathematics [45].

In grey system theory, a system with information that is certain is called a white system; a system with information that is totally unknown is called a black system; a system with partially known and partially unknown information is called a grey system [46]. Reference [47] argued that, in recent days, grey system theory is receiving increasing attention in the field of decision-making and has been successfully applied to many important problems featuring uncertainty such as supplier selection [48, 49], medical diagnosis [50], work safety [40], portfolio selection [51], and classification algorithms evaluation and selection [52].

According to [53], a grey system is defined as a system containing uncertain information presented by a grey number and grey variables. Another important definition is that of a grey set (of a universal set ), which is defined by its two mappings and as follows:where , and and are the upper and lower membership functions in , respectively.

A grey number is the most fundamental concept in grey system theory and can be defined as a number with uncertain information. Therefore, a white number is a real number , and a grey number, written as , refers to an indeterminate real number that takes its possible values from within an interval or a discrete set of numbers. In other words, a grey number, , is then defined as an interval with a known lower limit and a known upper limit, that is, as . Supposing there are two different grey numbers denoted by and , the mathematical operation rules of general grey numbers are as follows: GRA is a part of Grey Theory and can be used together with various correlated indicators to evaluate and analyze the performance of complex systems [54, 55]. In fact, GRA has been successfully used in FMEA and its results have been proven to be satisfactory. Compared to other methods, GRA has competitive advantages in terms of having shown the ability to process uncertainty and to deal with multi-input systems, discrete data, and data incompleteness effectively [55]. In addition, [41] argues that results generated by the combination of Grey Theory and FMEA are more unbiased than those of traditional FMEA, and [42] claims that combining Fuzzy Theory and Grey Theory with FMEA leads to more useful and practical results.

GRA is an impact evaluation model that measures the degree of similarity or difference between two sequences based on the degree of their relationship. In GRA, a global comparison between two sets of data is undertaken instead of using a local comparison by measuring the distance between two points [56]. Its basic principle is that if a comparability sequence translated from an alternative has a higher grey relational degree between the reference sequence and itself, then the alternative will be the better choice. Therefore, the analytic procedure of GRA normally consists of four parts: generating the grey relational situation, defining the reference sequence, calculating the grey relational coefficient, and finally calculating the grey relational degree [55, 57]. The comparative sequence denotes the sequences that should be evaluated by GRA and the reference sequence is the original reference that is compared with the comparative sequence. Normally, the reference sequence is defined as a vector consisting of (). GRA aims to find the alternative that has the comparability sequence that is the closest to the reference sequence [43].

2.4. Critical Analysis

Big data comprises complex data that is massively produced and managed in geographically dispersed repositories [63]. Such complexity motivates the development of advanced management techniques and technologies for dealing with the challenges of big data. Moreover, how best to assess the security of big data is an emerging research area that has attracted abundant attention in recent years. Existing security approaches carry out checking on data processing in diverse modes. The ultimate goal of these approaches is to preserve the integrity and privacy of data and to undertake computations in single and distributed storage environments irrespective of the underlying resource margins [11].

However, as discussed in [11], traditional data security technologies are no longer pertinent to solving big data security problems completely. These technologies are unable to provide dynamic monitoring of how data and security are protected. In fact, they were developed for static datasets, but data is now changing dynamically [64]. Thus, it has become hard to implement effective privacy and security protection mechanisms that can handle large amounts of data in complex circumstances.

In a general way, FMEA has been extensively used for examining potential failures in many industries. Moreover, FMEA together with Fuzzy Theory and/or Grey Theory has been widely and successfully used in the risk management of information systems [12], equipment failure [42], and failure in services [43].

Because the modeling of complex dynamic big data requires methods that combine human knowledge and experience as well as expert judgment, this paper uses GRA to evaluate the level of uncertainty associated with assessing big data in the presence or absence of threats. It also provides a structured approach in order to incorporate the impact of risk factors for big data into a more comprehensive definition of scenarios with negative outcomes and facilitates the assessment of risk by breaking down the overall risk to big data. Finally, its efficient evaluation criteria can help enterprises reduce the risks associated with big data.

Therefore, from a security and privacy perspective, big data is different from other traditional data and requires a different approach. Many of the existing methodologies and preferred practices cannot be extended to support the big data paradigm. Big data appears to have similar risks and exposures to traditional data. However, there are several key areas where they are dramatically different.

In this context, variety and volume translate into higher risks of exposure in the event of a breach due to variability in demand, which requires a versatile management platform for storing, processing, and managing complex data. In addition, the new paradigm for big data presents data characteristics at different levels of granularity and big data projects often encompass heterogeneous components. Another point of view states that new types of data are uncovering new privacy implications, with few privacy laws or guidelines to protect that information.

3. The Proposed Model

In this paper, an approach to big data risk management using GRA has been developed to analyze the dimensions that are critical to big data, as described by [65], based on FMEA and [31, 32]. The approach proposed is presented in Figure 1.

The new big data paradigm needs to work with far more than the traditional subsets of internal data. This paradigm incorporates a large volume of unstructured information, looks for nonobvious correlations that might drive new hypotheses, and must work with data that float into the organization in real time and that require real-time analysis and response. Therefore, in this paper, we analyzed the processing characteristics of the IBM Big Data Platform for illustrative purposes, but it is important to note that all big data platforms are vulnerable to both external and internal threats. Therefore, since our analysis model based on the probability of the occurrence of failure covers a wide view of the architecture of big data, it is eligible for analyzing other platforms, such as cloud computing infrastructures [66] and platforms from business scenarios [67]. Finally, our model considers the possible occurrence of failures in the distributed data and then we consider its implementation in a distributed way.

3.1. Expert Knowledge or Past Data regarding Previous Failures

The first step in the approach consists of expert identification or use of past data. The expert is the person who knows the enterprise systems and their vulnerability and is thus able to assess the information security risk of the organization in terms of the four dimensions [65]. One may also identify a group of experts in this step, and the analysis is accomplished by considering a composition of their judgments or the use of a dataset of past failures. The inclusion of an expert system in the model is also encouraged.

According to [68], an expert is someone with multiple skills who understands the working environment and has substantial training in and knowledge of the system being evaluated. Risk management models have widely used expert knowledge to provide value judgments that represent the expert’s perceptions and/or preferences. For instance, [69] provides evidence obtained from two unbiased and independent experts regarding the risk of release of a highly flammable gas near a processing facility. References [70, 71] explore a risk measure of underground vaults that considers the consequences of arc faults using a single expert’s a priori knowledge. Reference [19] proposes information security risk management using FMEA, Fuzzy Theory, and expert knowledge. Reference [72] analyzes the risk probability of an underwater tunnel excavation using the knowledge of four experts.

3.2. Determination and Evaluation of Potential Failure Modes (FMEA)

In a general way, this step concerns the determination of the failure modes associated with the big data dimensions (Figure 2) in terms of their vulnerabilities. Each dimension is described in Table 5.

Furthermore, these dimensions can be damaged by various associated activities. Table 6 presents failure modes relating to the vulnerability of big data for each dimension.

In fact, the determination of the failure modes is achieved using the FMEA methodology and evaluated regarding its occurrence (O), severity (S), and detection (D).

3.3. Establish Comparative Series

An information series with decision factors, such as chance of occurrence, severity of failure, or chance of lack of detection, can be expressed as follows:These comparative series can be provided by an expert or any dataset of previous failures, based on the scales described in Tables 24.

3.4. Establish the Standard Series

According to [41], the degree of relation can describe the relationship of two series; thus, an objective series called the standard series is established and expressed as , where is the number of risk factors (for this work, , i.e., occurrence, severity, and detection). According to FMEA, as the score becomes smaller, the standard series can be denoted as .

3.5. Obtain the Difference between the Comparative Series and the Standard Series

To discover the degree of the grey relationship, the difference between the score of the decision factors and the norm of the standard series must be determined and expressed by a matrix calculated bywhere is the number of failure modes in the analysis [31].

3.6. Compute the Grey Relational Coefficient

The grey relational coefficient is calculated bywhere ζ is an identifier, normally set to 0.5 [31]. It only affects the relative value of risk, not the priority.

3.7. Determine the Degree of Relation

Before finding the degree of relation, the relative weight of the decision factors is first decided so that it can be used in the following formulation [31]. In a general way, it is calculated bywhere is the risk factors’ weighting and, as a result, .

3.8. Rank the Priority of Risk

This step consists of dimension ordering. Based on the degree of relation between the comparative series and the standard series, a relational series can be constructed. The greater the degree of relation, the smaller the effect of the cause [31].

4. An Illustrative Example

To demonstrate the applicability of our proposition based on FMEA and Grey Theory, an example based on a real context is presented in this section. The steps performed are the same as shown in Figure 1, explained in Section 3. Following these steps, the expert selected for this study is a senior academic with more than 20 years’ experience. She holds a Ph.D. degree in information systems (IS), has published 12 papers in this field, and also has experience as a consultant in IS to companies in the private sector.

In the following step of the proposed model, the four dimensions associated with the potential failures of big data are represented according to Figure 2 and described in Table 5. Furthermore, Table 6 presents the failure modes relating to the vulnerability of big data for each dimension. Based on these potential failures, Tables 7 and 8 show the establishment of comparative and standard series for occurrence, severity, and detection, respectively.

To proceed to a grey relational analysis of potential accidents, it is necessary to obtain the difference between comparative series and standard series, according to (4). Table 9 shows the result of this difference.

In order to rank the priority of risk, it is necessary to compute both the grey relational coefficient (Table 10) and the degree of relation (Table 11) using (5), (6), and (7). Therefore, the greater the degree of relation, the smaller the effect of the cause. Assuming equal weights for risk factors, Table 11 also presents the degree of grey relation for each failure mode and dimension and final ranking.

From the analysis of failures using the proposed approach, we have shown that big data is mainly in need of structured policies for data governance. This result was expected because the veracity and provenance of data are fundamental to information security; otherwise, the vulnerabilities may be catastrophic or big data may have little value for the acquisition of knowledge. Data governance is also an aspect that requires more awareness because it deals with large amounts of data and directly influences operational costs.

Since the model works with a recommendation rather than a solution and compatible recommendations depend on expert knowledge, it is important to test the robustness of this information and therefore to conduct sensitivity analysis. Thus, different weightings, based on the context, may also be used for different risk factors, as suggested by [33]. Table 12 presents a sensitivity analysis conducted in order to evaluate the performance and validity of the results of the model. As can be seen, the final ranking of risk is the same for all the different weightings tested (±10%).

5. Discussion and Conclusions

The main difficulties in big data security risk analysis involve the volume of data and the variety of data connected to different databases. From the perspective of security and privacy, traditional databases have governance controls and a consolidated auditing process, while big data is at an early stage of development and hence continues to require structured analysis to address threats and vulnerabilities. Moreover, there is not yet enough research into risk analysis in the context of big data.

Thus, security is one of the most important issues for the stability and development of big data. Aiming to identify the risk factors and the uncertainty associated with the propagation of vulnerabilities, this paper proposed a systematic framework based on FMEA and Grey Theory, more precisely GRA. This systematic framework allows for an evaluation of risk factors and their relative weightings in a linguistic, as opposed to a precise, manner for evaluation of big data failure modes. This is in line with the uncertain nature of the context. In fact, according to [40], the traditional FMEA method cannot assign different weightings to the risk factors of O, S, and D and therefore may not be suitable for real-world situations. These authors pointed out that introducing Grey Theory into the traditional FMEA method enables engineers to allocate relative importance to the O, S, and D risk factors based on research and their own experience. In a general way, another advantage of this proposal is that it requires less effort on the part of experts using linguistic terms. Consequently, these experts can make accurate judgments using linguistic terms based on their experience or on datasets relating to previous failures.

Based on the above information, the use of our proposal is justified to identify and assess big data risk in a quantitative manner. Moreover, this study comprises various security characteristics of big data using FMEA: it analyzes four dimensions, identification and access management, device and application registration, infrastructure management, and data governance, as well as 20 subdimensions that represent failure modes. Therefore, this work can be expected to serve as a guideline for managing big data failures in practice.

It is worth stating that the results presented greater awareness of data governance for ensuring appropriate controls. In this context, a challenge to the process of governing big data is to categorize, model, and map data as it is captured and stored, mainly because of the unstructured nature of the volume of information. Then, one role of data governance in the information security context is to allow for the information that contributes to reporting to be defined consistently across the organization in order to guide and structure the most important activities and to help clarify decisions. Briefly, analyzing data from the distant past to decide on a current situation does not mean that the data has higher value. From another perspective, increasing volume does not guarantee confidence in decisions, and one may use tools such as data mining and knowledge discovery, proposed in [73], to improve the decision process.

Indeed, the concept of storage management is a critical point, especially when volumes of data that exceed the storage capacity are considered [11]. In fact, the emphasis of big data analytics is on how data is stored in a distributed fashion, for example, in traditional databases or in a cloud [66]. When a cloud is used, data can be processed in parallel on many computing nodes, in distributed environments across clusters of machines [3]. In conclusion, big data security must be seen as an important and challenging feature, capable of generating significant limitations. For instance, several electronic devices that enable communication via networks, especially via the Internet, and which place great emphasis on mobile trends allow for an increase in volume, variety, and even speed of data, which can thereby be defined as big data content. This fact adds more value to large volumes of data and allows for the support of organizational activities, bequeathing even more importance to the area of data processing, which now tends to work in a connected way that goes beyond the boundaries of companies.

This research contributes as a guide for researchers in the analysis of suitable big data risk techniques and in the development of response to the insufficiency of existing solutions. This risk model can ensure the identification of failure and attacks and help the victim decide how to react when this type of attack occurs. However, this study has limitations. For instance, it does not measure the consequences of a disaster occurring in the field of big data. This measurement could be carried out based on [74]. Future work should focus on developing a model to ensure the working of data governance and should recommend specific actions to ensure the safety of big data and to help managers choose the best safeguards to reduce risks. Further studies may also consider security-related issues in the fields of enterprise architecture, information infrastructure, and cloud-based computing.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research was partially supported by Universidade Federal de Pernambuco and GPSID, Decision and Information Systems Research Group.