Mathematical Problems in Engineering

Volume 2018, Article ID 8058670, 6 pages

https://doi.org/10.1155/2018/8058670

## Big Data Validity Evaluation Based on MMTD

^{1}School of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China^{2}State Key Laboratory of Smart Grid Protection and Control, Nanjing 211106, China

Correspondence should be addressed to Ningning Zhou; nc.ude.tpujn@nnuohz

Received 7 November 2017; Revised 23 March 2018; Accepted 10 April 2018; Published 10 June 2018

Academic Editor: Ester Zumpano

Copyright © 2018 Ningning Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Big data has been studied extensively in recent years. With the increase in data size, data quality becomes a priority. Evaluation of data quality is important for data management, which influences data analysis and decision making. Data validity is an important aspect of data quality evaluation. Based on 3V properties of big data, dimensions that have a major influence on data validity in a big data environment are analyzed. Each data validity dimension is analyzed qualitatively using medium logic. The measuring of medium truth degree is used to propose models to measure single and multiple dimensions of big data validity. The validity evaluation method based on medium logic is more reasonable and scientific than general methods.

#### 1. Introduction

Big data has been studied extensively in recent years and several investigations have focused on the big data phenomenon[1–7]. The top international journals ‘Nature’[8] and ‘Science’[9], respectively, in 2008 and 2011, took ‘big data’ and ‘dealing with data’ as the topic, which made people explore the enthusiasm of big data. However, there is no universal definition of big data in academia. In a literal sense, the most fundamental nature of big data lies in the large data size, but it also involves a high degree of complexity associated with data collection, management, and processing. The “big” of big data is mainly reflected in three aspects [10–12]: data volume is large (Volume); the complexity of data type is high (Variety); data flow, especially the generation of information flow in Internet, is fast (Velocity). The 3V properties have now been widely accepted to describe big data. Some people will also express the potential huge value of Value into it, so that 3V is extended to 4V.

Although big data is valuable, it is a challenge to unlock the potential from the large amount of data [13]. High quality is a prerequisite for unlocking big data potential since only a high-quality big data environment yields implicit, accurate, and useful information that helps make correct decisions. Even state-of-the-art data analysis tools cannot extract useful information from an environment fraught with “rubbish” [14, 15]. However, it is difficult to maintain high quality because big data is varied, complicated, and dynamic. This highlights a need for the analysis and evaluation of big data quality while constructing a high-quality big data environment.

Data quality involves many dimensions that include data validity, timeliness, fuzziness, objectivity, usefulness, availability, user satisfaction, ease of use, and understandability. Data validity is particularly important in the evaluation of data quality. It is a priority due to the massive data size, increased demand for data processing, and broad variety of data types. However, few studies have been done on the evaluation of data validity[16, 17]. Wei Meng proposed to measure data validity using the update frequency [18]. Update frequency of date is a dimension of the quality of data. However, this dimension reflects the novelty of the data rather than the validity. Qingyun et al. proposed to evaluate data validity by formulating a constraint in the dataset [19]. The constraint of evaluating data validity is whether it is in a range compliant with the truth or not. This constraint is one of the dimensions of data validity, but it is not comprehensive. In [20], Jie et al. proposed to devise constraints using three rules (i.e., static, transaction, and dynamic) and they evaluated data validity by measuring the degree to which the rules were satisfied. The method for data validity evaluation varies with the application. It focused on the restricting rules on GIS, but it is too special and it is not general. Moreover, due to the special attributes of big data, these methods are not entirely suitable for big data. To the best of our knowledge, there is no method for qualitative and quantitative analysis of big data validity.

In this paper, first, we comprehensively analyze dimensions that have a major influence on data validity based on the 3V properties of big data. Data validity refers to the level of need that users or enterprise have for data. Completeness, correctness, and compatibility are particularly serious in a big data environment and become the primary factors that affect data validity. Hence, big data validity is measured in this paper from the perspectives of completeness, correctness, and compatibility. It is used to indicate whether data meets the user-defined condition or falls within a user-defined range. Next, a qualitative analysis of each dimension of data validity is performed using medium logic. Finally, the measure of medium truth degree (MMTD) is used to propose models to measure single and multiple dimensions of big data validity. Our Model for measuring one dimension of big data validity is based on medium logic. Logical correctness ensures that the evaluation results are more reasonable and scientific.

#### 2. Overview of Medium Mathematics Systems

Medium principle was established by Wujia Zhu and Xi’an Xiao in 1980s who devised medium logic tools [21] to build the medium mathematics system, the corner stone of which is medium axiomatic sets [22].

##### 2.1. Notations for Medium Mathematics Systems

In medium mathematics system [21], predicate (concept or property) is represented by P; any variable is denoted as* x*, with* x* completely possessing property P being described as P(*x*). The “* ╕*”symbol stands for inverse opposite negative and it is termed as “opposite to”. The inverse opposite of predicate is denoted as

*P*

*╕**.*Then the concept of a pair of inverse opposite is represented by both

*P*and

*. Symbol “”denotes fuzzy negative which reflects the medium state of “either or” or “both this and that” in opposite transition process. The fuzzy negative profoundly reflects fuzziness; “” is a truth-value degree connective which describes the difference between two propositions.*

*╕*P##### 2.2. Measuring of Medium Truth Degree

###### 2.2.1. Measuring of Individual Medium Truth Degree

According to the concept of super state[23], the numerical value area of generally applicable quantification is divided into five areas corresponding to the predicted truth scale, namely* ╕*^{+}*P*,* ╕P*,

*P*,

*P*, and

^{+}

*P*. In “true” numerical value area T, is standard scale of predication P; In “false” numerical value area F, is standard scale of predicate

*.*

*╕*P*f(x)*is an arbitrary numeric function of variable

*x.*According to the numeric interval of

*f(x),*the distance ratio function

*h*

_{T}(or

*h*

_{F}) which can scale the individual truth degree is defined. Adopting the concept of distance and using length of numerical value interval to different predicate truth as norm, the distance ratio function is defined, and from this the individual truth degree function is established as follows [23].

For* f(X)**R* and* y= f(x) **f(X)*, the distance ratio* h*_{T}*(y)* which relates to P is

For* f(X)**R* and* y= f(x) **f(X)*, the distance ratio* h*_{F}*(y)* which relates to* ╕*P is

where* d(a,b) *is the Euclidean distance.

The bigger the value of* h*_{T}*(y) *is, the higher the individual truth degree related to P is. The bigger the value of* h*_{F}*(y) *is, the higher the individual truth degree related to* ╕*P is.

###### 2.2.2. Measuring of Set Medium Truth Degree

*f:X→R*^{n} is the n-dimensional numerical mapping of the set X. The measuring of truth scale of disperse set X which relates to P (or* ╕*P) can be scaled by the additivity of the truth scale [23, 24] (or ) and the average additivity of the truth scale [23, 24] (or) of set which relates to P (or

*P).*

*╕*When , the additivity of the truth degree of disperse set X which relates to P is

The average additivity of the truth degree of disperse set X which relates to P is

The additivity of the truth degree of disperse set X which relates to* ╕*P is

The average additivity of the truth degree of disperse set X which relates to* ╕*P is

#### 3. Qualitative Analysis of Big Data Validity

Data validity refers to the degree of data demand for users or enterprises. It is used to describe whether data satisfies user-defined conditions or falls within a user-defined range.

##### 3.1. Selection of Dimension for Big Data Validity Evaluation

A large amount of incompatible data is generated due to the 3V properties of big data. Furthermore, data correctness and completeness can be compromised during generation, transmission, and processing. These problems are particularly serious in a big data environment and become the primary factors that affect data validity. Hence, big data validity is measured in this paper from the perspectives of completeness, correctness, and compatibility.

##### 3.2. Dimensions of Big Data Validity

###### 3.2.1. Data Completeness

In Cihai (an encyclopedia of the Chinese language), completeness refers to the state where components or parts are maintained without being damaged. In the Collins English Dictionary and Oxford Dictionary, completeness is defined as the state including all the parts, etc., that are necessary: whole. In the 21st Century Unabridged English-Chinese Dictionary, completeness means including all parts, details, facts, etc. and with nothing missing.

A universal definition of big data completeness is lacking. In the context of a specific application, big data completeness can be defined as follows.

*Definition 1. *If data has n properties and each property has all necessary parts, it is regarded as complete. Otherwise, it is incomplete.

*Definition 2. *Completeness refers to the degree to which data is complete. It is denoted by* C1*.

Let R_{1},R_{2},… denote the n data properties and denote the completeness of property . Note that has different forms for different applications. For example, the completeness of a property is zero if the property value is missing for some data, and 1 otherwise. Hence, can be defined as

The importance of each data property varies with the application. Let denote the weights for n properties in an application, where

Consider data with n properties; its completeness is computed as the weighted sum of the completeness of all its properties.

###### 3.2.2. Data Correctness

In Cihai, correctness refers to compliance with truth, law, convention, and standard, contrary to “wrongness”. In the Collins English Dictionary and Oxford Dictionary, correctness is defined as accurate or true, without any mistakes. In the 21st Century Unabridged English-Chinese Dictionary, completeness means accurate, compliant with truth, and having no mistakes.

Currently, there is no universal definition for data correctness in the field of big data. Whether data is correct and the degree to which data is correct are defined as follows from the perspective of the application.

*Definition 3. *Consider data with n properties. If each property is compliant with a recognized standard or truth, it is regarded as correct. Otherwise, it is incorrect.

*Definition 4. *Correctness refers to the degree to which data is correct. It is denoted by C2.

Let R_{1},R_{2},… denote the n data properties and denote the correctness of property . If the value of is in a range compliant with the truth, the correctness of this property is 1. Otherwise, it is 0. The correctness of the property, Z, is defined as

Data correctness C2 is computed as the weighted sum of each property:

where denotes the weight of each property in the application and satisfies (8).

###### 3.2.3. Data Compatibility

In Cihai, compatibility refers to coexistence without causing problems. In the 21st Century Unabridged English-Chinese Dictionary, compatibility means that ideas, methods, or things can be used together. In the case of big data, data compatibility is defined as follows.

*Definition 5. *If a group of data is of the same type and describes the same object consistently, the data is regarded as compatible with one another; otherwise, it is mutually exclusive.

*Definition 6. *Compatibility C3 refers to the degree to which a group of data is compatible with one another. Compatibility C3 is defined aswhere denotes the total amount of data in the group and denotes the amount of incompatible data in the group.

#### 4. Medium Truth Degree-Based Model for Measuring Big Data Validity

##### 4.1. Data Normalization

Data variety is a significant aspect of big data. In addition to traditional structured data, a large amount of nonstructured and semistructured data has been generated by advances in the Internet and the Internet of Things (IoT). Examples include website data, sensed data, audio data, image data, and signal data, as shown in Figure 1. While this enriches content, it is more challenging to store, analyze, and evaluate data. Data needs to be normalized before appropriately evaluating big data validity.