Abstract

Big data has been studied extensively in recent years. With the increase in data size, data quality becomes a priority. Evaluation of data quality is important for data management, which influences data analysis and decision making. Data validity is an important aspect of data quality evaluation. Based on 3V properties of big data, dimensions that have a major influence on data validity in a big data environment are analyzed. Each data validity dimension is analyzed qualitatively using medium logic. The measuring of medium truth degree is used to propose models to measure single and multiple dimensions of big data validity. The validity evaluation method based on medium logic is more reasonable and scientific than general methods.

1. Introduction

Big data has been studied extensively in recent years and several investigations have focused on the big data phenomenon‎[17]. The top international journals ‘Nature’‎[8] and ‘Science’‎[9], respectively, in 2008 and 2011, took ‘big data’ and ‘dealing with data’ as the topic, which made people explore the enthusiasm of big data. However, there is no universal definition of big data in academia. In a literal sense, the most fundamental nature of big data lies in the large data size, but it also involves a high degree of complexity associated with data collection, management, and processing. The “big” of big data is mainly reflected in three aspects‎ [1012]: data volume is large (Volume); the complexity of data type is high (Variety); data flow, especially the generation of information flow in Internet, is fast (Velocity). The 3V properties have now been widely accepted to describe big data. Some people will also express the potential huge value of Value into it, so that 3V is extended to 4V.

Although big data is valuable, it is a challenge to unlock the potential from the large amount of data‎ [13]. High quality is a prerequisite for unlocking big data potential since only a high-quality big data environment yields implicit, accurate, and useful information that helps make correct decisions. Even state-of-the-art data analysis tools cannot extract useful information from an environment fraught with “rubbish” ‎[14, 15]. However, it is difficult to maintain high quality because big data is varied, complicated, and dynamic. This highlights a need for the analysis and evaluation of big data quality while constructing a high-quality big data environment.

Data quality involves many dimensions that include data validity, timeliness, fuzziness, objectivity, usefulness, availability, user satisfaction, ease of use, and understandability. Data validity is particularly important in the evaluation of data quality. It is a priority due to the massive data size, increased demand for data processing, and broad variety of data types. However, few studies have been done on the evaluation of data validity‎[16, 17]. Wei Meng proposed to measure data validity using the update frequency ‎[18]. Update frequency of date is a dimension of the quality of data. However, this dimension reflects the novelty of the data rather than the validity. Qingyun et al. proposed to evaluate data validity by formulating a constraint in the dataset ‎[19]. The constraint of evaluating data validity is whether it is in a range compliant with the truth or not. This constraint is one of the dimensions of data validity, but it is not comprehensive. In ‎[20], Jie et al. proposed to devise constraints using three rules (i.e., static, transaction, and dynamic) and they evaluated data validity by measuring the degree to which the rules were satisfied. The method for data validity evaluation varies with the application. It focused on the restricting rules on GIS, but it is too special and it is not general. Moreover, due to the special attributes of big data, these methods are not entirely suitable for big data. To the best of our knowledge, there is no method for qualitative and quantitative analysis of big data validity.

In this paper, first, we comprehensively analyze dimensions that have a major influence on data validity based on the 3V properties of big data. Data validity refers to the level of need that users or enterprise have for data. Completeness, correctness, and compatibility are particularly serious in a big data environment and become the primary factors that affect data validity. Hence, big data validity is measured in this paper from the perspectives of completeness, correctness, and compatibility. It is used to indicate whether data meets the user-defined condition or falls within a user-defined range. Next, a qualitative analysis of each dimension of data validity is performed using medium logic. Finally, the measure of medium truth degree (MMTD) is used to propose models to measure single and multiple dimensions of big data validity. Our Model for measuring one dimension of big data validity is based on medium logic. Logical correctness ensures that the evaluation results are more reasonable and scientific.

2. Overview of Medium Mathematics Systems

Medium principle was established by Wujia Zhu and Xi’an Xiao in 1980s who devised medium logic tools ‎[21] to build the medium mathematics system, the corner stone of which is medium axiomatic sets ‎[22].

2.1. Notations for Medium Mathematics Systems

In medium mathematics system ‎[21], predicate (concept or property) is represented by P; any variable is denoted as x, with x completely possessing property P being described as P(x). The “”symbol stands for inverse opposite negative and it is termed as “opposite to”. The inverse opposite of predicate is denoted asP. Then the concept of a pair of inverse opposite is represented by both P andP. Symbol “”denotes fuzzy negative which reflects the medium state of “either or” or “both this and that” in opposite transition process. The fuzzy negative profoundly reflects fuzziness; “” is a truth-value degree connective which describes the difference between two propositions.

2.2. Measuring of Medium Truth Degree
2.2.1. Measuring of Individual Medium Truth Degree

According to the concept of super state‎[23], the numerical value area of generally applicable quantification is divided into five areas corresponding to the predicted truth scale, namely+P,P, P, P, and +P. In “true” numerical value area T, is standard scale of predication P; In “false” numerical value area F, is standard scale of predicateP. f(x) is an arbitrary numeric function of variable x. According to the numeric interval of f(x), the distance ratio function hT(or hF) which can scale the individual truth degree is defined. Adopting the concept of distance and using length of numerical value interval to different predicate truth as norm, the distance ratio function is defined, and from this the individual truth degree function is established as follows [23].

For f(X)R and y= f(x) f(X), the distance ratio hT(y) which relates to P is

For f(X)R and y= f(x) f(X), the distance ratio hF(y) which relates toP is

where d(a,b) is the Euclidean distance.

The bigger the value of hT(y) is, the higher the individual truth degree related to P is. The bigger the value of hF(y) is, the higher the individual truth degree related toP is.

2.2.2. Measuring of Set Medium Truth Degree

f:X→Rn is the n-dimensional numerical mapping of the set X. The measuring of truth scale of disperse set X which relates to P (orP) can be scaled by the additivity of the truth scale ‎[23, 24] (or ) and the average additivity of the truth scale‎ [23, 24] (or) of set which relates to P (orP).

When , the additivity of the truth degree of disperse set X which relates to P is

The average additivity of the truth degree of disperse set X which relates to P is

The additivity of the truth degree of disperse set X which relates toP is

The average additivity of the truth degree of disperse set X which relates toP is

3. Qualitative Analysis of Big Data Validity

Data validity refers to the degree of data demand for users or enterprises. It is used to describe whether data satisfies user-defined conditions or falls within a user-defined range.

3.1. Selection of Dimension for Big Data Validity Evaluation

A large amount of incompatible data is generated due to the 3V properties of big data. Furthermore, data correctness and completeness can be compromised during generation, transmission, and processing. These problems are particularly serious in a big data environment and become the primary factors that affect data validity. Hence, big data validity is measured in this paper from the perspectives of completeness, correctness, and compatibility.

3.2. Dimensions of Big Data Validity
3.2.1. Data Completeness

In Cihai (an encyclopedia of the Chinese language), completeness refers to the state where components or parts are maintained without being damaged. In the Collins English Dictionary and Oxford Dictionary, completeness is defined as the state including all the parts, etc., that are necessary: whole. In the 21st Century Unabridged English-Chinese Dictionary, completeness means including all parts, details, facts, etc. and with nothing missing.

A universal definition of big data completeness is lacking. In the context of a specific application, big data completeness can be defined as follows.

Definition 1. If data has n properties and each property has all necessary parts, it is regarded as complete. Otherwise, it is incomplete.

Definition 2. Completeness refers to the degree to which data is complete. It is denoted by C1.

Let R1,R2,… denote the n data properties and denote the completeness of property . Note that has different forms for different applications. For example, the completeness of a property is zero if the property value is missing for some data, and 1 otherwise. Hence, can be defined as

The importance of each data property varies with the application. Let denote the weights for n properties in an application, where

Consider data with n properties; its completeness is computed as the weighted sum of the completeness of all its properties.

3.2.2. Data Correctness

In Cihai, correctness refers to compliance with truth, law, convention, and standard, contrary to “wrongness”. In the Collins English Dictionary and Oxford Dictionary, correctness is defined as accurate or true, without any mistakes. In the 21st Century Unabridged English-Chinese Dictionary, completeness means accurate, compliant with truth, and having no mistakes.

Currently, there is no universal definition for data correctness in the field of big data. Whether data is correct and the degree to which data is correct are defined as follows from the perspective of the application.

Definition 3. Consider data with n properties. If each property is compliant with a recognized standard or truth, it is regarded as correct. Otherwise, it is incorrect.

Definition 4. Correctness refers to the degree to which data is correct. It is denoted by C2.

Let R1,R2,… denote the n data properties and denote the correctness of property . If the value of is in a range compliant with the truth, the correctness of this property is 1. Otherwise, it is 0. The correctness of the property, Z, is defined as

Data correctness C2 is computed as the weighted sum of each property:

where denotes the weight of each property in the application and satisfies (8).

3.2.3. Data Compatibility

In Cihai, compatibility refers to coexistence without causing problems. In the 21st Century Unabridged English-Chinese Dictionary, compatibility means that ideas, methods, or things can be used together. In the case of big data, data compatibility is defined as follows.

Definition 5. If a group of data is of the same type and describes the same object consistently, the data is regarded as compatible with one another; otherwise, it is mutually exclusive.

Definition 6. Compatibility C3 refers to the degree to which a group of data is compatible with one another. Compatibility C3 is defined aswhere denotes the total amount of data in the group and denotes the amount of incompatible data in the group.

4. Medium Truth Degree-Based Model for Measuring Big Data Validity

4.1. Data Normalization

Data variety is a significant aspect of big data. In addition to traditional structured data, a large amount of nonstructured and semistructured data has been generated by advances in the Internet and the Internet of Things (IoT). Examples include website data, sensed data, audio data, image data, and signal data, as shown in Figure 1. While this enriches content, it is more challenging to store, analyze, and evaluate data. Data needs to be normalized before appropriately evaluating big data validity.

Structured and nonstructured data in a big data environment have different content, forms, and structures, so they cannot be managed uniformly. Hence, a data model needs to be developed to provide a uniform description of both structured and nonstructured data.

Based on ‎[25], a tetrahedron data model is proposed for nonstructured data. The proposed model consists of four parts: basic property, semantic feature, bottom-layer feature, and original document. In order to process structured and nonstructured data uniformly, a new part of data type is introduced to describe document type. Consider an audio document as an example of nonstructured data. Its document type belongs to audio document. Its basic property includes document name and intuitive information on document size and creation time. Its semantic feature is the information in the document. The bottom-layer feature is audio frequency and bandwidth. As for structured data, it does not have a basic property, semantic feature, or bottom-layer feature. It is thus directly stored in the original document. Semistructured data like an XML document has some structured data, which is dynamic. Hence, it is difficult to store these data by constructing a mapping table. Fortunately, these data can be extracted to form a string, enabling them to be stored in the database like structured data.

In this manner, structured and nonstructured data can be stored in the database uniformly. For nonstructured data like an image, the content can be analyzed using a description of the image in terms of the basic property, semantic feature, and bottom-layer feature. Structured and semistructured data can be analyzed directly.

4.2. Determination of Logical Predicate and True-Value Range
4.2.1. Determination of Logical Predicate

In order to evaluate data completeness, correctness, and compatibility, let the predicate W denote the high degree, low degree, and transition W. The correspondence between numerical range and predicates is shown in Figure 2.

4.2.2. Determination of Logical Interval

Weights need to be allocated to the completeness and correctness of data in an application. Data usefulness will not be compromised as long as the major property exists, even if the subordinate property is missing. Based on the proportions of major and subordinate properties, values A and B are computed as follows:where denotes weight and m denotes the largest weight of subordinate properties. Assume that the weights of n properties are sorted in descending order as follows: , where ,…, denote weights of subordinate properties and , denote weights of major properties. The value of m is determined as follows. Sort all weights and compute the sum of weights starting with the smallest weight until the sum of weights is no larger than the weight , as shown in

4.3. Model for Measuring One Dimension of Big Data Validity

The weight of each property in each dimension of the data is first determined to obtain the correspondence between the numerical range of one dimension and the logical predicates: high degree, low degree, and transition, as shown in Figure 2. The distance ratio function with respect to W is selected as the model to measure completeness:

where f(C) is defined as in (9), (11), and (12). Use the completeness measuring model as an example for the analysis. f(C) in (15) is C1 in (9) and the completeness measuring model is . If the value of data completeness is in the false range (low degree of logic truth), the value of data completeness is 0 and means that data is missing. If the value of data completeness is in the true range (high degree of logic truth W), the value of data completeness is 1 and means that data is complete. If the value of data completeness is in the transition range (medium degree of logic truth W), the value of data completeness is between 0 and 1; closer to 1 means more complete data, and closer to 0 means more missing data.

The model for measuring data correctness or compatibility is similar to the model for completeness. The model measures data correctness when f(C) in (15) is C2 in (11) and measures data compatibility when f(C) in (15) is C3 in (12).

4.4. Multidimension Model for Measuring the Integrated Value of Big Data Validity

For a set of K data, completeness and correctness can be measured by the average additive truth scales hkT-M(C1) and hkT-M(C2) which are defined as

where C1(i) and C2(i) denote completeness and correctness for each element in the data set, as defined in (9) and (11).

For a data set in a big data application, the integrated value of data validity can be measured by the weighted sum of metric values for each dimension. Hence, an integrated multidimension model H for measuring data validity in a big data application is

where , , and denote completeness, correctness, and compatibility, respectively, and denote the weights of completeness, correctness, and compatibility, respectively, according to certain application. Thus, we have

Compared with the tetrahedron evaluation models, the two models have both similarities and differences. The idea of the multidimension model for measuring data validity in a big data application in this paper (17) is similar to the tetrahedron evaluation models, but the difference between these two models lies in the measuring of each dimension. Our model for measuring one dimension of big data validity is based on medium logic. Logical correctness ensures that the evaluation results are more reasonable and scientific.

5. Conclusions

Medium mathematics systems are introduced for the evaluation of big data validity. A medium logic-based data validity evaluation method is proposed. The contributions of this paper are as follows: Based on the 3V properties of big data, dimensions that have a major influence on data validity are determined. Data completeness, correctness, and compatibility are defined. A medium truth degree-based model is proposed to measure each dimension of data validity. A medium truth degree-based multidimension model is proposed to measure the integrated value of data validity. In the future, other factors that influence big data quality will be studied and corresponding measurement models will be developed.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the State Key Laboratory of Smart Grid Protection and Control of China (2016, no. 10) and the National Natural Science Foundation of China no. 61170322, no. 61373065, and no. 61302157.