Abstract

Big data as a derivative of information technology facilitates the birth of data trading. The technology surrounding the business value of big data has come into focus. However, most of the current research focuses on improving the performance of big data analytics algorithms. Data pricing is still one of the main issues in data trading. Therefore, we aim to tackle the problem of evaluating the utility of data in the big data trading market and the problem of maximizing the profits of the various roles involved in data trading. To this end, we propose a Multidimensional Data Utility Evaluation (MDDUE) method through three data quality dimensions, namely, data size, availability, and completeness. Next, we propose a big data trading market model including data providers, service providers, and service users. An optimal data-pricing scheme based on a three-party Stackelberg game is proposed to maximize the participants’ profits. Finally, a machine learning model is used to verify the rationality and validity of the MDDUE. The results show that MDDUE can evaluate the utility of data more accurately than previous work. The existence and uniqueness of the Nash equilibrium are demonstrated through numerical experiments.

1. Introduction

With the rapid development of the Internet of Vehicles (IoV), Internet of Things (IoT), and other information technologies, the amount of data generated globally on a daily basis is heavily aggregated and exploding [1, 2]. The huge volumes and diverse sources of big data also make its application in various fields increasingly widespread [3, 4]. Big data is gradually becoming a basic resource alongside land and oil. However, emerging privacy concerns have prevented data owners from sharing their datasets [5]. In addition, the collection and storage of a large amount of data lead to some security issues in IoT [6]. Specifically, data is stored and maintained independently in different departments and isolated from each other, which hinders the effective circulation of data and prevents the value of big data from being fully realized. These problems lead to isolated data islands. To consider data resources as a commodity to be shared and circulated, it is important to establish an efficient data trading market [7]. There are three roles in the three-party traditional big data trading market: data owners, data trading platforms, and data users such as the Data Marketplace, Big Data Exchange, and Microsoft Azure Marketplace. Data owners collect data through IoT sensors and the Internet, etc. [8]. They consign the raw data to data trading platforms for data users to choose from and purchase [9]. But there are some problems with the traditional model of data trading, such as malicious data trading platforms or data consumers who may illegally cache and resell the data of data owners. It is not only an issue of copyright but also of privacy [10, 11]. It has to ensure high value and accuracy when sharing the sensing data [12]. As a result, how to promote the secure development of big data trading and building a digital economy with data as a key element has become a major challenge in the big data era. The introduction of the concept of Big Data as a Service (BDaaS) [13] has enabled big data services to become a commodity instead of raw data. In the big data service-trading market, what the service users need is not the entire raw dataset, but the data-processing results or data services based on the dataset. Instead of sending the raw data to the service providers, the data providers process the raw data into various data products and data services for the data consumers to purchase and use [14]. For example, service providers can provide users with personalized recommendations based on big data services [1517].

Few existing studies have considered the big data trading market from an economic perspective. The pricing strategies of data products or data services are difficult to define, and various pricing strategies are still incomplete. The big data trading market has not yet developed a uniform pricing standard. The literature [18] investigates the quantification of data utility from a data science perspective and proposes an optimal pricing scheme based on a big data trading market model. The literature [19] introduces the concept of the signal-noise ratio in the electronic information fields to evaluate the data utility. Based on [18, 19], the literature [14] evaluates the data utility in terms of two dimensions: data size and data noise level. However, we cannot overlook the impact that the data class-balanced ratio has on the utility of the data. Game theory is used in many scenarios as a practical tool to solve optimal problems (e.g., IoV) [20]. We complement the data utility evaluation method by introducing a class-balanced ratio. Meanwhile, we use game theory to maximize the profits of the three roles in the big data trading market.

In the big data trading market, the lifecycle of big data is mainly divided into five stages: data collection and uploading, data analysis, data pricing, data trading, and data protection [21, 22]. We can use blockchain to ensure the fairness of data trading [2325]. Cryptographic methods can be used to guarantee the security of data [26, 27]. Our work mainly lies in the third stage.

The main contributions of this paper are as follows. (1)We propose the Multidimensional Data Utility Evaluation (MDDUE) method that considers data size, availability, and completeness. The MDDUE takes into account more data quality dimensions and is more accurate than the literature [14](2)We introduce a service-based big data trading market model. In addition, a pricing scheme based on the Stackelberg game is proposed to maximize the profits of the three parties in the big data trading market(3)Validating the rationality and applicability of the MDDUE through a machine learning model. The existence and uniqueness of the Nash equilibrium are proven using backward induction, and the numerical experiments show that the proposed pricing scheme can maximize the profits of the three parties

The rest of the paper is organized as follows. Section 2 presents the related work about data pricing. We introduce our scheme in Section 3, including the Multidimensional Data Utility Evaluation (MDDUE) method and the optimal data-pricing scheme. Section 4 shows the experiment results and analysis. Finally, we conclude this paper in Section 5.

The emergence of data trading has facilitated the effective flow of data and provided a channel to fully exploit the value of big data. Information products are different from traditional goods in that they are easy to copy, modify, and spread. These characteristics make data trading different from traditional commodity trading and require appropriate specifications to address the specific features of data products. How to establish a uniform pricing strategy is one of the challenges facing the big data trading market. There are three main challenges to data pricing: diverse data sources, the complexity of data management, and the diversity of data [21]. The authors in [21] summarised the current data-pricing strategies and pricing models in the big data market. For example, data-pricing strategies are classified into six main categories: Free Data Strategy, Usage-Based Pricing Strategy, Package Pricing Strategy, Flat Pricing Strategy, Two-Part Tariff Strategy, and Freemium Strategy. Based on the strategies above, there are two main pricing models: the economic-based pricing model and the game theory-based pricing model.

Some scholars have tried to study data pricing from other perspectives. Koutris et al. [28] proposed query-based pricing which allows generating the price of any query automatically, and the pricing algorithm satisfies no arbitrage and no discount. Shen et al. [29] proposed a tuple granularity-based pricing model for personal big data. The model can be automatically adjusted according to the attributes that affect the value of the data. Inspired by information entropy, Li et al. [30] proposed a new data-pricing method based on data information entropy and gave a pricing function based on the results of the method mentioned above. They have done a lot of experiments to verify the method. The paper inspires research concerning the pricing mechanism of big data. Cai et al. [31] proposed a new privacy-preserving data trading framework for web-browsing histories. The framework takes into account the privacy preferences of different users and compensates the users for the privacy of their data according to the degree of privacy leakage. To reduce the heavy burdens and private leakage of data exchange in the IoT, Cai et al. [32, 33] proposed a novel framework for range-counting trading over IoT networks by jointly considering data utility, bandwidth consumption, and privacy preservation. However, all of the above methods do not take into account the optimal profits of the participants.

Data quality is one of the factors that influence the quality of machine learning models, which opens up a new way of thinking about data pricing. Stahl and Vossen [34] summarised seven metrics for evaluating data quality: accuracy, amount of data, availability, completeness, latency, response time, and timeliness. Niyato et al. [18] and Yang et al. [19] built a data utility evaluation function via data size and noise level, respectively, in other words, the amount of data and availability. Xiao et al. [14] combined both dimensions to quantify the value of data. But in some cases, the method cannot evaluate the data utility accurately. It has been found that the completeness of the data also has a relatively large impact on the accuracy of machine learning models, also known as the imbalance ratio [35]. Therefore, we introduce completeness into the data utility evaluation function which we called the class-balanced ratio.

3. System Model

We first describe the data utility evaluation method MDDUE in detail. Then, we describe a big data trading market and formulate an optimal pricing problem based on the Stackelberg game to maximize the profits of the participants. This study uses the method of Xiao et al., and the description of the method partly reproduces their wording [14]. The symbols used in this paper commonly are shown in Table 1 below.

3.1. MDDUE: Multidimensional Data Utility Evaluation

To price data, it is necessary to evaluate the utility of unstructured big data. CNN-based machine learning algorithms are increasingly used in a wide range of applications, such as face recognition, intrusion detection, and natural language processing [36, 37]. As an important technique for data analysis, machine learning is also an effective tool for evaluating the value of data. The process of machine learning providing data services to service users is shown in Figure 1.

The quality of raw data is very important for machine learning models. In the machine learning model, a raw dataset that has tuples can be presented as , where is a feature set of data samples. is the class label of . Supervised learning is widely used in classification and prediction problems [38].

We will introduce three dimensions that affect the data quality, namely, data size, availability, and completeness, next. We adopt the nonnoise ratio of data instead of availability and the class-balanced ratio of data instead of completeness. By changing these data quality dimensions, different data versions can be customized to meet user demand for data service quality.

Data size: we assume that the accuracy of the machine learning model is when the dataset size is where is the index of the experimental datasets. To determine the accuracy function , we set different data sizes under the same other conditions. We can get a set of experimental points after training the machine learning model using a series of datasets. We can apply least squares to minimize the mean-squared error to find the accuracy function to fit these points.

Data nonnoise ratio α: in the real world, labeling requires a certain amount of expertise. For example, in medical imaging, even experts have different opinions on the labels. So there will be noisy labels in the dataset [39]. The existence of noisy labels causes significant performance degradation of machine learning models and hence the availability of the dataset [40]. That is why we use the data nonnoise ratio instead of availability.

The data nonnoise ratio is the proportion of the total data set without noisy data. The data nonnoise ratio is denoted as where is the number of data with noisy labels and is the data size.

We construct datasets with different data nonnoise ratios by the following method: (1)Select samples from the dataset randomly(2)Replace the labels of these samples with other random labels

Similarly, we assume that the accuracy of the machine learning model is when the data nonnoise ratio is with the method above. We can also acquire the set of a series of experimental points with different data nonnoise ratios. Then, we can determine the accuracy function by minimizing the mean-squared error.

Data class-balanced ratio β: in the real-world dataset, the number of samples in one class in the dataset is too small compared to those in other classes due to various reasons such as sampling difficulties, which also affect the accuracy of machine learning models [35]. The data class-balanced ratio is the degree of balance in the number of samples in the dataset. We define the data-balanced ratio as the inverse of the imbalanced ratio in [35].

Other things being equal, we assume that the accuracy of the machine learning model can reach when the data class-balanced ratio is β. We determine the accuracy function of machine learning using the set of experimental points obtained by conducting experiments at different class-balanced ratios as above.

All other things being equal, the higher the quality of the dataset, the higher the accuracy of the model. Therefore, the accuracy of the machine learning model can be equated with data utility to some extent [14]. We can obtain datasets with different data utilities by changing the data size, nonnoise ratio, and class-balanced ratio. The function of data utility can be expressed as

It is known that machine learning models are more accurate with a larger dataset size, nonnoise ratio, and class-balanced ratio. But the accuracy stops getting bigger when it gets big enough. This is where it is time to optimize the model itself [40]. It is not the focus of our work.

We guess that the data utility function has the following properties: (1)Monotonically increasing: (2)The diminishing marginal efficiency:

So we assume the data utility functions for the three impact factors individually as follows: where are the fitting parameters.

The aggregated data utility evaluation function can be expressed as where are the fitting parameters.

3.2. Optimal Pricing Based on Stackelberg Game

In this section, we first describe a big data market model for selling big data services. Then, we formulate an optimal data-pricing scheme based on the Stackelberg game to maximize the profits of each trading participant.

We consider the big data market where a data provider provides the data and a service provider provides the data service to the service user as shown in Figure 2.

The data provider uses different tools or technologies (e.g., IoT sensors, multitarget detection in smart IoT, social media, smart devices, and social network) to collect data [4143]. And the data provider processes the raw data so that it has the data utility the service user needs and charges the service provider. The service provider buys data from the data provider. Then, the service provider uses the dataset to train different machine learning models to provide data services to the service user. We argue that the utility of the raw data can be equated to the value of the machine learning models [44]. The service user determines the optimal demand for data utility based on the profit function to maximize its profits.

In the big data trading market, service users have different data utility demands for different machine learning models. Uniform pricing for all data services is unreasonable. Therefore, service providers need to set prices separately for different service users. We will explain the roles of the big data market model and its profit functions below. (i)Data provider: the raw data collected by the data provider will incur storage costs, communication costs, maintenance equipment costs, etc. For easy calculation, we assume that the cost to the data provider increases linearly with the utility of the data. The cost of data processing is denoted by per unit of data utility. This costing approach is widely used in studies of cloud computing, the IoT, and Internet services [45]. The price of per unit data utility is denoted by . We conclude that the profit function of the data provider can be expressed aswhere is the data utility demand of the service user. (ii)Service provider: the service provider buys data of a certain data size, nonnoise ratio, and class-balanced ratio from the data provider. Then, the service provider evaluates the utility of data through MDDUE. Finally, the service provider trains the machine learning model with the data. The service provider provides the data services or products to the service user and charges the service user at a price per unit data utility. As we all know, a higher data utility means higher . We can get the profit function of the service provider as(iii)Service user: we assume that the service user is rational and only subscribes to the data services if the profit function is positive. Service users derive economic value from the use of data services. The higher the data utility, the higher the reward. But the reward function should be of diminishing marginal utility. Specifically, the rate of increase in rewards decreases as the data utility increases. So we assume the reward function for the service user iswhere and are the experience parameters that are set by the service users.

Therefore, the profit function of service users can be expressed as

The strategies made by the three roles influence each other. The interactions of the data provider, the service provider, and the service user can be modeled as a three-stage Stackelberg game [46, 47]. In the traditional Stackelberg game, the player that makes the first decision is called the leader. After the leader, the remaining players make decisions according to the leader’s decision, which are called the followers, and so on until a Nash equilibrium is reached [48].

We consider it to be a variant of the Stackelberg game. As shown in Figure 3, the game model in this paper can be expressed as three stages. In stage 1, the data provider sets the unit data utility price to maximize profits. Then, the service provider sets the price of service in stage 2. And in stage 3, the service user decides the demand of data utility . The three stages of the game are as follows:

Stage 1. The data provider determines the unit price of data utility :

Stage 2. Given the optimal data price , the service provider determines the price of the data service per unit data utility :

Stage 3. The service user as the follower determines the data utility demand according to the optimal service subscribe price to maximize the profits:

The subgame perfect equilibria of the Stackelberg game are usually solved by backward induction. At the Nash equilibrium, each player’s strategy is optimal under the other player’s strategies, so no players will change their strategies. The Stackelberg game is a dynamic game with full information. We assume that each player has complete information about other players in the game model.

Theorem 1. When and , the Stackelberg game has a unique Nash equilibrium . where

Proof. According to the profit function (12) and the reward function (13) of the service user, we can get the finally profit function as follows: According to (19), the first derivative of is The second derivative is Obviously, when and . Therefore, the subgame of the stage has the optimal demand . Let . We can get the optimal demand : Given , the service provider sets the optimal service price of per data utility as response. We use to replace in (11). We can acquire Then, we calculate the partial derivative of (24) with respect to . The second derivative of (24) for is, namely, Clearly, when . Therefore, there is an optimal solution to the subgame. We can get through calculating : Given and , the data provider sets the optimal data price as the optimal response.
We take the and into (10). Then, we can get The first derivative of (28) with respect to is The second derivative of (28) with respect to is where .
There is an optimal data price because (30) is negative. Let . We conclude by observation that (31) is a cubic equation in one unknown about . We can derive the three roots of (31) according to Shengjin’s formulas [49]. where .
We round it off as and are imaginary roots. Because of , there is We can get According to (34), Finally, we replace in (23) and (27) with (35). The theorem is proven.

4. Experiment

The experiment is divided into two parts. First, we design experiments to prove the rationality and validity of the MDDUE, and then, we prove the existence and uniqueness of the Stackelberg game Nash equilibrium through numerical experiments.

4.1. Parameter Fitting Based on Cifar10

To verify the rationality and validity of the MDDUE, the public dataset in our experiments is Cifar10 [50]. The Cifar10 dataset consists of 60,000 images in 10 classes. It has 6000 images per class. The machine learning model can identify which class an image is. The machine learning model in our experiment is Wide ResNet (WRN) [51]. The WRN is a variant of the ResNet network, with a dropout layer added between the two convolutional layers, increasing the width of the ResNet and improving the training speed.

To argue that changes in the class-balanced ratio have an impact on the accuracy of the machine learning model, we constructed datasets with different class-balanced ratios based on Cifar10. However, the traditional approach to constructing an unbalanced dataset in literature [35] would change the data size which is a factor that affects the accuracy of the model. We assume that the number of samples in each class is . To meet the requirements of the rest of the conditions being constant, we take samples in the first five classes and samples in the remaining classes as the training set. where is the class-balanced ratio.

Figure 4 shows the training curves of different data class-balanced ratios under data size and data nonnoise ratio . It is shown that the smaller the class-balanced ratio, the lower the accuracy of the model. So the introduction of the class-balanced ratio will make the data utility evaluation function more reasonable.

For simplicity of calculation, we normalize the data size to a value in the interval 0-1. Figure 5 shows the model accuracy for different data sizes with the other two dimensions fixed. As we can see, as the data size increases, the accuracy of the machine learning model increases. However, as the data size increases to a certain level, the growth rate of the model accuracy then becomes smaller. Similarly, Figures 6 and 7 show the variation in the accuracy of the model for different class-balanced ratios and different nonnoise ratios, respectively, for the same two other dimensions. All three fitting functions are closer to the actual results.

As shown in Figure 8, the actual accuracy of the model with different data size , nonnoise ratio , and class-balanced ratio is displayed. The average accuracy is the average value of accuracies under 10 experiments.

Using the above actual accuracy data, we can determine the optimal parameters of the proposed data utility evaluation function on the WRN-34 and Cifar10 by minimizing the mean-squared error. To verify the correctness of the function and to better demonstrate the data utility function, we have selected a few special cuts of the function image as in Figures 911.

By comparing Figure 8 with Figures 911, it can be seen that the data utility function is closer to the actual value. At this point, the values of , and are 1.1262, 1.3484, 1.3121, 1.1592, and 0.1343, respectively. We will use the data utility function shown below in the following numerical experiments.

4.2. Numerical Experiments

We conduct numerical experiments to demonstrate the existence and uniqueness of the Nash equilibrium. We set the parameters of the service user’s reward function and , and we set the subscription price of data service and the price per unit data utility . As shown in Figure 12, we can see that the profits of the servicer user increase with the increase of data utility. However, when the data utility increases to a certain degree, the profits of the service user reduce because of the increase of subscription price. We can obtain that there is an optimal data utility that maximizes the profits of the service user.

Figure 13 shows that the profits of the service provider increase with the rise of the subscription price. It is clear to see that the profits rise gradually with the increase of . But the data utility demand reduces due to higher prices leading to the reduced profits of the service provider. So there is an optimal price strategy to maximize the service provider’s profits.

We set the fixed cost per data utility . We know from Figure 14 that as increases, the profits of the data provider increases. However, the profits of the data provider start to decrease when it reaches a certain value because of the fact that the increase in led to an increase in . It leads to a reduction in the demand of the data utility. Therefore, the optimal profits of the data provider can be acquired if the optimal price is applied.

5. Conclusions

First, we propose MDDUE to evaluate the utility of data. Then, we advance an optimal data-pricing scheme based on the Stackelberg game in this paper. Specifically, we construct a data utility evaluation function through three data quality dimensions. We are the first to introduce the class-balanced ratio into the data utility evaluation function to make it more accurate and more reasonable. Then, we propose an optimal data-pricing scheme based on the three-stage Stackelberg game. The profits of the three roles in the data trading market can be maximized by using the scheme. Finally, we verify the rationality and validity of MDDUE through a specific machine learning model WRN and a real-world dataset Cifar10. Meanwhile, we prove the existence and uniqueness of the Nash equilibrium, then demonstrate results through numerical experiments. In the future work, we will improve the universality of MDDUE. And we will be working on building fairer and more secure data trading solutions using technologies such as cryptography and blockchain.

Data Availability

The Cifar10 dataset is found at http://www.cs.utoronto.ca/~kriz/cifar.html.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research is funded by the National Natural Science Foundation of China (62202118, 61962009), Top Technology Talent Project from Guizhou Education Department (Qianjiao Ji [2022]073), and Foundation of Guangxi Key Laboratory of Cryptography and Information Security (GCIS202118).