Abstract

To process a huge amount of data, computing resources need to be organized in clusters that can be scaled out easily. Still, traditional SQL databases built on the relational data model are difficult to be put to use in such clusters, which has motivated the movement named NoSQL. However, NoSQL databases have their limits by using their own data models. In this paper, the original soft set theory is extended, and a new theory system called n-tier soft set is brought up. We systematically constructed its concepts, definitions, and operations, establishing it as a novel soft set algebra. And some features of this algebra display its natural advantages as a data model which could combine the logicality of the SQL model (also known as the relational model) and the flexibility of NoSQL models. This data model provides a unified and normative perspective logic for organizing and manipulating data, combines metadata (semantic) and data to form a self-described structure, and combines index and data to realize fast locating and correlating.

1. Introduction

1.1. Background

After entering the 21st century, with the outbreak of Internet applications, the total amount and complexity of digital information possessed by human beings have witnessed an explosive increase at an unprecedented speed, showing many new features. Some professionals believe that we have entered into the era of big data [1, 2]. Databases, as the core part of information infrastructures, play a key role in this historical change. However, relational databases, which previously dominated the market, begin to appear inadequate to cope with some problems of big data [3].

In order to quickly process large volume, fast flowing, and complex data in a limited time to generate value, more computing resources must be acquired. There are usually two schemes: scale up and scale out.

Scale up means configuring better performance hardware for a single computer, such as more and stronger CPUs, and larger and faster memories and disks, but without increasing the number of computers. However, the performance of computer hardware that can be obtained from the market in a period of time has its up limit, and the performance-price ratio of high-end products is usually low, which incurs high cost.

By increasing the number of computers rather than the performance of single computer, scale out incorporates a large number of high cost-effective, low (or mid)-end computers into a cluster to increase computing power. That not only reduces costs in comparison but also makes the cluster more resilient, namely, even if some of the computing nodes failed, the entire cluster can continue to provide services.

1.2. The Challenges of Current Solutions
1.2.1. Relational Databases

However, the relational data model [4] and RDBMS are basically designed for single machine environments. They are not suitable for the case of cluster [3, 5]. From the data model level, relational databases organize the data base on the relational model and use tuples as the record units.

Firstly, according to the definition of the relation model and its normalization theory, a tuple is an ordered list of atomic values that cannot be nested or contain collection types (set, list, and so on) which is difficult to represent complex structures, but there is no such restriction on variables used by application programming, thus resulting in an “impedance mismatch” (a metaphor for the mismatch between the data forms of the relational data model and the application programming model). At present, this problem is usually adjusted by using the middle layer called ORM (object relational mapping).

Secondly, the relational model uses normalization to reduce redundancy and avoid exceptions and ensure the integrity of databases. In a relational database that follows the third (or higher) normal form, the data involved in an unit process of an application are typically scattered across different tables. In order to ensure the ACID (refers to the four basic elements of the correct execution of a database transaction, namely, atomicity, consistency, isolation, and durability) requirements of a transaction and the integrity constraints required by the normal form, a series of locks and resources are costed. In a situation of high concurrency or huge volume, that can egregiously affect the performance and availability of the database.

Moreover, the relational model is algebraically based on relation rather than mapping, which cannot express index by itself (while a mapping is a natural abstraction of an index in mathematics). That renders indexes are external structures, separated from data in implementation, which not only increases the demand of storage space but also makes data difficult to locate each other on their own. To correlate the data between different tables, it is necessary to write complex SQL queries and use expensive Join operation. And in order to support Join operation between tables, the related tables must be placed in a same node, which is not conducive to data dispersion in cluster and usually needs manual design for sharding, making relational databases difficult to scale out.

Meanwhile, a relational database needs a rigid predefined schema. One has to predefine the structures and constraints of tables. And the schema is very difficult to change in reality, falling short of dealing with changing sources and requirements.

1.2.2. NoSQL Databases

Those problems of relational databases have motivated the development of some database products called NoSQL and inspired a new round of innovation for database theory and practice [3, 6, 7]. Different NoSQL products try to solve problems of relational databases from different aspects. According to the data models they use, NoSQL products can be divided into four main types: key-value store, column family, document, and graph [7, 8]. Except for graph databases using graph as a data model, the data models of the first three are based on key-value structures. Key-value store databases are composed of simple key-value pairs, column family databases organize data into two-levels (or more) key-value mappings by row keys and column keys, etc., and document databases organize key-values into documents with accessible internal structures that can be nested with each other.

The main reason why these databases convert the view of data from relations to key-value structures (including simple key-value, column, and document) is for dealing with aggregates. Unlike tuples in relational databases, aggregates are usually designed and used by upper applications (not by databases). It organizes all the data needed in a single processing unit to be accessed together, eliminating expensive and complex SQL queries and table Joins. Aggregates, as natural and independent data distribution units, also make data dispersed easily in a cluster. The form of aggregate is also free, which can easily add or delete content. So, impedance mismatch can be solved without ORM intermediate layers.

Although key-value typed databases have partly solved some problems of relational databases, they do not have rigorous mathematical foundations and there is no connectivity between aggregates, resulting in the difficulties of complex querying and understanding connections among data. On the other hand, relational databases with rigorous and precise algebraic foundation may use a powerful query language based on relational algebra to analyze and reason data freely and logically in the case of a small amount of data on a single machine. However, in the case of big data or in a cluster, it is also difficult to dig out value from the connections among data by using Join operation. So, the graph database, based on graph theory, is designed to explore the connections among data expediently. The graph model represents data as a set of nodes, node attributes, and edges, providing fast and efficient performance of traversing the whole graph with index-free adjacency. However, the graph model focuses on connections and networks, and it is not good at expressing entity and its attribution (mathematically, nodes in a graph have no attributes, and on the implementation, simple key-value pairs are used to store attributes), so it has a specialized range of application and lack of generality [9, 10].

At present, the database models used by the mainstream are the relational model (SQL) and NoSQL (key-value, column family, file, graph, etc.) model. They are proposed to solve the problem that the relational model is too rigid to change the database schema (especially in vast amounts of data) and difficult to distribute. However, the new NoSQL models sacrifice the mathematical rigor of the relational model and the freedom of query expression.

A model that combines the same mathematical logic foundation as the relational model and uses a key-value class data structure urgently requires studying. It can be easy to distribute and also change the mode. We think that this improvement can use the “key-value pair” data structure in a distributed environment to realize a database with rigorous algebraic logic, which combines the advantages of SQL and NoSQL, and has a specific practical significance.

1.3. Our Approach

All these problems motivate us to explore a new data model which will not only maintain the merits of key-value structures, lend data the ability to describe itself, and can be easily located and moved in a cluster but also have an appropriate normalization and a rigorous algebraic basis like the relational model that can enable a powerful query language independent of products to be applied freely and logically. At the end, we focused on an algebraic theory called soft set. Soft set theory is a mathematics theory proposed by Russian mathematician Molodstov in 1999 in order to solve uncertainty problems. The basic idea is to provide semantic parameterized sets by using a generalized set-value mapping [11].

Just because a soft set is a mapping that allows fuzzy semantics for its parameters and sets for its return values, and mappings in mathematics has natural connection with key-value structures, and sets as return values can have internal structures that can be manipulated, we finally saw the hope that soft set could be used as a mathematical abstraction for an intricate key-value structure [10, 1215].

Molodstov gave the initial definition of soft set and a general operation and introduced several possible applications in [11]. Maji et al. studied the theory of soft sets in more details [16], introduced the concepts of subset, intersection, union, and complement of soft sets, and discussed their properties (but Yang and Ali et al. pointed out that these properties were incorrect and improved them [17, 18]). Subsequently, a variety of operations and algebraic properties of soft set have been proposed and studied [1821]. Original soft set has been extended by combining it with other uncertainty theories such as fuzzy set and rough set [2231], and by using algebraic properties of soft set, new algebraic structures have been constructed [21, 3236]. Cagman and Enginoğlu gave a new definition of soft set in a form of the extension of set-valued mapping which is different from the original one. Base on that, several related operations have been proposed, a new theory system has been constructed, and a new decision-making method has been presented [37]. At present, soft set theory is widely used in parameter reduction and decision making [38], and a large number of methods for parameter reduction [3943] and decision making [4446] have been developed.

In the second section, we will review the soft set theory. Because previous soft set theories are not suitable to be the algebraic basis of the data model we need, we will extend the original soft set theory from the basic structure and systematically introduce a new soft set algebra called n-tier soft set, including its definitions, operations, and related concepts, which will form a complete system and provide the theoretical basis for the later data model. In the third section, we will illustrate why and how to use n-tier soft set to build a data model, define the infrastructure and modeling principles, and finally, explain its features and advantages.

2. N-Tier Soft Set Theory

2.1. The Definitions of N-Tier Soft Set

Before defining n-tier soft set, we first review the basic definition of soft set.

Definition 1. Let a nonempty set be a universal set and be a power set of . Let a nonempty set be a parameter set. Then, is called a soft set if and only if is a mapping from to :This definition is slightly different from Molodtsov’s initial one [11], and it is more similar to Cağman’s definition [37]. Generally, we prefer to define a soft set as a special mapping directly rather than an ordered pair consists of a mapping and a parameter set.
A mapping also can be treated as a set of ordered pairs, so an equivalent definition is given.

Definition 2. Let a nonempty set be a universal set and be a power set of . A nonempty set is called a parameter set and is a Cartesian product of and , is called a soft set if and only if , and each appears and appears only once as the first item in an ordered pair, which is

Example 1. Examples for soft set: let , and a possible soft set on to is . Let , and a possible soft set is .
The definitions above point out that mapping and set are two equally views of soft set. So, for soft set, general properties and operations about set are also suitable (for example, intersection, union, and complement in the sense of a general set). However, the results of these operations may not be enclosed in soft set (like the union operation under general sets may destroy mapping condition of soft sets). When applying these general set operations, we treat soft set as a general set directly. In addition, in the following discussion, we will frequently apply both mapping operations and set operations on soft sets to avoid introducing too many notations. For example, there are two soft sets . is the image of (an element in ) under the mapping rule by soft set , while is the intersection of two soft sets as the sets of ordered pairs. And is the image of by the intersection (noticed that the intersection of two soft sets preserves a mapping).
Those notations are concise and enable us to see an important property of soft sets clearly, that is, the ability to maintain mapping after some splitting, merging, or deformation operations.
Meanwhile, because a soft set can be seen as a set-valued mapping, and we also can consider a soft set as a set. Such definition provides a crucial recursive way to construct a new structure, which furnishes the soft set theory with a new and richer content. Next, we will introduce a new notation to represent a kind of sets of soft sets and define n-tier soft set.
Firstly, we define n-tuple, n-ary Cartesian product, and some other related concepts and introduce some notations to facilitate the following discussion.

Definition 3. An n-tuple is a finite ordered list of elements, where is a positive integer. Formally,And let , , be the concatenation of and , which is noncommutative and associative, namely, .
In this paper, we use to denote the arity of . Let , denote the -th component from the right to the left in tuple . is used to denote the new tuple obtained from the tuple by removing the -th component from the right to the left.

Definition 4. Let be an n-tuple which is composed of sets, the n-ary Cartesian product is defined as follows:Using the usual notation , it also can be denoted as . The n-ary Cartesian product defined here is flat, noncommutative, and associative. Namely, that let be three sets: , . is called underlying sets of the Cartesian product , and the subset of n-ary Cartesian product is called an n-ary relation.
In particular, when , then , so the unary Cartesian product with only one set is equal to the set itself. Its elements and subsets are called unary tuple and unary relation, respectively (and are different representations of the same element). And if , then we can get from the definition directly.

Definition 5. N-tier soft set: let be an n-tuple consisting of nonempty domains.
When , we defineAmong which, refers to the power set of , that is, the set of all subsets of .
When , we defineAmong which, refers to the set of all mappings from the domain set to the codomain set .
When , any element in , which can be a mapping , is called an n-tier soft set about .
is called the underlying domains of , denoted as .
In this paper, refers to the arity of soft set , that is, . And , , and are the domain, codomain, and range of , respectively.
When , then any element in , which can be a mapping , is a binary soft set defined in Definition 1.
When , then , so degenerates into a subset of . Following the name of unary relation, we call it unary soft set, and the underlying domain of it is a unary tuple, that is, .

Example 2. is a set consisting of ternary soft set whose underlying domains are .
Next, we will define some other important concepts related to soft set.

Definition 6. Soft empty set : let be a positive integer and be an n-tuple consisting of nonempty domains. is called a soft empty set of if and only ifAmong which,refers to a mapping , whose domain and codomain are and , respectively, and it maps , an element of , to . Sometimes, we simply denote it as follows:

Definition 7. Soft universal set : let be a positive integer and be an n-tuple consisting of nonempty domains. is called the soft universal set of if and only if

Definition 8. Soft subset : let , in which is an n-tuple consisting of nonempty domains and we call a soft subset of , denoted as , if and only ifIt is important to note here that in earlier soft set theory, the conditions of soft subset can be summarized as follows: [16]. By regarding soft sets as sets of ordered pairs, this definition means every ordered pair of is also in , which can be expressed directly by the subset relation of general sets. However, soft subset is a kind of special inclusion relations of soft set. When the mapping value of soft set is still soft set (rather than a simple set), we compare them by pairs that need recursion as the soft set of values until the mapping values are general sets. In addition, it is also important to note that, at present, we do not consider infinite situation, but only n-tier soft set related to finite n-tuple of domains so all recursive judgments are bound to end. However, how to generalize it to the infinite situation will be discussed in the future study.

Example 3. is a subset relation of two soft sets in the sense of general set. The ordered pairs in the first set are all in the second set (but it automatically satisfies the definition of the soft subset at the same time), andis an example of soft subset, because the mapping values determined by the first soft set is subsets of the corresponding values of the second soft set, but none of the elements in the first set is in the second set.

Definition 9. Equality = : let in which is an n-tuple consisting of nonempty domains. We consider that is equivalent to , denoted as if and only if

Theorem 1. Let , in which is an n-tuple consisting of nonempty domains, then if and only if .

Proof 1:. By using the inductive method, we prove the base case of induction firstly.
According to the definition, when , , thenThen, when , the proposition is true.
Next, we prove the inductive step: if when , the proposition is true, then, according to the definition, when , ; then,According to the inductive hypothesis,and thenSo, if the proposition is true when , then the proposition is also true when . So, according to the induction principle, the proposition is true for any positive integer , q.e.d.

Definition 10. Soft power set : let , in which is an n-tuple consisting of nonempty domains. Setof all soft subsets of is called the soft power set of , denoted as .
It is easy to prove the following properties of soft subset and soft power set by using similar inductive methods in Proof 1.
For any , in which is an n-tuple consisting of nonempty domains, then , , , , and for any , then . The specific proof is similar to Proof 1 and will not be repeated.

2.2. The Operations of N-Tier Soft Set

Definition 11. Soft union : let in which is an n-tuple consisting of nonempty domains and then is called the soft union of and if and only ifIn addition, let , then is called the soft arbitrary union of if and only ifin which .

Definition 12. Soft intersection : let , in which is an n-tuple consisting of nonempty domains, then is called the soft intersection of and if and only ifIn addition, let , then is called the soft arbitrary intersection of if and only ifin which .

Definition 13. Soft difference : let , in which is an n-tuple consisting of nonempty domains, then is called the soft difference set of and if and only if

Definition 14. Soft complement : let , in which is an n-tuple consisting of nonempty domains, then is called the soft complement of , denoted as .

Definition 15. Soft symmetry difference : let , in which is an n-tuple consisting of nonempty domains, then is called the soft symmetry difference of and , denoted as .
The above operations of n-tier soft set have the following properties.

Theorem 2. Let be an n-tuple consisting of nonempty domains and . and are soft universal set and soft empty set, respectively, in . So, the following properties are true:(1)Commutative law:(2)Associative law:(3)Distributive law:(4)Identity element:(5)Zero element:(6)Inverse element:(7)Complementary law:(8)Idempotent law:(9)Absorption law:(10)De Morgan law:By using a similar inductive method in Proof 1, those properties can be proved directly by definition and the specific process will not be repeated here.

Definition 16. Soft range : let , in which is an n-tuple consisting of nonempty domains, and then is called the soft range of if and only if

Definition 17. Key set : let , in which is an n-tuple consisting of nonempty domains, and then is called the key set of if and only if

Definition 18. Value set : let , in which is an n-tuple consisting of nonempty domains, and then is called the value set of if and only if

Definition 19. Selection : let , in which is an n-tuple consisting of nonempty domains, and is n-ary predicate, so and are called the selection operations if and only ifPlease note that an n-ary predicate is reduced to an predicate when its variable is fixed. For example, suppose a 3-ary predicate , and when takes a fixed value , becomes a binary predicate with only two variables.

Definition 20. Domain remove : let , in which is an n-tuple consisting of nonempty domains and , and is called domain remove operation (new soft sets formed by removing the -th domain of the underlying domain of from right to left and the original mapping relation of ), if and only ifBecause of no ambiguity, we use the same token for the n-tier soft set and the n-tuple, and the reader can distinguish them from each other by context.

Definition 21. Domain rise : let , in which is an n-tuple consisting of nonempty domains and is called domain rise operation (new soft sets formed by exchanging the th and th domain of from right to left which is the underlying domain of and the underlying domain after exchanging is denoted as .), if and only ifIn particular, when is a binary soft set, the only domain rise is called the reverse of .

Definition 22. Uncurrying : let , in which is an n-tuple consisting of nonempty domains, and is called the uncurrying of if and only ifUncurrying transforms an n-tier soft set into an -ary mapping.

Definition 23. Currying : let is a definite positive integer, and is an n-tuple consisting of nonempty domains. and are called the currying of if and only ifMind here, an n-ary function is reduced to an (n−1)-ary function when one of its variables is fixed. For example, set will be reduced to when the value of is fixed. Generally, we obtain the (ni)-ary function which can be denoted as by taking values of the n-ary function from left to right, continuously, and then .
Currying transforms an (n−1)-ary mapping into an n-tier soft set.

Definition 24. Concatenate production : let , in which is an n-tuple consisting of nonempty domains and is an m-tuple consisting of nonempty domains. If , then is called the concatenate production of and if and only ifParticularly, is a set when , and , directly denoted as , is also called the restriction of under .

Definition 25. Soft direct production : let , in which is an n-tuple consisting of nonempty domains and is an m-tuple consisting of nonempty domains. We call is the soft direct production of and if and only if

Definition 26. Soft mapping production : let be a binary tuple consisting of two nonempty domains . We call is the soft mapping production of and if and only ifAccording to the definition, soft mapping production is associative, namely, .

Definition 27. Soft relation: let be binary soft sets and is called the soft relation whose underlying domain is if and only if

Definition 28. Associated relation of a soft set: let , in which is an n-tuple consisting of nonempty domains. is called the associated relation of if and only if

Definition 29. Associated soft set of a relation: let be an n-ary relation (if R is an empty set, according to its assumption and context, can be considered as an n-ary empty relation whose underlying domain is . is called the associated soft set of a relation of if and only ifMind here, we indirectly used inductive definition of tuple. That is, any n-ary tuple could be considered as a nested binary tuple when . For an n-ary relation and a definite value , is an -ary relation consisting of -tuples (if is an empty set, it can be seen as an -ary empty relation).
For mathematics, the n-tier soft set defined in this section and its operations have a wealth of contents to be studied. They have nice properties, soft intersection, soft union, and soft complement, and other operations satisfy all the properties of common set operations (commutation law, association law, etc.). However, this paper does not focus on the discussion of the mathematics. Next, we will focus on explaining why and how to use n-tier soft set as a data model for databases in the era of big data.

3. N-Tier Soft Set Data Model

There is no natural expression for the existence of things or events. Only by purposeful selection, abstraction and simplification can we transform some specific aspects of irregular fields to structured and manipulatable objects. Data model describes the static characteristics and dynamic behavior of database system from the abstract level, providing a logical abstract framework for data representation and operation, and fundamentally determines how data are stored, organized, and manipulated. Therefore, the data model is the core and foundation of the database system, and all database systems must be based on a certain data model. The data model also constitutes a bridge between the upper applications, database system itself, and its underlying physical implementation, which enables them to view and use the data in a unified way.

We have already explained the problems of the relational data model and the most popular NoSQL data models in Introduction. In the second section, n-tier soft set is defined as a nested set-valued mapping that makes it possible to express complex key-value structures. Next, we will set up a new data model by using n-tier soft set algebra.

Just like that we often use a table to represent a relational model, in order to illustrate easily, we will introduce a plain text representation of n-tier soft set at first. It is similar to JSON and independent of specific programming languages, which is called SSSN (soft set serialization notation). The basic construction rules are as follows (just for a demo, the strict definition and parse method will not be discussed in this paper):(1)Representing strings with double quotation marks, numerical values with literal numbers, and Boolean values with true/false, for example,“hello World this is SSSN” #String12345678 #Numbertrue #Boolean(2)Representing tuples with contents enclosed in parentheses and separated in comma, for example,(“Joe,” “Male,” “New York”)(3)Representing sets with contents enclosed in brace and separated in comma, in which the elements cannot be repeated, for example,{“Elephant,” “Monkey,” “Zebra,” “Panda”}(4)Representing mappings with contents enclosed in brace, separated in comma and matched by colon (several-to-one ordered pairs, and the left side of colon cannot be duplicated.), for example,{“name”:”Joe,” “sex”:”male,” “address”:”New York”}(5)Representing bijective mappings with contents enclosed in brace, separated in comma and matched by double colon (one-to-one ordered pairs, and neither side of double colon should be duplicated.), for example,(iii){“20181001”:“Oct−1−2018”, “20181031”:“Oct−31− 2018”}(6)When colons or double colons are used to pair, the left side of the colon or double colon can only use strings, numbers, Booleans, or tuples, and the right side can use any type of value defined above.{ (“name,” “sex”):{“Joe”:”male,” “Eva”:”Female”}, (“name,” “birthday”): {“Joe”:”20001001,” “Eva”:”20000110”}}

In the following discussion, we can see that when we express and pass n-tier soft set in this way, we can not only express and pass the semantics and data integrated by n-tier soft set but also express some important logical constraints.

3.1. Definitions of N-Tier Soft Set Data Model

Next, we will define the n-tier soft set data model.

Definition 30. Domain soft set: let is a finite set of semantic strings and is a finite set of atomic data, then the soft set is called a domain soft set if and only if it is a soft set from to :The elements in are called named domains. The elements in the key set () of domain soft set are called domain names. The images of the domain names under the mapping rule of the domain soft set are called the value ranges.

Example 4. Domain soft set:S = {“name,” “sex,” “birthday,” “telephone,” ... }A = { “Joe,” “male,” “20011231,” “19990723,” “female,” “Eva,” “19980301,” “Adam,” “Bob,” “860702,” “13320255520,” ...}D = { “name”:{“Joe,” “Eva,” “Adam,” “Bob,” ... }, “sex”:{“male,” “female,” ... }, “birthday”:{“19870220,” “19990723,” “19980301,”... }, “telephone”:{“860702,” “13320255520,” “1191101,”... }, ...}Domain soft set combines semantic and data, determines involved domains, defines the semantic names and value range of a domain, and establishes the finite boundaries of a database system.

Definition 31. Domain relation soft set: let be a domain soft set, is called a domain relation soft set underlying if and only if it is a soft relation of and itself, namely,The elements in are called named domain relations. The elements in the key set of domain relation soft set () are called domain relation names. The images of the domain relation names under the mapping rule of the domain relation soft set are called the values of domain relations.

Example 5. Domain relation soft set:R = { (“person,” “name”):{“0001”:{“Joe”}, “0002”:{“Eva”},... }, (“person,” “sex”):{“0001”:{“male”},“0002”:{“female”},... }, ...}By soft mapping production (Definition 26), domain soft set forms a soft relation (Definition 27) with itself: binary tuples are formed between domain names and express the connotation of the relationship, and and binary soft sets are formed between data and express the extension.

Definition 32. Database soft set: let be a finite set of semantic strings and be a domain soft set, then is called a database soft set if and only if is a mapping from to a domain relation soft set underlying , that is,The elements in are called named soft set databases. The elements in the key set of () are called database names. The images of the database names under the mapping rule of the database soft set are called the value of such database names.

Example 6. Database soft set:B = { “college”:{  (“person,” “name”):{“0001”:{“Joe”},... },  (“person,” “student_id”):{“0001”:{“20130201025”},... },  (“course,” “name”):{“CS001”:{“Database”},... },  (“course,” “credits”):{“CS001”:{“3”},... },  ... }, “E−Shopping”:{  (“person,” “name”):{“0001”:{“Joe”},... },  (“person,” “customer_id”):{“0001”:{“00302001”},... },  ... }, ...}

Definition 33. Aggregate soft set: let be a finite set of semantic strings, be a database soft set, and be a database name, then is called an aggregate soft set if and only if is a mapping from to , namely,The elements in are called soft aggregates.

Example 7. Aggregate soft set: let  = “college,” :T = { “student”:{  (“person,” “name”):{“0001”:{“Joe”},... },  (“person,” “student_id”):{“0001”:{“20130201025”},... },  ... }, “course”:{  (“course,” “name”):{“CS001”:{“Database”},... },  (“course,” “credits”):{“CS001”:{“3”},... },  ... }, ...}Aggregate soft sets divide a soft set database into different subsets and give each subset a name.
Meanwhile, let be a finite set of semantic strings, be a finite atomic data item set, be a database name, and be a domain soft set from to , and according to the definition, we haveTherefore, all the objects defined in this section can be represented as an n-tier soft set consisting of a semantic set and a dataset . So, the semantics, data, and relations can be fully expressed in an n-tier soft set system without the participation of external information. All the operations defined in Section 2can be directly and conveniently applied to a model or an instance.
For example, for the domain relation soft set in Example 5above, we can transform the binary function into a nested soft sets by currying operation (Definition 23) and raise the second domain (Definition 21), which can be denoted as , as follows:{ “person”:{  “0001”:{“name”:{“Joe”},“sex”:{“male”},... },  “0002”:{“name”:{“Eva”},“sex”:{“female”},... },  ... }}Then, it looks like the BigTable data model proposed by Google in [47], which is a multilevel mapping with domain “person” as row keys, and domain “name,” “sex,” etc. as column keys. And we can also select some of them to form a new 4-ary soft set by selection in Definition 19or delete a certain key level to make it a 3-ary soft set by domain remove operation in Definition 20. Or to form a deeper, larger n-tier soft set by product operations such as soft direct product (Definition 25), concatenate product (Definition 24), and so on. It should be noted that all these operations are defined recursively just for the rigor of mathematical logic and the convenience of proof and do not imply that they must be implemented by recursive algorithms.

3.2. Modeling with N-Tier Soft Set

Through the above definitions, we get the basic components needed to build the n-tier soft set data model. Next, we use an example to demonstrate the evolution from the relational model to the four popular NoSQL models, then to the n-tier soft set model, to show why and how to use the n-tier soft set data model for modeling.

In the traditional modeling process for relational databases, the initial stage of modeling is to understand conceptual entities in the modeling domain and the relationship between them. Through the discussion between domain experts and system architects and data architects, the results of these understandings often end up forming a so-called conceptual model, which is often represented by an ER diagram. Although the ER diagram is often used in the modeling process of the relational model, it can also provide a common conceptual starting point for all other models in our discussion.

We suppose that we have designed a conceptual model, as shown in Figure 1. It shows a simplified scenario of a common E-shopping site, which contains four entities: customer, order, order item, and product, and represented in a rectangle, respectively. Ellipses connected to an entity with undirected edges represent the attributes of each entity, and the underlined ID attribute uniquely identifies a particular entity. A customer can place multiple orders, each of which contains multiple order items, and each order item relates to a particular product. The relationships between these entities are represented by a diamond with undirected edges, and the quantitative relations are represented by n or 1 on both sides of the diamond. At last, customers can follow each other and know what their friends have bought in this shopping site. We represent the relationship in a diamond with both ends connected to the customer entity, and we use m and n on both sides of the diamond to indicate a multilateral relationship. In the following diagrams, we use rounded rectangles to represent data blocks, which can be atomic (if no internal structure is indicated) or composite (larger rectangles wrapped in other rectangles). Red represents semantics, straight lines represent undirected relations, and arrows represent directed relations.

3.2.1. Modeling with Relational Model

Then, firstly, let us see how to model the scenario with the relational model.

After obtaining a suitable conceptual model, the relational model transforms it into the structures and constraints of tables. As shown in Figure 2, the ID attributes that uniquely identify entities become the primary keys of the tables. The 1 : 1 and n : 1 relations among entities are represented by foreign keys inserted into the corresponding tables (such as customer_id in order table, or order_id in order_item table), and the m : n relationship will be implemented by adding a new relation table.

The advantages of relational model lie in its simple and intuitive expression, strict and nice mathematical foundation, and the freedom from the separation of logic and physics. Without any underlying implementation information, a relational database can freely express and obtain information contained in an existing dataset by a small amount of concise operations (relational algebra has been proved to be equivalent to first-order predicate calculus restricted in secure expressions). However, we can also observe several problems with the relational model: as you can see from Figure 2, a table is a regular two-dimensional rectangular array. It consists of tuples that contain the same number of indivisible atomic elements, and a single header provides semantic interpretation for tuples. This form is simple and regular but can lead to the following problems:(i)Flat: a tuple is a flat and restricted structure, which can only contain indivisible elements. These elements are regarded as atoms at the model level. They have no internal structure and cannot be nested, which restricts its ability to express complex objects and brings the so-called impedance mismatching problem.(ii)Rigid: in a table, every tuple must contain a same fixed number of elements, and each element is rigidly coupled with its position, so even if there is actually no value in a position, its place shall be filled with the null value.(iii)Semantic and data separation: table heads as semantics and table bodies as data are separated. In the theory of the relational data model, table names and column names are defined by a metalanguage, and in a specific implementation, a relational database uses a data dictionary separated from the data to store these metadata. That makes it necessary to process metadata separately before transferring data. This separation of semantic and data makes it difficult to transmit data in a network, while other data formats such as XML or JSON combining semantics with data can enable the transmission of complete information at the same time.(iv)Index and data separation: the relational model does not express information about how tuples are located or sorted. To find tuples containing certain values in a table, one has to scan and compare them one by one. This renders the relational model too reliable on the external index structure in real use. However, indexing is not a part of the relational model. It not only consumes large storage space but also incurs maintenance costs.(v)Data and data separation: whether in the same table or between different tables, tuples of relational models are separated from tuples. Their connections which need to be calculated dynamically are implicit in the value of specific data. Conceptually, this shows that the relational model does not directly express the relations between entities. To find links between entities, it is necessary to connect tables with Join operation, which is usually very time-consuming.

3.2.2. Modeling with NoSQL Models

These problems in the relational model have prompted the development of NoSQL data models and database products.

(1) Key-Value Store. Let us first look at the simplest of these: key-value store. The data model of key-value store is very simple. As shown in the Figure 3, the whole database can be divided into two parts: the set of keys on the left and the set of values on the right. We use arrows to indicate the corresponding one-directional access. In our case, we use order id as the keys, and all information related to an order id is placed in its value. The specific content of a value is determined by the upper application, and the database is only responsible for access. Theoretically, key-value store only focuses on the effective access of data, and values are not transparent to the database, which requires users to parse by themselves. If only a part of a value is required, it entails a process of extracting the entire value and filtering out unwanted content, which may be inefficient. So, the column family model and the document model add more internal structures to the values.

(2) Column Family. Logically, a column family model can be regarded as adding a secondary column name to value pairs in the values of a key-value store model, and these secondary pairs can also be grouped into column families. As shown in Figure 4, on the left, the primary keys are also called the row keys, which locate a virtual row. On the right, column name strings (characters enclosed in quotation marks) as secondary keys are located to the values (technically, tertiary keys may also be included, such as time stamps, version stamps, and so on, but skipping them does not affect our discussion). The prefixes in the column name strings divide them into different column families. The column family model can be regarded as a huge sparse two-dimensional table, which is more expressive than the key-value model. And because columns are represented by key-value pairs, they can be added and deleted freely. In our case, like the key-value store model, we also use order id as the row key. However, the value has a richer structure. We store all customer information by customer column family and all order items by order item family and merge product information into them (because product and order item are one-to-one relationships). Different order item information is distinguished by assigning a number to the column key.

(3) Document. The document model has more richer value structure than the column family model. As shown in Figure 5, a document database stores and retrieves all documents as a file cabinet. These documents contain simple key-value pairs (similar to key-value store), nested key-value pairs (similar to the combination of row keys and column keys of column families), lists (returning by sequential numeric subscripts rather than keys), and other nestable contents. This makes the document model even more expressive, and a document can be easily converted into a programming object in an upper application. Like all key-value typed models, the form of documents is flexible, and various structures in documents can be added or deleted freely. In our case, all information of customers, orders, and products is included in a document, which looks like an actual order list.

Generally speaking, all above three models use key-value pairs as basic structures to organize data. Different models use different structural values, which provide different ways of aggregating information.

Key-value pairs are simple but essential. Keys can provide semantics for the values, which uncouple data and their positions, and eliminate the rigidity of system. A key-value pair is a self-described entirety that is no longer dependent on each other in form. At the same time, keys can also help locate values so that they can be accessed quickly. This allows key-value pairs can be easily dispersed into a cluster, and their contents and forms can be very free and flexible. So, we can predetermine all the required content according to the convenience of the upper application and aggregate it together for fast access without Join operation. That partly solves the problems of the relational model. However, key-value typed models also have some problems:(i)Values can only be accessed one way by keys, and keys cannot be retrieved by values reversely (we can see the directions and granularities of access for different models through the arrows shown in the figures). To find the specific key-value pairs by values, it is necessary to compile external indexes or use external frameworks such as MapReduce for scanning processing.(ii)There is no connection between key-value pairs. Discrete key-value pairs have many advantages, and they can be formed and operated independently, but we also hope that they can maintain their logical connections (we will see how to achieve this in the subsequent discussion about the n-tier soft set model).(iii)The form of key-value typed databases is changeful (known as schemaless databases), but it is not the case for query and reasoning (which is what the relational model good at). The contents of aggregates are prepared and stored for specific needs, and aggregates designed for an application are not necessarily suitable for others, which becomes another kind of inflexibility.(iv)Key-value typed models have no rigorous mathematical basis. A strict mathematical foundation not only makes the definition and expression of the model more rigorous but also facilitates the theoretical study of the model, the deduction of its properties and theorems (or makes use of existing results), and the recognition of its logical reliability and completeness. It is also easy to design a concise and general query language (for example, the relational model achieves a powerful logical expression with a few operations).

(4) Graph. Graph models focus on solving the problem of lacking connections in the relational model and key-value typed models. As shown in Figure 6, the graph model consists of nodes and edges. Nodes are connected by edges, which can be directed or undirected. Nodes and edges can have attributes, which makes each look like a row in the column family model or a document in the document model. However, nodes are not separated but linked together by edges. In contrast, the main point of graph modeling is not to express the attributes of nodes or edges but to describe the connections between nodes. In our case, in the upper part of Figure 6, the followship network can be clearly expressed and easily queried by using a graph model, which is difficult to implement with the relational model and other NoSQL models. Based on graph theory, the graph model has a mature mathematical foundation and a large number of forthcoming achievements (theorems and algorithms), which makes it have the ability to deal with connections easily and solve complex problems such as finding the shortest connection path between two nodes. However, when it comes to the issues that focus mainly on entities and their attributes (for example, classification or statistics reports), graph models have the same problems as other NoSQL models. For example, in order to count the proportion of male and female users in a followship network, we still need an external index to locate the nodes from attributes or count nodes by scanning the whole network.

3.2.3. Modeling with N-Tier Soft Set Model

Various models have been discussed above, as well as their problems. Now, let us take a look at how to modeling with the n-tier soft set model (hereafter referred to as the NTSS model).

(1) Rules for Modeling with the NTSS Model. Firstly, we introduce the rules for transforming the ER model into the NTSS model (other conceptual models can be deduced by the same way):(i)Entities and attributes: as shown in Figure 7, entities and their attributes in the ER model are transformed into the connections of entity domains and attribute domains in the NTSS model. An entity domain is a set which is used to uniquely identify and represent entities. If the entities in the original ER model have simple artificial primary keys (such as IDs), they will be renamed (domain name in the NTSS model should be more descriptive) and converted into entity domains directly. If there are composite primary keys (composed of multiple attributes), simple artificial domains are added as entity domains.(ii)Relations: as shown in Figure 8, relations which have no attribute in the ER model are represented by direct connections between entity domains in the NTSS model, and relations which have attributes are represented by connection domains and attribute domains connected with it. And if both sides of a relation are the same domain (such as self-relation), two role domains are added as a distinction for different roles.(iii)Connections in the NTSS model: as shown in Figure 9, any connection in the NTSS model is represented by a pair of domain relations whose names are reverse tuples (like (“customer_id,” “e-mail”) and (“e-mail, customer_id”)) and values are reverse binary soft sets. The connection is undirected, and the data on both sides of the connection can be accessed symmetrically.(iv)Cardinality constraint of a connection is expressed and implemented by the values of domain relation pairs. For any connection C between domain A and domain B, we have the following:If it is a 1 : 1 connection, the values of domain relation pair C, and are both single-valued soft sets (all the images are either empty sets or just have only one element), and they are reverse of each other.If it is a 1 : n connection, is a common soft set, is a single-valued soft set, and they are reverse of each other.If it is an n : m connection, and are both common soft sets and they are reverse of each other.

For example, in Figure 9, the relation between “customer_id” and “e-mail” is a 1:n relation (one customer can have multiple e-mail addresses). So, the value of domain relation (“customer_id,” “e-mail”) is a common soft set, and the value of domain relation (“e-mail,” “customer_id”) is a single-valued soft set.

(2) Features and Advantages of the NTSS Model. The whole picture of converting the ER model in Figure 1to the NTSS model is shown in Figure 10.(i)Macroscopically: we can see the similarities between the NTSS model and ER model in the upper half of the figure. The NTSS model, like conceptual models such as ER, retains the intuitive panorama of its modeling domain and constructs a network of domains, which has rich semantics and sufficient connections close to human natural thinking.(ii)Microscopically: in the lower half of the figure, each connection between domains in the NTSS model is represented by a pair of named soft sets which are reverse of each other. An NTSS database is actually made up of such pairs of soft sets.(iii)In implementation: an NTSS database is an n-tier soft set, so it can be uncurrying (Definition 22) as a multivariate function, which can be implemented by key-value pairs. For example, a piece of information about customer’s names in an NTSS database “E−Shopping” can be represented as{ “E−Shopping”:{  (“customer_id,” “name”):{   “0001”:”Joe”,   “0002”:”Eve”,   ...  },  (“name,” “customer_id”):{   “Joe”:{“0001,” “0086,” “0223,”... },   “Eva”:{“0002,” “0332,” “0487,”... },   ...  } }}which is a 4-tier soft set, and can be transformed into key-value pairs as{ “E−Shopping, (customer_id, name), 0001”:”Joe”, “E−Shopping, (customer_id, name), 0002”:”Eva”, ... “E−Shopping, (name, customer_id), Joe”:{“0001,” “0086,” “0223,”... }, “E−Shopping,(name, customer_id), Eva”:{“0002,” “0332,” “0487,”... }, ...}So, if we use a hashtable to be the underlying implementation of an NTSS database, the information contained in the keys will be implied in storage addresses, and values will be hashed but maintain the logical structure of the database.In usage: through our formal definitions, for the upper application programming users, an NTSS database is just a function with a set of well-defined operations and uniform specifications. In fact, referring to the example mentioned above, let B be the database soft set which contains the “E−Shopping” database, and in upper programming languages, the database soft set B is just a function which return values are also functions. By giving a parameter “E−Shopping,” B (“E−Shopping”) returns the value (a domain relation soft set) of a database named “E−Shopping,” which can still be regarded as a function. By giving a parameter (“customer_id,” “name”), then B (“shopping”) (“customer_id,” “name”) will return the value of the domain relation (still a function) between “customer_id” and “name.” By giving a “customer_id” such as “0001,” then B (“shopping”) (“customer_id,” “name”) (“0001”) will return the name of the customer. This is very natural to the language, which supports functional programming, and naturally constitutes a concise query language.

Next, we will expound the advantages of the NTSS model and explain why the NTSS model is suitable for dealing with big data.

First, we show the performance advantages of the NTSS database over the relational database through a comparative experiment. We implemented a prototype database based on NTSS (using Python) and compared it to 8.0.15 version of MySQL on a computer with 2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3, and 512 GB PCI SSD. We built three experimental data tables, Customer, Product, and Buy, to express the records of customers purchasing products. Each time 10,000 records of data are written, the time consumption of write is recorded, then the names of the customers who purchased the random 5 products are queried, and the time consumption of read is recorded.

MySQL write and read statements are similar to the following:# Writeinsert into customer (cust_id, cust_name, cust_sex) values (“c00001,” “Joe,” “male”)insert into product (prod_id, prod_name, prod_desc) values (“p00001,” “tv,” “just_a_television”)insert into buy (cust_id, prod_id, but_time) values (“c00001,” “p00001,” “20190101163749″)# Readselect prod_name, cust_name from cust a, buy b, prod c where a.cust_id = b.cust_id and b.prod_id = c.prod_id and prod_name in (“tv,” “phone,” “pad,” “car,” “coke”)

NTSS write and read statements are similar to the following:import ntssdb as nbnb = nb.connect (host = “localhost,” dbname = “test,” user = “root,” password = “pw”)# Writenb (“cust_id,” “cust_name,” “cust_sex”).put (“c00001,” “Joe,” “male”)nb (“prod_id,” “prod_name,” “prod_desc”).put (“p00001,” “tv,” “just a television”)nb (“buy_id,”“cust_id,” “prod_id,” “buy_time”).put (“b00001,” “c00001,” “p00001,” “20190101163749”)# Readnb (“prod_name,” “prod_id,” “buy_id,” “cust_id,” “cust_name”).get (“tv,” “phone,” “pad,” “car,” “coke”)

In the experiment, we compared five key indicators with MySQL:(1)When MySQL is not indexed, the insertion time increases with the amount of data.(2)When MySQL is not indexed, the reading time increases with data.(3)When MySQL is indexed, the insertion time increases with the amount of data,(4)When MySQL is indexed, the reading time increases with the amount of data.(5)Space usage.

It can be seen from Figures 1113that when MySQL has no index, the insertion time can be regarded as the constant time of O(1). The random read time is O(n) (the whole table needs to be traversed), while the insertion and read time of NTSS are both the constant time of O(1) (the hash table is directly inserted and read). Still, the specific time consumed by each record during insertion is about four times slower than that of MySQL. However, due to the complexity of O(1), when reading, it is much faster than MySQL without an index.

When MySQL was indexed, the insertion and reading time is O(log(n)) in theory (because the index of MySQL is usually implemented by B + tree), while NTSS is O(1). From the actual test data, we can see that the insertion and reading of MySQL increase with the increase of the amount of data, while the insertion and reading of NTSS fluctuate stably in a certain range.

For space usage, NTSS is about 2.73 times as large as a nonindexed MySQL database (NTSS: 402 MB, MySQL: 147 MB) to store the same data. However, if MySQL wants to query more freely (index all columns), its index space will be about 307 MB, so it will take up 147 + 307 = 454 MB in total, which is higher than that of NTSS.

We do not compare performance with the current NoSQL databases. As a prototype database implemented in Python, there is no comparability between NTSS and the mature NoSQL database that has evolved for many years in performance. Compared with the current NoSQL database, NTSS has the advantages of query freedom and mathematical logicality. Taking MongoDB as an example, as a popular database, MongoDB is widely applied in everyday applications and has extremely high performance in some queries, but it has no mathematical logicality and cannot query freely (strong at query key to value, but weak at query key to value). So, if you need to get the relationship between the values, it will cost a lot (need index structure or traverse scan). However, the NTSS model is a model with complete mathematical logicality and can query freely between key and value. The NTSS database cannot compete with MongoDB from an implementation perspective because NTSS only stays at the prototype level and will gradually approach the current mainstream NoSQL database through future improvements.

Based on the above experiments and previous discussions, we can clearly see that the NTSS model has the following advantages:(i)Efficient performance: as we have seen in the comparison experiment, MySQL is a relational database whose data and indexes are separate, and its performance depends on the design of indexes; the write and read performance and the convenience of query cannot be taken into account at the same time. However, an NTSS database can be transformed to key-value pairs and implemented as a hashtable directly; therefore, any data in it can be write or read with an average time complexity of .(ii)Schemaless: the NTSS model represents entities or aggregates as interconnections between domains, rather than a fixed table. Connections in the NTSS model are logically represented by n-tier soft sets and implemented by key-values in the underlying, which are independent of each other and can be added or deleted at will without mutual influence. This solves flat and rigid problems in the relational model. For example, if we want to split the “name” domain which is connected to “customer_id” into “first_name” and “last_name,” we only need to add two new connections between “first_name,” “last_name,” and “customer_id” and delete the original one. This does not affect other parts of the database neither logically nor physically.(iii)Semantic and data integration: the NTSS model represents semantics and data in an integrated way, which makes it is easier to move and disperse. It is no longer necessary to process metadata separately.(iv)Index and data integration: an instance of the NTSS model is a nested index structure, and each atomic datum has a unique logical access path. The data stored in a database formed by the NTSS model are a complete index system itself, and every domain in it can be used as index key to indicate data in other domains connected to it, which solves the problem of index and data separation of the relational model. And it becomes the key to efficient performance and sufficient connections.(v)Sufficient connections: the atomic data in an NTSS database are no longer isolated, but in a network. In the NTSS model, entity domains are connected to each other, and attribute domains are connected to entity domains. These connections are static states of the model, and each connection is bidirectional. This solves the problems of lack of connection in the relational model or the key-value typed models, and the key-value typed model can only be accessed in one direction.(vi)Rigorous mathematical foundation: based on n-tier soft set theory, the NTSS model has a rigorous formal definition. That is not available in other key-value typed models. This not only makes the NTSS model more precise in definition and expression but also facilitates more in-depth theoretical research. It enables us to infer richer properties (or to use the existing mathematical research results of soft sets) and to understand its logical reliability and completeness. It is convenient to design a concise and general query language and achieve complete logical expression ability with as few operations as the relational model.(vii)Powerful query ability: through the rigorously defined operations, fast access brought by index and data integration, and sufficient connection between data, the NTSS model has the ability to query as freely and completely as the relational model but in a big data environments. In the comparison experiment with MySQL, we not only write and read key-values but also write the same logical structure as the relational model and implement the same query as the multi-table join SELECT SQL statement.(viii)Convenient for programming usage: from a programming perspective, all the structures that make up the NTSS model include tuples, sets, and dictionaries are built into most programming languages and can be processed natively.(ix)Easy to modeling: from the similarity between the NTSS model and the ER model, it can be seen that the macroscopic view of the NTSS model is close to the original appearance of human thinking and modeling, so that modeling can be carried out intuitively.(x)Convenient for statistical use: each domain can be used as a statistical dimension, and most of the values related to it have become a set that can be directly obtained. For these sets: counts, sums, averages, and other statistical indicators are easy to calculate.

Using the conclusions in [3, 58], we summarize the difference between the relational model, the four NoSQL models, and the NTSS model as shown in Table 1.

Through the discussion above, we can see that the NTSS model is indeed a data model suitable for dealing big data with 4 Vs. For Volume, an NTSS database is a discrete key-value structure and has natural support for distributed clusters. For Velocity, the underlying implementation of key-values provides fast and flexible data processing. For Variety, as a schemaless model, it can be altered at will, making it easy to respond to changing requirements or different data sources. For Value, the complete logical structure is preserved between the data and can be queried freely, and storing set values also facilitates statistics and data mining. Moreover, based on the features of the NTSS model, it is possible to realize an implementation with intelligent data distribution, which can automatically adapt to the status of the cluster, intelligently divide the soft aggregations, and still maintain the semantic and logical structure between the data, without manual sharding design or aggregation design.

4. Conclusion

The n-tier soft set theory and n-tier soft set data model have been proposed. We defined them in a strict formalized way and illustrated the process and design considerations. We explained why and how to use the n-tier soft set model to modeling, described the features and advantages of it.

However, a lot of details have not been covered, such as richer algebraic properties and detailed implementation aspects, which will be progressively fulfilled in the future.

However, we believe that through this paper, we have not only expanded the frontier of soft set theory but also shed light on a promising prospect of developing a new database product based on the NTSS model to meet the challenge of big data. In the future, the database will be rewritten using Scala, unlike a theoretical verification based on Python Implementation currently and open-source to improve its ability.

Data Availability

The data used to support the findings of this study are available upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (grant no. 72071021).