Research Article | Open Access
Lin Yue, Wanli Zuo, Lizhou Feng, Lin Guo, "OMFM: A Framework of Object Merging Based on Fuzzy Multisets", Mathematical Problems in Engineering, vol. 2014, Article ID 304537, 15 pages, 2014. https://doi.org/10.1155/2014/304537
OMFM: A Framework of Object Merging Based on Fuzzy Multisets
Information fusion is a process of merging information from multiple sources into a new set of information. Existing work on information fusion is applicable in various scenarios such as multiagent system, group decision making, and multidocument summarization. This paper intends to develop an effective framework to solve object merging problem based on fuzzy multisets. The objects defined in this paper are data segments in document fusion task, referring to the concepts with semantic-related terms of different semantic relations embedded. The fundamental operation is the merge function mapping data segments in multiple fuzzy multisets onto one object, which is a solution. Under this framework, we define quality measures of purity and entropy to quantify the quality of the solutions, balancing accurateness, and completeness of the results. Merge function that yields this kind of solutions is VI-optimal merge function and a series of theoretical properties concerning it are studied. Finally, we investigate the proposed framework in a special application scenario (i.e., document fusion) which is related to the task of multidocument summarization and show how the framework works with illustrative example.
As an important research area, information fusion is a process of merging information from multiple sources into a new set of information. There are many applications in this research area such as heterogeneous database, multiagent system, group decision making, and multidocument summarization. Under different application scenarios, different principles and procedures are utilized to solve the problems. Many classical mathematical theories of aggregation operators [1–5] have been developed for multiagent system and group decision making system, and the information that aggregation operators try to fuse typically expresses facts of opinion or score of an agent. Besides these researches, a fair amount of work focused on the situation where the source is regarded as a propositional belief [6–8]. The existence of nonfactual knowledge like integrity constraints and inference rules makes the difference between these two theories. As a result, a lot of work has been done in the heterogeneous database area on first-order theory. Another type of fusion is that each source presents knowledge by means of a possibility distribution , in this case, the imperfection of incorrectness, uncertainty, and incompleteness in the data should be coped with. The main challenge is how to deal with conflicting information provided by different sources.
To address the issues in the third type of fusion, a framework of object merging has been investigated by using multiset theory currently, which could be utilized to solve the problem of multidocument summarization (MDS) . Also, object merging is a hot spot of research in many domains with good prospect for application. The framework of multiset merging for MDS has defined the merge function which maps the objects in multisets onto a single object and has got some foregoing results which cannot be considered as a final summarization yet [11, 12], owing to the fact that these foregoing results are just some keywords without any relation among them, not mention to context of co-text, context of culture, context of situation, and so forth. The essential reason for this result is that the framework defined the quality measures with the multiplicity of element as the measure of important element. In other words, the multiplicity is equal to term frequency which is just shallow text feature. When performing source selection in MDS, the traditional method transformed one document into the representation of a vector of words or a multiset of words, which are just simple settings. Other progressive approaches should be proposed, which are semantically richer than using words as source representation. In short, the problem of processing coreferent objects has not been deeply investigated at present. On one hand, merging of nonquantitative objects, especially the objects with semantic information, has not been proposed. On the other hand, object merging functions and the rationality of merging still need to be further investigated.
Within the scope of our paper, we also focus on the problem of object merging in information fusion, and our work should be treated as an extension of the framework mentioned hereinbefore. There are many differences between these two works. The basic difference concerning the definition of coreferent objects: coreferent objects in paper  are the objects describing the same entity in the real world, while in our paper the object we’re discussing is a piece of data or information, which could be used to denote the same concept with semantic-related terms of different semantic relations embedded. Then, fuzzy multiset theory is investigated in our paper, in which membership degree function and length function are used to describe both uncertainty and repeatability of the natural language. When performing fusion in practical situations, the object merging process has considered deep text features of semantic relations such as hypernym, synonym, and antonym. Moreover, two quality measures (purity  and entropy [14, 15]) widely used in the text mining literature are adopted to quantify the result of a merge function. Thus, the behavior of the merge functions we defined in this paper can be characterized by the behavior of the quality measures. With this strategy, we can get an optimal merge result. The possible application of this work is document fusion , where a collection of textual documents is used to produce the shortest description containing all information found within the document set, but without repetition. Existing solutions for this problem normally focused on statistical methods or heuristics methods used in multidocument summarization [17, 18]. In this paper, object merging based on fuzzy multisets (OMFM) is definitely a meaningful attempt, where a source set of multiple documents is denoted as a multiset and each document is denoted as a fuzzy multiset of multiple concepts.
This paper is organized as follows. In Section 2, we review mathematical preliminaries. Furthermore, the general framework of objects and object merging are proposed in Section 3, and definition of the quality measures and construction of merge functions are introduced in Section 4. Next, demonstration about how our framework works on practical problem (i.e., document fusion) with illustrative example is presented in Section 5. Finally, in Section 6, we give the conclusion and future work to the proposed framework OMFM.
In mathematics, fuzzy set introduced by Zadeh in 1965 is set whose elements have degrees of membership which is an extension of the classical notion of set [19, 20]. Fuzzy set theory is very useful to deal with problems that are not easily handled by classical computing techniques. On the other hand, the use of membership degrees instead of real numbers to represent memberships also provides a mean to measure the possible uncertainty in languages computational theory. The notion of multiset is a generalization of the classical notion of set in which members are allowed to appear more than once. As a data structure, multiset stands in between strings where a linear ordering of symbols is presented and sets where no ordering is considered. Combined with the notion of fuzzy set, multiset is generalized to fuzzy multiset , which could describe both uncertainty and repeatability of the natural language. Consider one language modeling problem: given some sentences, identify the concepts and words which are similar or identical, and merge these objects to get a condensed description. This task is a challenging natural language problem with large amounts of diverse and compositional data. To solve this problem, we extend fuzzy multiset to produce a language model which maps data segments in multiple fuzzy multisets onto one object, where different semantic relations for one concept are treated as repeated elements with different membership degree in fuzzy multisets. In this section, mathematical theories of fuzzy set, multiset, and fuzzy multiset will be briefly reviewed.
Definition 1 (membership function). The membership function indicates the degree of belonging to . indicates that element completely belongs to set ; that is, is the concept of traditional set.
Definition 2 (fuzzy set). The membership function over defines a fuzzy set, which is represented as . Fuzzy set with elements can be denoted as .
According to the definition of fuzzy set, to what extent an object belongs to a set is not fixed any more, and the membership of each object falls in the range of interval .
Definition 3 (multiset). A multiset over the universe is defined as , where . For , denotes the multiplicity of in .
The cardinality of a multiset is given by .
A multiset also could be denoted as . The set of all multisets drawn from a universe is denoted as .
There are some basic operators and relations of multiset below: Inclusion: Equality: Intersection: Union: Addition:
Definition 4 (-cut set of multiset). The -cut set of a multiset is denoted as and given by .
Note that the difference between the notation and is that the former one means assigning an index to the multiset and the latter one means the -cut set of the multiset .
2.2. Fuzzy Multiset
Combined with the concept of fuzzy set, the traditional concept of multiset could be denoted as where , represents the membership degree of element recurring in multiset for the th time. According to the traditional concept of set, the membership degree of all elements in a multiset is 1. For , the traditional sense of multiset is generalized to fuzzy multiset.
Definition 5 (fuzzy multiset). A fuzzy multiset over the universe is defined as where is the degree of membership of element ().
Definition 6 (length function). The concept of multiplicity in multiset generalized to fuzzy multiset is length function, which is denoted as .
The number of occurrence or cardinality of a fuzzy multiset is given by .
There are some basic operators and relations of fuzzy multiset below: Inclusion: Equality: Addition: For , , . Intersection: Union: Note that when performing any operator for two fuzzy multisets, the length of the membership degree sequences and , will be set up to be similar. For this reason, adequate number of zeros is affixed. For , the membership degree is arranged by the decreasing order of the elements in the sequence, where and .
Example 7. A fuzzy multiset over ; the membership degree of element is ; the membership degree of element is ; the membership degree of element is ; the membership degree of element is , where , , , and .
The set of all fuzzy multisets drawn from a universe is denoted as .
Definition 8 (-cut set of fuzzy multiset). The -cut set of a fuzzy multiset is denoted as and given by Note that the difference between the notation and notation is that is preserved for the -cut set of the fuzzy multiset , while means assigning an index to the fuzzy multiset .
3. Objects and Object Merging
3.1. The General Framework
We have reviewed the most relevant definitions in the previous section. As we’ve mentioned earlier, the framework in our paper extends the work in paper , so now we will introduce some work basis below. The bases involve the redefinitions of coreferent objects and merge function in OMFM, and a brief review of properties of preservation and majority rule in .
The bases involve the redefinitions of coreferent objects and merge function in OMFM.
Reference function is formalized to describe a concept in the real world, where symbolizes the real world. By definition, two concepts are called coreferent if they describe the same real world concept.
Definition 9 (coreferent objects). Let be a universe set of concepts. Two concepts and are coreferent if and only if .
By the definition above, two objects that describe the same real world concept with semantic-related terms of different semantic relations embedded are formalized axiomatically. Here, we consider the context as the baseline: when describing a theme in a document, some semantic-related terms relating to this concept will be used to extend the theme.
Definition 10 (merge function). The merge function in OMFM is represented by function .
Mapping the fuzzy multisets of objects onto a single object is the job of merge function in our work, and these functions are often idempotent; that is, . This conclusion is also suitable in this paper and corresponding proof will be given in the following section.
A brief review of two important properties.
Property 1 (preservation). A merge function is preservative when merge function only selects one of the elements from the source set, the property of preservation in OMFM is denoted as
Property 2 (majority rule). If the multiplicity value of an element is larger than the half of cardinality value of the source set then this element must be selected by the merge function, which is denoted as
The majority rule above is an important property for merge function in multiset that was further studied in  and a weaker version has been proved in . By now, the majority rule is not extended deeply in fuzzy multiset as it does not apply in general, but the preservation rule will be elaborated in our paper.
3.2. Merging of Fuzzy Multisets
Within the scope of OMFM, we focus on the case of object merging of compound a multiset and multiple fuzzy multisets with the function of the type below: where the elements of are denoted as , and the elements of are . Here, the multiset could be denoted as , where denotes the multiplicity of in .
The fundamental operator is mapping the data segments of fuzzy multisets onto one object, which is called a solution. In following sections the symbol is used to represent a random solution of a given merge function; that is, .
Example 11. Given a fuzzy multiset over and the multiset , such that Here, consider two merge functions we have mentioned in Section 2, . Then, we can obtain that Here, the fuzzy multiset is referred to as source or source fuzzy multiset; the multiplicity of different source fuzzy multiset is not considered by merge function.
The case is is not an upper bounded lattice. The normalization criterion that is needed when performing merge functions is usually omitted by fuzzy multiset theory. Therefore, we show another property below.
Property 3 (boundedness). A bounded merge function over should satisfy the following constraint:
It indicates that the merge function selects one of the elements from all the source sets. A corresponding inference is that
This inference explains that any element not belonging to any source set should not exist in the outcomes of a bounded merge function. We could easily get this natural property just from the observation, because element with membership degree should not be mixed into a solution arbitrarily. Also, it is a weaker notion of preservation. Besides, we also formulate the enforcing preservation: Then, Property 3 is equivalent to indicating
Paper  has pointed out that keeping the weaker version of Property 2 in the situation of multiset is advantageous. They take multidocument summarization (MDS) as an example to explain that keeping a strict preservation would lead to a bad result in practical situations, that is, one of the documents itself would be the summary of the entire document set. While the task of document fusion (DF) is to generate a text containing all the information in entire document set. So, a weaker version of Property 2 is also advantageous in our framework. The bounded merge function of fuzzy multiset will be further elaborated in subsequent sections.
Theorem 12. The functions , , and are bounded.
Proof. We can get the proof from the case that for any
4. Optimal Merging of Fuzzy Multisets
4.1. Quality Measures
The purpose of defining quality measures is to construct the merge functions that could get good performance for object merging in multiple fuzzy multisets. On one hand, the behavior of the merge functions we defined could be characterized by the value of the quality measures. On the other hand, adjusting a merge function could also optimize a balance between accurateness and completeness of a given solution to get a higher value of quality measures. The relationship between the merge functions and quality measures can be shown in Figure 1.
Within the scope of our paper, we adopted two quality measures widely used in the text mining literature: the first one is purity , and the second one is entropy [14, 15]. Information entropy is a concept used to measure the amount of information in the information theory, which is often taken as a measure of “disorder;” that is, the higher the value of entropy, the higher the extent of disorder; information purity is a measure of correlation between a system and its environment, where a higher value of purity means that a system is more relevant to its environment. Both of the two measures fall into range interval . Basically, the maximum purity and minimum entropy of results are the goals we try to achieve. Nevertheless, when we try to analyze the effect of a merge function, we should be able to analyze the effect at fundamental level of the elements. So, some local quality measures will be introduced first.
Definition 13 (local precision). Given a multiset , the local precision of the element could be defined as such that The local precision judges the accurateness of adding the element with the membership degree into the solution. Here, is a multiset of sources. judges the proportion of fuzzy multisets where the membership degree of element is .
Example 14. Given the multiset in Example 11. The local precision of adding an element into a solution with membership degree 0.5 is calculated. When , , , , and :
Property 4 (monotonity of ). Local precision is a decreasing function in accordance with the membership degree threshold : The monotonity of is a natural property. The lower membership degree means more sources will be added into the solution, owing to the fact that higher membership degree indicates relative simple relations related one concept (say the synonym of one word), and lower membership degree indicates more unspecific and more layered descriptions concerning one concept. As a result, we will get more complete information with higher precision.
Definition 15 (purity). Purity is computed using the maximal local precision value for each element in the solution as follows: such that
Example 16. Given the multiset of Example 11 and solution , we could obtain the local precision of all elements in this solution
Note that we can also set of each element in solution a different value.
Then, we get the purity
Definition 17 (local entropy). The local entropy of each fuzzy multiset in is calculated as such that
Property 5 (monotonity of ). Local precision is an increasing function in accordance with the degree of membership threshold when , a decreasing function in accordance with the degree of membership threshold when :
Property 5 implies that the variation trend of local entropy is impacted by both fuzziness and proportion of an element in a solution; that is, neither excessively detailed or excessively brief information, nor more sources or less sources contained in the solution is appropriate to enrich the information of a fusion system. The proofs of these natural properties are omitted here. Back to our approach, the important connection exists between local precision and local entropy is also reflected by this property.
Definition 18 (total entropy). The total entropy of is calculated as such that
Example 19. Given the multiset of Example 11 and , we obtain the local precision of all elements in this solution Then, we get the total entropy
The purity and entropy can, respectively, express the quality measures, but the variation scales between them may be unequal. As mentioned above, the maximum purity and minimum entropy of results are the goals we try to achieve. Therefore, we try to investigate an index with the similar variation scales.
Definition 20 (validation index value). Given a multiset of sources , the VI-value of the validation index is calculated as such that
Next, the rationality of this index will be shown. Generally, a brilliant result is generated by the higher value of the purity and the lower value of the entropy. That is to say, if the discrepancy between these two values is large, the value of the validation index is large and a good result can be determined by this validation index. That is to say, a balance between purity and entropy is expressed by validation index. In the case where the variation scales of these two values are similar, we propose a constant value which could change the similar variation scales of purity value and entropy value. In practice, we determine the most significant singular values by selecting the best VI, and it is kind of an empirical value which could be achieved during the simulation and modified through iterated procedure. But how to determine the value of this constant is not the problem we really care about now, we have not discussed this problem deeply in this paper. In our future work, we will explore this problem deeply with experimental analysis.
Note that for any solution , if and only if the local precisions of all elements in this solution differ from zero.
4.2. Optimization of Quality
The effect of a merge function can be judged by quality measures introduced in previous phase. And then we try to investigate the solutions optimizing the values of the quality measures. This type of optimization problem also appears in other research fields, paper  utilized the transitive closure as the effective mechanism transforming a matrix into fuzzy equivalence relation, by this way, finding the approximate partitions of data sequences. It is a classic example in the field of fuzzy set theory. Another example involved searching approximate minimum-distance by transforming a fuzzy reciprocal relation with a transitive reciprocal relation . That is to say, the optimization mechanism could not be one of a kind. At the next step, we will concentrate on maximum quality generated from VI-value (the maximization of the purity and the minimization of the entropy). The difficulty of this step is to find the solution which gets the best VI-value. Therefore, the main task here is to define and investigate a suitable merge function.
Definition 21 (VI-value merge function). A VI-value merge function over should satisfy the following constraint:
At this step, some properties of VI-optimal merge function will be studied further. A notable point is that there may appear several solutions sharing one maximum VI-value. With the definition of the merge function, how to select the unique solution is an important task here. Therefore, a selection criterion that selects one solution from the optimal solutions set is needed when performing these merge functions. With the special application area of OMFM, we will show the details in illustrative examples. Another problem is a solution that has does not always exist. Hence, the notion of invalid solution is given below.
Definition 22 (invalid solution). Assume a VI-optimal merge function and a fuzzy multiset of sources .
A multiset is defined as an invalid solution of if A solution of a VI-optimal merge function that is not invalid is called avalid solution. Notice the differences between invalid solution and valid solution. Then, we will introduce another significant theorem.
Theorem 23. Any solution that is a real subset of the source intersection or a real superset of the source union has that
Proof. Assume a fuzzy multiset of source .(1)A solution that satisfies Also it satisfies Owing to the case of all the elements of the solution would generate a local precision equivalent to 0, then (2)A solution that satisfies also satisfies Owing to the case of all the elements of the solution would generate a local precision equivalent to 0, then
The conclusion here is that a valid solution of VI-optimal merge function should include the intersection of the sources and should be included by the union of the source. In view of this point, we define the intersection of the sources as the lower bound and the union of the sources as the upper bound. The formalized definition is shown as where the lower bound is denoted as and the upper bound is denoted as . Hence, we shall only consider solution that satisfies in the following section.
Theorem 24. An VI-optimal merge function is idempotent.
Proof. Assume the fuzzy multiset satisfies .
Thus, for , we have that The corresponding proof is also shown when applying the previous theorem.
Theorem 25. A VI-optimalmerge function is bounded.
An important point is that VI-optimal merge functions do not satisfy the property of preservation. Nevertheless, due to the theorem we just proved above, they are bounded undoubtedly and boundedness offering a weaker version of preservation is shown in previous section. Besides the theorem of boundedness, several interesting theorems relevant to VI-optimal merge function need to be mentioned here. One of them is the theorem of VI-optimality invariance when scaling of multiplicity of the sources below.
Theorem 26. Assume a fuzzy multiset and a merge function . A conclusion could be got that
Proof. Several facts could be got that And on the other hand,
Theorem 27. Assume a fuzzy multiset is a VI-optimal merge function and a scaling parameter . If the solution is VI-optimalwith regard to the sources , then is VI-optimal in regard to the sources .
Proof. We could get the corollary in last theorem.
5. An Application: Document Fusion with Illustrative Example
5.1. Document Fusion
One possible application for this fuzzy multiset framework is document fusion. It involves the merging of elements with the different relations embedded. When it comes to document fusion, we have to introduce multidocument summarization briefly. Document fusion and multidocument summarization are two relevant areas. The important difference between these two areas is that, for multidocument summarization, the main task is to generate the shortest description containing the most relevant information, while for document fusion, the focus is to generate the shortest description containing all information contained in the whole document set excluding the redundancy [15, 16]. It is like that multidocument summarization is the intersection of the documents and document fusion is the union of the documents. Unlike multidocument summarization system, there is no organization like DUC (Document Understand Conference)  providing “ideal” datasets for document fusion research yet, with which multiple documents under same subject and ideal summarization results for testing can be achieved. In addition, intrinsic and extrinsic evaluations in multidocument summarization system could not be suitable in fusion task: intrinsic evaluation where evaluation is done by human on accessing the quality of the fused documents itself makes the evaluation process subjective , and on the other hand, the difficulty in intrinsic evaluation of document fusion systems is that there is no existing collection of human written fusion results of multiple documents, serving as a gold standard for such evaluations by now; and extrinsic evaluation where the result of the document fused is evaluated by the completion of a specific task makes the evaluation process more complicated. Thus, there are no standard methods used to estimate the work in fusion task like in some document summarization tasks [31–33]. Given the problems we mentioned above, the evaluation that we performed is limited to date. To demonstrate our work, an example of an article cluster concerning the spoilage problem complaints of the dairy products on a particular brand has been selected from “315 consumption complaint” website to show the general fusion process and results by utilizing our framework. Although we use Chinese text for illustration, it is worth mentioning that there is not any fundamental difference between Chinese and English or other language under this framework.
The work of our paper is to propose a framework for document fusion, so we are not only aiming to get keywords, but for comprehensive information. Here, we just try to consider the situation of fuzzy multiset. With such extensions, the membership degree could be used to show the importance and fuzziness of an element, which makes the document representation more granular and semantically richer than multiset merging model in paper . Assigning different weights to the same element also makes sense, when considering the situation that semantic-related terms with different semantic relations are used to identify the concept, which is semantically richer than just using words. Under our framework, semantic methods and statistical methods could be combined and used in many domains.
5.2. Illustrative Example
The main processing that needs to be performed is to get the Extra Strong, Strong, and Medium Strong relations of every concept in each article by using HowNet . As a common-sense knowledge base, HowNet unveils interconceptual and interattribute relations of concepts. In HowNet, every concept of a word or phrase and its description form one entry with relations such as hypernym, hyponym, synonym, antonym, meronym, and Holonym (descriptions for these relations could be seen in Table 1), existing in HowNet and presented in DEF (concept definition) as shown in Box 1.
When performing English text, a large lexical database of English, WordNet, could be used to identify these relations instead of HowNet. Here, textual intention structure is determined by three relations of every concept. As an indicator in linguistic segments, three relation segments of every concept tend to indicate the theme segments. That is to say, once three relations of every concept have been confirmed, the corresponding linguistic segments will have determinate tendency. Each concept is defined as an element in fuzzy multisets and the three relation segments that are included in each concept determine the different membership degree of each element as shown in Table 2.
In the following example, three semantic segments concerning different concepts , , and will be used in merge process. The corresponding fuzzy multisets obtained from the source set are shown below: where denotes there are Extra Strong semantic relations identified in concept .
As we’ve mentioned above, document fusion is to produce the shortest description containing all information found within the document set, but without repetition. The solution that we need is the solution concluding all the key concepts (, , and in this example) and these concepts are constructed by three relations (Extra Strong, Strong, and Medium Strong relations). Let us consider the solution . We get the local precision of all elements in this solution such that And we calculate the local entropy such that Then, we have got the validation index with setting the constant value :
On the same principle, we present the local precision, purity, local entropy, entropy, and VI-value of the solutions with only one semantic relation embedded in each concept. If every concept is treated of equal importance with single relation embedded, in this case, we have got a maximal VI-value 0.355 (see Table 3).
As mentioned in former section, a valid solution of a VI-optimal merge function should follow the constraint below:
We should only consider solution that satisfies the constraint mentioned above in practical application. So, more complicated solutions could also be considered. For example, when concept is an important concept which needs to be described explicitly in the fusion result, more details as Medium Strong relation in Table 2 should be contained to construct the description. If the concept is not treated of equal importance with single relation embedded, in this case, we’ve got two maximal VI-value of 0.491 (see Table 4). Generally, for any solution , we can calculate the VI-values and choose the solution based on the observation of maximal VI-value. Within the scope of this paper, a tie-breaking criterion does not always exist, the accessorial choice criterion that helps selecting a solution from the set of optimal solutions is necessary. From the observations of and , we need to decide the merging function by actual requirement or finding a new solution with more semantic relations embedded.