Cognitive Network Science: A New FrontierView this Special Issue
Research Article | Open Access
Alexander Mehler, Rüdiger Gleim, Regina Gaitsch, Wahed Hemati, Tolga Uslu, "From Topic Networks to Distributed Cognitive Maps: Zipfian Topic Universes in the Area of Volunteered Geographic Information", Complexity, vol. 2020, Article ID 4607025, 47 pages, 2020. https://doi.org/10.1155/2020/4607025
From Topic Networks to Distributed Cognitive Maps: Zipfian Topic Universes in the Area of Volunteered Geographic Information
Are nearby places (e.g., cities) described by related words? In this article, we transfer this research question in the field of lexical encoding of geographic information onto the level of intertextuality. To this end, we explore Volunteered Geographic Information (VGI) to model texts addressing places at the level of cities or regions with the help of so-called topic networks. This is done to examine how language encodes and networks geographic information on the aboutness level of texts. Our hypothesis is that the networked thematizations of places are similar, regardless of their distances and the underlying communities of authors. To investigate this, we introduce Multiplex Topic Networks (MTN), which we automatically derive from Linguistic Multilayer Networks (LMN) as a novel model, especially of thematic networking in text corpora. Our study shows a Zipfian organization of the thematic universe in which geographical places (especially cities) are located in online communication. We interpret this finding in the context of cognitive maps, a notion which we extend by so-called thematic maps. According to our interpretation of this finding, the organization of thematic maps as part of cognitive maps results from a tendency of authors to generate shareable content that ensures the continued existence of the underlying media. We test our hypothesis by example of special wikis and extracts of Wikipedia. In this way, we come to the conclusion that geographical places, whether close to each other or not, are located in neighboring semantic places that span similar subnetworks in the topic universe.
In this article, we explore crowd-sourced resources for automatically characterizing geographical places with the help of so-called topic networks. Our goal is to model the thematic structure of corpora of natural language texts that are about certain places seen as thematic frames. This is done in order to automatically compare the thematic structures of corpora of texts about these places, which will be represented as topic networks. In this way, we want to investigate the regularity or systematicity according to which geographical objects (i.e., cities and regions) are dealt with, especially in online communication.
Our work relates to what is described by Crooks et al.  as a novel paradigm of modeling “urban morphologies.” We not only add special wikis such as regional and city wikis as candidates to the resources listed in  but also introduce a novel method for modeling their content. This concerns local media of collaborative writing about places (cf. ), which contain everyday place descriptions  authored and networked according to the wiki principle. The corresponding wikis and the subgraphs of Wikipedia that we additionally analyze manifest Volunteered Geographic Information (VGI) [4–6] and thus relate to what is called the wikification of Geographical Information Systems (GIS) . VGI is “completing traditional authoritative geographic information” , an information source which is still “underutilized” in geography  as a source of big textual data  making natural language processing an indispensable prerequisite for its analysis. According to Hardy et al. , authoring VGI has a spatial component in the sense that people likely write about local content though this also holds for Wikipedia for a minor degree . This spatial component can be accompanied by a lack of quality assurance, which makes VGI susceptible to deficiencies and to a distorted resource of still unknown extent . In any event, the biased coverage of VGI is a characteristic of resources like Wikipedia so that the same region can be displayed very differently in its various language editions , a sort of biasing which is typical for user-generated content. Nevertheless, Hahmann and Burghardt  show that more than 50% of the articles in the German Wikipedia contain georeferenced data (at least indirectly via links to other articles), so that such media can be regarded as rich resources of VGI. Moreover, Goodchild and Li  point to the fact that crowd-sourcing or, more precisely, crowd-curation , as enabled by wikis, is a means of quality assurance.
We follow this concept and assume that geographic data, as manifested linguistically in online media, are a valuable resource to investigate how communities form a common sense for addressing places of common interest. In line with Clare (, 41), we additionally assume that “[a]s people communicate more about a place, social consensus will create increased similarity between and within people’s judgments of it.” However, we also assume that the latter similarity can affect communications of different communities about different places. In this way, we assume a kind of horizontal self-similarity  of the thematic structure of online media, which is more or less independent of the underlying theme and the community. That is, our hypothesis on the theming of places is as follows.
Hypothesis 1. Thematizations of different places at a certain level of thematic abstraction tend to be similar among each other (rather than being dissimilar) (1) in the sense that they focus on similar topics and (2) the way these topics are networked and (3) with respect to the skewness of this focus, regardless of whether the underlying media are generated by different communities and whether these communities address related or unrelated places at near or distant spaces.
The intuition behind Hypothesis 1 is that thematizations of places in web-based communication are seemingly somehow thematically redundant: in reporting, for example, on the cities in which people live, they may aim to emphasize the special character of these places. It seems, however, as if a thematic trend is breaking ground that ultimately makes such reports appear thematically very similar. Whether or not this intuition is actually a trend that can be observed specifically in the field of wiki-based media is something this study is intended to clarify. From this point of view, it is obvious that Hypothesis 1 is only a starting point which in itself needs further clarification in order to be testable: similarity, for example, is a highly context-sensitive attribute  that needs further definitional specifications in order to be computable. Likewise, the concept of thematization (theme or topic)—a concept which according to Adamzik  has so far found comparatively less attention in linguistics—is not yet specified in Hypothesis 1. Thus, an appropriate elaboration and concretization of Hypothesis 1 is one of the main tasks of the present paper. To this end, it is developing a generic topic network model in conjunction with a measurement procedure which will specify both the notion of similarity (which will be defined in terms of the graph similarity of topic networks) and of the thematization of places (which will be defined in terms of topic labeling and topic networking). This topic network model will allow Hypothesis 1 to be reformulated and concretized in the form of variants (i.e., Hypotheses 2–4), which will be presented in Section 3.2.7 and whose formulations presuppose the topic network model that this paper develops in the preceding sections.
The skewness that is mentioned by Hypothesis 1 reminds one of a Zipfian process, according to which a few topics dominate, while the majority of candidate topics are underrepresented or disregarded. Therefore, we speak of Zipfian thematic universes, which are spanned by the thematization of the same places in online media such as special wikis of the sort studied here. By the term topic, we refer to the notion of aboutness of texts [18, 19]. From a linguistic point of view, the terminology of Hypothesis 1 seems to be confusing when referring to places as what is given and with topic to what is said about these places. The reason is that linguistics distinguishes between what is given (theme or topic) and what is said about it (rheme, comment, or focus) in a given piece of text [18, 20–22]: a mention of a city like Vienna, for example, can be connected with certain subtopics (e.g., classical music), which characterize this place rhematically by providing new information about it. The latter distinction is meant when we relate subtopics in the role of rhemes to places in the role of topics in the linguistic sense. Thus, when talking about topics as part of a computational model, we will use the term topic (topic2), while when talking about places as topics in the linguistic sense (topic1), we will use the term theme and speak about its rhemes as its subtopics modeled by topics (topic2) as units of our model. This scenario and its relation to Hypothesis 1 are depicted in Figure 1. It shows a generalization of a hypothesis of Louwerse and Zwaan  according to which language encodes geographical information: the places p and q, which are understood as conceptual units (i.e., mental models), are described by or expressed in two discourse units (texts, dialogs, etc.) x and y. From the latter units, the topic representations α and β are derived by means of a computational model (e.g., Latent Dirichlet Allocation (LDA)  or the topic network model introduced in Section 3). While such derived topics are part of the computational model, the underlying discourses belong to the modeled system. We assume that the conceptual unit p (q) is structured into a system of networked rhemes or subtopics (). Ideally, the derived topic α in Figure 1 is a valid model of one of the rhemes of place p (e.g., ) and β of one of the rhemes of place q (e.g., ). If we assume now that p and q are conceptually related (e.g., similar) to each other, then the linguistic encoding hypothesis implies that this is possibly reflected by a relatedness (e.g., similarity) relation among some rhemes of these places (e.g., by the relatedness of and ). From the point of view of modeling, this relation is ideally mapped by the relatedness (e.g., similarity) of the derived topics α and β. We assume that conceptual relations between places can be parallelized by relations of physical proximity or distance between spaces that are mentally modeled by these places. If one additionally assumes that proximity in space correlates with relatedness in conceptual space (the less the distant, the more the similar, for example), one obtains a linguistic variant of Tobler’s so-called first law (see Section 2). If we look at the literature (see Section 2), we find that the approaches in this area differ in terms of the linguistic level at which they observe the linguistic encoding of platial  relations: for example, at the level of intertextually linked texts, at the level of the topics these texts are about, or at the level of lexical elements used by these and other texts to deal with the latter topics. In lexical variants of this approach, the places p and q, for which we assume that they are conceptually related, are preferably referred to or described by means of lexical items (see Figure 1) of the underlying lexis that are syntagmatically or paradigmatically associated. From the point of view of modeling, we have to then assume the two types (as models of the words ) for which we automatically detect, for example, their (paradigmatic) closeness in semantic space (cf. [24, 25]) or the similarity of their (syntagmatic) co-occurrence statistics (cf. ).
From this analysis, we obtain a series of reference points or means for encoding geographical information about conceptual relations (see  in Figure 1) of places. This concerns more precisely a series of possible parallelizations of such relations, which may ultimately be parallelized by relations between the spaces designated by these places (for the numbers in brackets, see Figure 1): at the level of the modeled system, this refers to thematically linked rhemes, intertextually linked discourse units (e.g., texts), and syntagmatically or paradigmatically linked words (). From a modeling point of view, we distinguish the statistical relatedness of types or of topics as candidate parallelizations (). Beyond that, we find the parallelization of the relatedness of rhemes and words on the one hand and of types and topics on the other (, ), as well as that of the relatedness of words on the one hand and of types on the other (). The parallelization of the relatedness of rhemes of the same place () by the relatedness of the rhemes of another place concerns the core of our network approach. Such relations among rhemes constitute rhematic networks or networks of rhemes on both sides of the affected places. Our main assumption is now that any such rhematic network, which manifests the thematic structure of a place, can be related as a whole to that of another place. In doing so, it is, from a modeling point of view, ideally parallelized by the structural relatedness (e.g., similarity or complementarity) of topic networks, which are derived from corpora of texts, each of which describes one of these places (). This type of parallelization affects entire networks of linguistic objects and yet offers a means of encoding the conceptual relationship of places () or the proximity of spaces, respectively. In the present paper, we explore relations of Type  in order to learn about the encoding of geographical information in natural language texts, that is, about relations of Type . To this end, we develop, instantiate, and empirically test a formal model of multiplex topic networks derived from so-called linguistic multilayer networks as a model of relations of Type .
From this point of view, Hypothesis 1 means that certain rhemes of places and the structure they span resemble each other, regardless of how far the quantified distances of the spaces represented by these places are and regardless of the fact that the texts in which these rhemes are described are written by different communities. To test this hypothesis, we introduce topic networks to make the networking of topics a research object according to the scenario described in Figure 1, that is, in relation to the hypothesis of linguistic encoding of geographical information. The contributions of this article are of theoretical, methodical, and empirical nature.(1)Formal modeling: we develop a generic, extensible formalism for the representation of topic networks that cover a wide range of informational sources for spanning and weighting topic links. To this end, we introduce the notion of multiplex topic networks derived from so-called multilayer linguistic networks. In this way, we enable the same place to be represented by a family of thematic networks that offer different perspectives on the networking of its rhemes. We exemplify this model by means of two perspectives provided by so-called Text Topic Networks (TTN) and their corresponding Author Topic Networks (ATN).(2)Procedural modeling: we develop a measurement procedure for instantiating our formal model. To this end, we introduce novel measures of the similarity of labeled graphs that are sensitive to their links and to their nodes.(3)Experimentation: we further develop the range of baseline statistics in network theory in order to better assess the quality of our measurements. To this end, we test our model by means of a threefold classification experiment that compares a set of TTNs with each other, a set of corresponding ATNs with each other, and the former TTNs with the latter ATNs.(4)Theory formation: we interpret our findings in the context of cognitive maps, thus building a bridge between our network-theoretical approach and approaches to the cognitive representation of geographical information. We show how to integrate the analysis of entire networks into the research about the linguistic encoding of geographical information (see Figure 1).This paper is organized as follows: Section 2 discusses related work. Section 3 introduces our formal model of linguistic multilayer networks and the multiplex topic networks derived from them. Section 4 describes our experiments in detail, and Section 5 discusses our findings. Finally, Section 6 concludes and gives an outlook on future work.
2. Related Work
Our work is related to linguistic research on Tobler’s  first law (TFL) which says that “[…] everything is related to everything else, but near things are more related than distant things” (, p. 236). Due to its underspecification, this so-called law raised many questions about what it means to be related or distant . Accordingly, a range of approaches exist that make different proposals to interpret relatedness also in terms of semantic relatedness. In the context of information visualization, Montello et al.  test a variant of TFL called the first law of cognitive geography which says that “people believe closer things to be more similar than distant things” (, p. 317), where spatial distance is referred to for judging the similarity of information objects. This approach is contrasted with a study by Hecht and Moxley  who model relations of Wikipedia articles as a function of the probability of being linked in the web graph and find that this probability is related to the geographical distance of toponyms described in the articles. Hecht and Moxley relate their finding to the transitivity of networks by stating that the smaller the geographical distance of nodes, the higher their clustering coefficient (, 101). This work is extended by Li et al. , who calculate semantic relationships of articles instead of hyperlinks and show that TFL holds independently of the geographical domain up to a certain distance threshold. A lexical variant of TFL is mentioned by Yang et al. , according to which geographically close words tend to be clustered into the same geographical topics. This phenomenon has earlier been studied by Louwerse et al. (cf. the review in ) who reformulated Firth’s famous dictum by saying that “[…] you shall know the physical distance between locations by the lexical company they keep” (, p. 1557). This means that the distance of places correlates with syntagmatic associations between the lexical items used to describe them. That is, language encodes geographical information  at least regarding the distances of semantically related places. From this perspective, TFL appears to be reformulated as a candidate for a geolinguistic law that is compatible with the more general Symbol Interdependency Hypothesis (SIH) . According to SIH, linguistic information encodes perceptual information so that the former serves as a shortcut to the latter . Finally, a rather text-linguistic variant of TFL is proposed by Adams and McKenzie , which states that near places are each described by texts whose topics are more similar than in the case of texts about distant places.
In contrast to these approaches, we hypothesize that places, no matter how far apart, have similar topic distributions when their descriptions are transmitted by media such as city and region wikis. If we find evidence for this hypothesis, there are various candidates for explaining it: Firstly, such a finding could indicate a trivial meaning of TFL (cf. ) in relation to the topics modeled by us, implying that everything, distant or not, is highly related. Secondly, it could indicate the (in)effectiveness of distances and similarities at different scales: at the level of local, specific topics (within the scope of TFL) and at the level of global, more general topics (outside the scope of TFL). Thirdly, such a finding could indicate a hidden similarity of processes of collaboratively writing wikis about different places, even if the wikis are written by different communities (see Hypothesis 1). In order to decide between these alternatives, we need a new topic model that derives networks of thematic structures at different scales from texts in online media about the same places. This should at least include the networking of topics along relations of intertextuality and coauthorship in order to allow for revealing similarities of the underlying processes of collaborative writing. To this end, we will develop multiplex networks that integrate text- and author-driven topic networks.
So far, most approaches to thematic aspects of places use topic modeling based on Latent Dirichlet Allocation (LDA) to associate topics and texts about geographical units, where topics are represented as sets of thematically related words. An early approach in this regard is described by Mei et al.  who model spatiotemporal theme patterns to identify dominant topics in texts that are connected to places. A related approach is proposed by Qiang et al. , who aim to detect topics that are “localized” in places. This is done to ground their similarities in relations of their thematic representations—a scenario that is omnipresent in linguistically motivated work in the context of TFL (cf. Figure 1). Likewise, Adams and McKenzie  extract topic models from travel blogs to detect topics as groups of semantically related words associated to places, so that relations among places can be identified by shared topics. Another example is proposed by Bahrehdar and Purves : instead of documents written by individual authors, they analyze tagging data extracted from image descriptions in Flickr. A hybrid model of topic modeling comes from Yin et al. , in which representations of regions are used instead of documents to link topics to places. A related region-topic model that uses regions as topics to map words, sentences, and texts to distributions of regions or to ground them semantically (cf. ) is proposed by Speriosu et al. . A promising extension is developed by Gao et al.  who aim at detecting higher-level functional regions as semantically coherent areas of interest. To this end, they analyze co-occurrence relations between topics to describe many-to-many relations of locations and urban functions. Another direction is pursued by Lansley and Longley , who investigate the location- and time-based distribution of topics in Twitter, setting a number of twenty topics as a target for LDA. See also Jenkins et al.  who utilize a list of six high-level topic categories. One of the largest studies in this context is the one of Gao et al.  who present an integrative approach to modeling texts from a range of different media such as Wikipedia, Twitter, and Flickr to demarcate cognitive regions . All these approaches start from topic modeling to map natural language texts onto distributions of topics in order to relate the places thematized by these texts (cf. Figure 1).
A prominent precursor of topic models  is given by Latent Semantic Analysis (LSA) . Consequently, there are studies in the context of TFL based on this predecessor. Davies , for example, interprets the associations of place names computed by LSA from place descriptions as a model of the cognitive representation of the corresponding spaces (cf. ). This approach opens up a perspective for measuring biased cognitive representations of spatial systems: according to Davies, her approach provides representations of cognitive geographies that are explored by the associations of semantically close place names in accordance or not with the underlying geographical relations, that is, in accordance or not with TFL (cf. ). These and related studies produce interesting results about the localization of topics or vice versa about the thematization of places in texts. However, they mostly disregard topic networking, not to mention the networking of topics viewed from different angles. Although it is easy to derive a network approach from binary relations of topic similarity, relationships that cannot be traced back to sharing similar words are hardly mapped by topic models of the sort considered so far. By generating topic distributions per location, for example, we know nothing about the dynamics of the coauthorship of the underlying texts: in the extreme case, one observes (dis)similarities, which result from the activity of a small number of authors or even only one author—in contrast to the assumed collaboration density of online media such as Wikipedia. Therefore, it is our goal to develop a model of topic networks that simultaneously addresses the dynamics of the coauthorship of the underlying texts. A subtask will be to develop a formal model of thematic networking that is generic enough to integrate a wide range of sources of networking—at least theoretically.
While most of the approaches considered so far ignore aspects of networking, a second branch of research tends to follow the paradigm of network theory. Hu et al. , for example, measure the semantic relatedness of cities as nodes of a city network  depending on the co-occurrences of city names in news articles. This approach is related to Liu et al. , who explore co-occurrences of toponyms to induce city networks that can be used to test predictions associated with TFL. Hu et al.  further develop this approach to networking cities by reference to topics of articles in which the corresponding toponyms are observed. They use Labeled LDA  to learn to extract topics α from texts to finally determine the α-relative similarity of cities based on the co-occurrences of their names in texts about α. Another approach to city networks using Wikipedia as a data source is proposed by Salvini and Fabrikant : they link cities as a function of the number of articles “co-siting”  their Wikipedia articles. A comprehensive perspective on modeling spatial information is developed by Luo et al. , who propose a three-part network model that integrates representations of spatial, social, and semantic networks. In this conceptual model, semantics plays the role of interpreting behavior in spatial and social space and thus of bridging them. Although we share this hybridization of the network perspective on spatial information, we strive for a more concrete model that can be empirically tested.
Any such study has to face various aspects of the vagueness [44, 53] or informational uncertainty  of concepts of regions  and places  and especially of the names of such entities . According to Winter and Freksa , this includes semantic ambiguity, indeterminacy of spatial extent, or boundary vagueness , preference-oriented re-scaling of extent, and the dynamics of salience affected by various dimensions of contrast. Beyond boundary vagueness, Gao et al.  speak of the shape and location vagueness by example of cognitive regions. Furthermore, Jenkins et al.  refer to the temporal dynamics of places as evolving concepts as a source of uncertainty. From a methodological point of view, this multifaceted uncertainty has two implications: in relation to the model, which should be flexible enough to map these facets, and in relation to the object itself, which could complicate its modeling by unsystematically distorting it.
In accordance with Hu’s study , we assume that the thematic perspective complements the spatial and temporal perspective of the study of places. A rheme can be understood as the “content” of a geographical region that expands its dimensionality . This content may be further specified in terms of affordances, functions, or shared conceptual representations associated by members of a community with the corresponding place so that different places can be related by being associated with similar content. This thematic perspective will be at the core of our article. To this end, we follow the approach of Jenkins et al. , according to which places are connected with meanings generated by collaborators of crowd-sourcing media such as Wikipedia: their collaboration creates what Jenkins et al. call platial themes, namely, themes that are characteristic for certain places. As shared meanings, these platial themes ultimately create a “collective sense of place,” as it is perceived by the corresponding community. In this context, Jenkins et al.  propose to study politics, business, education, recreation, sports, and entertainment as six high-level topics of places. However, by reference to the Dewey Decimal Classification (DDC), we will instead deal with more than six hundred hierarchically organized topics, each of which is manifested by a range of Wikipedia articles. In any event, we have to consider that thematic aspects may distort the conceptualization and perception of spatial objects . A central question then concerns the regularity or systematicity of this distortion in the sense of asking to what extent thematic representations of different places show similar aspects of being biased. This question will be at the core of this article.
3. Multiplex Topic Networks: A Novel Approach to Topic Modeling
In order to study relations of thematic preference in VGI as a manifestation of distributed cognition, we introduce Topic Networks (TNs) as an alternative to Topic Models (TMs) [23, 58, 59]. TMs are based on the idea that texts manifest probabilistic distributions of topics which are represented as probability distributions over the lexical constituents of these texts, where these distributions may be affected by style, the underlying genre, or any other (syntactic, semantic, or pragmatic) criterion of text production [60–62]. Regardless of its success, this model is unsuitable for modeling TNs as manifestations of distributed cognitive maps because of the following problems: (P1) Corpus specificity: the corpus specificity of TMs impairs comparability and transferability to ever new corpora, since the topic distributions are learned from the input corpora whose topics are to be modeled. This approach apparently cannot use a transferable topic model as a basis for representing the topics of a large number of different corpora. (P2) Topic labeling: the corpus-specific derivation of topic labels from the input corpora makes it difficult to compare their topic distributions. As reviewed by Herzog et al. , external resources can be used for this task. However, there are hardly any such resources for all possible topic combinations—unless one wants to explore an overarching system such as Wikidata making such a project considerably more difficult due to its size. The labeling problem can be addressed using, for example, Labeled LDA , an approach that leads us into the area of supervised classification, which is also followed here. (P3) Scalability: instead of dealing with corpora of equally large texts, online communication often leads to sparse, tiny texts that sometimes consist of a single sentence, a single phrase, or a single word. Regardless of the size of the text, we need a procedure that determines its topic distributions so that texts of different sizes can be compared using topic models of comparable size. Even if small texts are postprocessed (after topic modeling) in such a way that their topic distributions are derived from their lexical constituents, such an approach would nevertheless mean to exclude text snippets from the training process. (P4) Rare topics: one reason to prefer training by means of corpora as large as Wikipedia is to allow for detecting topics even if they form a kind of thematic hapax legomenon in the corpora to be analyzed. If we try to identify rare topics directly from these corpora, we will probably not detect them, since by definition these corpora do not provide enough information to identify such topics. In any event, the rarity of evidence about a topic should not be an impediment to identifying its occurrences even at the level of single sentences. (P5) Methodical closeness: instead of deriving all distributions of all dependent and independent variables as part of the same topic model, one possibly wants to include different information sources that are computed by different methods based on diverse computational paradigms (e.g., ontological approaches to measuring sentence similarities, approaches to word embeddings based on neural networks, and topic models). In order to enable this, we look for a methodologically open topic model that allows such different resources to be easily integrated.
In a nutshell, we are looking for an approach that (i) allows thematic comparisons of previously unforeseen text corpora using an underlying reference corpus, (ii) offers a generic solution to the problem of topic labeling, (iii) is highly scalable and can therefore map even the smallest text snippets to topic distributions, (iv) simultaneously takes rare topics into account, and (vii) is methodologically open and expandable. Such a topic network model is now developed in two steps: in Section 3.1, we introduce the underlying formal apparatus. This is done by deriving multiplex topic networks from linguistic multilayer networks. Section 3.2 describes a method by which this model is instantiated as a prerequisite for its empirical testing.
3.1. From Linguistic Multilayer Networks to Multiplex Topic Networks
In this section, we introduce multiplex topic networks. This is a type of network that is based on the idea of deriving the networking of topics of textual units by evaluating evidence from different sources of information such as text vocabulary, higher-level text components, distributed authorship or readership, genre, register, or medium. Since these sources of evidence can be explored in different compositions, this can lead to different perspectives on the salience and networking of the topics addressed by the same texts. Topic networks are multiplex precisely in this respect: the different evidence-providing perspectives may lead to different topic networks that allow comparisons to be made through which differences in the linguistic, social, or otherwise contextual embedding of thematizations become visible. This concept of a multiplex topic network is now being generically formalized.
To introduce multiplex topic networks, we start with defining linguistic multilayer networks (Definition 1) whose layeredness allows for distinguishing several (non)linguistic information sources of topic networking. We refer to supervised topic classifiers trained by means of large reference corpora to tackle the challenges P1, P2, P3, and P4. Based thereon, we introduce so-called text topic networks (Definition 3), which evaluate intra- and intertextual relations for the purpose of topic networking. Then, we introduce two-level topic networks (Definition 4) and exemplify them by author (Definition 5) and word topic networks (Definition 6), which explore relations of (co)authorship and lexical relatedness, respectively, as sources of topic networking. These notions are generalized to arrive at n-level topic networks (Definition 7) which are based on informational sources of topic networking (cf. challenge P5). Finally, multiplex topic networks are defined as families of n-level topic networks (Definition 8) representing the networking of the same set of topics from different informational perspectives and thus allowing for mapping the thematic dynamics, for example, of descriptions of the same place.
Definition 1. Let be a corpus of texts and . A Linguistic Multilayer Network (LMN) is a tuple (Mehler  speaks of multilevel graphs; see Boccaletti et al.  for a comprehensive overview of related notions whose formalism is used here; see Stella et al.  for an example of a multiplex network of lexical systems)of two sets of directed graphs such that the set of kernel layers consists of a pivotal text layer and several derivative layers, that is, a coauthoring layer, a language-systematic word layer, and possibly several layers modeling the networking of constituents of the pivotal texts:(1)The pivotal text layer , also called text network, is spanned by texts of the corpus such that is manifesting intratextual (as in the case of reflexive arcs) or intertextual relations(2)The author layer , also called agent network, is spanned by the network of agents (co)authoring the texts in and their social relations(3)The lexicon layer , also called word network, is spanned by the language-systematic lexical signs (i.e., lexemes and related units) used by agents of as part of their agent lexica to author the texts in (4)For , is called a constituent layer modeling the networking of (e.g., lexical, phrasal, and sentential) constituents of texts such that maps intratextual (e.g., anaphoric) or intertextual (e.g., sentence similarity) relations(5)For , is called a contextual layer modeling the networking of units (e.g., media, genres, and registers ) of the contextual embedding of texts such that maps, for example, relations of the switching, merging, or embedding [67, 68] of these contextual units(6)For each , , , , is called a margin layer where , , , and .For , and are vertex weighting functions, and are arc weighting functions, and are vertex labeling functions, and arc labeling functions. We say that the linguistic multilayer network is spanned over the text corpus X and layered into l layers.
Example 1. To illustrate our definitions, we construct a minimized example. Suppose a corpus of four texts , each containing three lexemes , , , and (for reasons of simplicity, we exemplify texts as bag-of-words), that is, , , and . Further, we assume four authors such that and coauthored and , while and coauthored and ; that is, and . Further, we assume that the texts are linked by some intertextual coherence relations (e.g., by a rhetorical relation, an argument relation, or some hyperlinks) as are the texts so that . Note that additional arcs of the layers will be generated according to the subsequent definitions. For simplicity reasons, we assume all weighting functions to be limited to the set of vertex/arc weights. Since we assume no additional constituent layer, we get . Thus, any linguistic multilayer network based on this setting is layered into three layers.
Throughout this paper, we use the following simplifying notation: for any graph of order , arc set of size and vertex labeling function λ, and any vertex , we write . Thus, for any two graphs with vertex labeling functions and , for which , , we can write . Further, for any function , for which , we use the following alternative notations:Finally, for any function , Z being any set, we introduce the following notation based on square brackets:To leave no room for ambiguity, we assume that expressions of the sort are replaced from left to right into expressions of the sort . Henceforth, a structure such as will be called information link. Based on Definition 1, we start now with introducing text topic networks using the following auxiliary notion.
Definition 2. Let be a directed Generalized Tree (GT) according to Mehler [69, 70] representing a hierarchical topic structure, henceforth called Reference Classification System (RCS), that is spanned by kernel arcs which are possibly superimposed by upward, downward, lateral, sequential, external, or reflexive arcs. (See Figure 2 for an example of a GT. This notion is required since we may decide for using, for example, the category system of Wikipedia as an RCS, which spans a GT ). That is, vertices represent topics, while kernel arcs represent subordination relations according to which u is a thematic specialization of t. Let further, θ denote a hierarchical text classifier  taking values in that has been trained, validated, and tested by means of a reference corpus . Let now be a LMN spanned over the text corpus X and layered into l layers. We call the structurea Definitional Setting for defining topic networks.
Example 2. Given the LMN of Example 1, the Dewey Decimal Classification (see Section 3.2), and the topic classifier θ of , which uses the DDC as its Reference Classification System , a definitional setting is exemplified by . More specifically, by we will denote three topic labels of the third level of the DDC so that . Note that by using the DDC as a reference classification, the generalized tree of Definition 2 is reduced to a tree (see Section 3.2 for more details).
Definition 3. Given a definitional setting according to Definition 2, a Text Topic Network (TTN) is a vertex- and arc-weighted simple directed graphwith vertex set V and arc set which is said to be derived from and inferred from by means of the optional classifier and the monotonically increasing functions if and only if and :where is a vertex weighting function, an arc weighting function, an injective vertex labeling function, , and κ an injective arc labeling function. is called a one-layer topic network that is generated by the generating layer .
Formulas (6) and (7) require that the weighting values for nodes and arcs are greater than 0: otherwise, the candidate vertices and arcs do not exist in the TTN. is a classifier mapping pairs of topics and texts x onto real numbers indicating the extent to which x is a “prototypical” instance of t (obviously, the textual arguments of the functions θ and θ← are not restricted to elements of X.)
Example 3. Given Example 2, we assume that and , , so that . In our example, we disregard θ←. Further, we assume that the functions are identity functions. Thus, and . Now, we can generate a topic link between and by exploring the intertextual relation : To this end, we assume thatso that . By analogy to this case, we link topic by means of a reflexive link so that . Note that these simplifications are made for simplicity’s sake only: Section 3.2 will elaborate a realistic weighting scenario. However, the function of the latter illustration is to show that by the intertextual linkage of both texts, we get evidence about the linkage of the topics instantiated by these texts. TTNs always operate according to this premise: they network topics as a function of the networking of an underlying set of texts. Figure 3 gives a schematic depiction of this scenario, which is varied subsequently to illustrate the other types of topic networks developed in this paper.
A concrete example of a TTN that is derived from the articles of the so-called Dresden wiki (see Section 4.1) is depicted in Figure 4. It shows the highest weighted topics addressed by these articles and their (undirected) links. The TTN has been computed by means of the procedural model of Section 3.2. Evidently, the topic Transportation; ground transportation is most prominent in this wiki followed by the topic Central Europe; Germany. Most topics belong to the areas transportation (red), geography and history (turquoise), and architecture (gray) (for the color code, see Appendix). More examples of TTNs can be found in Figures 5–7.
Arguments of the sort can be used to quantify evidence about text x as an instance of topic : the more the evidence of this sort, the higher possibly the impact of x in formula (6) and the higher possibly the final weight of . The adverb possibly refers to what is licensed by the parameters . Arguments of the sort , where , can be used to quantify evidence that text x is intertextually linked to text y: the more the evidence of this sort, the higher possibly the weight of the link from x to y and the higher possibly the influence of this link onto the weight of the link from topic to topic in formula (7) (in cases in which there is no explicit information about intertextual links, one can use functions of aggregated word embeddings of the lexical constituents of texts to calculate their intertextual similarity). In this and related definitions, we do not fully specify the functions to leave enough space for different instances of topic networks.
Definition 3 relies on the pivotal text layer for deriving topic networks. To integrate further layers into the process of inferring topic networks, we introduce the following generalized schema.
Definition 4. Given a definitional setting according to Definition 2, an -Topic Network, , is a vertex- and arc-weighted simple directed graphwhich is said to be derived from and inferred from and the elements of by means of the optional classifiers and monotonically increasing functions iff and :where . is a vertex weighting function, an arc weighting function, an injective vertex labeling function, , and κ an injective arc labeling function. For , we say that is a two-level topic network that is generated by the generating layers and . If , then formula (10) changes to formula (6) and formula (11) to formula (7). By omitting any optional classifier , expressions of the sort change to . ϑ is treated analogously.
To understand formula (10) look at Figure 8: among other things, formula (10) collects the triangle spanned by , x, and a supposed that the two-level topic network is based on text and authorship links. Obviously, Definition 4 generalizes Definition 3. Now, it should be clear why we speak of the text network of an LMN as its pivotal level: it is the reference layer of any additional layer that is integrated into a two-level topic network according to Definition 4. This role is maintained below when we generalize this definition to capture n layers, . With the help of Definition 4, we can immediately derive so-called author topic networks.
Definition 5. An Author Topic Network (ATN) is a directed graphaccording to Definition 4 such that .
The relational arguments of this definition can be motivated as follows—assuming that they are instantiated appropriately:(1) can be used to represent evidence that text x is about topic possibly in relation to other topics of .(2) can be used to represent evidence that text x is a prototypical instance of topic possibly in relation to other texts in .(3) can be used to represent the extent to which agent r tends to write about topic possibly in relation to other topics of .(4) represents evidence that agent r is a prototypical author writing about topic possibly in relation to other agents in .(5)For , can be calculated to represent evidence about text x to be intertextually linked to text y (e.g., in the sense of linking contributions of different authors). Otherwise, if , can be used to quantify evidence about x being intratextually structured.(6) can be used to quantify evidence about the role of agent r as an author of text x possibly in relation to other texts authored by r. Typically, is a function of the number of edit actions performed by r on x .(7) can be used to quantify evidence about the role of agent r as a prototypical author of text x possibly in relation to other authors of x. In the simplest case, is symmetric making obsolete.(8) represents evidence that agent r is a coauthor of or interacting with s. For instantiating , the literature knows a wide range of alternatives [74, 75] (which mostly concern symmetric measures of coauthorship). Note that we do not require that .
Example 4. Starting from Example 3 to exemplify arcs between topics in author topic networks, we can now additionally explore the evidence, that text and are both coauthored by the agents . That is, we can assume a coauthorship link ( is the arc set of the author layer in Definition 1) of weight . Let us now assume the following simplification of the function δ in Definition 4, for which we assume that it simply multiplies and adds up its argument values in the following way:In our example, we get , , , , , and . Since there is no other interlinked pair of texts (see Example 1), instantiating the topics , we get as the weight of this topic link in the corresponding ATN. By this simplified example of an ATN, we get the information that the link of topic to topic is additionally supported by the coauthorship of agents : this information extends the evidence about the topic link as provided by the underlying TTN of Example 3. Likewise, the reflexive link of topic is augmented by 1 compared to the underlying TTN, while there is no other topic link to be considered in this example of an ATN. By analogy to Figure 3, Figure 9 gives a schematic depiction of this scenario. Note that in our example, the weight of the link between authors (cf. ) is a function of their coauthorship: this is only one alternative to weight the social relatedness of both agents, actually one that can be measured by exploring (special) wikis. However, any other social relatedness might be explored to weight the interaction of agents.
By comparing a text topic network with an author topic network derived from the same LMN , we can learn how the topics of are manifested in the texts of corpus X in the form of a concomitance or a disparity of intertextual and coauthorship-based networking. Consider, for example, two vertices such that ; let further and denote the minimum and maximum that the vertex weighting functions of both graphs can assume. Then, we can distinguish four extremal cases:(1)Cases of the sortprovide information on prominent topics that tend to be addressed by many texts which are coauthored by many authors.(2)Situations likeprobably apply to the majority of the topics in , which are hardly or even not at all addressed by texts in due to the narrow thematic focus of these texts.(3)Cases likesuggests a Zipfian topic effect, according to which a prominent topic is addressed by a small group of agents or even by a single author.(4)Finally, situations of the sortrefer to rarely manifested topics addressed by a few but highly coauthored texts. In conjunction with many cases of the sort described by formula (16), situations of this kind indicate a Zipfian coauthoring effect, according to which many authors write only a few texts, while many texts are written by a few authors without encountering many (relevant) coauthors.
Formulas (14)–(17) compare the node weighting functions of a TTN with those of a related ATN. The same can be done regarding their arc weighting functions. That is, for two arcs and , for which , we distinguish again four cases ( and now denote the minimum and maximum the arc weighting functions of both graphs can assume):(1)In the case oftopic is intertextually linked more strongly to topic and authors of its text instances tend to cooperate with those of instances of topic likewise to a greater extent.(2)In the case oftopic is intertextually less strongly linked to topic and the few authors of its textual instances tend to cooperate with authors of instances of topic likewise to a lesser extent.(3)In the case oftopic is intertextually more strongly connected with topic , while authors of its text instances tend to cooperate with those of instances of topic to a lesser extent, if at all.(4)Finally, in the case oftopic is intertextually less strongly linked to topic , while the numerous authors of its text instances tend to cooperate with those of instances of topic to a much greater extent.Our central question regarding the relationship between TTNs and ATNs derived from the same LMN is whether these networks are similar or not. If they are similar, we expect that cases of the sort described by formulas (14), (15), (18), and (19) predominate so that cases matched by formula (14) are parallelized by those considered by formula (18) and where cases according to formula (15) are concurrent to those described by formula (19). An opposite situation would be that two topic nodes in the TTN are highly weighted but weakly linked, while they are weakly weighted but strongly linked in the corresponding ATN. In this case, a few or even only a single author is responsible for the thematic focus of the TTN. Note that this scenario reminds again of a Zipfian effect regarding the relation of TTNs and ATNs. By characterizing TTNs in relation to ATNs along these and related scenarios, we want to investigate laws of the interdependence of both types of networks, which may consist, for example, in the simultaneity of dense or sparse intertextuality-based networking on the one hand and dense or sparse coauthorship-based networking on the other. We may expect, for example, that the more related the two topics, the more likely the authors of their textual instances cooperate. However, not so much is known about such scenarios in the area of VGI especially with regard to Hypothesis 1. Thus, we address this gap at least by introducing a novel theoretical model which may help filling it.
Figure 5 exemplifies two ATNs in relation to a corresponding TTN (T1) which were computed using the apparatus of Section 3.2 to instantiate the formal model of this section. The upper right ATN (A1) is computed by globally weighting coauthorship activities based on Wikipedia (as explained in Section 3.2.3); the ATN (A2) below is calculated by weighting of these activities relative to the city wiki itself. Figure 5 shows that the topic with DDC number 720 (Architecture) is weighted higher in A1 than in T1. This is all the more pronounced in A2, where 720 becomes the most prominent topic and consequently displaces the top subject from T1, that is, topic 380 (Commerce, communications & transportation). That is, although topic 380 is most frequently addressed in this wiki’s texts, topic 720 not only is almost as salient but also attracts many more activities among its interacting coauthors. Similar observations concern the switch of the roles of the topics 910 (Geography & travel) and 940 (History of Europe) from T1 to A1 and A2.
Regardless of the answer to this and related questions, we will also ask whether the shape of an ATN can be predicted if one knows the shape of the corresponding TTN and vice versa. To answer this question, we will consider LMNs of different text genres: of city wikis and regional wikis on the one hand and extracts of encyclopedic wikis on the other. We expect that LMNs spanned over corpora of the same genre exhibit a pattern of collaboration- and intertextuality-based networking that makes TTNs and ATNs derived from them mutually recognizable or predictable, whereas for LMNs generated from corpora of different genres this does not apply.
For reasons of formal variety, we now consider an alternative to author topic networks, namely, so-called word topic networks, which in turn are derived from Definition 4.
Definition 6. A Word Topic Network (WTN) is a directed graphaccording to Definition 4 such that .
This definition departs by five new relational arguments from Definition 5, which—if being instantiated appropriately—can be motivated as follows:(1) quantifies evidence about the role of word a as a lexical constituent of text x possibly in relation to all other texts in which a occurs. Typically, is implemented by a global term weighting function  or by a neural network-based feature selection function.(2) quantifies evidence about the role of the word a as a lexical constituent of the text x possibly in relation to other lexical constituents of x. Typically, is a local term weighting function, such as normalized term frequency , or a topic model-based function.(3) represents evidence about the word a to be associated with the topic possibly in relation to all other topics of .(4) calculates evidence about the extent to which the topic is prototypically labeled by the word a, possibly in relation to all other words in .(5) quantifies evidence about the extent to which the word a associates the word b. Typically, is computed by means of word embeddings .Based on this list, we better understand what topic networks offer in contrast to TMs. This concerns the flexibility with which we can include informational resources computed by different methods (e.g., based on neural networks, topic models, and LSA) to generate topic networks (cf. challenge P5). Different relational arguments can be quantified using different methods, which in turn can belong to a wide range of computational paradigms. Table 2 gives an account of the generality of our approach by hinting at candidate procedures for computing the different relations of Figure 8.
Example 5. Starting from Example 3 to exemplify arcs between topics in word topic networks, we have to additionally explore evidence regarding the lexical relatedness of the vocabularies of the texts and . In Example 1, we assumed that the intersection of both texts (represented as bags-of-words) is given by the set . By analogy to Example 4, we assume now the following simplification of the function δ of Definition 4:In this scenario, we have to instantiate Definition 4 as follows: , , , , , and for one summand and—everything else being constant— and for a second summand (for (), we do not assume a lexical relatedness w.r.t. the words of text ()). Note that under this regime, we assume that relatedness of lexical constituents only concerns shared usages of identical words—of course, this is a simplifying example. By analogy to the setting of Example 4, we have thus to conclude that as the weight of the topic link from to in the corresponding WTN. For texts , we may alternatively assume that lexical relatedness does not only concern shared lexical items but also relatedness that is measured, for example, by means of a terminological ontology  or by means of word embeddings . In this way, we may additionally arrive at a topic link between and . In order to allow for a comparison of a WTN with its corresponding TTN, a more realistic weighting scheme is needed that also reflects above and below average lexical relatednesses of the lexical constituents of interlinked texts—in Section 3.2, we elaborate such a model regarding ATNs in relation to TTNs. Figure 10 gives a schematic depiction of the scenario of WTNs as elaborated so far.
It is worth emphasizing that instead of the (language-systematic) lexicon layer , we may use a constituent layer , to infer a two-level topic network. For example, we can use the layer spanned by the sentences of the pivotal texts to obtain a sort of sentence topic network. In this case, may quantify evidence about the extent to which the sentence a entails the sentence b or the extent to which the sentence a is similar to the sentence b, etc., while may quantify evidence about the extent to which the sentence a is thematically central for the text x, etc. In sentence topic networks, topic linkage is a function of sentence linkage: prominent topics emerge from being addressed by many sentences, while prominent topic links arise from the relatedness of many underlying sentences. Another example of inferring two-level topic networks is to link topics as a function of places mentioned (by means of toponyms) within the texts of the underlying corpus X, where geospatial relations of these places can be explored to infer concurrent topic relations: if place p is mentioned in text x about topic and place q in text y about topic , where the platial relation relates p and q, this information can be used to link the topic nodes in the corresponding topic network. As a result, we obtain networks manifesting the networking of topics as a function of parallelized geographical relations.
Obviously, any other relationship (e.g., entailment among sentences, sentiment polarities shared by linked texts, and co-reference relations) can be investigated to induce such two-level networks. And even more, we can think of n-level networks in which several such relationships are explored at once to generate topic links. We can ask, for example, which locations are linked by which geospatial relations while being addressed in which sentences about which topics where these sentences are related by which sentiment relations. Another example is to ask which authors prefer to write about which topics while tending to use which vocabulary: the higher the number of authors who use the same words more often to write about the same topic, and the higher the number of such words, the higher the weight of that topic. In this case, topic weighting is a function of frequently observed pairs of linguistic (here: lexical) means and authors. On the other hand, the higher the degree of coauthorship of two authors contributing to different topics and the higher the degree of association of the words used by these authors to write about these topics, the higher the weight of the link between the topics. This concept of a topic network induced by the text, the coauthorship, and the lexicon layer of an LMN is addressed by the following generalization, which provides a generation scheme for topic networks:
Definition 7. Given a definitional setting according to Definition 2, an -Topic Network, for whichis a vertex- and arc-weighted simple directed graphwhich is said to be derived from and inferred from and the elements of by means of the optional classifiers and monotonically increasing functions iff and : is a vertex weighting function, an arc weighting function, an injective vertex labeling function, , and κ an injective arc labeling function. For , we say that is an m-level, , topic network generated by the generating layers and the elements of . If , formula (26) changes to formula (6) and formula (27) to formula (7). By omitting the optional classifier , expressions of the sort change to . θ and are treated analogously. In order to derive an undirected m-level topic network from , we define andand where are monotonically increasing functions.
Evidently, Definition 7 is a generalization of Definition 3 by considering higher numbers of generating layers. A schematic depiction of the scenario addressed by this definition is shown in Figure 11 by example of a 3-level topic network that explores evidence about topic linking starting from the text, the author, and the lexicon layer of Definition 1. Likewise, Figure 12 depicts an n-level topic network, , in which additional resources are explored beyond the word, author, and text level. Figure 8 illustrates more formally the inference process underlying Definition 7, and in particular of the arguments used. It illustrates the inference of an arc that connects two topics by exploring the links of the text, author, and lexicon layers of an underlying LMN. In this example, the blue and black arcs are evaluated to determine the weights of red arcs connecting the focal topic nodes. Blue arcs are used to orientate inferred arcs. We will not develop this apparatus further, nor will we empirically examine -layer topic networks for . Rather, the apparatus developed so far serves to demonstrate the generality, flexibility, and extensibility of our formal model.
In the above, we explained that one of the reasons for introducing a flexible and extensible formalism of topic networks is to compare topic networks derived from different layers (e.g., from the text layer on the one hand and the author layer on the other). In order to systematize this approach, we finally introduce the concept of a multiplex topic network, which is derived from the same or from different linguistic multilayer networks:
Definition 8. Given a definitional setting according to Definition 2, a Multiplex Topic Network (MTN) is a k-layer networksuch that each , , is an -Topic Network derived from according to Definition 7 and for each , , , , is called a margin layer fulfilling the following requirements: , , , and .
See Figure 13 for a schematic depiction of the comparison of two MTNs. Note that because of Definition 7, it does not necessarily hold that , but it always holds that . In this respect, we depart from , which instead require more strongly that . In the case of topic networks, this would be too restrictive, as different topic networks derived from the same definitional setting can focus on different subsets of topics, while ignoring the rest of the topics in the co-domain of θ. (A way to extend Definition 8 is to include the RCS of Definition 2 as an additional layer. This would allow for directly relating its constituent topic networks with the hierarchical classification system .)
In this paper, we quantify similarities of the different layers of MTNs to shed light on Hypothesis 1. More specifically, we generate an LMN for each corpus of a set of different text corpora in order to derive a separate two-layer MTN for each of these LMNs, each consisting of a TTN and an associated ATN. Then, among other things, we conduct a triadic classification experiment: firstly with respect to the subset of all TTNs derived from our corpus, secondly with respect to the subset of all corresponding ATNs, and thirdly with respect to the subset of all TTNs in relation to the subset of the corresponding ATNs. In the next section, we explain the measurement procedure for carrying out this triadic classification experiment.
3.2. A Procedural Model of Topic Network Analysis
In order to instantiate topic networks as manifestations of the rhematic networking of places, we employ the procedure depicted in Figure 14. It combines nine modules for the induction, comparison, and classification of topic networks.
3.2.1. Module 1: Natural Language Processing
Preparatory for all modules is the natural language processing of the input text corpora. To this end, we utilize the NLP tool chain of TextImager  to carry out tokenization, sentence splitting, part of speech tagging, lemmatization, morphological tagging, named entity recognition, dependency parsing , and automatic disambiguation—the latter by means of fastSense . For more details on these submodules, see [86, 87]. As a result of Module 1, the topic classification can be fed with texts whose lexical components are disambiguated at the sense level. As a sense model, we use the disambiguation pages of Wikipedia, currently the largest available model of lexical ambiguity.
3.2.2. Module 2: Topic Classification
According to Definition 2, the derivation of TNs from LMNs requires the specification of a Reference Classification System (RCS) . For this purpose, we utilize the Dewey Decimal Classification (DDC), a system that is well established in the area of (digital) libraries. As a result, the generalized tree from Definition 2 degenerates into an ordinary tree since the DDC has no arcs superimposing its kernel hierarchy (see Figure 15 for a subtree of the DDC). As a classifier θ, which addresses the DDC, we use , a topic classifier based on neural networks, which has been trained for a variety of languages  (see https://textimager.hucompute.org/DDC/). Starting from the output of Module 1 (NLP), we use text2ddc to map each input text x to the distribution of the 5 top-ranked DDC classes that best match the content of x as predicted by text2ddc. Since text2ddc reflects the three-level topic hierarchy of the DDC, this classifier can output a subset of 98 classes of the (two classes of this level are unspecified) and a subset of 641 classes of the 3rd DDC level for each input text. (We did not have training for all 3rd level classes (which are partly unspecified). See  and the appendix for details.) Thus, each topic network of each input corpus is represented on two levels of increasing thematic resolution. Note that text2ddc classifies input texts of any size (from single words to entire texts in order to meet challenge P3) and works as a multilabel classifier for processing thematically ambiguous input texts. By using an RCS, text2ddc meets challenge P2 simply by referring to the labels of the topic classes of the DDC. Furthermore, since text2ddc is trained with the help of a reference corpus, it can detect topics, even if they occur only once in a text (this is needed to meet challenge P4) and guarantees comparability for different input corpora (challenge P1). text2ddc is based on fastText whose time complexity is , where “k is the number of classes and h the dimension of the text representation” (2, ) (making this classifier competitive compared to TMs).
Figures 4–7 show examples of TTNs and ATNs generated by means of text2ddc by addressing the second level of the DDC. Each of these topic networks was generated for a subset of articles of the German Wikipedia that are at most 2 clicks away from the respective start article x (for the statistics of the corpora underlying these topic networks, see Section 4.1). Formally speaking, let be a directed graph and ; the nth orbit induced by is the subgraph,that is induced by the subset of vertices whose geodetic distance from is at most n (cf. ). We compute the first orbit and the second orbit of a set of Wikipedia articles (so that G denotes Wikipedia’s web graph). This is done to obtain a basis for comparison for the evaluation of topic networks derived from special wikis. Since Wikipedia is probably more strongly regulated than these special wikis, we expect higher disparities between networks of different groups (Wikipedia vs. special wiki) and smaller differences for networks of the same group.
3.2.3. Module 3: Network Induction
Network induction is done according to the formal model of Section 3.1. It starts with inducing an LMN for each input corpus X. That is, for each corpus X, we generate a text network and an agent network according to Definition 1:(1)In this paper, X always denotes the set of texts (web documents) of a corresponding wiki W so that the text layer of the LMN , in which is an agent network defined below, can be used to represent the web graph  of this wiki. Thus, for any two texts that are linked in W, we generate an arc , where and . Further, for .(2)The author layer of the LMN corresponding to (see Definition 1) is generated as follows: is the set of all registered authors or TCP/IP addresses of anonymous users working on texts in X so that maps to this name or IP address, respectively. Let be the sum of all additions made by the author to any revision of the edit history of the text x; we use to approximate the more difficult to measure concept of authorship as introduced by Brandes et al. . Then, we define: . Further, is the set of all arcs between users , for which there is at least one text x to which both contribute so that . Then, we define (cf. ):
Finally, . Obviously, is symmetric.
Now, given the definitional setting , where are instantiated in terms of Section 3.2.2, we induce a TTN according to Definition 3 by means of appropriately defined monotonically increasing functions . To this end, we utilize the setof the membership values of text to the topics in , where the parameter denotes a lower bound of an acceptable degree of aboutness. We set . Further, bywe denote the mean value of the set of selected topic membership values and by we denote the largest value of the arbitrary set . Finally, we select a number and define , thereby instantiating the parameters of formulas (6) and (7) of Definition 3:
According to formula (35), iff is one of the highest membership values of x to the topics in , supposed that . Otherwise, . In this paper, we experiment with . The higher the value of , the more sensitive the generation of to the thematic ambiguity of the underlying texts. However, since θ creates a membership value for each pair of texts and topics, we use as a lower bound of aboutness (in the sense of addressing a topic known by θ) so that irrelevant classifications do not affect .
Regarding the ATN corresponding to the TTN , we have to define monotonically increasing functions . To this end, we use several auxiliary functions:(i)By , we denote the mean activity per author per Wikipedia article.(ii)By , we denote the average number of active authors per Wikipedia article.
The corresponding estimators are found in Table 4. Now, consider the set of all active authors of the text x and the set of all texts that potentially contribute to and thus to the weight of the vertex :