Complexity

Volume 2017 (2017), Article ID 9212538, 7 pages

https://doi.org/10.1155/2017/9212538

## A Language as a Self-Organized Critical System

School of Applied Mathematics, Oles Honchar Dnipropetrovsk National University, Gagarina Av. 72, Dnipropetrovsk 49010, Ukraine

Correspondence should be addressed to Vasilii A. Gromov; ur.relbmar@rellorts

Received 2 May 2017; Revised 3 September 2017; Accepted 31 October 2017; Published 19 November 2017

Academic Editor: Gerard Olivar

Copyright © 2017 Vasilii A. Gromov and Anastasia M. Migrina. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A natural language (represented by texts generated by native speakers) is considered as a complex system, and the type thereof to which natural languages belong is ascertained. Namely, the authors hypothesize that a language is a self-organized critical system and that the texts of a language are “avalanches” flowing down its word cooccurrence graph. The respective statistical characteristics for distributions of the number of words in the texts of English and Russian languages are calculated; the samples were constructed on the basis of corpora of literary texts and of a set of social media messages (as a substitution to the oral speech). The analysis found that the number of words in the texts obeys power-law distribution.

#### 1. Introduction

Since natural languages gradually came to be regarded as complex systems, a means to study linguistic processes changed from descriptive approaches to formal analysis aiming to construct mathematical model for the operation and the development of language—this change is both a reason and an effect of “the linguistic turn” (the term is due to [1]).

In the frameworks of this approach, it is possible to pursue two main avenues of inquiry: The first one is aimed at constructing a theory of natural languages grammars [2]. The second one is associated with analysis of language statistical characteristics: Primary emphasis is placed, from seminal studies by Zipf [3], on distributions for separate words for written language [4–8] and for various graphs reflecting language features [9]. The principal result of these studies is a class of distributions describing natural language features [10]; the class comprises power (heavy-tail) distributions with various values of exponents: power laws manifest themselves in word’s frequency in language [3], in syntactic networks [5–7, 11], in frequency of letter sequences in vocabularies [8], and so forth.

Meanwhile, a shift in XX century philosophy (akin to the Copernican revolution in astronomy) suggests considering a language as an integral whole: “a man is just a place where a language speaks itself” [12]. Therefore, since above all else a language is a unified communicative tool to convey the meanings and its text (either a literary work or a tweet) is usually a meaningful, complete message, a text becomes a basic unit for such analysis.

The authors hypothesize that a natural language is a self-organized critical system (SOC) [13, 14] and the texts of a language are “avalanches” (as those are defined by Bak) flowing down the word cooccurrence graph of the respective language; large avalanches correspond to literary works, while smaller ones are associated with messages from social media. It is worth noting that a self-organized critical system conventionally features [13, 14](1)a space of elements able to be in two states, active and passive, along with a set of rules to describe how a change of state of an element affects states of the other ones,(2)the “avalanches” in the space that are chain reactions of elements’ state changes triggered by changes of other elements,(3)a power-law distribution governing avalanche sizes.

For a language system, a semantic space plays a part of the space at issue, and the rules are reduced to syntactic and semantic rules of the respective language. In the present study the space was formalized as a cooccurrence graph—the authors are aware that one-to-one correspondence between semantic space and vocabulary is absent, but they assume that the latter approximates the former somehow. Vertices of the graph correspond to words and an edge is present if and only if the words associated with its incident vertices occur simultaneously in the same text of the sample involved, once or more. As indicated above, an avalanche, in this context, is a text of a language, and the hypothesis that sizes of avalanches obey a power-law distribution forms the subject of the present study. One should mention that real-world (unfolding over time) SOC-systems usually exhibit long periods of slow evolution as opposed to short periods of fast evolution when the system space is changed drastically; similar phenomenon is reported to take place for evolving language systems.

Another point of interest here is the emergence of a gigantic volume of information reflecting, in essence, spoken language (parole as opposed to langue according to Saussurean terminology [15], (second) Orality versus Literacy [16]) that is texts posted by users in social networks of every sort and kind (Facebook, Reddit, and so on); this makes it possible to explore this domain of human communication. In this context, the primary problem is to compare statistical characteristics calculated by means of corpus of literary texts, on the one hand, and by means of a set of texts written by social networks users. If these characteristics appear to be statistically equal, this may give proof to the idea of language unity as a complex system; and if so, written and spoken language are merely different projections of internal dynamics of this complex system (synchronic unity of language). On the other hand, the characteristic at issue calculated for different time periods (one is to be restricted in this case to the analysis of written language) can be compared in order to verify unity of the linguistic system unfolding in time (diachronic unity of language). The present work is focused on the study (and comparison) of statistical characteristics of texts sets for English and Russian languages being considered in their synchronic and diachronic aspects.

The first paper to be cited among recent studies of distributions (mainly power-law distribution) observed for oral and written language is a brilliant review [17]. Here, the object of study is a particular text, and its basic unit is a word or a sentence—distributions of characteristics of these objects form the subjects of the overwhelming majority of papers for this line of investigation [18]. For example, Font-Clos et al. [19] examine dependence text length versus statistical properties of word occurrences; the authors reveal that the distribution obeys the power and investigate its relationship with Zipf’s and Heaps’ (Herdan’s) laws (the former states that the vocabulary grows as a power function of text’s length) [20].

Another object of study that generates power-law distributions is a representation of language semantic and syntactic relations using various discrete structures: semantic nets [17], global syntactic dependency trees [21, 22], cooccurrence graphs [18], and others. Both conventional methods aimed at exploring these structures as complex networks [23] and random walks on these structures result in power-law distributions [17, 21, 22, 24, 25]. A basic unit is a word or a sentence likewise.

The subject of the present paper is language as a whole; texts (semantic “avalanches”) are considered as its basic units. The rest of the paper is organized as follows. The next section outlines methods used to estimate distribution parameters; the third provides results for both English and Russian languages. The fourth section discusses results; finally, the last section presents conclusions.

#### 2. Methods

The choice of the languages is determined, apart from the availability of voluminous corpora of texts (for written and spoken languages) for them, by their qualitative difference in grammar structure: Russian is an inflected language, while in English inflections are rather rare; Russian is characterized by flexible word order in a sentence, whereas word order in English language is strict, and rare exceptions are constrained by stringent rules [26, 27].

We used two different approaches to test statistical hypothesis in question. The first one utilizing the concept of data collapse is considered in greater detail in the monograph by Pruessner [14]; the second one (grounded on the Kolmogorov-Smirnov (KS) criterion) is proposed in the source [28]. Both methods not only evaluate the exponents but also cut off the smallest elements of the sample that usually do not fit a power-law. The first method also cuts off the largest nonfitting elements. It is worth noting that this phenomenon (the sharp distinction between the largest [smallest] elements and all others) seems to be a salient characteristic of real-world data following power-law distributions [13, 14].

The first approach we dwell on briefly is that using the concept of data collapse [14]. It assumes that the distribution generating the data has the following probability density function:with which can be a continuous or discrete random variable, metric factors and , characteristic dimension of a system , scaling function , and scaling exponents and as well. The distribution follows the modified power-law within the interval bounded by the lower and upper cutoffs and . The scaling function (that distinguishes the distribution from a canonical power-law) fits many real-world systems obeying heavy-tail distributions [14]. The quantity determines the characteristic upper cutoff.

The raw data is taken to be previously binned, that is, grouped together and averaged over the observations belonging to the same group; we used the exponential binning for its suitability for this kind of data [14].

If the null hypothesis (that the sample under study is generated from the distribution equation (1)) is true, then plotted against , where , gives the same function for various , where is the empirical probability density function, and is the true value of the power exponent. The phenomenon is given the title of data collapse. Thus, for given data generated from the power-law distribution, (as a function of ) plotted for various is superimposed on each other.

Therefore, the respective goodness-of-fit test for power-law distribution involves the following steps:(1)binning of the raw data(2)plotting against for various using “apparent exponent” (rough estimate of )—such a plot comprises a nonhorizontal straight line and a characteristic nonlinear curve, whose extremum is called a landmark ( represents its coordinate)(3)the refinement of the value with the employment of the least squares method applied to the landmarks.

As a result, the plots merge into a single horizontal straight line for the section of the domain of definition for which power-law holds true (between the lower and upper cutoffs).

The second approach [28] is applicable to power-law distribution without scaling function:with normalization constant , where is the generalized (Hurwitz) zeta function. The method implies that all practicable values of the lower cutoff are considered; for each the estimate of the power exponent (with the maximum likelihood principle in mind) is calculated from see [28]; for each the Kolmogorov-Smirnov statistic (where is the cumulative distribution function (CDF) with estimated value of and is the empirical CDF) is calculated. The eventual estimate of minimizes . For real-world data, the function possesses, usually, several local minima; it is often reasonable not to choose a global minimum but the local minimum closest to , the lower boundary of the domain of definition, provided a value of the statistic at it does not differ significantly from that at a global minimum.

#### 3. Power-Law Distributions for the English and Russian Languages

To test the null hypothesis in question (that the number of words in the texts obeys a power-law distribution; (1) is used to verify data collapse, while the method based on the KS statistics employs (2), (3)), two samples were generated on the basis of corpora of literary texts for these languages and of a set of Reddit messages (or its Russian counterpart Pikabu). The resulting samples sizes for English language are 9820 (literary works), 5016 (Reddit), and 14836 (joint sample); for Russian language they are 12683 (literary works), 6005 (Pikabu), and 18688 (joint sample). For the method based on the concept of data collapse, the size of vocabulary used to generate texts serves as characteristic dimension of a system . To obtain samples for various , one resamples down the initial sample deleting randomly of words from the complete vocabulary and then from all the texts used. This brings about the generation of new samples corresponding to the characteristic dimension of .

Figure 1 presents a dependence of the number of words in a text in double-logarithmic scale on a rank of the text in the sample; namely, Figures 1(a), 1(b), and 1(c) correspond to the joint sample, to the sample constructed on the basis of literary works, and to poetry works for the English language, respectively; Figures 1(d), 1(e), and 1(f) exhibit the same dependence for the Russian, respectively; a dashed straight line in each subfigure corresponds to power distribution with an exponent estimated using data collapse.