Mathematical Problems in Engineering

Volume 2015, Article ID 626408, 14 pages

http://dx.doi.org/10.1155/2015/626408

## New Statistical Randomness Tests Based on Length of Runs

^{1}Institute of Applied Mathematics, Middle East Technical University, 06800 Ankara, Turkey^{2}Mathematics Department, Atılım University, 06836 Ankara, Turkey

Received 27 September 2014; Accepted 17 March 2015

Academic Editor: Anna Vila

Copyright © 2015 Ali Doğanaksoy et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Random sequences and random numbers constitute a necessary part of cryptography. Many cryptographic protocols depend on random values. Randomness is measured by statistical tests and hence security evaluation of a cryptographic algorithm deeply depends on statistical randomness tests. In this work we focus on statistical distributions of runs of lengths one, two, and three. Using these distributions we state three new statistical randomness tests. New tests use distribution and, therefore, exact values of probabilities are needed. Probabilities associated runs of lengths one, two, and three are stated. Corresponding probabilities are divided into five subintervals of equal probabilities. Accordingly, three new statistical tests are defined and pseudocodes for these new statistical tests are given. New statistical tests are designed to detect the deviations in the number of runs of various lengths from a random sequence. Together with some other statistical tests, we analyse our tests’ results on outputs of well-known encryption algorithms and on binary expansions of , , and . Experimental results show the performance and sensitivity of our tests.

#### 1. Introduction

Random numbers and random sequences are extensively used in many areas such as* game theory, numerical analysis, quantum mechanics, *and* cryptography*. In cryptography, need for random sequences emerges in many different applications such as* challenge and response authentication systems, generation of digital signatures, *and* zero-knowledge protocols*. Among those, the most important feature is key generators which highly depend on random values. Use of weak random values in key generations can cause a leakage in the system and hence an adversary can gain ability to break the whole cryptosystem. Therefore, randomness testing is an essential part of security evaluation of a cryptographic algorithm.

Random sequences and random numbers can be generated by true random sources, such as atmospheric noise and radioactive decay. However, using these sources in an algorithm is unpractical. It causes challenging problems in transmitting and storing large random bits since reproducing outputs of these sources is nearly impossible. Therefore, sequences and numbers, used as a key in cryptographic algorithms such as block ciphers and synchronous stream ciphers, should be pseudorandom, that is,* random looking* sequences of a specific length which are produced by deterministic processes [1]. Since proving randomness of these generators mathematically is nearly impossible, we use statistical randomness test for this purpose. Using statistical tests we try to detect the weaknesses that a generator could have.

Moreover, outputs of encryption algorithms should be indistinguishable from random mappings; that is, it should be random looking. This is another place where pseudorandom sequences play an important role. Also, deciding the round number of a block cipher algorithm, which is an essential part of design, is highly associated with concept of being random looking. Therefore, security of the system highly depends on production or testing of pseudorandom sequences. For these reasons, statistical randomness tests are considered as an important part of evaluating security of cryptographic algorithms.

Statistical tests are designed to test the null hypothesis which states that the sequence is randomly generated. Testing a binary sequence means that its degree of randomness is evaluated by a statistical test. The conclusion is that the sequence is random or not probabilistic; in other words the hypothesis is either* accepted* or* rejected*. A statistical test considers a random variable whose distribution function is known. Depending on the distribution, a real number between 0 and 1, called value, is calculated. If the value of the sequence is evaluated as one, we say that the sequence is completely random. On the other hand, the sequence is completely nonrandom, if value is determined as zero. If the value exceeds a predefined real number , then is accepted; otherwise, it is rejected.

Usually result of one statistical test is not enough to decide the randomness of sequence. Therefore, it is better to use a collection of statistical tests, called statistical test suites, to measure different behaviours of the sequence under consideration. These suites should be well designed to give trustable results and should not be blindly populated.

In the literature, there exist various statistical test packages. Among those, the most important ones are given in Knuth’s book [2], test suite presented by Rukhin [3], DIEHARD [4], CRYPT-X [5], TestU01 [6], and the test suite published by NIST [7] so far. Also there are works focusing on statistical tests individually such as a universal statistical test, stated by Maurer [8], a test based on diffusion characteristic of a block cipher [9], and topological binary test defined by Alcover et al. [10].

In this work, we propose three new statistical randomness tests which depend on famous postulates of Golomb. These tests are named as runs of length one, runs of length two, and runs of length three test. The rest of the paper is formed as follows. In Section 2, we explain Golomb’s randomness postulates. Also we discuss run tests given in the literature. In Section 3, we give proofs of our fundamental theorems. Also in order to calculate the probabilities needed, we state corollaries and algorithms for each theorem. In Section 4, we state new run tests and give the pseudocodes. In Section 5, we apply new tests to binary expansion of , , and , which are obtained from NIST package [7] and outputs of five advanced encryption standard competition finalists. In the last part of implementation we generate some nonrandom data sets to emphasize the sensitivity of our tests. Finally, in Section 6, we summarize our results and state the topics for further research.

#### 2. Preliminaries

##### 2.1. Golomb’s Randomness Postulates

Deciding the pseudorandomness of a sequence is a difficult task. The base for this task is constructed by Golomb’s postulates. These postulates are one of the most important attempts to create some necessary properties for a finite (or periodic) pseudorandom sequence to be random looking. Sequences satisfying following three properties are called* pseudonoise sequence* [11].

Let be an infinite binary sequence periodic with (or a finite sequence of length ). A run is defined as an uninterrupted maximal sequence of identical bits. Runs of 0’s are called* gap*; runs of 1’s are called* block*. R1, R2, and R3 are Golomb’s randomness postulates which are given as follows.(R1)In a period of , the number of 1’s should differ from the number of 0’s by at most 1. In other words, the sequence should be* balanced*.(R2)In a period of , at least half of the total number of runs of ’s or ’s should have length one, at least one-fourth should have length 2, at least one-eighth should have length 3, and the like. Moreover, for each of these lengths, there should be (almost) equally many gaps and blocks.(R3)The autocorrelation function should be two-valued. That is, for some integer and for all ,

The first postulate states that, in an sequence, the difference of number of ones and zeros should be 1 or 0. In other words, the number of ones in a sequence, that is, weight of the sequence, should be approximately . Frequency test, which measures the difference of number of ones and zeros in an sequence, is defined to check the first postulate of Golomb. Balancedness is a fundamental feature for an algorithm’s output. Therefore, frequency test is used as an initial step for almost all test suites. If an algorithm fails the frequency test, then other tests are not even applied.

The second postulate of Golomb is about number of runs in sequences. Tests, which deal with number of runs, are called run tests and these are also included in many test suites as the frequency test. Since calculating the expected number of runs of specified length in a random sequence is a difficult task (especially when specified length becomes large), most of test suites consider only the total number of runs and do not consider the number of runs of different lengths.

Lastly, the third postulate gives information about amount of similarities between the sequence and shifted version of it. If is a random looking sequence, the autocorrelation should be constant; that is, correlation between and bits should give no information about the sequence for . In this paper, we mainly focus on the first and second postulates, and the last one is not a matter of concern.

These postulates are theoretical, but difficult to check. Inspired by these postulates, we define new statistical randomness tests which are practical. In order to give the definitions, we calculate the exact probabilities. Before explaining these tests, first we give the mathematical background in order to compute the probabilities that we use in the following Section 3.

##### 2.2. Run Test

Run tests depend on Golomb’s second postulate and investigate number of runs in a sequence and their distribution. Run tests take place in most of the test suites. Almost all of these suites, run tests, consider only the total number of runs in a sequence. The most important ones of these are the suites given in [2, 4, 6, 7].

Knuth [2] and DIEHARD [4] test suites define the run test on random numbers. They define runs as* runs up* and* runs down* in a sequence. To illustrate their definition, consider a sequence of length 10, . Runs are indicated by putting a vertical line between ’s when . Hence, runs of the sequence can be seen as . In other words, the run test examines the length of monotone subsequences. TestU01 [6] defines* run and gap tests* for testing the randomness of long binary stream of length . This test collects runs of 1’s and 0’s until the total number of runs is . Then, for each length the number of runs of 1’s and 0’s of length in this collection is counted and recorded. Then test is applied on these counts.* Longest run of 1’s test* is also defined for the collection of strings of length which are obtained from the original long binary string of length .

NIST [7] test suite consists of firstly 16 and then 15 various statistical tests. After its first publication, some revisions are made. In 2004, it is discovered that test setting of discrete fourier transform test and lempel-ziv test were wrong [12] and new test, which can be used instead of lempel-ziv test, is defined in [13] and correction of overlapping template matching is stated in 2007 [14].

In the suite, 2 of 15 tests are variations of run tests. They are called run test and longest run of ones in a block test. The first one deals with the total number of runs in a sequence. It calculates the total number of runs in a sequence and determines whether it is consistent with the expected number of runs, which is supposed to be close to in a sequence or not. The second one determines whether the longest run of ones in the sequence is consistent with the length of the longest runs of ones which is in a random sequence. In NIST test suite the reference distributions for the run tests are a distribution.

In test suite, NIST assumed that sequence of length is of order to . For this reason, asymptotic reference distributions were derived and used for their tests. But, asymptotic reference distribution is misleading for smaller values of ; as stated in [7] “the asymptotic reference distributions would be inappropriate and would need to be replaced by exact distributions that would commonly be difficult to compute”. In other words, asymptotic reference distributions can lead to some errors in testing short sequences such as outputs of block ciphers or hash functions. In 1999, to overcome this problem, Soto and Bassham [15] propose to concatenate short sequences. This method is used for testing the randomness of Advanced Encryption Standard candidates. Another method has been proposed by Sulak et al. [16], in which distribution functions are used in NIST test suite, replaced by exact distribution and a similar method is used for producing the values.

In this paper we use the method stated in [16]; thus we need the exact probabilities and exact distribution of tests statistics. Finding the number of sequences having a specified number of runs of length is a hard problem. We find the number using combinatorial formulas. After that we calculate the desired probabilities by dividing the calculated number by the total number of sequences of length . Calculating the exact probabilities of the number of runs of length in a sequence enables us to define the new run tests. We calculate the probabilities for number of runs of lengths one, two, three and we give the detailed information in the following chapter. However, as the length grows, calculations are getting complex and time required for these calculations grows exponentially. Therefore tests involving number of runs of length are unpractical for statistical test suites.

#### 3. Computation of Probabilities

In this chapter, we give the theorems to find the number of sequences with specified properties and hence state the exact probabilities. The probabilities depend on the number of existing shorter runs. That is, probabilities for the number of runs of length two depends on both total number of runs and number of runs of length one; similarly number of runs of length three depends on total number of runs and number of runs of lengths one and two and so on. Since they have some dependencies with other variables, these probabilities are not directly used in tests. Therefore, after stating each theorem we give the corollaries and the algorithms to find the exact probabilities which are needed for describing the tests.

In the calculations of probabilities we frequently use the following combinatorial formulas.

*Fact 1* (number of nonnegative integer solutions of linear equation [17])*.* The number of nonnegative integer solutions of , , is .

*Fact 2*. The number of positive integer solutions of , , is .

*Proof. *With the substitution we get From Fact 1 it follows that the number of solutions is

##### 3.1. Number of Runs

In the rest of the paper we denote the total number of runs and number of runs of lengths one, two, and three as , , , and and we use samples of these variables, , , , and , respectively. We denote the probability of randomly chosen binary sequence with runs by . In the same way, is the probability of randomly chosen binary sequence with runs of length . Also we use subscripts to differentiate the blocks of a long sequence or outputs of block ciphers and hash functions. Lastly, , , and are used to state the set of number of runs of lengths one, two, and three in the sequences accordingly. That is, and corresponds the number of runs of length in the sequence.

Moreover, in order to illustrate the runs of a sequence we use the equation for a sequence with length and having runs. represents the number of bits in run. An important property of this illustration is that it gives no information about content of ’s; that is, can be a run of 0’s or 1’s. Thus, each positive integer solution of the equation corresponds to two sequences: one starts with 1 and the other starts with 0. Hence, the number of sequences with length and having exactly runs is by Fact 2.

*Example 1. *Let be a binary sequence of length 32 and having 15 runs. Then,

Probabilities are calculated in a similar way as in [16]. The main difference is that, in the previous approach, sequences are viewed in a circular form. Probabilities depend on weight of the sequence and parity of number of runs. We calculate the probabilities with the above notation, which is not based on circular form, and they depend on the number of runs and number of shorter runs.

Theorem 2. *Let be a binary sequence of length having total of runs; then *

*Proof. *We can illustrate the sequence of length , having runs, as follows: From Fact 2 the number of all binary sequences of length , having total number of runs, is . Since there are sequences, probability of a randomly chosen such sequence to have exactly runs is

*3.2. Number of Runs of Length One*

*In this section, probabilities for a -bit sequence having runs of length one is given in a combinatorial approach. We use the illustration defined in Section 3.1 to compute the number of sequences having total of runs, of which are of length one, and hence we calculate the probabilities. Then we state the first new run test depending on the idea of Golomb’s second postulate in the next chapter.*

*Theorem 3. The probability of randomly chosen binary sequence with length , having total of runs, of which are runs of length one, is *

*Proof. *As in the proof of the Theorem 2, we illustrate the sequence as follows: Let us first assume that the last runs are the runs of length one and the rest are of at least length two. That is, Notice that, here, , so we use the change of variable for . ConsiderThe number of sequences having conditions, which are stated above, is equal to the number of nonnegative solutions of (11). Consequently, by the Fact 1, number of desired solutions is Selection of runs of length 1 gives us a factor of . Since each positive integer solution of (9) corresponds two sequences (one starts with 1; the other starts with 0), 2 is stated as factor also. Therefore, the number of all binary sequences of length , having total number of runs, of which are of length one, is equal to . Hence probability of a randomly chosen such sequence to have exactly runs, of which are of length one, is

*Number of sequences having runs, of which are of length one, can be found using the formula above. Our aim is to compute total number of sequences of length having runs of length one without depending on the total number of runs. In order to compute aimed probabilities we use Corollary 4.*

*Corollary 4. Let denote the number of sequences with exactly runs of length one. Then,Since the number of all sequences of length is , probabilities follow immediately:*

*Moreover, using Algorithm 1 we calculate the probabilities for a sequence of length and runs of length one so that we can investigate number of length one independently.*