Four Years Remaining

What are pseudorandom numbers and what are they used for?

Posted by Swen 25.08.2009 No Comments

There seems to be a wide split of opinions between theoreticians and practitioners. Namely, some theoreticians define pseudorandomness in terms of complexity theory. As a result, the corresponding constructions are inherently slow and thus rejected by practitioners. This short blog post tries to address some common misconceptions.

A random number generator by xkcd

Formally, a pseudorandom generator is a deterministic function $f:\mathcal{S}\to\mathcal{R}$ which takes a (relatively short) seed $s$ and converts it into an element of $\mathcal{R}$ . In most cases, we as computer scientists are interested in functions $f:\{0,1\}^n\to\{0,1\}^\ell$ . However, there are other reasonable domains, too. For instance, when performing Monte-Carlo integration in the region $[0,1]$ , we would be interested in the functions of type $f:\{0,1\}^n\to[0,1]^\ell$ .

It is important to note that any function of type $f:\mathcal{S}\to\mathcal{R}$ is a pseudorandom generator. However, not all of those functions are equally good for all purposes. There are two main properties that are used to discriminate between various pseudorandom functions. First, we can talk about the efficiency of a pseudorandom generator, i.e., how fast can we compute $f(s)$ . Second, we can talk about the statistical properties of the output $f(s)$ .

Before going into the details, let us establish why anyone should use pseudorandom generators at all. There is an interesting controversy in the computer science, as well in statistics. Many algorithms assume that it is possible to use uniformly distributed numbers. For example, Monte-Carlo integration is based on the law of large numbers:

$\Pr[\lim_n\frac{1}{N}\sum_{i=1}^N g(x_i)=\int_0^1g(x)dx] = 1$

whenever $x_1,\ldots,x_N$ are taken independently and uniformly from the range $[0,1]$ . Similarly, one can provide theoretical bounds for the randomized version of quicksort, provided that we can draw elements uniformly from a set $\{1,\ldots,K\}$ . However, computers are mostly made of deterministic parts and it turns out to be really difficult to automatically collect uniformly distributed bits. A design of a device that would solve this problem is far from trivial.

The first official solution to this problem was posed in 1927 when a student of Karl Pearson published a table of random numbers. Later such tables were built by the RAND Corporation. That is, the function $f$ was explicitly specified through a big table. Of course, such a table is useful only if it can be used as a "reliable" source of random numbers. In particular, the value of

$\frac{1}{N}\sum_{i=1}^N g(x_i)$

should be as close to $\int_0^1 g(x)dx$ as possible. Since there are infinite number of functions, we cannot actually check this condition. Instead, statisticians performed a series of tests on the table to verify that the sequence $x_1,x_2,\ldots$ looks as close to random as possible. If we extend this concept properly, we get the modern quantification of pseudorandomness.

Formally, a function $f:\mathcal{S}\to\mathcal{R}$ is a $(t,\varepsilon)$ -secure pseudorandom generator if for any $t$ -time algorithm $A$ :

$|\Pr [r\gets\mathcal{R}: A(r)=1]-\Pr[s\gets\mathcal{S}: A(f(s))=1]|\leq \varepsilon$

where the probabilities are taken over the uniform choices of $s$ and $r$ . In more intuitive terms, if you replace the randomness $r$ with $f(s)$ and your algorithm runs in time $t$ , then the switch generates no easily observable discrepancies.

As an illustrative example, consider a probabilistic combinatorial search algorithm $B$ that runs in time $t_1$ and outputs a solution that can be verified in time $t_2$ . In this case, the use of a $(t_1+t_2,\varepsilon)$ -secure pseudorandom generator within $B$ instead of pure randomness would decrease the probability of $B$ finding a solution by at most $\varepsilon$ . Indeed, otherwise we could construct a $(t_1+t_2)$ -time algorithm $C$ that outputs 1, if $B$ finds the solution and 0 otherwise, and use it to discern the pseudorandom generator from true randomness with success at least $\varepsilon$ . This would contradict the fact that we have a true $(t_1+t_2,\varepsilon)$ -secure pseudorandom generator. Similar argument can be proven also for the Monte-Carlo integration algorithm. Note that parameters $t$ and $\epsilon$ can be arbitrary real numbers. For the combinatorial search algorithm that takes 3 weeks CPU time, you might use a $(3\ \text{week}, 0.01)$ -secure pseudorandom generator.

There are essentially two reasonable complaints about the pseudorandom generators. First, since obtaining random bits is hard, we have not solved the problem completely, as we must still get the seed from somewhere. This is indeed a valid problem. For instance, the standard rand() function in C is known to fail the NIST statistical test and thus you might actually observe inconsistencies when using rand() directly or generating a seed for a more complex function with rand(). The latter does not mean that rand() is not a pseudorandom generator, rather that its quality might be low for certain applications. As a complete solution, you would like to get a function $f:\{1\}\to\mathcal{R}$ . For some tasks, such as Monte-Carlo integration of certain functions the corresponding solution is known (see multidimensional integration and sparse grids).

However, we still do not know how to do it in general, i.e., how to convert randomized algorithms into deterministic ones without significant losses in their performance. The corresponding area of research is known as derandomization. Second, one might argue that we could prove directly that by replacing $r$ with $f(s)$ , the performance of the algorithm does not deteriorate much. However, this is not an easy task and it is usually not a valid option for usual mortals.

To summarize, whenever you use a certain function $f$ to extend a random seed in your implementation you actually have to believe that it does not affect performance, i.e., $f$ is a pseudorandom generator with appropriate $(t,\varepsilon)$ parameter values. Whether you should use $f(x)=1$ , rand() in C, or something more elaborate, depends on the application.

Tags: Computer science, Derandomization, Statistics, Theory
Captain Statistics

Posted by Konstantin 11.08.2009 7 Comments

Inspired by Swen

Tags: Comic, Fun, Geekness, Statistics