05 May 2018

Pseudorandom in chaos

Randomized data is commonly used in fuzz testing, game engine, UUID, pipelines, etc.

Yet due to massive historically reasons, many C programmers tends to misuse the random functionalities.

rand(3) and srand(3)

Those two first appeared in Open Group Issue 1, it may a good starting point to explain why those two are commonly used among other pseudorandom functions.

As you examine the rand(3) implementation specification, you’ll find some caveats:

Range:

The rand_r() function shall compute a sequence of pseudo-random integers in the range [0, {RAND_MAX}]. (The value of the {RAND_MAX} macro shall be at least 32767.)

When dates back to early era, CPU is 16-bits addressable, people can only generate pseudorandom within 16-bits range, still a historically issue, thus we have to stay compatible.

Though standard specify the RAND_MAX at least sized 16-bits, since in nowadays, most mainstream CPUs are 32-bits, the backing OS standard library is fairy likely to define RAND_MAX as 32-bits maximum signed integer. Wish may satisfy general purpose.

You should definitely check the RAND_MAX macro manually if the range matters. Protability still issue if you want to write implementation-independent code.

Period

The rand() function shall compute a sequence of pseudo-random integers in the range [0, {RAND_MAX}] [XSI] with a period of at least 2^32.

This seems isn’t an issue if your RAND_MAX is sized 32-bits or less, o.w. there may sequence loop.

Why period matters? Rationale quoted from References

Compare this with the Mersenne Twister algorithm, which is 2^19937-1. You typically want the period of a random number generator to exceed the amount of numbers expected to be generated, because that’s the point where the sequence repeats.

Randomness quality

Like select(2), rand(3) is another example of poor library implementation(at least once a while), old rand(3) implementation generate pseudorandom numbers which is predictable or not fully pseudorandomized.

(Nowaday rand(3) implementation have good quality over pseudorandomness, won’t be a big issue though)

Statistical distribution

Still implementation-dependent, theoretically, distribution of general-purpose pseudorandom generator should conform uniform distribution. Yet the specification claim nothing upon the distribution.

FYI, quoted from references:

Even when rand claims a uniform distribution, it may not be a terribly good distribution, which greatly affects any attempts to adjust the range.

If the distribution is a critical factor, you may happen to write your own pseudorandom generator. In most cases, rand(3) satisfy your daily-use needs.

A common misuse of rand(3) is use it without prior call srand(3), quoted from specification:

The srand() function uses the argument as a seed for a new sequence of pseudo-random numbers to be returned by subsequent calls to rand(). If srand() is then called with the same seed value, the sequence of pseudo-random numbers shall be repeated. If rand() is called before any calls to srand() are made, the same sequence shall be generated as when srand() is first called with a seed value of 1.

E.g. the srand() have initial seed 1 if no prior srand(3) call is made, in such case, you have no randomness at all.

So you inevitably need to seed the generator, yet another common misuse of srand(3) is use

#include <stdio.h>
#include <time.h>

srand((unsigned int) time(NULL));

do the seeding, this turns to be a bad practice. To understand why, you may wrapup a test program like the following:

#include <time.h>

int main(void)
{
    time_t a;
    return 0;
}

And do the macro-expansion(replace with your favorite C compiler)

$ gcc -E test.c | grep time_t

You would see the following output

typedef long int __time_t;

# 1 “/usr/include/bits/types/time_t.h” 1 3 4

typedef __time_t time_t;

See, the parameter of srand(3) is typed unsigned int, yet the return value of time(2) is time_t, which is typedef of long int.

There may precision loss over higher bits(if long sized 64-bits), though won’t be a big deal, the higher 32-bits usually invariant in considerably timespan.

The issue is the time(2) itself, it return time scaled in seconds, when you bump two generators within one second, you’re fairy likely get the same seed.

In other words, the time(2) is coarse-grained over subsecond-level seeding.

There’re many ways to get a better seed

1. Incomplete time hash

unsigned int time_seed(void)
{
    time_t now = time(NULL);
    unsigned char *p = (unsigned char *) &now;
    unsigned int seed = 0;

    for (size_t i = 0; i < sizeof(now); i++)
        seed = seed * (UCHAR_MAX + 2U) + p[i];

    return seed;
}

This solution give by Julienne Walker, yet it turns out incomplete in subsecond-level.

Still produce the same value within a second, though it’s hashed.

2. Predictable entropy

int seed = (int) time(NULL) % getpid() + getppid();

This solution give by Massimo Ricotti in his Astronomy class. (awe)

Still subsecond-level precision, yet it works pretty fine if you bear this in mind, especially in multithreading environment.

If you scrutinize this in deep, you’ll find the pid may restricted by system configuration.

Taking Linux as an example, the default maximum pid can be found at /proc/sys/kernel/pid_max

/proc/sys/kernel/pid_max (since Linux 2.5.34)

This file specifies the value at which PIDs wrap around (i.e., the value in

this file is one greater than the maximum PID). PIDs greater than this value

are not allocated; thus, the value in this file also acts as a system-wide

limit on the total number of processes and threads. The default value for

this file, 32768, results in the same range of PIDs as on earlier kernels. On

32-bit platforms, 32768 is the maximum value for pid_max. On 64-bit systems,

pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4

million).

In 32-bits system, the result cannnot exceed 2 * 32768, which sharply restricted the seed range.

FreeBSD define its maximum pid(99999 by initial) as macro, seems untunable if compiled.

P.S. most UNIX-like systems restricted their default pid range in ten-thousand level(stay backward compatible?), FYI.

And since the monotonic time is linearly increased, the pid may very likely to follow the same pattern.

Thus the seed is predictable in some ways.

3. Too much randomness

#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>

void junk_srandom(void)
{
    struct timeval tv;
    unsigned long junk; /* XXX left uninitialized on purpose */
 
    gettimeofday(&tv, NULL);
    srandom((getpid() << 16) ^ tv.tv_sec ^ tv.tv_usec ^ junk);
}

This solution give us sub-microsecond level precision seed, sounds great!

Yet it leaves undefined behaviour of using uninitialized data. Many compilers will ignored the entire translate unit(in this case, the entire expression) by default or by optimization flags.

Compile above code and objdump it

x86-64 gcc 7.3 -O0 flag

junk_srandom:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        lea     rax, [rbp-32]
        mov     esi, 0
        mov     rdi, rax
        call    gettimeofday
        call    getpid
        sal     eax, 16
        mov     edx, eax
        mov     rax, QWORD PTR [rbp-32]     # tv.tv_sec
        xor     edx, eax
        mov     rax, QWORD PTR [rbp-24]     # tv.tv_usec
        xor     eax, edx
        mov     edx, eax
        mov     rax, QWORD PTR [rbp-8]      # junk
        xor     eax, edx
        mov     edi, eax
        call    srandom
        nop
        leave
        ret

x86-64 gcc 7.3 -O1 flag

junk_srandom:
        sub     rsp, 24
        mov     esi, 0
        mov     rdi, rsp
        call    gettimeofday
        call    getpid
        mov     rdi, QWORD PTR [rsp+8]      # tv.tv_usec
        xor     edi, DWORD PTR [rsp]        # tv.tv_sec
        sal     eax, 16
        xor     edi, eax
        call    srandom
        add     rsp, 24
        ret

You’ll notice entire junk variable optimized out(more specifically, the GCC threat it as zero), since its behaviour is undefined. You’d better to avoid to use this solution, it’s extremely dangerous.

Yet with minimal remedy, we can bring it back to life.

#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>

void loselose_srand(void)
{
    struct timeval tv;
    unsigned long junk = 0xdeadbeef;    /* XXX  initialized for prevent from optimized out */
    uintptr_t p = (uintptr_t) &junk;
    //uintptr_t p = (uintptr_t) (const char[]){};

    gettimeofday(&tv, NULL);
    /*
     * Higher bits of an address may have massive 1s  which may pollute the entire seed
     *  cases likely happen in 64-bits  which padding the higher bits
     */
    srand(((getpid() + getppid()) << 16) ^ tv.tv_sec ^ tv.tv_usec ^ (p & 0xffff00));
}

This benifits from the ASLR technique, which sightly deviated from original junk_srandom().

If ASLR not available in your system, you may replace it with clock(3), which is processor time.

Nanosecond level seeding seems overkill, pasted here in case you need it.

#include <time.h>

struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);

random(3) and srandom(3)

This pair sourced POSIX, fixed several flaws mentioned above.

As the specification states,

The random() function shall use a non-linear additive feedback random-number generator employing a default state array size of 31 long integers to return successive pseudo-random numbers in the range from 0 to 2^31-1. The period of this random-number generator is approximately 16 x (2^31-1). The size of the state array determines the period of the random-number generator. Increasing the state array size shall increase the period.

With 256 bytes of state information, the period of the random-number generator shall be greater than 2^69.

The range and period is enough for general-purpose. You still need a better seed for srandom(3) though, and since its argument typed unsigned int, the merely time(NULL) still provides subsecond level seeding.

Conclusion

Random number is a complex topic, maybe it’s unusual for you to use random number, yet once you got any chance, do not misuse the basic random functionalities, even you don’t need sounding randomness. When the randomness is critical(say, you’re devloping a cryptographic hash algorithm), do NOT pursue rand(3) or srand(3), they’re terrible. Instead, develop one from scratch, or using state-of-the-art PRNG like TT800 or the Mersenne Twister.

References

EXP33-C. Do not read uninitialized memory
MSC30-C. Do not use the rand() function for generating pseudorandom numbers
MSC32-C. Properly seed pseudorandom number generators
Using rand() (C/C++): Advice for the C standard library’s rand() function.
Random numbers and cryptography
More randomness or less
Compiler Explorer

一个程序员的辩白