# Good post-processing methods of PRNG outputs

Discussion in 'Scientific Statistics Math' started by Mok-Kong Shen, Oct 6, 2011.

1. ### Mok-Kong ShenGuest

Pseudo-random number generators commonly have statistical anomalies.
I am interested to learn good post-processing methods that could turn
their outputs to something more favourable for practical usage and
should hence appreciate from experts hints to good references as well
as ideas of novel approaches.

I have done some experiments on PRNGs based on permutation polynomials
mod 2^n having maximal period length (cf. [1]). The following ad hoc
and simple scheme seems to be fairly promising (based on results from
Maurer's Universal Test computed for windows of 8 consecutive bits at
any arbitrary positions of the 32-bit words of the outputs):

1. Use a master PRNG to generate a number of PRNGs forming a pool of
size 2^k.

2. Pseudo-randomly activate the PRNGs in the pool and xor a pair of
consecutive outputs from the pool, with bits of one of the members
of the pair circularly rotated by a pseudo-random amount obtained
from the other member.

With rand(i) denoting an output from the i-th PRNG in the pool,
a C-code may be given as follows, assuming n=32 for 32-bit computer

r1 = rand(next);
// the 1st half of r1 has commonly less bias than the 2nd half
rot = r1>>27; // the leading 5 bits of r1
r2 = rand(next);
if (rot==0) r = r1 ^ r2; else r = r1 ^ ( (r2<<(32-rot)) | (r2>>rot) );

where r denotes the result of post-processing.

I found experimentally that having the bit rotation in the above is
favourable, i.e. better than straightforward xoring of r1 and r2.

M. K. Shen

Mok-Kong Shen, Oct 6, 2011

2. ### Maaartin GGuest

Does it mean that the PRNG on position "next" gets stepped,
or do you just recall it's value here?
In case this "next" equals to the initial one, you most probably get
some bias.
This is not how rotation gets implemented, the test is superfluous.

I saw somewhere something like this:

int improve(int x) {
int index = acc >> SHIFT;
int y = array[index];
array[index] = x;
acc = y;
return y;
}

It uses an array "array" of length 2**(WORD_LENGTH-SHIFT) and an
accumulator "acc". The usage is simple, generate a sequence x0,
x1, ... and let the numbers go through the function. The nice thing
about it is that it only permutes the values, which is an disadvantage

The important thing is that by using "acc" you avoid using one values
twice and thus also the danger of introducing a bias.

The sad thing is that by using indirection you make it vulnerable to
timing attacks. I wonder if it could be of any use, given that good
non-secure PRNGs like KISS, and the secure PRNGs like Salsa20/8 are
also quite fast.

Note that my above code may be wrong...
That's understandable as polynomial generators only propagate from
lower to upper bits, the least significant bits are of terrible
quality.

Maaartin G, Oct 6, 2011

3. ### Mok-Kong ShenGuest

Am 06.10.2011 23:35, schrieb Maaartin G:
"next" obtains an intial value at program initialization time.
It gets in the statement above (in general) a different (new) value.
This is an issue of probability. If you cast a die twice, you may also
get a double rightaway, don't you? Of course I am assuming that the
value of "next" as computed above is sufficiently (pseudo-)random
(because the PRNGs employed may be "assumed" to be not "too" bad) so
that all PRNGs in the pool would get a fair chance of being activated.
Do you mean that it's wrong or do you mean that it is inefficient?
Please kindly explain you point in some details. (Note that, if the
rotation is e.g. by 1 bit, then the original rightmost bit of the word
gets to the leftmost position, and similarly for 2 bits etc.)

I admit that I put in the test as a precaution owing to some uncertainty
I had (which I haven't resolved by looking into the proper manuals --
it's somewhat difficult for me to find access to them), because I
remember of longtime ago having read in a headware manual of a certain
(now outdated) hardware that a assembler command to shift n bits (for
a word size of n bits) was "undefined" and I am not sure whether this
is also the case for hardware generally being used nowadays. If someone
could provide a definite answer to this, I should be very grateful.

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
4. ### Mok-Kong ShenGuest

Am 06.10.2011 23:35, schrieb Maaartin G:
In my previous reply less than an hour ago I wrote that I put in the
test out of precaution because I was not entirely sure whether that's
needed or not but wanted to be safe anyway. I am luck enough to clearly
claim already now that the test is in fact "not" superfluous. For,
according to the C standard, if the right operand of the shift operator
is negative or is equal to or greater than 32 (for our case of 32 bit
computer words), then the behaviour of the shift statement is
"undefined". So your critique above was unjustified.

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
5. ### Maaartin GGuest

This is a different case... casting a die twice returns two
independent results. What you do seems to be something like this:

Cast 4 dice, write the results down, select two of them at random, add
the results modulo 6. Now even with perfect dice you've introduced a
clear bias in favor of even result, don't you?
But a polynomial PRNG is too bad when looking at the LSB. Your
rotations surely help, but with probability of 1/32 the rotation
amount equals to 0, with probability of 2**-k the same PRNG gets
selected twice, so there's a bias of 1**(-k-32).

I'm afraid you're right. I haven't seen the C standard for ages. Maybe
something like

((r2<<((32-rot) & 31) | (r2>>rot) );

is the way to go. I'd suggest to write only

rotateRight(r2, rot)

in the algorithm description and define the function elsewhere. The
exact definition doesn't matter much, everybody understands what
rotateRight means. In case you fit what the compiler knows it can use
the rotation machine instruction (assuming the architecture provides
it).

Maaartin G, Oct 7, 2011
6. ### Mok-Kong ShenGuest

Am 07.10.2011 13:09, schrieb Mok-Kong Shen:
I want to add that one goal of having a pool of PRNGs is that the bias
inherent in the individual PRNGs will tend to be "averaged" out (owing
to both their "diversity" and the (pseudo-)random ordering of their
activation), which is found to be indeed the case according to my
experimental results. However, this averaging effect is found to be
not yet good enough unless the size of the pool is chosen to be
sufficiently large, say, 512. (This boundary number is certainly
dependent on the type of PRNGs used but applies to the type I used.)
Through the xoring (with the bit rotation) operation it is found that
the scheme is already quite good for a pool size of 2 (actually to
my own surprise). Please note, as I have hopefully made it clear in
my OP, that I am not attempting to generate any in the theoretical
sense perfect random number sequences (which is a task not strictly
possible in reality IMHO), but to obtain sequences of practically
acceptable quality and that in a fairly simple manner and with not
too high computational cost.

That the introduction of the pool could be a factor rendering the
cryptoanalyst's work quite hard seems to be intuitively clear. Note
further the use of xoring in this connection.

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
7. ### Mok-Kong ShenGuest

Am 07.10.2011 15:03, schrieb Maaartin G:
Let's look at that in a different way. Forget at first the xoring
and consider only the effect of the pool. What would you say to that?
But my experimental results showed fairly clearly that the bias
of the result of post-processing is practically negligible. Anyway,
that's beyond the "sensitivity" of Maurer's Test (for L=8 and
rho=0.01). Of course, you could question whether that one single test is
sufficient and whether an entire battery of tests wouldn't be required.
Perhaps I should stress that I am persuing only some "practical" goals
and further that the final outputs thus obtained would not be
"directly" used as such (e.g. in simple xoring with other user given
sequences) by me but would be used/involved in certain more complex
operations such that even a certain small remaining amount of bias
could be well tolerable for the intended applications as a whole.
In what sense do you think that it is better? Are you sure that it
is really more efficient (considering all what a compiler normally
does)?

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
8. ### Mok-Kong ShenGuest

Am 07.10.2011 15:03, schrieb Maaartin G:
I am perhaps strongly confused at the moment. But wouldn't anyway a
proper question to be asked "in the context of my OP" be:

Would the result be a non-uniform distribution in [0,5]?

If so, I believe the answer is NO and hence there is no "bias".

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
9. ### Maaartin GGuest

I see that you ain't looking for a perfect PRNG, but by using
something introducing some bias when there are easy ways to avoid it,
you're heading to a bad solution -- in the sense that there's a better
PRNG running at the same speed, having about the same complexity
(whatever may the definition be) and producing better results.

Using the pool surely helps, the open question is if it's worth it.

I have to correct myself: I meant 1/32 * 1**-k, i.e., 1**(-k-5),
which is much worse.
That's like saying that the patient is clearly healthy after examining
his hand and ignoring the fact he has no head.
Whatever. It makes no sense to run a test when you know there's a
bias. Of course, I may have misunderstood your algorithm.
So you're going to post-process your post-processed sequence? When
does it stop?
The "if" corresponds with a conditional jump which is quite an
expensive operation. Moreover, the compiler should at least recognize
that "& 31" for rotation amount is a no-op and remove it. Ideally you
fit in the pattern the compiler knows and it collapses to a single
instruction.

Assuming you'd never select a single result twice, the resulting
distribution would be obviously uniform.

In case you happen to select a single result twice, the result is
always even.

Summarize, the probability of the result being even is 3/4*1/2 + 1/4*1
= 0.625. Verified by a simple Java program (which can be trivially
rewritten to C):

final Random random = new SecureRandom();
final int total = (int) 1e6;
int even = 0;
for (int n=0; n<total; ++n) {
final int[] a = new int[4];
for (int i=0; i<a.length; ++i) a = random.nextInt(6) + 1;
final int index0 = random.nextInt(a.length);
final int index1 = random.nextInt(a.length);
final int sum = a[index0] + a[index1];
final boolean isEven = (sum&1) == 0;
if (isEven) ++even;
}
final double ratio = (double) even/total;

Maaartin G, Oct 7, 2011
10. ### Mok-Kong ShenGuest

Am 07.10.2011 19:59, schrieb Maaartin G:
Which is the good solution? I know I neglected a part of your first
post, which I failed to quickly capture. Was there the good solution?
If yes, please say so. I would then attempt to study that but maybe
I'll need some of your help to properly understand it.
Well, as said, with a pool size of 2, the scheme sketched already
passes Maurer's test. In which sense is the non-worthiness, runtime
efficiency or implementation effort?
I don't understand your way of thinking here. I don't see the analogy
at all. Could you help thru indicating some one-to-one correspondences
that you have in mind? On the other hand, since you used an analogy
about medicine, I'll take an analogy in that too. Nowadays there are
quite some western doctors practising acupuncture, which, unless I
gravely err, has nowhere yet the scientific base comparable to the
western stuffs. But what really counts is whether the patient gets
helped, isn't it? Ok, I have only applied Maurer's test. But "suppose"
the scheme well passes the numerous other known tests as well, what
would anyone that simply needs to "use" some pseudo-random numbers
for some practical applications desire "in addition"? ("Theoretical
foundations" or what?)
See my argumentation above.
Long ago I did some assembler programming. I couldn't well argue in
detail now due to fading knowledge but personally I consider that you
are evidently nit-picking here. That single statement is entirely
insignificant compared to the code for the proper part of the PRNG
itself (in my case the evaluation of the polynomials).
Oh, I now see what you mean. But are you considering only a single
specific (type of) event or are you doing statictics and consider
the "whole" of what can occur in a long run, i.e. the ensemble??

To be explicit, a perfect die gives a uniform distribution. Taking
two dice and adding them and taking modulo means adding the two
distributions (of our discrete random variables). There is even a
theorem in the theory of random variables (I am fairly sure, though
offhand I couldn't cite a reference) saying that it suffices for
one of the two distributions to be uniform to give rise to a resulting
distribution that is uniform. Doesn't that refute your claim?

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
11. ### Mok-Kong ShenGuest

[skip]
I can at least give you anyway right now a very simple detailed proof
for the special case that both die are perfect (which is the case you
have been arguing) as follows:

To get the modulo sum 0, there are six possible events corresponding,
each having a probability of (1/6)*(1/6) = 1/36 of occuring, namely:

Die 1: 0, 1, 2, 3, 4, 5
Die 2: 0, 5, 4, 3, 2, 1

So modulo sum 0 itself has a probability of 1/6 of occurring.

To get the modulu sum 1, there are six possible events corresponding,
each having a probability of (1/6)*(1/6) = 1/36 of occuring, namely:

Die 1: 0, 1, 2, 3, 4, 5
Die 2: 1, 0, 5, 4, 3, 2

So modulo sum 1 itself has a probability of 1/6 of occurring.

And so on, and so on.

Is that ok for you?

M. K. Shen

Mok-Kong Shen, Oct 7, 2011
12. ### Maaartin GGuest

Maybe. My first code snipped should do some trivial post-processing
without introducing any bias. Check if I got it right.

It only permutes the inputs, but it can be simple modified to perform
some computation.
The important thing is to avoid using the same slot twice. An easier
way is to use two pools.
I meant runtime efficiency here. Such post-processing may improve a
PRNG, but doing something different may be better.
You performed a single test. Such a test could have shown a weakness
of the PRNG and it didn't, that's fine. But by no means it indicated
that the PRNG is not completely broken.
AFAIK for non-crypto PRNGs not much more is required. Some minimum
period guarantee maybe, which is usualy simple to obtain.

And obviously speed. In order to make anybody to use it, you need be
better in some respect that any existing PRNG.
Again, I may have misunderstood you algorithm, but in case I haven't,
there's a bias, so you don't need to test it before you fix it.
Is it?

What degree are your polynomials? How many of them get evaluated per
output byte?

Do you know the speed of other PRNGs (KISS, Salsa20)?
The whole experiment is described by this sentence: "Cast 4 dice,
write the results down, select two of them at random, add the results
modulo 6."

So I cast, get 2, 4, 3, 1, select first and second, compute (2+4) mod
6 = 0.
I cast 5, 6, 1, 3, select third and third again, compute (1+1) mod 6 =
2.
And so on.
No, it doesn't. The two random variables must be independent, and this
condition is violated when you select a slot twice.

You're doing something different from my experiment. Maybe now it's
clear.

Maybe you should copy the entire program here, so I can tell you if I
understood you correctly.

Maaartin G, Oct 8, 2011
13. ### Mok-Kong ShenGuest

Am 08.10.2011 02:16, schrieb Maaartin G:
There were some well-known algorithms to permute PRNG outputs in Knuth's
book. I assume you are familiar with them. If so, what are your
evaluations of them in comparison to what you suggested?
I don't think so. Analogy, any output sequence, if sufficiently long,
of a perfect die must have repetitions. That doesn't imply that the
sequence has bias, no? Besides, any pool is of finite size, how is it
possible to avoid using the same slot twice (excepting of course the
generation of a new PRNG for it with the master PRNG, once that slot
is accessed). Note that two pools are also finite and hence is of
no help, did you notice that?
That I have done only one test and that that may be insufficient, was,
I hope, sufficiently clearly expressed in what I wrote in my OP. Thus
I was asking for good references (to better methods) and novel ideas.
In fact, to quote myself:
promising

Well, I assume you are the same person maartin (difference only in the
captalization of the 1st character) with whom I once engaged in a bet
in another forum, though the process of negotiation of the challenge
was not carried to completion. In that you claimed you had a short
program that almost instantly brock my scheme (which I remember --
it's quite a time back and I would have to do some searching to get
the details -- also involved xoring to render the cryptanalysis hard).
At the end of the discussions there you didn't accept the my challenge
offer but promised to publish your code sometime later. How is that
code now? Could you publish that now for examination by all or did
you later in fact discover essential errors in your program? I like
to call your attention (since you raised the security issue of PRNG
outputs) to this, because in the present scheme xoring plays IMHO
a "non-trivial" role in a similar way to the case that's way back.

Oh, any "real" random or pseudo-random sequence can't be "perfectly"
random and must thus have some (mathematically "non-zero") bias,
just like no real human being can be perfect (excepting Jesus in the
Christian religion). So, according to you, we could discard just as
well all the known statistical tests, since they could find no chance
of any "sensible" applications, right?? Which random or pseudo-random
sequence that you could personally get could you claim to be "really"
and "definitely" without "any" bias? There do exist some PRNGs with
very nice supporting math theories of their characteristics, but, if
I don't gravely err, they all yet depend on some theorem that is not
yet 100% "strictly" proved. So, if you really want rigor, you would
be back to square 0, I am afraid.
The code of the polynomails I used or the code of any other PRNG known,
when compiled, certainly are much involved in terms of machine
instructions and hence also consume more CPU time in comparison to any
speed up (if at all) of the single statement you nit-picked. That was
my point. Could you understand?
I know they are, anyway as they are claimed, very fast. But that does
not preclude development of slower ones, if these could be advantegeous
in certain (different) situations/applications. (The same situation
is with e.g. transport means of different speeds, etc., no?)
But we are arguing here about your perfect dice case. Could you first
"acknowlege" or "deny" that you erred with your claim concerning the
perfect dice case? (I have some vague impression that you haven't yet
fully captured my point in countering you argumentation there.)
Which program? A program to illustrate/support that proof in my previous
post?

M. K. Shen

Mok-Kong Shen, Oct 8, 2011
14. ### MrDGuest

Really? How do you figure that a real random sequence must have a bias?
Like, I appreciate that any finite sequence probably exhibits bias,
whether its origin was a true random source or not. And I also
appreciate that even an infinite sequence from a truly random source
*might* exhibit bias. But "must"?

Incidentally: why have you put "real" and "perfectly" and "non-zero" in
quotes? The implication is that you want these terms to be given some
meaning other than their normal meaning. What meaning did you intend
them to have?

scheme of permutations was actually *introducing* bias. That is, given a
hypothetical PRNG with zero bias, your scheme would transform it into a
stream with a bias. This waffle about perfection completely misses that
point.
The only source of true random events is (so I'm told) the collapse of
the quantum waveform function, and events that depend on it. Why do you
believe that such events are subject to bias?

MrD, Oct 8, 2011
15. ### Mok-Kong ShenGuest

Am 08.10.2011 11:34, schrieb MrD:
The argument is the same I already give: Nothing in the real world,
and in particular stuffs that are made by humans, can be "perfect" in
the "absolute" sense (one could even say that's true by definition,
i.e. a tautology in a sense). It's certainly up to you to discover
a case that shows the contrary.
The true (physical) randomness must be captured by appaatus that
have some intrinsic errors, even though with advancement of technology,
they could be made almost arbitrarily neglible small. Note that even
the fundamental constants of physics, that the scientists give big
efforts to determine extremely accurately, are nonetheless given with
margins of errors, in the same way like results of experiments done
in schools, though very very very much smaller than these.
No. just to put stress on them.
As I reported in my OP, I experimentally found that the operations I
employed in the scheme resulted in essential improvements, anyway
according to Maurer's test (dependent on the "sensitivity" of that
test certainly).
Even if that "were" perfecctly random, the chain of practical devices
that turn it into bits for our use cannot be free from instrumental
errors as I argued above. On the other hand, how do you "know" that's
really "random"? (Quantum theory is not yet "entirely" settled IMHO.)
Of course, you could "define" that to be perfectly random, in ways
analogous to what is done in many fields of math, where e.g. a space
is "defined" to have this and that characteristics, point. That is,
by axioms. But that's a different kind of stuff than we would want
to have in the present context, right?

Anyway, let's wait for Maartin's follow-up. (BTW, if I don't err,
you were in the same forum in which I had a challenge offer to him
at that time. So perhaps we could eventually join efforts in examining
his code that was claimed to have instantly broken my scheme way back.)

M. K. Shen

Mok-Kong Shen, Oct 8, 2011
16. ### Maaartin GGuest

....a lot of things

In order to prevent the communication getting longer and longer, I
At the very beginning of this thread you posted a code snippet
containing the function "rand(int next)". When you post this single
function too, I'll see if my claim concerning your post-processing
introducing a bias holds or not.

Concerning your cipher: I know I should post my solver one day, and
maybe I will. But when I want people to have a look at it I need to
comment and polish it first, which is not much fun...

Maaartin G, Oct 8, 2011
17. ### Mok-Kong ShenGuest

Am 08.10.2011 17:00, schrieb Maaartin G:
Whether there are bias or not is what the statistical test programs
or packages are designed to do. Examining the code is IMHO not the
right way for that purpose. Since my code is yet under development,
I am sorry to have to decline that "currently". I am sure you certainly
could understand that, since you yourself, as you wrote in the quote
from you last post below, are not yet capable of releasing your code
of two years ago (my memory may be wrong, but at least over one year).
Actually I intended to post a larger package including the PRNG code
to the internet this summer (it is to be a new version of my PEARL2, the
first version of which you have surely seen), but I am yet waiting
for some inputs from a group of people from whom I expect to receive
a couple of non-trivial suggestions of improvements (or maybe very
hard critiques). That is however being unfortunately subjected to a
delay due to their having work of higher priority of their own to do.

Perhaps we could do a sort of bargin, since we both have codes to
release. If you could manage to release your long awaited code, I'll,
independent of the above said expected potential improvements, release
a (eventually rather "premature") version of my package within 2 months
after the date of your release of your code. Would you consider that
to be a sufficiently fair matter?

M. K. Shen

Mok-Kong Shen, Oct 8, 2011
18. ### Mok-Kong ShenGuest

Am 08.10.2011 20:32, schrieb Mok-Kong Shen:
[snip]

(1) The PRNGs I used in the pool are, as I wrote in my OP, all based
on permutation polynomails mod 2^n. An implementation could be
seen from the code of the first version of my PEARL2 perviously
posted to the other forum.

(2) Such PRNGs have of course, like many other types of PRNGs,
quite some bias, which is why I wrote my OP to report on my
(limited) experience with my concept of having a pool of
PRNGs that are to be pseudo-randomly activated (I had first
posted that idea to an internet forum over ten years ago)
with the expectation not only to have that concept discussed
but also (this is actually the major part of my expectation)
to learn some novel and possibly much superior approaches.
Whether a pool, with or without the additional measure of
xoring and bit rotation, can "effectively" reduce the bias
inherent in the original outputs of the PRNGs in the pool,
is a question, I believe, not much dependent on the specific
type of PRNGs there but is a question of "principle" and
should be amenable IMHO to examination/study as such, i.e.
being treated in a rather "general" way. To mention something
related in this connnection: Knuth's Vol.2 contains a couple
of algorithms to scramble the ordering of outputs from a PRNG.
This is a well-known post-processing, from which outputs of all
types of PRNGs are (hopefully) to be profited. Of course, it
could well be that one type of PRNG may be profited from that
some more than the others. But that difference I think is not
that essential to be studied as compared to the question whether
the scrambling as such is capable of resulting in improvements
in general at all. Hence I don't believe any examination of
the details of implementation of any specific types of PRNGs
could contribute to the issue I raised in my OP in any
essential way.

M. K. Shen

Mok-Kong Shen, Oct 8, 2011
19. ### Maaartin GGuest

That's like saying that using a spreadsheet is the right way to
compute a sum of numbers. Sometimes it is, but sometimes you don't
need it. And sometimes it's unusable such as for 1+2+...+1000000000,
which is trivial to compute without it.
The only thing I need to know is how "next" gets used. Is it something
like

int a, b, c, ...;
int[] pool;
int rand(next) {
int x = pool[x];
int y = a + b*x + c*x*x + ...
pool[next] = y;
return y;
}

or what?

Maaartin G, Oct 8, 2011
20. ### Mok-Kong ShenGuest

Am 08.10.2011 23:14, schrieb Maaartin G:
Consider the case of LCPRNGs. They are all of one type, so the code is
the same. But, for different values of the parameters, some are more
or less useful and some are notoriousy bad (cf. e.g. Knuth). That is,
in terms of bias, you could have no idea of the difference through
examining the code as such.
You have assumed fairly right of what I have done. For, say, 2nd degree
polynomials and pool size 128, I would have something more like the
following concretely (deviations from my current, possibly yet to be
reworked, coding are unessential in the present context; I need
"unsigned int" due to mod 2^32):

unsigned int pool[128][3];
unsigned int seed[128];

unsigned int rand(int i)
{ unsigned int x=seed,y;
seed = y = pool[0] + pool[1]*x + pool[2]*x*x ;
return(y);
}

As said, I rather doubt that this code could help you to know anything
about the bias of the output of the PRNG having the index i in the pool.

M. K. Shen

Mok-Kong Shen, Oct 8, 2011