PHP code to generate samples from a normal distribution?

Discussion in 'Probability' started by Ray Koopman, Nov 4, 2010.

  1. Ray Koopman

    Ray Koopman Guest

    The normal distribution does not have a min or max. It goes to
    infinity in both directions. Do you want to sample from a normal
    with the specified mean and s.d., but retain only values that are
    within the range you specify? (That would usually be referred to
    as sampling from a truncated normal distribution.)
     
    Ray Koopman, Nov 4, 2010
    #1
    1. Advertisements

  2. Ray Koopman

    A.Reader Guest

    Apologies if this isn't the best ng for this question
    (redirection appreciated, if not)

    I need to generate random samples from a normal distribution
    given min, max, mean, and sd. Can someone point me to the
    source of a function that will do that?

    It can be written in just about any general-purpose computer
    language. Or even a semi-raw formula would be fine, if it's
    parenthesised well enough that I can tell which operators have
    precedence.

    Many thanks in advance for any help! (Just in case: no, I'm not
    asking for homework help, I got my undergrad degree ages ago; I'm
    simply hopeless at anything arithmetic)
     
    A.Reader, Nov 4, 2010
    #2
    1. Advertisements

  3. Ray Koopman

    A.Reader Guest

    Yes, that's what I meant --discarding all values not within a
    range. I suppose I must have been focusing on my purpose, which
    is to sample from possible human ages (i.e., 0..122).

    Thanks very much for responding!
     
    A.Reader, Nov 4, 2010
    #3
  4. Ray Koopman

    Ray Koopman Guest

    This is probably the easiest way to do it:

    1. Generate two independent Uniform(0,1) random values, u & v.
    2. Let x = mean + sd*sqrt(-2*ln(u))*cos(PI*v).
    3. If min < x < max then accept x; otherwise return to step 1.

    'mean' & 'sd' are the mean & s.d. of the normal distribution before
    removing values that are < min or > max. The mean of the truncated
    distribution will be between 'min' and 'max', and the s.d. of the
    truncated distribution will be less than 'sd'. It's possible to do
    things the other way around, where you specify the mean & s.d. that
    you want the truncated distribution to have, and then solve for the
    values of 'mean' & 'sd' to use in the code, but it would be very
    messy.

    I don't know about using the normal to model the age distribution.
    Maybe someone who does know about such things will comment.
     
    Ray Koopman, Nov 4, 2010
    #4
  5. Ray Koopman

    Henry Guest

    A variety of age distributions can be seen at
    http://www.google.com/images?q=age+pyramid

    Although there is clearly a decline at very old age, the pattern for
    young and middle age varies from place to place and time to time,
    without any conventional distribution being useful. A normal
    certainly would not do it, especially for the young.
     
    Henry, Nov 5, 2010
    #5
  6. Ray Koopman

    Herman Rubin Guest

    The Box-Muller method should use random "reals" instead
    of a highly discrete scale like 0 to 99. The above distribution
    would have jumps in the cdf of size .0001, and especially bad tails.

    There are several better, quicker, and more accurate methods
    of generating normal random variables. Johnson and I have a
    paper comparing our acceptance-replacement algorithm with
    Marsaglia's reverse ziggurat, which is also a good method.

    If the probability of the admissible range is high, I definitely
    recommend a good method of generating unrestricted normal random
    variables and rejecting those out of range. If it is not, I
    would recommend using acceptance-rejection or acceptance-replacement
    based on the concave logarithm of density approach.

    Other distributions can be simulated. Nothing is
    normally distributed in nature, although some (not age)
    may be close.
    Numerical analysis is an art; I suggest you consult
    someone with the necessary artistry. The other parts
    of your system likewise are going to need good
    modeling and good mathematics.
     
    Herman Rubin, Nov 5, 2010
    #6
  7. Ray Koopman

    A.Reader Guest

    Thanks very much! I'll have a go with it. I don't think doing
    post-hoc truncation will be a problem.

    Just to make sure I understand the operator precedence, would
    this be a correct re-writing?

    ran1 = ran_int( 0, 99 ) / 100
    ran2 = ran_int( 0, 99 ) / 100
    dp = mean + (( sd * sqrt((-2) * natlog( ran1 )) * cos( pi *
    ran2))
    Sorry, I was misleadingly terse in my nano-explanation: it's for
    simulating a system. The population is expected to be bimodal in
    age, and my aim is to sample accordingly for various modal
    combinations and see what that does to later interactions within
    subsystems.
     
    A.Reader, Nov 5, 2010
    #7
  8. Ray Koopman

    Ray Koopman Guest

    ...............^
    The precedence is right, but you have an extra (. And as Herman has
    already pointed out, the uniform random values must be continuous
    reals, not discrete integers. If you want the result to be an integer
    then round at the end. Also, the value whose log you take must be in
    (0,1]; that is, greater than 0, and less than or equal to 1.
    It sounds like exact normality isn't crucial to what you're doing.
     
    Ray Koopman, Nov 6, 2010
    #8
  9. Ray Koopman

    A.Reader Guest

    This is where my dyscalculia gets me into trouble. I'd take it
    kindly if you or Herman would explain in what way my thinking is
    off about this:

    Herman notes that creating reals by my intended scheme of taking
    pseudo-random integers 0-99 (or 0-100) and dividing by 99 (or
    100) will result in a "randomness deficit" (I'm sure there's a
    more-accepted term) of 0.0001.

    To me, that doesn't seem like a problem for the case, as here,
    where the end result must be constrained into the integer
    sub-range (e.g.) 10-100.

    No matter how perfectly destributed the raw result of the
    calculation, ultimately it must reduce to one of those few
    integers.

    Or is the formula chaotic in some way? If it's not, I'd think
    that such a small deficit would, at most, produce the occasional
    (e.g.) 44 instead of 43 or 45, i.e. a skew invisible to the naked
    eye

    What am I not understanding? (If you can explain it in simple
    enough terms, that is)

    And as a practical matter, would it help (since I only have
    integer rands in php, I think) if I were to take integer values
    from the full 2^32 range and then divide by 2^32?
    Very true. I was actually trying to get by with merely
    averaging the sum of several integer rand calls, fiddling with
    their ranges to control central tendency, but I couldn't persuade
    myself that it was producing a sufficiently natural-looking
    result. So if I can get something that "looks right" to the
    naked eye, I'm sure that'll do everything necessary.

    Many thanks for your continued help with this!
     
    A.Reader, Nov 10, 2010
    #9
  10. Ray Koopman

    Ray Koopman Guest

    Most "continuous real" uniform(0,1) generators actually work just
    as you describe -- they generate a uniform integer 1...n, where n
    is very large, then float it and divide by n or n+1. For generating
    normals using the algorithm described earlier, n determines the
    maximum possible magnitude that the generated standard normal can
    have, which is sqrt(2 natlog(n)). Also, if n is too small (e.g., 100)
    then the distribution will have a "spike" at 0. 10^4 should be OK.
     
    Ray Koopman, Nov 11, 2010
    #10
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.