Bell curve distribution

Discussion in 'Scientific Statistics Math' started by Mekkala, Mar 2, 2004.

  1. Mekkala

    Mekkala Guest

    I'd like to ask a favor of you guys. I need an algorithm that will
    analyze any set of data and put in into a normal bell-curve
    distribution, by frequency groups. Or at the very least, I need tips to
    figure out such an algorithm (I'm not entirely mathematically ignorant,
    but I'm not exactly highly educated either).

    In case what I'm looking for is not quite clear, let me explain. I need
    to be able to place a set of data into a bell curve and then break the
    bell curve up into a set number of frequency groups, so that I have the
    value range for each group. Here's a sample set:

    Group 1 -- data points: 9; value range: 133-183
    Group 2 -- data points: 23; value range: 184-388
    Group 3 -- data points: 36; value range: 389-603
    Group 4 -- data points: 45; value range: 604-822
    Group 5 -- data points: 50; value range: 823-1,105
    Group 6 -- data points: 45; value range: 1,106-1,590
    Group 7 -- data points: 36; value range: 1,591-2,052
    Group 8 -- data points: 23; value range: 2,053-2,818
    Group 9 -- data points: 8; value range: 2,819-3,807

    I don't know how good that sample set is -- that's just an example of
    what my boss is looking for (that's the sample set he gave me). But, in
    essence, what I'm looking for is an algorithm that will divide any data
    set into a distribution similar to that example, and it needs to fall
    into a normal bell curve as accurately as possible. Any help you guys
    can give me?
     
    Mekkala, Mar 2, 2004
    #1
    1. Advertisements

  2. If I follow what your boss wants, I'd say it can't necessarily be done.
    One can make a graph like he wants (it's called a histogram), but if the
    data are from something significantly unlike a normal distribution, the
    histogram probably won't look like a normal distribution. If what he wants
    is a way to choose the bin intervals for the histogram so that it looks as
    much like a normal distribution as possible, I'd say A) that's cheating and
    B) I don't know of such an algorithm (other than putting everything in one
    bin, in which case all distributions look alike :) ). There may be such
    an algorithm, but I have not run across it.

    There are two rules of thumb I know of to choose the bin interval.
    One is Sturge's Rule and the other is Scott's Rule. I read a paper
    claiming Scott's Rule is better. Unfortunately, I don't have the
    reference handy.

    Regards,
    Russell
     
    Russell Martin, Mar 2, 2004
    #2
    1. Advertisements

  3. Mekkala

    Rich Ulrich Guest

    The recommended number of bins increases slowly as the
    sample N gets a much larger.

    It is good advice is to use one of the binning rules-of-thumb.
    There are lots of data sets that *will* be symmetrical
    with tapered tails, like the boss wants.

    If the boss has proper experience, the boss will be pleased,
    too, to see the other data sets reveal their outliers, or
    their bi-modality, or whatever.
     
    Rich Ulrich, Mar 2, 2004
    #3
  4. Mekkala

    Mekkala Guest

    The data we're working with ought, statistically, to fall into a normal
    bell curve (each data point is an oil well, and the value is the
    mechanical difficulty of drilling the well). If it doesn't, there's
    something wrong with the data. The problem, though, is that sometimes
    we're working with small data sets and there aren't enough data points
    to make up a clear picture of a normal distribution.

    My boss would like to look very clean, precise, and professional, and to
    that end he'd like all his graphs to look just precisely like a perfect
    bell curve. While I agree that that's cheating, he's also my boss and
    he doesn't think it's cheating -- so I'm trying to find a way to do
    that. If there is no way, then I'll settle for a way to choose bin
    intervals that accurately represent the data, and if he doesn't like it,
    I'll tell him to go hire a damn mathematician and see what they tell him
    when he asks for that :p

    (Note that I've added the intervals to the data set above so you can see
    better what he wants done to the data)
    That might be what I'm looking for. Can you point me to a place where I
    can find Sturge's Rule and/or Scott's Rule?
     
    Mekkala, Mar 2, 2004
    #4
  5. Like Rich wrote, a lot of times what you have will be close to normal
    looking, and probably more so if you actually have normal data. ;-)
    But a small number of points will be more prone to sampling errors that
    make a sample from even a perfectly normal distribution (yes, Prof. Rubin,
    I know there's no such thing :) ) look not too normal.
    Can I be there to watch? :)
    I can send you C programs that implement binning using both, if you'd like.
    You can google on "Sturge's Rule" and "Scott's Rule" and get quite a few
    hits. To get the most symmetric looking histograms you will probably have
    to fool around adjusting the center point around which you're binning in
    addition to using some rule of thumb to choose bin width or number of bins,
    and I don't know of any automatic way to do that except choosing the
    sample mean or median, but again that doesn't guaranteed it will work in
    all cases.

    Regards,
    Russell
     
    Russell Martin, Mar 2, 2004
    #5
  6. Mekkala

    Jos Jansen Guest

    This is quite easy:

    1) estimate 3rd, 11th, 24th, 41th, 59th, 76th, 89th, 97th percentiles
    of the distribution (numbers to be refined if so desired)
    2) use obtained estimates as group boundaries for making the frequency
    table of the sample
    3) observe that the frequency table so obtained will always be nicely
    bell shaped over group numbers
    4) observe, moreover, that the resulting frequency table is useless,
    for whichever purpose.

    Jos Jansen
     
    Jos Jansen, Mar 2, 2004
    #6
  7. Mekkala

    Eric Bohlman Guest

    I get the feeling that what your boss actually wants is to have the data
    summarized as a smooth curve rather than a "blocky" histogram. If that's
    the case, then what you're looking for is kernel density estimation.
     
    Eric Bohlman, Mar 2, 2004
    #7
  8. What do you do if the data don't come from a normal bell curve?

    You could google on stanines. The US Air Force invented stanines in World
    War II to force non-normal data to be normal -- you scale the data so that
    it's normal. Unfortunately, that means that the scaled data don't represent
    anything meaningful. If all you really care about are rankings (with an
    implied slop factor), then it doesn't matter.

    One last bit of free financial advice. Economic data are non-normal. If
    your boss is fitting economic data this way, you should consider
    diversifying your financial portfolio away from this company. That includes
    your job, which is likely to be your largest financial asset. Of course, if
    this is quality control data, then it's the standard approach, and some
    companies have been very successful with it.

    Jon Miller
     
    Jonathan Miller, Mar 2, 2004
    #8
  9. Mekkala

    Mekkala Guest

    No, he's trying to prove to his superiors that the data do indeed fall
    in a normal bell curve distribution. He's right, they do, in general --
    but that doesn't mean every data set will look just exactly like a bell
    curve, and so far I haven't been able to convince him otherwise; so now
    I'm stuck with the crappy job of trying to arbitrarily do something with
    data that's not meant to be done. That's why I'm asking here if there
    is any algorithm that might be able to do that... I'm starting to think
    there's probably not, though.
     
    Mekkala, Mar 2, 2004
    #9
  10. Mekkala

    Mekkala Guest

    On Tue 02 Mar 2004 04:07:47p, "Jonathan Miller"
    <> kicked back with a beer, ruminated at length,
    fell asleep, woke up, lit up a joint, then fell asleep again after
    thoughtfully blurting out:
    It's quality control data. And stanines may be just what I'm looking
    for, so I'll go check that. Thanks.

    *wanders off, grumbling about unreasonable bosses*
     
    Mekkala, Mar 2, 2004
    #10
  11. Mekkala

    Mekkala Guest

    Ah. Yes, that would do it. Do you know a function that will give me
    the percentiles to use when dividing up a bell curve like that, for a
    given number of sections (bins)?
    Yeah, that's how I would do it if I had the percentiles.
    Also true. My boss doesn't care, and he pays me as long as I do what he
    says, so I don't care either, at least in this case.
     
    Mekkala, Mar 2, 2004
    #11
  12. There's ways of TESTING whether data is from a bell curve. Some ways:

    1) Ozturk's Algorithm. (Not too many know about this, perhaps.)
    2) Kolmogorov-Smirnov statistic (this gets you a nice graph, but it's
    not of a bell (pdf). It's of the CDF.)
    3) A test based on the mean and median of the data.
    4) A test based on the coefficient of skewness of the data.
    5) A test based on (Geary's) kurtosis.
    6) Chi-Squared statistic.

    --
    Try http://csf.colorado.edu/pkt/pktauthors/Vienneau.Robert/Bukharin.html
    To solve Linear Programs: .../LPSolver.html
    r c A game: .../Keynes.html
    v s a Whether strength of body or of mind, or wisdom, or
    i m p virtue, are found in proportion to the power or wealth
    e a e of a man is a question fit perhaps to be discussed by
    n e . slaves in the hearing of their masters, but highly
    @ r c m unbecoming to reasonable and free men in search of
    d o the truth. -- Rousseau
     
    Robert Vienneau, Mar 2, 2004
    #12
  13. Mekkala

    Eric Bohlman Guest

    In that case, just use Photoshop or the like to draw a nice-looking graph
    with the appearance that your boss wants. Since the outcomes are all pre-
    specified, you don't need to involve the data at all.
     
    Eric Bohlman, Mar 3, 2004
    #13
  14. Mekkala

    David Petry Guest

    It's been said that there are three kinds of lies: lies, damned lies,
    and statistics. You're getting perilously close to statistics here.

    Here's a way to do what you want. Say you have N data points, and
    you want to put them in M groups. The bell curve is essentially a
    binomial distribution. These values can be computed using Pascal's
    triangle, or using the formula C(m,n) = m!/( n! * (m-n)! ).

    So compute the M values C(M-1, k) for k = 0..M-1. Multiply each
    of those values by N/2^(M-1), and do a little fiddling to get integers.
    That gives you a list of M numbers c_1, c_2, ... c_m, such that the
    sum of those numbers equals N, and the numbers are very close to
    being the needed values for a bell curve.

    Then take the first c_1 data points for your first group, c_2 points
    for your second group, etc. That does it!
     
    David Petry, Mar 3, 2004
    #14
  15. Mekkala

    Bill Guest

    Robert is right. The way you have set things up you could come close to
    proving that your boss is wrong - for example if the data is very skewed one
    way or has two obvious humps - but you really can't prove he is right. If it
    sorta looks bell shaped who is to say it is a normal distribution. You really
    need a normality test.

    Bill
     
    Bill, Mar 3, 2004
    #15
  16. A question: In a group like sci.math, why do you feel the need to
    proclaim yourself as an atheist, in order to ask a math question.

    Which is your priority: To get an answer to the question, or to post
    your commercial for your irrational belief system?

    Jack
     
    Jack Crenshaw, Mar 4, 2004
    #16
  17. Mekkala

    mensanator Guest

    Still more questions:

    How is it that a signature is off-topic?

    Is a person supposed to alter their signature based on the group
    being posted to?

    How is atheism an irrational belief system? I thought irrational
    meant belief without evidence, so theists are irrational.

    Why do you top-post?
     
    mensanator, Mar 4, 2004
    #17
  18. Mekkala

    Glen Guest

    Its a pretty straightforward (and mercifully short) .sig.
    It isn't racist, sexist, advocating illegal activity or otherwise
    broadly objectionable.

    Posts about .sigs are off-topic. Attacking his belief system as
    irrational
    is definitely off-topic.

    He did nothing wrong. You did.

    Glen
     
    Glen, Mar 5, 2004
    #18
  19. Um .. because atheism has nothing to do with bell curves?
    Um ... yes, perhaps. If (a) the message is offensive and hateful,
    as yours is, and (b) you want an answer to your question.
    Are you claiming that you have scientific evidence that God does not
    exist? That would be important news. By all means, please share it.

    If, OTOH, you do _NOT_, you are proclaiming a belief without evidence.

    Most of us have such beliefs. They're called "world views." Not many
    of us feel the need to wear them on our electronic T-shirts, as you do.
    The fact that you do so labels you as someone who feels strongly
    about a view with no supporting evidence. Ergo, irrational.
    Um .. because I choose to?

    Jack
     
    Jack Crenshaw, Mar 5, 2004
    #19
  20. It is to me.

    If he had signed his msg

    Mekkala, Homo-hater #2148,
    Mekkala, Skinhead #2148,

    or
    Mekkala, Jew-killer #2148,

    I don't think you would have been so magnanimous.

    Would you?

    Jack
     
    Jack Crenshaw, Mar 5, 2004
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.