How many data points to sample?

Discussion in 'Probability' started by Prof Wonmug, Sep 19, 2010.

  1. Prof Wonmug

    Prof Wonmug Guest

    I am not quite sure how to articulate my question. I am trying to
    estimate how the senses of a word are distributed. For example, the
    word "dissent" has 3 main senses:

    1. The dissent of a judge as on the Supreme Court.
    2. A difference of opinion.
    3. A protest.

    I have a database containing thousands examples of the in context. I
    will have researchers examine each example and assign it to one or
    more of these senses. The examples can be ambiguous. If it is not
    clear which sense was intended, the researcher will assign it to each
    sense that it could be.

    I have done some testing where 10, 50, 100, and 200 samples were
    examined. In many cases, the distributions after just 10 samples were
    quite close to those after 50, 100, or even 200. In other cases, the
    distributions varied considerably and did not "settle down" until 50
    or 100 samples or more.

    Since I am paying the researchers by the hour, I would like be able to
    stop after 10 or 20 samples instead of 100 or 200 if the accuracy of
    the results are not likely to vary much fron the additional work.

    I would also like to be able to claim some measure of the variability.

    Is there a way to measure the variability in the distributions after
    each sample?

    Suppose a word has 4 senses. After 10 samples, the distribution is

    1 2 3 4
    5 2 3 0

    and after 100 it is

    1 2 3 4
    47 31 21 1

    Is there a way to measure something like a confidence interval that
    these represent the actual distributions? Can I calculate the odds
    that each of these tallies is within +/- 10% of the actual relative

    As I said, I may be expressing the problem all wrong. I've only had
    very elementary exposure to statistics. But I hope someone can figure
    out what I meant to say.

    Prof Wonmug, Sep 19, 2010
    1. Advertisements

  2. Prof Wonmug

    Ray Koopman Guest

    In both of your frequency distributions, the sum of the frequencies
    equals the number of raters. How are you handling cases where a rater
    assigns multiple meanings?

    Also, in the "dissent" example, I see 1 as just a special case of 2.
    Are you distinguishing such nested meanings from those which are not?
    Ray Koopman, Sep 19, 2010
    1. Advertisements

  3. Prof Wonmug

    Prof Wonmug Guest

    Yes, sorry. I should have provided an example where a multi-colored
    ball was drawn.

    I think this is equivalent to a bag of colored balls where we are to
    estimate the relative numbers of each color by drawing a random ball
    without replacement. For the "dissent" case, we would have three
    colors (1, 2, & 3) with some multi-colored balls.

    If the balls drawn are:

    Ball Color Tallies
    1 3 0 0 1
    2 1 1 0 1
    3 1 2 2 1 1
    4 2 2 2 1
    5 2 2 3 1
    6 3 2 3 2
    7 1 2 3 4 2
    8 2 3 5 2
    9 1 2 4 6 2
    10 2 4 7 2

    I drew 10 balls, but I have 13 tallies. I would calculate the relative
    frequencies as 4/13 = 0.3077, 7/13 = 0.5385, & 2/13 = 0.1538.

    Is that part correct?
    That's a linguistic question and outside the scope of the math. A
    linguist will decide what the senses are. None are nested. For
    "dissent", the linguist identified three senses. Our job is to assign
    each ball to one or more of those.
    Prof Wonmug, Sep 19, 2010
  4. Prof Wonmug

    Ray Koopman Guest

    The denominator should probably be 10 rather than 13. But the bigger
    question is what the numerators should be. If you are interested in
    only the proportions of balls that have each color on them, regardless
    of what other colors they may or may not have, then the tallies you
    show will suffice. But if you are interested in more complicated
    questions, such as "what proportions of balls have red or blue but
    not yellow" then you will need to keep track of all 2^k - 1 possible
    patterns, where k is the number of different colors.

    How you decide when to stop sampling will depend on the set of
    questions you want to ask, the sizes of the errors you can tolerate,
    and the degree of confidence you want that the errors are tolerable.
    Ray Koopman, Sep 20, 2010
  5. Prof Wonmug

    Prof Wonmug Guest

    If we change the denominator to 10, then we have to change the
    numerator as well, or the percentages will exceed 100.

    4/10 + 7/10 + 2/10 = 13/10
    I think that's all I care about; hence, the 13 in the denominator.
    I don't think I care about the colors a ball is NOT, unless there is
    something that I don't understand.
    To speak coherently to that, I think we'll need to get back to word
    senses. All I care about is getting a reasonable estimate of the
    relative incidence of each word sense. There are many subjective
    aspects to the numbers, so I do not need anything like 99% confidence.
    I don't know if I need 95%, 90%, 80%, or even lower. I was hoping for
    a formula with the confidence level as a parameter so I could try a
    few settings and see how it goes.

    I can live with estimated tallies that are within +/- 30% of the
    actual tallies -- or the tallies that I would get if I sampled
    thousands of examples. I could propbably live with +/- 50% or more,
    because I can always go back later and do more sampling.

    I am paying people by the hour to do this research, so I'd like to be
    able to stop sampling as soon as possible -- even at the expense of
    some accuracy.

    A word may have 1-n senses, where n could be as large as 50, but will
    rarely exceed 8-10. I simply want to know how often a sense occurs in
    a large sample. By occurs, I mean is either the only sense or one of
    multiple senses.
    Prof Wonmug, Sep 20, 2010
  6. Prof Wonmug

    Ray Koopman Guest

    Percentages are required to sum to 100 only when the events to
    which they refer are mutually exclusive and jointly exhaustive.
    In this case the colors are jointly exhaustive but they are not
    mutually exclusive, so the percentages don't need to sum to 100.
    No, you should divide by 10. 40% of the balls had color 1 on them,
    70% had color 2, 20% had color 3.
    To estimate that, you divide by (the equivalent of) 10, not 13.

    I think what's complicating things is that you've mentioned both
    absolute and relative incidence. Absolute incidence is more basic:
    if you know it then you can get relative incidence, but you can't
    go the other way. So even if you care about only relative incidence,
    you still need to know the absolute incidence because it shows up
    in the formulas for standard errors. The relative incidence of things
    is harder to estimate accurately when their absolute incidences are
    low than when they are high. Also, I suspect that no matter whether
    you compare absolute incidences by looking at their ratio or their
    difference, you're going to need a count of the number of joint
    occurrences in order to estimate the error in the comparison.
    Ray Koopman, Sep 21, 2010
  7. Prof Wonmug

    Prof Wonmug Guest

    OK, thanks.
    Prof Wonmug, Sep 22, 2010
  8. Prof Wonmug

    Ray Koopman Guest

    For any given word, and any two senses x and y, let
    a = # of people who assign both x and y to the word,
    b = # of people who assign x but not y,
    c = # of people who assign y but not x.

    Then the sample relative incidence of x to y for that word is
    r = (a+b)/(a+c), and large-sample confidence limits for the true r
    are r/s and r*s, where s = exp( z * sqrt( (b+c)/((a+b)(a+c)) ) ),
    and z is the value in a standard normal distribution that has
    (100 - confidence level)/2 % of the distribution above it
    (e.g., for 95% confidence use z = 1.96; for 90% confidence use 1.645).

    Note that the total # of people who rated the word does not appear in
    the formulas. Also, part of the lore in this field is that if you add
    1 to the observed counts a, b, and c then the resulting estimates of
    r and its confidence limits will generally be better -- although they
    will be a little biased, that will be more than made up for by a
    reduction in their standard errors.
    Ray Koopman, Sep 25, 2010
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.