# How many data points to sample?

Discussion in 'Probability' started by Prof Wonmug, Sep 19, 2010.

1. ### Prof WonmugGuest

I am not quite sure how to articulate my question. I am trying to
estimate how the senses of a word are distributed. For example, the
word "dissent" has 3 main senses:

1. The dissent of a judge as on the Supreme Court.
2. A difference of opinion.
3. A protest.

I have a database containing thousands examples of the in context. I
will have researchers examine each example and assign it to one or
more of these senses. The examples can be ambiguous. If it is not
clear which sense was intended, the researcher will assign it to each
sense that it could be.

I have done some testing where 10, 50, 100, and 200 samples were
examined. In many cases, the distributions after just 10 samples were
quite close to those after 50, 100, or even 200. In other cases, the
distributions varied considerably and did not "settle down" until 50
or 100 samples or more.

Since I am paying the researchers by the hour, I would like be able to
stop after 10 or 20 samples instead of 100 or 200 if the accuracy of
the results are not likely to vary much fron the additional work.

I would also like to be able to claim some measure of the variability.

Is there a way to measure the variability in the distributions after
each sample?

Suppose a word has 4 senses. After 10 samples, the distribution is

1 2 3 4
5 2 3 0

and after 100 it is

1 2 3 4
47 31 21 1

Is there a way to measure something like a confidence interval that
these represent the actual distributions? Can I calculate the odds
that each of these tallies is within +/- 10% of the actual relative
frequency?

As I said, I may be expressing the problem all wrong. I've only had
very elementary exposure to statistics. But I hope someone can figure
out what I meant to say.

Thanks

Prof Wonmug, Sep 19, 2010

2. ### Ray KoopmanGuest

In both of your frequency distributions, the sum of the frequencies
equals the number of raters. How are you handling cases where a rater
assigns multiple meanings?

Also, in the "dissent" example, I see 1 as just a special case of 2.
Are you distinguishing such nested meanings from those which are not?

Ray Koopman, Sep 19, 2010

3. ### Prof WonmugGuest

Yes, sorry. I should have provided an example where a multi-colored
ball was drawn.

I think this is equivalent to a bag of colored balls where we are to
estimate the relative numbers of each color by drawing a random ball
without replacement. For the "dissent" case, we would have three
colors (1, 2, & 3) with some multi-colored balls.

If the balls drawn are:

Ball Color Tallies
1 3 0 0 1
2 1 1 0 1
3 1 2 2 1 1
4 2 2 2 1
5 2 2 3 1
6 3 2 3 2
7 1 2 3 4 2
8 2 3 5 2
9 1 2 4 6 2
10 2 4 7 2

I drew 10 balls, but I have 13 tallies. I would calculate the relative
frequencies as 4/13 = 0.3077, 7/13 = 0.5385, & 2/13 = 0.1538.

Is that part correct?
That's a linguistic question and outside the scope of the math. A
linguist will decide what the senses are. None are nested. For
"dissent", the linguist identified three senses. Our job is to assign
each ball to one or more of those.

Prof Wonmug, Sep 19, 2010
4. ### Ray KoopmanGuest

The denominator should probably be 10 rather than 13. But the bigger
question is what the numerators should be. If you are interested in
only the proportions of balls that have each color on them, regardless
of what other colors they may or may not have, then the tallies you
show will suffice. But if you are interested in more complicated
questions, such as "what proportions of balls have red or blue but
not yellow" then you will need to keep track of all 2^k - 1 possible
patterns, where k is the number of different colors.

How you decide when to stop sampling will depend on the set of
questions you want to ask, the sizes of the errors you can tolerate,
and the degree of confidence you want that the errors are tolerable.

Ray Koopman, Sep 20, 2010
5. ### Prof WonmugGuest

If we change the denominator to 10, then we have to change the
numerator as well, or the percentages will exceed 100.

4/10 + 7/10 + 2/10 = 13/10
I think that's all I care about; hence, the 13 in the denominator.
I don't think I care about the colors a ball is NOT, unless there is
something that I don't understand.
To speak coherently to that, I think we'll need to get back to word
senses. All I care about is getting a reasonable estimate of the
relative incidence of each word sense. There are many subjective
aspects to the numbers, so I do not need anything like 99% confidence.
I don't know if I need 95%, 90%, 80%, or even lower. I was hoping for
a formula with the confidence level as a parameter so I could try a
few settings and see how it goes.

I can live with estimated tallies that are within +/- 30% of the
actual tallies -- or the tallies that I would get if I sampled
thousands of examples. I could propbably live with +/- 50% or more,
because I can always go back later and do more sampling.

I am paying people by the hour to do this research, so I'd like to be
able to stop sampling as soon as possible -- even at the expense of
some accuracy.

A word may have 1-n senses, where n could be as large as 50, but will
rarely exceed 8-10. I simply want to know how often a sense occurs in
a large sample. By occurs, I mean is either the only sense or one of
multiple senses.

Prof Wonmug, Sep 20, 2010
6. ### Ray KoopmanGuest

Percentages are required to sum to 100 only when the events to
which they refer are mutually exclusive and jointly exhaustive.
In this case the colors are jointly exhaustive but they are not
mutually exclusive, so the percentages don't need to sum to 100.
No, you should divide by 10. 40% of the balls had color 1 on them,
To estimate that, you divide by (the equivalent of) 10, not 13.

I think what's complicating things is that you've mentioned both
absolute and relative incidence. Absolute incidence is more basic:
if you know it then you can get relative incidence, but you can't
go the other way. So even if you care about only relative incidence,
you still need to know the absolute incidence because it shows up
in the formulas for standard errors. The relative incidence of things
is harder to estimate accurately when their absolute incidences are
low than when they are high. Also, I suspect that no matter whether
you compare absolute incidences by looking at their ratio or their
difference, you're going to need a count of the number of joint
occurrences in order to estimate the error in the comparison.

Ray Koopman, Sep 21, 2010
7. ### Prof WonmugGuest

OK, thanks.

Prof Wonmug, Sep 22, 2010
8. ### Ray KoopmanGuest

For any given word, and any two senses x and y, let
a = # of people who assign both x and y to the word,
b = # of people who assign x but not y,
c = # of people who assign y but not x.

Then the sample relative incidence of x to y for that word is
r = (a+b)/(a+c), and large-sample confidence limits for the true r
are r/s and r*s, where s = exp( z * sqrt( (b+c)/((a+b)(a+c)) ) ),
and z is the value in a standard normal distribution that has
(100 - confidence level)/2 % of the distribution above it
(e.g., for 95% confidence use z = 1.96; for 90% confidence use 1.645).

Note that the total # of people who rated the word does not appear in
the formulas. Also, part of the lore in this field is that if you add
1 to the observed counts a, b, and c then the resulting estimates of
r and its confidence limits will generally be better -- although they
will be a little biased, that will be more than made up for by a
reduction in their standard errors.

Ray Koopman, Sep 25, 2010