Group Means / Extremely Large Sample...

Discussion in 'Scientific Statistics Math' started by tripp, Nov 12, 2010.

  1. tripp

    tripp Guest

    Hello Folks,

    I'm working with an extremely large satellite image dataset that
    contains millions of records. The sample can be partitioned into 3
    groups - < 300 meters, >= 300 to < 3000, >= 3000 with each each group
    containing an extremely large number of samples. Are ANOVAS
    appropriate for extremely large datasets? Is there a better way to
    assess similarity within big datasets? I welcome your advice.

    tripp, Nov 12, 2010
    1. Advertisements

  2. tripp

    Bruce Weaver Guest

    With very large samples, very small differences between groups that
    are not practically important will be statistically significant. This
    is probably what you're getting at. So you should probably forget
    about significance tests here, and instead focus on how large group
    differences have to be in order to be practically important. This is
    not a statistical question--to answer it, you need substantive
    knowledge of the research field.

    Bruce Weaver, Nov 12, 2010
    1. Advertisements

  3. tripp

    Rich Ulrich Guest

    And if you want tests, ANOVAs are always important or
    useful, and, moreover, there isn't any other testing that
    doesn't face the same challenges. If you want to insist on
    a "very large effect", you could pre-determine that you would
    insist on a p-value of 0.001 or even smaller. However, with
    Very Large Ns (VLNs), there is a secondary, statistical problem
    hiding behind what Bruce says about looking for "practical"

    A regression is unbiased - or an ANOVA is a fair test, since
    the two procedures are mathematically the same - when
    the suitable, "relevant factors" are all included. We ignore
    hundreds of small-but-potentially-relevant factors "all the
    time" -- because they are small enough (or, so we figure) to
    have no effect on the outcome. For VLN, that will not be
    true, since the effects that the ANOVA procedure *might*
    detect have become even smaller. One adjustment for
    this is to conduct tests with an error term that reflects the
    size of biases: Some Between term (preferably with a
    large number of d.f.) can be used as the error to test
    against, rather than the Within term.

    Other thoughts.
    1) I don't know what uses you are making of satellite images.
    Are there 1000s of hypotheses? - The "multi-test problem"
    is another big challenge for data-mining.

    2) In the old days, a big N meant that it would take a long
    time to run a single analysis. For simple hypotheses, one
    intelligent time-saver was to pick a random subset of the
    original data to develop models and tests, or even to do the
    original testing. The whole set could be used for validation.
    That's still possible, but computer speed has reduced the need.

    One proper approach to some datamining comes back to
    sampling for a different reason: Crossvalidation is the method
    used to validate after too many tests have been run.
    Further, instead of random sampling, it can be strategic
    to use various, specific subsamples. Showing that the test
    will validate in multiple, distnctly defined samples is another
    way to try to control for those tiny, unknown sources of bias.

    3) Another advantage of large Ns is that they contain
    examples of data with tiny chances of existence. Is a rare
    value an outlier to be labeled and discarded, or is it the
    gold that you were looking for?
    Rich Ulrich, Nov 12, 2010
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.