abnormal variation during bootstrap

Discussion in 'Scientific Statistics Math' started by Jinsong Zhao, Nov 27, 2010.

  1. Jinsong Zhao

    Jinsong Zhao Guest

    Hi there,

    I resample a subsample with size m from a sample with size n (n > 50)
    with or without replacement. Then resample a sample with size 100 from
    the subsample (with size m) obtained from previous step, and compute the
    5% quantile. Repeat this procedure 10000 times for 95% confidence
    intervals of the 5% quantile.

    With increasing m (m from 2 to n), the variation of 5% quantile
    decreased. When ploting the 5% quantile obtained with m from 2 to n in a
    figure, I always saw several points with higher value than their
    neighborhood at m is about 20.

    I hope to know whether the variation is reasonable, or what caused the
    variation.

    I will really appreciate for any suggestions and comments.

    Thanks in advance.

    Best regards,
    Jinsong
     
    Jinsong Zhao, Nov 27, 2010
    #1
    1. Advertisements

  2. Jinsong Zhao

    Rich Ulrich Guest

    If I follow correctly, you are doing something that I have never
    heard of, and which seems like a poor idea now that you mention it.
    But the multiple uses of the word "sample" does leave me wondering
    if I read it wrong.

    For a set of numbers of size n (let me take it, for a first case,
    as n=60) you draw a sample of size m (let me take it, for this time,
    as m=10). And then, with from this m=10, you draw a Bootstrap
    sample of size 100 in order to compute the 5% quantile.

    a) I have never considered drawing a bootstrap sample that was
    larger than what I was sampling from. - I just don't imagine how it
    would be worthwhile, ignoring the problems that could arise.
    b) There are about 4 defintions of Percentile that are in use; how
    you adapt that to defining a 5% quantile when your data consists
    of (an average of) 10 ties for every value is problematic.

    My guess would be that it has to do with 5% times 20 = 100%.
    And how you have chosen to define the quantile.

    The efficient way to define a CI on the 5% quantile uses ranks
    of the original sample. Or of a subsample, which will give a larger
    range. IF you are just practicing to see how bootstrapping behaves,
    or if you want to know what the CI actually is, you should find the
    true, minimal CI from the ranks.
     
    Rich Ulrich, Nov 29, 2010
    #2
    1. Advertisements

  3. Jinsong Zhao

    Jinsong Zhao Guest

    sorry for my poor English.
    you are right. It's the procedures in my previous post.
    It's not my idea, and I just followed a paper published in
    ``Environmental Toxicology and Chemistry'' (DOI:
    10.1002/etc.5620190233). In the field of ecological risk assessment,
    bootstrap with similar procedure seems to be popular.
    I don't know the 4 definitions you mentioned here. I just use the
    function quantile() in R (http://www.r-project.org) to give the 5%
    quantile. The implement of quantile() in R is based on the paper by
    Hyndman and Fan (1996) (http://www.jstor.org/stable/2684934). For the
    data consists of (an average of) 10 ties for every value, the 5% or 95%
    quantile is same.
    Maybe you are right, I also saw a few jumps at m is about 40.
    I really appreciated your suggestions, however, I don't think I entirely
    understand what you said. If possible, would you please to give me more
    detailed explanation, or point me to some references.

    Thanks again.

    Best regards,
    Jinsong
     
    Jinsong Zhao, Nov 29, 2010
    #3
  4. Jinsong Zhao

    Rich Ulrich Guest

    - I should have added a comment before - I am not an expert
    at bootstrap, at all. But I've browsed a couple of books. I can
    say that I was discouraged by the number of ways that the user
    could go wrong. My only use of bootstrapping has been an
    indirect one, where certain programs have used bootstrap
    evaluation as part of their ordinary algorithm.


    jz>
    ru > >
    I don't have a jstor subscription. I do see from the intro-page that
    the paper by Hyndman and Fan seems to be an excellent one. They
    are exporing multiple definitions of quantiles, to find which ones
    work best for which purposes. All (apparently) are rank-based
    procedures, and so that must be what is used in R. H&F start out
    by defining precise quantiles as being described in terms of the
    weighted proportions of the ranked values observed above and
    below a point.


    jz (continued) >
    ? I don't follow that statement. The low end matches the high end?

    That would be consistent.
    Okay, I don't approve of using bootstrap to find quantiles, if
    the procedure is only rearranging numbers and counting the
    extremes. Your method seems limited to that. The alternative
    would be something that had to build in a parametric step, say,
    by assuming "normallity" for subsamples, and estimating "normal
    deviations" from those. You don't seem to be describing anything
    with that.

    The *whole* of the information that is incorporated in those
    bootstraps already exists in the rank-order scores for points;
    you do not "gain information" by re-sampling the same points.
    If you have a sample of 10, there's a really good chance that your
    5-percentile point is lower than any value observed. You cannot
    change that by drawing 10 copies of each of the 10, any more
    than a political survey could interview 10 people and use every
    response 100 times in order to have a "sample" of 1000.

    As I mentioned, the highly efficient way to set a 95% (or some
    other) CI on a percentile is to use the ranks directly, and call on
    the beta distribution. I think Conover, "Applied Nonparametric
    Statistics" covers the method.

    According to my hand-calculator figuring --
    If you have a sample of 58, the 5-percentile-point is something
    a bit *smaller* than its smallest observed value; if you have
    59, the smallest value may be taken as a 95% lower-limit for
    the estimate of the 5-percentile point.

    I do not see how anyone can pretend to generate useful
    estimates of the 95% CI for the 5-percentile value, when starting
    with samples or subsamples that are tiny compared to 60.


    As far as the practice in ecology goes ... If they are doing
    what you say they are doing, I hope they they describe it
    with great caution and many warnings. It probably works
    okay for large enough N and m, but it never will beat the
    information from the rank-orders, because that is all that it
    has to work with. - If they "like it" for the results that they
    small Ns, I hope that they recognize that what they have is
    *not* a decent CI on the small percentile.
     
    Rich Ulrich, Nov 29, 2010
    #4
  5. Jinsong Zhao

    Jinsong Zhao Guest

    Thank you very much for your kind help. I need some time to digest your
    suggestions.

    Best regards,
    Jinsong
     
    Jinsong Zhao, Nov 30, 2010
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.