Reef Fish Statistics for Dummies: Applied Simple Regression

Discussion in 'Scientific Statistics Math' started by Reef Fish, Oct 4, 2006.

  1. Reef Fish

    Reef Fish Guest

    For those who have completed the first portion of any First Course in
    Statistics to arrive at the "Simple Regression" topic, this lecture is
    self-contained. It is elementary, but it contains some material used
    my advanced Graduate Courses in Data Analysis, in Math. Sciences
    and Statistics Departments, because it is easy for the mathematically
    inclined to neglect the APPLIED aspects of Statistics in general, and
    the necessary methodology in Simple Regression in particular.

    1. What is a "Simple Regression"?

    Most regression problems are simple, in the English sense of the
    word simple. In Statistics, the term "simple" in "simple regression"
    synonymous with "regression with ONE independent variable X"
    in the model:

    Y(i) = bo + b1 X(i) + e(i),

    where e(i) is the random error of Y(i), independent and identically
    distributed as a Normal random variable with mean 0 and variance
    sigma^2, denoted as ~ i.i.d. N(0, sigma^2).

    2. Model Assumptions in (1),

    bo and b1 are unknown parameters to be estimated from DATA
    (X(i), Y(i), i = 1, 2, ..., n, ASSUMED to have the error structure
    of e(i).

    Because of the model assumption, before one does ANY analysis
    or statistical inference, one must first "validate" the model
    assumptions, because if the assumptions are wrong, then the
    estimation and inference theory and methodology would not be

    Q, What STATISTICAL assumptions can we examine or
    valicate before we proceed with our application?

    A. NOTHING. Nothing at all!

    There is already the place that some of the graduate students
    stumble in applying simple regression. They realize that each
    Y(i) in the model comes from a Normal distribution with mean
    bo + b1 X(i), and variance sigma^2, and they have both graphic
    and analytic tests (SPSS, SAS, S, R, Maple, etc) for normality,
    so they test the data values of Y for normality.

    That is a BLUNDER Number 1. While each Y comes from a
    normal distribution, the dependent variable Y is a mixture of n
    different normal distribuitons, and there is no reason why the
    mixture Y should resemble data from a normal distribution at all.

    Not only graduate students, but authors of statistics textbooks
    sometimes make the same error. I caught the authors of the
    textbook "Statistics for Business and Economics", Boston:
    Allyn and Bacon, (1980), making that error in Figure 11-2,
    on page 278, "Normal Distribution of the Population of Y
    Regressed on X", suggesting by a sketch of a SINGLE normal
    distribution that the observed Y in the aggregate should follow
    a normal distribution.

    The authors of that book were: Heitman, W.R. and Mueller, F.W.

    Q. What about the distribution of X?

    A. There is NO assumption about the distribution of X. They
    can come from any distribution, and they can be fixed
    constants or any given values.

    It is BLUNDER Number 2 for anyone to examine the probability
    distribution of X, or any "outlier" of X because they think X
    should behave like a normal distribution, or any distribution.

    The ONLY statistical ASSUMPTION behind a simple regression
    model is that the ERRORS are ~ i.i.d. N(0, sigma^2), and there
    is NOTHING you can do until you have tried some fit and have
    observed the errors, in the form of "residuals" (left over from an
    exact fit to a straight line).

    The only thing one CAN, and SHOULD do, is to examine the
    DATA for typographical and other non-statistical errors, verify
    that they are indeed errors, and correct them before doing any
    regression fit.

    The Reef Fish archives come in handy:

    The DATA was taken from the 1975 SPSS Manual in which
    the data in that post was used to illustrate the output of a
    Multiple Regression. That was where "Lesson #1" was
    mentioned by me:

    RF> LESSON #1. ALWAYS examine the data for gross
    RF> (and not so gross) anomaly.

    RF> Jerry and Russell are making a good start toward
    RF> their Fish University "A".

    The 15849 is of course an obvious typo, not by ME (it took me
    about 10 minutes to type the data, about 30 minutes to write
    a multiple regression program in SPEAKEASY because I have
    NO access to any statistical package; and at least an hour to
    find and correct the half a dozen or so typos of MINE <G> by
    checking against the results I had done 30 years ago). The
    typos were in the 1975 SPSS Manual!
    ======= end excerpt

    The TYPO was what contributed to all THREE variables being
    statistically significant in the SPSS Manual -- without it ...
    that's the next chapter/Lesson. :)

    For Simple Regression, the dataset was used in the series of
    lectures on Model Building to show that a Simple Regression
    model was "better" than the SPSS's Multiple Regression Model
    (with 3 independent variables) when the ERRORS in the DATA
    were removed.

    3. What is Step 2 of an Simple Regression Application?

    THAT is where the Statistical Assumptions of the model are
    examined after each attempted fit. The i.i.d N(0, sigma^2)
    can be broken into three INDEPENDENT components:

    1. Normality of the errors (residuals)
    2. Independence of the errors (residuals)
    3. Homoscedasticity (equal variances) of the errors (residuals)

    All three MUST be satisfied before any statistical result of
    a simple regression can be validly used. These are
    independent assumptions in the sense that none implies
    the other and no two of them implies the third.

    Step 2 is what I had called LESSON 2 in the Model Building
    thread, using the SPSS data, in the post

    "LESSON 2 in Model Building. Iterative Loop of Sponsorship vs Critic"

    The Applied Statistician (Data Analyst) first sponsors a model (our
    initial simple regression model (1)). Once the model is FITTED,
    the analyst must then act as his own CRITIC, to see if any of the
    assumptions are violated. If so, he make changes in the model,
    do a new fit, and acts as a critic of his own model AGAIN. This is
    the "iterative Loop of Sponsorship vs Critic" described in detail in
    George Box's JASA (1976) article, "Science and Statistics".

    The Data Analysis / SPSS / Model Building thread started on
    March 17, 2005. By Jun 29 2005 2:06 am, we began the post
    on LESSON 2. :)

    At this time, I'll skim over the steps of the validation of the THREE
    assumptions and how each violation might be accommodated.

    It was in the cited post above that I made this observation,

    RF> It's interesting in a way that all FOUR of us found the same
    RF> "hockey stick" in the scatterplot of the INVDEX variable vs the
    RF> GNP variable. All four of us took DIFFERENT actions!!

    RF> In that respect, if we were working as a TEAM, we would
    RF> put our heads together on the four TENTATIVE models
    RF> and decide what model to sponsor next (if any).

    That is the most interesting and rewarding part of being a Data
    Analyst or an Applied Statistician! There are no formulas that
    tell you what to do -- there are GUIDELINES on what you should
    avoid and what are valid continuations.

    That's where the SCIENCE of Statistics blends with the ART
    of Statistics that the "mathematical statistician" is most deficient.

    You can find numerous posts of Reef Fish which cited particular
    passengers from George Box's JASA article -- which everyone
    really should read, re-read, and re-re-read, very carefully every
    time he analyses a real set of DATA. In my post below:

    I cited Box's indictment of "mathematical statistician's

    "Mathematistry is characterized by development of theory for
    theory's sake, which since it seldom touches down with practice,
    has a tendency to redefine the problem rather than solve it."
    (p,797, 1976 JASA paper on "Science and Statistics").

    The accommodation of a model with misbehaving residuals is
    commonly accomplished via transformation of either the
    dependent OR the independent (or both) variable in the

    Tukey and Mosteller's book "Data Analysis and Regression"
    has a nice Exhiibit 1 in Chapter 4 "Straightening Curves and
    Plots" showing a "continuum ladder of power transformation"
    and where negative powers, root transformations and log
    transformation fall in the diagrammed exhibit.

    Going back to the SPSS example... Of those who dared to
    show what they tried, NONE used any power transformation
    to straight out the "hockey stick" seen in the simple
    regression scatter. :)

    There ain't such a thing as "cook book" or "recipe" in an
    enlightened "data analysis", or "exploratory data analysis"
    guided by both MODEL and DATA.

    That's what Applied Statistics and Applied SImple Regression
    is all about.

    In the Preface to "Reef Fish Statistics for Dummies", I
    said there will be very few formulas or equations. So far,
    I had given only ONE, equation (1), which is the usual model
    for a simple regression.

    4. What comes next?

    This is where, after the model and statistical assumptions
    have been validated to apply the theoretical results derived
    by statisticians, almost everyone will have no trouble finding
    the FORMULAS and EQUATIONS used in constructing
    Confidence Intervals for the parameters, test Statistical
    Hypotheses about the intercept or slope of the parameters
    bo and b1 in (1), and obtain prediction intervals for future

    These are formulas that are easy to derive and even easier
    to apply, and they are the ones that all of my students have
    access in their OPEN BOOK, and OPEN NOTES exams.,
    so I won't even bother to go over them here, once we have
    carefully carried out the APPLIED steps 1 and 2 to ensure
    that it is valid to apply the formulas and results.

    The only REMAINING STEP is outside of most Statistics
    textbooks on regression, which is labeled as STEP 3 in
    my Model Building lessons:

    "LESSON 3 in Model Building: Practical Significance"

    There is an ongoing discussion in two of the three sci.stat groups
    now, under a not-so-obvious topic of "confidence intervals". In my
    opening post

    which blossomed into a mini-thread of 8 posts, based on my
    statement in the initial post, in which I wrotet:

    RF> Any statisticians worth his salt would know that a highly
    RF> "statistically significant" result can be completely
    RF> worthless from a practical point of view of the usefulness
    RF> of the result.

    RF> Conversely, a statistical result that is not statistically
    RF> significant at some .05 or .10 level can be very useful.

    RF> The two concepts are TOTALLY different in terms of
    RF> knowing how to apply statistics sensibly and usefully.

    That is a VERY important principle for an APPLIED statistician,
    which no "mathematical statistician" ever even think about.

    The 1975 SPSS example was a good illustration of a highly
    statistically significant result is completely USELESS in
    practice, as demonstrated in the Model Building threads.

    RF> In my re-analysis of the DATA in the SPSS Manual, I came to

    RF> INVDEX = -197.51 + 0.018234 * GNP
    RF> (15.031) (.000667)
    RF> T=-13.14 T=27.33
    RF> p-value 10^(-12)

    RF> Multiple R-sq = 0.9726, MSE = 317.96.
    RF> All very impressive and highly statistically significant.

    And I punctured the euphoria of any social scientist or
    non-thinking statistician, in LESSON 3, showing that the
    simple regression model was completely USELESS in
    practice. :)

    I conclude this First Lesson of "Reef Fish Statistics for
    Dummies" with the sig of one Dr. Flash Gordon, M.D.,

    FG> in theory, there is no difference between
    FG> theory and practice. but in practice, there is.
    FG> flash gordon, m.d.

    Flash is a PRACTICAL Man (and M.D.) of many interests:

    -- Reef Fish Bob.
    Reef Fish, Oct 4, 2006
    1. Advertisements

  2. Reef Fish

    Reef Fish Guest

    In the Applied Simple Regression lesson, Reef Fish gave the STEPS
    that a Data Analyst must follow:

    Model Assumptions
    For the Simple Regression Model, the answer was:
    For Simple regression, there is NO assumption about the
    Y which is the aggregate of Y(i), from n different distributions!
    That is the FIRST STEP in Applied Simple Regression, before
    doing any fit of the data or statistical inference.

    Now suppose we want to use the Simple Regression DATA to
    test the correlation between X and Y, and apply the same STEPS
    and Principles of Data Analysis, then we'll see what we HAVE
    to do and in the process see HOW and WHY m00es had erred
    in his insistence that the same TEST STATISTIC used to
    test the slope of the Simple Regression problem can be used to
    test the correlation Ho: rho(X,Y) = 0.

    For the problem of testing the correlation R, as we had seen,
    according to the result quoted in Hogg and Craig by m00es:

    m00> Hogg, R. V., & Craig, A. T. (1995). Introduction to mathematical
    m00> statistics (5th ed.).

    m00> On pages 478-480, the authors derive the distribution of r
    m00> under the bivariate normal assumption and show that
    m00> under rho = 0, r * sqrt(n-2)/ sqrt(1-r^2 ) is distributed t(n-2).
    m00> Now, on page 480, the authors mention EXPLICITLY that
    m00> a careful review of their proof reveals that nowhere was it
    m00> necessary to assume that the two variables are bivariate
    m00> normal. Only one of the variables must be normal.

    The problem changed drastically, for the Applied Statistician
    and Data Analyst, if he wants to use the simple regression DATA
    to test the correlation between X and Y, because he now, unlike
    the regression problem that there's NOTHING he could verify
    in terms of the statistical assumptions before fitting a regression
    line, the statistician now must VALIDATE the NECESSARY
    ASSUMPTION (to test the correlation) that ONE of the two
    variables X or Y must be NORMAL, before calculating the
    correlation coefficient, let alone making statistical inference
    about it based on the same test statistic T used in testing the
    slope in a regression problem.

    It is HERE that m00es should have realized that his repeated
    claim that "DATA is irrelevant" to the distribution of the
    TEST STATISTICS is drastically wrong, from the APPLIED
    Statistics and Data Analytic point of view.

    Because the CORRELATION MODEL assumption requires
    X or Y to be normally distributed, in the relaxed assumption
    from bivariate normality, a data analyst would have
    needed to validata the BIVARIATE NORMALITY of (X,Y)
    before, in the sponsor-critic iterative loop is now replaced
    by the necessity to validate that ONE of the two variables
    MUST be normally distributed.

    In the problem and DATA under consideration where the
    Y in a Simple Regression Model is a mixture of n normal,
    and hence non-normal a priori and a fortiori, nonnormal;
    and the X variable, which is the dichotomous Indicator
    variable, is clearly nonnormal -- the statistical theory about
    testing r(X,Y) = anything is NOT applicable, and there
    is no way of fixing the violation because Y consists of the
    DATA from two independent normal distributions with
    (statistically significantly) different means.

    One can still do a validation of the normality of Y, if the
    two populations of Y are not sufficiently far apart in their
    means to make the mixture distribution to be sufficiently
    nonnormal, and one COULD still proceed to test the
    correlation under Hogg and Craig's assumption.

    Alas, DATA is/are very relevant not only in the final
    execution of the correlation test, but the data would have
    FAILED the normality test in the mixture of the two groups
    whose data came from the Anderson and Sclove book
    on testing the equality of means, and illustrated in the
    Ling and Roberts Manual of using simple regression for
    the problem of testing MEANS.

    If the same data WERE used to test for testing correlation
    between the X and Y in the regression problem, the
    model-validation step would have failed by the data Y,
    and one would have to appeal to the mercy of "robustness"
    of the test statistic against the point-biserial-correlation
    when X is dichotomous and Y is NON-normal.

    This is precisely the way ANY test of correlation between
    two variables MUST be done. One cannot simply assume
    away bivariate normality OR normality of ONE of the two

    One must VALIDATE that one of the two variables IS indeed
    normal, or not sufficiently non-normal to make the inference
    about the POPULATION correlation coefficient between X
    and Y.

    And FINALLY, even if everything goes well up to the point
    of actual testing of the correlation coeficient, that one of the
    variables can be said to be normal, and that Hogg and Craig's
    NECESSARY condition is satisfied, we still are confronted
    with the question of "paractical significance" of the tested
    result about the correlation.

    What if X and Y are "significantly correlated" to have
    rho=0 rejected at the alpha = 0.001 level, or at a p-value
    less than 0,001, what does one make of the PRACTICAL
    usefulness or uselessness of that result?

    You can't eat the correlation, and you can't bite the
    correlation for its hardness or softness. You can't do
    ANYTHING with it, except looking at other evidence
    such as the predicted values or prediction intervals
    associated with the same significance.

    The SPSS example now comes vividly to mind.

    In said example, I showed

    RF> INVDEX = -197.51 + 0.018234 * GNP
    RF> (15.031) (.000667)
    RF> T=-13.14 T=27.33
    RF> p-value 10^(-12)
    RF> Multiple R-sq = 0.9726

    The slope coefficient of the simple regression (and
    hence the correlation) would have shown a T value
    of 27.33 corresponding to a p-value of .000000000001.
    for a correlation R of 0.9862 that was judged to be
    USELESS in practice!

    The only thing that comes to my mind is the reenforcement
    of something I always attributed to John Tukey (but I don't
    recall seeing it in PRINT, so I must have heard it from his
    mouth, because I can't make something like that up. :)),
    "Using correlation is like sweeping dirt under the rug, with
    a vengeance".

    I couldn't have characterized it any better!

    I hope this finally settled the m00es Lecture topic, which
    dragged on for weeks and billions of wasted electrons
    dozens of posts and NOISE, about how to test a hypothesis
    about a correlation coefficient.

    As I had said in the Preface, I wish I had started in the
    APPLIED Simple Regression, and it would have immediately
    flattened m00es's repeated argument that "DATA is irrelevant"
    in his posts about theory, and see the relevance of DATA as
    well as the relevance of VALIDATING Hogg and Craig's
    assumption if one were to test the hypothesis Ho: rho = 0,
    or rho = anything.

    But the FINAL conclusion (in the light of the SPSS Simple
    Regression example) is most rewarding, in seeing how a
    correlation of .98+ that rejected Ho: rho = 0, at any alpha
    level of 0.00000000001 or greater, can be so ... utterly
    USELESS result in practice.

    This concludes the Simple Regression for Dummies, with
    the bonus lesson of Testing Correlations for Dummies.

    -- Reef Fish Bob.
    Reef Fish, Oct 4, 2006
    1. Advertisements

  3. Reef Fish

    m00es Guest

    No, it does not, because what you wrote is still not correct.

    Data IS irrelevant for deriving the distribution of the test statistic.
    This has nothing to do with a viewpoint. It's a fact. I don't need to
    observe any data to derive that distribution.

    I'll explain this again. Why don't you explicitly point out in this
    proof the source of my error.

    1) The model: Y = beta0 + beta1 x + e, where e ~ iid N(0, sigma^2)

    2) beta1 = rho(X,Y) * SD(Y) / SD(X)

    3) Therefore, beta1 = 0 iff rho(X,Y) = 0

    (since SD(Y) and SD(X) can safely be assumed to be > 0).

    Now we want to test H0: beta1 = 0. As you have said yourself, we can

    t = b1/s(b1)
    s = r * sqrt(n-2) / sqrt( 1 - r^2 )

    to test H0: beta1 = 0. Why?

    4) under H0: beta1 = 0, t follows a t-distribution with n - 2 degrees
    of freedom
    5) s = t, so both MUST have the same distribution
    6) we can also use the result from Hogg & Craig to see that s has a
    t-distribution with n - 2 degrees of freedom under H0. Let's use my
    Under H0: beta1 = 0, then Y = beta0 + 0 X + e. Therefore Y ~ N(beta0,
    sigma^2). We see that under H0, Y is normal and not a mixture
    distribution. Therefore, s follows a t-distribution with n - 2 degrees
    of freedom.

    An important point: t (as well as s) only follows a t-distribution with
    n - 2 degrees of freedom when beta1 = 0 holds!

    7) When we reject H0: beta1 = 0, we automatically reject H0: rho = 0
    and vice-versa.


    So, why don't you actually point out where the error is. And don't say:
    Y follows a mixture distribution. Under H0, it does not. If it would,
    then NEITHER t NOR s would have a t-distribution with n - 2 degrees of
    freedom. But since s = t, they both have the SAME distribution -- it
    just won't be a central t-distribution when H0: beta1 != 0.

    In fact, that's the whole idea of hypothesis testing:

    (a) Assume H0 holds.

    (b) Derive the distribution of the test statistic UNDER THE ASSUMPTION
    THAT H0 holds (and Y is normal under that assumption).

    (c) Then obtain data, calculate the test statistic in the sample, and
    see where the observed test statistic falls with respect to the
    critical bounds according to the distribution under H0.

    (d) When H0 holds, then using the critical bounds according to the
    distribution under H0 guarantees that we will only reject H0 in alpha *
    100% of the cases. But if H0 does not hold, then the distribution of
    the test statistic (i.e., the distribution of t, which is the same as
    the distribution of s) will be stochastically greater than the
    distribution of t (= s) under H0. Therefore, the probability of
    rejecting H0 increases, which is exactly what we would want.

    But one more time: Under H0: beta1 = 0, both t and s have a central
    t-distribution. Moreover, beta1 = 0 iff rho = 0. Therefore, rejecting
    beta1 = 0 implies that we can reject rho = 0 and vice-versa.

    So, please, enlighten me where the error is.

    m00es, Oct 5, 2006
  4. But as I read RF's point, you don't *know* that the *data* support
    H0 a priori, so you can't validly run the test to show that it does
    *until* you check for normality. Kind of a Catch 22. :) Your point
    about deriving the distribution is moot in the case of the actual
    process of doing the data analysis. IOW what you show is true
    in theory under a set of assumptions may not be valid if those
    assumptions are violated, and you need to test the validity of the
    assumptions *before* proceeding. At least that is my reading of
    the situation. I'll admit I've only skimmed much of the voluminous
    exchange on this topic, in part because so much of it is repetitious
    because neither of you are trying (it seems to me) to *understand*
    what the other is saying, so I may have missed everyone's point
    And here is where it seems to me you need:
    (c-1) Test that the data satisfies the requirements of the
    hypothesis test.
    Enlighten me. please, if you think I'm wrong.

    Russell.Martin, Oct 5, 2006
  5. Reef Fish

    TomC Guest


    To Mr Martin - I think mOOes has been very consistent all along. Under the assumption that Ho is true what he has stated is also true.

    Time and time again mOOes has been very specific - he has stated his position repeatedly; under the assumption that Ho is true, Rho(x,y) = 0. mOOes has not claimed he is testing Rho(x,y) = 0 as has been claimed by some. As mOOes has stated it simply follows if H0 is true -

    Clearly at some point we have to test the data to see if Ho holds. Any reasonable reading of this whole debate would see this as another RF saga.

    So mOOes you will never get a concession from RF - you are wasting your time. RF is never wrong - if you think RF is wrong refer to the previous point.


    TomC, Oct 5, 2006
  6. Reef Fish

    Lou Thraki Guest

    You need to check wether the hypothesis is true before you test it ;-)
    Lou Thraki, Oct 5, 2006
  7. Yes, he just keeps repeating the same thing, which of course
    is very consistent. RF is somewhat guilty of the same thing.
    That doesn't seem to be the point of contention.
    Yes, repeatedly. :)
    Russell.Martin, Oct 5, 2006
  8. Reef Fish

    Reef Fish Guest

    Russell, glad you see you (or anyone else) stepping in about the aspect
    of VALIDATING the ASSUMPTION in data analysis.

    You interpreted me correctly, almost. :) There is NO Catch 22
    involved. The ASSUMPTION is not in Ho. It is in what is needed
    for the DATA to satisfy in order to use a particular TEST STATISTIC.

    The point here is that t=s always (a trivial mathematical fact).

    To use it to test the SLOPE of a simple regression, there is NO
    ASSUMPTION about X or Y that can be, or need be, validated.
    That's in the First Lesson of this "Reef Fish Stat for Dummiies".

    To use the same t to test a CORRELATION, the ASSUMPTION
    (re: Hogg and Craig) is that ONE of the X or Y need be normal.
    That's about the DATA that needs to be validated. In this case
    one can just think of ANY data X, Y, whether it was used in a
    simple regression of not. Someone comes with the a set of
    data with X and Y and wants to text Ho: rho = 0. What must
    a data analyst do?

    He MUST validate that X or Y is normal, because that's the
    ASSUMPTION behind the theory of the distribution of the
    test statistic for testing CORRELATIONS.

    That is 100% correct.

    m00es is confused with the fact that the test statistic must
    INCORPORATE the value of the parameter under Ho.
    So, in testing rho = 0, the test stat would be (R - 0)/s(R).
    To test rho = .3, the test stat would be (R - .3)/s(R), but
    each of those has a T(n-2) ONLY when X or Y is normal!
    His argument of rho = 0 implies something about the data
    falls apart completely when you test rho = .3. That made
    your explanation below all the more ESSENTIAL to any
    statistical procedure, hypothesis testing or not.

    That is also 100% correct. In fact THAT is the WHOLE idea
    of needing to validate any assumption behind any statistical
    procedure BEFORE proceeding.

    The SAME validation about X and Y (that ONE must be normal)
    is necessary if one wants to use the test statistic to construct
    a Confidence Interval for R.

    That faulty inference of yours, Russell, was partly because you
    were not completely clear yourself, but mostly because you did
    not read me carefully enough, AND of course, much of the
    previous exchanges were repetitious -- m00es playing the same
    tune without alteration AND without reading my explanation,
    and I could only re-explain the same in different ways.

    I understood m00es point from the start. He was confusing
    distribution theory and the role of the parameter in Ho, with tha
    ASSUMPTION behind any statistical procedure.

    I SHOULD have gone into the current Lesson 1 and 2 in the
    "Reef Fish Statistics for DUmmies" as soon as he stated
    "DATA is irrelevant".

    That's why I thought he would finally wake up to the POINT that

    1. T is t(n-2) in testing the SLOPE of the regression line without
    requiring either X or Y to be normal.

    2. T is t(n-2) in testing the CORRELATION between any X and Y
    ONLY if X or Y is normal. (Hogg and Craig necessity condition).

    Here, m00es is STUCK in that same HOLE he had been digging.

    That's only ONE aspect of the distribution of the test statistic, as
    I indicated above. To test rho = c, one would have to incorporate
    (R - c)/s(R) into the test statistic. But that is INDEPENDENT of
    the fact that the distribution of R must ALSO satistify the
    underlying assumption about the distribution of R that X or Y
    must be normal in the DATA being used to test the hypothesis
    about rho.

    That is the same error made by Heitzman and Mueller in Lesson 1.
    Or more specifically, validate the ASSUMPTION that underlies
    the procedure for testing the PARAMETER in question.

    For correlation, one MUST validate that X or Y is normal.
    For simple regression slope, there is NOTHING to validate
    about X or Y. The ASSUMPTION lies with the ERRORS
    which can only be validated after a regression fit has been

    Snip m00es's same repetition of fallacious reasoning.

    I already HAD, in Lessons 1 and 2 of "Reef Fish Statistics
    for Dummies". I was merely re-stating the SAME, as I
    had restated the same a dozen times before.

    You were wrong ONLY in your confusion of Ho with the
    statistical assumption that underlies a statistical procedure.

    As such, ANY use of the correlation and related theory
    based on the normal theory that leads to a T distribution
    MUST validate that X or Y is from a normal population.

    The ASSUMPTION about the distribution of a Statistic
    does not depend on the parameter being testing in Ho.
    For confidence intervals on R, there is nothing to be
    tested at all!

    Russell, what's NEW in the two lessons is that I outlined
    the procedure, step by step, of what EVERY applied
    statistician should do, in doing a simple regression
    analysis. m00es should have realized that he had NEVER
    done a simple regression, in which he validated anything.

    m00es was apparently trained in "mathematical statistics"
    where data was never seen.

    I used the Simple Regression problem to set up the
    PARALLEL steps in the execution. THAT's where it
    should have been clear to any careful reader that
    the ASSUMPTIONS of the two procedures are
    DIFFERENT. In regression there is NO assumption
    about the data Y or X other than each Y comes from
    a different normal distribution, and there is NOTHING
    you can do to verify THAT. Which is why that
    assumption had to be waited to be validated once
    the RESIDUALS are observed.

    The CORRELATION problem, on the other hand,
    has assumption about normality that CAN be, and
    MUST be, validated about the X and Y, in following
    the same data analytic steps of testing OR
    constructing C.I. about R.

    Now, with the additional hints above, everyone can
    go back and RE-READ lessons 1 and 2 carefully.

    -- Reef Fish Bob.
    Reef Fish, Oct 5, 2006
  9. Reef Fish

    Reef Fish Guest

    TomC wrote:

    Welcome to the Gulf separating Math-stat and Applied-stat.
    Think about what you've typed! We were talking about Hypothesis
    What's the point of testing a Hypothesis if you assume it is true?

    Yes, he WAS talking about testing Rho(x,y) = 0, especially in the
    problem that testing beta1 = 0 is equivalent to testing rho = 0.

    In fact, m00es has NOT deviated from that hypothesis test -- or else
    his own argument would have fallen apart. It DOES NOT MATTER
    whether one is testing rho = 0 or rho = anything else.
    Here both of you are wrong. The DATA does not follow anything
    stated in Ho. The TEST STATISTIC used to test Ho incorporates
    the parameter value in Ho, but that is an entirely separate issue,
    stated in several different "Hypothesis Testing Lectures" or mine.
    And the DATA must satisfy the ASSUMPTIONS of the testing
    procedure for the parameter. That's where m00es never recovered
    because of his misguided training in MATHEMATICAL statistics,
    and never had any education on how DATA is used in applied statistics.
    Yes so far m00es has been wasting HIS time. But OTHERS who
    carefully read what I had posted, will undoubtedly learn from them.

    You can say that about the thousands of posts I have posted about
    STATISTICAL facts and procedures. If I made a minor slip, as I had
    done several times, they were instantly corrected, by myself, or
    pointed out by others.
    TomC, you're obviously deficient not only in Applied Statistics
    training, you couldn't even read what m00es had posted to say
    that he never said he was testing Ho: rho = 0.


    Perhaps with the various comments by readers other then me,
    that even something completely WRONG ones like yours, may
    help m00es see his OWN errors.

    His biggest ERROR is his belief that "DATA is irrelevant" in
    any statistical procedure.

    I think if he stops digging his hole, and read my Lessons 1 and
    2 in the "Statistics for Dummies" thread, he MAY actually finally
    see his own errors, in his inability to separate the role of the
    parameter in Ho, and the role of DATA, both in the execution
    of a Hypothesis Test, AND in validating the PROCEDURAL
    ASSUMPTIONS in working with any "statistic" in Statistics.

    -- Reef Fish Bob.
    Reef Fish, Oct 5, 2006
  10. Reef Fish

    Reef Fish Guest

    Even allowing your faulty characterization of my guilt, m00es is
    consistently WRONG in his repetition; while RF is consistently
    RIGHT in his re-explanations of why m00es was wrong.
    That's only because Russell's own confusion. It IS a point of
    contention that Ho does not dictate the DATA. What is agreed
    is the fact that the TEXT STATISTIC for Ho must incorporate the
    parameter being tested.

    Verbatim even. :) And SURPRISE!! A repeatedly stated
    falsehood is STILL a falsehood!

    I had already covered the rest of Tom C's errors in his own reading
    and understanding about Statistics.

    -- Reef Fish Bob.

    P.S. The TWO lessons in the "Reef Fish Statistics for Dummies"
    are so condensed and streamlined that they are each SELF-
    CONTAINED, about (1) Applied SImple Regression, and (2)
    Testing Correlations. They are meant to be READ, and READ
    carefully. If they had been, Russel would not have made some
    of his erroneous comments (while he was merely correct in
    something he had learned BEFORE about validating assumptions)
    but it's quite obvious that TomC is one of m00es' classmates or
    friends who dwell in the same "Mathematical Statistican Hole".

    -- Reef Fish Bob.
    Reef Fish, Oct 5, 2006
  11. Reef Fish

    Reef Fish Guest

    That is concise and apt remark about Russell's opening remark. :))

    I made the same remark, in a less graphic and less concise fashion,
    about TomC's post about m00es assuming rho = 0 about the DATA,
    when m00es was testing Ho: rho = 0.

    Birds of the same feather. :)

    That was Russell's faux pas about the role Ho and the role of
    checking ASSUMPTIONS underlying a Statistical Procedure.

    -- Reef Fish Bob.
    Reef Fish, Oct 5, 2006
  12. Reef Fish

    Reef Fish Guest

    m00es wrote. in his own interview by m00es:
    It's billions plus episilon where epsilon are in the millinons. :)

    Data is VERY RELEVANT if the distribution of S depends on the
    ASSUMPTION that Y is normal, as in the case of testing the
    correlation R(X,Y).

    m00es replays HIS script in the Theatre of the Absurd:

    Noe Schitt Sherlock! That's a mathematical identity that we all knew.

    We heard that a few hundred times, from you, and from RF citing
    m00es citing Hogg and Craig. But m00es kept MISAPPLYING
    what m00es quoted Hogg.
    A new angle in the Netherland Statistics. The TEST STATISTIC
    for testing beta1 = 3 would be T = (beta1^ - 3)/se (beta1^).

    It is ALSO distributed as T with n-2 d.f. Didn't you know that?

    Furthermore, what do you conclude if you accepted OR rejected
    Ho: beta1 = 3?
    Quack Endeth Duck indeed!
    Pointed out countless times already. TO test the correlation, Hogg
    Criag says the DATA must come from a bivariate distribution where Y
    must be normal if X is not.

    And m00es kept forgetting to VALIDATE that assumption about Y
    when testing correlations!!

    A collective LAUGHTER is heard echoing in the halls of the USA, and
    even in the halls of Netherlands!

    If you assume Ho holds, why would you NEED to test if Ho is true?

    and the data on Y CAN be anything BUT normal. For example,
    the data on Y could be N(10, .00001) for the first 40 Ys and N(20,
    on the next 60 Ys.

    PERFECTLY good regression DATA for Y on X.

    PERFECTLY bad data for testing the correlation between X and Y
    no matter what the c is, in testing Ho: rho = c .

    See, m00es, that's the truism --- that DATA for testing a
    Hypothesis, any hypothesis do NOT depend on the statement
    of the hypothesis being tested.

    That's also the motto of Harvard: Veritas. (Latin for TRTH). :)
    I learned that motto when I taught at Harvard.

    The DATA could be anything from anywhere.

    BUt if you're going to test Ho: R(X, Y) = a given value, then
    it doesn't matter what the given value a is, the DATA must
    satistify the NECESSARY condition that either X or Y must
    be normally distributed, if you're going to use the statistical
    theory behind the test statistic for testing R.

    That condition is independent of the USA, Portugal, or UK,
    or the Netherlands. It is universal. An applied statistician
    in any of those countries must VALIDATE the assumption
    if he is going to test the correlation between X and Y.

    So the missing link buried DEEP in the m00es HOLE is the
    "validation of ASSUMPTION in a statistical PROCEDURE"
    for testing OR constructing a Confidence Interval about
    an unknown parameter.

    In constructing a Confidence Interval on the correlation
    coefficient R, the statistic doesn't even have an Ho to

    That topic was covered under the Portuguese Statistics
    for Dummies on the difference between a Confidence
    Interval for (p1 - p2) using the s.e. for p1^ - p2^, whereas
    the s.e. for testing Ho: p1 - p2 must use a different
    s.e. incoporating p1 = p2.

    Read THAT Lesson, m00es, which was given in 2005,
    and repeated in 2006, long before you enrolled in
    any of the Statistics for Dummy schools. :)

    In testing a correlation coefficient using the T(n-2), DATA
    must meet the assumption that X or Y must be normal!

    Think confidence interval, m00es, if that's what it takes
    to haul your posterior from that HOLE you dug so deep
    in making inferences about a correlation!

    Would you use T(n-2) to test Ho: rho = 0 if the DATA
    X and Y came from two sample of size 3 each from
    different uniform distributions? :) You can calculate
    the correlation coefficient and can use the Cauchy
    distribution table ya know? But SHOULD you?

    YOu had been enlightened hundreds of times, but you've
    holding on to that confirmatory lamp post <tm>, used by
    classical statisticians, as a drunk use it for SUPPORT
    rather than for enlightment -- which is another quote by
    Tukey, this time I knew it came from his 1961 article in
    AMS on "The Future of Data Analysis" which I am sure
    you can find, because it was in the Annals of Mathematical
    Statistics which was (and still is) widely populated by
    statistical drunks of the "mathematical statistics" type,
    rather than the by the enlightened Data Analysts and
    Applied Statisticians who followed Tukey's enlightened

    C'est la difference, mon ami.

    -- Reef Fish Bob.
    Reef Fish, Oct 5, 2006
  13. Reef Fish

    Guest Guest

    Reef Fish,
    I want to see what underlies the argument between you & m00es
    [to me it resembles Fisher vs Neyman-Pearson - does one perform
    "pure significance tests", or should one always have an H1 in mind
    (typically from a spectrum of models)?] I'd be grateful if you'd
    say explicitly what you'd do in the following situation.

    Suppose a client comes to you with the sort of data described:
    the X variable is 0/1 and the Y variable continuous.

    They are interested ONLY in testing the null hypothesis
    H0: rho(X,Y)=0. They ignore any suggestions that it might
    be better (say) to model Pr(X=1|Y=y), they ignore any suggestion
    that correlation may be a perverse way to think about the data
    (particularly if e.g. X has been assigned rather than observed).
    They aren't interested in a confidence interval for rho,
    and they don't want to test H0': rho=a for any nonzero a.
    They really, really just want to test H0: rho=0.

    You look at the marginal distribution of Y, and judge
    that it may reasonably be assumed to be Normal.

    What test do you apply? i.e.
    what is the test statistic T?
    what is its null distribution?
    for what values of T do you reject H0 (say with P<0.05)?

    Many thanks -- Ewart Shaw
    Guest, Oct 6, 2006
  14. Reef Fish

    Reef Fish Guest

    I am more than happy to explain to you that if there's a controversy,
    it's Tukey-Box vs Fisher-Neyman-Pearson and the rest of the non-
    thinking statisticians in terms of VALIDATING the assumptions
    behind any statistical procedure BEFORE using it!

    It appears that said idea of validating the statisticla assumptions
    is as foreign to you as it was to m00es, and perhaps m00es is
    from the UK as well, as I always suspected.
    You stated the client's case VERY WELL! Of course I would
    first tell him what a foolish chap he is (in a politite British way
    of course) and then tell him since he is paying the fees and he
    knows exactly what he wants while ignoring all my suggestions,
    I would gladly test Ho: rho = 0 for him, and statistically CORRECT
    way of course.

    It's one thing to satisfy a client's wishes to do some statistical
    proccedure which is neither necessary nor wise for whatever
    problem he has in mind, it is an ethical matter to do the procedure
    "according to the book", without letting any error slip by, and
    that's the way I'll be your consultant on your problem.
    By that, I presume you mean the DATA for Y may reasonably
    be assumed to be Normal, after you've done a P-P or Q-Q
    plot (or what's also called Normal Probability plot) of the data
    Since you want to test Ho: rho = 0, I would use the usual test-
    statistic T = (r - 0)/se(r), the same one m00es used.
    T with (n-2) d.f. since it has been validated that Y is Normal,
    and according to Hogg and Craig, cited by m00es, it is
    sufficient to use the same t or s he used.
    For a two-tailed test, it would be | T | > t(.975, (n-2)) from
    the T-tables.
    That's the easiest $300 I've ever earned consulting a problem
    in statistics. :).

    I'll use the remainder of your hour to tell you that your problem
    is a no-brainer because the DATA for one of your variables X
    and Y used in the correlation had been VALIDATED to satisfy
    the statistical assumption required for a test of the correlation
    coefficient. I would even use the same rejection region for
    testing any other value of rho for you Ho: rho = c, for free, by
    simply using T = (r - c)/se(r).

    Had you DATA not have passed the interrocular traumatic
    test of a Normal Probability Plot of Y, I would have told you
    that there is no known statistical theory on which to base the
    test of your Ho: rho = 0, because neither of your X nor Y
    is Normal, and then I would insist that you re-formulate
    your problem, or else I would turn you away, not taking any
    easy money from a fool for doing something wrong that he
    wouldn't know is right or wrong.

    Now, you should be a very satistified client.

    You're welcome to come back for your next consulting problem.
    But bring some REAL British pounds next time. :)

    -- Reef Fish Bob.

    Reef Fish, Oct 7, 2006
  15. Reef Fish

    Guest Guest

    Thank you for your very clear & precise response...
    ....though there's no need to slip in insults like that.
    I did have a couple of further tweaks to the client's situation,
    clarifying why I thought that the argument might have been more
    about "pure significance testing vs the rest of the World"
    rather than of validating assumptions, but

    (1) The OP (Arnold, 19th Sept) asked about the use of the correlation
    coefficient "to analyse the dependence between a dichotomous
    dependent variable and a continuous independent variable",
    which involves more than just testing H0: rho=0;
    (2) I don't in any case want to defend using correlation here;
    (3) I don't have the time
    (4) or the energy
    (5) or the money.
    Guest, Oct 9, 2006
  16. Reef Fish

    Reef Fish Guest

    You're welcome. Your very clear & precise question made it easy.
    That was merely a statement of FACT. The fact that you had to
    ask such questions was very clear and precise that validating
    statistical assumptions is as foreign to YOU as it was to m00es.
    But you certainly seem to grasp my response with much less
    labor than m00es.
    REAL British pounds stirling are scarce, aren't they?
    and ... I duely responded, as I did to you, that the use of
    correlations was NOT appropriate in his case, but a test of the
    slope in a regression WAS appropriate.
    (2), (3), and (4) would be an excercise in futulity just as m00es's
    wasted time and energy.
    You would only have wasted your money, had you had it, IN
    ADDITION to wasting your time and energy.

    Reef Fish Bob,
    Reef Fish, Oct 9, 2006
  17. Reef Fish

    m00es Guest

    I think this discussion is really going nowhere at this point. I have
    repeatedly pointed out the fact that using t = b1/s(b1) and s = r *
    sqrt(n-2) / sqrt( 1 - r^2) are in fact identical tests for testing H0:
    beta1 = 0 and H0: rho = 0 (the one implies the other). Since these are
    identical tests, the assumptions are the same.

    Yes, we assume H0 is true when conducting a hypothesis test. That's the
    initial assumption. Of course, we may conclude that H0 should be
    rejected after we have carried out the test. But comments from Reef
    Fish like "If you assume Ho holds, why would you NEED to test if Ho is
    true?" are a clear indication that he is either completely ignorant how
    hypothesis testing works or just playing dumb. I'll assume the latter
    applies, but in either case, it makes it impossible to actually have a
    reasonable discussion about this issue.

    When H0 holds, then Y is normal. Only then will t and s have a
    t-distribution with n - 2 degrees of freedom.

    Reef Fish keeps saying: But you first must check whether Y is really
    normal! Otherwise, s will not follow a t-distribution with n - 2
    degrees of freedom! But neither will t = b1/s(b1). So, we keep going
    round and round. I never said that one should not check the model
    assumptions. In fact, the exact same assumptions apply to t = b1/s(b1)
    and s = r * sqrt(n-2) / sqrt( 1 - r^2) for testing H0: beta1 = 0/H0:
    rho = 0.

    Well, at this point, I would say that this discussion isn't useful
    anymore. We will just have to agree to disagree.

    m00es, Oct 10, 2006
  18. Reef Fish

    Guest Guest

    That's mainly why I wondered to what extent this "discussion"
    was related to "significance testing vs hypothesis testing".
    If one adopts the regression/t-test model with conditional
    Normality of [Y|X=0] and [Y|X=1], then the Hogg & Craig result
    is a red herring: if you satisfy yourself (q-q plot or whatever)
    that Y can be assumed to be marginally Normally distributed,
    then this implies that H0 is true. You can't then use that
    as justification for carrying out the confirmatory test.

    I had assumed the discussion was based on the following scanario.

    A client has fixed (say equal) numbers of observations of Y
    at X=0 and X=1. You verify that the statistical assumptions
    underlying a 2-sample t-test are reasonable for their data.
    The client then says that all they want (however perversely)
    is a test of H0:rho=0. They are not interested in anything else.
    Maybe their boss has told them they will lose their job if that
    can't be done; maybe they also offer you an obscene amount of money.
    Can you carry out a formally correct test of H0?

    I think the discussants have been talking at cross purposes,
    I don't see why marginal Normality has been brought into the
    discussion, I don't see why Reef Fish insists that validating
    statistical assumptions is foreign to both you and me, and I
    agree that the discussion is really going nowhere at this point.
    Guest, Oct 10, 2006
  19. Reef Fish

    Reef Fish Guest

    That was, as others used it, a SARCASM of your being obtuse to
    the fact that DATA plays several roles in ANY test of a hypothesis,
    and your repeated FALSE claim that "DATA is irrelevant".

    That is your repeated blunder about the DATA of Y.

    Whatever is assumed to hold in Ho does NOT affect the DATA used
    to test whatever is stated in Ho.

    For the TEST STATISTICS used for testing Ho, it is necessary to
    assume the parameter value in Ho to be used in the test statistic.

    But the DATA used in any hypothesis test is completely independent
    of the STATEMENT of any Ho!

    But said DATA is used to VALIDATE any assumption that is in the
    process of testing the parameter in question.

    That's the same point you, and now the other chap Shaw from UK
    are still failing to see:

    1. For testing the SLOPE of a simple regression, you do NOT
    NEED to validate the distribution of the Y in the DATA, because
    that Y is assumed to have come from a mixture of n different
    normal populations, in the REGRESSION problem.

    You only need to validate that the RESIDUALS of the regression
    is normal.

    2. For testing the CORRELATION in X and Y, whether the same
    data is used in a regression or not, in order to use the TEST
    STATISTIC for testing the correlation, you not only need to
    use the value of rho in Ho, but you ALSO need to VALIDATE
    that either X or Y MUST be normal, according to what you
    cited from Hogg and Craig.

    (1) and (2) above are TWO DIFFERENT problems. TWO DIFFERENT
    tests of two different hypotheses!

    The fact that you are using the same (or different) DATA for those
    two different problems does not change the FACT that you need to
    do DIFFERENT VALIDATIONS for each of the two problems.

    Your statistical training in the UK apparently is foreign to the Data
    Analysis approach in the USA (since the teaching of Tukey and Box)
    that assumptions behind ANY statistical procedure must be validated
    before applying those procedures.

    That which has been widely and completely accepted in the
    APPLICATION of Neyman-Pearson type of hypothesis testing or
    interval estimation, is POST Neyman-Pearson, but easily seen
    to be a NECESSARY part which Neyman-Pearson overlooked in
    their formulation of theory, without touching on the logical
    necessisity of VALIDATION of assumptions behind the theory.

    The above is true for testing the correlation ONLY when it can be
    validated that X or Y of the DATA is from a Normal population.

    See the above, spelled out more and more explicitly and SEPARATELY
    each time -- that's about the normality of the DATA which is necessary
    to perform a test of CORRELATION, but not necessary for a REGRESSION,
    under the different assumptions of those two DIFFERENT procedures.

    Otherwise, s will not follow a t-distribution with n - 2
    You misspelled "we". YOU are the only one BLIND to the difference
    between what is ASSUMED and what needs to be VALIDATED using
    the DATA,

    Really? Then why did you keep saying that the DATA is irrelevant?
    HOW do you check the model assumption about CORRELATION
    without checking that the DATA Y is normal? In the case of putting
    TWO different groups into the same Y, the DATA can be clearly
    nonnormal because its the mixture of two different sets of DATA
    from two different normal populations.

    That's exactly the place that you were WRONG, and remains to
    be WRONG, because you failed to recognize that those two are
    DIFFERENT problems requiring DIFFERENT validations of
    No, you do not have the luxury to agree to disagree when you are
    100% WRONG, and had been proven to be WRONG when you
    muddled in your FAILURE to recognize what needs to be
    VALIDATED by the DATA in two different problems of hypothesis
    test, each of which REQUIRES a different assumption to be

    That is exactly where you went astray, and stayed astray.

    Go back to the DATA which started all of this, and discussed
    EXPLCITLY by me, as an example -- the DATA taken from the
    textbook of Anderson and Sclove, and used in the Manual by
    Ling and Roberts to illustrate the set up as a simple regression
    problem to test the equality of the means of two independent

    Go back to that DATA and show us WHEN you had ever said
    about the validation of assumptions anywhere, or the
    VALIDATION of the Y in the simple regression?

    You can go back and use a Normal Probability Plot (if you had
    ever used one) or any technique for VALIDATING the
    assumption of Normality, and show us what you found, and
    we can START from there.

    You NEVER ever looked at the DATA, let alone using it to
    validate any assumption. The validation of the normality
    of the RESIDUALS was illustrated in the Ling/Roberts
    application. There, it was not necessary to validate the
    normality of the Y because no correlation was tested.

    Learn how to READ, m00es, and learn how to recognize
    that (1) and (2) stated above are two DIFFERENT problems
    each requiring its own validation of assumptions.

    -- Reef Fish Bob.
    Reef Fish, Oct 10, 2006
  20. Reef Fish

    m00es Guest

    I said that data is irrelevant for deriving the distribution of the
    test statistic. You claimed otherwise. I pointed out that this is

    The entire time, we have been discussing the situation, where:

    Y = beta0 + beta1 X + e,

    where e ~ iid N(0, sigma^2). We do not know beta0, beta1, or sigma^2,
    since these are parameters. But we assume that e is normal. Therefore,
    when H0: beta1 = 0 holds (which is equivalent to H0: rho = 0), then Y
    is normal. And for testing H0: beta1 = 0, we start out by assuming that
    H0 holds. So, under H0, Y is normal. The assumption that e is normal
    may be off, but that gets into a different issue.

    By the way, Reef Fish, it is quite apparent that you actually got your
    statistical training in Iceland.

    m00es, Oct 10, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.