# Reef Fish Statistics for Dummies: Applied Simple Regression

Discussion in 'Scientific Statistics Math' started by Reef Fish, Oct 4, 2006.

1. ### Reef FishGuest

For those who have completed the first portion of any First Course in
Statistics to arrive at the "Simple Regression" topic, this lecture is
self-contained. It is elementary, but it contains some material used
in
and Statistics Departments, because it is easy for the mathematically
inclined to neglect the APPLIED aspects of Statistics in general, and
the necessary methodology in Simple Regression in particular.

1. What is a "Simple Regression"?

Most regression problems are simple, in the English sense of the
word simple. In Statistics, the term "simple" in "simple regression"
is
synonymous with "regression with ONE independent variable X"
in the model:

Y(i) = bo + b1 X(i) + e(i),
(1)

where e(i) is the random error of Y(i), independent and identically
distributed as a Normal random variable with mean 0 and variance
sigma^2, denoted as ~ i.i.d. N(0, sigma^2).

2. Model Assumptions in (1),

bo and b1 are unknown parameters to be estimated from DATA
(X(i), Y(i), i = 1, 2, ..., n, ASSUMED to have the error structure
of e(i).

Because of the model assumption, before one does ANY analysis
or statistical inference, one must first "validate" the model
assumptions, because if the assumptions are wrong, then the
estimation and inference theory and methodology would not be
applicable.

Q, What STATISTICAL assumptions can we examine or
valicate before we proceed with our application?

A. NOTHING. Nothing at all!

stumble in applying simple regression. They realize that each
Y(i) in the model comes from a Normal distribution with mean
bo + b1 X(i), and variance sigma^2, and they have both graphic
and analytic tests (SPSS, SAS, S, R, Maple, etc) for normality,
so they test the data values of Y for normality.

That is a BLUNDER Number 1. While each Y comes from a
normal distribution, the dependent variable Y is a mixture of n
different normal distribuitons, and there is no reason why the
mixture Y should resemble data from a normal distribution at all.

Not only graduate students, but authors of statistics textbooks
sometimes make the same error. I caught the authors of the
textbook "Statistics for Business and Economics", Boston:
Allyn and Bacon, (1980), making that error in Figure 11-2,
on page 278, "Normal Distribution of the Population of Y
Regressed on X", suggesting by a sketch of a SINGLE normal
distribution that the observed Y in the aggregate should follow
a normal distribution.

The authors of that book were: Heitman, W.R. and Mueller, F.W.

Q. What about the distribution of X?

A. There is NO assumption about the distribution of X. They
can come from any distribution, and they can be fixed
constants or any given values.

It is BLUNDER Number 2 for anyone to examine the probability
distribution of X, or any "outlier" of X because they think X
should behave like a normal distribution, or any distribution.

The ONLY statistical ASSUMPTION behind a simple regression
model is that the ERRORS are ~ i.i.d. N(0, sigma^2), and there
is NOTHING you can do until you have tried some fit and have
observed the errors, in the form of "residuals" (left over from an
exact fit to a straight line).

The only thing one CAN, and SHOULD do, is to examine the
DATA for typographical and other non-statistical errors, verify
that they are indeed errors, and correct them before doing any
regression fit.

The Reef Fish archives come in handy:

The DATA was taken from the 1975 SPSS Manual in which
the data in that post was used to illustrate the output of a
Multiple Regression. That was where "Lesson #1" was
mentioned by me:

RF> LESSON #1. ALWAYS examine the data for gross
RF> (and not so gross) anomaly.

RF> Jerry and Russell are making a good start toward
RF> their Fish University "A".

The 15849 is of course an obvious typo, not by ME (it took me
about 10 minutes to type the data, about 30 minutes to write
a multiple regression program in SPEAKEASY because I have
NO access to any statistical package; and at least an hour to
find and correct the half a dozen or so typos of MINE <G> by
checking against the results I had done 30 years ago). The
typos were in the 1975 SPSS Manual!
======= end excerpt

The TYPO was what contributed to all THREE variables being
statistically significant in the SPSS Manual -- without it ...
that's the next chapter/Lesson. For Simple Regression, the dataset was used in the series of
lectures on Model Building to show that a Simple Regression
model was "better" than the SPSS's Multiple Regression Model
(with 3 independent variables) when the ERRORS in the DATA
were removed.

3. What is Step 2 of an Simple Regression Application?

THAT is where the Statistical Assumptions of the model are
examined after each attempted fit. The i.i.d N(0, sigma^2)
can be broken into three INDEPENDENT components:

1. Normality of the errors (residuals)
2. Independence of the errors (residuals)
3. Homoscedasticity (equal variances) of the errors (residuals)

All three MUST be satisfied before any statistical result of
a simple regression can be validly used. These are
independent assumptions in the sense that none implies
the other and no two of them implies the third.

Step 2 is what I had called LESSON 2 in the Model Building
thread, using the SPSS data, in the post

"LESSON 2 in Model Building. Iterative Loop of Sponsorship vs Critic"

The Applied Statistician (Data Analyst) first sponsors a model (our
initial simple regression model (1)). Once the model is FITTED,
the analyst must then act as his own CRITIC, to see if any of the
assumptions are violated. If so, he make changes in the model,
do a new fit, and acts as a critic of his own model AGAIN. This is
the "iterative Loop of Sponsorship vs Critic" described in detail in
George Box's JASA (1976) article, "Science and Statistics".

The Data Analysis / SPSS / Model Building thread started on
March 17, 2005. By Jun 29 2005 2:06 am, we began the post
on LESSON 2. At this time, I'll skim over the steps of the validation of the THREE
assumptions and how each violation might be accommodated.

It was in the cited post above that I made this observation,

RF> It's interesting in a way that all FOUR of us found the same
RF> "hockey stick" in the scatterplot of the INVDEX variable vs the
RF> GNP variable. All four of us took DIFFERENT actions!!

RF> In that respect, if we were working as a TEAM, we would
RF> put our heads together on the four TENTATIVE models
RF> and decide what model to sponsor next (if any).

That is the most interesting and rewarding part of being a Data
Analyst or an Applied Statistician! There are no formulas that
tell you what to do -- there are GUIDELINES on what you should
avoid and what are valid continuations.

That's where the SCIENCE of Statistics blends with the ART
of Statistics that the "mathematical statistician" is most deficient.

You can find numerous posts of Reef Fish which cited particular
passengers from George Box's JASA article -- which everyone
time he analyses a real set of DATA. In my post below:

I cited Box's indictment of "mathematical statistician's
mathematistry":

"Mathematistry is characterized by development of theory for
theory's sake, which since it seldom touches down with practice,
has a tendency to redefine the problem rather than solve it."
(p,797, 1976 JASA paper on "Science and Statistics").

The accommodation of a model with misbehaving residuals is
commonly accomplished via transformation of either the
dependent OR the independent (or both) variable in the
model.

Tukey and Mosteller's book "Data Analysis and Regression"
has a nice Exhiibit 1 in Chapter 4 "Straightening Curves and
Plots" showing a "continuum ladder of power transformation"
and where negative powers, root transformations and log
transformation fall in the diagrammed exhibit.

Going back to the SPSS example... Of those who dared to
show what they tried, NONE used any power transformation
to straight out the "hockey stick" seen in the simple
regression scatter. There ain't such a thing as "cook book" or "recipe" in an
enlightened "data analysis", or "exploratory data analysis"
guided by both MODEL and DATA.

That's what Applied Statistics and Applied SImple Regression

In the Preface to "Reef Fish Statistics for Dummies", I
said there will be very few formulas or equations. So far,
I had given only ONE, equation (1), which is the usual model
for a simple regression.

4. What comes next?

This is where, after the model and statistical assumptions
have been validated to apply the theoretical results derived
by statisticians, almost everyone will have no trouble finding
the FORMULAS and EQUATIONS used in constructing
Confidence Intervals for the parameters, test Statistical
Hypotheses about the intercept or slope of the parameters
bo and b1 in (1), and obtain prediction intervals for future
observations.

These are formulas that are easy to derive and even easier
to apply, and they are the ones that all of my students have
access in their OPEN BOOK, and OPEN NOTES exams.,
so I won't even bother to go over them here, once we have
carefully carried out the APPLIED steps 1 and 2 to ensure
that it is valid to apply the formulas and results.

The only REMAINING STEP is outside of most Statistics
textbooks on regression, which is labeled as STEP 3 in
my Model Building lessons:

"LESSON 3 in Model Building: Practical Significance"

There is an ongoing discussion in two of the three sci.stat groups
now, under a not-so-obvious topic of "confidence intervals". In my
opening post

which blossomed into a mini-thread of 8 posts, based on my
statement in the initial post, in which I wrotet:

RF> Any statisticians worth his salt would know that a highly
RF> "statistically significant" result can be completely
RF> worthless from a practical point of view of the usefulness
RF> of the result.

RF> Conversely, a statistical result that is not statistically
RF> significant at some .05 or .10 level can be very useful.

RF> The two concepts are TOTALLY different in terms of
RF> knowing how to apply statistics sensibly and usefully.

That is a VERY important principle for an APPLIED statistician,
which no "mathematical statistician" ever even think about.

The 1975 SPSS example was a good illustration of a highly
statistically significant result is completely USELESS in
practice, as demonstrated in the Model Building threads.

RF> In my re-analysis of the DATA in the SPSS Manual, I came to

RF> INVDEX = -197.51 + 0.018234 * GNP
RF> (15.031) (.000667)
RF> T=-13.14 T=27.33
RF> p-value 10^(-12)

RF> Multiple R-sq = 0.9726, MSE = 317.96.
RF> All very impressive and highly statistically significant.

And I punctured the euphoria of any social scientist or
non-thinking statistician, in LESSON 3, showing that the
simple regression model was completely USELESS in
practice. I conclude this First Lesson of "Reef Fish Statistics for
Dummies" with the sig of one Dr. Flash Gordon, M.D.,

FG> in theory, there is no difference between
FG> theory and practice. but in practice, there is.
FG> flash gordon, m.d.

Flash is a PRACTICAL Man (and M.D.) of many interests:

-- Reef Fish Bob.

Reef Fish, Oct 4, 2006

2. ### Reef FishGuest

In the Applied Simple Regression lesson, Reef Fish gave the STEPS
that a Data Analyst must follow:

Model Assumptions
For the Simple Regression Model, the answer was:
For Simple regression, there is NO assumption about the
Y which is the aggregate of Y(i), from n different distributions!
That is the FIRST STEP in Applied Simple Regression, before
doing any fit of the data or statistical inference.

Now suppose we want to use the Simple Regression DATA to
test the correlation between X and Y, and apply the same STEPS
and Principles of Data Analysis, then we'll see what we HAVE
to do and in the process see HOW and WHY m00es had erred
in his insistence that the same TEST STATISTIC used to
test the slope of the Simple Regression problem can be used to
test the correlation Ho: rho(X,Y) = 0.

For the problem of testing the correlation R, as we had seen,
according to the result quoted in Hogg and Craig by m00es:

m00> Hogg, R. V., & Craig, A. T. (1995). Introduction to mathematical
m00> statistics (5th ed.).

m00> On pages 478-480, the authors derive the distribution of r
m00> under the bivariate normal assumption and show that
m00> under rho = 0, r * sqrt(n-2)/ sqrt(1-r^2 ) is distributed t(n-2).
m00> Now, on page 480, the authors mention EXPLICITLY that
m00> a careful review of their proof reveals that nowhere was it
m00> necessary to assume that the two variables are bivariate
m00> normal. Only one of the variables must be normal.

The problem changed drastically, for the Applied Statistician
and Data Analyst, if he wants to use the simple regression DATA
to test the correlation between X and Y, because he now, unlike
the regression problem that there's NOTHING he could verify
in terms of the statistical assumptions before fitting a regression
line, the statistician now must VALIDATE the NECESSARY
ASSUMPTION (to test the correlation) that ONE of the two
variables X or Y must be NORMAL, before calculating the
correlation coefficient, let alone making statistical inference
about it based on the same test statistic T used in testing the
slope in a regression problem.

It is HERE that m00es should have realized that his repeated
claim that "DATA is irrelevant" to the distribution of the
TEST STATISTICS is drastically wrong, from the APPLIED
Statistics and Data Analytic point of view.

Because the CORRELATION MODEL assumption requires
X or Y to be normally distributed, in the relaxed assumption
from bivariate normality, a data analyst would have
needed to validata the BIVARIATE NORMALITY of (X,Y)
before, in the sponsor-critic iterative loop is now replaced
by the necessity to validate that ONE of the two variables
MUST be normally distributed.

In the problem and DATA under consideration where the
Y in a Simple Regression Model is a mixture of n normal,
and hence non-normal a priori and a fortiori, nonnormal;
and the X variable, which is the dichotomous Indicator
variable, is clearly nonnormal -- the statistical theory about
testing r(X,Y) = anything is NOT applicable, and there
is no way of fixing the violation because Y consists of the
DATA from two independent normal distributions with
(statistically significantly) different means.

One can still do a validation of the normality of Y, if the
two populations of Y are not sufficiently far apart in their
means to make the mixture distribution to be sufficiently
nonnormal, and one COULD still proceed to test the
correlation under Hogg and Craig's assumption.

Alas, DATA is/are very relevant not only in the final
execution of the correlation test, but the data would have
FAILED the normality test in the mixture of the two groups
whose data came from the Anderson and Sclove book
on testing the equality of means, and illustrated in the
Ling and Roberts Manual of using simple regression for
the problem of testing MEANS.

If the same data WERE used to test for testing correlation
between the X and Y in the regression problem, the
model-validation step would have failed by the data Y,
and one would have to appeal to the mercy of "robustness"
of the test statistic against the point-biserial-correlation
when X is dichotomous and Y is NON-normal.

This is precisely the way ANY test of correlation between
two variables MUST be done. One cannot simply assume
away bivariate normality OR normality of ONE of the two
variables.

One must VALIDATE that one of the two variables IS indeed
normal, or not sufficiently non-normal to make the inference
about the POPULATION correlation coefficient between X
and Y.

And FINALLY, even if everything goes well up to the point
of actual testing of the correlation coeficient, that one of the
variables can be said to be normal, and that Hogg and Craig's
NECESSARY condition is satisfied, we still are confronted
with the question of "paractical significance" of the tested

What if X and Y are "significantly correlated" to have
rho=0 rejected at the alpha = 0.001 level, or at a p-value
less than 0,001, what does one make of the PRACTICAL
usefulness or uselessness of that result?

You can't eat the correlation, and you can't bite the
correlation for its hardness or softness. You can't do
ANYTHING with it, except looking at other evidence
such as the predicted values or prediction intervals
associated with the same significance.

The SPSS example now comes vividly to mind.

In said example, I showed

RF> INVDEX = -197.51 + 0.018234 * GNP
RF> (15.031) (.000667)
RF> T=-13.14 T=27.33
RF> p-value 10^(-12)
RF> Multiple R-sq = 0.9726

The slope coefficient of the simple regression (and
hence the correlation) would have shown a T value
of 27.33 corresponding to a p-value of .000000000001.
for a correlation R of 0.9862 that was judged to be
USELESS in practice!

The only thing that comes to my mind is the reenforcement
of something I always attributed to John Tukey (but I don't
recall seeing it in PRINT, so I must have heard it from his
mouth, because I can't make something like that up. ),
"Using correlation is like sweeping dirt under the rug, with
a vengeance".

I couldn't have characterized it any better!

I hope this finally settled the m00es Lecture topic, which
dragged on for weeks and billions of wasted electrons
dozens of posts and NOISE, about how to test a hypothesis

As I had said in the Preface, I wish I had started in the
APPLIED Simple Regression, and it would have immediately
flattened m00es's repeated argument that "DATA is irrelevant"
in his posts about theory, and see the relevance of DATA as
well as the relevance of VALIDATING Hogg and Craig's
assumption if one were to test the hypothesis Ho: rho = 0,
or rho = anything.

But the FINAL conclusion (in the light of the SPSS Simple
Regression example) is most rewarding, in seeing how a
correlation of .98+ that rejected Ho: rho = 0, at any alpha
level of 0.00000000001 or greater, can be so ... utterly
USELESS result in practice.

This concludes the Simple Regression for Dummies, with
the bonus lesson of Testing Correlations for Dummies.

-- Reef Fish Bob.

Reef Fish, Oct 4, 2006

3. ### m00esGuest

No, it does not, because what you wrote is still not correct.

Data IS irrelevant for deriving the distribution of the test statistic.
This has nothing to do with a viewpoint. It's a fact. I don't need to
observe any data to derive that distribution.

I'll explain this again. Why don't you explicitly point out in this
proof the source of my error.

1) The model: Y = beta0 + beta1 x + e, where e ~ iid N(0, sigma^2)

2) beta1 = rho(X,Y) * SD(Y) / SD(X)

3) Therefore, beta1 = 0 iff rho(X,Y) = 0

(since SD(Y) and SD(X) can safely be assumed to be > 0).

Now we want to test H0: beta1 = 0. As you have said yourself, we can
use:

t = b1/s(b1)
s = r * sqrt(n-2) / sqrt( 1 - r^2 )

to test H0: beta1 = 0. Why?

4) under H0: beta1 = 0, t follows a t-distribution with n - 2 degrees
of freedom
5) s = t, so both MUST have the same distribution
6) we can also use the result from Hogg & Craig to see that s has a
t-distribution with n - 2 degrees of freedom under H0. Let's use my
quote:
Under H0: beta1 = 0, then Y = beta0 + 0 X + e. Therefore Y ~ N(beta0,
sigma^2). We see that under H0, Y is normal and not a mixture
distribution. Therefore, s follows a t-distribution with n - 2 degrees
of freedom.

An important point: t (as well as s) only follows a t-distribution with
n - 2 degrees of freedom when beta1 = 0 holds!

7) When we reject H0: beta1 = 0, we automatically reject H0: rho = 0
and vice-versa.

q.e.d.

So, why don't you actually point out where the error is. And don't say:
Y follows a mixture distribution. Under H0, it does not. If it would,
then NEITHER t NOR s would have a t-distribution with n - 2 degrees of
freedom. But since s = t, they both have the SAME distribution -- it
just won't be a central t-distribution when H0: beta1 != 0.

In fact, that's the whole idea of hypothesis testing:

(a) Assume H0 holds.

(b) Derive the distribution of the test statistic UNDER THE ASSUMPTION
THAT H0 holds (and Y is normal under that assumption).

(c) Then obtain data, calculate the test statistic in the sample, and
see where the observed test statistic falls with respect to the
critical bounds according to the distribution under H0.

(d) When H0 holds, then using the critical bounds according to the
distribution under H0 guarantees that we will only reject H0 in alpha *
100% of the cases. But if H0 does not hold, then the distribution of
the test statistic (i.e., the distribution of t, which is the same as
the distribution of s) will be stochastically greater than the
distribution of t (= s) under H0. Therefore, the probability of
rejecting H0 increases, which is exactly what we would want.

But one more time: Under H0: beta1 = 0, both t and s have a central
t-distribution. Moreover, beta1 = 0 iff rho = 0. Therefore, rejecting
beta1 = 0 implies that we can reject rho = 0 and vice-versa.

So, please, enlighten me where the error is.

m00es

m00es, Oct 5, 2006
4. ### Russell.MartinGuest

But as I read RF's point, you don't *know* that the *data* support
H0 a priori, so you can't validly run the test to show that it does
*until* you check for normality. Kind of a Catch 22. Your point
about deriving the distribution is moot in the case of the actual
process of doing the data analysis. IOW what you show is true
in theory under a set of assumptions may not be valid if those
assumptions are violated, and you need to test the validity of the
assumptions *before* proceeding. At least that is my reading of
the situation. I'll admit I've only skimmed much of the voluminous
exchange on this topic, in part because so much of it is repetitious
because neither of you are trying (it seems to me) to *understand*
what the other is saying, so I may have missed everyone's point
entirely.
And here is where it seems to me you need:
(c-1) Test that the data satisfies the requirements of the
hypothesis test.
Enlighten me. please, if you think I'm wrong.

Cheers,
Russell

Russell.Martin, Oct 5, 2006
5. ### TomCGuest

Hello,

To Mr Martin - I think mOOes has been very consistent all along. Under the assumption that Ho is true what he has stated is also true.

Time and time again mOOes has been very specific - he has stated his position repeatedly; under the assumption that Ho is true, Rho(x,y) = 0. mOOes has not claimed he is testing Rho(x,y) = 0 as has been claimed by some. As mOOes has stated it simply follows if H0 is true -

Clearly at some point we have to test the data to see if Ho holds. Any reasonable reading of this whole debate would see this as another RF saga.

So mOOes you will never get a concession from RF - you are wasting your time. RF is never wrong - if you think RF is wrong refer to the previous point.

Regards

TomC.

TomC, Oct 5, 2006
6. ### Lou ThrakiGuest

You need to check wether the hypothesis is true before you test it ;-)

Lou Thraki, Oct 5, 2006
7. ### Russell.MartinGuest

Yes, he just keeps repeating the same thing, which of course
is very consistent. RF is somewhat guilty of the same thing.
That doesn't seem to be the point of contention.
Yes, repeatedly. Cheers,
Russell

Russell.Martin, Oct 5, 2006
8. ### Reef FishGuest

Russell, glad you see you (or anyone else) stepping in about the aspect
of VALIDATING the ASSUMPTION in data analysis.

You interpreted me correctly, almost. There is NO Catch 22
involved. The ASSUMPTION is not in Ho. It is in what is needed
for the DATA to satisfy in order to use a particular TEST STATISTIC.

The point here is that t=s always (a trivial mathematical fact).

To use it to test the SLOPE of a simple regression, there is NO
ASSUMPTION about X or Y that can be, or need be, validated.
That's in the First Lesson of this "Reef Fish Stat for Dummiies".

To use the same t to test a CORRELATION, the ASSUMPTION
(re: Hogg and Craig) is that ONE of the X or Y need be normal.
That's about the DATA that needs to be validated. In this case
one can just think of ANY data X, Y, whether it was used in a
simple regression of not. Someone comes with the a set of
data with X and Y and wants to text Ho: rho = 0. What must
a data analyst do?

He MUST validate that X or Y is normal, because that's the
ASSUMPTION behind the theory of the distribution of the
test statistic for testing CORRELATIONS.

That is 100% correct.

m00es is confused with the fact that the test statistic must
INCORPORATE the value of the parameter under Ho.
So, in testing rho = 0, the test stat would be (R - 0)/s(R).
To test rho = .3, the test stat would be (R - .3)/s(R), but
each of those has a T(n-2) ONLY when X or Y is normal!
His argument of rho = 0 implies something about the data
falls apart completely when you test rho = .3. That made
your explanation below all the more ESSENTIAL to any
statistical procedure, hypothesis testing or not.

That is also 100% correct. In fact THAT is the WHOLE idea
of needing to validate any assumption behind any statistical
procedure BEFORE proceeding.

The SAME validation about X and Y (that ONE must be normal)
is necessary if one wants to use the test statistic to construct
a Confidence Interval for R.

That faulty inference of yours, Russell, was partly because you
were not completely clear yourself, but mostly because you did
not read me carefully enough, AND of course, much of the
previous exchanges were repetitious -- m00es playing the same
tune without alteration AND without reading my explanation,
and I could only re-explain the same in different ways.

I understood m00es point from the start. He was confusing
distribution theory and the role of the parameter in Ho, with tha
ASSUMPTION behind any statistical procedure.

I SHOULD have gone into the current Lesson 1 and 2 in the
"Reef Fish Statistics for DUmmies" as soon as he stated
"DATA is irrelevant".

That's why I thought he would finally wake up to the POINT that

1. T is t(n-2) in testing the SLOPE of the regression line without
requiring either X or Y to be normal.

2. T is t(n-2) in testing the CORRELATION between any X and Y
ONLY if X or Y is normal. (Hogg and Craig necessity condition).

Here, m00es is STUCK in that same HOLE he had been digging.

That's only ONE aspect of the distribution of the test statistic, as
I indicated above. To test rho = c, one would have to incorporate
(R - c)/s(R) into the test statistic. But that is INDEPENDENT of
the fact that the distribution of R must ALSO satistify the
underlying assumption about the distribution of R that X or Y
must be normal in the DATA being used to test the hypothesis

That is the same error made by Heitzman and Mueller in Lesson 1.
Or more specifically, validate the ASSUMPTION that underlies
the procedure for testing the PARAMETER in question.

For correlation, one MUST validate that X or Y is normal.
For simple regression slope, there is NOTHING to validate
about X or Y. The ASSUMPTION lies with the ERRORS
which can only be validated after a regression fit has been
performed.

Snip m00es's same repetition of fallacious reasoning.

I already HAD, in Lessons 1 and 2 of "Reef Fish Statistics
for Dummies". I was merely re-stating the SAME, as I
had restated the same a dozen times before.

You were wrong ONLY in your confusion of Ho with the
statistical assumption that underlies a statistical procedure.

As such, ANY use of the correlation and related theory
based on the normal theory that leads to a T distribution
MUST validate that X or Y is from a normal population.

The ASSUMPTION about the distribution of a Statistic
does not depend on the parameter being testing in Ho.
For confidence intervals on R, there is nothing to be
tested at all!

Russell, what's NEW in the two lessons is that I outlined
the procedure, step by step, of what EVERY applied
statistician should do, in doing a simple regression
analysis. m00es should have realized that he had NEVER
done a simple regression, in which he validated anything.

m00es was apparently trained in "mathematical statistics"
where data was never seen.

I used the Simple Regression problem to set up the
PARALLEL steps in the execution. THAT's where it
should have been clear to any careful reader that
the ASSUMPTIONS of the two procedures are
DIFFERENT. In regression there is NO assumption
about the data Y or X other than each Y comes from
a different normal distribution, and there is NOTHING
you can do to verify THAT. Which is why that
assumption had to be waited to be validated once
the RESIDUALS are observed.

The CORRELATION problem, on the other hand,
has assumption about normality that CAN be, and
MUST be, validated about the X and Y, in following
the same data analytic steps of testing OR

Now, with the additional hints above, everyone can
go back and RE-READ lessons 1 and 2 carefully.

-- Reef Fish Bob.

Reef Fish, Oct 5, 2006
9. ### Reef FishGuest

TomC wrote:

Welcome to the Gulf separating Math-stat and Applied-stat.
Testing!
What's the point of testing a Hypothesis if you assume it is true? Yes, he WAS talking about testing Rho(x,y) = 0, especially in the
original
problem that testing beta1 = 0 is equivalent to testing rho = 0.

In fact, m00es has NOT deviated from that hypothesis test -- or else
his own argument would have fallen apart. It DOES NOT MATTER
whether one is testing rho = 0 or rho = anything else.
Here both of you are wrong. The DATA does not follow anything
stated in Ho. The TEST STATISTIC used to test Ho incorporates
the parameter value in Ho, but that is an entirely separate issue,
stated in several different "Hypothesis Testing Lectures" or mine.
And the DATA must satisfy the ASSUMPTIONS of the testing
procedure for the parameter. That's where m00es never recovered
because of his misguided training in MATHEMATICAL statistics,
and never had any education on how DATA is used in applied statistics.
Yes so far m00es has been wasting HIS time. But OTHERS who

You can say that about the thousands of posts I have posted about
STATISTICAL facts and procedures. If I made a minor slip, as I had
done several times, they were instantly corrected, by myself, or
pointed out by others.
TomC, you're obviously deficient not only in Applied Statistics
that he never said he was testing Ho: rho = 0.

LOL!

that even something completely WRONG ones like yours, may
help m00es see his OWN errors.

His biggest ERROR is his belief that "DATA is irrelevant" in
any statistical procedure.

I think if he stops digging his hole, and read my Lessons 1 and
2 in the "Statistics for Dummies" thread, he MAY actually finally
see his own errors, in his inability to separate the role of the
parameter in Ho, and the role of DATA, both in the execution
of a Hypothesis Test, AND in validating the PROCEDURAL
ASSUMPTIONS in working with any "statistic" in Statistics.

-- Reef Fish Bob.

Reef Fish, Oct 5, 2006
10. ### Reef FishGuest

Even allowing your faulty characterization of my guilt, m00es is
consistently WRONG in his repetition; while RF is consistently
RIGHT in his re-explanations of why m00es was wrong.
That's only because Russell's own confusion. It IS a point of
contention that Ho does not dictate the DATA. What is agreed
is the fact that the TEXT STATISTIC for Ho must incorporate the
parameter being tested.

Verbatim even. And SURPRISE!! A repeatedly stated
falsehood is STILL a falsehood!

-- Reef Fish Bob.

P.S. The TWO lessons in the "Reef Fish Statistics for Dummies"
are so condensed and streamlined that they are each SELF-
CONTAINED, about (1) Applied SImple Regression, and (2)
of his erroneous comments (while he was merely correct in
but it's quite obvious that TomC is one of m00es' classmates or
friends who dwell in the same "Mathematical Statistican Hole".

-- Reef Fish Bob.

Reef Fish, Oct 5, 2006
11. ### Reef FishGuest

That is concise and apt remark about Russell's opening remark. )

I made the same remark, in a less graphic and less concise fashion,
when m00es was testing Ho: rho = 0.

Birds of the same feather. That was Russell's faux pas about the role Ho and the role of
checking ASSUMPTIONS underlying a Statistical Procedure.

-- Reef Fish Bob.

Reef Fish, Oct 5, 2006
12. ### Reef FishGuest

m00es wrote. in his own interview by m00es:
It's billions plus episilon where epsilon are in the millinons. Data is VERY RELEVANT if the distribution of S depends on the
ASSUMPTION that Y is normal, as in the case of testing the
correlation R(X,Y).

m00es replays HIS script in the Theatre of the Absurd:

Noe Schitt Sherlock! That's a mathematical identity that we all knew.

We heard that a few hundred times, from you, and from RF citing
m00es citing Hogg and Craig. But m00es kept MISAPPLYING
what m00es quoted Hogg.
A new angle in the Netherland Statistics. The TEST STATISTIC
for testing beta1 = 3 would be T = (beta1^ - 3)/se (beta1^).

It is ALSO distributed as T with n-2 d.f. Didn't you know that?

Furthermore, what do you conclude if you accepted OR rejected
Ho: beta1 = 3?
Quack Endeth Duck indeed!
Pointed out countless times already. TO test the correlation, Hogg
and
Criag says the DATA must come from a bivariate distribution where Y
must be normal if X is not.

And m00es kept forgetting to VALIDATE that assumption about Y
when testing correlations!!

A collective LAUGHTER is heard echoing in the halls of the USA, and
even in the halls of Netherlands!

If you assume Ho holds, why would you NEED to test if Ho is true?

and the data on Y CAN be anything BUT normal. For example,
the data on Y could be N(10, .00001) for the first 40 Ys and N(20,
..00001)
on the next 60 Ys.

PERFECTLY good regression DATA for Y on X.

PERFECTLY bad data for testing the correlation between X and Y
no matter what the c is, in testing Ho: rho = c .

See, m00es, that's the truism --- that DATA for testing a
Hypothesis, any hypothesis do NOT depend on the statement
of the hypothesis being tested.

That's also the motto of Harvard: Veritas. (Latin for TRTH). I learned that motto when I taught at Harvard.

The DATA could be anything from anywhere.

BUt if you're going to test Ho: R(X, Y) = a given value, then
it doesn't matter what the given value a is, the DATA must
satistify the NECESSARY condition that either X or Y must
be normally distributed, if you're going to use the statistical
theory behind the test statistic for testing R.

That condition is independent of the USA, Portugal, or UK,
or the Netherlands. It is universal. An applied statistician
in any of those countries must VALIDATE the assumption
if he is going to test the correlation between X and Y.

So the missing link buried DEEP in the m00es HOLE is the
"validation of ASSUMPTION in a statistical PROCEDURE"
for testing OR constructing a Confidence Interval about
an unknown parameter.

In constructing a Confidence Interval on the correlation
coefficient R, the statistic doesn't even have an Ho to
reference,

That topic was covered under the Portuguese Statistics
for Dummies on the difference between a Confidence
Interval for (p1 - p2) using the s.e. for p1^ - p2^, whereas
the s.e. for testing Ho: p1 - p2 must use a different
s.e. incoporating p1 = p2.

Read THAT Lesson, m00es, which was given in 2005,
and repeated in 2006, long before you enrolled in
any of the Statistics for Dummy schools. In testing a correlation coefficient using the T(n-2), DATA
must meet the assumption that X or Y must be normal!

Think confidence interval, m00es, if that's what it takes
to haul your posterior from that HOLE you dug so deep
in making inferences about a correlation!

Would you use T(n-2) to test Ho: rho = 0 if the DATA
X and Y came from two sample of size 3 each from
different uniform distributions? You can calculate
the correlation coefficient and can use the Cauchy
distribution table ya know? But SHOULD you?

YOu had been enlightened hundreds of times, but you've
holding on to that confirmatory lamp post <tm>, used by
classical statisticians, as a drunk use it for SUPPORT
rather than for enlightment -- which is another quote by
Tukey, this time I knew it came from his 1961 article in
AMS on "The Future of Data Analysis" which I am sure
you can find, because it was in the Annals of Mathematical
Statistics which was (and still is) widely populated by
statistical drunks of the "mathematical statistics" type,
rather than the by the enlightened Data Analysts and
Applied Statisticians who followed Tukey's enlightened
path.

C'est la difference, mon ami.

-- Reef Fish Bob.

Reef Fish, Oct 5, 2006
13. ### GuestGuest

Reef Fish,
I want to see what underlies the argument between you & m00es
[to me it resembles Fisher vs Neyman-Pearson - does one perform
"pure significance tests", or should one always have an H1 in mind
(typically from a spectrum of models)?] I'd be grateful if you'd
say explicitly what you'd do in the following situation.

Suppose a client comes to you with the sort of data described:
the X variable is 0/1 and the Y variable continuous.

They are interested ONLY in testing the null hypothesis
H0: rho(X,Y)=0. They ignore any suggestions that it might
be better (say) to model Pr(X=1|Y=y), they ignore any suggestion
that correlation may be a perverse way to think about the data
(particularly if e.g. X has been assigned rather than observed).
They aren't interested in a confidence interval for rho,
and they don't want to test H0': rho=a for any nonzero a.
They really, really just want to test H0: rho=0.

You look at the marginal distribution of Y, and judge
that it may reasonably be assumed to be Normal.

What test do you apply? i.e.
what is the test statistic T?
what is its null distribution?
for what values of T do you reject H0 (say with P<0.05)?

Many thanks -- Ewart Shaw

Guest, Oct 6, 2006
14. ### Reef FishGuest

I am more than happy to explain to you that if there's a controversy,
it's Tukey-Box vs Fisher-Neyman-Pearson and the rest of the non-
thinking statisticians in terms of VALIDATING the assumptions
behind any statistical procedure BEFORE using it!

It appears that said idea of validating the statisticla assumptions
is as foreign to you as it was to m00es, and perhaps m00es is
from the UK as well, as I always suspected.
You stated the client's case VERY WELL! Of course I would
first tell him what a foolish chap he is (in a politite British way
of course) and then tell him since he is paying the fees and he
knows exactly what he wants while ignoring all my suggestions,
I would gladly test Ho: rho = 0 for him, and statistically CORRECT
way of course.

It's one thing to satisfy a client's wishes to do some statistical
proccedure which is neither necessary nor wise for whatever
problem he has in mind, it is an ethical matter to do the procedure
"according to the book", without letting any error slip by, and
By that, I presume you mean the DATA for Y may reasonably
be assumed to be Normal, after you've done a P-P or Q-Q
plot (or what's also called Normal Probability plot) of the data
Y.
Since you want to test Ho: rho = 0, I would use the usual test-
statistic T = (r - 0)/se(r), the same one m00es used.
T with (n-2) d.f. since it has been validated that Y is Normal,
and according to Hogg and Craig, cited by m00es, it is
sufficient to use the same t or s he used.
For a two-tailed test, it would be | T | > t(.975, (n-2)) from
the T-tables.
That's the easiest \$300 I've ever earned consulting a problem
in statistics. .

I'll use the remainder of your hour to tell you that your problem
is a no-brainer because the DATA for one of your variables X
and Y used in the correlation had been VALIDATED to satisfy
the statistical assumption required for a test of the correlation
coefficient. I would even use the same rejection region for
testing any other value of rho for you Ho: rho = c, for free, by
simply using T = (r - c)/se(r).

Had you DATA not have passed the interrocular traumatic
test of a Normal Probability Plot of Y, I would have told you
that there is no known statistical theory on which to base the
test of your Ho: rho = 0, because neither of your X nor Y
is Normal, and then I would insist that you re-formulate
your problem, or else I would turn you away, not taking any
easy money from a fool for doing something wrong that he
wouldn't know is right or wrong.

Now, you should be a very satistified client.

You're welcome to come back for your next consulting problem.
But bring some REAL British pounds next time. -- Reef Fish Bob.

Reef Fish, Oct 7, 2006
15. ### GuestGuest

Thank you for your very clear & precise response...
....though there's no need to slip in insults like that.
I did have a couple of further tweaks to the client's situation,
clarifying why I thought that the argument might have been more
about "pure significance testing vs the rest of the World"
rather than of validating assumptions, but

(1) The OP (Arnold, 19th Sept) asked about the use of the correlation
coefficient "to analyse the dependence between a dichotomous
dependent variable and a continuous independent variable",
which involves more than just testing H0: rho=0;
(2) I don't in any case want to defend using correlation here;
(3) I don't have the time
(4) or the energy
(5) or the money.

Guest, Oct 9, 2006
16. ### Reef FishGuest

That was merely a statement of FACT. The fact that you had to
ask such questions was very clear and precise that validating
statistical assumptions is as foreign to YOU as it was to m00es.
But you certainly seem to grasp my response with much less
labor than m00es.
REAL British pounds stirling are scarce, aren't they?
and ... I duely responded, as I did to you, that the use of
correlations was NOT appropriate in his case, but a test of the
slope in a regression WAS appropriate.
(2), (3), and (4) would be an excercise in futulity just as m00es's
wasted time and energy.

Reef Fish Bob,

Reef Fish, Oct 9, 2006
17. ### m00esGuest

I think this discussion is really going nowhere at this point. I have
repeatedly pointed out the fact that using t = b1/s(b1) and s = r *
sqrt(n-2) / sqrt( 1 - r^2) are in fact identical tests for testing H0:
beta1 = 0 and H0: rho = 0 (the one implies the other). Since these are
identical tests, the assumptions are the same.

Yes, we assume H0 is true when conducting a hypothesis test. That's the
initial assumption. Of course, we may conclude that H0 should be
rejected after we have carried out the test. But comments from Reef
Fish like "If you assume Ho holds, why would you NEED to test if Ho is
true?" are a clear indication that he is either completely ignorant how
hypothesis testing works or just playing dumb. I'll assume the latter
applies, but in either case, it makes it impossible to actually have a

When H0 holds, then Y is normal. Only then will t and s have a
t-distribution with n - 2 degrees of freedom.

Reef Fish keeps saying: But you first must check whether Y is really
normal! Otherwise, s will not follow a t-distribution with n - 2
degrees of freedom! But neither will t = b1/s(b1). So, we keep going
round and round. I never said that one should not check the model
assumptions. In fact, the exact same assumptions apply to t = b1/s(b1)
and s = r * sqrt(n-2) / sqrt( 1 - r^2) for testing H0: beta1 = 0/H0:
rho = 0.

Well, at this point, I would say that this discussion isn't useful
anymore. We will just have to agree to disagree.

m00es

m00es, Oct 10, 2006
18. ### GuestGuest

That's mainly why I wondered to what extent this "discussion"
was related to "significance testing vs hypothesis testing".
If one adopts the regression/t-test model with conditional
Normality of [Y|X=0] and [Y|X=1], then the Hogg & Craig result
is a red herring: if you satisfy yourself (q-q plot or whatever)
that Y can be assumed to be marginally Normally distributed,
then this implies that H0 is true. You can't then use that
as justification for carrying out the confirmatory test.

I had assumed the discussion was based on the following scanario.

A client has fixed (say equal) numbers of observations of Y
at X=0 and X=1. You verify that the statistical assumptions
underlying a 2-sample t-test are reasonable for their data.
The client then says that all they want (however perversely)
is a test of H0:rho=0. They are not interested in anything else.
Maybe their boss has told them they will lose their job if that
can't be done; maybe they also offer you an obscene amount of money.
Can you carry out a formally correct test of H0?

I think the discussants have been talking at cross purposes,
I don't see why marginal Normality has been brought into the
discussion, I don't see why Reef Fish insists that validating
statistical assumptions is foreign to both you and me, and I
agree that the discussion is really going nowhere at this point.

Guest, Oct 10, 2006
19. ### Reef FishGuest

That was, as others used it, a SARCASM of your being obtuse to
the fact that DATA plays several roles in ANY test of a hypothesis,
and your repeated FALSE claim that "DATA is irrelevant".

Whatever is assumed to hold in Ho does NOT affect the DATA used
to test whatever is stated in Ho.

For the TEST STATISTICS used for testing Ho, it is necessary to
assume the parameter value in Ho to be used in the test statistic.

But the DATA used in any hypothesis test is completely independent
of the STATEMENT of any Ho!

But said DATA is used to VALIDATE any assumption that is in the
process of testing the parameter in question.

That's the same point you, and now the other chap Shaw from UK
are still failing to see:

1. For testing the SLOPE of a simple regression, you do NOT
NEED to validate the distribution of the Y in the DATA, because
that Y is assumed to have come from a mixture of n different
normal populations, in the REGRESSION problem.

You only need to validate that the RESIDUALS of the regression
is normal.

2. For testing the CORRELATION in X and Y, whether the same
data is used in a regression or not, in order to use the TEST
STATISTIC for testing the correlation, you not only need to
use the value of rho in Ho, but you ALSO need to VALIDATE
that either X or Y MUST be normal, according to what you
cited from Hogg and Craig.

(1) and (2) above are TWO DIFFERENT problems. TWO DIFFERENT
tests of two different hypotheses!

The fact that you are using the same (or different) DATA for those
two different problems does not change the FACT that you need to
do DIFFERENT VALIDATIONS for each of the two problems.

Your statistical training in the UK apparently is foreign to the Data
Analysis approach in the USA (since the teaching of Tukey and Box)
that assumptions behind ANY statistical procedure must be validated
before applying those procedures.

That which has been widely and completely accepted in the
APPLICATION of Neyman-Pearson type of hypothesis testing or
interval estimation, is POST Neyman-Pearson, but easily seen
to be a NECESSARY part which Neyman-Pearson overlooked in
their formulation of theory, without touching on the logical
necessisity of VALIDATION of assumptions behind the theory.

The above is true for testing the correlation ONLY when it can be
validated that X or Y of the DATA is from a Normal population.

See the above, spelled out more and more explicitly and SEPARATELY
each time -- that's about the normality of the DATA which is necessary
to perform a test of CORRELATION, but not necessary for a REGRESSION,
under the different assumptions of those two DIFFERENT procedures.

Otherwise, s will not follow a t-distribution with n - 2
You misspelled "we". YOU are the only one BLIND to the difference
between what is ASSUMED and what needs to be VALIDATED using
the DATA,

Really? Then why did you keep saying that the DATA is irrelevant?
HOW do you check the model assumption about CORRELATION
without checking that the DATA Y is normal? In the case of putting
TWO different groups into the same Y, the DATA can be clearly
nonnormal because its the mixture of two different sets of DATA
from two different normal populations.

That's exactly the place that you were WRONG, and remains to
be WRONG, because you failed to recognize that those two are
DIFFERENT problems requiring DIFFERENT validations of
assumptions.
No, you do not have the luxury to agree to disagree when you are
100% WRONG, and had been proven to be WRONG when you
muddled in your FAILURE to recognize what needs to be
VALIDATED by the DATA in two different problems of hypothesis
test, each of which REQUIRES a different assumption to be
satistified.

That is exactly where you went astray, and stayed astray.

Go back to the DATA which started all of this, and discussed
EXPLCITLY by me, as an example -- the DATA taken from the
textbook of Anderson and Sclove, and used in the Manual by
Ling and Roberts to illustrate the set up as a simple regression
problem to test the equality of the means of two independent
population.

Go back to that DATA and show us WHEN you had ever said
about the validation of assumptions anywhere, or the
VALIDATION of the Y in the simple regression?

You can go back and use a Normal Probability Plot (if you had
ever used one) or any technique for VALIDATING the
assumption of Normality, and show us what you found, and
we can START from there.

You NEVER ever looked at the DATA, let alone using it to
validate any assumption. The validation of the normality
of the RESIDUALS was illustrated in the Ling/Roberts
application. There, it was not necessary to validate the
normality of the Y because no correlation was tested.

Learn how to READ, m00es, and learn how to recognize
that (1) and (2) stated above are two DIFFERENT problems
each requiring its own validation of assumptions.

-- Reef Fish Bob.

Reef Fish, Oct 10, 2006
20. ### m00esGuest

I said that data is irrelevant for deriving the distribution of the
test statistic. You claimed otherwise. I pointed out that this is
nonsense.

The entire time, we have been discussing the situation, where:

Y = beta0 + beta1 X + e,

where e ~ iid N(0, sigma^2). We do not know beta0, beta1, or sigma^2,
since these are parameters. But we assume that e is normal. Therefore,
when H0: beta1 = 0 holds (which is equivalent to H0: rho = 0), then Y
is normal. And for testing H0: beta1 = 0, we start out by assuming that
H0 holds. So, under H0, Y is normal. The assumption that e is normal
may be off, but that gets into a different issue.

By the way, Reef Fish, it is quite apparent that you actually got your
statistical training in Iceland.

m00es

m00es, Oct 10, 2006