# Reef Fish Statistics for Dummies: Applied Simple Regression

Discussion in 'Scientific Statistics Math' started by Reef Fish, Oct 4, 2006.

1. ### Reef FishGuest

That is certainly true of YOU, m00es.

"The same" in (2) means "Y needs to be normal" as in (1).

But Y is the COLUMN of numbers used for either the calculation
of the correlation coefficient r*, OR used in a simple regression.

The column of Y in a simple regression, in THEORY, comes
from n different normal distributions, depending on the n X's.
if the ERRORS in the model is N(0, sigma^2). So, in theory,
it is already non-normal. But in practice, the OBSERVED
data for Y (as an aggregate in the column of numbers)
doesn't have to follow ANY distribution at all, and therefore
there is NOTHING to VALIDATE about Y when one is
doing a simple regression problem.

There is absolute no NEED for the column of Y to be normal.

In fact, most of the time in a simple regression problem, the
Y is NON-NORMAL -- Y can easily be bimodel for the
data coming from two very different normal distributions
with very different means.

You should REMAIN at this point until you understand this
very first step. And remember, the normality of Y or
nonnormality of it has to be VALIDSTED by DATA, not
by any hypothesis to be tested in any regression model.

-- Reef Fish Bob.

Reef Fish, Oct 13, 2006

2. ### m00esGuest

Apparently, you are having difficulties distinguishing that which is
true in the population and that which we observe in a sample. In the
population, we assume that the model describing the relationship
between Y and X is given by Y = beta0 + beta1 X + e, where e ~ iid N(0,
sigma^2). This is a statement about the population. Now if in reality
beta1 = 0 (i.e., Y and X are completely unrelated), then Y = beta0 + e,
where e ~ iid N(0, sigma^2). Therefore, in the population Y ~ N(beta0,
sigma^2). Hence, if Y and X are unrelated, Y is normal.

Now, if we take a random sample from that population, then we are
either taking a random sample from a population where beta1 = 0 or
where beta1 != 0. In other words, we are either taking a random sample
from Y ~ N(beta0, sigma^2) or from Y ~ N(beta0 + beta1 X, sigma^2). We
don't know which one of these two cases holds. But we take our random
sample, fit the model, calculate b1 and s(b1) and then b1/s(b1). If H0:
beta1 = holds (i.e., Y is normal), then b1/s(b1) will be a realization
of a random variable following a central t-distribution with n - 2
degrees of freedom. If H0 does not hold, then b1/s(b1) is a realization
of a random variable following a non-central t-distribution.

m00es

m00es, Oct 13, 2006

3. ### Reef FishGuest

If you're going to follow up discussing STATISTICS, you follow
up with what I posted, not rehashing your old ERRORS.

The DATA is the SAMPLE from one or more populations.

The DATA is what you use to VALIDATE the necessary
assumption behind any statistical procedure.

The DATA does not depend on what you model or what
parameter value you want to test.

You get THAT part into you numb skull and go back to your
own step (2).

If you want to test R, using T, the Y must be Normal.

So you MUST VALIDATE the normality of Y.

If you want to test the slope in a regression, there no NO
distributional assumption on the aggregate Y (it came from
n different distributions in theory).

There is NOTHING to VALIDATE about Y.
You wait until you have observed residuals,then you
validate the distribution of the errors.

If you FAIL the validation in BOTH, none of your
inference based on the test statistic is valid.

If you SUCCEED in the validation of BOTH, then
both are valid, as I had explained to Ewart Shaw
in his given conditions and insistence.

It's in the ACTUAL problem of testing the equality
of means of two populations that

T(n-2) may be validly used for testing R while
not valid for testing beta1, if Y is validated to be
normal, but the residuals of the regression
seriously violated the normal validation,

OR

T(n-2) may be INVALID for testing R because
neither X nor Y is normal; but perfectly valid
for testing beta1 because the RESIDUALS
are validated to normal.

At this point, I must ask OTHERS to explain the
above to m00es, or if anyone ELSE still doesn't

===============================
T(n-2) may be validly used for testing R while
not valid for testing beta1, if Y is validated to be
normal, but the residuals of the regression
seriously violated the normal validation,

OR

T(n-2) may be INVALID for testing R because
neither X nor Y is normal; but perfectly valid
for testing beta1 because the RESIDUALS
are validated to normal.
===============================

because there is no other way on earth or in hell
that FACT can be explained in a more direct way
that is consistent with standard practice of
VALIDSTION of statistical ASSUMPTIONS in
procedures before conducting any inferece or test.

-- Reef Fish Bob.

-- Reef Fish Bob.

Reef Fish, Oct 13, 2006
4. ### m00esGuest

You can just keep on ignoring what I wrote, but it's still correct. I
am not talking about verifying any assumptions. I am not even talking
about the correlation coefficient. I am just trying to teach something
to you that is apparently very difficult to understand. And that is:
the distribution of b1/s(b1) is only t-distributed with n-2 degrees of
freedom when H0: beta1 = 0 holds, which implies that Y is normal. Once
you understand this point, then we can proceed and talk about ways to
verify assumptions. But for now, that is reaching too far when you
can't even understand something so simple.

Here is the proof that b1/s(b1) ONLY follows a t-distribution with n -
2 degrees of freedom when Y is normal. The model: Y = beta0 + beta1 X +
e, where e ~ iid N(0, sigma^2). Let SSX = sum(x - xbar)^2.

b1 ~ N(beta1, sigma^2 / SSX )
MSE ~ chi^2(n-2) sigma^2 / (n - 2)
s^2(b1) = MSE / SSX

Therefore:

b1/s(b1) ~ N(beta1, sigma^2 / SSX) / sqrt{ chi^2(n-2) sigma^2 / [(n -
2) SSX] }
~ N(beta1, 1) / sqrt( chi^2(n-2) / (n - 2) ).

When beta1 = 0, then:

b1/s(b1) ~ N(0,1) / sqrt( chi^2(n-2) / (n - 2) )

and that is distributed t with n - 2 degrees of freedom. But ONLY when
beta1 = 0. Otherwise, it will be non-central t.

And when beta1 = 0, then in the population Y = beta0 + e, hence Y ~
N(beta0, sigma^2).

So, ONLY when Y is normal (i.e., beta1 = 0) will b1/s(b1) be
t-distributed with n - 2 degrees of freedom.

We have not yet drawn any data. These are just facts that hold when we
are correct in assuming that in the population the relationship between
Y and X is described by a linear model of the form Y = beta0 + beta1 X
+ e, where e ~ iid(0, sigma^2).

So, why don't you skip all the rhetoric and insults and accept that
this is a fact. Or if you disagree with anything I wrote, then please
point out where the error is (we would have to rewrite 1000's of stat
books if you can find an error in what I wrote). But don't get started
again about assumptions and data. We aren't even there yet.

m00es

m00es, Oct 13, 2006
5. ### Reef FishGuest

because you repeated the same errors you made in Day 1. with
absolutely nothing new.

I have already put it as succinctly as possible, and ask OTHER
readers to explain to you, or for THEM to ask me what they
hypotheses being incompetible if the DATA can validate on
procedure and not the other.

If you want to have anything to say, address these two, so
OTHERS can respond to you.

===============================
T(n-2) may be validly used for testing R while
not valid for testing beta1, if Y is validated to be
normal, but the residuals of the regression
seriously violated the normal validation,

OR

T(n-2) may be INVALID for testing R because
neither X nor Y is normal; but perfectly valid
for testing beta1 because the RESIDUALS
are validated to normal.
===============================

because there is no other way on earth or in hell
that FACT can be explained in a more direct way
that is consistent with standard practice of
VALIDSTION of statistical ASSUMPTIONS in
any statistical procedures before conducting any
inferece or test.

The two procedures were:

Testing correlation given DATA (X, Y) using a T
distribution with (n-2) d.f.

Assumption to be validated: X or Y MUST be Normal.

Testing the slope of a simple regression line given
DATA (X, Y) using a T distribution with (n-2) d.f.

Assumption to be validated: The usual i.i.d. (0, sigma^2)
assumption about the ERRORS in the linear fit.

-- Reef Fish Bob.

Reef Fish, Oct 13, 2006
6. ### m00esGuest

I am not making any errors. We haven't even gotten yet to the issue of
how to verify assumptions, and what the role of the correlation
coefficient is in all of this. I am trying to do this step by step,
because there is no point in discussing anything if there if you do not
understand a very basic fact about the distribution of Y.

All I am trying to explain to you is that Y is normal under H0: beta1 =
0. Once you get this point, we can discuss other things. But
apparently, you do not want to admit that I am correct.

Look, it's very very simple. When Y = beta0 + beta1 X + e, where e ~
iid N(0, sigma^2), then Y ~ N(beta0 + beta1 X, sigma^2). Therefore, the
conditional distribution of Y|x is N(beta0 + beta1 x, sigma^2). And the
marginal distribution of Y is indeed a mixture distribution. However,
when H0: beta1 = 0 holds, then Y ~ N(beta0, sigma^2) and the marginal
distribution of Y is normal.

Why is it so difficult for you to admit that this is true?

m00es

m00es, Oct 14, 2006
7. ### Reef FishGuest

You not only made errors, but you were polluting the OTHER thread

You errors are so OBVIOUS.

In a sense, Dick Startz explain to Kevin how you could have a valid
set of data for testing correlations while an INVALID one for testing
the slope of a simple regression.

If you just keep your BIG MOUTH SHUT and use your little ears to
LISTEN for awhile, you may learn something from the various readers
in these groups.

-- Reef Fish Bob.

Reef Fish, Oct 14, 2006
8. ### m00esGuest

Why do you keep evading what I wrote? Under the model Y = beta0 + beta1
X + e, where e ~ iid N(0, sigma^2), beta1 = 0 implies that Y is normal.
Why is it so difficult for you to admit that this is true?

m00es

m00es, Oct 15, 2006
9. ### Reef FishGuest

Because the DATA could have come from beta1 = 100,000.

You are only TESTING if beta1 = 0.

That's how DENSE you are.

Read it in the thread for OTHER people and the post by Dick Startz
that tells exactly the same reason why you've been wrong all these
weeks, without the slightest sign of revival or recovery.

The HOLE you dug is so deep that even if you recover, you'll be
the laughting stock of this group for years to come, in the archives
of sci.stat.math.

-- Reef Fish Bob.

Reef Fish, Oct 15, 2006
10. ### m00esGuest

If the data came from model where beta1 != 0, then the marginal
distribution of Y is not normal. That is correct. I never said
otherwise. However, IF beta1 = 0 holds, then Y is normal. So, you are
still unable to admit that Y is normal when beta1 = 0 holds.

Yes, and to test beta1 = 0, we initially assume that H0: beta1 = 0
holds. If H0 holds, then Y is normal. And if H0 holds, then b1/s(b1)
follow a t-distribution with n - 2 degrees of freedom. If H0 does not
hold, then Y is nor normal. And then b1/s(b1) follows a non-central
t-distribution with n - 2 degrees of freedom.

But one more time. IF H0: beta1 = 0 holds, then Y is normal. So, again,
why are you having such a hard time admitting that this is true?

More insults. Zero substance.

m00es

m00es, Oct 16, 2006
11. ### Reef FishGuest

That says it all about all you've been saying for weeks, NOT
recognizing
that the DATA one tests has nothing to do with the Ho being tested.

It has to validate the ASSUMPTION of the regression procedure.

That's the extent of your perpetual ignorance.

-- Reef Fish Bob.

Reef Fish, Oct 16, 2006
12. ### RussellGuest

So you're saying that it is an impossibility for there to
exist data which are not normally distributed for which
beta1 = 0? I don't believe that. In fact, given your model
Y = beta0 + beta1 X + e, if we take e to be distributed as,
say, Cauchy then Y isn't normal even if beta1 = 0. Now
I don't know how sensitive this analysis is to a violation
of e being normal. As I understand things, some tests are
more robust than others to violations of their assumptions,
but when I was learning about data analysis I was taught
to check for such violations. I've adopted as my creed a
slight modification of a statement in the book on spectral
analysis by Blackman and Tukey, _The Measurement of
Power Spectra From the Point of View of Communication
Engineering_: All too often the study of data requires care.
Cheers,
Russell

Russell, Oct 16, 2006
13. ### m00esGuest

That's correct. However, Reef Fish just doesn't want to admit that the
model Y = beta0 + beta1 X + e with e ~ iid N(0, sigma^2) implies that Y
is normal when beta1 = 0. Certainly e could follow any other
distibution, but then b1/s(b1) does not follow a t-distribution even
when beta1 = 0. Neither will r * sqrt(n-2)/sqrt(1-r^2). However, even
then will b1/s(b1) and r * sqrt(n-2)/sqrt(1-r^2) follow the exact same
distribution (whatever it may be). Since b1/s(b1) = r *
sqrt(n-2)/sqrt(1-r^2), they ALWAYS follow the same distribution.

Of course, checks on the assumption that e is normal should be carried
out. But that has nothing to do with the equivalence of the test of
beta1 = 0 and rho = 0.

m00es

m00es, Oct 17, 2006
14. ### m00esGuest

And again, you avoid the issue. Insults, on the other hand, seem to be

If Y = beta0 + beta1 X + e, where e ~ iid N(0, sigma^2), then the
marginal distribution of Y follows a mixture distribution (a mixture of
normals). However, if beta1 = 0, then the marginal distribution of Y is
normal. Why are you having such a hard time pointing out where the
error is? Oh right, there IS no error! Yes, that would make it
difficult to point out where the error is. I suppose throwing insults
around is an alternative then.

m00es

m00es, Oct 17, 2006
15. ### Reef FishGuest

That's the END of m00es's arguments all these weeks.

Why more DOUBLE TALK?
NONE of that had anything to do with what I said about DATA that can
be validly used to test ONE (rho) and NOT THE OTHER (beta1).
NOT when you don't have any e and have ONLY X and Y to test the
correlation rho.

Precisely. The equivalence of beta1 = 0 and rho = 0 is IRRELEVANT
when you are testing ONLY correlation, and there you have to test
that either X or Y is normal, and nothing about e, because e
DOESN'T exist.

In the end, you have proven yourself WRONG once more, trying to
mouth dance and weasel your way out notwithstanding.

-- Reef Fish Bob.

Reef Fish, Oct 17, 2006
16. ### m00esGuest

What double-talk??? Wait, don't answer that. It's just going to be a
bunch of nonsense anyway.

The tests are the same. When are you going to get that into your head?

Once again, the test of rho = 0 is equivalent to the test of beta1 = 0.
It's really not that difficult to understand.

Funny; you are the one who keeps dodging my question. And I have not
proven myself wrong. On the other hand, you have proven that you are
unable to carry on an argument, you are unable to admit that you are
wrong, and you lack common decency, since you revert to insults
whenever you can.

m00es

m00es, Oct 17, 2006
17. ### Reef FishGuest

But the tests require DIFFERENT assumptions to be validated.
is so mutilated by mathematical statistics in a vacuum of APPLIED
statistics, just couldn't understand that the validation of ASSUMPTIONS
in a procedure has nothing to do with the Ho being tested.

It has EVERYTHING to do with the DATA used.
It is trivial to understand. But it is hard for m00es to understand
that
in the FORMER, one needs to validate the normality of Y and X.
In the LATTER, X and Y can be anything. NOTHING to validate in X an
Y.

Any freshman with any sound instruction in simple regression would
have understood the difference.

Only a BLIND mathematical statistics student like m00es remains BLIND
to the simple fact, and POLLUTE the entire three newsgroups in his
incessant repeat of the same FALSEHOOD that had nothing to do with
the issue of VALIDATION of assumptions.
Proven in your current post, and shot your own foot with the same
weapon.

-- Reef Fish Bob.

Reef Fish, Oct 17, 2006
18. ### \Luis A. Afonso\Guest

Before the main test to validate the assumption (valid) if the sample has the normal distribution (for example) and then perform it SEEMS a good idea (for ignorant people) but IMO is STATISTICALLY A DISASTER.
Why?
Because one jump from a well established situation to a CONDITIONAL TEST
__evaluation of p(H0|valid)___that we cannot solve.

______licas (Luis A. Afonso)

\Luis A. Afonso\, Oct 17, 2006
19. ### Reef FishGuest

The keyword in the above is "IMO".

When "IMO" comes from someone ignorant about statistics and the well-
accepted and understood practice of the VALIDATION of assumptions in
Data Analysis and Applied Statistics, it remains the OPINION of the
uneducated.

Read George Box, "Science and Staistics", JASA 1976, the article which
I have referenced at least a dozen times to others (which was probably
not seen by Luis A. Afonso), and let George Box teach you a lesson or
three in what statistics is about.

THEN, you may be ready to read dozens and dozens of books and
articles in the statistical literature of WHY it makes sense to
VALIDATE
the assumptions behind a procedure before plunging into applying it.

SIMPLE LOGIC: If the assumptions are seriously violated, then all
the statistical results and conclusion based on the procedure are
INVALID, WRONG, and not worth the paper it's printed on (by your
computer program).

-- Reef Fish Bob.

Reef Fish, Oct 17, 2006
20. ### \Luis A. Afonso\Guest

What I go from Reef Fish commentary

___Reef Fish relying only on an *authority* (?) (and the orinary ad hominen insult).

What I didnÂ´t see

___His opinion why IT IS NOT AN CONDITIONAL TEST.

_______licas (Luis A. Afonso)

\Luis A. Afonso\, Oct 17, 2006