# Factor analysis with many variables

Discussion in 'Scientific Statistics Math' started by A Fog, Nov 24, 2007.

1. ### A FogGuest

I am trying to do a factor analysis on the standard cross-cultural sample in order to test certain hypotheses.

I am using the factanal function in R (www.r-project.org).
My problem is that this function doesn't work when the number of variables exceeds the number of cases. It gives the following error message:
"Error in solve.default(cv) : system is computationally singular: reciprocal condition number = 2.40481e-19"

Is there a fundamental limitation in factor analysis that the number of variables must be smaller than the number of cases (e.g. can't do a factor analysis of 200 variables on 100 test persons)?

I have also tried the "Multiple factor analysis" (MFA) function in the FactoMineR R package. This works with higher number of variables, but the result appears to be a principal component analysis, not a factor analysis. I want a factor analysis because the variables have high measurement errors. The output of factanal gives nice factors with pretty obvious interpretations, while the dimensions output of the MFA (PCA) is more messy with no clear interpretations.

I would appreciate any advice on how to make a factor analysis with more variables than cases. I like R, but can use any other software if it is not too expensive.

(The standard cross-cultural sample is a file of more than 1000 traits of 186 non-industrial societies. I am using this for testing certain hypotheses on cultural and political reactions to collective dangers)

A Fog, Nov 24, 2007

2. ### Ray KoopmanGuest

All the versions of factor analysis that I know of require the
covariance matrix to be nonsingular, which in turn implies more
cases than variables. You may be stuck with doing component analyses,
or with doing factor analyses of small subsets of your variables.

Have a look at Rudy Rummel's book _Applied Factor Analysis_
and his "Understanding Factor Analysis" website
http://www.hawaii.edu/powerkills/UFA.HTM

Ray Koopman, Nov 24, 2007

3. ### A FogGuest

Thank you Ray. Now I don't have to try other software packages.

I can think of three ways to handle the problem of too many variables:

1. Remove the variables that have the highest uniqueness or lowest factor loading

2. Remove variables that are unimportant to the hypothesis I want to test

3. Look for variables that are highly correllated and combine them together by addition to reduce the number of variables.

Method 1 is no good because it removes many of the variables that are most important to my hypothesis.

Method 2 gives a nice result that confirms the hypothesis I want to test. My only concern is: Can my cherry-picking of variables produce a bias in favour of my hypothesis?

Is method 3 valid?

I am replacing missing data with the mean for that variable. Is there anything better I can do with missing data?

A Fog, Nov 24, 2007
4. ### Old Mac UserGuest

At the risk of being a party pooper, I have several concerns.

1. If you are not thoroughly trained in multifactor analysis
(including principal components analysis, factor analysis, etc.) then
I suggest that you are headed for deep trouble using factor analysis.
There are a lot of thistles on FA, and your original post suggests you
are a novice with FA

2. Eliminating variables "to taste" is a nice way to get the answer(s)
you really desire.

3. Replacing missing values with the mean for that variable is a bad
idea.

Let's go back to the beginning.

Why are you using FA in the first place?

What are you trying to accomplish?

Tell us a little about the data and the source(s) of that data.

I've been in the "statistics business" for almost 53 years and haven't
found it necessary or fruitful to use FA so far. Was in charge of a
group of applied statisticians for about 37 years and no one in that
group ever found a valid use for FA.

OMU

Old Mac User, Nov 24, 2007
5. ### Richard UlrichGuest

I liked the post by OMU. I can offer a related perspective.

Unlike OMU, who has never found a use for factor analysis
in his industrial context, I have used factor analysis many times -
it has given me an easy cross-check on the sensibility and
content of rating scales for attitudes or symptoms, etc.
I've used it with almost every new data set, over several decades.

But I have never done a factor analysis "to test ... hypotheses."
Most of that was killed off, I thought, when Spearman failed
to wrap up the argument about "g".

Modern attempts, I thought, would fall under the heading of
SEM (structural equations methods). For that, I would assume
large Ns, little missing, data structure that is fairly obvious
before you start. What you *may* do is confirm that certain
hypotheses seem better than others.

If you have a citation that you are following, "to test certain
hypotheses" by factor analyses, please post it. I suspect that
you will find readers here who may provide a critique.

[snip, rest]

Richard Ulrich, Nov 25, 2007
6. ### Herman RubinGuest

The maximum value of the likelihood function is infinite,
so such methods cannot work. There are ways to handle
things, such as Bayesian methods, or putting a lower bound
on the specific variances; either of these will remove the
problem of singularity.

If you believe that there are large measurement errors,
then you believe the specific variances are rather high.

You are in a situation where you MUST make far more
assumptions than the usual factor analysis assumptions,
of which normality is the least important. This situation
always holds, for any non-arbitrary method (one not based
on principles) if there are more parameters than there
are observations. These cases do occur, and any attempt
at "objectivity" is subject to failure.

Herman Rubin, Nov 25, 2007
7. ### Old Mac UserGuest

Rich...

Hello again...

I've used PCA a few times (always on correlation matrices) to get a
sense of "how many dimensions do we have here" and to test for "zero"
eigenvalues because of hidden linear constraints. Both PCA and Factor
Analysis are sensitive to scaling... so I always work with a
correlation matrix when doing PCA.

When I saw "is is a good idea to replace missing data with their mean"
I just couldn't remain silent.

Regards... OMU

Old Mac User, Nov 25, 2007
8. ### Ray KoopmanGuest

That depends on the method of estimation. For factor analysis there
are several scale-equivariant methods: canonical/maximum likelihood/
maximum determinant, generalized least squares, alpha, and various
versions of weighted least squares come to mind; for PCA there is a
method called Harris component analysis (which really deserves to be
more widely used than it is).

Ray Koopman, Nov 25, 2007
9. ### Ray KoopmanGuest

I have been asked offline to provide a reference for and short
description of Harris component analysis. The basic reference is

Harris, Chester W. (1962). Some Rao-Guttman relationships.
Psychometrika, 27, 247-263.

It's a rescaled PCA: you rescale the covariances, get the principal
components of the rescaled matrix, then undo the rescaling to get
back into the original units. The rescaling matrix is a diagonal
matrix, say D^2, whose elements are the reciprocals of the diagonals
of the inverse of the covariance matrix. Instead of getting the
eigenvector matrix V and diagonal matrix of eigenvalues E of the
covariance matrix C, you get them for D^-1.C.D^-1; the component
things back into the original units. (Notation: "^" is powering,
"." is matrix multiplication. "^" has a higher precedence than ".".)

The diagonal elements of D^2 are variances, in the same units as the
corresponding variables in C. Different variables do not have to be
in the same units. The analysis is scale-equivariant: if two people
analyze covariance matrices that are rescalings of one another,
their results will be the same rescalings of one another. So it
doesn't matter whether you analyze covariances or correlations; you
can always rescale the results into whatever units you want, and
they will be just as if you had done the analysis in those units in
the first place.

The variances in D^2 are sometimes called "empirical uniquenesses".
They are analogous to (and upper bounds for) the theoretical unique
variances in common factor analysis. dj^2 is the variance of the
residuals when variable j is regressed on the k-1 other variables.
dj^2 = cjj * (1 - rj^2), where cjj is the variance of variable j
(i.e., the j'th diagonal element of C) and rj is the multiple
correlation of variable j with the k-1 other variables.

Ray Koopman, Nov 26, 2007