Factor analysis with many variables

Discussion in 'Scientific Statistics Math' started by A Fog, Nov 24, 2007.

  1. A Fog

    A Fog Guest

    I am trying to do a factor analysis on the standard cross-cultural sample in order to test certain hypotheses.

    I am using the factanal function in R (www.r-project.org).
    My problem is that this function doesn't work when the number of variables exceeds the number of cases. It gives the following error message:
    "Error in solve.default(cv) : system is computationally singular: reciprocal condition number = 2.40481e-19"

    Is there a fundamental limitation in factor analysis that the number of variables must be smaller than the number of cases (e.g. can't do a factor analysis of 200 variables on 100 test persons)?

    I have also tried the "Multiple factor analysis" (MFA) function in the FactoMineR R package. This works with higher number of variables, but the result appears to be a principal component analysis, not a factor analysis. I want a factor analysis because the variables have high measurement errors. The output of factanal gives nice factors with pretty obvious interpretations, while the dimensions output of the MFA (PCA) is more messy with no clear interpretations.

    I would appreciate any advice on how to make a factor analysis with more variables than cases. I like R, but can use any other software if it is not too expensive.

    (The standard cross-cultural sample is a file of more than 1000 traits of 186 non-industrial societies. I am using this for testing certain hypotheses on cultural and political reactions to collective dangers)
    A Fog, Nov 24, 2007
    1. Advertisements

  2. A Fog

    Ray Koopman Guest

    All the versions of factor analysis that I know of require the
    covariance matrix to be nonsingular, which in turn implies more
    cases than variables. You may be stuck with doing component analyses,
    or with doing factor analyses of small subsets of your variables.

    Have a look at Rudy Rummel's book _Applied Factor Analysis_
    and his "Understanding Factor Analysis" website
    Ray Koopman, Nov 24, 2007
    1. Advertisements

  3. A Fog

    A Fog Guest

    Thank you Ray. Now I don't have to try other software packages.

    I can think of three ways to handle the problem of too many variables:

    1. Remove the variables that have the highest uniqueness or lowest factor loading

    2. Remove variables that are unimportant to the hypothesis I want to test

    3. Look for variables that are highly correllated and combine them together by addition to reduce the number of variables.

    Method 1 is no good because it removes many of the variables that are most important to my hypothesis.

    Method 2 gives a nice result that confirms the hypothesis I want to test. My only concern is: Can my cherry-picking of variables produce a bias in favour of my hypothesis?

    Is method 3 valid?

    I am replacing missing data with the mean for that variable. Is there anything better I can do with missing data?
    A Fog, Nov 24, 2007
  4. A Fog

    Old Mac User Guest

    At the risk of being a party pooper, I have several concerns.

    1. If you are not thoroughly trained in multifactor analysis
    (including principal components analysis, factor analysis, etc.) then
    I suggest that you are headed for deep trouble using factor analysis.
    There are a lot of thistles on FA, and your original post suggests you
    are a novice with FA

    2. Eliminating variables "to taste" is a nice way to get the answer(s)
    you really desire.

    3. Replacing missing values with the mean for that variable is a bad

    Let's go back to the beginning.

    Why are you using FA in the first place?

    What are you trying to accomplish?

    Tell us a little about the data and the source(s) of that data.

    I've been in the "statistics business" for almost 53 years and haven't
    found it necessary or fruitful to use FA so far. Was in charge of a
    group of applied statisticians for about 37 years and no one in that
    group ever found a valid use for FA.

    Old Mac User, Nov 24, 2007
  5. I liked the post by OMU. I can offer a related perspective.

    Unlike OMU, who has never found a use for factor analysis
    in his industrial context, I have used factor analysis many times -
    it has given me an easy cross-check on the sensibility and
    content of rating scales for attitudes or symptoms, etc.
    I've used it with almost every new data set, over several decades.

    But I have never done a factor analysis "to test ... hypotheses."
    Most of that was killed off, I thought, when Spearman failed
    to wrap up the argument about "g".

    Modern attempts, I thought, would fall under the heading of
    SEM (structural equations methods). For that, I would assume
    large Ns, little missing, data structure that is fairly obvious
    before you start. What you *may* do is confirm that certain
    hypotheses seem better than others.

    If you have a citation that you are following, "to test certain
    hypotheses" by factor analyses, please post it. I suspect that
    you will find readers here who may provide a critique.

    [snip, rest]
    Richard Ulrich, Nov 25, 2007
  6. A Fog

    Herman Rubin Guest

    The maximum value of the likelihood function is infinite,
    so such methods cannot work. There are ways to handle
    things, such as Bayesian methods, or putting a lower bound
    on the specific variances; either of these will remove the
    problem of singularity.

    If you believe that there are large measurement errors,
    then you believe the specific variances are rather high.

    You are in a situation where you MUST make far more
    assumptions than the usual factor analysis assumptions,
    of which normality is the least important. This situation
    always holds, for any non-arbitrary method (one not based
    on principles) if there are more parameters than there
    are observations. These cases do occur, and any attempt
    at "objectivity" is subject to failure.
    Herman Rubin, Nov 25, 2007
  7. A Fog

    Old Mac User Guest


    Hello again...

    I've used PCA a few times (always on correlation matrices) to get a
    sense of "how many dimensions do we have here" and to test for "zero"
    eigenvalues because of hidden linear constraints. Both PCA and Factor
    Analysis are sensitive to scaling... so I always work with a
    correlation matrix when doing PCA.

    When I saw "is is a good idea to replace missing data with their mean"
    I just couldn't remain silent.

    Regards... OMU
    Old Mac User, Nov 25, 2007
  8. A Fog

    Ray Koopman Guest

    That depends on the method of estimation. For factor analysis there
    are several scale-equivariant methods: canonical/maximum likelihood/
    maximum determinant, generalized least squares, alpha, and various
    versions of weighted least squares come to mind; for PCA there is a
    method called Harris component analysis (which really deserves to be
    more widely used than it is).
    Ray Koopman, Nov 25, 2007
  9. A Fog

    Ray Koopman Guest

    I have been asked offline to provide a reference for and short
    description of Harris component analysis. The basic reference is

    Harris, Chester W. (1962). Some Rao-Guttman relationships.
    Psychometrika, 27, 247-263.

    It's a rescaled PCA: you rescale the covariances, get the principal
    components of the rescaled matrix, then undo the rescaling to get
    back into the original units. The rescaling matrix is a diagonal
    matrix, say D^2, whose elements are the reciprocals of the diagonals
    of the inverse of the covariance matrix. Instead of getting the
    eigenvector matrix V and diagonal matrix of eigenvalues E of the
    covariance matrix C, you get them for D^-1.C.D^-1; the component
    loadings are then D.V.E^(1/2), where the premultiplication by D puts
    things back into the original units. (Notation: "^" is powering,
    "." is matrix multiplication. "^" has a higher precedence than ".".)

    The diagonal elements of D^2 are variances, in the same units as the
    corresponding variables in C. Different variables do not have to be
    in the same units. The analysis is scale-equivariant: if two people
    analyze covariance matrices that are rescalings of one another,
    their results will be the same rescalings of one another. So it
    doesn't matter whether you analyze covariances or correlations; you
    can always rescale the results into whatever units you want, and
    they will be just as if you had done the analysis in those units in
    the first place.

    The variances in D^2 are sometimes called "empirical uniquenesses".
    They are analogous to (and upper bounds for) the theoretical unique
    variances in common factor analysis. dj^2 is the variance of the
    residuals when variable j is regressed on the k-1 other variables.
    dj^2 = cjj * (1 - rj^2), where cjj is the variance of variable j
    (i.e., the j'th diagonal element of C) and rj is the multiple
    correlation of variable j with the k-1 other variables.
    Ray Koopman, Nov 26, 2007
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.