Clustering to optimise proportion of first principal component

Discussion in 'Scientific Statistics Math' started by RossClement, Dec 17, 2010.

  1. RossClement

    RossClement Guest

    I would like to take vectors of eight integers, all in the range
    0-99. There will be very large numbers of the vectors, in the tens of

    A sample vector might be (99,99,30,70,50,50,30,0)

    I'd like to cluster these vectors into a reasonable number of clusters
    (a number plucked wildly from the top of my head would be ten
    clusters) such that, within clusters, the proportion of variability of
    the data which is explained by the first principal component is

    Are there any particular clustering techniques that would be useful
    for this application?

    Clearly I could use standard k-means (if crisp cluster membership) or
    EM (fuzzy cluster membership) clustering, which I'm sure will increase
    the proportion of variability explained by the first component, but my
    gut feeling is that would produce a result far from optimal. Otherwise
    I could use search based optimisation such as a genetic algorithm to
    find clusters. But, as what I want to do seems quite a simple
    straightforward concept, I'm sure there are well developed techniques.

    Any help/advice/discussion?
    RossClement, Dec 17, 2010
    1. Advertisements

  2. RossClement

    Art Kendall Guest

    Depending on the nature of your data, you might be able to do some form
    of factor analysis first (principal components, principal axes, etc.).
    If you find meaningful factors then do several runs of k-means on the
    scale scores or factor scores.
    If you have SPSS you could do a few runs of TWOSTEP using Euclidean

    Art Kendall
    Social Research Consultants
    Art Kendall, Dec 17, 2010
    1. Advertisements

  3. RossClement

    RossClement Guest

    Thanks Art.

    I've also found a paper on "Principal Component Analysis and Effective
    K-means Clustering" by Ding and He which investigates the relationship
    between them. I suspect that a careful reading of this paper will put
    me in a much better position to think about what I should do.
    RossClement, Dec 17, 2010
  4. RossClement

    Rich Ulrich Guest

    Art mentioned factoring as a good place to start.
    Here is another consideration....

    It occurs to me that those numbers in the example look like
    an integer scale from 0 to 10, times 10, and kept to 2 digits ("99"
    for "100"). That makes me wonder a couple of things.

    Does the content suggest that there might be a few "nominal"
    categories that might appear from a cross-tabulation?

    If this is a scale with min and max, or something like a proportion,
    I would wonder about scaling at the ends. Is there a ceilling
    effect or a floor effect? If these are proportions, it might be
    appropriate to look at the logit of each value instead of the
    proportion itself. A factor analysis of logits would not obtain
    a first component that maximized the variance in the original
    metric... but if the scores are proportions, the analysis of logits
    would be expected to obtain a more sensible partitioning of
    *meaning* across *multiple* factors.

    Or, to say it in another way -- A factor analysis looks at the
    matrix of correlations for the first factor, and then it looks
    at the matrix of partial correlations for each successive
    factor, after partialling out each obtained "factor". What
    is partialled out at each step is effectively a linear composite
    of scores. If those scores are "badly distributed" in certain
    ways, then the extraction can give you artifacts that reflect
    the bad scaling. The likeliest two things that I think of are
    both because of inhomogeneous variances, when the original
    data scores were proportions or were ranks.

    I do wonder why you consider clustering a good starting place.
    I've never found clustering to be attractive, partly because it
    leaves untouched so much of the task of describing what is there.

    What I mention above, about scaling, might be irrelevant, but
    the one place that clustering can function usefully is when it
    "corrects" for bad scaling by lumping cases according to the
    disparate scaling -- so that the separate "clusters" are now
    much more homogeneous. Other analyses can then be
    done with more reliability, using each homogeneous set.
    Rich Ulrich, Dec 17, 2010
  5. RossClement

    Ray Koopman Guest

    From the start of your choice, iterate by moving one case at a time
    from its current cluster to some other cluster so as to minimize
    the sum of the smallest 7 eigenvalues of the within-cluster sum-of-
    products matrices. Since the data are integers, these matrices
    can be updated easily. Quit when no movement will improve things.
    Do this many times, from different starts and/or checking the cases
    in different orders. You won't necessarily get the absolutely best
    solution, but you'll get some very good ones.
    Ray Koopman, Dec 18, 2010
  6. RossClement

    Ray Koopman Guest

    You need to solve for only the largest eigenvalue of each matrix.
    Minimizing the sum of the smallest 7 eigenvalues is equivalent to
    minimizing the difference between the trace of the matrix and the
    largest eigenvalue.
    Ray Koopman, Dec 19, 2010
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.