# Clustering to optimise proportion of first principal component

Discussion in 'Scientific Statistics Math' started by RossClement, Dec 17, 2010.

1. ### RossClementGuest

I would like to take vectors of eight integers, all in the range
0-99. There will be very large numbers of the vectors, in the tens of
thousands.

A sample vector might be (99,99,30,70,50,50,30,0)

I'd like to cluster these vectors into a reasonable number of clusters
(a number plucked wildly from the top of my head would be ten
clusters) such that, within clusters, the proportion of variability of
the data which is explained by the first principal component is
maximised.

Are there any particular clustering techniques that would be useful
for this application?

Clearly I could use standard k-means (if crisp cluster membership) or
EM (fuzzy cluster membership) clustering, which I'm sure will increase
the proportion of variability explained by the first component, but my
gut feeling is that would produce a result far from optimal. Otherwise
I could use search based optimisation such as a genetic algorithm to
find clusters. But, as what I want to do seems quite a simple
straightforward concept, I'm sure there are well developed techniques.

RossClement, Dec 17, 2010

2. ### Art KendallGuest

Depending on the nature of your data, you might be able to do some form
of factor analysis first (principal components, principal axes, etc.).
If you find meaningful factors then do several runs of k-means on the
scale scores or factor scores.
If you have SPSS you could do a few runs of TWOSTEP using Euclidean
distance.

Art Kendall
Social Research Consultants

Art Kendall, Dec 17, 2010

3. ### RossClementGuest

Thanks Art.

I've also found a paper on "Principal Component Analysis and Effective
K-means Clustering" by Ding and He which investigates the relationship
between them. I suspect that a careful reading of this paper will put
me in a much better position to think about what I should do.

RossClement, Dec 17, 2010
4. ### Rich UlrichGuest

Art mentioned factoring as a good place to start.
Here is another consideration....

It occurs to me that those numbers in the example look like
an integer scale from 0 to 10, times 10, and kept to 2 digits ("99"
for "100"). That makes me wonder a couple of things.

Does the content suggest that there might be a few "nominal"
categories that might appear from a cross-tabulation?

If this is a scale with min and max, or something like a proportion,
I would wonder about scaling at the ends. Is there a ceilling
effect or a floor effect? If these are proportions, it might be
appropriate to look at the logit of each value instead of the
proportion itself. A factor analysis of logits would not obtain
a first component that maximized the variance in the original
metric... but if the scores are proportions, the analysis of logits
would be expected to obtain a more sensible partitioning of
*meaning* across *multiple* factors.

Or, to say it in another way -- A factor analysis looks at the
matrix of correlations for the first factor, and then it looks
at the matrix of partial correlations for each successive
factor, after partialling out each obtained "factor". What
is partialled out at each step is effectively a linear composite
of scores. If those scores are "badly distributed" in certain
ways, then the extraction can give you artifacts that reflect
the bad scaling. The likeliest two things that I think of are
both because of inhomogeneous variances, when the original
data scores were proportions or were ranks.

I do wonder why you consider clustering a good starting place.
I've never found clustering to be attractive, partly because it
leaves untouched so much of the task of describing what is there.

What I mention above, about scaling, might be irrelevant, but
the one place that clustering can function usefully is when it
"corrects" for bad scaling by lumping cases according to the
disparate scaling -- so that the separate "clusters" are now
much more homogeneous. Other analyses can then be
done with more reliability, using each homogeneous set.

Rich Ulrich, Dec 17, 2010
5. ### Ray KoopmanGuest

From the start of your choice, iterate by moving one case at a time
from its current cluster to some other cluster so as to minimize
the sum of the smallest 7 eigenvalues of the within-cluster sum-of-
products matrices. Since the data are integers, these matrices
can be updated easily. Quit when no movement will improve things.
Do this many times, from different starts and/or checking the cases
in different orders. You won't necessarily get the absolutely best
solution, but you'll get some very good ones.

Ray Koopman, Dec 18, 2010
6. ### Ray KoopmanGuest

You need to solve for only the largest eigenvalue of each matrix.
Minimizing the sum of the smallest 7 eigenvalues is equivalent to
minimizing the difference between the trace of the matrix and the
largest eigenvalue.

Ray Koopman, Dec 19, 2010