# average if: average values in column if values in other colum are equal

Discussion in 'MATLAB' started by Daphne, Jul 15, 2011.

1. ### DaphneGuest

I would like to average values in column 2 of a matrix that correspond to equal (and unique) values in column 1, a simple example:

a= [1.0000 0.5000
1.0000 0.4500
1.0000 0.6000
2.0000 1.0000
2.0000 1.2000
3.0000 3.0000
3.0000 3.5000
3.0000 3.0000];

should become

b = [1.0000 0.5167
2.0000 1.1000
3.0000 3.1667];

is there a non-loop way to do this, for matrices with >100k elements?...
I have tried to do logicals in a loop, but these take too long (example of 126 sec on 8653 calls from the profiler). My try:

unique_a = unique(a,1));
b = NaN(size(unique_a,1),2);
for unique_ind = 1:size(unique_a,1)
data2avg = a(a,1) == unique_a(unique_ind),2);
b(unique_ind, = [unique_a(unique_ind),mean(data2avg) ];
end

Any ideas on how to improve on this or do this better?

Thanks,
Daphne

Daphne, Jul 15, 2011

2. ### Matt JGuest

================

accumarray(a,1),a,2))./accumarray(a,1),1)

Matt J, Jul 15, 2011

3. ### dpbGuest

....

Not w/o a loop, but preallocating result and cleaning up slightly may
help some...
b =
1.0000 0.5167
2.0000 1.1000
3.0000 3.1667
--

dpb, Jul 15, 2011
4. ### Roger StaffordGuest

- - - - - - - - -
If elements of a,1) are not necessarily indices, do this:

[u,~,n] = unique(a,1));
sz = [size(u,1),1];
b = [u,accumarray(n,a,2),sz)./accumarray(n,1,sz)];

Roger Stafford

Roger Stafford, Jul 15, 2011
5. ### Matt JGuest

=================

BTW, I deliberately do not recommend using
accumarray(...,@mean). This will be considerably slower, as demonstrated in the following test:

%fake data
N=1e6;
K=1000;
a=[randi(K,N,1), rand(N,1)];

tic;
result1=accumarray(a,1),a,2))./accumarray(a,1),1);
toc;
%Elapsed time is 0.060626 seconds.

tic;
result2=accumarray(a,1),a,2),[],@mean);
toc;
%Elapsed time is 0.270781 seconds.

Matt J, Jul 15, 2011
6. ### DaphneGuest

Thanks so much for the great suggestions!
The accumarray approach reduced calculation time to 1.27 seconds in a test case which I ran and took at least minutes (don't know exactly, because I stopped it and waited for a better option).

Thanks again!

Daphne, Jul 16, 2011
7. ### DaphneGuest

I've compared the two accumarray approaches suggested here and Matt J's is actually much faster although both give the same results.

A follow up question, how would you calculate the std of the means? that would be a third column in the results matrix.

Thanks!

Daphne, Jul 16, 2011
8. ### Matt JGuest

Same kind of thing...

N=accumarray(a,1),1);
av=accumarray(a,1),a,2));
avsq=accumarray(a,1),a,2).^2);

standardDevs=sqrt((avsq-av.^2)./N)

Matt J, Jul 16, 2011
9. ### DaphneGuest

Perfect!

I noticed a strange phenomenon, the average (and now also the std) gives me a vector that has several NaN and on the bottom the actual values I am looking for.
This didn't happen with the sample matrices, but does for my real data. Any ideas why?
Since the values themselves are fine, I just toss the NaNs and keep the values, but would be happy to know why this is happening.
It doesn't happen in the accumarry approach that Roger Stafford offered.

Thanks,
Daphne

Daphne, Jul 16, 2011
10. ### Matt JGuest

==========================

Although, it would be marginally more numerically stable to extend Roger's code as follows:

[u,i,j] = unique(a,1));
sz = size(i);

N=accumarray(j,1,sz);

col1=u;
col2=accumarray(j,a,2),sz)./N;
col3=sqrt(accumarray(j, ( a,2)-col2(j) ).^2 ,sz)./N);

result = [col1,col2,col3];

Matt J, Jul 16, 2011
11. ### Matt JGuest

===================

Because in your actual data unique(a,1)) contains fewer values than
1:max(a,1)). ACCUMARRAY assigns zeros to these missing locations and you end up doing a 0/0 operation there.

Roger's use of UNIQUE is more neat and tidy, and more efficient I guess since 0/0 operations are more time-consuming than well-defined division operations.

Matt J, Jul 16, 2011
12. ### DaphneGuest

Got it.

Well, my final version, which also works if the unique values in column 1 are not integers (thanks for noting that Roger):

[unique_val,~,ind_unique] = unique(a,1));
sz = [size(unique_val,1),1];
N = accumarray(ind_unique,1,sz); % number of measurements
mean_a = accumarray(ind_unique,a,2),sz)./ N; % average
std_a = sqrt(accumarray(ind_unique, ( a,2)-mean_a(ind_unique) ).^2 ,sz) ./ (N-1)); % sample std, use N for population std
b = [unique_val, mean_a, std_a];

Thanks a lot Matt and Roger!
Daphne

Daphne, Jul 16, 2011
13. ### Roger StaffordGuest

- - - - - - - - - - -
To avoid the zero-divided-by-zero phenomenon for single occurrences in a,1), you had better divide by max(N-1,1) rather than N-1, just as matlab's 'std' does.

Roger Stafford

Roger Stafford, Jul 16, 2011
14. ### DaphneGuest

thanks for the tip. I've done that.

Daphne

Daphne, Jul 16, 2011