average if: average values in column if values in other colum are equal

Discussion in 'MATLAB' started by Daphne, Jul 15, 2011.

  1. Daphne

    Daphne Guest

    I would like to average values in column 2 of a matrix that correspond to equal (and unique) values in column 1, a simple example:

    a= [1.0000 0.5000
    1.0000 0.4500
    1.0000 0.6000
    2.0000 1.0000
    2.0000 1.2000
    3.0000 3.0000
    3.0000 3.5000
    3.0000 3.0000];

    should become

    b = [1.0000 0.5167
    2.0000 1.1000
    3.0000 3.1667];

    is there a non-loop way to do this, for matrices with >100k elements?...
    I have tried to do logicals in a loop, but these take too long (example of 126 sec on 8653 calls from the profiler). My try:

    unique_a = unique(a:),1));
    b = NaN(size(unique_a,1),2);
    for unique_ind = 1:size(unique_a,1)
    data2avg = a(a:),1) == unique_a(unique_ind),2);
    b(unique_ind,:) = [unique_a(unique_ind),mean(data2avg) ];
    end

    Any ideas on how to improve on this or do this better?

    Thanks,
    Daphne
     
    Daphne, Jul 15, 2011
    #1
    1. Advertisements

  2. Daphne

    Matt J Guest

    ================

    accumarray(a:),1),a:),2))./accumarray(a:),1),1)
     
    Matt J, Jul 15, 2011
    #2
    1. Advertisements

  3. Daphne

    dpb Guest

    ....

    Not w/o a loop, but preallocating result and cleaning up slightly may
    help some...
    b =
    1.0000 0.5167
    2.0000 1.1000
    3.0000 3.1667
    --
     
    dpb, Jul 15, 2011
    #3
  4. - - - - - - - - -
    If elements of a:),1) are not necessarily indices, do this:

    [u,~,n] = unique(a:),1));
    sz = [size(u,1),1];
    b = [u,accumarray(n,a:),2),sz)./accumarray(n,1,sz)];

    Roger Stafford
     
    Roger Stafford, Jul 15, 2011
    #4
  5. Daphne

    Matt J Guest

    =================

    BTW, I deliberately do not recommend using
    accumarray(...,@mean). This will be considerably slower, as demonstrated in the following test:

    %fake data
    N=1e6;
    K=1000;
    a=[randi(K,N,1), rand(N,1)];

    tic;
    result1=accumarray(a:),1),a:),2))./accumarray(a:),1),1);
    toc;
    %Elapsed time is 0.060626 seconds.

    tic;
    result2=accumarray(a:),1),a:),2),[],@mean);
    toc;
    %Elapsed time is 0.270781 seconds.
     
    Matt J, Jul 15, 2011
    #5
  6. Daphne

    Daphne Guest

    Thanks so much for the great suggestions!
    The accumarray approach reduced calculation time to 1.27 seconds in a test case which I ran and took at least minutes (don't know exactly, because I stopped it and waited for a better option).

    Thanks again!



     
    Daphne, Jul 16, 2011
    #6
  7. Daphne

    Daphne Guest

    I've compared the two accumarray approaches suggested here and Matt J's is actually much faster although both give the same results.

    A follow up question, how would you calculate the std of the means? that would be a third column in the results matrix.

    Thanks!



     
    Daphne, Jul 16, 2011
    #7
  8. Daphne

    Matt J Guest

    Same kind of thing...

    N=accumarray(a:),1),1);
    av=accumarray(a:),1),a:),2));
    avsq=accumarray(a:),1),a:),2).^2);

    standardDevs=sqrt((avsq-av.^2)./N)
     
    Matt J, Jul 16, 2011
    #8
  9. Daphne

    Daphne Guest

    Perfect!

    I noticed a strange phenomenon, the average (and now also the std) gives me a vector that has several NaN and on the bottom the actual values I am looking for.
    This didn't happen with the sample matrices, but does for my real data. Any ideas why?
    Since the values themselves are fine, I just toss the NaNs and keep the values, but would be happy to know why this is happening.
    It doesn't happen in the accumarry approach that Roger Stafford offered.

    Thanks,
    Daphne
     
    Daphne, Jul 16, 2011
    #9
  10. Daphne

    Matt J Guest

    ==========================

    Although, it would be marginally more numerically stable to extend Roger's code as follows:


    [u,i,j] = unique(a:),1));
    sz = size(i);

    N=accumarray(j,1,sz);

    col1=u;
    col2=accumarray(j,a:),2),sz)./N;
    col3=sqrt(accumarray(j, ( a:),2)-col2(j) ).^2 ,sz)./N);

    result = [col1,col2,col3];
     
    Matt J, Jul 16, 2011
    #10
  11. Daphne

    Matt J Guest

    ===================

    Because in your actual data unique(a:),1)) contains fewer values than
    1:max(a:),1)). ACCUMARRAY assigns zeros to these missing locations and you end up doing a 0/0 operation there.

    Roger's use of UNIQUE is more neat and tidy, and more efficient I guess since 0/0 operations are more time-consuming than well-defined division operations.
     
    Matt J, Jul 16, 2011
    #11
  12. Daphne

    Daphne Guest

    Got it.

    Well, my final version, which also works if the unique values in column 1 are not integers (thanks for noting that Roger):

    [unique_val,~,ind_unique] = unique(a:),1));
    sz = [size(unique_val,1),1];
    N = accumarray(ind_unique,1,sz); % number of measurements
    mean_a = accumarray(ind_unique,a:),2),sz)./ N; % average
    std_a = sqrt(accumarray(ind_unique, ( a:),2)-mean_a(ind_unique) ).^2 ,sz) ./ (N-1)); % sample std, use N for population std
    b = [unique_val, mean_a, std_a];


    Thanks a lot Matt and Roger!
    Daphne
     
    Daphne, Jul 16, 2011
    #12
  13. - - - - - - - - - - -
    To avoid the zero-divided-by-zero phenomenon for single occurrences in a:),1), you had better divide by max(N-1,1) rather than N-1, just as matlab's 'std' does.

    Roger Stafford
     
    Roger Stafford, Jul 16, 2011
    #13
  14. Daphne

    Daphne Guest

    thanks for the tip. I've done that.

    Daphne


     
    Daphne, Jul 16, 2011
    #14
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.