Database like "left join" or SAS like Merge function

Discussion in 'Mathematica' started by Peter, Nov 23, 2007.

  1. Peter

    Peter Guest

    Is there a builtin function that can "join" (merge) two two-
    dimensional data lists based on a single or multiple key column?

    This seems like such a basic requirement that I feel like I must be
    missing something obvious.

    Thanks, Peter
     
    Peter, Nov 23, 2007
    #1
    1. Advertisements

  2. Here is an example that does what you want:

    Define a matrix where the first two columns contain keys and the third
    column contains data.
    In[1]:= mat1=Table[{RandomInteger[3],RandomInteger[2],RandomReal[]},{10}]
    Out[1]=
    {{2,0,0.719448},{2,0,0.155021},{3,0,0.138765},{1,2,0.714609},{0,0,0.244554},{2,0,0.582359},{1,2,0.354337},{0,2,0.498536},{0,2,0.159724},{1,2,0.624066}}

    Similarly, define a second matrix.
    In[2]:= mat2=Table[{RandomInteger[3],RandomInteger[2],RandomReal[]},{20}]
    Out[2]=
    {{1,0,0.756742},{3,2,0.112279},{2,1,0.091024},{2,0,0.177684},{2,1,0.77082},{3,1,0.25392},{0,0,0.508483},{2,2,0.201832},{1,2,0.738323},{2,2,0.139091},{1,2,0.918997},{3,1,0.801789},{1,1,0.0767979},{1,2,0.600228},{3,0,0.129787},{1,1,0.699115},{2,1,0.897158},{3,2,0.364663},{2,0,0.487728},{2,1,0.901181}}

    Join the matrices, then sort the result by the key columns, then split the
    sorted result into subsets according to their key.
    In[3]:= Split[Sort[Join[mat1,mat2]],#1[[2]]==#2[[2]]&]
    Out[3]=
    {{{0,0,0.244554},{0,0,0.508483}},{{0,2,0.159724},{0,2,0.498536}},{{1,0,0.756742}},{{1,1,0.0767979},{1,1,0.699115}},{{1,2,0.354337},{1,2,0.600228},{1,2,0.624066},{1,2,0.714609},{1,2,0.738323},{1,2,0.918997}},{{2,0,0.155021},{2,0,0.177684},{2,0,0.487728},{2,0,0.582359},{2,0,0.719448}},{{2,1,0.091024},{2,1,0.77082},{2,1,0.897158},{2,1,0.901181}},{{2,2,0.139091},{2,2,0.201832}},{{3,0,0.129787},{3,0,0.138765}},{{3,1,0.25392},{3,1,0.801789}},{{3,2,0.112279},{3,2,0.364663}}}

    This is a very basic example, but it can readily be generalised.

    Steve Luttrell
    West Malvern, UK
     
    Steve Luttrell, Nov 24, 2007
    #2
    1. Advertisements

  3. Peter

    Mark Fisher Guest

    Hi Peter,

    I'm not aware of any built-in function that does what you want. Here's
    something I wrote for myself that may be related to what you're
    looking for. The main function is MergeByKey. It calls a helper
    function SubsetPosition.

    --Mark

    (* code starts here *)

    MergeByKey::usage = "MergeByKey[{data1, data2, ...}, n] merges \
    the datasets treating the first n columns of each dataset \
    as the key. If n is omitted, the first column is used as the key. \
    MergeByKey[datasets, list] uses the list of column numbers to \
    specify the key columns which are placed first in the merged dataset.
    \
    Only rows that have matching keys in all datasets appear in the \
    merged dataset. MergeByKey assumes there are no duplicate keys \
    in each of the datasets."

    SubsetPosition::usage = "SubsetPosition[set, subset] returns the \
    positions of subset in the set, assuming both set and subset are \
    sorted and have no duplicates."

    MergeByKey[datalist:{__List}, n_:1] :=
    Module[{i, keylist, ikeys, poslist, dlist},
    i = Range[n];
    keylist = #[[All, i]] & /@ datalist;
    ikeys = Intersection @@ keylist;
    poslist = SubsetPosition[#, ikeys]& /@ keylist;
    dlist = MapThread[#1[[#2, n + 1 ;;]] &, {datalist, poslist}];
    Join @@@ Transpose[Prepend[dlist, ikeys]]
    ]

    MergeByKey[datalist:{__List}, index_List] :=
    Module[{orderlist},
    orderlist =
    Flatten[Join[index, Complement[Range[Length[#[[1]]]], index]]]& /@
    datalist;
    MergeByKey[MapThread[#1[[All, #2]]&, {datalist, orderlist}],
    Length[index]]
    ]

    SubsetPosition[superset:{__Integer}, subset:{__Integer}] :=
    sspCompiled[superset, subset]

    SubsetPosition[superset_, subset_] :=
    Module[{i = 1},
    (While[superset[] != #, i++]; i++) & /@ subset
    ]

    sspCompiled = Compile[{
    {superset, _Integer, 1},
    {subset, _Integer, 1}},
    Module[{i = 1},
    (While[superset[] != #, i++]; i++) & /@ subset
    ]]

    (* code ends here *)
     
    Mark Fisher, Nov 24, 2007
    #3
  4. Peter

    Albert Guest

    Hi Peter,
    I also don't know about a built in function which would allow to do this
    and there is nothing wrong with the solutions that have been sent
    already. Still I couldn't resist to post another method which takes
    advantage of using lookup tables, which is something which comes, via
    pattern matching, as a built in to mathematica but is often not
    recognized as that. Here is a simple function that builds a lookup table
    from a list with keys from the n'th column of the list. Note that you
    will need a symbol which holds the lookup table in it's downvalues, so
    this is the first argument here:

    makeLookupTable[symbol_,list_,column_]:=(
    symbol[___]=Table[0.,{Length[list[[1]]-1]}];
    Scan[
    With[{key=#[[column]]},symbol[key]=Delete[#,column]]&,
    list
    ]
    );

    Note that it uses a list with zeros as "default" if no key is found. Of
    course you might want a different behavior concerning the defaults
    depending on your use case.

    now here are two lists:

    m1=Table[{i,Random[],Random[]},{i,10}];
    m2=Table[{Random[],Random[],Random[],i},{i,2,12}];

    here I create two lookup tables using columns 1 respectively 4 as keys:

    makeLookupTable[lt1,m1,1]
    makeLookupTable[lt2,m2,4]

    these are all keys that are in either m1 or m2:

    allkeys=Union[m1[[All,1]],m2[[All,4]]]

    build the merged lists using all these keys:

    Join[{#},lt1[#],lt2[#]]&/@allkeys

    Note how the defaults are used for keys not existing in one of the two
    lists.

    Of course the details may vary and there are many possible variations.
    In general I found this a very useful programming pattern which allows
    me to write code which at least I can read easily and in many cases is
    also quite efficient, since the whole work of searching for keys is up
    to the pattern matcher, which is implemented at a low level and needs to
    do these things efficient to be of any use. Note that the keys are not
    limited at all but you can use any mathematica expression as keys, and
    by mixing with patterns you can implement defaults for non existing keys
    in a very powerful way. An earlier examination lead me to the conclusion
    that using plain strings as keys was very fast, while more complicated
    pattern were not so efficient, but this might have changed since version
    4.x ...

    When combining this to a function, you could use local variables to hold
    the lookup tables and which are cleaned up automatically:

    mergelists[lst1_List,lst2_List,n1_:1,n2_:1]:=Module[{
    lt1,lt2,keys
    },
    makeLookupTable[lt1,lst1,n1];
    makeLookupTable[lt2,lst2,n2];
    keys=Union[lst1[[All,n1]],lst2[[All,n2]]];
    Join[{#},lt1[#],lt2[#]]&/@keys
    ];

    hth,

    albert
     
    Albert, Nov 25, 2007
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.