# Database like "left join" or SAS like Merge function

Discussion in 'Mathematica' started by Peter, Nov 23, 2007.

1. ### PeterGuest

Is there a builtin function that can "join" (merge) two two-
dimensional data lists based on a single or multiple key column?

This seems like such a basic requirement that I feel like I must be
missing something obvious.

Thanks, Peter

Peter, Nov 23, 2007

2. ### Steve LuttrellGuest

Here is an example that does what you want:

Define a matrix where the first two columns contain keys and the third
column contains data.
In:= mat1=Table[{RandomInteger,RandomInteger,RandomReal[]},{10}]
Out=
{{2,0,0.719448},{2,0,0.155021},{3,0,0.138765},{1,2,0.714609},{0,0,0.244554},{2,0,0.582359},{1,2,0.354337},{0,2,0.498536},{0,2,0.159724},{1,2,0.624066}}

Similarly, define a second matrix.
In:= mat2=Table[{RandomInteger,RandomInteger,RandomReal[]},{20}]
Out=
{{1,0,0.756742},{3,2,0.112279},{2,1,0.091024},{2,0,0.177684},{2,1,0.77082},{3,1,0.25392},{0,0,0.508483},{2,2,0.201832},{1,2,0.738323},{2,2,0.139091},{1,2,0.918997},{3,1,0.801789},{1,1,0.0767979},{1,2,0.600228},{3,0,0.129787},{1,1,0.699115},{2,1,0.897158},{3,2,0.364663},{2,0,0.487728},{2,1,0.901181}}

Join the matrices, then sort the result by the key columns, then split the
sorted result into subsets according to their key.
In:= Split[Sort[Join[mat1,mat2]],#1[]==#2[]&]
Out=
{{{0,0,0.244554},{0,0,0.508483}},{{0,2,0.159724},{0,2,0.498536}},{{1,0,0.756742}},{{1,1,0.0767979},{1,1,0.699115}},{{1,2,0.354337},{1,2,0.600228},{1,2,0.624066},{1,2,0.714609},{1,2,0.738323},{1,2,0.918997}},{{2,0,0.155021},{2,0,0.177684},{2,0,0.487728},{2,0,0.582359},{2,0,0.719448}},{{2,1,0.091024},{2,1,0.77082},{2,1,0.897158},{2,1,0.901181}},{{2,2,0.139091},{2,2,0.201832}},{{3,0,0.129787},{3,0,0.138765}},{{3,1,0.25392},{3,1,0.801789}},{{3,2,0.112279},{3,2,0.364663}}}

This is a very basic example, but it can readily be generalised.

Steve Luttrell
West Malvern, UK

Steve Luttrell, Nov 24, 2007

3. ### Mark FisherGuest

Hi Peter,

I'm not aware of any built-in function that does what you want. Here's
something I wrote for myself that may be related to what you're
looking for. The main function is MergeByKey. It calls a helper
function SubsetPosition.

--Mark

(* code starts here *)

MergeByKey::usage = "MergeByKey[{data1, data2, ...}, n] merges \
the datasets treating the first n columns of each dataset \
as the key. If n is omitted, the first column is used as the key. \
MergeByKey[datasets, list] uses the list of column numbers to \
specify the key columns which are placed first in the merged dataset.
\
Only rows that have matching keys in all datasets appear in the \
merged dataset. MergeByKey assumes there are no duplicate keys \
in each of the datasets."

SubsetPosition::usage = "SubsetPosition[set, subset] returns the \
positions of subset in the set, assuming both set and subset are \
sorted and have no duplicates."

MergeByKey[datalist:{__List}, n_:1] :=
Module[{i, keylist, ikeys, poslist, dlist},
i = Range[n];
keylist = #[[All, i]] & /@ datalist;
ikeys = Intersection @@ keylist;
poslist = SubsetPosition[#, ikeys]& /@ keylist;
dlist = MapThread[#1[[#2, n + 1 ;;]] &, {datalist, poslist}];
Join @@@ Transpose[Prepend[dlist, ikeys]]
]

MergeByKey[datalist:{__List}, index_List] :=
Module[{orderlist},
orderlist =
Flatten[Join[index, Complement[Range[Length[#[]]], index]]]& /@
datalist;
Length[index]]
]

SubsetPosition[superset:{__Integer}, subset:{__Integer}] :=
sspCompiled[superset, subset]

SubsetPosition[superset_, subset_] :=
Module[{i = 1},
(While[superset[] != #, i++]; i++) & /@ subset
]

sspCompiled = Compile[{
{superset, _Integer, 1},
{subset, _Integer, 1}},
Module[{i = 1},
(While[superset[] != #, i++]; i++) & /@ subset
]]

(* code ends here *)

Mark Fisher, Nov 24, 2007
4. ### AlbertGuest

Hi Peter,
I also don't know about a built in function which would allow to do this
and there is nothing wrong with the solutions that have been sent
already. Still I couldn't resist to post another method which takes
advantage of using lookup tables, which is something which comes, via
pattern matching, as a built in to mathematica but is often not
recognized as that. Here is a simple function that builds a lookup table
from a list with keys from the n'th column of the list. Note that you
will need a symbol which holds the lookup table in it's downvalues, so
this is the first argument here:

makeLookupTable[symbol_,list_,column_]:=(
symbol[___]=Table[0.,{Length[list[]-1]}];
Scan[
With[{key=#[[column]]},symbol[key]=Delete[#,column]]&,
list
]
);

Note that it uses a list with zeros as "default" if no key is found. Of
course you might want a different behavior concerning the defaults

now here are two lists:

m1=Table[{i,Random[],Random[]},{i,10}];
m2=Table[{Random[],Random[],Random[],i},{i,2,12}];

here I create two lookup tables using columns 1 respectively 4 as keys:

makeLookupTable[lt1,m1,1]
makeLookupTable[lt2,m2,4]

these are all keys that are in either m1 or m2:

allkeys=Union[m1[[All,1]],m2[[All,4]]]

build the merged lists using all these keys:

Join[{#},lt1[#],lt2[#]]&/@allkeys

Note how the defaults are used for keys not existing in one of the two
lists.

Of course the details may vary and there are many possible variations.
In general I found this a very useful programming pattern which allows
me to write code which at least I can read easily and in many cases is
also quite efficient, since the whole work of searching for keys is up
to the pattern matcher, which is implemented at a low level and needs to
do these things efficient to be of any use. Note that the keys are not
limited at all but you can use any mathematica expression as keys, and
by mixing with patterns you can implement defaults for non existing keys
in a very powerful way. An earlier examination lead me to the conclusion
that using plain strings as keys was very fast, while more complicated
pattern were not so efficient, but this might have changed since version
4.x ...

When combining this to a function, you could use local variables to hold
the lookup tables and which are cleaned up automatically:

mergelists[lst1_List,lst2_List,n1_:1,n2_:1]:=Module[{
lt1,lt2,keys
},
makeLookupTable[lt1,lst1,n1];
makeLookupTable[lt2,lst2,n2];
keys=Union[lst1[[All,n1]],lst2[[All,n2]]];
Join[{#},lt1[#],lt2[#]]&/@keys
];

hth,

albert

Albert, Nov 25, 2007