Algorithm for searching "unstructured" dataset (of parameters and values) - algorithm

Given an unstructured dataset consisting of sets of various parameters with numeric quantities, what's an efficient and practical algorithm for searching? The parameters vary wildly, with no exhaustive list of parameters, and any parameter can be part of any set.
That was either a very good problem description, or just confusing. So let me try an example:
"dataset" : [
{"a": 4, "b": 1},
{"a": 4, "b": 1, "c": 0.5},
{"a": 1, "b": 3, "c": 0.5, "x": 1},
{"x": 3, "t": 0.01}
]
search input (to match/score against dataset):
q = {"a": 2, "b": 1}
I'm thinking a matching/scoring rule along the lines of:
for each "set", s, in the dataset, scan through the parameters of s. If q contains same parameter (name/key), then let v be the quantity (value) of that parameter in s. Let w be corresponding value in q, and this parameter is scored, max(w/v, 1.0).
Repeat for each parameter of s, producing an overall score (of as the product of all the "sub scores").
So, q scores
2/4 * 1/1 = 0.5 against the two first sets, 0.33 against the third set, and 0 against the last one. I'm not sure how to handle parameters in s that are not in q, but maybe those could give some secondary score (for those "hits" where score > 0).
Any tips on what to search (google) for here, any well-suited algorithms on something like this?

Related

equal value = equal rank

I would like to rank the elements of a list such that elements that have the same value also get the same rank:
list = {1, 2, 3, 4, 4, 5}
desired output:
ranks = {5, 4, 3, 2, 2, 1}
Ordering[] does almost what I want but assigns different ranks to the two instances of 4 in the list.
I am not sure that I cover everything you have in mind, but the following code will give the desired output. It presupposes that the smallest value is the highest rank, and should work with numerical values or as long as you are ok with the standard sorting order of Mathematica. The local variable dv is a shortname for "distinct values".
FromListToRanks[k_List]:= Module[ {dv=Reverse[Union[k]]},
k /. Thread[dv -> Range[Length[dv]]] ]
FromListToRanks[list]
{5,4,3,2,2,1}

Efficient data structure for a list of index sets

I am trying to explain by example:
Imagine a list of numbered elements E = [elem0, elem1, elem2, ...].
One index set could now be {42, 66, 128} refering to elements in E. The ordering in this set is not important, so {42, 66, 128} == {66, 128, 42}, but each element is at most once in any given index set (so it is an actual set).
What I want now is a space efficient data structure that gives me another ordered list M that contains index sets that refer to elements in E. Each index set in M will only occur once (so M is a set in this regard) but M must be indexable itself (so M is a List in this sense, whereby the precise index is not important). If necessary, index sets can be forced to all contain the same number of elements.
For example, M could look like:
0: {42, 66, 128}
1: {42, 66, 9999}
2: {1, 66, 9999}
I could now do the following:
for(i in M[2]) { element = E[i]; /* do something with E[1],E[66],and E[9999] */ }
You probably see where this is going: You may now have another map M2 that is an ordered list of sets pointing into M which ultimately point to elements in E.
As you can see in this example, index sets can be relatively similar (M[0] and M[1] share the first two entries, M[1] and M[2] share the last two) which makes me think that there must be something more efficient than the naive way of using an array-of-sets. However, I may not be able to come up with a good global ordering of index entries that guarantee good "sharing".
I could think of anything ranging from representing M as a tree (where M's index comes from the depth-first search ordering or something) to hash maps of union-find structures (no idea how that would work though:)
Pointers to any textbook datastructure for something like this are highly welcome (is there anything in the world of databases?) but I also appreciate if you propose a "self-made" solution or only random ideas.
Space efficiency is important for me because E may contain thousands or even few million elements, (some) index sets are potentially large, similarities between at least some index sets should be substantial, and there may be multiple layers of mappings.
Thanks a ton!
You may combine all numbers from M and remove duplicates and name it as UniqueM.
All M[X] collections convert to bit masks. For example int value may store 32 numbers (To support of unlimited count you should store array of ints, if array size is 10 totally we can store 320 different elements). long type may store 64 bits.
E: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
M[0]: {6, 8, 1}
M[1]: {2, 8, 1}
M[2]: {6, 8, 5}
Will be converted to:
UniqueM: {6, 8, 1, 2, 5}
M[0]: 11100 {this is 7}
M[1]: 01110 {this is 14}
M[2]: 11001 {this is 19}
Note:
Also you may combine my and ring0 approaches, instead of rearrange E make new UniqueM and use intervals inside it.
It will be pretty hard to beat an index. You could save some space by using the right data type (eg in gnu C, short if less than 64k elements in E, int if < 4G...).
Besides,
Since you say the order in E is not important, you could sort E a way it maximizes the consecutive elements to match as much as possible the Ms.
For instance,
E: { 1,2,3,4,5,6,7,8 }
0: {1,3,5,7}
1: {1,3,5,8}
2: {3,5,7,8}
By re-arranging E
E: { 1,3,5,7,8,2,4,6 }
and using E indexes, not values, you could define the Ms based on subsets of E, giving indexes
0: {0-3} // E[0]: 1, E[1]: 3, E[2]: 5, E[3]: 7 etc...
1: {0-2,4}
2: {1-3,4}
this way
you use indexes instead of the raw numbers (indexes are usually smaller, no negative..)
the Ms are made of sub-sets, 0-3 meaning 0,1,2,3,
The difficult part is to make the algorithm to re-arrange E so that you maximize the subsets sizes - minimize the Ms sizes.
E rearrangement algo suggestion
sort all Ms
process all Ms:
algo to build a map, which gives for an element 'x' its list of neighbors 'y', along with points, number of times 'y' is just after 'x'
Map map (x,y) -> z
for m in Ms
for e,f in m // e and f are consecutive elements
if ( ! map(e,f)) map(e,f) = 1
else map(e,f)++
rof
rof
Get E rearranged
ER = {} // E rearranged
Map mas = sort_map(map) // mas(x) -> list(y) where 'y' are sorted desc based on 'z'
e = get_min_elem(mas) // init with lowest element (regardless its 'z' scores)
while (mas has elements)
ER += e // add element e to ER
f = mas(e)[0] // get most likely neighbor of e (in f), ie first in the list
if (empty(mas(e))
e = get_min_elem(mas) // Get next lowest remaining value
else
delete mas(e)[0] // set next e neighbour in line
e = f
fi
elihw
The algo (map) should be O(n*m) space, with n elements in E, m elements in all Ms.
Bit arrays may be used. They're arrays of elements a[i] which are 1 if i is in set and 0 if i is not in set. So every set would occupy exactly size(E) bits even if it contain a few or no members. Not so space efficient, but if you compress this array with some compression algorithm it will be much less in size (possibly reaching ultimate entropy limit). So you can try dynamic Markov coder or RLE or group Huffman and choose one most efficient for you. Then, iteration process could include on-the-fly decompression followed by linear scanning for 1 bits. For looong 0 runs you could modify decompression algorithm to detect such cases (RLE is simplest case for it).
If you found sets having small defference, you may store sets A and A xor B anstead of A and B saving space for common parts. In this case to iterate over B you'll have to unpack both A and A xor B then xor them.
Another useful solution:
E: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
M[0]: {1, 2, 3, 4, 5, 10, 14, 15}
M[1]: {1, 2, 3, 4, 5, 11, 14, 15}
M[2]: {1, 2, 3, 4, 5, 12, 13}
Cache frequently used items:
Cache[1] = {1, 2, 3, 4, 5}
Cache[2] = {14, 15}
Cache[3] = {-2, 7, 8, 9} //Not used just example.
M[0]: {-1, 10, -2}
M[1]: {-1, 11, -2}
M[2]: {-1, 12, 13}
Mark links to cached list as negative numbers.

generate all combintations for list with repeated items

Related to this question, I am wondering the algorithms (and actual code in java/c/c++/python/etc., if you have!) to generate all combinations of r elements for a list with m elements in total. Some of these m elements may be repeated.
Thanks!
recurse for each element type
int recurseMe(list<list<item>> items, int r, list<item> container)
{
if (r == container.length)
{
//print out your collection;
return 1;
}
else if (container.length > score)
{
return 0;
}
list<item> firstType = items[0];
int score = 0;
for(int i = 0; i < firstType.length; i++)
{
score += recurseMe(items without items[0], r, container + i items from firstType);
}
return score;
}
This takes as input a list containing lists of items, assuming each inner list represents a unique type of item. You may have to build a sorting function to feed as input to this.
//start with a list<item> original;
list<list<item>> grouped = new list<list<item>>();
list<item> sorted = original.sort();//use whichever method for this
list<item> temp = null;
item current = null;
for(int x = 0; x < original.length; x++)
if (sorted[x] == current)
{
temp.add(current);
}
else
{
if (temp != null && temp.isNotEmpty)
grouped.add(temp);
temp = new list<item>();
temp.add(sorted[x]);
}
}
if (temp != null && temp.isNotEmpty)
grouped.add(temp);
//grouped is the result
This sorts the list, then creates sublists containing elements that are the same, inserting them into the list of lists grouped
Here is a recursion that I believe is closely related to Jean-Bernard Pellerin's algorithm, in Mathematica.
This takes input as the number of each type of element. The output is in similar form. For example:
{a,a,b,b,c,d,d,d,d} -> {2,2,1,4}
Function:
f[k_, {}, c__] := If[+c == k, {{c}}, {}]
f[k_, {x_, r___}, c___] := Join ## (f[k, {r}, c, #] & /# 0~Range~Min[x, k - +c])
Use:
f[4, {2, 2, 1, 4}]
{{0, 0, 0, 4}, {0, 0, 1, 3}, {0, 1, 0, 3}, {0, 1, 1, 2}, {0, 2, 0, 2},
{0, 2, 1, 1}, {1, 0, 0, 3}, {1, 0, 1, 2}, {1, 1, 0, 2}, {1, 1, 1, 1},
{1, 2, 0, 1}, {1, 2, 1, 0}, {2, 0, 0, 2}, {2, 0, 1, 1}, {2, 1, 0, 1},
{2, 1, 1, 0}, {2, 2, 0, 0}}
An explanation of this code was requested. It is a recursive function that takes a variable number of arguments. The first argument is k, length of subset. The second is a list of counts of each type to select from. The third argument and beyond is used internally by the function to hold the subset (combination) as it is constructed.
This definition therefore is used when there are no more items in the selection set:
f[k_, {}, c__] := If[+c == k, {{c}}, {}]
If the total of the values of the combination (its length) is equal to k, then return that combination, otherwise return an empty set. (+c is shorthand for Plus[c])
Otherwise:
f[k_, {x_, r___}, c___] := Join ## (f[k, {r}, c, #] & /# 0~Range~Min[x, k - +c])
Reading left to right:
Join is used to flatten out a level of nested lists, so that the result is not an increasingly deep tensor.
f[k, {r}, c, #] & calls the function, dropping the first position of the selection set (x), and adding a new element to the combination (#).
/# 0 ~Range~ Min[x, k - +c]) for each integer between zero and the lesser of the first element of the selection set, and k less total of combination, which is the maximum that can be selected without exceeding combination size k.
I'm going to make this an answer rather than a bunch of comments.
My original comment was:
The CombinationGenerator Java class systematically generates all
combinations of n elements, taken r at a time. The algorithm is
described by Kenneth H. Rosen, Discrete Mathematics and Its
Applications, 2nd edition (NY: McGraw-Hill, 1991), pp. 284-286." See
merriampark.com/comb.htm. It has a link to source code.
As you pointed out in your comment, you want unique combinations. So, given the array ["a", "a", "b", "b"], you want it to generate aab, abb. The code I linked generates aab, aab, baa, baa.
With that array, removing duplicates is very easy. Depending on how you implement it, you either let it generate the duplicates and then filter them after the fact (i.e. selecting unique elements from an array), or you modify the code to include a hash table so that when it generates a combination, it checks the hash table before putting the item into the output array.
Looking something up in a hash table is an O(1) operation, so either of those is going to be efficient. Doing it after the fact will be a little bit more expensive, because you'll have to copy items. Still, you're talking O(n), where n is the number of combinations generated.
There is one complication: order is irrelevant. That is, given the array ["a", "b", "a", "b"], the code will generate aba, abb, aab, bab. In this case, aba and aab are duplicate combinations, as are abb and bab, and using a hash table isn't going to remove those duplicates for you. You could, though, create a bit mask for each combination, and use the hash table idea with the bit masks. This would be slightly more complicated, but not terribly so.
If you sort the initial array first, so that duplicate items are adjacent, then the problem goes away and you can use the hash table idea.
There's undoubtedly a way to modify the code to prevent it from generating duplicates. I can see a possible approach, but it would be messy and expensive. It would probably make the algorithm slower than if you just used the hash table idea. The approach I would take:
Sort the input array
Use the linked code to generate the combinations
Use a hash table or some other code to select unique items.
Although ... a thought that occurred to me.
Is it true that if you sort the input array, then any generated duplicates will be adjacent? That is, given the input array ["a", "a", "b", "b"], then the generated output will be aab, aab, abb, abb, in that order. This will be implementation dependent, of course. But if it's true in your implementation, then modifying the algorithm to remove duplicates is a simple matter of checking to see if the current combination is equal to the previous one.

A fast implementation in Mathematica for Position2D

I'm looking for a fast implementation for the following, I'll call it Position2D for lack of a better term:
Position2D[ matrix, sub_matrix ]
which finds the locations of sub_matrix inside matrix and returns the upper left and lower right row/column of a match.
For example, this:
Position2D[{
{0, 1, 2, 3},
{1, 2, 3, 4},
{2, 3, 4, 5},
{3, 4, 5, 6}
}, {
{2, 3},
{3, 4}
}]
should return this:
{
{{1, 3}, {2, 4}},
{{2, 2}, {3, 3}},
{{3, 1}, {4, 2}}
}
It should be fast enough to work quickly on 3000x2000 matrices with 100x100 sub-matrices. For simplicity, it is enough to only consider integer matrices.
Algorithm
The following code is based on an efficient custom position function to find positions of (possibly overlapping) integer sequences in a large integer list. The main idea is that we can first try to eficiently find the positions where the first row of the sub-matrix is in the Flatten-ed large matrix, and then filter those, extracting full sub-matrices and comparing to the sub-matrix of interest. This will be efficient for most cases except very pathological ones (those, for which this procedure would generate a huge number of potential position candidates, while the true number of entries of the sub-matrix would be much smaller. But such cases seem rather unlikely generally, and then, further improvements to this simple scheme can be made).
For large matrices, the proposed solution will be about 15-25 times faster than the solution of #Szabolcs when a compiled version of sequence positions function is used, and 3-5 times faster for the top-level implementation of sequence positions - finding function. The actual speedup depends on matrix sizes, it is more for larger matrices. The code and benchmarks are below.
Code
A generally efficient function for finding positions of a sub-list (sequence)
These helper functions are due to Norbert Pozar and taken from this Mathgroup thread. They are used to efficiently find starting positions of an integer sequence in a larger list (see the mentioned post for details).
Clear[seqPos];
fdz[v_] := Rest#DeleteDuplicates#Prepend[v, 0];
seqPos[list_List, seq_List] :=
Fold[
fdz[#1 (1 - Unitize[list[[#1]] - #2])] + 1 &,
fdz[Range[Length[list] - Length[seq] + 1] *
(1 - Unitize[list[[;; -Length[seq]]] - seq[[1]]])] + 1,
Rest#seq
] - Length[seq];
Example of use:
In[71]:= seqPos[{1,2,3,2,3,2,3,4},{2,3,2}]
Out[71]= {2,4}
A faster position-finding function for integers
However fast seqPos might be, it is still the major bottleneck in my solution. Here is a compiled-to-C version of this, which gives another 5x performance boost to my code:
seqposC =
Compile[{{list, _Integer, 1}, {seq, _Integer, 1}},
Module[{i = 1, j = 1, res = Table[0, {Length[list]}], ctr = 0},
For[i = 1, i <= Length[list], i++,
If[list[[i]] == seq[[1]],
While[j < Length[seq] && i + j <= Length[list] &&
list[[i + j]] == seq[[j + 1]],
j++
];
If[j == Length[seq], res[[++ctr]] = i];
j = 1;
]
];
Take[res, ctr]
], CompilationTarget -> "C", RuntimeOptions -> "Speed"]
Example of use:
In[72]:= seqposC[{1, 2, 3, 2, 3, 2, 3, 4}, {2, 3, 2}]
Out[72]= {2, 4}
The benchmarks below have been redone with this function (also the code for main function is slightly modified )
Main function
This is the main function. It finds positions of the first row in a matrix, and then filters them, extracting the sub-matrices at these positions and testing against the full sub-matrix of interest:
Clear[Position2D];
Position2D[m_, what_,seqposF_:Automatic] :=
Module[{posFlat, pos2D,sp = If[seqposF === Automatic,seqposC,seqposF]},
With[{dm = Dimensions[m], dwr = Reverse#Dimensions[what]},
posFlat = sp[Flatten#m, First#what];
pos2D =
Pick[Transpose[#], Total[Clip[Reverse#dm - # - dwr + 2, {0, 1}]],2] &#
{Mod[posFlat, #, 1], IntegerPart[posFlat/#] + 1} &#Last[dm];
Transpose[{#, Transpose[Transpose[#] + dwr - 1]}] &#
Select[pos2D,
m[[Last## ;; Last## + Last#dwr - 1,
First## ;; First## + First#dwr - 1]] == what &
]
]
];
For integer lists, the faster compiled subsequence position-finding function seqposC can be used (this is a default). For generic lists, one can supply e.g. seqPos, as a third argument.
How it works
We will use a simple example to dissect the code and explain its inner workings. This defines our test matrix and sub-matrix:
m = {{0, 1, 2, 3}, {1, 2, 3, 4}, {2, 3, 4, 5}};
what = {{2, 3}, {3, 4}};
This computes the dimensions of the above (it is more convenient to work with reversed dimensions for a sub-matrix):
In[78]:=
dm=Dimensions[m]
dwr=Reverse#Dimensions[what]
Out[78]= {3,4}
Out[79]= {2,2}
This finds a list of starting positions of the first row ({2,3} here) in the Flattened main matrix. These positions are at the same time "flat" candidate positions of the top left corner of the sub-matrix:
In[77]:= posFlat = seqPos[Flatten#m, First#what]
Out[77]= {3, 6, 9}
This will reconstruct the 2D "candidate" positions of the top left corner of a sub-matrix in a full matrix, using the dimensions of the main matrix:
In[83]:= posInterm = Transpose#{Mod[posFlat,#,1],IntegerPart[posFlat/#]+1}&#Last[dm]
Out[83]= {{3,1},{2,2},{1,3}}
We can then try using Select to filter them out, extracting the full sub-matrix and comparing to what, but we'll run into a problem here:
In[84]:=
Select[posInterm,
m[[Last##;;Last##+Last#dwr-1,First##;;First##+First#dwr-1]]==what&]
During evaluation of In[84]:= Part::take: Cannot take positions 3 through 4
in {{0,1,2,3},{1,2,3,4},{2,3,4,5}}. >>
Out[84]= {{3,1},{2,2}}
Apart from the error message, the result is correct. The error message itself is due to the fact that for the last position ({1,3}) in the list, the bottom right corner of the sub-matrix will be outside the main matrix. We could of course use Quiet to simply ignore the error messages, but that's a bad style. So, we will first filter those cases out, and this is what the line Pick[Transpose[#], Total[Clip[Reverse#dm - # - dwr + 2, {0, 1}]], 2] &# is for. Specifically, consider
In[90]:=
Reverse#dm - # - dwr + 2 &#{Mod[posFlat, #, 1],IntegerPart[posFlat/#] + 1} &#Last[dm]
Out[90]= {{1,2,3},{2,1,0}}
The coordinates of the top left corners should stay within a difference of dimensions of matrix and a sub-matrix. The above sub-lists were made of x and y coordiantes of top - left corners. I added 2 to make all valid results strictly positive. We have to pick only coordiantes at those positions in Transpose#{Mod[posFlat, #, 1], IntegerPart[posFlat/#] + 1} &#Last[dm] ( which is posInterm), at which both sub-lists above have strictly positive numbers. I used Total[Clip[...,{0,1}]] to recast it into picking only at those positions at which this second list has 2 (Clip converts all positive integers to 1, and Total sums numbers in 2 sublists. The only way to get 2 is when numbers in both sublists are positive).
So, we have:
In[92]:=
pos2D=Pick[Transpose[#],Total[Clip[Reverse#dm-#-dwr+2,{0,1}]],2]&#
{Mod[posFlat,#,1],IntegerPart[posFlat/#]+1}&#Last[dm]
Out[92]= {{3,1},{2,2}}
After the list of 2D positions has been filtered, so that no structurally invalid positions are present, we can use Select to extract the full sub-matrices and test against the sub-matrix of interest:
In[93]:=
finalPos =
Select[pos2D,m[[Last##;;Last##+Last#dwr-1,First##;;First##+First#dwr-1]]==what&]
Out[93]= {{3,1},{2,2}}
In this case, both positions are genuine. The final thing to do is to reconstruct the positions of the bottom - right corners of the submatrix and add them to the top-left corner positions. This is done by this line:
In[94]:= Transpose[{#,Transpose[Transpose[#]+dwr-1]}]&#finalPos
Out[94]= {{{3,1},{4,2}},{{2,2},{3,3}}}
I could have used Map, but for a large list of positions, the above code would be more efficient.
Example and benchmarks
The original example:
In[216]:= Position2D[{{0,1,2,3},{1,2,3,4},{2,3,4,5},{3,4,5,6}},{{2,3},{3,4}}]
Out[216]= {{{3,1},{4,2}},{{2,2},{3,3}},{{1,3},{2,4}}}
Note that my index conventions are reversed w.r.t. #Szabolcs' solution.
Benchmarks for large matrices and sub-matrices
Here is a power test:
nmat = 1000;
(* generate a large random matrix and a sub-matrix *)
largeTestMat = RandomInteger[100, {2000, 3000}];
what = RandomInteger[10, {100, 100}];
(* generate upper left random positions where to insert the submatrix *)
rposx = RandomInteger[{1,Last#Dimensions[largeTestMat] - Last#Dimensions[what] + 1}, nmat];
rposy = RandomInteger[{1,First#Dimensions[largeTestMat] - First#Dimensions[what] + 1},nmat];
(* insert the submatrix nmat times *)
With[{dwr = Reverse#Dimensions[what]},
Do[largeTestMat[[Last#p ;; Last#p + Last#dwr - 1,
First#p ;; First#p + First#dwr - 1]] = what,
{p,Transpose[{rposx, rposy}]}]]
Now, we test:
In[358]:= (ps1 = position2D[largeTestMat,what])//Short//Timing
Out[358]= {1.39,{{{1,2461},{100,2560}},<<151>>,{{1900,42},{1999,141}}}}
In[359]:= (ps2 = Position2D[largeTestMat,what])//Short//Timing
Out[359]= {0.062,{{{2461,1},{2560,100}},<<151>>,{{42,1900},{141,1999}}}}
(the actual number of sub-matrices is smaller than the number we try to generate, since many of them overlap and "destroy" the previously inserted ones - this is so because the sub-matrix size is a sizable fraction of the matrix size in our benchmark).
To compare, we should reverse the x-y indices in one of the solutions (level 3), and sort both lists, since positions may have been obtained in different order:
In[360]:= Sort#ps1===Sort[Reverse[ps2,{3}]]
Out[360]= True
I do not exclude a possibility that further optimizations are possible.
This is my implementation:
position2D[m_, k_] :=
Module[{di, dj, extractSubmatrix, pos},
{di, dj} = Dimensions[k] - 1;
extractSubmatrix[{i_, j_}] := m[[i ;; i + di, j ;; j + dj]];
pos = Position[ListCorrelate[k, m], ListCorrelate[k, k][[1, 1]]];
pos = Select[pos, extractSubmatrix[#] == k &];
{#, # + {di, dj}} & /# pos
]
It uses ListCorrelate to get a list of potential positions, then filters those that actually match. It's probably faster on packed real matrices.
As per Leonid's suggestion here's my solution. I know it isn't very efficient (it's about 600 times slower than Leonid's when I timed it) but it's very short, rememberable, and a nice illustration of a rarely used function, PartitionMap. It's from the Developer package, so it needs a Needs["Developer`"] call first.
Given that, Position2D can be defined as:
Position2D[m_, k_] := Position[PartitionMap[k == # &, m, Dimensions[k], {1, 1}], True]
This only gives the upper-left coordinates. I feel the lower-right coordinates are really redundant, since the dimensions of the sub-matrix are known, but if the need arises one can add those to the output by prepending {#, Dimensions[k] + # - {1, 1}} & /# to the above definition.
How about something like
Position2D[bigMat_?MatrixQ, smallMat_?MatrixQ] :=
Module[{pos, sdim = Dimensions[smallMat] - 1},
pos = Position[bigMat, smallMat[[1, 1]]];
Quiet[Select[pos, (MatchQ[
bigMat[[Sequence##Thread[Span[#, # + sdim]]]], smallMat] &)],
Part::take]]
which will return the top left-hand positions of the submatrices.
Example:
Position2D[{{0, 1, 2, 3}, {1, 2, 3, 4}, {2, 3, 4, 5}, {3, 5, 5, 6}},
{{2, 3}, {3, _}}]
(* Returns: {{1, 3}, {2, 2}, {3, 1}} *)
And to search a 1000x1000 matrix, it takes about 2 seconds on my old machine
SeedRandom[1]
big = RandomInteger[{0, 10}, {1000, 1000}];
Position2D[big, {{1, 1, _}, {1, 1, 1}}] // Timing
(* {1.88012, {{155, 91}, {295, 709}, {685, 661},
{818, 568}, {924, 45}, {981, 613}}} *)

Conditional Data Manipulation in Mathematica

I am trying to prepare the best tools for efficient Data Analysis in Mathematica.
I have a approximately 300 Columns & 100 000 Rows.
What would be the best tricks to :
"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.
One of the trickiest examples I could think of is :
Given a data structure,
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
I also welcome any general advice on data manipulation.
For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:
Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];
ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules = Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
indexRules = Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
With[{selF = Apply[Function, Hold[condition] /. colRules]},
Select[{rows}, selF ## # &][[All, cnames /. indexRules]]]];
What happens here is that the function used in Select gets generated automatically from your specifications. For example (using #Yoda's example):
rows = Array[#1 #2 &, {5, 15}];
We need to define the column names (must be strings or symbols without values):
In[425]:=
colnames = "c" <> ToString[#] & /# Range[15]
Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12",
"c13", "c14", "c15"}
(in practice, usually names are more descriptive, of course). Here is the table then:
table = Prepend[rows, colnames];
Here is the select statement you need (I picked x = 4 and y=2):
select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
where["c2" == 4 && "c8" != 2]]
{{2, 4, 6, 12, 14, 16, 18, 30}}
Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as
In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]
Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}
and similar.
Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).
EDIT
On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.
data = RandomInteger[{1, 20}, {40, 20}]
x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]
==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}
Some useful commands to get pieces of matrices and list are Span (;;), Drop, Take, Select, Cases and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,
Part ([[...]]) in combination with ;; can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).
Select can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part ([[..]]) function.
To pull out columns (or rows) you can do it by part indexing
data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten#{Range#3, Range ## {6, 9}, -1}]]
MatrixForm#%
The last line is just to view it pretty.
As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the Span (;;) command. If you are joining multiple disjoint ranges, using Flatten to combine the separate ranges created with Range is easier than entering them by hand.
I read:
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
as meaning that we want:
elements 1-3 and 6-9 from each row
AND
the last element from rows wherein [[2]] == x && [[8]] != y.
This is what I hacked together:
a = RandomInteger[5, {20, 10}]; (*define the array*)
x = 4; y = 0; (*define the test values*)
Join ## Range ### {1 ;; 3, 6 ;; 9}; (*define the column ranges*)
#2 == x && #8 != y & ### a; (*test the rows*)
Append[%%, #] & /# % /. {True -> -1, False :> Sequence[]}; (*complete the ranges according to the test*)
MapThread[Part, {a, %}] // TableForm (*extract and display*)

Resources