Best data structure for search? - data-structures

I have a list with items where each have number of properties (A, B, C, D) which I would like to filter using template containing same attributes (A, B, C, D). When I use a template I would like to filter all items matching this template. The match is assumed if item is equal to template or is smaller subsequence of it (0 match any item).
Example data
A B C D
1 0 1 0
2 0 0 0
0 0 2 3
2 0 2 1
2 0 2 0
0 0 0 0
Example templates
[2 0 0 0] will filter {[0 0 0 0], [2 0 0 0]}
[2 0 2 0] will filter {[0 0 0 0], [2 0 0 0], [2 0 2 0]}
[2 0 2 1] will filter {[0 0 0 0], [2 0 2 1]}
[3 4 5 6] will filter {[0 0 0 0]}
[0 0 2 0] will filter {[0 0 0 0], [0 0 2 3], [2 0 2 1], [2 0 2 0]}
The problem is that number of comparisons can easily reach 300k and can get slow sometimes. What tricks or structure I could use to make things quicker? Any ideas?

Assuming 4 properties, let's place all the items into 16 buckets.
First bucket is where there are no zero-values for the properties. Selecting from here - simple lookup based on key ABCD.
Second bucket is where the property A == 0. Selecting from here is a lookup on the template with value of BCD.
Third bucket is where B == 0. Selecting from here is a lookup on the template with value of ACD.
Fourth is where A == 0 and B == 0. Selecting from here is a lookup on the template with value of CD.
....
Fifteenth is where A,B,C == 0. the lookup is on D.
Sixteenth is where A,B,C,D == 0. This can be a boolean variable ;-)
Since all of the 16 buckets are 'exact match' - you can use methods like hash tables for the search inside them.
(this proposal is based on the assumption from the example that it's 0 in the prop value that counts as 'match any' and not in the template.) - because the 2000 selected only one value in your exaample. it will obviously be incorrect if the semantics is 'any' in both places.
--
update: corollary: you can have no more than 2^Nproperties matches.
Example:
Let's suppose we have 3 properties A,B,C and the following four items:
itemX[A=1, B=0, C=1] ---> B is a wildcard, so bucketAC[11] = itemX
itemY[A=2, B=0, C=0] ---> B and C are wildcards, so bucketA[2] = itemY
itemZ[A=2, B=1, C=0] ---> C is a wildcard, so bucketAB[21] = itemZ
now, the lookup for a key 'abc' would be as follows (I also include to the right the
contents of the buckets for ease of reading, and '<<' means 'accumulate' in this context)
1.results << bucketA[a] | '2' => itemY[A=2, B=0, C=0]
2.results << bucketB[b]
3.results << bucketAB[ab] | '21' => itemZ[A=2, B=1, C=0]
4.results << bucketC[c]
5.results << bucketAC[ac] | '11' => itemX[A=1, B=0, C=1]
6.results << bucketBC[bc]
7.results << bucketABC[abc]
8.results << bucket_item_all_wildcards
So if we use template [2 0 0], we get the results from key being A=2 in bucketA only.
If we use template [2 1 0], then we get the results from key being A=2 in bucketA,
and from key being AB=21 in bucketAB - two results.
NB: Of course, the above notation for keys is rather frivolous, it merely assumes "hashtable-like access with the concatenation of the said properties being the key".
If you are allowed to have items with the same properties multiple times, then you will need to have multiple elements in some slots - and then, obviously, you can have more
than 2^Nproperties search results, nonetheless you can track the maximum number of duplicates and hence always predict the worst-case maximum number of items.
Notably, if the number of properties grows, the total possible number of buckets will quickly blow up (e.g. 32 properties would mean maximum more than 4 billion buckets),
so this idea will no longer be applicable directly, and would need further
optimizations around the bucket traversal/allocation.

What about nested hash maps? For example, an item "it" will be stored as:
map(it.A)(it.B)(it.C).(it.D) = it
So [2 0 2 0] could be searched as:
map(2).keys.(2).keys

Related

Is there any kind of matrix filter in TCL?

Given the following matrix:
1 2 3 4
1 2 3 0
1 2 0 0
1 0 0 0
1 0 0 5
Return the rows that contain searched information.
For Example:
matrixname filter {{1} {} {3} {}}
The return would be:
1 2 3 4
1 2 3 0
and
matrixname filter {{1} {} {} {4}}
The return would be:
1 2 3 4
Does something like this already exist? I am thinking almost SQL-esk type of function WHERE COL = VALUE AND ORDER BY type of thing.
I have looked around and I am not finding anything.
---------------------------------EDIT
I have come up with the following to search the given fields.
::struct::list dbJoin -inner\
-keys FoundKeyList\
1 [::struct::list dbJoin -inner\
1 [MatrixName search -nocase column 2 $ITEM1]\
1 [MatrixName search -nocase column 1 $ITEM2]]\
1 [MatrixName search -nocase column 0 $ITEM3];
This will provide a list of row numbers that match the search criteria.
then you can just use MatrixName get row row or matrixName get rect column_tl row_tl column_br row_br to get the results.
Anyone have any feedback on this?
Two options come to mind.
In tcllib, there is the struct::matrix package. It has a search command. However, that command searches for patterns on individual cells (which can be constrained to particular columns) and you would need to write a procedure to perform the multiple searches required to achieve the conjunctive match you describe.
The other option is TclRAL. This will give you a relation value (aka a table) and you can perform a restrict command to obtain the subset matching an arbitrary expression, e.g.
set m [ral::relation table {C1 int C2 int C3 int C4 int}\
{1 2 3 4} {1 2 3 0} {1 2 0 0} {1 0 0 0} {1 0 0 5}]
set filt [ral::relation restrictwith $m {$C1 == 1 && $C3 == 3}]
However, both of these options are somewhat "heavyweight" and might be justified if there are more operations you need to perform on your tabular data than you indicate in your question. If the scope of your problem is as small as your questions indicates, then simply dashing off a procedure, as the other commenters have suggested, may be your best bet.
package require struct::matrix
struct::matrix xdata
xdata add columns 4
xdata add rows 5
xdata set rect 0 0 {
{1 2 3 4}
{1 2 3 0}
{1 2 0 0}
{1 0 0 0}
{1 0 0 5}
}
foreach {c r} [join [xdata search -regexp all {1*3}]] { puts [xdata get row $r] }
# 1 2 3 4
# 1 2 3 0
foreach {c r} [join [xdata search -regexp all {1*4}]] { puts [xdata get row $r] }
# 1 2 3 4
xdata destroy
This should achieve the expected results.

boost::compute stream compaction

How to do stream compaction with boost::compute?
E.g. if you want to perform heavy operation only on certain elements in the array. First you generate mask array with ones corresponding to elements for which you want to perform operation:
mask = [0 0 0 1 1 0 1 0 1]
Then perform exclusive scan (prefix sum) of mask array to get:
scan = [0 0 0 0 1 2 2 3 3]
Then compact this array with:
if (mask[i])
inds[scan[i]] = i;
To get final array of compacted indices (inds):
[3 4 6 8]
Size of the final array is scan.last() + mask.last()
#include <boost/compute/algorithm/copy_if.hpp>
using namespace boost::compute;
detail::copy_index_if(mask.begin(), mask.end(), inds.begin(), _1 == 1, queue);

Coming with a procedure to generate an array with special properties

I have an array of size n. Each element can hold any integer as long as the following properties holds:
1) All elements are non-negative
2) sum(array[0,i+1])<i for 0<=i<n
3) sum(array)=n-1
Let's call such an array a bucket.
I need to come up with a procedure that will generate the next bucket.
We can assume the first bucket is {0,0,0...n-1}
Example: For n=5, some possible combinations are
[0 0 0 0 4]
[0 0 0 1 3]
[0 0 0 2 2]
[0 0 0 3 1]
[0 0 1 2 1]
[0 0 2 1 1]
[0 1 1 1 1]
[0 1 0 0 3]
[0 1 1 0 2]
I'm having trouble coming up with a procedure that hits all the possible combinations. Any hints/tips? (Note I want to generate the next bucket. I'm not looking to print out all possible buckets at once)
You can use a simple backtracking procedure. The idea is to keep track of the current sum and the current index i. This would allow you to express the required constrains.
n = 5
a = [0] * n
def backtrack(i, sum):
if i > 0 and sum > i-1:
return
if i == n:
if sum == n-1:
print(a)
return
for e in range(n-sum):
a[i] = e
backtrack(i + 1, sum+e)
backtrack(0, 0)
test run

Julia: find row in matrix

Using Julia, I'd like to determine if a row is located in a matrix and (if applicable) where in the matrix the row is located. For example, in Matlab this can done with ismember:
a = [1 2 3];
B = [3 1 2; 2 1 3; 1 2 3; 2 3 1]
B =
3 1 2
2 1 3
1 2 3
2 3 1
ismember(B, a, 'rows')
ans =
0
0
1
0
From this, we can see a is located in row 3 of B. Is there a similar function to accomplish this in Julia?
You can also make use of array broadcasting by simply testing for equality (.==) without the use of comprehensions:
all(B .== a, dims=2)
Which gives you:
4x1 BitMatrix:
0
0
1
0
You can then use findall on this array:
findall(all(B .== a, 2))
However, this gives you a vector of CartesianIndex objects:
1-element Vector{CartesianIndex{2}}:
CartesianIndex(3, 1)
So if you expect to find multiple rows with the value defined in a you can either:
simplify this Vector by taking only the row index from each CartesianIndex:
[cart_idx[1] for cart_idx in findall(all(B .== a, 2))]
or pass one dimensional BitMatrix to findall (as suggested by Shep Bryan in the comment):
findall(all(B .== a, dims=2)[:, 1])
Either way you get an integer vector of column indices:
1-element Vector{Int64}:
3
Another pattern is using array comprehension:
julia> Bool[ a == B[i,:] for i=1:size(B,1) ]
4-element Array{Bool,1}:
false
false
true
false
julia> Int[ a == B[i,:] for i=1:size(B,1) ]
4-element Array{Int64,1}:
0
0
1
0
how about:
matchrow(a,B) = findfirst(i->all(j->a[j] == B[i,j],1:size(B,2)),1:size(B,1))
returns 0 when no matching row, or first row number when there is one.
matchrow(a,B)
3
should be as "fast as possible" and pretty simple too.
Though Julia doesn't have a built-in function, its easy enough as a one-liner.
a = [1 2 3];
B = [3 1 2; 2 1 3; 1 2 3; 2 3 1]
ismember(mat, x, dims) = mapslices(elem -> elem == vec(x), mat, dims)
ismember(B, a, 2) # Returns booleans instead of ints

Algorithm to swap two indices in a symmetric matrix

I am trying for a day now to find an algorithm to swap two indices in a symmetric matrix so that the result is also a symmetric matrix.
Let´s say I have following matrix:
0 1 2 3
1 0 4 5
2 4 0 6
3 5 6 0
Let´s say I want to swap line 1 and line 3 (where line 0 is the first line). Just swapping results in:
0 1 2 3
3 5 6 0
2 4 0 6
1 0 4 5
But this matrix is not symmetric anymore. What I really want is following matrix as a result:
0 3 2 1
3 0 6 5
2 6 0 4
1 5 4 0
But I am not able to find a suitable algorithm. And that really cracks me up, because it looks like in easy task.
Does anybody know?
UPDATE
Phylogenesis gave a really simple answer and I feel silly that I could not think of it myself. But here is a follow-up task:
Let´s say I store this matrix as a two-dimensional array. And to save memory I do not save the redundant values and I also leave out the diagonal which has always 0 values. My array looks like that:
[ [1, 2, 3], [4, 5], [6] ]
My goal is to transform that array to:
[ [3, 2, 1], [6, 5], [4] ]
How can I swap the rows and then the columns in an efficient way using the given array?
It is simple!
As you are currently doing, swap row 1 with row 3. Then swap column 1 with column 3.

Resources