Is there a search algorithm for huge two-dimensional arrays?

Is there a search algorithm for huge two-dimensional arrays? - ruby

This is not a real-life question, it is just theory-crafting.
I have a big array which consists of elements like [1,140,245,123443], all
integer or floats with low selectivity, and the number of unique values is ten
times less than the size of the array. B*tree indexing is not good in this case.
I also tried to implement bitmap indexing, but in Ruby, binary operations are not so fast.
Are there any good algorithms for searching two-dimensional arrays of fixed size vectors?
And, the main question is, how do I convert the vector in value, where the conversion function has to be monotonic, so I can apply range queries such as:
(v[0]<10, v[2]>100, v[3]=32, 0.67*10^-8<v[4]<1.2154241410*10^-6)
the only idea i have is to create separate sorted indexes for each component of vector...binary search then and merge...but it is a bad idea because in the worst case scenario it will require O(N*N) operations...

Assuming that each "column" is vaguely evenly distributed in a known range, you could keep track of a series of buckets for each column, and a list of rows that satisfy the bucket. The number of buckets for each column can be the same, or different, it's totally arbitrary. More buckets is faster, but takes slightly more memory.
my table:
range: {1to10} {1to4m} {-2mto2m}
row1: {7 3427438335 420645075}
row2: {5 3862506151 -1555396554}
row3: {1 2793453667 -1743457796}
buckets for column 1:
bucket{1-3} : row3
bucket{4-6} : row2
bucket{7-10} : row1
buckets for column 2:
bucket{1-2m} :
bucket{2m-4m} : row1, row2, row4
buckets for column 3:
bucket{-2m--1m} : row2, row3
bucket{-1m-0} :
bucket{0-1m} :
bucket{1m-2m} : row1
Then, given a series of criteria: {v[0]<=5, v[2]>3*10^10}, we pull out the buckets that match that criteria:
column 1:
v[0]<=5 matches buckets {1-3} and {4-6}, which is rows 2 and 3.
column 2:
v[2]>3*10^10} matches buckets {2m-4m} and {4-6}, which is rows 1, 2 and 3.
column 3:
"" matches all , which is rows 1, 2 and 3.
Now we know that the row(s) we're looking for meet all three criteria, so we list all the rows that are in the buckets that matched all the criteria, in this case, rows 2 and 3. At this point, the number of rows remaining will be small even for massive amounts of data, depending on the granularity of your buckets. You simply check each of the rows that is left at this point to see if they match. In this sample we see that row 2 matches, but row 3 doesn't.
This algorithm is technically O(n), but in practice, if you have large numbers of small buckets, this algorithm can be very fast.

Using an index :)
The basic idea is to turn the 2 dimensional array into a 1 dimensional sorted array(while keeping the original position) and apply binary search on the later.
This method works for any n dimensional array and is used widely by databases which can be seen as a n dimensional array with variable lengths.

Related

Best mapping between 2 sequences

I have two sequences of items:
S1 = [ A B C D E F ]
S2 = [ 1 2 3 4 5 6 7 8 ]
And I can determine "similarity" for each pair of items (s1, s2) as a number (for example on scale 0 to 10).
I want to find a mapping between S1/S2 items, such that ordering of each sequence is preserved and sum of "similarity" values between mapped items is maximum. It is not required that all S1/S2 items are part of mapping.
Example:
[ A B C D E F ]
[ 1 2 3 4 5 6 7 8 ]
In example above, mapping 'A on 3', 'D on 4' and 'F on 6' gives overall maximum "similarity".
Are there any existing problems (/algorithms) this could be turned into?

Looks like the Smith–Waterman algorithm, which is traditional used for determining similar regions between two strings of nucleic acid sequences or protein sequences, should be perfect:
Smith–Waterman algorithm aligns two sequences by matches/mismatches (also known as substitutions), insertions, and deletions. Both insertions and deletions are the operations that introduce gaps, which are represented by dashes. The Smith–Waterman algorithm has several steps:
Determine the substitution matrix and the gap penalty scheme. A substitution matrix assigns each pair of items (s1, s2) a score for match or mismatch. Usually matches get positive scores, whereas mismatches get relatively lower scores. A gap penalty function determines the score cost for opening or extending gaps. It is suggested that users choose the appropriate scoring system based on the goals. In addition, it is also a good practice to try different combinations of substitution matrices and gap penalties.
Initialize the scoring matrix. The dimensions of the scoring matrix are 1+length of each sequence respectively. All the elements of the first row and the first column are set to 0. The extra first row and first column make it possible to align one sequence to another at any position, and setting them to 0 makes the terminal gap free from penalty.
Scoring. Score each element from left to right, top to bottom in the matrix, considering the outcomes of substitutions (diagonal scores) or adding gaps (horizontal and vertical scores). If none of the scores are positive, this element gets a 0. Otherwise the highest score is used and the source of that score is recorded.
Traceback. Starting at the element with the highest score, traceback based on the source of each score recursively, until 0 is encountered. The segments that have the highest similarity score based on the given scoring system is generated in this process. To obtain the second best local alignment, apply the traceback process starting at the second highest score outside the trace of the best alignment.
Just choose the substitution matrix to match yours
And I can determine "similarity" for each pair of items (s1, s2) as a number (for example on scale 0 to 10).
and set the gap and no match penalty to zero
I want to find a mapping between S1/S2 items, such that ordering of each sequence is preserved and sum of "similarity" values between mapped items is maximum. It is not required that all S1/S2 items are part of mapping.
More information can be found at: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Scoring_matrix

The problem you described looks like Longest Common Subsequence Problem variation.
Use this recurrent relation instead of original:
ans[i][j] = max(
ans[i-1][j],
ans[i][j-1],
ans[i-1][j-1] + similarity(S1[i], S2[j])
)

fastest algorithm for sum queries in a range

Assume we have the following data, which consists of a consecutive 0's and 1's (the nature of data is that there are very very very few 1s.
data =
[0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0]
so a huge number of zeros, and then possibly some ones (which indicate that some sort of an event is happening).
You want to query this data many times. The query is that given two indices i and j what is sum(data[i:j]). For example, sum_query(i=12, j=25) = 2 in above example.
Note that you have all these queries in advance.
What sort of a data structure can help me evaluate all the queries as fast as possible?
My initial thoughts:
preprocess the data and obtain two shorter arrays: data_change and data_cumsum. The data_change will be filled up with the indices for when the sequence of 1s will start and when the next sequence of 0s will start, and so on. The data_cumsum will contain the corresponding cummulative sums up to indices represented in data_change, i.e. data_cumsum[k] = sum(data[0:data_change[k]])
In above example, the preprocessing results in: data_change=[8,11,18,20,31,35] and data_cumsum=[0,3,3,5,5,9]
Then if query comes for i=12 and j=25, I will do a binary search in this sorted data_change array to find the corresponding index for 12 and then for 25, which will result in the 0-based indices: bin_search(data_change, 12)=2 and bin_search(data_change, 25)=4.
Then I simply output the corresponding difference from the cumsum array: data_cumsum[4] - data_cumsum[2]. (I won't go into the detail of handling the situation where the any endpoint of the query range falls in the middle of the sequence of 1's, but those cases can be handled easily with an if-statement.

With linear space, linear preprocessing, constant query time, you can store an array of sums. The i'th position gets the sum of the first i elements. To get query(i,j) you take the difference of the sums (sums[j] - sums[i-1]).

I already gave an O(1) time, O(n) space answer. Here are some alternates that trade time for space.
1. Assuming that the number of 1s is O(log n) or better (say O(log n) for argument):
Store an array of ints representing the positions of the ones in the original array. so if the input is [1,0,0,0,1,0,1,1] then A = [0,4,6,7].
Given a query, use binary search on A for the start and end of the query in O(log(|A|)) = O(log(log(n)). If the element you're looking for isn't in A, find the smallest bigger index and the largest smaller index. E.g., for query (2,6) you'd return the indices for the 4 and the 6, which are (1,2). Then the answer is one more than the difference.
2. Take advantage of knowing all the queries up front (as mentioned by the OP in a comment to my other answer). Say Q = (Q1, Q2, ..., Qm) is the set of queries.
Process the queries, storing a map of start and end indices to the query. E.g., if Q1 = (12,92) then our map would include {92 => Q1, 12 => Q1}. This take O(m) time and O(m) space. Take note of the smallest start index and the largest end index.
Process the input data, starting with the smallest start index. Keep track of the running sum. For each index, check your map of queries. If the index is in the map, associate the current running sum with the appropriate query.
At the end, each query will have two sums associated with it. Add one to the difference to get the answer.
Worst case analysis:
O(n) + O(m) time, O(m) space. However, this is across all queries. The amortized time cost per query is O(n/m). This is the same as my constant time solution (which required O(n) preprocessing).

I would probably go with something like this:
# boilerplate testdata
from itertools import chain, permutations
data = [0,0,0,0,0,0,0,1,1,1]
chained = list(chain(*permutations(data,5))) # increase 5 to 10 if you dare
Preprozessing:
frSet = frozenset([i for i in range(len(chained)) if chained[i]==1])
"Counting":
# O(min(len(frSet), len(frozenset(range(200,500))))
summa = frSet.intersection(frozenset(range(200,500))) # use two sets for faster intersect
counted=len(summa)
"Sanity-Check"
print(sum([1 for x in frSet if x >= 200 and x<500]))
print(summa)
print(len(summa))
No edge cases needed, intersection will do all you need, slightly higher memory as you store each index not ranges of ones. Performance depends on intersection-Implementation.
This might be helpfull: https://wiki.python.org/moin/TimeComplexity#set

Reduce by Keys performance using linear index to rows index

I have been checking the code rows_sum.cu from thrust examples and I could not understand what is exactly happening in linear_index_to_rows_index.
Can someone please explain me with some example what does convert a linear index to row index mean?
Reference: https://github.com/thrust/thrust/blob/master/examples/sum_rows.cu

In thrust, a common storage format is a vector container, (e.g. device_vector and host_vector). In typical usage, these containers are a one-dimensional storage format. The concept of "row" and "column" don't really apply to a 1D vector. We simply talk about the index of an element. In this case, it is a linear (1D) index.
In order to store a 2 dimensional item, such as a matrix where the concept of "row" and "column" indexing of elements is applicable, a common approach to using the 1D containers in thrust is to "flatten" or "linearize" the storage:
Matrix A:
column:
0 1
row: 0 1 2
1 3 4
Vector A:
index: 0 1 2 3
element: 1 2 3 4
At this point, we can conveniently use thrust operations. But what if we want to do operations on specific rows or columns of the original matrix? That information (row, column) is "lost" when we linearize or flatten it into a 1D vector. But if we know the dimensions of the original matrix, we can use that to convert a linear index:
0 1 2 3
into a row/column index:
(0,0) (0,1) (1,0) (1,1)
and the reverse (row/column to linear index).
In the case of the thrust example you linked, the functor linear_index_to_row_index converts a given linear index to its associated row index. This allows the programmer to write a thrust operation that works on specific rows of data, such as summing the "rows" of the original matrix, even though it is now stored in a linear 1D vector.
Specifically, when given the following linear indicies:
0 1 2 3
For my example, the functor would return:
0 0 1 1
Because that represents the row of the original matrix that the specific element in the vector belongs to.
If I want to sum all the elements of each row together, producing one sum per row, then I can use the row index generated by that functor to identify the row of each element. At that point, reduce_by_key can easily sum together the elements that have the same row index, producing one result per row.

Create Ancestor Matrix from given Binary Tree

The question is, given a Ancestor Matrix, as a bitmap of 1s and 0s, to construct the corresponding Binary Tree. Can anyone give me an idea on how to do it? I found a solution at Stackoverflow, but the line a[root->data][temp[i]]=1 seems wrong, there is no binding that the nodes will contain data 1 to n. It may contain, say 2000, in which case, there will be no a[2000][some_column], since there are only 7 nodes, hence 7 rows and columns in the matrix.

Two ways:
Normalize your node values such that they are all from 1 to n. If you have nodes 1, 2, 5000 for example, make them 1, 2, 3. You can do this by sorting or hashing your labels and keeping something like normalized[i] = normalized value of node i. normalized can be a map / hash table if you have very large labels or even text labels.
You might be able to use a sparse matrix for this, implementable with a hash table or a set: keep a hash table of hash tables. H[x] stores another hash table that stores your y values. So if in a naive matrix solution you had a[2000][5000] = 1, you would use H.get(2000) => returns a hash table H' of values stored on the 2000th row => H'.get(5000) => returns the value you want.

Complexity of: One matrix is row/col permutation of another matrix

Given two m x n matrices A and B whose elements belong to a set S.
Problem: Can the rows and columns of A be permuted to give B?
What is the complexity of algorithms to solve this problem?
Determinants partially help (when m=n): a necessary condition is that det(A) = +/- det(B).
Also allow A to contain "don't cares" that match any element of B.
Also, if S is finite allow permutations of elements of A.
This is not homework - it is related to the solved 17x17 puzzle.

See below example of permuting rows and columns of a matrix:
Observe the start matrix and end matrix. All elements in a row or column are retained its just that their order has changed. Also the change in relative positions is uniform across rows and columns
eg. see 1 in start matrix and end matrix. Its row has elements 12, 3 and 14 along with it. Also its column has 5, 9 and 2 along with it. This is maintained across the transformations.
Based on this fact I am putting forward this basic algo to find for a given matrix A, can its rows and columns of A be permuted to give matrix B.
1. For each row in A, sort all elements in the row. Do same for B.
2. Sort all rows of A (and B) based on its columns. ie. if row1 is {5,7,16,18} and row2 is {2,4,13,15}, then put row2 above row1
3. Compare resultant matrix A' and B'.
4. If both equal, then do (1) and (2) but for columns on ORIGINAL matrix A & B instead of rows.
5. Now compare resultant matrix A'' and B''

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio