Efficient algorithms for merging hashes into a sparse matrix - ruby

I have time data with irregular intervals and I need to convert it to a sparse matrix for use with a graphing library.
The data is currently in the following format:
{
:series1 => [entry, entry, entry, entry, ...],
:series2 => [entry, entry, entry, entry, ...]
}
where entry is an object with two properties, timestamp(a unix timestamp) and value(an integer)
I need to put it in this format in as close to O(n) time as possible.
{
timestamp1 => [ value, value, nil ],
timestamp2 => [ value, nil, value ],
timestamp3 => [ value, value, value],
...
}
Here each row represents a point in time which I have an entry for. Each column represents a series ( a line on a line graph ). Thats why it is very important to represent missing values with nil.
I have some pretty slow implementations but this seems like a problem that has been solved before so I'm hoping that there is a more efficient way to do this.

I'm slightly confused by your asking for O(n), so feel free to correct me, but as far as I can tell, O(n) is easily possible.
First find the length of your starting hash (the number of series in the data). This should be O(1), but no worse than O(S) (where S is no of series), and S <= O(n) (assuming no series with no values) so is still O(n).
Store this length somewhere, and then setup your hash for the sparse matrix to automatically initialise any row to an empty array of this size.
matrix = Hash.new {|hsh,k| hsh[k] = Array.new(S)}
Then simply go through each series, by index. And for each entry, set the appropriate cell in the array to be the right value.
For each entry, this is O(1) (average) for the lookup of the timestamp in the hash, then O(1) for setting the cell in the array. This happens n times, giving you O(n) there.
There will also be the creation of an Array for each row in the matrix. As far as I am aware this is O(1) for one Array, so O(T) (where T is number of timestamps) overall. As we are not creating empty rows where there are no entries with that timestamp, T must be <= n, so this is O(n) as well.
So overall we have O(n) + O(n) + O(n) = O(n). There are probably ways to speed this up in Ruby, but as far as I am aware this is not only close to, but actually O(n).

How about something like this:
num = series.count
timestamps = {}
series.each_with_index do |(k, entries), i|
entries.each do |entry|
timestamps[entry.timestamp] ||= Array.new(num)
timestamps[entry.timestamp][i] = entry.value
end
end
Not sure though about the initial ordering of your series, I guess your real situation is a bit more complex than presented in the question.

Related

Approximation-tolerant map

I'm working with arrays of integer, all of the same size l.
I have a static set of them and I need to build a function to efficiently look them up.
The tricky part is that the elements in the array I need to search might be off by 1.
Given the arrays {A_1, A_2, ..., A_n}, and an array S, I need a function search such that:
search(S)=x iff ∀i: A_x[i] ∈ {S[i]-1, S[i], S[i]+1}.
A possible solution is treating each vector as a point in an l-dimensional space and looking for the closest point, but it'd cost something like O(l*n) in space and O(l*log(n)) in time.
Would there be a solution with a better space complexity (and/or time, of course)?
My arrays are pretty different from each other, and good heuristics might be enough.
Consider a search array S with the values:
S = [s1, s2, s3, ... , sl]
and the average value:
s̅ = (s1 + s2 + s3 + ... + sl) / l
and two matching arrays, one where every value is one greater than the corresponding value in S, and one where very value is one smaller:
A1 = [s1+1, s2+1, s3+1, ... , sl+1]
A2 = [s1−1, s2−1, s3−1, ... , sl−1]
These two arrays would have the average values:
a̅1 = (s1 + 1 + s2 + 1 + s3 + 1 + ... + sl + 1) / l = s̅ + 1
a̅2 = (s1 − 1 + s2 − 1 + s3 − 1 + ... + sl − 1) / l = s̅ − 1
So every matching array, whose values are at most 1 away from the corresponding values in the search array, has an average value that is at most 1 away from the average value of the search array.
If you calculate and store the average value of each array, and then sort the arrays based on their average value (or use an extra data structure that enables you to find all arrays with a certain average value), you can quickly identify which arrays have an average value within 1 of the search array's average value. Depending on the data, this could drastically reduce the number of arrays you have to check for similarity.
After having pre-processed the arrays and stores their average values, performing a search would mean iterating over the search array to calculate the average value, looking up which arrays have a similar average value, and then iterating over those arrays to check every value.
If you expect many arrays to have a similar average value, you could use several averages to detect arrays that are locally very different but similar on average. You could e.g. calculate these four averages:
the first half of the array
the second half of the array
the odd-numbered elements
the even-numbered elements
Analysis of the actual data should give you more information about how to divide the array and combine different averages to be most effective.
If the total sum of an array cannot exceed the integer size, you could store the total sum of each array, and check whether it is within l of the total sum of the search array, instead of using averages. This would avoid having to use floats and divisions.
(You could expand this idea by also storing other properties which are easily calculated and don't take up much space to store, such as the highest and lowest value, the biggest jump, ... They could help create a fingerprint of each array that is near-unique, depending on the data.)
If the number of dimensions is not very small, then probably the best solution will be to build a decision tree that recursively partitions the set along different dimensions.
Each node, including the root, would be a hash table from the possible values for some dimension to either:
The list of points that match that value within tolerance, if it's small enough; or
Those same points in a similar tree partitioning on the remaining dimensions.
Since each level completely eliminates one dimension, the depth of the tree is at most L, and search takes O(L) time.
The order in which the dimensions are chosen along each path is important, of course -- the wrong choice could explode the size of the data structure, with each point appearing many times.
Since your points are "pretty different", though, it should be possible to build a tree with minimal duplication. I would try the ID3 algorithm to choose the dimensions: https://en.wikipedia.org/wiki/ID3_algorithm. That basically means you greedily choose the dimension that maximizes the overall reduction in set size, using an entropy metric.
I would personally create something like a Trie for the lookup. I said "something like" because we have up to 3 values per index that might match. So we aren't creating a decision tree, but a DAG. Where sometimes we have choices.
That is straightforward and will run (with backtracking) in maximum time O(k*l).
But here is the trick. Whenever we see a choice of matching states that we can go into next, we can create a merged state which tries all of them. We can create a few or a lot of these merged states. Each one will defer a choice by 1 step. And if we're careful to keep track of which merged states we've created, we can reuse the same one over and over again.
In theory we can be generating partial matches for somewhat arbitrary subsets of our arrays. Which can grow exponentially in the number of arrays. In practice are likely to only wind up with a few of these merged states. But still we can guarantee a tradeoff - more states up front runs faster later. So we optimize until we are done or have hit the limit of how much data we want to have.
Here is some proof of concept code for this in Python. It will likely build the matcher in time O(n*l) and match in time O(l). However it is only guaranteed to build the matcher in time O(n^2 * l^2) and match in time O(n * l).
import pprint
class Matcher:
def __init__ (self, arrays, optimize_limit=None):
# These are the partial states we could be in during a match.
self.states = [{}]
# By state, this is what we would be trying to match.
self.state_for = ['start']
# By combination we could try to match for, which state it is.
self.comb_state = {'start': 0}
for i in range(len(arrays)):
arr = arrays[i]
# Set up "matched the end".
state_index = len(self.states)
this_state = {'matched': [i]}
self.comb_state[(i, len(arr))] = state_index
self.states.append(this_state)
self.state_for.append((i, len(arr)))
for j in reversed(range(len(arr))):
this_for = (i, j)
prev_state = {}
if 0 == j:
prev_state = self.states[0]
matching_values = set((arr[k] for k in range(max(j-1, 0), min(j+2, len(arr)))))
for v in matching_values:
if v in prev_state:
prev_state[v].append(state_index)
else:
prev_state[v] = [state_index]
if 0 < j:
state_index = len(self.states)
self.states.append(prev_state)
self.state_for.append(this_for)
self.comb_state[this_for] = state_index
# Theoretically optimization can take space
# O(2**len(arrays) * len(arrays[0]))
# We will optimize until we are done or hit a more reasonable limit.
if optimize_limit is None:
# Normally
optimize_limit = len(self.states)**2
# First we find all of the choices at the root.
# This will be an array of arrays with format:
# [state, key, values]
todo = []
for k, v in self.states[0].iteritems():
if 1 < len(v):
todo.append([self.states[0], k, tuple(v)])
while len(todo) and len(self.states) < optimize_limit:
this_state, this_key, this_match = todo.pop(0)
if this_key == 'matched':
pass # We do not need to optimize this!
elif this_match in self.comb_state:
this_state[this_key] = self.comb_state[this_match]
else:
# Construct a new state that is all of these.
new_state = {}
for state_ind in this_match:
for k, v in self.states[state_ind].iteritems():
if k in new_state:
new_state[k] = new_state[k] + v
else:
new_state[k] = v
i = len(self.states)
self.states.append(new_state)
self.comb_state[this_match] = i
self.state_for.append(this_match)
this_state[this_key] = [i]
for k, v in new_state.iteritems():
if 1 < len(v):
todo.append([new_state, k, tuple(v)])
#pp = pprint.PrettyPrinter()
#pp.pprint(self.states)
#pp.pprint(self.comb_state)
#pp.pprint(self.state_for)
def match (self, list1, ind=0, state=0):
this_state = self.states[state]
if 'matched' in this_state:
return this_state['matched']
elif list1[ind] in this_state:
answer = []
for next_state in this_state[list1[ind]]:
answer = answer + self.match(list1, ind+1, next_state)
return answer;
else:
return []
foo = Matcher([[1, 2, 3], [2, 3, 4]])
print(foo.match([2, 2, 3]))
Please note that I deliberately set up a situation where there are 2 matches. It reports both of them. :-)
I came up with a further approach derived off Matt Timmermans's answer: building a simple decision tree that might have certain some arrays in multiple branches. It works even if the error in the array I'm searching is larger than 1.
The idea is the following: given the set of arrays As...
Pick an index and a pivot.
I fixed the pivot to a constant value that works well with my data, and tried all indices to find the best one. Trying multiple pivots might work better, but I didn't need to.
Partition As into two possibly-intersecting subsets, one for the arrays (whose index-th element is) smaller than the pivot, one for the larger arrays. Arrays very close to the pivot are added to both sets:
function partition( As, pivot, index ):
return {
As.filter( A => A[index] <= pivot + 1 ),
As.filter( A => A[index] >= pivot - 1 ),
}
Apply both previous steps to each subset recursively, stopping when a subset only contains a single element.
Here an example of a possible tree generated with this algorithm (note that A2 appears both on the left and right child of the root node):
{A1, A2, A3, A4}
pivot:15
index:73
/ \
/ \
{A1, A2} {A2, A3, A4}
pivot:7 pivot:33
index:54 index:0
/ \ / \
/ \ / \
A1 A2 {A2, A3} A4
pivot:5
index:48
/ \
/ \
A2 A3
The search function then uses this as a normal decision tree: it starts from the root node and recurses either to the left or the right child depending on whether its value at index currentNode.index is greater or less than currentNode.pivot. It proceeds recursively until it reaches a leaf.
Once the decision tree is built, the time complexity is in the worst case O(n), but in practice it's probably closer to O(log(n)) if we choose good indices and pivots (and if the dataset is diverse enough) and find a fairly balanced tree.
The space complexity can be really bad in the worst case (O(2^n)), but it's closer to O(n) with balanced trees.

fastest algorithm for sum queries in a range

Assume we have the following data, which consists of a consecutive 0's and 1's (the nature of data is that there are very very very few 1s.
data =
[0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0]
so a huge number of zeros, and then possibly some ones (which indicate that some sort of an event is happening).
You want to query this data many times. The query is that given two indices i and j what is sum(data[i:j]). For example, sum_query(i=12, j=25) = 2 in above example.
Note that you have all these queries in advance.
What sort of a data structure can help me evaluate all the queries as fast as possible?
My initial thoughts:
preprocess the data and obtain two shorter arrays: data_change and data_cumsum. The data_change will be filled up with the indices for when the sequence of 1s will start and when the next sequence of 0s will start, and so on. The data_cumsum will contain the corresponding cummulative sums up to indices represented in data_change, i.e. data_cumsum[k] = sum(data[0:data_change[k]])
In above example, the preprocessing results in: data_change=[8,11,18,20,31,35] and data_cumsum=[0,3,3,5,5,9]
Then if query comes for i=12 and j=25, I will do a binary search in this sorted data_change array to find the corresponding index for 12 and then for 25, which will result in the 0-based indices: bin_search(data_change, 12)=2 and bin_search(data_change, 25)=4.
Then I simply output the corresponding difference from the cumsum array: data_cumsum[4] - data_cumsum[2]. (I won't go into the detail of handling the situation where the any endpoint of the query range falls in the middle of the sequence of 1's, but those cases can be handled easily with an if-statement.
With linear space, linear preprocessing, constant query time, you can store an array of sums. The i'th position gets the sum of the first i elements. To get query(i,j) you take the difference of the sums (sums[j] - sums[i-1]).
I already gave an O(1) time, O(n) space answer. Here are some alternates that trade time for space.
1. Assuming that the number of 1s is O(log n) or better (say O(log n) for argument):
Store an array of ints representing the positions of the ones in the original array. so if the input is [1,0,0,0,1,0,1,1] then A = [0,4,6,7].
Given a query, use binary search on A for the start and end of the query in O(log(|A|)) = O(log(log(n)). If the element you're looking for isn't in A, find the smallest bigger index and the largest smaller index. E.g., for query (2,6) you'd return the indices for the 4 and the 6, which are (1,2). Then the answer is one more than the difference.
2. Take advantage of knowing all the queries up front (as mentioned by the OP in a comment to my other answer). Say Q = (Q1, Q2, ..., Qm) is the set of queries.
Process the queries, storing a map of start and end indices to the query. E.g., if Q1 = (12,92) then our map would include {92 => Q1, 12 => Q1}. This take O(m) time and O(m) space. Take note of the smallest start index and the largest end index.
Process the input data, starting with the smallest start index. Keep track of the running sum. For each index, check your map of queries. If the index is in the map, associate the current running sum with the appropriate query.
At the end, each query will have two sums associated with it. Add one to the difference to get the answer.
Worst case analysis:
O(n) + O(m) time, O(m) space. However, this is across all queries. The amortized time cost per query is O(n/m). This is the same as my constant time solution (which required O(n) preprocessing).
I would probably go with something like this:
# boilerplate testdata
from itertools import chain, permutations
data = [0,0,0,0,0,0,0,1,1,1]
chained = list(chain(*permutations(data,5))) # increase 5 to 10 if you dare
Preprozessing:
frSet = frozenset([i for i in range(len(chained)) if chained[i]==1])
"Counting":
# O(min(len(frSet), len(frozenset(range(200,500))))
summa = frSet.intersection(frozenset(range(200,500))) # use two sets for faster intersect
counted=len(summa)
"Sanity-Check"
print(sum([1 for x in frSet if x >= 200 and x<500]))
print(summa)
print(len(summa))
No edge cases needed, intersection will do all you need, slightly higher memory as you store each index not ranges of ones. Performance depends on intersection-Implementation.
This might be helpfull: https://wiki.python.org/moin/TimeComplexity#set

How to "sort" elements of 2 possible values in place in linear time? [duplicate]

This question already has answers here:
Stable separation for two classes of elements in an array
(3 answers)
Closed 9 years ago.
Suppose I have a function f and array of elements.
The function returns A or B for any element; you could visualize the elements this way ABBAABABAA.
I need to sort the elements according to the function, so the result is: AAAAAABBBB
The number of A values doesn't have to equal the number of B values. The total number of elements can be arbitrary (not fixed). Note that you don't sort chars, you sort objects that have a single char representation.
Few more things:
the sort should take linear time - O(n),
it should be performed in place,
it should be a stable sort.
Any ideas?
Note: if the above is not possible, do you have ideas for algorithms sacrificing one of the above requirements?
If it has to be linear and in-place, you could do a semi-stable version. By semi-stable I mean that A or B could be stable, but not both. Similar to Dukeling's answer, but you move both iterators from the same side:
a = first A
b = first B
loop while next A exists
if b < a
swap a,b elements
b = next B
a = next A
else
a = next A
With the sample string ABBAABABAA, you get:
ABBAABABAA
AABBABABAA
AAABBBABAA
AAAABBBBAA
AAAAABBBBA
AAAAAABBBB
on each turn, if you make a swap you move both, if not you just move a. This will keep A stable, but B will lose its ordering. To keep B stable instead, start from the end and work your way left.
It may be possible to do it with full stability, but I don't see how.
A stable sort might not be possible with the other given constraints, so here's an unstable sort that's similar to the partition step of quick-sort.
Have 2 iterators, one starting on the left, one starting on the right.
While there's a B at the right iterator, decrement the iterator.
While there's an A at the left iterator, increment the iterator.
If the iterators haven't crossed each other, swap their elements and repeat from 2.
Lets say,
Object_Array[1...N]
Type_A objs are A1,A2,...Ai
Type_B objs are B1,B2,...Bj
i+j = N
FOR i=1 :N
if Object_Array[i] is of Type_A
obj_A_count=obj_A_count+1
else
obj_B_count=obj_B_count+1
LOOP
Fill the resultant array with obj_A and obj_B with their respective counts depending on obj_A > obj_B
The following should work in linear time for a doubly-linked list. Because up to N insertion/deletions are involved that may cause quadratic time for arrays though.
Find the location where the first B should be after "sorting". This can be done in linear time by counting As.
Start with 3 iterators: iterA starts from the beginning of the container, and iterB starts from the above location where As and Bs should meet, and iterMiddle starts one element prior to iterB.
With iterA skip over As, find the 1st B, and move the object from iterA to iterB->previous position. Now iterA points to the next element after where the moved element used to be, and the moved element is now just before iterB.
Continue with step 3 until you reach iterMiddle. After that all elements between first() and iterB-1 are As.
Now set iterA to iterB-1.
Skip over Bs with iterB. When A is found move it to just after iterA and increment iterA.
Continue step 6 until iterB reaches end().
This would work as a stable sort for any container. The algorithm includes O(N) insertion/deletion, which is linear time for containers with O(1) insertions/deletions, but, alas, O(N^2) for arrays. Applicability in you case depends on whether the container is an array rather than a list.
If your data structure is a linked list instead of an array, you should be able to meet all three of your constraints. You just skim through the list and accumulating and moving the "B"s will be trivial pointer changes. Pseudo code below:
sort(list) {
node = list.head, blast = null, bhead = null
while(node != null) {
nextnode = node.next
if(node.val == "a") {
if(blast != null){
//move the 'a' to the front of the 'B' list
bhead.prev.next = node, node.prev = bhead.prev
blast.next = node.next, node.next.prev = blast
node.next = bhead, bhead.prev = node
}
}
else if(node.val == "b") {
if(blast == null)
bhead = blast = node
else //accumulate the "b"s..
blast = node
}
3
node = nextnode
}
}
So, you can do this in an array, but the memcopies, that emulate the list swap, will make it quiet slow for large arrays.
Firstly, assuming the array of A's and B's is either generated or read-in, I wonder why not avoid this question entirely by simply applying f as the list is being accumulated into memory into two lists that would subsequently be merged.
Otherwise, we can posit an alternative solution in O(n) time and O(1) space that may be sufficient depending on Sir Bohumil's ultimate needs:
Traverse the list and sort each segment of 1,000,000 elements in-place using the permutation cycles of the segment (once this step is done, the list could technically be sorted in-place by recursively swapping the inner-blocks, e.g., ABB AAB -> AAABBB, but that may be too time-consuming without extra space). Traverse the list again and use the same constant space to store, in two interval trees, the pointers to each block of A's and B's. For example, segments of 4,
ABBAABABAA => AABB AABB AA + pointers to blocks of A's and B's
Sequential access to A's or B's would be immediately available, and random access would come from using the interval tree to locate a specific A or B. One option could be to have the intervals number the A's and B's; e.g., to find the 4th A, look for the interval containing 4.
For sorting, an array of 1,000,000 four-byte elements (3.8MB) would suffice to store the indexes, using one bit in each element for recording visited indexes during the swaps; and two temporary variables the size of the largest A or B. For a list of one billion elements, the maximum combined interval trees would number 4000 intervals. Using 128 bits per interval, we can easily store numbered intervals for the A's and B's, and we can use the unused bits as pointers to the block index (10 bits) and offset in the case of B (20 bits). 4000*16 bytes = 62.5KB. We can store an additional array with only the B blocks' offsets in 4KB. Total space under 5MB for a list of one billion elements. (Space is in fact dependent on n but because it is extremely small in relation to n, for all practical purposes, we may consider it O(1).)
Time for sorting the million-element segments would be - one pass to count and index (here we can also accumulate the intervals and B offsets) and one pass to sort. Constructing the interval tree is O(nlogn) but n here is only 4000 (0.00005 of the one-billion list count). Total time O(2n) = O(n)
This should be possible with a bit of dynamic programming.
It works a bit like counting sort, but with a key difference. Make arrays of size n for both a and b count_a[n] and count_b[n]. Fill these arrays with how many As or Bs there has been before index i.
After just one loop, we can use these arrays to look up the correct index for any element in O(1). Like this:
int final_index(char id, int pos){
if(id == 'A')
return count_a[pos];
else
return count_a[n-1] + count_b[pos];
}
Finally, to meet the total O(n) requirement, the swapping needs to be done in a smart order. One simple option is to have recursive swapping procedure that doesn't actually perform any swapping until both elements would be placed in correct final positions. EDIT: This is actually not true. Even naive swapping will have O(n) swaps. But doing this recursive strategy will give you absolute minimum required swaps.
Note that in general case this would be very bad sorting algorithm since it has memory requirement of O(n * element value range).

How to design a data structure that allows one to search, insert and delete an integer X in O(1) time

Here is an exercise (3-15) in the book "Algorithm Design Manual".
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time (i.e. , constant time, independent of the total number of integers stored). Assume that 1 ≤ X ≤ n and that there are m + n units of space available, where m is the maximum number of integers that can be in the table at any one time. (Hint: use two arrays A[1..n] and B[1..m].) You are not allowed to initialize either A or B, as that would take O(m) or O(n) operations. This means the arrays are full of random garbage to begin with, so you must be very careful.
I am not really seeking for the answer, because I don't even understand what this exercise asks.
From the first sentence:
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time
I can easily design a data structure like that. For example:
Because 1 <= X <= n, so I just have an bit vector of n slots, and let X be the index of the array, when insert, e.g., 5, then a[5] = 1; when delete, e.g., 5, then a[5] = 0; when search, e.g.,5, then I can simply return a[5], right?
I know this exercise is harder than I imagine, but what's the key point of this question?
You are basically implementing a multiset with bounded size, both in number of elements (#elements <= m), and valid range for elements (1 <= elementValue <= n).
Search: myCollection.search(x) --> return True if x inside, else False
Insert: myCollection.insert(x) --> add exactly one x to collection
Delete: myCollection.delete(x) --> remove exactly one x from collection
Consider what happens if you try to store 5 twice, e.g.
myCollection.insert(5)
myCollection.insert(5)
That is why you cannot use a bit vector. But it says "units" of space, so the elaboration of your method would be to keep a tally of each element. For example you might have [_,_,_,_,1,_,...] then [_,_,_,_,2,_,...].
Why doesn't this work however? It seems to work just fine for example if you insert 5 then delete 5... but what happens if you do .search(5) on an uninitialized array? You are specifically told you cannot initialize it, so you have no way to tell if the value you'll find in that piece of memory e.g. 24753 actually means "there are 24753 instances of 5" or if it's garbage.
NOTE: You must allow yourself O(1) initialization space, or the problem cannot be solved. (Otherwise a .search() would not be able to distinguish the random garbage in your memory from actual data, because you could always come up with random garbage which looked like actual data.) For example you might consider having a boolean which means "I have begun using my memory" which you initialize to False, and set to True the moment you start writing to your m words of memory.
If you'd like a full solution, you can hover over the grey block to reveal the one I came up with. It's only a few lines of code, but the proofs are a bit longer:
SPOILER: FULL SOLUTION
Setup:
Use N words as a dispatch table: locationOfCounts[i] is an array of size N, with values in the range location=[0,M]. This is the location where the count of i would be stored, but we can only trust this value if we can prove it is not garbage. >!
(sidenote: This is equivalent to an array of pointers, but an array of pointers exposes you being able to look up garbage, so you'd have to code that implementation with pointer-range checks.)
To find out how many is there are in the collection, you can look up the value counts[loc] from above. We use M words as the counts themselves: counts is an array of size N, with two values per element. The first value is the number this represents, and the second value is the count of that number (in the range [1,m]). For example a value of (5,2) would mean that there are 2 instances of the number 5 stored in the collection.
(M words is enough space for all the counts. Proof: We know there can never be more than M elements, therefore the worst-case is we have M counts of value=1. QED)
(We also choose to only keep track of counts >= 1, otherwise we would not have enough memory.)
Use a number called numberOfCountsStored that IS initialized to 0 but is updated whenever the number of item types changes. For example, this number would be 0 for {}, 1 for {5:[1 times]}, 1 for {5:[2 times]}, and 2 for {5:[2 times],6:[4 times]}.
                          1  2  3  4  5  6  7  8...
locationOfCounts[<N]: [☠, ☠, ☠, ☠, ☠, 0, 1, ☠, ...]
counts[<M]:           [(5,⨯2), (6,⨯4), ☠, ☠, ☠, ☠, ☠, ☠, ☠, ☠..., ☠]
numberOfCountsStored:          2
Below we flush out the details of each operation and prove why it's correct:
Algorithm:
There are two main ideas: 1) we can never allow ourselves to read memory without verifying that is not garbage first, or if we do we must be able to prove that it was garbage, 2) we need to be able to prove in O(1) time that the piece of counter memory has been initialized, with only O(1) space. To go about this, the O(1) space we use is numberOfItemsStored. Each time we do an operation, we will go back to this number to prove that everything was correct (e.g. see ★ below). The representation invariant is that we will always store counts in counts going from left-to-right, so numberOfItemsStored will always be the maximum index of the array that is valid.
.search(e) -- Check locationsOfCounts[e]. We assume for now that the value is properly initialized and can be trusted. We proceed to check counts[loc], but first we check if counts[loc] has been initialized: it's initialized if 0<=loc<numberOfCountsStored (if not, the data is nonsensical so we return False). After checking that, we look up counts[loc] which gives us a number,count pair. If number!=e, we got here by following randomized garbage (nonsensical), so we return False (again as above)... but if indeed number==e, this proves that the count is correct (★proof: numberOfCountsStored is a witness that this particular counts[loc] is valid, and counts[loc].number is a witness that locationOfCounts[number] is valid, and thus our original lookup was not garbage.), so we would return True.
.insert(e) -- Perform the steps in .search(e). If it already exists, we only need to increment the count by 1. However if it doesn't exist, we must tack on a new entry to the right of the counts subarray. First we increment numberOfCountsStored to reflect the fact that this new count is valid: loc = numberOfCountsStored++. Then we tack on the new entry: counts[loc] = (e,⨯1). Finally we add a reference back to it in our dispatch table so we can look it up quickly locationOfCounts[e] = loc.
.delete(e) -- Perform the steps in .search(e). If it doesn't exist, throw an error. If the count is >= 2, all we need to do is decrement the count by 1. Otherwise the count is 1, and the trick here to ensure the whole numberOfCountsStored-counts[...] invariant (i.e. everything remains stored on the left part of counts) is to perform swaps. If deletion would get rid of the last element, we will have lost a counts pair, leaving a hole in our array: [countPair0, countPair1, _hole_, countPair2, countPair{numberOfItemsStored-1}, ☠, ☠, ☠..., ☠]. We swap this hole with the last countPair, decrement numberOfCountsStored to invalidate the hole, and update locationOfCounts[the_count_record_we_swapped.number] so it now points to the new location of the count record.
Here is an idea:
treat the array B[1..m] as a stack, and make a pointer p to point to the top of the stack (let p = 0 to indicate that no elements have been inserted into the data structure). Now, to insert an integer X, use the following procedure:
p++;
A[X] = p;
B[p] = X;
Searching should be pretty easy to see here (let X' be the integer you want to search for, then just check that 1 <= A[X'] <= p, and that B[A[X']] == X'). Deleting is trickier, but still constant time. The idea is to search for the element to confirm that it is there, then move something into its spot in B (a good choice is B[p]). Then update A to reflect the pointer value of the replacement element and pop off the top of the stack (e.g. set B[p] = -1 and decrement p).
It's easier to understand the question once you know the answer: an integer is in the set if A[X]<total_integers_stored && B[A[X]]==X.
The question is really asking if you can figure out how to create a data structure that is usable with a minimum of initialization.
I first saw the idea in Cameron's answer in Jon Bentley Programming Pearls.
The idea is pretty simple but it's not straightforward to see why the initial random values that may be on the uninitialized arrays does not matter. This link explains pretty well the insertion and search operations. Deletion is left as an exercise, but is answered by one of the commenters:
remove-member(i):
if not is-member(i): return
j = dense[n-1];
dense[sparse[i]] = j;
sparse[j] = sparse[i];
n = n - 1

Good hash function for permutations?

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.
One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.
Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.
If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.
depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine
I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.
You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though
I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

Resources