Whats the best data-structure for storing 2-tuple (a, b) which support adding, deleting tuples and compare (either on a or b)) - algorithm

So here is my problem. I want to store 2-tuple (key, val) and want to perform following operations:
keys are strings and values are Integers
multiple keys can have same value
adding new tuples
updating any key with new value (any new value or updated value is greater than the previous one, like timestamps)
fetching all the keys with values less than or greater than given value
deleting tuples.
Hash seems to be the obvious choice for updating the key's value but then lookups via values will be going to take longer (O(n)). The other option is balanced binary search tree with key and value switched. So now lookups via values will be fast (O(lg(n))) but updating a key will take (O(n)). So is there any data-structure which can be used to address these issues?
Thanks.

I'd use 2 datastructures, a hash table from keys to values and a search tree ordered by values and then by keys. When inserting, insert the pair into both structures, when deleting by key, look up the value from the hash and then remove the pair from the tree. Updating is basically delete+insert. Insert, delete and update are O(log n). For fetching all the keys less than a value lookup the value in the search tree and iterate backwards. This is O(log n + k).
The choices for good hash table and search tree implementations depend a lot on your particular distribution of data and operations. That said, a good general purpose implementation of both should be sufficient.

For binary Search Tree Insert is O(logN) operation in average and O(n) in worst case. The same for lookup operation. So this should be your choice I believe.

Dictionary or Map types tend to be based on one of two structures.
Balanced tree (guarantee O(log n) lookup).
Hash based (best case is O(1), but a poor hash function for the data could result in O(n) lookups).
Any book on algorithms should cover both in lots of detail.
To provide operations both on keys and values, there are also multi-index based collections (with all the extra complexity) which maintain multiple structures (much like an RDBMS table can have multiple indexes). Unless you have a lot of lookups over a large collection the extra overhead might be a higher cost than a few linear lookups.

You can create a custom data structure which holds two dictionaries.
i.e
a hash table from keys->values and another hash table from values->lists of keys.
class Foo:
def __init__(self):
self.keys = {} # (KEY=key,VALUE=value)
self.values = {} # (KEY=value,VALUE=list of keys)
def add_tuple(self,kd,vd):
self.keys[kd] = vd
if self.values.has_key(vd):
self.values[vd].append(kd)
else:
self.values[vd] = [kd]
f = Foo()
f.add_tuple('a',1)
f.add_tuple('b',2)
f.add_tuple('c',3)
f.add_tuple('d',3)
print f.keys
print f.values
print f.keys['a']
print f.values[3]
print [f.values[v] for v in f.values.keys() if v > 1]
OUTPUT:
{'a': 1, 'c': 3, 'b': 2, 'd': 3}
{1: ['a'], 2: ['b'], 3: ['c', 'd']}
1
['c', 'd']
[['b'], ['c', 'd']]

Related

How to order a list according to an arbitrary order

I searched a relevant question but couldn't find one. So my question is how do I sort an array based on an arbitrary order. For example, let's say the ordering is:
order_of_elements = ['cc', 'zz', '4b', '13']
and my list to be sorted:
list_to_be_sorted = ['4b', '4b', 'zz', 'cc', '13', 'cc', 'zz']
so the result needs to be:
ordered_list = ['cc', 'cc', 'zz', 'zz', '4b', '4b', '13']
please note that the reference list(order_of_elements) describes ordering and I don't ask about sorting according to the alphabetically sorted indices of the reference list.
You can assume that order_of_elements array includes all the possible elements.
Any pseudocode is welcome.
A simple and Pythonic way to accomplish this would be to compute an index lookup table for the order_of_elements array, and use the indices as the sorting key:
order_index_table = { item: idx for idx, item in enumerate(order_of_elements) }
ordered_list = sorted(list_to_be_sorted, key=lambda x: order_index_table[x])
The table reduces order lookup to O(1) (amortized) and thus does not change the time complexity of the sort.
(Of course it does assume that all elements in list_to_be_sorted are present in order_of_elements; if this is not necessarily the case then you would need a default return value in the key lambda.)
Since you have a limited number of possible elements, and if these elements are hashable, you can use a kind of counting sort.
Put all the elements of order_of_elements in a hashmap as keys, with counters as values. Traverse you list_to_be_sorted, incrementing the counter corresponding to the current element. To build ordered_list, go through order_of_elements and add each current element the number of times indicated by the counter of that element.
hashmap hm;
for e in order_of_elements {
hm.add(e, 0);
}
for e in list_to_be_sorted {
hm[e]++;
}
list ordered_list;
for e in order_of_elements {
list.append(e, hm[e]); // Append hm[e] copies of element e
}
Approach:
create an auxiliary array which will hold the index of 'order_of_elements'
sort the auxiliary array.
2.1 re-arrange the value in the main array while sorting the auxiliary array

Check if a value belongs to a hash

I'm not sure if this is actually possible thus I ask here. Does anyone knows of an algorithm that would allow something like this?
const values = ['a', 'b', 'c', 'd'];
const hash = createHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('b'); // => true
hash.includes('v'); // => false
What this snippet does, is it first creates some sort of hash from a list of values, then checks if the certain value belongs to that hash.
Hash functions in general
The primary idea of hash functions is to reduce the space, that is the functions are not injective as they map from a bigger domain to a smaller.
So they produce collisions. That is, there are different elements x and y that get mapped to the same hash value:
h(x) = h(y)
So basically you loose information of the given argument x.
However, in order to answer the question whether all values are contained you would need to keep all information (or at least all non-duplicates). This is obviously not possible for nearly all practical hash-functions.
Possible hash-functions would be identity function:
h(x) = x for all x
but this doesn't reduce the space, not practical.
A natural idea would be to compute hash values of the individual elements and then concatenate them, like
h(a, b, c) = (h(a), h(b), h(c))
But this again doesn't reduce the space, hash values are as long as the message, not practical.
Another possibility is to drop all duplicates, so given values [a, b, c, a, b] we only keep [a, b, c]. But this, in most examples, only reduces the space marginally, again not practical.
But no matter what you do, you can not reduce more than the amount of non-duplicates. Else you wouldn't be able to answer the question for some values. For example if we use [a, b, c, a] but only keep [a, b], we are unable to answer "was c contained" correctly.
Perfect hash functions
However, there is the field of perfect hash functions (Wikipedia). Those are hash-functions that are injective, they don't produce collisions.
In some areas they are of interest.
For those you may be able to answer that question, for example if computing the inverse is easy.
Cryptographic hash functions
If you talk about cryptographic hash functions, the answer is no.
Those need to have three properties (Wikipedia):
Pre-image resistance - Given h it should be difficult to find m : hash(m) = h
Second pre-image resistance - Given m it should be difficult to find m' : hash(m) = hash(m')
Collision resistance - It should be difficult to find (m, m') : hash(m) = hash(m')
Informally you have especially:
A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.
If you now would have such a hash value you would be able to easily reconstruct it by asking whether some values are contained. Using that you can easily construct collisions on purpose and stuff like that.
Details would however depend on the specific hash algorithm.
For a toy-example let's use the previous algorithm that simply removes all duplicates:
[a, b, c, a, a] -> [a, b, c]
In that case we find messages like
[a, b, c]
[a, b, c, a]
[a, a, b, b, c]
...
that all map to the same hash value.
If the hash function produces collisions (as almost all hash function do) this cannot be possible.
Think about it this way if for example h('abc') = x and h('abd') = x, how can you decide based on x if the original string contains 'd'?
You could arguably decide to use identity as a has function, which would do the job.
Trivial solution will be a simple hash concatenation.
func createHash(values) {
var hash;
foreach (v in values)
hash += MD5(v);
return hash;
}
Can it be done with fixed length hash and variable input? I'd bet it's impossible.
In case of string hash (such as used in HashMaps), because it is additive, I think we can match partially (prefix match but not suffix).
const values = ['a', 'b', 'c', 'd'];
const hash = createStringHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('a'); // => true
hash.includes('a', 'b'); // => true
hash.includes('a', 'b', 'v'); // => false
Bit arrays
If you don't care what the resulting hash looks like, I'd recommend just using a bit array.
Take the range of all possible values
Map this to the range of integers starting from 0
Let each bit in our hash indicate whether or not this value appears in the input
This will require 1 bit for every possible value (which could be a lot of bits for large ranges).
Note: this representation is optimal in terms of the number of bits used, assuming there's no limit on the number of elements you can have (beyond 1 of each possible value) - if it were possible to use any fewer bits, you'd have an algorithm that's capable of providing guaranteed compression of any data, which is impossible by the pigeonhole principle.
For example:
If your range is a-z, you can map this to 0-25, then [a,d,g,h] would map to:
10010011000000000000000000 = 38535168 = 0x24c0000
(abcdefghijklmnopqrstuvwxyz)
More random-looking hashes
If you care what the hash looks like, you could take the output from the above and perform a perfect hash on it to map it either to the same length hash or a longer hash.
One trivial example of such a map would be to increment the resulting hash by a randomly chosen but deterministic value (i.e. it's the same for every hash we convert) - you can also do this for each byte (with wrap-around) if you want (e.g. byte0 = (byte0+5)%255, byte1 = (byte1+18)%255).
To determine whether an element appears, the simplest approach would be to reverse the above operation (subtract instead of add) and then just check if the corresponding bit is set. Depending on what you did, it might also be possible to only convert a single byte.
Bloom filters
If you don't mind false positives, I might recommend just using a bloom filter instead.
In short, this sets multiple bits for each value, and then checks each of those bits to check whether a value is in our collection. But the bits that are set for one value can overlap with the bits for other values, which allows us to significantly reduce the number of bits required at the cost of a few false positives (assuming the total number of elements isn't too large).

Does an ordered Map knows how to search efficiently for a key in Scala?

Does an ordered Map knows how to search efficiently for a key in Scala?
Imagine I have a Map:
val unorderdMap: Map[Int, String] = ...
val orederedMap: Map[Int, String] = unorderedMap.sort
Is lookup operation for a key faster in orderedMap?
unorderedMap.get(i) //Slower???
orderedMap.get(i) //Faster???
Does the compliler knows how to search efficiently?
Does the compiler performs the lookup operation differently in each case?
*EDIT:
I have a
case class A(key: Int, value1: String, value2: String, ...)
val SeqA: Seq[A] = Seq(A(1, "One", "Uno", ...), A(2, "Two", "Duo",...), ..., A(20000,... ,...))
I want to have fast lookup operations on key(That's what i am interested ONLY)
Is it better to make a Map out of it like:
val mapA = SeqA.map(a => a.key -> a)(collection.breakOut)
Or Is it Better to leave it as a Sequence(and maybe order them).
Then If I make it a Map should I Order it or not? *Elements are around
20K - 30K elements!
Sorted maps are usually(*) slower than hash maps in any languages. This is because sorted maps has O(log n) complexity compared to hash maps which have O(1) amortized complexity.
You should have a look at relevant wiki pages for a more in depth explanation.
(*) That depends on many factors like the size of the map. For small sets, sorted arrays with binary searches might do better if it fits in cache.

O(1) find value from a key in a range

What kind of data structure would allow me to get a corresponding value from a given key in a set of ordered range-like keys, where my key is not necessarily in the set.
Consider, [key, value]:
[3, 1]
[5, 2]
[10, 3]
Looking up 3 or 4 would return 1, 5 - 9 would return 2 and 10 would return 3. The ranges are not constant sized.
O(1) or like-O(1) is important, if possible.
A balanced binary search tree will give you O(log n).
what about a key-indexed array? Say, you know your keys are below 1000, you can simply fill a int[1000] with values, like this:
[0,0]
[1,0]
[2,0]
[3,1]
[4,1]
[5,2]
......
and so on. that'll give you o(1) performance, but huge memory overhead.
otherwise, a hash table is the closest i know of. hope it helps.
edit: look up red-black tree, it's a self balancing tree which has a worst case of o
(logn) in searching.
I would use i Dictionary in this scenario. Retrieving a value by using its key is very fast, close to O(1)
Dictionary<int, int> myDictionary = new Dictionary<int, int>();
myDictionary.Add(3,1);
myDictionary.Add(5,2);
myDictionary.Add(10,3);
//If you know the key exists, you can use
int value = myDictionary[3];
//If you don't know if the key is in the dictionary - use the TryGetValue method
int value;
if (myDictionary.TryGetValue(3, out value))
{
//The key was found and the corresponding value is stored in "value"
}
For more info: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx

Efficient algorithms for merging hashes into a sparse matrix

I have time data with irregular intervals and I need to convert it to a sparse matrix for use with a graphing library.
The data is currently in the following format:
{
:series1 => [entry, entry, entry, entry, ...],
:series2 => [entry, entry, entry, entry, ...]
}
where entry is an object with two properties, timestamp(a unix timestamp) and value(an integer)
I need to put it in this format in as close to O(n) time as possible.
{
timestamp1 => [ value, value, nil ],
timestamp2 => [ value, nil, value ],
timestamp3 => [ value, value, value],
...
}
Here each row represents a point in time which I have an entry for. Each column represents a series ( a line on a line graph ). Thats why it is very important to represent missing values with nil.
I have some pretty slow implementations but this seems like a problem that has been solved before so I'm hoping that there is a more efficient way to do this.
I'm slightly confused by your asking for O(n), so feel free to correct me, but as far as I can tell, O(n) is easily possible.
First find the length of your starting hash (the number of series in the data). This should be O(1), but no worse than O(S) (where S is no of series), and S <= O(n) (assuming no series with no values) so is still O(n).
Store this length somewhere, and then setup your hash for the sparse matrix to automatically initialise any row to an empty array of this size.
matrix = Hash.new {|hsh,k| hsh[k] = Array.new(S)}
Then simply go through each series, by index. And for each entry, set the appropriate cell in the array to be the right value.
For each entry, this is O(1) (average) for the lookup of the timestamp in the hash, then O(1) for setting the cell in the array. This happens n times, giving you O(n) there.
There will also be the creation of an Array for each row in the matrix. As far as I am aware this is O(1) for one Array, so O(T) (where T is number of timestamps) overall. As we are not creating empty rows where there are no entries with that timestamp, T must be <= n, so this is O(n) as well.
So overall we have O(n) + O(n) + O(n) = O(n). There are probably ways to speed this up in Ruby, but as far as I am aware this is not only close to, but actually O(n).
How about something like this:
num = series.count
timestamps = {}
series.each_with_index do |(k, entries), i|
entries.each do |entry|
timestamps[entry.timestamp] ||= Array.new(num)
timestamps[entry.timestamp][i] = entry.value
end
end
Not sure though about the initial ordering of your series, I guess your real situation is a bit more complex than presented in the question.

Resources