O(1) find value from a key in a range - data-structures

What kind of data structure would allow me to get a corresponding value from a given key in a set of ordered range-like keys, where my key is not necessarily in the set.
Consider, [key, value]:
[3, 1]
[5, 2]
[10, 3]
Looking up 3 or 4 would return 1, 5 - 9 would return 2 and 10 would return 3. The ranges are not constant sized.
O(1) or like-O(1) is important, if possible.

A balanced binary search tree will give you O(log n).

what about a key-indexed array? Say, you know your keys are below 1000, you can simply fill a int[1000] with values, like this:
[0,0]
[1,0]
[2,0]
[3,1]
[4,1]
[5,2]
......
and so on. that'll give you o(1) performance, but huge memory overhead.
otherwise, a hash table is the closest i know of. hope it helps.
edit: look up red-black tree, it's a self balancing tree which has a worst case of o
(logn) in searching.

I would use i Dictionary in this scenario. Retrieving a value by using its key is very fast, close to O(1)
Dictionary<int, int> myDictionary = new Dictionary<int, int>();
myDictionary.Add(3,1);
myDictionary.Add(5,2);
myDictionary.Add(10,3);
//If you know the key exists, you can use
int value = myDictionary[3];
//If you don't know if the key is in the dictionary - use the TryGetValue method
int value;
if (myDictionary.TryGetValue(3, out value))
{
//The key was found and the corresponding value is stored in "value"
}
For more info: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx

Related

Hashing with division remainder method

I don't understand this exercise.
Hash the keys: (13,17,39,27,1,20,4,40,25,9,2,37) into a hash table of size 13 using the division-remainder method.
a) find a suitable value for m.
b) handle collisions using linked lists andvisualize theresult in a table like this
0→
1→
2→
3→
4→
5→
6→
...
c) c) Handle collision with linear probing using the sequence s(j) = j and illustrate the development in a table by starting a new column for every insert (don’t forget to copy the cells already filled to the right) and by using downwards arrows to show the probing steps in case of collisions.
my attempt:
a) if the table size is 13, m also have to be 13 because of remaining classes
b) for example 0→ 39 -> 13 ....
c) I have no idea
It would be really great if someone could help me solve it. :)
Let me give a brief overview of all topics which will be used here.
Hash-map is a data structure that uses a hash function to map identifying values, known as keys, to their associated values. It contains “key-value” pairs and allows retrieving value by key.
Like in array you can get any element using index, similarly you can get any value using a key in hash-map.
Basically something like this happens, you are given a key which is string here, then it is hashed and we put the value at that index in array.
In our example image, if you want what is value for "Billy", we again hash "Billy" we get 03. Now we just check the value at index 3 and that's the stored value for "Billy" (key)
In your case you have to hash integers not strings.
Now how to hash keys?
There can be several methods like you may sum ascii values of characters of string, or anything what you can think of.
let's say you have this array [100, 1, 3, 56, 80]
and you have to store it in bucket of size 13.
We directly can't use those array values as an index because we will need index 1 and index 100, it will make bucket have 100 size.
But if you take remainder of each array number with 13 then the remainder is always guaranteed to be from 0 to 13, thus you can use a 13 size bucket if you has keys using division method
[100, 1, 3, 56, 80] remainder with 13 -> [9, 1, 3, 4, 5]
Thus you store 100's value at index 9, and so on.
Collision:
But what if in array we have a value 5 and 80, both after will give remainder 5. What to store at index 5 now?
In our example image,
Now let's say "SACHU" this also gives 03 after hashing now two keys gave same index so this is called collision which can be resolved using two methods
linkedlist like storage (store both values at same index using linkedlist, like this)
linear probing: in simple words 03 index is already occupied we try to find next empty index, like using the most simplest probing our in image example will be, 06 is empty so we store "SACHU" value at 06 not 03.
(now this is a little hard so I highly suggest you to read hashing and collisions on internet)
Now, there is one method where we h(x) denotes the hash of an integer x.
if number is x, first hash will be, h1 = h(x)
If h1 index is not empty we again hash same index, h2 = h(h1)
An so on, I am not sure, but I guess this is what is meant by s[j] = j method.
THESE ARE THE METHODS WHICH YOU HAVE TO USE IN YOUR PROBLEM.
I prefer you to give it a try first.
You can read more about it online and and comment if still you were not able to solve it.

How to order a list according to an arbitrary order

I searched a relevant question but couldn't find one. So my question is how do I sort an array based on an arbitrary order. For example, let's say the ordering is:
order_of_elements = ['cc', 'zz', '4b', '13']
and my list to be sorted:
list_to_be_sorted = ['4b', '4b', 'zz', 'cc', '13', 'cc', 'zz']
so the result needs to be:
ordered_list = ['cc', 'cc', 'zz', 'zz', '4b', '4b', '13']
please note that the reference list(order_of_elements) describes ordering and I don't ask about sorting according to the alphabetically sorted indices of the reference list.
You can assume that order_of_elements array includes all the possible elements.
Any pseudocode is welcome.
A simple and Pythonic way to accomplish this would be to compute an index lookup table for the order_of_elements array, and use the indices as the sorting key:
order_index_table = { item: idx for idx, item in enumerate(order_of_elements) }
ordered_list = sorted(list_to_be_sorted, key=lambda x: order_index_table[x])
The table reduces order lookup to O(1) (amortized) and thus does not change the time complexity of the sort.
(Of course it does assume that all elements in list_to_be_sorted are present in order_of_elements; if this is not necessarily the case then you would need a default return value in the key lambda.)
Since you have a limited number of possible elements, and if these elements are hashable, you can use a kind of counting sort.
Put all the elements of order_of_elements in a hashmap as keys, with counters as values. Traverse you list_to_be_sorted, incrementing the counter corresponding to the current element. To build ordered_list, go through order_of_elements and add each current element the number of times indicated by the counter of that element.
hashmap hm;
for e in order_of_elements {
hm.add(e, 0);
}
for e in list_to_be_sorted {
hm[e]++;
}
list ordered_list;
for e in order_of_elements {
list.append(e, hm[e]); // Append hm[e] copies of element e
}
Approach:
create an auxiliary array which will hold the index of 'order_of_elements'
sort the auxiliary array.
2.1 re-arrange the value in the main array while sorting the auxiliary array

Write algorithm to return top 2 elements in terms of frequency from a long list of elements

I was asked this question during interview. I was not able to solve this. I wonder if anyone has a good idea how to solve it:
If I have a long list of integers, return the integer which top 2 in terms of frequencies.
e.g. [1, 2, 3, 1, 4, 5, 6, 7, 8, 6, 1, 8, 8] returns [1,8]
Thank you.
Loop through the list and create a max heap with the value and count.
There is definitely a challenge about how to keep track. Thinking of a quick solution (as often is the case in an interview), I'd probably keep a dictionary to keep track if I've created an object for any given int in the array/list and if so it's current index in the heap. If so, then I'll get that object, update it's counter and trickle up in the max heap.
I'll probably have a class that contains data, such as this:
public class MyData
{
private readonly int _key;
public MyData(int key)
{
_key = key;
Count = 0;
}
public int GetKey()
{
return _key;
}
public int Count { get; set; }
}
I'll have a structure like this (where the tuple contains the object and it's index in the heap array (i'm going for the array implementation of the heap)
var elementsInHeap = new Dictionary<int, Tuple<MyData, int>>();
When looping through the input list, check if you have any entry in that dictionary for that int, if so get that value, get the object, increase the counter, and then do the trickle up in the heap. For the heap you can use the MyData object, when doing trickle up or down use the counter value. If not, create a new MyData object, have it trickle up int he max heap based on it's counter, when finished add it to the dictionary with it's index in the tuple.
Hope this helps, I'm sure there is a smarter solution out there. Hopefully someone will help us with that.
I think the answers that suggest building a heap or sorting the array have O(n log n) complexity.
First build a hash map in which the keys are the (distinct) elements of the array and the values are their frequencies. This map can be easily built in O(n).
Then find the maximum and second maximum of the entries in the map. This can also be done easily in O(n) by iterating through the map entries only once. Even if you decide to iterate twice (find a maximum, remove it and find the next maximum), your complexity will still be O(n).
If you know the range of numbers (max and min elements) you can use array and count frequencies in one loop through the array ,
you also can use heap-fast construction algorithm O(n) and just extract max 2 times,
or use hashing (if your are able to implement it during interview)

Whats the best data-structure for storing 2-tuple (a, b) which support adding, deleting tuples and compare (either on a or b))

So here is my problem. I want to store 2-tuple (key, val) and want to perform following operations:
keys are strings and values are Integers
multiple keys can have same value
adding new tuples
updating any key with new value (any new value or updated value is greater than the previous one, like timestamps)
fetching all the keys with values less than or greater than given value
deleting tuples.
Hash seems to be the obvious choice for updating the key's value but then lookups via values will be going to take longer (O(n)). The other option is balanced binary search tree with key and value switched. So now lookups via values will be fast (O(lg(n))) but updating a key will take (O(n)). So is there any data-structure which can be used to address these issues?
Thanks.
I'd use 2 datastructures, a hash table from keys to values and a search tree ordered by values and then by keys. When inserting, insert the pair into both structures, when deleting by key, look up the value from the hash and then remove the pair from the tree. Updating is basically delete+insert. Insert, delete and update are O(log n). For fetching all the keys less than a value lookup the value in the search tree and iterate backwards. This is O(log n + k).
The choices for good hash table and search tree implementations depend a lot on your particular distribution of data and operations. That said, a good general purpose implementation of both should be sufficient.
For binary Search Tree Insert is O(logN) operation in average and O(n) in worst case. The same for lookup operation. So this should be your choice I believe.
Dictionary or Map types tend to be based on one of two structures.
Balanced tree (guarantee O(log n) lookup).
Hash based (best case is O(1), but a poor hash function for the data could result in O(n) lookups).
Any book on algorithms should cover both in lots of detail.
To provide operations both on keys and values, there are also multi-index based collections (with all the extra complexity) which maintain multiple structures (much like an RDBMS table can have multiple indexes). Unless you have a lot of lookups over a large collection the extra overhead might be a higher cost than a few linear lookups.
You can create a custom data structure which holds two dictionaries.
i.e
a hash table from keys->values and another hash table from values->lists of keys.
class Foo:
def __init__(self):
self.keys = {} # (KEY=key,VALUE=value)
self.values = {} # (KEY=value,VALUE=list of keys)
def add_tuple(self,kd,vd):
self.keys[kd] = vd
if self.values.has_key(vd):
self.values[vd].append(kd)
else:
self.values[vd] = [kd]
f = Foo()
f.add_tuple('a',1)
f.add_tuple('b',2)
f.add_tuple('c',3)
f.add_tuple('d',3)
print f.keys
print f.values
print f.keys['a']
print f.values[3]
print [f.values[v] for v in f.values.keys() if v > 1]
OUTPUT:
{'a': 1, 'c': 3, 'b': 2, 'd': 3}
{1: ['a'], 2: ['b'], 3: ['c', 'd']}
1
['c', 'd']
[['b'], ['c', 'd']]

Storing a bucket of numbers in an efficient data structure

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,....
I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.
So my question is what is the suitable data structure and a sorting mechanism for this type of problem.
Thanks in advance
If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)
Keep the vector sorted.
You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.
EDIT. Here is what I mean:
#include <stdio.h>
// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
int middle;
if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
return -1;
if(left + 1 == right) // found
return left;
middle = left + (right - left)/2;
if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
return findBucket(aNumber, leftBounds, left, middle);
else
return findBucket(aNumber, leftBounds, middle, right);
}
#define NBUCKETS 12
int main(void)
{
int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
// The buckets are 1-3, 4-6, 7-14, 15-31, ...
int aNumber;
for(aNumber = -3; aNumber < 103; aNumber++)
{
int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
if(index < 0)
printf("%d: Bucket not found\n", aNumber);
else
printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
}
return 0;
}
You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.
If I understand you correctly, you have a list of buckets and you want, given an arbitrary integer, to find out which bucket it goes in.
Assuming that none of the bucket ranges overlap, I think you could implement this in a binary search tree. That would make the lookup possible in O(logn) (whenere n=number of buckets).
It would be simple to do this, just define the left branch to be less than the low end of the bucket, the right branch to be greater than the right end. So in your example we'd end up with a tree something like:
16-21
/ \
5-15 22-34
/
1-4
To search for, say, 7, you just check the root. Less than 16? Yes, go left. Less than 5? No. Greater than 15? No, you're done.
You just have to be careful to balance your tree (or use a self balancing tree) in order to keep your worst-case performance down. this is really important if your input (the bucket list) is already sorted.
+1 to the kind-of binary search idea. It's simple and gives good performance for 600000 buckets. That being said, if it's not good enough, you could create an array with MAX BUCKET VALUE - MIN BUCKET VALUE = RANGE elements, and have each element in this array reference the appropriate bucket. Then, you get a lookup in guaranteed constant [O(1)] time, at the cost of using a huge amount of memory.
If A) the probability of accessing buckets is not uniform and B) you knew / could figure out how likely a given set of buckets were to be accessed, you could probably combine these two approaches to create a kind of cache. For example, say bucket {0, 3} were accessed all the time, as was {7, 13}, then you can create an array CACHE. . .
int cache_low_value = 0;
int cache_hi_value = 13;
CACHE[0] = BUCKET_1
CACHE[1] = BUCKET_1
...
CACHE[6] = BUCKET_2
CACHE[7] = BUCKET_3
CACHE[8] = BUCKET_3
...
CACHE[13] = BUCKET_3
. . . which will allow you to find a bucket in O(1) time assuming the value you're trying to associate a value with a bucket is between cache_low_value and cache_hi_value (if Y <= cache_hi_value && Y >= cache_low_value; then BUCKET = CACHE[Y]). On the up side, this approach wouldn't use all the memory on your machine; on the downside, it'd add the equivalent of an additional operation or two to your bsearch in the case you can't find your number / bucket pair in the cache (since you had to check the cache in the first place).
A simple way to store and sort these in C++ is to use a pair of sorted arrays that represent the lower and upper bounds on each bucket. Then, you can use int bucket_index= std::distance(lower_bounds.begin(), std::lower_bound(lower_bounds, value)) to find the bucket that the value will match with, and if (upper_bounds[bucket_index]>=value), bucket_index is the bucket you want.
You can replace that with a single struct holding the bucket, but the principle will be the same.
Let me see if I can restate your requirement. It's analogous to having, say, the day of the year, and wanting to know which month a given day falls in? So, given a year with 600,000 days(an interesting planet), you want to return a string that is either "Jan","Feb","Mar"... "Dec"?
Let me focus on the retrieval end first, and I think you can figure out how to arrange the data when initializing the data structures, given what has already been posted above.
Create a data structure...
typedef struct {
int DayOfYear :20; // an bit-int donating some bits for other uses
int MonthSS :4; // subscript to select months
int Unused :8; // can be used to make MonthSS 12 bits
} BUCKET_LIST;
char MonthStr[12] = "Jan","Feb","Mar"... "Dec";
.
To initialize, use a for{} loop to set BUCKET_LIST.MonthSS to one of the 12 months in MonthStr.
On retrieval, do a binary search on a vector of BUCKET_LIST.DayOfYear (you'll need to write a trivial compare function for BUCKET_LIST.DayOfYear). Your result can be obtained by using the return from bsearch() as the subscript into MonthStr...
pBucket = (BUCKET_LIST *)bsearch( v_bucket_list);
MonthString = MonthStr[pBucket->MonthSS];
The general approach here is to have collections of "pointers" to the strings attached to the 600,000 entries. All of the pointers in a bucket point to the same string. I used a bit int as a subscript here, instead of 600k 4 byte pointers, because it takes less memory (4 bits vs 4 bytes), and BUCKET_LIST sorts and searches as a species of int.
Using this scheme you'll use no more memory or storage than storing a simple int key, get the same performance as a simple int key, and do away with all the range checking on retrieval. IE: no if{ } testing. Save those if{ }s for initializing the BUCKET_LIST data structure, and then forget about them on retrieval.
I refer to this technique as subscript aliasing, as it resolves a many-to-one relationship by converting the subscript of the many to the subscript of the one - very efficiently I might add.
My application was to use an array of many UCHARs to index a much smaller array of double floats. The size reduction was enough to keep all of the hot-spot's data in L1 cache on the processor. 3X performance gain just from this one little change.

Resources