Queue-like data structure with fast search and insertion - performance

I need a datastructure with the following properties:
It contains integer numbers.
Duplicates aren't allowed (that is, it stores at most one of any integer).
After it reaches the maximal size the first element is removed.
So if the capacity is 3, then this is how it would look when putting in it sequential numbers:
{}, {1}, {1, 2}, {1, 2, 3}, {2, 3, 4}, {3, 4, 5} etc.
Only two operations are needed: inserting a number into this container (INSERT) and checking if the number is already in the container (EXISTS).
The number of EXISTS operations is expected to be approximately 2 * number of INSERT operations.
I need these operations to be as fast as possible.
What would be the fastest data structure or combination of data structures for this scenario?

Sounds like a hash table using a ring buffer for storage.

O(1) for both insert and lookup (and delete if you eventually need it).
Data structures:
Queue of Nodes containing the integers, implemented as a linked list (queue)
and
HashMap mapping integers to Queue's linked list nodes (hashmap)
Insert:
if (queue.size >= MAX_SIZE) {
// Remove oldest int from queue and hashmap
hashmap.remove(queue.pop());
} else if (!hashmap.exists(newInt)) { // remove condition to allow dupes.
// add new int to queue and hashmap if not already there
Node n = new Node(newInt);
queue.push(n);
hashmap.add(newInt, n);
}
Lookup:
return hashmap.exists(lookupInt);
Note: With this implementation, you can also remove integers in O(1) since all you have to do is lookup the Node in the hashmap, and remove it from the linked list (by linking its neighbors together.)

You want a ring buffer, the best way to do this is to define an array of the size you want and then maintain indexes as to where it starts and ends.
int *buffer = {0,0,0};
int start = 0;
int end = 0;
#define LAST_INDEX 2;
void insert(int data)
{
buffer[end] = data;
end = (end == LAST_INDEX) ? 0 : end++;
}
void remove_oldest()
{
start = (start == LAST_INDEX) ? 0 : start++;
}
void exists(int data)
{
// search through with code to jump around the end if needed
}
start always points to the first item
end always points to the most recent item
the list may wrap over the end of the array
search n logn
insert 1
delete 1
For true geek marks though build a Bloom filter http://en.wikipedia.org/wiki/Bloom_filter
not guaranteed to be 100% accurate but faster than anything.

If you want to remove the lowest value, use a sorted list and if you have more elements than needed, remove the lowest one.
If you want to remove the oldest value, use a set and a queue. Both the set and the queue contain a copy of each value. If the value is in the set, no-op. If the value isn't in the set, append the value to the queue and add it to the set. If you've exceeded your size, pop the queue and remove that value from the set.
If you need to move duplicated values to the back of the queue, you'll need to switch from a set to a hash table mapping values to stable iterators into the queue and be able to remove from the middle of the queue.
Alternatively, you could use a sorted list and a hash table. Instead of just putting your values into the sorted list, you could put in pairs (id, value) and then have the hash table map from value to (id, value). id would just be incremented after every insert. When you find a match in the hash table, you remove that (id, value) from the list and add a new (id, value) pair at the end of the list. Otherwise you just add to the end of the list and pop from the beginning if it's too long.

Related

Which data structure supports given operations efficiently

I need to think of a data structure, which supports the following operations efficiently:
1) Add an integer x
2) Delete an integer with maximum frequency (if there are more than one element with the same maximum frequency delete all of them).
I am thinking of implementing a segment tree where each node stores the index of its child having largest frequency.
Any ideas or suggestions on how to approach this problem or how should it be implemented would be kindly appreciated.
We can use a combination of data structures. A hash_map to maintain the frequency mappings, where the key is the integer, and value a pointer to a "frequency" node representing the frequency value and the set of integers having the same frequency. The frequency nodes will be maintained in a list ordered by the values of the frequencies.
The Frequency node can be defined as
class Freq {
int frequency;
Set<Integer> values_with_frequency;
Freq prev;
Freq next;
}
The elements HashMap would then contain entries of the form
Entry<Integer, Freq>
So, for a snapshot of the dataset such as
a,b,c,b,d,d,a,e,a,f,b where the letters denote integers, the following would be how the data structure would look like.
c -----> (1, [c, e, f])
|
|
e --
|
|
f --
a -----> (3, [a, b])
|
|
b --
d --> (2, [d])
The Freq nodes would be maintained in a linked list, say freq_nodes, sorted by the frequency value. Note that, as explained below, there wouldn't be any log(n) operation needed for keeping the list sorted on the add/delete operations.
The way the add(x), and delete_max_freq() operations could be implemented is as follows
add(x) :
If x is not found in the elements map, check if the first element of the freq_nodes contains the Freq object with frequency 1. If so, add x to the values_with_frequency set of the Freq object. Otherwise, create a new Freq object with 1 as the frequency value and x added to the (now only single element) wrapped set values_with_frequency
Otherwise, (i.e. if x is already there in the elements map), follow the pointer in the value of the entry corresponding to x in elements to the Freq object in the freq_nodes, remove x from the values_with_frequency field of the Freq object, noting the current value of x’s frequency which is the value of elements.get(x).frequency(Hold this value in say F). If the set values_with_frequency is rendered empty due to this removal, delete the corresponding node from the freq_nodes linked list. Finally if the next Freq node in the freq_nodes linked list has the frequency F+1, just add x to the values_with_frequency field of the next node. Otherwise just create a Freq node as was done in the case of non-existence of Freq node with frequency 1 above.
Finally, add the entry (x, Freq) to the elements map.
Note that this whole add(x) operation is going to be O(1) in time.
Here's an example of a sequence of add() operations with the subsequent state of the data structure.
add(a)
a -> N1 : freq_nodes : |N1 (1, {a}) | ( N1 is actually a Freq object)
add(b)
a -> N1 : freq_nodes : |N1 (1, {a, b}) |
b -> N1
add(a)
At this point ‘a’ points to N1, however, its current frequency is 2, so we need to insert a node N2 next to N1 in the DLL, after removing it from N1’s values_with_frequency set {a,b}
a -> N2 : freq_nodes : |N1 (1, {b}) | --> |N2 (2, {a}) |
b -> N1
The interesting thing to note here is that any time we increase the frequency of an existing element from F to say F+1, we need to do the following
if (next node has a higher frequency than F+1 or we have reached the end of the list):
create a new Freq node with frequency equal to F+1 (as is done above)
and insert it next to the current node
else :
add ‘a’ (the input to the add() operation) to the ```values_with_frequency``` set of the next node
The delete_max_freq() operation would just involve removing the last entry of the linked list freq_nodes, and iterating over the keys in the wrapped set values_with_frequency to remove the corresponding keys from the elements map. This operation would take O(k) time where k is the number of elements with maximum frequency.
Assuming "efficient" refers to the way the complexity of those operations scale, big-O style, I'd consider something consisting of:
a hashmap with the integers as keys and their frequencies as values
a tree structure (possibly a binary search tree, e.g.) where its nodes have a number representing a frequency and a hashset of numbers which have that frequency.
When a number is inserted:
1. Look up the number in the hashmap to find its frequency. (O(1))
2. Look up the frequency in the tree (O(log N)). Remove the number from its collection (O(1)). If the collection is empty, remove the frequency from the tree (O(log N)).
3. Increment the number's frequency. Set that value in the hashmap (O(1)).
4. Look up its new frequency in the tree (O(log N)). If it's there, add the number to the collection there (O(1)). If not, add a new node with the number in its collection (O(log N)).
When deleting items with the maximum frequency:
1. Remove the highest-valued node from the tree (O(log N)).
2. For each number in that node's collection, remove that number's entry from the hashmap (O(1) for each number removed).
If you have N numbers to add and remove, your worst-case scenario should be O(N log N) regardless of the actual distribution of frequencies or the order in which numbers are added and removed.
If you know of any assumptions you can make about the numbers being added, it's possible you could make further enhancements like using an indexed array rather than an ordered tree. But if your inputs are fairly unbounded, this seems like a pretty good structure to handle all the operations you want without getting into O(n²) territory.
My thoughts:
You will need 2 maps.
Map 1: Integer as key with frequency as value.
Map 2: Have a map of frequencies as keys and list of integers as values.
Add Integer: Add the integer to map 1. Get the frequency. Add it to the list of frequency key in map 2.
Delete Integer : We can obviously maintain maximum frequency in a variable across these operations. Now, remove the key from map2 which has this max frequency and decrement max frequency.
So, adding and deleting performance should be O(1) on average.
In the above scenario, we will still have integers in map 1 which exist and have the frequency which is unrealistic after the delete from map 2. In this case, when same integer gets added, we do an on demand update in map 1, meaning, if current frequency in map 1 doesn't exist in map 2 for this integer, it means it was deleted and we can reset that to 1 again.
Implementation:
import java.util.*;
class Foo{
Map<Integer,Integer> map1;
Map<Integer,Set<Integer>> map2;
int max_freq;
Foo(){
map1 = new HashMap<>();
map2 = new HashMap<>();
map2.put(0,new HashSet<>());
max_freq = 0;
}
public void add(int x){
map1.putIfAbsent(x,0);
int curr_f = map1.get(x);
if(!map2.containsKey(curr_f)){
map1.put(x,1);
}else{
map1.merge(x,1,Integer::sum);
}
map2.putIfAbsent(map1.get(x),new HashSet<>());
map2.get(map1.get(x)-1).remove(x); // remove from previous frequency list
map2.get(map1.get(x)).add(x);// add to current frequency list
max_freq = Math.max(max_freq,map1.get(x));
printState();
}
public List<Integer> delete(){
List<Integer> ls = new ArrayList<>(map2.get(max_freq));
map2.remove(max_freq);
max_freq--;
while(max_freq > 0 && map2.get(max_freq).size() == 0) max_freq--;
printState();
return ls;
}
public void printState(){
System.out.println(map1.toString());
System.out.println("Maximum frequency: " + max_freq);
for(Map.Entry<Integer,Set<Integer>> m : map2.entrySet()){
System.out.println(m.getKey() + " " + m.getValue().toString());
}
System.out.println("----------------------------------------------------");
}
}
Demo: https://ideone.com/tETHKV
Note: The call to delete() is amortized.

Grouping numbers in a list

I came across the following question,
You are given an array A of n elements. These elements are now added to a new list L which is initially empty , in a certain order based on the given q queries.
In each query you are given an integer i that corresponds to A[i] in the array A. This means that you have to add the element A[i] to the list L.
After each element is added to the list L, make groups among the elements in the list L. Two elements will be in same group if their indexes in the array A are consecutive.
For each group we define the group’s value as axb where a is the largest value in that group and b is the size of that group.
Print the maximum group value among all the groups that are formed after each element is added to the list L.
My approach was to use a map<int,vector<int>> where key is the group number and value is a vector containing group size, max. of group. I also had an array g and g[i] indicated group number of a[i], -1 if it is not in any group. The code below is a part of my implementation, but I'm sure there are better ways to solve this question as this solution of mine gave TLE and WA in some cases,and I can't seem to figure out the correct approach. Pls suggest optimal way to solve this.
int g[a.size()+2]; //+2 because queries start with index 1, and g[i] corresponds to a[i-1]
for(int i=0;i<a.size()+2;i++)
g[i]=-1;
int gno=1;
map<int,vector<int> > m;
vector<int> ans;
int mx=0;
for(unsigned int i=0;i<queries.size();i++){
int q = queries[i];
if(g[q-1]==-1 && g[q+1]==-1){
//create new group with current eleent as first element
g[q] = gno; //gno is the group number.
vector<int> v;
v.push_back(1);
v.push_back(a[q-1]);
m[gno]=v;
mx = max(mx,m[gno][0]*m[gno][1]);
gno++;
}
else if(g[q-1]!=-1 && g[q+1]==-1){
//join current element to left group
g[q] = g[q-1];
m[g[q]][0]++;
m[g[q]][1] = max(m[g[q]][1],a[q-1]);
mx = max(mx,m[g[q]][0]*m[g[q]][1]);
}
else if(g[q-1]==-1 && g[q+1]!=-1){
//join current element to right group
g[q] = g[q+1];
m[g[q]][0]++;
m[g[q]][1] = max(m[g[q]][1],a[q-1]);
mx = max(mx,m[g[q]][0]*m[g[q]][1]);
}
else{
//join both groups to left and right
g[q]=g[q-1];
int g1 = g[q];
int i;
m[g[q]][0] += 1 + m[g[q+1]][0];
m[g[q]][1] = max(m[g[q]][1],max(a[q-1],m[g[q+1]][1]));
for(i=q+1;g[i]==g[i+1];i++){
g[i]=g1;
}
g[i]=g1;
mx = max(mx,m[g[q]][0]*m[g[q]][1]);
}
ans.push_back(mx);
}
.
I would not actually build list L. It may be too costly in time to find what to do with a new value: is it a new group on itself, does it extend an existing group, do two groups need to merge into one? If the first values are all far apart, you'll have many groups, and you need to iterate them with each new incoming value: this is not efficient.
I would just collect all the values first and only then see how they fit in groups.
There are two ways to collect the values:
Store them in a list, and when all values have been collected, sort the list in ascending order
Flag the entry in an array of booleans of size n. This way you do not have to sort it, but afterwards you do need to iterate the whole array to find the values in ascending order.
Method 1 will be the best when q is a lot less than n. Method 2 will be better for greater q.
With both methods you'll be able to iterate over the found values in ascending order, and while doing so you can identify the groups, their value, and also keep track of the largest group-value. Only one sweep is needed to find the answer.
Let's start with two simplifying assumptions:
no duplicates. Once a given index i has been "queried", it will never be queried again.
no negative numbers. All elements are positive or zero, so the largest value in a group is always positive or zero, so expanding a group (or merging two groups) will never cause the overall "maximum group value" to decrease.
(Further below I'll show how to not require those assumptions, but for now this will simplify the picture.)
So, whenever we "query" an index i, there are four cases:
i-1 is currently the right-endpoint of a group (by which I mean its greatest index) and i+1 is currently the left-endpoint of another group.
In this case, we need to merge the two groups into a single group, with i bridging the gap between them.
i-1 is currently the right-endpoint of a group, but i+1 is not currently in any group.
In this case we need to extend the group to cover i.
i-1 is not currently in any group, but i+1 is currently the left-endpoint of a group.
In this case, as in the previous case, we need to extend the group to cover i.
Neither i-1 nor i+1 is in a group.
In this case, we have a new group with just one element.
In all cases, the key thing to note is that we're only interested in the endpoints of groups. So we don't need a general mapping from indices to their groups . . . which is good, because when we merge two groups, it would be expensive to then go and update every single index from one group to point to the other.
So we just need three mappings:
std::unordered_map<int, int> map_from_left_endpoint_to_right_endpoint;
std::unordered_map<int, int> map_from_right_endpoint_to_left_endpoint;
std::unordered_map<int, int> map_from_left_endpoint_to_largest_value;
To distinguish the four cases, we use e.g. map_from_right_endpoint_to_left_endpoint.find(i - 1) (which returns an iterator pointing to the left-endpoint of the group that i-1 is the right-endpoint of, if applicable; otherwise it returns map_from_right_endpoint_to_left_endpoint.end()). We then delete entries as they become no-longer-applicable (due to groups being extended or merged in a given direction), in addition to (obviously) inserting new entries, and updating the values of existing entries.
In addition to those values, we also need an
int maximum_group_value = 0;
and whenever we extend a group or merge two groups, we check whether the value of the resulting group (meaning its largest_value * (right_endpoint - left_endpoint + 1) is greater than maximum_group_value. If so, we update maximum_group_value and return it; if not, we return maximum_group_value as-is.
Now, what if duplicates are allowed, such that a given index i might be "queried" after it already belongs to a group?
The simplest approach is to simply keep track of which i-s have already been queried; but a more elegant approach, if desired, might be to change map_from_left_endpoint_to_right_endpoint from a std::unordered_map to a std::map, and then use something like this:
bool is_already_in_a_group(
std::map<int, int> const & map_from_left_endpoint_to_right_endpoint,
int const i) {
// get iterator to first element *after* index (or to 'end()' if no such):
auto iter = map_from_left_endpoint_to_right_endpoint.upper_bound(index);
// if that pointer points to 'begin()', then there are no elements
// at or before index:
if (iter == map_from_left_endpoint_to_right_endpoint.begin()) {
return false;
}
// otherwise, move iterator to point to the last element whose key is
// less than or equal to index:
--iter;
// . . . and check whether the value of that element is greater than
// or equal to index (meaning that [key, value] spans index):
return iter->second >= index;
}
to check if the greatest key in map_from_left_endpoint_to_right_endpoint that is less than or equal to i is mapped to a value that is greater than or equal to i.
This adds a fifth case to our case analysis above — "if i is already inside a group, just do nothing and return maximum_group_value" — but other than that, has no effect.
Note that this same approach also lets us eliminate map_from_right_endpoint_to_left_endpoint, if we want: the above function could easily be tweaked to int get_left_endpoint_for_right_endpoint by changing its return statement to return iter->second == index ? iter->first : -1;.
At this point it becomes sensible to define a Group class with three fields (left_endpoint, right_endpoint, and largest_value), and just keep a single map_from_left_endpoint_to_group.
Lastly — what if negative values are allowed, such that the "maximum group value" can actually decrease as the result of a query? (For example, if the array elements are [-1, -10] and the queries are i=0, i=1, then the results are maximum_group_value=-1, maximum_group_value=-2.) In such a case, we need to keep track of the values of all current groups, because any one of them might suddenly become the maximum.
For that, instead of storing a single int maximum_group_value, we can maintain a heap of groups, ordered by value, that we push into every time we create/extend/merge groups. (We can just use a std::vector<Group> for this, plus std::push_heap with an appropriate comparator, or with an appropriate definition for operator<(Group const &, Group const &).) After each query, we check if the top group on the heap (the first element in the vector) is still a group that actually exists; if so, we return its value, otherwise we pop it (using std::pop_heap) and repeat.
As an optimization, we can also store int maximum_group_value, and eliminate the heap once we've encountered a nonnegative array-element (since as soon as a given group contains a nonnegative array-element, its value can never decrease again, and obviously the maximum group value will be the value of one of those groups).

Binary search with gaps

Let's imagine two arrays like this:
[8,2,3,4,9,5,7]
[0,1,1,0,0,1,1]
How can I perform a binary search only in numbers with an 1 below it, ignoring the rest?
I know this can be in O(log n) comparisons, but my current method is slower because it has to go through all the 0s until it hits an 1.
If you hit a number with a 0 below, you need to scan in both directions for a number with a 1 below until you find it -- or the local search space is exhausted. As the scan for a 1 is linear, the ratio of 0s to 1s determines whether the resulting algorithm can still be faster than linear.
This question is very old, but I've just discovered a wonderful little trick to solve this problem in most cases where it comes up. I'm writing this answer so that I can refer to it elsewhere:
Fast Append, Delete, and Binary Search in a Sorted Array
The need to dynamically insert or delete items from a sorted collection, while preserving the ability to search, typically forces us to switch from a simple array representation using binary search to some kind of search tree -- a far more complicated data structure.
If you only need to insert at the end, however (i.e., you always insert a largest or smallest item), or you don't need to insert at all, then it's possible to use a much simpler data structure. It consists of:
A dynamic (resizable) array of items, the item array; and
A dynamic array of integers, the set array. The set array is used as a disjoint set data structure, using the single-array representation described here: How to properly implement disjoint set data structure for finding spanning forests in Python?
The two arrays are always the same size. As long as there have been no deletions, the item array just contains the items in sorted order, and the set array is full of singleton sets corresponding to those items.
If items have been deleted, though, items in the item array are only valid if the there is a root set at the corresponding position in the set array. All sets that have been merged into a single root will be contiguous in the set array.
This data structure supports the required operations as follows:
Append (O(1))
To append a new largest item, just append the item to the item array, and append a new singleton set to the set array.
Delete (amortized effectively O(log N))
To delete a valid item, first call search to find the adjacent larger valid item. If there is no larger valid item, then just truncate both arrays to remove the item and all adjacent deleted items. Since merged sets are contiguous in the set array, this will leave both arrays in a consistent state.
Otherwise, merge the sets for the deleted item and adjacent item in the set array. If the deleted item's set is chosen as the new root, then move the adjacent item into the deleted item's position in the item array. Whichever position isn't chosen will be unused from now on, and can be nulled-out to release a reference if necessary.
If less than half of the item array is valid after a delete, then deleted items should be removed from the item array and the set array should be reset to an all-singleton state.
Search (amortized effectively O(log N))
Binary search proceeds normally, except that we need to find the representative item for every test position:
int find(item_array, set_array, itemToFind) {
int pos = 0;
int limit = item_array.length;
while (pos < limit) {
int testPos = pos + floor((limit-pos)/2);
if (item_array[find_set(set_array, testPos)] < itemToFind) {
pos = testPos + 1; //testPos is too low
} else {
limit = testPos; //testPos is not too low
}
}
if (pos >= item_array.length) {
return -1; //not found
}
pos = find_set(set_array, pos);
return (item_array[pos] == itemToFind) ? pos : -1;
}

Write algorithm to return top 2 elements in terms of frequency from a long list of elements

I was asked this question during interview. I was not able to solve this. I wonder if anyone has a good idea how to solve it:
If I have a long list of integers, return the integer which top 2 in terms of frequencies.
e.g. [1, 2, 3, 1, 4, 5, 6, 7, 8, 6, 1, 8, 8] returns [1,8]
Thank you.
Loop through the list and create a max heap with the value and count.
There is definitely a challenge about how to keep track. Thinking of a quick solution (as often is the case in an interview), I'd probably keep a dictionary to keep track if I've created an object for any given int in the array/list and if so it's current index in the heap. If so, then I'll get that object, update it's counter and trickle up in the max heap.
I'll probably have a class that contains data, such as this:
public class MyData
{
private readonly int _key;
public MyData(int key)
{
_key = key;
Count = 0;
}
public int GetKey()
{
return _key;
}
public int Count { get; set; }
}
I'll have a structure like this (where the tuple contains the object and it's index in the heap array (i'm going for the array implementation of the heap)
var elementsInHeap = new Dictionary<int, Tuple<MyData, int>>();
When looping through the input list, check if you have any entry in that dictionary for that int, if so get that value, get the object, increase the counter, and then do the trickle up in the heap. For the heap you can use the MyData object, when doing trickle up or down use the counter value. If not, create a new MyData object, have it trickle up int he max heap based on it's counter, when finished add it to the dictionary with it's index in the tuple.
Hope this helps, I'm sure there is a smarter solution out there. Hopefully someone will help us with that.
I think the answers that suggest building a heap or sorting the array have O(n log n) complexity.
First build a hash map in which the keys are the (distinct) elements of the array and the values are their frequencies. This map can be easily built in O(n).
Then find the maximum and second maximum of the entries in the map. This can also be done easily in O(n) by iterating through the map entries only once. Even if you decide to iterate twice (find a maximum, remove it and find the next maximum), your complexity will still be O(n).
If you know the range of numbers (max and min elements) you can use array and count frequencies in one loop through the array ,
you also can use heap-fast construction algorithm O(n) and just extract max 2 times,
or use hashing (if your are able to implement it during interview)

Storing a bucket of numbers in an efficient data structure

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,....
I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.
So my question is what is the suitable data structure and a sorting mechanism for this type of problem.
Thanks in advance
If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)
Keep the vector sorted.
You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.
EDIT. Here is what I mean:
#include <stdio.h>
// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
int middle;
if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
return -1;
if(left + 1 == right) // found
return left;
middle = left + (right - left)/2;
if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
return findBucket(aNumber, leftBounds, left, middle);
else
return findBucket(aNumber, leftBounds, middle, right);
}
#define NBUCKETS 12
int main(void)
{
int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
// The buckets are 1-3, 4-6, 7-14, 15-31, ...
int aNumber;
for(aNumber = -3; aNumber < 103; aNumber++)
{
int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
if(index < 0)
printf("%d: Bucket not found\n", aNumber);
else
printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
}
return 0;
}
You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.
If I understand you correctly, you have a list of buckets and you want, given an arbitrary integer, to find out which bucket it goes in.
Assuming that none of the bucket ranges overlap, I think you could implement this in a binary search tree. That would make the lookup possible in O(logn) (whenere n=number of buckets).
It would be simple to do this, just define the left branch to be less than the low end of the bucket, the right branch to be greater than the right end. So in your example we'd end up with a tree something like:
16-21
/ \
5-15 22-34
/
1-4
To search for, say, 7, you just check the root. Less than 16? Yes, go left. Less than 5? No. Greater than 15? No, you're done.
You just have to be careful to balance your tree (or use a self balancing tree) in order to keep your worst-case performance down. this is really important if your input (the bucket list) is already sorted.
+1 to the kind-of binary search idea. It's simple and gives good performance for 600000 buckets. That being said, if it's not good enough, you could create an array with MAX BUCKET VALUE - MIN BUCKET VALUE = RANGE elements, and have each element in this array reference the appropriate bucket. Then, you get a lookup in guaranteed constant [O(1)] time, at the cost of using a huge amount of memory.
If A) the probability of accessing buckets is not uniform and B) you knew / could figure out how likely a given set of buckets were to be accessed, you could probably combine these two approaches to create a kind of cache. For example, say bucket {0, 3} were accessed all the time, as was {7, 13}, then you can create an array CACHE. . .
int cache_low_value = 0;
int cache_hi_value = 13;
CACHE[0] = BUCKET_1
CACHE[1] = BUCKET_1
...
CACHE[6] = BUCKET_2
CACHE[7] = BUCKET_3
CACHE[8] = BUCKET_3
...
CACHE[13] = BUCKET_3
. . . which will allow you to find a bucket in O(1) time assuming the value you're trying to associate a value with a bucket is between cache_low_value and cache_hi_value (if Y <= cache_hi_value && Y >= cache_low_value; then BUCKET = CACHE[Y]). On the up side, this approach wouldn't use all the memory on your machine; on the downside, it'd add the equivalent of an additional operation or two to your bsearch in the case you can't find your number / bucket pair in the cache (since you had to check the cache in the first place).
A simple way to store and sort these in C++ is to use a pair of sorted arrays that represent the lower and upper bounds on each bucket. Then, you can use int bucket_index= std::distance(lower_bounds.begin(), std::lower_bound(lower_bounds, value)) to find the bucket that the value will match with, and if (upper_bounds[bucket_index]>=value), bucket_index is the bucket you want.
You can replace that with a single struct holding the bucket, but the principle will be the same.
Let me see if I can restate your requirement. It's analogous to having, say, the day of the year, and wanting to know which month a given day falls in? So, given a year with 600,000 days(an interesting planet), you want to return a string that is either "Jan","Feb","Mar"... "Dec"?
Let me focus on the retrieval end first, and I think you can figure out how to arrange the data when initializing the data structures, given what has already been posted above.
Create a data structure...
typedef struct {
int DayOfYear :20; // an bit-int donating some bits for other uses
int MonthSS :4; // subscript to select months
int Unused :8; // can be used to make MonthSS 12 bits
} BUCKET_LIST;
char MonthStr[12] = "Jan","Feb","Mar"... "Dec";
.
To initialize, use a for{} loop to set BUCKET_LIST.MonthSS to one of the 12 months in MonthStr.
On retrieval, do a binary search on a vector of BUCKET_LIST.DayOfYear (you'll need to write a trivial compare function for BUCKET_LIST.DayOfYear). Your result can be obtained by using the return from bsearch() as the subscript into MonthStr...
pBucket = (BUCKET_LIST *)bsearch( v_bucket_list);
MonthString = MonthStr[pBucket->MonthSS];
The general approach here is to have collections of "pointers" to the strings attached to the 600,000 entries. All of the pointers in a bucket point to the same string. I used a bit int as a subscript here, instead of 600k 4 byte pointers, because it takes less memory (4 bits vs 4 bytes), and BUCKET_LIST sorts and searches as a species of int.
Using this scheme you'll use no more memory or storage than storing a simple int key, get the same performance as a simple int key, and do away with all the range checking on retrieval. IE: no if{ } testing. Save those if{ }s for initializing the BUCKET_LIST data structure, and then forget about them on retrieval.
I refer to this technique as subscript aliasing, as it resolves a many-to-one relationship by converting the subscript of the many to the subscript of the one - very efficiently I might add.
My application was to use an array of many UCHARs to index a much smaller array of double floats. The size reduction was enough to keep all of the hot-spot's data in L1 cache on the processor. 3X performance gain just from this one little change.

Resources