Write algorithm to return top 2 elements in terms of frequency from a long list of elements - algorithm

I was asked this question during interview. I was not able to solve this. I wonder if anyone has a good idea how to solve it:
If I have a long list of integers, return the integer which top 2 in terms of frequencies.
e.g. [1, 2, 3, 1, 4, 5, 6, 7, 8, 6, 1, 8, 8] returns [1,8]
Thank you.

Loop through the list and create a max heap with the value and count.
There is definitely a challenge about how to keep track. Thinking of a quick solution (as often is the case in an interview), I'd probably keep a dictionary to keep track if I've created an object for any given int in the array/list and if so it's current index in the heap. If so, then I'll get that object, update it's counter and trickle up in the max heap.
I'll probably have a class that contains data, such as this:
public class MyData
{
private readonly int _key;
public MyData(int key)
{
_key = key;
Count = 0;
}
public int GetKey()
{
return _key;
}
public int Count { get; set; }
}
I'll have a structure like this (where the tuple contains the object and it's index in the heap array (i'm going for the array implementation of the heap)
var elementsInHeap = new Dictionary<int, Tuple<MyData, int>>();
When looping through the input list, check if you have any entry in that dictionary for that int, if so get that value, get the object, increase the counter, and then do the trickle up in the heap. For the heap you can use the MyData object, when doing trickle up or down use the counter value. If not, create a new MyData object, have it trickle up int he max heap based on it's counter, when finished add it to the dictionary with it's index in the tuple.
Hope this helps, I'm sure there is a smarter solution out there. Hopefully someone will help us with that.

I think the answers that suggest building a heap or sorting the array have O(n log n) complexity.
First build a hash map in which the keys are the (distinct) elements of the array and the values are their frequencies. This map can be easily built in O(n).
Then find the maximum and second maximum of the entries in the map. This can also be done easily in O(n) by iterating through the map entries only once. Even if you decide to iterate twice (find a maximum, remove it and find the next maximum), your complexity will still be O(n).

If you know the range of numbers (max and min elements) you can use array and count frequencies in one loop through the array ,
you also can use heap-fast construction algorithm O(n) and just extract max 2 times,
or use hashing (if your are able to implement it during interview)

Related

O(1) find value from a key in a range

What kind of data structure would allow me to get a corresponding value from a given key in a set of ordered range-like keys, where my key is not necessarily in the set.
Consider, [key, value]:
[3, 1]
[5, 2]
[10, 3]
Looking up 3 or 4 would return 1, 5 - 9 would return 2 and 10 would return 3. The ranges are not constant sized.
O(1) or like-O(1) is important, if possible.
A balanced binary search tree will give you O(log n).
what about a key-indexed array? Say, you know your keys are below 1000, you can simply fill a int[1000] with values, like this:
[0,0]
[1,0]
[2,0]
[3,1]
[4,1]
[5,2]
......
and so on. that'll give you o(1) performance, but huge memory overhead.
otherwise, a hash table is the closest i know of. hope it helps.
edit: look up red-black tree, it's a self balancing tree which has a worst case of o
(logn) in searching.
I would use i Dictionary in this scenario. Retrieving a value by using its key is very fast, close to O(1)
Dictionary<int, int> myDictionary = new Dictionary<int, int>();
myDictionary.Add(3,1);
myDictionary.Add(5,2);
myDictionary.Add(10,3);
//If you know the key exists, you can use
int value = myDictionary[3];
//If you don't know if the key is in the dictionary - use the TryGetValue method
int value;
if (myDictionary.TryGetValue(3, out value))
{
//The key was found and the corresponding value is stored in "value"
}
For more info: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx

minimum interval of an array of unique elements

How can i find the minimum interval of an integer array in which all the unique elements of that array
are present .
For example my array is : 1 1 1 2 3 1 1 4 3 3 3 2 1 2 2 4 1
minimum interval is from index 3 to index 7.
I'm looking for an algorithm of O(nlogn) or less (n<=100000)
The strategy is iterate from the end to the start, remembering when you last saw each integer. Eg. somewhere in the middle, you last saw 1 at index 15, 2 at index 20, 3 at index 17. The interval length is the maximum index you last saw something minus your current index.
To find the maximum index easily, you should use a self-balancing binary search tree (BST), because it has O(log n) insert and removal time, and constant lookup time for the largest index.
For example, if you have to update the index you last saw a 1, you remove the current last seen index (the 15), and insert the new last seen index.
By updating the self balancing BST with all the end indices allowed by each integer type, we can pick the largest, and say that we can end there.
The exact code depends on how the input is defined (eg. whether you know what all the integers are, ie. you know there exists all integers between 1 and 4 in array, then the code is simplified).
Iteration is O(n), the BST is O(log n). Overall is O(n log n).
Implementation Details
Implementation of this takes a little bit of work.
Initialize:
the interval length for each starting index.
an array for when you last saw a certain integer. (If you don't know what possible integers might be in the array, instead of using a normal array, use an associative array (eg. map<> in C++)).
a priority queue-like type heap, where the top of the queue is the maximum integer in it. You need to be able to easily remove stuff from it, so use a self-balancing binary search tree
Now inside the loop (looping index from end of input array to start of input array),
You can update your last seen array for this particular index.
Just check what integer you see, and update the entry in the index last seen array.
Using before and after in the last seen array, update the BST (remove old end index, add new index)
Update interval length for this starting index, based on largest end index required (from BST).
If you see an integer you haven't seen before, invalidate all interval lengths for starting indices above this index (or just avoid updating interval length until all integers have been seen at least once).
C++ code implementation
Assuming all integers 0-(k-1) are found in input array
Disclaimer: untested
ignores #include and main function
Code:
int n=10,k=3;
int input[n]=?;
unsigned int interval[n];
for (int i=0;i<n;i++) interval[i]=-1; // initialize interval to very large number
int lastseen[k];
for (int i=0;i<k;i++) lastseen[i]=-1; // initialize lastseen
multiset<int> pq;
for (int i=n-1;i>=0;i--) {
if (lastseen[input[i]] != -1) // if lastseen[] already has index
pq.erase(pq.find(lastseen[input[i]])); // erase single copy
lastseen[input[i]]=i; // update last seen
pq.insert(i); // put last seen index into BST
if (pq.size()==k) { // if all integers seen (nothing missing)
// get (maximum of endindex requirements) - current index
interval[i] = (*pq.rbegin())-i+1;
}
}
// find best answer
unsigned int minlength=-1;
int startindex;
for (int i=0;i<n;i++) {
if (minlength>interval[i]) { // better answer?
minlength=interval[i];
startindex=i;
}
}
// Your answer is [startindex,startindex+minlength)

Queue-like data structure with fast search and insertion

I need a datastructure with the following properties:
It contains integer numbers.
Duplicates aren't allowed (that is, it stores at most one of any integer).
After it reaches the maximal size the first element is removed.
So if the capacity is 3, then this is how it would look when putting in it sequential numbers:
{}, {1}, {1, 2}, {1, 2, 3}, {2, 3, 4}, {3, 4, 5} etc.
Only two operations are needed: inserting a number into this container (INSERT) and checking if the number is already in the container (EXISTS).
The number of EXISTS operations is expected to be approximately 2 * number of INSERT operations.
I need these operations to be as fast as possible.
What would be the fastest data structure or combination of data structures for this scenario?
Sounds like a hash table using a ring buffer for storage.
O(1) for both insert and lookup (and delete if you eventually need it).
Data structures:
Queue of Nodes containing the integers, implemented as a linked list (queue)
and
HashMap mapping integers to Queue's linked list nodes (hashmap)
Insert:
if (queue.size >= MAX_SIZE) {
// Remove oldest int from queue and hashmap
hashmap.remove(queue.pop());
} else if (!hashmap.exists(newInt)) { // remove condition to allow dupes.
// add new int to queue and hashmap if not already there
Node n = new Node(newInt);
queue.push(n);
hashmap.add(newInt, n);
}
Lookup:
return hashmap.exists(lookupInt);
Note: With this implementation, you can also remove integers in O(1) since all you have to do is lookup the Node in the hashmap, and remove it from the linked list (by linking its neighbors together.)
You want a ring buffer, the best way to do this is to define an array of the size you want and then maintain indexes as to where it starts and ends.
int *buffer = {0,0,0};
int start = 0;
int end = 0;
#define LAST_INDEX 2;
void insert(int data)
{
buffer[end] = data;
end = (end == LAST_INDEX) ? 0 : end++;
}
void remove_oldest()
{
start = (start == LAST_INDEX) ? 0 : start++;
}
void exists(int data)
{
// search through with code to jump around the end if needed
}
start always points to the first item
end always points to the most recent item
the list may wrap over the end of the array
search n logn
insert 1
delete 1
For true geek marks though build a Bloom filter http://en.wikipedia.org/wiki/Bloom_filter
not guaranteed to be 100% accurate but faster than anything.
If you want to remove the lowest value, use a sorted list and if you have more elements than needed, remove the lowest one.
If you want to remove the oldest value, use a set and a queue. Both the set and the queue contain a copy of each value. If the value is in the set, no-op. If the value isn't in the set, append the value to the queue and add it to the set. If you've exceeded your size, pop the queue and remove that value from the set.
If you need to move duplicated values to the back of the queue, you'll need to switch from a set to a hash table mapping values to stable iterators into the queue and be able to remove from the middle of the queue.
Alternatively, you could use a sorted list and a hash table. Instead of just putting your values into the sorted list, you could put in pairs (id, value) and then have the hash table map from value to (id, value). id would just be incremented after every insert. When you find a match in the hash table, you remove that (id, value) from the list and add a new (id, value) pair at the end of the list. Otherwise you just add to the end of the list and pop from the beginning if it's too long.

Storing a bucket of numbers in an efficient data structure

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,....
I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.
So my question is what is the suitable data structure and a sorting mechanism for this type of problem.
Thanks in advance
If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)
Keep the vector sorted.
You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.
EDIT. Here is what I mean:
#include <stdio.h>
// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
int middle;
if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
return -1;
if(left + 1 == right) // found
return left;
middle = left + (right - left)/2;
if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
return findBucket(aNumber, leftBounds, left, middle);
else
return findBucket(aNumber, leftBounds, middle, right);
}
#define NBUCKETS 12
int main(void)
{
int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
// The buckets are 1-3, 4-6, 7-14, 15-31, ...
int aNumber;
for(aNumber = -3; aNumber < 103; aNumber++)
{
int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
if(index < 0)
printf("%d: Bucket not found\n", aNumber);
else
printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
}
return 0;
}
You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.
If I understand you correctly, you have a list of buckets and you want, given an arbitrary integer, to find out which bucket it goes in.
Assuming that none of the bucket ranges overlap, I think you could implement this in a binary search tree. That would make the lookup possible in O(logn) (whenere n=number of buckets).
It would be simple to do this, just define the left branch to be less than the low end of the bucket, the right branch to be greater than the right end. So in your example we'd end up with a tree something like:
16-21
/ \
5-15 22-34
/
1-4
To search for, say, 7, you just check the root. Less than 16? Yes, go left. Less than 5? No. Greater than 15? No, you're done.
You just have to be careful to balance your tree (or use a self balancing tree) in order to keep your worst-case performance down. this is really important if your input (the bucket list) is already sorted.
+1 to the kind-of binary search idea. It's simple and gives good performance for 600000 buckets. That being said, if it's not good enough, you could create an array with MAX BUCKET VALUE - MIN BUCKET VALUE = RANGE elements, and have each element in this array reference the appropriate bucket. Then, you get a lookup in guaranteed constant [O(1)] time, at the cost of using a huge amount of memory.
If A) the probability of accessing buckets is not uniform and B) you knew / could figure out how likely a given set of buckets were to be accessed, you could probably combine these two approaches to create a kind of cache. For example, say bucket {0, 3} were accessed all the time, as was {7, 13}, then you can create an array CACHE. . .
int cache_low_value = 0;
int cache_hi_value = 13;
CACHE[0] = BUCKET_1
CACHE[1] = BUCKET_1
...
CACHE[6] = BUCKET_2
CACHE[7] = BUCKET_3
CACHE[8] = BUCKET_3
...
CACHE[13] = BUCKET_3
. . . which will allow you to find a bucket in O(1) time assuming the value you're trying to associate a value with a bucket is between cache_low_value and cache_hi_value (if Y <= cache_hi_value && Y >= cache_low_value; then BUCKET = CACHE[Y]). On the up side, this approach wouldn't use all the memory on your machine; on the downside, it'd add the equivalent of an additional operation or two to your bsearch in the case you can't find your number / bucket pair in the cache (since you had to check the cache in the first place).
A simple way to store and sort these in C++ is to use a pair of sorted arrays that represent the lower and upper bounds on each bucket. Then, you can use int bucket_index= std::distance(lower_bounds.begin(), std::lower_bound(lower_bounds, value)) to find the bucket that the value will match with, and if (upper_bounds[bucket_index]>=value), bucket_index is the bucket you want.
You can replace that with a single struct holding the bucket, but the principle will be the same.
Let me see if I can restate your requirement. It's analogous to having, say, the day of the year, and wanting to know which month a given day falls in? So, given a year with 600,000 days(an interesting planet), you want to return a string that is either "Jan","Feb","Mar"... "Dec"?
Let me focus on the retrieval end first, and I think you can figure out how to arrange the data when initializing the data structures, given what has already been posted above.
Create a data structure...
typedef struct {
int DayOfYear :20; // an bit-int donating some bits for other uses
int MonthSS :4; // subscript to select months
int Unused :8; // can be used to make MonthSS 12 bits
} BUCKET_LIST;
char MonthStr[12] = "Jan","Feb","Mar"... "Dec";
.
To initialize, use a for{} loop to set BUCKET_LIST.MonthSS to one of the 12 months in MonthStr.
On retrieval, do a binary search on a vector of BUCKET_LIST.DayOfYear (you'll need to write a trivial compare function for BUCKET_LIST.DayOfYear). Your result can be obtained by using the return from bsearch() as the subscript into MonthStr...
pBucket = (BUCKET_LIST *)bsearch( v_bucket_list);
MonthString = MonthStr[pBucket->MonthSS];
The general approach here is to have collections of "pointers" to the strings attached to the 600,000 entries. All of the pointers in a bucket point to the same string. I used a bit int as a subscript here, instead of 600k 4 byte pointers, because it takes less memory (4 bits vs 4 bytes), and BUCKET_LIST sorts and searches as a species of int.
Using this scheme you'll use no more memory or storage than storing a simple int key, get the same performance as a simple int key, and do away with all the range checking on retrieval. IE: no if{ } testing. Save those if{ }s for initializing the BUCKET_LIST data structure, and then forget about them on retrieval.
I refer to this technique as subscript aliasing, as it resolves a many-to-one relationship by converting the subscript of the many to the subscript of the one - very efficiently I might add.
My application was to use an array of many UCHARs to index a much smaller array of double floats. The size reduction was enough to keep all of the hot-spot's data in L1 cache on the processor. 3X performance gain just from this one little change.

Efficient algorithm to remove any map that is contained in another map from a collection of maps

I have set (s) of unique maps (Java HashMaps currently) and wish to remove from it any maps that are completely contained by some other map in the set (i.e. remove m from s if m.entrySet() is a subset of n.entrySet() for some other n in s.)
I have an n^2 algorithm, but it's too slow. Is there a more efficient way to do this?
Edit:
the set of possible keys is small, if that helps.
Here is an inefficient reference implementation:
public void removeSubmaps(Set<Map> s) {
Set<Map> toRemove = new HashSet<Map>();
for (Map a: s) {
for (Map b : s) {
if (a.entrySet().containsAll(b.entrySet()))
toRemove.add(b);
}
}
s.removeAll(toRemove);
}
Not sure I can make this anything other than an n^2 algorithm, but I have a shortcut that might make it faster. Make a list of your maps with the length of the each map and sort it. A proper subset of a map must be shorter or equal to the map you're comparing - there's never any need to compare to a map higher on the list.
Here's another stab at it.
Decompose all your maps into a list of key,value,map number. Sort the list by key and value. Go through the list, and for each group of key/value matches, create a permutation of all the map number pairs - these are all potential subsets. When you have the final list of pairs, sort by map numbers. Go through this second list, and count the number of occurrences of each pair - if the number matches the size of one of the maps, you've found a subset.
Edit: My original interpretation of the problem was incorrect, here is new answer based on my re-read of the question.
You can create a custom hash function for HashMap which returns the product of all hash value of its entries. Sort the list of hash value and start loop from biggest value and find all divisor from smaller hash values, these are possible subsets of this hashmap, use set.containsAll() to confirm before marking them for removal.
This effectively transforms the problem into a mathematical problem of finding possible divisor from a collection. And you can apply all the common divisor-search optimizations.
Complexity is O(n^2), but if many hashmaps are subsets of others, the actual time spent can be a lot better, approaching O(n) in best-case scenario (if all hashmaps are subset of one). But even in worst case scenario, division calculation would be a lot faster than set.containsAll() which itself is O(n^2) where n is number of items in a hashmap.
You might also want to create a simple hash function for hashmap entry objects to return smaller numbers to increase multiply/division performance.
Here's a subquadratic (O(N**2 / log N)) algorithm for finding maximal sets from a set of sets: An Old Sub-Quadratic Algorithm for Finding Extremal Sets.
But if you know your data distribution, you can do much better in average case.
This what I ended up doing. It works well in my situation as there is usually some value that is only shared by a small number of maps. Kudos to Mark Ransom for pushing me in this direction.
In prose: Index the maps by key/value pair, so that each key/value pair is associated with a set of maps. Then, for each map: Find the smallest set associated with one of it's key/value pairs; this set is typically small for my data. Each of the maps in this set is a potential 'supermap'; no other map could be a 'supermap' as it would not contain this key/value pair. Search this set for a supermap. Finally remove all the identified submaps from the original set.
private <K, V> void removeSubmaps(Set<Map<K, V>> maps) {
// index the maps by key/value
List<Map<K, V>> mapList = toList(maps);
Map<K, Map<V, List<Integer>>> values = LazyMap.create(HashMap.class, ArrayList.class);
for (int i = 0, uniqueRowsSize = mapList.size(); i < uniqueRowsSize; i++) {
Map<K, V> row = mapList.get(i);
Integer idx = i;
for (Map.Entry<K, V> entry : row.entrySet())
values.get(entry.getKey()).get(entry.getValue()).add(idx);
}
// find submaps
Set<Map<K, V>> toRemove = Sets.newHashSet();
for (Map<K, V> submap : mapList) {
// find the smallest set of maps with a matching key/value
List<Integer> smallestList = null;
for (Map.Entry<K, V> entry : submap.entrySet()) {
List<Integer> list = values.get(entry.getKey()).get(entry.getValue());
if (smallestList == null || list.size() < smallestList.size())
smallestList = list;
}
// compare with each of the maps in that set
for (int i : smallestList) {
Map<K, V> map = mapList.get(i);
if (isSubmap(submap, map))
toRemove.add(submap);
}
}
maps.removeAll(toRemove);
}
private <K,V> boolean isSubmap(Map<K, V> submap, Map<K,V> map){
if (submap.size() >= map.size())
return false;
for (Map.Entry<K,V> entry : submap.entrySet()) {
V other = map.get(entry.getKey());
if (other == null)
return false;
if (!other.equals(entry.getValue()))
return false;
}
return true;
}

Resources