Retrieve the average count in count-min-sketch datastructure - data-structures

I am in love with probabilistic data structures. For my current problem, it seems that the count-min-sketch structure is almost the right candidate. I want to use count-min-sketch to store events per ID.
Let's assume I do have the following
Map<String, Int> {
[ID1, 10],
[ID2, 12],
[ID2, 15]
}
If I use a count-min-sketch, I can query the data structure by IDs and retrieve the ~counts.
Question
Actually I am interested in the average occurrence over all IDs, which in the example above would be: 12,33. If I am using the count-min then it seems that I need to store the Set of IDs and then iterate over the set and query the count-min for each ID and calculate the average. Is there an improved way without storing all IDs? Ideally I just want to retrieve the average right away without remembering all IDs.
Hope that make sense!?

You should be able to calculate the average count if you know the number of entries, and the number of distinct entries:
averageCount = totalNumberOfEntries / numberOfDistinctEntries
Right? And to calculate the number of distinct entries, you can use e.g. HyperLogLog. You already added the hyperloglog tag to your question, so maybe you already know this?

Related

Data structure to Filter Data Quickly

I am doing a bit of research into making an efficient filtering algorithm when it comes to many properties of specific data. This is kind of a fun project for me to learn new data structures.
for example, say I wanted All RPG's on Playstation Which had English releases.
Now I want to allow for much more complex queries.
Is there a good data structure to handle filtering attributes like this, without the need to give all of the attributes. Instead I can give only a few and still find the correct games?
I currently plan to have "buckets" which will describe an attribute, for example all Genre's game ID's will be in one bucket, and so forth. Then I will use a hash algorithm to add 1 to that game, and only use games which have the correct value after the search.
But I want to try to find a faster or easier method, any suggestions when it comes to filtering many attributes to find sets of items?
Thanks,
What do you mean by "without the need to give all of the attributes"? Are you saying you have N attributes and you want to find the items that match l < N of the attributes, or are you saying that you don't want to compute an index for each attribute?
Hashing each attribute into buckets will give you O(1) time at the expense of O(n) space to store each index.
You could sort your list by one or two attributes to make some lookups O(logn) at the expense of having to do the sorting up front for O(nlogn) time
You could get kinda clever with bloom filters for your attributes and let some attributes overlap. This would lead to some false-positives, but you could filter those out after the fact. This gives you constant-space with constant-time lookup in the average case (but O(n) time in the worse-case).

Which Data Structure should I choose?

I am thinking to list the top scores for a game. More than 300000 players are to be listed in order by their top score. Players can update their high score by typing in their name, and their new top score. Only 10 scores show up at a time, and the user can type in which place they want to start with. So if they type "100100" then the whole list should refresh, and show them the 100,100th score through the 100,109th score. So what data structure should I use in this case? I am thinking to use hashTable with users' names as keys, it would take constant time to update their scores. But what if a user's previous score is at 100,100th, and after he updated his score his score became the highest one in whole list? Then if by using hash table it would take linear time since I need to compare each score in the list to make sure is the highest one. Therefore, is there any better data structure to choose beside using hashTable?
You should choose the data structure that is optimized for the most common operation. By your description of an ordered list probably the most common operation will be viewing the list (and jumping around in it).
If you use a hashtable with the user's names as keys, then it will be very expensive to display the list ordered by score, and very expensive to compute different views when viewers skip around in the list.
Instead, using a simple list sorted by score will make all of the "view" operations very cheap and very easy to implement. When a user updates their score, simply do a linear (O(n)) search for the user by name and remove their old entry. Then, since the list is sorted, you can search it in O(log n) time to find where to re-insert their new entry in the list.
Use a map (ordered tree) based container with score keys and a hash with name keys. Let the values be a link to your entities stored in a list or array etc. i.e. store the data as you like an make indeces for the different access you need performed fast.

Sorting application difficulty

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?
Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field
Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.
Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

Best data structure for a given set of operations - Add, Retrieve Min/Max and Retrieve a specific object

I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.

Do I need to implement a b-tree search for this?

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

Resources