Implementing shuffle on the celestial jukebox - algorithm

How would one implement shuffle for the "Celestial Jukebox"?
More precisely, at each time t, return an uniform random number between 0..n(t), such that there are no repeats in the entire sequence, with n() increasing over time.
For the concrete example, assume a flat-rate music service which allows playing any song in the catalog by a 0 based index number. Every so often, new songs are added which increase range of index numbers. The goal is to play a new song each time (assuming no duplicates in the catalog).
an ideal solution would be feasible on existing hardware - how would I shoehorn a list of six million songs in 8MB of DRAM? Similarly, the high song count exacerbates O(n) selection timings.
-- For an LCG generator, given a partially exhausted LCG on 0..N0, can that be translated to a different LCG on 0..N1 (where N1 > N0), that doen't repeat the exhausted sequence.
-- Checking if a particular song has already been played seems to rapidly grow out of hand, although this might be the only way ? Is there an efficient data structure for this?

The way that I like to do that kind of non-repeating random selection is to have a list, and each time I select an item at random between [0-N), I remove it from that list. In your case, as new items get added to the catalog, it would also be added to the not-yet-selected list. Once you get to the end, simply reload all the songs back to the list.
EDIT:
If you take v3's suggestion into account, this can be done in basically O(1) time after the O(N) initialization step. It guarantees non-repeating random selection.
Here is the recap:
Add the initial items to a list
Pick index i at random (from set of [0,N))
Remove item at index i
Replace the hole at i with the Nth item (or null if i == Nth) and decrement N
For new items, simply append to the end of the list and increment N as necessary
If you ever get to playing through all the songs (which I doubt if you have 6M songs), then add all the songs back to the list, lather, rinse, and repeat.
Since you are trying to deal with rather large sets, I would recommend the use of a DB. A simple table with basically two fields: id and "pointer" (where "pointer" is what tells you the song to play which could be a GUID, FileName, etc, depending on how you want to do it). Have an index on id and you should get very decent performance with persistence between application runs.
EDIT for 8MB limit:
Umm, this does make it a bit harder... In 8 MB, you can store a maximum of ~2M entries using 32-bit keys.
So what I would recommend is to pre-select the next 2M entries. If the user plays through 2M songs in a lifetime, damn! To pre-select them, do a pre-init step using the above algorithm. The one change I would make is that as you add new songs, roll the dice and see if you want to randomly add that song to the mix. If yes, then pick a random index and replace it with the new song's index.

With a limit of 8MB for 6 million songs, there's plainly not room to store even a single 32 bit integer for each song. Unless you're prepared to store the list on disk (in which case, see below).
If you're prepared to drop the requirement that new items be immediately added to the shuffle, you can generate an LCG over the current set of songs, then when that is exhausted, generate a new LCG over only the songs that were added since you began. Rinse and repeat until you no longer have any new songs. You can also use this rather cool algorithm that generates an unguessable permutation over an arbitrary range without storing it.
If you're prepared to relax the requirement of 8MB ram for 6 million songs, or to go to disk (for example, by memory mapping), you could generate the sequence from 1..n at the beginning, shuffle it with fisher-yates, and whenever a new song is added, pick a random element from the so-far-unplayed section, insert the new ID there, and append the original ID to the end of the list.
If you don't care much about computational efficiency, you could store a bitmap of all songs, and repeatedly pick IDs uniformly at random until you find one you haven't played yet. This would take 6 million tries to find the last song (on average), which is still damn fast on a modern CPU.

While Erich's solution is probably better for your specific use case, checking if a song has already been played is very fast (amortized O(1)) with a hash-based structure, such as a set in Python or a hashset<int> in C++.

You could simply generate the sequence of numbers from 1 to n and then shuffle it using a Fisher-Yates shuffle. That way you can guarantee that the sequence won't repeat, regardless of n.

You could use a linked list inside an array:
To build the initial playlist, use an array containing a something like this:
struct playlistNode{
songLocator* song;
playlistNode *next;
};
struct playlistNode arr[N];
Also keep a 'head' and 'freelist' pointer;
Populate it in 2 passes:
1. fill in arr with all the songs in the catalog in order 0..N.
2. randomly iterate through all the indexes, filling in the next pointer;
Deletion of songs played is O(1):
head=cur->next;
cur->song=NULL;
freelist->next = freelist;
cur->next=freelist;
freelist=cur;
Insertion of new songs is O(1) also: pick an array index at random, and patch a new node.
node = freelist;
freelist=freelist->next;
do {
i=rand(N);
} while (!arr[i].song); //make sure you didn't hit a played node
node->next = arr[i].next;
arr[i].next=node;

Related

Garbage collection with a very large dictionary

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.
I will try to restate the problem to make it more "real":
You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).
Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.
Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).
Sounds like a problem I tackled 10 years ago.
The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.
The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.
Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn.
As noted in the question, allocate one extra bit per username for the "used emoji" result.
The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).
Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.
Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.
Hash the keys, sort the hashes, and store sorted hashes in compressed form.
TL;DR
The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.
To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.
While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).
This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).
To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:
Key list loads allowed: 10 20
Optimal h/n: 2.1 1.2
Bits per key: 2.98 2.62
Compressed MB: 710.851 625.096
Uncompressed MB: 40.474 3.325
Bitmap MB: 238.419 238.419
MB used: 989.744 866.839
Index entries: 1'122'520 5'149'840
Indexed fragment size: 1781.71 388.361
With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.
With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).
Loosening limitation to 40 key scans does not result in any further improvements.
Minimal perfect hashing
Create a minimal perfect hash function (MPHF).
At around 1.8 bits per key (using the
RecSplit
algorithm), this uses about 429 MB.
(Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.)
For each user, allocate one bit as a marker, about 238 MB.
So memory usage is around 667 MB.
Then read the posts, for each user calculate the hash,
and set the related bit if needed.
Read the user table again, calculate the hash, check if the bit is set.
Generation
Generating the MPHF is a bit tricky, not because it is slow
(this may take around 30 minutes of CPU time),
but due to memory usage. With 1 GB or RAM,
it needs to be done in segments.
Let's say we use 32 segments of about the same size, as follows:
Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function),
and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates.
With 64 bit entries, the probability of duplicates is very low,
but if there are any, use a different hash function and try again
(you need to store which hash function was used).
Now calculate the MPHF for this segment.
The RecSplit algorithm is the fastest I know.
The CHD algorithm can be used as well,
but needs more space / is slower to generate.
Repeat until all segments are processed.
The above algorithm reads the user list 32 times.
This could be reduced to about 10 if more segments are used
(for example one million),
and as many segments are read, per step, as fits in memory.
With smaller segments, less bits per key are needed
to the reduced probability of duplicates within one segment.
The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.
Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:
Here's the general idea:
currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
if currentUser == currentPostUser
{
if currentPost has emoji
{
usedEmoji = true
}
currentPost = get next post from database
}
else if currentUser > currentPostUser
{
// No user for this post. Get next post.
currentPost = get next post from database
usedEmoji = false
}
else
{
// Current user is less than post user name.
// So we have to switch users.
if (usedEmoji == false)
{
// No post by this user contained an emoji
output currentUser name
}
currentUser = get next user name from file
}
}
// at the end of one of the files.
// Clean up.
// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
if (usedEmoji == false)
{
output currentUser name
}
currentUser = get next user name from file
usedEmoji = false
}
// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.
An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Efficiently and dynamically rank many users in memory?

I run a Java game server where I need to efficiently rank players in various ways. For example, by score, money, games won, and other achievements. This is so I can recognize the top 25 players in a given category to apply medals to those players, and dynamically update them as the rankings change. Performance is a high priority.
Note that this cannot easily be done in the database only, as the ranks will come from different sources of data and different database tables, so my hope is to handle this all in memory, and call methods on the ranked list when a value needs to be updated. Also, potentially many users can tie for the same rank.
For example, let's say I have a million players in the database. A given player might earn some extra points and instantly move from 21,305th place to 23rd place, and then later drop back off the top 25 list. I need a way to handle this efficiently. I imagine that some kind of doubly-linked list would be used, but am unsure of how to handle quickly jumping many spots in the list without traversing it one at a time to find the correct new ranking. The fact that players can tie complicates things a little bit, as each element in the ranked list can have multiple users.
How would you handle this in Java?
I don't know whether there is library that may help you, but I think you can maintain a minimum heap in the memory. When a player's point updates, you can compare this to the root of the heap, if less than,do nothing.else adjust the heap.
That means, you can maintain a minimum heap that has 25 nodes which are the highest 25 of all the players in one category.
Forget linked list. It allows fast insertions, but no efficient searching, so it's of no use.
Use the following data
double threshold
ArrayList<Player> top;
ArrayList<Player> others; (3)
and manage the following properties
each player in top has a score greater or equal to threshold
each player in others has a score lower than threshold
top is sorted
top.size() >= 25
top.size() < 25 + N where N is some arbitrary limit (e.g., 50)
Whenever some player raises it score, do the following:
if they're in top, sort top (*)
if they're in others, check if their score promotes them to top
if so, remove them from others, insert in top, and sort top
if top grew too big, move the n/2 worst players from top to others and update threshold
Whenever some player lowers it score, do the following:
- if they're in others, do nothing
- if they're in top, check if their new score allows them to stay in top
- if so, sort top (1)
- otherwise, demote them to bottom, and check if top got too small
- if so, determine an appropriate new threshold and move all corresponding players to top. (2)
(1) Sorting top is cheap as it's small. Moreover, TimSort (i.e., the algorithm behind Arrays.sort(Object[])) works very well on partially sorted sequences. Instead of sorting, you can simply remember that top is unsorted and sort it later when needed.
(2) Determining a proper threshold can be expensive and so can be moving the players. That's why only N/2 player get moved away from it when it grows too big. This leaves some spare players and makes this case pretty improbable assuming that players rarely lose score.
EDIT
For managing the objects, you also need to be able to find them in the lists. Either add a corresponding field to Player or use a TObjectIntHashMap.
EDIT 2
(3) When removing an element from the middle of others, simply replace the element by the last one and shorten the list by one. You can do it as the order doesn't matter and you must do it because of speed. (4)
(4) The whole others list needn't be actually stored anywhere. All you need is a possibility to iterate all the players not contained in top. This can be done by using an additional Set or by simply iterating though all the players and skipping those scoring above threshold.
FINAL RECOMMENDATIONS
Forget the others list (unless I'm overlooking something, you won't need it).
I guess you will need no TObjectIntHashMap either.
Use a list top and a boolean isTopSorted, which gets cleared whenever a top score changes or a player gets promoted to top (simple condition: oldScore >= threshold | newScore >= threshold).
For handling ties, make top contain at least 25 differently scored players. You can check this condition easily when printing the top players.
I assume you may use plenty of memory to do that or memory is not a concern for you. Now as you want only the top 25 entries for any category, I would suggest the following:
Have a HashSet of Player objects. Player objects have the info like name, games won, money etc.
Now have a HashMap of category name vs TreeSet of top 25 player objects in that category. The category name may be a checksum of some columns say gamewon, money, achievent etc.
HashMap<String /*for category name */, TreeSet /*Sort based on the criteria */> ......
Whenever you update a player object, you update the common HashSet first and then check if the player object is a candidate for top 25 entries in any of the categories. If it is a candidate, some player object unfortunately may lose their ranking and hence may get kicked out of the corresponding treeset.
>> if you make the TreeSet sorted by the score, it'll break whenever the score changes (and the player will not be found in it
Correct. Now I got the point :) . So, I will do the following to mitigate the problem. The player object will have a field that indicate whether it is already in some categories, basically a set of categories it is already in. While updating a player object, we have to check if the player is already in some categories, if it is in some categories already, we will rearrange the corresponding treeset first; i.e. remove the player object and readjust the score and add it back to the treeset. Whenever a player object is kicked out of a category, we remove the category from the field which is holding a set of categories the player is in
Now, what you do if the look-up is done with a brand new search criteria (means the top 25 is not computed for this criteria already) ? ------
Traverse the HashMap and build the top entries for this category from "scratch". This will be expensive operation, just like indexing something afresh.

Which Data Structure should I choose?

I am thinking to list the top scores for a game. More than 300000 players are to be listed in order by their top score. Players can update their high score by typing in their name, and their new top score. Only 10 scores show up at a time, and the user can type in which place they want to start with. So if they type "100100" then the whole list should refresh, and show them the 100,100th score through the 100,109th score. So what data structure should I use in this case? I am thinking to use hashTable with users' names as keys, it would take constant time to update their scores. But what if a user's previous score is at 100,100th, and after he updated his score his score became the highest one in whole list? Then if by using hash table it would take linear time since I need to compare each score in the list to make sure is the highest one. Therefore, is there any better data structure to choose beside using hashTable?
You should choose the data structure that is optimized for the most common operation. By your description of an ordered list probably the most common operation will be viewing the list (and jumping around in it).
If you use a hashtable with the user's names as keys, then it will be very expensive to display the list ordered by score, and very expensive to compute different views when viewers skip around in the list.
Instead, using a simple list sorted by score will make all of the "view" operations very cheap and very easy to implement. When a user updates their score, simply do a linear (O(n)) search for the user by name and remove their old entry. Then, since the list is sorted, you can search it in O(log n) time to find where to re-insert their new entry in the list.
Use a map (ordered tree) based container with score keys and a hash with name keys. Let the values be a link to your entities stored in a list or array etc. i.e. store the data as you like an make indeces for the different access you need performed fast.

Algorithm to find top 10 search terms

I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:
"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:
(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).
(ii) Display only the top 10 search terms for the past month, updated hourly.
You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.
The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.
The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.
Any ideas?
Frequency Estimation Overview
There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.
The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.
Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).
Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).
The Frequent Algorithm
Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.
Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.
Look at each item in the stream.
If the term exists in the map, increment the associated counter.
Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.
Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.
Counting Searches
In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.
Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.
Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.
Useful book in this area:
Muthukrishnan - "Data Streams: Algorithms and Applications"
Closely related reference to the problem at hand which I picked from the above:
Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]
By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.
Heh, nice interview question.
This is one of the research project that I am current going through. The requirement is almost exactly as yours, and we have developed nice algorithms to solve the problem.
The Input
The input is an endless stream of English words or phrases (we refer them as tokens).
The Output
Output top N tokens we have seen so
far (from all the tokens we have
seen!)
Output top N tokens in a
historical window, say, last day or
last week.
An application of this research is to find the hot topic or trends of topic in Twitter or Facebook. We have a crawler that crawls on the website, which generates a stream of words, which will feed into the system. The system then will output the words or phrases of top frequency either at overall or historically. Imagine in last couple of weeks the phrase "World Cup" would appears many times in Twitter. So does "Paul the octopus". :)
String into Integers
The system has an integer ID for each word. Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower. We have already found 4 million different words, and assigned a unique ID for each. This whole set of data can be loaded into memory as a hash table, consuming roughly 300MB memory. (We have implemented our own hash table. The Java's implementation takes huge memory overhead)
Each phrase then can be identified as an array of integers.
This is important, because sorting and comparisons on integers is much much faster than on strings.
Archive Data
The system keeps archive data for every token. Basically it's pairs of (Token, Frequency). However, the table that stores the data would be so huge such that we have to partition the table physically. Once partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram. And this goes on. Roughly at 4gram we have 1 billion records, with table sized at around 60GB.
Processing Incoming Streams
The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.
For highly frequent tokens, they are always kept in memory. For less frequent tokens, they are sorted based on IDs (remember we translate the String into an array of integers), and serialized into a disk file.
(However, for your problem, since you are counting only words, then you can put all word-frequency map in memory only. A carefully designed data structure would consume only 300MB memory for 4 million different words. Some hint: use ASCII char to represent Strings), and this is much acceptable.
Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort. Some design need to be taken care at here as well, since we want to avoid too many random disk seeks. The idea is to avoid read (merge process)/write (system output) at the same time, and let the merge process read form one disk while writing into a different disk. This is similar like to implementing a locking.
End of Day
At end of day, the system will have many frequent tokens with frequency stored in memory, and many other less frequent tokens stored in several disk files (and each file is sorted).
The system flush the in-memory map into a disk file (sort it). Now, the problem becomes merging a set of sorted disk file. Using similar process, we would get one sorted disk file at the end.
Then, the final task is to merge the sorted disk file into archive database.
Depends on the size of archive database, the algorithm works like below if it is big enough:
for each record in sorted disk file
update archive database by increasing frequency
if rowcount == 0 then put the record into a list
end for
for each record in the list of having rowcount == 0
insert into archive database
end for
The intuition is that after sometime, the number of inserting will become smaller and smaller. More and more operation will be on updating only. And this updating will not be penalized by index.
Hope this entire explanation would help. :)
You could use a hash table combined with a binary search tree. Implement a <search term, count> dictionary which tells you how many times each search term has been searched for.
Obviously iterating the entire hash table every hour to get the top 10 is very bad. But this is google we're talking about, so you can assume that the top ten will all get, say over 10 000 hits (it's probably a much larger number though). So every time a search term's count exceeds 10 000, insert it in the BST. Then every hour, you only have to get the first 10 from the BST, which should contain relatively few entries.
This solves the problem of top-10-of-all-time.
The really tricky part is dealing with one term taking another's place in the monthly report (for example, "stack overflow" might have 50 000 hits for the past two months, but only 10 000 the past month, while "amazon" might have 40 000 for the past two months but 30 000 for the past month. You want "amazon" to come before "stack overflow" in your monthly report). To do this, I would store, for all major (above 10 000 all-time searches) search terms, a 30-day list that tells you how many times that term was searched for on each day. The list would work like a FIFO queue: you remove the first day and insert a new one each day (or each hour, but then you might need to store more information, which means more memory / space. If memory is not a problem do it, otherwise go for that "approximation" they're talking about).
This looks like a good start. You can then worry about pruning the terms that have > 10 000 hits but haven't had many in a long while and stuff like that.
case i)
Maintain a hashtable for all the searchterms, as well as a sorted top-ten list separate from the hashtable. Whenever a search occurs, increment the appropriate item in the hashtable and check to see if that item should now be switched with the 10th item in the top-ten list.
O(1) lookup for the top-ten list, and max O(log(n)) insertion into the hashtable (assuming collisions managed by a self-balancing binary tree).
case ii)
Instead of maintaining a huge hashtable and a small list, we maintain a hashtable and a sorted list of all items. Whenever a search is made, that term is incremented in the hashtable, and in the sorted list the term can be checked to see if it should switch with the term after it. A self-balancing binary tree could work well for this, as we also need to be able to query it quickly (more on this later).
In addition we also maintain a list of 'hours' in the form of a FIFO list (queue). Each 'hour' element would contain a list of all searches done within that particular hour. So for example, our list of hours might look like this:
Time: 0 hours
-Search Terms:
-free stuff: 56
-funny pics: 321
-stackoverflow: 1234
Time: 1 hour
-Search Terms:
-ebay: 12
-funny pics: 1
-stackoverflow: 522
-BP sucks: 92
Then, every hour: If the list has at least 720 hours long (that's the number of hours in 30 days), look at the first element in the list, and for each search term, decrement that element in the hashtable by the appropriate amount. Afterwards, delete that first hour element from the list.
So let's say we're at hour 721, and we're ready to look at the first hour in our list (above). We'd decrement free stuff by 56 in the hashtable, funny pics by 321, etc., and would then remove hour 0 from the list completely since we will never need to look at it again.
The reason we maintain a sorted list of all terms that allows for fast queries is because every hour after as we go through the search terms from 720 hours ago, we need to ensure the top-ten list remains sorted. So as we decrement 'free stuff' by 56 in the hashtable for example, we'd check to see where it now belongs in the list. Because it's a self-balancing binary tree, all of that can be accomplished nicely in O(log(n)) time.
Edit: Sacrificing accuracy for space...
It might be useful to also implement a big list in the first one, as in the second one. Then we could apply the following space optimization on both cases: Run a cron job to remove all but the top x items in the list. This would keep the space requirement down (and as a result make queries on the list faster). Of course, it would result in an approximate result, but this is allowed. x could be calculated before deploying the application based on available memory, and adjusted dynamically if more memory becomes available.
Rough thinking...
For top 10 all time
Using a hash collection where a count for each term is stored (sanitize terms, etc.)
An sorted array which contains the ongoing top 10, a term/count in added to this array whenever the count of a term becomes equal or greater than the smallest count in the array
For monthly top 10 updated hourly:
Using an array indexed on number of hours elapsed since start modulo 744 (the number of hours during a month), which array entries consist of hash collection where a count for each term encountered during this hour-slot is stored. An entry is reset whenever the hour-slot counter changes
the stats in the array indexed on hour-slot needs to be collected whenever the current hour-slot counter changes (once an hour at most), by copying and flattening the content of this array indexed on hour-slots
Errr... make sense? I didn't think this through as I would in real life
Ah yes, forgot to mention, the hourly "copying/flattening" required for the monthly stats can actually reuse the same code used for the top 10 of all time, a nice side effect.
Exact solution
First, a solution that guarantees correct results, but requires a lot of memory (a big map).
"All-time" variant
Maintain a hash map with queries as keys and their counts as values. Additionally, keep a list f 10 most frequent queries so far and the count of the 10th most frequent count (a threshold).
Constantly update the map as the stream of queries is read. Every time a count exceeds the current threshold, do the following: remove the 10th query from the "Top 10" list, replace it with the query you've just updated, and update the threshold as well.
"Past month" variant
Keep the same "Top 10" list and update it the same way as above. Also, keep a similar map, but this time store vectors of 30*24 = 720 count (one for each hour) as values. Every hour do the following for every key: remove the oldest counter from the vector add a new one (initialized to 0) at the end. Remove the key from the map if the vector is all-zero. Also, every hour you have to calculate the "Top 10" list from scratch.
Note: Yes, this time we're storing 720 integers instead of one, but there are much less keys (the all-time variant has a really long tail).
Approximations
These approximations do not guarantee the correct solution, but are less memory-consuming.
Process every N-th query, skipping the rest.
(For all-time variant only) Keep at most M key-value pairs in the map (M should be as big as you can afford). It's a kind of an LRU cache: every time you read a query that is not in the map, remove the least recently used query with count 1 and replace it with the currently processed query.
Top 10 search terms for the past month
Using memory efficient indexing/data structure, such as tightly packed tries (from wikipedia entries on tries) approximately defines some relation between memory requirements and n - number of terms.
In case that required memory is available (assumption 1), you can keep exact monthly statistic and aggregate it every month into all time statistic.
There is, also, an assumption here that interprets the 'last month' as fixed window.
But even if the monthly window is sliding the above procedure shows the principle (sliding can be approximated with fixed windows of given size).
This reminds me of round-robin database with the exception that some stats are calculated on 'all time' (in a sense that not all data is retained; rrd consolidates time periods disregarding details by averaging, summing up or choosing max/min values, in given task the detail that is lost is information on low frequency items, which can introduce errors).
Assumption 1
If we can not hold perfect stats for the whole month, then we should be able to find a certain period P for which we should be able to hold perfect stats.
For example, assuming we have perfect statistics on some time period P, which goes into month n times.
Perfect stats define function f(search_term) -> search_term_occurance.
If we can keep all n perfect stat tables in memory then sliding monthly stats can be calculated like this:
add stats for the newest period
remove stats for the oldest period (so we have to keep n perfect stat tables)
However, if we keep only top 10 on the aggregated level (monthly) then we will be able to discard a lot of data from the full stats of the fixed period. This gives already a working procedure which has fixed (assuming upper bound on perfect stat table for period P) memory requirements.
The problem with the above procedure is that if we keep info on only top 10 terms for a sliding window (similarly for all time), then the stats are going to be correct for search terms that peak in a period, but might not see the stats for search terms that trickle in constantly over time.
This can be offset by keeping info on more than top 10 terms, for example top 100 terms, hoping that top 10 will be correct.
I think that further analysis could relate the minimum number of occurrences required for an entry to become a part of the stats (which is related to maximum error).
(In deciding which entries should become part of the stats one could also monitor and track the trends; for example if a linear extrapolation of the occurrences in each period P for each term tells you that the term will become significant in a month or two you might already start tracking it. Similar principle applies for removing the search term from the tracked pool.)
Worst case for the above is when you have a lot of almost equally frequent terms and they change all the time (for example if tracking only 100 terms, then if top 150 terms occur equally frequently, but top 50 are more often in first month and lest often some time later then the statistics would not be kept correctly).
Also there could be another approach which is not fixed in memory size (well strictly speaking neither is the above), which would define minimum significance in terms of occurrences/period (day, month, year, all-time) for which to keep the stats. This could guarantee max error in each of the stats during aggregation (see round robin again).
What about an adaption of the "clock page replacement algorithm" (also known as "second-chance")? I can imagine it to work very well if the search requests are distributed evenly (that means most searched terms appear regularly rather than 5mio times in a row and then never again).
Here's a visual representation of the algorithm:
The problem is not universally solvable when you have a fixed amount of memory and an 'infinite' (think very very large) stream of tokens.
A rough explanation...
To see why, consider a token stream that has a particular token (i.e., word) T every N tokens in the input stream.
Also, assume that the memory can hold references (word id and counts) to at most M tokens.
With these conditions, it is possible to construct an input stream where the token T will never be detected if the N is large enough so that the stream contains different M tokens between T's.
This is independent of the top-N algorithm details. It only depends on the limit M.
To see why this is true, consider the incoming stream made up of groups of two identical tokens:
T a1 a2 a3 ... a-M T b1 b2 b3 ... b-M ...
where the a's, and b's are all valid tokens not equal to T.
Notice that in this stream, the T appears twice for each a-i and b-i. Yet it appears rarely enough to be flushed from the system.
Starting with an empty memory, the first token (T) will take up a slot in the memory (bounded by M). Then a1 will consume a slot, all the way to a-(M-1) when the M is exhausted.
When a-M arrives the algorithm has to drop one symbol so let it be the T.
The next symbol will be b-1 which will cause a-1 to be flushed, etc.
So, the T will not stay memory-resident long enough to build up a real count. In short, any algorithm will miss a token of low enough local frequency but high global frequency (over the length of the stream).
Store the count of search terms in a giant hash table, where each new search causes a particular element to be incremented by one. Keep track of the top 20 or so search terms; when the element in 11th place is incremented, check if it needs to swap positions with #10* (it's not necessary to keep the top 10 sorted; all you care about is drawing the distinction between 10th and 11th).
*Similar checks need to be made to see if a new search term is in 11th place, so this algorithm bubbles down to other search terms too -- so I'm simplifying a bit.
sometimes the best answer is "I don't know".
Ill take a deeper stab. My first instinct would be to feed the results into a Q. A process would continually process items coming into the Q. The process would maintain a map of
term -> count
each time a Q item is processed, you simply look up the search term and increment the count.
At the same time, I would maintain a list of references to the top 10 entries in the map.
For the entry that was currently implemented, see if its count is greater than the count of the count of the smallest entry in the top 10.(if not in the list already). If it is, replace the smallest with the entry.
I think that would work. No operation is time intensive. You would have to find a way to manage the size of the count map. but that should good enough for an interview answer.
They are not expecting a solution, that want to see if you can think. You dont have to write the solution then and there....
One way is that for every search, you store that search term and its time stamp. That way, finding the top ten for any period of time is simply a matter of comparing all search terms within the given time period.
The algorithm is simple, but the drawback would be greater memory and time consumption.
What about using a Splay Tree with 10 nodes? Each time you try to access a value (search term) that is not contained in the tree, throw out any leaf, insert the value instead and access it.
The idea behind this is the same as in my other answer. Under the assumption that the search terms are accessed evenly/regularly this solution should perform very well.
edit
One could also store some more search terms in the tree (the same goes for the solution I suggest in my other answer) in order to not delete a node that might be accessed very soon again. The more values one stores in it, the better the results.
Dunno if I understand it right or not.
My solution is using heap.
Because of top 10 search items, I build a heap with size 10.
Then update this heap with new search. If a new search's frequency is greater than heap(Max Heap) top, update it. Abandon the one with smallest frequency.
But, how to calculate the frequency of the specific search will be counted on something else.
Maybe as everyone stated, the data stream algorithm....
Use cm-sketch to store count of all searches since beginning, keep a min-heap of size 10 with it for top 10.
For monthly result, keep 30 cm-sketch/hash-table and min-heap with it, each one start counting and updating from last 30, 29 .., 1 day. As a day pass, clear the last and use it as day 1.
Same for hourly result, keep 60 hash-table and min-heap and start counting for last 60, 59, ...1 minute. As a minute pass, clear the last and use it as minute 1.
Montly result is accurate in range of 1 day, hourly result is accurate in range of 1 min

How can I create a unique 7-digit code for an entity?

When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.

Resources