Short Version:
I'm looking for a technique to keep nearly-sorted data in nearly-sorted order over time, despite the values changing slightly.
Here's the scenario:
In the world of 3D graphics, it is often beneficial to order your objects from front-to-back before drawing. As your scene changes or your view of the scene changes, this data may require re-sorting, however it will usually be very close to the sorted order (i.e. it won't change very much between frames). It's also not critical that the data be exactly in sorted order. The worst thing that will happen is that a polygon will be rendered and then completely hidden. It's a small performance hit, but not the end of the world.
With this in mind, is it possible to sort the data once ahead of time and then apply a minimal patch to the data once per frame to ensure that the data stays mostly sorted? In this scenario, the data would be considered mostly sorted if most of the objects were in ascending order. That is, 1 object that is 10 steps away from it's proper location is much better (10x better) than 10 objects that are 1 step away from their proper location.
It's also worth noting that the data could continue to be patched on a semi regular basis, as the data is typically rendered 30 times per second (or so). As long as the calculation was efficient, it could continue to be done over time until the changes stop and the list was completely sorted.
Existing Idea:
My knee jerk reaction to this problem is:
Apply an n log n sort to the data when it is loaded, and on large changes (which I can track pretty easily).
When the data starts changing slowly (e.g. when the scene is rotated), apply a single (linear) pass of some sort on the data to swap backwards neighbors and try to maintain sort order (I think this is basically shell sort - maybe there is a better algorithm to use for this single pass).
Keep doing a single pass of the partial sort each frame until the changes stop and the data is completely sorted
Go back to step 2 and wait for more changes.
There are a variety of sorts that run in O(n) time if the input is mostly sorted, and O(n log n) if the data is not sorted. It sounds like you can use that pretty easily. Timsort is one such sort and, I believe, is the default sort now in both python and java. Smoothsort is another one that is fairly easy to implement yourself.
From your description it sounds like the sort order changes without you changing the data itself. E.g. you change the camera, so the sort order should change, even though you have not modified any polygons.
If so, you can't detect sort order changes directly when they happen. If you could, I would create buckets for the list of polygons, and resort buckets when 'enough' polygons in that bucket have been touched.
But I'm betting your system doesn't work that way. The sort is determined by the view port. In that case polygons at the front of the sort matter much more than ones at the end.
So I'd segment the poly list into fifths or something like that. Front to back, so that the first fifth is the part closest to the camera. I'd completely sort the first segment every frame. I'd divide the second segment into sub segments - say 5 again - and sort each sub segment every frame, such that every 5 frames the second fifth is completely sorted. segment the third through 5th segments into 15 sub segments and do those every 5 frames each such that the rest get sorted completely every 75 frames. At 60 fps you'd have the display list completely resorted a little more than once per second.
The nice thing about prioritizing the front of the list, is
1. Polys at the front are going to tend to be larger on the screen, and will fail depth test more often. Bad orders at the end of the list will more often than not just not matter.
2. the front of the list is more susceptible sort changes due to camera changes.
Also chose those segment ranges with a little overlap, so that polygons can migrate to their correct segment in 2 sorts.
#OP: Thinking about it a little more. You are probably more concerned with having the sorting cost stay bounded - instead of exploding with scene complexity. Especially since a very complex scene should - surprisingly - be less susceptible to bad sorts ( because generally the polys get smaller ).
You could define a fixed amount of sorting you are willing to do per frame. Use say 50% of the budget for as much of the front of the list as you can afford, 25% of the budget to sort the next region and 25% to spend equally on the rest.
Say you budget 1000 polys sorted per frame, and you have 10000 polys in the scene. Sort the first 500 polys every frame. Sort 250 polys every tenth frame for the next region. So 501-750 on frame 1, 751-1000 on frame 2 etc. And then divide the rest of the list into 250 frame segments and sort them round robin for however many frames you need to.
This keeps the sorting cost fixed s the scene gets more and less complex, and it is easy to tune, you just adjust the sorting budget to what you can afford.
I'll suggest a solution that borrows from a number of others here. Of course we start with a full sort of the objects on initialisation.
What I would do is always perform, say, 10 linear-time runs over your objects for every frame (with early termination if you find out that your objects are already completely sorted). Each run can be, say, one pass of bubble sort with a shell sort-style gap over the whole array: for all i from 0 to n-gap-1, compare A[i] and A[i+gap], and exchange them if they are not sorted. You can use a fixed sequence of gaps, or maybe better, let it vary between frames; either way, if you do sufficiently many frames where the objects do not change, you'll have a fully sorted sequence. You could even mix different types of sub-algorithms to do your runs, as long as each iteration improves the 'sortedness'.
You can add Rafael Baptista's idea of prioritizing the front of the scene easily by doing one extra run on the front segment, or choosing to divide the gap by two for the front half, or something like that.
It doesn't work out as neatly as the problem you've supposed because all you have to do is turn the camera 90 degrees and the basis for being sorted is on a different axis entirely. (X and Y axis are independent, for example -- looking down the X axis will cause the sort order to not rely on the X axis, and looking down the Y axis will cause the sort order to not rely on the Y axis.) Even a 5 degree turn can cause far away "close" (as far as Z-order is concerned) things to be suddenly "far".
Let's be honest -- generating the draw calls for the objects is normally going to take much more time than sorting them, especially if you have an optimized sorting algorithm for your scenario and your game is of modern visual complexity.
Sorting can be practically O(n), especially with histogram-based algorithms or radix-style algorithms. (Yes, radix sort applies to integers, so you'd have to scale your world coordinates to integers, but normally that's more than good enough unless you have a gigantic world.)
That being said, since you're already doing O(n) ops for everything you're drawing, resorting per frame isn't going to be a huge problem, especially with both high and low level optimization.
Another common way of addressing this issue is with a scene graph, but for your purposes it ends up essentially being a re-sort per frame. However, you can build frustum culling, shadow culling, and level of detail calculations into the scene graph traversal.
If you're looking for approximations, instead of doing a z-distance sort do a true distance sort and update the sort order more often for close by objects and less often for further objects (depending on distance the camera has traveled). This can work because if you're further away from an object, moving doesn't cause the angle to the viewer to change as often which, in turn, means the old sorting data is more likely to be valid. I'm not a fan of this because I like algorithms which allow my game to teleport across the map without any issues. (Mind you, streaming assets from disk becomes the real issue for teleporting.)
Shell sort is good for lists with few unique values and some scenarios that "need short code and do not use the call stack".
In your case, you need something called Adaptive sort, which means algorithms "takes advantage of existing order in its input".
If your space is tight, you can just use Straight Insertion Sort, which is adaptive and in place.
Otherwise you can try Timsort and Smoothsort as #RunningWild suggested, they are both adaptive sort algorithms.
We have two offline systems that normally can not communicate with each other. Both systems maintain the same ordered list of items. Only rarely will they be able to communicate with each other to synchronize the list.
Items are marked with a modification timestamp to detect edits. Items are identified by UUIDs to avoid conflicts when inserting new items (as opposed to using auto-incrementing integers). When synchronizing new UUIDs are detected and copied to the other system. Likewise for deletions.
The above data structure is fine for an unordered list, but how can we handle ordering? If we added an integer "rank", that would need renumbering when inserting a new item (thus requiring synchronizing all successor items due to only 1 insertion). Alternatively, we could use fractional ranks (use the average of the ranks of the predecessor and successor item), but that doesn't seem like a robust solution as it will quickly run into accuracy problems when many new items are inserted.
We also considered implementing this as a doubly linked-list with each item holding the UUID of its predecessor and successor item. However, that would still require synchronizing 3 items when 1 new items was inserted (or synchronizing the 2 remaining items when 1 item was deleted).
Preferably, we would like to use a data structure or algorithm where only the newly inserted item needs to be synchronized. Does such a data structure exist?
Edit: we need to be able to handle moving an existing item to a different position too!
There is really no problem with the interpolated rank approach. Just define your own numbering system based on variable length bit vectors representing binary fractions between 0 and 1 with no trailing zeros. The binary point is to the left of the first digit.
The only inconvenience of this system is that the minimum possible key is 0 given by the empty bit vector. Therefore you use this only if you're positive the associated item will forever be the first list element. Normally, just give the first item the key 1. That's equivalent to 1/2, so random insertions in the range (0..1) will tend to minimize bit usage. To interpolate an item before and after,
01 < newly interpolated = 1/4
1
11 < newly interpolated = 3/4
To interpolate again:
001 < newly interpolated = 1/8
01
011 < newly interpolated = 3/8
1
101 < newly interpolated = 5/8
11
111 < newly interpolated = 7/8
Note that if you wish you can omit storing the final 1! All keys (except 0 which you won't normally use) end in 1, so storing it is supefluous.
Comparison of binary fractions is a lot like lexical comparison: 0<1 and the first bit difference in a left-to-right scan tells you which is less. If no differences occur, i.e. one vector is a strict prefix of the other, then the shorter one is smaller.
With these rules it's pretty simple to come up with an algorithm that accepts two bit vectors and computes a result that's roughly (or exactly in some cases) between them. Just add the bit strings, and shift right 1, dropping unnecessary trailing bits, i.e. take the average of the two to split the range between.
In the example above, if deletions had left us with:
01
111
and we need to interpolate these, add 01(0) and and 111 to obtain 1.001, then shift to get 1001. This works fine as an interpolant. But note the final 1 unnecessarily makes it longer than either of the operands. An easy optimization is to drop the final 1 bit along with trailing zeros to get simply 1. Sure enough, 1 is about half way between as we'd hope.
Of course if you do many inserts in the same location (think e.g. of successive inserts at the start of the list), the bit vectors will get long. This is exactly the same phenomenon as inserting at the same point in a binary tree. It grows long and stringy. To fix this, you must "rebalance" during a synchronization by renumbering with the shortest possible bit vectors, e.g. for 14 you'd use the sequence above.
Addition
Though I haven't tried it, the Postgres bit string type seems to suffice for the keys I've described. The thing I'd need to verify is that the collation order is correct.
Also, the same reasoning works just fine with base-k digits for any k>=2. The first item gets key k/2. There is also a simple optimization that prevents the very common cases of appending and prepending elements at the end and front respectively from causing keys of length O(n). It maintains O(log n) for those cases (though inserting at the same place internally can still produce O(p) keys after p insertions). I'll let you work that out. With k=256, you can use indefinite length byte strings. In SQL, I believe you'd want varbinary(max). SQL provides the correct lexicographic sort order. Implementation of the interpolation ops is easy if you have a BigInteger package similar to Java's. If you like human-readable data, you can convert the byte strings to e.g. hex strings (0-9a-f) and store those. Then normal UTF8 string sort order is correct.
You can add two fields to each item - 'creation timestamp' and 'inserted after' (containing the id of the item after which the new item was inserted). Once you synchronize a list, send all the new items. That information is enough for you to be able to construct the list on the other side.
With the list of newly added items received, do this (on the receiving end): sort by creation timestamp, then go one by one, and use the 'inserted after' field to add the new item in the appropriate place.
You may face trouble if an item A is added, then B is added after A, then A is removed. If this can happen, you will need to sync A as well (basically syncing the operations that took place on the list since the last sync, and not just the content of the current list). It's basically a form of log-shipping.
You could have a look at "lenses", which is bidirectional programming concept.
For instance, your problem seems to be solved my "matching lenses", described in this paper.
I think the datastructure that is appropriate here is order statistic tree. In order statistic tree you also need to maintain sizes of subtrees along with other data, the size field helps easy to find element by rank as you need it to be. All operations like rank,delete,change position,insert are O(logn).
I think you can try kind of transactional approach here. For example you do not delete items physically but mark them for deletion and commit changes only during synchronization. I'm not absolutely sure which data type you should choose, it depends on which operations you want to be more productive (insertions, deletions, search or iteration).
Let we have the following initial state on both systems:
|1| |2|
--- ---
|A| |A|
|B| |B|
|C| |C|
|D| |D|
After that the first system marks element A for deletion and the second system inserts element BC between B and C:
|1 | |2 |
------------ --------------
|A | |A |
|B[deleted]| |B |
|C | |BC[inserted]|
|D | |C |
|D |
Both systems continue processing taking into account local changes, System 1 ignores element B and System 2 treats element BC as normal element.
When synchronization occurs:
As I understand, each system receives the list snapshot from other system and both systems freeze processing until synchronization will be finished.
So each system iterates sequentially through received snapshot and local list and writes changes to local list (resolving possible conflicts according to modified timestamp) after that 'transaction is commited', all local changes are finally applied and information about them erases.
For example for system one:
|1 pre-sync| |2-SNAPSHOT | |1 result|
------------ -------------- ----------
|A | <the same> |A | |A |
|B[deleted]| <delete B> |B |
<insert BC> |BC[inserted]| |BC |
|C | <same> |C | |C |
|D | <same> |D | |D |
Systems wake up and continue processing.
Items are sorted by insertion order, moving can be implemented as simultaneous deletion and insertion. Also I think that it will be possible not to transfer the whole list shapshot but only list of items that were actually modified.
I think, broadly, Operational Transformation could be related to the problem you are describing here. For instance, consider the problem of Real-Time Collaborative text editing.
We essentially have a sorted list of items( words) which needs to be kept synchronized, and which could be added/modified/deleted at random within the list. The only major difference I see is in the periodicity of modifications to the list.( You say it does not happen often)
Operational Transformation does happen to be well studied field. I could find this blog article giving pointers and introduction. Plus, for all the problems Google Wave had, they actually made significant advancements to the domain of Operational Transform. Check this out. . There is quite a bit of literature available on this subject. Look at this stackoverflow thread, and about Differential Synchronisation
Another parallel that struck me was the data structure used in Text Editors - Ropes.
So if you have a log of operations,lets say, "Index 5 deleted", "Index 6 modified to ABC","Index 8 inserted",what you might now have to do is to transmit a log of the changes from System A to System B, and then reconstruct the operations sequentially on the other side.
The other "pragmatic Engineer" 's choice would be to simply reconstruct the entire list on System B when System A changes. Depending on actual frequency and size of changes, this might not be as bad as it sounds.
I have tentatively solved a similar problem by including a PrecedingItemID (which can be null if the item is the top/root of the ordered list) on each item, and then having a sort of local cache that keeps a list of all items in sorted order (this is purely for efficiency—so you don't have to recursively query for or build the list based on PrecedingItemIDs every time there is a re-ordering on the local client). Then when it comes time to sync I do the slightly more expensive operation of looking for cases where two items are requesting the same PrecedingItemID. In those cases, I simply order by creation time (or however you want to reconcile which one wins and comes first), put the second (or others) behind it, and move on ordering the list. I then store this new ordering in the local ordering cache and go on using that until the next sync (just making sure to keep the PrecedingItemID updated as you go).
I haven't unit tested this approach yet—so I'm not 100% sure I'm not missing some problematic conflict scenario—but it appears at least conceptually to handle my needs, which sound similar to those of the OP.
I'm trying to come up with a weighted algorithm for an application. In the application, there is a limited amount of space available for different elements. Once all the space is occupied, the algorithm should choose the best element(s) to remove in order to make space for new elements.
There are different attributes which should affect this decision. For example:
T: Time since last accessed. (It's best to replace something that hasn't been accessed in a while.)
N: Number of times accessed. (It's best to replace something which hasn't been accessed many times.)
R: Number of elements which need to be removed in order to make space for the new element. (It's best to replace the least amount of elements. Ideally this should also take into consideration the T and N attributes of each element being replaced.)
I have 2 problems:
Figuring out how much weight to give each of these attributes.
Figuring out how to calculate the weight for an element.
(1) I realize that coming up with the weight for something like this is very subjective, but I was hoping that there's a standard method or something that can help me in deciding how much weight to give each attribute. For example, I was thinking that one method might be to come up with a set of two sample elements and then manually compare the two and decide which one should ultimately be chosen. Here's an example:
Element A: N = 5, T = 2 hours ago.
Element B: N = 4, T = 10 minutes ago.
In this example, I would probably want A to be the element that is chosen to be replaced since although it was accessed one more time, it hasn't been accessed in a lot of time compared with B. This method seems like it would take a lot of time, and would involve making a lot of tough, subjective decisions. Additionally, it may not be trivial to come up with the resulting weights at the end.
Another method I came up with was to just arbitrarily choose weights for the different attributes and then use the application for a while. If I notice anything obviously wrong with the algorithm, I could then go in and slightly modify the weights. This is basically a "guess and check" method.
Both of these methods don't seem that great and I'm hoping there's a better solution.
(2) Once I do figure out the weight, I'm not sure which way is best to calculate the weight. Should I just add everything? (In these examples, I'm assuming that whichever element has the highest replacementWeight should be the one that's going to be replaced.)
replacementWeight = .4*T - .1*N - 2*R
or multiply everything?
replacementWeight = (T) * (.5*N) * (.1*R)
What about not using constants for the weights? For example, sure "Time" (T) may be important, but once a specific amount of time has passed, it starts not making that much of a difference. Essentially I would lump it all in an "a lot of time has passed" bin. (e.g. even though 8 hours and 7 hours have an hour difference between the two, this difference might not be as significant as the difference between 1 minute and 5 minutes since these two are much more recent.) (Or another example: replacing (R) 1 or 2 elements is fine, but when I start needing to replace 5 or 6, that should be heavily weighted down... therefore it shouldn't be linear.)
replacementWeight = 1/T + sqrt(N) - R*R
Obviously (1) and (2) are closely related, which is why I'm hoping that there's a better way to come up with this sort of algorithm.
What you are describing is the classic problem of choosing a cache replacement policy. Which policy is best for you, depends on your data, but the following usually works well:
First, always store a new object in the cache, evicting the R worst one(s). There is no way to know a priori if an object should be stored or not. If the object is not useful, it will fall out of the cache again soon.
The popular squid cache implements the following cache replacement algorithms:
Least Recently Used (LRU):
replacementKey = -T
Least Frequently Used with Dynamic Aging (LFUDA):
replacementKey = N + C
Greedy-Dual-Size-Frequency (GDSF):
replacementKey = (N/R) + C
C refers to a cache age factor here. C is basically the replacementKey of the item that was evicted last (or zero).
NOTE: The replacementKey is calculated when an object is inserted or accessed, and stored alongside the object. The object with the smallest replacementKey is evicted.
LRU is simple and often good enough. The bigger your cache, the better it performs.
LFUDA and GDSF both are tradeoffs. LFUDA prefers to keep large objects even if they are less popular, under the assumption that one hit to a large object makes up lots of hits for smaller objects. GDSF basically makes the opposite tradeoff, keeping many smaller objects over fewer large objects. From what you write, the latter might be a good fit.
If none of these meet your needs, you can calculate optimal values for T, N and R (and compare different formulas for combining them) by minimizing regret, the difference in performance between your formula and the optimal algorithm, using, for example, Linear regression.
This is a completely subjective issue -- as you yourself point out. And a distinct possibility is that if your test cases consist of pairs (A,B) where you prefer A to B, then you might find that you prefer A to B , B to C but also C over A -- i.e. its not an ordering.
If you are not careful, your function might not exist !
If you can define a scalar function of your input variables, with various parameters for coefficients and exponents, you might be able to estimate said parameters by using regression, but you will need an awful lot of data if you have many parameters.
This is the classical statistician's approach of first reviewing the data to IDENTIFY a model, and then using that model to ESTIMATE a particular realisation of the model. There are large books on this subject.
I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:
"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:
(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).
(ii) Display only the top 10 search terms for the past month, updated hourly.
You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.
The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.
The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.
Any ideas?
Frequency Estimation Overview
There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.
The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.
Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).
Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).
The Frequent Algorithm
Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.
Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.
Look at each item in the stream.
If the term exists in the map, increment the associated counter.
Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.
Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.
Counting Searches
In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.
Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.
Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.
Useful book in this area:
Muthukrishnan - "Data Streams: Algorithms and Applications"
Closely related reference to the problem at hand which I picked from the above:
Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]
By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.
Heh, nice interview question.
This is one of the research project that I am current going through. The requirement is almost exactly as yours, and we have developed nice algorithms to solve the problem.
The Input
The input is an endless stream of English words or phrases (we refer them as tokens).
The Output
Output top N tokens we have seen so
far (from all the tokens we have
seen!)
Output top N tokens in a
historical window, say, last day or
last week.
An application of this research is to find the hot topic or trends of topic in Twitter or Facebook. We have a crawler that crawls on the website, which generates a stream of words, which will feed into the system. The system then will output the words or phrases of top frequency either at overall or historically. Imagine in last couple of weeks the phrase "World Cup" would appears many times in Twitter. So does "Paul the octopus". :)
String into Integers
The system has an integer ID for each word. Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower. We have already found 4 million different words, and assigned a unique ID for each. This whole set of data can be loaded into memory as a hash table, consuming roughly 300MB memory. (We have implemented our own hash table. The Java's implementation takes huge memory overhead)
Each phrase then can be identified as an array of integers.
This is important, because sorting and comparisons on integers is much much faster than on strings.
Archive Data
The system keeps archive data for every token. Basically it's pairs of (Token, Frequency). However, the table that stores the data would be so huge such that we have to partition the table physically. Once partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram. And this goes on. Roughly at 4gram we have 1 billion records, with table sized at around 60GB.
Processing Incoming Streams
The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.
For highly frequent tokens, they are always kept in memory. For less frequent tokens, they are sorted based on IDs (remember we translate the String into an array of integers), and serialized into a disk file.
(However, for your problem, since you are counting only words, then you can put all word-frequency map in memory only. A carefully designed data structure would consume only 300MB memory for 4 million different words. Some hint: use ASCII char to represent Strings), and this is much acceptable.
Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort. Some design need to be taken care at here as well, since we want to avoid too many random disk seeks. The idea is to avoid read (merge process)/write (system output) at the same time, and let the merge process read form one disk while writing into a different disk. This is similar like to implementing a locking.
End of Day
At end of day, the system will have many frequent tokens with frequency stored in memory, and many other less frequent tokens stored in several disk files (and each file is sorted).
The system flush the in-memory map into a disk file (sort it). Now, the problem becomes merging a set of sorted disk file. Using similar process, we would get one sorted disk file at the end.
Then, the final task is to merge the sorted disk file into archive database.
Depends on the size of archive database, the algorithm works like below if it is big enough:
for each record in sorted disk file
update archive database by increasing frequency
if rowcount == 0 then put the record into a list
end for
for each record in the list of having rowcount == 0
insert into archive database
end for
The intuition is that after sometime, the number of inserting will become smaller and smaller. More and more operation will be on updating only. And this updating will not be penalized by index.
Hope this entire explanation would help. :)
You could use a hash table combined with a binary search tree. Implement a <search term, count> dictionary which tells you how many times each search term has been searched for.
Obviously iterating the entire hash table every hour to get the top 10 is very bad. But this is google we're talking about, so you can assume that the top ten will all get, say over 10 000 hits (it's probably a much larger number though). So every time a search term's count exceeds 10 000, insert it in the BST. Then every hour, you only have to get the first 10 from the BST, which should contain relatively few entries.
This solves the problem of top-10-of-all-time.
The really tricky part is dealing with one term taking another's place in the monthly report (for example, "stack overflow" might have 50 000 hits for the past two months, but only 10 000 the past month, while "amazon" might have 40 000 for the past two months but 30 000 for the past month. You want "amazon" to come before "stack overflow" in your monthly report). To do this, I would store, for all major (above 10 000 all-time searches) search terms, a 30-day list that tells you how many times that term was searched for on each day. The list would work like a FIFO queue: you remove the first day and insert a new one each day (or each hour, but then you might need to store more information, which means more memory / space. If memory is not a problem do it, otherwise go for that "approximation" they're talking about).
This looks like a good start. You can then worry about pruning the terms that have > 10 000 hits but haven't had many in a long while and stuff like that.
case i)
Maintain a hashtable for all the searchterms, as well as a sorted top-ten list separate from the hashtable. Whenever a search occurs, increment the appropriate item in the hashtable and check to see if that item should now be switched with the 10th item in the top-ten list.
O(1) lookup for the top-ten list, and max O(log(n)) insertion into the hashtable (assuming collisions managed by a self-balancing binary tree).
case ii)
Instead of maintaining a huge hashtable and a small list, we maintain a hashtable and a sorted list of all items. Whenever a search is made, that term is incremented in the hashtable, and in the sorted list the term can be checked to see if it should switch with the term after it. A self-balancing binary tree could work well for this, as we also need to be able to query it quickly (more on this later).
In addition we also maintain a list of 'hours' in the form of a FIFO list (queue). Each 'hour' element would contain a list of all searches done within that particular hour. So for example, our list of hours might look like this:
Time: 0 hours
-Search Terms:
-free stuff: 56
-funny pics: 321
-stackoverflow: 1234
Time: 1 hour
-Search Terms:
-ebay: 12
-funny pics: 1
-stackoverflow: 522
-BP sucks: 92
Then, every hour: If the list has at least 720 hours long (that's the number of hours in 30 days), look at the first element in the list, and for each search term, decrement that element in the hashtable by the appropriate amount. Afterwards, delete that first hour element from the list.
So let's say we're at hour 721, and we're ready to look at the first hour in our list (above). We'd decrement free stuff by 56 in the hashtable, funny pics by 321, etc., and would then remove hour 0 from the list completely since we will never need to look at it again.
The reason we maintain a sorted list of all terms that allows for fast queries is because every hour after as we go through the search terms from 720 hours ago, we need to ensure the top-ten list remains sorted. So as we decrement 'free stuff' by 56 in the hashtable for example, we'd check to see where it now belongs in the list. Because it's a self-balancing binary tree, all of that can be accomplished nicely in O(log(n)) time.
Edit: Sacrificing accuracy for space...
It might be useful to also implement a big list in the first one, as in the second one. Then we could apply the following space optimization on both cases: Run a cron job to remove all but the top x items in the list. This would keep the space requirement down (and as a result make queries on the list faster). Of course, it would result in an approximate result, but this is allowed. x could be calculated before deploying the application based on available memory, and adjusted dynamically if more memory becomes available.
Rough thinking...
For top 10 all time
Using a hash collection where a count for each term is stored (sanitize terms, etc.)
An sorted array which contains the ongoing top 10, a term/count in added to this array whenever the count of a term becomes equal or greater than the smallest count in the array
For monthly top 10 updated hourly:
Using an array indexed on number of hours elapsed since start modulo 744 (the number of hours during a month), which array entries consist of hash collection where a count for each term encountered during this hour-slot is stored. An entry is reset whenever the hour-slot counter changes
the stats in the array indexed on hour-slot needs to be collected whenever the current hour-slot counter changes (once an hour at most), by copying and flattening the content of this array indexed on hour-slots
Errr... make sense? I didn't think this through as I would in real life
Ah yes, forgot to mention, the hourly "copying/flattening" required for the monthly stats can actually reuse the same code used for the top 10 of all time, a nice side effect.
Exact solution
First, a solution that guarantees correct results, but requires a lot of memory (a big map).
"All-time" variant
Maintain a hash map with queries as keys and their counts as values. Additionally, keep a list f 10 most frequent queries so far and the count of the 10th most frequent count (a threshold).
Constantly update the map as the stream of queries is read. Every time a count exceeds the current threshold, do the following: remove the 10th query from the "Top 10" list, replace it with the query you've just updated, and update the threshold as well.
"Past month" variant
Keep the same "Top 10" list and update it the same way as above. Also, keep a similar map, but this time store vectors of 30*24 = 720 count (one for each hour) as values. Every hour do the following for every key: remove the oldest counter from the vector add a new one (initialized to 0) at the end. Remove the key from the map if the vector is all-zero. Also, every hour you have to calculate the "Top 10" list from scratch.
Note: Yes, this time we're storing 720 integers instead of one, but there are much less keys (the all-time variant has a really long tail).
Approximations
These approximations do not guarantee the correct solution, but are less memory-consuming.
Process every N-th query, skipping the rest.
(For all-time variant only) Keep at most M key-value pairs in the map (M should be as big as you can afford). It's a kind of an LRU cache: every time you read a query that is not in the map, remove the least recently used query with count 1 and replace it with the currently processed query.
Top 10 search terms for the past month
Using memory efficient indexing/data structure, such as tightly packed tries (from wikipedia entries on tries) approximately defines some relation between memory requirements and n - number of terms.
In case that required memory is available (assumption 1), you can keep exact monthly statistic and aggregate it every month into all time statistic.
There is, also, an assumption here that interprets the 'last month' as fixed window.
But even if the monthly window is sliding the above procedure shows the principle (sliding can be approximated with fixed windows of given size).
This reminds me of round-robin database with the exception that some stats are calculated on 'all time' (in a sense that not all data is retained; rrd consolidates time periods disregarding details by averaging, summing up or choosing max/min values, in given task the detail that is lost is information on low frequency items, which can introduce errors).
Assumption 1
If we can not hold perfect stats for the whole month, then we should be able to find a certain period P for which we should be able to hold perfect stats.
For example, assuming we have perfect statistics on some time period P, which goes into month n times.
Perfect stats define function f(search_term) -> search_term_occurance.
If we can keep all n perfect stat tables in memory then sliding monthly stats can be calculated like this:
add stats for the newest period
remove stats for the oldest period (so we have to keep n perfect stat tables)
However, if we keep only top 10 on the aggregated level (monthly) then we will be able to discard a lot of data from the full stats of the fixed period. This gives already a working procedure which has fixed (assuming upper bound on perfect stat table for period P) memory requirements.
The problem with the above procedure is that if we keep info on only top 10 terms for a sliding window (similarly for all time), then the stats are going to be correct for search terms that peak in a period, but might not see the stats for search terms that trickle in constantly over time.
This can be offset by keeping info on more than top 10 terms, for example top 100 terms, hoping that top 10 will be correct.
I think that further analysis could relate the minimum number of occurrences required for an entry to become a part of the stats (which is related to maximum error).
(In deciding which entries should become part of the stats one could also monitor and track the trends; for example if a linear extrapolation of the occurrences in each period P for each term tells you that the term will become significant in a month or two you might already start tracking it. Similar principle applies for removing the search term from the tracked pool.)
Worst case for the above is when you have a lot of almost equally frequent terms and they change all the time (for example if tracking only 100 terms, then if top 150 terms occur equally frequently, but top 50 are more often in first month and lest often some time later then the statistics would not be kept correctly).
Also there could be another approach which is not fixed in memory size (well strictly speaking neither is the above), which would define minimum significance in terms of occurrences/period (day, month, year, all-time) for which to keep the stats. This could guarantee max error in each of the stats during aggregation (see round robin again).
What about an adaption of the "clock page replacement algorithm" (also known as "second-chance")? I can imagine it to work very well if the search requests are distributed evenly (that means most searched terms appear regularly rather than 5mio times in a row and then never again).
Here's a visual representation of the algorithm:
The problem is not universally solvable when you have a fixed amount of memory and an 'infinite' (think very very large) stream of tokens.
A rough explanation...
To see why, consider a token stream that has a particular token (i.e., word) T every N tokens in the input stream.
Also, assume that the memory can hold references (word id and counts) to at most M tokens.
With these conditions, it is possible to construct an input stream where the token T will never be detected if the N is large enough so that the stream contains different M tokens between T's.
This is independent of the top-N algorithm details. It only depends on the limit M.
To see why this is true, consider the incoming stream made up of groups of two identical tokens:
T a1 a2 a3 ... a-M T b1 b2 b3 ... b-M ...
where the a's, and b's are all valid tokens not equal to T.
Notice that in this stream, the T appears twice for each a-i and b-i. Yet it appears rarely enough to be flushed from the system.
Starting with an empty memory, the first token (T) will take up a slot in the memory (bounded by M). Then a1 will consume a slot, all the way to a-(M-1) when the M is exhausted.
When a-M arrives the algorithm has to drop one symbol so let it be the T.
The next symbol will be b-1 which will cause a-1 to be flushed, etc.
So, the T will not stay memory-resident long enough to build up a real count. In short, any algorithm will miss a token of low enough local frequency but high global frequency (over the length of the stream).
Store the count of search terms in a giant hash table, where each new search causes a particular element to be incremented by one. Keep track of the top 20 or so search terms; when the element in 11th place is incremented, check if it needs to swap positions with #10* (it's not necessary to keep the top 10 sorted; all you care about is drawing the distinction between 10th and 11th).
*Similar checks need to be made to see if a new search term is in 11th place, so this algorithm bubbles down to other search terms too -- so I'm simplifying a bit.
sometimes the best answer is "I don't know".
Ill take a deeper stab. My first instinct would be to feed the results into a Q. A process would continually process items coming into the Q. The process would maintain a map of
term -> count
each time a Q item is processed, you simply look up the search term and increment the count.
At the same time, I would maintain a list of references to the top 10 entries in the map.
For the entry that was currently implemented, see if its count is greater than the count of the count of the smallest entry in the top 10.(if not in the list already). If it is, replace the smallest with the entry.
I think that would work. No operation is time intensive. You would have to find a way to manage the size of the count map. but that should good enough for an interview answer.
They are not expecting a solution, that want to see if you can think. You dont have to write the solution then and there....
One way is that for every search, you store that search term and its time stamp. That way, finding the top ten for any period of time is simply a matter of comparing all search terms within the given time period.
The algorithm is simple, but the drawback would be greater memory and time consumption.
What about using a Splay Tree with 10 nodes? Each time you try to access a value (search term) that is not contained in the tree, throw out any leaf, insert the value instead and access it.
The idea behind this is the same as in my other answer. Under the assumption that the search terms are accessed evenly/regularly this solution should perform very well.
edit
One could also store some more search terms in the tree (the same goes for the solution I suggest in my other answer) in order to not delete a node that might be accessed very soon again. The more values one stores in it, the better the results.
Dunno if I understand it right or not.
My solution is using heap.
Because of top 10 search items, I build a heap with size 10.
Then update this heap with new search. If a new search's frequency is greater than heap(Max Heap) top, update it. Abandon the one with smallest frequency.
But, how to calculate the frequency of the specific search will be counted on something else.
Maybe as everyone stated, the data stream algorithm....
Use cm-sketch to store count of all searches since beginning, keep a min-heap of size 10 with it for top 10.
For monthly result, keep 30 cm-sketch/hash-table and min-heap with it, each one start counting and updating from last 30, 29 .., 1 day. As a day pass, clear the last and use it as day 1.
Same for hourly result, keep 60 hash-table and min-heap and start counting for last 60, 59, ...1 minute. As a minute pass, clear the last and use it as minute 1.
Montly result is accurate in range of 1 day, hourly result is accurate in range of 1 min