How does the leaf node split in the physical space in innodb? - data-structures

If the keys are inserted in ascending order, according to normal B+-tree characteristics, when the leaf page is full, it will split and there will be a new page introduced to the B+-tree.
For an instance, if there is a leaf page with up to 3 keys.
(page0)|1|2|3|
Then the key 4 is inserted:
|1|3|*|(page0)
(page1)|1|2|*| |3|4|*|(page2)
After this, later keys will be inserted into page2 until the next split since they are in ascending order. All previous pages will remain half full.
In my example, I guess this will cause space to be wasted. However, in the database, it seems to be unreasonable. This really confuses me. I've read Jeremy Cole-B+Tree index structures in InnoDB, but I have probably misunderstood something.

Without additional optimizations, you're absolutely correct that as an index page filled it would be split in half and then remain half-filled forever. However, InnoDB optimizes index fill based on its perception of the insertion order. That is, if it detects that insertion is being done in-order (ascending or descending) it will, instead of splitting a page in half, just create a new empty page for an insertion at the "edge" of the page.
There is some information about this in the MySQL manual section The Physical Structure of an InnoDB Index. Additionally I illustrate an example of this behavior in my post Visualizing the impact of ordered vs. random index insertion in InnoDB.
In The physical structure of InnoDB index pages I describe the Last Insert Position, Page Direction, and Number of Inserts in Page Direction fields of each index page. This is how the tracking for ascending vs. descending order is done (as left vs. right, though). With each insert, the last inserted record is compared to the currently inserted one, and if the insert is in the same "direction", the counter is incremented. This counter is then checked to determine the page split behavior; whether to split in half or create a new, empty page.
In practice, this optimization is not perfect, and there's a big difference between insertions being mostly in-order, and exactly in-order. If inserts are only mostly in-order it can mean that the page direction may never get appropriately set, and pages will end up half-filled (as you described).

Related

How does linear probing handle deletions without breaking lookups?

Here is my understanding of linear probing.
For insertion:
- We hash to a certain position. If that position already has a value, we linearly increment to the next position, until we encounter an empty position, then we insert there. That makes sense.
My question revolves around lookup. From descriptions I have read, I believe lookup works like this:
We look at the position the item we are looking for hashes to.
If the position is empty, we return Not Found
If the position is full, we move linearly to positions until we either encounter the value we are looking for, or we encounter an empty position (meaning not found)
So how does this work when we delete an item from a hash? Wouldn't that mess up this lookup? Say two items hash to the same position. We add both items, then delete the first one we added. So now, the expected position of the second item (which had to be moved to a different position, since the first item originally occupied it) is empty. Does deleting handle this in some way?
Great question! You are absolutely right that just removing an item from a linear probing table would cause problems in exactly the circumstance that you are reporting.
There are a couple of solutions to this. One is to use tombstone deletion. In tombstone deletion, to remove an element, you replace the element with a marker called a tombstone that indicates "an element used to be here, but has since been removed." Then, when you do a lookup, you use the same procedure as before: jump to the hash location, then keep stepping forward until you find a blank spot. The idea here is that a tombstone doesn't count as a blank spot, so you'll keep scanning over it to find what you're searching for.
To keep the number of tombstones low, there are nice techniques you can use like overwriting tombstones during insertions or globally rebuilding the table if the number of tombstones becomes too great.
Another option is to use Robin Hood hashing and backward-shift deletion. In Robin Hood hashing, you store the elements in the table in a way that essentially keeps them sorted by their hash code (wraparound at the front of the table makes things a bit more complex than this, but that's the general idea). From there, when doing a deletion, you can shift elements backwards one spot to fill the gap from the removed element until you either hit a blank or an element that's already in the right place and doesn't need to be moved.
For more information on this, check out these lecture slides on linear probing and Robin Hood hashing.
Hope this helps!
Deletion in linear probing (open addressing) is done in such a way that index at which the value is deleted is assigned any marker such as "Deletion". [One can type any value at that index other than None to indicate that value at this index is deleted]. Keep a look at the below code snippet to indicate how i used "Deletion" marker to fill index where value is deleted
if self.table[index] == value:
print("key {} is found in the table and hence deletion tag is updated at that position".format(value))
self.table[index] = "Deletion"
Now, what happens when again search is done, this position is not None and search will continue. See the below snippet how search is implemented in the linear probe
def search(self, value):
index = value % self.table_size
if self.table[index] != value:
while self.table[index] is not None and self.table[index] != value:
index = (index + 1) % self.table_size
if self.table[index] == value:
print("Key is found in the table")
else:
print("key is not found in the table")
One can also look at the github code explaining deletion in linear probing without breaking lookups.

Bulk loading in a B-Tree

https://en.wikipedia.org/wiki/B-tree#Initial_construction
Currently I know of 2 ways for building a B-Tree : bulkloading and inserting key after key.
In the wiki example the keys are sorted, which is a precondition for bulkloading.
What is the advantage of bulkloading if the keys are unsorted ?
hence I have to sort them myself still resulting in O(nlogn) , same as inserting key after key in the B-Tree.
Thanks.
Consider the following scenarios:
If the data is already sorted, then you don't need to sort the data yourself. This may result in O(n) loading (I'm no expert in bulk loading).
If the tree is very large and stored on disk or on multiple machines, then memory locality may play a role. Bulk loading avoids 'loading' parts of the tree into memory before adding something.
A difference a regular say, red-black tree, and a 4-tree is one of hysteresis; the B-tree reserves space for future use. Only when the node has too many keys does it split in half. The inefficiency arises when one submits keys in order; that is, half-filled nodes that never become full because one is always adding to one side.
This is an example of a 3-tree (using Knuth's definition) that I've inserted the numbers 1 -- 8 in ascending order. Most of the tree is at the low-end of occupancy. The expected value of the number of nodes to access data is 2.5.
Bulk loading is a process by which we ignore the rules of splitting and pack it in as tight as possible, knowing that we probably have more to come from the right. It also helps avoid unnecessary copying, at the expense of maybe having to fix the tree to restore the B-tree invariants on the right. In this, I've bulk-loaded the same data.
Though the asymptotic space and runtime are the same, instead of using little-more than half the space, it uses almost all the space. Thus, the average lookup cost is less at this point, expected value of 1.75. This is more useful when loading a large volume of data that remains relatively static.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

What makes table lookups so cheap?

A while back, I learned a little bit about big O notation and the efficiency of different algorithms.
For example, looping through each item in an array to do something with it
foreach(item in array)
doSomethingWith(item)
is an O(n) algorithm, because the number of cycles the program performs is directly proportional to the size of the array.
What amazed me, though, was that table lookup is O(1). That is, looking up a key in a hash table or dictionary
value = hashTable[key]
takes the same number of cycles regardless of whether the table has one key, ten keys, a hundred keys, or a gigabrajillion keys.
This is really cool, and I'm very happy that it's true, but it's unintuitive to me and I don't understand why it's true.
I can understand the first O(n) algorithm, because I can compare it to a real-life example: if I have sheets of paper that I want to stamp, I can go through each paper one-by-one and stamp each one. It makes a lot of sense to me that if I have 2,000 sheets of paper, it will take twice as long to stamp using this method than it would if I had 1,000 sheets of paper.
But I can't understand why table lookup is O(1). I'm thinking that if I have a dictionary, and I want to find the definition of polymorphism, it will take me O(logn) time to find it: I'll open some page in the dictionary and see if it's alphabetically before or after polymorphism. If, say, it was after the P section, I can eliminate all the contents of the dictionary after the page I opened and repeat the process with the remainder of the dictionary until I find the word polymorphism.
This is not an O(1) process: it will usually take me longer to find words in a thousand page dictionary than in a two page dictionary. I'm having a hard time imagining a process that takes the same amount of time regardless of the size of the dictionary.
tl;dr: Can you explain to me how it's possible to do a table lookup with O(1) complexity?
(If you show me how to replicate the amazing O(1) lookup algorithm, I'm definitely going to get a big fat dictionary so I can show off to all of my friends my ninja-dictionary-looking-up skills)
EDIT: Most of the answers seem to be contingent on this assumption:
You have the ability to access any page of a dictionary given its page number in constant time
If this is true, it's easy for me to see. But I don't know why this underlying assumption is true: I would use the same process to to look up a page by number as I would by word.
Same thing with memory addresses, what algorithm is used to load a memory address? What makes it so cheap to find a piece of memory from an address? In other words, why is memory access O(1)?
You should read the Wikipedia article.
But the essence is that you first apply a hash function to your key, which converts it to an integer index (this is O(1)). This is then used to index into an array, which is also O(1). If the hash function has been well designed, there should only be one (or a few items) stored at each location in the array, so the lookup is complete.
So in massively-simplified pseudocode:
ValueType array[ARRAY_SIZE];
void insert(KeyType k, ValueType v)
{
int index = hash(k);
array[index] = v;
}
ValueType lookup(KeyType k)
{
int index = hash(k);
return array[index];
}
Obviously, this doesn't handle collisions, but you can read the article to learn how that's handled.
Update
To address the edited question, indexing into an array is O(1) because underneath the hood, the CPU is doing this:
ADD index, array_base_address -> pointer
LOAD pointer -> some_cpu_register
where LOAD loads data stored in memory at the specified address.
Update 2
And the reason a load from memory is O(1) is really just because this is an axiom we usually specify when we talk about computational complexity (see http://en.wikipedia.org/wiki/RAM_model). If we ignore cache hierarchies and data-access patterns, then this is a reasonable assumption. As we scale the size of the machine,, this may not be true (a machine with 100TB of storage may not take the same amount of time as a machine with 100kB). But usually, we assume that the storage capacity of our machine is constant, and much much bigger than any problem size we're likely to look at. So for all intents and purposes, it's a constant-time operation.
I'll address the question from a different perspective from every one else. Hopefully this will give light to why the accessing x[45] and accessing x[5454563] takes the same amount of time.
A RAM is laid out in a grid (i.e. rows and columns) of capacitors. A RAM can address a particular cell of memory by activating a particular column and row on the grid, so let's say if you have a 16-byte capacity RAM, laid out in a 4x4 grid (insanely small for modern computer, but sufficient for illustrative purpose), and you're trying to access the memory address 13 (1101), you first split the address into rows and column, i.e row 3 (11) column 1 (01).
Let's suppose a 0 means taking the left intersection and a 1 means taking a right intersection. So when you want to activate row 3, you send an army of electrons in the row starting gate, the row-army electrons went right, right to reach row 3 activation gate; next you send another army of electrons on the column starting gate, the column-army electrons went left then right to reach the 1st column activation gate. A memory cell can only be read/written if the row and column are both activated, so this would allow the marked cell to be read/written.
The effect of all this gibberish is that the access time of a memory address depends on the address length, and not the particular memory address itself; if an architecture uses a 32-bit address space (i.e. 32 intersections), then addressing memory address 45 and addressing memory address 5454563 both will still have to pass through all 32 intersections (actually 16 intersections for the row electrons and 16 intersections for the columns electrons).
Note that in reality memory addressing takes very little amount of time compared to charging and discharging the capacitors, therefore even if we start having a 512-bit length address space (enough for ~1.4*10^130 yottabyte of RAM, i.e. enough to keep everything under the sun in your RAM), which mean the electrons would have to go through 512 intersections, it wouldn't really add that much time to the actual memory access time.
Note that this is a gross oversimplification of modern RAM. In modern DRAM, if you want to access subsequent memory addresses you only change the columns and not spend time changing the rows, therefore accessing subsequent memory is much faster than accessing totally random addresses. Also, this description is totally ignorant about the effect of CPU cache (although CPU cache also uses a similar grid addressing scheme, however since CPU cache uses the much faster transistor-based capacitor, the negative effect of having large cache address space becomes very critical). However, the point still holds that if you're jumping around the memory, accessing any one of them will take the same amount of time.
You're right, it's surprisingly difficult to find a real-world example of this. The idea of course is that you're looking for something by address and not value.
The dictionary example fails because you don't immediately know the location of page say 278. You still have to look that up the same as you would a word because the page locations are not in your memory.
But say I marked a number on each of your fingers and then I told you to wiggle the one with 15 written on it. You'd have to look at each of them (assuming its unsorted), and if it's not 15 you check the next one. O(n).
If I told you to wiggle your right pinky. You don't have to look anything up. You know where it is because I just told you where it is. The value I just passed to you is its address in your "memory."
It's kind of like that with databases, but on a much larger scale than just 10 fingers.
Because work is done up front -- the value is put in a bucket that is easily accessible given the hashcode of the key. It would be like if you wanted to look up your work in the dictionary but had marked the exact page the word was on.
Imagine you had a dictionary where everything starting with letter A was on page 1, letter B on page 2...etc. So if you wanted to look up "balloon" you would know exactly what page to go to. This is the concept behind O(1) lookups.
Arbitrary data input => maps to a specific memory address
The trade-off of course being you need more memory to allocate for all the potential addresses, many of which may never be used.
If you have an array with 999999999 locations, how long does it take to find a record by social security number?
Assuming you don't have that much memory, then allocate about 30% more array locations that the number of records you intend to store, and then write a hash function to look it up instead.
A very simple (and probably bad) hash function would be social % numElementsInArray.
The problem is collisions--you can't guarantee that every location holds only one element. But thats ok, instead of storing the record at the array location, you can store a linked list of records. Then you scan linearly for the element you want once you hash to get the right array location.
Worst case this is O(n)--everything goes to the same bucket. Average case is O(1) because in general if you allocate enough buckets and your hash function is good, records generally don't collide very often.
Ok, hash-tables in a nutshell:
You take a regular array (O(1) access), and instead of using regular Int values to access it, you use MATH.
What you do, is to take the key value (lets say a string) calculate it into a number (some function on the characters) and then use a well known mathematical formula that gives you a relatively good distribution on the array's range.
So, in that case you are just doing like 4-5 calculations (O(1)) to get an object from that array, using a key which isn't an int.
Now, avoiding collisions, and finding the right mathematical formula for good distribution is the hard part. That's what is explained pretty well in wikipedia: en.wikipedia.org/wiki/Hash_table
Lookup tables know exactly how to access the given item in the table before hand.
Completely the opposite of say, finding an item by it's value in a sorted array, where you have to access items to check that it is what you want.
In theory, a hashtable is a series of buckets (addresses in memory) and a function that maps objects from a domain into those buckets.
Say your domain is 3 letter words, you'd block out 26^3=17,576 addresses for all the possible 3 letter words and create a function that maps all 3 letter words to those addresses, e.g., aaa=0, aab=1, etc. Now when you have a word you'd like to look up, say, "and", you know immediately from your O(1) function that it is address number 367.

Algorithm to find top 10 search terms

I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:
"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:
(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).
(ii) Display only the top 10 search terms for the past month, updated hourly.
You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.
The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.
The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.
Any ideas?
Frequency Estimation Overview
There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.
The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.
Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).
Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).
The Frequent Algorithm
Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.
Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.
Look at each item in the stream.
If the term exists in the map, increment the associated counter.
Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.
Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.
Counting Searches
In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.
Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.
Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.
Useful book in this area:
Muthukrishnan - "Data Streams: Algorithms and Applications"
Closely related reference to the problem at hand which I picked from the above:
Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]
By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.
Heh, nice interview question.
This is one of the research project that I am current going through. The requirement is almost exactly as yours, and we have developed nice algorithms to solve the problem.
The Input
The input is an endless stream of English words or phrases (we refer them as tokens).
The Output
Output top N tokens we have seen so
far (from all the tokens we have
seen!)
Output top N tokens in a
historical window, say, last day or
last week.
An application of this research is to find the hot topic or trends of topic in Twitter or Facebook. We have a crawler that crawls on the website, which generates a stream of words, which will feed into the system. The system then will output the words or phrases of top frequency either at overall or historically. Imagine in last couple of weeks the phrase "World Cup" would appears many times in Twitter. So does "Paul the octopus". :)
String into Integers
The system has an integer ID for each word. Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower. We have already found 4 million different words, and assigned a unique ID for each. This whole set of data can be loaded into memory as a hash table, consuming roughly 300MB memory. (We have implemented our own hash table. The Java's implementation takes huge memory overhead)
Each phrase then can be identified as an array of integers.
This is important, because sorting and comparisons on integers is much much faster than on strings.
Archive Data
The system keeps archive data for every token. Basically it's pairs of (Token, Frequency). However, the table that stores the data would be so huge such that we have to partition the table physically. Once partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram. And this goes on. Roughly at 4gram we have 1 billion records, with table sized at around 60GB.
Processing Incoming Streams
The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.
For highly frequent tokens, they are always kept in memory. For less frequent tokens, they are sorted based on IDs (remember we translate the String into an array of integers), and serialized into a disk file.
(However, for your problem, since you are counting only words, then you can put all word-frequency map in memory only. A carefully designed data structure would consume only 300MB memory for 4 million different words. Some hint: use ASCII char to represent Strings), and this is much acceptable.
Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort. Some design need to be taken care at here as well, since we want to avoid too many random disk seeks. The idea is to avoid read (merge process)/write (system output) at the same time, and let the merge process read form one disk while writing into a different disk. This is similar like to implementing a locking.
End of Day
At end of day, the system will have many frequent tokens with frequency stored in memory, and many other less frequent tokens stored in several disk files (and each file is sorted).
The system flush the in-memory map into a disk file (sort it). Now, the problem becomes merging a set of sorted disk file. Using similar process, we would get one sorted disk file at the end.
Then, the final task is to merge the sorted disk file into archive database.
Depends on the size of archive database, the algorithm works like below if it is big enough:
for each record in sorted disk file
update archive database by increasing frequency
if rowcount == 0 then put the record into a list
end for
for each record in the list of having rowcount == 0
insert into archive database
end for
The intuition is that after sometime, the number of inserting will become smaller and smaller. More and more operation will be on updating only. And this updating will not be penalized by index.
Hope this entire explanation would help. :)
You could use a hash table combined with a binary search tree. Implement a <search term, count> dictionary which tells you how many times each search term has been searched for.
Obviously iterating the entire hash table every hour to get the top 10 is very bad. But this is google we're talking about, so you can assume that the top ten will all get, say over 10 000 hits (it's probably a much larger number though). So every time a search term's count exceeds 10 000, insert it in the BST. Then every hour, you only have to get the first 10 from the BST, which should contain relatively few entries.
This solves the problem of top-10-of-all-time.
The really tricky part is dealing with one term taking another's place in the monthly report (for example, "stack overflow" might have 50 000 hits for the past two months, but only 10 000 the past month, while "amazon" might have 40 000 for the past two months but 30 000 for the past month. You want "amazon" to come before "stack overflow" in your monthly report). To do this, I would store, for all major (above 10 000 all-time searches) search terms, a 30-day list that tells you how many times that term was searched for on each day. The list would work like a FIFO queue: you remove the first day and insert a new one each day (or each hour, but then you might need to store more information, which means more memory / space. If memory is not a problem do it, otherwise go for that "approximation" they're talking about).
This looks like a good start. You can then worry about pruning the terms that have > 10 000 hits but haven't had many in a long while and stuff like that.
case i)
Maintain a hashtable for all the searchterms, as well as a sorted top-ten list separate from the hashtable. Whenever a search occurs, increment the appropriate item in the hashtable and check to see if that item should now be switched with the 10th item in the top-ten list.
O(1) lookup for the top-ten list, and max O(log(n)) insertion into the hashtable (assuming collisions managed by a self-balancing binary tree).
case ii)
Instead of maintaining a huge hashtable and a small list, we maintain a hashtable and a sorted list of all items. Whenever a search is made, that term is incremented in the hashtable, and in the sorted list the term can be checked to see if it should switch with the term after it. A self-balancing binary tree could work well for this, as we also need to be able to query it quickly (more on this later).
In addition we also maintain a list of 'hours' in the form of a FIFO list (queue). Each 'hour' element would contain a list of all searches done within that particular hour. So for example, our list of hours might look like this:
Time: 0 hours
-Search Terms:
-free stuff: 56
-funny pics: 321
-stackoverflow: 1234
Time: 1 hour
-Search Terms:
-ebay: 12
-funny pics: 1
-stackoverflow: 522
-BP sucks: 92
Then, every hour: If the list has at least 720 hours long (that's the number of hours in 30 days), look at the first element in the list, and for each search term, decrement that element in the hashtable by the appropriate amount. Afterwards, delete that first hour element from the list.
So let's say we're at hour 721, and we're ready to look at the first hour in our list (above). We'd decrement free stuff by 56 in the hashtable, funny pics by 321, etc., and would then remove hour 0 from the list completely since we will never need to look at it again.
The reason we maintain a sorted list of all terms that allows for fast queries is because every hour after as we go through the search terms from 720 hours ago, we need to ensure the top-ten list remains sorted. So as we decrement 'free stuff' by 56 in the hashtable for example, we'd check to see where it now belongs in the list. Because it's a self-balancing binary tree, all of that can be accomplished nicely in O(log(n)) time.
Edit: Sacrificing accuracy for space...
It might be useful to also implement a big list in the first one, as in the second one. Then we could apply the following space optimization on both cases: Run a cron job to remove all but the top x items in the list. This would keep the space requirement down (and as a result make queries on the list faster). Of course, it would result in an approximate result, but this is allowed. x could be calculated before deploying the application based on available memory, and adjusted dynamically if more memory becomes available.
Rough thinking...
For top 10 all time
Using a hash collection where a count for each term is stored (sanitize terms, etc.)
An sorted array which contains the ongoing top 10, a term/count in added to this array whenever the count of a term becomes equal or greater than the smallest count in the array
For monthly top 10 updated hourly:
Using an array indexed on number of hours elapsed since start modulo 744 (the number of hours during a month), which array entries consist of hash collection where a count for each term encountered during this hour-slot is stored. An entry is reset whenever the hour-slot counter changes
the stats in the array indexed on hour-slot needs to be collected whenever the current hour-slot counter changes (once an hour at most), by copying and flattening the content of this array indexed on hour-slots
Errr... make sense? I didn't think this through as I would in real life
Ah yes, forgot to mention, the hourly "copying/flattening" required for the monthly stats can actually reuse the same code used for the top 10 of all time, a nice side effect.
Exact solution
First, a solution that guarantees correct results, but requires a lot of memory (a big map).
"All-time" variant
Maintain a hash map with queries as keys and their counts as values. Additionally, keep a list f 10 most frequent queries so far and the count of the 10th most frequent count (a threshold).
Constantly update the map as the stream of queries is read. Every time a count exceeds the current threshold, do the following: remove the 10th query from the "Top 10" list, replace it with the query you've just updated, and update the threshold as well.
"Past month" variant
Keep the same "Top 10" list and update it the same way as above. Also, keep a similar map, but this time store vectors of 30*24 = 720 count (one for each hour) as values. Every hour do the following for every key: remove the oldest counter from the vector add a new one (initialized to 0) at the end. Remove the key from the map if the vector is all-zero. Also, every hour you have to calculate the "Top 10" list from scratch.
Note: Yes, this time we're storing 720 integers instead of one, but there are much less keys (the all-time variant has a really long tail).
Approximations
These approximations do not guarantee the correct solution, but are less memory-consuming.
Process every N-th query, skipping the rest.
(For all-time variant only) Keep at most M key-value pairs in the map (M should be as big as you can afford). It's a kind of an LRU cache: every time you read a query that is not in the map, remove the least recently used query with count 1 and replace it with the currently processed query.
Top 10 search terms for the past month
Using memory efficient indexing/data structure, such as tightly packed tries (from wikipedia entries on tries) approximately defines some relation between memory requirements and n - number of terms.
In case that required memory is available (assumption 1), you can keep exact monthly statistic and aggregate it every month into all time statistic.
There is, also, an assumption here that interprets the 'last month' as fixed window.
But even if the monthly window is sliding the above procedure shows the principle (sliding can be approximated with fixed windows of given size).
This reminds me of round-robin database with the exception that some stats are calculated on 'all time' (in a sense that not all data is retained; rrd consolidates time periods disregarding details by averaging, summing up or choosing max/min values, in given task the detail that is lost is information on low frequency items, which can introduce errors).
Assumption 1
If we can not hold perfect stats for the whole month, then we should be able to find a certain period P for which we should be able to hold perfect stats.
For example, assuming we have perfect statistics on some time period P, which goes into month n times.
Perfect stats define function f(search_term) -> search_term_occurance.
If we can keep all n perfect stat tables in memory then sliding monthly stats can be calculated like this:
add stats for the newest period
remove stats for the oldest period (so we have to keep n perfect stat tables)
However, if we keep only top 10 on the aggregated level (monthly) then we will be able to discard a lot of data from the full stats of the fixed period. This gives already a working procedure which has fixed (assuming upper bound on perfect stat table for period P) memory requirements.
The problem with the above procedure is that if we keep info on only top 10 terms for a sliding window (similarly for all time), then the stats are going to be correct for search terms that peak in a period, but might not see the stats for search terms that trickle in constantly over time.
This can be offset by keeping info on more than top 10 terms, for example top 100 terms, hoping that top 10 will be correct.
I think that further analysis could relate the minimum number of occurrences required for an entry to become a part of the stats (which is related to maximum error).
(In deciding which entries should become part of the stats one could also monitor and track the trends; for example if a linear extrapolation of the occurrences in each period P for each term tells you that the term will become significant in a month or two you might already start tracking it. Similar principle applies for removing the search term from the tracked pool.)
Worst case for the above is when you have a lot of almost equally frequent terms and they change all the time (for example if tracking only 100 terms, then if top 150 terms occur equally frequently, but top 50 are more often in first month and lest often some time later then the statistics would not be kept correctly).
Also there could be another approach which is not fixed in memory size (well strictly speaking neither is the above), which would define minimum significance in terms of occurrences/period (day, month, year, all-time) for which to keep the stats. This could guarantee max error in each of the stats during aggregation (see round robin again).
What about an adaption of the "clock page replacement algorithm" (also known as "second-chance")? I can imagine it to work very well if the search requests are distributed evenly (that means most searched terms appear regularly rather than 5mio times in a row and then never again).
Here's a visual representation of the algorithm:
The problem is not universally solvable when you have a fixed amount of memory and an 'infinite' (think very very large) stream of tokens.
A rough explanation...
To see why, consider a token stream that has a particular token (i.e., word) T every N tokens in the input stream.
Also, assume that the memory can hold references (word id and counts) to at most M tokens.
With these conditions, it is possible to construct an input stream where the token T will never be detected if the N is large enough so that the stream contains different M tokens between T's.
This is independent of the top-N algorithm details. It only depends on the limit M.
To see why this is true, consider the incoming stream made up of groups of two identical tokens:
T a1 a2 a3 ... a-M T b1 b2 b3 ... b-M ...
where the a's, and b's are all valid tokens not equal to T.
Notice that in this stream, the T appears twice for each a-i and b-i. Yet it appears rarely enough to be flushed from the system.
Starting with an empty memory, the first token (T) will take up a slot in the memory (bounded by M). Then a1 will consume a slot, all the way to a-(M-1) when the M is exhausted.
When a-M arrives the algorithm has to drop one symbol so let it be the T.
The next symbol will be b-1 which will cause a-1 to be flushed, etc.
So, the T will not stay memory-resident long enough to build up a real count. In short, any algorithm will miss a token of low enough local frequency but high global frequency (over the length of the stream).
Store the count of search terms in a giant hash table, where each new search causes a particular element to be incremented by one. Keep track of the top 20 or so search terms; when the element in 11th place is incremented, check if it needs to swap positions with #10* (it's not necessary to keep the top 10 sorted; all you care about is drawing the distinction between 10th and 11th).
*Similar checks need to be made to see if a new search term is in 11th place, so this algorithm bubbles down to other search terms too -- so I'm simplifying a bit.
sometimes the best answer is "I don't know".
Ill take a deeper stab. My first instinct would be to feed the results into a Q. A process would continually process items coming into the Q. The process would maintain a map of
term -> count
each time a Q item is processed, you simply look up the search term and increment the count.
At the same time, I would maintain a list of references to the top 10 entries in the map.
For the entry that was currently implemented, see if its count is greater than the count of the count of the smallest entry in the top 10.(if not in the list already). If it is, replace the smallest with the entry.
I think that would work. No operation is time intensive. You would have to find a way to manage the size of the count map. but that should good enough for an interview answer.
They are not expecting a solution, that want to see if you can think. You dont have to write the solution then and there....
One way is that for every search, you store that search term and its time stamp. That way, finding the top ten for any period of time is simply a matter of comparing all search terms within the given time period.
The algorithm is simple, but the drawback would be greater memory and time consumption.
What about using a Splay Tree with 10 nodes? Each time you try to access a value (search term) that is not contained in the tree, throw out any leaf, insert the value instead and access it.
The idea behind this is the same as in my other answer. Under the assumption that the search terms are accessed evenly/regularly this solution should perform very well.
edit
One could also store some more search terms in the tree (the same goes for the solution I suggest in my other answer) in order to not delete a node that might be accessed very soon again. The more values one stores in it, the better the results.
Dunno if I understand it right or not.
My solution is using heap.
Because of top 10 search items, I build a heap with size 10.
Then update this heap with new search. If a new search's frequency is greater than heap(Max Heap) top, update it. Abandon the one with smallest frequency.
But, how to calculate the frequency of the specific search will be counted on something else.
Maybe as everyone stated, the data stream algorithm....
Use cm-sketch to store count of all searches since beginning, keep a min-heap of size 10 with it for top 10.
For monthly result, keep 30 cm-sketch/hash-table and min-heap with it, each one start counting and updating from last 30, 29 .., 1 day. As a day pass, clear the last and use it as day 1.
Same for hourly result, keep 60 hash-table and min-heap and start counting for last 60, 59, ...1 minute. As a minute pass, clear the last and use it as minute 1.
Montly result is accurate in range of 1 day, hourly result is accurate in range of 1 min

Resources