Operations on Two Streams of Data - Design Algorithm - algorithm

I have seen this algorithm question or variants of it several times but have not found or been able to determine an optimal solution to the problem. The question is:
You are given two queues where each queue contains {timestamp, price}
pair. You have to print "price 1, price 2" pair for all those
timestamps where abs(ts1-ts2) <= 1 second where ts1 and price1 are
from the first queue and ts2 and price2 are from the second queue.
How would you design a system to handle these requirements?
Then a followup on this questions: what if one of the queues is slower than the other (data is delayed). How would you handle this?

You can do this in a similar fashion to the merging algorithm from merge sort, only doubled.
I'm going to describe an algorithm in which I choose queue #1 to be my "main queue." This will only provide a partial solution; I'll explain how to complete it afterwards.
At any time you keep one entry from each queue in memory. Whenever the two entries you have uphold your condition of being less than one second apart, print out their prices. Whether or not you did, you discard the one with the lower time stamp and get the next one. If at any point the time stamp from queue #1 is lower than that from queue #2, discard entries from queue #1 until that is no longer the case. If they both have the same time stamp, print it out and advance the one from queue #1. Repeat until done.
This will print out all the pairs of "price1, price2" whose corresponding ts1 and ts2 uphold that 0 <= ts1 - ts2 <= 1.
Now, for the other half, do the same only this time choose queue #2 as your "main queue" (i.e. do everything I just said with the numbers 1 and 2 reversed) - except don't print out pairs with equal time stamps, since you've already printed those in the first part.
This will print out all the pairs of "price1, price2" whose corresponding ts1 and ts2 uphold that 0 < ts2 - ts1 <= 1, which is like saying 0 > ts1 - ts2 >= -1.
Together you get the printout for all the cases in which -1 <= ts1 - ts2 <= 1, i.e. in which abs(ts1 - ts2) <= 1.

Additionally to the queues use two hashmaps (each exclusive for one queue)
As soon as a new item arrives strip the seconds out and use this as the key of the corresponding hashmap.
Using the very same key, retrieve all the items in the other hashmap.
One by one compare if the actual retrieved items are 1 second away of the item in bullet 2.
Note that this will fail to detect items with a difference in minutes: 10:00:59 and 10:01:00 will not be detected.
To solve this:
for items like XX:XX:59 you will need to hit the hashmap twice using keys XX:XX and XX:XX+1.
for items like XX:XX:00 you will need to hit the hashmap twice using keys XX:XX and XX:XX-1.
Note: do a date addition (not a mathematical one) since it will automatically deal with things like 01:59:59 + 1 = 02:00:00 or Monday 1 23:59:59 becoming Tuesday 2 00:00:00.
BTW, this algorithm also deals with the delay issue.

The speed of the queues does not matter at all if the algorithm is based on the comparison of timestamps alone. If one queue is empty and you cannot proceed just check periodically until you can continue.
You can solve this by managing a list for one of the queues. In the algorithm below the first was chosen, therefore it is called l1. It works like a sliding window.
Dequeue the 2nd queue: d2.
While the timestamp of the head of l1 is smaller than the one of d2 and the difference is greater than 1: remove the head from l1.
Go through the list and print all the pairs l1[i].price, d2.price as long as the difference of the timestamps is smalller than 1. If you don't reach the end of the list, continue with step 1.
Get the next element from the first queue and add it to the list. If the difference between the timestamps is smaller than 1 print the prices and repeat, if not continue with step 1.

here is my solution, you need following services.
Design a service to read the message from Queue1 and push the data to DB
Design another service to read the message from Queue2 and push the data to same DB.
Design another service to read the data from DB and print the result as per the frequency of result needed.
Edit
above specified system is designed ,keepin below point in mind
Scalablity - if load for system increases number of services can be scale up
Slowness as already mention one queu is slow then other , chances are ,first queue recieving more message then second ,hence not able to produce desired out put.
Otput Frequencey in future if requirement changes and instead of 1 sec difference we want to show 1 hour diffference ,then it is also very much possible.

Get the first element from both queues.
Compare the timestamps. If within one second, output the pair.
From the queue that gave the earlier timestamp, get the next element.
Repeat.
EDIT:
After #maraca's comment, I had to rethink my algorithm. And yes, if on both queues there are multiple events within a second, it will not produce all combinations.

Related

Different clustering algorithms to cluster timeseries events

I have a very large input file with the following format:
ID \t time \t duration \t Description \t status
The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI, aSI, Asi...)
The ultimate purpose is to cluster events that start and end at a "close enough' times in order to recognize that those events contribute to a bigger event. Close enough here could be determined by a theta, let's say for now is 1 hour (or it could be 2 hours, or more, etc.). If two events that have start times within 1 hour and end times within 1 hour, we cluster them together and say that they are part of a big event.
One other thing is that I only care about events that has all lower case letters in the status
So below is my initial input processing:
I filter the input file so that it only contains rows that have all lower case letters
This task is already done in Python using MapReduce and Hadoop
Then I take the output file and split it into "boxes" where each box represents a time range (ex: 11am-12pm - box1, 12pm-1pm - box2, 1pm-2pm - box 3, etc...)
Then use MapReduce again to sort each box based on start time (ascending order)
The final output is an ordered list of start time
Now I want to develop an algorithm to group "similar events" together in the output above. Similar events are determined by start and end time.
my current algorithm is:
pick the first item in the list
find any event in the list that has start time is within 1 hour with first event start time and duration is +/- 1 hour with first item duration (duration determines the end time).
Then cluster them together (basically I want to cluster events that happen at the same time frame)
If no other event found, move to the next available event (the one that has not been clustered).
Keep doing this until no more events to be clustered.
I'm not sure if my algorithm will work or it's efficient. I'm trying to do an algorithm that is better than O (n^2), so hierarchical clustering won't work. K-means might not work also since I don't know ahead of time how many clusters I will need.
There could be some other clustering algorithms that might fit better in my case. I think I might need some equations in my algorithm to calculate the distance in my cluster in order to determine similarity. I'm pretty new to this field, so any help to direct me to the right path is greatly appreciated.
Thanks a lot in advance.
You could try DBSCAN density-based clustering algorithm which is O(n log n) (garanteed ONLY in case of using indexing data structure like kd-tree, ball-tree etc, otherwise it's O(n^2)). Events that are part of a bigger event should be in a areas of high density. It seem to be a great fit to your problem.
You might need to implement your own distance function in order to perform neighborhood query (to find nearest events).
Also, it would be better to represent event time in POSIX-time format.
Here is an example.
Depending on the environment you use, DBSCAN implementation is in Java (ELKI), Python (scikit-learn), R (fpc).

load balancing algorithms - special example

Let´s pretend i have two buildings where i can build different units in.
A building can only build one unit at the same time but has a fifo-queue of max 5 units, which will be built in sequence.
Every unit has a build-time.
I need to know, what´s the fastest solution to get my units as fast as possible, considering the units already in the build-queues of my buildings.
"Famous" algorithms like RoundRobin doesn´t work here, i think.
Are there any algorithms, which can solve this problem?
This reminds me a bit of starcraft :D
I would just add an integer to the building queue which represents the time it is busy.
Of course you have to update this variable once per timeunit. (Timeunits are "s" here, for seconds)
So let's say we have a building and we are submitting 3 units, each take 5s to complete. Which will sum up to 15s total. We are in time = 0.
Then we have another building where we are submitting 2 units that need 6 timeunits to complete each.
So we can have a table like this:
Time 0
Building 1, 3 units, 15s to complete.
Building 2, 2 units, 12s to complete.
Time 1
Building 1, 3 units, 14s to complete.
Building 2, 2 units, 12s to complete.
And we want to add another unit that takes 2s, we can simply loop through the selected buildings and pick the one with the lowest time to complete.
In this case this would be building 2. This would lead to Time2...
Time 2
Building 1, 3 units, 13s to complete
Building 2, 3 units, 11s+2s=13s to complete
...
Time 5
Building 1, 2 units, 10s to complete (5s are over, the first unit pops out)
Building 2, 3 units, 10s to complete
And so on.
Of course you have to take care of the upper boundaries in your production facilities. Like if a building has 5 elements, don't assign something and pick the next building that has the lowest time to complete.
I don't know if you can implement this easily with your engine, or if it even support some kind of timeunits.
This will just result in updating all production facilities once per timeunit, O(n) where n is the number of buildings that can produce something. If you are submitting a unit this will take O(1) assuming that you keep the selected buildings in a sorted order, lowest first - so just a first element lookup. In this case you have to resort the list after manipulating the units like cancelling or adding.
Otherwise amit's answer seem to be possible, too.
This is NPC problem (proof at the end of the answer) so your best hope to find ideal solution is trying all possibilities (this will be 2^n possibilities, where n is the number of tasks).
possible heuristic was suggested in comment (and improved in comments by AShelly): sort the tasks from biggest to smallest, and put them in one queue, every task can now take element from the queue when done.
this is of course not always optimal, but I think will get good results for most cases.
proof that the problem is NPC:
let S={u|u is a unit need to be produced}. (S is the set containing all 'tasks')
claim: if there is a possible prefect split (both queues finish at the same time) it is optimal. let this time be HalfTime
this is true because if there was different optimal, at least one of the queues had to finish at t>HalfTime, and thus it is not optimal.
proof:
assume we had an algorithm A to produce the best solution at polynomial time, then we could solve the partition problem at polynomial time by the following algorithm:
1. run A on input
2. if the 2 queues finish exactly at HalfTIme - return True.
3. else: return False
this solution solves the partition problem because of the claim: if the partition exist, it will be returned by A, since it is optimal. all steps 1,2,3 run at polynomial time (1 for the assumption, 2 and 3 are trivial). so the algorithm we suggested solves partition problem at polynomial time. thus, our problem is NPC
Q.E.D.
Here's a simple scheme:
Let U be the list of units you want to build, and F be the set of factories that can build them. For each factory, track total time-til-complete; i.e. How long until the queue is completely empty.
Sort U by decreasing time-to-build. Maintain sort order when inserting new items
At the start, or at the end of any time tick after a factory completes a unit runs out of work:
Make a ready list of all the factories with space in the queue
Sort the ready list by increasing time-til-complete
Get the factory that will be done soonest
take the first item from U, add it to thact factory
Repeat until U is empty or all queues are full.
Googling "minimum makespan" may give you some leads into other solutions. This CMU lecture has a nice overview.
It turns out that if you know the set of work ahead of time, this problem is exactly Multiprocessor_scheduling, which is NP-Complete. Apparently the algorithm I suggested is called "Longest Processing Time", and it will always give a result no longer than 4/3 of the optimal time.
If you don't know the jobs ahead of time, it is a case of online Job-Shop Scheduling
The paper "The Power of Reordering for Online Minimum Makespan Scheduling" says
for many problems, including minimum
makespan scheduling, it is reasonable
to not only provide a lookahead to a
certain number of future jobs, but
additionally to allow the algorithm to
choose one of these jobs for
processing next and, therefore, to
reorder the input sequence.
Because you have a FIFO on each of your factories, you essentially do have the ability to buffer the incoming jobs, because you can hold them until a factory is completely idle, instead of trying to keeping all the FIFOs full at all times.
If I understand the paper correctly, the upshot of the scheme is to
Keep a fixed size buffer of incoming
jobs. In general, the bigger the
buffer, the closer to ideal
scheduling you get.
Assign a weight w to each factory according to
a given formula, which depends on
buffer size. In the case where
buffer size = number factories +1, use weights of (2/3,1/3) for 2 factories; (5/11,4/11,2/11) for 3.
Once the buffer is full, whenever a new job arrives, you remove the job with the least time to build and assign it to a factory with a time-to-complete < w*T where T is total time-to-complete of all factories.
If there are no more incoming jobs, schedule the remainder of jobs in U using the first algorithm I gave.
The main problem in applying this to your situation is that you don't know when (if ever) that there will be no more incoming jobs. But perhaps just replacing that condition with "if any factory is completely idle", and then restarting will give decent results.

Algorithm to process jobs with same priority

I am solving exercise problems from a book called Algorithms by Papadimitrou and Vazirani.
The following is the question:
A server has n customers waiting to be served. The service time required by each customer is known in advance: it is ti minutes for customer i. So if, for example, the customers are served in order of increasing i, then the ith customer has to wait for Sum(j = 1 to n) tj minutes.
We wish to minimize the total waiting time. Give an efficient algorithm for the same.
My Attempt:
I thought of a couple of approaches but couldnt decide which is best or any other approach that beats mine.
Approach 1:
Serve them in Round Robin fashion with time slice as 5. However, when i need to be more careful when deciding the time slice. It shouldnt be too high or low. So, i thought of selecting the time slice as the average of serving times.
Approach 2:
Assume jobs are sorted according to the time they take and are stored in an array A[1...n]
First serve A[1] then A[n] then A[2] then A[n-1] and so on.
I cant really decide which will be a more optimal solution for this problem. Am i missing something.
Thanks,
Chander
You can solve this problem by adding the sorting part and improvising on your Round robin approach,
First sort the customers based on service time
Now instead of just giving each customer a time slice t in round robin manner, you can also check if the customer has less than t/2 remaining time, if so complete his task
So
for each customer in sorted list from first
server customer for time t
if his remaining time is < t/2 then complete his service now
else move to next customer
Let me assume the "total waiting time" is the sum of the time each customer waits before the server finish serving him/her, and assume the customers are served in order of increasing i, so customer C1 waits t1 minutes, customer C2 waits t1+t2 minutes, and customer C3 waits t1+t2+t3 minutes, and ... customer Cn waits t1+t2+...+t{n-1}+tn minutes.
or:
C1 waits: t1
C2 waits: t1+t2
C3 waits: t1+t2+t3
...
Cn waits: t1+t2+t3+...tn
The total waiting time adds up to n*t1+(n-1)*t2+...1*tn
Again, this is based on the assumption that the customers are served in order of increasing i.
Now, which customer do you want to server first?

Algorithm to find top 10 search terms

I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:
"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:
(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).
(ii) Display only the top 10 search terms for the past month, updated hourly.
You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.
The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.
The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.
Any ideas?
Frequency Estimation Overview
There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.
The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.
Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).
Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).
The Frequent Algorithm
Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.
Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.
Look at each item in the stream.
If the term exists in the map, increment the associated counter.
Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.
Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.
Counting Searches
In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.
Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.
Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.
Useful book in this area:
Muthukrishnan - "Data Streams: Algorithms and Applications"
Closely related reference to the problem at hand which I picked from the above:
Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]
By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.
Heh, nice interview question.
This is one of the research project that I am current going through. The requirement is almost exactly as yours, and we have developed nice algorithms to solve the problem.
The Input
The input is an endless stream of English words or phrases (we refer them as tokens).
The Output
Output top N tokens we have seen so
far (from all the tokens we have
seen!)
Output top N tokens in a
historical window, say, last day or
last week.
An application of this research is to find the hot topic or trends of topic in Twitter or Facebook. We have a crawler that crawls on the website, which generates a stream of words, which will feed into the system. The system then will output the words or phrases of top frequency either at overall or historically. Imagine in last couple of weeks the phrase "World Cup" would appears many times in Twitter. So does "Paul the octopus". :)
String into Integers
The system has an integer ID for each word. Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower. We have already found 4 million different words, and assigned a unique ID for each. This whole set of data can be loaded into memory as a hash table, consuming roughly 300MB memory. (We have implemented our own hash table. The Java's implementation takes huge memory overhead)
Each phrase then can be identified as an array of integers.
This is important, because sorting and comparisons on integers is much much faster than on strings.
Archive Data
The system keeps archive data for every token. Basically it's pairs of (Token, Frequency). However, the table that stores the data would be so huge such that we have to partition the table physically. Once partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram. And this goes on. Roughly at 4gram we have 1 billion records, with table sized at around 60GB.
Processing Incoming Streams
The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.
For highly frequent tokens, they are always kept in memory. For less frequent tokens, they are sorted based on IDs (remember we translate the String into an array of integers), and serialized into a disk file.
(However, for your problem, since you are counting only words, then you can put all word-frequency map in memory only. A carefully designed data structure would consume only 300MB memory for 4 million different words. Some hint: use ASCII char to represent Strings), and this is much acceptable.
Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort. Some design need to be taken care at here as well, since we want to avoid too many random disk seeks. The idea is to avoid read (merge process)/write (system output) at the same time, and let the merge process read form one disk while writing into a different disk. This is similar like to implementing a locking.
End of Day
At end of day, the system will have many frequent tokens with frequency stored in memory, and many other less frequent tokens stored in several disk files (and each file is sorted).
The system flush the in-memory map into a disk file (sort it). Now, the problem becomes merging a set of sorted disk file. Using similar process, we would get one sorted disk file at the end.
Then, the final task is to merge the sorted disk file into archive database.
Depends on the size of archive database, the algorithm works like below if it is big enough:
for each record in sorted disk file
update archive database by increasing frequency
if rowcount == 0 then put the record into a list
end for
for each record in the list of having rowcount == 0
insert into archive database
end for
The intuition is that after sometime, the number of inserting will become smaller and smaller. More and more operation will be on updating only. And this updating will not be penalized by index.
Hope this entire explanation would help. :)
You could use a hash table combined with a binary search tree. Implement a <search term, count> dictionary which tells you how many times each search term has been searched for.
Obviously iterating the entire hash table every hour to get the top 10 is very bad. But this is google we're talking about, so you can assume that the top ten will all get, say over 10 000 hits (it's probably a much larger number though). So every time a search term's count exceeds 10 000, insert it in the BST. Then every hour, you only have to get the first 10 from the BST, which should contain relatively few entries.
This solves the problem of top-10-of-all-time.
The really tricky part is dealing with one term taking another's place in the monthly report (for example, "stack overflow" might have 50 000 hits for the past two months, but only 10 000 the past month, while "amazon" might have 40 000 for the past two months but 30 000 for the past month. You want "amazon" to come before "stack overflow" in your monthly report). To do this, I would store, for all major (above 10 000 all-time searches) search terms, a 30-day list that tells you how many times that term was searched for on each day. The list would work like a FIFO queue: you remove the first day and insert a new one each day (or each hour, but then you might need to store more information, which means more memory / space. If memory is not a problem do it, otherwise go for that "approximation" they're talking about).
This looks like a good start. You can then worry about pruning the terms that have > 10 000 hits but haven't had many in a long while and stuff like that.
case i)
Maintain a hashtable for all the searchterms, as well as a sorted top-ten list separate from the hashtable. Whenever a search occurs, increment the appropriate item in the hashtable and check to see if that item should now be switched with the 10th item in the top-ten list.
O(1) lookup for the top-ten list, and max O(log(n)) insertion into the hashtable (assuming collisions managed by a self-balancing binary tree).
case ii)
Instead of maintaining a huge hashtable and a small list, we maintain a hashtable and a sorted list of all items. Whenever a search is made, that term is incremented in the hashtable, and in the sorted list the term can be checked to see if it should switch with the term after it. A self-balancing binary tree could work well for this, as we also need to be able to query it quickly (more on this later).
In addition we also maintain a list of 'hours' in the form of a FIFO list (queue). Each 'hour' element would contain a list of all searches done within that particular hour. So for example, our list of hours might look like this:
Time: 0 hours
-Search Terms:
-free stuff: 56
-funny pics: 321
-stackoverflow: 1234
Time: 1 hour
-Search Terms:
-ebay: 12
-funny pics: 1
-stackoverflow: 522
-BP sucks: 92
Then, every hour: If the list has at least 720 hours long (that's the number of hours in 30 days), look at the first element in the list, and for each search term, decrement that element in the hashtable by the appropriate amount. Afterwards, delete that first hour element from the list.
So let's say we're at hour 721, and we're ready to look at the first hour in our list (above). We'd decrement free stuff by 56 in the hashtable, funny pics by 321, etc., and would then remove hour 0 from the list completely since we will never need to look at it again.
The reason we maintain a sorted list of all terms that allows for fast queries is because every hour after as we go through the search terms from 720 hours ago, we need to ensure the top-ten list remains sorted. So as we decrement 'free stuff' by 56 in the hashtable for example, we'd check to see where it now belongs in the list. Because it's a self-balancing binary tree, all of that can be accomplished nicely in O(log(n)) time.
Edit: Sacrificing accuracy for space...
It might be useful to also implement a big list in the first one, as in the second one. Then we could apply the following space optimization on both cases: Run a cron job to remove all but the top x items in the list. This would keep the space requirement down (and as a result make queries on the list faster). Of course, it would result in an approximate result, but this is allowed. x could be calculated before deploying the application based on available memory, and adjusted dynamically if more memory becomes available.
Rough thinking...
For top 10 all time
Using a hash collection where a count for each term is stored (sanitize terms, etc.)
An sorted array which contains the ongoing top 10, a term/count in added to this array whenever the count of a term becomes equal or greater than the smallest count in the array
For monthly top 10 updated hourly:
Using an array indexed on number of hours elapsed since start modulo 744 (the number of hours during a month), which array entries consist of hash collection where a count for each term encountered during this hour-slot is stored. An entry is reset whenever the hour-slot counter changes
the stats in the array indexed on hour-slot needs to be collected whenever the current hour-slot counter changes (once an hour at most), by copying and flattening the content of this array indexed on hour-slots
Errr... make sense? I didn't think this through as I would in real life
Ah yes, forgot to mention, the hourly "copying/flattening" required for the monthly stats can actually reuse the same code used for the top 10 of all time, a nice side effect.
Exact solution
First, a solution that guarantees correct results, but requires a lot of memory (a big map).
"All-time" variant
Maintain a hash map with queries as keys and their counts as values. Additionally, keep a list f 10 most frequent queries so far and the count of the 10th most frequent count (a threshold).
Constantly update the map as the stream of queries is read. Every time a count exceeds the current threshold, do the following: remove the 10th query from the "Top 10" list, replace it with the query you've just updated, and update the threshold as well.
"Past month" variant
Keep the same "Top 10" list and update it the same way as above. Also, keep a similar map, but this time store vectors of 30*24 = 720 count (one for each hour) as values. Every hour do the following for every key: remove the oldest counter from the vector add a new one (initialized to 0) at the end. Remove the key from the map if the vector is all-zero. Also, every hour you have to calculate the "Top 10" list from scratch.
Note: Yes, this time we're storing 720 integers instead of one, but there are much less keys (the all-time variant has a really long tail).
Approximations
These approximations do not guarantee the correct solution, but are less memory-consuming.
Process every N-th query, skipping the rest.
(For all-time variant only) Keep at most M key-value pairs in the map (M should be as big as you can afford). It's a kind of an LRU cache: every time you read a query that is not in the map, remove the least recently used query with count 1 and replace it with the currently processed query.
Top 10 search terms for the past month
Using memory efficient indexing/data structure, such as tightly packed tries (from wikipedia entries on tries) approximately defines some relation between memory requirements and n - number of terms.
In case that required memory is available (assumption 1), you can keep exact monthly statistic and aggregate it every month into all time statistic.
There is, also, an assumption here that interprets the 'last month' as fixed window.
But even if the monthly window is sliding the above procedure shows the principle (sliding can be approximated with fixed windows of given size).
This reminds me of round-robin database with the exception that some stats are calculated on 'all time' (in a sense that not all data is retained; rrd consolidates time periods disregarding details by averaging, summing up or choosing max/min values, in given task the detail that is lost is information on low frequency items, which can introduce errors).
Assumption 1
If we can not hold perfect stats for the whole month, then we should be able to find a certain period P for which we should be able to hold perfect stats.
For example, assuming we have perfect statistics on some time period P, which goes into month n times.
Perfect stats define function f(search_term) -> search_term_occurance.
If we can keep all n perfect stat tables in memory then sliding monthly stats can be calculated like this:
add stats for the newest period
remove stats for the oldest period (so we have to keep n perfect stat tables)
However, if we keep only top 10 on the aggregated level (monthly) then we will be able to discard a lot of data from the full stats of the fixed period. This gives already a working procedure which has fixed (assuming upper bound on perfect stat table for period P) memory requirements.
The problem with the above procedure is that if we keep info on only top 10 terms for a sliding window (similarly for all time), then the stats are going to be correct for search terms that peak in a period, but might not see the stats for search terms that trickle in constantly over time.
This can be offset by keeping info on more than top 10 terms, for example top 100 terms, hoping that top 10 will be correct.
I think that further analysis could relate the minimum number of occurrences required for an entry to become a part of the stats (which is related to maximum error).
(In deciding which entries should become part of the stats one could also monitor and track the trends; for example if a linear extrapolation of the occurrences in each period P for each term tells you that the term will become significant in a month or two you might already start tracking it. Similar principle applies for removing the search term from the tracked pool.)
Worst case for the above is when you have a lot of almost equally frequent terms and they change all the time (for example if tracking only 100 terms, then if top 150 terms occur equally frequently, but top 50 are more often in first month and lest often some time later then the statistics would not be kept correctly).
Also there could be another approach which is not fixed in memory size (well strictly speaking neither is the above), which would define minimum significance in terms of occurrences/period (day, month, year, all-time) for which to keep the stats. This could guarantee max error in each of the stats during aggregation (see round robin again).
What about an adaption of the "clock page replacement algorithm" (also known as "second-chance")? I can imagine it to work very well if the search requests are distributed evenly (that means most searched terms appear regularly rather than 5mio times in a row and then never again).
Here's a visual representation of the algorithm:
The problem is not universally solvable when you have a fixed amount of memory and an 'infinite' (think very very large) stream of tokens.
A rough explanation...
To see why, consider a token stream that has a particular token (i.e., word) T every N tokens in the input stream.
Also, assume that the memory can hold references (word id and counts) to at most M tokens.
With these conditions, it is possible to construct an input stream where the token T will never be detected if the N is large enough so that the stream contains different M tokens between T's.
This is independent of the top-N algorithm details. It only depends on the limit M.
To see why this is true, consider the incoming stream made up of groups of two identical tokens:
T a1 a2 a3 ... a-M T b1 b2 b3 ... b-M ...
where the a's, and b's are all valid tokens not equal to T.
Notice that in this stream, the T appears twice for each a-i and b-i. Yet it appears rarely enough to be flushed from the system.
Starting with an empty memory, the first token (T) will take up a slot in the memory (bounded by M). Then a1 will consume a slot, all the way to a-(M-1) when the M is exhausted.
When a-M arrives the algorithm has to drop one symbol so let it be the T.
The next symbol will be b-1 which will cause a-1 to be flushed, etc.
So, the T will not stay memory-resident long enough to build up a real count. In short, any algorithm will miss a token of low enough local frequency but high global frequency (over the length of the stream).
Store the count of search terms in a giant hash table, where each new search causes a particular element to be incremented by one. Keep track of the top 20 or so search terms; when the element in 11th place is incremented, check if it needs to swap positions with #10* (it's not necessary to keep the top 10 sorted; all you care about is drawing the distinction between 10th and 11th).
*Similar checks need to be made to see if a new search term is in 11th place, so this algorithm bubbles down to other search terms too -- so I'm simplifying a bit.
sometimes the best answer is "I don't know".
Ill take a deeper stab. My first instinct would be to feed the results into a Q. A process would continually process items coming into the Q. The process would maintain a map of
term -> count
each time a Q item is processed, you simply look up the search term and increment the count.
At the same time, I would maintain a list of references to the top 10 entries in the map.
For the entry that was currently implemented, see if its count is greater than the count of the count of the smallest entry in the top 10.(if not in the list already). If it is, replace the smallest with the entry.
I think that would work. No operation is time intensive. You would have to find a way to manage the size of the count map. but that should good enough for an interview answer.
They are not expecting a solution, that want to see if you can think. You dont have to write the solution then and there....
One way is that for every search, you store that search term and its time stamp. That way, finding the top ten for any period of time is simply a matter of comparing all search terms within the given time period.
The algorithm is simple, but the drawback would be greater memory and time consumption.
What about using a Splay Tree with 10 nodes? Each time you try to access a value (search term) that is not contained in the tree, throw out any leaf, insert the value instead and access it.
The idea behind this is the same as in my other answer. Under the assumption that the search terms are accessed evenly/regularly this solution should perform very well.
edit
One could also store some more search terms in the tree (the same goes for the solution I suggest in my other answer) in order to not delete a node that might be accessed very soon again. The more values one stores in it, the better the results.
Dunno if I understand it right or not.
My solution is using heap.
Because of top 10 search items, I build a heap with size 10.
Then update this heap with new search. If a new search's frequency is greater than heap(Max Heap) top, update it. Abandon the one with smallest frequency.
But, how to calculate the frequency of the specific search will be counted on something else.
Maybe as everyone stated, the data stream algorithm....
Use cm-sketch to store count of all searches since beginning, keep a min-heap of size 10 with it for top 10.
For monthly result, keep 30 cm-sketch/hash-table and min-heap with it, each one start counting and updating from last 30, 29 .., 1 day. As a day pass, clear the last and use it as day 1.
Same for hourly result, keep 60 hash-table and min-heap and start counting for last 60, 59, ...1 minute. As a minute pass, clear the last and use it as minute 1.
Montly result is accurate in range of 1 day, hourly result is accurate in range of 1 min

Which data structure(s) to back a Final Fantasy ATB-style queue? (a delay queue)

Situation: There are several entities in a simulated environment, which has an artificial notion of time called "ticks", which has no link to real time. Each entity takes it in turns to move, but some are faster than others. This is expressed by a delay, in ticks. So entity A might have a delay of 10, and B 25. In this case the turn order would go:
A A B A A
I'm wondering what data structure to use. At first I automatically thought "priority queue" but the delays are relative to "current time" which complicates matters. Also, there will be entities with larger delays and it's not unforseeable that the program will run through millions of ticks. It seems silly for an internal counter to be building higher and higher when the delays themselves stay relatively small and don't increase.
So how would you solve this?
You store the entities in a Heap and group them by their time left to wait. The group of entities that are next to move would be at the top of the Heap. You only have to update these entities. When their time remaining to wait drops to 0, you remove them from the heap. Put the next group of entities in line at the top of the Heap while decrementing their time to wait by the amount of time that just passed before the previous move.
For example:
Your Heap has 3 nodes (A,B, and C), the top is node A with two entities both having 5 ticks remaining. The childern have 10 and 12 ticks remaining respectively.
At time t=5 you move all the entities that are bucketed in node A
Remove A from the Heap
B moves to the top of the Heap with 10-5 = 5 ticks remaining then
repeat.
It seems to me by your description that the concept of "what's next?" is more important than "how long is it until the next action?". This being the case, sort your queue by "next-to-go" or lowest number of ticks remaining to highest. Inserts, of course, get entered in their appropriate order, and altered entries ("Speed up" spells) get removed trom the queue, altered, and then re-entered appropriately.
Then, you just pop the next job off the queue. Whatever number of ticks it had remaining must be the "time-elapsed". Makes a pass over the queue, decrementing the ticks remaining field of each entry by the amount of ticks you just discovered.
This has the advantage of keeping track of the concept of time remaining, but also of not having to fire events or execute any other code for ticks that go by when there is no action to take. You can afford this, since there is no relation to real time at all. There is only "what's next?", and "How long did it take to get there?".
If we assume your entities are observing or watching the simulation time, they could each implement an interface which makes them track the ticks left and provides a method to get how many ticks are left for a particular entity. At each tick, the entity reduces its ticks left by 1.
You could then keep a sorted set queue (set because each entity will be in the queue only once) of these entities, sorted based on get ticks left, so that the 0th entity is the one to move next, and the Nth entity is the "slowest".
When the entity's get ticks left method is 0, it is removed from the sorted set, the ticks left timer is reset, and it is re-inserted into the sorted set.
Option #1: Polling
I would probably build a controller that can discover the delay for all the different entities and maintain a ticks-remaining for each entity. The controller would cycle through ticks and on each tick it would reduce the ticks-remaining for all entities in the game.
Once an entities ticks-remaining value reaches zero you know it's their turn, either controlled by the heartbeat method that handles the ticks or a method that you call.
Option #2 Events
Think like the UI paradigm, the interface doesn't constantly poll the button to see when it is clicked. Rather it let's the button notify the UI when it has been clicked via events. Have your Entities (or an EntityBattleContext) fire an event when it is ready. You will have to handle you game time in some manner, since it isn't based off real world time at all you may need to have all the Entities listen for a GameTick event and when they receiver that event update their interal TicksRemaining variable.
Before following the event driven route make sure the polling route won't work. Remember the cardinal rule always optimize later because more often then not you don't need the optimization.
Look at how Java's DelayQueue is implemented.

Resources