Is there a probabilistic data structure for storing relations? - algorithm

I have database with user subscriptions to topics.
There is currently about 20 000 topics,
20 mln users and 200 mln subscriptions stored in SQL database.
Because of its size, the database is partitioned by topics,
so I can't get the info in one database query.
There are couple of topics with 10 mln subscriptions, couple with 100 000 and others have hundreds or less.
When an event occurs, it usually matches couple of topics, so to inform users, I need to perform query like "give me all users subscribed to topics x, y, z and perform union of sets", so that one user gets the news once even if he subscribed both topics x and z.
The constraints are:
There must be no duplicates in the union set. (users can't get the content twice)
There can be bounded amount of users missing from the union set. (if sometimes user doesn't get the content, it is not that bad, but it can't be always the same user for the same topic)
It is possible to subscribe to new topic without rebuilding whole thing.
I thought about using set of bloom filters for every topic, but they constraints are the other way round: "user either not subscribed for sure or probably subscribed". I need something like "user subscribed for sure or probably not".
Lossy hash tables might be good idea, but I am not sure, if they can be as memory efficient as bloom filters and I am afraid, that it would be always the same user, that is missing the content in his topic.
Do you know any other data structures, that mey be good for solving this problem?

What if each user record had a BIT FIELD representing all of the topics.
TABLE Users(ID INT, UserName VARCHAR(16), Topics BINARY(8000))
A binary 8k would allow you to have 64000 topics. I would probably use multiple columns of BINARY(1024) each so I could add more topics easily.
Now when an event comes in that's tagged for topics 1, 10, 20, 30, 40.
I have to search every User, but this can be parallelized and will always be N complexity where N is the number of total users.
SELECT ID
FROM Users (READPAST)
WHERE
SUBSTRING(Topics, 1 / 8, 1) & (1 * POWER(2, (1 % 8))) > 0
OR
SUBSTRING(Topics, 10 / 8, 1) & (1 * POWER(2, (10 % 8))) > 0
OR
SUBSTRING(Topics, 20 / 8, 1) & (1 * POWER(2, (20 % 8))) > 0
OR
SUBSTRING(Topics, 30 / 8, 1) & (1 * POWER(2, (30 % 8))) > 0
OR
SUBSTRING(Topics, 40 / 8, 1) & (1 * POWER(2, (40 % 8))) > 0
OPTION (MAXDOP = 64)
No duplicates we're scanning Users once so we don't have o worry about unions
Some users missing the READPAST hint will skip any rows that are currently locked (being updated), so some users may be missing from the result.
SUbscribe You can [un]subscribe to a topic simply by toggling the topics bit in the Topics column.

As I said in comments, a memory-based exact solution is certainly feasible.
But if you really want an approximate data structure, then what you're looking for a size-limited set (of users for each topic) with random eviction.
You also need to compute unions rapidly on the fly when queries arrive. There's no helpful pre-computation here. If topic sets tend to repeat, you can look at caching the frequently used unions.
All the usual methods of representing a set apply. Hash tables (both closed and open), trees, and skip lists (all containing user id keys; no values required) are most likely.
If you use a closed hash table with a good hash function, pseudo-random eviction happens automatically. On collision, just replace the previous value. The problem with closed hashes is always picking a good table size for the set you need to represent. Remember that to recover set elements, you'll have to traverse the whole open table including null entries, so starting with a huge table isn't a good idea; rather start with a small one and reorganize, growing by a factor each time so reorganization amortizes to constant time overhead per element stored.
With the other schemes, you can literally do pseudo-random eviction when the table gets too big. The easiest way to evict fairly is store the user id's an a table and have the size-limited set store indices. Evict by generating a random index into the table and removing that id before adding a new one.
It's also possible to evict fairly from a BST set representation by using an order statistic tree: store the number of descendants in each node. Then you can always find the n'th element in key sorted order, where n is pseudo-random, and evict it.
I know you were looking for the bitwise space efficiency of a Bloom filter, but guaranteeing no false positives seems to rule that out.

This might not be the solution you were looking for, but you could utilize ElasticSearch's terms filter and to have one document like this for each user:
{
"id": 12345,
"topics": ["Apache", "GitHub", "Programming"]
}
Terms filters directly responds to the query "which users subscribe to at least one of these topics" and ES is very smart on how to cache and re-utilize filters.
It wouldn't be a probabilistic data structure but would very efficiently solve this problem. You'd need to use scan api for serializing retrieving potentially large JSON responses. If necessary you can scale this solution to billions of users spread on multiple computers and have response times like 10 - 100 milliseconds. You could also find correlations between topics (significant terms aggregation) and use ES as an engine for further analysis.
Edit: I implemented searching and scan / sroll API usage in Python and got some interesting results. I did "users who subscribe to any three of these topics" queries with that 20m users and 200m subscriptions dataset, and in general the search itself finishes in 4 - 8 milliseconds. Queries return 350.000 - 750.000 users.
Problems arise from getting user ids out of ES, even with the scan/scroll API. On Core i5 I seems to get only 8200 users / second so it is less than 0.5 million / minute (with "_source": false). The query itself looks like this:
{
"filtered": {
"filter": {
"terms": {
"topics": [ 123, 234, 345 ],
"execution": "plain",
"_cache": false
}
}
}
}
In production I would use "execution": "bool" so that partial query results can be cached and re-utilized at other queries. I don't know what is the bottle-neck with getting results out, server's CPU usage is 50% and I run the client's python script at the same machine, utilizing elasticsearch.helpers.scan.

[This solution is similar to Louis Ricci's, except inverted to the Topics table - which could make subscription updates less practical, be warned!]
(The probabilistic data structure approach is cool, but unnecessary for your current data-size. I was initially looking at compressed bitsets for a non-probabilistic solution, as they are great at performing set operations in-memory, but I think that's overkill as well. Here is a good implementation for this type of use-case. if you're interested.)
But looking at the sparsity of your data, bitsets waste space over integer-arrays. And even with integer-arrays, the union operation is still pretty inexpensive given that you only have an average of 10,000 subscriptions per topic.
So maybe, just maybe, a dead-simple data-structure given your use-case is simply:
Topic 1 => [array of subscriber IDs]
Topic 2 => [array of subscriber IDs]
...
Topic 20,000 => [array of subscriber IDs]
Storing (avg) 10,000 subscriber IDs (assuming 32-bit integers) only requires about 40kb of space per topic.
[In an array-type or BLOB, depending on your database]
With 20,000 topics, this adds only 800mb of data to your topic table ... and very little of this (~200kb avg) needs to be loaded to memory when a notification event occurs!
Then when an average event (affecting 5 topics) occurs, all that needs to happen is:
Query / Pull the data for the relevant topics (avg 5 records) into memory
(avg ~200kb of I/O)
Dump them into a Set data structure (de-duplicates subscriber list)
Alert the subscribers in the Set.

Related

How to setup CosmosDB when need to search for "like" in string tags

I have a 3 tables structure, Customer, Invoice, InvoiceItem that I would like to try to move from the relational DB and store it in CosmosDB. Currently, there are quite intensive queries being run on the InvoiceItem table. This InvoiceItem table has up to 10 optional TagX columns that are basically text that might include the brand, group, type, or something that would group this InvoiceItem and make it searchable by saying (simplified):
SELECT * FROM InvoiceItem WHERE Tag1 LIKE '%shirt%' AND Tag2 LIKE '%training%'
A query like this on a multi-million table can take more than 8 minutes. We are working on the archiving strategy and indexes to speed up the process but it looked to me like CosmosDB could be worth trying in this case, since all of the data is write-once-read-many scenario.
Back to CosmosDB, how do I deal with those string tags in CosmosDB. As a start, I thought about having Invoice and InvoiceItem in the same partition with "type" property that would differ them. But then I cannot stick the tags anywhere so they would be easily searchable. Any ideas on how to set it up?
Thanks!
Textbook database performance issue caused by either lack of, or inefficient indexing.
With that many rows, index cardinality becomes important. You don't want to index the entire field, you only want to index the first n characters of the columns you're indexing, and only index columns you are searching, whether by join or direct where clauses.
The idea is to keep the indexes as small as possible, while still giving you the query performance you need.
With 18 million rows you probably want to start with an index cardinality of the square root of 18m.
That means to hit the index segment you need, you only want to search no more than 5000 index rows, each of which have 400-5000 rows in their segment, at least for sub-second result times.
indexing the first 3-4 letters would be a good starting point. Based on the square root of 18000000 being 4242 and the nearest exponent of 26(3) (assuming alpha characters only) overshooting that. Even if alpha-numeric, 3 characters is still a good starting point.
If the queries then run super fast, but the index takes forever to build, drop a character. This is called "index tuning". You pick a starting point and find the largest cardinality (lowest number of characters indexed) that gives you the performance you need.
If I'm way off because index performance in this DB is way off the mark of a relational db, you'll need to experiment.
As far as I'm concerned, a select query that takes more than a few seconds is unacceptable, except in rare cases. I once worked for a security company. Their license management system took minutes to pull large customers.
After indexing the tables correctly the largest customer took less than 2 seconds. I had to sift through a table with billions of rows for number of downloads, and some of these queries had 7 joins.
If that database can't do this with 18m rows, I'd seriously consider a migration to a better architecture, hardware, software or otherwise.
As index cardinality increases, the performance gains drop to negative as the index cardinality approaches table cardinality, as compared to no index.
As in all things in life, moderation. At the other end of the spectrum, an index with a cardinality of 2 is just about useless. Half of 8 minutes is 4 minutes, assuming a nearly equal distribution.... useless, so indexing a boolean field isn't a great thing to do, usually. There are few hard and fast rules though. Lots of edge cases. Experimentation is your friend.

Determine if unique number has been seen from a range of numbers

I am trying to find the best, quickest and most efficient way to determine if a number has been seen in a range.
Example:
Key for record: Raffle Event 1 (database key)
Tickets Available: 1 - 1000000 (the range)
Ticket number 4 was turned in. Has it been turned in already for this event?
Ticket number 865401 was turned in. Has it been turned in already for this event?
I've thought about bit masks, storing data as buckets, etc. But none of these seem to be the answer I am trying to find. Maybe it does not exist.
We have 800,000 events, each event with 1 million tickets. We are currently storing the last number turned in, and anything lower is rejected. We want to have finer granularity, but need efficiency and storing each ticket would be impractical.
Data is stored using SQL
Any ideas?
EDIT
The best idea I've come up with so far is using a bitmap. Have 10 columns for each event. Each column stores 100,000 bits. This should allow for quick data retrieval, then just check if the bit is on or off. This should be about 1mb of storage per event, or 100k per column read.
I'm still searching for alternative ideas or recommendations.
You could use a bitmask if you anticipate your range to be moderate. Else you could try using a set structure. These can be implemented using binary search trees.
I went with a bitmap. I store 1/0 if the ticket was used or not. I split this bitmap into 16 buckets for storage. This was my magic number as it made just under 8K buckets, the perfect size for SQL paging.
Each bucket is null at first until needed. This saves space. This way each even is taking up zero space. And we only use 8K for each "62500" chunk of tickets. (8K)
It is efficient and does everything I need. I played around with compression to save space, but worst case for me was a few trillion records totaling 100GB if all tickets were accounted for (all buckets for each event). This might sound like a lot of space, but with today's cost its nearly negatable and not worth trying to compress the buckets of data.

Optimize returning top results from intersecting Sorted Set and Set in Redis

I am trying to optimize my Redis code and I am currently having some difficulties scaling my solution. Redis is hosted on Redis Cloud, the app on Heroku and I am using Ruby for my language.
My Redis setup:
I have a series of sorted sets which each contains approx 1,000 scored members and a corresponding set (which could be a hash, string, list, set, sorted set or any other structure) for each user in the system.
For example in the news:sports key I have the following structure. Other keys examples are news:entertainment, news:business.
news:sports -- Member Score
StoryOID1 1000
StoryOID2 999
StoryOID3 998
StoryOID4 997
StoryOID5 996
...
I also have a user specific key (a set) for each of the master sorted sets (user1:news:sports) that contain a list of stories the user has already seen. ie
seen:user1:sports StoryOID2
My challenge:
On each user request, I need to return the top 3 members (with the highest scores, although I don't need to know the scores) from a specific sorted set that the user hasn't already seen. I prefer not to save the result in Redis as have no long term use and I will just have to delete the key.
Given the example above, User1 requests sports:news, I would return:
StoryOID1
StoryOID3
StoryOID4
As part of my code, I loop through 10 sorted sets (10 genres) returning the top 3 unseen stories from each for a total of 30 OIDS being returned for each request.
For each user request:
Do this 10 times:
ZRANGEBYSCORE top 24 members
loop user genres key using SISMEMBER until I return 3 unseen members
end
Benchmarking at 60 dynos (heroku) on I can only get to 500 simultaneous connections (and at 1000 ms response time) with the Redis loop below being the bottleneck. I am aiming to scale a few multiples over my current setup. Anything and Everything can be change to scale this process.
My Current Process (in Ruby):
def newslist (userOID,genres)
#pull top 24 stories for the given news:genres -- 24 could be replaced by 1,3,6,12 etc
newsscores = ##redis.zrevrangebyscore("news:#{genres}", "+inf", "-inf", :limit => [0, 24],:with_scores => true)
newsstories = Array.new(3)
i = 0 #news acceptance counter
loopcnt = 0 #loop counter
while i < 3
if newsscores.count == loopcnt - 1 #loop to the max number of news returned in news news
break #breakout of loop
end
seen = ##redis.sismember("seen:#{userOID}:#{genres}", newsscores[loopcnt][0])
if seen == false
newsstories[i] = newsscores[loopcnt][0]
i+=1
end
loopcnt += 1
end
if i==3
return newsstories #return 3 news newss
else
return 0 #return 0 -- this should cause a repick
end
return 0 #return 0 -- this should cause a repick
end
I know I am paying a large costs the large number of Redis requets. My current thought process is to basically translate the above into a Lua script I can run on the server side but I can't help but feel there is a more elegant solution that will scale far better.
Is there a better way?
First off, yes: you should 100% be using Lua for this. Check the CPU on your Redis box. I'm willing to bet it's not burning up. Your bottleneck at this point is almost definitely network throughput because you're requiring a back-and-forth call each time you hit SISMEMBER (which is up to 24 times PER USER). That's a lot of unnecessary network activity. It's especially unnecessary when you consider that the logic you're doing on top of SISMEMBER could very easily be done server side, and there's really no reason to send anything back to your client until you finish the loop completely. That logic applies to the initial ZRANGEBYSCORE top 24 members as well. You can literally just translate the entire:
Do this 10 times:
ZRANGEBYSCORE top 24 members
loop user genres key using SISMEMBER until I return 3 unseen members
end
into Lua and go from 250 network hits per user to just 1 per user. That will be a huge, huge win. On top of that, you'll be sending a lot less information to Redis and back the one time you do initiate a Redis call. Here's some Lua pseudo-code that should give you the idea of what you want to be doing:
local genres = {KEYS[1], KEYS[2], KEYS[3], KEYS[4], KEYS[5], KEYS[6], KEYS[7], KEYS[8], KEYS[9], KEYS[10]}
local user_seen_genre_sets = {KEYS[11], KEYS[12], KEYS[13], KEYS[14], KEYS[15], KEYS[16], KEYS[17], KEYS[18], KEYS[19], KEYS[20]}
local user_id = ARGV[1]
to_return = {{},{},{},{},{},{},{},{},{},{}}
for i = 1, #genres do
possible_stories = redis.call('ZREVRANGEBYSCORE', genres[i], 'inf', 0, 'LIMIT', 0, 24)
--call SISMEMBER on each story above with the appropriate user_unseen_genre_sets key
--add the first 3 results to to_return[i], then stop the loop.
end
return to_return
Why use Lua instead of a pipeline?
Itamar Haber brought up a good point that you might want to use broken up pipelines for this instead of a single Lua script because the Lua script might block your Redis server for too long. Here's a few reasons for why you want to use a Lua script instead of broken up pipelines:
I've never seen a single Lua script on Redis that doesn't do something like KEYS(*) take longer than 10s of milliseconds. None of the costs of the operations mentioned should be considered more than log(n) either, so you're pretty future proofed as well in case you're expecting large data growth. If your Redis server is being blocked for too long, that's more an indication that you need a bigger server because none of the operations you're running are very intense (log(n) at most, as mentioned).
One of the major benefits of the Lua script is you are sending your logic to be run server-side instead of sending a bunch of data back and forth to run your logic client-side (i.e. Get all the possible stories and send them to the client. Now send them each back to Redis one by one to run an ISMEMBER). All of that data being sent over the network will be a much bigger bottle neck than running more operations within Redis and Lua, both of which are very, very fast.
So, all in all, despite the valid question, I firmly stand by the Lua approach. If you care to run benchmarks and share them with us, that would be really awesome, because I'm guessing switching this over will improve things around two orders of magnitude.
There is no need to use Lua scripting here, though depending on your data size, Lua versions of this calculation may be faster (depending on SISMEMBER vs. ZUNIONSTORE vs. SDIFFSTORE + ZINTERSTORE performance). Generally, you can calculate everything you want without multiple round-trips and without Lua scripting, provided the following 3 assumptions.
Your ZSETs all use positive non-zero scores (it is easier if all scores are >1, I will assume this)
Your SETs all contain the same members as your sorted sets
The maximum score in your ZSETs is fixed, or can at least be bounded (something like 1000000 is perfectly reasonable)
The important operation here is ZUNIONSTORE, which can take SETs as input, and whose members behave as though all scores are 1.
To get the top 3 stories from your news ZSETs, excluding those stories already read by a given user, you can make the following calls:
ZUNIONSTORE temp 2 news:sports seen:user1:sports WEIGHTS 1 -1000000
ZREVRANGEBYSCORE temp inf 0 LIMIT 0 3
DEL temp
You can wrap that with a MULTI/EXEC transaction so you don't have any extra data laying around after the query, pipeline it all, etc. This one has a limitation that as your number of stories read and the number of stories in each category increases, this one performs slower.
As an alternative, if you had a secondary SET that had the same members as (for example) your news:sports ZSET, you could instead perform:
SDIFFSTORE temp news:sports:set seen:user1:sports
ZINTERSTORE temp 2 temp news:sports WEIGHTS 0 1
ZREVRANGEBYSCORE temp inf -inf LIMIT 0 3
DEL temp
That would eliminate the score requirements, though would increase data size for each of your stories by an entry in a SET. This one also gets slower for every entry in the news SET, ZSET, and the user's seen SET, but the constants are different, so may be faster depending on your data sizes.

Mongodb Performance issue

I am using mongodb and I need to update my documents say total 1000 are there. My document has a basic structure like:
{
People:[1,2,3,4],
Place:"Auckland"
Event:"Music Show"
}
I have 10,000 threads running concurrently in another VM. Each thread looks for these documents(1000), see if these 1000 documents matches the query and push a number in People array . Suppose if thread 100 found say 500 out of these 1000 documents relevant, then it pushes the number 100 in People array of all the 500 documents.
For this,
I am using for each thread(10000) the command
update.append("$push",new BasicDBObject("People",serial_number));
updateMulti(query,update);
I observe poor performance for these in-place updates (multi-query).
Is this a problem due to a write lock?
Every thread(10000) updates the document that is relevant to the query ? - so there seems to be a lot of "waiting"
Is there a more efficient way to do these "push" operations?
Is "UpdateMulti" the right way to approach this?
Th‎ank you for a great response - Editing and Adding some more information
Some design background :
Yes your reading of our problem is correct. We have 10000 threads each representing one "actor" updating upto 1000 entities ( based on the appropriate query ) at a time with a $push .
Inverting the model leads us to a few broken usecases ( from our domain perspective ) leading us to joins across "states" of the primary entity ( which will now be spread across many collections ) - ex: each of these actions is a state change for that entity - E has states ( e1,e2,e3,e4,e5 ) - So e1 to e5 is represented as an aggregate array which gets updated by the 10,000 threads/processes which represent actions of external apps.
We need close to real-time aggregation as another set of "actors" look at these "states" of e1 to e5 and then respond appropriately via another channel to the "elements in the array".
What should be the "ideal" design strategy in such a case - to speed up the writes.
Will sharding help - is there a "magnitude" heuristic for this - at what lock% should we shard etc..
This is a problem because of your schema design.
It is extremely inefficient to $push multiple values to multiple documents, especially from multiple threads. It's not so much that the write lock is the problem, it's that your design made it the problem. In addition, you are continuously growing documents which means that the updates are not "in place" and your collection is quickly getting fragmented.
It seems like your schema is "upside down". You have 10,000 threads looking to add numbers representing people (I assume a very large number of people) to a small number of documents (1000) which will grow to be huge. It seems to me that if you want to embed something in something else, you might consider collections representing people and then embedding events that those people are found at - at least then you are limiting the size of the array for each person to 1,000 at most, and the updates will be spread across a much larger number of documents, reducing contention significantly.
Another option is simply to record the event/person in attendance and then do aggregation over the raw data later, but without knowing exactly what your requirements for this application are, it's hard to know which way will produce the best results - the way you have picked is definitely one that's unlikely to give you good performance.

Algorithm to find top 10 search terms

I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:
"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:
(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).
(ii) Display only the top 10 search terms for the past month, updated hourly.
You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.
The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.
The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.
Any ideas?
Frequency Estimation Overview
There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.
The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.
Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).
Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).
The Frequent Algorithm
Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.
Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.
Look at each item in the stream.
If the term exists in the map, increment the associated counter.
Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.
Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.
Counting Searches
In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.
Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.
Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.
Useful book in this area:
Muthukrishnan - "Data Streams: Algorithms and Applications"
Closely related reference to the problem at hand which I picked from the above:
Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]
By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.
Heh, nice interview question.
This is one of the research project that I am current going through. The requirement is almost exactly as yours, and we have developed nice algorithms to solve the problem.
The Input
The input is an endless stream of English words or phrases (we refer them as tokens).
The Output
Output top N tokens we have seen so
far (from all the tokens we have
seen!)
Output top N tokens in a
historical window, say, last day or
last week.
An application of this research is to find the hot topic or trends of topic in Twitter or Facebook. We have a crawler that crawls on the website, which generates a stream of words, which will feed into the system. The system then will output the words or phrases of top frequency either at overall or historically. Imagine in last couple of weeks the phrase "World Cup" would appears many times in Twitter. So does "Paul the octopus". :)
String into Integers
The system has an integer ID for each word. Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower. We have already found 4 million different words, and assigned a unique ID for each. This whole set of data can be loaded into memory as a hash table, consuming roughly 300MB memory. (We have implemented our own hash table. The Java's implementation takes huge memory overhead)
Each phrase then can be identified as an array of integers.
This is important, because sorting and comparisons on integers is much much faster than on strings.
Archive Data
The system keeps archive data for every token. Basically it's pairs of (Token, Frequency). However, the table that stores the data would be so huge such that we have to partition the table physically. Once partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram. And this goes on. Roughly at 4gram we have 1 billion records, with table sized at around 60GB.
Processing Incoming Streams
The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.
For highly frequent tokens, they are always kept in memory. For less frequent tokens, they are sorted based on IDs (remember we translate the String into an array of integers), and serialized into a disk file.
(However, for your problem, since you are counting only words, then you can put all word-frequency map in memory only. A carefully designed data structure would consume only 300MB memory for 4 million different words. Some hint: use ASCII char to represent Strings), and this is much acceptable.
Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort. Some design need to be taken care at here as well, since we want to avoid too many random disk seeks. The idea is to avoid read (merge process)/write (system output) at the same time, and let the merge process read form one disk while writing into a different disk. This is similar like to implementing a locking.
End of Day
At end of day, the system will have many frequent tokens with frequency stored in memory, and many other less frequent tokens stored in several disk files (and each file is sorted).
The system flush the in-memory map into a disk file (sort it). Now, the problem becomes merging a set of sorted disk file. Using similar process, we would get one sorted disk file at the end.
Then, the final task is to merge the sorted disk file into archive database.
Depends on the size of archive database, the algorithm works like below if it is big enough:
for each record in sorted disk file
update archive database by increasing frequency
if rowcount == 0 then put the record into a list
end for
for each record in the list of having rowcount == 0
insert into archive database
end for
The intuition is that after sometime, the number of inserting will become smaller and smaller. More and more operation will be on updating only. And this updating will not be penalized by index.
Hope this entire explanation would help. :)
You could use a hash table combined with a binary search tree. Implement a <search term, count> dictionary which tells you how many times each search term has been searched for.
Obviously iterating the entire hash table every hour to get the top 10 is very bad. But this is google we're talking about, so you can assume that the top ten will all get, say over 10 000 hits (it's probably a much larger number though). So every time a search term's count exceeds 10 000, insert it in the BST. Then every hour, you only have to get the first 10 from the BST, which should contain relatively few entries.
This solves the problem of top-10-of-all-time.
The really tricky part is dealing with one term taking another's place in the monthly report (for example, "stack overflow" might have 50 000 hits for the past two months, but only 10 000 the past month, while "amazon" might have 40 000 for the past two months but 30 000 for the past month. You want "amazon" to come before "stack overflow" in your monthly report). To do this, I would store, for all major (above 10 000 all-time searches) search terms, a 30-day list that tells you how many times that term was searched for on each day. The list would work like a FIFO queue: you remove the first day and insert a new one each day (or each hour, but then you might need to store more information, which means more memory / space. If memory is not a problem do it, otherwise go for that "approximation" they're talking about).
This looks like a good start. You can then worry about pruning the terms that have > 10 000 hits but haven't had many in a long while and stuff like that.
case i)
Maintain a hashtable for all the searchterms, as well as a sorted top-ten list separate from the hashtable. Whenever a search occurs, increment the appropriate item in the hashtable and check to see if that item should now be switched with the 10th item in the top-ten list.
O(1) lookup for the top-ten list, and max O(log(n)) insertion into the hashtable (assuming collisions managed by a self-balancing binary tree).
case ii)
Instead of maintaining a huge hashtable and a small list, we maintain a hashtable and a sorted list of all items. Whenever a search is made, that term is incremented in the hashtable, and in the sorted list the term can be checked to see if it should switch with the term after it. A self-balancing binary tree could work well for this, as we also need to be able to query it quickly (more on this later).
In addition we also maintain a list of 'hours' in the form of a FIFO list (queue). Each 'hour' element would contain a list of all searches done within that particular hour. So for example, our list of hours might look like this:
Time: 0 hours
-Search Terms:
-free stuff: 56
-funny pics: 321
-stackoverflow: 1234
Time: 1 hour
-Search Terms:
-ebay: 12
-funny pics: 1
-stackoverflow: 522
-BP sucks: 92
Then, every hour: If the list has at least 720 hours long (that's the number of hours in 30 days), look at the first element in the list, and for each search term, decrement that element in the hashtable by the appropriate amount. Afterwards, delete that first hour element from the list.
So let's say we're at hour 721, and we're ready to look at the first hour in our list (above). We'd decrement free stuff by 56 in the hashtable, funny pics by 321, etc., and would then remove hour 0 from the list completely since we will never need to look at it again.
The reason we maintain a sorted list of all terms that allows for fast queries is because every hour after as we go through the search terms from 720 hours ago, we need to ensure the top-ten list remains sorted. So as we decrement 'free stuff' by 56 in the hashtable for example, we'd check to see where it now belongs in the list. Because it's a self-balancing binary tree, all of that can be accomplished nicely in O(log(n)) time.
Edit: Sacrificing accuracy for space...
It might be useful to also implement a big list in the first one, as in the second one. Then we could apply the following space optimization on both cases: Run a cron job to remove all but the top x items in the list. This would keep the space requirement down (and as a result make queries on the list faster). Of course, it would result in an approximate result, but this is allowed. x could be calculated before deploying the application based on available memory, and adjusted dynamically if more memory becomes available.
Rough thinking...
For top 10 all time
Using a hash collection where a count for each term is stored (sanitize terms, etc.)
An sorted array which contains the ongoing top 10, a term/count in added to this array whenever the count of a term becomes equal or greater than the smallest count in the array
For monthly top 10 updated hourly:
Using an array indexed on number of hours elapsed since start modulo 744 (the number of hours during a month), which array entries consist of hash collection where a count for each term encountered during this hour-slot is stored. An entry is reset whenever the hour-slot counter changes
the stats in the array indexed on hour-slot needs to be collected whenever the current hour-slot counter changes (once an hour at most), by copying and flattening the content of this array indexed on hour-slots
Errr... make sense? I didn't think this through as I would in real life
Ah yes, forgot to mention, the hourly "copying/flattening" required for the monthly stats can actually reuse the same code used for the top 10 of all time, a nice side effect.
Exact solution
First, a solution that guarantees correct results, but requires a lot of memory (a big map).
"All-time" variant
Maintain a hash map with queries as keys and their counts as values. Additionally, keep a list f 10 most frequent queries so far and the count of the 10th most frequent count (a threshold).
Constantly update the map as the stream of queries is read. Every time a count exceeds the current threshold, do the following: remove the 10th query from the "Top 10" list, replace it with the query you've just updated, and update the threshold as well.
"Past month" variant
Keep the same "Top 10" list and update it the same way as above. Also, keep a similar map, but this time store vectors of 30*24 = 720 count (one for each hour) as values. Every hour do the following for every key: remove the oldest counter from the vector add a new one (initialized to 0) at the end. Remove the key from the map if the vector is all-zero. Also, every hour you have to calculate the "Top 10" list from scratch.
Note: Yes, this time we're storing 720 integers instead of one, but there are much less keys (the all-time variant has a really long tail).
Approximations
These approximations do not guarantee the correct solution, but are less memory-consuming.
Process every N-th query, skipping the rest.
(For all-time variant only) Keep at most M key-value pairs in the map (M should be as big as you can afford). It's a kind of an LRU cache: every time you read a query that is not in the map, remove the least recently used query with count 1 and replace it with the currently processed query.
Top 10 search terms for the past month
Using memory efficient indexing/data structure, such as tightly packed tries (from wikipedia entries on tries) approximately defines some relation between memory requirements and n - number of terms.
In case that required memory is available (assumption 1), you can keep exact monthly statistic and aggregate it every month into all time statistic.
There is, also, an assumption here that interprets the 'last month' as fixed window.
But even if the monthly window is sliding the above procedure shows the principle (sliding can be approximated with fixed windows of given size).
This reminds me of round-robin database with the exception that some stats are calculated on 'all time' (in a sense that not all data is retained; rrd consolidates time periods disregarding details by averaging, summing up or choosing max/min values, in given task the detail that is lost is information on low frequency items, which can introduce errors).
Assumption 1
If we can not hold perfect stats for the whole month, then we should be able to find a certain period P for which we should be able to hold perfect stats.
For example, assuming we have perfect statistics on some time period P, which goes into month n times.
Perfect stats define function f(search_term) -> search_term_occurance.
If we can keep all n perfect stat tables in memory then sliding monthly stats can be calculated like this:
add stats for the newest period
remove stats for the oldest period (so we have to keep n perfect stat tables)
However, if we keep only top 10 on the aggregated level (monthly) then we will be able to discard a lot of data from the full stats of the fixed period. This gives already a working procedure which has fixed (assuming upper bound on perfect stat table for period P) memory requirements.
The problem with the above procedure is that if we keep info on only top 10 terms for a sliding window (similarly for all time), then the stats are going to be correct for search terms that peak in a period, but might not see the stats for search terms that trickle in constantly over time.
This can be offset by keeping info on more than top 10 terms, for example top 100 terms, hoping that top 10 will be correct.
I think that further analysis could relate the minimum number of occurrences required for an entry to become a part of the stats (which is related to maximum error).
(In deciding which entries should become part of the stats one could also monitor and track the trends; for example if a linear extrapolation of the occurrences in each period P for each term tells you that the term will become significant in a month or two you might already start tracking it. Similar principle applies for removing the search term from the tracked pool.)
Worst case for the above is when you have a lot of almost equally frequent terms and they change all the time (for example if tracking only 100 terms, then if top 150 terms occur equally frequently, but top 50 are more often in first month and lest often some time later then the statistics would not be kept correctly).
Also there could be another approach which is not fixed in memory size (well strictly speaking neither is the above), which would define minimum significance in terms of occurrences/period (day, month, year, all-time) for which to keep the stats. This could guarantee max error in each of the stats during aggregation (see round robin again).
What about an adaption of the "clock page replacement algorithm" (also known as "second-chance")? I can imagine it to work very well if the search requests are distributed evenly (that means most searched terms appear regularly rather than 5mio times in a row and then never again).
Here's a visual representation of the algorithm:
The problem is not universally solvable when you have a fixed amount of memory and an 'infinite' (think very very large) stream of tokens.
A rough explanation...
To see why, consider a token stream that has a particular token (i.e., word) T every N tokens in the input stream.
Also, assume that the memory can hold references (word id and counts) to at most M tokens.
With these conditions, it is possible to construct an input stream where the token T will never be detected if the N is large enough so that the stream contains different M tokens between T's.
This is independent of the top-N algorithm details. It only depends on the limit M.
To see why this is true, consider the incoming stream made up of groups of two identical tokens:
T a1 a2 a3 ... a-M T b1 b2 b3 ... b-M ...
where the a's, and b's are all valid tokens not equal to T.
Notice that in this stream, the T appears twice for each a-i and b-i. Yet it appears rarely enough to be flushed from the system.
Starting with an empty memory, the first token (T) will take up a slot in the memory (bounded by M). Then a1 will consume a slot, all the way to a-(M-1) when the M is exhausted.
When a-M arrives the algorithm has to drop one symbol so let it be the T.
The next symbol will be b-1 which will cause a-1 to be flushed, etc.
So, the T will not stay memory-resident long enough to build up a real count. In short, any algorithm will miss a token of low enough local frequency but high global frequency (over the length of the stream).
Store the count of search terms in a giant hash table, where each new search causes a particular element to be incremented by one. Keep track of the top 20 or so search terms; when the element in 11th place is incremented, check if it needs to swap positions with #10* (it's not necessary to keep the top 10 sorted; all you care about is drawing the distinction between 10th and 11th).
*Similar checks need to be made to see if a new search term is in 11th place, so this algorithm bubbles down to other search terms too -- so I'm simplifying a bit.
sometimes the best answer is "I don't know".
Ill take a deeper stab. My first instinct would be to feed the results into a Q. A process would continually process items coming into the Q. The process would maintain a map of
term -> count
each time a Q item is processed, you simply look up the search term and increment the count.
At the same time, I would maintain a list of references to the top 10 entries in the map.
For the entry that was currently implemented, see if its count is greater than the count of the count of the smallest entry in the top 10.(if not in the list already). If it is, replace the smallest with the entry.
I think that would work. No operation is time intensive. You would have to find a way to manage the size of the count map. but that should good enough for an interview answer.
They are not expecting a solution, that want to see if you can think. You dont have to write the solution then and there....
One way is that for every search, you store that search term and its time stamp. That way, finding the top ten for any period of time is simply a matter of comparing all search terms within the given time period.
The algorithm is simple, but the drawback would be greater memory and time consumption.
What about using a Splay Tree with 10 nodes? Each time you try to access a value (search term) that is not contained in the tree, throw out any leaf, insert the value instead and access it.
The idea behind this is the same as in my other answer. Under the assumption that the search terms are accessed evenly/regularly this solution should perform very well.
edit
One could also store some more search terms in the tree (the same goes for the solution I suggest in my other answer) in order to not delete a node that might be accessed very soon again. The more values one stores in it, the better the results.
Dunno if I understand it right or not.
My solution is using heap.
Because of top 10 search items, I build a heap with size 10.
Then update this heap with new search. If a new search's frequency is greater than heap(Max Heap) top, update it. Abandon the one with smallest frequency.
But, how to calculate the frequency of the specific search will be counted on something else.
Maybe as everyone stated, the data stream algorithm....
Use cm-sketch to store count of all searches since beginning, keep a min-heap of size 10 with it for top 10.
For monthly result, keep 30 cm-sketch/hash-table and min-heap with it, each one start counting and updating from last 30, 29 .., 1 day. As a day pass, clear the last and use it as day 1.
Same for hourly result, keep 60 hash-table and min-heap and start counting for last 60, 59, ...1 minute. As a minute pass, clear the last and use it as minute 1.
Montly result is accurate in range of 1 day, hourly result is accurate in range of 1 min

Resources