Efficient distributed counting - algorithm

I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5 minutes, or how many pizzas John Doe ordered in the past week.
It is a LOT of events, so we're using something like Cassandra or HBase because even the counts can't be stored in memory. Also, since we need to keep track of set membership (in order to count unique people ordering a particular kind of pizza, for example), it gets bigger.
We could store a list of orders and then query to count, but this is slow. And we mostly don't care who ordered pepperoni pizza, just how many unique orders were made, and in a given time window.
What's the best way to store this information, for example in Cassandra, such that the information can be retrieved in some time intervals?
I tried at first to use Redis + bloom filters, but storing a bloom filter bit vector would require transactions to avoid race conditions, so then I used redis sets.
Then I realized the whole thing was too big to just be in memory, so I decided to switch to a disk-backed store. However, there are no native sets like in redis.
I looked at sketches / streaming algos like HyperLogLog but the conclusion was that to save the hyperloglog object, I need to store the bit array (or pickle the object or whatever)...is that kosher, and what are the best practices for this, if this is indeed the solution?
I was tempted to save each event individually with a timestamp, then query and count on demand, but this is slow. I'm looking for something better, if it exists.
Example Requests:
How many unique people had a pepperoni pizza order in the past 10 minutes
How many unique pepperoni pizzas were ordered by some person John Doe in the past 30 minutes

There are a few ways to approach this problem from what I have learned.
Use locking + set membership / counting data structure e.g hyperloglog or bloom filter. As long as there's not that much fighting over a particular lock, things should be okay.
Use a database that has built-in sets/collections support. They pretty much implement #1 internally.

my guesses:
cassandra supports counters - i think i saw some incr operation which should work concurrently - by using free running counter on your event, you just need to setup something which samples all counters at specified intervals (5 min?) then you can give estimations between two samples
(http://wiki.apache.org/cassandra/Counters)
cassandra can timeout a column..i never really used it, but it might worth a try

Related

Implementing Trending in Elasticsearch

I'm building a project that indexes celebrity-related content across sites (tmz, people, etc) because I always thought that it would be funny to "bet" on people (and maybe shows, directors, etc) like horse racing or the stock market -- only, you know, not with real money -- where the value of the person changes day to day and hour to hour and even minute to minute if we can figure this out together, stack overflow denizens.
I assign traffic values to users based on mentions in social media. I have some scrapers (probably violating some TOSes) and access to Twitter's API to get relative counts for search results for a time, so I have known "numbers" to associate w/ users outside of elasticsearch for periods of time to build the trends. Now to be clear, I am not looking to implement trending based on the number of documents in the system, that actually stays pretty consistent, but I need to rank documents that already exist based on trends.
So that's what I've got: a few hundred thousand articles with pre-determined associations to individual celebrities. Data for on-the-minute associations of a score to those celebrities which are then merged and applied to each article so that each article has a few scores associated (there's some complexity here that does not matter, but the bottom line is that I have 10 or so values that I want to assign to content to sort it when you're on the market page and I want to sort those w/ a function or script score).
So the question: How the heck do I assign these values without making elasticsearch go crazy with re-indexing? I need to use these values to sort dozens of requests per second coming from feeds on the site, but I am running this on a raspberry pi... literally, I've maxed the poor thing out for memory.
We're real write heavy, but if for some reason celebrity stock markets takes off, we're also real read heavy at the same time. I swear I remember a plugin that had metadata associated with content, but I cannot find it.
I've tried enable=false and index=false, but they seem to still thrash the read times while writing the updates. The best I've gotten to is slowing down the refresh_interval, but that's still pretty expensive and starts to affect the "real-time" nature of the app.
I believe that this is impossible as you've laid it out. Any updates to a field will update _source and fire the full update process.
There are some alternatives that you might consider:
Replication, if another cluster is available
A separate write index on the same cluster, space allowing

Practical importance of efficient sorting algorithms

I've been looking around trying to learn of any practical applications where sorting is needed and its efficiency matters, but couldn't find anything.
The only examples I could find or think off either did not need total sorting (like when looking for 100 top results or for the median) or sorting efficiency was hardly important (like when sorting once a year a spreadsheet with student names or past transactions).
When sorting web search results, only a few dozens of top ranked results need to be found and sorted, not all of the Internet, so classical sorting algorithms are not needed or practical.
When sorting a spreadsheet, it hardly matters if it will be sorted by a triple-pivot Las Vegas randomised quicksort or by the insertion sort.
Using sorted arrays as sets or associative arrays seems to be practically less efficient than using hash tables.
So my question is: what are practical ("real-life") examples where a total sorting is required and its efficiency is a bottleneck? I am particularly curious about applications for comparison sorting.
Update.
I've stumbled upon this phrase in lecture notes by Steven Skiena:
Computers spend more time sorting than anything else, historically 25% on mainframes.
With some details, that could make a perfect answer to my question. Where can I find the source for this statistics, ideally with some details about the kind and the application of sorting done by mainframes?
In some graphics rendering algorithms, objects need to be drawn in back to front order. A good example is transparent particles: there can be hundreds of thousands of them, and because of the transparency, traditional depth buffering doesn't work. So you need to sort these particles by distance from the camera, and keep them sorted, at 60 frames per second.
Interestingly, if the order of the particles doesn't change much (relatively slow particle motion, little camera movement), then the array of particles will already be "mostly sorted" in the next frame, and a simple bubble sort or insertion sort can actually work fine. But on frames where many particles are created, or the camera moves quickly, sort performance can become important, simply because there are so many other things to do each frame.
Imagine you have a daily list of transactions (deposits and withdrawls) for bank accounts. There are millions of accounts, and millions of transactions per day. Each night, you have to update the accounts to reflect those transactions, and compute the interest accrued that day, and print a report, ordered by account, that shows each account with its daily activity.
One way to do that is to go through the list sequentially, reading a transaction and updating the account in the database. That will work, but it has several drawbacks, including:
If there are many transactions for a single account, you pay the price of retrieving and updating the account for every transaction. Considering that a business account can have thousands of transactions per day, those costs add up.
The typical rule is that deposits are recorded before withdrawals, so as to prevent an overdraft. If an account's balance is 0, and the transactions list has a withdrawal of $5 ahead of a $10 deposit, the system will record an overdraft when it shouldn't.
Printing the report would require a separate scan of the database, after all transactions are recorded.
The solution to those problems is to sort the transactions list by account and type (deposits first). Then, the update is a simple merge operation. You read the database and the transactions list in account number order, apply any transactions for that account, compute interest, print the output line, and write the updated record to the database.
The result is much faster than doing a read-update-write for every single transaction, and it eliminates the problems #2 and #3 I outlined above. Sort-and-merge makes the difference between the update taking all night, and the update taking a few hours.
Also, MapReduce (and Hadoop), used for processing big data, make good use of sorting. Those programming models simply would not be possible without high performance sorting algorithms.
Any time you need to merge multiple large data streams into a single output stream (and those applications are legion), the sort-and-merge approach is useful. There are times when other techniques might be faster, but the sort-and-merge is reliable and durable, and, as shown by MapReduce, scales well.

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Efficient searching in huge multi-dimensional matrix

I am looking for a way to search in an efficient way for data in a huge multi-dimensional matrix.
My application contains data that is characterized by multiple dimensions. Imagine keeping data about all sales in a company (my application is totally different, but this is just to demonstrate the problem). Every sale is characterized by:
the product that is being sold
the customer that bought the product
the day on which it has been sold
the employee that sold the product
the payment method
the quantity sold
I have millions of sales, done on thousands of products, by hundreds of employees, on lots of days.
I need a fast way to calculate e.g.:
the total quantity sold by an employee on a certain day
the total quantity bought by a customer
the total quantity of a product paid by credit card
...
I need to store the data in the most detailed way, and I could use a map where the key is the sum of all dimensions, like this:
class Combination
{
Product *product;
Customer *customer;
Day *day;
Employee *employee;
Payment *payment;
};
std::map<Combination,quantity> data;
But since I don't know beforehand which queries are performed, I need multiple combination classes (where the data members are in different order) or maps with different comparison functions (using a different sequence to sort on).
Possibly, the problem could be simplified by giving each product, customer, ... a number instead of a pointer to it, but even then I end up with lots of memory.
Are there any data structures that could help in handling this kind of efficient searches?
EDIT:
Just to clarify some things: On disk my data is stored in a database, so I'm not looking for ways to change this.
The problem is that to perform my complex mathematical calculations, I have all this data in memory, and I need an efficient way to search this data in memory.
Could an in-memory database help? Maybe, but I fear that an in-memory database might have a serious impact on memory consumption and on performance, so I'm looking for better alternatives.
EDIT (2):
Some more clarifications: my application will perform simulations on the data, and in the end the user is free to save this data or not into my database. So the data itself changes the whole time. While performing these simulations, and the data changes, I need to query the data as explained before.
So again, simply querying the database is not an option. I really need (complex?) in-memory data structures.
EDIT: to replace earlier answer.
Can you imagine you have any other possible choice besides running qsort( ) on that giant array of structs? There's just no other way that I can see. Maybe you can sort it just once at time zero and keep it sorted as you do dynamic insertions/deletions of entries.
Using a database (in-memory or not) to work with your data seems like the right way to do this.
If you don't want to do that, you don't have to implement lots of combination classes, just use a collection that can hold any of the objects.

Does someone really sort terabytes of data?

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.

Resources