Implementing Trending in Elasticsearch - elasticsearch

I'm building a project that indexes celebrity-related content across sites (tmz, people, etc) because I always thought that it would be funny to "bet" on people (and maybe shows, directors, etc) like horse racing or the stock market -- only, you know, not with real money -- where the value of the person changes day to day and hour to hour and even minute to minute if we can figure this out together, stack overflow denizens.
I assign traffic values to users based on mentions in social media. I have some scrapers (probably violating some TOSes) and access to Twitter's API to get relative counts for search results for a time, so I have known "numbers" to associate w/ users outside of elasticsearch for periods of time to build the trends. Now to be clear, I am not looking to implement trending based on the number of documents in the system, that actually stays pretty consistent, but I need to rank documents that already exist based on trends.
So that's what I've got: a few hundred thousand articles with pre-determined associations to individual celebrities. Data for on-the-minute associations of a score to those celebrities which are then merged and applied to each article so that each article has a few scores associated (there's some complexity here that does not matter, but the bottom line is that I have 10 or so values that I want to assign to content to sort it when you're on the market page and I want to sort those w/ a function or script score).
So the question: How the heck do I assign these values without making elasticsearch go crazy with re-indexing? I need to use these values to sort dozens of requests per second coming from feeds on the site, but I am running this on a raspberry pi... literally, I've maxed the poor thing out for memory.
We're real write heavy, but if for some reason celebrity stock markets takes off, we're also real read heavy at the same time. I swear I remember a plugin that had metadata associated with content, but I cannot find it.
I've tried enable=false and index=false, but they seem to still thrash the read times while writing the updates. The best I've gotten to is slowing down the refresh_interval, but that's still pretty expensive and starts to affect the "real-time" nature of the app.

I believe that this is impossible as you've laid it out. Any updates to a field will update _source and fire the full update process.
There are some alternatives that you might consider:
Replication, if another cluster is available
A separate write index on the same cluster, space allowing

Related

User Search Strategy Mobile Dev

I want to implement a search feature on my app that re-filters upon each new character entered into the search bar so users can search for other users. This is a fairly common feature on apps, but as a beginner it would seem like a very computationally complex process. It would seem that one of two things happen:
For each new character typed, the frontend queries the backend, which applies filter and returns.
The frontend loads all (or many) possible results beforehand and updates filter on the stored info as new characters are entered.
It would seem that 1) would have time complexity issues, as it makes O(n) queries (where n is number of characters) per search. This is especially problematic because it's expected that the filtered search results update near instantaneously. Additionally, my average query time is probably slower than most, as I'm using a three tier architecture (frontend<->server<->graph database)
I don't like 2)--at least in its straightforward form--as the number of possible results can get very large. We can reduce the space complexity of this by querying only for a limited set of user attributes (perhaps only uid and name, and fetching details on the fly if needed), but the point remains.
Things get more interesting if we modify 2) to load only a sample of users (and here we can use data like Location as well as ML/AI to select). The problem with this is that the searching user could always be looking for someone we didn't select. It would be a horrible (even if rare) experience for a user to know their friend was on the app but was unable to find them because our algorithm was only accurate for 99% of searches.
I am sure this is possible--other apps seem to pull it off--so what am I missing?
First, you should avoid to query the server for each character typed. Most of the times the user types a bounce of chars very fast without looking at suggested results, especially because with few chars the results wouldn't be specific enough. All the autocompletion systems adopt both of the following:
query only if the string is at least 2-3 chars long;
query only if the user is not typing more, i.e. after 300ms from the last type.
To get all the pertinent results without huge data transfer you could implement a progressive data load. Just load enough results to fill the page height, then as the user scrolls down load more results. However if you reach a high number of results you should stop retrieving them and ask the user to type a more specific search.
If you want to make your users happy, try to sort the result by relevance. For example if you know where the users are located you may sort the results by distance, because if I live in Italy and I search for "Ste" it is more likely my friend is Stefano who lives in Rome, than Steve who lives in NY.

What are best known algorithms/techniques for updating edges on huge graph structures like social networks?

On social networks like twitter where millions follow single account, it must be very challenging to update all followers instantly when a new tweet is posted. Similarly on facebook there are fan pages with millions of followers and we see updates from them instantly when posted on page. I am wondering what are best known techniques and algorithms to achieve this. I understand with billion accounts, they have huge data centers across globe and even if we reduce this problem for just one computer in following manner - 100,000 nodes with average 200 edges per node, then every single node update will require 200 edge updates. So what are best techniques/algorithms to optimize such large updates. Thanks!
The best way is usually just to do all the updates. You say they can be seen "instantly", but actually the updates probably propagate through the network and can take up to a few seconds to show up in followers' feeds.
Having to do all those updates may seem like a lot, but on average follower will check for updates much more often than a person being followed will produce them, and checking for updates has to be much faster.
The choices are:
Update 1 million followers, a couple times a day, within a few seconds; or
Respond to checks from 1 million followers, a couple hundred times a day, within 1/10 second or so.
There are in-between strategies involving clustering users and stuff, but usage patterns like you see on Facebook and Twitter are probably so heavily biased toward option (1) that such strategies don't pay off.

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks
Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Efficient distributed counting

I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5 minutes, or how many pizzas John Doe ordered in the past week.
It is a LOT of events, so we're using something like Cassandra or HBase because even the counts can't be stored in memory. Also, since we need to keep track of set membership (in order to count unique people ordering a particular kind of pizza, for example), it gets bigger.
We could store a list of orders and then query to count, but this is slow. And we mostly don't care who ordered pepperoni pizza, just how many unique orders were made, and in a given time window.
What's the best way to store this information, for example in Cassandra, such that the information can be retrieved in some time intervals?
I tried at first to use Redis + bloom filters, but storing a bloom filter bit vector would require transactions to avoid race conditions, so then I used redis sets.
Then I realized the whole thing was too big to just be in memory, so I decided to switch to a disk-backed store. However, there are no native sets like in redis.
I looked at sketches / streaming algos like HyperLogLog but the conclusion was that to save the hyperloglog object, I need to store the bit array (or pickle the object or whatever)...is that kosher, and what are the best practices for this, if this is indeed the solution?
I was tempted to save each event individually with a timestamp, then query and count on demand, but this is slow. I'm looking for something better, if it exists.
Example Requests:
How many unique people had a pepperoni pizza order in the past 10 minutes
How many unique pepperoni pizzas were ordered by some person John Doe in the past 30 minutes
There are a few ways to approach this problem from what I have learned.
Use locking + set membership / counting data structure e.g hyperloglog or bloom filter. As long as there's not that much fighting over a particular lock, things should be okay.
Use a database that has built-in sets/collections support. They pretty much implement #1 internally.
my guesses:
cassandra supports counters - i think i saw some incr operation which should work concurrently - by using free running counter on your event, you just need to setup something which samples all counters at specified intervals (5 min?) then you can give estimations between two samples
(http://wiki.apache.org/cassandra/Counters)
cassandra can timeout a column..i never really used it, but it might worth a try

Resources