Elastic Search Indexing the Internet - elasticsearch

This is mostly a Design Pattern Question for Elastic Search.
If I wanted to index The Internet with Elastic Search, what would be the most efficient way to organize such a task?
#kimchy talks about different patterns and Rafal Kuc discusses scaling massive clusters, but I didnt get a sense of how to organize an index of the internet after watching these.
I think logically you could organize such an effort by creating a new index for each domain. So you could shard heavily on indexes like Stackoverflow.com but maybe have as little as 1 shard for indexes like momandpopsite.com
Does that look efficient to you ES Community? I'm not sure because we can very quickly get into millions of indexes not to mention their individual shards. And now I'm wondering if there is a lot of overhead associated with this type of design and it becomes bloated. (That is, does this pattern's structure create too much overhead?).
I know this question has to be theoretical because resources are not specified. But if you could use your imagination and try to stick purely to a design strategy -- how would you index the world wide web? Lets say there are 275 million domains. What is the most efficient design pattern for indexing the internet using Elastic Search?

An index per domain (so 275 million indexes) is not feasible. Indexes do have an overhead, and I've lost the reference, but I don't think you want more than ~100 indexes on a single "normal" server.
To get more sites into a single Index, you would want to introduce routing and views, but I would imagine that a single index for everything would also introduce un-needed overhead. I'm guessing, but the routing rule look up might become incredibly large etc. So you would want to find some way of splitting things across indexes. At such a high volume, you can't design it all on paper, so I would advise PoC work to determine what kind of performance you get for different sized indexes. You would then look to use aliases to map correctly to the underlying index.
For further reading:
https://groups.google.com/forum/#!searchin/elasticsearch/index$20per$20user/elasticsearch/i-G5NlP1VeY/PK9vVP0myAgJ
https://groups.google.com/forum/#!msg/elasticsearch/9L5cWIAib94/K7zdHEW-4P0J

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Performance features in Elasticsearch 5 that are not available in Solr 6

I am picking one of the 2 search engines above for a project, and so far both of them have shown to be similar in functionalities.
At least for the requirements that I have:
Proximity Search
Boolean queries
query over all fields
Boolean queries
Retrieval of original indexed document
Real time search requirements, as soon as I index a document, it should be available
Besides that I should have around 1 single type of document, in about 40 million documents - roughly 2 TB of data
that's basically what I need, my questions would be:
Does one search engine perform better than the other considering my dataset? Such as offering better indexing rates or Search Rates?
Am I loosing anything by going with Solr(considering my requirements)?
Solr is my choice at the moment.
some thoughts:
nobody can tell you about which one would perform best for you unless you benchmark in your realistic conditions
for %99 of users, any of the two would work perfectly
if you want to go with one of them (for any reason: you like it, your devs want to try it, you like the logo, whatever), then, don't sweat it, both are very capable.

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks
Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

elasticsearch - tips on how to organize my data

I'm trying elasticsearch by getting some data from facebook and twitter to.
The question is: how can I organize this data in index?
/objects/posts
/objects/twits
or
/posts/post
/twits/twit
I'm trying queries such as, get posts by author_id = X
You need to think about the long term when deciding how to structure your data in Elasticsearch. How much data are you planning on capturing? Are search requests going to look into both Facebook and Twitter data? Amount of requests, types of queries and so on.
Personally I would start of with the first approach, localhost:9200/social/twitter,facebook/ as this will reduce the need for another index when it isn't necessarily required. You can search across both of the types easily which has less overhead than searching across two indexes. There is quite an interesting article here about how to grow with intelligence.
Elasticsearch has many configurations, essentially its finding a balance which fits your data.
First one is the good approach. Because creating two indices will create two lucence instances which will effect the response time.

Resources