elasticsearch - tips on how to organize my data

elasticsearch - tips on how to organize my data - elasticsearch

I'm trying elasticsearch by getting some data from facebook and twitter to.
The question is: how can I organize this data in index?
/objects/posts
/objects/twits
or
/posts/post
/twits/twit
I'm trying queries such as, get posts by author_id = X

You need to think about the long term when deciding how to structure your data in Elasticsearch. How much data are you planning on capturing? Are search requests going to look into both Facebook and Twitter data? Amount of requests, types of queries and so on.
Personally I would start of with the first approach, localhost:9200/social/twitter,facebook/ as this will reduce the need for another index when it isn't necessarily required. You can search across both of the types easily which has less overhead than searching across two indexes. There is quite an interesting article here about how to grow with intelligence.
Elasticsearch has many configurations, essentially its finding a balance which fits your data.

First one is the good approach. Because creating two indices will create two lucence instances which will effect the response time.

Related

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.

According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

Is ElasticSearch suited for retrieving a very large number of search records?

So, our production environment has an ES cluster that contains all our products inventory (ID and attributes) where each product is mapped to one document. Internally, one of our use cases is to create a logical grouping of these products based on text matching on a bunch of these product attributes.
Often times, it's possible that a product set could contain a very large number of products, say, 5 million. That is, the query to create a product set could match about 5 million documents.
Now, my question is, is ES capable of handling such large retrievals of documents, or is it recommended to use a backing store like Cassandra or HBase to fetch a huge number of documents? Note that I'm not concerned about realtime use cases - I'm okay with having an asynchronous execution of the product set creation, so latency isn't a major concern for me. From what I understand, ES provides the Scroll API to retrieve a large number of documents, but, I'm approaching the problem more from a school of thought perspective.
Is it fine to use ES to fetch very large documents, in the range of 5-10 million? Or should we use a parallel DB with big data capabilities to fetch the data and use ES only as the search store?

TL;DR no, it is not meant to retrieve large sets of documents, although you could work your way around with different approaches
notice that Scroll API might not be suitable for purposes other than re-indexing:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one data stream or index into a new data stream or index with a different configuration.
Another way to achieve it would be the Search after parameter
search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.
Rethink whether your use case really needs to exhaustively paginate over large sets of documents. Since ElasticSearch's strength doesn't lie on large result sets.
consult documentation:
Paginate search results
index.max_result_window
Track total hits
Scroll API
Search after parameter

Best way to store votes in elasticsearch for a reddit like system

I am building a site similar to reddit using elasticsearch and trying to decide where is the best place to store the up/down votes. I can think of couple options.
Store as part of the document.
In this case, any vote will trigger an update on the document. According to elasticsearch document, this is essentially a replace of the whole document. That seems to be a very expensive operation.
Store in another database.
Store votes in other database like SQL/MongoDB and update elasticsearch periodically. In this case, we have to tolerate some delay for the new votes to affect search result which is not so ideal and will also increase complexity and maintenance cost.
Store in another index in elasticsearch
This can separate the concern by index - one mostly RO, one RW. Is there an efficient way to merge the two indices so that I can order by votes at query time?
Any suggestions on those options or other better way to handle this?

There is a forth option - store votes in a separate document with a different type but in the same index as the original document. The votes type can be made a child of the article type. This setup will enable you to perform queries against articles and votes at the same time using has_child filters and queries. It will also require reindexing of only a small votes document every time a vote occurs instead of the large article document. On the negative side, the has_child and has_parent queries require loading of the parent/child map into memory, so this approach has a non-trivial memory footprint comparing to all other options that you have described.

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks

Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

Elastic Search Indexing the Internet

This is mostly a Design Pattern Question for Elastic Search.
If I wanted to index The Internet with Elastic Search, what would be the most efficient way to organize such a task?
#kimchy talks about different patterns and Rafal Kuc discusses scaling massive clusters, but I didnt get a sense of how to organize an index of the internet after watching these.
I think logically you could organize such an effort by creating a new index for each domain. So you could shard heavily on indexes like Stackoverflow.com but maybe have as little as 1 shard for indexes like momandpopsite.com
Does that look efficient to you ES Community? I'm not sure because we can very quickly get into millions of indexes not to mention their individual shards. And now I'm wondering if there is a lot of overhead associated with this type of design and it becomes bloated. (That is, does this pattern's structure create too much overhead?).
I know this question has to be theoretical because resources are not specified. But if you could use your imagination and try to stick purely to a design strategy -- how would you index the world wide web? Lets say there are 275 million domains. What is the most efficient design pattern for indexing the internet using Elastic Search?

An index per domain (so 275 million indexes) is not feasible. Indexes do have an overhead, and I've lost the reference, but I don't think you want more than ~100 indexes on a single "normal" server.
To get more sites into a single Index, you would want to introduce routing and views, but I would imagine that a single index for everything would also introduce un-needed overhead. I'm guessing, but the routing rule look up might become incredibly large etc. So you would want to find some way of splitting things across indexes. At such a high volume, you can't design it all on paper, so I would advise PoC work to determine what kind of performance you get for different sized indexes. You would then look to use aliases to map correctly to the underlying index.
For further reading:
https://groups.google.com/forum/#!searchin/elasticsearch/index$20per$20user/elasticsearch/i-G5NlP1VeY/PK9vVP0myAgJ
https://groups.google.com/forum/#!msg/elasticsearch/9L5cWIAib94/K7zdHEW-4P0J

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio