Improve mapping performance on Elasticsearch - elasticsearch

My elastic cluster contains indices with giant mapping files. This is due to the fact that some of my indices contain up to 60k different fields.
To elaborate a bit about my setup, each index contains information from a single source. Each source has several types of data (what I'll call layers) which are indexed as different types in the index corresponding to the source. Each layer has different attributes (20 in average). To avoid field name collision they are indexed as "LayerId_FieldId".
I'm trying to find a way to reduce the size of my mapping (as to my understanding, it might cause performance issues). One option is having one index per layer (and perhaps spreading large layers over several indices, each responsible for a different time segment). I have around 4000 different layers indexed right now, so lets say that in this method I will have 5000 different indices. Is elastic fine with that? What should I be worried about (if at all) with such a large number of indices, some of them very small (some of the layers have as few as 100 items)?
A second possible solution is the following. Instead of saving a layer's data in the way it is sent to me, for example:
"LayerX_name" : "John Doe",
"LayerX_age" : 34,
"LayerX_isAdult" : true,
it will be saved as :
"value1_string" : "John Doe",
"value2_number" : 34,
"value3_boolean" : true,
In the latter option, I will have to keep some metadata index which links the generic names to the real field names. In the above example, I need to know that for layer X the field "value1_string" corresponds to "name". Thus, whenever I receive a new document to index, I have to query the metadata in order to know how to map the given fields into my generic names. This allows me to have a constant size mapping (say, 50 fields for each value type, so several hundred fields overall). However, this introduces some overhead, but most importantly I'm feeling that this basically reduces my database to a relational one, and I lose the ability to handle documents of arbitrary structure.
Some technical details about my cluster:
Elasticsearch version 2.3.5
22 nodes, 3 of them are masters, each node contains 16 Gb of ram, 2 Tb
disc storage. In total I currently have 6 Tb of data spread over 1.2
billion docs, 55 indices, and 1500 shards.
I'd appreciate your input on the two solutions I suggested, or any other alternatives you have in mind!

Related

what are the points to be taken care of when using parent child relationship in elasticsearch

I am using es 7.3v and i have approx 5 TB of data that is to be indexed in ES in the form of parent-child relationship but I know that searching is 100 times slower using this relationship approach but my use case is such that i have to stick to it so i wanted to know what points should i take care of to improve the performance of my cluster also I have to run a lot of has_child and has_parent queries.
Note: size of my parent doc is small but the size of child doc is large as it may consist of 100 fields in every child doc.
Such as the H/W requirements, i.e what should be suitable hardware on cloud that has to be used or any specific machines.
How to decide the number of shards.
How many nodes should be thier in the cluster to handle such amount of data.
Any more points that are important to be considered.

Advice on efficient ElasticSearch document design

I'm working on a project that deals with listings (think: Craiglist, Ebay, Trulia, etc).
The basic unit of information is a "Listing", something like this:
{
"id": 1,
"title": "Awesome apartment!",
"price": 1000000,
// other stuff
}
Some fields can be searched upon (e.g price, location, etc), others are just for display purposes on the application (e.g title, description which contains lots of HTML etc).
My question is: should i store all the data in one document, or split it into two (one for searching e.g 'ListingSearchIndex', one for display, e.g 'ListingIndex').
I also have to do some pretty hefty aggregations across the documents too.
I guess the question is, would searching across smaller documents then doing another call to fetch the results by id be faster than just searching across the full document?
The main factors is obviously speed, but if i split the documents then maintenance would be a factor too.
Any suggestions on best practices?
Thanks :)
In my experience with Elasticsearch, shard configuration has been significant in cluster performance/ speed when querying, aggregating etc. Since, every shard by itself consumes cluster resources (memory/cpu) and has a cost towards cluster overhead it is ideal to get the shard count right so the cluster is not overloaded. Our cluster was over-sharded and it impacted loading search results, visualizations, heavy aggregations etc. Once we fixed our shard count it worked flawlessly!
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
Besides performance, I think there's other aspects to consider here.
ElasticSearch offers weaker guarantees in terms of correctness and robustness than other databases (on this topic see their blog post ElasticSearch as a NoSQL database). Its focus is on search, and search performance.
For those reasons, as they mention in the blog post above:
Elasticsearch is commonly used in addition to another database
One way to go about following that pattern:
Store your data in a primary database (e.g. a relational DB)
Index only what you need for your search and aggregations, and to link search results back to items in your primary DB
Get what you need from the primary DB before displaying - i.e. the data for display should mostly come from the primary DB.
The gist of this approach is to not treat ElasticSearch as a source of truth; and instead have another source of truth that you index data from.
Another advantage of doing things that way is that you can easily reindex from your primary DB when you change your index mapping for a new search use case (or on changing index-time processing like analyzers etc...).
I think you can't answer this question without knowing all your queries in advance. For example consider that you split to documents and later you decide that you need to filter based on a field stored in one index and sort by a field that is stored in another index. This will be a big problem!
So my advice to you, If you are not sure where you are heading, just put everything in one index. You can later reindex and remodel.

How to split multiple typed index out to prepare for ES upgrade (where multiple types are deprecated)

We are currently running a cluster with ES 2.3.2 that has one large index with the following properties:
762 GB (366 million docs)
25 data nodes; 3 master nodes; 3 client nodes
23 shards / 1 replica
This one index has 20+ types, each with a few common and many unique fields. I am redesigning the cluster with the following goals:
1) Remove multiple types in an index so that we can upgrade ES. Though multi-types are supported in v5, we want to do the work to prep for v6 now.
2) Break up the large index into more manageable smaller indexes
I have set up a new identical cluster. I modified the indexing so that I have one index per type. I allocated a shard count based on the relative size of the data with a minimum of 2, and a max of 5 shards. After indexing all of our data into this new cluster, I am finding that the same query against the new cluster is slower than against the old cluster.
I figured this was due to the explosion of shards (i.e. was 23 primary, and now it is 78). I closed all but one index (that has a shard count of 2), then ran a test where I targeted a single type against my old monolithic index, and the new single-typed index (using a homebrew tool to run requests in parallel and parse out the "took"). I find that if I do a "size: 0", my new cluster is faster. When I return 7 or 8 they seem to be in parity. It then goes downhill where our default query of 30 records returned is about twice as slow. I am guessing this is because there are fewer threads to do the actual retrieval in the smaller index with two shards vs the large one with 23.
What is the recommendation for moving away from multi-typed indexes when the following is true:
- There are many types
- The types have very different mappings
- There is a huge variance in size per type running from 4 mb to 154 gb
I am currently contemplating putting them all in one type with one massive mapping (I don't think there are any fields with the same name but different mappings), but that seems really ugly.
Any suggestions welcome,
Thanks,
~john
I don't know your data but you can try to lessen the indexes in the following way.
Those types that have similar mappings group in one index. In this index create property "type" and support this property by yourself in queries.
If every type has completely different structure I would put the smaller ones in one index. After all they were this way before.
Sharding is the way how elasticsearch scales, so it makes sense that you observe performance degradation for network/IO bound operation when it's executed against 2 shard vs. 23, as it essentially means it was run on 2 nodes in parallel as opposed to 23.
If you want to split the index, you need to go over all of the types and identify the minimum number of shards for each type for your target performance. It'll depend on multiple factors such as the number of documents, document size, request/indexing patterns. As you mentioned that types vary significantly in size, most likely the result will be less balanced than your initial set up (2-5 shards), i.e. some of the indexes will need a higher number of shards, while some will do fine with less, e.g. there is no need to split 4mb index (as in your example) into multiple shards, unless you expect it to grow significantly and have high update rate and you want to scale indexing, otherwise 1 shard is fine.

splitting small amount of data over multiple vs a single index?

We are in the process of jumping from ES 2.3.2 to 6.0. As part of this work, we are breaking up our monolithic index into multiple indexes. The index we have is split into 23 shards, each of which is about 50 GB. We are doing two things:
1) Split out our types (we have 26 types with widely varying fields) into individual indexes
2) Create date-based indexes
We are doing #1 as that as mapping types are being deprecated.
We are doing #2 as 80% of our queries are on data from just the last 30 days. We do have queries that go back over all time. Also, our data is mutable (any document from any date can be updated), so we are managing our index targeting in our api.
What I am dealing with now is that when we split this all out, for some of our larger types, this works really well. For the smaller ones, we end up with indexes that, once split out by type and date, are very small (like ~100 mb). I am concerned that for the small types, we will end up with 15 (we have a 15 month retention) indexes that are all small, making it inefficient for searching. Is this really bad? I did a test of collapsing the smaller ones and not having them date based, but I found performance actually went down. I am hyothesizing that this is because all the data was in one shard (I set it to one shard), and the search could not be parallelized.
My root question is to find out if there is a penalty for having a small amount of data spread over multiple indexes vs collocating it in one index? We really need the date-based for our larger types, and managing a mix of date vs non-date-based is undesirable.
thanks,
~john
if you are able to get performance you need from time sliced indices. I'd go with that approach. A simple , indexing strategy is better than having another system to manage number of indices based on size.

ElasticSearch Scale Forever

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?
A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

Resources