Scalling up ElasticSearch - elasticsearch

We have a problem with scaling up Elasticsearch.
We have a simple structure with multiple indexes each index has 2 shards, 1 primary and 1 duplicate.
Around now we have ~ 3000 indexes which mean 6000 shards and we think that we will hit limits with shards soon. (currently running 2 nodes 32gb of ram, 4 cores top usage 65%, we are working on changing servers to 3 nodes to have 1 primary shard and 2 duplicates)
Refresh interval is set to 60s.
Some indexes have 200 documents, others have 10 mil documents most of them are less than 200k.
The total amount of documents is: 40 mil. (amount of documents can increase fast)
Our search requests are hitting multiple indexes at the same time (we might be searching in 50 indexes or 500 indexes, we might need in the future to be able to search in all indexes).
Searching need to be fast.
Currently, we are daily synchronizing all documents by bulk in chunks of 5000 of documents ~ 7mb because from tests that is working best ~ 2,3 seconds per request of 5000 ~ 7mb, done by 10 async workers.
We sometimes hit the same index with workers at same time and request with bulk is taking longer extending even to ~ 12,5 sec per request of 5000 ~ 7mb.
Current synchronization process takes about ~1hour / 40 mil of documents.
Documents are stored by uuids (we are using them to get direct hits of documents from elasticsearch), documents values can be modified daily, sometimes we only change synchronization_hash which determins which documents were changed, after all the synchronization process we run a delete on documents with old synchronization_hash.
Other thing is that we think that our data architecture is broken, we have x clients ~ 300 number can increase, each client is assigned to be only allowed to search in y indexes (from 50 to 500), indexes for multiple clients can repeat in search (client x has 50 indexes, client y has 70 indexes, client y,x most often clients need to have access to same documents in indexes, amount of clients can increase) that's why we store data in separated indexes so we don't have to update all indexes where this data is stored.
To increase a speed of indexing we are thinking even moving to 4 nodes (with each index 2 primary, 2 duplicates), or moving to 4 nodes (with each index only having 2, 1 primary, 1 duplicate), but we need to test things out to figure out what would work for us the best. We might be needing to double amount of documents in next few months.
What do you think can be changed to increase a indexing speed without reducing an search speed?
What can be changed in our data architecture?
Is there any other way that our data should be organized to allow us for fast searching and faster indexing?
I tried many sizes of chunks in synchronization / didn't try to change the architecture.
We are trying to achive increased indexing speed without reducing search speed.

Related

Elasticsearch- Single Index vs Multiple Indexes

I have more than 4000 different fields in one of my index. And that number can grow larger with time.
As Elasticsearch give default limit of 1000 field per index. There must be some reason.
Now, I am thinking that I should not increase the limit set by Elasticsearch.
So I should break my single large index into small multiple indexes.
Before moving to multiple indexes I have few questions as follows:
The number of small multiple indexes can increase up to 50. So searching on all 50 index at a time would slow down search time as compared to a search on the single large index?
Is there really a need to break my single large index into multiple indexes because of a large number of fields?
When I use small multiple indexes, the total number of shards would increase drastically(more than 250 shards). Each index would have 5 shards(default number, which I don't want to change). Search on these multiple indexes would be searching on these 250 shards at once. Will this affect my search performance? Note: These shards might increase in time as well.
When I use Single large index which contains only 5 shards and a large number of documents, won't this be an overload on these 5 shards?
It strongly depends on your infrastructure. If you run a single node with 50 Shards a query will run longer than it would with only 1 Shard. If you have 50 Nodes holding one shard each, it will most likely run faster than one node with 1 Shard (if you have a big dataset). In the end, you have to test with real data to be sure.
When there is a massive amount of fields, ES gets a performance problem and errors are more likely. The main problem is that every field has to be stored in the cluster state, which takes a toll on your master node(s). Also, in a lot of cases you have to work with lots of sparse data (90% of fields empty).
As a rule of thumb, one shard should contain between 30 GB and 50 GB of data. I would not worry too much about overloading shards in your use-case. The opposite is true.
I suggest testing your use-case with less shards, go down to 1 Shard, 1 Replica for your index. The overhead from searching multiple Shards (5 primary, multiply by replicas) then combining the results again is massive in comparison to your small dataset.
Keep in mind that document_type behaviour changed and will change further. Since 6.X you can only have one document_type per Index, starting in 7.X document_type is removed entirely. As the API listens at _doc, _doc is the suggested document_type to use in 6.X. Either move to one Index per _type or introduce a new field that stores your type if you need the data in one index.

How to split multiple typed index out to prepare for ES upgrade (where multiple types are deprecated)

We are currently running a cluster with ES 2.3.2 that has one large index with the following properties:
762 GB (366 million docs)
25 data nodes; 3 master nodes; 3 client nodes
23 shards / 1 replica
This one index has 20+ types, each with a few common and many unique fields. I am redesigning the cluster with the following goals:
1) Remove multiple types in an index so that we can upgrade ES. Though multi-types are supported in v5, we want to do the work to prep for v6 now.
2) Break up the large index into more manageable smaller indexes
I have set up a new identical cluster. I modified the indexing so that I have one index per type. I allocated a shard count based on the relative size of the data with a minimum of 2, and a max of 5 shards. After indexing all of our data into this new cluster, I am finding that the same query against the new cluster is slower than against the old cluster.
I figured this was due to the explosion of shards (i.e. was 23 primary, and now it is 78). I closed all but one index (that has a shard count of 2), then ran a test where I targeted a single type against my old monolithic index, and the new single-typed index (using a homebrew tool to run requests in parallel and parse out the "took"). I find that if I do a "size: 0", my new cluster is faster. When I return 7 or 8 they seem to be in parity. It then goes downhill where our default query of 30 records returned is about twice as slow. I am guessing this is because there are fewer threads to do the actual retrieval in the smaller index with two shards vs the large one with 23.
What is the recommendation for moving away from multi-typed indexes when the following is true:
- There are many types
- The types have very different mappings
- There is a huge variance in size per type running from 4 mb to 154 gb
I am currently contemplating putting them all in one type with one massive mapping (I don't think there are any fields with the same name but different mappings), but that seems really ugly.
Any suggestions welcome,
Thanks,
~john
I don't know your data but you can try to lessen the indexes in the following way.
Those types that have similar mappings group in one index. In this index create property "type" and support this property by yourself in queries.
If every type has completely different structure I would put the smaller ones in one index. After all they were this way before.
Sharding is the way how elasticsearch scales, so it makes sense that you observe performance degradation for network/IO bound operation when it's executed against 2 shard vs. 23, as it essentially means it was run on 2 nodes in parallel as opposed to 23.
If you want to split the index, you need to go over all of the types and identify the minimum number of shards for each type for your target performance. It'll depend on multiple factors such as the number of documents, document size, request/indexing patterns. As you mentioned that types vary significantly in size, most likely the result will be less balanced than your initial set up (2-5 shards), i.e. some of the indexes will need a higher number of shards, while some will do fine with less, e.g. there is no need to split 4mb index (as in your example) into multiple shards, unless you expect it to grow significantly and have high update rate and you want to scale indexing, otherwise 1 shard is fine.

ElasticSearch indexing with hundreds of indices

I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.
Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time

Performance issues when pushing data at a constant rate to Elasticsearch on multiple indexes at the same time

We are experiencing some performance issues or anomalies on a elasticsearch specifically on a system we are currently building.
The requirements:
We need to capture data for multiple of our customers, who will query and report on them on a near real time basis. All the documents received are the same format with the same properties and are in a flat structure (all fields are of primary type and no nested objects). We want to keep each customer’s information separate from each other.
Frequency of data received and queried:
We receive data for each customer at a fluctuating rate of 200 to 700 documents per second – with the peak being in the middle of the day.
Queries will be mostly aggregations over around 12 million documents per customer – histogram/percentiles to show patterns over time and the occasional raw document retrieval to find out what happened a particular point in time. We are aiming to serve 50 to 100 customer at varying rates of documents inserted – the smallest one could be 20 docs/sec to the largest one peaking at 1000 docs/sec for some minutes.
How are we storing the data:
Each customer has one index per day. For example, if we have 5 customers, there will be a total of 35 indexes for the whole week. The reason we break it per day is because it is mostly the latest two that get queried with occasionally the remaining others. We also do it that way so we can delete older indexes independently of customers (some may want to keep 7 days, some 14 days’ worth of data)
How we are inserting:
We are sending data in batches of 10 to 2000 – every second. One document is around 900bytes raw.
Environment
AWS C3-Large – 3 nodes
All indexes are created with 10 shards with 2 replica for the test purposes
Both Elasticsearch 1.3.2 and 1.4.1
What we have noticed:
If I push data to one index only, Response time starts at 80 to 100ms for each batch inserted when the rate of insert is around 100 documents per second. I ramp it up and I can reach 1600 before the rate of insert goes to close to 1sec per batch and when I increase it to close to 1700, it will hit a wall at some point because of concurrent insertions and the time will spiral to 4 or 5 seconds. Saying that, if I reduce the rate of inserts, Elasticsearch recovers nicely. CPU usage increases as rate increases.
If I push to 2 indexes concurrently, I can reach a total of 1100 and CPU goes up to 93% around 900 documents per second.
If I push to 3 indexes concurrently, I can reach a total of 150 and CPU goes up to 95 to 97%. I tried it many times. The interesting thing is that response time is around 109ms at the time. I can increase the load to 900 and response time will still be around 400 to 600 but CPU stays up.
Question:
Looking at our requirements and findings above, is the design convenient for what’s asked? Are there any tests that I can do to find out more? Is there any setting that I need to check (and change)?
I've been hosting thousands of Elasticsearch clusters on AWS over at https://bonsai.io for the last few years, and have had many a capacity planning conversation that sound like this.
First off, it sounds to me like you have a pretty good cluster design and test rig going here. My first intuition here is that you are legitimately approaching the limits of your c3.large instances, and will want to bump up to a c3.xlarge (or bigger) fairly soon.
An index per tenant per day could be reasonable, if you have relatively few tenants. You may consider an index per day for all tenants, using filters to focus your searches on specific tenants. And unless there are obvious cost savings to discarding old data, then filters should suffice to enforce data retention windows as well.
The primary benefit of segmenting your indices per tenant would be to move your tenants between different Elasticsearch clusters. This could help if you have some tenants with wildly larger usage than others. Or to reduce the potential for Elasticsearch's cluster state management to be a single point of failure for all tenants.
A few other things to keep in mind that may help explain the performance variance you're seeing.
Most importantly here, indexing is incredibly CPU bottlenecked. This makes sense, because Elasticsearch and Lucene are fundamentally just really fancy string parsers, and you're sending piles of strings. (Piles are a legitimate unit of measurement here, right?) Your primary bottleneck is going to be the number and speed of your CPU cores.
In order to take the best advantage of your CPU resources while indexing, you should consider the number of primary shards you're using. I'd recommend starting with three primary shards to distribute the CPU load evenly across the three nodes in your cluster.
For production, you'll almost certainly end up on larger servers. The goal is for your total CPU load for your peak indexing requirements ends up under 50%, so you have some additional overhead for processing your searches. Aggregations are also fairly CPU hungry. The extra performance overhead is also helpful for gracefully handling any other unforeseen circumstances.
You mention pushing to multiple indices concurrently. I would avoid concurrency when bulk updating into Elasticsearch, in favor of batch updating with the Bulk API. You can bulk load documents for multiple indices with the cluster-level /_bulk endpoint. Let Elasticsearch manage the concurrency internally without adding to the overhead of parsing more HTTP connections.
That's just a quick introduction to the subject of performance benchmarking. The Elasticsearch docs have a good article on Hardware which may also help you plan your cluster size.

Should I control the Index size in Elastic Search?

I have a fast growing database and I'm using Elastic Search to manage it.it has only one index and gets 200 K new documents per day. each document contains of about 5 KB text.
Should I keep using only one index or it's better to have one index for each day or something else?
If so, what's the benefits of having multiple indices?
You should definitely worry about the maximum size of your shards/index. We use daily indexes for stuff where we are inserting millions of records per day and monthly indexes where were are inserting millions per month.
A good rule of thumb is that shards should max out around 4 GB (remember there are a configurable number of shards per index).
The advantage is that when you have daily/weekly/monthly indexes, you can eventually close/delete them when your cluster becomes too big or the data isn't useful anymore. If your data is time series data, you can craft your queries to only hit the indexes that are used for the given data. Also if you've made a mistake in how many shards you really need, you can correct it going forward (because you create a new index periodically).
The disadvantage is then that you have to manage all of the extra indexes, but there are tools to do that (elasticsearch-curator for example).

Resources