elastic reindex is very slow. from some article refresh interval is default 1seconds and it required to change it to -1 and after reindex complete to update back to 1s. my question here is..
Is it good to update the refresh interval value to -1seconds when re index is running. which is already completed 20%.
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{
"index" : {
"refresh_interval" : -1
}
}
It won't hurt if you do that, no
Also if you are using Elasticsearch 5 then you really need to urgently upgrade, it's been EOL for a number of years now
The refresh_interval changes how frequently ElasticSearch syncs data and makes it available for search. This is additional work that is required when the refresh interval is reduced. When re-indexing, if you are not needing to query the new index, then you want to set this high to improve performance or even disable it.
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
If you must read from the index while writing to it and must have the data available very quickly, indexing performance will be reduced.
Related
I have a Flink job that's bulk writing/upserting a few thousands docs per second onto Elasticsearch. Every time I query it takes ~10-20 seconds to get a response.
I have second index that's exactly the same and equally as full on the same cluster but writes are now turned down to 0 on this index. When I query it it takes milliseconds to get a response.
I.e. with writes off queries take milliseconds. With writes on queries take 10-20 seconds.
CPU utilization ~10%, JVM mem pressure ~70%. ES 7.8.
It would appear then that writes to shards are somehow slowing the reads down. This is odd considering with "profile": true it's giving me query timings on the order of milliseconds yet took (total request time) is in seconds like I'm seeing.
My question is why might this be happening, and how can I optimize it?
(I did think of maybe I could have read replica nodes, but ES doesn't support a read replica node type https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#node-roles )
Not sure, what was your original refresh_interval value, but the default is 1 second which you set explicitly and it makes a difference on what documents are returned in search results.
Refresh makes documents to be written to segments which is used by search queries and without refresh you will get the obsolete(for existing docs) also new documents written after refresh will not be available in search results.
But it doesn't make difference in the performance of the search queries and in your case while indexing is also happening, search queries are taking more time which you need to debug more(see the CPU, memory, node stats(queue size of search and index queue)) etc.
By the way, the replica is related to shard and not the node, and you can easily increase the replica of an index dynamically to improve the search performance.
PUT /my-index-000001/_settings
{
"index" : {
"number_of_replicas" : 2 // no of replica you want to set.
}
}
Setting refresh_interval to 1s seems to have fixed the issue. If anyone has an explanation as to why I'd appreciate it.
curl -X PUT "host/index/_settings?pretty&human=true" -H 'Content-Type: application/json' -d'
{
"index" : {
"refresh_interval" : "1s"
}
}
Edit:
The refresh rate changes dynamically based on read load unless it's explicitly set.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/indices-refresh.html#refresh-api-desc
I am having a big use case with elasticsearch which has millions of records in it.
I will be updating the records frequently, say 1000 records per hour.
I don't want elastic search to reindex for my every update.
I am planning to reindex it on weekly basis.
Any Idea how to stop auto-reindex while update ?
Or any other better suggestion is welcome . Thanks in advance :)
Elasticsearch(ES) update an existing doc in below manner.
1. Deletes the old doc.
2. Index a new doc with the changes applied to it.
According to ES docs :-
In Elasticsearch, this lightweight process of writing and opening a
new segment is called a refresh. By default, every shard is refreshed
automatically once every second. This is why we say that Elasticsearch
has near real-time search: document changes are not visible to search
immediately, but will become visible within 1 second.
Note that these changes will not be visible/searchable until ES commits/flush these changes to disk cache and disk,which is control by soft-commit(es refresh interval, which is by default 1 second) and hard-commit(which actually write the document to disk, which prevent it being lost permanently and costly affair than a soft-commit).
You need to make sure, you tune your ES refresh interval, and do proper load testing, as setting it very low and very high has its own pros and cons.
for example setting it very less for example 1 second and if you have too many updates happening than it has a performance hit and it might crash your system. Also setting it very high for example 1 hour means you now don't have a NRT(near real time search) and during that time if your memory could contain again millions of doc(depending on your app) and can cause out of memory error, also committing on such a large memory is a very costly affair.
I have a query which runs every time a website is loaded. This Query aggregates over three different term-fields and around 3 million documents and therefore needs 6-7 seconds to complete. The data does not change that frequently and the currentness of the result is not critical.
I know that I can use an alias to create something "View" like in the RDMS world. Is it also possible to populate it, so the query result gets cached? Is there any other way caching might help in this scenario or do I have to create an additional index for the aggregated data and update it from time to time?
I know that the post is old, but about view, elastic add the Data frames in the 7.3.0.
You could also use the _reindex api
POST /_reindex
{
"source": {
"index": "live_index"
},
"dest": {
"index": "caching_index"
}
}
But it will not change your ingestion problem.
About this, I think the solution is sharding for your index.
with 2 or more shards, and several nodes, elastic will be able to paralyze.
But an easier thing to test is to disable the refresh_interval when indexing and to re-enable it after. It generally improve a lot the ingestion time.
You can see a full article on this use case on
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
You create materialised view.Its a table eventually which has data of aggregated functions. As you have already inserted the aggregated data ,now when you query it, it will be faster. I feel there is no need to cache as well.Even i have created the MVs , it improves the performance tremendously. Having said that you can even go for elastic search as well where you can cache the aggregated queries if your data is not changing frequently.I feel MV and elastic search gives the same performance.
I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain
I'm using elastic4s client that returns a Future in response to index request and when that future completes I still have to do Thread.sleep(1000) before I can query for that indexed record. Mostly it is exactly 1 second. Is there an elasticsearch setting that I can change so that when the Future completes the record will be available?
I tried to use the java client directly client.prepareIndex....execute().actionGet(); and it ends up exactly the same way, I have to call Thread.sleep(1000)
Is there any settings I can change for the record to be ready after the future completes?
execute(index into(foo, bar) id uuid fields baz).await
Thread.sleep(1000) // This is mandatory for search to find it
execute {search in foo}.await // returns empty without Thread.sleep(1000)
It sounds like you may be having to wait for the default index refresh interval to come into play before you can query the newly indexed data. The refresh interval is 1 second by default and can be changed with the following
curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "1s"
} }'
Alternatively, you can refresh the shard after the indexing operation by including the refresh parameter in the query string of the index operation. This may be more useful than changing the refresh interval globally
curl -XPUT 'http://localhost:9200/{index}/{type}/{id}?refresh=true' -d '{
"property" : "value"
}'
Russ's answer is correct, but I want to add a bit more about the Scala side.
When you do an index operation, the future returned is completed as soon as the Elasticsearch cluster has processed the command. That is not the same time as when the document is available to search. That is, as Russ pointed out, 1 second later (by default).
So your future completes at k. Your document is available at k+1sec.
You can adjust the refresh interval when creating the index, eg in Elastic4s
create index "myindex" refreshInterval "200ms" mappings ...
In the next release you can use Scala durations eg
create index "myindex" refreshInterval 200.millis mappings ...
But be aware, by adjusting this too much you remove some of the optimizations that the refresh interval brings. If you are doing multiple indexes, etc then look at the bulk api. (In Elastic4s just wrap your calls in bulk(seq))