I'm bulk inserting hundreds of documents into elastic search from a worker.
At the same time a user is running queries from a UI.
The problem is the queries either timeout, or are very slow, if the bulk indexing is going on at the same time as the query.
Is that normal behaviour for Elastic search?
Is there a way to run a query at the same time as the indexing or does Elastic lock the whole index?
It's fine if the query doesnt run over the latest documents.
I'm running the latest version of elasticsearch-py and Elastic, on Windows.
Related
I insert multiple documents using ReactiveElasticsearchRepository.saveAll.
However, when I want to search just after that using findByName on my repository I get no results.
Is it possible to somehow wait/query the indexing process to finish?
I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.
By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.
My application allows updating multiple elasticsearch documents in single request.
I use ElasticSearch BulkRequestBuilder to update all such documents in Bulk.
BulkRequestBuilder bulkRequestBuilder = elasticSearchClient.prepareBulk();
documents.forEach(id -> {
UpdateRequest updateRequest = new UpdateRequestBuilder(elasticSearchClient)
.setType("MyDocumentType")
.setIndex("MyDocumentIndex")
.setId(id)
.setDoc("fieldName", "valueToBeUpdated")
.request();
bulkRequestBuilder.add(updateRequest);
});
//update in bulk
bulkRequestBuilder.get();
All the documents are updated with valueToBeUpdated but ElasticSearch internally takes time to update all the documents but the call to bulkRequestBuilder.get() returns even before documents are updated. (Indicating Async nature of ElasticSearch engine).
Could anyone please suggest how to make it a Sync updates of all documents?
Finally I found the core issue (may be default nature) with updates taking time by the ElasticSearch engine.
By default the ElasticSearch engines updates are ASYNC in nature (as I pointed in my question already). There are couple of links which are explaining this default behaviour.
e.g. ElasticSearch GET API Documentation states that in order to get the document , elasticsearch engine does a refresh in order to visible all previous updates if any. This hints that ASYNC nature of elastic search is causing immediate search of my documents not providing me updated documents.
As of now to continue with existing behaviour, trigger bulk update in SYNC as follows.
bulkRequestBuilder.setReplicationType(ReplicationType.SYNC).setRefresh(true).get();
Usually problems indexing/updating a lot of data comes from segment merging from ES .
One tip from ES people is to disable refresh before indexing/updating a lot of data.
You can achieve this updating index refresh_interval before indexing to refresh_interval=-1, and once all your data is indexed return it to your previous index configuration.
Tune-indexing-speed
i am using spring boot application to load messages into elastic search. i have a use case where i need to query the elastic search data , get some id value and populate it in elastic search json document before inserting it into elastic search
Querying the elastic search before insertion . Will this be expensive ? If yes is there some other way to approach this issue.
You can use update_by_query to do it in one step.
But otherwise it shouldn't be slow if you do it in two steps (get + update). It depends on many things - how often you do it, how much data is transferred, etc.
I'm using CouchDB river plugin with Elastic Search. In my web application, I am using CouchDB's bulk insert to insert documents into CouchDB. This triggers the changes feed and ES reads this to index my documents. The problem now is that my web ui isn't showing anything because ES is still indexing the documents.
I'm using PyES to "talk" to ES by the way. Is there any function I can call to know whether Elastic Search is busy indexing?
Thanks a million.
Even if ES is indexing, ES should answer to queries.
Could you check with a
curl localhost:9200/_search?q=*
That your index has docs in it while indexing from couchDb?
[UPDATE]
You have to know that Elasticsearch is a Near Real Time search engine. So, you have to wait some seconds to be able to search for your docs.
You can retrieve your docs immediatly but you need to wait for the refresh process.
You can trigger manually the refresh API. But it could slow down dramatically your insertions.
Does it help?