ElasticSearch 1.7 (Spring Data ElasticSearch) update by query takes lot of time to update documents

ElasticSearch 1.7 (Spring Data ElasticSearch) update by query takes lot of time to update documents - spring-boot

My application allows updating multiple elasticsearch documents in single request.
I use ElasticSearch BulkRequestBuilder to update all such documents in Bulk.
BulkRequestBuilder bulkRequestBuilder = elasticSearchClient.prepareBulk();
documents.forEach(id -> {
UpdateRequest updateRequest = new UpdateRequestBuilder(elasticSearchClient)
.setType("MyDocumentType")
.setIndex("MyDocumentIndex")
.setId(id)
.setDoc("fieldName", "valueToBeUpdated")
.request();
bulkRequestBuilder.add(updateRequest);
});
//update in bulk
bulkRequestBuilder.get();
All the documents are updated with valueToBeUpdated but ElasticSearch internally takes time to update all the documents but the call to bulkRequestBuilder.get() returns even before documents are updated. (Indicating Async nature of ElasticSearch engine).
Could anyone please suggest how to make it a Sync updates of all documents?

Finally I found the core issue (may be default nature) with updates taking time by the ElasticSearch engine.
By default the ElasticSearch engines updates are ASYNC in nature (as I pointed in my question already). There are couple of links which are explaining this default behaviour.
e.g. ElasticSearch GET API Documentation states that in order to get the document , elasticsearch engine does a refresh in order to visible all previous updates if any. This hints that ASYNC nature of elastic search is causing immediate search of my documents not providing me updated documents.
As of now to continue with existing behaviour, trigger bulk update in SYNC as follows.
bulkRequestBuilder.setReplicationType(ReplicationType.SYNC).setRefresh(true).get();

Usually problems indexing/updating a lot of data comes from segment merging from ES .
One tip from ES people is to disable refresh before indexing/updating a lot of data.
You can achieve this updating index refresh_interval before indexing to refresh_interval=-1, and once all your data is indexed return it to your previous index configuration.
Tune-indexing-speed

Related

How about including JSON doc version? Is it possible for elastic search, to include different versions of JSON docs, to save and to search?

We are using ElasticSearch to save and manage information on complex transactions. We might need to add more information for every transaction, on the near future.
How about including JSON doc version?
Is it possible for elastic search, to include different versions of JSON docs, to save and to search?
How does this affects performance on ElasticSearch?

It's completely possible, By default elastic uses the dynamic mappings for every new documents such as your JSON documents to index them. For each field in your documents elastic creates a table called inverted_index and the search queries executed against them so regardless of your field variation as long as you know which field you want to execute query the data throughput and performance will not be affected.

How to debug document not available for search in Elasticsearch

I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.

By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.

elasticsearch:update the doc if exists in all the shards of an index

I googled on update the docs in ES across all the shards of index if exists. I found a way (/_bulk api), but it requires we need to specify the routing values. I was not able to find the solution to my problem. If does anybody aware of the below things please update me.
Is there any way to update the doc in all the shards of an index if exists using a single update query?.
If not, is there any way to generate routing values such that we should be able to hit all shards with update query?

Ideally for bulk update, ES recommends get the documents by query which needs to get updated using scan and scroll, update the document and index them again. Internally also, ES never updates a document although it provides an Update API through scripting. It always reindexes the new document with updated field/value and deletes the older document.
Is there any way to update the doc in all the shards of an index if exists using a single update query?.
You can check the update API if its suits your purpose. Also there are plugins which can provide you update by query. Check this.
Now comes the routing part and updating all shards. If you have specified a routing value while indexing the document for very first time, then whenever you update your document, you need to set the original routing value. Otherwise ES would never know which shard did the document resided and it can send it to any shard(algo based).
If you don't use routing value, then based on the ID of the document, ES uses an algo to decide the shard it needs to go. Hence when you update a document through a bulk API and keeps the same ID without the routing, the document will be saved in the same shard as it was previous and you would see the update.

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!

The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

How to get a response from Elastic Search after indexing?

I'm using CouchDB river plugin with Elastic Search. In my web application, I am using CouchDB's bulk insert to insert documents into CouchDB. This triggers the changes feed and ES reads this to index my documents. The problem now is that my web ui isn't showing anything because ES is still indexing the documents.
I'm using PyES to "talk" to ES by the way. Is there any function I can call to know whether Elastic Search is busy indexing?
Thanks a million.

Even if ES is indexing, ES should answer to queries.
Could you check with a
curl localhost:9200/_search?q=*
That your index has docs in it while indexing from couchDb?
[UPDATE]
You have to know that Elasticsearch is a Near Real Time search engine. So, you have to wait some seconds to be able to search for your docs.
You can retrieve your docs immediatly but you need to wait for the refresh process.
You can trigger manually the refresh API. But it could slow down dramatically your insertions.
Does it help?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio