How to keep track of elasticsearch requests - elasticsearch

In elastic cluster I have 2 indices. I need to keep track of the requests that come to these indices. For example I have customer and product indices. When a new customer document added to customer index, I need to get the id of the document that added and its body.
Another example when a product document is updated I also need the id of that product and its body or what changed in that document.
My elasticsearch version is 7.17
(I am writing in node.js if you have an code examples or solution I would be appreciated)

you can do this via the Elasticsearch slow log, where you reduce the timing to a 0 so it tracks everything, or via some other proxy that intercepts the requests. Elasticsearch doesn't do this out of the box though unfortunately

Related

Elastic Search:Update of existing Record (which has custom routing param set) results in duplicate record, if custom routing is not set during update

Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases

Facing issue while reindexing document using elasticsearch-7.9.0.jar to Elasticsearch

We are using Elasticsearch 7.9.0 having single node cluster. We have developed a plugin which is sitting inside plugins folder of Elasticsearch. We are using Elasticsearch SDK (org.Elasticsearch:Elasticsearch:7.9.0) and mustache client (org.Elasticsearch.plugin:lang-mustache-client:7.9.0) and couple of other jars. We have exposed couple of REST endpoints which our client services will consume instead of standard Elasticsearch URLs in order to perform CRUD operations on documents.
We observed when we try to re-index the documents, often certain fields of the documents are not getting updated. This behavior is not at all consistent; completely random in nature. We are really not sure which field will be updated and which wont. This issue doesn't happen for all the documents of the particular index in a consistent manner. We really don't know which document will have this issue.
Few important code snippets for reindexing the document along with log statement as below;
org.elasticsearch.action.update.UpdateRequestBuilder updateRequestBuilder = ((Client) client.getClient()).prepareUpdate();
updateRequestBuilder.setDocAsUpsert(true);
updateRequestBuilder.setRetryOnConflict(1);
UpdateResponse updateResponse = updateRequestBuilder.get();
logger.info("Response after updating the document: {} " , updateResponse);
Sample response after updating the document as below:
UpdateResponse[index=sanity_616d2c81737fc26067d9d220_transaction,type=_doc,id=STZ_8411,version=39,seqNo=3955,primaryTerm=6,result=updated,shards=ShardInfo{total=2, successful=1, failures=}]
When we identify the document is not updated properly in Elasticsearch and introspect the log; we usually observe the log as above. Here the log always return the result=updated and successful=1 despite certain fields are not getting updated.
Please note, this issue is not consistent. The happens very randomly.
Kindly help us on this!

ElasticSearch 1.7 (Spring Data ElasticSearch) update by query takes lot of time to update documents

My application allows updating multiple elasticsearch documents in single request.
I use ElasticSearch BulkRequestBuilder to update all such documents in Bulk.
BulkRequestBuilder bulkRequestBuilder = elasticSearchClient.prepareBulk();
documents.forEach(id -> {
UpdateRequest updateRequest = new UpdateRequestBuilder(elasticSearchClient)
.setType("MyDocumentType")
.setIndex("MyDocumentIndex")
.setId(id)
.setDoc("fieldName", "valueToBeUpdated")
.request();
bulkRequestBuilder.add(updateRequest);
});
//update in bulk
bulkRequestBuilder.get();
All the documents are updated with valueToBeUpdated but ElasticSearch internally takes time to update all the documents but the call to bulkRequestBuilder.get() returns even before documents are updated. (Indicating Async nature of ElasticSearch engine).
Could anyone please suggest how to make it a Sync updates of all documents?
Finally I found the core issue (may be default nature) with updates taking time by the ElasticSearch engine.
By default the ElasticSearch engines updates are ASYNC in nature (as I pointed in my question already). There are couple of links which are explaining this default behaviour.
e.g. ElasticSearch GET API Documentation states that in order to get the document , elasticsearch engine does a refresh in order to visible all previous updates if any. This hints that ASYNC nature of elastic search is causing immediate search of my documents not providing me updated documents.
As of now to continue with existing behaviour, trigger bulk update in SYNC as follows.
bulkRequestBuilder.setReplicationType(ReplicationType.SYNC).setRefresh(true).get();
Usually problems indexing/updating a lot of data comes from segment merging from ES .
One tip from ES people is to disable refresh before indexing/updating a lot of data.
You can achieve this updating index refresh_interval before indexing to refresh_interval=-1, and once all your data is indexed return it to your previous index configuration.
Tune-indexing-speed

elasticsearch:update the doc if exists in all the shards of an index

I googled on update the docs in ES across all the shards of index if exists. I found a way (/_bulk api), but it requires we need to specify the routing values. I was not able to find the solution to my problem. If does anybody aware of the below things please update me.
Is there any way to update the doc in all the shards of an index if exists using a single update query?.
If not, is there any way to generate routing values such that we should be able to hit all shards with update query?
Ideally for bulk update, ES recommends get the documents by query which needs to get updated using scan and scroll, update the document and index them again. Internally also, ES never updates a document although it provides an Update API through scripting. It always reindexes the new document with updated field/value and deletes the older document.
Is there any way to update the doc in all the shards of an index if exists using a single update query?.
You can check the update API if its suits your purpose. Also there are plugins which can provide you update by query. Check this.
Now comes the routing part and updating all shards. If you have specified a routing value while indexing the document for very first time, then whenever you update your document, you need to set the original routing value. Otherwise ES would never know which shard did the document resided and it can send it to any shard(algo based).
If you don't use routing value, then based on the ID of the document, ES uses an algo to decide the shard it needs to go. Hence when you update a document through a bulk API and keeps the same ID without the routing, the document will be saved in the same shard as it was previous and you would see the update.

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Resources