What is difference b/w modifying and updating in the Elasticsearch? - elasticsearch

I am following Elasticsearch official docs where there is a section on Modifying Document: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_modifying_your_data.html
So I already have a document under /customer/_doc/1:
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "ajay"
}
}
Below is the request to "modify"
PUT /customer/_doc/1
{
"firstname": "ajay",
"lastname": "tanwar"
}
GET would return the updated document
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"firstname" : "ajay",
"lastname" : "tanwar"
}
}
On the next page of docs, Updating Documents https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_updating_documents.html
Below is the request used to "update"
POST /customer/_doc/1/_update
{
"doc":{
"firstname": "ajay",
"lastname": "tanwar"
}
}
This also return the same result as "modify".
Two difference I noticed in both of these:
"modify" request updates the _version on each request. Whereas the
"update" request keeps the _version same
"modify" request's response contain "result" : "updated" whereas
the "update" request's response contain "result" : "noop"
But few doubts I have: first of all, why the "modify" returns "result" : "updated"? Docs itself says it's a modification operation. And why "modify" returns "result" : "noop"? What is noop BTW?
And if we go logically, modifying and updating are the same thing. What is the purpose of these two different APIs?

When you modify document, you delete the old document and insert an entirely new document in its place. This is similar to HTTP's PUT method, in that it simply replaces the old document with whatever is sent in the HTTP body.
When you update a document, you make changes to the old document. Internally, ElasticSearch will also delete the old document and insert a new (updated) document. However, this operation should be treated as if it just made changes to the old document. This is similar to HTTP's PATCH method, in that it will keep the old document and only apply the changes sent in the HTTP body.
"result" : "updated" means changes were made to the ElasticSearch database whereas "result" : "noop" (no operation) means nothing happened (probably because the end result after update would've been the same as before the update).

Related

How to de-normalize the relational data in Elasticsearch

I am working on social networking application and I am using elasticsearch for service data.I have multiple joins in elasticsearch. Users can share the posts and each post has one parent user. I have a scenario than I have shown posts of those users whose you follow.
Type Post
{
"_index" : "xxxxxx",
"_type" : "_doc",
"_id" : "p-370648",
"_score" : null,
"_routing" : "2",
"_source" : {
"uid" : "9a73b0e0-a52c-11ec-aa58-37061b467b8c",
"user_id" : 87,
"id" : 370648,
"type" : {
"parent" : "u-87",
"name" : "post"
},
"item_type_number" : 2,
"source_key" : "youtube-5wcpIrpbvXQ#2"
}
}
Type User
{
"_index" : "trending",
"_type" : "_doc",
"_id" : "u-56432",
"_score" : null,
"_routing" : "1",
"_source" : {
"gender" : "female",
"picture" : "125252125.jpg",
"uid" : "928de1a5-cc93-4fd3-adec-b9fb220abc2b",
"full_name" : "Shannon Owens",
"dob" : "1990-08-18",
"id" : 56432,
"username" : "local_12556",
"type" : {
"name" : "user"
},
},
}
Type Follow
{
"_index" : "trending",
"_type" : "_doc",
"_id" : "fr-561763",
"_score" : null,
"_routing" : "6",
"_source" : {
"user_id" : 25358,
"id" : 561763,
"object_id" : 36768,
"status" : "U",
"type" : {
"parent" : "u-36768",
"name" : "followers"
},
}
}
So in this scenario if user follow someone then we save record in elasticsearch with object_id following user and user_id who follow the user and type "followers", and on the other hand each post has one parent user. So when I try to fetch posts from elasticsearch with type post so then I need to put two level joins to fetch posts.
First one for post parent with user and second for checking following status with user. This query work good when there is no traffic on system. But when traffic comes on system send consurrent requests then the elasticsearch query gets down due to processing even I try to fix this issue with high server with higher performance and CPU/Ram but still facing fall down.
So I decided to denormalize the type post data but the problem is that I am failed to check the following status with post.
Because If I do another query from DB and use some caching then I facing memory exaust issue when thousand of following users data come in query. So is there any way that I can check the following directly in following with type posts instead of adding parent join in query.

Duplicated ElasticSearch documents

We use spring boot application to insert/update elastic search documents. Our data provider sends ous data via Kafka. Our app process events, tries to find a record and insert record If not exists or update if received record is different than saved. There shouldn't be any duplicated record in elasticsearch.
App inserts/update documents with IMMEDIATE refresh
Problem:
Occasionally we have to remove all data and load them again, becouse there are duplicated records. I found out that these cloned records differs only with insert date. Its usually a few hours difference.
Generally it works as expected, detailed integration tests on org.codelibs.elasticsearch-cluster-runner are green.
Example metadata from elastic search query:
{
"docs" : [
{
"_index" : "reference",
"_type" : "reference",
"_id" : "s0z-BHIBCvxpj4TjysIf",
"_version" : 1,
"_seq_no" : 17315835,
"_primary_term" : 40,
"found" : true,
"_source" : {
...
"insertedDate" : 1589221706262,
...
}
},
{
"_index" : "reference",
"_type" : "reference",
"_id" : "jdVCBHIBXucoJmjM8emL",
"_version" : 1,
"_seq_no" : 17346529,
"_primary_term" : 41,
"found" : true,
"_source" : {
...
"insertedDate" : 1589209395577,
...
}
}
]
}
Tests
I loaded many times data to local instance of ES - no duplications
I created a few long working integrational tests with big number of inserts, updates, queries on local instance of org.codelibs.elasticsearch-cluster-runner with 1 to 5 nodes in memory- no duplications 
Details:
Elastic Search version - 7.5
ES connection with org.elasticsearch.client.RestHighLevelClient
The reason has been found. One of the nodes had problems to establish a connection and liked to disconnect sometimes.

Using field instead of "_id" for more-like-this query

I have a slug field that I want to use to identify object to use as a reference instead of "_id" field. But instead of using it as a reference, doc seems to use it as query to comapre against. Since slug is a unique field with a simple analyzer, it just returns exactly one result like the following. As far as I know, there is no way to use a custom field as _id field:
https://github.com/elastic/elasticsearch/issues/6730
So is double look up, finding out elasticsearch's id first then doing more_like_this the only way to achieve what I am looking for? Someone seems to have asked a similar question three years ago, but it doesn't have an answer.
ArticleDocument.search().query("bool",
should=Q("more_like_this",
fields= ["slug", "text"],
like={"doc": {"slug": "OEXxySDEPWaUfgTT54QvBg",
}, "_index":"article", "_type":"doc"},
min_doc_freq=1,
min_term_freq=1
)
).to_queryset()
Returns:
<ArticleQuerySet [<Article: OEXxySDEPWaUfgTT54QvBg)>]>
You can make some of your documents field as "default" _id while ingesting data.
Logstash
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "my_name"
document_id => "%{some_field_id}"
}
}
Spark (Scala)
DF.saveToEs("index_name" + "/some_type", Map("es.mapping.id" -> "some_field_id"))
Index API
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"result" : "created"
}

Elasticsearch - How to update document

How does elasticsearch update document? It will delete original document and make new one? I've heard this is how nosql's updating method. does elasticsearch do, same as any other nosql db? or It will replace/insert just field which need to be?
For example, I'm running with Elasticsearh 7.0.0.
First, I created one document,
PUT /employee/_doc/1
{
"first_name" : "John",
"last_name" : "Snow",
"age" : 19,
"about" : "King in the north",
"sex" : "male"
}
Then I updated it via
POST /employee/_update/1/
{
"doc": {
"first_name" : "Aegon",
"last_name" : "Targaryen",
"skill": "fighting and leading"
}
}
Finally, I got correct result when
GET /employee/_doc/1
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_version" : 9,
"_seq_no" : 11,
"_primary_term" : 1,
"found" : true,
"_source" : {
"first_name" : "Aegon",
"last_name" : "Targaryen",
"age" : 19,
"about" : "King in the north",
"sex" : "male",
"skill" : "fighting and leading"
}
}
Document in elasticsearch are immutable object. Updating a document is always a reindexing and it consist of the following steps:
Retrieve the JSON (that you want to reindex)
Change it
Delete the old document
Index a new document
Elasticsearch documentation
For the answer you can check the documentation:
In addition to being able to index and replace documents, we can also
update documents. Note though that Elasticsearch does not actually do
in-place updates under the hood. Whenever we do an update,
Elasticsearch deletes the old document and then indexes a new document
with the update applied to it in one shot.

Is the order of operations guaranteed in a bulk update?

I am sending delete and index requests to elasticsearch in bulk (the example is adapted from the docs):
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
The sequence above is intended to first delete a possible document with _id=1, then index a new document with the same _id=1.
Is the order of the actions guaranteed? In other words, for the example above, can I be sure that the delete will not touch the document indexed afterwards (because the order would not be respected for a reason or another)?
The delete operation is useless in this scenario, if you simply index a document with the same ID, it will automatically and implicitly delete/replace the previous document with the same ID.
So if document with ID=1 already exists, simply sending the below command will replace it (read delete and re-index it)
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
According to an Elastic Team Member:
Elasticsearch is distributed and concurrent. We do not guarantee that requests are executed in the order they are received.
https://discuss.elastic.co/t/are-bulk-index-operations-serialized/83770/6

Resources