Detect changes during bulk indexing

Detect changes during bulk indexing - elasticsearch

We are using Elasticsearch v5.6.12 for our database. We update this frequently using the bulk REST api. Some of the time the individual requests won't change anything (i.e. the value of the document that Elasticsearch is already up to date). How can I detect these instances?
I saw this (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html) but I'm not sure it's applicable in our situation.

You can use the noop detection when checking the result of your bulk queries.
When the bulk query returns, you can iterate over each update result and check if the result field has a value of noop (vs updated)
# Say the document is indexed
PUT test/doc/1
{
"test": "123"
}
# Now you want to bulk update it
POST test/doc/_bulk
{"update":{"_id": "1"}}
{"doc":{"test":"123"}} <-- this will yield `result: noop`
{"update":{"_id": "1"}}
{"doc":{"test":"1234"}} <-- this will yield `result: updated`
{"update":{"_id": "2"}}
{"doc":{"test":"3456"}, "doc_as_upsert": true} <-- this will yield `result: created`
Result:
{
"took" : 6,
"errors" : false,
"items" : [
{
"update" : {
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_version" : 2,
"result" : "noop", <-- see "noop"
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"status" : 200
}
},
{
"update" : {
"_index" : "test",
"_type" : "doc",
"_id" : "1",
"_version" : 3,
"result" : "updated", <-- see "updated"
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 2,
"_primary_term" : 1,
"status" : 200
}
},
{
"_index" : "test",
"_type" : "doc",
"_id" : "2",
"_version" : 1,
"result" : "created", <-- see "created"
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
]
}
As you can see, when specifying doc_as_upsert: true for document with id 2, the document will be created and the result field value will be created

Related

Elasticsearch: how to find a document by number in logs

I have an error in kibana
"The length [2658823] of field [message] in doc[235892]/index[mylog-2023.02.10] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."
I know how to deal with it (change "index.highlight.max_analyzed_offset" for an index, or set the query parameter), but I want to find the document with long field and examine it.
If i try to find it by id, i get this:
q:
GET mylog-2023.02.10/_search
{
"query": {
"terms": {
"_id": [ "235892" ]
}
}
}
a:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
q:
GET mylog-2023.02.10/_doc/235892
a:
{ "_index" : "mylog-2023.02.10", "_type" : "_doc", "_id" :
"235892", "found" : false }
Maybe this number (doc[235892]) is not id? How can i find this document?

try use Query IDs:
GET /_search
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
}
}

Check documents not existing at elasticsearch

I have millions of indexed documents. after indexing I figured that there is an document count mismatch. i want to send array of hundreds of document ids and search at Elastic search if those document ids exists?. and in response get ids that has not Indexed.
example:
these are indexed documents
[497499, 497550, 498370, 498476, 498639, 498726, 498826, 500479, 500780, 500918]
I'm sending 4 at a time
[497599, 88888, 497550, 77777]
response should be whats not at there
[88888, 77777]

You should consider using the _mget endpoint and then parse the result like for instance :
GET someidx/_mget?_source=false
{
"docs" : [
{
"_id" : "c37m5W4BifZmUly9Ni-X"
},
{
"_id" : "2"
}
]
}
Result :
{
"docs" : [
{
"_index" : "someidx",
"_type" : "_doc",
"_id" : "c37m5W4BifZmUly9Ni-X",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true
},
{
"_index" : "someidx",
"_type" : "_doc",
"_id" : "2",
"found" : false
}
]
}

elasticsearch doesn't update documents

I'm facing up with a trouble related with document updatings.
I'm able to index(create) documents and they are correctly added on index.
Nevertheless, when I'm trying to update one of them, the operation is not made, the document is not updated.
When I first time add the document it's like:
{
"user" : "user4",
"timestamp" : "2016-12-16T15:00:22.645Z",
"startTimestamp" : "2016-12-16T15:00:22.645Z",
"dueTimestamp" : null,
"closingTimestamp" : null,
"matter" : "F1",
"comment" : null,
"status" : 0,
"backlogStatus" : 20,
"metainfos" : {
"ceeaceaaaceeaceaaaceeaceaaaceeaaceaaaceeabceaaa" : [ "FZ11" ]
},
"resources" : [ ],
"notes" : null
}
This is the code I'm using in order to build UpdateRequest:
this.elasticsearchResources.getElasticsearchClient()
.prepareUpdate()
.setIndex(this.user.getMe().getUser())
.setType(type)
.setId(id.toString())
.setDoc(source)
.setUpsert(source)
.setDetectNoop(true);
I've also been able to debug which's the content of this request begore sending it to elasticsearch. The document is:
{
"user":"user4",
"timestamp":"2016-12-16T15:00:22.645Z",
"startTimestamp":"2016-12-16T15:00:22.645Z",
"dueTimestamp":null,
"closingTimestamp":null,
"matter":"F1",
"comment":null,
"status":0,
"backlogStatus":20,
"metainfos":{
},
"resources":[
],
"notes":null
}
As you can see the only difference is metainfos is empty when I try to update the document.
After having performed this update request the document is not updated. I mean the content of metainfos keeps as before:
#curl -XGET 'http://localhost:9200/user4/fuas/_search?pretty'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "living_v1",
"_type" : "fuas",
"_id" : "327c9435-c394-11e6-aa90-02420a011808",
"_score" : 1.0,
"_routing" : "user4",
"_source" : {
"user" : "user4",
"timestamp" : "2016-12-16T15:00:22.645Z",
"startTimestamp" : "2016-12-16T15:00:22.645Z",
"dueTimestamp" : null,
"closingTimestamp" : null,
"matter" : "F1",
"comment" : null,
"status" : 0,
"backlogStatus" : 20,
"metainfos" : {
>>>>>>>> "ceeaceaaaceeaceaaaceeaceaaaceeaaceaaaceeabceaaa" : [ "FZ11" ]
},
"resources" : [ ],
"notes" : null
}
} ]
}
}
I don't quite figure out what's wrong. Any ideas?

ElasticSearch will not update an empty object. You can try with:
null "metainfos":null
or
"metainfos":"ceeaceaaaceeaceaaaceeaceaaaceeaaceaaaceeabceaaa":[]
to clean the field.

Entries of ElasticSearch hits

I am performing an ElasticSearch query through
curl -XGET 'http://localhost:9200/_search?pretty' -d '{\"_source\":[\"data.js_event\", \"data.timestamp\",\"data.uid\",\"data.element\"],\"query\":{\"match\":{\"event\":\"first_time_user_event\"}}}'
I am ONLY interested in the ouput of _source but the retrieval leads to something like
{"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 71,
"successful" : 71,
"failed" : 0
},
"hits" : {
"total" : 2062326,
"max_score" : 4.8918204,
"hits" : [ {
"_index" : "logstash-2015.11.22",
"_type" : "fluentd",
"_id" : "AVEv_blDT1yShMIEDDmv",
"_score" : 4.8918204,
"_source":{"data":{"js_event":"leave_page","timestamp":"2015-11-22T16:18:47.088Z","uid":"l4Eys1T/rAETpysn7E/Jog==","element":"body"}}
}, {
"_index" : "logstash-2015.11.21",
"_type" : "fluentd",
"_id" : "AVEnZa5nT1yShMIEDDW8",
"_score" : 4.0081544,
"_source":{"data":{"js_event":"hover","timestamp":"2015-11-21T00:15:15.097Z","uid":"E/4Fdl5uETvhQeX/FZIWfQ==","element":"infographic-new-format-selector"}}
},
...
Is there a way to get rid of _index, _type, _id and _score? The reason is that the query is performed on a remote server and I would like to limit the size of downloaded data.
Thanks.

Yes, you can use response filtering (only available since ES 1.7) by specifying what you want in the response (i.e. filter_path=hits.hits._source) and ES will filter it out for you:
curl -XGET 'http://localhost:9200/_search?filter_path=hits.hits._source&pretty' -d '{\"_source\":[\"data.js_event\", \"data.timestamp\",\"data.uid\",\"data.element\"],\"query\":{\"match\":{\"event\":\"first_time_user_event\"}}}'

_mget and _search differences on ElasticSearch

I've indexed 2 documents:
As you can see, after having indexed those ones, I can see them in a search result:
[root#centos7 ~]# curl 'http://ESNode01:9201/living/fuas/_search?pretty'
{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2, <<<<<<<<<<<<<<<<
"max_score" : 1.0,
"hits" : [ {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge1", <<<<<<<<<<<<<<<
"_score" : 1.0,
"_source":{"timestamp":"2015-10-14T16:13:49.004Z","matter":"null","comment":"null","status":"open","backlogStatus":"unknown","metainfos":[],"resources":[{"resourceId":"idResourceMerge1","noteId":"null"},{"resourceId":"idResourceMerge2","noteId":null}]}
}, {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge2", <<<<<<<<<<<<<<<<<<
"_score" : 1.0,
"_source":{"timestamp":"2015-10-14T16:13:49.004Z","matter":"null","comment":"null","status":"open","backlogStatus":"unknown","metainfos":[],"resources":[{"resourceId":"idResourceMerge3","noteId":null}]}
} ]
}
}
After that, I perform a multiget request setting the document ids:
[root#centos7 ~]# curl 'http://ESNode01:9201/living/fuas/_mget?pretty' -d '
{
"ids": ["idFuaMerge1", "idFuaMerge2"]
}
'
{
"docs" : [ {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge1",
"found" : false <<<<<<<<<<<<<<<<<<<<!!!!!!!!!!!!!!
}, {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge2",
"_version" : 4,
"found" : true, <<<<<<<<<<<<<<<!!!!!!!!!!!!!!!!!
"_source":{"timestamp":"2015-10-14T16:13:49.004Z","matter":"null","comment":"null","status":"open","backlogStatus":"unknown","metainfos":[],"resources":[{"resourceId":"idResourceMerge3","noteId":null}]}
} ]
}
How on earth, on a multiget request, the first document is NOT found and the other one does?

This can only happen if you have used routing key to index your document. Or even parent child relation can also imply the same.
When a document is given for indexing , that document is mapped to a unique shard using the mechanism of routing. In this mechanism the docID is converted to a hash and modulas operation of that hash is taken to determine to which shard the document should go.
So in short
for documentA by default the shard might be 1. Default shard is computed based on routing key.
But then because you applied the routing key yourself , this document is mapped to a different shard , tell 0.
Now when you try to get the document without the routing key , it expects the document to be in shard 1 and not shard 0 and hence your multi get fails as it directly looks in shard 1 to get the document.
The search works because search operation happens across all shards/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Detect changes during bulk indexing - elasticsearch

Related

Elasticsearch: how to find a document by number in logs

Check documents not existing at elasticsearch

elasticsearch doesn't update documents

Entries of ElasticSearch hits

_mget and _search differences on ElasticSearch

Categories

Resources