_mget and _search differences on ElasticSearch - elasticsearch

I've indexed 2 documents:
As you can see, after having indexed those ones, I can see them in a search result:
[root#centos7 ~]# curl 'http://ESNode01:9201/living/fuas/_search?pretty'
{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2, <<<<<<<<<<<<<<<<
"max_score" : 1.0,
"hits" : [ {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge1", <<<<<<<<<<<<<<<
"_score" : 1.0,
"_source":{"timestamp":"2015-10-14T16:13:49.004Z","matter":"null","comment":"null","status":"open","backlogStatus":"unknown","metainfos":[],"resources":[{"resourceId":"idResourceMerge1","noteId":"null"},{"resourceId":"idResourceMerge2","noteId":null}]}
}, {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge2", <<<<<<<<<<<<<<<<<<
"_score" : 1.0,
"_source":{"timestamp":"2015-10-14T16:13:49.004Z","matter":"null","comment":"null","status":"open","backlogStatus":"unknown","metainfos":[],"resources":[{"resourceId":"idResourceMerge3","noteId":null}]}
} ]
}
}
After that, I perform a multiget request setting the document ids:
[root#centos7 ~]# curl 'http://ESNode01:9201/living/fuas/_mget?pretty' -d '
{
"ids": ["idFuaMerge1", "idFuaMerge2"]
}
'
{
"docs" : [ {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge1",
"found" : false <<<<<<<<<<<<<<<<<<<<!!!!!!!!!!!!!!
}, {
"_index" : "living",
"_type" : "fuas",
"_id" : "idFuaMerge2",
"_version" : 4,
"found" : true, <<<<<<<<<<<<<<<!!!!!!!!!!!!!!!!!
"_source":{"timestamp":"2015-10-14T16:13:49.004Z","matter":"null","comment":"null","status":"open","backlogStatus":"unknown","metainfos":[],"resources":[{"resourceId":"idResourceMerge3","noteId":null}]}
} ]
}
How on earth, on a multiget request, the first document is NOT found and the other one does?

This can only happen if you have used routing key to index your document. Or even parent child relation can also imply the same.
When a document is given for indexing , that document is mapped to a unique shard using the mechanism of routing. In this mechanism the docID is converted to a hash and modulas operation of that hash is taken to determine to which shard the document should go.
So in short
for documentA by default the shard might be 1. Default shard is computed based on routing key.
But then because you applied the routing key yourself , this document is mapped to a different shard , tell 0.
Now when you try to get the document without the routing key , it expects the document to be in shard 1 and not shard 0 and hence your multi get fails as it directly looks in shard 1 to get the document.
The search works because search operation happens across all shards/

Related

Check documents not existing at elasticsearch

I have millions of indexed documents. after indexing I figured that there is an document count mismatch. i want to send array of hundreds of document ids and search at Elastic search if those document ids exists?. and in response get ids that has not Indexed.
example:
these are indexed documents
[497499, 497550, 498370, 498476, 498639, 498726, 498826, 500479, 500780, 500918]
I'm sending 4 at a time
[497599, 88888, 497550, 77777]
response should be whats not at there
[88888, 77777]
You should consider using the _mget endpoint and then parse the result like for instance :
GET someidx/_mget?_source=false
{
"docs" : [
{
"_id" : "c37m5W4BifZmUly9Ni-X"
},
{
"_id" : "2"
}
]
}
Result :
{
"docs" : [
{
"_index" : "someidx",
"_type" : "_doc",
"_id" : "c37m5W4BifZmUly9Ni-X",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true
},
{
"_index" : "someidx",
"_type" : "_doc",
"_id" : "2",
"found" : false
}
]
}

Using field instead of "_id" for more-like-this query

I have a slug field that I want to use to identify object to use as a reference instead of "_id" field. But instead of using it as a reference, doc seems to use it as query to comapre against. Since slug is a unique field with a simple analyzer, it just returns exactly one result like the following. As far as I know, there is no way to use a custom field as _id field:
https://github.com/elastic/elasticsearch/issues/6730
So is double look up, finding out elasticsearch's id first then doing more_like_this the only way to achieve what I am looking for? Someone seems to have asked a similar question three years ago, but it doesn't have an answer.
ArticleDocument.search().query("bool",
should=Q("more_like_this",
fields= ["slug", "text"],
like={"doc": {"slug": "OEXxySDEPWaUfgTT54QvBg",
}, "_index":"article", "_type":"doc"},
min_doc_freq=1,
min_term_freq=1
)
).to_queryset()
Returns:
<ArticleQuerySet [<Article: OEXxySDEPWaUfgTT54QvBg)>]>
You can make some of your documents field as "default" _id while ingesting data.
Logstash
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "my_name"
document_id => "%{some_field_id}"
}
}
Spark (Scala)
DF.saveToEs("index_name" + "/some_type", Map("es.mapping.id" -> "some_field_id"))
Index API
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"result" : "created"
}

Kibana - given an index, how to find saved objects relying on it?

In Kibana I have many dozens of indices.
Given one of them, I want a way to find all the saved objects (searches/dashboards/visualizations) that rely on this index.
Thanks
You can retrieve the document ID of your index pattern and then use that to search your .kibana index
{
"_index" : ".kibana",
"_type" : "index-pattern",
"_id" : "AWBWDmk2MjUJqflLln_o", <---- take this id...
You can use this query on Kibana 5:
GET .kibana/_search?q=AWBWDmk2MjUJqflLln_o <---- ...and use it here
You'll find your visualizations:
{
"_index" : ".kibana",
"_type" : "visualization",
"_id" : "AWBZNJNcMjUJqflLln_s",
"_score" : 6.2450323,
"_source" : {
"title" : "CA groupe",
"visState" : """{"title":"XXX","type":"pie","params":{"addTooltip":true,"addLegend":true,"legendPosition":"right","isDonut":false,"type":"pie"},"aggs":[{"id":"1","enabled":true,"type":"sum","schema":"metric","params":{"field":"XXX","customLabel":"XXX"}},{"id":"2","enabled":true,"type":"terms","schema":"segment","params":{"field":"XXX","size":5,"order":"desc","orderBy":"1","customLabel":"XXX"}}],"listeners":{}}""",
"uiStateJSON" : "{}",
"description" : "",
"version" : 1,
"kibanaSavedObjectMeta" : {
"searchSourceJSON" : """{"index":"AWBWDmk2MjUJqflLln_o","query":{"match_all":{}},"filter":[]}"""
^
|
this is where your index pattern is used
}
}
},

Elasticsearch has_child not returning back all parent documents

Here is the mapping data for both customer and customer_query documents, where customer is the parent and customer_query the child document.
When I run a generic search against all customer_query documents, I get back 127 documents.
However, when I run the following query against the parent
curl -XGET "http://localhost:9200/fts_index/customer/_search" -d'
{
"query": {
"has_child" : {
"type" : "customer_query",
"query" : { "match_all": {} }
}
}
}
}'
I get back only 23 documents. There should be 127 documents returned back since each customer_query document has a unique parent id assigned to it that does match up to the customer type.
When I retry creating my customer_query documents, I get a different number of documents back each time leading me to think it is some kind of shard issue. I have 5 shards assigned to the index.
{
"took" : 59,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 23,
"max_score" : 1.0,
"hits" : [ {
"_index" : "fts_index",
"_type" : "customer",
"_id" : "7579f2c0-e4e4-4374-82d7-bf4c508fc51d",
"_score" : 1.0,
"_routing" : "8754248f-1c51-46bf-970a-493c349c70a7",
"_parent" : "8754248f-1c51-46bf-970a-493c349c70a7",
....
I can't wrap my head around this issue. Any thoughts on what could be the issue? Is this a routing issue? If so, how do I rectify that with my search?

Entries of ElasticSearch hits

I am performing an ElasticSearch query through
curl -XGET 'http://localhost:9200/_search?pretty' -d '{\"_source\":[\"data.js_event\", \"data.timestamp\",\"data.uid\",\"data.element\"],\"query\":{\"match\":{\"event\":\"first_time_user_event\"}}}'
I am ONLY interested in the ouput of _source but the retrieval leads to something like
{"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 71,
"successful" : 71,
"failed" : 0
},
"hits" : {
"total" : 2062326,
"max_score" : 4.8918204,
"hits" : [ {
"_index" : "logstash-2015.11.22",
"_type" : "fluentd",
"_id" : "AVEv_blDT1yShMIEDDmv",
"_score" : 4.8918204,
"_source":{"data":{"js_event":"leave_page","timestamp":"2015-11-22T16:18:47.088Z","uid":"l4Eys1T/rAETpysn7E/Jog==","element":"body"}}
}, {
"_index" : "logstash-2015.11.21",
"_type" : "fluentd",
"_id" : "AVEnZa5nT1yShMIEDDW8",
"_score" : 4.0081544,
"_source":{"data":{"js_event":"hover","timestamp":"2015-11-21T00:15:15.097Z","uid":"E/4Fdl5uETvhQeX/FZIWfQ==","element":"infographic-new-format-selector"}}
},
...
Is there a way to get rid of _index, _type, _id and _score? The reason is that the query is performed on a remote server and I would like to limit the size of downloaded data.
Thanks.
Yes, you can use response filtering (only available since ES 1.7) by specifying what you want in the response (i.e. filter_path=hits.hits._source) and ES will filter it out for you:
curl -XGET 'http://localhost:9200/_search?filter_path=hits.hits._source&pretty' -d '{\"_source\":[\"data.js_event\", \"data.timestamp\",\"data.uid\",\"data.element\"],\"query\":{\"match\":{\"event\":\"first_time_user_event\"}}}'

Resources