Elasticsearch Distinct query after setting fielddata to true - elasticsearch

I am trying to get distinct values of a field "vip_name" on an index.
This is what I tried to begin with:
curl -XGET http://172.31.38.157:9200/cb_inventory/_search -d
'{"size":0,"aggs":{"vips":{"terms":{"field":"vip_name"}}}}'
{"error":{"root_cause":
[{"type":"illegal_argument_exception","reason":"Fielddata is disabled
on text fields by default. Set fielddata=true on [vip_name] in order to
load fielddata in memory by uninverting the inverted index. Note that
this can however use significant memory. Alternatively use a keyword
field
instead."}],"type":"search_phase_execution_exception","reason":"all
shards failed","phase":"query","grouped":true,"failed_shards": [{"shard":0,
"index":"cb_inventory","node":"7_t7zG82QsS__Q_vRHWy9A","reason":
{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields
by default.
OK. So I set fielddata to true as below:
curl -XPUT http://172.31.38.157:9200/cb_inventory/_mapping/cb_inventory -d '{"properties":{"vip_name":{"type":"text","fielddata":true}}}'
{"acknowledged":true}
Now I do the search and get back the below:
curl -XGET http://172.31.38.157:9200/cb_inventory/_search?pretty=true -d '{"size":0,"aggs":{"vips":{"terms":{"field":"vip_name","size":1000}}}}'
{"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"vips" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "domain.com",
"doc_count" : 3
},
{
"key" : "ppcbcl00021",
"doc_count" : 3
}
]
}
}
}
This is a bit funny, since I have only one distinct value ppcbcl00021.domain.com . Now it is showing up as 2 broken distinct values.
How Do I go about getting a distinct value as "ppcbcl00021.domain.com"

This is because vip_name is set into text not keyword. So, even though you have ppcbcl00021.domain.com, in the ES, it will be stored as chunk of text ie ppcbcl00021 and domain.com.
Try again by setting vip_name to keyword
curl -XPUT http://172.31.38.157:9200/cb_inventory/_mapping/cb_inventory -d '{"properties":{"vip_name":{"type":"keyword"}}}'

Related

Elasticsearch has_child not returning back all parent documents

Here is the mapping data for both customer and customer_query documents, where customer is the parent and customer_query the child document.
When I run a generic search against all customer_query documents, I get back 127 documents.
However, when I run the following query against the parent
curl -XGET "http://localhost:9200/fts_index/customer/_search" -d'
{
"query": {
"has_child" : {
"type" : "customer_query",
"query" : { "match_all": {} }
}
}
}
}'
I get back only 23 documents. There should be 127 documents returned back since each customer_query document has a unique parent id assigned to it that does match up to the customer type.
When I retry creating my customer_query documents, I get a different number of documents back each time leading me to think it is some kind of shard issue. I have 5 shards assigned to the index.
{
"took" : 59,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 23,
"max_score" : 1.0,
"hits" : [ {
"_index" : "fts_index",
"_type" : "customer",
"_id" : "7579f2c0-e4e4-4374-82d7-bf4c508fc51d",
"_score" : 1.0,
"_routing" : "8754248f-1c51-46bf-970a-493c349c70a7",
"_parent" : "8754248f-1c51-46bf-970a-493c349c70a7",
....
I can't wrap my head around this issue. Any thoughts on what could be the issue? Is this a routing issue? If so, how do I rectify that with my search?

How to get search hits results when executing aggregation?

As stated in the ElasticSearch documentation:
In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.
I want to execute searches returning hits when i have queries out for the aggregation. But i am not sure how can i achieve the above?
I am using the following query:
curl -XPOST 'localhost:9200/employee/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_domain": {
"terms": {
"field": "domain"
}
}
}
}'
and here is the result which i am getting,
{
"took" : 92,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_domain" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 744,
"buckets" : [ {
"key" : "finance",
"doc_count" : 30
}]
}
}
}
As we can see that the hits array is empty. I am not sure how to get those hits array. Any suggestion?
the hits are empty because you have set the size of the returning query to 0 when you specify:
"size": 0,
you can remove size completely and in this case you'll get 10 hits that is the default or you can set the size you want, for instance if you specify 100 you'll get 100 hits in response. This is related to the search results.
Now, if you also want to get results in the aggregation you can use Top Hits Aggregation for that.

Count the number of duplicates in elasticsearch

I have an application inserting a numbered sequence of logs into elasticsearch.
Under certain conditions, after stopping my application, I find that in elasticsearch there are more logs than I have actually generated.
This simple aggregation helped me find out that a few duplicates are present:
curl /logstash-*/_search?pretty -d '{
size: 0,
aggs: {
msgnum_terms: {
terms: {
field: "msgnum.raw",
min_doc_count: 2,
size: 0
}
}
}
}'
msgnum is the field containing the numeric sequence. Normally it should be unique and the resulting doc_counts never exceed 1. Instead I get something like:
{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 100683,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"msgnum_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "4097",
"doc_count" : 2
}, {
"key" : "4099",
"doc_count" : 2
...
...
...
}, {
"key" : "5704",
"doc_count" : 2
} ]
}
}
}
How can I count the exact number of duplicates in order to make sure that they are the only cause of mismatch between number of generated log lines and number of hits in elasticsearch?

Unable to search attachment type field in an ElasticSearch indexed document

Search does not return any results although I do have a document that should match the query.
I do have the ElasticSearch mapper-attachments plugin installed per https://github.com/elasticsearch/elasticsearch-mapper-attachments. I have also googled the topic as well as browsed similar questions in stack overflow, but have not found an answer.
Here's what I typed into a windows 7 command prompt:
c:\Java\elasticsearch-1.3.4>curl -XDELETE localhost:9200/tce
{"acknowledged":true}
c:\Java\elasticsearch-1.3.4>curl -XPUT localhost:9200/tce
{"acknowledged":true}
c:\Java\elasticsearch-1.3.4>curl -XPUT localhost:9200/tce/contact/_mapping -d{\"
contact\":{\"properties\":{\"my_attachment\":{\"type\":\"attachment\"}}}}
{"acknowledged":true}
c:\Java\elasticsearch-1.3.4>curl -XPUT localhost:9200/tce/contact/1 -d{\"my_atta
chment\":\"SGVsbG8=\"}
{"_index":"tce","_type":"contact","_id":"1","_version":1,"created":true}
c:\Java\elasticsearch-1.3.4>curl localhost:9200/tce/contact/_search?pretty
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "tce",
"_type" : "contact",
"_id" : "1",
"_score" : 1.0,
"_source":{"my_attachment":"SGVsbG8="}
} ]
}
}
c:\Java\elasticsearch-1.3.4>curl localhost:9200/tce/contact/_search?pretty -d{\"
query\":{\"term\":{\"my_attachment\":\"Hello\"}}}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Note that the base64 encoded value of "Hello" is "SGVsbG8=", which is the value I have inserted into the "my_attachment" field of the document.
I am assuming that the mapper-attachments plugin has been deployed correctly because I don't get an error executing the mapping command above.
Any help would be greatly appreciated.
What analyzer is running against the my_attachment field?
if it's the standard analyser (can't see any listed) then the Hello in the text will be made lowercase in the index.
i.e. when doing a term search (which doesn't have an analyzer on it) - try searching for hello
curl localhost:9200/tce/contact/_search?pretty -d'
{"query":
{"term":
{"my_attachment":"hello"
}}}'
you can also see which terms have been added to the index:
curl 'http://localhost:9200/tce/contact/_search?pretty=true' -d '{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "my_attachment"
}
}
}
}'

ElasticSearch count returned result

I want to count number of document returned as a result of a query with size limit. For example, I run following query:
curl -XGET http://localhost:9200/logs_-*/a_logs/_search?pretty=true -d '
{
"query" : {
"match_all" : { }
},
"size" : 5,
"from" : 8318
}'
and I get:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 159,
"successful" : 159,
"failed" : 0
},
"hits" : {
"total" : 8319,
"max_score" : 1.0,
"hits" : [ {
....
Total documents matching my query are 8319, but I fetched at max 5. Only 1 document was returned since I queried "from" 8318.
In the response, I do not know how many documents are returned. I want to write a query such that the number of documents being returned are also present in some field. Maybe some facet may help, but I could not figure out. Kindly help.
Your query :
{
"query" : {
"match_all" : { }
},
=> Means that you ask all your data
"size" : 5,
=> You want to display only 5 results
"from" : 8318
=> You start from the 8318 records
ElasticSearch respons :
....
"hits" : {
"total" : 8319,
...
=> Elastic search told you that there is 8319 results in his index.
You ask him all the result and you start from the 8318.
8319 - 8318 = 1 So you have 1 result.
Try by removing the from.
Looking through the documentation, it's not clear how to make the query return this -- if indeed the API supports it. If you just want to have the count of the returned hits, the easiest way seems to be to actually count them yourself after parsing the response.

Resources