Elasticsearch Ignoring Filter in Aggregations - elasticsearch

The request looks something like:
{
"aggs": {
"contentType": {
"terms": {
"field": "contentType",
"size": 0
}
}
},
"query": {...},
"filter": {...}
}
The response looks something like:
{
"took": 300,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 68,
"max_score": 0,
"hits": []
},
"aggregations": {
"contentType": {
"doct_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 9
"doc_count": 7054
},
{
"key": 9
"doc_count": 7054
},
{
"key": 5
"doc_count": 6236
},
{
"key": 4
"doc_count": 1124
}
]
}
}
}
The "doc_count" in the aggregation is what the results would be without the "filter" and just the "query". The "filter" seems to be ignored.
This was working at some point, but all of a sudden doesn't seem to be working. Anyone have any clue?
Elasticsearch 1.5.2, NEST 1.4.3.
Thanks.

filter used at the top level of your DSL query has been renamed to post_filter (see https://github.com/elastic/elasticsearch/issues/4119). Documentation for post_filter is here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html
I'm not sure whether it applies or not to your particular query, but you might want to use the filtered query type: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html

Related

Group results returned by elasticsearch query based on query terms

I am very new with elasticsearch. I am facing an issue building a query. My document structure is like:
{
latlng: {
lat: '<some-latitude>',
lon: '<some-longitude>'
},
gmap_result: {<Some object>}
}
I am doing a search on a list of lat-long. For each coordinate, I am fetching a result that is within 100m. I have been able to do this part. But, the tricky part is that I do not know which results in the output correspond to the which query term. I think this requires using aggregations at some level, but I am currently clueless on how to proceed on this.
An aggregate query is the correct approach. You can learn about them here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
An example is below. In this example, I am using a match query to find all instances of the word test in the field title and then aggregating the field status to count the number of results with the word test that are in each status.
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "*test*"
}
}
]
}
},
"aggs": {
"count_by_status": {
"terms": {
"field": "status"
}
}
},
"size": 0
}
The results look like this:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 346,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_by_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Open",
"doc_count": 283
},
{
"key": "Completed",
"doc_count": 36
},
{
"key": "On Hold",
"doc_count": 12
},
{
"key": "Withdrawn",
"doc_count": 10
},
{
"key": "Declined",
"doc_count": 5
}
]
}
}
}
If you provide your query, it would help us give a more specific aggregate query for you to use.

Boosting elastic aggregation result

I have an elastic index for products, each product has Brand attribution and I "have to" create an aggregation that returns Brands of the products.
My Sample Query:
GET /products/product/_search
{
"size": 0,
"aggs": {
"myFancyFilter": {
"filter": {
"match_all": {}
},
"aggs": {
"inner": {
"terms": {
"field": "Brand",
"size": 3
}
}
}
}
},
"query": {
"match_all": {}
}
}
And the result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 236952,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 236952,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 139267,
"buckets": [
{
"key": "Brand1",
"doc_count": 3144
},
{
"key": "Brand2",
"doc_count": 1759
},
{
"key": "Brand3",
"doc_count": 1737
}
]
}
}
}
}
It works perfect for me. Elastic sorts buckets according to doc_count, however I would like to manipulate the bucket order in result. For example, assume that I have Brand5 and I want to increment its order to #2. I want result coming in order Brand1, Brand5 and Brand3.
If it was not in an aggregation, but in a query, I could use function_score, but now, I don't have an idea. Any clues?
What you are looking for is to define your own sorting definition and that to be applied in aggregation in elasticsearch. I've been able to come up with a solution by renaming the aggregation terms in below manner:
Brand1 to a_Brand1
Brand5 to b_Brand5
Brand3 to c_Brand3
And then apply sorting on the terms so that sorting happens lexicographically.
Of course this may not be the exact or the best solution but I felt this can help.
Below is the query that I've used. Please note that my field name is brand and it is a multifield and I'm using the field brand.keyword.
POST testdataindex/_search
{
"size":0,
"query":{
"match_all":{
}
},
"aggs":{
"myFancyFilter":{
"filter":{
"match_all":{
}
},
"aggs":{
"inner":{
"terms":{
"script":{
"lang":"painless",
"inline":"if(params.newNames.containsKey(doc['brand.keyword'].value)) { return params.newNames[doc['brand.keyword'].value];} return null;",
"params":{
"newNames":{
"Brand1":"a_Brand1",
"Brand5":"b_Brand5",
"Brand3":"c_Brand3"
}
}
},
"order":{
"_term":"asc"
}
}
}
}
}
}
}
I've created a sample data with brand names Brand1, Brand3 and Brand5 and below how the results would appear. Note the change in the term names.
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 8,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a_Brand1",
"doc_count": 2
},
{
"key": "b_Brand5",
"doc_count": 4
},
{
"key": "c_Brand3",
"doc_count": 2
}
]
}
}
}
}
Hope it helps!

Get Percentage of Values in Elasticsearch

I have some test documents that look like
"hits": {
...
"_source": {
"student": "DTWjkg",
"name": "My Name",
"grade": "A"
...
"student": "ggddee",
"name": "My Name2",
"grade": "B"
...
"student": "ggddee",
"name": "My Name3",
"grade": "A"
And I wanted to get the percentage of students that have a grade of B, the result would be "33%", assuming there were only 3 students.
How would I do this in Elasticsearch?
So far I have this aggregation, which I feel like is close:
"aggs": {
"gradeBPercent": {
"terms": {
"field" : "grade",
"script" : "_value == 'B'"
}
}
}
This returns:
"aggregations": {
"gradeBPercent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "false",
"doc_count": 2
},
{
"key": "true",
"doc_count": 1
}
]
}
}
I'm not looking necessarily looking for an exact answer, perhaps what I could terms and keywords I could google. I've read over the elasticsearch docs and not found anything that could help.
First off, you shouldn't need a script for this aggregation. If you want to limit your results to everyone where `value == 'B' then you should do that using a filter, not a script.
ElasticSearch won't return you a percentage exactly, but you can easily calculate that using the result from a TERMS AGGREGATION.
Example:
GET devdev/audittrail/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "uIDRequestID"
}
}
}
}
That returns:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 25083,
"max_score": 0,
"hits": []
},
"aggregations": {
"a1": {
"doc_count_error_upper_bound": 9,
"sum_other_doc_count": 1300,
"buckets": [
{
"key": 556,
"doc_count": 34
},
{
"key": 393,
"doc_count": 28
},
{
"key": 528,
"doc_count": 15
}
]
}
}
}
So what does that return mean?
the hits.total field is the total number of records matching your query.
the doc_count is telling you how many items are in each bucket.
So for my example here: I could say that the key "556" shows up in 34 of 25083 documents, so it has a percentage of (34 / 25083) * 100

Elasticsearch: accuracy on a filter aggregation

I'm fairly new to Elasticsearch (using version 2.2).
To simplify my question, I have documents that have a field named termination, which can sometimes take the value transfer.
I currently do this request to aggregate by month the number of documents which have that termination :
{
"size": 0,
"sort": [{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}],
"query": { "match_all": {} },
"aggs": {
"report": {
"date_histogram": {
"field": "#timestamp",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"documents_with_termination_transfer": {
"filter": {
"term": {
"termination": "transfer"
}
}
}
}
}
}
}
Here is the response :
{
"_shards": {
"failed": 0,
"successful": 206,
"total": 206
},
"aggregations": {
"report": {
"buckets": [
{
"calls_with_termination_transfer": {
"doc_count": 209163
},
"doc_count": 278100,
"key": 1451606400000,
"key_as_string": "2016-01-01T00:00:00.000Z"
},
{
"calls_with_termination_transfer": {
"doc_count": 107244
},
"doc_count": 136597,
"key": 1454284800000,
"key_as_string": "2016-02-01T00:00:00.000Z"
}
]
}
},
"hits": {
"hits": [],
"max_score": 0.0,
"total": 414699
},
"timed_out": false,
"took": 90
}
Why is the number of hits (414699) greater than the total number of document counts (278100 + 136597 = 414697)? I had read about accuracy problems but it didn't seem to apply in the case of filters...
Is there also an accuracy problem if I sum the total numbers of documents with transfer termination ?
My guess is that some documents have a missing #timestamp.
You could verify this by running exists query on this field.

How to get the count of most frequent pattern in elasticsearch?

I want to get the ten most frequent patterns in search with elasticsearch .
Example :
"cgn:4189, dfsdkfldslfs"
"cgn:4210, aezfvdsvgds"
"cgn:4189, fdsmpfjdjs"
"cgn:4195, cvsf"
"cgn:4189, mkpjd"
"cgn:4210, mfsfgkpjd"
I want to get :
4189 : 3
4210 : 2
4195 : 1
I know how to do that in mysql or via awk/sort/head ... but with elasticsearch I'm lost.
Exactly how it will work depends on your analyzer, but if you are just using the default, standard analyzer, you can probably get what you want pretty easily with a terms aggregation.
As a simple example, I set up a trivial index:
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
Then indexed the data you posted, using the bulk api:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"msg":"cgn:4189, dfsdkfldslfs"}
{"index":{"_id":2}}
{"msg":"cgn:4210, aezfvdsvgds"}
{"index":{"_id":3}}
{"msg":"cgn:4189, fdsmpfjdjs"}
{"index":{"_id":4}}
{"msg":"cgn:4195, cvsf"}
{"index":{"_id":5}}
{"msg":"cgn:4189, mkpjd"}
{"index":{"_id":6}}
{"msg":"cgn:4210, mfsfgkpjd"}
Then I can run a simple terms aggregation to get back all the terms and how often they occur (ordered descending by term frequency by default):
POST /test_index/_search?search_type=count
{
"aggs": {
"msg_terms": {
"terms": {
"field": "msg"
}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"msg_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cgn",
"doc_count": 6
},
{
"key": "4189",
"doc_count": 3
},
{
"key": "4210",
"doc_count": 2
},
{
"key": "4195",
"doc_count": 1
},
{
"key": "aezfvdsvgds",
"doc_count": 1
},
{
"key": "cvsf",
"doc_count": 1
},
{
"key": "dfsdkfldslfs",
"doc_count": 1
},
{
"key": "fdsmpfjdjs",
"doc_count": 1
},
{
"key": "mfsfgkpjd",
"doc_count": 1
},
{
"key": "mkpjd",
"doc_count": 1
}
]
}
}
}
Here is the code I used:
http://sense.qbox.io/gist/a827095b675596c4e3d545ce963cde3fae932156

Resources