For the last two days my team deals with solving an issue of querying the data from Elasticsearch DB (ES). Our purpose is to get aggregated data by a field from ES with two values accumulated.
If I would translate it to SQL query we need something like that:
SELECT MAX(FIELD1) AS F1, MAX(FIELD2) AS F2 FROM ES GROUP BY FIELD3 HAVING F1 = ‘SOME_TEXT’
Please put attention that F1 is a text field.
The only solution that we found as of now is:
{
"size": 0 ,
"aggs": {
"flowId": {
"terms": {
"field": "flowId.keyword"
},
"aggs" :{
"scenario" : { "terms" : { "field" : "scnName.keyword" } },
"max_time" : { "max" : { "field" : "inFlowTimeNsec" } },
"sales_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalSales": "scenario"
},
"script": "params.totalSales != null && params.totalSales == 'Test' "
}
}
}
}
}
}
The issue that we encountered is:
{
"error": {
"root_cause": [],
"type": "search_phase_execution_exception",
"reason": "",
"phase": "fetch",
"grouped": true,
"failed_shards": [],
"caused_by": {
"type": "aggregation_execution_exception",
"reason": "buckets_path must reference either a number value or a single value numeric metric aggregation, got: org.elasticsearch.search.aggregations.bucket.terms.StringTerms"
}
},
"status": 503
}
As far as I understand that issue was already raised: https://github.com/elastic/elasticsearch/issues/23874
The output of the above query without bucket_selector part looks as following:
{
"took": 52,
"timed_out": false,
"_shards": {
"total": 480,
"successful": 480,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 15657901,
"max_score": 0,
"hits": []
},
"aggregations": {
"flowId": {
"doc_count_error_upper_bound": 4104,
"sum_other_doc_count": 9829317,
"buckets": [
{
"key": "0_66718_31120bfd_39ae_4258_81e8_08abd89a81bf",
"doc_count": 107816,
"scenario": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "GetPop",
"doc_count": 12
}
]
},
"max_time": {
"value": 121244876800
}
},
{
"key": "0_67116_31120bfd_39ae_4258_81e8_08abd89a81bf",
"doc_count": 107752,
"scenario": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "GetPop",
"doc_count": 12
}
]
},
"max_time": {
"value": 120955101184
}
},
…
}
The question is there any other way to achieve what we need? I mean we need filter the result of the aggregated data...
Thank you a lot,
EG
Related
I am very new with elasticsearch. I am facing an issue building a query. My document structure is like:
{
latlng: {
lat: '<some-latitude>',
lon: '<some-longitude>'
},
gmap_result: {<Some object>}
}
I am doing a search on a list of lat-long. For each coordinate, I am fetching a result that is within 100m. I have been able to do this part. But, the tricky part is that I do not know which results in the output correspond to the which query term. I think this requires using aggregations at some level, but I am currently clueless on how to proceed on this.
An aggregate query is the correct approach. You can learn about them here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
An example is below. In this example, I am using a match query to find all instances of the word test in the field title and then aggregating the field status to count the number of results with the word test that are in each status.
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "*test*"
}
}
]
}
},
"aggs": {
"count_by_status": {
"terms": {
"field": "status"
}
}
},
"size": 0
}
The results look like this:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 346,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_by_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Open",
"doc_count": 283
},
{
"key": "Completed",
"doc_count": 36
},
{
"key": "On Hold",
"doc_count": 12
},
{
"key": "Withdrawn",
"doc_count": 10
},
{
"key": "Declined",
"doc_count": 5
}
]
}
}
}
If you provide your query, it would help us give a more specific aggregate query for you to use.
I have some test documents that look like
"hits": {
...
"_source": {
"student": "DTWjkg",
"name": "My Name",
"grade": "A"
...
"student": "ggddee",
"name": "My Name2",
"grade": "B"
...
"student": "ggddee",
"name": "My Name3",
"grade": "A"
And I wanted to get the percentage of students that have a grade of B, the result would be "33%", assuming there were only 3 students.
How would I do this in Elasticsearch?
So far I have this aggregation, which I feel like is close:
"aggs": {
"gradeBPercent": {
"terms": {
"field" : "grade",
"script" : "_value == 'B'"
}
}
}
This returns:
"aggregations": {
"gradeBPercent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "false",
"doc_count": 2
},
{
"key": "true",
"doc_count": 1
}
]
}
}
I'm not looking necessarily looking for an exact answer, perhaps what I could terms and keywords I could google. I've read over the elasticsearch docs and not found anything that could help.
First off, you shouldn't need a script for this aggregation. If you want to limit your results to everyone where `value == 'B' then you should do that using a filter, not a script.
ElasticSearch won't return you a percentage exactly, but you can easily calculate that using the result from a TERMS AGGREGATION.
Example:
GET devdev/audittrail/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "uIDRequestID"
}
}
}
}
That returns:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 25083,
"max_score": 0,
"hits": []
},
"aggregations": {
"a1": {
"doc_count_error_upper_bound": 9,
"sum_other_doc_count": 1300,
"buckets": [
{
"key": 556,
"doc_count": 34
},
{
"key": 393,
"doc_count": 28
},
{
"key": 528,
"doc_count": 15
}
]
}
}
}
So what does that return mean?
the hits.total field is the total number of records matching your query.
the doc_count is telling you how many items are in each bucket.
So for my example here: I could say that the key "556" shows up in 34 of 25083 documents, so it has a percentage of (34 / 25083) * 100
I'm fairly new to Elasticsearch (using version 2.2).
To simplify my question, I have documents that have a field named termination, which can sometimes take the value transfer.
I currently do this request to aggregate by month the number of documents which have that termination :
{
"size": 0,
"sort": [{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}],
"query": { "match_all": {} },
"aggs": {
"report": {
"date_histogram": {
"field": "#timestamp",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"documents_with_termination_transfer": {
"filter": {
"term": {
"termination": "transfer"
}
}
}
}
}
}
}
Here is the response :
{
"_shards": {
"failed": 0,
"successful": 206,
"total": 206
},
"aggregations": {
"report": {
"buckets": [
{
"calls_with_termination_transfer": {
"doc_count": 209163
},
"doc_count": 278100,
"key": 1451606400000,
"key_as_string": "2016-01-01T00:00:00.000Z"
},
{
"calls_with_termination_transfer": {
"doc_count": 107244
},
"doc_count": 136597,
"key": 1454284800000,
"key_as_string": "2016-02-01T00:00:00.000Z"
}
]
}
},
"hits": {
"hits": [],
"max_score": 0.0,
"total": 414699
},
"timed_out": false,
"took": 90
}
Why is the number of hits (414699) greater than the total number of document counts (278100 + 136597 = 414697)? I had read about accuracy problems but it didn't seem to apply in the case of filters...
Is there also an accuracy problem if I sum the total numbers of documents with transfer termination ?
My guess is that some documents have a missing #timestamp.
You could verify this by running exists query on this field.
Preface
I have 4 days experience of Elasticsearch 1.7.2.
Setup
I have a collection of documents, each document is a User. The User has a number of Answers which is linked through UserAnswers. Which gives a document reference of user_answers.answer[]. Where the answers array is an array of objects.
The user_answers.answer[].correct is a boolean field which tells me if the answer given by the user is correct or not.
Objective
I would like to list the users and also display the total number of correct and incorrect answers they have.
Approach
So far I have tried a number of different approaches and the one I'll include here is as close as I've got so far in 1.5 days of trying.
Use a terms aggregation to create a bucket for each User by username.
Filter each bucket to leave only correct or incorrect answers.
Count the number of filtered answers.
Query
{
"size": 0,
"filter": {
"bool": {
"must_not": {
// Remove users who already have this award
"term": {"awards_users.award_id": 2}
}
}
},
"aggs": {
"users": {
"terms": {"field": "username"},
"aggs": {
"correct": {
"filter": {
"term": {"user_answers.answer.correct": true}
},
"aggs": {
"count": {
"value_count": {
"field": "user_answers.answer.id"
}
}
}
},
// Same for incorrect, but inverted correct value
}
}
}
}
Sample response
{
"key": "neon1024",
"doc_count": 1,
"correct": {
"doc_count": 1,
"count": {
"value": 7 // Expected 1 correct & 6 incorrect
}
}
},
This is the record which I am testing against, and I am expecting that 1 is returned instead of 7. There are 7 answers in total, 6 incorrect and 1 correct. This I have verified in my document index.
The problem
For some reason the actual filter seems to be being ignored, and leaving all possible related answers in the bucket. Hence the aggregation is seeing them all, rather than showing the expected value.
Question
How can I use an aggregation to segregate my counts based on the value of the related answers values?
Thanks for reading my long question!
As suggested, you probably have your answers mapped as object, while you should be using nested type.
Using nested type, elasticsearch will store your answers as individual documents linked to the root one and will let you do expected aggregations on them. You'll have to use nested type aggregation in your query to achieve that.
So I'd say it would be best to map your document like this:
PUT /test
{
"mappings" : {
"your_type" : {
"properties" : {
"username" : {
"type" : "string",
"index" : "not_analyzed"
},
"user_answers" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "integer"
},
"answer" : {
"type" : "string"
},
"correct" : {
"type" : "boolean"
}
}
}
}
}
}
}
Test document:
PUT /test/your_type/1
{
"username": "neon1024",
"user_answers": [
{
"id": 1,
"answer": "answer1",
"correct": true
},
{
"id": 2,
"answer": "answer2",
"correct": true
},
{
"id": 3,
"answer": "answer3",
"correct": false
}
]
}
Query:
POST /test/_search?search_type=count
{
"aggs": {
"users": {
"terms": {
"field": "username"
},
"aggs": {
"DiveIn": {
"nested": {
"path": "user_answers"
},
"aggs": {
"CorrectVsIncorrect": {
"terms": {
"field": "user_answers.correct",
"size": 2
}
}
}
}
}
}
}
}
And Final result:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"users": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "neon1024",
"doc_count": 1,
"DiveIn": {
"doc_count": 3,
"CorrectVsIncorrect": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "T",
"doc_count": 2
},
{
"key": "F",
"doc_count": 1
}
]
}
}
}
]
}
}
}
Where "key": "T" represents correct answers and "doc_count": 2 represents amount of them.
Can I limit aggregations to return only specific list of values? I have something like this:
{ "aggs" : {
"province" : {
"terms" : {
"field" : "province"
}
}
},
"query": {
"bool": {
//my query..
But let's say I know list of province for which I want make count ({'province1', 'province2', 'province3'}). Is it possible to restrict returned list of province without influence on my query results?
I want to get:
//list of hits..
//
"aggregations": {
"province": {
"buckets": [
{
"key": "province1",
"doc_count": 200
},
{
"key": "province2",
"doc_count": 162
},
{
"key": "province3",
"doc_count": 162
}
// even if there is more possible provinces
// I don't want to see them
Sure, just use term filters.
Here's an example. Let's say I have visit stats for a bunch of different IP addresses, but I only want to get counts of document for two of them, I could do this:
POST /test_index/_search?search_type=count
{
"aggregations": {
"ip": {
"terms": {
"field": "ip",
"size": 10,
"include": [
"146.233.189.126",
"193.33.153.89"
]
}
}
}
}
and get back something like:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"ip": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "146.233.189.126",
"doc_count": 3
},
{
"key": "193.33.153.89",
"doc_count": 3
}
]
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/68697646ef7afc9f0375995b6f84181a7ac4cba9
So your example might look like:
{
"aggs": {
"province": {
"terms": {
"field": "province",
"include": [
"province1",
"province2",
"province3"
]
}
}
}
}