How to select fields after aggregation in Elastic Search 2.3 - elasticsearch

I have following schema for an index:-
PUT
"mappings": {
"event": {
"properties": {
"#timestamp": { "type": "date", "doc_values": true},
"partner_id": { "type": "integer", "doc_values": true},
"event_id": { "type": "integer", "doc_values": true},
"count": { "type": "integer", "doc_values": true, "index": "no" },
"device_id": { "type": "string", "index":"not_analyzed","doc_values":true }
"product_id": { "type": "integer", "doc_values": true},
}
}
}
I need result equivalent to following query:-
SELECT product_id, device_id, sum(count) FROM index WHERE partner_id=5 AND timestamp<=end_date AND timestamp>=start_date GROUP BY device_id,product_id having sum(count)>1;
I am able to achieve the result by following elastic query:-
GET
{
"store": true,
"size":0,
"aggs":{
"matching_events":{
"filter":{
"bool":{
"must":[
{
"term":{
"partner_id":5
}
},
{
"range":{
"#timestamp":{
"from":1470904000,
"to":1470904999
}
}
}
]
}
},
"aggs":{
"group_by_productid": {
"terms":{
"field":"product_id"
},
"aggs":{
"group_by_device_id":{
"terms":{
"field":"device_id"
},
"aggs":{
"total_count":{
"sum":{
"field":"count"
}
},
"sales_bucket_filter":{
"bucket_selector":{
"buckets_path":{
"totalCount":"total_count"
},
"script": {"inline": "totalCount > 1"}
}
}
}
}}
}
}
}
}
}'
However for the case where count is <=1 query is returning empty buckets with key as product_id. Now out of 40 million groups, only 100k will have satisfy the condition, so I am returned with huge result set, majority of which is useless. How can I select only particular field after aggregation? I tried this but not working- `"fields": ["aggregations.matching_events.group_by_productid.group_by_device_id.buckets.key"]
Edit:
I have following set of data:-
device id Partner Id Count
db63te2bd38672921ffw27t82 367 3
db63te2bd38672921ffw27t82 272 1
I go this output:-
{
"took":6,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":7,
"max_score":0.0,
"hits":[
]
},
"aggregations":{
"matching_events":{
"doc_count":5,
"group_by_productid":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":367,
"doc_count":3,
"group_by_device_id":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"db63te2bd38672921ffw27t82",
"doc_count":3,
"total_count":{
"value":3.0
}
}
]
}
},
{
"key":272,
"doc_count":1,
"group_by_device_id":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
]
}
}
]
}
}
}
}
As you can see, bucket with key 272 is empty which make sense, but shouldn't this bucket be removed from result set altogether?

I've just found out that there is a fairly recent issue and PR that adds a _bucket_count path to a buckets_path option so that an aggregation can potentially filter the parent bucket based on the number of buckets another aggregation has. In other words if the _bucket_count is 0 for a parent bucket_selector the bucket should be removed.
This is the github issue: https://github.com/elastic/elasticsearch/issues/19553

Related

Sorting aggregation resultset in Elasticsearch and filtering

Data sample in Elasticsearch index:
"_source": {
"Type": "SELL",
"Id": 31,
"status": "YES",
"base": "FIAT",
"orderDate": "2019-02-01T05:00:00.000Z",
}
I need to
1. Filter the records based on 'base'=? and 'Type'=? THEN
2. get the top of stack or latest records for these filtered records for each Id and THEN
3. from the results of these I need only the records with 'status'= 'YES'.
Elasticsearch query I wrote:
{
"size":0,
"query":{
"bool":{
"must":[
{ "match":{ "base":"FIAT" } },
{ "match":{ "Type":"SELL" } }
]
}
},
"aggs":{
"sources":{
"terms":{ "field":"Id" },
"aggs":{
"latest":{
"top_hits":{
"size":1,
"_source":{
"includes":[
"Id",
"orderDate",
"status"
]
},
"sort":{ "orderDate":"desc" }
}
}
}
}
}
}
Did you try using composite aggregations.
Composite Aggregations in ElasticSearch

How to join ElasticSearch query with multi_match, boosting, wildcard and filter?

I'm trying to acheve this goals:
Filter out results by bool query, like "status=1"
Filter out results by bool range query, like "discance: gte 10 AND lte 60"
Filter out results by match at least one int value from int array
Search words in many fields with calculating document score. Some fields needs wildcard, some boosting, like importantfield^2, somefield*, someotherfield^0.75
All above points join by AND operator. All terms in one point join by OR operator.
Now I wrote something like this, but wildcards not working. Searching "abc" don't finds "abcd" in "name" field.
How to solve this?
{
"filtered": {
"query": {
"multi_match": {
"query": "John Doe",
"fields": [
"*name*^1.75",
"someObject.name",
"tagsArray",
"*description*",
"ownerName"
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"status": 2
}
},
{
"bool": {
"should": [
{
"term": {
"someIntsArray": 1
}
},
{
"term": {
"someIntsArray": 5
}
}
]
}
},
{
"range": {
"distanceA": {
"lte": 100
}
}
},
{
"range": {
"distanceB": {
"gte": 50,
"lte": 100
}
}
}
]
}
}
}
}
Mappings:
{
"documentId": {
"type": "integer"
},
"ownerName": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string"
},
"status": {
"type": "byte"
},
"distanceA": {
"type": "short"
},
"createdAt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"distanceB": {
"type": "short"
},
"someObject": {
"properties": {
"someObject_id": {
"type": "integer"
},
"name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"someIntsArray": {
"type": "integer"
},
"tags": {
"type": "string",
"index": "not_analyzed"
}
}
You can make use of Query String if you would want to apply wildcard for multiple fields and at the same time apply various boosting values for individual fields:
Below is how your query would be:
POST <your_index_name>/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"abc*",
"fields":[
"*name*^1.75",
"someObject.name",
"tagsArray",
"*description*",
"ownerName"
]
}
}
],
"filter":{
"bool":{
"must":[
{
"term":{
"status":"2"
}
},
{
"bool":{
"minimum_should_match":1,
"should":[
{
"term":{
"someIntsArray":1
}
},
{
"term":{
"someIntsArray":5
}
}
]
}
},
{
"range":{
"distanceA":{
"lte":100
}
}
},
{
"range":{
"distanceB":{
"gte": 50,
"lte":100
}
}
}
]
}
}
}
}
}
Note that for the field someIntsArray, I've made use of "minimum_should_match":1 so that you won't end up with documents that'd have neither of those values.
Updated Answer:
Going by the updated comment, you can have the fields with wildcard search used by query_string and you can make use of simple match query with boosting as shown in below. Include both these queries (can even add more match queries depending on your requirement) in a combine should clause. That way you can control where wildcard query can be used and where not.
{
"query":{
"bool":{
"should":[
{
"query_string":{
"query":"joh*",
"fields":[
"name^2"
]
}
},
{
"match":{
"description":{
"query":"john",
"boost":15
}
}
}
],
"filter":{
"bool":{
"must":[
{
"term":{
"status":"2"
}
},
{
"bool":{
"minimum_should_match":1,
"should":[
{
"term":{
"someIntsArray":1
}
},
{
"term":{
"someIntsArray":5
}
}
]
}
},
{
"range":{
"distanceA":{
"lte":100
}
}
},
{
"range":{
"distanceB":{
"lte":100
}
}
}
]
}
}
}
}
}
Let me know if this helps

Elasticsearch- sum specific field in specific datetime range?

I have a requirement of find sum one fields in a single query. I have managed to sum specific field in specific datetime range .
I try to use aggregation query to get the field total in specific datetime range,but get total value is 0
My document json look like the following way :
{
"_index": "flow-2018.02.01",
"_type": "doc",
"_source": {
"#timestamp": "2018-02-01T01:02:40.701Z",
"dest": {
"ip": "120.119.37.237",
"mac": "d4:6d:50:21:f8:44",
"port": 3280
},
"final": true,
"flow_id": "EQQA////DP//////FP8BAAFw5CIXZxTUbVAh+ERn/7FMeHcl7S6z0Aw",
"last_time": "2018-02-01T01:01:48.349Z",
"source": {
"ip": "100.255.177.76",
"mac": "70:e4:30:15:67:14",
"port": 45870,
"stats": {
"bytes_total": 60,
"packets_total": 1
}
},
"start_time": "2018-02-01T01:01:48.349Z",
"transport": "tcp",
"type": "flow"
},
"fields": {
"start_time": [
1517446908349
],
"#timestamp": [
1517446960701
],
"last_time": [
1517446908349
]
},
"sort": [
1517446960701
]
}
My search query :
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"_source.#timestamp":{
"gte": "2018-02-01T01:00:00.000Z",
"lte": "2018-02-01T01:05:00.000Z"
}
}
}
]
}
},
"aggs":{
"total":{
"sum":{
"field":"stats.packets_total "
}
}
}
}
Please help me to solve this issue. Thank you
You're almost there, stats.packets_total should be source.stats.packets_total (and make sure to remove the space at the end of the field name), like this:
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"#timestamp":{
"gte": "2018-02-01T01:00:00.000Z",
"lte": "2018-02-01T01:05:00.000Z"
}
}
}
]
}
},
"aggs":{
"total":{
"sum":{
"field":"source.stats.packets_total"
}
}
}
}

Elasticsearch - Cardinality over Full Field Value

I have a document that looks like this:
{
"_id":"some_id_value",
"_source":{
"client":{
"name":"x"
},
"project":{
"name":"x November 2016"
}
}
}
I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.
The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.
This is the mapping for the document type
{
"mappings":{
"vma_docs":{
"properties":{
"client":{
"properties":{
"contact":{
"type":"string"
},
"name":{
"type":"string"
}
}
},
"project":{
"properties":{
"end_date":{
"format":"yyyy-MM-dd",
"type":"date"
},
"project_type":{
"type":"string"
},
"name":{
"type":"string"
},
"project_manager":{
"index":"not_analyzed",
"type":"string"
},
"start_date":{
"format":"yyyy-MM-dd",
"type":"date"
}
}
}
}
}
}
}
This is my search query
{
"fields":[
"client.name",
"project.name"
],
"query":{
"bool":{
"must":{
"match":{
"client.name":{
"operator":"and",
"query":"ABC systems"
}
}
}
}
},
"aggs":{
"num_projects":{
"cardinality":{
"field":"project.name"
}
}
},
"size":5
}
These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.
{
"hits":{
"hits":[
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9IBwwoAW3mzgKz",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
},
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9YBwwoAW3mzgK2",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
}
],
"total":18,
"max_score":5.8553367
},
"_shards":{
"successful":5,
"failed":0,
"total":5
},
"took":4,
"aggregations":{
"num_projects":{
"value":5
}
},
"timed_out":false
}
FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App
You need the following mapping for your project.name field:
{
"mappings": {
"vma_docs": {
"properties": {
"client": {
"properties": {
"contact": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"project": {
"properties": {
"end_date": {
"format": "yyyy-MM-dd",
"type": "date"
},
"project_type": {
"type": "string"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"project_manager": {
"index": "not_analyzed",
"type": "string"
},
"start_date": {
"format": "yyyy-MM-dd",
"type": "date"
}
}
}
}
}
}
}
It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:
{
"fields": [
"client.name",
"project.name"
],
"query": {
"bool": {
"must": {
"match": {
"client.name": {
"operator": "and",
"query": "ABC systems"
}
}
}
}
},
"aggs": {
"num_projects": {
"cardinality": {
"field": "project.name.raw"
}
}
},
"size": 5
}

Return Documents with Null Field in Multi Level Aggregation

We are using Multi Level Aggregation. We have Buckets of City and each Bucket has Buckets of Class.
For Few documents Class is Null and in such cases an empty bucket is returned for the City. Please refer to below response:
Sample Output:
"aggregations":
{
"CITY":{
"buckets":[
{
"key":"CITY 1",
"doc_count":2
"CLASS":{
"buckets":[
{
"key":"CLASS A",
"top_tag_hits":{
}
}
]
}
},
{
"key":"CITY 2",
"doc_count":2
"CLASS":{
"buckets":[
]
}
},
]
}
}
Here the key CITY 2 has an empty bucket of CLASS as all documents under key CITY 2 has the field CITY as null. But we are having a doc count.
How can we return documents under the bucket when terms field is null
Update:
Field Mapping for CLASS:
"CLASS":
{
"type": "string",
"index_analyzer": "text_with_autocomplete_analyzer",
"search_analyzer": "text_standard_analyzer",
"fields": {
"raw": {
"type": "string",
"null_value" : "na",
"index": "not_analyzed"
},
"partial_matching": {
"type": "string",
"index_analyzer": "text_with_partial_matching_analyzer",
"search_analyzer": "text_standard_analyzer"
}
}
}
Please refer to mapping to solve the issue.
You can use the missing setting or the terms aggregation in order to handle buckets with missing values. So in your case, you'd do it like this:
{
"aggs": {
"CITY": {
"terms": {
"field": "city_field"
},
"aggs": {
"CLASS": {
"terms": {
"field": "class_field",
"missing": "NO_CLASS"
}
}
}
}
}
}
With this setup, all documents than don't have a class_field field (or a null value) will land in the NO_CLASS bucket.
PS: Note that this only works since ES 2.0 and not in prior releases.

Resources