Elasticsearch, composite and sub(?) aggregations - elasticsearch

I'm using composite to scroll through whole data. (it's like pagination)
Suppose a car selling data,
For each day, I'd like to count the number of cars sold per car-brand
{
day1: {
honda: 3,
bmw: 5
},
day2: {
honda: 4,
audi: 1,
tesla:5
}
}
I'm doing something like the following but it doesn't work
GET _search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"date": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "1d"
},
"aggs": {
"car_brand": {
"terms": {
"field": "car_brands"
}
}
}
}
}
]
}
}
}
}
with error message
{
"error" : {
"root_cause" : [
{
"type" : "x_content_parse_exception",
"reason" : "[14:17] [composite] failed to parse field [sources]"
}
],
"type" : "x_content_parse_exception",
"reason" : "[14:17] [composite] failed to parse field [sources]",
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "expected value but got [FIELD_NAME]"
}
},
"status" : 400
}

Composite aggs cannot directly accept sub-aggs. Go with
GET _search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"date": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "1d"
}
}
},
{
"car_brand": {
"terms": {
"field": "car_brands"
}
}
}
]
}
}
}
}
instead.

Related

Elasticsearch query - Most recent log for each user, for field logtype='x'

I am trying to query elasticsearch to get the most recent log for all userIDs where we only include one log per user with field logtype='x'
if logtype='x' then get 1 log per userID where the date of this log is the most recent for each userID
Example log:{"logtype"="x", "number":232423, "userID":123, "time":"2021-02-03T20:25:44.603045+05:30"}
How can I create this query?
Assuming your index mapping looks like:
{
"properties" : {
"logtype" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"number" : {
"type" : "long"
},
"time" : {
"type" : "date"
},
"userID" : {
"type" : "long"
}
}
}
you'll need these aggregations: a terms ordered by the result of a max, plus a top_hits to fetch the most recent log per a userId:
POST logs/_search
{
"size": 0,
"query": {
"match": {
"logtype": "x"
}
},
"aggs": {
"by_user_id": {
"terms": {
"field": "userID",
"size": 1,
"order": {
"latest_date": "desc"
}
},
"aggs": {
"latest_date": {
"max": {
"field": "time"
}
},
"latest_log": {
"top_hits": {
"size": 1,
"sort": [
{
"time": {
"order": "desc"
}
}
]
}
}
}
}
}
}

How to count number of fields inside nested field? - Elasticsearch

I did the following mapping. I would like to count the number of products in each nested field "products" (for each document separately). I would also like to do a histogram aggregation, so that I would know the number of specific bucket sizes.
PUT /receipts
{
"mappings": {
"properties": {
"id" : {
"type": "integer"
},
"user_id" : {
"type": "integer"
},
"date" : {
"type": "date"
},
"sum" : {
"type": "double"
},
"products" : {
"type": "nested",
"properties": {
"name" : {
"type" : "text"
},
"number" : {
"type" : "double"
},
"price_single" : {
"type" : "double"
},
"price_total" : {
"type" : "double"
}
}
}
}
}
}
I've tried this query, but I get the number of all the products instead of number of products for each document separately.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products"
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count" : 6552,
"bucket_size" : {
"value" : 0
}
}
}
UPDATE
Now I have this code where I make separate buckets for each id and count the number of products inside them.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size" : 0,
"aggs": {
"terms":{
"terms":{
"field": "_id"
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 490,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"nested" : {
"doc_count" : 21,
"bucket_size" : {
"value" : 21
}
}
},
{
"key" : "10",
"doc_count" : 1,
"nested" : {
"doc_count" : 5,
"bucket_size" : {
"value" : 5
}
}
},
{
"key" : "100",
"doc_count" : 1,
"nested" : {
"doc_count" : 12,
"bucket_size" : {
"value" : 12
}
}
},
...
Is is possible to group these values (21, 5, 12, ...) into buckets to make a histogram of them?
products is only the path to the array of individual products, not an aggregatable field. So you'll need to use it on one of your product's field -- such as the number:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
Note that is a product has no number, it'll not contribute to the total count. It's therefore best practice to always include an ID in each of them and then aggregate on that field.
Alternatively you could use a script to account for missing values. Luckily value_count does not deduplicate -- meaning if two products are alike and/or have empty values, they'll still be counted as two:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"script": {
"source": "doc['products.number'].toString()"
}
}
}
}
}
}
}
UPDATE
You could also use a nested composite aggregation which'll give you the histogrammed product count w/ the corresponding receipt id:
GET /receipts/_search
{
"size": 0,
"aggs": {
"my_aggs": {
"nested": {
"path": "products"
},
"aggs": {
"composite_parent": {
"composite": {
"sources": [
{
"receipt_id": {
"terms": {
"field": "_id"
}
}
},
{
"product_number": {
"histogram": {
"field": "products.number",
"interval": 1
}
}
}
]
}
}
}
}
}
}
The interval is modifiable.

elasticsearch return hits found in aggregation

I am trying to get rows from my database that have a unique 'sku' field.
I have a working query which counts this number correctly, my query:
GET _search
{
"size": 0,
"aggs": {
"unique_products":{
"cardinality":{
"field":"sku.keyword"
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(merch1: 'Dog') AND ((store_name: 'walmart')) AND product_gap: 'yes'"
}
},
{
"range": {
"capture_date": {
"format": "date",
"gte": "2020-05-13",
"lte": "2020-08-03"
}
}
}
]
}
}
}
Returns this result:
{
"took" : 129,
"timed_out" : false,
"_shards" : {
"total" : 514,
"successful" : 514,
"skipped" : 98,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 150,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"unique_products" : {
"value" : 38
}
}
}
Which correctly reports the number of unique_products as 38.
I am trying to edit this query so that it will actually return all 38 unique products, but am unsure how, I started by trying to return the top hit from the agg result:
GET _search
{
"size": 0,
"aggs": {
"unique_products":{
"cardinality":{
"field":"sku.keyword"
}
},
"top_hits": {
"size": 1,
"_source": {
"include": [
"sku", "source_store"
]
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(merch1: 'Dog') AND ((store_name: 'walmart')) AND product_gap: 'yes'"
}
},
{
"range": {
"capture_date": {
"format": "date",
"gte": "2020-05-13",
"lte": "2020-08-03"
}
}
}
]
}
}
}
But got an error in my result saying:
{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "Expected [START_OBJECT] under [size], but got a [VALUE_NUMBER] in [top_hits]",
"line": 10,
"col": 13
}
],
"type": "parsing_exception",
"reason": "Expected [START_OBJECT] under [size], but got a [VALUE_NUMBER] in [top_hits]",
"line": 10,
"col": 13
},
"status": 400
}
Is a cardinality agg still my best bet for returning all 38 unique products? thanks
While the cardinality aggregation gives the unique count, it cannot accept sub-aggs. In other words top_hits cannot be used here directly.
The approach was correct but you may first want to bucketize the skus and then retrieve the underlying docs using top_hits:
{
"size": 0,
"aggs": {
"unique_products": {
"cardinality": {
"field": "sku.keyword"
}
},
"terms_agg": {
"terms": {
"field": "sku.keyword",
"size": 100
},
"aggs": {
"top_hits_agg": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"sku",
"source_store"
]
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(merch1: 'Dog') AND ((store_name: 'walmart')) AND product_gap: 'yes'"
}
},
{
"range": {
"capture_date": {
"format": "date",
"gte": "2020-05-13",
"lte": "2020-08-03"
}
}
}
]
}
}
}
FYI The reason your query threw an exception is that top_hits is an agg type and, just like unique_products, it was missing its own name.

Count the percentage of character fields

I want to count the percentage of specified field data.
this is my Restful API:
Restful API:
GET _search
{
"_source": {
"includes": [ "FIRST_SWITCHED","LAST_SWITCHED","IPV4_DST_ADDR","L4_DST_PORT","IPV4_SRC_ADDR","L7_PROTO_NAME","IN_BYTES","IN_PKTS","OUT_BYTES","OUT_PKTS"]
},
"from" : 0, "size" : 10000,
"query": {
"bool": {
"must": [
{
"match" : { "_index" : "logstash-2017.12.22" }
},
{
"match_phrase":{"IPV4_SRC_ADDR":"192.168.0.159"}
},
{
"range" : {
"LAST_SWITCHED" : {
"gte" : 1513683600
}
}
}
]
}
},
"aggs": {
"IN_PKTS": {
"sum": {
"field": "IN_PKTS"
}
},
"IN_BYTES": {
"sum": {
"field": "IN_BYTES"
}
},
"OUT_BYTES": {
"sum": {
"field": "OUT_BYTES"
}
},
"OUT_PKTS": {
"sum": {
"field": "OUT_PKTS"
}
},
"percent":{
"significant_terms" : {
"field" : "L7_PROTO_NAME",
"percentage":{}
}},
"protocol" : {
"terms" : {
"field" : "PROTOCOL",
"include" : ["17", "6"]
}
},
"Using_port_count" : {
"cardinality" : {
"field" : "L4_SRC_PORT"
}
}
}
}
but there's some errors.
this is error messages:
error messages:
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [L7_PROTO_NAME] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
thank you in advance!
ok, I find the answer!
just add .keyword at here then it can run!
"field" : "L7_PROTO_NAME.keyword"

Elasticsearch sort inside top_hits aggregation

I have an index of messages where I store messageHash for each message too. I also have many more fields along with them. There are multiple duplicate message fields in the index e.g. "Hello". I want to retrieve unique messages.
Here is the query I wrote to search unique messages and sort them by date. I mean the message with the latest date among all duplicates is what I want
to be returned.
{
"query": {
"bool": {
"must": {
"match_phrase": {
"message": "Hello"
}
}
}
},
"sort": [
{
"date": {
"order": "desc"
}
}
],
"aggs": {
"top_messages": {
"terms": {
"field": "messageHash"
},
"aggs": {
"top_messages_hits": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
},
"_score"
],
"size": 1
}
}
}
}
}
}
The problem is that it's not sorted by date. It's sorted by doc_count. I just get the sort values in the response, not the real sorted results. What's wrong? I'm now wondering if it is even possible to do it.
EDIT:
I tried subsituting "terms" : { "field" : "messageHash", "order" : { "mydate" : "desc" } } , "aggs" : { "mydate" : { "max" : { "field" : "date" } } } for "terms": { "field": "messageHash" } but I get:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Found two sub aggregation definitions under [top_messages]",
"line" : 1,
"col" : 412
}
],
"type" : "parsing_exception",
"reason" : "Found two sub aggregation definitions under [top_messages]",
"line" : 1,
"col" : 412
},
"status" : 400
}

Resources