Count number of inner elements of array property (Including repeated values) - elasticsearch

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(

You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

Related

Elasticsearch query for finding and grouping objects by a field, and returning the unique values of another field

I have this kind of data (irrelevant fields omitted for simplicity's sake):
{
"endpoint": "endpoint_1",
"user_id": 1,
"session": "value2",
...
}
{
"endpoint": "endpoint_2",
"user_id": 1,
"session": "value3",
...
}
{
"endpoint": "endpoint_2",
"user_id": 2,
"session": "value2",
...
}
{
"endpoint": "endpoint_3",
"user_id": 3,
"session": "value3",
...
}
I want to find all users sharing at least one session, BUT only if they're in specific endpoints. I'm struggling to build a query that finds what I want, because the documentation really sucks.
This is what I have so far, created after painstakingly trying to figure out what the hell the documentation was talking about, but it seems overly complex and wrong:
{
"query": {
"bool": {
"should": [
{"match_phrase": {"endpoint": "endpoint_1"}},
{"match_phrase": {"endpoint": "endpoint_2"}},
{"match_phrase": {"endpoint": "endpoint_4"}},
{"match_phrase": {"endpoint": "endpoint_6"}},
{"match_phrase": {"endpoint": "endpoint_11"}}
],
"minimum_should_match": 1
}
},
"size": 0,
"aggregations": {
"shared_sessions": {
"terms": {
"size": 1000,
"field": "session",
"order": {
"users": "desc"
}
},
"aggregations": {
"users": {
"cardinality": {
"field": "user_id",
"precision_threshold": 100
}
},
"minimum": {
"bucket_selector": {
"buckets_path": {
"var1": "users"
},
"script": "params.var1 > 1"
}
},
"aggregations": {
"terms": {
"field": "user_id",
"size": 1000
}
}
}
}
}
}
This manages to find the users, however it returns hard to parse results, and it makes further alterations prohibitive. It returns something like this:
{
"aggregations" : {
"shared_sessions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 29740,
"buckets" : [
{
"key" : "abcdefg123456", # session
"doc_count" : 6,
"aggregations" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1, # user_id
"doc_count" : 4
},
{
"key" : 2, # user_id
"doc_count" : 1
},
{
"key" : 3, # user_id
"doc_count" : 1
}
]
},
"users" : {
"value" : 3
}
}
}
}
So, is there a better way to do what I want?
Ideally I'd like to return arbitrary results from the object (including ones I've excluded from this answer for simplicity's sake).

Sort Aggregated Buckets From Nested Object Array By Specific Field

I have indexed documents such as
// doc 1
{
...,
"list": [{
"value": "a",
"order": 1
}, {
"value": "b",
"order": 2
}]
,...
}
// doc 2
{
...,
"list": [{
"value": "b",
"order": 2
}, {
"value": "c",
"order": 3
}]
,...
}
If I use the aggregation on the list.value:
{
"aggs": {
"values": {
"terms": {
"field": "list.value.keyword"
}
}
}
}
I get buckets in order b, a, c:
"buckets" : [
{
"key" : "b",
"doc_count" : 2
},
{
"key" : "a",
"doc_count" : 1
},
{
"key" : "c",
"doc_count" : 1
}
]
as keys would be sorted by the _count in desc order.
If I use the aggregation on the list.value with sub-aggregation for sorting in form of max(list.order):
{
"aggs": {
"values": {
"terms": {
"field": "list.value.keyword",
"order": { "max_order": "desc" }
},
"aggs": {
"max_order": { "max": { "field": "list.order" } }
}
}
}
}
I get buckets in order b, c, a
"buckets" : [
{
"key" : "b",
"doc_count" : 2,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "c",
"doc_count" : 1,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "a",
"doc_count" : 1,
"max_order" : {
"value" : 2.0
}
}
]
as both b and c have max order 3 in their lists of the object.
However, I want to write a query to get buckets in order c, b, a as their order is 3, 2, 1 respectively. How to achieve that?
You need to use nested aggregation, to get the buckets in order of c,b,a
Adding a working example with index data, mapping, search query and search result
Index Mapping
PUT testidx1
{
"mappings":{
"properties": {
"list":{
"type": "nested"
}
}
}
}
Index Data:
POST testidx1/_doc/1
{
"list": [
{
"value": "a",
"order": 1
},
{
"value": "b",
"order": 2
}
]
}
POST testidx1/_doc/2
{
"list": [
{
"value": "b",
"order": 2
},
{
"value": "c",
"order": 3
}
]
}
Search Query:
POST testidx1/_search
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "list"
},
"aggs": {
"unique_values": {
"terms": {
"field": "list.value.keyword",
"order": {
"max_order": "desc"
}
},
"aggs": {
"max_order": {
"max": {
"field": "list.order"
}
}
}
}
}
}
}
}
Search Response:
"aggregations" : {
"resellers" : {
"doc_count" : 4,
"unique_values" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "c",
"doc_count" : 1,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "b",
"doc_count" : 2,
"max_order" : {
"value" : 2.0
}
},
{
"key" : "a",
"doc_count" : 1,
"max_order" : {
"value" : 1.0
}
}
]
}
}
}
}

ElasticSearch Max Agg on lowest value inside a list property of the document

I'm looking to do a Max aggregation on a value of the property under my document, the property is a list of complex object (key and value). Here's my data:
[{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
},
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}]
When I do the Nested Max Aggregation on "listItems.value", I'm expecting the max value returned to be 200 (and not 5000), reason being I want the logic to first figure the MIN value under listItems for each document, then doing the Max Aggregation on that. Is it possible to do something like this?
Thanks.
The search query performs the following aggregation :
Terms aggregation on the id field
Min aggregation on listItems.value
Max bucket aggregation that is a sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s).
Please refer to nested aggregation, to get a detailed explanation on it.
Adding a working example with index data, index mapping, search query, and search result.
Index Mapping:
{
"mappings": {
"properties": {
"listItems": {
"type": "nested"
},
"id":{
"type":"text",
"fielddata":"true"
}
}
}
}
Index Data:
{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
}
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id"
},
"aggs": {
"nested_entries": {
"nested": {
"path": "listItems"
},
"aggs": {
"min_position": {
"min": {
"field": "listItems.value"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": "2",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 200.0
}
}
}
]
},
"maxValue": {
"value": 200.0,
"keys": [
"2"
]
}
}
Initial post was mentioning nested aggregation, thus i was sure question is about nested documents. Since i've come to solution before seeing another answer, i'm keeping the whole thing for history, but actually it differs only in adding nested aggregation.
The whole process can be explained like that:
Bucket each document into single bucket.
Use nested aggregation to be able to aggregate on nested documents.
Use min aggregation to find minimum value within all document nested documents, and by that, for document itself.
Finally, use another aggregation to calculate maximum value among results of previous aggregation.
Given this setup:
// PUT /index
{
"mappings": {
"properties": {
"children": {
"type": "nested",
"properties": {
"value": {
"type": "integer"
}
}
}
}
}
}
// POST /index/_doc
{
"children": [
{ "value": 12 },
{ "value": 45 }
]
}
// POST /index/_doc
{
"children": [
{ "value": 7 },
{ "value": 35 }
]
}
I can use those aggregations in request to get required value:
{
"size": 0,
"aggs": {
"document": {
"terms": {"field": "_id"},
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"minimum": {
"min": {
"field": "children.value"
}
}
}
}
}
},
"result": {
"max_bucket": {
"buckets_path": "document>children>minimum"
}
}
}
}
{
"aggregations": {
"document": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "O4QxyHQBK5VO9CW5xJGl",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 7.0
}
}
},
{
"key": "OoQxyHQBK5VO9CW5kpEc",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 12.0
}
}
}
]
},
"result": {
"value": 12.0,
"keys": [
"OoQxyHQBK5VO9CW5kpEc"
]
}
}
}
There also should be a workaround using script for calculating max - all that you will need to do is just find and return smallest value in document in such script.

Elasticsearch - find IPs from which only anonymous requests came

I have network logs in my Elasticsearch. Each log has an username and an IP field. Something like this:
{"username":"user1", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "2.3.4.5"}
{"username":"user2", "ip": "3.4.5.6"}
I have a seemingly simple task: list all IP-s from which only anonymous requests came. The problem is, I can not simply filter for anonymous, because then I'll list false IP-s which appear with anonymous, but not exclusively. Manually I can do this with a 3 step process:
List all unique IP-s
List unique IP-s that appear with something other than anonymous
Exclude items of 2nd list from the first.
But is there a way to do this with a single ES query? My first instinct was to use bool query. My current approach is this:
GET /sample1/_search
{
"query": {
"bool": {
"must": {
"wildcard": {
"ip": "*"
}
},
"must_not": {
"term": {
"username": "-anonymous"
}
}
}
},
"size": 0,
"aggs": {
"ips": {
"terms": {
"field": "ip.keyword"
}
}
}
}
I expect "2.3.4.5", but it returns all 3 unique IPs. I searched the web and tried different query types for hours. Any ideas?
Please find the below mapping, sample docs, the respective query for your scenario and the response:
Mapping:
PUT my_ip_index
{
"mappings": {
"properties": {
"user":{
"type": "keyword"
},
"ip":{
"type": "ip"
}
}
}
}
Documents:
POST my_ip_index/_doc/1
{
"user": "user1",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/2
{
"user": "anonymous",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/3
{
"user": "anonymous",
"ip": "2.3.4.5"
}
POST my_ip_index/_doc/4
{
"user": "user2",
"ip": "3.4.5.6"
}
Aggregation Query:
POST my_ip_index/_search
{
"size": 0,
"aggs": {
"my_valid_ips": {
"terms": {
"field": "ip",
"size": 10
},
"aggs": {
"valid_users": {
"terms": {
"field": "user",
"size": 10,
"include": "anonymous"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"valid_users_count": "valid_users._bucket_count",
"my_valid_ips_count": "_count"
},
"script": {
"source": "params.valid_users_count == 1 && params.my_valid_ips_count == 1"
}
}
}
}
}
}
}
Note how I've made use of Terms Aggregation and Bucket Selector Aggregation in the above query.
I've added include part in Terms Agg so as to consider only anonymous users and the logic inside bucket aggregation is to filter out only if it is a single doc count in the top level terms aggregation for e.g. 2.3.4.5 followed by single bucket count in the second level terms aggregation.
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_valid_ips" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2.3.4.5", <---- Expected IP/Answer
"doc_count" : 1,
"valid_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "anonymous",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope it helps!

How to get multiple fields returned in elasticsearch query?

How to get multiple fields returned that are unique using elasticsearch query?
All of my documents have duplicate name and job fields. I would like to use an es query to get all the unique values which include the name and job in the same response, so they are tied together.
[
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "luffy",
"job": "rubber man",
"dob": "1/2/99"
}
]
Expected result in any format -> I was trying to use aggs but I only get one field
[
{
"name": "albert",
"job": "teacher"
},
{
"name": "justin",
"job": "engineer"
},
{
"name": "luffy",
"job": "rubber man"
},
]
This is what I tried so far
GET name.test.index/_search
{
"size": 0,
"aggs" : {
"name" : {
"terms" : { "field" : "name.keyword" }
}
}
}
using the above query gets me this which is good that its unique
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 95,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Justin",
"doc_count" : 56
},
{
"key" : "Luffy",
"doc_count" : 31
},
{
"key" : "Albert",
"doc_count" : 8
}
]
}
}
}
I tried doing nested aggregation but that did not work. Is there an alternative solution for getting multiple unique values or am I missing something?
That's a good start! There are a few ways to achieve what you want, each provides a different response format, so you can decide which one you prefer.
The first option is to leverage the top_hits sub-aggregation and return the two fields for each name bucket:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"top": {
"top_hits": {
"_source": [
"name",
"job"
],
"size": 1
}
}
}
}
}
}
The second option is to use a script in your terms aggregation instead of a field to return a compound value:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"script": "doc['name'].value + ' - ' + doc['job'].value"
}
}
}
}
The third option is to use two levels of field collapsing:
GET name.test.index/_search
{
"collapse": {
"field": "name",
"inner_hits": {
"name": "by_job",
"collapse": {
"field": "job"
},
"size": 1
}
}
}

Resources