I have use case where I need to get all unique user ids from Elasticsearch and it should be sorted by timestamp.
What I'm using currently is composite term aggregation with sub aggregation which will return the latest timestamp.
(I can't sort it in client side as it slow down the script)
Sample data in elastic search
{
"_index": "logstash-2020.10.29",
"_type": "doc",
"_id": "L0Urc3UBttS_uoEtubDk",
"_version": 1,
"_score": null,
"_source": {
"#version": "1",
"#timestamp": "2020-10-29T06:56:00.000Z",
"timestamp_string": "1603954560",
"search_query": "example 3",
"user_uuid": "asdfrghcwehf",
"browsing_url": "https://www.google.com/search?q=example+3",
},
"fields": {
"#timestamp": [
"2020-10-29T06:56:00.000Z"
]
},
"sort": [
1603954560000
]
}
Expected Output:
[
{
"key" : "bjvexyducsls",
"doc_count" : 846,
"1" : {
"value" : 1.603948557E12,
"value_as_string" : "2020-10-29T05:15:57.000Z"
}
},
{
"key" : "lhmsbq2osski",
"doc_count" : 420,
"1" : {
"value" : 1.6039476E12,
"value_as_string" : "2020-10-29T05:00:00.000Z"
}
},
{
"key" : "m2wiaufcbvvi",
"doc_count" : 1,
"1" : {
"value" : 1.603893635E12,
"value_as_string" : "2020-10-28T14:00:35.000Z"
}
},
{
"key" : "rrm3vd5ovqwg",
"doc_count" : 1,
"1" : {
"value" : 1.60389362E12,
"value_as_string" : "2020-10-28T14:00:20.000Z"
}
},
{
"key" : "x42lk4t3frfc",
"doc_count" : 72,
"1" : {
"value" : 1.60389318E12,
"value_as_string" : "2020-10-28T13:53:00.000Z"
}
}
]
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"user":{
"type":"keyword"
},
"date":{
"type":"date"
}
}
}
}
Index Data:
{
"date": "2015-01-01",
"user": "user1"
}
{
"date": "2014-01-01",
"user": "user2"
}
{
"date": "2015-01-11",
"user": "user3"
}
Search Query:
{
"size": 0,
"aggs": {
"user_id": {
"terms": {
"field": "user",
"order": {
"sort_user": "asc"
}
},
"aggs": {
"sort_user": {
"min": {
"field": "date"
}
}
}
}
}
}
Search Result:
"aggregations": {
"user_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "user2",
"doc_count": 1,
"sort_user": {
"value": 1.3885344E12,
"value_as_string": "2014-01-01T00:00:00.000Z"
}
},
{
"key": "user1",
"doc_count": 1,
"sort_user": {
"value": 1.4200704E12,
"value_as_string": "2015-01-01T00:00:00.000Z"
}
},
{
"key": "user3",
"doc_count": 1,
"sort_user": {
"value": 1.4209344E12,
"value_as_string": "2015-01-11T00:00:00.000Z"
}
}
]
}
Related
I have a question about aggregation.
I want to do aggregation for a field declared as an object array.
It is not aggregation for each element, but aggregation for the whole value.
I have following documents:
PUT value-list-index
{
"mappings": {
"properties": {
"server": {
"type": "keyword"
},
"users": {
"type": "keyword",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
PUT value-list-index/_doc/1
{
"server": "server1",
"users": ["user1"]
}
PUT value-list-index/_doc/2
{
"server": "server2",
"users": ["user1","user2"]
}
PUT value-list-index/_doc/3
{
"server": "server3",
"users": ["user2", "user3"]
}
PUT value-list-index/_doc/4
{
"server": "server4",
"users": ["user1","user2", "user3","user4"]
}
PUT value-list-index/_doc/5
{
"server": "server5",
"users": ["user2", "user3","user4"]
}
PUT value-list-index/_doc/6
{
"server": "server6",
"users": ["user3","user4"]
}
PUT value-list-index/_doc/7
{
"server": "server7",
"users": ["user1","user2", "user3","user4"]
}
PUT value-list-index/_doc/8
{
"server": "server8",
"users": ["user1","user2", "user3","user4"]
}
PUT value-list-index/_doc/9
{
"server": "server9",
"users": ["user1","user2", "user3","user4"]
}
get value-list-index/_search
{
"size" : 0,
"aggs": {
"words": {
"terms": {
"field": "users"
},
"aggs": {
"total": {
"value_count": {
"field": "users"
}
}
}
}
}
}
i want following
"aggregations" : {
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
**"key" : "user1",
"doc_count" : 1,**
"total" : {
"value" : xx
}
},
{
**"key" : "user1","user2",
"doc_count" : 1,**
"total" : {
"value" : xx
}
},
{
"key" : "user1","user2","user3","user4",
"doc_count" : 4,
"total" : {
"value" : xx
}
}
]
}
}
but return each element grouping result like this
"aggregations" : {
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "user2",
"doc_count" : 7,
"total" : {
"value" : 23
}
},
{
"key" : "user3",
"doc_count" : 7,
"total" : {
"value" : 23
}
},
{
"key" : "user1",
"doc_count" : 6,
"total" : {
"value" : 19
}
},
{
"key" : "user4",
"doc_count" : 6,
"total" : {
"value" : 21
}
}
]
}
}
Is the aggregation I want possible?
Maybe this aggs can help you: Frequent items aggregation
But be careful with the performance.
Look this results:
"aggregations": {
"words": {
"buckets": [
{
"key": {
"users": [
"user2"
]
},
"doc_count": 7,
"support": 0.7777777777777778
},
{
"key": {
"users": [
"user2",
"user3"
]
},
"doc_count": 6,
"support": 0.6666666666666666
},
{
"key": {
"users": [
"user3",
"user4"
]
},
"doc_count": 6,
"support": 0.6666666666666666
},
{
"key": {
"users": [
"user1"
]
},
"doc_count": 6,
"support": 0.6666666666666666
},
{
"key": {
"users": [
"user2",
"user3",
"user4"
]
},
"doc_count": 5,
"support": 0.5555555555555556
},
{
"key": {
"users": [
"user2",
"user1"
]
},
"doc_count": 5,
"support": 0.5555555555555556
},
{
"key": {
"users": [
"user2",
"user3",
"user4",
"user1"
]
},
"doc_count": 4,
"support": 0.4444444444444444
}
]
}
}
I have index with following mapping
{
"mappings": {
"properties": {
"typed_obj": {
"type": "nested",
"properties": {
"id": {"type": "keyword"},
"type": {"type": "keyword"}
}
}
}
}
}
and documents
{"index" : {}}
{"typed_obj": [{"id": "1", "type": "one"}, {"id": "2", "type": "two"}]}
{"index" : {}}
{"typed_obj": [{"id": "1", "type": "one"}, {"id": "2", "type": "one"}]}
{"index" : {}}
{"typed_obj": [{"id": "1", "type": "one"}, {"id": "3", "type": "one"}]}
{"index" : {}}
{"typed_obj": [{"id": "1", "type": "one"}, {"id": "4", "type": "two"}]}
How can i group typed_obj by type and calculate unique id ?
Smth like
{
"type": "one",
"count": 3
},
{
"type": "two",
"count": 2
}
I made up query with agg
{
"query": {
"match_all": {}
},
"aggs": {
"obj_nested": {
"nested": {
"path": "typed_obj"
},
"aggs": {
"by_type_and_id": {
"multi_terms": {
"terms": [
{
"field": "typed_obj.type"
},
{
"field": "typed_obj.id"
}
]
}
}
}
}
},
"size": 0
}
and it returns
"buckets": [
{
"key": [
"one",
"1"
],
"key_as_string": "one|1",
"doc_count": 4
},
{
"key": [
"one",
"2"
],
"key_as_string": "one|2",
"doc_count": 1
},
{
"key": [
"one",
"3"
],
"key_as_string": "one|3",
"doc_count": 1
},
{
"key": [
"two",
"2"
],
"key_as_string": "two|2",
"doc_count": 1
},
{
"key": [
"two",
"4"
],
"key_as_string": "two|4",
"doc_count": 1
}
]
In backend app i can group keys by first element (it is typed_obj type) and then retriev length, but my question is - is it possible to get types count without obtain from index all id+type pairs ?
You need to use Cardinality aggregation to count distinct values.
Query:
{
"query": {
"match_all": {}
},
"aggs": {
"obj_nested": {
"nested": {
"path": "typed_obj"
},
"aggs": {
"type":{
"terms": {
"field": "typed_obj.type",
"size": 10
},
"aggs": {
"id": {
"cardinality": {
"field": "typed_obj.id"
}
}
}
}
}
}
},
"size": 0
}
Response
"aggregations" : {
"obj_nested" : {
"doc_count" : 8,
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "one",
"doc_count" : 6,
"id" : {
"value" : 3
}
},
{
"key" : "two",
"doc_count" : 2,
"id" : {
"value" : 2
}
}
]
}
}
}
Note:
A single-value metrics aggregation that calculates an approximate
count of distinct values.
My Elasticsearch index contains products with a denormalized m:n relationship to categories.
My goal is to derive a categories index from it which contains the same information, but with the relationship inverted.
The index looks like this:
PUT /products
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"article_id": {
"type": "keyword"
},
"categories": {
"type": "nested",
"properties": {
"cat_name": {
"type": "keyword"
}
}
}
}
}
}
containing documents created like this:
POST /products/_doc
{
"name": "radio",
"article_id": "1001",
"categories": [
{ "cat_name": "audio" },
{ "cat_name": "electronics" }
]
}
POST /products/_doc
{
"name": "fridge",
"article_id": "1002",
"categories": [
{ "cat_name": "appliances" },
{ "cat_name": "electronics" }
]
}
I would like to get something like this back from Elasticsearch:
{
"name": "appliances",
"products": [
{
"name": "fridge",
"article_id": "1002"
}
]
},
{
"name": "audio",
"products": [
{
"name": "radio",
"article_id": "1001"
}
]
},
{
"name": "electronics",
"products": [
{
"name": "fridge",
"article_id": "1002"
},
{
"name": "radio",
"article_id": "1001"
}
]
}
which would eventually be put into an index such as:
PUT /categories
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"products": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"article_id": {
"type": "keyword"
}
}
}
}
}
}
I cannot figure out how to do this without loading and grouping all products programmatically.
Here's what I have tried:
Bucket aggregation on field categories.cat_name
This gives me the document count per category but not the product documents. Using top_hits sub-aggregation seems to be limited to 100 documents.
Group using collapse field with expansion
Collapsing is only possible on a single-valued field.
I'm using Elasticsearch 8.1.
The query you need is this one:
POST products/_search
{
"size": 0,
"aggs": {
"cats": {
"nested": {
"path": "categories"
},
"aggs": {
"categories": {
"terms": {
"field": "categories.cat_name",
"size": 10
},
"aggs": {
"root": {
"reverse_nested": {},
"aggs": {
"products": {
"terms": {
"field": "name",
"size": 10
}
}
}
}
}
}
}
}
}
}
Which produces exactly what you need (less the article id, but that's easy):
"buckets" : [
{
"key" : "electronics",
"doc_count" : 2,
"root" : {
"doc_count" : 2,
"products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "fridge",
"doc_count" : 1
},
{
"key" : "radio",
"doc_count" : 1
}
]
}
}
},
{
"key" : "appliances",
"doc_count" : 1,
"root" : {
"doc_count" : 1,
"products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "fridge",
"doc_count" : 1
}
]
}
}
},
{
"key" : "audio",
"doc_count" : 1,
"root" : {
"doc_count" : 1,
"products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "radio",
"doc_count" : 1
}
]
}
}
}
]
I need a query that returns only result that has 1 bucket.
The query below returns me the access data of a visitor grouped by day.
{
"size" : 0,
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{
"range" : {
"start_time" : {
"gte" : "2019-02-06 00:00:00",
"lte" : "2019-02-11 23:59:59"
}
}
}
]
}
}
}
},
"aggs" : {
"UNIQUE" : {
"terms" : {
"size" : 0,
"field" : "username"
},
"aggs" : {
"visits" : {
"date_histogram" : {
"field" : "start_time",
"interval" : "day",
"format" : "yyyy-MM-dd"
}
}
}
}
}
}
I need to know which ones returned only once in the period. So when you have only 1 bucket, it's ONE. And if it has visited for more than a day (buckets> 1) then it is RECURRENT.
If I understand it correctly, you'd want a list of users who have had a unique date or like visited only once in a particular time frame and you'd want both the details, date and the username to be in the aggregation.
I've created a sample mapping, sample documents, aggregation query and how it would appear in the response
Mapping:
PUT mytest
{
"mappings": {
"mydocs": {
"properties": {
"username": {
"type": "keyword"
},
"start_time": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
}
Sample Documents:
You can see that I've created 6 documents where John has visited twice on same date, Jack visits site on two different dates, while Jane and Rob visited only once in the time-frame for which I will write an aggregation.
POST mytest/mydocs/1
{
"username": "john",
"start_time": "2018-08-01"
}
POST mytest/mydocs/2
{
"username": "john",
"start_time": "2018-08-01"
}
POST mytest/mydocs/3
{
"username": "jane",
"start_time": "2018-08-01"
}
POST mytest/mydocs/4
{
"username": "rob",
"start_time": "2018-08-01"
}
POST mytest/mydocs/5
{
"username": "jack",
"start_time": "2018-08-01"
}
POST mytest/mydocs/6
{
"username": "jack",
"start_time": "2018-08-02"
}
Updated Aggregation Request
Note I've added two more documents with username Jack who visits the site on two different dates, username John visits the site twice on the same date.
POST mytest/_search
{
"size": 0,
"query": {
"range": {
"start_time": {
"gte": "2017-08-01",
"lte": "2019-08-01"
}
}
},
"aggs": {
"myterms": {
"terms": {
"size": 100,
"field": "username"
},
"aggs": {
"visit_date": {
"date_histogram": {
"field": "start_time",
"interval" : "day",
"format" : "yyyy-MM-dd"
}
},
"count": {
"cardinality": {
"field": "start_time"
}
},
"equal_one":{
"bucket_selector":{
"buckets_path":{
"count":"count.value"
},
"script":"params.count == 1"
}
}
}
}
}
}
Response
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"myterms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john",
"doc_count": 2,
"count": {
"value": 1
},
"visit_date": {
"buckets": [
{
"key_as_string": "2018-08-01",
"key": 1533081600000,
"doc_count": 2
}
]
}
},
{
"key": "jane",
"doc_count": 1,
"count": {
"value": 1
},
"visit_date": {
"buckets": [
{
"key_as_string": "2018-08-01",
"key": 1533081600000,
"doc_count": 1
}
]
}
},
{
"key": "rob",
"doc_count": 1,
"count": {
"value": 1
},
"visit_date": {
"buckets": [
{
"key_as_string": "2018-08-01",
"key": 1533081600000,
"doc_count": 1
}
]
}
}
]
}
}
}
You can see that John now appears in the result as expected even if he has visited site multiple times on same date.
Let me know if you have any queries.
Solution found was:
{
"size" : 0,
"query" : {
{
"range" : {
"start_time" : {
"gte" : "2019-02-11 00:00:00",
"lte" : "2019-02-11 23:59:59"
}
}
}
},
"aggs" : {
"UNIQUE" : {
"terms" : {
"size" : 0,
"field" : "username"
},
"aggs":{
"visit_date": {
"date_histogram": {
"field" : "start_time",
"interval" : "day",
"format" : "yyyy-MM-dd"
}
},
"count": {
"cardinality": {
"script": "new Date(doc['start_time'].value).format('yyyy-MM-dd')"
}
},
"equal_one":{
"bucket_selector":{
"buckets_path":{
"count":"count.value"
},
"script":"count == 1"
}
}
}
}
}
}
But performance remains a problem. In an environment with about 1 million records this query does not work very well.
Maybe some query using Scripted Metrics would solve, but demand more analysis (doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html)
I have created a elasticsearch query with function score and top_hit. This query will remove the duplicate and return top 1 record for each bucket.
GET employeeid/info/_search
{"size": 0,
"query" : {
"function_score" : {
"query" : {
"match" : {
"employeeID" : "23141A"
}
},
"functions" : [{
"linear" : {
"AcquiredDate" : {
"scale" : "90d",
"decay" : 0.5
}
}
}, {
"filter" : {
"match" : {
"name" : "sorna"
}
},
"boost_factor" : 10
}, {
"filter" : {
"match" : {
"name" : "lingam"
}
},
"boost_factor" : 7
}
],
"boost_mode" : "replace"
}
},
"aggs": {
"duplicateCount": {
"terms": {
"field": "employeehash",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size":1
}
}
}
}
}
}
I am getting the expected result, But the problem is i want to sort the result using _score.
Following is my simple o/p
{
"key": "567",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 0.40220365,
"hits": [
{
"_index": "employeeid",
"_type": "info",
"_id": "5",
"_score": 0.40220365,
"_source": {
"name": "John",
"organisation": "google",
"employeeID": "23141A",
"employeehash": "567",
"AcquiredDate": "2016-02-01T07:57:28Z"
}
}
]
}
}
},
{
"key": "102",
"doc_count": 1,
"duplicateDocuments": {
"hits": {
"total": 1,
"max_score": 2.8154256,
"hits": [
{
"_index": "employeeid",
"_type": "info",
"_id": "8",
"_score": 2.8154256,
"_source": {
"name": "lingam",
"organisation": "google",
"employeeID": "23141A",
"employeehash": "102",
"AcquiredDate": "2016-02-01T07:57:28Z"
}
}
]
}
}
}
Question: How to sort _score : desc ?
i have not enabled groovy so i can not use script