Related
I need some help regarding querying in elasticsearch.
So basically, the api looks something like this:
{
"took": 58,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1020900,
"max_score": 1,
"hits": [
{
"_index": "index-20192029",
"_type": "_doc",
"_id": "urn:22291760",
"_score": 1,
"_source": {
"user_id": 1234567,
"document": [
{
"documentType": "application/pdf",
"documentUrl": "http://somethingxyz1234.pdf"
},
{
"documentType": "application/xml",
"documentUrl": "http://somethingxyz1234.xml"
}
], .....
How do I only get the url that is an xml?
I tried doing
"_source": ["user_id", "document.documentType", "document.documentUrl"],
"query": {
"bool": {
"match": { "document.documentType" :"application/xml"}
}
}
But that also included the pdf.
I just want the documentUrl to give only the url that's xml.
Thanks
If document is nested you can use inner_hits to get the document query match.
GET test/_search
{
"query": {
"nested": {
"path": "document",
"query": {
"term": {
"document.documentType": {
"value": "application/pdf"
}
}
},
"inner_hits": {}
}
}
}
Results:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "pv36jIIB-X7q7ErxEhyg",
"_score" : 0.6931471,
"_source" : {
"document" : [
{
"documentType" : "application/pdf",
"documentUrl" : "http://somethingxyz1234.pdf"
},
{
"documentType" : "application/xml",
"documentUrl" : "http://somethingxyz1234.xml"
}
]
},
"inner_hits" : {
"document" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931471,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "pv36jIIB-X7q7ErxEhyg",
"_nested" : {
"field" : "document",
"offset" : 0
},
"_score" : 0.6931471,
"_source" : {
"documentType" : "application/pdf",
"documentUrl" : "http://somethingxyz1234.pdf"
}
}
]
}
}
}
}
]
Trying to run a terms query on elastic search and couldn't figure out how to limit the returns to only unique results?
Assuming this is the query.
"query": {
"bool": {
"must": [{
"terms": {
"id": [
"1",
"2",
"3",
],
"boost": 1.0
}
}],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"aggs": {
"top-results": {
"terms": {
"field": "id"
},
"aggs": {
"test": {
"top_hits": {
"size": 1
}
}
}
}
}
Ideally I would like to only have 3 results returned each one matching a id of 1, 2, or 3, but this query returns a lot more than that.
In order to mimic your scenario, have pushed a set of 5 records of employees in elasticsearch having different salaries. So, I am trying to fetch the salaries listed with one record (top-hit) each.
GET /employee/_doc/_search
{
"query": {
"bool": {
"should": [
{ "match": { "salary": 90000 }},
{ "match": { "salary": 80000 }}
]
}
},
"size" : 0,
"aggs": {
"salaries": {
"terms": {
"field": "salary",
"order": { "top_score": "desc" }
},
"aggs": {
"top_score": { "max": { "script": "_score" }},
"salary-num": { "top_hits": { "size": 1 }}
}
}
}
}
OUTPUT
{
...
"aggregations" : {
"salaries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 80000,
"doc_count" : 2,
"top_score" : {
"value" : 1.0
},
"salary-num" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 10,
"name" : "Lydia",
"dept" : "HR",
"salary" : 80000
}
}
]
}
}
},
{
"key" : 90000,
"doc_count" : 1,
"top_score" : {
"value" : 1.0
},
"salary-num" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"id" : 20,
"name" : "Flora",
"dept" : "Accounts",
"salary" : 90000
}
}
]
}
}
}
]
}
}
}
Following are the sample document at elasticsearch.
{
"_index": “social”,
"_type": “social”,
"_id": "1632560884596186633",
"_score": 1,
"_source": {
"created_date": "2017-10-24",
"reach": 1692,
"social_id": 200
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "1626693964184981799",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 100
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "162669396418498170",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 50
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "1626693964184981756",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 25
}
}
Question: Sum of reach for top 2 documents based on the created date per social id.
What I have tried:
{
"size": 0,
"aggs": {
"reach_bucket": {
"terms": {
"size": 200,
"field": "social_id"
},
"aggs": {
"media_reach_bucket": {
"terms": {
"field": "created_date",
"size": 200
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"created_date": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"created_date",
"reach"
]
},
"size": 2
}
}
}
}
}
}
}
}
Problem:
Not to do sub aggregation for top_hits.
Any suggestion will be grateful.
You may want to use date_histogram instead of terms when bucketing per day (I assume). But more importantly, you should sort your top_hits by reach, not created_date, since that's going to be the same in your per-day bucket.
{
"size": 0,
"aggs": {
"reach_bucket": {
"terms": {
"size": 200,
"field": "social_id"
},
"aggs": {
"media_reach_bucket": {
"date_histogram": {
"field": "created_date",
"calendar_interval": "day"
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"reach": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"reach"
]
},
"size": 2
}
}
}
}
}
}
}
}
yielding top hits like so
"aggregations" : {
"reach_bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 100,
"doc_count" : 4,
"media_reach_bucket" : {
"buckets" : [
{
"key_as_string" : "2017-10-24T00:00:00.000Z",
"key" : 1508803200000,
"doc_count" : 4,
"top_sales_hits" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "kart",
"_type" : "_doc",
"_id" : "3iLJRnEBZbobBB0NiV8R",
"_score" : null,
"_source" : {
"reach" : 40
},
"sort" : [
40
]
},
{
"_index" : "kart",
"_type" : "_doc",
"_id" : "3SLJRnEBZbobBB0Nhl-Y",
"_score" : null,
"_source" : {
"reach" : 30
},
"sort" : [
30
]
}
]
}
}
}
]
}
}
]
}
}
whose reach you can then sum up in your post processing functions.
I'm not familiar with a top-n sum, only with sums of docs above a certain threshold -- in which case I'd use filter aggregations.
I have index in Elasticsearch. Documents in it have duplicate field values. And in query result I need to remove all duplicates, and get only distinct values. For example:
PUT localhost:9200/person
{
"mappings" : {
"person" : {
"properties" : {
"name" : { "type" : "keyword" }
}
}
}
}
POST localhost:9200/person/person
{
"name": "John"
}
{
"name": "John"
}
{
"name": "Marry"
}
{
"name": "Tomas"
}
I'm trying to remove duplicated with terms aggregation by field "name", but it doesn't work.
GET localhost:9200/person/person/_search
{
"size": 3,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "dasdfdLBpnM0"
}
}
]
}
},
"aggs": {
"top-names": {
"terms": {
"field": "name",
"size": 3
},
"aggs": {
"top_names_hits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
Result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0.9506482,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "H-5D8GoB8pRyckNSVUeN",
"_score": 0.9506482,
"_source": {
"name": "Tomas"
}
},
{
"_index": "person",
"_type": "person",
"_id": "He5D8GoB8pRyckNSPEfa",
"_score": 0.7700638,
"_source": {
"name": "John"
}
},
{
"_index": "person",
"_type": "person",
"_id": "HO5D8GoB8pRyckNSN0fo",
"_score": 0.71723765,
"_source": {
"name": "John"
}
}
]
},
"aggregations": {
"top-names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John",
"doc_count": 2,
"top_names_hits": {
"hits": {
"total": 2,
"max_score": 0.7700638,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "He5D8GoB8pRyckNSPEfa",
"_score": 0.7700638,
"_source": {
"name": "John"
}
}
]
}
}
},
{
"key": "Marry",
"doc_count": 1,
"top_names_hits": {
"hits": {
"total": 1,
"max_score": 0.66815424,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "Iu5D8GoB8pRyckNScUdv",
"_score": 0.66815424,
"_source": {
"name": "Marry"
}
}
]
}
}
},
{
"key": "Tomas",
"doc_count": 1,
"top_names_hits": {
"hits": {
"total": 1,
"max_score": 0.9506482,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "H-5D8GoB8pRyckNSVUeN",
"_score": 0.9506482,
"_source": {
"name": "Tomas"
}
}
]
}
}
}
]
}
}
}
Aggregation applied to documents with name = "Marry", but I don't understand why, and how can I apply aggregation only to query results.
Below is more or less Elasticsearch Query blueprint....
{
"size": n, // Return the n documents based on "query" section (to frontend)
"query": {
// Here is where you are supposed to mention what documents you want
// Any filter/bool/match query condition
// In your case, you haven't specified any correct condition.
// So basically, it would return all the documents or documents based on size parameter. In your case it returns 3.
},
"aggs":{
// This aggregation query would only be applied on documents
// based on documents filtered/matched by the "query" section.
// In your case it is applying aggregation on all documents of that index as per the comment I've mentioned in the above query section.
}
}
Aggregation Query:
To get what you are looking for simply make use of below simplified query which you had with Terms Aggregation with Top Hits as sub-aggregation.
POST person/_search
{
"size": 0, <------- This is to say, I don't want "query" results to be returned and that I only want below aggregation results.
"aggs": {
"top-names": {
"terms": {
"field": "name",
"size": 10
},
"aggs": {
"top_hits_documents": { <------- Top hits would return the actual documents
"top_hits": {
"size": 1
}
}
}
}
}
}
By specifying "size": 0, at the very top you are basically applying aggregation on all the documents and that you are not returning any query results.
You simply return aggregation results.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ] <------ Notice this. No query results returned
},
"aggregations" : { <------ Aggregation Result starts
"top-names" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John", <------- This is to say there's a value called John
"doc_count" : 2, <------- John occurs in two documents.
"top_hits_documents" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "person",
"_type" : "person",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "John"
}
}
]
}
}
},
{
"key" : "Marry",
"doc_count" : 1,
"top_hits_documents" : {
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "person",
"_type" : "person",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "Marry"
}
}
]
}
}
},
{
"key" : "Thomas",
"doc_count" : 1,
"top_hits_documents" : {
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "person",
"_type" : "person",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"name" : "Thomas"
}
}
]
}
}
}
]
}
}
}
Hope that helps!
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1.0,
"hits": [{
"_index": "db",
"_type": "users",
"_id": "AVOiyjHmzUObmc5euUGS",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIQzUObmc5euUGT",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIlzUObmc5euUGU",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjJKzUObmc5euUGW",
"_score": 1.0,
"_source": {
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiy4jhzUObmc5euUGX",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjI2zUObmc5euUGV",
"_score": 1.0,
"_source": {
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}
}]
}
}
I want to filter out the document based on the user last visited time and get the most recent accessed document of an individual user and then group all the filtered documents based on offer code.
I get the most recent accessed document of an user by performing tophits aggregation. But, I can't able to group the results of tophits aggregation using the offercode.
ES Query to get most recent document of a user
curl -XGET localhost:9200/account/users/_search?pretty -d'{
"size": "0",
"query": {
"bool": {
"must": {
"range": {
"lastvisited": {
"gte": "2016/01/19",
"lte": "2016/01/21"
}
}
}
}
},
"aggs": {
"lastvisited_users": {
"terms": {
"field": "user"
}
,
"aggs": {
"top_user_hits": {
"top_hits": {
"sort": [
{
"lastvisited": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user","offercode","lastvisited"
]
},
"size": 1
}
}
}
}
}}'
ES Output
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"lastvisited_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "james",
"doc_count" : 3,
"top_user_hits" : {
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIEz1WBU8vnnZ2d",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 03:04:15",
"offercode" : "JB20,JB50",
"user" : "james"
},
"sort" : [ 1453259055000 ]
} ]
}
}
}, {
"key" : "adams",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJMz1WBU8vnnZ2h",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB10",
"user" : "adams"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "adamsnew",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJhz1WBU8vnnZ2i",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB1010,aka10",
"user" : "adamsnew"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "peter",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIoz1WBU8vnnZ2f",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 02:32:22",
"offercode" : "JB20,JB50,JB100",
"user" : "peter"
},
"sort" : [ 1453257142000 ]
} ]
}
}
} ]
}
}
}
Now, I want to aggregate the results of tophits aggregation.
Expected Output
{
"offercode_grouped": {
"JB20": 1,
"JB10": 1,
"JB20,JB50": 1,
"JB20,JB50,JB100": 2,
"":1
}
}
I tried using Pipeline aggregation but I don't know how to groupby the result of tophits aggregation.
I hope that I understand your problem correctly. I think I found a bit hacky "solution".
It is a combination of function_score query, sampler aggregation and terms aggregation.
Create new index
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow" -d'
{
"mappings": {
"document": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"lastvisited": {
"type": "date",
"format": "YYYY/MM/dd HH:mm:ss"
},
"browser": {
"type": "string",
"index": "not_analyzed"
},
"offercode": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
Index documents
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/1?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/2?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/3?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/4?routing=peter" -d'
{
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/5?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/6?routing=adams" -d'
{
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}'
Get aggregations
curl -XPOST "http://127.0.0.1:9200/stackoverflow/_search" -d'
{
"query": {
"function_score": {
"boost_mode": "replace", // we need to replace document score with the result of the functions
"query": {
"bool": {
"filter": [
{
"range": { // get documents within the date range
"lastvisited": {
"gte": "2016/01/19 00:00:00",
"lte": "2016/01/21 23:59:59"
}
}
}
]
}
},
"functions": [
{
"linear": {
"lastvisited": {
"origin": "2016/01/21 23:59:59", // same as lastvisited lte filter
"scale": "2d" // set the scale - please, see elasticsearch docs for more info https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-function-score-query.html#function-decay
}
}
}
]
}
},
"aggs": {
"user": {
"sampler": { // get top scored document per user
"field": "user",
"max_docs_per_value": 1
},
"aggs": {
"offers": { // aggregate user documents per `offercode`
"terms": {
"field": "offercode"
}
}
}
}
},
"size": 0
}'
Response
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"user": {
"doc_count": 3,
"offers": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "JB20,JB50,JB100",
"doc_count": 2
},
{
"key": "JB10",
"doc_count": 1
}
]
}
}
}
}
Unless you have only one shard per index, you need to specify routing when indexing data. It is because sampler aggregation is calculated per shard. So we need to ensure that all data of particular user will be in the same shard - to get one document with highest score per user.
Sampler aggregation returns documents by score. That is why we need to modify score of the documents. There is where function_score query can help. Using field_value_factor, the score is just the timestamp of last visit - so the more recent the visit, the higher the score.
UPDATE: With field_value_factor there is probably problem with _score accuracy. For more info see issue https://github.com/elastic/elasticsearch/issues/11872. That is why decay function is used as clintongormley suggested in the issue. Because decay function works for both sides from origin. It means that documents 1 day older and 1 day younger than origin recevive the same _score. That's why we need to filter out newer documents (see range filter in the query).
NOTE: I tried this query just with the data which you can see in the example, so bigger dataset is needed to test the query. But I think it should work...
Check this solution: it's more limited, but it is suitable for production: https://stackoverflow.com/a/39788948/4769188
This may solve your problem:
SELECT offercode, count(offercode)
FROM users AS u1
WHERE u1.ID = (SELECT u2.ID FROM users AS u2 WHERE u2.user = u1.user ORDER BY u2.lastvisited DESC LIMIT 1)
AND u1.lastvisited >= "2016/01/20"
AND ORDER BY lastvisited ASC AND GROUP BY offercode;