I need some help regarding querying in elasticsearch.
So basically, the api looks something like this:
{
"took": 58,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1020900,
"max_score": 1,
"hits": [
{
"_index": "index-20192029",
"_type": "_doc",
"_id": "urn:22291760",
"_score": 1,
"_source": {
"user_id": 1234567,
"document": [
{
"documentType": "application/pdf",
"documentUrl": "http://somethingxyz1234.pdf"
},
{
"documentType": "application/xml",
"documentUrl": "http://somethingxyz1234.xml"
}
], .....
How do I only get the url that is an xml?
I tried doing
"_source": ["user_id", "document.documentType", "document.documentUrl"],
"query": {
"bool": {
"match": { "document.documentType" :"application/xml"}
}
}
But that also included the pdf.
I just want the documentUrl to give only the url that's xml.
Thanks
If document is nested you can use inner_hits to get the document query match.
GET test/_search
{
"query": {
"nested": {
"path": "document",
"query": {
"term": {
"document.documentType": {
"value": "application/pdf"
}
}
},
"inner_hits": {}
}
}
}
Results:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "pv36jIIB-X7q7ErxEhyg",
"_score" : 0.6931471,
"_source" : {
"document" : [
{
"documentType" : "application/pdf",
"documentUrl" : "http://somethingxyz1234.pdf"
},
{
"documentType" : "application/xml",
"documentUrl" : "http://somethingxyz1234.xml"
}
]
},
"inner_hits" : {
"document" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931471,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "pv36jIIB-X7q7ErxEhyg",
"_nested" : {
"field" : "document",
"offset" : 0
},
"_score" : 0.6931471,
"_source" : {
"documentType" : "application/pdf",
"documentUrl" : "http://somethingxyz1234.pdf"
}
}
]
}
}
}
}
]
I have some documents contains fields: id, size, etc
And I want find all the possible size where id = 1 or 2
Is this possible?
You can use terms query with source filtering. Adding a working example
Index Data:
{
"id": 1,
"size": 1
}
{
"id": 2,
"size": 2
}
{
"id": 3,
"size": 3
}
Search Query:
{
"_source": "size",
"query": {
"terms": {
"id": [
1,
2
]
}
}
}
Search Result:
"hits": [
{
"_index": "66381642",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"size": 1
}
},
{
"_index": "66381642",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"size": 2
}
}
]
If you want to show the possible sizes of that Ids then you should use an aggregation.
POST your_index/_search
{
"size": 0,
"query": {
"terms": {
"id": [
"1",
"2"
]
}
},
"aggs": {
"sizes": {
"terms": {
"field": "size"
}
}
}
}
The response will be the unique size with the amount of docs with that size
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"sizes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 1
},
{
"key" : 2,
"doc_count" : 1
}
]
}
}
}
Following are the sample document at elasticsearch.
{
"_index": “social”,
"_type": “social”,
"_id": "1632560884596186633",
"_score": 1,
"_source": {
"created_date": "2017-10-24",
"reach": 1692,
"social_id": 200
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "1626693964184981799",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 100
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "162669396418498170",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 50
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "1626693964184981756",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 25
}
}
Question: Sum of reach for top 2 documents based on the created date per social id.
What I have tried:
{
"size": 0,
"aggs": {
"reach_bucket": {
"terms": {
"size": 200,
"field": "social_id"
},
"aggs": {
"media_reach_bucket": {
"terms": {
"field": "created_date",
"size": 200
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"created_date": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"created_date",
"reach"
]
},
"size": 2
}
}
}
}
}
}
}
}
Problem:
Not to do sub aggregation for top_hits.
Any suggestion will be grateful.
You may want to use date_histogram instead of terms when bucketing per day (I assume). But more importantly, you should sort your top_hits by reach, not created_date, since that's going to be the same in your per-day bucket.
{
"size": 0,
"aggs": {
"reach_bucket": {
"terms": {
"size": 200,
"field": "social_id"
},
"aggs": {
"media_reach_bucket": {
"date_histogram": {
"field": "created_date",
"calendar_interval": "day"
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"reach": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"reach"
]
},
"size": 2
}
}
}
}
}
}
}
}
yielding top hits like so
"aggregations" : {
"reach_bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 100,
"doc_count" : 4,
"media_reach_bucket" : {
"buckets" : [
{
"key_as_string" : "2017-10-24T00:00:00.000Z",
"key" : 1508803200000,
"doc_count" : 4,
"top_sales_hits" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "kart",
"_type" : "_doc",
"_id" : "3iLJRnEBZbobBB0NiV8R",
"_score" : null,
"_source" : {
"reach" : 40
},
"sort" : [
40
]
},
{
"_index" : "kart",
"_type" : "_doc",
"_id" : "3SLJRnEBZbobBB0Nhl-Y",
"_score" : null,
"_source" : {
"reach" : 30
},
"sort" : [
30
]
}
]
}
}
}
]
}
}
]
}
}
whose reach you can then sum up in your post processing functions.
I'm not familiar with a top-n sum, only with sums of docs above a certain threshold -- in which case I'd use filter aggregations.
I have created a elasticsearch query with function score and top_hit. This query will remove the duplicate and return top 1 record for each bucket.
GET employeeid/info/_search
{"size": 0,
"query" : {
"function_score" : {
"query" : {
"match" : {
"employeeID" : "23141A"
}
},
"functions" : [{
"linear" : {
"AcquiredDate" : {
"scale" : "90d",
"decay" : 0.5
}
}
}, {
"filter" : {
"match" : {
"name" : "sorna"
}
},
"boost_factor" : 10
}, {
"filter" : {
"match" : {
"name" : "lingam"
}
},
"boost_factor" : 7
}
],
"boost_mode" : "replace"
}
},
"aggs": {
"duplicateCount": {
"terms": {
"field": "employeehash",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size":1
}
}
}
}
}
}
I am getting the expected result, But the problem is i want to sort the result using _score.
Following is my simple o/p
{
"key": "567",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 0.40220365,
"hits": [
{
"_index": "employeeid",
"_type": "info",
"_id": "5",
"_score": 0.40220365,
"_source": {
"name": "John",
"organisation": "google",
"employeeID": "23141A",
"employeehash": "567",
"AcquiredDate": "2016-02-01T07:57:28Z"
}
}
]
}
}
},
{
"key": "102",
"doc_count": 1,
"duplicateDocuments": {
"hits": {
"total": 1,
"max_score": 2.8154256,
"hits": [
{
"_index": "employeeid",
"_type": "info",
"_id": "8",
"_score": 2.8154256,
"_source": {
"name": "lingam",
"organisation": "google",
"employeeID": "23141A",
"employeehash": "102",
"AcquiredDate": "2016-02-01T07:57:28Z"
}
}
]
}
}
}
Question: How to sort _score : desc ?
i have not enabled groovy so i can not use script
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1.0,
"hits": [{
"_index": "db",
"_type": "users",
"_id": "AVOiyjHmzUObmc5euUGS",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIQzUObmc5euUGT",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIlzUObmc5euUGU",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjJKzUObmc5euUGW",
"_score": 1.0,
"_source": {
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiy4jhzUObmc5euUGX",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjI2zUObmc5euUGV",
"_score": 1.0,
"_source": {
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}
}]
}
}
I want to filter out the document based on the user last visited time and get the most recent accessed document of an individual user and then group all the filtered documents based on offer code.
I get the most recent accessed document of an user by performing tophits aggregation. But, I can't able to group the results of tophits aggregation using the offercode.
ES Query to get most recent document of a user
curl -XGET localhost:9200/account/users/_search?pretty -d'{
"size": "0",
"query": {
"bool": {
"must": {
"range": {
"lastvisited": {
"gte": "2016/01/19",
"lte": "2016/01/21"
}
}
}
}
},
"aggs": {
"lastvisited_users": {
"terms": {
"field": "user"
}
,
"aggs": {
"top_user_hits": {
"top_hits": {
"sort": [
{
"lastvisited": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user","offercode","lastvisited"
]
},
"size": 1
}
}
}
}
}}'
ES Output
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"lastvisited_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "james",
"doc_count" : 3,
"top_user_hits" : {
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIEz1WBU8vnnZ2d",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 03:04:15",
"offercode" : "JB20,JB50",
"user" : "james"
},
"sort" : [ 1453259055000 ]
} ]
}
}
}, {
"key" : "adams",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJMz1WBU8vnnZ2h",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB10",
"user" : "adams"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "adamsnew",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJhz1WBU8vnnZ2i",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB1010,aka10",
"user" : "adamsnew"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "peter",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIoz1WBU8vnnZ2f",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 02:32:22",
"offercode" : "JB20,JB50,JB100",
"user" : "peter"
},
"sort" : [ 1453257142000 ]
} ]
}
}
} ]
}
}
}
Now, I want to aggregate the results of tophits aggregation.
Expected Output
{
"offercode_grouped": {
"JB20": 1,
"JB10": 1,
"JB20,JB50": 1,
"JB20,JB50,JB100": 2,
"":1
}
}
I tried using Pipeline aggregation but I don't know how to groupby the result of tophits aggregation.
I hope that I understand your problem correctly. I think I found a bit hacky "solution".
It is a combination of function_score query, sampler aggregation and terms aggregation.
Create new index
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow" -d'
{
"mappings": {
"document": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"lastvisited": {
"type": "date",
"format": "YYYY/MM/dd HH:mm:ss"
},
"browser": {
"type": "string",
"index": "not_analyzed"
},
"offercode": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
Index documents
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/1?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/2?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/3?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/4?routing=peter" -d'
{
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/5?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/6?routing=adams" -d'
{
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}'
Get aggregations
curl -XPOST "http://127.0.0.1:9200/stackoverflow/_search" -d'
{
"query": {
"function_score": {
"boost_mode": "replace", // we need to replace document score with the result of the functions
"query": {
"bool": {
"filter": [
{
"range": { // get documents within the date range
"lastvisited": {
"gte": "2016/01/19 00:00:00",
"lte": "2016/01/21 23:59:59"
}
}
}
]
}
},
"functions": [
{
"linear": {
"lastvisited": {
"origin": "2016/01/21 23:59:59", // same as lastvisited lte filter
"scale": "2d" // set the scale - please, see elasticsearch docs for more info https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-function-score-query.html#function-decay
}
}
}
]
}
},
"aggs": {
"user": {
"sampler": { // get top scored document per user
"field": "user",
"max_docs_per_value": 1
},
"aggs": {
"offers": { // aggregate user documents per `offercode`
"terms": {
"field": "offercode"
}
}
}
}
},
"size": 0
}'
Response
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"user": {
"doc_count": 3,
"offers": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "JB20,JB50,JB100",
"doc_count": 2
},
{
"key": "JB10",
"doc_count": 1
}
]
}
}
}
}
Unless you have only one shard per index, you need to specify routing when indexing data. It is because sampler aggregation is calculated per shard. So we need to ensure that all data of particular user will be in the same shard - to get one document with highest score per user.
Sampler aggregation returns documents by score. That is why we need to modify score of the documents. There is where function_score query can help. Using field_value_factor, the score is just the timestamp of last visit - so the more recent the visit, the higher the score.
UPDATE: With field_value_factor there is probably problem with _score accuracy. For more info see issue https://github.com/elastic/elasticsearch/issues/11872. That is why decay function is used as clintongormley suggested in the issue. Because decay function works for both sides from origin. It means that documents 1 day older and 1 day younger than origin recevive the same _score. That's why we need to filter out newer documents (see range filter in the query).
NOTE: I tried this query just with the data which you can see in the example, so bigger dataset is needed to test the query. But I think it should work...
Check this solution: it's more limited, but it is suitable for production: https://stackoverflow.com/a/39788948/4769188
This may solve your problem:
SELECT offercode, count(offercode)
FROM users AS u1
WHERE u1.ID = (SELECT u2.ID FROM users AS u2 WHERE u2.user = u1.user ORDER BY u2.lastvisited DESC LIMIT 1)
AND u1.lastvisited >= "2016/01/20"
AND ORDER BY lastvisited ASC AND GROUP BY offercode;