I've a issue when sorting a field in ElasticSearch with text and numbers in the same field.
My pattern is something like this: "text/number/year/text".
I'm trying sort by this, but I get something like this:
"hits": [
{
"_source": {
"fields": {
"numbered": "text/1/year/text",
"numbered-number": "1"
}
},
"sort": [
"1"
]
},
{
"_source": {
"fields": {
"numbered": "text/10/year/text",
"numbered-number": "10"
}
},
"sort": [
"10"
]
},
{
"_source": {
"fields": {
"numbered": "text/11/year/text",
"numbered-number": "11"
}
},
"sort": [
"11"
]
},
...
{
"_source": {
"fields": {
"numbered": "text/19/year/text",
"numbered-number": "19"
}
},
"sort": [
"19"
]
},
{
"_source": {
"fields": {
"numbered": "text/2/year/text",
"numbered-number": "2"
}
},
"sort": [
"2"
]
},
Well, ElasticSearch are sorting this 1, 10, 11, 12, ..., 19, 2, 20, 21... How can I resolve this issue? I just need Natural Sort.
UPDATED:
I try this script, but it's not working too.
POST myindex/_search
{
"from": 0,
"size": 40,
"sort": [
{
"_script": {
"type": "string",
"script": {
"inline":
"if ('fields.myfield.sort' =~ /\\d+/) { return Integer.parseInt(doc['fields.myfield.sort'].value); }"
},
"order" : "asc"
}
}
],
"_source": { "include": ["fields.myfield"] }
}
Error
"reason": {
"type": "null_pointer_exception",
"reason": null
You could use a scripted sort to sort by integer value rather than lexicographically (which is how text will sort)
var searchResponse = client.Search<MyDocument>(s => s
.Sort(so => so
.Script(ss => ss
.Script(sc => sc
.Inline("Integer.parseInt(doc['numbered-number'].value)")
)
)
)
);
The better way however would be to explicitly map numbered-number field as an integer. In doing so, you'll be able to sort on the field as expected.
Related
index_name: my_data-2020-12-01
ticket_number: T123
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:22:12
index_name: my_data-2020-12-01
ticket_number: T124
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:32:11
index_name: my_data-2020-12-02
ticket_number: T123
ticket_status: INPROGRESS
ticket_updated_time: 2020-12-02 12:33:12
index_name: my_data-2020-12-02
ticket_number: T125
ticket_status: OPEN
ticket_updated_time: 2020-12-02 14:11:45
I want to create a saved search with group by ticket_number field get unique doc with latest ticket status (ticket_status). Is it possible?
You can simply query again, I am assuming you are using Kibana for visualization purpose. in your query, you need to filter based on the ticket_number and sort based on ticket_updated_time.
Working example
Index mapping
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date"
},
"ticket_number" :{
"type" : "text"
},
"ticket_status" : {
"type" : "text"
}
}
}
}
Index sample docs
{
"ticket_number": "T123",
"ticket_status": "OPEN",
"ticket_updated_time": "2020-12-01T12:22:12"
}
{
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
}
Now as you can see, both the sample documents belong to the same ticket_number with different status and updated time.
Search query
{
"size" : 1, // fetch only the latest status document, if you remove this, will get other ticket with different status.
"query": {
"bool": {
"filter": [
{
"match": {
"ticket_number": "T123"
}
}
]
}
},
"sort": [
{
"ticket_updated_time": {
"order": "desc"
}
}
]
}
And search result
"hits": [
{
"_index": "65180491",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
},
"sort": [
1606912392000
]
}
]
If you need to group by ticket_number field, then you can use aggregation as well
Index Mapping:
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"unique_id": {
"terms": {
"field": "ticket_number.keyword",
"order": {
"latestOrder": "desc"
}
},
"aggs": {
"latestOrder": {
"max": {
"field": "ticket_updated_time"
}
}
}
}
}
}
Search Result:
"buckets": [
{
"key": "T125",
"doc_count": 1,
"latestOrder": {
"value": 1.606918305E12,
"value_as_string": "2020-12-02 14:11:45"
}
},
{
"key": "T123",
"doc_count": 2,
"latestOrder": {
"value": 1.606912392E12,
"value_as_string": "2020-12-02 12:33:12"
}
},
{
"key": "T124",
"doc_count": 1,
"latestOrder": {
"value": 1.606825931E12,
"value_as_string": "2020-12-01 12:32:11"
}
}
]
I wanted to aggregate the data on a different field and also wanted to get the aggregated data on sorted fashion based on the name.
My data is :
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp001_local000000000000001",
"_score": 10.0,
"_source": {
"name": [
"Person 01"
],
"groupbyid": [
"group0001"
],
"ranking": [
"2.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp002_local000000000000001",
"_score": 85146.375,
"_source": {
"name": [
"Person 02"
],
"groupbyid": [
"group0001"
],
"ranking": [
"10.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp003_local000000000000001",
"_score": 20.0,
"_source": {
"name": [
"Person 03"
],
"groupbyid": [
"group0002"
],
"ranking": [
"-1.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp004_local000000000000001",
"_score": 5.0,
"_source": {
"name": [
"Person 04"
],
"groupbyid": [
"group0002"
],
"ranking": [
"2.0"
]
}
}
My query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "name:emp*^1000.0"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"order": {
"top_hit_agg": "desc"
},
"size": 10
},
"aggs": {
"top_hit_agg": {
"terms": {
"field": "name"
}
}
}
}
}
}
My mapping is :
{
"name": {
"type": "text",
"fielddata": true,
"fields": {
"lower_case_sort": {
"type": "text",
"fielddata": true,
"analyzer": "case_insensitive_sort"
}
}
},
"groupbyid": {
"type": "text",
"fielddata": true,
"index": "analyzed",
"fields": {
"raw": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
I am getting data based on the average of the relevance of grouped records. Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
I wanted grouping on one field and after that grouped bucket, I want to sort on another field. This is sample data.
There are other fields like created_on, updated_on. I also wanted to get sorted data based on that field. also get the data by alphabetically grouped.
I wanted to sort on the non-numeric data type(string). I can do the numeric data type.
I can do it for the ranking field but not able to do it for the name field. It was giving the below error.
Expected numeric type on field [name], but got [text];
You're asking for a few things, so I'll try to answer them in turn.
Step 1: Sorting buckets by relevance
I am getting data based on the average of the relevance of grouped records.
If this is what you're attempting to do, it's not what the aggregation you wrote is doing. Terms aggregations default to sorting the buckets by the number of documents in each bucket, descending. To sort the groups by "average relevance" (which I'll interpret as "average _score of documents in the group"), you'd need to add a sub-aggregation on the score and sort the terms aggregation by that:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless",
}
}
}
}
}
}
Step 2: Sorting employees by name
Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
To sort the documents within each bucket, you can use a top_hits aggregation:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"employees": {
"top_hits": {
"size": 10, // Default will be 10 - change to whatever
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
Step 3: Putting it all together
Putting the both the above together, the following aggregation should suit your needs (note that I used a function_score query to simulate "relevance" based on ranking - your query can be whatever and just needs to be any query that produces whatever relevance you need):
POST /testing-aggregation/employee/_search
{
"size": 0,
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "ranking"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"size": 10,
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless"
}
}
},
"employees": {
"top_hits": {
"size": 10,
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
}
I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -
{"name":"A",
"class":"10th",
"subjects":{
"S1":92,
"S2":92,
"S3":92,
}
}
We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.
We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.
We wrote the following script but this only works if we know all possible subjects.
GET student_data_v1_1/_search
{ "query" :
{"match" :
{ "class" : "' + query + '" }},
"aggs" : { "my_buckets" : { "terms" :
{ "field" : "subjects", "size":10000 },
"aggregations": {"the_avg":
{"avg": { "field": "subjects.value" }}} }},
"size" : 0 }'
but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -
{"name":"A",
"class":"10th",
"subjects":{
"value":93
}
}
An alternate form the document is present is that the subject is a list of dictionaries -
{"name":"A",
"class":"10th",
"subjects":[
{"S1":92},
{"S2":92},
{"S3":92},
]
}
Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======
After updating the document to hold weights for each subject -
{
class": "10th",
"subject": [
{
"name": "s1",
"marks": 90,
"weight":30
},
{
"name": "s2",
"marks": 80,
"weight":70
}
]}
I have updated the query to be -
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "scores"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
},
"size": 0
}
but it throws the error-
{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
To achieve the required result I would suggest you to keep your index mapping as follows:
{
"properties": {
"class": {
"type": "keyword"
},
"subject": {
"type": "nested",
"properties": {
"marks": {
"type": "integer"
},
"name": {
"type": "keyword"
}
}
}
}
}
In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.
Sample doc:
{
"class": "10th",
"subject": [
{
"name": "s1",
"marks": 90
},
{
"name": "s2",
"marks": 80
}
]
}
Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
}
}
}
}
}
},
"size": 0
}
NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.
Sample result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"subjects": {
"doc_count": 6,
"subjects": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "s1",
"doc_count": 3,
"avg_score": {
"value": 80
}
},
{
"key": "s2",
"doc_count": 2,
"avg_score": {
"value": 75
}
},
{
"key": "s3",
"doc_count": 1,
"avg_score": {
"value": 80
}
}
]
}
}
}
}
As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.
UPDATE to include weighted_avg:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
},
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "subject.marks"
},
"weight": {
"field": "subject.weight"
}
}
}
}
}
}
}
},
"size": 0
}
Following is my doc structure
'Order': {
u'properties': {
u'order_id': {u'type': u'integer'},
'Product': {
u'properties': {
u'product_id': {u'type': u'integer'},
u'product_category': {'type': 'text'},
},
u'type': u'nested'
}
}
}
Doc1
"Order": {
"order_id": "1",
"Product": [
{
"product_id": "1",
"product_category": "category_1"
},
{
"product_id": "2",
"product_category": "category_2"
},
{
"product_id": "3",
"product_category": "category_2"
},
]
}
Doc2
"Order": {
"order_id": "2",
"Product": [
{
"product_id": "4",
"product_category": "category_1"
},
{
"product_id": "1",
"product_category": "category_1"
},
{
"product_id": "2",
"product_category": "category_2"
},
]
}
I want to get following output
"aggregations": {
"Order": [
{
"order_id": "1"
"category_counts": [
{
"category_1": 1
},
{
"category_2": 2
},
]
},
{
"order_id": "1"
"category_counts": [
{
"category_1": 2
},
{
"category_2": 1
},
]
},
]
}
I tried using nested aggregation
"aggs": {
"Product-nested": {
"nested": {
"path": "Product"
}
"aggs": {
"category_counts": {
"terms": {
"field": "Product.product_category"
}
}
},
}
}
It does not give output for each order but gives combined output for all orders
{
"Product-nested": {
"category_counts": [
"category_1": 3,
"category_2": 3
]
}
}
I have two questions:
How to get the desired output in above scenario?
What if instead of single product_category I have an array of
product_categories then how will we achieve the same in this
scenario?
I am using elasticsearch >= 5.0
I have an idea but i dont think its the best one..
you can make a terms aggregation on the "order_id" field, then a sub nestes aggregation on "Product.product_category".
somthing like this :
{
"aggs": {
"all-order-id": {
"terms": {
"field": "order_id",
"size": 10
},
"aggs": {
"Product-nested": {
"nested": {
"path": "Product"
},
"aggs": {
"all-products-in-order-id": {
"terms": {
"field": "Product.product_category"
}
}
}
}
}
}
}
}
sorry its lock bit messy i'm not so good with this answer editor
I have shop which use elasticsearch 2.4 for faceted search.
But at the moment the existing filters (product attributes) are taken from mysql. I want to do this using elasticsearch aggregations.
But I got the problem: I do not need to aggregate all the attributes.
What a have:
Part of Mapping:
...
'is_active' => [
'type' => 'long',
'index' => 'not_analyzed',
],
'category_id' => [
'type' => 'long',
'index' => 'not_analyzed',
],
'attrs' => [
'properties' => [
'attr_name' => ['type' => 'string', 'index' => 'not_analyzed'],
'value' => [
'type' => 'string',
'index' => 'analyzed',
'analyzer' => 'attrs_analizer',
],
]
],
...
Exemple of data:
{
"id": 1,
"is_active": "1",
"category_id": 189,
...
"price": "48.00",
"attrs": [
{
"attr_name": "Brand",
"value": "TP-Link"
},
{
"attr_name": "Model",
"value": "TL-1"
},
{
"attr_name": "Other",
"value": "<div>Some text of 'Other' property<br><img src......><ul><li>......</ul></div>"
}
]
},
{
"id": 2,
"is_active": "1",
"category_id": 242,
...
"price": "12.00",
"attrs": [
{
"attr_name": "Brand",
"value": "Lenovo"
},
{
"attr_name": "Model",
"value": "B570"
},
{
"attr_name": "OS",
"value": "Linux"
},
{
"attr_name": "Other",
"value": "<div>Some text of 'Other' property<br><img src......><ul><li>......</ul></div>"
}
]
},
{
"id": 3,
"is_active": "1",
"category_id": 242,
...
"price": "24.00",
"attrs": [
{
"attr_name": "Brand",
"value": "Asus"
},
{
"attr_name": "Model",
"value": "QZ85"
},
{
"attr_name": "OS",
"value": "Windows"
},
{
"attr_name": "Other",
"value": "<div>Some text of 'Other' property<br><img src......><ul><li>......</ul></div>"
}
]
}
Attributes such as "Model" and "Other" are not used when filtering products, they are only displayed on the product page. On the other attributes (Brand, OS, and others ...) I want to receive aggregations.
When I try to aggregate the attrs.value field, of course I get aggregations for all data (including the large "Other" fields, in which there can be a lot of HTML).
"aggs": {
"facet_value": {
"terms": {
"field": "attrs.value",
"size": 0
}
}
}
How to exclude "attrs.attr_name": ["Model", "Other"]?
Change the mapping is a bad solution for me, but if it is inevitable, tell me how to do it? I guess I'll need to make "attrs" nested?
UPD:
I want to receive:
1. All the attributes that the products have in a certain category, except for those that I indicate in the settings of the my system (in this example I will exclude "Model" and "Other").
2. Number of products near each value.
It should look like this:
For category "Laptops":
Brand:
Lenovo (18)
Asus (19)
.....
OS:
Windows (19)
Linux (5)
...
For "computer monitors":
Brand:
Samsung (18)
LG (19)
.....
Resolution:
1360x768 (19)
1920x1080 (22)
....
It's Terms Aggregation , I use this for the number of products for each category. And I try it for attrs.value, but I do not know how to exclude "attrs.value", which refer to "attrs.attr_name": "Model" & "attrs.attr_name": "Other".
UPD2:
In my case if map attrs as nested type, the weight of the index increases by 30%.
from 2700Mi to 3510Mi.
If there is no other option, I'll have to put up with it.
you have to map first attrs as nested type and use nested aggregations.
PUT no_play
{
"mappings": {
"document_type" : {
"properties": {
"is_active" : {
"type": "long"
},
"category_id" : {
"type": "long"
},
"attrs" : {
"type": "nested",
"properties": {
"attr_name" : {
"type" : "keyword"
},
"value" : {
"type" : "keyword"
}
}
}
}
}
}
}
POST no_play/document_type
{
"id": 3,
"is_active": "1",
"category_id": 242,
"price": "24.00",
"attrs": [
{
"attr_name": "Brand",
"value": "Asus"
},
{
"attr_name": "Model",
"value": "QZ85"
},
{
"attr_name": "OS",
"value": "<div>Some text of 'Other' property<br><img src......><ul><li>......</ul></div>"
},
{
"attr_name": "Other",
"value": "<div>Some text of 'Other' property<br><img src......><ul><li>......</ul></div>"
}
]
}
Since you didn't mention how you want to aggregate.
Case 1) If you want to count the attrs as individual. This metric gives you count of term occurrences.
POST no_play/_search
{
"size": 0,
"aggs": {
"nested_aggregation_value": {
"nested": {
"path": "attrs"
},
"aggs": {
"value_term": {
"terms": {
"field": "attrs.value",
"size": 10
}
}
}
}
}
}
POST no_play/_search
{
"size": 0,
"aggs": {
"nested_aggregation_value": {
"nested": {
"path": "attrs"
},
"aggs": {
"value_term": {
"terms": {
"field": "attrs.value",
"size": 10
},
"aggs": {
"reverse_back_to_roots": {
"reverse_nested": {
}
}
}
}
}
}
}
}
Now to get count of root document with attrs value you will need to hook a reverse nested aggregation to move the aggregator a level up to the level of root document.
Think of the following document.
{
"id": 3,
"is_active": "1",
"category_id": 242,
"price": "24.00",
"attrs": [
{
"attr_name": "Brand",
"value": "Asus"
},
{
"attr_name": "Model",
"value": "QZ85"
},
{
"attr_name": "OS",
"value": "repeated value"
},
{
"attr_name": "Other",
"value": "repeated value"
}
]
}
For first query the value count for 'repeated value' will be 2 and for second query it will be 1
Note
here is how you can do filtering to exclude
POST no_play/_search
{
"size": 0,
"aggs": {
"nested_aggregation_value": {
"nested": {
"path": "attrs"
},
"aggs": {
"filtered_results": {
"filter": {
"bool": {
"must_not": [{
"terms": {
"attrs.attr_name": ["Model", "Brand"]
}
}]
}
},
"aggs": {
"value_term": {
"terms": {
"field": "attrs.value",
"size": 10
}
}
}
}
}
}
}
}
POST no_play/_search
{
"size": 0,
"aggs": {
"nested_aggregation_value": {
"nested": {
"path": "attrs"
},
"aggs": {
"filtered_results": {
"filter": {
"bool": {
"must_not": [{
"terms": {
"attrs.attr_name": ["Model", "Brand"]
}
}]
}
},
"aggs": {
"value_term": {
"terms": {
"field": "attrs.value",
"size": 10
},
"aggs": {
"reverse_back_to_roots": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
}
Thanks