ElasticSearch is not sorting file names in correct order - elasticsearch

This is a contrived example to illustrate my problem. I have a bunch of filename that I would like to sort alphabetically in the same way macOS might do in a finder window.
These are my indexed file names in the order I would expect to see them sorted:
A Tribe Called Quest - Can I Kick It (1).mp3
a.png
Bcc 05.png
Birling Gap Cliffs.jpg
Durdle Door.jpg
f.png
Frost.jpg
p.png
Users order.mp4
z.png
And this is what I'm doing in Kibana dev tools to test:
## sorting contrived example
# create the index with keyword filename for sorting
PUT /file-names
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc" : {
"properties": {
"filename": { "type": "keyword" }
}
}
}
}
# create bunch of documents
POST file-names/_doc/_bulk
{ "index":{} }
{ "filename":"A Tribe Called Quest - Can I Kick It (1).mp3" }
{ "index":{} }
{ "filename":"a.png" }
{ "index":{} }
{ "filename":"Bcc 05.png" }
{ "index":{} }
{ "filename":"Birling Gap Cliffs.jpg" }
{ "index":{} }
{ "filename":"Durdle Door.jpg" }
{ "index":{} }
{ "filename":"Frost.jpg" }
{ "index":{} }
{ "filename":"f.png" }
{ "index":{} }
{ "filename":"Users order.mp4" }
{ "index":{} }
{ "filename":"p.png" }
{ "index":{} }
{ "filename":"z.png" }
# query with sort - bugged
GET /file-names/_search
{
"sort": {
"filename": {
"order": "asc"
}
}
}
The results I'm getting back are:
"hits" : [
{
"_index" : "file-names",..."_score" : null,
"_source" : {
"filename" : "A Tribe Called Quest - Can I Kick It (1).mp3"
},
"sort" : [
"A Tribe Called Quest - Can I Kick It (1).mp3"
]
},
{
...
"_source" : {
"filename" : "Bcc 05.png"
},
"sort" : [
"Bcc 05.png"
]
},
{
...
"_source" : {
"filename" : "Birling Gap Cliffs.jpg"
},
"sort" : [
"Birling Gap Cliffs.jpg"
]
},
{
...
"_source" : {
"filename" : "Durdle Door.jpg"
},
"sort" : [
"Durdle Door.jpg"
]
},
{
...
"_source" : {
"filename" : "Frost.jpg"
},
"sort" : [
"Frost.jpg"
]
},
{
...
"_source" : {
"filename" : "Users order.mp4"
},
"sort" : [
"Users order.mp4"
]
},
{
...
"_source" : {
"filename" : "a.png"
},
"sort" : [
"a.png"
]
},
{
...
"_source" : {
"filename" : "f.png"
},
"sort" : [
"f.png"
]
},
{
...
"_source" : {
"filename" : "p.png"
},
"sort" : [
"p.png"
]
},
{
...
"_source" : {
"filename" : "z.png"
},
"sort" : [
"z.png"
]
}
]
Which are not in the order I'd expect. You can see "a.png" is below "Users order.mp4" for reasons I cannot understand.
Any help appreciated to get sorting working in the order I'd expect!

As #Alper suggested, this has already been addressed.
If you for some reason need to stick with the keyword mapping, here's how you can script-sort:
GET /file-names/_search
{
"sort": {
"_script": {
"type": "string",
"script": {
"lang": "painless",
"source": "doc['filename'].value.toLowerCase()"
},
"order": "desc"
}
}
}

Related

Elasticsearch Nested 2 Step Sorting

Given the following data with nested objects (members within teams), I need to do a 2 step sort:
Return the youngest member of each team.
Sort the teams by the name of that youngest member.
I have a query below that is close: it does get the youngest member of each team, but then it sorts the teams using the names of all the members, not just the one selected per team.
What would the query be to do this?
And would such a query be performant assuming there was a lot of data? (Probably a few million objects each having 1-3 nested objects.)
Note: Although it's not clear in this simple example, I cannot simply store the youngest member, since in my real world case, the sorting of the nested objects is determined by a formula that includes an external parameter. This is just a very simplified example of the many sorts like this I would have to do on a larger data set, where I need to get the single best matching nested document for each outer document sorted in one way, but then sort the outer objects based on some other property of that selected nested object.
Data
PUT nested_test
{
"mappings": {
"dynamic": "strict",
"properties": {
"team": { "type": "keyword", "index": true, "doc_values": true },
"members": {
"type": "nested",
"properties": {
"name": { "type": "keyword", "index": true, "doc_values": true },
"age": { "type": "integer", "index": true, "doc_values": true}
}
}
}
}
}
PUT nested_test/_doc/1
{
"team" : "A" ,
"members" :
[
{ "name" : "Curt" , "age" : "34" } ,
{ "name" : "Dave" , "age" : "33" }
]
}
PUT nested_test/_doc/2
{
"team" : "B" ,
"members" :
[
{ "name" : "Alex" , "age" : "36" } ,
{ "name" : "Earl" , "age" : "32" }
]
}
PUT nested_test/_doc/3
{
"team" : "C" ,
"members" :
[
{ "name" : "Brad" , "age" : "35" } ,
{ "name" : "Gary" , "age" : "31" }
]
}
Attempted Query
GET nested_test/_search?filter_path=hits.hits._source.team,hits.hits.sort.*,hits.hits.inner_hits.members.hits.hits._source.*,hits.hits.inner_hits.members.hits.hits.sort.*
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "members",
"query": {
"match_all" : { }
} ,
"inner_hits": {
"size": 1,
"sort": {
"members.age": { "order": "asc" }
}
}
}
}
]
}
}
,
"sort": [
{ "members.name": {
"order": "asc" ,
"nested": {
"path": "members",
"filter": { "match_all" : { } }
}
} }
]
}
Results (If the query was correct, the teams would be in A, B, C order, but they are B, C, A)
{
"hits" : {
"hits" : [
{
"_source" : {
"team" : "B"
},
"inner_hits" : {
"members" : {
"hits" : {
"hits" : [
{
"_source" : {
"name" : "Earl",
"age" : "32"
}
}
]
}
}
}
},
{
"_source" : {
"team" : "C"
},
"inner_hits" : {
"members" : {
"hits" : {
"hits" : [
{
"_source" : {
"name" : "Gary",
"age" : "31"
}
}
]
}
}
}
},
{
"_source" : {
"team" : "A"
},
"inner_hits" : {
"members" : {
"hits" : {
"hits" : [
{
"_source" : {
"name" : "Dave",
"age" : "33"
}
}
]
}
}
}
}
]
}
}
I not feasable with nested sort. And you cant use the result of the inner_hits to sort your documents.
You could maybe use some runtime field with a complex script to extract the name of the youngest member at search time, but it will certainly be ugly and the performance of the query will be impacted, it will perform poorly at scale.
Since you use a nested model, you have all the data needed during indexation to store the youngest member name in a specific field at the root of the document.
Then you will be able to use a standard sort for this use case.
Its the right way to do it in Elasticsearch it you want to keep the performance.

Elasticsearch nested sort based on minimum values of child of child arrays

I've two orders and these orders have multiple shipments and shipments have multiple products.
How can I sort the orders based on the minimum product.quantity in a shipment?
For example. When ordering ascending, orderNo = 2 should be listed first because it has a shipment that contains a product.quantity=1. (This is the minimum value among all product.quantity values. (productName doesn't matter)
{
"orders": [
{
"orderNo": "1",
"shipments": [
{
"products": [
{
"productName": "AAA",
"quantity": "2"
},
{
"productName": "AAA",
"quantity": "2"
}
]
},
{
"products": [
{
"productName": "AAA",
"quantity": "3"
},
{
"productName": "AAA",
"quantity": "6"
}
]
}
]
},
{
"orderNo": "2",
"shipments": [
{
"products": [
{
"productName": "AAA",
"quantity": "1"
},
{
"productName": "AAA",
"quantity": "6"
}
]
},
{
"products": [
{
"productName": "AAA",
"quantity": "4"
},
{
"productName": "AAA",
"quantity": "5"
}
]
}
]
}
]
}
Assuming that each order is a separate document, you could create an order-focused index where both shipments and products are nested fields to prevent array flattening.
The minimal index mapping could then look like:
PUT orders
{
"mappings": {
"properties": {
"shipments": {
"type": "nested",
"properties": {
"products": {
"type": "nested"
}
}
}
}
}
}
The next step is to ensure the quantity is always numeric -- not a string. When that's done, insert said docs:
POST orders/_doc
{"orderNo":"1","shipments":[{"products":[{"productName":"AAA","quantity":2},{"productName":"AAA","quantity":2}]},{"products":[{"productName":"AAA","quantity":3},{"productName":"AAA","quantity":6}]}]}
POST orders/_doc
{"orderNo":"2","shipments":[{"products":[{"productName":"AAA","quantity":1},{"productName":"AAA","quantity":6}]},{"products":[{"productName":"AAA","quantity":4},{"productName":"AAA","quantity":5}]}]}
Finally, you can use nested sorting:
POST orders/_search
{
"sort": [
{
"shipments.products.quantity": {
"nested": {
"path": "shipments.products"
},
"order": "asc"
}
}
]
}
Tip: To make the query even more useful, you could introduce sorted inner_hits to not only sort the top-level orders but also the individual products enclosed in a given order. These inner hits need a nested query so you could simply add a non-negative condition on shipments.products.quantity.
When you combine this query with the above sort and restrict the response to only relevant attributes with filter_path:
POST orders/_search?filter_path=hits.hits._id,hits.hits._source.orderNo,hits.hits.inner_hits.*.hits.hits._source
{
"_source": ["orderNo", "non_negative_quantities"],
"query": {
"nested": {
"path": "shipments.products",
"inner_hits": {
"name": "non_negative_quantities",
"sort": {
"shipments.products.quantity": "asc"
}
},
"query": {
"range": {
"shipments.products.quantity": {
"gte": 0
}
}
}
}
},
"sort": [
{
"shipments.products.quantity": {
"nested": {
"path": "shipments.products"
},
"order": "asc"
}
}
]
}
you'll end up with both sorted orders AND sorted products:
{
"hits" : {
"hits" : [
{
"_id" : "gVc0BHgBly0XYOUcZ4vd",
"_source" : {
"orderNo" : "2" <---
},
"inner_hits" : {
"non_negative_quantities" : {
"hits" : {
"hits" : [
{
"_source" : {
"quantity" : 1, <---
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 4, <---
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 5, <---
"productName" : "AAA"
}
}
]
}
}
}
},
{
"_id" : "gFc0BHgBly0XYOUcYosz",
"_source" : {
"orderNo" : "1"
},
"inner_hits" : {
"non_negative_quantities" : {
"hits" : {
"hits" : [
{
"_source" : {
"quantity" : 2,
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 2,
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 3,
"productName" : "AAA"
}
}
]
}
}
}
}
]
}
}

How can I find all documents in elasticsearch that contain a number in a certain field?

I have a keyword type'd field that can contain either a number or a string. If the field does not contain any letters, I would like to hit on that document. How can I do this?
My index mapping looks like:
{
"mappings": {
"Entry": {
"properties": {
"testField": {
"type": "keyword"
}
}
}
}
}
My documents look like this:
{
"testField":"123abc"
}
or
{
"testField": "456789"
}
I've tried the query:
{
"query": {
"range": {
"gte": 0,
"lte": 2000000
}
}
}
but it stills hits on 123abc. How can I design this so that I only hit on the documents with a number in that particular field?
There is another more optimal option for achieving exactly what you want. You can leverage the ingest API pipelines and using a script processor you can create another numeric field at indexing time that you can then use more efficiently at search time.
The ingestion pipeline below contains a single script processor which will create another field called numField that will only contain numeric values.
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"source": """
ctx.numField = /\D/.matcher(ctx.testField).replaceAll("");
"""
}
}
]
},
"docs": [
{
"_source": {
"testField": "123"
}
},
{
"_source": {
"testField": "abc123"
}
},
{
"_source": {
"testField": "123abc"
}
},
{
"_source": {
"testField": "abc"
}
}
]
}
Simulating this pipeline with 4 different documents having a mix of alphanumeric content, will yield this:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"numField" : "123",
"testField" : "123"
},
"_ingest" : {
"timestamp" : "2019-05-09T04:14:51.448Z"
}
}
},
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"numField" : "123",
"testField" : "abc123"
},
"_ingest" : {
"timestamp" : "2019-05-09T04:14:51.448Z"
}
}
},
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"numField" : "123",
"testField" : "123abc"
},
"_ingest" : {
"timestamp" : "2019-05-09T04:14:51.448Z"
}
}
},
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"numField" : "",
"testField" : "abc"
},
"_ingest" : {
"timestamp" : "2019-05-09T04:14:51.448Z"
}
}
}
]
}
After indexing your documents using this pipeline, you can run your range query on numField instead of testField. Compared to the other solution (sorry #Kamal), it will shift the scripting burden to run only once per document at indexing time, instead of everytime on every document at search time.
{
"query": {
"range": {
"numField": {
"gte": 0,
"lte": 2000000
}
}
}
}
Afaik, Elasticsearch does not have a direct solution for this.
Instead you would need to write a Script Query. Below is what you are looking for:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"lang": "painless",
"source": """
try{
String temp = doc['testField'].value;
int a = Integer.parseInt(temp);
if(a instanceof Integer)
return true;
}catch(NumberFormatException e){
return false;
}
"""
}
}
}
]
}
}
}
Hope it helps!

Replacing OR/AND/NOT filters with bool filter creates a hard-to-understand query with too many levels?

I have the following filter in a filtered query. As seen, it has many OR/AND/NOT filters at different levels. I was advised to replace them with bool filters for performance reasons, and I am going to do that.
"filter" : {
"or" : [
{
"and" : [
{ "range" : { "start" : { "lte": 201407292300 } } },
{ "range" : { "end" : { "gte": 201407292300 } } },
{ "term" : { "condtion1" : false } },
{
"or" : [
{
"and" : [
{ "term" : { "condtion2" : false } },
{
"or": [
{
"and" : [
{ "missing" : { "field" : "condtion6" } },
{ "missing" : { "field" : "condtion7" } }
]
},
{ "term" : { "condtion6" : "nop" } }
{ "term" : { "condtion7" : "rst" } }
]
}
]
},
{
"and" : [
{ "term" : { "condtion2" : true } },
{
"or": [
{
"and" : [
{ "missing" : { "field" : "condtion3" } },
{ "missing" : { "field" : "condtion4" } },
{ "missing" : { "field" : "condtion5" } },
{ "missing" : { "field" : "condtion6" } },
{ "missing" : { "field" : "condtion7" } }
]
},
{ "term" : { "condtion3" : "abc" } },
{ "term" : { "condtion4" : "def" } },
{ "term" : { "condtion5" : "ghj" } },
{ "term" : { "condtion6" : "nop" } },
{ "term" : { "condtion7" : "rst" } }
]
}
]
}
]
}
]
},
{
"and" : [
{
"term": { "condtion8" : "TIME_POINT_1" }
},
{ "range" : { "start" : { "lte": 201407302300 } } },
{
"or": [
{ "term" : { "condtion9" : "GROUP_B" } },
{
"and" : [
{ "term" : { "condtion9" : "GROUP_A" } },
{ "ids" : { values: [100, 10] } }
]
}
]
}
]
},
{
"and" : [
{
"term": { "condtion8" : "TIME_POINT_2" }
},
{ "ids" : { values: [100, 10] } }
]
},
{
"and" : [
{
"term": { "condtion8" : "TIME_POINT_3" }
},
{
"or": [
{ "term" : { "condtion1" : true } },
{ "range" : { "end" : { "lt": 201407302300 } } }
]
},
{
"or": [
{ "term" : { "condtion9" : "GROUP_B" } },
{
"and" : [
{ "term" : { "condtion9" : "GROUP_A" } },
{ "ids" : { values: [100, 10] } }
]
}
]
}
]
}
]
}
However, I feel replacing these OR/AND/NOT filters would create a query that has too many levels and is hard to understand. For example, replacing
"or": [
....
]
I have to have:
"bool" {
"should": [
]
}
Am I right that replacing OR/AND/NOT with bool filter in my case is at the expense of sacrificing understandability?
A related question
If I have to replace OR/AND/NOT filters for performance, should I replace ALL of these OR/AND/NOT filters, or just some of them such as the one at the top for example?
Thanks and regards.
You should replace all of them except geo/script/range filters. Having said that understanding the possible impact of each filter can help you also. For example if one of the filter is going to filter out say 90% of the result then you may want to put that in an and filter at the starting. Since and/or filters are executed sequentially the rest of the filters will have lesser documents to process. In case of bool filters all the filters are combined in a single bitset operation. You might have already read about it.
I don't think you will be sacrificing understability by replacing OR/AND/NOT with bool filter. As the example you have given, for a single or filter converting to should filter looks like an increase in the query structure but in an overall combination the structure would be almost similar.

Search query for elastic search

I have documents in elastic search in the following format
{
"stringindex" : {
"mappings" : {
"files" : {
"properties" : {
"BaseOfCode" : {
"type" : "long"
},
"BaseOfData" : {
"type" : "long"
},
"Characteristics" : {
"type" : "long"
},
"FileType" : {
"type" : "long"
},
"Id" : {
"type" : "string"
},
"Strings" : {
"properties" : {
"FileOffset" : {
"type" : "long"
},
"RO_BaseOfCode" : {
"type" : "long"
},
"SectionName" : {
"type" : "string"
},
"SectionOffset" : {
"type" : "long"
},
"String" : {
"type" : "string"
}
}
},
"SubSystem" : {
"type" : "long"
}
}
}
}
}
}
My requirement is when I search for a particular string (String.string) i want to get only the FileOffSet (String.FileOffSet) for that string.
How do i do this?
Thanks
I suppose that you want to perform a nested query and retrieve only one field as the result, but I see problems in your mapping, hence I will split my answer in 3 sections:
What is the problem I see:
How to query nested fields (this is more ES background):
How to find a solution:
1) What is the problem I see:
You want to query a nested field, but you don't have a nested field.
The nested field part:
The field "Strings" is not nested in the type "files" (nested data without a nested field may bring future problems), otherwise your mapping for the field "Strings" would be something like this:
{
"stringindex" : {
"mappings" : {
"files" : {
"properties" : {
"Strings" : {
"properties" : {
"type" : "nested",
"String" : {
"type" : "string"
}
}
}
}
}
}
}
}
Note: yes, I cut most of the fields, but I did this to easily show that you didn't create a nested field.
With a nested field "in hands", we need a nested query.
The specific field result part:
To retrieve only one field as result, you have to include the property "_source" in your query.
2) How to query nested fields:
This is more for ES background, if you have never worked with nested fields.
Small example:
You define a type with a nested field:
{
"nesttype" : {
"properties" : {
"name" : { "type" : "string" },
"parents" : {
"type" : "nested" ,
"properties" : {
"sex" : { "type" : "string" },
"name" : { "type" : "string" }
}
}
}
}
}
You create some inputs:
{ "name" : "Dan", "parents" : [{ "name" : "John" , "sex" : "m" },
{ "name" : "Anna" , "sex" : "f" }] }
{ "name" : "Lana", "parents" : [{ "name" : "Maria" , "sex" : "f" }] }
Then you query, but only fetch the nested field "parents.name":
{
"query": {
"nested": {
"path": "parents",
"query": {
"bool": {
"must": [
{
"term": {
"sex": "m"
}
}
]
}
}
}
},
"_source" : [ "parents.name" ]
}
The output of this query is "the name of the parents of all people who have a parent of the sex 'm' ". One entry (Dan) has a father, whereas the other (Lana) doesn't. So it only will retrieve Dan's parents names.
3) How to find a solution:
To fix your mapping:
You only need to include the type "nested" in the field "Strings":
{
"files" : {
"properties" : {
...
"Strings" : {
"type" : "nested" ,
"properties" : {
"FileOffset" : { "type" : "long" },
"RO_BaseOfCode" : { "type" : "long" },
...
}
}
...
}
}
}
To query your data:
{
"query": {
"nested": {
"path": "Strings",
"query": {
"bool": {
"must": [
{
"term": {
"String": "my string"
}
}
]
}
}
}
},
"_source" : [ "Strings.FileOffSet" ]
}
Great answer by dan, but I think he didn't mention it all.
His solution don't work for your question, but I guess you even don't know that.
Consider a scenario where data is like ,
doc_1
{
"Id": 1,
"Strings": [
{
"string": "x",
"fileoffset": "f1"
},
{
"string": "y",
"fileoffset": "f2"
}
]
}
doc_2
{
"Id": 2,
"Strings": {
"string": "z",
"fileoffset": "f3"
}
}
When you run the like dan said, like say let's apply filter with Strings.string=x then response is like,
{
"hits": [
{
"_index": "stringindex",
"_type": "files",
"_id": "11961",
"_score": 1,
"_source": {
"Strings": [
{
"fileoffset": "f1"
},
{
"fileoffset": "f2"
}
]
}
}
]
}
This is because, elasticsearch will get hits from documents where any of the object inside nested field (here Strings) pass the filter criteria. (In this case in doc_1, Strings.string=x passed filter, so doc_1 is returned. But we don't know which nested object pass the criteria.
So, you have to use nested_aggregation,
Here is a solution for you..
POST index/type/_search
{
"size": 0,
"aggs": {
"StringsNested": {
"nested": {
"path": "Strings"
},
"aggs": {
"StringFilter": {
"filter": {
"term": {
"Strings.string": "x"
}
},
"aggs": {
"FileOffsets": {
"terms": {
"field": "Strings.fileoffset"
}
}
}
}
}
}
}
}
So, response is like,
"aggregations": {
"StringsNested": {
"doc_count": 2,
"StringFilter": {
"doc_count": 1,
"FileOffsets": {
"buckets": [
{
"key": "f1",
"doc_count": 1
}
]
}
}
}
}
Remember to have mapping of Strings as nested, as dan said.

Resources