Elasticsearch sort alphabetically then numerically - sorting

I looking for some elegant way to sort my results first by alphabet and then by numbers.
My current solution is inserting an "~" before numbers using the next sort script, "~" is lexicographically after "z":
"sort": {
"_script":{
"script" : "s = doc['name.raw'].value; n = org.elasticsearch.common.primitives.Ints.tryParse(s.split(' ')[0][0]); if (n != null) { '~' + s } else { s }",
"type" : "string"
}
}
but I wonder if there is a more elegant and perhaps more performant solution.
Input:
ZBA ABC ...
ABC SDK ...
123 RIU ...
12B BTE ...
11J TRE ...
BCA 642 ...
Desired output:
ABC SDK ...
BCA 642 ...
ZBA ABC ...
11J TRE ...
12B BTE ...
123 RIU ...

You can do the same thing at indexing time using a custom analyzer which leverages a pattern_replace character filter. It's more performant to do it at indexing than running a script sort at search time for each query.
It works in the same vein as your solution, i.e. if we detect a number, we prepend the value with a tilde ~, otherwise we don't do anything, yet we do it at indexing time and index the resulting value in the name.sort field.
PUT /tests
{
"settings": {
"analysis": {
"char_filter": {
"pre_num": {
"type": "pattern_replace",
"pattern": "(\\d)",
"replacement": "~$1"
}
},
"analyzer": {
"number_tagger": {
"type": "custom",
"tokenizer": "keyword",
"filter": [],
"char_filter": [
"pre_num"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"fields": {
"sort": {
"type": "string",
"analyzer": "number_tagger",
"search_analyzer": "standard"
}
}
}
}
}
}
}
Then you can index your data
POST /tests/test/_bulk
{"index": {}}
{"name": "ZBA ABC"}
{"index": {}}
{"name": "ABC SDK"}
{"index": {}}
{"name": "123 RIU"}
{"index": {}}
{"name": "12B BTE"}
{"index": {}}
{"name": "11J TRE"}
{"index": {}}
{"name": "BCA 642"}
Then your query can simply look like this:
POST /tests/_search
{
"sort": {
"name.sort": "asc"
}
}
And the response you'll get is:
{
"hits": {
"hits": [
{
"_source": {
"name": "ABC SDK"
}
},
{
"_source": {
"name": "BCA 642"
}
},
{
"_source": {
"name": "ZBA ABC"
}
},
{
"_source": {
"name": "11J TRE"
}
},
{
"_source": {
"name": "12B BTE"
}
},
{
"_source": {
"name": "123 RIU"
}
}
]
}
}

Related

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

ElasticSearch Keyword usage with a prefix search

I have a requirement to be able to search a sentence as complete or with prefix. The UI library (reactive search) I am using is generating the query in this way:
"simple_query_string": {
"query": "\"Louis George Maurice Adolphe\"",
"fields": [
"field1",
"field2",
"field3"
],
"default_operator": "or"
}
I am expecting it to returns results for eg.
Louis George Maurice Adolphe (Roche)
but NOT just records containing partial terms like Louis or George
Currently, I have code like this but it only brings the record if I search with complete word Louis George Maurice Adolphe (Roche) but not a prefix Louis George Maurice Adolphe.
{
"settings": {
"analysis": {
"char_filter": {
"space_remover": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
},
"normalizer": {
"lower_case_normalizer": {
"type": "custom",
"char_filter": [
"space_remover"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"field3": {
"type": "keyword",
"normalizer": "lower_case_normalizer"
}
}
}
}
}
Any guidance on the above is appreciated. Thanks.
You are not using the prefix query hence not getting result for prefix search terms, I used same mapping and sample doc, but changed the search query which gives the expected results
Index mapping
{
"settings": {
"analysis": {
"char_filter": {
"space_remover": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
},
"normalizer": {
"lower_case_normalizer": {
"type": "custom",
"char_filter": [
"space_remover"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"field3": {
"type": "keyword",
"normalizer": "lower_case_normalizer"
}
}
}
}
Indexed sample doc
{
"field3" : "Louis George Maurice Adolphe (Roche)"
}
Search query
{
"query": {
"prefix": {
"field3": {
"value": "Louis George Maurice Adolphe"
}
}
}
}
Search result
"hits": [
{
"_index": "normal",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"field3": "Louis George Maurice Adolphe (Roche)"
}
}
]
The underlying issue stems from the fact that you're applying a whitespace remover. What this practically means is that when you ingest your docs:
GET your_index_name/_analyze
{
"text": "Louis George Maurice Adolphe (Roche)",
"field": "field3"
}
they're indexed as
{
"tokens" : [
{
"token" : "louisgeorgemauriceadolphe(roche)",
"start_offset" : 0,
"end_offset" : 36,
"type" : "word",
"position" : 0
}
]
}
So if you indend to use simple_string, you may want to rethink your normalizers.
#Ninja's answer fails when you search for George Maurice Adolphe, i.e. no prefix intersection.

How to exclude asterisks while searching with analyzer

I need to search by an array of values, and each value can be either simple text or text with askterisks(*).
For example:
["MYULTRATEXT"]
And I have the next index(i have a really big index, so I will simplify it):
................
{
"settings": {
"analysis": {
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\d+)*(?=\\d)",
"replacement": "1$"
}
},
"analyzer": {
"custom_search_analyzer": {
"char_filter": [
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer":"keyword",
"search_analyzer": "custom_search_analyzer"
},
......................
And all data in the index is stored with asterisks * e.g.:
curl -X PUT "localhost:9200/locations/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"name" : "MY*ULTRA*TEXT"
}
I need to return exact the same name value when I search by this string MYULTRATEXT
curl -XPOST 'localhost:9200/locations/_search?pretty' -d '
{
"query": { terms: { "name": ["MYULTRATEXT"] } }
}'
It Should return MY*ULTRA*TEXT, but it does not work, so can't find a workaround. Any thoughts?
I tried pattern_replace but seems like I am doing something wrong or I am missing something here.
So I need to replace all * to empty `` while searching
There appears to be a problem with the regex you provided and the replacement pattern.
I think what you want is:
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\w+)\\*(?=\\w)",
"replacement": "$1"
}
}
Note the following changes:
\d => \w (match word characters instead of only digits)
escape * since asterisks have a special meaning for regexes
1$ => $1 ($<GROUPNUM> is how you reference captured groups)
To see how Elasticsearch will analyze the text against an analyzer, or to check that you defined an analyzer correctly, Elasticsearch has the ANALYZE API endpoint that you can use: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
If you try this API with your current definition of custom_search_analyzer, you will find that "MY*ULTRA*TEXT" is analyzed to "MY*ULTRA*TEXT" and not "MYULTRATEXT" as you intend.
I have a personal app that I use to more easily interact with and visualize the results of the ANALYZE API. I tried your example and you can find it here: Elasticsearch Analysis Inspector.
This might help you - your regex pattern is the issue.
You want to replace all * occurrences with `` the pattern below will do the trick..
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(?<=\\w)(\\*)(?=\\w)",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}
Analyze query
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["MY*ULTRA*TEXT"]
}
Results of analyze query
{
"tokens": [
{
"token": "myultratext",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
}
]
}
Post a document
POST my_index/doc/1
{
"name" : "MY*ULTRA*TEXT"
}
Search query
GET my_index/_search
{
"query": {
"match": {
"name": "MYULTRATEXT"
}
}
}
Or
GET my_index/_search
{
"query": {
"match": {
"name": "myultratext"
}
}
}
Results search query
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "MY*ULTRA*TEXT"
}
}
]
}
}
Hope it helps

Elasticsearch unexpected results when sorting against deeply nested attributes

I'm trying to perform some sorting based on the attributes of a document's deeply nested children.
Let's say we have an index filled with publisher documents. A publisher has a collection of books, and
each book has a title, a published flag, and a collection of genre scores. A genre_score represents how well
a particular book matches a particular genre, or in this case a genre_id.
First, let's define some mappings (for simplicity, we will only be explicit about the nested types):
curl -XPUT 'localhost:9200/book_index' -d '
{
"mappings": {
"publisher": {
"properties": {
"books": {
"type": "nested",
"properties": {
"genre_scores": {
"type": "nested"
}
}
}
}
}
}
}'
Here are our two publishers:
curl -XPUT 'localhost:9200/book_index/publisher/1' -d '
{
"name": "Best Books Publishing",
"books": [
{
"name": "Published with medium genre_id of 1",
"published": true,
"genre_scores": [
{ "genre_id": 1, "score": 50 },
{ "genre_id": 2, "score": 15 }
]
}
]
}'
curl -XPUT 'localhost:9200/book_index/publisher/2' -d '
{
"name": "Puffin Publishers",
"books": [
{
"name": "Published book with low genre_id of 1",
"published": true,
"genre_scores": [
{ "genre_id": 1, "score": 10 },
{ "genre_id": 4, "score": 10 }
]
},
{
"name": "Unpublished book with high genre_id of 1",
"published": false,
"genre_scores": [
{ "genre_id": 1, "score": 100 },
{ "genre_id": 2, "score": 35 }
]
}
]
}'
And here is the final definition of our index & mappings...
curl -XGET 'localhost:9200/book_index/_mappings?pretty=true'
...
{
"book_index": {
"mappings": {
"publisher": {
"properties": {
"books": {
"type": "nested",
"properties": {
"genre_scores": {
"type": "nested",
"properties": {
"genre_id": {
"type": "long"
},
"score": {
"type": "long"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"published": {
"type": "boolean"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Now suppose we want to query for a list of publishers, and have them sorted by those who books performing
well in a particular genre. In other words, sort the publishers by the genre_score.score of one of their books
for the target genre_id.
We might write a search query like this...
curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
"size": 5,
"from": 0,
"sort": [
{
"books.genre_scores.score": {
"order": "desc",
"nested_path": "books.genre_scores",
"nested_filter": {
"term": {
"books.genre_scores.genre_id": 1
}
}
}
}
],
"_source":false,
"query": {
"nested": {
"path": "books",
"query": {
"bool": {
"must": []
}
},
"inner_hits": {
"size": 5,
"sort": []
}
}
}
}'
Which correctly returns the Puffin (with a sort value of [100]) first and Best Books second (with a sort value of [50]).
But suppose we only want to consider books for which published is true. This would change our expectation to have Best Books first (with a sort of [50]) and Puffin second (with a sort of [10]).
Let's update our nested_filter and query to the following...
curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
"size": 5,
"from": 0,
"sort": [
{
"books.genre_scores.score": {
"order": "desc",
"nested_path": "books.genre_scores",
"nested_filter": {
"bool": {
"must": [
{
"term": {
"books.genre_scores.genre_id": 1
}
}, {
"term": {
"books.published": true
}
}
]
}
}
}
}
],
"_source": false,
"query": {
"nested": {
"path": "books",
"query": {
"term": {
"books.published": true
}
},
"inner_hits": {
"size": 5,
"sort": []
}
}
}
}'
Suddenly, our sort values for both publishers has become [-9223372036854775808].
Why does adding an additional term to our nested_filter in the top-level sort have this impact?
Can anyone provide some insight as to why this behavior is happening? And additionally, if there are any viable solutions to the proposed query/sort?
This occurs in both ES1.x and ES5
Thanks!

How to get the number of hits of several matching fields in one record?

I have records similar to
{
"who": "John",
"hobby": [
{"name": "gardening",
"skills": 2
},
{"name": "sleeping",
"skills": 3
},
{"name": "darts",
"skills": 2
}
]
}
,
{
"who": "Mary",
"hobby": [
{"name": "gardening",
"skills": 2
},
{"name": "volleyball",
"skills": 3
},
{"name": "kung-fu",
"skills": 2
}
]
}
I am looking at building a query which would answer the question: "how many hobbies with skills=2 do we have?"
The answer for the example above would be 3 ("gardening" is common to both, and each have another unique one).
Every "query" or "query"+"aggs" I tried returns in ['hits']['hits'] or ['aggregations']['sources']['buckets'] the number of matching documents, that is two in the case above (one for "John" and one for "Mary", each of them satisfying the query).
Is there a way to build a query so that it returns the total number of fields (in the example above: the elements of the list "hobby") which matched that query? (fields, not documents)
Note: If my documents were flat:
{"who": "John", "name": "gardening", "skills": 2},
{"who": "John", "name": "sleeping", "skills": 3},
(...)
{"who": "Mary", "name": "kung-fu", "skills": 2}
then a simple "query" to match "skills": 2 + an aggregation on "name" would have done the work
Yes, you can achieve this with the nested type and using inner_hits and/or nested aggregations.
So here is the mapping you should use:
curl -XPUT localhost:9200/hobbies -d '{
"mappings": {
"hob": {
"properties": {
"who": {
"type": "string"
},
"hobby": {
"type": "nested", <--- the hobby list is of type nested
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"skills": {
"type": "integer"
}
}
}
}
}
}
}
Then we can insert your two sample documents using the _bulk endpoint like this:
curl -XPOST localhost:9200/hobbies/hob/_bulk -d '
{"index":{}}
{"who":"John", "hobby":[{"name": "gardening","skills": 2},{"name": "sleeping","skills": 3},{"name": "darts","skills": 2}]}
{"index":{}}
{"who":"Mary", "hobby":[{"name": "gardening","skills": 2},{"name": "volley-ball","skills": 3},{"name": "kung-fu","skills": 2}]}
'
And finally, we can query your index for how many hobbies have skills: 2 like this:
curl -XPOST localhost:9200/hobbies/hob/_search -d '{
"_source": false,
"query": {
"nested": {
"path": "hobby",
"query": {
"term": {
"hobby.skills": 2
}
},
"inner_hits": {} <---- this will return only the matching nested fields with skills=2
}
},
"aggs": {
"hobbies": {
"nested": {
"path": "hobby"
},
"aggs": {
"skills": {
"filter": {
"term": {
"hobby.skills": 2
}
},
"aggs": {
"by_field": { <--- this will return a breakdown of the fields with skills=2
"terms": {
"field": "name"
}
}
}
}
}
}
}
}'
What this query will return you is
In the hits part, the four fields that have skills: 2
In the aggs part, a breakdown of the 3 distinct fields which have skills: 2

Resources