MySql Order By Value equivalent in ElasticSearch 5.6 - elasticsearch

ElasticSearch Version: 5.6
I have imported MySQL data in ElasticSearch and I have added mapping to the elastic search as required. Following is one mapping for the column application_status.
Mappings:
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"lead": {
"properties": {
"application_status": {
"type": "string",
"analyzer": "case_insensitive",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}}
On the above mapping, I am able to do simple sorting (asc or desc) using following query:
{
"size": 50,
"from": 0,
"sort": [{
"application_status.keyword": {
"order": "asc"
}
}]}
which is MySql equivalent of
select * from <table_name> order by application_status asc limit 50;
Need help on following problem:
I have MySQL query which sorts based on application_status:
select * from vLoan_application_grid order by CASE WHEN application_status = "IP_QUAL_REASSI" THEN application_status END desc, CASE WHEN application_status = "IP_COMPLE" THEN application_status END desc, CASE WHEN application_status LIKE "IP_FRESH%" THEN application_status END desc, CASE WHEN application_status LIKE "IP_%" THEN application_status END desc
Please help me write the same query in ElasticSearch. I am not able to find order by value equivalent for strings in ElasticSearch. Searching online, I understood that, I should use sorting scripts but not able to find any proper documentation.
I have following query which just does simple sort.
{
"size": 500,
"from": 0,
"query" : {
"match_all": {}
},
"sort": {
"_script": {
"type": "string",
"script": {
"source": "doc['application_status.keyword'].value",
"params": {
"factor": ["IP_QUAL_REASS", "IP_COMPLE"]
}
},
"order": "desc"
}
}}
In the above query, I am not using params section as I am not aware how to use it for type: string
I believe I am asking too much. Please help or any relevant documentation links would be greatly appreciated. Hope question is clear. I'll provide more details if necessary.

You have two options:
the most performant one is to index at indexing time another field that should be a number. This number (your choice) will be the numerical representation of that status. Then at search time, you simply sort by that number and not by the status
at search time use a script that will do almost the same thing as the first option, but dynamically, and less performant (but still quite fast)
Below you have the second option:
"sort": {
"_script": {
"type": "number",
"script": {
"source": "if (params.factor[0].containsKey(doc['application_status.keyword'].value)) return params.factor[0].get(doc['application_status.keyword'].value); else return 1000;",
"params": {
"factor": [{
"IP_QUAL_REASS":1,
"IP_COMPLE":2,
"whatever":3
}
]
}
},
"order": "asc"
}
}
If you also want things like LIKE WHATEVER%, my suggestion is to consider an indexing time change, rather than search time because the script gets more complex. But, this is the one for wildcard matches as well:
"sort": {
"_script": {
"type": "number",
"script": {
"source": "if (params.factor[0].containsKey(doc['application_status.keyword'].value)) return params.factor[0].get(doc['application_status.keyword'].value); else { params.wildcard_factors[0].entrySet().stream().filter(kv -> doc['application_status.keyword'].value.startsWith(kv.getKey())).map(Map.Entry::getValue).findFirst().orElse(1000)}",
"params": {
"factor": [
{
"IP_QUAL_REASS": 1,
"IP_COMPLE": 2,
"whatever": 3
}
],
"wildcard_factors": [
{
"REJ_": 66
}
]
}
},
"order": "asc"
}
}

Related

Number of nested objects in Elasticsearch

Looking for a way to get the number of nested objects, for querying, sorting etc.
For example, given this index:
PUT my-index-000001
{
"mappings": {
"properties": {
"some_id": {"type": "long"},
"user": {
"type": "nested",
"properties": {
"first": {
"type": "keyword"
},
"last": {
"type": "keyword"
}
}
}
}
}
}
PUT my-index-000001/_doc/1
{
"some_id": 111,
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
How to filter by the number of users (e.g. query fetching all documents with more than XX users).
I was thinking to using a runtime_field but this gives an error:
GET my-index-000001/_search
{
"runtime_mappings": {
"num": {
"type": "long",
"script": {
"source": "emit(doc['some_id'].value)"
}
},
"num1": {
"type": "long",
"script": {
"source": "emit(doc['user'].size())" // <- this breaks with "No field found for [user] in mapping"
}
}
}
,"fields": [
"num","num1"
]
}
Is it possible perhaps using aggregations?
Would also be nice to know if I can sort the results (e.g. all documents with more than XX and sorted desc by XX).
Thanks.
You cannot query this efficiently
It is possible to use this hack for it, but I would only do it if you need to do some one-time fetching, not for a regular use case as it uses params._source and is therefore really slow when you have a lot of docs
{
"query": {
"function_score": {
"min_score": 1, # -> min number of nested docs to filter by
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": "params._source['user'].size()"
}
}
],
"boost_mode": "replace"
}
}
}
It basically calculates a new score for each doc, where the score is equal to the length of the users array, and then removes all docs under min_score from returning
The best way to do this is to add a userCount field at indexing time (since you know how many elements there are) and then query that field using a range query. Very simple, efficient and fast.
Each element of the nested array is a document in itself, and thus, not queryable via the root-level document.
If you cannot re-create your index, you can leverage the _update_by_query endpoint in order to add that field:
POST my-index-000001/_update_by_query?wait_for_completion=false
{
"script": {
"source": """
ctx._source.userCount = ctx._source.user.size()
"""
}
}

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

ElasticSearch sort by value

I have ElasticSearch 5 and I would like to do sorting based on field value. Imagine having document with category e.g. genre which could have values like sci-fi, drama, comedy and while doing search I would like to order values so that first comes comedies then sci-fi and drama at last. Then of course I will order within groups by other criteria. Could somebody point me to how do this ?
Elasticsearch Sort Using Manual Ordering
This is possible in elasticsearch where you can assign order based on particular values of a field.
I've implemented what you are looking for using script based sorting which makes use of painless script. You can refer to the links I've mentioned to know more on these for below query would suffice what you are looking for.
Also assuming you have genre and movie as keyword with the below mapping.
PUT sampleindex
{
"mappings": {
"_doc": {
"properties": {
"genre": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"movie": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
You can make use of below query to get what you are looking for
GET sampleindex/_search
{
"query": {
"match_all": {}
},
"sort": [{
"_script": {
"type": "number",
"script": {
"lang": "painless",
"inline": "if(params.scores.containsKey(doc['genre.raw'].value)) { return params.scores[doc['genre.raw'].value];} return 100000;",
"params": {
"scores": {
"comedy": 0,
"sci-fi": 1,
"drama": 2
}
}
},
"order": "asc"
}
},
{ "movie.raw": { "order": "asc"}
}]
}
Note that I've included sort for both genre and movie. Basically sort would happen on genre and post that further it would sort based on movie for each genre.
Hope it helps.

Aggregation in elastic search

Need help with aggregation in elastic search. Is it possible to agreggate values of a particular field as an array or list - This is more of a grouping for example instead of getting the result as
{"Book_Id":"102","Review_Text":"DescentRead"},{"Book_Id":"102","Review_Text":"For Kids."},{"Book_Id":"103","Review_Text":"Great"},{"Book_Id":"103","Review_Text":"Excellent"}
can i get all the reviews of each book as a list ?
[ { Book_Id: 102, Review_Text: [ "DescentRead", "For Kids"], { Book_Id: 103, reviews: [ "Great","Excellent"] } ]
Tried some trail with aggs but not able to get it. Any pointers would help!!
Could aggregations with top hits work? The limitation is that you need to specify a max amount of hits per aggregation (will return the top 100 results per book ID in the example ordered by the review text), but apart from that you can do run it as a normal query and specify which fields to return, how they should be sorted (to get the top hits), etc.
Example aggs query:
POST
http://myserver:9200/books/book/_search
{
"size": 0,
"aggs": {
"BookReviews": {
"terms": {
"field": "Book_Id.keyword"
},
"aggs": {
"top_reviews": {
"top_hits": {
"sort": [ { "Review_Text.keyword": { "order": "desc" } } ],
"size": 100,
"_source": {
"includes": [ "Review_Text" ]
}
}
}
}
}
}
}
Note that the name for the aggregations ("BookReviews" and "top_reviews") you can use any name you choose, and that same name will appear in the resulting aggregation tree. You can do multi level aggregations on terms in your index, and inclute top hits on any level, typically for drill-down reporting or similar cases.
Mapping used:
{
"books": {
"mappings": {
"book": {
"properties": {
"Book_Id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"Review_Text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
"size": 0 in the root node will omit any hits for the search and only return the aggs trees.
You can also add a normal "query": {} block on the same level as size and aggs if you need to filter the results before elastic starts aggregating.
Read more in the elasticsearch documentation pages:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
(If you provide a more complete example dataset, we can give a more realistic example query, as there isn't a lot of data in the example for sorting or scoring the results)

How to sort ordinal values in elasticsearch?

Say i've got a field 'spicey' with possible values 'hot', 'hotter', 'smoking'.
There's an intrinsic ordening in these values: they're ordinals.
I'd like to be able to sort or filter on them using their intrinsic order. For example: give me all documents where spicey > hot.
Sure i can translate the values to integers 0,1,2 but this requires extra housekeeping on both the index and the query side which I'd rather avoid.
Is this possible in some way? Already contemplated using multi field mapping but not sure if that would help me.
You can sort based on string values by scripting a sort operation, so that you set each spicey string a specific field value.
curl -XGET 'http://localhost:9200/yourindex/yourtype/_search' -d
{
"sort": {
"_script": {
"script": "factor.get(doc[\"spicey\"].value)",
"type": "number",
"params": {
"factor": {
"hot": 0,
"hotter": 1,
"smoking": 2
}
},
"order": "asc"
}
}
}
One solution could be to create a specific analyzer for spice levels. The idea is to map each level to a discrete value which increases the more spicy the spice is.
{
"settings": {
"analysis": {
"char_filter": {
"spices": {
"type": "mapping",
"mappings": [
"mild=>1",
"hot=>2",
"hotter=>3",
"smoking=>4"
]
}
},
"analyzer": {
"spice_synonyms": {
"type": "custom",
"char_filter": "spices",
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
},
"mappings": {
"ordinal": {
"properties": {
"spicy": {
"type": "string",
"fields": {
"level": {
"type": "string",
"analyzer": "spice_synonyms"
}
}
}
}
}
}
}
In the above index settings and mappings, the spicy field would contain the plain english word (hot, mild, etc) while the spicy.level field would contain a discrete value that you can then use in queries and sorting.
For instance, retrieving documents whose spice level is strictly bigger than hot and ordered in decreasing order (smoking first) could be done like this:
{
"sort": {
"spicy.level": "desc"
},
"query": {
"query_string": {
"query": "spicy.level:>2"
}
}
}
or a range query would work, too
{
"sort": {
"spicy.level": "desc"
},
"query": {
"range": {
"spicy.level" {
"gt": 2
}
}
}
}

Resources