Sort documents by size of a field - elasticsearch

I have documents like below indexed,
1.
{
"name": "Gilly",
"hobbyName" : "coin collection",
"countries": ["US","France","Georgia"]
}
2.
{
"name": "Billy",
"hobbyName":"coin collection",
"countries":["UK","Ghana","China","France"]
}
Now I need to sort these documents based on the array length of the field "countries", such that the result after the sorting would be of the order document2,document1. How can I achieve this using elasticsearch?

You can use script based sorting to achieve this.
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "number",
"script": "doc['countries'].values.size()",
"order": "desc"
}
}
}

I would suggest using token count type in Elasticsearch.
By using scripts , it can be done (can check here for how to do it using scripts). But then results wont be perfect.
Scripts mostly uses filed data cache and duplicate are removed in this.
You can read more on how to use token count type here.

Related

elasticsearch - how to combine results from two indexes

I have CDR log entries in Elasticsearch as in the below format. While creating this document, I won't have info about delivery_status field.
{
msgId: "384573847",
msgText: "Message text to be delivered"
submit_status: true,
...
delivery_status: //comes later
}
Later when delivery status becomes available, I can update this record.
But I have seen that update queries bring down the rate of ingestion. With pure inserts using bulk operations, I can reach upto 3000 or more transactions /sec, but if I combine with updates, the ingestion rate becomes very slow and crawls at 100 or less txns/sec.
So, I am thinking that I could create another index like below, where I store the delivery status along with msgId:
{
msgId:384573847,
delivery_status: 0
}
With this approach, I end up with 2 indices (similar to master-detail tables in an RDBMS). Is there a way to query the record by joining these indices? I have heard of aliases, but could not fully understand its concept and whether it can be applied in my use case.
thanks to anyone helping me out with suggestions.
As you mentioned, you can index both the document in separate index and used collapse functionality of Elasticsearch and retrieve both the documents.
Let consider, you have index document in index2 and index3 and both have common msgId then you can use below query:
POST index2,index3/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}
But again, you need to consider querying performance with large data set. You can do some benchmarking Evalue query performance and decide index or query time will be better.
Regarding alias, currently in above query we are providing index2,index3 as index name. (Comma separated). But if you use aliases then You can use the single unified name for query to both the index.
You can add both the index to single alias using below command:
POST _aliases
{
"actions": [
{
"add": {
"index": "index3",
"alias": "order"
}
},
{
"add": {
"index": "index2",
"alias": "order"
}
}
]
}
Now you can use below query with alias name insted of index name:
POST order/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}

How to rank ElasticSearch documents based on scores

I have an Elastic search index that contain thousands of documents, each document represent a user.
each document has set of fields (is_verified: boolean, country: string, is_creator: boolean), also i have another service that call ES search to lookup for documents, how i can rank the retrieved documents based on those fields? for example a verified user with match should come first than un verified one.
is there some kind of document scoring while indexing the documents ? if yes can i modify it based on my criteria ?
what shall i read/look to understand how to rank in elastic search.
thanks
I guess the sorting function mentioned by Mikael is pretty straight forward and should cover your use cases. Check Elastic Doc for more information on that.
But in case you want to do really fancy sorting, maybe you could use a bool query and different boost values to set your desired relevancy for each matched field. It tried to come up with a real life example, but honestly didn't find one. For the sake of completeness, he following snippet should give you an idea how to achieve similar results as with the sort API (but still, i would prefer using sort).
GET /yourindexname/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "Monica"
}
}
],
"should": [
{
"term": {
"is_verified": {
"value": true,
"boost": 2
}
}
},
{
"term": {
"is_creator": {
"value": true,
"boost": 2
}
}
}
]
}
}
}
is there some kind of document scoring while indexing the documents ? if yes can i modify it based on my criteria ?
I wouldn't assign a fixed score to a document while indexing, as the score should be dependent on the query. However, if you insist to have a predefined relevancy for each document, theoretically you could add a field relevancy having that value for ordering and use it later in the query:
GET /yourindexname/_search
{
"query" : {
"match" : {
"name": "Monica"
}
},
"sort" : [
{
"relevancy": {
"order": "desc"
},
"_score"
}
]
}
You can consider using the Sort Api inside your search queries ,In example below we used the search on the field country and sorted the result with respect of Boolean field (is_verified) , You can also add the other Boolean field inside Sort brackets .
GET /yourindexname/_search
{
"query" : {
"match" : {
"country": "Iceland"
}
},
"sort" : [
{
"is_verified": {
"order": "desc"
}
}
]
}

Elasticsearch aggregate by field prefix

I have data entries of the form
{
"id": "ABCxxx",
// Other fields
}
Where ABC is a unique identifier that defines the "type" of this record. (For example a user would be USR1234..., an image would be IMG1234...)
I want to get a list of all the different types of records that I have in my ES. So in essence I want to do a sort by id but only looking at the first three characters of the id.
This doesn't work obviously, because it sorts by id (so USR123 is different than USR456):
{
"fields": ["id"],
"aggs": {
"group_by_id": {
"terms": {
"field": "id"
}
}
}
}
How do I write this query?
You can use the painless scripting language to get this accomplished.
{
"fields": ["id"],
"aggs": {
"group_by_id": {
"terms": {
"script" : {
"inline": "doc['id'].substring(0,3)",
"lang": "painless"
}
}
}
}
}
More info here. Please note that the syntax for the substring method may not be exactly right.
As suggested by paqash already that the same can be achieved via script but I would suggest an alternate of storing "type" as a different field altogether in your schema.
For eg.
USR1234 : {id:"USR1234", type:"USR"}
IMG1234 : {id:"USR1234", type:"IMG"}
This would avoid unnecessary complications in scripting and keep your query interface clean.

elasticsearch sort by document id

I have a simple index in elasticsearch and all my ids are manually added, i.e. I do not add documents with automatic string ids.
Now the requirement is to get list of all documents page by page and sorted by the document id (i.e. _id)
When I tried this with _id, it did not work. Then I looked for it on forums and found out this much that I have to use _uid for that. This actually works, although I have no clue how. But another problem is that the sorting is done as if the the _id is string. And it actually is a string. But I want the results as if the _id was a number.
So there are two issues here:
Why sorting does not work with _id and it does work with _uid
Is there a way to get document ids sorted as numbers and not integers
For e.g. if my doc ids are 1, 2, 3, ..... , 55
I am getting results in this order:
1, 10, 11, 12, ... , 19, 2, 20, ... so on
While I would like to get the results in this order:
1, 2, 3, ... so on
Any help is highly appreciated!
Have the _id indexed:
{
"mappings": {
"some_type": {
"_id": {
"index": "not_analyzed"
}
}
}
}
And use a script:
{
"sort": {
"_script": {
"type": "number",
"script": "doc['_id'].value?.isInteger()?doc['_id'].value.toFloat():null",
"order": "asc"
}
}
}
Even though I strongly recommend, if possible, changing the id to integer rather having it as string and contain numbers, instead.
And I kind of doubt that it worked with _uid because _uid is a combination between type and id.
For some reasons the code above didn't work for me. ("dynamic method [java.lang.String, isInteger/0] not found")
However the script below works (only if your _id can be converted into integers)
GET ENDPOINT/INDEX/_search
{
"sort": {
"_script": {
"type": "number",
"script": "return Integer.parseInt(doc['_id'].value)",
"order": "desc" // I personally needed descending
}
}
}
Instead of id, I used id.keyword and it worked.. sample code below:
GET index_name/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"id.keyword": {
"order": "desc"
}
}
]
}

ElasticSearch - sort search results by relevance and custom field (Date)

For example, I have entities with two fields - Text and Date. I want search by entities with results sorted by Date. But if I do it simply, then the result is unexpected.
For search query "Iphone 6" there are the newest texts only with "6" in top of еру results, not with "iphone 6". Without sorting the results seem nice, but not ordered by Date as I want.
How write custom sort function which will consider both relevance and Date? Or may be exist way to give weight to field Date which will be consider in scoring?
In addition, may be I shall want to suppress search results only with "6". How to customize search to find results only by bigrams for example?
Did you tried with bool query like this
{
"query": {
"bool": {
"must": {
"match": {
"field": "iphone 6"
}
}
}
},
"sort": {
"date": {
"order": "desc"
}
}
}
or with your query you can also do this with is more appropriate way of doing i guess ..
just add this as sort
"sort": [
{ "date": { "order": "desc" }},
{ "_score": { "order": "desc" }}
]
all matching results sorted first by date, then by relevance.
The solution is to use _score and the date field both in sort. _score as the first sort order and date field as secondary sort order.
You can use simple match query to perform relevance match.
Try it out.
Data setup:
POST ecom/prod
{
"name":"iphone 6",
"date":"2019-02-10"
}
POST ecom/prod
{
"name":"iphone 5",
"date":"2019-01-10"
}
POST ecom/prod
{
"name":"iphone 6",
"date":"2019-02-28"
}
POST ecom/prod
{
"name":"6",
"date":"2019-03-01"
}
Query for relevance and date based sorting:
POST ecommerce/prododuct/_search
{
"query": {
"match": {
"name": "iphone 6"
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"date": {
"order": "desc"
}
}
]
}
You could definitely use a phrase matching query for this.
It does position-aware matching so the documents will be considered a match for your query only if both "iphone" and "6" occur in the searched fields AND that their occurrences respects this order, "iphone" shows up before "6".
looks like you want to sort first by relevance and then by date. this query will do it.
{ "query" : {
"match" : {
"my_field" : "my query"
}
},
"sort": {
"pubDate": {
"order": "desc",
"mode": "min"
}
}
}
When sorting on fields with more than one value, remember that the
values do not have any intrinsic order; a multivalue field is just a
bag of values. Which one do you choose to sort on? For numbers and
dates, you can reduce a multivalue field to a single value by using
the min, max, avg, or sum sort modes. For instance, you could sort on
the earliest date in each dates field by using the above query.
elasticsearch guide sorting
I think your relevance is broken. You should use two different analyzers, 1 for setting up your index and another for searching. like this:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
also you can read more about this here: https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html
Once you fix the relevance then sorting should work correctly.

Resources