Sorting by distance on a numeric field in ElasticSearch - elasticsearch

For a project I need to select documents which are within filter bounds and a closest to a numeric value. This is about a price and I cannot seem to find if this is possible.
Say I have 2 documents:
{
"name": "Document1",
"price": 46.12,
"tags": ["tag1", "tag2"]
}
{
"name": "Document2",
"price": 82.29,
"tags": ["tag1", "tag3"]
}
Is it possible to get the document with the price closest to 66.23?

The answer (thanks to keety) was to enable dynamic scripting and add a sorting method like this:
{
"query": {
"filtered" : {
"query":{
"match_all" : { }
},
"filter": {}
}
},
"sort" : {
"_script" : {
"script" : "cur = (factor - doc['age'].value); if (cur < 0) { cur = cur * -1 } else { cur = cur}",
"type" : "number",
"params" : {
"factor" : 45
},
"order" : "asc"
}
}
}
This sorts from closest to farthest and works like a charm. Thanks!

Related

Using NestedPath in Script Sort Elastic Search doesn't allow accessing outer properties

I need to sort based on two logical part in script. For each document, min value ( HQ and offices distance from given distance) is calculated and returned for sorting. Since I need to return only 1 value, I need to combine those scripts that calculate distance between hq and given location as well as multiple offices and given location.
I tried to combine those but Offices is nested property and Headquarter is non-nested property. If I use "NestedPath", somehow I am not able to access Headquarter property. Without "NestedPath", I am not able to use Offices property. here is the mapping:
"offices" : {
"type" : "nested",
"properties" : {
"coordinates" : {
"type" : "geo_point",
"fields" : {
"raw" : {
"type" : "text",
"index" : false
}
},
"ignore_malformed" : true
},
"state" : {
"type" : "text"
}
}
},
"headquarters" : {
"properties" : {
"coordinates" : {
"type" : "geo_point",
"fields" : {
"raw" : {
"type" : "text",
"index" : false
}
},
"ignore_malformed" : true
},
"state" : {
"type" : "text"
}
}
}
And here is the script that I tried :
"sort": [
{
"_script": {
"nested" : {
"path" : "offices"
},
"order": "asc",
"script": {
"lang": "painless",
"params": {
"lat": 28.9672,
"lon": -98.4786
},
"source": "def hqDistance = 1000000;if (!doc['headquarters.coordinates'].empty){hqDistance = doc['headquarters.coordinates'].arcDistance(params.lat, params.lon) * 0.000621371;} def officeDistance= doc['offices.coordinates'].arcDistance(params.lat, params.lon) * 0.000621371; if (hqDistance < officeDistance) { return hqDistance; } return officeDistance;"
},
"type": "Number"
}
}
],
When I run the script, Headquarters logic is not even executed it seems, I get results only based on offices distance.
Nested fields operate in a separate context and their content cannot be accessed from the outer level, nor vice versa.
You can, however, access a document's raw _source.
But there's a catch:
See, when iterating under the offices nested path, you were able to call .arcDistance because the coordinates are of type ScriptDocValues.GeoPoint.
But once you access the raw _source, you'll be dealing with an unoptimized set of java.util.ArrayLists and java.util.HashMaps.
This means that even though you can iterate an array list:
...
for (def office : params._source['offices']) {
// office.coordinates is a trivial HashMap of {lat, lon}!
}
calculating geo distances won't be directly possible…
…unless you write your own geoDistance function -- which is perfectly fine with Painless, but it'll need to be defined at the top of a script.
No need to reinvent the wheel though: Calculating distance between two points, using latitude longitude?
A sample implementation
Assuming your documents look like this:
POST my-index/_doc
{
"offices": [
{
"coordinates": "39.9,-74.92",
"state": "New Jersey"
}
],
"headquarters": {
"coordinates": {
"lat": 40.7128,
"lon": -74.006
},
"state": "NYC"
}
}
your sorting script could look like this:
GET my-index/_search
{
"sort": [
{
"_script": {
"order": "asc",
"script": {
"lang": "painless",
"params": {
"lat": 28.9672,
"lon": -98.4786
},
"source": """
// We can declare functions at the beginning of a Painless script
// https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-functions.html#painless-functions
double deg2rad(double deg) {
return (deg * Math.PI / 180.0);
}
double rad2deg(double rad) {
return (rad * 180.0 / Math.PI);
}
// https://stackoverflow.com/a/3694410/8160318
double geoDistanceInMiles(def lat1, def lon1, def lat2, def lon2) {
double theta = lon1 - lon2;
double dist = Math.sin(deg2rad(lat1)) * Math.sin(deg2rad(lat2)) + Math.cos(deg2rad(lat1)) * Math.cos(deg2rad(lat2)) * Math.cos(deg2rad(theta));
dist = Math.acos(dist);
dist = rad2deg(dist);
return dist * 60 * 1.1515;
}
// start off arbitrarily high
def hqDistance = 1000000;
if (!doc['headquarters.coordinates'].empty) {
hqDistance = doc['headquarters.coordinates'].arcDistance(params.lat, params.lon) * 0.000621371;
}
// assume office distance as large as hq distance
def officeDistance = hqDistance;
// iterate each office and compare it to the currently lowest officeDistance
for (def office : params._source['offices']) {
// the coordinates are formatted as "lat,lon" so let's split...
def latLong = Arrays.asList(office.coordinates.splitOnToken(","));
// ...and parse them before passing onwards
def tmpOfficeDistance = geoDistanceInMiles(Float.parseFloat(latLong[0]),
Float.parseFloat(latLong[1]),
params.lat,
params.lon);
// we're interested in the nearest office...
if (tmpOfficeDistance < officeDistance) {
officeDistance = tmpOfficeDistance;
}
}
if (hqDistance < officeDistance) {
return hqDistance;
}
return officeDistance;
"""
},
"type": "Number"
}
}
]
}
Shameless plug: I dive deep into Elasticsearch scripting in a dedicated chapter of my ES Handbook.

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.
TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

Elasticsearch aggregation on a field with dynamic properties

Given the following mapping where variants are a nested type and options is a flattened type:
{
"doc_type" : "product",
"id" : 1,
"variants" : [
{
"options" : {
"Size" : "XS",
},
"price" : 1,
},
{
"options" : {
"Size" : "S",
"Material": "Wool"
},
"price" : 6.99,
},
]
}
I want to run an aggregation that produces data in the following format:
{
"variants.options.Size": {
"buckets" : [
{
"key" : "XS",
"doc_count" : 1
},
{
"key" : "S",
"doc_count" : 1
},
],
},
"variants.options.Material": {
"buckets" : [
{
"key" : "Wool",
"doc_count" : 1
}
],
},
}
I could very easily do something like:
"aggs": {
"variants.options.Size": {
"terms": {
"field": "variants.options.Size"
}
},
"variants.options.Material": {
"terms": {
"field": "variants.options.Material"
}
}
}
The caveat here is that we're using the flattened type for options because the fields in options are dynamic and so there is no way for me to know before hand that we want to aggregate on Size and Material.
Essentially, I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
Not directly. I had the same question a while back. I haven't found a clean solution to this day and I'm convinced there isn't one.
Luckily, there's a scripted_metric workaround that I outlined here. Applying it to your use case:
POST your_index/_search
{
"size": 0,
"aggs": {
"dynamic_variant_options": {
"scripted_metric": {
"init_script": "state.buckets = [:];",
"map_script": """
def variants = params._source['variants'];
for (def variant : variants) {
for (def entry : variant['options'].entrySet()) {
def key = entry.getKey();
def value = entry.getValue();
def path = "variants.options." + key;
if (state.buckets.containsKey(path)) {
if (state.buckets[path].containsKey(value)) {
state.buckets[path][value] += 1;
} else {
state.buckets[path][value] = 1;
}
} else {
state.buckets[path] = [value:1];
}
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
would yield:
"aggregations" : {
"dynamic_variant_options" : {
"value" : [
{
"buckets" : {
"variants.options.Size" : {
"S" : 1,
"XS" : 1
},
"variants.options.Material" : {
"Wool" : 1
}
}
}
]
}
}
You'll need to adjust the painless code if you want the buckets to be arrays of key-doc_count pairs instead of hash maps like in my example.

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

Elasticsearch query on array index

How do I query/filter by index of an array in elasticsearch?
I have a document like this:-
PUT /edi832/record/1
{
"LIN": [ "UP", "123456789" ]
}
I want to search if LIN[0] is "UP" and LIN[1] exists.
Thanks.
This might look like a hack , but then it will work for sure.
First we apply token count type along with multi field to capture the the number of tokens as a field.
So the mapping will look like this -
{
"record" : {
"properties" : {
"LIN" : {
"type" : "string",
"fields" : {
"word_count": {
"type" : "token_count",
"store" : "yes",
"analyzer" : "standard"
}
}
}
}
}
}
LINK - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#token_count
So to check if the second field exists , its as easy as checking if this field value is more than or equal to 2.
Next we can use the token filter to check if the token "up" exists in position 0.
We can use the scripted filter to check this.
Hence a query like below should work -
{
"query": {
"filtered": {
"query": {
"range": {
"LIN.word_count": {
"gte": 2
}
}
},
"filter": {
"script": {
"script": "for(pos : _index['LIN'].get('up',_POSITIONS)){ if(pos.position == 0) { return true}};return false;"
}
}
}
}
}
Advanced scripting - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html
Script filters - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html

Resources