Elasticsearch Sort By Length Of Text - sorting

I'm using elasticsearch 7.13 and code on kibana
This is my mapping
{
"full_text" : {
"properties" : {
"title" : {
"type" : "text",
"fielddata" : true
},
}
}
}
This is my data
"full_text" : [
{
"title" : "Pkd chuyên cho thuê kingdom 101 1pn đến 3pn giá rẻ nhất thị trường chỉ 11 triệu/căn. lh 0919504***"
}
]
This is my code to sort by length of full_text.title
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": {
"lang": "painless",
"source": "doc['full_text.title'].value.length()"
}
}
}
So why sort result return only 7?
"_source" : {
"full_text" : [
{
"title" : "Pkd chuyên cho thuê kingdom 101 1pn đến 3pn giá rẻ nhất thị trường chỉ 11 triệu/căn. lh 0919504***"
}
]
},
"sort": [
7.0
]

Because doc['full_text.title'] will split "title" into array, you need to join that array to string.
Try this:
"source": "int length = String.join(' ',doc['full_text.title']).length(); return length;"

Related

How to get the local day of week from timestamp in elasticsearch

I'm using the ingest pipeline script processors to extract the day of the week from the local time for each document.
I'm using the client_ip to extract the timezone, use that along with the timestamp to extract the local time, and then extract day of week (and other features) from that local time.
This is my ingest pipeline:
{
"processors" : [
{
"set" : {
"field" : "#timestamp",
"override" : false,
"value" : "{{_ingest.timestamp}}"
}
},
{
"date" : {
"field" : "#timestamp",
"formats" : [
"EEE MMM dd HH:mm:ss 'UTC' yyyy"
],
"ignore_failure" : true,
"target_field" : "#timestamp"
}
},
{
"convert" : {
"field" : "client_ip",
"type" : "ip",
"ignore_failure" : true,
"ignore_missing" : true
}
},
{
"geoip" : {
"field" : "client_ip",
"target_field" : "client_geo",
"properties" : [
"continent_name",
"country_name",
"country_iso_code",
"region_iso_code",
"region_name",
"city_name",
"location",
"timezone"
],
"ignore_failure" : true,
"ignore_missing" : true
}
},
{
"script" : {
"description" : "Extract details of Dates",
"lang" : "painless",
"ignore_failure" : true,
"source" : """
LocalDateTime local_time LocalDateTime.ofInstant( Instant.ofEpochMilli(ctx['#timestamp']), ZoneId.of(ctx['client_geo.timezone']));
int day_of_week = local_time.getDayOfWeek().getValue();
int hour_of_day = local_time.getHour();
int office_hours = 0;
if (day_of_week<6 && day_of_week>0) { if (hour_of_day >= 7 && hour_of_day <= 19 ) {office_hours =1;} else {office_hours = -1;}} else {office_hours = -1;}
ctx['day_of_week'] = day_of_week;
ctx['hour_of_day'] = hour_of_day;
ctx['office_hours'] = office_hours;
"""
}
}
]
}
The first two processors were added before for other purposes. I've added the last 3.
An example document could be the following:
"docs": [
{
"_source": {
"#timestamp": 43109942361111,
"client_ip": "89.160.20.128"
}
}
]
I'm getting the GeoIP fields in the data now, but none of the fields created by the script processor. What am I doing wrong?
EDIT
A few notes about the index that is affected by these changes:
The Dynamic mapping is off.
I have manually added the client_geo.timezone field to the mapping of the index as a keyword.
When I run the following scripted search on the index
GET index_name/_search
{
"script_fields": {
"day_of_week": {
"script": "doc['#timestamp'].value.withZoneSameInstant(ZoneId.of(doc['client_geo']['timezone'])).getDayOfWeek().getValue()"
}
}
}
I get the following runtime error in script execution:
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "No field found for [client_geo] in mapping"
}
Thanks for a well formed question + example.
I was able to replicate your problem and figured it out.
ctx is "The document source as it is". Consequently, ingest does not automatically dig-up dot-delimited fields.
Your client data is added as such:
"client_geo" : {
"continent_name" : "Europe"
//<snip>..</snip>
}
So, you have to access it directly as a nested hash map.
Meaning ctx['client_geo.timezone'] should actually be ctx['client_geo']['timezone']
Here is the full pipeline that worked for me:
"processors": [
{
"set": {
"field": "#timestamp",
"override": false,
"value": "{{_ingest.timestamp}}"
}
},
{
"date": {
"field": "#timestamp",
"formats": [
"EEE MMM dd HH:mm:ss 'UTC' yyyy"
],
"ignore_failure": true,
"target_field": "#timestamp"
}
},
{
"convert": {
"field": "client_ip",
"type": "ip",
"ignore_failure": true,
"ignore_missing": true
}
},
{
"geoip": {
"field": "client_ip",
"target_field": "client_geo",
"properties": [
"continent_name",
"country_name",
"country_iso_code",
"region_iso_code",
"region_name",
"city_name",
"location",
"timezone"
],
"ignore_failure": true,
"ignore_missing": true
}
},
{
"script": {
"description": "Extract details of Dates",
"lang": "painless",
"ignore_failure": true,
"source": """
LocalDateTime local_time = LocalDateTime.ofInstant(Instant.ofEpochMilli(ctx['#timestamp']), ZoneId.of(ctx['client_geo']['timezone']));
int day_of_week = local_time.getDayOfWeek().getValue();
int hour_of_day = local_time.getHour();
int office_hours = 0;
if (day_of_week<6 && day_of_week>0) { if (hour_of_day >= 7 && hour_of_day <= 19 ) {office_hours =1;} else {office_hours = -1;}} else {office_hours = -1;}
ctx['day_of_week'] = day_of_week;
ctx['hour_of_day'] = hour_of_day;
ctx['office_hours'] = office_hours;
"""
}
}
]

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.
TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

Update multi level nested document in elasticsearch

Using Elasticsearch 1.7.1, I have the following document structure
"_source" : {
"questions" : {
"defaultQuestion" : {
"tag" : 0,
"gid" : 0,
"rid" : 0,
"caption" : "SRID",
},
"tableQuestion" : {
"rows" : [{
"ids" : {
"answerList" : ["3547", "3548"],
"tag" : "0",
"caption" : "Accounts",
},
"name" : {
"answerList" : ["Some Name"],
"tag" : "0",
"caption" : "Name",
}
}
],
"caption" : "BPI 1500541753537",
"id" : 644251570,
"tag" : ""
}
},
"id" : "447722821"
}
I want to add a new object in in questions.tableQuestion.rows. My current script is replacing the existing object with the new one. Kindly suggest how to append it instead. Following is my update script.
{ "update": {"_id": "935663867", "_retry_on_conflict" : 3} }
{ "script" : "ctx._source.questions += param1", "params" : {"param1" : {"tableQuestion": {"rows" : [ NEWROWOBJECT ]} } }}
You can build the path with next nested fields, right to the rows property and then use += operator. It's also good to have a check if rows array is null and initialize it in this case.
Checked with ES 2.4, but should be similar for earlier versions:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) ctx._source.questions.tableQuestion.rows = new ArrayList(); ctx._source.questions.tableQuestion.rows += param1;",
"params" : {
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}
For ES 5.x and Painless language the script is a bit different:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) { ctx._source.questions.tableQuestion.rows = new ArrayList();} ctx._source.questions.tableQuestion.rows.add(params.param1);",
"params" : {
"param1" : {
...
}
}
}
}
Update to the additional comment
If some part of the path is dynamic, you can also use parameters to build the path - with get(param_name) method - try this syntax (I removed the null check for simplicity):
{
"script": {
"inline": "ctx._source.questions.get(param2).rows += param1;",
"params" : {
"param2" : "6105243",
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}

Sorting by distance on a numeric field in ElasticSearch

For a project I need to select documents which are within filter bounds and a closest to a numeric value. This is about a price and I cannot seem to find if this is possible.
Say I have 2 documents:
{
"name": "Document1",
"price": 46.12,
"tags": ["tag1", "tag2"]
}
{
"name": "Document2",
"price": 82.29,
"tags": ["tag1", "tag3"]
}
Is it possible to get the document with the price closest to 66.23?
The answer (thanks to keety) was to enable dynamic scripting and add a sorting method like this:
{
"query": {
"filtered" : {
"query":{
"match_all" : { }
},
"filter": {}
}
},
"sort" : {
"_script" : {
"script" : "cur = (factor - doc['age'].value); if (cur < 0) { cur = cur * -1 } else { cur = cur}",
"type" : "number",
"params" : {
"factor" : 45
},
"order" : "asc"
}
}
}
This sorts from closest to farthest and works like a charm. Thanks!

Resources