Split a string in Painless/ELK - elasticsearch

I have a string field "myfield.keyword", where entries have the following format:
AAA_BBBB_CC
DDD_EEE_F
I am trying to create a scripted field that outputs the substring before the first _, a scripted field that outputs the substring between the first and second _ and a scripted field that outputs the substring after the second _.
I was trying to use .split('_') to do this, but found that this method is not available in Painless:
def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = path.split('_')[1];} else {newfield="null";}
return newfield
I then tried the workaround suggested here, but found that I must enable regexes in Elastic (which would not be possible in my case):
def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = /_/.split(path)[1];} else {newfield="null";}
return newfield
Is there a way to do this that does presuppose enabling regexes?
EDIT:
Thank you for such an elegante solution Val. This answers the question I asked. My question, however, was not well formed. In particular, the string that needs to be split has four occurrences of '_'. Something like:
AAA_BB_CCC_DD_E
FFF_GGG_HH_JJJJ_KK
So, if I understand correctly, indexOf() and lastIndexOf() cannot give me BB, CCC or DD. I thought that I could adapt your solution, and find the index of the second and third occurrences of _, by using string.indexOf("_", 1) and string.indexOf("_", 2). However, I always get the same result as string.indexOf("_"), without any extra parameters (i.e. the result is always the index of _'s first occurence).

Enabling regular expressions is not terribly complicated, but it requires restarting your cluster and that might not be easy for you depending on the environment.
Another way to achieve this is to do it the "old way". First you create a reusable script for each of the script fields. What that script does is simply find the first, second, third and last occurrence of the _ symbol and returns the split elements. It takes as input the field name to split and the index of the substring to return:
POST _scripts/my-split
{
"script": {
"lang": "painless",
"source": """
def str = doc[params.field].value;
def first = str.indexOf("_");
def second = first + 1 + str.substring(first + 1).indexOf("_");
def third = second + 1 + str.substring(second + 1).indexOf("_");
def last = str.lastIndexOf("_");
def parts = [
str.substring(0, first),
str.substring(first + 1, second),
str.substring(second + 1, third),
str.substring(third + 1, last),
str.substring(last + 1)
];
return parts[params.index];
"""
}
}
Then you can simply define one script field for each of the parts like this:
POST test/_search
{
"script_fields": {
"first": {
"script": {
"id": "my-split",
"params": {
"field": "myfield.keyword",
"index": 0
}
}
},
"second": {
"script": {
"id": "my-split",
"params": {
"field": "myfield.keyword",
"index": 1
}
}
},
"third": {
"script": {
"id": "my-split",
"params": {
"field": "myfield.keyword",
"index": 2
}
}
}
}
}
The response you get will look like this:
{
"_index" : "test",
"_type" : "_doc",
"_id" : "ykS-l3UBeO1HTBdDvTZd",
"_score" : 1.0,
"fields" : {
"first" : [
"AAA"
],
"second" : [
"BBBB"
],
"third" : [
"CC"
]
}
}

You could use str.splitOnToken("_") and retrieve each result as an array and loop the array for any of your purposes.
You can even split on variable tokens such as:
def message = "[LOG] Something something WARNING: Your warning";
def reason = message.splitOnToken("WARNING: ")[1];
So reason will hold the remaining string: Your warning.

Related

Elasticsearch nested phrase search within a distance

Sample ES Document
{
// other properties
"transcript" : [
{
"id" : 0,
"user_type" : "A",
"phrase" : "hi good afternoon"
},
{
"id" : 1,
"user_type" : "B",
"phrase" : "hey"
}
{
"id" : 2,
"user_type" : "A",
"phrase" : "hi "
}
{
"id" : 3,
"user_type" : "B",
"phrase" : "my name is john"
}
]
}
transcript is a nested field whose mapping looks like
{
"type":"nested",
"properties": {
"id":{
"type":"integer"
}
"phrase": {
"type":"text",
"analyzer":"standard"
},
"user_type": {
"type":"keyword"
}
}
}
I need to search for two phrases inside transcript that are apart by at max a given distance d.
For example:
If the phrases are hi and name and d is 1, the above document match because hi is present in third nested object, and name is present in fourth nested object. (Note: hi in first nested object and name in fourth nested object is NOT valid, as they are apart by more than d=1 distance)
If the phrases are good and name and d is 1, the above document does not match because good and name are 3 distance apart.
If both phrases are present in same sentence, the distance is considered as 0.
Possible Solution:
I can fetch all documents where both phrases are present and on the application side, I can discard documents where phrases were more than the given threshold(d) apart. The problem in this case could be that I cannot get the count of such documents beforehand in order to show in the UI as found in 100 documents out of 1900 (as without processing from application side, we can't be sure if the document is indeed a match or not, and it's not feasible to do processing for all documents in index)
Second possible solution is:
{
"query": {
"bool": {
// suppose d = 2
// if first phrase occurs at 0th offset, second phrase can occur at
// ... 0th, 1st or 2nd offset
// if first phrase occurs at 1st offset, second phrase can occur at
// ... 1st, 2nd or 3rd offset
// any one of above permutation should exist
"should": [
{
// search for 1st permutation
},
{
// search for 2nd permutation
},
...
]
}
}
}
This is clearly not scalable as if d is large, and if the transcript is large, the query is going to be very very big.
Kindly suggest any approach.

How can I script a field in Kibana that matches first 4 digits of a field?

I'm having a steep learning curve with the syntax and my data has PII so I don't know how to describe more.
I need a new field in kibana in the already indexed documents. This field "C" would be a combination of the first 4 digits of a field "A" that contains numbers up to millions and is of type:keyword, and a field "B" that is type:keyword and is some large number.
Later I will use this field "C" that is a unique combination, to compare with a list/array of items ( I will insert the list in a query DSL in Kibana, as I need to build some visualizations and reports with the returned documents).
I saw that I could use painless to create this new field, but I don't know exactly if I need to use regex and how to.
EDIT:
As requested more info about the mapping with a concrete example.
"fieldA" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"fieldB" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Example of values:
FieldA = "9876443320134",
FieldB = "000000001".
I would like to sum the first 4 digits of FieldA and the full content of FieldB. FieldC would result in a value of "9877".
The raw query could look like this:
GET combination_index/_search
{
"script_fields": {
"a+b": {
"script": {
"source": """
def a = doc['fieldA.keyword'].value;
def b = doc['fieldB.keyword'].value;
if (a == null || b == null) {
return null
}
def parsed_a = new BigInteger(a);
def parsed_b = new BigInteger(b);
return new BigInteger(parsed_a.toString().substring(0, 4)) + parsed_b;
"""
}
}
}
}
Note 1: we're parsing the strings into BigInteger b/c of seemingly insufficient Integer.MAX_VALUE.
Note 2: we're first parsing fieldA and only then calling .toString on it again in order to handle the edge case of fieldA starting w/ 0s like 009876443320134. It's assumed that you're looking for 9876, not 98, which be the result of first calling .substring and then parsing.
If you intend to use it in Kibana visualizations, you'll need an index pattern first. Once you've got one, you can proceed as follows:
then put the script in:
click save and the new scripted becomes available in numeric aggregations and queries:

How can I pass parameters to token filter script in Elasticsearch?

I want to do something like the one below, but this code clearly does not work. Is there any way I can pass a predefined parameter to the "source" section?
term_temp = "of kerosene, kerosene fuel"
predicate_token_filter_temp = {
"type" : "predicate_token_filter",
"script" : {
"source": "String additional_terms = term_path ;String curr_token = toke n.getTerm().toString(); !curr_token.contains(\" \") || additional_terms.contains(curr_token);",
"params": {"term_path": term_temp }
}
}

filtering / sorting on difference between two values as contained in nested array (using script filter and doc values only)

Easier to illustrate my question with a usecase so let's us the example from the elasticsearch guide.
This lists a product. Each product has a nested array containing resellers that sell said product:
{
...
"product" : {
"properties" : {
"resellers" : {
"type" : "nested",
"properties" : {
"name" : { "type" : "text" },
"price" : { "type" : "double" }
}
}
}
}
}
How would I do the following if at all possible?
Filter all products where storeA is cheaper than storeB. E.g.: product.resellers[name=storeA].price < product.resellers[name=storeB].price
Order products by difference between price of storeA and store B
This likely needs a script filter and order filter respectively, but not sure how I would go about this. Moreover, these types of queries are used frequently so performance is important. Therefore I probably need to stick with docValues instead of resorting to _source. Is this possible?
Yes, that's definitely possible and you can do it like this:
{
"sort": {
"_script": {
"type": "number",
"script": {
"inline": "def store1 = _source.resellers.find{it.name == store1}; def store2 = _source.resellers.find{it.name == store2}; (store1 != null && store2 != null) ? store1.price - store2.price : 0",
"lang": "groovy",
"params": {
"store1": "storeA",
"store2": "storeB"
}
},
"order": "asc"
}
},
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"inline": "def store1 = _source.resellers.find{it.name == store1}; def store2 = _source.resellers.find{it.name == store2}; (store1 != null && store2 != null) ? store1.price < store2.price : false",
"lang": "groovy",
"params": {
"store1": "storeA",
"store2": "storeB"
}
}
}
}
]
}
}
}
The sort script looks like this:
def store1 = _source.resellers.find{it.name == store1};
def store2 = _source.resellers.find{it.name == store2};
(store1 != null && store2 != null) ? store1.price - store2.price : 0
The filter script is a bit similar and looks like this:
def store1 = _source.resellers.find{it.name == store1};
def store2 = _source.resellers.find{it.name == store2};
(store1 != null && store2 != null) ? store1.price < store2.price : false
Both scripts take two parameters in input, namely the names of the reseller stores you want to compare.
UPDATE
Somehow I forgot to explain why it's not possible to do it with doc values. Doc values are effectively the inverse of the inverted index, i.e. to each document are mapped the tokens present inside that document. This coupled with the fact that nested documents are stored as standalone (yet hidden) documents in the index, the doc values for a document like the one below
{
"id": 1,
"product": "Water",
"resellers": [
{
"name": "storeA",
"price": 20
},
{
"name": "storeB",
"price": 30
}
]
}
would look like this:
Document | Values
----------------+---------------------------
1 (top-level) | water
1a (1st nested} | storea, 20
1b (2nd nested} | storeb, 30
Looking at the above table, and since scripts are executed in the context of each document (whether top-level or nested), it becomes evident that when accessing to doc values within a script will only yield the values of that document, and hence it is not possible to compare them with values from another document.
When accessing the source, we're effectively iterating over the resellers array and it is thus possible to compare the values among them and yield something that is useful in your context.
This looks like a marketplace problem. So i would separate products by their master product id's - so products can have different descriptions, propertys etc. - and add them Priority to sort and filter.
{
...
"product" : {
"properties" : {
"masterProduct" : "int",
"priority" : "int",
"resellers" : {
"type" : "nested",
"properties" : {
"name" : { "type" : "text" },
"price" : { "type" : "double" }
}
}
}
}
}
Let me explain how;
First,
product.resellers[name=storeA].price < roduct.resellers[name=storeB].price
I guess this problem raised because of you want to show the cheapest product in the search result. So i think you should have all resellers price while indexing products.
And if you know the cheapest while indexing make it's priority a positive number like 1. And multiply other products with -1, so you can sort them in the product detail cheap to expensive.
This solves the second problem (Order products by difference between price of storeA and store B).
After all you got positive priorities and negative priorities in your index. And all you can make a filter by priority > 0 returns you the cheapest products. So by priority if any reseller wants to be the top off the search result or promote itself you can do it by increasing priorty

Elasticsearch: conditionally sort on 2 fields, 1 replaces the other if it exists

Without scripting, I need to sort records based on rating. The system-rating exists for all records, but a user-rating may or may not exist. If a user-rating does exist I want to use that value in the sort instead of the system-rating, for that particular record and only for that record.
Tried looking into the missing setting but it only allows _first, _last or a custom value (that will be used for missing docs as the sort value):
{
"sort" : [
{ "user_rating" : {"missing" : "_last"} },
],
"query" : {
"term" : { "meal" : "cabbage" }
}
}
...but is there a way to specify the custom value should be system_rating when user_rating is missing?
I can do the following:
query_hash[:sort] = []
if user_rating.exist?
query_hash[:sort] << {
"user_rating" => {
"order": sort_direction,
"unmapped_type": "long",
"missing": "_last",
}
}
end
query_hash[:sort] << {
"system_rating" => {
"order": sort_direction,
"unmapped_type": "long",
}
}
...but that will always sort user rated records on top regardless of the user_rating value.
I know that scripting will allow me to do it but we cannot use scripting. Is it possible?
The only way is scripting or building a custom field at indexing time that will contain the already built value for sorting.

Resources