Matching by array elements in Elasticsearch - elasticsearch

I have to construct quite a non-trivial (as it seems to be now) query in Elasticsearch.
Suppose I have a couple of entities, each with an array element, consisting of strings:
1). ['A', 'B']
2). ['A', 'C']
3). ['A', 'E']
4). ['A']
Mappings for array element is as follows (using dynamic templates):
{
"my_array_of_strings": {
"path_match": "stringArray*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
Json representation of entity looks like this:
{
"stringArray": [
"A",
"B"
]
}
Then I have user input:
['A', 'B', 'C'].
What I want to achieve is to find entities which contain only elements specified in input - expected results are:
['A', 'B'], ['A', 'C'], ['A'] but NOT ['A', 'E'] (because 'E' is not present in user input).
Can this scenario be implemented with Elasticsearch?
UPDATE:
Apart from the solution with using the scripts - which should work nicely, but will most likely slow down the query considerably in case when there are many records that match - I have devised another one. Below I will try to explain its main idea, without code implementation.
One considerable condition that I failed to mention (and which might have given other users valuable hint) is that arrays consist of enumerated elements, i.e. there are finite number of such elements in array. This allows to flatten such array into separate field of an entity.
Lets say there are 5 possible values: 'A', 'B', 'C', 'D', 'E'. Each of these values is a boolean field - true if it is empty (i.e. array version would contain this element ) and false otherwise.
Then each of the entities could be rewritten as follows:
1).
A = true
B = true
C = false
D = false
E = false
2).
A = true
B = false
C = true
D = false
E = false
3).
A = true
B = false
C = false
D = false
E = true
4).
A = true
B = false
C = false
D = false
E = false
With the user input of ['A', 'B', 'C'] all I would need to do is:
a) take all possible values (['A', 'B', 'C', 'D', 'E']) and subtract from them user input -> result will be ['D', 'E'];
b) find records where each of resulting elements is false, i.e. 'D = false AND E = false'.
This would give records 1, 2 and 4, as expected. I am still experimenting with the code implementation of this approach, but so far it looks quite promising. It has yet to be tested, but I think this might perform faster, and be less resource demanding, than using scripts in query.
To optimize this a little bit further, it might be possible not to provide fields which will be 'false' at all, and modify the previous query to 'D = not exists AND E = not exists' - result should be the same.

You can achieve this with scripting, This is how it looks
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"terms": {
"name": [
"A",
"B",
"C"
]
}
},
{
"script": {
"script": "if(user_input.containsAll(doc['name'].values)){return true;}",
"params": {
"user_input": [
"A",
"B",
"C"
]
}
}
}
]
}
}
}
}
}
This groovy script is checking if the list contains anything apart from ['A', 'B', 'C'] and returns false if it does, so it wont return ['A', 'E']. It is simply checking for sublist match. This script might take couple of seconds. You would need to enable dynamic scripting, also syntax might be different for ES 2.x, let me know if it does not work.
EDIT 1
I have put both conditions inside filter only. First only those documents that have either A, B or C are returned, and then script is applied on only those documents, so this would be faster than the previous one. More on filter ordering
Hope this helps!!

In same case for me I have done the follow steps:
First of all I have deleted the index to redefine analyzer/settings with sense plugin.
DELETE my_index
Then I have defined custom analyzer for my_index
PUT my_index
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
Then I have defined mapping properties inside my code, but you can also do that with sense. both of them are same.
PUT /my_index/_mapping/my_type
{
"properties" : {
"conduct_days" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
Then For testing do these bellow steps:
PUT /my_index/my_type/1
{
"coduct_days" : "1,2,3"
}
PUT /my_index/my_type/2
{
"conduct_days" : "3,4"
}
PUT /my_index/my_type/3
{
"conduct_days" : "1,6"
}
GET /my_index/_search
{
"query": {"match_all": {}}
}
GET /my_index/_search
{
"filter": {
"or" : [
{
"term": {
"coduct_days": "6"
}
},
{
"term": {
"coduct_days": "3"
}
}
]
}
}

Related

Elasticsearch nested phrase search within a distance

Sample ES Document
{
// other properties
"transcript" : [
{
"id" : 0,
"user_type" : "A",
"phrase" : "hi good afternoon"
},
{
"id" : 1,
"user_type" : "B",
"phrase" : "hey"
}
{
"id" : 2,
"user_type" : "A",
"phrase" : "hi "
}
{
"id" : 3,
"user_type" : "B",
"phrase" : "my name is john"
}
]
}
transcript is a nested field whose mapping looks like
{
"type":"nested",
"properties": {
"id":{
"type":"integer"
}
"phrase": {
"type":"text",
"analyzer":"standard"
},
"user_type": {
"type":"keyword"
}
}
}
I need to search for two phrases inside transcript that are apart by at max a given distance d.
For example:
If the phrases are hi and name and d is 1, the above document match because hi is present in third nested object, and name is present in fourth nested object. (Note: hi in first nested object and name in fourth nested object is NOT valid, as they are apart by more than d=1 distance)
If the phrases are good and name and d is 1, the above document does not match because good and name are 3 distance apart.
If both phrases are present in same sentence, the distance is considered as 0.
Possible Solution:
I can fetch all documents where both phrases are present and on the application side, I can discard documents where phrases were more than the given threshold(d) apart. The problem in this case could be that I cannot get the count of such documents beforehand in order to show in the UI as found in 100 documents out of 1900 (as without processing from application side, we can't be sure if the document is indeed a match or not, and it's not feasible to do processing for all documents in index)
Second possible solution is:
{
"query": {
"bool": {
// suppose d = 2
// if first phrase occurs at 0th offset, second phrase can occur at
// ... 0th, 1st or 2nd offset
// if first phrase occurs at 1st offset, second phrase can occur at
// ... 1st, 2nd or 3rd offset
// any one of above permutation should exist
"should": [
{
// search for 1st permutation
},
{
// search for 2nd permutation
},
...
]
}
}
}
This is clearly not scalable as if d is large, and if the transcript is large, the query is going to be very very big.
Kindly suggest any approach.

Split a string in Painless/ELK

I have a string field "myfield.keyword", where entries have the following format:
AAA_BBBB_CC
DDD_EEE_F
I am trying to create a scripted field that outputs the substring before the first _, a scripted field that outputs the substring between the first and second _ and a scripted field that outputs the substring after the second _.
I was trying to use .split('_') to do this, but found that this method is not available in Painless:
def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = path.split('_')[1];} else {newfield="null";}
return newfield
I then tried the workaround suggested here, but found that I must enable regexes in Elastic (which would not be possible in my case):
def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = /_/.split(path)[1];} else {newfield="null";}
return newfield
Is there a way to do this that does presuppose enabling regexes?
EDIT:
Thank you for such an elegante solution Val. This answers the question I asked. My question, however, was not well formed. In particular, the string that needs to be split has four occurrences of '_'. Something like:
AAA_BB_CCC_DD_E
FFF_GGG_HH_JJJJ_KK
So, if I understand correctly, indexOf() and lastIndexOf() cannot give me BB, CCC or DD. I thought that I could adapt your solution, and find the index of the second and third occurrences of _, by using string.indexOf("_", 1) and string.indexOf("_", 2). However, I always get the same result as string.indexOf("_"), without any extra parameters (i.e. the result is always the index of _'s first occurence).
Enabling regular expressions is not terribly complicated, but it requires restarting your cluster and that might not be easy for you depending on the environment.
Another way to achieve this is to do it the "old way". First you create a reusable script for each of the script fields. What that script does is simply find the first, second, third and last occurrence of the _ symbol and returns the split elements. It takes as input the field name to split and the index of the substring to return:
POST _scripts/my-split
{
"script": {
"lang": "painless",
"source": """
def str = doc[params.field].value;
def first = str.indexOf("_");
def second = first + 1 + str.substring(first + 1).indexOf("_");
def third = second + 1 + str.substring(second + 1).indexOf("_");
def last = str.lastIndexOf("_");
def parts = [
str.substring(0, first),
str.substring(first + 1, second),
str.substring(second + 1, third),
str.substring(third + 1, last),
str.substring(last + 1)
];
return parts[params.index];
"""
}
}
Then you can simply define one script field for each of the parts like this:
POST test/_search
{
"script_fields": {
"first": {
"script": {
"id": "my-split",
"params": {
"field": "myfield.keyword",
"index": 0
}
}
},
"second": {
"script": {
"id": "my-split",
"params": {
"field": "myfield.keyword",
"index": 1
}
}
},
"third": {
"script": {
"id": "my-split",
"params": {
"field": "myfield.keyword",
"index": 2
}
}
}
}
}
The response you get will look like this:
{
"_index" : "test",
"_type" : "_doc",
"_id" : "ykS-l3UBeO1HTBdDvTZd",
"_score" : 1.0,
"fields" : {
"first" : [
"AAA"
],
"second" : [
"BBBB"
],
"third" : [
"CC"
]
}
}
You could use str.splitOnToken("_") and retrieve each result as an array and loop the array for any of your purposes.
You can even split on variable tokens such as:
def message = "[LOG] Something something WARNING: Your warning";
def reason = message.splitOnToken("WARNING: ")[1];
So reason will hold the remaining string: Your warning.

Terms query not returning results for list of strings

I have this Elastic query which fails to return the desired results for terms.letter_score. I'm certain there is available matches in the index. This query (excluding letter_score) returns the expected filtered results but nothing with letter_score. The only difference is (as far as I can tell), is that the cat_id values is a list of integers vs strings. Any ideas of what could be the issue here? I'm basically trying to get it to match ANY value from the letter_score list.
Thanks
{
"size": 10,
"query": {
"bool": {
"filter": [
{
"terms": {
"cat_id": [
1,
2,
4
]
}
},
{
"terms": {
"letter_score": [
"A",
"B",
"E"
]
}
}
]
}
}
}
It sounds like your letter_score field is of type text, and hence, has been analyzed, so the tokens A, B and E have been stored as a, b and e so the terms query won't match them.
Also if that's the case, the probability is high that the token a has been ignored at indexing time because it is a stop word and the standard analyzer (default) ignores them (if you're using ES 5+).
A first approach is to use a match query instead of terms, like this:
{
"match": {
"letter_score": "A B E"
}
}
If that still doesn't work, I suggest that you change the mapping of your letter_score field to keyword (requires reindexing your data) and then your query will work as it is now

how to filter for results for which a property has a value contained in array X

Say I've got a dynamic array A of values [x,y,z].
I want to return all results for which property P has a value that exists in A.
I could write some recursive filter that concatenates 'or's for each value in A, but it's extremely clunky.
Any other out-of-the-box way to do this?
You can use the filter command in conjunction with the reduce and contains command to accomplish this.
Example
Let's say you have the following documents:
{
"id": "41e352d0-f543-4731-b427-6e16a2f6fb92" ,
"property": [ 1, 2, 3 ]
}, {
"id": "a4030671-7ad9-4ab9-a21f-f77cba9bfb2a" ,
"property": [ 5, 6, 7 ]
}, {
"id": "b0694948-1fd7-4293-9e11-9e5c3327933e" ,
"property": [ 2, 3, 4 ]
}, {
"id": "4993b81b-912d-4bf7-b7e8-e46c7c825793" ,
"property": [ "b" ,"c" ]
}, {
"id": "ce441f1e-c7e9-4a7f-9654-7b91579029be" ,
"property": [ "a" , "b" , "c" ]
}
From these sequence, you want to get all documents that have either "a" or 1 in their property property. You can write a query that returns a chained contains statement using reduce.
r.table('30510212')
// Filter documents
.filter(function (row) {
// Array of properties you want to filter for
return r.expr([ 1, 'a' ])
// Insert `false` as the first value in the array
// in order to make it the first value in the reduce's left
.insertAt(0, false)
// Chain up the `contains` statement
.reduce(function (left, right) {
return left.or(row('property').contains(right));
});
})
Update: Better way to do it
Actually, you can use 2 contains to execute the same query. This is shorter and probably a bit easier to understand.
r.table('30510212')
.filter(function (row) {
return row('property').contains(function (property) {
return r.expr([ 1, 'a' ]).contains(property);
})
})

Difference with count result in Mongo group by query with Ruby/Javascript

I'm using Mongoid to get a count of certain types of records in a Mongo database. When running the query with the javascript method:
db.tags.group({
cond : { tag: {$ne:'donotwant'} },
key: { tag: true },
reduce: function(doc, out) { out.count += 1; },
initial: { count: 0 }
});
I get the following results:
[
{"tag" : "thing", "count" : 4},
{"tag" : "something", "count" : 1},
{"tag" : "test", "count" : 1}
]
Does exactly what I want it to do. However, when I utilize the corresponding Mongoid code to perform the same query:
Tag.collection.group(
:cond => {:tag => {:$ne => 'donotwant'}},
:key => [:tag],
:reduce => "function(doc, out) { out.count += 1 }",
:initial => { :count => 0 },
)
the count parameters are (seemingly) selected as floats instead of integers:
[
{"tag"=>"thing", "count"=>4.0},
{"tag"=>"something", "count"=>1.0},
{"tag"=>"test", "count"=>1.0}
]
Am I misunderstanding what's going on behind the scenes? Do I need to (can I?) cast those counts or is the javascript result just showing it without the .0?
JavaScript doesn't distinguish between floats and ints. It has one Number type that is implemented as a double. So what you are seeing in Ruby is correct, the mongo shell output follows javascript printing conventions and displays Numbers that don't have a decimal component without the '.0'

Resources