ElasticSearch painless scripts - Way to output variable values besides the final score? - elasticsearch

I am using a painless script to implement a custom scoring function while querying the ES index, that's being used as a basis for our recommendation engine. While calculating the final score in the painless script, I use a product of intermediate variables such as recency and uniqueness, calculated within the painless script.
Now, it is trivial to get the final scores of the top documents as they are returned within the query response. However, for detailed analysis, I'm trying to find a way to also get the intermediate variables' values (recency and uniqueness as in the above example). I understand these painless variables only exist within the context of the painless script, which does not have standard REPL setup. So is there really no way to access these painless variables? Has anyone found a workaround to do this? Thanks!
E.g. If I have the following simplified painless script:
def recency = 1/doc['date'].value
def uniqueness = doc['ctr].value
return recency * uniqueness
In the final ES response, I get the scores i.e. recency * uniqueness. However, I also want to know what the intermediate variables are i.e. recency and uniqueness

You can try using a modular approach with multiple scripted fields like:
recency -- get the recency field
uniqueness -- get the uniqueness field
access the fields like a normal ES field in your final painless script
if(doc.containsKey('recency.keyword') && doc.containsKey('uniqueness.keyword'))
{
def val1 = doc['recency.keyword'].value;
def val2 = doc['uniqueness.keyword'].value;
}
Hope it helps

There is not direct way of printing it anywhere i suppose.
But here is what you can give it a try to check the intermediate output of any variable.
create another scripted field which will return only the value of that variable.
For Ex: in your case,
"script_fields": {
"derivedRecency": {
"script": {
"lang": "painless",
"source": """
return doc['recency'].value;
"""
}
}
}

Related

Get date value in update query elasticsearch painless

I'm trying to get millis value of two dates and subtract them to another.
When I used ctx._sourse.begin_time.toInstant().toEpochMilli() like doc['begin_time'].value.toInstant().toEpochMilli() it gives me runtime error.
And ctx._source.begin_time.date.getYear() (like this Update all documents of Elastic Search using existing column value) give me runtime error with message
"ctx._source.work_time = ctx._source.begin_time.date.getYear()",
" ^---- HERE"
What type I get with ctx._source if this code works correctly doc['begin_time'].value.toInstant().toEpochMilli().
I can't find in documentation of painless how to get values correctly. begin_time is date 100%.
So, how can I write a script to get the difference between two dates and write it to another integer?
If you look closely, the script language from the linked question is in groovy but it's not supported anymore. What we use nowadays (2021) is called painless.
The main point here is that the ctx._source attributes are the original JSON -- meaning the dates will be strings or integers (depending on the format) and not java.util.Date or any other data type that you could call .getDate() on. This means we'll have to parse the value first.
So, assuming your begin_time is of the format yyyy/MM/dd, you can do the following:
POST myindex/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy/MM/dd");
LocalDate begin_date = LocalDate.parse(ctx._source.begin_time, dtf);
ctx._source.work_time = begin_date.getYear()
"""
}
}
BTW the _update_by_query script context (what's accessible and what's not) is documented here and working with datetime in painless is nicely documented here.

Elasticsearch manipulate existing field value to add new field

I try to add new field which is value comes from hashed existing field value. So, i want to do;
my_index.hashedusername(new field) = crc32(my_index.username) (existing field)
For example
POST _update_by_query
{
"query": {
"match_all": {}
},
"script" : {
"source": "ctx._source.hashedusername = crc32(ctx._source.username);"
}
}
Please give me an idea how to do this..
java.util.zip.CRC32 is not available in the shared painless API so mocking that package will be non-trivial -- perhaps even unreasonable.
I'd suggest to compute the CRC32 hashes beforehand and only then send the docs to ES. Alternatively, scroll through all your documents, compute the hash and bulk-update your documents.
The painless API was designed to perform comparatively simple tasks and CRC32 is certainly outside of its purpose.

Elastic Search - Tokenization and Multi Match query

I need to perform tokenization and multi match in a single query in Elastic Search.
Currently,
1)I am using the analyzer to get the tokens like below
String text = // 4 line log data;
List<AnalyzeToken> analyzeTokenList = new ArrayList<AnalyzeToken>();
AnalyzeRequestBuilder analyzeRequestBuilder = this.client.admin().indices().prepareAnalyze();
for (String newIndex : newIndexes) {
analyzeRequestBuilder.setIndex(newIndex);
analyzeRequestBuilder.setText(text);
analyzeRequestBuilder.setAnalyzer(analyzer);
Response analyzeResponse = analyzeRequestBuilder.get();
analyzeTokenList.addAll(analyzeResponse.getTokens());
}
then, I will iterate through the AnalyzeToken and get the list of tokens,
List<String> tokens = new ArrayList<String>();
for (AnalyzeToken token : tokens)
{
tokens.addAll(token.getTerm().replaceAll("\\s+"," "));
}
then use the tokens and frame the multi-match query like below,
String query = "";
for(string data : tokens) {
query = query + data;
}
MultiMatchQueryBuilder multiMatchQueryBuilder = new MultiMatchQueryBuilder(query, "abstract", "title");
Iterable<Document> result = documentRepository.search(multiMatchQueryBuilder);
Based on the result, I am checking whether similar data exists in the database.
Is it possible to combine as single query - the analyze and multi match query as single query?
Any help is appreciated!
EDIT :
Problem Statement : Say I have 90 entries in one index, In which each 10 entries in that index are identical (not exactly but will have 70% match) so I will have 9 pairs.
I need to process only one entry in each pair, so I went in the following approach (which is not the good way - but as of now I end up with this approach)
Approach :
Get each entry from the 90 entries in the index
Tokenize using the analyzer (this removes the unwanted keywords)
Search in the same index (It checks whether the same kind of data is there in the index) and also filters the flag as processed. --> this flag will be updated after the first log gets processed.
If there is no flag available as processed for the similar kind of data (70% match) then I will process these logs and update the current log flag as processed.
If any data already exist with the flag as processed then I will consider this data is already processed and I will continue with the next one.
So Ideal goal is to, process only one data in the 10 unique entries.
Thanks,
Harry
Multi-match queries internally uses the match queries which are analyzed means they apply the same analyzer which is defined in the fields mapping(standard) if there is no analyzer defined.
From the multi-match query doc
The multi_match query builds on the match query to allow multi-field
queries:
Also, accepts analyzer, boost, operator, minimum_should_match,
fuzziness, lenient, as explained in match query.
So what you are trying to do is overkill, even if you want to change the analyzer(need different tokens during search time) then you can use the search analyzer instead of creating tokens and then using them in multi-match query.

Elasticsearch NEST: specifying Id explicitly seems to cause inconsistent search scores

I have a model class that looks like this:
public class MySearchDocument
{
public string ID { get; set; }
public string Name { get; set; }
public string Description { get; set; }
public int DBID { get; set; }
}
We always use bulk indexing. By default our searches do a relatively simple multi_match with more weight given to ID and Name, like this:
{
"query": {
"multi_match": {
"query": "burger",
"fields": [
"ID^1.2",
"Name^1.1",
"Description"
],
"auto_generate_synonyms_phrase_query": true
}
}
}
I was previously just relying on Id inference, allowing Elasticsearch to use my ID property for its Id purposes, but for a few reasons it has become preferable to use DBID as the Id property in Elasticsearch. I tried this 3 different ways, separately and in combo:
Explicitly when bulk indexing: new BulkIndexOperation<MySearchDocument>(d) { Id = d.DBID }
In the ConnectionSettings using DefaultMappingFor<MySearchDocument>(d => d.IdProperty(p => p.DBID))
Using an attribute on MySearchDocument: [ElasticsearchType(IdProperty = nameof(DBID))]
Any and all of these seem to work as expected; the _id field in the indexed documents are being set to my DBID property. However, in my integration tests, search results are anything but expected. Specifically, I have a test that:
Creates a new index from scratch.
Populates it with a handful of MySearchDocuments
Issues a Refresh on the index just to make sure it's ready.
Issues a search.
Asserts that the results come back in the expected order.
With Id inference, this test consistently passes. When switching the Id field using any or all of the techniques above, it passes maybe half the time. Looking at the raw results, the correct documents are always returned, but the _score often varies for the same document from test run to test run. Sometimes the varying score is the one associated with the document whose ID field matches the search term, other times it's the score of a different document.
I've tried coding the test to run repeatedly and in parallel. I've tried waiting several seconds after issuing Refresh, just to be sure the index is ready. None of these make a difference - the test passes consistently with Id inference, and is consistently inconsistent without. I know nothing in this world is truly random, so I feel like I must be missing something here. Let me know if more details would be helpful. Thanks in advance.
Search relevancy scores are calculated per shard, and a hashing algorithm on the value of _id determines into which primary shard a given document will be indexed.
It sounds like you may be seeing the effects of this when indexing a small sample of documents across N > 1 primary shards; in this case, the local relevancy scores may be different enough to manifest in some odd looking _scores returned. With a larger set of documents and even distribution, differences in local shard scores diminish.
There are a couple of approaches that you can take to overcome this for testing purposes:
Use a single primary shard
or
Use dfs_query_then_fetch when making the search request. This tells Elasticsearch to take the local relevancy scores first in order to calculate global relevancy scores, then use global scores for _score. There is a slight overhead to using dfs_query_then_fetch.
Take a look also at the section "Relevance is Broken!" from the Elasticsearch Definitive guide; although the guide refers to Elasticsearch 2.x, much of it is still very much relevant for later versions.

Unable to loop through array field ES 6.1

I'm facing a problem in ElasticSearch 6.1 that I cannot solve and I don't know why. I have read the docs several times and maybe I'm missing something.
I have a scripted query that needs to do some calculation before decides if a record is available or not.
Here is the following script:
https://gist.github.com/dunice/a3a8a431140ec004fdc6969f77356fdf
What I'm doing is trying to loop though an array field with the following source:
"unavailability": [
{
"starts_at": "2018-11-27T18:00:00+00:00",
"local_ends_at": "2018-11-27T15:04:00",
"local_starts_at": "2018-11-27T13:00:00",
"ends_at": "2018-11-27T20:04:00+00:00"
},
{
"starts_at": "2018-12-04T18:00:00+00:00",
"local_ends_at": "2018-12-04T15:04:00",
"local_starts_at": "2018-12-04T13:00:00",
"ends_at": "2018-12-04T20:04:00+00:00"
},
]
When the script is executed it throws the error: No field found for [unavailability] in mapping with types [aircraft]
Is there any clue to make it work?
Thanks
UPDATE
Query:
https://gist.github.com/dunice/3ccd7d83ca6ddaa63c11013b84e659aa
UPDATE 2
Mapping:
https://gist.github.com/dunice/f8caee114bbd917115a21b8b9175a439
Data example:
https://gist.github.com/dunice/8ad0602bc282b4ca19bce8ae849117ad
You cannot access an array present in the source document via doc_values (i.e. doc). You need to directly access the source document via the _source variable instead, like this:
for(int i = 0; i < params._source['unavailability'].length; i++) {
Note that depending on your ES version, you might want to try ctx._source or just _source instead of params._source
I solve my use-case in a different approach.
Instead having a field as array of object like unavailability was I decided to create two fields as array of datetime:
unavailable_from
unavailable_to
My script walks through the first field then checks the second with the same position.
UPDATE
The direct access to _source is disabled by default:
https://github.com/elastic/elasticsearch/issues/17558

Resources