Sum-aggregation script for term frequencies without dynamic scripting - elasticsearch

I try to evaluate a web-application for my masterthesis. For this I want to make a user study, where I prepare the data in elasitc found, and send my web application to the testers. As far as I know, elastic found does not allow dynamic scripting for security reasons. I try to refomulate the following dynamic script query:
GET my_index/document/_search
{
"query": {
"match_all":{}
},
"aggs": {
"stadt": {
"sum": {
"script": "_index['textBody']['frankfurt'].tf()"
}
}
}
}
This query sums up all term frequencies in the document field textBody for the term frankfurt.
In order to reformulate the query without dynamic scripting, I've taken a look on groovy scripts without dynamic scripting, but I still get parsing errors.
My approach to this was:
GET my_index/document/_search
{
"query": {
"match_all":{}
},
"aggs": {
"stadt": {
"sum": {
"script": {
"script_id": "termFrequency",
"lang" : "groovy",
"params": {
"term" : "frankfurt"
}
}
}
}
}
}
and the file termFrequency.groovy in the scripts directory:
_index['textBody'][term].tf()
I get the following parsing error:
Parse Failure [Unexpected token START_OBJECT in [stadt].]

This is the correct syntax assuming your file is inside config/scripts directory.
{
"query": {
"match_all": {}
},
"aggs": {
"stadt": {
"sum": {
"script_file": "termFrequency",
"lang": "groovy",
"params": {
"term": "frankfurt"
}
}
}
},
"size": 0
}
Also the term should be variable rather than string so it should be
_index['textBody'][term].tf()
Hope this helps!

Related

Is it possible to access a query term in a script field?

I would like to construct an elasticsearch query in which I can search for a term and on-the-fly compute a new field for each found document, which is calculated based on some existing fields as well as the query term. Is this possible?
For example, let's say in my EL query I am searching for documents which have the keyword "amsterdam" in the "text" field.
"filter": [
{
"match_phrase": {
"text": {
"query": "amsterdam"
}
}
}]
Now I would also like to have a script field in my query, which computes some value based on other fields as well as the query.
So far, I have only found how to access the other fields of a document though, using doc['someOtherField'], for example
"script_fields" : {
"new_field" : {
"script" : {
"lang": "painless",
"source": "if (doc['citizens'].value > 10000) {
return "large";
}
return "small";"
}
}
}
How can I integrate the query term, e.g. if I wanted to add to the if statement "if the query term starts with a-e"?
You're on the right track but script_fields are primarily used to post-process your documents' attributes — they won't help you filter any docs because they're run after the query phase.
With that being said, you can use scripts to filter your documents through script queries. Before you do that, though, you should explore alternatives.
In other words, scripts should be used when all other mechanisms and techniques have been exhausted.
Back to your example. I see three possibilities off the top of my head.
Match phrase prefix queries as a group of bool-should subqueries:
POST your-index/_search
{
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match_phrase_prefix": {
"text_field": "a"
}
},
{
"match_phrase_prefix": {
"text_field": "b"
}
},
{
"match_phrase_prefix": {
"text_field": "c"
}
},
... till the letter "e"
]
}
}
]
}
}
}
A regexp query:
POST your-index/_search
{
"query": {
"bool": {
"must": [
{
"regexp": {
"text_field": "[a-e].+"
}
}
]
}
}
}
Script queries using .charAt comparisons:
POST your-index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
char c = doc['text_field.keyword'].value.charAt(0);
return c >= params.gte.charAt(0) && c <= params.lte.charAt(0);
""",
"params": {
"gte": "a",
"lte": "e"
}
}
}
}
]
}
}
}
If you're relatively new to ES and would love to see real-world examples, check out my recently released Elasticsearch Handbook. One chapter is dedicated to scripting and as it turns out, you can achieve a lot with scripts (if of course executed properly).

JsonQueryElasticSearch Processor in Nifi

I am working with JsonQueryElasticSearch Processor in Nifi (v1.9.2).
The query string is as below:
{
"query": {
"bool": {
"must": [
{ "match": { "event": "New" }},
{ "match": { "uniqueId": "${unique_id}"}},
{ "match": { "header.schemaVersion": "1.3" }}
]
}
},
"sort" : {
"header.sourceSystemCreationTimestamp" : {"order" : "desc"}
}
}
It's not giving me any result as value of ${unique_id} flow attribute within query is blank. If I hard code the value in query it works as expected. At processor level, I do see the value for ${unique_id} flow attribute.
Thanks much for your time and help.
(I'm the developer who wrote this processor)
I tried to duplicate the issue by doing the following:
Creating an index with several test documents.
Using GenerateFlowFile -> JsonQueryElasticsearch.
Putting this simple query in the query parameter of JsonQueryElasticsearch:
{
"query": {
"match": {
"from": "${sender}"
}
},
"aggs": {
"senders": {
"terms": {
"field": "from",
"size": 10
}
}
}
}
All of the expected results were returned. If you are attempting to pass the query in via the flowfile content, you cannot use Expression Language (${unique_id}). That's expected behavior because Expression Language is not evaluated on the contents of flowfiles, only on configuration properties.

How to check field data is numeric when using inline Script in ElasticSearch

Per our requirement we need to find the max ID of the document before adding new document. Problem here is doc may contain string data also So had to use inline script on the elastic query to find out max id only for the document which has integer data otherwise returning 0. am using following inline script query to find max-key but not working. can you help me onthis ?.
{
"size":0,
"query":
{"bool":
{"filter":[
{"term":
{"Name":
{
"value":"Test2"
}
}}
]
}},
"aggs":{
"MaxId":{
"max":{
"field":"Key","script":{
"inline":"((doc['Key'].value).isNumber()) ? Integer.parseInt(doc['Key'].value) : 0"}}
}
}
}
The error is because the max aggregation only supports numeric fields, i.e. you cannot specify a string field (i.e. Key) in a max aggregation.
Simply remove the "field":"Key" part and only keep the script part
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"Name": "Test2"
}
}
]
}
},
"aggs": {
"MaxId": {
"max": {
"script": {
"source": "((doc['Key'].value).isNumber()) ? Integer.parseInt(doc['Key'].value) : 0"
}
}
}
}
}

Elasticsearch terms query on array of values

I have data on ElasticSearch index that looks like this
{
"title": "cubilia",
"people": [
"Ling Deponte",
"Dana Madin",
"Shameka Woodard",
"Bennie Craddock",
"Sandie Bakker"
]
}
Is there a way for me to do a search for all the people whos name starts with
"ling" (should be case insensitive) and get distinct terms properly cased "Ling Deponte" not "ling deponte"?
I am find with changing mappings on the index in any way.
Edit does what I want but is really bad query:
{
"size": 0,
"aggs": {
"person": {
"filter": {
"bool":{
"should":[
{"regexp":{
"people.raw":"(.* )?[lL][iI][nN][gG].*"
}}
]}
},
"aggs": {
"top-colors": {
"terms": {
"size":10,
"field": "people.raw",
"include":
{
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
}
}
people.raw is not_analyzed
Yes, and you can do it without a regular expression by taking advantage of Elasticsearch's full text capabilities.
GET /test/_search
{
"query": {
"match_phrase": {
"people": "Ling"
}
}
}
Note: This could also be match or match_phrase_prefix in this case. The match_phrase* queries imply an order of the values in the text. match simply looks for any of the values. Since you only have one value, it's pretty much irrelevant.
The problem is that you cannot limit the document responses to just that name because the search API returns documents. With that said, you can use nested documents and get the desired behavior via inner_hits.
You do not want to do wildcard prefixing whenever possible because it simply does not work at scale. To put it in SQL terms, that's like doing a full table scan; you effectively lose the benefit of the inverted index because it has to walk it entirely to find the actual start.
Combining the two should work pretty well though. Here, I use the query to widdle down results to what you are interested in, then I use your inner aggregation to only include based on the value.
{
"size": 0,
"query": {
"match_phrase": {
"people": "Ling"
}
}
"aggs": {
"person": {
"terms": {
"size":10,
"field": "people.raw",
"include": {
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
Hi Please find the query it may help for your request
GET skills/skill/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"wildcard": {
"skillNames.raw": "jav*"
}
}
]
}
}
}
}
}
My intention is to find documents starting with the "jav"

Elastic Search Aggregation with scripting giving exception

Query Failed [Failed to execute main query]]; nested: GroovyScriptExecutionException[ArrayIndexOutOfBoundsException[-1]];
I am using the following aggs query
{
"aggs": {
"hscodes_eval": {
"terms": {
"field": "fscode"
},
"aggs": {
"top_6_fscodes": {
"terms": {
"field": "fscode",
"script": "doc[\"fscode\"].value[0..6]"
}
}
}
}
}
}
I want to get the count of documents matching the first 6 characters of field fscode.
But I am getting above exception.Please help.
Try this:
{
"aggs": {
"hscodes_eval": {
"terms": {
"field": "fscode"
},
"aggs": {
"top_6_fscodes": {
"terms": {
"script": "fieldValue=doc['fscode'].value;if(fieldValue.length()>=7) fieldValue[0..6] else ''"
}
}
}
}
}
}
But I wouldn't do it like this, meaning using a script. Scripts are usually slow and if you have many documents, it can add up pretty quickly.
Haven't thought much about this, but my gut feeling says I would try, at indexing time, to put in a sub-field of fscode the first 6 characters of the initial fscode, maybe using a truncate filter. Then, at search time, I would use a terms aggregation not on fscode, but on the subfield already defined.

Resources