Elastic Search Aggregation with scripting giving exception - elasticsearch

Query Failed [Failed to execute main query]]; nested: GroovyScriptExecutionException[ArrayIndexOutOfBoundsException[-1]];
I am using the following aggs query
{
"aggs": {
"hscodes_eval": {
"terms": {
"field": "fscode"
},
"aggs": {
"top_6_fscodes": {
"terms": {
"field": "fscode",
"script": "doc[\"fscode\"].value[0..6]"
}
}
}
}
}
}
I want to get the count of documents matching the first 6 characters of field fscode.
But I am getting above exception.Please help.

Try this:
{
"aggs": {
"hscodes_eval": {
"terms": {
"field": "fscode"
},
"aggs": {
"top_6_fscodes": {
"terms": {
"script": "fieldValue=doc['fscode'].value;if(fieldValue.length()>=7) fieldValue[0..6] else ''"
}
}
}
}
}
}
But I wouldn't do it like this, meaning using a script. Scripts are usually slow and if you have many documents, it can add up pretty quickly.
Haven't thought much about this, but my gut feeling says I would try, at indexing time, to put in a sub-field of fscode the first 6 characters of the initial fscode, maybe using a truncate filter. Then, at search time, I would use a terms aggregation not on fscode, but on the subfield already defined.

Related

"Filter then Aggregation" or just "Filter Aggregation"?

I am working on ES recently and I found that I could achieve the almost same result but I have no clear idea as to the DIFFERENCE between these two.
"Filter then Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"term": {
"DestCountry": "CA"
}
}
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
"Filter Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"ca": {
"filter": {
"term": {
"DestCountry": "CA"
}
},
"aggs": {
"_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
}
My Questions
Why there are two similar functions? I believe I am wrong about it but what's the difference then?
(please do ignore the result format, it's not the question I am asking ;p)
Which is better if I want to filter out the unrelated/unmatched and start the aggregation on lots of documents?
When you use it in "query", you're creating a context on ALL the docs in your index. In this case, it acts like a normal filter like: SELECT * FROM index WHERE (my_filter_condition1 AND my_filter_condition2 OR my_filter_condition3...).
When you use it in "aggs", you're creating a context on ALL the docs that might have (or haven't) been previously filtered. Let's say that if you have an structure like:
#OPTION A
{
"aggs":{
t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } }
}
}
}
Without a "query", is exactly the same as having
#OPTION B
{
"query":{
"filter" : { "term": { "type": "t-shirt" } }
}
}
BUT the results will be returned in different fields.
In the Option A, the results will be returned in the aggregations field.
In the Option B, the results will be returned in the hits field.
I would recommend to apply your filters always on the query part, so you can work with subsecuent aggregations of the already filtered docs. Also because Aggrgegations cost more performance than queries.
Hope this is helpful! :D
Both filters, used in isolation, are equivalent. If you load no results (hits), then there is no difference. But you can combine listing and aggregations. You can query or filter your docs for listing, and calculate aggregations on bucket further limited by the aggs filter. Like this:
POST kibana_sample_data_flights/_search
{
"size": 100,
"query": {
"bool": {
"filter": {
"term": {
... some other filter
}
}
}
},
"aggs": {
"ca_filter": {
"term": {
"TestCountry": "CA"
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
But more likely you will need the other way, ie. make aggregations on all docs, to display summary informations, while you display docs from specific query. In this case you need to combine aggragations with post_filter.
Answer from #Val's comment, I may just quote here for reference:
In option A, the aggregation will be run on ALL documents. In option B, the documents are first filtered and the aggregation will be run only on the selected documents. Say you have 10M documents and the filter select only a 100, it's pretty evident that option B will always be faster.

Elasticsearch terms query on array of values

I have data on ElasticSearch index that looks like this
{
"title": "cubilia",
"people": [
"Ling Deponte",
"Dana Madin",
"Shameka Woodard",
"Bennie Craddock",
"Sandie Bakker"
]
}
Is there a way for me to do a search for all the people whos name starts with
"ling" (should be case insensitive) and get distinct terms properly cased "Ling Deponte" not "ling deponte"?
I am find with changing mappings on the index in any way.
Edit does what I want but is really bad query:
{
"size": 0,
"aggs": {
"person": {
"filter": {
"bool":{
"should":[
{"regexp":{
"people.raw":"(.* )?[lL][iI][nN][gG].*"
}}
]}
},
"aggs": {
"top-colors": {
"terms": {
"size":10,
"field": "people.raw",
"include":
{
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
}
}
people.raw is not_analyzed
Yes, and you can do it without a regular expression by taking advantage of Elasticsearch's full text capabilities.
GET /test/_search
{
"query": {
"match_phrase": {
"people": "Ling"
}
}
}
Note: This could also be match or match_phrase_prefix in this case. The match_phrase* queries imply an order of the values in the text. match simply looks for any of the values. Since you only have one value, it's pretty much irrelevant.
The problem is that you cannot limit the document responses to just that name because the search API returns documents. With that said, you can use nested documents and get the desired behavior via inner_hits.
You do not want to do wildcard prefixing whenever possible because it simply does not work at scale. To put it in SQL terms, that's like doing a full table scan; you effectively lose the benefit of the inverted index because it has to walk it entirely to find the actual start.
Combining the two should work pretty well though. Here, I use the query to widdle down results to what you are interested in, then I use your inner aggregation to only include based on the value.
{
"size": 0,
"query": {
"match_phrase": {
"people": "Ling"
}
}
"aggs": {
"person": {
"terms": {
"size":10,
"field": "people.raw",
"include": {
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
Hi Please find the query it may help for your request
GET skills/skill/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"wildcard": {
"skillNames.raw": "jav*"
}
}
]
}
}
}
}
}
My intention is to find documents starting with the "jav"

elastic search by day aggregation, sum of two properties

I'm trying to aggregate on the sum of two fields, but can't seem to get the syntax right.
Let's say I have the following aggregation:
{
"aggregations": {
"byDay": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d"
},
"aggregations": {
"sum_a": {
"sum": {
"field": "a"
}
},
"sum_b": {
"sum": {
"field": "b"
}
},
"sum_a_and_b": {
/* what goes here? */
}
}
}
}
}
What I really want is an aggregation that is the sum of fields a and b.
It seem like something that would be simple, but I've hit a brick wall trying to get it right. Online examples have either been too simple (summing only on one field), or tried to do much more than this, so I've not found them helpful.
Try Terms Aggregation generating the terms using a script :
"aggs": {
"sum_a_and_b": {
"terms": {
"script": "doc['a'].value + doc['b'].value"
}
}
}
In order to enable dynamic scripting add the following to your config file (elasticsearch.yml by default) :
script.aggs: true # enable just for aggregations

Getting cardinality of multiple fields?

How can I get count of all unique combinations of values of 2 fields that are present in documents of my database, i.e. achieve the same functionality as the "cardinality" aggregation provides, but for more than 1 field?
You can use a script to achieve this. Assuming the character '#' is not present in any value of both the fields (you can use anything else to act as a separator), the query you're looking for is as under. Mind you, scripting will come with a performance hit.
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['<field1>'].value + '#' + doc['<field2'].value"
}
}
}
}
Read more about it here.
A better solution is to use nested aggregations and then count the resulting buckets.
"aggs": {
"Group1": {
"terms": {
"field": "Field1",
"size": 0
},
"aggs": {
"Group2": {
"terms": {
"field": "Field2",
"size": 0
}
}
}
}
}

Sum-aggregation script for term frequencies without dynamic scripting

I try to evaluate a web-application for my masterthesis. For this I want to make a user study, where I prepare the data in elasitc found, and send my web application to the testers. As far as I know, elastic found does not allow dynamic scripting for security reasons. I try to refomulate the following dynamic script query:
GET my_index/document/_search
{
"query": {
"match_all":{}
},
"aggs": {
"stadt": {
"sum": {
"script": "_index['textBody']['frankfurt'].tf()"
}
}
}
}
This query sums up all term frequencies in the document field textBody for the term frankfurt.
In order to reformulate the query without dynamic scripting, I've taken a look on groovy scripts without dynamic scripting, but I still get parsing errors.
My approach to this was:
GET my_index/document/_search
{
"query": {
"match_all":{}
},
"aggs": {
"stadt": {
"sum": {
"script": {
"script_id": "termFrequency",
"lang" : "groovy",
"params": {
"term" : "frankfurt"
}
}
}
}
}
}
and the file termFrequency.groovy in the scripts directory:
_index['textBody'][term].tf()
I get the following parsing error:
Parse Failure [Unexpected token START_OBJECT in [stadt].]
This is the correct syntax assuming your file is inside config/scripts directory.
{
"query": {
"match_all": {}
},
"aggs": {
"stadt": {
"sum": {
"script_file": "termFrequency",
"lang": "groovy",
"params": {
"term": "frankfurt"
}
}
}
},
"size": 0
}
Also the term should be variable rather than string so it should be
_index['textBody'][term].tf()
Hope this helps!

Resources