Fast elasticsearch CASE WHEN THEN ELSE equivalent? - elasticsearch

I need to build an exclusive bucketing aggregation in Elasticsearch (ie. the documents are assigned to the FIRST bucket to meet the criterion, not ALL buckets that meet it as the filters might overlap - this is the same behavior as a CASE WHEN THEN ELSE in SQL environments). Currently I am using a Filters Aggregation coupled with a Bool Query/Filter to achieve what I want. The idea is to use the "must" and "must_not" parts of the "Bool Query" where the "must" is my filter and the "must_not" is the collection of all the other filters that have already been used previously. An example would be:
GET _search
{
"query":{"match_all":{}},
"size":0,
"aggs":{
"bin_1": {
"filter": {
"bool": {
"must": { <filter1> },
"must_not": { <empty> }
}
}
},
"bin_2": {
"filter": {
"bool": {
"must": { <filter2> },
"must_not": { <filter1> }
}
}
},
"bin_3": {
"filter": {
"bool": {
"must": { <filter3> },
"must_not": { <filter1>, <filter2> }
}
}
},
"bin_else": {
"filter": {
"bool": {
"must": { <empty> },
"must_not": { <filter1>, <filter2>, <filter3> }
}
}
}
}
}
In a relational approach, the same would be achieved by the CASE WHEN clause like so:
CASE WHEN <filter1> THEN <bin_1>
WHEN <filter2> THEN <bin_2>
WHEN <filter3> THEN <bin_3>
ELSE <bin_else>
END
The problem with this approach is that it gets slower and slower the more buckets I add (in my real case I even have nested buckets). Is there any language support for exclusive bucketing like this in Elastic? Or any other faster approach that would yield the same results?
Thank you!

I think the solution would be to Script fields. It would use the if else logic, so no extra conditions would be used. Just I do not know what kind of filter you are using but it should be possible to implement anything I think. I will write here an equivalent of
SELECT
CASE WHEN <filter1> THEN <bin_1>
WHEN <filter2> THEN <bin_2>
ELSE <bin_else>
END as binning
FROM SOMETHING
Implemented using script fields in painless language. As is described here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html
and painless here:
https://www.elastic.co/guide/en/elasticsearch/painless/5.6/painless-examples.html
GET _search
{
"query" : { "match_all": {} },
"script fields" : {
"binning" : {
"script" : {
"lang": "painless",
"source": "if (<filter>) {return <bin1>;} else if (<filter2>) {return <bin2>;} else {return <bin3>;}"
}
}
}
where the "filter" would be something like: doc['my_field'].value == "value1" where 'my_field' is the field that you use in the filter.

Related

Is it possible to access a query term in a script field?

I would like to construct an elasticsearch query in which I can search for a term and on-the-fly compute a new field for each found document, which is calculated based on some existing fields as well as the query term. Is this possible?
For example, let's say in my EL query I am searching for documents which have the keyword "amsterdam" in the "text" field.
"filter": [
{
"match_phrase": {
"text": {
"query": "amsterdam"
}
}
}]
Now I would also like to have a script field in my query, which computes some value based on other fields as well as the query.
So far, I have only found how to access the other fields of a document though, using doc['someOtherField'], for example
"script_fields" : {
"new_field" : {
"script" : {
"lang": "painless",
"source": "if (doc['citizens'].value > 10000) {
return "large";
}
return "small";"
}
}
}
How can I integrate the query term, e.g. if I wanted to add to the if statement "if the query term starts with a-e"?
You're on the right track but script_fields are primarily used to post-process your documents' attributes — they won't help you filter any docs because they're run after the query phase.
With that being said, you can use scripts to filter your documents through script queries. Before you do that, though, you should explore alternatives.
In other words, scripts should be used when all other mechanisms and techniques have been exhausted.
Back to your example. I see three possibilities off the top of my head.
Match phrase prefix queries as a group of bool-should subqueries:
POST your-index/_search
{
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match_phrase_prefix": {
"text_field": "a"
}
},
{
"match_phrase_prefix": {
"text_field": "b"
}
},
{
"match_phrase_prefix": {
"text_field": "c"
}
},
... till the letter "e"
]
}
}
]
}
}
}
A regexp query:
POST your-index/_search
{
"query": {
"bool": {
"must": [
{
"regexp": {
"text_field": "[a-e].+"
}
}
]
}
}
}
Script queries using .charAt comparisons:
POST your-index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
char c = doc['text_field.keyword'].value.charAt(0);
return c >= params.gte.charAt(0) && c <= params.lte.charAt(0);
""",
"params": {
"gte": "a",
"lte": "e"
}
}
}
}
]
}
}
}
If you're relatively new to ES and would love to see real-world examples, check out my recently released Elasticsearch Handbook. One chapter is dedicated to scripting and as it turns out, you can achieve a lot with scripts (if of course executed properly).

Elasticsearch : filter results based on the date range

I'm using Elasticsearch 6.6, trying to extract multiple results/records based on multiple values (email_address) passed to the query (Bool) on a date range. For ex: I want to extract information about few employees based on their email_address (annie#test.com, charles#test.com, heman#test.com) and from the period i.e project_date (2019-01-01).
I did use should expression but unfortunately it's pulling all the records from elasticsearch based on the date range i.e. it's even pulling other employees information from project_date 2019-01-01.
{
"query": {
"bool": {
"should": [
{ "match": { "email_address": "annie#test.com" }},
{ "match": { "email_address": "chalavadi#test.com" }}
],
"filter": [
{ "range": { "project_date": { "gte": "2019-08-01" }}}
]
}
}
}
I also tried must expression but getting no result. Could you please help me on finding employees using their email_address with the date range?
Thanks in advance.
Should(Or) clauses are optional
Quoting from this article.
"In a query, if must and filter queries are present, the should query occurrence then helps to influence the score. However, if bool query is in a filter context or has neither must nor filter queries, then at least one of the should queries must match a document."
So in your query should is only influencing the score and not actually filtering the document. You must wrap should in must, or move it in filter(if scoring not required).
GET employeeindex/_search
{
"query": {
"bool": {
"filter": {
"range": {
"projectdate": {
"gte": "2019-01-01"
}
}
},
"must": [
{
"bool": {
"should": [
{
"term": {
"email.raw": "abc#text.com"
}
},
{
"term": {
"email.raw": "efg#text.com"
}
}
]
}
}
]
}
}
}
You can also replace should clause with terms clause as in #AlwaysSunny's answer.
You can do it with terms and range along with your existing query inside filter in more shorter way. Your existing query doesn't work as expected because of should clause, it makes your filter weaker. Read more here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
{
"query": {
"bool": {
"filter": [
{
"terms": {
"email_address.keyword": [
"annie#test.com", "chalavedi#test.com"
]
}
},
{
"range": {
"project_date": {
"gte": "2019-08-01"
}
}
}
]
}
}
}

Elasticsearch should query without computing relevance (_score)

I'm creating filtering queries which operates on two fields. I would like to avoid computing relevance by Elasticsearch. How to achieve OR statement without moving to query context.
My simplified model has two boolean fields:
{
is_opened,
is_send
}
I'd like to prepare query with logic:
(is_opened == true AND is_send == true) OR (is_opened == false)
In other words I want to exclude documents with fields:
is_opened == true AND is_send == false
My query looks like that:
GET documents/default/_search
{
"query": {
"bool": {
"should": [
{
"bool":{
"must":[
{"term": {"is_opened":true}},
{"term": {"is_send":true}}
]
}
},
{
"bool":{
"must":[
{"term": {"is_opened":false}}
]
}
}
]
}
}
}
Logically it works as I expected but Elasticsearch computes relevance.
I don't need it because at the end I sort results by another field so it's a place to optimize queries.
I ask about it because Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.
My results have _score field computed so I think that above query is executed in query context so Elasticsearch won't cache it automatically.
In the future I would like to create queries which operates on status fields, where logic would be more complicated. Still I need to know how to block computing _score.
I noticed that changing should to filter block computing _score but works as must operator. Is it possible to change filter behavior?
Is it possible to use another query than should?
How to force Elasticserach to stop computing _score?
Simply wrap your query inside the constant_score query:
GET documents/default/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"term": {
"is_opened": true
}
},
{
"term": {
"is_send": true
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"is_opened": false
}
}
]
}
}
]
}
}
}
}
}

Elastic Search Filter performing much slower than Query

As my ES index/cluster has scaled up (# ~2 billion docs now), I have noticed more significant performance loss. So I started messing around with my queries to see if I could squeeze some perf out of them.
As I did this, I noticed that when I used a Boolean Query in my Filter, my results would take about 3.5-4 seconds to come back. But if I do the same thing in my Query it is more like 10-20ms
Here are the 2 queries:
Using a filter
POST /backup/entity/_search?routing=39cd0b95-efc3-4eee-93d1-93e6f5837d6b
{
"query": {"bool":{"should":[],"must":[{"match_all":{}}]}},
"filter": {
"bool": {
"must": [
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]
}
}
}
Using a query
POST /backup/entity/_search?routing=39cd0b95-efc3-4eee-93d1-93e6f5837d6b
{
"query": {"bool":{"should":[],"must":[
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]}}
}
Like I said, the second method where I don't use a Filter at all takes mere milliseconds, while the first query takes almost 4 seconds. This seems completely backwards from what the documentation says. They say that the Filter should actually be very quick and the Query should be the one that takes longer. So why am I seeing the exact opposite here?
Could it be something with my index mapping? If anyone has any idea why this is happening I would love to hear suggestions.
Thanks
The root filter element is actually another name for post_filter element. Somehow, it was supposed to be removed (the filter) in ES 1.1 but it slipped through and exists in 2.x versions as well.
It is removed completely in ES 5 though.
So, your first query is not a "filter" query. It's a query whose results are used afterwards (if applicable) in aggregations, and then the post_filter/filter is applied on the results. So you basically have a two steps process in there: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-request-post-filter.html
More about its performance here:
While we have gained cacheability of the tag filter, we have potentially increased the cost of scoring significantly. Post filters are useful when you need aggregations to be unfiltered, but hits to be filtered. You should not be using post_filter (or its deprecated top-level synonym filter) if you do not have facets or aggregations.
A proper filter query is the following:
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [],
"must": [
{
"match_all": {}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]
}
}
}
}
}
A filter is faster. Your problem is that you include the match_all query in your filter case. This matches on all 2 billion of your documents. A set operation has to then be done against the filter to cull the set. Omit the query portion in your filter test and you'll see that the results are much faster.

elastic search where clause with constant rank?How to do this?

I'm new to elastic search. How to generate elastic search equivalent query for
select * from response where pnrno='sampleid'
I know we have to use 'filter' option in elastic search.but we do not need any ranking. (ranking can be constant) so how can I generate query for achieve this
you are correct , you can use filtered query with query clause empty and filters.Filtering a set of documents is to filter the sets upon which query acts to furthur filter/match and calculate relevance.Filters are like bool either match or reject(1/0).
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [{
"term": {
"FIELD": "VALUE"
}
}]
}
}
}
}
}
The usual way of achieving this is by using the constant_score query with an embedded term filter, like this:
{
"query": {
"constant_score": {
"filter": {
"term": {
"pnrno": "sampleid"
}
}
}
}
}

Resources