Elasticsearch: Exclude filter clause from scoring - elasticsearch

I have a filter clause deep inside a query clause but I think it doesn't make sense to calculate a score for the filter clause. How can I take this filter clause out? Would this improve performance?
{
"size" : 30,
"sort" : [
{
"_score" : {
"order" : "desc"
}
}
],
"query" : {
"function_score" : {
"score_mode" : "sum",
"boost_mode" : "sum",
"functions" : [
{
...
<filter_clause>
}
]
}
}
}

You can wrap any (sub)query in the filter context which is a yes/no operation where no scoring occurs:
{
"query": {
"bool": {
"filter": [
{...}
]
}
}
}
Though function_scores are supposed to affect the scoring so I don't think it makes sense to disallow it in this way.

Related

How to filter with multiple fields and values in elasticsearch?

I've been reading through the docs and been trying to achieve a solution to filter results through multiple fields and columns, however I keep getting errors; malformed query.
I want to filter the result with exact equal values, such as the following:
is_active: true
category_id: [1,2,3,4]
brand: "addidas"
gender: "male"
To make it more clear what I intend to do, this is how I'd like it to run if it would be written in SQL:
SELECT .... WHERE
is_active= 1 AND category_id IN(1,2,3,4)
AND brand='addidas' AND gender='male'
My query in DSL goes as following:
{
"body": {
"query": {
"nested": {
"query": {
"bool": {
"must": {
"terms": {
"category_id": [
1,
2,
3
]
},
"term": {
"is_active": true
},
"term": {
"brand": "addidas"
}
}
}
}
}
}
}
}
How do I filter multiple fileds and values as described, in elasticsearch?
If you need extra information from me that is required to answer the question, leave a comment. If you add a link to the docs, please also provide an example (with query dsl) of how my current, or similar situations should be solved.
Use the following code:
The clause (query) must appear in matching documents and will contribute to the score.
"query": {
"bool": {
"must" : [
{"term" : { "is_active" : true}},
{"term" : { "gender" : "female"}},
{"term" : { "brand" : "addidas"}},
{"terms": { "categoryId": [1,2,3,4]}}
]
}
}
Queries specified under the filter element have no effect on scoring
"query": {
"bool": {
"filter" : [
{"term" : { "is_active" : true}},
{"term" : { "gender" : "female"}},
{"term" : { "brand" : "addidas"}},
{"terms": { "categoryId": [1,2,3,4]}}
]
}
}

Elastic Search must not queries are slow

I have a test index of 50K documents.
I'm firing 500 (same) queries against it, which have a clause that a field (that is an array of values) "must not" be of "some value".
Out of these 500 queries several fail/time out. (Sometimes it's 5, sometimes it's 9, sometimes it's 18 queries...) Is there a way to make the "must not" queries faster? In production the index is going to be several million docs, and the majority of queries are going to have "must not" clauses.
Mapping is as follows:
{
"jobs_en":{
"mappings":{
"index":{
"_all":{
"enabled":false
},
"properties":{
"GUID":{
"type":"string",
"index":"not_analyzed"
},
"channel":{
"type":"string",
"index":"not_analyzed"
},
"country":{
"type":"string",
"analyzer":"standard"
}
}
}
}
}
}
The query is as follows:
{
"bool" : {
"must" : [ {
"bool" : {
"must" : {
"bool" : { }
},
"must_not" : {
"term" : {
"channel" : "Email"
}
}
}
}, {
"bool" : {
"must" : {
"match" : {
"country" : {
"query" : "US",
"type" : "boolean"
}
}
}
}
} ]
}
}"
We have a large database in ES, I don't think it is as large as yours. Several things help me:
1. Use Must if you can.
2. Use Must Not WITH Must.
3. If you are able to: use Source.
"query" : {
"bool" : {
"must": [
{"term": {
"createUser": {
"value": "processor.imsignal"
}
}
},
{"terms" : {
"imcampaignid" : [70191,66983,70188,70235,70190]
}
}
],
"must_not": [
{"term": {
"source": {
"value": "EMAIL"
}
}
},
{"terms" : {
"category" : ["campaign_email","unsubscribe","from_email"]
}
}
]
}
},
"_source": ["category","source","accountPlatformID"]
By specifying a must first, it speeds up the query. By specifying must_not it can reduce the number of returned records which can be a real hit. Finally, reducing what is returned on those records can really be helpful.
Since there was no other answer, I figured I'd help with what I knew. Believe it or not, this query with the must not outperforms the identical query with only the musts for my purposes by tens of seconds. Telling something what it should be is essential, then filter with what it is not.

Converting SQL query to ElasticSearch Query

I want to convert the following sql query to Elasticsearch one. can any one help in this.
select csgg, sum(amount) from table1
where type in ('a','b','c') and year=2016 and fc="33" group by csgg having sum(amount)=0
I tried following way:enter code here
{
"size": 500,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}}
],
"should" : [
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg"
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
but not sure if I am doing right as its not validating the results.
seems query to be added inside aggregation.
Assuming that you use Elasticsearch 2.x, there is a possibility to have the having-semantics in Elasticsearch.
I'm not aware of a possibility prior 2.0.
You can use the new Pipeline Aggregation Bucket Selector Aggregation, which only selects the buckets, which meet a certain criteria:
POST test/test/_search
{
"size": 0,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}},
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg",
"size": 100
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"no_amount_filter": {
"bucket_selector": {
"buckets_path": {"sumAmount": "sum_amount"},
"script": "sumAmount == 0"
}
}
}
}
}
}
However there are two caveats. Depending on your configuration, it might be necessary to enable scripting like that:
script.aggs: true
script.groovy: true
Moreover, as it works on the parent buckets it is not guaranteed that you get all buckets with amount = 0. If the terms aggregation selects only terms with sum amount != 0, you will have no result.

Difference between using multiple filters and specifying multiple filters in a single "and" clause

I am a new to elasticsearch and don't know what is the difference between the two queries. Is it just processing time or are they fundamentally different queries.
1) filters : { and: [{
"bool" : {
"should" : {
"term" : {
"Code" : "1510"
}
}
}
}
,
{
"bool" : {
"should" : {
"term" : {
"Id" : "Id3"
}
}
}
}] }
2) filter: [{
"bool" : {
"must" : [{
"term" : {
"Code" : "1510"
},
"term":{
"Id":"Id3"}]
}
}
}]
The queries in OP are logically equivalent.
However that being said I find 2) to be intutive , readable and simpler.
Generally for perfomance reasons bool filters are preferred over and although for the queries in question I doubt this difference is perceptible.
Also for the and filter the query in 1) is better written as follows :
"filter": {
"and": [
{
"term": {
"Code": "1510"
}
},
{
"term": {
"Id": "Id3"
}
}
]
}

filter by child frequency in ElasticSearch

I currently have parents indexed in elastic search (documents) and child (comments) related to these documents.
My first objective was to search for a document with more than N comments, based on a child query. Here is how I did it:
documents/document/_search
{
"min_score": 0,
"query": {
"has_child" : {
"type" : "comment",
"score_type" : "sum",
"boost": 1,
"query" : {
"range": {
"date": {
"lte": 20130204,
"gte": 20130201,
"boost": 1
}
}
}
}
}
}
I used score to calculate the amount of comments a document has and then I filtered the documents by this amount, using "min_score".
Now, my objective is to search not just comments, but several other child documents related to the document, always based on frequency. Something like the query bellow:
documents/document/_search
{
"query": {
"match_all": {
}
},
"filter" : {
"and" : [{
"query": {
"has_child" : {
"type" : "comment",
"query" : {
"range": {
"date": {
"lte": 20130204,
"gte": 20130201
}
}
}
}
}
},
{
"or" : [
{"query": {
"has_child" : {
"type" : "comment",
"query" : {
"match": {
"text": "Finally"
}
}
}
}
},
{ "query": {
"has_child" : {
"type" : "comment",
"query" : {
"match": {
"text": "several"
}
}
}
}
}
]
}
]
}
}
The query above works fine, but it doesn't filter based on frequency as the first one does. As filters are computed before scores are calculated, I cannot use min_score to filter each child query.
Any solutions to this problem?
There is no score at all associated with filters. I'd suggest to move the whole logic to the query part and use a bool query to combine the different queries together.

Resources