Difference between using multiple filters and specifying multiple filters in a single "and" clause - elasticsearch

I am a new to elasticsearch and don't know what is the difference between the two queries. Is it just processing time or are they fundamentally different queries.
1) filters : { and: [{
"bool" : {
"should" : {
"term" : {
"Code" : "1510"
}
}
}
}
,
{
"bool" : {
"should" : {
"term" : {
"Id" : "Id3"
}
}
}
}] }
2) filter: [{
"bool" : {
"must" : [{
"term" : {
"Code" : "1510"
},
"term":{
"Id":"Id3"}]
}
}
}]

The queries in OP are logically equivalent.
However that being said I find 2) to be intutive , readable and simpler.
Generally for perfomance reasons bool filters are preferred over and although for the queries in question I doubt this difference is perceptible.
Also for the and filter the query in 1) is better written as follows :
"filter": {
"and": [
{
"term": {
"Code": "1510"
}
},
{
"term": {
"Id": "Id3"
}
}
]
}

Related

Elasticsearch: Exclude filter clause from scoring

I have a filter clause deep inside a query clause but I think it doesn't make sense to calculate a score for the filter clause. How can I take this filter clause out? Would this improve performance?
{
"size" : 30,
"sort" : [
{
"_score" : {
"order" : "desc"
}
}
],
"query" : {
"function_score" : {
"score_mode" : "sum",
"boost_mode" : "sum",
"functions" : [
{
...
<filter_clause>
}
]
}
}
}
You can wrap any (sub)query in the filter context which is a yes/no operation where no scoring occurs:
{
"query": {
"bool": {
"filter": [
{...}
]
}
}
}
Though function_scores are supposed to affect the scoring so I don't think it makes sense to disallow it in this way.

Elasticsearch - EXISTS syntax + Filter not working

I am trying to query for a date range where a particular field exists. This seems like it would be easy but I am sensing that the keyword "exists" has changed per the documentation. I am on 5.4. https://www.elastic.co/guide/en/elasticsearch/reference/5.4/query-dsl-exists-filter.html
I use #timestamp for dates and the field "error_data" is in the mapping and only appears if an error condition is found.
Here is my query....
GET /filebeat-2017.07.25/_search
{
"query": {
"bool" : {
"filter" : {
"range" : {
"#timestamp" : {
"gte" : "now-5m",
"lte" : "now-1m"
}
}
},
"exists": {
"field": "error_data"
}
}
}
}
but it says that "[bool] query does not support [exists]" whereas the following does not work either but gets an parsing error message of "[exists] malformed query, expected [END_OBJECT] but found [FIELD_NAME]" on line 6 column 9. Thanks for your help.
GET /filebeat-2017.07.25/_search
{
"query": {
"exists": {
"field": "error_data"
},
"bool" : {
"filter" : {
"range" : {
"#timestamp" : {
"gte" : "now-5m",
"lte" : "now-1m"
}
}
}
}
}
}
You're almost there. Try like this:
GET /filebeat-2017.07.25/_search
{
"query": {
"bool" : {
"filter" : [
{
"range" : {
"#timestamp" : {
"gte" : "now-5m",
"lte" : "now-1m"
}
}
},
{
"exists": {
"field": "error_data"
}
}
]
}
}
}
i.e. the bool/filter clause must be an array if you have several clauses to put in it:

Return every nth record in Elastic Search

I have time series data and I want to query Elasticsearch by using time ranges with a fixed set of 2000 records.
I have this query
GET http://IP:9200/MYINDEX/_search
{
"_source": ["XXX1", "XXX2","timestamp"],
"sort" :
{ "#timestamp" : {"order" : "asc"}},
"query" : {
"range" : {
"#timestamp" : {
"gte" : "2017-02-10T10:55:31,259Z",
"lte" : "2017-02-10T10:55:32,272Z"
}
}
}
Is it possible to return only every 5th or 10th record?
I found some filter scripts but none of them seems to work.
Since there could be millions of records in one index its crucial to limit the number of returned values!
EDIT: rework query becasue filtered was replaced by bool:
{
"_source":[
"XXX1",
"XXX2",
"timestamp"
],
"sort":{
"#timestamp":{
"order":"asc"
}
},
"query":{
"bool":{
"must":{
"range":{
"#timestamp":{
"gte":"2017-02-10T10:55:31,259Z",
"lte":"2017-02-10T10:55:32,272Z"
}
}
},
"filter":{
"script":{
"script":"doc['#timestamp'].value % 5 == 0"
}
}
}
}
}
There is one way to do it. You can add a field which can behave like an auto increment field of a DB.
Then you can add a filter to the query that you want to run.
"filter": {
"script": {
"script": "doc['auto_increment'].value % n == 0",
"params" : {
"n" : 5
}
}
}
This should work for indexes that have time series data and are going to be searched for a range. It will not work properly if you have an added text search to the field.
For the query that you are trying it would transform into something like this.
GET http://IP:9200/MYINDEX/_search
{
"_source": ["XXX1", "XXX2","timestamp"],
"sort" :
{ "#timestamp" : {"order" : "asc"}},
"query" : {
"filtered": {
"query": {
"range" : {
"#timestamp" : {
"gte" : "2017-02-10T10:55:31,259Z",
"lte" : "2017-02-10T10:55:32,272Z"
}
}
},
"filter": {
"script": {
"script": "doc['auto_increment'].value % 5 == 0"
}
}
}
}
}
For reference do look into this

Elastic Search must not queries are slow

I have a test index of 50K documents.
I'm firing 500 (same) queries against it, which have a clause that a field (that is an array of values) "must not" be of "some value".
Out of these 500 queries several fail/time out. (Sometimes it's 5, sometimes it's 9, sometimes it's 18 queries...) Is there a way to make the "must not" queries faster? In production the index is going to be several million docs, and the majority of queries are going to have "must not" clauses.
Mapping is as follows:
{
"jobs_en":{
"mappings":{
"index":{
"_all":{
"enabled":false
},
"properties":{
"GUID":{
"type":"string",
"index":"not_analyzed"
},
"channel":{
"type":"string",
"index":"not_analyzed"
},
"country":{
"type":"string",
"analyzer":"standard"
}
}
}
}
}
}
The query is as follows:
{
"bool" : {
"must" : [ {
"bool" : {
"must" : {
"bool" : { }
},
"must_not" : {
"term" : {
"channel" : "Email"
}
}
}
}, {
"bool" : {
"must" : {
"match" : {
"country" : {
"query" : "US",
"type" : "boolean"
}
}
}
}
} ]
}
}"
We have a large database in ES, I don't think it is as large as yours. Several things help me:
1. Use Must if you can.
2. Use Must Not WITH Must.
3. If you are able to: use Source.
"query" : {
"bool" : {
"must": [
{"term": {
"createUser": {
"value": "processor.imsignal"
}
}
},
{"terms" : {
"imcampaignid" : [70191,66983,70188,70235,70190]
}
}
],
"must_not": [
{"term": {
"source": {
"value": "EMAIL"
}
}
},
{"terms" : {
"category" : ["campaign_email","unsubscribe","from_email"]
}
}
]
}
},
"_source": ["category","source","accountPlatformID"]
By specifying a must first, it speeds up the query. By specifying must_not it can reduce the number of returned records which can be a real hit. Finally, reducing what is returned on those records can really be helpful.
Since there was no other answer, I figured I'd help with what I knew. Believe it or not, this query with the must not outperforms the identical query with only the musts for my purposes by tens of seconds. Telling something what it should be is essential, then filter with what it is not.

Why would this cached geo query be slower in elasticsearch than the uncached?

Looking at the query below, I've added a cached geo_bounding_box filter in front of my geo_shape filter. My expectation after reading https://www.elastic.co/guide/en/elasticsearch/guide/current/geo-caching.html was that this query should be faster. However, in my benchmarking the query with both filters turns out to be slightly slower on average, and MUCH slower in the worst case. Am I doing something wrong, or misinterpreting the doc?
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [
{"geo_bounding_box" : {
"_cache": True,
"properties.center" : {
"top_left" : {
"lat" : math.ceil(float(lat)),
"lon" : math.floor(float(lon))
},
"bottom_right" : {
"lat" : math.floor(float(lat)),
"lon" : math.ceil(float(lon))
}
}
}},
{"geo_shape": {
"geometry": {
"relation": "intersects",
"shape": {
"coordinates": [lon,lat],
"type": "point"
}
}
}}
]
}
}
}
}
}
Use lowercase JSON boolean values:
"_cache": true

Resources