Elastic Search must not queries are slow - performance

I have a test index of 50K documents.
I'm firing 500 (same) queries against it, which have a clause that a field (that is an array of values) "must not" be of "some value".
Out of these 500 queries several fail/time out. (Sometimes it's 5, sometimes it's 9, sometimes it's 18 queries...) Is there a way to make the "must not" queries faster? In production the index is going to be several million docs, and the majority of queries are going to have "must not" clauses.
Mapping is as follows:
{
"jobs_en":{
"mappings":{
"index":{
"_all":{
"enabled":false
},
"properties":{
"GUID":{
"type":"string",
"index":"not_analyzed"
},
"channel":{
"type":"string",
"index":"not_analyzed"
},
"country":{
"type":"string",
"analyzer":"standard"
}
}
}
}
}
}
The query is as follows:
{
"bool" : {
"must" : [ {
"bool" : {
"must" : {
"bool" : { }
},
"must_not" : {
"term" : {
"channel" : "Email"
}
}
}
}, {
"bool" : {
"must" : {
"match" : {
"country" : {
"query" : "US",
"type" : "boolean"
}
}
}
}
} ]
}
}"

We have a large database in ES, I don't think it is as large as yours. Several things help me:
1. Use Must if you can.
2. Use Must Not WITH Must.
3. If you are able to: use Source.
"query" : {
"bool" : {
"must": [
{"term": {
"createUser": {
"value": "processor.imsignal"
}
}
},
{"terms" : {
"imcampaignid" : [70191,66983,70188,70235,70190]
}
}
],
"must_not": [
{"term": {
"source": {
"value": "EMAIL"
}
}
},
{"terms" : {
"category" : ["campaign_email","unsubscribe","from_email"]
}
}
]
}
},
"_source": ["category","source","accountPlatformID"]
By specifying a must first, it speeds up the query. By specifying must_not it can reduce the number of returned records which can be a real hit. Finally, reducing what is returned on those records can really be helpful.
Since there was no other answer, I figured I'd help with what I knew. Believe it or not, this query with the must not outperforms the identical query with only the musts for my purposes by tens of seconds. Telling something what it should be is essential, then filter with what it is not.

Related

Return every nth record in Elastic Search

I have time series data and I want to query Elasticsearch by using time ranges with a fixed set of 2000 records.
I have this query
GET http://IP:9200/MYINDEX/_search
{
"_source": ["XXX1", "XXX2","timestamp"],
"sort" :
{ "#timestamp" : {"order" : "asc"}},
"query" : {
"range" : {
"#timestamp" : {
"gte" : "2017-02-10T10:55:31,259Z",
"lte" : "2017-02-10T10:55:32,272Z"
}
}
}
Is it possible to return only every 5th or 10th record?
I found some filter scripts but none of them seems to work.
Since there could be millions of records in one index its crucial to limit the number of returned values!
EDIT: rework query becasue filtered was replaced by bool:
{
"_source":[
"XXX1",
"XXX2",
"timestamp"
],
"sort":{
"#timestamp":{
"order":"asc"
}
},
"query":{
"bool":{
"must":{
"range":{
"#timestamp":{
"gte":"2017-02-10T10:55:31,259Z",
"lte":"2017-02-10T10:55:32,272Z"
}
}
},
"filter":{
"script":{
"script":"doc['#timestamp'].value % 5 == 0"
}
}
}
}
}
There is one way to do it. You can add a field which can behave like an auto increment field of a DB.
Then you can add a filter to the query that you want to run.
"filter": {
"script": {
"script": "doc['auto_increment'].value % n == 0",
"params" : {
"n" : 5
}
}
}
This should work for indexes that have time series data and are going to be searched for a range. It will not work properly if you have an added text search to the field.
For the query that you are trying it would transform into something like this.
GET http://IP:9200/MYINDEX/_search
{
"_source": ["XXX1", "XXX2","timestamp"],
"sort" :
{ "#timestamp" : {"order" : "asc"}},
"query" : {
"filtered": {
"query": {
"range" : {
"#timestamp" : {
"gte" : "2017-02-10T10:55:31,259Z",
"lte" : "2017-02-10T10:55:32,272Z"
}
}
},
"filter": {
"script": {
"script": "doc['auto_increment'].value % 5 == 0"
}
}
}
}
}
For reference do look into this

Range query in elasticsearch does not work properly

I have an index that contains objects eventvalue-eventtime. I want to write a query that will return aggregated event count based on eventvalue for the last 30 seconds. Also, I need empty buckets if for a given seconds there was no events - I need to display this data on a graph.
So I wrote the following query:
{
"query" : {
"bool" : {
"must" : [
{
"range" : {
"eventtime" : {
"gte" : "now-30s/s",
"lte" : "now/s",
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
},
{
"range" : {
"eventvalue" : {
"lte" : 3
}
}
}
]
}
},
"aggs": {
"values_agg": {
"terms": {
"field": "eventvalue",
"min_doc_count" : 0,
"order": {
"_term": "asc"
}
},
"aggs": {
"events_over_time" : {
"date_histogram" : {
"field" : "eventtime",
"interval" : "1s",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "now-30s/s",
"max" : "now/s"
},
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
}
}
}
}
This query is not working properly and I don't know why. Specifically, the first "range" query gives me desired interval (if I remove it I'm getting values from all time). But the second "range" query seems to have no effect. Eventvalue can be anywhere from 1 to 10 and the desired effect is that I will have three buckets for eventvalues 1-3. However, I get all 10 buckets with all events.
How can I fix this query so it still returns empty buckets but only for selected evenvalues?
I believe you need to remove the "min_doc_count": 0 from your terms aggregation. To achieve the empty buckets you're aiming for, you need only use min_doc_count in the date_histogram aggregation.
Per the documentation for the terms aggregation:
Setting min_doc_count=0 will also return buckets for terms that didn’t
match any hit.
This explains why you are seeing buckets for eventvalues that are greater than 3. They were filtered out by the query, but brought back in by the terms aggregation.
UPDATE
Since there is a possibility that the eventvalues may not exist anywhere in the 30sec time slice, the other approach I would recommend is to manually specify the discrete values you want to use as buckets using a filters aggregation. See the documentation here.
Try using this for your aggregations:
"aggs": {
"values_agg": {
"filters": {
"filters": {
"1": { "term": { "eventvalue": 1 }},
"2": { "term": { "eventvalue": 2 }},
"3": { "term": { "eventvalue": 3 }}
}
},
"aggs": {
"events_over_time" : {
"date_histogram" : {
"field" : "eventtime",
"interval" : "1s",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "now-30s/s",
"max" : "now/s"
},
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
}
}
}

Difference between using multiple filters and specifying multiple filters in a single "and" clause

I am a new to elasticsearch and don't know what is the difference between the two queries. Is it just processing time or are they fundamentally different queries.
1) filters : { and: [{
"bool" : {
"should" : {
"term" : {
"Code" : "1510"
}
}
}
}
,
{
"bool" : {
"should" : {
"term" : {
"Id" : "Id3"
}
}
}
}] }
2) filter: [{
"bool" : {
"must" : [{
"term" : {
"Code" : "1510"
},
"term":{
"Id":"Id3"}]
}
}
}]
The queries in OP are logically equivalent.
However that being said I find 2) to be intutive , readable and simpler.
Generally for perfomance reasons bool filters are preferred over and although for the queries in question I doubt this difference is perceptible.
Also for the and filter the query in 1) is better written as follows :
"filter": {
"and": [
{
"term": {
"Code": "1510"
}
},
{
"term": {
"Id": "Id3"
}
}
]
}

Filtering on Elasticsearch Optional Fields

I'm using Elasticsearch to query a document type, that has an optional location field. When searching, if that field does not exist, those results should be returned, as well as filtering on the results that do.
It seems like the OR filter in Elasticsearch does not short circuit, as this:
"query": {
"filtered": {
"query": {
"match_phrase_prefix": {
"display_name": "SearchQuery"
}
},
"filter": {
"or": [
{
"missing": {
"field": "location"
}
},
{
"geo_distance" : {
"distance" : "20mi",
"location" : {
"lat" : 33.47,
"lon" : -112.07
}
}
}
]
}
Fails with "failed to find geo_point field [location]".
Is there any way to perform this (or something along the same vein) in ES?
I don't know why yours isn't working but I've used the bool filter with great success in the past. The should option is essentially an or and makes sure at least one is true. Give it a try and comment on my answer if it still doesn't work. Also double check I copied your query terms properly :)
{
"filtered" : {
"query" : {
"match_phrase_prefix": {
"display_name": "SearchQuery"
}
},
"filter" : {
"bool" : {
"should" : [
{
"missing": { "field": "location" }
},
{
"geo_distance" : {
"distance" : "20mi",
"location" : {
"lat" : 33.47,
"lon" : -112.07
}
}
}
]
}
}
}
}
For anyone with the same issue, I kind of just hacked around it. For any documents that were missing a "location", I added one with a lat/lon of 0/0. Then I altered my query to be:
"filter": {
"or": [
{
"geo_distance": {
"distance": "0.1mi",
"location": {
"lat": 0,
"lon": 0
}
}
},
{
"geo_distance": {
"distance": "30mi",
"location": {
"lat": [lat variable],
"lon": [lon variable]
}
}
}
]
}

Why would this cached geo query be slower in elasticsearch than the uncached?

Looking at the query below, I've added a cached geo_bounding_box filter in front of my geo_shape filter. My expectation after reading https://www.elastic.co/guide/en/elasticsearch/guide/current/geo-caching.html was that this query should be faster. However, in my benchmarking the query with both filters turns out to be slightly slower on average, and MUCH slower in the worst case. Am I doing something wrong, or misinterpreting the doc?
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [
{"geo_bounding_box" : {
"_cache": True,
"properties.center" : {
"top_left" : {
"lat" : math.ceil(float(lat)),
"lon" : math.floor(float(lon))
},
"bottom_right" : {
"lat" : math.floor(float(lat)),
"lon" : math.ceil(float(lon))
}
}
}},
{"geo_shape": {
"geometry": {
"relation": "intersects",
"shape": {
"coordinates": [lon,lat],
"type": "point"
}
}
}}
]
}
}
}
}
}
Use lowercase JSON boolean values:
"_cache": true

Resources