Elasticsearch: performance in case of complex filters and lot of records - elasticsearch

I am new to Elasticsearch. I need to build a query with querying (scoring) on two text fields plus complex filters. Here is what I got so far (with the help of kind folks such as Dan Tuffery, John Petrone, and dark_shadow at SO) and it works:
{
"filter": {
"or": [
{
"and": [
{
"range": {
"start": {
"lte": 201407292300
}
}
},
{
"range": {
"end": {
"gte": 201407292300
}
}
},
{
"term": {
"condtion1": false
}
},
{
"or": [
{
"and": [
{
"term": {
"condtion2": false
}
},
{
"or": [
{
"and": [
{
"missing": {
"field": "condtion6"
}
},
{
"missing": {
"field": "condtion7"
}
}
]
},
{
"term": {
"condtion6": "nop"
}
},
{
"term": {
"condtion7": "rst"
}
}
]
}
]
},
{
"and": [
{
"term": {
"condtion2": true
}
},
{
"or": [
{
"and": [
{
"missing": {
"field": "condtion3"
}
},
{
"missing": {
"field": "condtion4"
}
},
{
"missing": {
"field": "condtion5"
}
},
{
"missing": {
"field": "condtion6"
}
},
{
"missing": {
"field": "condtion7"
}
}
]
},
{
"term": {
"condtion3": "abc"
}
},
{
"term": {
"condtion4": "def"
}
},
{
"term": {
"condtion5": "ghj"
}
},
{
"term": {
"condtion6": "nop"
}
},
{
"term": {
"condtion7": "rst"
}
}
]
}
]
}
]
}
]
},
{
"and": [
{
"term": {
"condtion8": "TIME_POINT_1"
}
},
{
"range": {
"start": {
"lte": 201407302300
}
}
},
{
"or": [
{
"term": {
"condtion9": "GROUP_B"
}
},
{
"and": [
{
"term": {
"condtion9": "GROUP_A"
}
},
{
"ids": {
"values": [
100,
10
]
}
}
]
}
]
}
]
},
{
"and": [
{
"term": {
"condtion8": "TIME_POINT_2"
}
},
{
"ids": {
"values": [
100,
10
]
}
}
]
},
{
"and": [
{
"term": {
"condtion8": "TIME_POINT_3"
}
},
{
"or": [
{
"term": {
"condtion1": true
}
},
{
"range": {
"end": {
"lt": 201407302300
}
}
}
]
},
{
"or": [
{
"term": {
"condtion9": "GROUP_B"
}
},
{
"and": [
{
"term": {
"condtion9": "GROUP_A"
}
},
{
"ids": {
"values": [
100,
10
]
}
}
]
}
]
}
]
}
]
}
}
I am wondering whether Elasticsearch will perform well in case of such queries against hundreds of thousands of records.
Basically I am facing choice of technologies. I am thinking about whether traditional database plus full-text search features do a better job. I do like what Elasticsearch offers and the features to use possibly in my project in the future.

I can see you are using a lot of AND/OR/NOT Filters. I strongly recommend going through these links:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html
http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
you should use bool filter instead of And/Or/Not as bool filter are internally cached. So, its much faster. Also, you are using term and missing filter which are inherently fast as they operate on terms level.
A last advise would be to properly analyze your use case and better approach your problem. Try to reduce number of filters by making effective choices. ElasticSearch can handle these filters very well and with caching it won't be too slow.
Thanks

Personally i think elastic search will be a good choice of technology for what you are trying to achieve, i have used FAST, Solr and SQL in the past, but i really find ES much better.
Do have a look at this Queries vs. Filters as its important to know when to use filters vs queries, as elastic search does some caching.
I have ran complex histograms over 800 million records on one server (16 cores, 64GB Ram, 500GB SAN) and it works very well, i would prefer to cluster the instance however my client does not wish to add a couple of more linux servers (madness really). You should ideally set ES up with 3 nodes as this gives you great performance and high availability, which i have done at another clients setup and works a dream.

Related

Elasticsearch nester_filter with multiple term queries

I'm trying to use elasticsearch(6.7) sorting with multiple term queries.
But it doesn't sort data when there are 3 term queries. It works when I specify only
{
"term": {
"instance.instFields.sourceFieldId": {
"value": "16044"
}
}
},
Below is the sort query with all 3 terms.
"sort": [
{
"instance.instFields.fieldDate": {
"order": "desc",
"nested_path": "instance.instFields",
"nested_filter": {
"bool": {
"must": [
{
"term": {
"instance.instFields.sourceFieldId": {
"value": "16044"
}
}
},
{
"term": {
"instance.dataSourceId": {
"value": "819"
}
}
},
{
"term": {
"instance.dsTypeId": {
"value": "2301"
}
}
}
]
}
}
}
}
],
Appreciate any help to resolve this issue.
instance.dataSourceId and instance.dsTypeId fields are outside of your declared nested path (instance.instFields) so no inner objects match the nested filter and they are not taken into account by sorting.
BTW, as of ES 6.1 the nested_path and nested_filter options have been deprecated in favor of path and filter.

Filtered bool vs Bool query : elasticsearch

I have two queries in ES. Both have different turnaround time on the same set of documents. Both are doing the same thing conceptually. I have few doubts
1- What is the difference between these two?
2- Which one is better to use?
3- If both are same why they are performing differently?
1. Filtered bool
{
"from": 0,
"size": 5,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"called_party_address_number": "1987112602"
}
},
{
"term": {
"original_sender_address_number": "6870340319"
}
},
{
"range": {
"x_event_timestamp": {
"gte": "2016-07-01T00:00:00.000Z",
"lte": "2016-07-30T00:00:00.000Z"
}
}
}
]
}
}
}
},
"sort": [
{
"x_event_timestamp": {
"order": "desc",
"ignore_unmapped": true
}
}
]
}
2. Simple Bool
{
"query": {
"bool": {
"must": [
{
"term": {
"called_party_address_number": "1277478699"
}
},
{
"term": {
"original_sender_address_number": "8020564722"
}
},
{
"term": {
"cause_code": "573"
}
},
{
"range": {
"x_event_timestamp": {
"gt": "2016-07-13T13:51:03.749Z",
"lt": "2016-07-16T13:51:03.749Z"
}
}
}
]
}
},
"from": 0,
"size": 10,
"sort": [
{
"x_event_timestamp": {
"order": "desc",
"ignore_unmapped": true
}
}
]
}
Mapping:
{
"ccp": {
"mappings": {
"type1": {
"properties": {
"original_sender_address_number": {
"type": "string"
},
"called_party_address_number": {
"type": "string"
},
"cause_code": {
"type": "string"
},
"x_event_timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
.
.
.
}
}
}
}
}
Update 1:
I tried bool/must query and bool/filter query on same set of data,but I found the strange behaviour
1-
bool/must query is able to search the desired document
{
"query": {
"bool": {
"must": [
{
"term": {
"called_party_address_number": "8701662243"
}
},
{
"term": {
"cause_code": "401"
}
}
]
}
}
}
2-
While bool/filter is not able to search the document. If I remove the second field condition it searches the same record with field2's value as 401.
{
"query": {
"bool": {
"filter": [
{
"term": {
"called_party_address_number": "8701662243"
}
},
{
"term": {
"cause_code": "401"
}
}
]
}
}
}
Update2:
Found a solution of suppressing scoring phase with bool/must query by wrapping it within "constant_score".
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"called_party_address_number": "1235235757"
}
},
{
"term": {
"cause_code": "304"
}
}
]
}
}
}
}
}
Record we are trying to match have "called_party_address_number": "1235235757" and "cause_code": "304".
The first one uses the old 1.x query/filter syntax (i.e. filtered queries have been deprecated in favor of bool/filter).
The second one uses the new 2.x syntax but not in a filter context (i.e. you're using bool/must instead of bool/filter). The query with 2.x syntax which is equivalent to your first query (i.e. which runs in a filter context without score calculation = faster) would be this one:
{
"query": {
"bool": {
"filter": [
{
"term": {
"called_party_address_number": "1277478699"
}
},
{
"term": {
"original_sender_address_number": "8020564722"
}
},
{
"term": {
"cause_code": "573"
}
},
{
"range": {
"x_event_timestamp": {
"gt": "2016-07-13T13:51:03.749Z",
"lt": "2016-07-16T13:51:03.749Z"
}
}
}
]
}
},
"from": 0,
"size": 10,
"sort": [
{
"x_event_timestamp": {
"order": "desc",
"ignore_unmapped": true
}
}
]
}

Combining and with or conditions

I stuck with a query which has to combine some conditions.
this properties of the catalog are the following
_id:integer
parentID: integer
path: string
level: integer
i have absolutely no clue how to combine them, so that the query returns what I need.
a) _id has to be one of a given list ("_id": ["7","10"]) OR
b) parentID has to be of a given integer ("_parentID": "1") OR
c) path has to match a special pattern ("regexp": {"path": "/foobar.*"}) AND level has be between two integer ("range": {"level": {"gte": 2, "lte": 3 } })
Additionaly all entries have to be from one defined catalog
I will not write down all my attempts. I tried to use bool query with must and should, but this does not apply c):
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"type": {
"value": "category"
}
}
],
"should": [
{
"regexp": {
"path": "/foobar.*"
}
},
{
"range": {
"level": {
"gte": 2,
"lte": 3
}
}
},
{
"term": {
"_id": [
"7",
"10"
]
}
}
]
}
}
}
}
}
what is the best way to combine and and or conditions? i am kind of lost.
I think this should be pretty darn close to what you need.
GET devdev/alert/_search
{
"filter": {
"or": {
"filters": [
{
"terms": {
"_id": [
"eee75eJpRua4HasVzz0PeA",
"VALUE2"
]
}
},
{
"term": {
"_parentID": "SE.SE.0000"
}
},
{
"and": {
"filters": [
{
"term": {
"regexp": "foobar"
}
},
{
"range": {
"level": {
"from": 2,
"to": 3
}
}
}
]
}
}
]
}
}
}

Multiple filters and an aggregate in elasticsearch

How can I use a filter in connection with an aggregate in elasticsearch?
The official documentation gives only trivial examples for filter and for aggregations and no formal description of the query dsl - compare it e.g. with postgres documentation.
Through trying out I found following query, which is accepted by elasticsearch (no parsing errors), but ignores the given filters:
{
"filter": {
"and": [
{
"term": {
"_type": "logs"
}
},
{
"term": {
"dc": "eu-west-12"
}
},
{
"term": {
"status": "204"
}
},
{
"range": {
"#timestamp": {
"from": 1398169707,
"to": 1400761707
}
}
}
]
},
"size": 0,
"aggs": {
"time_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "1h"
},
"aggs": {
"name": {
"percentiles": {
"field": "upstream_response_time",
"percents": [
98.0
]
}
}
}
}
}
}
Some people suggest using query instead of filter. But the official documentation generally recommends the opposite for filtering on exact values. Another issue with query: while filters offer an and, query does not.
Can somebody point me to documentation, a blog or a book, which describe writing non-trivial queries: at least an aggregate plus multiple filters.
I ended up using a filter aggregation - not filtered query. So now I have 3 nested aggs elements.
I also use bool filter instead of and as recommended by #alex-brasetvik because of http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
My final implementation:
{
"aggs": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"_type": "logs"
}
},
{
"term": {
"dc": "eu-west-12"
}
},
{
"term": {
"status": "204"
}
},
{
"range": {
"#timestamp": {
"from": 1398176502000,
"to": 1400768502000
}
}
}
]
}
},
"aggs": {
"time_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "1h"
},
"aggs": {
"name": {
"percentiles": {
"field": "upstream_response_time",
"percents": [
98.0
]
}
}
}
}
}
}
},
"size": 0
}
Put your filter in a filtered-query.
The top-level filter is for filtering search hits only, and not facets/aggregations. It was renamed to post_filter in 1.0 due to this quite common confusion.
Also, you might want to look into this post on why you often want to use bool and not and/or: http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
more on #geekQ 's answer: to support filter string with space char,for multipal term search,use below:
{ "aggs": {
"aggresults": {
"filter": {
"bool": {
"must": [
{
"match_phrase": {
"term_1": "some text with space 1"
}
},
{
"match_phrase": {
"term_2": "some text with also space 2"
}
}
]
}
},
"aggs" : {
"all_term_3s" : {
"terms" : {
"field":"term_3.keyword",
"size" : 10000,
"order" : {
"_term" : "asc"
}
}
}
}
} }, "size": 0 }
Just for reference, as for the version 7.2, I tried with something as follows to achieve multiple filters for aggregation:
filter aggregation to filter for aggregation
use bool to set up the compound query
POST movies/_search?size=0
{
"size": 0,
"aggs": {
"test": {
"filter": {
"bool": {
"must": {
"term": {
"genre": "action"
}
},
"filter": {
"range": {
"year": {
"gte": 1800,
"lte": 3000
}
}
}
}
},
"aggs": {
"year_hist": {
"histogram": {
"field": "year",
"interval": 50
}
}
}
}
}
}

How to do nested AND and OR filters in ElasticSearch?

My filters are grouped together into categories.
I would like to retrieve documents where a document can match any filter in a category, but if two (or more) categories are set, then the document must match any of the filters in ALL categories.
If written in pseudo-SQL it would be:
SELECT * FROM Documents WHERE (CategoryA = 'A') AND (CategoryB = 'B' OR CategoryB = 'C')
I've tried Nested filters like so:
{
"sort": [{
"orderDate": "desc"
}],
"size": 25,
"query": {
"match_all": {}
},
"filter": {
"and": [{
"nested": {
"path":"hits._source",
"filter": {
"or": [{
"term": {
"progress": "incomplete"
}
}, {
"term": {
"progress": "completed"
}
}]
}
}
}, {
"nested": {
"path":"hits._source",
"filter": {
"or": [{
"term": {
"paid": "yes"
}
}, {
"term": {
"paid": "no"
}
}]
}
}
}]
}
}
But evidently I don't quite understand the ES syntax. Is this on the right track or do I need to use another filter?
This should be it (translated from given pseudo-SQL)
{
"sort": [
{
"orderDate": "desc"
}
],
"size": 25,
"query":
{
"filtered":
{
"filter":
{
"and":
[
{ "term": { "CategoryA":"A" } },
{
"or":
[
{ "term": { "CategoryB":"B" } },
{ "term": { "CategoryB":"C" } }
]
}
]
}
}
}
}
I realize you're not mentioning facets but just for the sake of completeness:
You could also use a filter as the basis (like you did) instead of a filtered query (like I did). The resulting json is almost identical with the difference being:
a filtered query will filter both the main results as well as facets
a filter will only filter the main results NOT the facets.
Lastly, Nested filters (which you tried using) don't relate to 'nesting filters' like you seemed to believe, but related to filtering on nested-documents (parent-child)
Although I have not understand completely your structure this might be what you need.
You have to think tree-wise. You create a bool where you must (=and) fulfill the embedded bools. Each embedded checks if the field does not exist or else (using should here instead of must) the field must (terms here) be one of the values in the list.
Not sure if there is a better way, and do not know the performance.
{
"sort": [
{
"orderDate": "desc"
}
],
"size": 25,
"query": {
"query": { #
"match_all": {} # These three lines are not necessary
}, #
"filtered": {
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"not": {
"exists": {
"field": "progress"
}
}
},
{
"terms": {
"progress": [
"incomplete",
"complete"
]
}
}
]
}
},
{
"bool": {
"should": [
{
"not": {
"exists": {
"field": "paid"
}
}
},
{
"terms": {
"paid": [
"yes",
"no"
]
}
}
]
}
}
]
}
}
}
}
}

Resources