Elasticsearch nester_filter with multiple term queries - elasticsearch

I'm trying to use elasticsearch(6.7) sorting with multiple term queries.
But it doesn't sort data when there are 3 term queries. It works when I specify only
{
"term": {
"instance.instFields.sourceFieldId": {
"value": "16044"
}
}
},
Below is the sort query with all 3 terms.
"sort": [
{
"instance.instFields.fieldDate": {
"order": "desc",
"nested_path": "instance.instFields",
"nested_filter": {
"bool": {
"must": [
{
"term": {
"instance.instFields.sourceFieldId": {
"value": "16044"
}
}
},
{
"term": {
"instance.dataSourceId": {
"value": "819"
}
}
},
{
"term": {
"instance.dsTypeId": {
"value": "2301"
}
}
}
]
}
}
}
}
],
Appreciate any help to resolve this issue.

instance.dataSourceId and instance.dsTypeId fields are outside of your declared nested path (instance.instFields) so no inner objects match the nested filter and they are not taken into account by sorting.
BTW, as of ES 6.1 the nested_path and nested_filter options have been deprecated in favor of path and filter.

Related

Trying to understand ElasticSearch search latency issue

I have setup an ES index to index user centered data, each document contains the relevant user ID (either in an owner field on in a contributor field) and 2 fields that need to be searched on with a "contains" semantic. The index contains about 100M documents each of them sized about 15K with a complex nested structure. The index is setup with dynamic_templates that indexes all fields as keywords (since no free text search is needed tokenizing seemed redundant), some fields are also normalized with a lowercase filter to enable case-insensitive search. The reasoning behind indexing all fields at this point in time is to avoid having to reindex in order to allow searches on other fields so that new features could be added quickly (the size of the index makes reindexing abit painful). The cluster is configured with 3 nodes and 5 shards with replication factor of 1. The query I use looks like this:
{
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"wildcard": {
"document.name": {
"value": "*SEARCH_TERM*"
}
}
},
{
"wildcard": {
"externalData.properties.displayName": {
"value": "*SEARCH_TERM*"
}
}
}
]
}
}
],
"filter": [
{
"bool": {
"should": [
{
"term": {
"contributorIds": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
},
{
"term": {
"document.ownerId": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
}
],
"filter": [
{
"term": {
"deleted": {
"value": "false"
}
}
}
]
}
}
]
}
},
"size": 50,
"sort": [
{
"_doc": {
"order": "asc"
}
}
]
}
I've noticed searches (very low RPM) with high latency (and latency variance but I assume that is related to some caching mechanism) varying between 300ms and 1500ms per search. I am trying to understand the pain point in this query so as to understand whether a solution that does not require reindexing (such as using a ngram tokenizer on the relevant searchable fields) can be used to lower the latency.
I've also tried using a filtered query with constant_score:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"wildcard": {
"document.name": {
"value": "*SEARCH_TERM*"
}
}
},
{
"wildcard": {
"externalData.properties.displayName": {
"value": "*SEARCH_TERM*"
}
}
}
],
"must": [
{
"term": {
"contributorIds": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
},
{
"term": {
"document.ownerId": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
},
{
"term": {
"deleted": {
"value": "false"
}
}
}
]
}
}
}
},
"size": 50,
"sort": [
{
"_doc": {
"order": "asc"
}
}
]
}
but the latency has not changed. Can anyone shed some light on what is the pain point in this query? I am trying to understand possible scaling paths (adding 2 more nodes for instance) vs. re-indexing the data in a different way (for instance using an ngram tokenizer) which I would rather avoid if possible.

use wildcard with Terms in elasticsearch

I wanted to simulate SQL's IN so I used terms filter, but terms does not support wild cards like adding astrisck in "*egypt*".
so how can i achieve the following query?
PS: i am using elastica
{
"query": {
"bool": {
"should": [
{
"terms": {
"country_name": [
"*egypt*",
"*italy*"
]
}
}
]
}
},
"sort": [
{
"rank": {
"order": "desc"
}
}
]
}
terms query does not support wildcards. You can use match or wildcard query instead. If your problem is multiple values to filter you can combine queries inside should, so it will look like this
{
"query": {
"bool": {
"should": [
{
"wildcard": {
"country_name": "*egypt*"
}
},
{
"wildcard": {
"country_name": "*italy*"
}
}
]
}
},
"sort": [
{
"rank": {
"order": "desc"
}
}
]
}

Elasticsearch: multiple sorts for nested fields

have a situation where treatment has a price and hospital may or may not want to display it.
so there is a price field PLUS a lp_low_priority field.
value of lp_low_priority is 1(true) when price is not set(price is_null).
hospital doc is saved with its nested treatments.
when user searches for treatment he get list of hospitals with minimum price of the treatment.
now the sort works fine.
BUT i want the hospital with that treatment with the lp_low_priority = 1 to come at last.
Code to search is like
{
"sort": [
{
"treatments.lowest_price": {
"nested_filter": {
"term": {
"treatments.treatment_slug": "heart-surgery"
}
},
"mode": "avg",
"order": "asc"
}
},
{
"treatments.lp_low_priority": {
"order": "asc",
"nested_filter": {
"term": {
"treatments.treatment_slug": "heart-surgery"
}
},
"mode": "max"
}
}
],
"query": {
"filtered": {
"filter": [
{
"term": {
"treatments.treatment_slug": "heart-surgery"
}
},
{
"term": {
"treatments.status": "active"
}
},
{
"term": {
"treatments.treatment_status": "active"
}
},
{
"term": {
"hospital_status": "active"
}
},
{
"terms": {
"location.country": [
"India"
]
}
}
]
}
}
}
the result is way too weird.
if I only use
{
"sort": [
{
"treatments.lowest_price": {
"nested_filter": {
"term": {
"treatments.treatment_slug": "heart-surgery"
}
},
"mode": "avg",
"order": "asc"
}
}
The sorting is in order but then you see the lp_low_priority come first in order, which is OK(but not the requirement).
Can i even use more than one sorts for nested fields.

Elasticsearch: performance in case of complex filters and lot of records

I am new to Elasticsearch. I need to build a query with querying (scoring) on two text fields plus complex filters. Here is what I got so far (with the help of kind folks such as Dan Tuffery, John Petrone, and dark_shadow at SO) and it works:
{
"filter": {
"or": [
{
"and": [
{
"range": {
"start": {
"lte": 201407292300
}
}
},
{
"range": {
"end": {
"gte": 201407292300
}
}
},
{
"term": {
"condtion1": false
}
},
{
"or": [
{
"and": [
{
"term": {
"condtion2": false
}
},
{
"or": [
{
"and": [
{
"missing": {
"field": "condtion6"
}
},
{
"missing": {
"field": "condtion7"
}
}
]
},
{
"term": {
"condtion6": "nop"
}
},
{
"term": {
"condtion7": "rst"
}
}
]
}
]
},
{
"and": [
{
"term": {
"condtion2": true
}
},
{
"or": [
{
"and": [
{
"missing": {
"field": "condtion3"
}
},
{
"missing": {
"field": "condtion4"
}
},
{
"missing": {
"field": "condtion5"
}
},
{
"missing": {
"field": "condtion6"
}
},
{
"missing": {
"field": "condtion7"
}
}
]
},
{
"term": {
"condtion3": "abc"
}
},
{
"term": {
"condtion4": "def"
}
},
{
"term": {
"condtion5": "ghj"
}
},
{
"term": {
"condtion6": "nop"
}
},
{
"term": {
"condtion7": "rst"
}
}
]
}
]
}
]
}
]
},
{
"and": [
{
"term": {
"condtion8": "TIME_POINT_1"
}
},
{
"range": {
"start": {
"lte": 201407302300
}
}
},
{
"or": [
{
"term": {
"condtion9": "GROUP_B"
}
},
{
"and": [
{
"term": {
"condtion9": "GROUP_A"
}
},
{
"ids": {
"values": [
100,
10
]
}
}
]
}
]
}
]
},
{
"and": [
{
"term": {
"condtion8": "TIME_POINT_2"
}
},
{
"ids": {
"values": [
100,
10
]
}
}
]
},
{
"and": [
{
"term": {
"condtion8": "TIME_POINT_3"
}
},
{
"or": [
{
"term": {
"condtion1": true
}
},
{
"range": {
"end": {
"lt": 201407302300
}
}
}
]
},
{
"or": [
{
"term": {
"condtion9": "GROUP_B"
}
},
{
"and": [
{
"term": {
"condtion9": "GROUP_A"
}
},
{
"ids": {
"values": [
100,
10
]
}
}
]
}
]
}
]
}
]
}
}
I am wondering whether Elasticsearch will perform well in case of such queries against hundreds of thousands of records.
Basically I am facing choice of technologies. I am thinking about whether traditional database plus full-text search features do a better job. I do like what Elasticsearch offers and the features to use possibly in my project in the future.
I can see you are using a lot of AND/OR/NOT Filters. I strongly recommend going through these links:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html
http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
you should use bool filter instead of And/Or/Not as bool filter are internally cached. So, its much faster. Also, you are using term and missing filter which are inherently fast as they operate on terms level.
A last advise would be to properly analyze your use case and better approach your problem. Try to reduce number of filters by making effective choices. ElasticSearch can handle these filters very well and with caching it won't be too slow.
Thanks
Personally i think elastic search will be a good choice of technology for what you are trying to achieve, i have used FAST, Solr and SQL in the past, but i really find ES much better.
Do have a look at this Queries vs. Filters as its important to know when to use filters vs queries, as elastic search does some caching.
I have ran complex histograms over 800 million records on one server (16 cores, 64GB Ram, 500GB SAN) and it works very well, i would prefer to cluster the instance however my client does not wish to add a couple of more linux servers (madness really). You should ideally set ES up with 3 nodes as this gives you great performance and high availability, which i have done at another clients setup and works a dream.

How to do nested AND and OR filters in ElasticSearch?

My filters are grouped together into categories.
I would like to retrieve documents where a document can match any filter in a category, but if two (or more) categories are set, then the document must match any of the filters in ALL categories.
If written in pseudo-SQL it would be:
SELECT * FROM Documents WHERE (CategoryA = 'A') AND (CategoryB = 'B' OR CategoryB = 'C')
I've tried Nested filters like so:
{
"sort": [{
"orderDate": "desc"
}],
"size": 25,
"query": {
"match_all": {}
},
"filter": {
"and": [{
"nested": {
"path":"hits._source",
"filter": {
"or": [{
"term": {
"progress": "incomplete"
}
}, {
"term": {
"progress": "completed"
}
}]
}
}
}, {
"nested": {
"path":"hits._source",
"filter": {
"or": [{
"term": {
"paid": "yes"
}
}, {
"term": {
"paid": "no"
}
}]
}
}
}]
}
}
But evidently I don't quite understand the ES syntax. Is this on the right track or do I need to use another filter?
This should be it (translated from given pseudo-SQL)
{
"sort": [
{
"orderDate": "desc"
}
],
"size": 25,
"query":
{
"filtered":
{
"filter":
{
"and":
[
{ "term": { "CategoryA":"A" } },
{
"or":
[
{ "term": { "CategoryB":"B" } },
{ "term": { "CategoryB":"C" } }
]
}
]
}
}
}
}
I realize you're not mentioning facets but just for the sake of completeness:
You could also use a filter as the basis (like you did) instead of a filtered query (like I did). The resulting json is almost identical with the difference being:
a filtered query will filter both the main results as well as facets
a filter will only filter the main results NOT the facets.
Lastly, Nested filters (which you tried using) don't relate to 'nesting filters' like you seemed to believe, but related to filtering on nested-documents (parent-child)
Although I have not understand completely your structure this might be what you need.
You have to think tree-wise. You create a bool where you must (=and) fulfill the embedded bools. Each embedded checks if the field does not exist or else (using should here instead of must) the field must (terms here) be one of the values in the list.
Not sure if there is a better way, and do not know the performance.
{
"sort": [
{
"orderDate": "desc"
}
],
"size": 25,
"query": {
"query": { #
"match_all": {} # These three lines are not necessary
}, #
"filtered": {
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"not": {
"exists": {
"field": "progress"
}
}
},
{
"terms": {
"progress": [
"incomplete",
"complete"
]
}
}
]
}
},
{
"bool": {
"should": [
{
"not": {
"exists": {
"field": "paid"
}
}
},
{
"terms": {
"paid": [
"yes",
"no"
]
}
}
]
}
}
]
}
}
}
}
}

Resources