Trying to understand ElasticSearch search latency issue - performance

I have setup an ES index to index user centered data, each document contains the relevant user ID (either in an owner field on in a contributor field) and 2 fields that need to be searched on with a "contains" semantic. The index contains about 100M documents each of them sized about 15K with a complex nested structure. The index is setup with dynamic_templates that indexes all fields as keywords (since no free text search is needed tokenizing seemed redundant), some fields are also normalized with a lowercase filter to enable case-insensitive search. The reasoning behind indexing all fields at this point in time is to avoid having to reindex in order to allow searches on other fields so that new features could be added quickly (the size of the index makes reindexing abit painful). The cluster is configured with 3 nodes and 5 shards with replication factor of 1. The query I use looks like this:
{
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"wildcard": {
"document.name": {
"value": "*SEARCH_TERM*"
}
}
},
{
"wildcard": {
"externalData.properties.displayName": {
"value": "*SEARCH_TERM*"
}
}
}
]
}
}
],
"filter": [
{
"bool": {
"should": [
{
"term": {
"contributorIds": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
},
{
"term": {
"document.ownerId": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
}
],
"filter": [
{
"term": {
"deleted": {
"value": "false"
}
}
}
]
}
}
]
}
},
"size": 50,
"sort": [
{
"_doc": {
"order": "asc"
}
}
]
}
I've noticed searches (very low RPM) with high latency (and latency variance but I assume that is related to some caching mechanism) varying between 300ms and 1500ms per search. I am trying to understand the pain point in this query so as to understand whether a solution that does not require reindexing (such as using a ngram tokenizer on the relevant searchable fields) can be used to lower the latency.
I've also tried using a filtered query with constant_score:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"wildcard": {
"document.name": {
"value": "*SEARCH_TERM*"
}
}
},
{
"wildcard": {
"externalData.properties.displayName": {
"value": "*SEARCH_TERM*"
}
}
}
],
"must": [
{
"term": {
"contributorIds": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
},
{
"term": {
"document.ownerId": {
"value": "deadbeef-cafe-babe-cafe-deadbeefcafe"
}
}
},
{
"term": {
"deleted": {
"value": "false"
}
}
}
]
}
}
}
},
"size": 50,
"sort": [
{
"_doc": {
"order": "asc"
}
}
]
}
but the latency has not changed. Can anyone shed some light on what is the pain point in this query? I am trying to understand possible scaling paths (adding 2 more nodes for instance) vs. re-indexing the data in a different way (for instance using an ngram tokenizer) which I would rather avoid if possible.

Related

Elasticsearch nester_filter with multiple term queries

I'm trying to use elasticsearch(6.7) sorting with multiple term queries.
But it doesn't sort data when there are 3 term queries. It works when I specify only
{
"term": {
"instance.instFields.sourceFieldId": {
"value": "16044"
}
}
},
Below is the sort query with all 3 terms.
"sort": [
{
"instance.instFields.fieldDate": {
"order": "desc",
"nested_path": "instance.instFields",
"nested_filter": {
"bool": {
"must": [
{
"term": {
"instance.instFields.sourceFieldId": {
"value": "16044"
}
}
},
{
"term": {
"instance.dataSourceId": {
"value": "819"
}
}
},
{
"term": {
"instance.dsTypeId": {
"value": "2301"
}
}
}
]
}
}
}
}
],
Appreciate any help to resolve this issue.
instance.dataSourceId and instance.dsTypeId fields are outside of your declared nested path (instance.instFields) so no inner objects match the nested filter and they are not taken into account by sorting.
BTW, as of ES 6.1 the nested_path and nested_filter options have been deprecated in favor of path and filter.

Elasticsearch Remove duplicate results if greater than some value

I have news articles form multiple sources saved and each source have different category I need to write a query which will reverse time sort the article in chunks of 15 at a time also I don't need more than 3 articles from a particular source I am using the below query but the results are wrong can any one tell me what am I doing wrong.
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"category": "Digital"
}
},
{
"match_phrase": {
"type": "Local"
}
}
]
}
},
"collapse": {
"field": "source.keyword",
"max_concurrent_group_searches": 3
},
"sort": [
{
"pub_date": {
"order": "desc"
}
}
]
}

Can ElasticSearch perform multiple aggregations with different query conditions in a single request?

I am looking for a solution to get aggregations, one of each field, but apply different query conditions at different aggregations.
I have a collection of products, which has attributes: type, color, brand.
User selected: brand=Gap, color=White, and type=Sandal. To display the counts of the various similar products of at each aggregation:
Query condition for brand aggregation : color=White, and type=Sandal
Query condition for color aggregation: brand=Gap, and
type=Sandal
Query condition for type aggregation: brand=Gap, and color=White
Can this be done in a single ElasticSearch query?
You'd create three aggregations with a filter agg for each and add the queries you'd like in there. I used the simplest one - bool with term - just to show the high level approach:
"aggs": {
"brand_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"color": "white"
}
},
{
"term": {
"type": "sandal"
}
}
]
}
}
},
"color_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"brand": "gap"
}
},
{
"term": {
"type": "sandal"
}
}
]
}
}
},
"type_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"color": "white"
}
},
{
"term": {
"brand": "gap"
}
}
]
}
}
}
}

Terrible has_child query performance

The following query has terrible performance.
100% sure it is the has_child. Query without it runs under 300ms, with it it takes 9 seconds.
Is there some better way to use the has_child query? It seems like I could query parents, and then children by id and then join client side to do the has child check faster than the ES database engine is doing it...
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "es"
}
}
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
Cluster info:
CPU and memory usage is low. It is AWS ES Service cluster (v1.5.2). Many small documents, and since version aws is running is old, doc values aren't on by default. Not sure if that is helping or hurting.
Since "stage" is not analyzed (based on your comment) and, therefore, you are not interested in scoring the documents that match on that field, you might realize slight performance gains by using the has_child filter instead of the has_child query. And using a term filter instead of a term query.
In the documentation for has_child, you'll notice:
The has_child filter also accepts a filter instead of a query:
The main performance benefits of using a filter come from the fact that Elasticsearch can skip the scoring phase of the query. Also, filters can be cached which should improve the performance of future searches that use the same filters. Queries, on the other hand, cannot be cached.
Try this instead:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "es"
}
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
I bit the bullet and just performed the parent:child join in my application. Instead of waiting 7 seconds for the has_child query, I fire off two consecutive term queries and do some post processing: 200ms.

ElasticSearch multi_match if field exists apply filter otherwise dont worry about it?

So we got an elasticsearch instance, but a job is requiring a "combo search" (A single search field, with checkboxes for types across a specific index)
This is fine, I simply apply this kind of search to my index (for brevity: /posts):
{
"query": {
"multi_match": {
"query": querystring,
"type":"cross_fields",
"fields":["title","name"]
}
}
}
}
As you may guess from the need for the multi_match here, the schemas to each of these types differs in one way or another. And that's my challenge right now.
In one of the types, just one, there is a field that doesnt exist in the other types, it's called active and it's a basic boolean 0 or 1.
We want to index inactive items in the type for administration search purposes, but we don't want inactive items in this type to be exposed to the public when searching.
To my knowledge and understanding, I want to use a filter. But when I supply a filter asking for active to be 1, I only ever now get results from that type and nothing else. Because now it's explicitly looking for items with that field and equal to one.
How can I do a conditional "if field exists, make sure it equals 1, otherwise ignore this condition"? Can this even be achieved?
if field exists, make sure it equals 1, otherwise ignore this condition
I think it can be implemented like this:
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"exists": {
"field": "active"
}
},
{
"term": {
"active": 1
}
}
]
}
},
{
"missing": {
"field": "active"
}
}
]
}
}
}
}
}
and the complete query:
{
"query": {
"filtered": {
"query": {
"multi_match": {
"query": "whatever",
"type": "cross_fields",
"fields": [
"title",
"name"
]
}
},
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"exists": {
"field": "active"
}
},
{
"term": {
"active": 1
}
}
]
}
},
{
"missing": {
"field": "active"
}
}
]
}
}
}
}
}

Resources