Terrible has_child query performance - elasticsearch

The following query has terrible performance.
100% sure it is the has_child. Query without it runs under 300ms, with it it takes 9 seconds.
Is there some better way to use the has_child query? It seems like I could query parents, and then children by id and then join client side to do the has child check faster than the ES database engine is doing it...
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "es"
}
}
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
Cluster info:
CPU and memory usage is low. It is AWS ES Service cluster (v1.5.2). Many small documents, and since version aws is running is old, doc values aren't on by default. Not sure if that is helping or hurting.

Since "stage" is not analyzed (based on your comment) and, therefore, you are not interested in scoring the documents that match on that field, you might realize slight performance gains by using the has_child filter instead of the has_child query. And using a term filter instead of a term query.
In the documentation for has_child, you'll notice:
The has_child filter also accepts a filter instead of a query:
The main performance benefits of using a filter come from the fact that Elasticsearch can skip the scoring phase of the query. Also, filters can be cached which should improve the performance of future searches that use the same filters. Queries, on the other hand, cannot be cached.
Try this instead:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "es"
}
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}

I bit the bullet and just performed the parent:child join in my application. Instead of waiting 7 seconds for the has_child query, I fire off two consecutive term queries and do some post processing: 200ms.

Related

Elasticsearch aggregations on nested inner hits

I got a large amount of data in Elasticsearch. My douments have a nested field called "records" that contains a list of objects with several fields.
I want to be able to query specific objects from the records list, and therefore I use the inner_hits field in my query, but It doesn't help because aggregation uses size 0 so no results are returned.
I didn't succeed to make an aggregation work only for inner_hits, as aggregation returns results for all the objects inside records no matter the query.
This is the query I am using:
(Each document has first_timestamp and last_timestamp fields, and each object in the records list has a timestamp field)
curl -XPOST 'localhost:9200/_msearch?pretty' -H 'Content-Type: application/json' -d'
{
"index":[
"my_index"
],
"search_type":"count",
"ignore_unavailable":true
}
{
"size":0,
"query":{
"filtered":{
"query":{
"nested":{
"path":"records",
"query":{
"term":{
"records.data.field1":"value1"
}
},
"inner_hits":{}
}
},
"filter":{
"bool":{
"must":[
{
"range":{
"first_timestamp":{
"gte":1504548296273,
"lte":1504549196273,
"format":"epoch_millis"
}
}
}
],
}
}
}
},
"aggs":{
"nested_2":{
"nested":{
"path":"records"
},
"aggs":{
"2":{
"date_histogram":{
"field":"records.timestamp",
"interval":"1s",
"min_doc_count":1,
"extended_bounds":{
"min":1504548296273,
"max":1504549196273
}
}
}
}
}
}
}'
Your query is pretty complex.
To be short, here is your requested query:
{
"size": 0,
"aggregations": {
"nested_A": {
"nested": {
"path": "records"
},
"aggregations": {
"bool_aggregation_A": {
"filter": {
"bool": {
"must": [
{
"term": {
"records.data.field1": "value1"
}
}
]
}
},
"aggregations": {
"reverse_aggregation": {
"reverse_nested": {},
"aggregations": {
"bool_aggregation_B": {
"filter": {
"bool": {
"must": [
{
"range": {
"first_timestamp": {
"gte": 1504548296273,
"lte": 1504549196273,
"format": "epoch_millis"
}
}
}
]
}
},
"aggregations": {
"nested_B": {
"nested": {
"path": "records"
},
"aggregations": {
"my_histogram": {
"date_histogram": {
"field": "records.timestamp",
"interval": "1s",
"min_doc_count": 1,
"extended_bounds": {
"min": 1504548296273,
"max": 1504549196273
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Now, let me explain every step by aggregations' names:
size: 0 -> we are not interested in hits, only aggregations
nested_A -> data.field1 is under records so we dive our scope to records
bool_aggregation_A -> filter by data.field1: value1
reverse_aggregation -> first_timestamp is not in nested document, we need to scope out from records
bool_aggregation_B -> filter by first_timestamp range
nested_B -> now, we scope again into records for timestamp field (located under records)
my_histogram -> finally, aggregate date histogram by timestamp field
Inner_hits aggregation is not supported by elasticsearch. The reason behind it is that inner_hits is a very expensive operation and applying aggregation on inner_hits is like exponential increase in complexity of operation.
Here is the github link of the issue.
If you want aggregation on inner_hits you can probably use the following approach:
Make flexible query where you only get the required hit from elastic and aggregate over it. Repeat it multiple time to get all the hits and aggregate simultaneously. This approach may lead you with multiple search query which is not advisable.
You can make your application layer handle the aggregation logic by writing smart aggregation parser and run those parser on response from elasticsearch. This approach is a little better but you have an overhead of maintaining the parser according to changing needs.
I would personally recommend you to change your data-mapping style in elasticsearch so that you are able to run aggregation on it.
You can also check the code like this
PUT records
{
"mappings": {
"properties": {
"records": {
"type": "nested"
}
}
}
}
POST records/_doc
{
"records": [
{
"data": "test1",
"value": 1
},
{
"data": "test2",
"value": 2
}
]
}
GET records/_search
{
"size": 0,
"aggs": {
"all_nested_count": {
"nested": {
"path": "records"
},
"aggs": {
"bool_aggs": {
"filter": {
"bool": {
"must": [
{
"term": {
"records.data": "test2"
}
}
]
}
},
"aggs": {
"filtered_aggs": {
"sum": {
"field": "records.value"
}
}
}
}
}
}
}
}
Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/inner-hits.html

Can ElasticSearch perform multiple aggregations with different query conditions in a single request?

I am looking for a solution to get aggregations, one of each field, but apply different query conditions at different aggregations.
I have a collection of products, which has attributes: type, color, brand.
User selected: brand=Gap, color=White, and type=Sandal. To display the counts of the various similar products of at each aggregation:
Query condition for brand aggregation : color=White, and type=Sandal
Query condition for color aggregation: brand=Gap, and
type=Sandal
Query condition for type aggregation: brand=Gap, and color=White
Can this be done in a single ElasticSearch query?
You'd create three aggregations with a filter agg for each and add the queries you'd like in there. I used the simplest one - bool with term - just to show the high level approach:
"aggs": {
"brand_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"color": "white"
}
},
{
"term": {
"type": "sandal"
}
}
]
}
}
},
"color_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"brand": "gap"
}
},
{
"term": {
"type": "sandal"
}
}
]
}
}
},
"type_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"color": "white"
}
},
{
"term": {
"brand": "gap"
}
}
]
}
}
}
}

elasticsearch to apply a sort to a query, the select top N for aggregate

The query below aggregates over the entire result of the query, and size only affects what is returned rather than what is aggregated.
How would I modify the search so that only the top N results after sort is processed by the average aggregation?
It seems such a simple requirement that I'm expecting it to be possible but so far all my efforts have failed, and similar questions on SO have gone unanswered.
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"jobType": "LiveEventScoring"
}
},
{
"term": {
"host": "MTVMDANS"
}
},
{
"term": {
"dataSourceCode": "AU_VIRT"
}
},
{
"term": {
"measurement": "EventDataLoadFromCacheDuration"
}
}
]
}
}
}
},
"sort": {
"timestamp": {
"order": "desc"
}
},
"aggs": {
"avgDuration": {
"avg": {
"field": "elapsedMs"
}
}
}
}

Select distinct values of bool query elastic search

I have a query that gets me some user post data from an elastic index. I am happy with that query, though I need to make it return rows with unique usernames. Current, it displays relevant posts by users, but it may display one user twice..
{
"query": {
"bool": {
"should": [
{ "match_phrase": { "gtitle": {"query": "voice","boost": 1}}},
{ "match_phrase": { "gdesc": {"query": "voice","boost": 1}}},
{ "match": { "city": {"query": "voice","boost": 2}}},
{ "match": { "gtags": {"query": "voice","boost": 1} }}
],"must_not": [
{ "term": { "profilepicture": ""}}
],"minimum_should_match" : 1
}
}
}
I have read about aggregations but didn't understand much (also tried to use aggs but didn't work either).... any help is appreciated
You would need to use terms aggregation to get all unique users and then use top hits aggregation to get only one result for each user. This is how it looks.
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"gtitle": {
"query": "voice",
"boost": 1
}
}
},
{
"match_phrase": {
"gdesc": {
"query": "voice",
"boost": 1
}
}
},
{
"match": {
"city": {
"query": "voice",
"boost": 2
}
}
},
{
"match": {
"gtags": {
"query": "voice",
"boost": 1
}
}
}
],
"must_not": [
{
"term": {
"profilepicture": ""
}
}
],
"minimum_should_match": 1
}
},
"aggs": {
"unique_user": {
"terms": {
"field": "userid",
"size": 100
},
"aggs": {
"only_one_post": {
"top_hits": {
"size": 1
}
}
}
}
},
"size": 0
}
Here size inside user aggregation is 100, you can increase that if you have more unique users(default is 10), also the outermost size is zero to get only aggregation results. One important thing to remember is your user ids have to be unique, i.e ABC and abc will be considered different users, you might have to make your userid not_analyzed to be sure about that. More on that.
Hope this helps!!

Why script in custom_filters_score behaves as boost?

{
"query": {
"custom_filters_score": {
"query": {
"term": {
"name": "user1234"
}
},
"filters": [
{
"filter": {
"term": {
"subject": "math"
}
},
"script": "_score + doc['subject_score'].value"
}
]
}
}
}
If script is having like above it gives Error: unresolvable property or identifier: _score
If script is like "script": "doc['subject_score'].value" It multiplies the _score in similar way boost does. I want to replace the elasticsearch _score with custom score.
If I understood you correctly you would like to use elasticsearch scoring if subject is not math and you would like to use custom scoring with subject is math. If you are using Elasticsearch v0.90.4 or higher, it can be achieved using new function_score query:
{
"query": {
"function_score": {
"query": {
"term": {
"name": "user1234"
}
},
"functions": [{
"filter": {
"term": {
"subject": "math"
}
},
"script_score": {
"script": "doc[\"subject_score\"].value"
}
}, {
"boost_factor": 0
}],
"score_mode": "first",
"boost_mode": "sum"
}
}
}
Prior to v0.90.4 you would have to resort to using combination of custom_score and custom_filters_score:
{
"query": {
"custom_score": {
"query": {
"custom_filters_score": {
"query": {
"term": {
"name": "user1234"
}
},
"filters": [{
"filter": {
"term": {
"subject": "math"
}
},
"script": "-1.0"
}]
}
},
"script": "_score < 0.0 ? _score * -1.0 + doc[\"subject_score\"].value : _score"
}
}
}
or as #javanna suggested, use multiple custom_score queries combined together by bool query:
{
"query": {
"bool": {
"disable_coord": true,
"should": [{
"filtered": {
"query": {
"term": {
"name": "user1234"
}
},
"filter": {
"bool": {
"must_not": [{
"term": {
"subject": "math"
}
}]
}
}
}
}, {
"filtered": {
"query": {
"custom_score": {
"query": {
"term": {
"name": "user1234"
}
},
"script": "doc['subject_score'].value"
}
},
"filter": {
"term": {
"subject": "math"
}
}
}
}]
}
}
}
Firstly I'd like to say that there are many ways of customising the scoring in elasticsearch and it seems like you may have accidentally picked the wrong one. I will just summarize two and you will see what the problem is:
Custom Filters Score
If you read the docs (carefully) on custom_filters_score then you will see that it there for performance reasons, to be able to use for scoring the the faster filter machinery of elasticsearch. (Filters are faster as scoring is not calculated when computing the hit set, and they are cached between requests.)
At the end of the docs; it mentions custom_filters_score can take a "script" parameter to use instead of a "boost" parameter per filter. Best way to think of this is to calculate a number, which will be passed up to the parent query to be combined with the other sibling queries to calculate the total score for the document.
Custom Score Query
Reading the docs this is used when you want to customise the score from the query and change it how you wish. There is a _score variable available to you to use in your "script" which is the score of the query inside the custom_score query.
Try this:
"query": {
"filtered": {
"query": {
"custom_score": {
"query": {
"match_all": {}
},
"script": "doc['subject_score'].value" //*see note below
}
},
"filter": {
"and": [
{
"term": {
"subject": "math"
}
},
{
"term": {
"name": "user1234"
}
}
]
}
}
}
*NOTE: If you wanted to you could use _score here. Also, I moved both your "term" parts to filters as any match of a term would get the same score and filters are faster.
Good luck!

Resources