Composite and Terms Aggregations on a field with a high cardinality - performance

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.

First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Related

Composite aggregation query with bucket_sort does not work properly

I have an index to store financial transactions:
{
"mappings": {
"_doc": {
"properties": {
"amount": {
"type": "long"
},
"currencyCode": {
"type": "keyword"
},
"merchantId": {
"type": "keyword"
},
"merchantName": {
"type": "text"
},
"partnerId": {
"type": "keyword"
},
"transactionDate": {
"type": "date"
},
"userId": {
"type": "keyword"
}
}
}
}
}
Here's my query:
GET /transactions/_search
{
"aggs": {
"date_merchant": {
"aggs": {
"amount": {
"sum": {
"field": "amount"
}
},
"amount_sort": {
"bucket_sort": {
"sort": [
{
"amount": {
"order": "desc"
}
}
]
}
},
"top_hit": {
"top_hits": {
"_source": {
"includes": [
"merchantName",
"currencyCode"
]
},
"size": 1
}
}
},
"composite": {
"size": 1,
"sources": [
{
"date": {
"date_histogram": {
"calendar_interval": "day",
"field": "transactionDate"
}
}
},
{
"merchant": {
"terms": {
"field": "merchantId"
}
}
}
]
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"userId": "AAA"
}
},
{
"term": {
"partnerId": "BBB"
}
},
{
"range": {
"transactionDate": {
"gte": "2022-07-01"
}
}
},
{
"term": {
"currencyCode": "EUR"
}
}
]
}
},
"size": 0
}
Please note the "size": 1 in the composite aggregation.
If I change it to 3 (based on my data)... I get different results!
That means the bucket_sort operation doesn't work on the whole list of buckets, but just on the returned ones (if it's just one, that means it's not going to be sorted at all!)
How can I sort on ALL the buckets instead?
EDIT
Based on Benjamin's answer I changed my query to use normal aggregations instead of composite, and a large bucket size for merchant IDs (default is 10, while for date histogram there's no limit)
Composite agg design
The composite aggregation is designed to iterate all buckets in the most efficient way possible.
How can I sort on ALL the buckets instead?
To fully sort over ALL buckets, all buckets would have to be enumerated ahead of time, defeating the design of the composite aggregation.
So, how to actually sort over all buckets?
Then aggregate over all buckets in a single call. Set your size to the largest number of buckets available within your query.
The number of buckets will be the cardinality of merchantId and the number of days in the date histogram.
Another option is to paginate over all the composite buckets and then sort them client side. If you choose this path, it may be good to have each page of the composite aggregation be sorted so that sorting them client side will be faster.

Nested Fields, Wildcard Queries and Aggregations in Elasticsearch

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:
"chain": {
"type": "nested",
"properties": {
"url.position": {
"type": "long"
},
"url.full": {
"type": "text"
},
"url.domain": {
"type": "keyword"
},
"url.path": {
"type": "keyword"
},
"url.query": {
"type": "text"
}
}
}
As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:
GET push_url_chain/_search
{
"query": {
"nested": {
"path": "chain",
"query": {
"regexp": {
"chain.url.query": "aff_c.*"
}
}
}
},
"size": 0,
"aggs": {
"dataFields": {
"nested": {
"path": "chain"
},
"aggs": {
"offers": {
"terms": {
"field": "chain.url.domain",
"size": 30
}
}
}
}
}
}
The above query does produce aggregated results but not the way I want.
I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.
I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.
Tha
Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits.
Aggregation is applied on top of these documents, so all domains are coming in terms
You need to use nested aggregation to gets only matching terms.
{
"size": 0,
"aggs": {
"Name": {
"nested": {
"path": "chain"
},
"aggs": {
"matched_doc": {
"filter": { --> filter for url
"match_phrase_prefix": {
"chain.url.query": "abc"
}
},
"aggs": {
"domain": {
"terms": {
"field": "chain.url.domain", -- terms for matched url
"size": 10
}
}
}
}
}
}
}
}
You can use match_phrase_prefix instead of regex. It has better performance.
Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.

Aggregation in elastic search

Need help with aggregation in elastic search. Is it possible to agreggate values of a particular field as an array or list - This is more of a grouping for example instead of getting the result as
{"Book_Id":"102","Review_Text":"DescentRead"},{"Book_Id":"102","Review_Text":"For Kids."},{"Book_Id":"103","Review_Text":"Great"},{"Book_Id":"103","Review_Text":"Excellent"}
can i get all the reviews of each book as a list ?
[ { Book_Id: 102, Review_Text: [ "DescentRead", "For Kids"], { Book_Id: 103, reviews: [ "Great","Excellent"] } ]
Tried some trail with aggs but not able to get it. Any pointers would help!!
Could aggregations with top hits work? The limitation is that you need to specify a max amount of hits per aggregation (will return the top 100 results per book ID in the example ordered by the review text), but apart from that you can do run it as a normal query and specify which fields to return, how they should be sorted (to get the top hits), etc.
Example aggs query:
POST
http://myserver:9200/books/book/_search
{
"size": 0,
"aggs": {
"BookReviews": {
"terms": {
"field": "Book_Id.keyword"
},
"aggs": {
"top_reviews": {
"top_hits": {
"sort": [ { "Review_Text.keyword": { "order": "desc" } } ],
"size": 100,
"_source": {
"includes": [ "Review_Text" ]
}
}
}
}
}
}
}
Note that the name for the aggregations ("BookReviews" and "top_reviews") you can use any name you choose, and that same name will appear in the resulting aggregation tree. You can do multi level aggregations on terms in your index, and inclute top hits on any level, typically for drill-down reporting or similar cases.
Mapping used:
{
"books": {
"mappings": {
"book": {
"properties": {
"Book_Id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"Review_Text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
"size": 0 in the root node will omit any hits for the search and only return the aggs trees.
You can also add a normal "query": {} block on the same level as size and aggs if you need to filter the results before elastic starts aggregating.
Read more in the elasticsearch documentation pages:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
(If you provide a more complete example dataset, we can give a more realistic example query, as there isn't a lot of data in the example for sorting or scoring the results)

Elasticsearch: how to scope aggregations to your query and filter?

I have been playing around with elasticsearch query and filter for some time now but never worked with aggregations before. The idea that we can scope the aggregations with our query seems quite amazing to me but I want to understand how to do it properly so that I do not make any mistakes. Currently all my search queries are designed this way:
{
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
Now, when I added some aggregation buckets, the structure became this:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
However, the results of the aggregation are always the same irrespective of what info I provide in filter.
Now, when I changed the query structure to something like this, it started showing different results:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
"filtered": {
"query": {
},
"filter": {
}
}
},
"from": 0,
"size": 60
}
Does it mean I will have to change the structure of my search queries everywhere to this new filtered type of structure ? Is there any other workaround which allows me to achieve desired results without having to change that much of code ?
Also, another thing I observed is that if my brand_slug field contains multiple keywords like "peter england", then both of these are returned in separate buckets like this:
{
"buckets": [
{
"key": "england",
"doc_count": 368
},
{
"key": "peter",
"doc_count": 368
}
]
}
How can I ensure that both these end up in a same bucket like this:
{
"buckets": [
{
"key": "peter england",
"doc_count": 368
}
]
}
UPDATE: This second part I have been able to accomplish by indexing brand, color and sizes differently like this:
"sizes": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
What you've noticed is by design. Have a look at my answer to a similar question on SO. Basically, input to both aggregation and filter sections is the output of query section. Filtered Query as you've suggested would be the best way to achieve the results you desire. There is another way too. You can use Filter Aggregation. Then you would not need to change your query and filter sections but simply copy the filter section inside the aggregation sections but that in my opinion would be an overkill and a violation of the DRY principle in general.

Elasticsearch: Limit filtered query to 5 items per type per day

I'm using elasticsearch to gather data for my frontpage on my event-portal. the current query is as follows:
{
"query": {
"function_score": {
"filter": {
"and": [
{
"geo_distance": {
"distance": "50km",
"location": {
"lat": 50.78,
"lon": 6.08
},
"_cache": true
}
},
{
"or": [
{
"and": [
{
"term": {
"type": "event"
}
},
{
"range": {
"datetime": {
"gt": "now"
}
}
}
]
},
{
"not": {
"term": {
"type": "event"
}
}
}
]
}
]
},
"functions": [
...
]
}
}
}
So basically all events in an 50km distance which are future events or other types. Other types could be status, photo, video, soundcloud etc... All these items have a datetime field and a parent field which account the items belongs to. There are some functions after the filter for scoring objects based on there distance and age.
Now my question:
Is there a way to filter the query to get only the first (or even better highest scored) 5 items per type per account per day?
So currently I have accounts which upload 20 images at the same time. This is too much to display on the frontpage.
I thought about using filter scripts in a post_filter. But i am not very familiar with this topic.
Any ideas?
many thanks in advance
DTFagus
I solved it this way:
"aggs": {
"byParent": {
"terms": {
"field": "parent_id"
},
"aggs": {
"byType": {
"terms": {
"field": "type"
},
"aggs": {
"perDay": {
"date_histogram" : {
"field" : "datetime",
"interval": "day"
},
"aggs": {
"topHits": {
"top_hits": {
"size": 5,
"_source": {
"include": ["path"]
}
}
}
}
}
}
}
}
}
}
Unfortunately there is no pagination for aggregations (or other way around: the pagination of the query is not used). So I will get the paginated query results and the aggregation of all hits and intersect the arrays in js. Does not sound very efficient but I currently have no better idea. Anyone?
The only way around this I see would be to index all data into two indices. One containing all data and one with only the top 5 per day per type per account. This would be less time consuming to query but more time and storage consuming while indexing :/
You can limit the number of results returned by your query using the "size" parameter.if you set size to 5, then you will get the first 5 results returned by your query.
Check the documentation http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/pagination.html
Hope this helps!

Resources