Elasticsearch terms aggregation performance - elasticsearch

I have a basic aggregation on an index with about 40 million documents.
{
aggs: {
countries: {
filter: {
bool: {
must: my_filters,
}
},
aggs: {
filteredCountries: {
terms: {
field: 'countryId',
min_doc_count: 1,
size: 15,
}
}
}
}
}
}
The index:
{
"settings": {
"number_of_shards": 5,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"unique"
]
}
}
},
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"countryId": {
"type": "short"
}
}
}
}
The search response time is 100ms, but the aggregation response time is about 1.5s, and is increasing as we add more documents (was about 200ms with 5 million documents). There are about 20 distinct countryId right now.
What I tried so far:
Allocating more RAM (from 4GB to 32GB), same results.
Changing countryId field data type to keyword and adding eager_global_ordinals option, it made things worse
The elasticsearch version is 7.8.0, elastic has 8GB of ram, the server has 64GB of ram and 16CPU, 5 shards, 1 node
I use this aggregation to put filters in search results, so I need it to respond as fast as possible. For large number of results I don't need precision. so if it is approximate or even limited to a number (ex. 100 gte) it's great.
Any ideas how to speed up this aggregation ?

Reason for the slowness:
Bucket explosion is the reason. And breadth first collect mode would speed up further.
As per the doc, you can optimize further with breadth first collect mode.
Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. To find 10 popular actors and thir 5 top coactors.
I would suggest you to set Execution hint. Since you have very less unique values, I suggest you to set hint as map.
Another optimization, let's say some documents are not accessed in last few weeks, you can use a field from your filter, to partition the aggregation on particular set of documents.
Another optimization that you could exclude, include what countries needed, if possible in your use case. Filter

Related

Elastic search Ngram tokenizer performance for UUID

I would like to partial filter on UUID, reference_id, and postal_code. For reference_id and postal_code, I know that they will be shorter than length 36. But UUID are 36 chars long. I'm thinking to set ngram tokenizer with:
min ngram 1
max ngram 36
Will this get really bad overtime in terms of speed and memory? Is there a better way to partial search UUID?
For example I have 7e222584-0818-49b0-875b-2774f4bf939b and I want to be able to search it using 9b0
Yes, that will create an awful lot of tokens, actually 36 + 35 + 34 + 33 + ... + 1 = (1 + 36) * (36/2) = 666 tokens for each UUID and that's discouraged. Even when creating an ngram token filter, the default accepted distance between min and max is 1, so you'd have to override that in the index settings, which gives you a first indication that it might not be the right thing todo.
You might want to give a try to the new wildcard query field which might do a better job.
You can easily compare both approaches by creating two indexes and indexing the same amount (but a substantial one) of UUIDs in both and then comparing their size.
First index with ngrams:
PUT uuid1
{
"settings": {
"index.max_ngram_diff": 36,
"analysis": {
"analyzer": {
"uuid": {
"tokenizer": "keyword",
"filter": [
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36
}
}
}
},
"mappings": {
"properties": {
"uuid": {
"type": "text",
"analyzer": "uuid",
"search_analyzer": "standard"
}
}
}
}
Second index with wildcard:
PUT uuid2
{
"mappings": {
"properties": {
"uuid": {
"type": "wildcard"
}
}
}
}
Then you index the same data in both:
POST _bulk/_doc
{ "index": {"_index": "uuid1"}}
{ "uuid": "7e222584-0818-49b0-875b-2774f4bf939b"}
{ "index": {"_index": "uuid2"}}
{ "uuid": "7e222584-0818-49b0-875b-2774f4bf939b"}
And finally you can compare their size and you can see that the uuid index will be bigger than the uuid2 index. Here by a factor of 3, but you might want to index a bit more data to figure out a better ratio:
GET _cat/shards/uuid*?v
index shard prirep state docs store ip node
uuid1 0 p STARTED 1 10.6kb 10.0.33.86 instance-0000000062
uuid2 0 p STARTED 1 3.5kb 10.0.12.26 instance-0000000042
Searching on the second index leveraging wildcard, can be done very easily like this, so it's a simple as a match query you'd do on the index with ngrams:
POST uuid2/_search
{
"query": {
"wildcard": {
"uuid": "*9b0*"
}
}
}

Store Hotel Availabilies with daily informations

I have to store some millions of hotel rooms with some requirements:
Hotel gives the numbers of identical rooms available - daily
Price can change daily, this data are only stored in es, not indexed
The index will only be used for search (no for monitoring) using the hotel's Geolocation
Size: Let s say about 50k hotels, 10 rooms each, 1 year+ Availability => 200 millions
So we have to manage on a "daily" level.
Each time a room is booked, on our application, the numbers of rooms should be updated, we also store "cache" from the partner (other hotel providers) working worldwide, we request them at a regular interval to update our cache.
I am pretty familiar with the elastic search, but I still hesitate between 2 mappings, I removed some fields (breakfast, amenities, smoking...) to keep it simple:
The first one, 1 document by room, each of them contains 365 children (one by day)
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
}
}
},
"availability": {
"_parent": {
"type": "room"
},
"properties": {
"date": {
"type": "date",
"format": "date"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Update, reindex will be isolated on the child level
Only one index
Adding future availabilities is easy (just adding child documents in a room)
cons:
Query will be a little slower, because of the join (looping of availability children)
Childs AND parents need to be returned, so the query would have to include an inner_hits.
A lot of hotels create temporary rooms (for vacation, local event...), only available 1 month a year, for example, this add useless rooms for the 11 remaining months in the index.
The second: I create one index by month (Jan, Feb...) using nested documents instead of children.
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
},
"availability": {
"type": "nested"
}
}
},
"availability": {
"properties": {
"day_of_month": {
"type": "integer"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Faster, no join, smaller index
Resolve the issue of the temporary room, thanks to the 12 monthly index
cons:
Update, booking a room for 1 night will make reindex the room documents (of the matching month)
If a customer is looking for a room with a check-in on the 31st March, for example, we will have to query 2 index, March and April
For the search/query, the second option is better in theory.
The main problem is about the updates of the rooms:
According to my production, about 30 million daily availabilities change / 24 hours.
I also have to read/compare and update if needed, cache from the partner, about 130 million of reading / possible update every (one update for 10 reads) 12 hours (in means).
I have 6 other indexed fields in my mappings on room level, this is not a lot, so maybe a nested solution is ok...
So, which one is the best?
note: I read this How to store date range data in elastic search (aws) and search for a range?
But my case is a little different because of the daily information.
Any help/advice is welcome.

ElasticSearch (5.5) query or algorithm required to exctract values against timestamp with an interference pattern

I have a very large volume of documents in ElasticSearch (5.5) which hold recorded data at regular time intervals, let's say every 3 seconds.
{
"#timestamp": "2015-10-14T12:45:00Z",
"channel1": 24.4
},
{
"#timestamp": "2015-10-14T12:48:00Z",
"channel1": 25.5
},
{
"#timestamp": "2015-10-14T12:51:00Z",
"channel1": 26.6
}
Let's say that I need to get results back for a query that asks for the point value every 5 seconds. An interference pattern arises where sometimes there will be an exact match (for simplicity's sake, let's say in the example above that 12:45 is the only sample to land on a multiple of five).
On these times, I want elastic to provide me with the exact value recorded at that time if there is one. So at 12:45 there is a match so it returns value 24.4
In the other cases, I require the last (previously recorded) value. So at 12:50, having no data at that precise time, it would return the value at 12:48 (25.5), being the last known value.
Previously I have used aggregations but in this case this doesnt help because I don't want some average made from a bucket of data, I need either an exact value for an exact time match or a previous value if no match.
I could do this programmatically but performance is a real issue here so I need to come up with the most performant method possible to retrieve the data in the way stated. Returning ALL the elastic data and iterating over the results and checking for a match at each time interval else keeping the item at index i-1 sounds slow and I wonder if it isn't the best way.
Perhaps I am missing a trick with Elastic. Perhaps somebody knows a method to do exactly what I am after?! It would be much appreciated...
The mapping is like so:
"mappings": {
"sampleData": {
"dynamic": "true",
"dynamic_templates": [{
"pv_values_template": {
"match": "GroupId", "mapping": { "doc_values": true, "store": false, "type": "keyword" }
}
}],
"properties": {
"#timestamp": { "type": "date" },
"channel1": { "type": "float" },
"channel2": { "type": "float" },
"item": { "type": "object" },
"keys": { "properties": { "count": { "type": "integer" }}},
"values": { "properties": { "count": { "type": "integer" }}}
}
}
}
and the (NEST) method being called looks like so:
channelAggregation => channelAggregation.DateHistogram("HistogramFilter", histogram => histogram
.Field(dataRecord => dataRecord["#timestamp"])
.Interval(interval)
.MinimumDocumentCount(0)
.ExtendedBounds(start, end)
.Aggregations(aggregation => DataFieldAggregation(channelNames, aggregation)));
#Nikolay there may be up to around 1400 buckets (maximum of one velue to be returned per pixel available on the chart)

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

Field norm calculation on ElasticSearch array fields

Here's the mapping for one of the fields in my index:
"resourceId": {
"type": "string",
"index_analyzer": "partial_match",
"search_analyzer": "lowercase",
"include_in_all": true
}
Here are the custom analyzers used in the index:
"analysis": {
"filter": {
"partial_match_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"partial_match": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"partial_match_filter"
]
},
"lowercase": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
This field will contain an array of strings, which are the multiple IDs that a resource can have (it can have multiple IDs due to different systems calling each resource by a different id).
Now let's suppose that resource #1 has has three IDs:
resourceId: [3]
0: "ID:MATCH"
1: "MATCH"
2: "ID:ALT"
And that resource #2 has only one ID:
resourceId: [1]
0: "ID:MATCHFIVE"
And let's suppose that we run this query against my index:
{
"from": 0,
"size": 30,
"query": {
"query_string": {
"query": "resourceId:ID\\:MATCH"
}
}
}
What I'd like is for resource #1 to show up first, since its array contains an exact match. However, resource #2 is the one coming on top;
When I used the explain parameter on the query request, I saw that the tf and idf scores where the same for both resources. However, the norm score was lower for resource #1.
My theory is that since resource #1 has three items in the array (which I assume are concatenated together during indexing), the field is considered larger, and thus the norm value is decreased. When it comes to resource #2, it has only one item (and it's shorter than the concatenation of the other array), so the norm is higher, bumping the resource to the top.
My question, therefore, is: when calculating the score, is it possible for the norm calculation to only consider the size of the item that matched in the array?
For example: the search for "ID:MATCH" would find the exact match on resource #1 on resourceId[0]. At this point, all other items in the array would be put aside and the norm would be calculated based on that single item (resourceId[0]), showing a perfect match. As for resource #2, the norm would be lower, since the resourceId field would be larger.
If this isn't possible, would there be workarounds to get the exact match to the top? Or maybe I'm completely off on my theory?

Resources