Elastic search Ngram tokenizer performance for UUID - elasticsearch

I would like to partial filter on UUID, reference_id, and postal_code. For reference_id and postal_code, I know that they will be shorter than length 36. But UUID are 36 chars long. I'm thinking to set ngram tokenizer with:
min ngram 1
max ngram 36
Will this get really bad overtime in terms of speed and memory? Is there a better way to partial search UUID?
For example I have 7e222584-0818-49b0-875b-2774f4bf939b and I want to be able to search it using 9b0

Yes, that will create an awful lot of tokens, actually 36 + 35 + 34 + 33 + ... + 1 = (1 + 36) * (36/2) = 666 tokens for each UUID and that's discouraged. Even when creating an ngram token filter, the default accepted distance between min and max is 1, so you'd have to override that in the index settings, which gives you a first indication that it might not be the right thing todo.
You might want to give a try to the new wildcard query field which might do a better job.
You can easily compare both approaches by creating two indexes and indexing the same amount (but a substantial one) of UUIDs in both and then comparing their size.
First index with ngrams:
PUT uuid1
{
"settings": {
"index.max_ngram_diff": 36,
"analysis": {
"analyzer": {
"uuid": {
"tokenizer": "keyword",
"filter": [
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36
}
}
}
},
"mappings": {
"properties": {
"uuid": {
"type": "text",
"analyzer": "uuid",
"search_analyzer": "standard"
}
}
}
}
Second index with wildcard:
PUT uuid2
{
"mappings": {
"properties": {
"uuid": {
"type": "wildcard"
}
}
}
}
Then you index the same data in both:
POST _bulk/_doc
{ "index": {"_index": "uuid1"}}
{ "uuid": "7e222584-0818-49b0-875b-2774f4bf939b"}
{ "index": {"_index": "uuid2"}}
{ "uuid": "7e222584-0818-49b0-875b-2774f4bf939b"}
And finally you can compare their size and you can see that the uuid index will be bigger than the uuid2 index. Here by a factor of 3, but you might want to index a bit more data to figure out a better ratio:
GET _cat/shards/uuid*?v
index shard prirep state docs store ip node
uuid1 0 p STARTED 1 10.6kb 10.0.33.86 instance-0000000062
uuid2 0 p STARTED 1 3.5kb 10.0.12.26 instance-0000000042
Searching on the second index leveraging wildcard, can be done very easily like this, so it's a simple as a match query you'd do on the index with ngrams:
POST uuid2/_search
{
"query": {
"wildcard": {
"uuid": "*9b0*"
}
}
}

Related

Elasticsearch Became case sensitive after add synonym analyzer

After I added synonym analyzer to my_index, the index became case-sensitive
I have one property called nationality that has synonym analyzer. But it seems that this property become case sensitive because of the synonym analyzer.
Here is my /my_index/_mappings
{
"my_index": {
"mappings": {
"items": {
"properties": {
.
.
.
"nationality": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "synonym"
},
.
.
.
}
}
}
}
}
Inside the index, i have word India COUNTRY. When I try to search India nation using the command below, I will get the result.
POST /my_index/_search
{
"query": {
"match": {
"nationality": "India nation"
}
}
}
But, when I search for india (notice the letter i is lowercase), I will get nothing.
My assumption is, this happend because i put uppercase filter before the synonym. I did this because the synonyms are uppercased. So the query India will be INDIA after pass through this filter.
Here is my /my_index/_settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "my_index",
"similarity": {
"default": {
"type": "BM25",
"b": "0.9",
"k1": "1.8"
}
},
"creation_date": "1647924292297",
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": "true",
"synonyms": [
"NATION, COUNTRY, FLAG"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"uppercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"version": {
"created": "6080099"
}
}
}
}
}
Is there a way so I can make this property still case-insensitive. All the solution i've found only shows that I should only either set all the text inside nationality to be lowercase or uppercase. But how if I have uppercase & lowercase letters inside the index?
Did you apply synonym filter after adding your data into index?
If so, probably "India COUNTRY" phrase was indexed exactly as "India COUNTRY". When you sent a match query to index, your query was analyzed and sent as "INDIA COUNTRY" because you have uppercase filter anymore, it is matched because you are using match query, it is enough to match one of the words. "COUNTRY" word provide this.
But, when you sent one word query "india" then it is analyzed and converted to "INDIA" because of your uppercase filter but you do not have any matching word on your index. You just have a document contains "India COUNTRY".
My answer has a little bit assumption. I hope that it will be useful to understand your problem.
I have found the solution!
I didn't realize that the filter that I applied in the settings is applicable while updating and searching the data. At first, I did this step:
Create index with synonym filter
Insert data
Add uppercase before synonym filter
By doing that, the uppercase filter is not applied to my data. What I should've done are:
Create index with uppercase & synonym filter (pay attention to the order)
Insert data
Then the filter will be applied to my data.

Store Hotel Availabilies with daily informations

I have to store some millions of hotel rooms with some requirements:
Hotel gives the numbers of identical rooms available - daily
Price can change daily, this data are only stored in es, not indexed
The index will only be used for search (no for monitoring) using the hotel's Geolocation
Size: Let s say about 50k hotels, 10 rooms each, 1 year+ Availability => 200 millions
So we have to manage on a "daily" level.
Each time a room is booked, on our application, the numbers of rooms should be updated, we also store "cache" from the partner (other hotel providers) working worldwide, we request them at a regular interval to update our cache.
I am pretty familiar with the elastic search, but I still hesitate between 2 mappings, I removed some fields (breakfast, amenities, smoking...) to keep it simple:
The first one, 1 document by room, each of them contains 365 children (one by day)
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
}
}
},
"availability": {
"_parent": {
"type": "room"
},
"properties": {
"date": {
"type": "date",
"format": "date"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Update, reindex will be isolated on the child level
Only one index
Adding future availabilities is easy (just adding child documents in a room)
cons:
Query will be a little slower, because of the join (looping of availability children)
Childs AND parents need to be returned, so the query would have to include an inner_hits.
A lot of hotels create temporary rooms (for vacation, local event...), only available 1 month a year, for example, this add useless rooms for the 11 remaining months in the index.
The second: I create one index by month (Jan, Feb...) using nested documents instead of children.
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
},
"availability": {
"type": "nested"
}
}
},
"availability": {
"properties": {
"day_of_month": {
"type": "integer"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Faster, no join, smaller index
Resolve the issue of the temporary room, thanks to the 12 monthly index
cons:
Update, booking a room for 1 night will make reindex the room documents (of the matching month)
If a customer is looking for a room with a check-in on the 31st March, for example, we will have to query 2 index, March and April
For the search/query, the second option is better in theory.
The main problem is about the updates of the rooms:
According to my production, about 30 million daily availabilities change / 24 hours.
I also have to read/compare and update if needed, cache from the partner, about 130 million of reading / possible update every (one update for 10 reads) 12 hours (in means).
I have 6 other indexed fields in my mappings on room level, this is not a lot, so maybe a nested solution is ok...
So, which one is the best?
note: I read this How to store date range data in elastic search (aws) and search for a range?
But my case is a little different because of the daily information.
Any help/advice is welcome.

Elasticsearch terms aggregation performance

I have a basic aggregation on an index with about 40 million documents.
{
aggs: {
countries: {
filter: {
bool: {
must: my_filters,
}
},
aggs: {
filteredCountries: {
terms: {
field: 'countryId',
min_doc_count: 1,
size: 15,
}
}
}
}
}
}
The index:
{
"settings": {
"number_of_shards": 5,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"unique"
]
}
}
},
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"countryId": {
"type": "short"
}
}
}
}
The search response time is 100ms, but the aggregation response time is about 1.5s, and is increasing as we add more documents (was about 200ms with 5 million documents). There are about 20 distinct countryId right now.
What I tried so far:
Allocating more RAM (from 4GB to 32GB), same results.
Changing countryId field data type to keyword and adding eager_global_ordinals option, it made things worse
The elasticsearch version is 7.8.0, elastic has 8GB of ram, the server has 64GB of ram and 16CPU, 5 shards, 1 node
I use this aggregation to put filters in search results, so I need it to respond as fast as possible. For large number of results I don't need precision. so if it is approximate or even limited to a number (ex. 100 gte) it's great.
Any ideas how to speed up this aggregation ?
Reason for the slowness:
Bucket explosion is the reason. And breadth first collect mode would speed up further.
As per the doc, you can optimize further with breadth first collect mode.
Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. To find 10 popular actors and thir 5 top coactors.
I would suggest you to set Execution hint. Since you have very less unique values, I suggest you to set hint as map.
Another optimization, let's say some documents are not accessed in last few weeks, you can use a field from your filter, to partition the aggregation on particular set of documents.
Another optimization that you could exclude, include what countries needed, if possible in your use case. Filter

ElasticSearch (5.5) query or algorithm required to exctract values against timestamp with an interference pattern

I have a very large volume of documents in ElasticSearch (5.5) which hold recorded data at regular time intervals, let's say every 3 seconds.
{
"#timestamp": "2015-10-14T12:45:00Z",
"channel1": 24.4
},
{
"#timestamp": "2015-10-14T12:48:00Z",
"channel1": 25.5
},
{
"#timestamp": "2015-10-14T12:51:00Z",
"channel1": 26.6
}
Let's say that I need to get results back for a query that asks for the point value every 5 seconds. An interference pattern arises where sometimes there will be an exact match (for simplicity's sake, let's say in the example above that 12:45 is the only sample to land on a multiple of five).
On these times, I want elastic to provide me with the exact value recorded at that time if there is one. So at 12:45 there is a match so it returns value 24.4
In the other cases, I require the last (previously recorded) value. So at 12:50, having no data at that precise time, it would return the value at 12:48 (25.5), being the last known value.
Previously I have used aggregations but in this case this doesnt help because I don't want some average made from a bucket of data, I need either an exact value for an exact time match or a previous value if no match.
I could do this programmatically but performance is a real issue here so I need to come up with the most performant method possible to retrieve the data in the way stated. Returning ALL the elastic data and iterating over the results and checking for a match at each time interval else keeping the item at index i-1 sounds slow and I wonder if it isn't the best way.
Perhaps I am missing a trick with Elastic. Perhaps somebody knows a method to do exactly what I am after?! It would be much appreciated...
The mapping is like so:
"mappings": {
"sampleData": {
"dynamic": "true",
"dynamic_templates": [{
"pv_values_template": {
"match": "GroupId", "mapping": { "doc_values": true, "store": false, "type": "keyword" }
}
}],
"properties": {
"#timestamp": { "type": "date" },
"channel1": { "type": "float" },
"channel2": { "type": "float" },
"item": { "type": "object" },
"keys": { "properties": { "count": { "type": "integer" }}},
"values": { "properties": { "count": { "type": "integer" }}}
}
}
}
and the (NEST) method being called looks like so:
channelAggregation => channelAggregation.DateHistogram("HistogramFilter", histogram => histogram
.Field(dataRecord => dataRecord["#timestamp"])
.Interval(interval)
.MinimumDocumentCount(0)
.ExtendedBounds(start, end)
.Aggregations(aggregation => DataFieldAggregation(channelNames, aggregation)));
#Nikolay there may be up to around 1400 buckets (maximum of one velue to be returned per pixel available on the chart)

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

Resources