best setup for live data in elasticsearch - elasticsearch

I am trying to use elasticsearch for live data filtering. Right now I use a single machine which gets constantly pushed new data (every 3 seconds via _bulk). Even so I did set up a ttl the index gets quite big after a day or so and then elasticsearch hangs. My current mapping:
curl -XPOST localhost:9200/live -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
},
"no_keyword": {
"type": "custom",
"tokenizer": "whitespace",
"filter": []
}
}
}
},
"mappings": {
"log": {
"_timestamp": {
"enabled": true,
"path": "datetime"
},
"_ttl":{
"enabled":true,
"default":"8h"
},
"properties": {
"url": {
"type": "string",
"search_analyzer": "lowercase_keyword",
"index_analyzer": "lowercase_keyword"
},
"q": {
"type": "string",
"search_analyzer": "no_keyword",
"index_analyzer": "no_keyword"
},
"datetime" : {
"type" : "date"
}
}
}
}
}'
I think a problem is purging the old documents but I could be wrong. Any ideas on how to optimize my setup?

To avoid elasticsearch hanging, you might want to increase amount of memory available to java process.
If all your documents have the same 8 hour life span, it might be more efficient to use rolling aliases instead of ttl. The basic idea is to create a new index periodically (every hour, for example) and use aliases to keep track of current indices. As time goes, you can update the list of indices in the alias that you search and simply delete indices that are more than 8 hour long. Deleting an index is much quicker than removing indices using ttl. A sample code that demonstrates how to create rolling aliases setup can be found here.
I am not quite sure how much live data you are trying to keep, but if you are just testing incoming data against a set of queries, you might also consider using Percolate API instead of indexing data.

Related

Store Hotel Availabilies with daily informations

I have to store some millions of hotel rooms with some requirements:
Hotel gives the numbers of identical rooms available - daily
Price can change daily, this data are only stored in es, not indexed
The index will only be used for search (no for monitoring) using the hotel's Geolocation
Size: Let s say about 50k hotels, 10 rooms each, 1 year+ Availability => 200 millions
So we have to manage on a "daily" level.
Each time a room is booked, on our application, the numbers of rooms should be updated, we also store "cache" from the partner (other hotel providers) working worldwide, we request them at a regular interval to update our cache.
I am pretty familiar with the elastic search, but I still hesitate between 2 mappings, I removed some fields (breakfast, amenities, smoking...) to keep it simple:
The first one, 1 document by room, each of them contains 365 children (one by day)
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
}
}
},
"availability": {
"_parent": {
"type": "room"
},
"properties": {
"date": {
"type": "date",
"format": "date"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Update, reindex will be isolated on the child level
Only one index
Adding future availabilities is easy (just adding child documents in a room)
cons:
Query will be a little slower, because of the join (looping of availability children)
Childs AND parents need to be returned, so the query would have to include an inner_hits.
A lot of hotels create temporary rooms (for vacation, local event...), only available 1 month a year, for example, this add useless rooms for the 11 remaining months in the index.
The second: I create one index by month (Jan, Feb...) using nested documents instead of children.
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
},
"availability": {
"type": "nested"
}
}
},
"availability": {
"properties": {
"day_of_month": {
"type": "integer"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Faster, no join, smaller index
Resolve the issue of the temporary room, thanks to the 12 monthly index
cons:
Update, booking a room for 1 night will make reindex the room documents (of the matching month)
If a customer is looking for a room with a check-in on the 31st March, for example, we will have to query 2 index, March and April
For the search/query, the second option is better in theory.
The main problem is about the updates of the rooms:
According to my production, about 30 million daily availabilities change / 24 hours.
I also have to read/compare and update if needed, cache from the partner, about 130 million of reading / possible update every (one update for 10 reads) 12 hours (in means).
I have 6 other indexed fields in my mappings on room level, this is not a lot, so maybe a nested solution is ok...
So, which one is the best?
note: I read this How to store date range data in elastic search (aws) and search for a range?
But my case is a little different because of the daily information.
Any help/advice is welcome.

Elasticsearch terms aggregation performance

I have a basic aggregation on an index with about 40 million documents.
{
aggs: {
countries: {
filter: {
bool: {
must: my_filters,
}
},
aggs: {
filteredCountries: {
terms: {
field: 'countryId',
min_doc_count: 1,
size: 15,
}
}
}
}
}
}
The index:
{
"settings": {
"number_of_shards": 5,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"unique"
]
}
}
},
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"countryId": {
"type": "short"
}
}
}
}
The search response time is 100ms, but the aggregation response time is about 1.5s, and is increasing as we add more documents (was about 200ms with 5 million documents). There are about 20 distinct countryId right now.
What I tried so far:
Allocating more RAM (from 4GB to 32GB), same results.
Changing countryId field data type to keyword and adding eager_global_ordinals option, it made things worse
The elasticsearch version is 7.8.0, elastic has 8GB of ram, the server has 64GB of ram and 16CPU, 5 shards, 1 node
I use this aggregation to put filters in search results, so I need it to respond as fast as possible. For large number of results I don't need precision. so if it is approximate or even limited to a number (ex. 100 gte) it's great.
Any ideas how to speed up this aggregation ?
Reason for the slowness:
Bucket explosion is the reason. And breadth first collect mode would speed up further.
As per the doc, you can optimize further with breadth first collect mode.
Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. To find 10 popular actors and thir 5 top coactors.
I would suggest you to set Execution hint. Since you have very less unique values, I suggest you to set hint as map.
Another optimization, let's say some documents are not accessed in last few weeks, you can use a field from your filter, to partition the aggregation on particular set of documents.
Another optimization that you could exclude, include what countries needed, if possible in your use case. Filter

Auto Suggestions in Elastic Search after 3 letters

I've a search query which does basic search after a complete word is typed in. I'm looking for auto suggestions after 3 letters.
For Example,
Title- samsung galaxy s4
I want to see auto suggestions after "sam" instead of complete word "samsung".
while the ngram filter works, there is a dedicated suggester for this use-case, called the completion suggester, which uses another data structure internal, which will allow you to execute suggestions in the millisecond range, thus being much faster than a regular query use edgengram. Check out the documentation here
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-suggesters-completion.html
You need to use an edgeNGram filter for this.
{
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete_edge_ngram": {
"filter": ["lowercase"],
"type": "custom",
"tokenizer": "autocomplete_tokenizer"
}
}
}
}
and mapping will be
{
"title_edge_ngram": {
"type": "text",
"analyzer": "autocomplete_edge_ngram",
"search_analyzer": "standard"
}
Or you can use the completion suggester in elasticsearch.
For three character check, you have to do it in your client side itself.

Tokenize a big word into combination of words

Suppose I have Super Bowl is the value of a document's property in the elasticsearch. How can the term query superbowl match Super Bowl?
I read about letter tokenizer and word delimiter but both don't seem to solve my problem. Basically I want to be able to convert combination of a large word into meaningful combination of words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-letter-tokenizer.html
I know this is quite late but you could use synonym filter
You could define that super bowl is the same as "s bowl", "SuperBowl" etc.
There are ways to do this without changing what you actually index. For example, if you are using at least 5.2 (where normalizers were introduced), but it can also be earlier version but 5.x makes it easier, you can define a normalizer to lowercase your text and not change it and then use a fuzzy query at search time to account for the space between super and bowl. My solution though is specific to this example you have given. As it is with Elasticsearch most of time, one needs to think about what kind of data goes into Elasticsearch and what it is required at search time.
In any case, if you are interested in an approach here it is:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
POST test/test/1
{"title":"Super Bowl"}
GET /test/_search
{
"query": {
"fuzzy": {
"title.keyword": "superbowl"
}
}
}

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Resources