I have to store some millions of hotel rooms with some requirements:
Hotel gives the numbers of identical rooms available - daily
Price can change daily, this data are only stored in es, not indexed
The index will only be used for search (no for monitoring) using the hotel's Geolocation
Size: Let s say about 50k hotels, 10 rooms each, 1 year+ Availability => 200 millions
So we have to manage on a "daily" level.
Each time a room is booked, on our application, the numbers of rooms should be updated, we also store "cache" from the partner (other hotel providers) working worldwide, we request them at a regular interval to update our cache.
I am pretty familiar with the elastic search, but I still hesitate between 2 mappings, I removed some fields (breakfast, amenities, smoking...) to keep it simple:
The first one, 1 document by room, each of them contains 365 children (one by day)
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
}
}
},
"availability": {
"_parent": {
"type": "room"
},
"properties": {
"date": {
"type": "date",
"format": "date"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Update, reindex will be isolated on the child level
Only one index
Adding future availabilities is easy (just adding child documents in a room)
cons:
Query will be a little slower, because of the join (looping of availability children)
Childs AND parents need to be returned, so the query would have to include an inner_hits.
A lot of hotels create temporary rooms (for vacation, local event...), only available 1 month a year, for example, this add useless rooms for the 11 remaining months in the index.
The second: I create one index by month (Jan, Feb...) using nested documents instead of children.
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
},
"availability": {
"type": "nested"
}
}
},
"availability": {
"properties": {
"day_of_month": {
"type": "integer"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
Faster, no join, smaller index
Resolve the issue of the temporary room, thanks to the 12 monthly index
cons:
Update, booking a room for 1 night will make reindex the room documents (of the matching month)
If a customer is looking for a room with a check-in on the 31st March, for example, we will have to query 2 index, March and April
For the search/query, the second option is better in theory.
The main problem is about the updates of the rooms:
According to my production, about 30 million daily availabilities change / 24 hours.
I also have to read/compare and update if needed, cache from the partner, about 130 million of reading / possible update every (one update for 10 reads) 12 hours (in means).
I have 6 other indexed fields in my mappings on room level, this is not a lot, so maybe a nested solution is ok...
So, which one is the best?
note: I read this How to store date range data in elastic search (aws) and search for a range?
But my case is a little different because of the daily information.
Any help/advice is welcome.
Related
We have a following index schema:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"data": {
"properties": {
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
},
"type": "nested"
}
}
}
}
Id is a unique identifier. Here data field is an array and it could have more than 300,000 objects and may be more. Is it sensible and correct way to index this kind of data? Or we should change our design and make the schema like following:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
}
}
}
In this design, we cant use Id as a document id because with this design, id would be repeating. If we have 300,000 FieldName and FieldValues for one Id, Id would be repeating 300,000 time. The challenge here is to generate our custom id using some mechanism. Because we need to handle both insert and update cases.
In the first approach one document size would be too large so that it could contain an array of 300,000 objects or may be more.
In second approach, we would have too many documents. 75370611530 is the number we currently have. This is the number of FieldNames and FieldValues we have. How should we handle this kind of data? Which approach would be better? What should be the size of shards in this index?
I noticed that the current mapping is not nested. I assume you would need to be nested as the query seems to be "find value for key = key1".
If it is known that 300K objects are expected - It may not be a good idea. ES soft limit is 10K. Indexing issues are going to give trouble with this approach in addition to possible slow queries.
I doubt if indexing 75 billion documents for this purpose is useful - given the resources required, though it is feasible and will work.
May be consider RDBMS?
I have a basic aggregation on an index with about 40 million documents.
{
aggs: {
countries: {
filter: {
bool: {
must: my_filters,
}
},
aggs: {
filteredCountries: {
terms: {
field: 'countryId',
min_doc_count: 1,
size: 15,
}
}
}
}
}
}
The index:
{
"settings": {
"number_of_shards": 5,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"unique"
]
}
}
},
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"countryId": {
"type": "short"
}
}
}
}
The search response time is 100ms, but the aggregation response time is about 1.5s, and is increasing as we add more documents (was about 200ms with 5 million documents). There are about 20 distinct countryId right now.
What I tried so far:
Allocating more RAM (from 4GB to 32GB), same results.
Changing countryId field data type to keyword and adding eager_global_ordinals option, it made things worse
The elasticsearch version is 7.8.0, elastic has 8GB of ram, the server has 64GB of ram and 16CPU, 5 shards, 1 node
I use this aggregation to put filters in search results, so I need it to respond as fast as possible. For large number of results I don't need precision. so if it is approximate or even limited to a number (ex. 100 gte) it's great.
Any ideas how to speed up this aggregation ?
Reason for the slowness:
Bucket explosion is the reason. And breadth first collect mode would speed up further.
As per the doc, you can optimize further with breadth first collect mode.
Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. To find 10 popular actors and thir 5 top coactors.
I would suggest you to set Execution hint. Since you have very less unique values, I suggest you to set hint as map.
Another optimization, let's say some documents are not accessed in last few weeks, you can use a field from your filter, to partition the aggregation on particular set of documents.
Another optimization that you could exclude, include what countries needed, if possible in your use case. Filter
I have a very large volume of documents in ElasticSearch (5.5) which hold recorded data at regular time intervals, let's say every 3 seconds.
{
"#timestamp": "2015-10-14T12:45:00Z",
"channel1": 24.4
},
{
"#timestamp": "2015-10-14T12:48:00Z",
"channel1": 25.5
},
{
"#timestamp": "2015-10-14T12:51:00Z",
"channel1": 26.6
}
Let's say that I need to get results back for a query that asks for the point value every 5 seconds. An interference pattern arises where sometimes there will be an exact match (for simplicity's sake, let's say in the example above that 12:45 is the only sample to land on a multiple of five).
On these times, I want elastic to provide me with the exact value recorded at that time if there is one. So at 12:45 there is a match so it returns value 24.4
In the other cases, I require the last (previously recorded) value. So at 12:50, having no data at that precise time, it would return the value at 12:48 (25.5), being the last known value.
Previously I have used aggregations but in this case this doesnt help because I don't want some average made from a bucket of data, I need either an exact value for an exact time match or a previous value if no match.
I could do this programmatically but performance is a real issue here so I need to come up with the most performant method possible to retrieve the data in the way stated. Returning ALL the elastic data and iterating over the results and checking for a match at each time interval else keeping the item at index i-1 sounds slow and I wonder if it isn't the best way.
Perhaps I am missing a trick with Elastic. Perhaps somebody knows a method to do exactly what I am after?! It would be much appreciated...
The mapping is like so:
"mappings": {
"sampleData": {
"dynamic": "true",
"dynamic_templates": [{
"pv_values_template": {
"match": "GroupId", "mapping": { "doc_values": true, "store": false, "type": "keyword" }
}
}],
"properties": {
"#timestamp": { "type": "date" },
"channel1": { "type": "float" },
"channel2": { "type": "float" },
"item": { "type": "object" },
"keys": { "properties": { "count": { "type": "integer" }}},
"values": { "properties": { "count": { "type": "integer" }}}
}
}
}
and the (NEST) method being called looks like so:
channelAggregation => channelAggregation.DateHistogram("HistogramFilter", histogram => histogram
.Field(dataRecord => dataRecord["#timestamp"])
.Interval(interval)
.MinimumDocumentCount(0)
.ExtendedBounds(start, end)
.Aggregations(aggregation => DataFieldAggregation(channelNames, aggregation)));
#Nikolay there may be up to around 1400 buckets (maximum of one velue to be returned per pixel available on the chart)
Suppose to have an index with documents describing vehicles.
Your index needs to deal with two different type of vehicles: motorcycle and car.
Which of the following mapping is better from a performance point of view?
(nested is required for my purposes)
"vehicle": {
"type": "nested",
"properties": {
"car": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
},
"motorcycle": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
}
}
}
or this one:
"vehicle": {
"type": "nested",
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
},
"vehicle_type": {
"type": "string" ### "car", "motorcycle"
}
}
}
The second one is more readable and thin.
But the drawback that I'll have is that when I make my queries, if I want to focus only on "car", I need to put this condition as part of the query.
If I use the first mapping, I just need to have a direct access to the stored field, without adding overhead to the query.
The first mapping, where cars and motorcycles are isolated in different fields, is more likely to be faster. The reason is that you have one less filter to apply as you already know, and because of the increased selectivity of the queries (e.g less documents for a given value of vehicle.car.model than just vehicle.model)
Another option would be to create two distinct indexes car and motorcycle, possibly with the same index template.
In Elasticsearch, a query is processed by a single-thread per shard. That means, if you split your index in two, and query both in a single request, it will be executed in parallel.
So, when needed to query only one of cars or motorcycles, it's faster simply because indexes are smaller. And when it comes to query both cars and motorcycles it could also be faster by using more threads.
EDIT: one drawback of the later option you should know, the inner lucene dictionary will be duplicated, and if values in cars and motorcycles are quite identical, it doubles the list of indexed terms.
I am creating a hotel booking system using Elasticsearch and am trying to find a way to return hotels that have a variable number of sequential dates available (for example 7 days) across a range of dates
I am currently storing dates and prices as a child document to the hotel but am unsure how to undertake the search or if it is even possible with my current setup?
Edit: added mappings
Hotel Mapping
{
"hotel":{
"properties":{
"city":{
"type":"string"
},
"hotelid":{
"type":"long"
},
"lat":{
"type":"double"
},
"long":{
"type":"double"
},
"name":{
"type":"multi_field",
"fields":{
"name":{
"type":"string"
},
"name.exact":{
"type":"string",
"index":"not_analyzed",
"omit_norms":true,
"index_options":"docs",
"include_in_all":false
}
}
},
"star":{
"type":"double"
}
}
}
}
Date Mapping
{
"dates": {
"_parent": {
"type": "hotel"
},
"_routing": {
"required": true
},
"properties": {
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"price": {
"type": "double"
}
}
}
}
I am currently using a date range to select the available dates and then a field query to match the city - the other fields will be used later
I had to to something similar and i ended up storing every day the hotel(room) was booked during indexing. (So a list of [2014-02-15, 2014-02-16, 2014-02-17, etc])
After that is was fairy trivial to write a query to find all hotel rooms that were free during a certain date range.
Still seems like there should be am ore elegant solution, but this ended up working great for me.
In my project we have limited period of stays, so I created an object in mapping, and specified available booking dates for each period, like:
list of available booking dates for 2 days stay
list of available booking dates for 3 days stay
....
list of available booking dates for 10 days stay