Elastic Search Monthly Rolling index with custom routing - elasticsearch

I am trying to figure out the how to create a monthly rolling index with custom routing (multi-tenancy scenario) , with these requirements :
WRITE flow : Each document will have a timestamp and the document should be indexed to the appropriate backing index based on that timestamp and not to the latest index. Also, write requests will have a custom routing key (eg: customerId) so they hit a specific shard.
READ flow : Requests must be routed to all backing indexes. Requests will have a custom routing key specified (eg: customerId) and results must be aggregated and returned.
Index creation : Rolling the index should be automated. Each index should have a custom routing key (eg: customerId )
Wondering, what are the options available ?

This very feature, called time-series data stream, will be coming in the upcoming ES 8.5 release.
The big difference between normal data streams and time-series data stream is that all backing indexes of TSDS are sorted by timestamp and all documents will be written in the right backing index for the given time frame of the document, even if that backing index is not the current write index, which means if your data source lags (even by a few hours), the data will still land in the right index. Also all documents related to the same dimension (i.e. customerId in your case) will end up on the same shard.
Another difference is that the ID of the documents is computed as a function of the timestamp and the dimension(s) contained in the document, which means there can only be one single occurence for a given timestamp/dimension pair (i.e. no duplicate).
Technically, you can already achieve pretty much the same with normal data streams, however, the underlying optimizations related to storing docs in the same shard and the ability to write documents to older backing indexes won't be possible since you can only index documents in the current write index.

Related

Is is more efficient to query multiple ElasticSearch indices at once or one big index

I have an ElasticSearch cluster and my system handles events coming from an API.
Each event is a document stored in an index and I create a new index per source (the company calling the API). Sources come and go, so I have new sources every week and most sources become inactive after a few weeks. Each source send between 100k and 10M new events every day.
Right now my indices are named api-events-sourcename
The documents contain a datetime field and most of my queries look like "fetch the data for that source between those dates.
I frequently use Kibana and I have configured a filter that matches all my indices (api-events-*) at once, and I then add terms to filter a specific source and specific days.
My requests can be slow at times and they tend to slow down the ingestion of new data.
Given that workflow, should I see any performance benefits to create an index per source and per day, instead of the index per source only that I use today ?
Are there other easy tricks to avoid putting to much strain on the cluster ?
Thanks!

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Elasticsearch index with historical versions of documents

I have an Elasticsearch index continuously being updated and I'm creating a second index with the same mappings for doing offline analytics: I need to store changes for certain fields, in order to retrieve the values that were associated in specific time in the past. Therefore, in this second index I store multiple versions of the same document (same id but different _id fields).
My objective is to get ranked results for a given query and reference date. I've tried with aggregations but rather than modifying the hits fields you get a new aggregations one with unordered results.
Is there any way other than removing duplicates at the client side?
This is similar but different to this previous question as the proposed solution of just having a boolean current field allows for removing duplicates when querying the present.

Updating existing documents in ElasticSearch (ES) while using rollover API

I have a data source which will create a high number of entries that I'm planning to store in ElasticSearch.
The source creates two entries for the same document in ElasticSearch:
the 'init' part which records init-time and other details under a random key in ES
the 'finish' part which contains the main data, and updates the initially created document (merges) in ES under the init's random key.
I will need to use time-based indexes in ElasticSearch, with an alias pointing to the actual index,
using the rollover index.
For updates I'll use the update API to merge init and finish.
Question: If the init document with the random key is not in the current index (but in an older one already rolled over) would updating it using it's key
successfully execute? If not, what is the best practice to perform the update?
After some quietness I've set out to test it.
Short answer: After the index is rolled over under an alias, an update operation using the alias refers to the new index only, so it will create the document in the new index, resulting in two separate documents.
One way of solving it is to perform a search in the last 2 (or more if needed) indexes and figure out which non-alias index name to use for the update.
Other solution which I prefer is to avoid using the rollover, but calculate index name from the required date field of our document, and create new index from the application, using template to define mapping. This way event sourcing and replaying the documents in order will yield the same indexes.

I am looking for storing particular field in particular shard in elasticsearch

By routing we can allocate particular file/doc/json in particular shard which make it easy to extract data.
But I am thinking as would it be possible to store particular field of json file in particular shard.
for eg:
i had three field : username , message and time. I had created 3 shard for indexing.
Now i want that
username is stored in one shard , message in another shard and time in another shard.
Thanks
No this is not possible. The whole document (the JSON doc) will be stored on one shard. If you want to do what you describe, then you should split the data up into separate docs and then you can route them differently.
As for the reasoning, imagine there was a username query which matched document5. If document5 was spread over many shards, these would all have to be queried to get the other parts of document5 back to compile the results. Imagine further a complex AND query across different fields, there would be a lot of traffic (and waiting) to find out if both fields match to compute if the document was a hit or not.

Resources