Setting up a daily partitioned index - elasticsearch

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.

there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

Related

Is is more efficient to query multiple ElasticSearch indices at once or one big index

I have an ElasticSearch cluster and my system handles events coming from an API.
Each event is a document stored in an index and I create a new index per source (the company calling the API). Sources come and go, so I have new sources every week and most sources become inactive after a few weeks. Each source send between 100k and 10M new events every day.
Right now my indices are named api-events-sourcename
The documents contain a datetime field and most of my queries look like "fetch the data for that source between those dates.
I frequently use Kibana and I have configured a filter that matches all my indices (api-events-*) at once, and I then add terms to filter a specific source and specific days.
My requests can be slow at times and they tend to slow down the ingestion of new data.
Given that workflow, should I see any performance benefits to create an index per source and per day, instead of the index per source only that I use today ?
Are there other easy tricks to avoid putting to much strain on the cluster ?
Thanks!

Reasons & Consequences of putting a Date in Elastic Index Name

I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.

Elasticseach & Kibana - best practice for visualizing one year's logs

I am using ElasticSearch with Kibana to store and visualize data from my logs. I know it is customary to use Logstash, but I just use the elasticsearch Rest API and POST new elements to it.
I am trying to look for best practices in terms of how I should manage my indices, given I have about 50k logs per day, and I want to visualize sometimes weekly, sometimes monthly and sometimes yearly data. And also I have no need for more than one node. I don't need a high available cluster.
So I am basically trying to determine:
-How should I store my indexes, by time? Monthly? Weekly? One index for everything?
-What are the disadvantages of a huge index (one index that contains all my data)? Does it mean that the entire index is in memory?
Thank you.
I like to match indexes to the data retention policy. Daily indexes work very well for log files, so you can expire one day's worth after X days of retention.
The fewer indexes/shards you have, the less RAM is used in overhead by Elasticsearch to manage them.
The mapping for a field is frozen when the field is added to the index. With a daily index, I can update the mapping and have it take effect for the new indexes, and wait for the old ones to expire. With a longer-term indexes, you'd probably need to reindex the data, which I always try to avoid.
The settings for shards and replicas are also frozen when you create the index.
You can visualize them in Kibana regardless of how they're stored. Use the #timestamp field as your X-axis and change the "interval" to the period you want.
Using logstash would be important if you wanted to alter your logs at all. We do a lot of normalization and creation of new fields, so it's very helpful. If it's not a requirement for you, you might also look into filebeats, which can write directly to elasticsearch.
Lots to consider...

Is solr cloud applicable for use case where indexing is offline?

Solr cloud seems to be the suggested method to scale solr in future. I understand that legacy scaling methods (like master slave and replication) still exists. My use case with solr does not have to be near real time (NRT). It is fine if the newly indexed data is visible for searchers after about 1 day.
In the master slave (legacy scaling), I could replicate it once a day. In Solr cloud do i have an option like this?
Also i don't want the indexing to impact the searcher performance during index time. Is there a way to isolate the indexer from searcher shards in solr cloud?
You could skip SolrCloud and just index on a dedicate separate collection.
Then, you bring the new content to each machine individually and do a Core Swap.
Or similar thing using Aliases to point to the newest core/collection. Which also allows you to segment old content and new content into different collections and search them together.
I also used collection aliases in such cases. You can build your index once a day and when it is ready you simply change the alias. I'll give an example
At very begining you create index called: index_2014_12_01. This index is aliased by index_2014_12_01. The next day you build index_2014_12_02 and changing the alias now to point index_2014_12_02 instead of index_2014_12_01.

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Resources