I have indexes around 250 GB all-together in 3 host i.e. 750 GB data in ELK cluster.
So how can I rotate ELK logs to keep three months data in my ELK cluster and older logs should be pushed some other place.
You could create your index using "indexname-%{+YYYY.MM}" naming format. This will create a distinct index every month.
You could then filter this index, based on timestamp, using a plugin like curator.
The curator could help you set up a CRON job to purge those older indexes or back them up on some s3 repository.
Reference - Backup or Restore using curator
Moreover, you could even restore these backup indexes whenever needed directly from s3 repo for historical analysis.
Answer by dexter_ is correct, but as the answer is old, a better answer would be:
version 7.x of elastic stack provides a index life cycle management policies, which can be easily managed with kibana GUI and is native to elk stack.
PS, you still have to manage the indices like "indexname-%{+YYYY.MM}" as suggested dexter_
elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
It took me a while to figure out exact syntax and rules, so I'll post the final policy I used to remove old indexes (it's based on the example from https://aws.amazon.com/blogs/big-data/automating-index-state-management-for-amazon-opensearch-service-successor-to-amazon-elasticsearch-service/):
{
"policy": {
"description": "Removes old indexes",
"default_state": "active",
"states": [
{
"name": "active",
"transitions": [
{
"state_name": "delete",
"conditions": {
"min_index_age": "14d"
}
}
]
},
{
"name": "delete",
"actions": [
{
"delete": {}
}
],
"transitions": []
}
],
"ism_template": {
"index_patterns": [
"mylogs-*"
]
}
}
}
It will automatically apply the policy for any new mylogs-* indexes, but you'll need to apply it manually for existing ones (under "Index Management" -> "Indices").
Related
Is it possible to create conditional indexing by using ingest node pipelines? I feel this could be done by the script processor but can someone tell if this is possible?
I am in a scenario where I should decide which is a better way to do custom indexing. I can mention conditions in the metricbeat.yml /filebeat.yml files to get this done. But is this the best way to do custom indexing? There is no logstash in my elastic stack
output.elasticsearch:
indices:
- index: "metricbeat-dev-%{[agent.version]}-%{+yyyy.MM.dd}"
when.equals:
kubernetes.namespace: "dev"
This is how I have implemented custom indexing in metric/filebeat right now. I have like 20+ namespaces in my Kubernetes cluster. Please help in suggesting if this could be done by ingest node pipeline or not
Yes, You can achived this by ingest pipeline Set Processor. Ingest Pipeline support accessing of metadata fields and you can access / update index name using _index field name.
Below is sample Ingest Pipeline which will update index name when namespace is dev:
[
{
"set": {
"field": "_index",
"value": "metricbeat-dev",
"if": "ctx.kubernetes?.namespace== 'dev'"
}
}
]
Upadte 1: append agent version to index name. I ahve consider agent version feild name as agent.version
[
{
"set": {
"field": "_index",
"value": "metricbeat-dev-{{agent.version}}",
"if": "ctx.kubernetes?.namespace== 'dev'"
}
}
]
I have following tables which have millions of records and they are changing frequently is there a way to load that data in elasticsearch (for eventual consistency ) with spring boot initially and incrementally?
Tables :
Employee
Role
contactmethod (Phone/email/mobile)
channel
department
Status
Address
Here the document will be like below
{
"id":1,
"name": "tom john",
"Contacts":[
{
"mobile":123,
"type":"MOBILE"
},
{
"phone":223333
"type":"PHONE"
}
]
"Address":[
{
"city": "New york"
"ZIP": 12343
"type":"PERMANENT"
},
{
"city": "New york"
"ZIP": 12343
"type":"TEMPORARY"
}
]
}
.. simillar data for ROLE,DEPT etc tables
]
How do I make sure that ev.g. mobile number of "tom john" changed in relational DB will be propagated to elasticsearch DB ?
You should have a background job in your application, which pulls the data from DB(you know when there is change in DB of-course), and based on what you need(filtering, massaging) reindex that in your Elasticsearch index.
or you can use the logstash with JDBC to keep your data in sync, please refer to elastic blog on how to do it.
The first one is a flexible and not out of the box solution, while the second one is out of the box solution, and there are pros and cons of both the approaches and choose what fits best in your use-case.
Can I filter the documents in elastic search before rolling them up, or can I define filter query in Roll up job, If yes how?
There's no way to filter data before rolling it up into a new rolled up index. However, you can achieve what you want by first defining a filtered alias and then rolling up on that alias.
Say, you want to roll up index test but only for customers 1, 2 and 3. You can create the following filtered alias:
POST /_aliases
{
"actions": [
{
"add": {
"index": "test",
"alias": "filtered-test",
"filter": { "terms": { "customer.id": [1, 2, 3] } }
}
}
]
}
And then you can roll up on the filtered-test alias instead of the test index and that will only roll up data from customers 1, 2 and 3:
PUT _rollup/job/sensor
{
"index_pattern": "filtered-test",
"rollup_index": "customer_rollup",
...
}
PS: It is worth noting that you're not alone but Elastic folks specifically decided not to allow filtering in roll-ups for various reasons (you can read more in the issue I linked to). The issue has been reopened because there's a big refactor of the roll up feature going on. Stay tuned...
I have set a simple ILM policy on my fluentd.* indices to be deleted after (for testing - ) a short period of time.
ILM:
PUT _ilm/policy/fluentd
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "1gb"
},
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "4d",
"actions": {
"delete": {}
}
}
}
}
}
Index Template:
PUT _template/fluentd
{
"order": 0,
"index_patterns": [
"fluentd.*"
],
"settings": {
"index": {
"lifecycle": {
"name": "fluentd"
}
}
},
"aliases": {
"fluent": {}
}
}
With these settings, I expected ES to delete indices older than 5-6 days, but there are still indices from 3 weeks ago in ES. Currently, it says there are 108 linked indices to this ILM policy.
What is it actually doing, it seems it's not doing anything at all... how to delete indices after x days?
I tried first to use the "index template" but it's useless, it does not apply settings to each index (maybe yes but only on creation????).
Then I put the ILM on the index by hand (another bug: you can't select all index and hit "add ILM policy" - you need to add the policy one by one) which required me to click about 600 times.
Now the problem was, I had "hot" phase defined but it didn't trigger (it's buggy?) - because the hot phase didn't trigger (i set to to "rollover after 1 day after index creation") - the delete phase didn't either. When I removed the hot phase and applied the ILM to index again with only delete - it worked! but adding and removing all this is buggy, I get Ooops, something went wrong errors here and there.
I don't understand why I have to remove the ILM and reapply it to each index when I change something in the ILM policy. It's 1000% inconvenient.
ES really needs to put some work into it, it's still too beta and I got a hell lot of status code 500, although I am using most recent version directly on Elastic Cloud.
With these settings, I expected ES to delete indices older than 5-6 days, but there are still indices from 3 weeks ago in ES. Currently, it says there are 108 linked indices to this ILM policy.
With your settings, the delete phase starts at 4 day from rollover. If you want to start the delete phase at 4 day from "index creation" you need to remove the rollover action from the hot phase:
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "4d",
"actions": {
"delete": {}
}
}
}
}
}
I tried first to use the "index template" but it's useless, it does not apply settings to each index (maybe yes but only on creation????).
Yes, it works on index creation.
Then I put the ILM on the index by hand (another bug: you can't select all index and hit "add ILM policy" - you need to add the policy one by one) which required me to click about 600 times.
Kibana does not allow you to apply ILM policy to all index, but the elasticsearch API allows it!
Simply open a kibana dev tools and run the follow request:
PUT fluentd.*/_settings
{
"index": {
"lifecycle": {
"name": "fluentd"
}
}
}
Now the problem was, I had "hot" phase defined but it didn't trigger (it's buggy?) - because the hot phase didn't trigger (i set to to "rollover after 1 day after index creation") - the delete phase didn't either. When I removed the hot phase and applied the ILM to index again with only delete - it worked! but adding and removing all this is buggy, I get Ooops, something went wrong errors here and there.
If rollover phase was not triggered, the ILM could not progress.
I don't understand why I have to remove the ILM and reapply it to each index when I change something in the ILM policy. It's 1000% inconvenient.
Because the ILM definition are cached on each index.
see the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-index-lifecycle.html#ilm-phase-execution
A little bit late, but maybe it will help somebody.
Another reason can be like that was mentioned here:
ILM is not really intended to be used on 1m lifecycle. I Do not
believe You will achieve your desired behavior. My understanding is
that ILM is an opportunistic background task it is not preemptive so
it is not going to execute on the exact time frame.
It's designed to work on the order of hours or days not minutes.
I have the same situation at my indices and I checked - indices are deleted, but later than I sat them up.
I want elasticsearch data backup in different physical location.
I have tried to put all elasticsearch nodes into a same cluster at first, but when program query or update elasticsearch, large data will transfer on internet. It will cause a lot of money for network traffic and there is a network delay.
Is there any easy way to sync data between two elasticsearch clusters? so that I can only sync the changed data on the internet.
PS:
I don't so care about data sync delay, less then 1 min is acceptable
In case if you are running the latest version of Elasticsearch (5.0 or 5.2+), you need to have or add date field updatedAt or similar name and then on destination cluster side run cron every 1 minute which will run Reindex API query like this:
POST _reindex
{
"source": {
"remote": {
"host": "http://sourcehost:9200",
"username": "user",
"password": "pass"
},
"index": "source",
"query": {
"range": {
"updatedAt": {
"gte": "2015-01-01 00:00:00"
}
}
},
"dest": {
"index": "dest"
}
}
More information on Reindex API you can get here - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html
In case if you are using older Elasticsearch (<5.0), then you can use tool elasticdump (https://github.com/taskrabbit/elasticsearch-dump) to transfer data using similar approach with updatedAt field.