Is it possible to organize data between elasticsearch shards based on stored data?

Is it possible to organize data between elasticsearch shards based on stored data? - elasticsearch

I want to build a data store with three nodes. The first one should keep all data, the second one data of the last month, the third data of the last week. Is it possible to automatically configure elasticsearch shards to relocate themselves between nodes so that this functionality is given?

if you want to move existing documents from some node to another then you can use _cluster/reroute.
But using this solution with automatic allocation can be dangerous as just after moving an index to target node it will try to even balance the cluster.
Or you can disable automatic allocations, in that case, only custom allocations will work and can be really risky to handle for large data set.
POST /_cluster/reroute
{
"commands" : [
{
"move" : {
"index" : "test", "shard" : 0,
"from_node" : "node1", "to_node" : "node2"
}
},
{
"allocate_replica" : {
"index" : "test", "shard" : 1,
"node" : "node3"
}
}
]
}
source: Elasticsearch rerouting
Also, you should read this : > Customize document routing

Related

Remove ECS data from metricbeat for smaller documents

I use the graphite beat to get graphite protocol metrics into es.
The metric document is much bigger than the metric data itself (timestamp, value, metric name).
I also get all the ECS data inserted and I think it will make my queries much slower (and my documents much bigger) and I don't need this data.
Can I remove the ECS data somehow in the metricbeat configuration?

You might be able to use Metricbeat's drop_fields processor, but it might not be able to remove all the fields you specify as some are added after the processor chain.
So, acting on the ES side will guarantee you that you can change the event source the way you like. Also if you have many Beats deployed, you only need to configure this in a single place.
One way to achieve this is to create an index template for Metricbeat events and attach an ingest pipeline to it.
PUT _index_template/my-template
{
"index_patterns" : [
"metricbeat-*"
],
"template" : {
"settings" : {
"index" : {
"lifecycle" : {
"name" : "metric-lifecycle"
},
"codec" : "best_compression",
"default_pipeline" : "metric-pipeline"
}
},
...
Then the metric-pipeline would simply look like this and remove all the fields listed in the field array:
PUT _ingest/pipeline/metric-pipeline
{
"processors": [
{
"remove": {
"field": ["agent", "host", "..."]
}
}
]
}

Conditional indexing in metricbeat using Ingest node pipeline creates a datastream

I am trying to achieve conditional indexing for namespaces in elastic using ingest node pipelines. I used the below pipeline but the index getting created when I add the pipeline in metricbeat.yml is in form of datastreams.
PUT _ingest/pipeline/sample-pipeline
{
"processors": [
{
"set": {
"field": "_index",
"copy_from": "metricbeat-dev",
"if": "ctx.kubernetes?.namespace==\"dev\"",
"ignore_failure": true
}
}
]
}
Expected index name is metricbeat-dev but i am getting the value in _index as .ds-metricbeat-dev.
This works fine when I test with one document but when I implement it in yml file I get the index name starting with .ds- why is this happening?
update for the template :
{
"metricbeat" : {
"order" : 1,
"index_patterns" : [
"metricbeat-*"
],
"settings" : {
"index" : {
"lifecycle" : {
"name" : "metricbeat",
"rollover_alias" : "metricbeat-metrics"
},

If you have data streams enabled in the index templates it has potential to create a datastream. This would depend upon how you configure the priority. If priority is not mentioned then it would create legacy index but if priority higher than 100 is mentioned in the index templates. Then this creates a data stream(legacy index has priority 100 so use priority value more than 100 if you want index in form of data stream).
If its create a data stream and its not expected please check if there is a template pointing to index you are writing where data stream is enabled! This was the reason in my case.
Have been working with this for few months and this is what I have observed!

Autobalance the shards in ElasticSearch

We have 4 ElasticSearch nodes in version 5.6.9, that for some previous rules, they have an unbalanced number of shards in each node.
We have found that we can move one shard at a time to another node, but that is incredibly slow.
Apart from creating a script that uses the ElasticSearch API to balance the shards, is there another way?

You can do so using Cluster Reroute it allows for manual changes to the allocation of individual shards in the cluster. check out the docs Cluster Reroute
POST /_cluster/reroute
{
"commands" : [
{
"move" : {
"index" : "test", "shard" : 0,
"from_node" : "node1", "to_node" : "node2"
}
},
{
"allocate_replica" : {
"index" : "test", "shard" : 1,
"node" : "node3"
}
}
]
}

We found the issue, the system was not autorebalancing the cluster's indices, because we had the cluster.routing.rebalance.enable = none
We found the information here.
The problem we had with the cluster/reroute, was the according to the documentation the system will try to balance itself again. Either way, thanks for your help.

ElasticSearch Filtered Aliases Creation - Best Practice

We are planning to use Filtered Aliases as mentioned here - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
Our input data is going to be a stream with each line of the stream corresponding to an object we would like to store in ES.
Each object contains an 'id', which we are using for routing and filtering.
QUESTION -
How do we create alias and index data in a performant way ?
-- Do we index all data, keep track of all the unique 'id's and the very end create the filtered alias ? OR
-- For each object, check if an alias for that 'id' exists; if it doesn't create one ?
I'm leaning towards the first approach. Is it advisable and performant when compared to the second approach ?
TIA.

Based on our discussion above and after having glanced over the blog article you posted, I'm pretty positive that in your case you don't need aliases at all and the routing key would suffice. Again, only because you have a single index, if you had many indices this would not be true anymore!
You simply need to specify the routing key to use when indexing your document. Until ES 2.0, you can use the _routing field for that purpose, even though it's been deprecated in ES 1.5, but in your case it serves your purpose.
{
"customer" : {
"_routing" : {
"required" : true,
"path" : "customer_id" <----- the field you use as the routing key
},
"properties": { ... }
}
}
Then when searching you simply need to specify &routing=<customer_id> in your search URL in addition to your customer id filter (since a given shard can host documents for different customers). Your search will go directly to the shard identified by the given routing key, and thus, only retrieve data from the specified customer.
Using a filtered alias for this brings nothing as the filter and routing key you'd include in your alias definition would not contribute anything additional, since the retrieved documents are already "filtered" (kind of) by the routing key. This is way easier than trying to detect (on each new document to index) if an alias exists or not and create it if it doesn't.
UPDATE:
Now if you absolutely have/want to create filtered aliases, the more performant way would be the first one you mentioned:
First index your daily data
Then run a terms aggregation on your customer_id field with size high enough (i.e. higher than the cardinality of the field, which was ~100 in your case) to make sure you capture all unique customer ids to create your aliases
Loop over all the buckets to retrieve all unique customer ids
Create all aliases in one shot using one action for each customer_id
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{
"add" : {
"index" : "customers",
"alias" : "alias_cid1",
"routing" : "cid1",
"filter" : { "term" : { "customer_id" : "cid1" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid2",
"routing" : "cid2",
"filter" : { "term" : { "customer_id" : "cid2" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid3",
"routing" : "cid3",
"filter" : { "term" : { "customer_id" : "cid3" } }
}
},
...
]
}'
Note that you don't have to worry if an alias already exists, the whole command won't fail and silently ignore the existing alias.
When this command has run, you'll have all your aliases on your unique index, properly configured with a filter and a routing key.

Elasticsearch querying alias with routing giving partial results

In an effort to create multi-tenant architecture for my project.
I've created an elasticsearch cluster with an index 'tenant'
"tenant" : {
"some_type" : {
"_routing" : {
"required" : true,
"path" : "tenantId"
},
Now,
I've also created some aliases -
"tenant" : {
"aliases" : {
"tenant_1" : {
"index_routing" : "1",
"search_routing" : "1"
},
"tenant_2" : {
"index_routing" : "2",
"search_routing" : "2"
},
"tenant_3" : {
"index_routing" : "3",
"search_routing" : "3"
},
"tenant_4" : {
"index_routing" : "4",
"search_routing" : "4"
}
I've added some data with tenantId = 2
After all that, I tried to query 'tenant_2' but I only got partial results, while querying 'tenant' index directly returns with the full results.
Why's that?
I was sure that routing is supposed to query all the shards that documents with tenantId = 2 resides on.

When you have created aliases in elasticsearch, you have to do all operations using aliases only. Be it indexing, update or search.
Try reindexing the data again and check if possible (If it is a test index, I hope so).
Remove all the indices.
curl -XDELETE 'localhost:9200/' # Warning:!! Dont use this in production.
Use this command only if it is test index.
Create the index again. Create alias again. Do all the indexing, search and delete operations on alias name. Even the import of data should also be done via alias name.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Is it possible to organize data between elasticsearch shards based on stored data? - elasticsearch

I want to build a data store with three nodes. The first one should keep all data, the second one data of the last month, the third data of the last week. Is it possible to automatically configure elasticsearch shards to relocate themselves between nodes so that this functionality is given?

Related

Remove ECS data from metricbeat for smaller documents

Conditional indexing in metricbeat using Ingest node pipeline creates a datastream

Autobalance the shards in ElasticSearch

ElasticSearch Filtered Aliases Creation - Best Practice

Elasticsearch querying alias with routing giving partial results

Categories

Resources