My ES cluster has 12 servers, but when I create my index I just indicated 3 shards. So should I use the parameter route for each time writing and reading for making the latency shorted.
If you want to controll shard allocation there is few options
One of the options you can set in config yml file node.rack: rack1
Then when you create/update index
PUT test/_settings
{
"index.routing.allocation.include.rack": "rack1"
}
In addition it depends on size of you index, for instance in my app i am using different type of indexes and some of them have 1 shard (they are settings indexes) other have 3 shards and 1 replica, and i dont care about allocation because its super fast, so if you care about latency then maybe its better to think about upgrading network
Related
I need a way that whenever shard size increases from a given size limit, I need to redistribute that shard's data into two equal-size shards by adding one more shard and transfer half of the original size exceeded shard's data into newly created shard in the same index.
I have got the shard state like following, but need help find a way to distribute the data
{
"index": "public",
"shard": "0",
"store": "20GB"
}
P.S. I have tried Split Index API Link but this doesn't serve the purpose as it requires a new non-existing index and it cannot do the magic on the existing index, like in the above example index 'public' need to be the same but shard should increase and distribute data among themselves
This is not possible, you can't change the primary shards of elasticsearch index on the same index, this is because if your routing and location depend on the number of primary shards(created at the index creation time).
And if you change it, elasticsearch will have to change the routing algorithm and distribute the data again to evenly distribute the data in all the shards(including replica). Doing the above on a distributed large-scale stateful application is not an easy feat and elasticsearch as of now doesn't support it.
You cannot just add a shard without reindexing (but you can add a replica)
If part of your data is readonly, and you can activate a basic licence,(probably not in aws) you can define an ILM.
In Open Distro, you can use the equivalent :
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/ism.html
I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...
I have an ELK Stack set up and accepting log data from 2 of my applications and everything is working ok. Its been running for 25 days and I have nearly 4GB of Data/Documents on a 25GB server.
My question
I have 8 applications in total that I would like to hook up to my ELK Stack.
Is the one cluster OK for this, or do I need to add more clusters? say a cluster for each applications data? If so how do I do that without having to re-index my data?
Why does cluster health say "yellow (244 of 488)?
Should I index each application to index on it own index rather than the default "logstash-{todays-date}". Like my-app-1-{todays-date}, my-app-2-{todays-date} etc..?
your help is greatly appreciated
G
Your cluster is yellow because your logstash-* indices are configured with 1 replica and you probably have a single node. 244 of 488 means that you have 488 shards in all your indices but only 244 are assigned on your single node and 244 remain to be assigned to new nodes. This is not a problem per se, but if your current node were to fail for some reason, you'd probably lose some data, whereas if you had 2+ nodes, the data would be replicated on other nodes, your cluster would be green (and you'd see 488 of 488) and you'd have a lower risk of losing data.
As for your second question, nothing prevents you from storing all the logs from your eight applications in the same daily logstash indices. You just need to make sure that your logstash configuration accounts for every different apps and adds one field with the application name (e.g. app: app1, app: app2, etc) to the indexed log events so that you can then distinguish within Kibana from which app each log event has been issued.
I have only used Elasticsearch and no the complete ELK stack, but I can give some ideas and guess what is going on. 488 = 2 x 244 , so I guess there are un-assigned replica shards in the single-machine cluster. You can update this setting ad-hoc and set it to zero:
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{"index" : {"number_of_replicas" : 0}}'
You should update logstash index template not to use replicas when you are running just a single machine. Also your shards seem to be only about 20 MB in size so I'd recommend each index to use just one shard instead of five, each shard consumes extra resources. Having multiple shards increases indexing speed but slows down queries, you should check if one is sufficient or not.
Index / application / day would speed-up querying if dashboards are mostly application-specific, and you can create a day-specific alias to-be used by cross-application queries.
I am using ElasticSearch version 1.0.1 and want to achieve two things at the same time -
1. Allow new indices to be created ( the primary and replica shards need to be allocated as per usual logic).
2. Prevent existing shards to be rebalanced on node failure.
What combination of settings will allow me to achieve the same? I tried the settings from the cluster module documented at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html. But I am unable to achieve both of them at the same time.
Thanks,
Below I have 2 nodes in two servers. 2 indexes each. The indexes are distributed in 2 shards and 1 replica set.
"Thor" node had a downtime so I "Iron_man" took over. That's fine.
As you can see events_v1 is an index created before the downtime and venue_v1 was created after the downtime. Shouldn't "Thor" after being back alive take over one shard automatically in the same way as it handles the newly created venue index?
If yes how should I configure the settings?
You need not to configure anything for above scenario.Because Elasticsearch's default behavior is same as you requested.if you create replica for a shard in same node is not allocated to query.If you add new node means your replica shard will be allocated in newly created node.
for more information watch this video