On my single test server with 8G of RAM (1955m to JVM) having es v 7.4, I have 12 application indices + few system indices like (.monitoring-es-7-2021.08.02, .monitoring-logstash-7-2021.08.02, .monitoring-kibana-7-2021.08.02) getting created daily. So on an average I can see daily es creates 15 indices.
today I can see only two indices are created.
curl -slient -u elastic:xxxxx 'http://127.0.0.1:9200/_cat/indices?v' -u elastic | grep '2021.08.03'
Enter host password for user 'elastic':
yellow open metricbeat-7.4.0-2021.08.03 KMJbbJMHQ22EM5Hfw 1 1 110657 0 73.9mb 73.9mb
green open .monitoring-kibana-7-2021.08.03 98iEmlw8GAm2rj-xw 1 0 3 0 1.1mb 1.1mb
and reason for above I think is below,
While looking into es logs, found
[2021-08-03T12:14:15,394][WARN ][o.e.x.m.e.l.LocalExporter] [elasticsearch_1] unexpected error while indexing monitoring document org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.common.ValidationException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [1000]/[1000] maximum shards open;
logstash logs for application index and filebeat index
[2021-08-03T05:18:05,246][WARN ][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"ping_server-2021.08.03", :_type=>"_doc", :routing=>nil}, #LogStash::Event:0x44b98479], :response=>{"index"=>{"_index"=>"ping_server-2021.08.03", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}}}}
[2021-08-03T05:17:38,230][WARN ][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"filebeat-7.4.0-2021.08.03", :_type=>"_doc", :routing=>nil}, #LogStash::Event:0x1e2c70a8], :response=>{"index"=>{"_index"=>"filebeat-7.4.0-2021.08.03", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}}}}
Adding active and unassigned shards totals to 1000
"active_primary_shards" : 512,
"active_shards" : 512,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 488,
"delayed_unassigned_shards" : 0,
"active_shards_percent_as_number" : 51.2
If I check with below command, I see all unassigned shards are replica shards
curl -slient -XGET -u elastic:xxxx http://localhost:9200/_cat/shards | grep 'UNASSIGNED'
.
.
dev_app_server-2021.07.10 0 r UNASSIGNED
apm-7.4.0-span-000028 0 r UNASSIGNED
ping_server-2021.07.02 0 r UNASSIGNED
api_app_server-2021.07.17 0 r UNASSIGNED
consent_app_server-2021.07.15 0 r UNASSIGNED
Q. So for now, can I safely delete unassigned shards to free up some shards as its single node cluster?
Q. Can I changed the settings from allocating 2 shards (1 primary and 1 replica) to 1 primary shard being its a single server for each index online?
Q. If I have to keep one year of indices, Is below calculation correct?
15 indices daily with one primary shard * 365 days = 5475 total shards (or say 6000 for round off)
Q. Can I set 6000 shards as shard limit for this node so that I will never face this mentioned shard issue?
Thanks,
You have a lot of unassigned shards (probably because you have a single node and all indices have replicas=1), so it's easy to get rid of all of them and get rid of the error at the same time, by running the following command
PUT _all/_settings
{
"index.number_of_replicas": 0
}
Regarding the count of the indices, you probably don't have to create one index per day if those indexes stay small (i.e. below 10GB each). So the default 1000 shards count is more than enough without you have to change anything.
You should simply leverage Index Lifecycle Management in order to keep your index size at bay and not create too many small ones of them.
Related
The health column is showing yellow for logstash index , even after deleting old ones they re recreated with yellow health. I have clusters for this setup and have checked shards using below.
GET _cluster/health :
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 12,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 3,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 88.46153846153845
}
Any idea how this can be turned to green ?
Also the index are not getting created daily due to this issue.
The yellow health indicates that your primary shard is allocated but the replicas are not allocated. This may be because your elastic is deployed using one node only. Elastic does not allocate the primary and the replica shards on the same node as it will serve no purpose. When you have multiple nodes and multiples shards, the elastic by default allocates the primary and the replicas to different nodes.
As seen from the data you provided, you have 22 active shards and only 2 nodes. The unassigned shards, i.e., 3, is the problem leading to yellow cluster health.
In order to solve this, you can do 2 things.
If you are using elastic for testing, you can initiate the server with one shard (no replicas). In this case you have one node in your elastic service.
If you are in a production and want multiple shards (primary + replicas), then the number of nodes should be equal to the total number of shards. For instance, if you have 1 primary and 2 replicas, then initiate the server with 3 nodes.
Please remember to do this when you are initiating your elastic server.
The harm in yellow health is that if your primary shard goes bad, you will lose the service and the data as well.
I am posting a more general question, after having found I may have more issues than low disk space:
optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures
My issue is that my ES server crashes occasionally, and cannot figure out why.
I want to ensure reliability at least of days, and if error occur, restart the instance automatically.
Which best practices could I follow to debug ES on a small server instance, using a single node?
This is what I am looking at:
(useful resource - https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/)
Check on available disk space - optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures
Check on ES log (/var/log/elasticsearch):
...
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: org.elasticsearch.action.NoShardAvailableActionException
... 60 more
[2020-05-12T15:05:56,874][INFO ][o.e.c.r.a.AllocationService] [awesome3-master] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[en-awesome-wiki][2]] ...]).
[2020-05-12T15:10:48,998][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [awesome3-master] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[target-validation][4], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-05-12T15:05:54.260Z], delayed=false, allocation_status[no_attempt]]]
I spotted somewhere a shared allocation error. So I check:
curl -s 'localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
15 616.2mb 10.6gb 12.5gb 23.1gb 45 127.0.0.1 127.0.0.1 awesome3-master
15 UNASSIGNED
What does this mean ? Are the indexed duplicated in more replicas (see below) ?
I check
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1425 100 1425 0 0 5137 0 --:--:-- --:--:-- --:--:-- 5144
target-validation 4 r UNASSIGNED CLUSTER_RECOVERED
target-validation 2 r UNASSIGNED CLUSTER_RECOVERED
target-validation 1 r UNASSIGNED CLUSTER_RECOVERED
target-validation 3 r UNASSIGNED CLUSTER_RECOVERED
target-validation 0 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 4 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 2 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 1 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 3 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 0 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 4 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 2 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 1 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 3 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 0 r UNASSIGNED CLUSTER_RECOVERED
and here I have a question: is ES trying to create new replicas each time an error is failing the system ?
So I look at an explaination:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"shard" : {
"index" : "target-validation",
"index_uuid" : "ONFPE7UQQzWjrhG0ztlSdw",
"id" : 4,
"primary" : false
},
"assigned" : false,
"shard_state_fetch_pending" : false,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2020-05-12T15:05:54.260Z",
"delayed" : false,
"allocation_status" : "no_attempt"
},
"allocation_delay_in_millis" : 60000,
"remaining_delay_in_millis" : 0,
"nodes" : {
"Ynm6YG-MQyevaDqT2n9OeA" : {
"node_name" : "awesome3-master",
"node_attributes" : { },
"store" : {
"shard_copy" : "AVAILABLE"
},
"final_decision" : "NO",
"final_explanation" : "the shard cannot be assigned because allocation deciders return a NO decision",
"weight" : 9.5,
"decisions" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated on the same node id [Ynm6YG-MQyevaDqT2n9OeA] on which it already exists"
}
]
}
}
}
Now, I would like to better understand what a shard is and what ES is attempting to do.
Should I delete unused replicas?
And finally, what should I do to test the service is "sufficiently" reliable ?
Kindly let me know if there are best practices to follow for debugging ES and tuning server.
My constraint are a small server and would be happy if server won't crash, just take a little bit longer.
EDIT
Found this very useful question :
Shards and replicas in Elasticsearch
and this answer may offer a solution:
https://stackoverflow.com/a/50641899/305883
Before testing it out as an answer, could you kindly help to figure out if / how back-up the indexes and estimating correct parameters?
I run 1 single server and assume, given the above configurations, number_of_shards should be 1 (1 single machine) and max number_of_replicas could be 2 (disk size should handle it) :
curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
"settings":{
"number_of_shards":1,
"number_of_replicas":2
}
}'
I am running a 2 node cluster on version 5.6.12
I followed the following rolling upgrade guide: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/rolling-upgrades.html
After reconnecting the last upgraded node back into my cluster, the health status remained as yellow due to unassigned shards.
Re-enabling shard allocation seemed to have no effect:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
My query results when checking cluster health:
GET _cat/health:
1541522454 16:40:54 elastic-upgrade-test yellow 2 2 84 84 0 0 84 0 - 50.0%
GET _cat/shards:
v2_session-prod-2018.11.05 3 p STARTED 6000 1016kb xx.xxx.xx.xxx node-25
v2_session-prod-2018.11.05 3 r UNASSIGNED
v2_session-prod-2018.11.05 1 p STARTED 6000 963.3kb xx.xxx.xx.xxx node-25
v2_session-prod-2018.11.05 1 r UNASSIGNED
v2_session-prod-2018.11.05 4 p STARTED 6000 1020.4kb xx.xxx.xx.xxx node-25
v2_session-prod-2018.11.05 4 r UNASSIGNED
v2_session-prod-2018.11.05 2 p STARTED 6000 951.4kb xx.xxx.xx.xxx node-25
v2_session-prod-2018.11.05 2 r UNASSIGNED
v2_session-prod-2018.11.05 0 p STARTED 6000 972.2kb xx.xxx.xx.xxx node-25
v2_session-prod-2018.11.05 0 r UNASSIGNED
v2_status-prod-2018.11.05 3 p STARTED 6000 910.2kb xx.xxx.xx.xxx node-25
v2_status-prod-2018.11.05 3 r UNASSIGNED
Is there another way to try and get shards allocation working again so I can get my cluster health back to green?
The other node within my cluster had a "high disk watermark [90%] exceeded" warning message so shards were "relocated away from this node".
I updated the config to:
cluster.routing.allocation.disk.watermark.high: 95%
After restarting the node, shards began to allocate again.
This is a quick fix - I will also attempt to increase the disk space on this node to ensure I don't lose reliability.
I'm new in using elasticseach. I to use elasticsearch to aggregate logs. My problem is with the storage, I deleted all indices and now I have only one index.
When I call /_cat/allocation?v disk.indices is 23.9mb and disk.used is 16.4gb. Why is this difference? How can I remove unused data or how can I remove properly indices?
I ran the command:
curl -XPOST "elasticsearch:9200/_forcemerge?only_expunge_deletes=true"
But I didn't see any improvement.
Output of _cat/allocation?v :
shards disk.indices disk.used disk.avail
12 24.3mb 16.4gb 22.7gb
Output of _cat/shards?v :
index shard prirep state docs store ip node
articles 0 p STARTED 3666 24.2mb 192.168.1.21 lW9hsd5
articles 0 r UNASSIGNED
storage_test 2 p STARTED 0 261b 192.168.1.21 lW9hsd5
storage_test 2 r UNASSIGNED
storage_test 3 p STARTED 0 261b 192.168.1.21 lW9hsd5
storage_test 3 r UNASSIGNED
storage_test 4 p STARTED 0 261b 192.168.1.21 lW9hsd5
storage_test 4 r UNASSIGNED
storage_test 1 p STARTED 0 261b 192.168.1.21 lW9hsd5
storage_test 1 r UNASSIGNED
storage_test 0 p STARTED 0 261b 192.168.1.21 lW9hsd5
storage_test 0 r UNASSIGNED
twitter 3 p STARTED 1 4.4kb 192.168.1.21 lW9hsd5
twitter 3 r UNASSIGNED
twitter 2 p STARTED 0 261b 192.168.1.21 lW9hsd5
twitter 2 r UNASSIGNED
twitter 4 p STARTED 0 261b 192.168.1.21 lW9hsd5
twitter 4 r UNASSIGNED
twitter 1 p STARTED 0 261b 192.168.1.21 lW9hsd5
twitter 1 r UNASSIGNED
twitter 0 p STARTED 0 261b 192.168.1.21 lW9hsd5
twitter 0 r UNASSIGNED
.kibana 0 p STARTED 4 26.4kb 192.168.1.21 lW9hsd5
Thanks
https://www.elastic.co/guide/en/elasticsearch/guide/current/delete-doc.html
As already mentioned in Updating a Whole Document, deleting a document
doesn’t immediately remove the document from disk; it just marks it as
deleted. Elasticsearch will clean up deleted documents in the
background as you continue to index more data.
You might be facing some side effects of a _forcemerge on an a non-read-only index:
Warning: Force merge should only be called against read-only indices. Running force merge against a read-write index can cause very large segments to be produced (>5Gb per segment), and the merge policy will never consider it for merging again until it mostly consists of deleted docs. This can cause very large segments to remain in the shards.
In this case I would suggest to first make the index read-only:
PUT your_index/_settings
{
"index": {
"blocks.read_only": true
}
}
Then to do force merge again and enable back writing to the index:
PUT your_index/_settings
{
"index": {
"blocks.read_only": false
}
}
In case this does not work, you can do a reindex from an old index into a new index and then delete the old index.
Is there a better way of deleting old logs?
Looks like you want to delete old log messages. Although you could issue a delete by query, there is in fact a better way: using Rollover API.
The idea is to create a new index every time the old index gets too big. Writes will happen into a fixed alias, and Rollover API will make alias point into a new index when the old one is too old or too big. Then to delete the old data you will only have to delete the old indexes.
Hope that helps!
I have only one node on one computer and the index have 5 shards without replicas. Here are some parameters describe my elasticsearch node(healthy indexes are ignored in the following list):
GET /_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
red open datas 5 0 344999414 0 43.9gb 43.9gb
GET _cat/shards
datas 4 p STARTED 114991132 14.6gb 127.0.0.1 Eric the Red
datas 3 p STARTED 114995287 14.6gb 127.0.0.1 Eric the Red
datas 2 p STARTED 115012995 14.6gb 127.0.0.1 Eric the Red
datas 1 p UNASSIGNED
datas 0 p UNASSIGNED
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
14 65.9gb 710gb 202.8gb 912.8gb 77 127.0.0.1 127.0.0.1 Eric the Red
3 UNASSIGNED
Although deleting created shards doesn't seem to be supported, as mentioned on the comments above, reducing the number of replicas to zero for the indexes with UNASSIGNED shards might do the job, at least for single node clusters.
PUT /{my_index}/_settings
{
"index" : {
"number_of_replicas" : 0
}
}
reference
You can try deleting unassigned shard following way (Not sure though if it works for data index, works for marvel indices)
1) Install elasticsearch plugin - head. Refer Elastic Search Head Plugin Installation
2) Open your elasticsearch plugin - head URL in brwoser. From here you can easily check out which are unassigned shards and other related info. This will display infor regarding that shard.
{
"state": "UNASSIGNED",
"primary": true,
"node": null,
"relocating_node": null,
"shard": 0,
"index": ".marvel-es-2016.05.18",
"version": 0,
"unassigned_info": {
"reason": "DANGLING_INDEX_IMPORTED",
"at": "2016-05-25T05:59:50.678Z"
}
}
from here you can copy index name i.e. .marvel-es-2016.05.18.
3) Now you can run this query in sense
DELETE .marvel-es-2016.05.18
Hope this helps !