How to get number of current open shards in elsticsearch cluster? - elasticsearch

I can't find where to get the number of current open shards.
I want to make monitoring to avoid cases like this:
this cluster currently has [999]/[1000] maximum shards open
I can get maximum limit - max_shards_per_node
$ curl -X GET "${ELK_HOST}/_cluster/settings?include_defaults=true&flat_settings=true&pretty" 2>/dev/null | grep cluster.max_shards_per_node
"cluster.max_shards_per_node" : "1000",
$
But can't find out how to get number of the current open shards (999).

A very simple way to get this information is to call the _cat/shards API and count the number of lines using the wc shell command:
curl -s -XGET ${ELK_HOST}/_cat/shards | wc -l
That will yield a single number that represents the number of shards in your cluster.
Another option is to retrieve the cluster stats using JSON format, pipe the results into jq and then grab whatever you want, e.g. below I'm counting all STARTED shards:
curl -s -XGET ${ELK_HOST}/_cat/shards?format=json | jq ".[].state" | grep "STARTED" | wc -l
Yet another option is to query the _cluster/stats API:
curl -s -XGET ${ELK_HOST}/_cluster/stats?filter_path=indices.shards.total
That will return a JSON with the shard count
{
"indices" : {
"shards" : {
"total" : 302
}
}
}
To my knowledge there is no single number that ES spits out from any API with the single number. To be sure of that, let's look at the source code.
The error is thrown from IndicesService.java
To see how currentOpenShards is computed, we can then go to Metadata.java.
As you can see, the code is iterating over the index metadata that is retrieved from the cluster state, pretty much like running the following command and count the number of shards, but only for indices with "state" : "open"
GET _cluster/state?filter_path=metadata.indices.*.settings.index.number_of*,metadata.indices.*.state
From that evidence, we can pretty much be sure that the single number you're looking for is nowhere to be found, but needs to be computed by one of the methods I showed above. You're free to open a feature request if needed.

The problem: Seems that your elastic cluster number of shards per node are getting limited.
Solution:
Verify the number of shards per node in your configuration and increase it using elastic API.
For getting the number of shards - use _cluster/stats API:
curl -s -XGET 'localhost/_cluster/stats?filter_path=indices.shards.total'
From elastic docs:
The Cluster Stats API allows to retrieve statistics from a cluster
wide perspective. The API returns basic index metrics (shard numbers,
store size, memory usage) and information about the current nodes that
form the cluster (number, roles, os, jvm versions, memory usage, cpu
and installed plugins).
For updating number of shards (increasing/decreasing), use - _cluster/settings api:
For example:
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_cluster/settings' -d '{ "persistent" : {"cluster.max_shards_per_node" : 5000}}'
From elastic docs:
With specifications in the request body, this API call can update
cluster settings. Updates to settings can be persistent, meaning they
apply across restarts, or transient, where they don’t survive a full
cluster restart.
You can reset persistent or transient settings by assigning a null
value. If a transient setting is reset, the first one of these values
that is defined is applied:
the persistent setting the setting in the configuration file the
default value. The order of precedence for cluster settings is:
transient cluster settings persistent cluster settings settings in the
elasticsearch.yml configuration file. It’s best to set all
cluster-wide settings with the settings API and use the
elasticsearch.yml file only for local configurations. This way you can
be sure that the setting is the same on all nodes. If, on the other
hand, you define different settings on different nodes by accident
using the configuration file, it is very difficult to notice these
discrepancies.

curl -s '127.1:9200/_cat/indices' | awk '{ if ($2 == "open") C+=$5*$6} END {print C}'

This works:
GET /_stats?level=shards&filter_path=_shards.total
Reference:
https://stackoverflow.com/a/38108448/4271117

Related

Wazuh - Filebeat - Elasticsearch non-zero metrics

Could you please help me solve this Filebeat error?
Its Wazuh manager server. All is working, I can connect to Kibana web, enter Wazuh app and I can see there my three Wazuh agents connected and active.
I want FIM monitoring nad If I change file on agent server, alert is created and I can see that alert in alert.log on manager server. Issue is, that Filebeat wont send this alert to elasticsearch so I cant see that alert on Kibana web.
Wazuh manager>
Wazuh 4.2.5
Filebeat 7.14.2
Elasticsearch 7.14.2
Kibana 7.14.2
Wazuh alert log - /var/ossec/logs/alerts/2022/Feb/ and /var/ossec/logs/alerts
systemctl status filebeat is active, but I can see there lines:
WARN [elasticsearch] elasticsearch/client.go:405 Cannot>
This is error from > filebeat -e
2022-02-03T12:46:20.386+0100 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cgroup":{"memory":{"id":"session-248447.scope","mem":{"limit":{"bytes":9223372036854771712},"usage":{"bytes":622415872}}}},"cpu":{"system":{"ticks":70,"time":{"ms":72}},"total":{"ticks":300,"time":{"ms":311},"value":300},"user":{"ticks":230,"time":{"ms":239}}},"handles":{"limit":{"hard":262144,"soft":1024},"open":9},"info":{"ephemeral_id":"641d7fdd-47a0-4b10-bda9-36f29c29fdef","uptime":{"ms":98413},"version":"7.14.2"},"memstats":{"gc_next":18917616,"memory_alloc":14197072,"memory_sys":75383816,"memory_total":71337840,"rss":115638272},"runtime":{"goroutines":11}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":2,"starts":2},"reloads":1,"scans":1},"output":{"events":{"active":0},"type":"elasticsearch"},"
And here is error found in /var/log/messages
Feb 3 10:27:54 filebeat[2531915]: 2022-02-03T10:27:54.707+0100#011WARN#011[elasticsearch]#011elasticsearch/client.go:405#011Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xc07705e669760167, ext:958857091513, loc:(*time.Location)(0x5620964fb2a0)}, Meta:{"pipeline":"filebeat-7.14.0-wazuh-alerts-pipeline"}, Fields:{"agent":{"ephemeral_id":"33cb9baa-af71-4b44-99a6-1379c747722f","hostname":"xlc","id":"03fb57ca-9940-4886-9e6e-a3b3e635cd35","name":"xlc","type":"filebeat","version":"7.14.0"},"ecs":{"version":"1.10.0"},"event":{"dataset":"wazuh.alerts","module":"wazuh"},"fields":{"index_prefix":"wazuh-monitoring-"},"fileset":{"name":"alerts"},"host":{"name":"xlc"},"input":{"type":"log"},"log":{"file":{"path":"/var/ossec/logs/alerts/alerts.json"},"offset":122695554},"message":"{\"timestamp\":\"2022-02-03T10:27:52.438+0100\",\"rule\":{\"level\":5,\"description\":\"Registry Value Integrity Checksum Changed\",\"id\":\"750\",\"mitre\":{\"id\":[\"T1492\"],\"tactic\":[\"Impact\"],\"technique\":[\"Stored Data Manipulation\"]},\"firedtimes\":7,\"mail\":false,\"groups\":[\"ossec\",\"syscheck\",\"syscheck_entry_modified\",\"syscheck_registry\"],\"pci_dss\":[\"11.5\"],\"gpg13\":[\"4.13\"],\"gdpr\":[\"II_5.1.f\"],\"hipaa\":[\"164.312.c.1\",\"164.312.c.2\"],\"nist_800_53\":[\"SI.7\"],\"tsc\":[\"PI1.4\",\"PI1.5\",\"CC6.1\",\"CC6.8\",\"CC7.2\",\"CC7.3\"]},\"agent\":{\"id\":\"006\",\"name\":\"CPP\",\"ip\":\"10.74.37.3\"},\"manager\":{\"name\":\"xlc\"},\"id\":\"1643880472.68132386\",\"full_log\":\"Registry Value '[x32] HKEY_LOCAL_MACHINE\\\\System\\\\CurrentControlSet\\\\Services\\\\W32Time\\\\Config\\\\LastKnownGoodTime' modified\\nMode: scheduled\\nChanged attributes: md5,sha1,sha256\\nOld md5sum was: '5df5b1598b729d98734105148103abf2'\\nNew md5sum is : '361334bf60bdd83e30894c4f313d16ec'\\nOld sha1sum was: 'c233c8ccb56fbd363c44b51a9d51c7fa32512474'\\nNew sha1sum is : '7163cffa48f1a7c0bcb4a3ddff6278ae9a4895a6'\\nOld sha256sum was: '3aad3da22f2d53e8ac33c46c73f40c3e8f5db07188d166e24957d8a20b62b5f1'\\nNew sha256sum is : 'bee8072335d870a1624a541cb13ca5085ba85646a8417d4d894deff71c3f4a92'\\n\",\"syscheck\":{\"path\":\"HKEY_LOCAL_MACHINE\\\\System\\\\CurrentControlSet\\\\Services\\\\W32Time\\\\Config\",\"mode\":\"scheduled\",\"arch\":\"[x32]\",\"value_name\":\"LastKnownGoodTime\",\"size_after\":\"8\",\"md5_before\":\"5df5b1598b729d98734105148103abf2\",\"md5_after\":\"361334bf60bdd83e30894c4f313d16ec\",\"sha1_before\":\"c233c8ccb56fbd363c44b51a9d51c7fa32512474\",\"sha1_after\":\"7163cffa48f1a7c0bcb4a3ddff6278ae9a4895a6\",\"sha256_before\":\"3aad3da22f2d53e8ac33c46c73f40c3e8f5db07188d166e24957d8a20b62b5f1\",\"sha256_after\":\"bee8072335d870a1624a541cb13ca5085ba85646a8417d4d894deff71c3f4a92\",\"changed_attributes\":[\"md5\",\"sha1\",\"sha256\"],\"event\":\"modified\"},\"decoder\":{\"name\":\"syscheck_registry_value_modified\"},\"location\":\"syscheck\"}","service":{"type":"wazuh"}}, Private:file.State{Id:"native::1049-64776", PrevId:"", Finished:false, Fileinfo:(*os.fileStat)(0xc000fc9380), Source:"/var/ossec/logs/alerts/alerts.json", Offset:122697450, Timestamp:time.Time{wall:0xc07704f6d4cb3764, ext:510354422, loc:(*time.Location)(0x5620964fb2a0)}, TTL:-1, Type:"log", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x419, Device:0xfd08}, IdentifierName:"native"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {"type":"illegal_argument_exception","reason":"data_stream [<wazuh-monitoring-{2022.02.03||/d{yyyy.MM.dd|UTC}}>] must not contain the following characters [ , \", *, \\, <, |, ,, >, /, ?]"}
Could you please help with this? I tried google but with no success. Thank you.
Filebeat reads from alerts.json, you can check this file to see if the alerts are being generated. Judging from the log you provided, it looks like filebeat cannot send some logs to elasticsearch (Cannot index event publisher.Event), but we would need more details about the complete error and source logs causing that error. The output of the command # journalctl -f -u filebeat will be useful in this case to provide further assistance.
Based on previous experience. the problem could be that you have reached the maximum limit of shards opened, by default this number is set to 1000. If this is the case, you will see an error like the following: {"type":"validation_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}
If that's the case, you can either reduce the number of shards, or increase the limit to solve the situation right now. I'd recommend the first approach if you only have 1 Elasticsearch node, having 1000 shards is not healthy for the environment in these cases.
To reduce the number of shards in /etc/filebeat/wazuh-template.json check this information and change it to "1", then restart filebeat. These actions will affect the index from now on, but checking This guide can help you with cases like this one.
Also, you can try to remove old indexes. I would first check what are the indices you have stored. I suppose some of them are related to statistics or other stuff, so I would first try to remove those before actual data (wazuh-alerts-)
You can use:
GET /_cat/indices
As the indices are stored per day by default, so you can remove, for instance, those indices older than 1 month and we only keep one month of those indices
To prevent this from happening in the future, you may try implementing an Index Management Policy after you solve the issue at hand.

Not able to disallocate shard from ES cluster

I have created a ES cluster with ES running on three different machine. In order to make them as cluster i have added the unicast config as below in all the 3 machine in elasticsearch.yml config file.
discovery.zen.ping.unicast.hosts:[IP1, IP2, IP3]
When i run
curl -XGET localhost:9200/_cluster/health?pretty
Am getting No_of_nodes as 3. Now i wanted to remove one node from the cluster
so without changing any config file i ran the below command
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" :{
"cluster.routing.allocation.exclude._ip" : "IP_adress_of_Node3"
}
}';
After this i ran the second command again to get the cluster details, expected output is NO_of_nodes should be 2 but in the result it is showing number of nodes=3 still even after excluding the node. It will of great help if someone can please tell me what is wrong in the steps followed for removing node.
Thanks
The command cluster.routing.allocation.exclude._ip that you sent to your cluster will not actually remove the node from your cluster, but rather prepare it for removal. What this does is, it instructs Elasticsearch to move all shards that are held on this node away from this node and store them on other nodes instead.
This allows you to then remove the node once it is empty, without causing under-replication of the shards stored on this node.
To actually remove the node from your cluster you would need to remove it from your list of unicast hosts. Of course you can also just shut it down and leave it in the list until you next need to restart your cluster anyway, as far as I am aware that won't hurt anything.

OpenDJ vlv index error: # Server-side sort failed: Unwilling to Perform

I'm using OpenDJ 3.0.0 release version.
I have two base dns, 1st is dc=tenant1, 2nd is dc=tenant2, the vlv index I created is based on dc=tenant1, but the ldap search happened on dc=tenant2
Here is the vlv index, which looks like
filter:
(&(objectClass=ns-nationsky-base-subject)(uid=)(cn=))
base dn: dc=tenant1
sort order:uid cn mail
scope: one level
There will be "# Server-side sort failed: Unwilling to Perform" when I try to use ldapsearch with a vlv control, like below:
/ldapsearch -p 1389 -h localhost -D 'cn=Directory Manager' -w 'password' -b 'ou=People,ou=Subjects,dc=tenant2' -G 0:2000:1:0 -s one --sortorder uid "(uid=a)" cn
It all works good but it will always be an error of "# Server-side sort failed: Unwilling to Perform" if there are too many entries in my server.(say 15000)
from the access log , I can see unindexed search
[19/Sep/2016:23:06:38 +0800] SEARCH REQ conn=35 op=1 msgID=2 base="ou=People,ou=Subjects,dc=tenant2" scope=one filter="(uid=a)" attrs="cn"
[19/Sep/2016:23:06:40 +0800] SEARCH RES conn=35 op=1 msgID=2 result=0 nentries=8458 unindexed etime=2543
Any idea how I can fix it ?
A VLV Index and queries are really meant to browse a well know set of entries (like all users) and not varying sets of entries.
So, in order to use a VLV Index, the search request must match the base, the scope, the filter and the sorting order defined for that index (and filters should be constant).
If the VLV index was defined with (&(objectClass=ns-nationsky-base-subject)(uid=)(cn=)), then a search with (uid=a) will not match the index and thus cannot be used.
Server side sorting is a very expensive request, this is why, when there is no index, the server will refuse to sort many entries (governed by index-entry-limit). While it is possible to increase this limit, this has very serious implications in the amount of resources that are used in the server and may seriously impact performances of the server.

percolate returns empty matches under heavy load during elasticsearch cluster resizing

We have an elasticsearch cluster dynamically re-sizing in respect to percolate message count in a rabbitmq queue.
We have a single shard and ~18K query in our index, and we use auto_expand_replicas: "0-all" at index settings to copy single shard to all nodes when a node becomes available.
But during heavy load and cluster re-sizing, some requests produces unexpected empty matches.
We send ~1M percolate request daily and we were losing ~1K content. We added a cluster status control to our code, if cluster status is not green before and after percolate request we're waiting for green status and re-sending percolate request, we were able to reduce lost content count from 1K to ~100 in this way. We do not live this problem in a cluster with fixed node size.
Unfortunately any loss is not acceptable in our scenario, and we don't want to give up auto scaling, we need to find a workaround for this problem.
To repeat problem, you can use following bash script:
https://gist.github.com/ekesken/de41598a1e7e54c6f33c
This script will download and install elasticsearch 1.5.2 on your current directory, create a cluster with 10 nodes on your local and create index and percolation queries and will start testing.
Normally we expect following output for single percolate request:
curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
"doc" : {
"message" : "A new bonsai tree in the office"
}
}'
{"took":95,"_shards":{"total":1,"successful":1,"failed":0},"total":1,"matches":[{"_index":"my-index","_id":"tree"}]}
After running script, if you see all shards in all nodes are started at http://localhost:9200/_cat/shards response and test script is still running, that means you couldn't reproduce problem, try increasing node count which was 10 by default:
./repeat_percolation_loss.sh 15 test-only
When you reproduce problem, script will exit with following output:
{"took":209,"_shards":{"total":1,"successful":1,"failed":0},"total":0,"matches":[]}
Problem repeated! Congratulations.
You can shutdown all servers and clean all directory and files created via script with command:
./repeat_percolation_loss.sh 15 clean
Change node count above with latest node count you've tried.

Elasticsearch: evacuate all data before shutdown of a data node?

Is there a way to tell a node to remove all of its data (spread it back out among the other nodes) so that I can shut it down and not deal with a rebalance/re-replicate once its down?
If I have 2 copies of each shard, and I drop one node, some of the shards now only have 1 live copy and it has to be re-replicated. I'd prefer to not drop down to 1 live copy for any period of time if I can.
After posting to the ES mailing list, I was informed the proper answer lies in the _cluster/settings api, specifically the cluster.routing.allocation.exclude._ip option.
From the docs: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"cluster.routing.allocation.exclude._ip" : "10.0.0.1"
}
}'
The IP address can be a comma separated list. To 'un-exclude', just remove the IP from the list (or set the list to "" to remove all excluded IPs).
Hopefully this helps others looking for the answer to this same question.

Resources