Elasticsearch disk usage issue - elasticsearch

i have a strange issue in my elasticsearch cluster
so i have 5 nodes ( 4 data and masters and 1 master only node )
so each node has 5.7 tb disk space on it
but on the first node my disk is almost completely full, and on the rest it is half full
the number of shards on all nodes is approximately the same
df -h from first node
/dev/mapper/vg1-data 5.8T 5.1T 717G 88% /var/lib/elasticsearch
and here is /cat/shards output
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
354 5tb 5tb 714.9gb 5.7tb 87 10.0.5.21 10.0.5.21 elastic-01
392 3.2tb 3.2tb 2.5tb 5.7tb 55 10.0.5.23 10.0.5.23 elastic-02
393 3.8tb 3.8tb 1.8tb 5.7tb 67 10.0.5.28 10.0.5.28 elastic-07
392 3.9tb 3.9tb 1.7tb 5.7tb 69 10.0.5.27 10.0.5.27 elastic-06
i tried summing the results from cat/shards | grep elastic-01 and it turned out that all shards on this node occupy 3.5 tb
curl -X GET http://10.0.5.22:9200/_cat/shards | grep elastic-01 | awk '{ print $6 }'
94.3kb
279b
1.4gb
333.9mb
13.2gb
260.4mb
11.5gb
20.3gb
28.5gb
10gb
12.8gb
365.7mb
9.1gb
263.3mb
92.5gb
951.1kb
266.4mb
35.9gb
10.8gb
299.6mb
22gb
526kb
31.2mb
110.1mb
1mb
46.9gb
19.3gb
358.1kb
17.9gb
22.4kb
11.7gb
3.9gb
5.1gb
427.2mb
1.1mb
48.4gb
elastic-01
75.3mb
6.7gb
30.6gb
43.8gb
31.1mb
21.3gb
10.7gb
1.1gb
17gb
5.1gb
38.4gb
49.1gb
20.2mb
7gb
7.3mb
7.3mb
383.1mb
322.7mb
130.9gb
18.5gb
34.1gb
291.8mb
537.3mb
1.6gb
15.6gb
96.4mb
7.4mb
5.8gb
114.3gb
4.3gb
25gb
7.4gb
7.4gb
638.1kb
10.5gb
175.6kb
275.9mb
33.2mb
806.8kb
35.5gb
40.1gb
17.1gb
408.6mb
115.2mb
69mb
20.3gb
542.4kb
28.4gb
385.6mb
12.9gb
1.3mb
5.5mb
66.6mb
17.5gb
18.7gb
35.6gb
10.9gb
986.3kb
10.3gb
19.1gb
412.8mb
34.4gb
22.6gb
5.1gb
883.4kb
5.3gb
10.4gb
276.4mb
31.9gb
34.5gb
58.1gb
22.3gb
18.8gb
93.9kb
176.5gb
249.3mb
38.1kb
12.1gb
19.7gb
7.6gb
24.7gb
779.9kb
11.2gb
4.9mb
19.1gb
1.2gb
21.1gb
30.4gb
3.8gb
276.5kb
26.3gb
379.9mb
10.4gb
5.5gb
31gb
802.4kb
868.3kb
43.9gb
5.8gb
463.5mb
18.7gb
3.3gb
12gb
4.3gb
32.1gb
3.3gb
11.3gb
1.2mb
944kb
118.2mb
25.8gb
23.9gb
799kb
410.4mb
6mb
5.1gb
32gb
30gb
7.8gb
32.3gb
24.9gb
25.1gb
18gb
16.4gb
1.2gb
915.2kb
4.9mb
29.2gb
59.5kb
1.3gb
150.8gb
1.6gb
11.2gb
17.4gb
439.4mb
6.3mb
21.6gb
394.9mb
26.9gb
23.5gb
43.8gb
28gb
8.9gb
19.5gb
30.3gb
31.8gb
14.7gb
19gb
34.9gb
41.3kb
63.4gb
41.8gb
22.7gb
15gb
32.6gb
281.4mb
379.5mb
8.6mb
3.6mb
37.7gb
10.9gb
818.7kb
19gb
115kb
112.3kb
10gb
7.4mb
685.2kb
332.9mb
5gb
20.2gb
39.5gb
8.6mb
289.5mb
19.3mb
289.6mb
1.1gb
1.6gb
24.8gb
18.1mb
915kb
22.4gb
5.8mb
429mb
261b
20.3gb
930.8kb
19.2gb
25.6gb
31gb
26.6gb
20.1gb
20.2gb
538.4kb
27.4gb
1.2mb
290.6mb
403.6mb
77.4mb
41.7gb
2.7gb
3gb
17.7gb
11.3gb
15.9gb
282.4mb
10.7gb
962.9kb
888.6kb
16.9gb
176.9gb
11.6gb
21.4gb
5.1mb
26.1gb
331.1mb
3.9gb
9.6gb
29.6gb
7.8gb
17.8gb
19.2gb
7.5gb
388.8mb
43.4gb
31.5gb
3gb
21.6mb
15.2gb
11.2gb
54.1gb
17.4gb
1.5gb
34.8gb
273.1mb
32.3gb
17.7gb
2.2gb
17.5gb
22.6gb
820.7kb
1gb
6.6gb
7.8mb
9.3gb
34.5gb
24.1gb
32.9gb
25.2gb
2.9gb
2.6gb
4.6mb
42.8gb
9.3gb
17.9kb
23.4gb
1.1gb
20.6gb
18.1gb
27gb
25.7gb
5mb
32.5gb
29.1gb
42kb
22.5gb
3.1mb
22.6gb
9.8gb
11gb
28.5gb
14.2gb
89.2kb
34.5gb
41.8gb
25gb
410.2mb
20.6gb
16.5gb
16.2gb
19.8gb
7.3gb
13.4gb
11.4gb
10.4gb
11.8gb
7.3mb
1.1gb
46.9gb
10.4gb
535.6mb
55.5gb
19.2gb
14.1gb
20.3gb
28.9gb
30.5gb
4.7gb
49.4gb
7.7gb
9.7gb
6.6gb
20.7gb
29.2gb
18.9gb
9.3gb
19gb
757.4kb
902.4kb
but why does both elastic and du -hs show that more space is being used?
du -hs inside /var/lib/elasticsearch shows 5.1 tb too
du-hs*
5.1T nodes
4.0K range

You should change your URL to
http://10.0.5.22:9200/_cat/shards?bytes=b
So that you get whole numbers instead of human readable ones. You probably have an issue adding up kb/mb/gb figures because when all figures are whole numbers without units, I get 5,265,110,799,379 which matches pretty well with the results you get from _cat/allocation and du -hs
As to the reason why elastic-01 uses more disk space than the other nodes, it's because it seems to have very big shards on it. From the list you shared, you can see these shard sizes (sort desc):
176 900 000 000
176 500 000 000
150 800 000 000
130 900 000 000
114 300 000 000
92 500 000 000
63 400 000 000
58 100 000 000
55 500 000 000
54 100 000 000
49 400 000 000
49 100 000 000
We can see 5 shards whose size is way above 100GB each, and this is usually not a good sign, because your shards have grown too big. Remember that shards are the unit of partitioning of your indexes.
I'm pretty sure there aren't that big shards on your other nodes.
There are a couple ways forward.
First, I would check if the index containing those shards contains time-based data and the index is not using Index Lifecycle Management.
Second, I would check if those shards contain a lot of deleted documents by looking at the docs.deleted column resulting from the following command:
GET _cat/indices?v
If that's the case (in the case the index holds documents being frequently updated), it might be possible to regain some space by running
POST <index_name>/_forcemerge?only_expunge_deletes=true
The previous command should be run with great care, because it requires disk space and you don't have much left, so it might not be possible in your case.
There are other ways, but I would first investigate these two points first.

Related

Daily index not created

On my single test server with 8G of RAM (1955m to JVM) having es v 7.4, I have 12 application indices + few system indices like (.monitoring-es-7-2021.08.02, .monitoring-logstash-7-2021.08.02, .monitoring-kibana-7-2021.08.02) getting created daily. So on an average I can see daily es creates 15 indices.
today I can see only two indices are created.
curl -slient -u elastic:xxxxx 'http://127.0.0.1:9200/_cat/indices?v' -u elastic | grep '2021.08.03'
Enter host password for user 'elastic':
yellow open metricbeat-7.4.0-2021.08.03 KMJbbJMHQ22EM5Hfw 1 1 110657 0 73.9mb 73.9mb
green open .monitoring-kibana-7-2021.08.03 98iEmlw8GAm2rj-xw 1 0 3 0 1.1mb 1.1mb
and reason for above I think is below,
While looking into es logs, found
[2021-08-03T12:14:15,394][WARN ][o.e.x.m.e.l.LocalExporter] [elasticsearch_1] unexpected error while indexing monitoring document org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.common.ValidationException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [1000]/[1000] maximum shards open;
logstash logs for application index and filebeat index
[2021-08-03T05:18:05,246][WARN ][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"ping_server-2021.08.03", :_type=>"_doc", :routing=>nil}, #LogStash::Event:0x44b98479], :response=>{"index"=>{"_index"=>"ping_server-2021.08.03", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}}}}
[2021-08-03T05:17:38,230][WARN ][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"filebeat-7.4.0-2021.08.03", :_type=>"_doc", :routing=>nil}, #LogStash::Event:0x1e2c70a8], :response=>{"index"=>{"_index"=>"filebeat-7.4.0-2021.08.03", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}}}}
Adding active and unassigned shards totals to 1000
"active_primary_shards" : 512,
"active_shards" : 512,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 488,
"delayed_unassigned_shards" : 0,
"active_shards_percent_as_number" : 51.2
If I check with below command, I see all unassigned shards are replica shards
curl -slient -XGET -u elastic:xxxx http://localhost:9200/_cat/shards | grep 'UNASSIGNED'
.
.
dev_app_server-2021.07.10 0 r UNASSIGNED
apm-7.4.0-span-000028 0 r UNASSIGNED
ping_server-2021.07.02 0 r UNASSIGNED
api_app_server-2021.07.17 0 r UNASSIGNED
consent_app_server-2021.07.15 0 r UNASSIGNED
Q. So for now, can I safely delete unassigned shards to free up some shards as its single node cluster?
Q. Can I changed the settings from allocating 2 shards (1 primary and 1 replica) to 1 primary shard being its a single server for each index online?
Q. If I have to keep one year of indices, Is below calculation correct?
15 indices daily with one primary shard * 365 days = 5475 total shards (or say 6000 for round off)
Q. Can I set 6000 shards as shard limit for this node so that I will never face this mentioned shard issue?
Thanks,
You have a lot of unassigned shards (probably because you have a single node and all indices have replicas=1), so it's easy to get rid of all of them and get rid of the error at the same time, by running the following command
PUT _all/_settings
{
"index.number_of_replicas": 0
}
Regarding the count of the indices, you probably don't have to create one index per day if those indexes stay small (i.e. below 10GB each). So the default 1000 shards count is more than enough without you have to change anything.
You should simply leverage Index Lifecycle Management in order to keep your index size at bay and not create too many small ones of them.

elasticsearch bulkload performance issue

We want to increase the speed of bulk-load.
Now we used JAVA to bulk load documents to Elasticsearch. We planned to import 10m documents each document size is almost 8M. Now we only can import 400K documents each day/ 5 documents every second.
Our ES infrastructure is 3 master node with 4G ES_JAVA_OPTS(heap size) 2 data nodes and 2 client nodes with 2G memory. When I want to increase the speed of bulk-load, we will get over the heap size issue. we set up the es cluster on Kubernetes.
The I/O is below.
dd if=/dev/zero of=/data/tmp/test1.img bs=1G count=10 oflag=dsync
10737418240 bytes (11 GB) copied, 50.7528 s, 212 MB/s
dd if=/dev/zero of=/data/tmp/test2.img bs=512 count=100000 oflag=dsync
51200000 bytes (51 MB) copied, 336.107 s, 152 kB/s
Any advice for the improvement?
for (int x =0; x<200000;x++) {
BulkRequest bulkRequest = new BulkRequest();
for (int k = 0; k < 50; k++) {
Order order = generateOrder();
IndexRequest indexRequest = new IndexRequest("orderpot", "orderpot");
Object esDataMap = objectToMap(order);
String source = JSONObject.valueToString(esDataMap);
indexRequest.source(source, XContentType.JSON);
bulkRequest.add(indexRequest);
}
rhlclient.bulk(bulkRequest, RequestOptions.DEFAULT);
over heap size
Seems you need more memory for data node.10m documents with 8M each will cost a lot of memory.And you can reduce the memory of master node and add on data nodes, master node need less memory than data nodes, and if there is no more nodes, you can combine the client nodes with data nodes, more data nodes with share the pressure.
Some other advise:
1. disable refresh by setting index.refresh_interval to -1 and set index.number_of_replicas to 0, when indexing.
2. set a mapping for your index, do not use default mapping, for example:some fields can be integer no need to use long, some fields can be text but keyword will never be used, and some fields will only be used as text.
[tune-for-indexing-speed given by official][1]https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html

I have the error 503, No Shard Available Action Exception with Elasticsearch

I am having problems when I am looking for a register inside of an index and the message is the next:
TransportError: TransportError(503, u'NoShardAvailableActionException[[new_gompute_history][2] null]; nested: IllegalIndexShardStateException[[new_gompute_history][2] CurrentState[POST_RECOVERY] operations only allowed when started/relocated]; ')
This comes when I am searching by an id inside of an index.
The health of my cluster is green:
GET _cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign
1438678496 10:54:56 juan green 5 4 212 106 0 0 0
GET _cat/allocation?v
shards disk.used disk.avail disk.total disk.percent host ip node
53 3.1gb 16.8gb 20gb 15 bc10-05 10.8.5.15 Anomaloco
53 6.4gb 80.8gb 87.3gb 7 bc10-03 10.8.5.13 Algrim the Strong
0 0b l8a 10.8.0.231 logstash-l8a-5920-4018
53 6.4gb 80.8gb 87.3gb 7 bc10-03 10.8.5.13 Harry Leland
53 3.1gb 16.8gb 20gb 15 bc10-05 10.8.5.15 Hypnotia
I solved putting a a sleep time between consecutive PUTs, but I do not like this solution

WHM cPanlel Disk Usage Incorrect

I have cPanel/WHM installed on a 40gb partition, however WHM shows that 8.9gb out of 9.9gb is in use. How do I correct this?
This is on an AWS EC2 instance. The root volume is configured to 40gb.
After running df -h :
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
9.9G 8.9G 574M 95% /
/dev/hda1 99M 26M 69M 28% /boot
tmpfs 1006M 0 1006M 0% /dev/shm
So that shows that the /dev/mapper/VolGroup00-LogVol00 is 9.9GB. However, if I run parted and print the configuration I can see that:
Model: QEMU HARDDISK (ide)
Disk /dev/hda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 107MB 107MB primary ext3 boot
2 107MB 21.5GB 21.4GB primary lvm
I need the whole 40GB for cPanel/WHM. Why would it limit its self to 1/4 of the disk?
After Running vgs
VG #PV #LV #SN Attr VSize VFree
VolGroup00 1 2 0 wz--n- 19.88G 0
pvs:
PV VG Fmt Attr PSize PFree
/dev/hda2 VolGroup00 lvm2 a-- 19.88G 0
lvs:
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
LogVol00 VolGroup00 -wi-ao 10.22G
LogVol01 VolGroup00 -wi-ao 9.66G
fdisk -l
Disk /dev/hda: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 83 Linux
/dev/hda2 14 2610 20860402+ 8e Linux LVM
Disk /dev/dm-0: 10.9 GB, 10972299264 bytes
255 heads, 63 sectors/track, 1333 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/dm-1: 10.3 GB, 10368319488 bytes
255 heads, 63 sectors/track, 1260 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/dm-1 doesn't contain a valid partition table
Where are you checking disk usages in WHM ? Can you please let me know following command out put so that I can assist you on this.
df -h
I think there is free space on your server in LVM partition. Can you please check this with the following command and let me know
vgs
pvs
lvs
fdisk -l
And if you found any free space in your VolGroup, Then you will have to increase it through lvextend command, You can check it at http://www.24x7servermanagement.com/blog/how-to-increase-the-size-of-the-logical-volume/

Cassandra read latency high even with row caching, why?

I am testing cassandra performance with a simple model.
CREATE TABLE "NoCache" (
key ascii,
column1 ascii,
value ascii,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
I am fetching 100 columns of a row key using pycassa, get/xget function (). but getting read latency about 15ms in the server.
colums=COL_FAM.get(row_key, column_count=100)
nodetool cfstats
Column Family: NoCache
SSTable count: 1
Space used (live): 103756053
Space used (total): 103756053
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 20
Read Latency: 15.717 ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 976
Compacted row minimum size: 4769
Compacted row maximum size: 557074610
Compacted row mean size: 87979499
Latency of this type is amazing! When nodetool info shows that read hits directly in the row cache.
Row Cache : size 4834713 (bytes), capacity 67108864 (bytes), 35 hits, 38 requests, 1.000 recent hit rate, 0 save period in seconds
Can anyone tell me why is cassandra taking so much time while reading from row cache?
Enable tracing and see what it's doing. http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2

Resources