Ec2 Storage: can't find which dir is using major chunk of the storage

Ec2 Storage: can't find which dir is using major chunk of the storage - amazon-ec2

below are the stats for my EC2 server
Filesystem Size Used Avail Use% Mounted on
/dev/root 146G 135G 11G 93% /
this command sudo du -shc /* gives me output as 33G total
and sudo du -shc /home/ubuntu/* give me output as 2.7G total
where is approx 100GB of disk space is being utilized?
so there is a process which is writing to disk, is more disk I/O the reason.

Related

How to create RAM disk with Golang and in a platform-independent way?

RAM disk can be created in command line with different commands on different operating systems.
For example, to create a RAM disk of 512MB ---
On macOS:
diskutil eraseVolume HFS+ "RAMDisk" `hdiutil attach -nomount ram://1048576`
On Ubuntu:
mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
Is there any cross-platform method/way to create a RAM disk in Golang?

Know the disk space of data nodes in hadoop?

Is there a way or any command using which I can come to know the disk space of each datanode or the total cluster disk space?
I tried the command
dfs -du -h /
but it seems that I do not have permission to execute it for many directories and hence cannot get the actual disk space.

From UI:
http://namenode:50070/dfshealth.html#tab-datanode
---> which will give you all the details about datanode.
From command line:
To get disk space of each datanode:
sudo -u hdfs hdfs dfsadmin -report
---> which will give you the details of entire HDFS and the individual datanodes OR
sudo -u hdfs hdfs dfs -du -h /
---> which will give you the total disk usage of each folder under root / directory

You view the information about all datanodes and their disk usage in the namenode UI's Datanodes tab.
Total cluster disk space can be seen in the summary part of the main page.
http://namenode-ip:50070

If you are using Hadoop cluster configured as simple security, you can execute the below command to get the usage of data nodes.
export HADOOP_USER_NAME=hdfs ;
* Above command can be used to get admin privilege in simple security, If you are using any other user for hdfs admin, replace hdfs with the respective hdfs admin user.
hadoop dfsadmin -report
Alternate option is to login to respective datanode and execute the below unix command to get disk utilization of that server.
df -h

Hadoop 3.2.0:
hduser#hadoop-node1:~$ hdfs dfs -df
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 3000457228288 461352007680 821808787456 15%
hduser#hadoop-node1:~$
For human-readable numbers, use:
hduser#hadoop-node1:~$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 2.7 T 429.7 G 765.4 G 15%
hduser#hadoop-node1:~$

Spark job crash with no space left on device because of hdfs "non DFS usage"

I am running a spark sql job on a small dataset (25 GB) and I always end up filling up the disk of things and eventually crashing up my executors.
$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://x.x.x.x:8020 138.9 G 3.9 G 14.2 G 3%
$ hdfs dfs -du -h
2.5 G .sparkStaging
0 archive
477.3 M checkpoint
When this append, I have to leave safemode.. hdfs dfsadmin -safemode leave
Looking at the spark job itself, it is obviously a problem of shuffle or caching dataframes. However, any idea why df reports such irregular Used/Available sizes? And why du does not list files?
I understand that it's related to the "non DFS usage" I can see in namenode overview. But why spark uses up so much "hidden" space to the point it makes my job crash?
Configured Capacity: 74587291648 (69.46 GB)
DFS Used: 476200960 (454.14 MB)
Non DFS Used: 67648394610 (63.00 GB)
DFS Remaining: 6462696078 (6.02 GB)

HDFS space usage on fresh install

I just installed HDFS and launched the service,
and there is already more than 800MB of used space.
What does it represent ?
$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://quickstart.cloudera:8020 54.5 G 823.7 M 43.4 G 1%

Caching (Cache) directory in RHEL / CentOS

How to cache a particular directory in RHEL / CentOS ? Suppose I have a directory which contains 10 GB data and I've 48 GB of RAM. How to cache all these data inside the directory(only this specific directory) to my memory for a specific amount of time or indefinitely ?

There's a standard memory device on each Linux system /dev/shm.
When you run the mount command you will see:
tmpfs on /dev/shm type tmpfs (rw)
Generally it's about half the size of the system's memory so if you have 48GB of RAM, its size will be about 24GB. ( you can check this by running df -h )
You can use /dev/shm as if it was a normal hard drive, for example, you can copy a file to it:
cp -r "YOUR DIRECTORY" /dev/shm/
that will do the tirck

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ec2 Storage: can't find which dir is using major chunk of the storage - amazon-ec2

Related

How to create RAM disk with Golang and in a platform-independent way?

Know the disk space of data nodes in hadoop?

Spark job crash with no space left on device because of hdfs "non DFS usage"

HDFS space usage on fresh install

Caching (Cache) directory in RHEL / CentOS

Categories

Resources