HDFS storage check shows different values - hadoop

I faced a weird situation where i'm getting different results from hdfs dfs -du command and what i see in cloudera manager UI, i read about the differences between the 2 but didn't any clue that could help me to catch the issue and solved it.
I also deleted all the snapshots and disallowed them, but the storage didn't change.
Below is the output :
[cloudera-scm#roor-chc101 root]$ hdfs dfs -du -h -s .
2.3 G 5.8 G .
[cloudera-scm#roor-chc101 root]$ hdfs dfs -du -h -s /
250.3 T 749.3 T
Cloudera manager UI
I also checked the hdfs dfsadmin -report which shows the same results as the UI
Configured Capacity: 1.54 PB
DFS Used: 897.77 TB
Non DFS Used: 2.98 GB
DFS Remaining: 682.27 TB
DFS Used%: 56.82%
DFS Remaining%: 43.18%
Block Pool Used: 897.77 TB
Block Pool Used%: 56.82%
DataNodes usages% (Min/Median/Max/stdDev): 11.17% / 58.94% / 69.35% / 13.31%
Live Nodes 45 (Decommissioned: 0)
Dead Nodes 0 (Decommissioned: 0)
Decommissioning Nodes 0
Total Datanode Volume Failures 0 (0 B)
Number of Under-Replicated Blocks 0
Number of Blocks Pending Deletion 0
Block Deletion Start Time 8/14/2017, 10:57:30 AM

i added hdfs dfs -du -h -s . and hdfs dfs -du -h -s / just to show nothing that i may missed which could be under /user/cloudera-scm.
The cloudera manager UI shows the same value as the hdfs dfsadmin -report.
BTW, i solved the issue where i find /tmp/logs/cloudera-scm used 150T and still intersting why this volume wasn't taken into consideration when i ran hdfs dfs -du -h -s

Related

“hdfs dfs -du” vs “hdfs dfs -count”, differences on expecting same results

Why hdfs dfs -du -s and hdfs dfs -count -v (supposeed same bytes at CONTENT_SIZE field) are (near but) not the same values?
Example
# at user1#borderNode1
hdfs dfs -count -v "hdfs://XYZ/apps/hive/warehouse/p_xx_db.db"
# DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
# 9087 1610048 141186781009632 hdfs://XYZ/apps/hive/warehouse/p_xx_db.db
hdfs dfs -du -s "hdfs://XYZ/apps/hive/warehouse/p_xx_db.db"
#141186781010380 hdfs://XYZ/apps/hive/warehouse/p_xx_db.db
The value 141186781009632 is not 141186781010380.
The difference 141186781010380-141186781009632=748 is less tham the blocksize (134217728 in the example)... so, perhaps, one is exact and other not, but I not see this kind of documentation on Hadoop.
PS: no clues here neither at guide,
hdfs dfs -count: "Count the number of ... bytes under the directory... output column CONTENT_SIZE".
dfs -du: "Displays sizes files... contained in the given directory".
Guide say only that both are number of bytes contained under the directory.

How to know the exact block size of a file on a Hadoop node?

I have a 1 GB file that I've put on HDFS. So, it would be broken into blocks and sent to different nodes in the cluster.
Is there any command to identify the exact size of the block of the file on a particular node?
Thanks.
You should use hdfs fsck command:
hdfs fsck /tmp/test.txt -files -blocks
This command will print information about all the blocks of which file consists:
/tmp/test.tar.gz 151937000 bytes, 2 block(s): OK
0. BP-739546456-192.168.20.1-1455713910789:blk_1073742021_1197 len=134217728 Live_repl=3
1. BP-739546456-192.168.20.1-1455713910789:blk_1073742022_1198 len=17719272 Live_repl=3
As you can see here is shown (len field in every row) actual used capacities of blocks.
Also there are many another useful features of hdfs fsck which you can see at the official Hadoop documentation page.
You can try:
hdfs getconf -confKey dfs.blocksize
I do not have reputation to comment.
Have a look at documentation page to set various properties, which covers
dfs.blocksize
Apart from configuration change, you can view actual size of file with
hadoop fs -ls fileNameWithPath
e.g.
hadoop fs -ls /user/edureka
output:
-rwxrwxrwx 1 edureka supergroup 391355 2014-09-30 12:29 /user/edureka/cust

HDFS file blocks distribution in two node cluster

Environment
Hadoop : 0.20.205.0
Number of machines in cluster : 2 nodes
Replication : set to 1
DFS Block size : 1MB
I put a 7.4MB file into HDFS using put command. I run fsck command to check the blocks distribution of the file among the datanodes. I see that all the 8 blocks of the file are going to only one node. This affects the load distribution and only one node always get used while running mapred tasks.
Is there a way that I can distribute the files to more than one datanode?
bin/hadoop dfsadmin -report
Configured Capacity: 4621738717184 (4.2 TB)
Present Capacity: 2008281120783 (1.83 TB)
DFS Remaining: 2008281063424 (1.83 TB)
DFS Used: 57359 (56.01 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (6 total, 4 dead)
Name: 143.215.131.246:50010
Decommission Status : Normal
Configured Capacity: 2953506713600 (2.69 TB)
DFS Used: 28687 (28.01 KB)
Non DFS Used: 1022723801073 (952.49 GB)
DFS Remaining: 1930782883840(1.76 TB)
DFS Used%: 0%
DFS Remaining%: 65.37%
Last contact: Fri Jul 18 10:31:51 EDT 2014
bin/hadoop fs -put /scratch/rkannan3/hadoop/test/pg20417.txt /user/rkannan3
bin/hadoop fs -ls /user/rkannan3
Found 1 items
-rw------- 1 rkannan3 supergroup 7420270 2014-07-18 10:40 /user/rkannan3/pg20417.txt
bin/hadoop fsck /user/rkannan3 -files -blocks -locations
FSCK started by rkannan3 from /143.215.131.246 for path /user/rkannan3 at Fri Jul 18 10:43:13 EDT 2014
/user/rkannan3 <dir>
/user/rkannan3/pg20417.txt 7420270 bytes, 8 block(s): OK <==== All the 8 blocks in one DN
0. blk_3659272467883498791_1006 len=1048576 repl=1 [143.215.131.246:50010]
1. blk_-5158259524162513462_1006 len=1048576 repl=1 [143.215.131.246:50010]
2. blk_8006160220823587653_1006 len=1048576 repl=1 [143.215.131.246:50010]
3. blk_4541732328753786064_1006 len=1048576 repl=1 [143.215.131.246:50010]
4. blk_-3236307221351862057_1006 len=1048576 repl=1 [143.215.131.246:50010]
5. blk_-6853392225410344145_1006 len=1048576 repl=1 [143.215.131.246:50010]
6. blk_-2293710893046611429_1006 len=1048576 repl=1 [143.215.131.246:50010]
7. blk_-1502992715991891710_1006 len=80238 repl=1 [143.215.131.246:50010]
If you want to have distribution on file level use at least a replication factor of 2. The first replica is always placed where the writer is located
(see the introduction paragraph in http://waset.org/publications/16836/optimizing-hadoop-block-placement-policy-and-cluster-blocks-distribution); and normally one file has only one writer, so the first replica of several blocks of a file will always be on that node. You probably don't want to change that behaviour, because you want to have the option available to increase the minimum split size when you want to avoid spawning too many mappers without losing the data locality for mappers.
You must use Hadoop balancer command. Details below. Tutorials link
Balancer
Runs a cluster balancing utility. You can simply press Ctrl-C to stop the rebalancing process. Please find more details here
Usage: hadoop balancer [-threshold <threshold>]
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold.

Centralized Cache failed in hadoop-2.3

i want to use Centralized Cache in hadoop-2.3.
here is my steps. (10 nodes, every node 6g memory)
1.my file(45M) to be cached
[hadoop#Master ~]$ hadoop fs -ls /input/pics/bundle
Found 1 items
-rw-r--r-- 1 hadoop supergroup 47185920 2014-03-09 19:10 /input/pics/bundle/bundle.chq
2.create cache pool
[hadoop#Master ~]$ hdfs cacheadmin -addPool myPool -owner hadoop -group supergroup
Successfully added cache pool myPool.
[hadoop#Master ~]$ hdfs cacheadmin -listPools -stats
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL BYTES_NEEDED BYTES_CACHED BYTES_OVERLIMIT FILES_NEEDED FILES_CACHED
myPool hadoop supergroup rwxr-xr-x unlimited never 0 0 0 0 0
3.addDirective
[hadoop#Master ~]$ hdfs cacheadmin -addDirective -path /input/pics/bundle/bundle.chq -pool myPool -force -replication 3
Added cache directive 2
4.listDirectives
[hadoop#Master ~]$ hdfs cacheadmin -listDirectives -stats -path /input/pics/bundle/bundle.chq -pool myPool
Found 1 entry
ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED
2 myPool 3 never /input/pics/bundle/bundle.chq 141557760 0 1 0
the BYTES_NEEDED is right, but BYTES_CACHED is zero. It seems that the size has been calculated but the cache action which puts file into memory has not been done.So how to cache my file into memory.
Thank you very much.
There were a bunch of bugs we fixed in Hadoop 2.3. I would recommend using at least Hadoop 2.4 to use HDFS caching.
To get into more detail I would need to see the log messages.
Including the output of hdfs dfsadmin -report would also be useful, as well as ensuring that you have followed the setup instructions here (namely, increasing the ulimit and setting dfs.datanode.max.locked.memory):
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

Checking filesize and its distribution in HDFS

Is it possible to know filesize in blocks and its distribution over DataNodes in Hadoop?
Currently I am using:
frolo#A11:~/hadoop> $HADOOP_HOME/bin/hadoop dfs -stat "%b %o %r %n" /user/frolo/input/rmat-*
318339 67108864 1 rmat-10.0
392835957 67108864 1 rmat-20.0
Which does not show actual number of blocks created after uploading file to HDFS. And I dont know any way how to find out its distribution.
Thanks,
Alex
The %r in your stat command shows the replication factor of the queried file. If this is 1, it means there will only be only a single replica across the cluster for blocks belonging to this file. The hadoop fs -ls output also shows this value for listed files as one of its numeric columns, as replication factor is a per file FS attribute.
If you are looking to find where the blocks reside instead, you are looking for hdfs fsck (or hadoop fsck if using a dated release) instead. The below, for example, will let you see the list of block IDs and their respective set of resident locations, for any file:
hdfs fsck /user/frolo/input/rmat-10.0 -files -blocks -locations

Resources