No. of files Vs No. of blocks in HDFS - hadoop

I am running a singlenode hadoop environment. When I ran $hadoop fsck /user/root/mydatadir -block, I really got confused around output it gave:
Status: HEALTHY
Total size: 998562090 B
Total dirs: 1
Total files: 50 (Files currently being written: 1)
Total blocks (validated): 36 (avg. block size 27737835 B) (Total open file blocks (not validated): 1)
Minimally replicated blocks: 36 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 36 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 72 (200.0 %)
Number of data-nodes: 1
Number of racks: 1
It says I have written 50 files and yet it only uses 36 blocks (I just Ignore the file currently being written).
From my understanding each file uses atleast 1 block even though its size is less than HDFS block size(for me it's 64MB, the default size).i.e, I expect 50 blocks for 50 files. What is wrong with my understanding ?

The files do not require full blocks each. The concern is overhead of managing them as well as - if you have truly many of them- namenode utilization:
From Hadoop - The Definitive Guide:
small files do not take up any more disk space than is required to
store the raw contents of the file. For example, a 1 MB file stored
with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)
Hadoop Archives, or HAR files, are a file archiving facility that
packs files into HDFS blocks more efficiently, thereby reducing
namenode memory usage while still allowing transparent access to
files.
However, a single block only contains a single file - unless a specialized input format such as HAR, SequenceFile, or CombineFileIputFormat is used. Here is some more information Small File problem info

Related

In hadoop what's under replication and over replication mean and how does it work?

IN map reduce concept under replica and over replica to use.
how to balance the over replica and under replica.
I think you are aware that by default replication factor is 3.
Over-replicated blocks are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas. Thats how its balanced in this case.
Under-replicated blocks are blocks that do not meet their target replication for the file they belong to.
To balance these HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.
You can get information about the blocks being replicated (or waiting to be replicated) using
hdfs dfsadmin -metasave.
if you execute below command, you will get the detailed stats.
hdfs fsck /
......................
Status: HEALTHY
Total size: 511799225 B
Total dirs: 10 Total files: 22
Total blocks (validated): 22 (avg. block size 23263601 B)
Minimally replicated blocks: 22 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
The filesystem under path '/' is HEALTHY

HDFS disk usage showing different information

I got below details through hadoop fsck /
Total size: 41514639144544 B (Total open files size: 581 B)
Total dirs: 40524
Total files: 124348
Total symlinks: 0 (Files currently being written: 7)
Total blocks (validated): 340802 (avg. block size 121814540 B) (Total open file blocks (not validated): 7)
Minimally replicated blocks: 340802 (100.0 %)
I am usign 256MB block size.
so 340802 blocks * 256 MB = 83.2TB * 3(replicas) =249.6 TB
but in cloudera manager it shows 110 TB disk used. how is it possible?
You cannot just multiply with block size and replication factor. Block size and replication factor can be changed dynamically at each file level.
Hence the computation done in 2nd part of your question need not be correct, especially fsck command is showing block size approximately 120MB.
In this case 40 TB storage is taking up around 110 TB of storage. So replication factor is also not 3 for all the files. What ever you get in Cloudera Manager is correct value.

HDFS Corrupt block pool needs some explaination

I have a cluster up and running (HDP-2.3.0.0-2557), it consists of 10 physical servers (2 management servers and 8 data nodes all of which are healthy). The cluster (HDFS) was loaded with an initial dataset of roughly 4Tb of data over a month ago. Most importantly, after loading there were no reports of any missing or corrupt blocks!
I loaded up the Ambari dashboard after a month of not using the system at all and under the HDFS summary - Block Error section I am seeing "28 missing / 28 under replicated". The servers have not been used at all particularly no map reduce jobs and no new files read or written to/from HDFS. How is possible that 28 blocks are now reported as corrupt?
The original data source which resides on one 4Tb disk has no missing blocks, no corrupt files or anything of the sort and is working just fine! Having the data in triplicate using HDFS should surely safeguard me against files being lost/corrupt.
I have run all the suggested fsck commands and can see lines such as:
/user/ambari-qa/examples/input-data/rawLogs/2010/01/01/01/40/log05.txt: MISSING 1 blocks of total size 15 B...........
/user/ambari-qa/examples/src/org/apache/oozie/example/DemoMapper.java: CORRUPT blockpool BP-277908767-10.13.70.142-1443449015470 block blk_1073742397
I convinced my manager Hadoop was the way forward due to is impressive resilience claims but this example proves (to me at least) that HDFS is floored? Perhaps I'm doing something wrong but surely I should not have to go searching round a file system for missing blocks. I need to get back to my manager with an explanation, if one of these 28 missing files was critical then HDFS would have landed me in hot water! At this point in time my manager thinks HDFS is not fit for purpose!
I must be missing something or doing something wrong, surely files/blocks stored in triplicate are 3 times less likely to go missing?! The concept is if one data node is taken offline then a file is marked as under replicated and eventually copied to another data node.
In summary: A default install of HDP was installed with all services started. 4Tb of data copied to HDFS with no reported errors (all blocks are stored with default triplicate replication). Everything left standing for 1 month. HDFS summary reporting 28 missing files (no disk errors on any of the 9 data nodes encountered).
Has anyone else had a similar experience?
Last section output from "hdfs fsck /" command:
Total size: 462105508821 B (Total open files size: 1143 B)
Total dirs: 4389
Total files: 39951
Total symlinks: 0 (Files currently being written: 13)
Total blocks (validated): 41889 (avg. block size 11031667 B) (Total open file blocks (not validated): 12)
********************************
UNDER MIN REPL'D BLOCKS: 40 (0.09549046 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 40
MISSING BLOCKS: 40
MISSING SIZE: 156470223 B
CORRUPT BLOCKS: 28
********************************
Minimally replicated blocks: 41861 (99.93316 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.998138
Corrupt blocks: 28
Missing replicas: 0 (0.0 %)
Number of data-nodes: 8
Number of racks: 1
FSCK ended at Thu Dec 24 03:18:32 CST 2015 in 979 milliseconds
The filesystem under path '/' is CORRUPT
Thanks for reading!

hdfs fsck displays wrong replication factor

I just started using Hadoop and have been playing around with it.
I googled a bit and found out that I have to change the properties in hdfs-site.xml
to change the default replication factor... so that's what I did and to be honest it
works like a charm. When I add new files they will automatically be replicated with the
new replication factor. But when I do something like:
hdfs fsck /
Then the output says that the default replication is 1. I may just be pedantic about it.
But I'd rather have that fixed... or should I say. I've been relying on that output and therefore it took a long time before I realised there is nothing wrong... or maybe there
is something wrong? Can someone help to interpret that fsck output.
..Status: HEALTHY
Total size: 1375000000 B
Total dirs: 1
Total files: 2
Total symlinks: 0
Total blocks (validated): 12 (avg. block size 114583333 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
Sometimes Hadoop responds to queries with information it has in .xml on the client machine and sometimes on the various server machines. Make sure the hdfs-site.xml file has the same value on the data node, the client node (where you ran hdfs from), and the name node. I maintain a central repository for the configuration files (as customized for the particulars of each node) and globally push them as they change.

Need clarity on Hadoop block size in single Node cluster

I have a single Node Hadoop cluster version - 2.x. The block size i have set is 64 MB. I have an input file in HDFS of size 84 MB. Now, when i run the MR job, I see that there are 2 splits which is valid as 84 MB/64 MB ~ 2 and so 2 splits.
But when i run command "hadoop fsck -blocks" to see details of blocks, I see this.
Total size: 90984182 B
Total dirs: 16
Total files: 7
Total symlinks: 0
Total blocks (validated): 7 (avg. block size 12997740 B)
Minimally replicated blocks: 7 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
As you can see, the average block size is close to 13 MB. Why is this? ideally, the block size should be 64 MB rite?
The maximum block size is 64MB as you specified, but you'd have to be pretty lucky to have your average block side be equal to the maximum block size.
Consider the one file you mentioned:
1 file, 84 MB
84MB/64MB = 2 Blocks
84MB/2 Blocks = 42 MB/block on average
You must have some other files bringing the average down even more.
Other than the memory requirement on the namenode for the blocks and possibly loss of parallelism if your block size is too high (obviously not an issue in a single-node cluster), there isn't too much of a problem with the average block size being smaller than the max.
Having 64MB max block size does not mean every block takes up 64MB on disk.
When you configure the block size you set the maximum size a block can be. It is highly unlikely that your files are an exact multiple of the block size so many blocks will be smaller than the configured block size.

Resources