Hadoop, Mapreduce - Cannot obtain block length for LocatedBlock - hadoop

I've a file on hdfs in the path 'test/test.txt' which is 1.3G
output of ls and du commands is:
hadoop fs -du test/test.txt -> 1379081672 test/test.txt
hadoop fs -ls test/test.txt ->
Found 1 items
-rw-r--r-- 3 testuser supergroup 1379081672 2014-05-06 20:27 test/test.txt
I want to run a mapreduce job on this file but when i start the mapreduce job on this file the job fails with the following error:
hadoop jar myjar.jar test.TestMapReduceDriver test output
14/05/29 16:42:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/05/29 16:42:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/29 16:42:03 INFO mapred.JobClient: Running job: job_201405271131_9661
14/05/29 16:42:04 INFO mapred.JobClient: map 0% reduce 0%
14/05/29 16:42:17 INFO mapred.JobClient: Task Id : attempt_201405271131_9661_m_000004_0, Status : FAILED
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-428948818-namenode-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode4:50010, datanode3:50010, datanode1:50010]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:319)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:263)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:205)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:198)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1117)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:249)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:746)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:83)
at org.apache.hadoop.mapred.Ma`
i tried the following commands:
hadoop fs -cat test/test.txt gives the following error
cat: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode3:50010, datanode1:50010, datanode4:50010]}
additionally i can't copy the file hadoop fs -cp test/test.txt tmp gives same error:
cp: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode1:50010, datanode3:50010, datanode4:50010]}
output of the hdfs fsck /user/testuser/test/test.txt command:
Connecting to namenode via `http://namenode:50070`
FSCK started by testuser (auth:SIMPLE) from /10.17.56.16 for path
/user/testuser/test/test.txt at Thu May 29 17:00:44 EEST 2014
Status: HEALTHY
Total size: 0 B (Total open files size: 1379081672 B)
Total dirs: 0
Total files: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 21)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu May 29 17:00:44 EEST 2014 in 0 milliseconds
The filesystem under path /user/testuser/test/test.txt is HEALTHY
by the way i can see the content of the test.txt file from the web browser.
hadoop version is: Hadoop 2.0.0-cdh4.5.0

I got the same issue with you and I fixed it by the following steps.
There are some files that opened by flume but never closed (I am not sure about your reason).
You need to find the name of the opened files by the command:
hdfs fsck /directory/of/locked/files/ -files -openforwrite
You can try to recover files as command:
hdfs debug recoverLease -path <path-of-the-file> -retries 3
Or removing them by the command:
hdfs dfs -rmr <path-of-the-file>

I had the same error, but it was not due to the full disk problem, and I think the inverse, where there were files and blocks referenced by in the namenode that did not exist on any datanodes.
Thus, hdfs dfs -ls shows the files, but any operation on them fails, e.g. hdfs dfs -copyToLocal.
In my case, the hard part was isolating which files were listed but corrupted, as they existed in a tree having thousands of files. Oddly, hdfs fsck /path/to/files/ did not report any problems.
My solution was:
Isolate the location using copyToLocal which resulted in copyToLocal: Cannot obtain block length for LocatedBlock{BP-1918381527-10.74.2.77-1420822494740:blk_1120909039_47667041; getBlockSize()=1231; corrupt=false; offset=0; locs=[10.74.2.168:50010, 10.74.2.166:50010, 10.74.2.164:50010]} for several files
Get a list of the local directories using ls -1 > baddirs.out
get rid of the local files from the first copyToLocal
use for files incat baddirs.out;do echo $files; hdfs dfs -copyToLocal $files This will produce a list of directories checks, and errors where files are found.
get rid of the local files again, and now get lists of files from each affected subdirectory. Use that as input to a file-by-file copyToLocal, at which point you can echo each file as it's copied, then see where the error occurs.
use hdfs dfs -rm <file> for each file.
Confirm you got 'em all be removing all local files again, and using the original copyToLocal on the top level directory where you had problems.
A simple two hour process!

You are having some corrupted files with no blocks on datanode but an entry in namenode. Best to follow this:
https://stackoverflow.com/a/19216037/812906

According to this this may be produced by a full disk problem. I came across the same problem recently with an old file and checking my servers metrics it effectively was a full disk problem during the creation of that file. Most solutions just claim to delete the file and prey for it not happening again.

Related

How to know the exact block size of a file on a Hadoop node?

I have a 1 GB file that I've put on HDFS. So, it would be broken into blocks and sent to different nodes in the cluster.
Is there any command to identify the exact size of the block of the file on a particular node?
Thanks.
You should use hdfs fsck command:
hdfs fsck /tmp/test.txt -files -blocks
This command will print information about all the blocks of which file consists:
/tmp/test.tar.gz 151937000 bytes, 2 block(s): OK
0. BP-739546456-192.168.20.1-1455713910789:blk_1073742021_1197 len=134217728 Live_repl=3
1. BP-739546456-192.168.20.1-1455713910789:blk_1073742022_1198 len=17719272 Live_repl=3
As you can see here is shown (len field in every row) actual used capacities of blocks.
Also there are many another useful features of hdfs fsck which you can see at the official Hadoop documentation page.
You can try:
hdfs getconf -confKey dfs.blocksize
I do not have reputation to comment.
Have a look at documentation page to set various properties, which covers
dfs.blocksize
Apart from configuration change, you can view actual size of file with
hadoop fs -ls fileNameWithPath
e.g.
hadoop fs -ls /user/edureka
output:
-rwxrwxrwx 1 edureka supergroup 391355 2014-09-30 12:29 /user/edureka/cust

HDFS file blocks distribution in two node cluster

Environment
Hadoop : 0.20.205.0
Number of machines in cluster : 2 nodes
Replication : set to 1
DFS Block size : 1MB
I put a 7.4MB file into HDFS using put command. I run fsck command to check the blocks distribution of the file among the datanodes. I see that all the 8 blocks of the file are going to only one node. This affects the load distribution and only one node always get used while running mapred tasks.
Is there a way that I can distribute the files to more than one datanode?
bin/hadoop dfsadmin -report
Configured Capacity: 4621738717184 (4.2 TB)
Present Capacity: 2008281120783 (1.83 TB)
DFS Remaining: 2008281063424 (1.83 TB)
DFS Used: 57359 (56.01 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (6 total, 4 dead)
Name: 143.215.131.246:50010
Decommission Status : Normal
Configured Capacity: 2953506713600 (2.69 TB)
DFS Used: 28687 (28.01 KB)
Non DFS Used: 1022723801073 (952.49 GB)
DFS Remaining: 1930782883840(1.76 TB)
DFS Used%: 0%
DFS Remaining%: 65.37%
Last contact: Fri Jul 18 10:31:51 EDT 2014
bin/hadoop fs -put /scratch/rkannan3/hadoop/test/pg20417.txt /user/rkannan3
bin/hadoop fs -ls /user/rkannan3
Found 1 items
-rw------- 1 rkannan3 supergroup 7420270 2014-07-18 10:40 /user/rkannan3/pg20417.txt
bin/hadoop fsck /user/rkannan3 -files -blocks -locations
FSCK started by rkannan3 from /143.215.131.246 for path /user/rkannan3 at Fri Jul 18 10:43:13 EDT 2014
/user/rkannan3 <dir>
/user/rkannan3/pg20417.txt 7420270 bytes, 8 block(s): OK <==== All the 8 blocks in one DN
0. blk_3659272467883498791_1006 len=1048576 repl=1 [143.215.131.246:50010]
1. blk_-5158259524162513462_1006 len=1048576 repl=1 [143.215.131.246:50010]
2. blk_8006160220823587653_1006 len=1048576 repl=1 [143.215.131.246:50010]
3. blk_4541732328753786064_1006 len=1048576 repl=1 [143.215.131.246:50010]
4. blk_-3236307221351862057_1006 len=1048576 repl=1 [143.215.131.246:50010]
5. blk_-6853392225410344145_1006 len=1048576 repl=1 [143.215.131.246:50010]
6. blk_-2293710893046611429_1006 len=1048576 repl=1 [143.215.131.246:50010]
7. blk_-1502992715991891710_1006 len=80238 repl=1 [143.215.131.246:50010]
If you want to have distribution on file level use at least a replication factor of 2. The first replica is always placed where the writer is located
(see the introduction paragraph in http://waset.org/publications/16836/optimizing-hadoop-block-placement-policy-and-cluster-blocks-distribution); and normally one file has only one writer, so the first replica of several blocks of a file will always be on that node. You probably don't want to change that behaviour, because you want to have the option available to increase the minimum split size when you want to avoid spawning too many mappers without losing the data locality for mappers.
You must use Hadoop balancer command. Details below. Tutorials link
Balancer
Runs a cluster balancing utility. You can simply press Ctrl-C to stop the rebalancing process. Please find more details here
Usage: hadoop balancer [-threshold <threshold>]
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold.

Centralized Cache failed in hadoop-2.3

i want to use Centralized Cache in hadoop-2.3.
here is my steps. (10 nodes, every node 6g memory)
1.my file(45M) to be cached
[hadoop#Master ~]$ hadoop fs -ls /input/pics/bundle
Found 1 items
-rw-r--r-- 1 hadoop supergroup 47185920 2014-03-09 19:10 /input/pics/bundle/bundle.chq
2.create cache pool
[hadoop#Master ~]$ hdfs cacheadmin -addPool myPool -owner hadoop -group supergroup
Successfully added cache pool myPool.
[hadoop#Master ~]$ hdfs cacheadmin -listPools -stats
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL BYTES_NEEDED BYTES_CACHED BYTES_OVERLIMIT FILES_NEEDED FILES_CACHED
myPool hadoop supergroup rwxr-xr-x unlimited never 0 0 0 0 0
3.addDirective
[hadoop#Master ~]$ hdfs cacheadmin -addDirective -path /input/pics/bundle/bundle.chq -pool myPool -force -replication 3
Added cache directive 2
4.listDirectives
[hadoop#Master ~]$ hdfs cacheadmin -listDirectives -stats -path /input/pics/bundle/bundle.chq -pool myPool
Found 1 entry
ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED
2 myPool 3 never /input/pics/bundle/bundle.chq 141557760 0 1 0
the BYTES_NEEDED is right, but BYTES_CACHED is zero. It seems that the size has been calculated but the cache action which puts file into memory has not been done.So how to cache my file into memory.
Thank you very much.
There were a bunch of bugs we fixed in Hadoop 2.3. I would recommend using at least Hadoop 2.4 to use HDFS caching.
To get into more detail I would need to see the log messages.
Including the output of hdfs dfsadmin -report would also be useful, as well as ensuring that you have followed the setup instructions here (namely, increasing the ulimit and setting dfs.datanode.max.locked.memory):
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

Checking filesize and its distribution in HDFS

Is it possible to know filesize in blocks and its distribution over DataNodes in Hadoop?
Currently I am using:
frolo#A11:~/hadoop> $HADOOP_HOME/bin/hadoop dfs -stat "%b %o %r %n" /user/frolo/input/rmat-*
318339 67108864 1 rmat-10.0
392835957 67108864 1 rmat-20.0
Which does not show actual number of blocks created after uploading file to HDFS. And I dont know any way how to find out its distribution.
Thanks,
Alex
The %r in your stat command shows the replication factor of the queried file. If this is 1, it means there will only be only a single replica across the cluster for blocks belonging to this file. The hadoop fs -ls output also shows this value for listed files as one of its numeric columns, as replication factor is a per file FS attribute.
If you are looking to find where the blocks reside instead, you are looking for hdfs fsck (or hadoop fsck if using a dated release) instead. The below, for example, will let you see the list of block IDs and their respective set of resident locations, for any file:
hdfs fsck /user/frolo/input/rmat-10.0 -files -blocks -locations

How to find different fragments of a file in HDFS

Is there a way to find out where have the fragments of the file I have put in Hdfs gone? I mean the information as to where the file fragments stored in hdfs?
You can use the fsck command:
#> hadoop fsck /path/to/file -files -blocks -locations -racks
This lists for the file, the blocks and their associated metadata:
block name/ID
block length
block replication
locations (datanodeIp:port)
rack (prefix datanode ip's with the associated rack id)
For example:
/user/chris/file1.txt 123 bytes, 1 block(s): OK
0. blk_432678432632_3426532 len=123 repl=2 [/rack1/1.2.3.4:50010, /rack2/4.5.6.7:50010]

Resources