copyFromLocal - extra/empty files getting copied to HDFS - hadoop

I created two sample files - 'file1.txt' and 'file2.txt' in a folder on my local filesystem in Ubuntu and copied them to HDFS using the dfs command copyFromLocal.
On browsing the copied files thru NameNode UI, I can see two extra files (file1.txt~ & file2.txt~ each of 0 KB) being copied to HDFS though there was no other file in that folder except file1.txt and file2.txt
Below is the information from the NameNode UI -
Name Type Size
file1.txt file 0.01 KB
file1.txt~ file 0 KB
file2.txt file 0.01 KB
file2.txt~ file 0 KB
Any suggestion/idea why these two extra files got created and how to fix this?


Does Hadoop copyFromLocal creates 2 copies? - 1 inside hdfs and other inside datanode?

I have installed a pseudo distributed standalone hadoop version on Ubuntu present inside my vmware installed on my windows10.
I downloaded a file from internet and copied into ubuntu local directory /lab/data
I have created namenode and datanode folders(not hadoop folder) with name namenodep and datan1 in ubuntu. I have also created a folder inside hdfs as /input.
When I copied the file from ubuntu local to hdfs, why is that file is present in both the below directories?
$ hadoop fs -copyFromLocal /lab/data/Civil_List_2014.csv /input
$hadoop fs -ls /input/
input/Civil_List_2014.csv ?????
$cd lab/hdfs/datan1/current
blk_3621390486220058643 ?????
Basically I want to understand if it created 2 copies, 1 inside datan1 folder and the other inside hdfs?
No. Only one copy is created.
When you create a file in HDFS, the contents of the file are stored on one of the disks of the Data Node. The disk location where the Data Node stores the data is determined by the configuration parameter: (present in hdfs-site.xml)
Check the description of this property:
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
So above, the contents of your file HDFS file "/input/Civil_List_2014.csv", are stored in physical location: lab/hdfs/datan1/current/blk_3621390486220058643.
"blk_3621390486220058643_1121.meta" contains the check sum of the data stored in "blk_3621390486220058643".
This file may be small enough to be put in a single file. But, if a file is big (assuming > 256 MB and a Hadoop block size of 256 MB), then Hadoop splits the contents of the file into 'n' number of blocks and stores them on the disk. In that case, you will see 'n' number of "blk_*" files in the data node's data directory.
Also, since the replication factor is typically set to "3", 3 instances of the same block are created.
The output from the hadoop fs -ls /input/ command is actually showing you the metadata information and is not actually a physical file, its logical abstraction around the files which are hosted by datanode's. This metadata information is stored by NameNode.
The actual physical file's are split into blocks and are hosted by the datanode's in the path specified in the configuration in your case lab/hdfs/datan1/current.

Uncompress recursively using 7-Zip from command line

I'm attempting to uncompress several .gz files using 7-Zip from the command line. My files are in directories like so:
I would like to recursively uncompress all the .gz files into each's orginal location and as well as deleting the remaining .gz files when they are done uncompressing.
I have tried the following command with no luck:
7z.exe x C:\Users\MYUSERNAME\Desktop\copyto\*\*.gz -r
I assumed that this would extract recursively. I get the error:
Processing archive: C:\Users\MYUSERNAME\Desktop\copyto\1\file1.gz
Can not open output file file1
Sub items Errors: 1
Any idea what's going on?
Given your command line, my guess is that your current working directory isn't any subdirectory of your home directory (C:\Users\MYUSERNAME) or the public user directory (C:\Users\Public), which means you probably don't have access rights. For example, if I run the following from C:\Program Files\7-Zip, I get the same error with a 7-Zip file:
C:\Program Files\7-Zip>7z x C:\Users\MYUSERNAME\Desktop\migrated\annex_k.7z -r
7-Zip [64] 9.38 beta Copyright (c) 1999-2014 Igor Pavlov 2015-01-03
Processing archive: C:\Users\MYUSERNAME\Desktop\migrated\annex_k.7z
ERROR: Can not open output file : .\annex_k\include\annex_k\errno.h
Skipping annex_k\include\annex_k\errno.h
ERROR: Can not open output file : .\annex_k\include\annex_k\handler.h
Skipping annex_k\include\annex_k\handler.h
Extracting annex_k\include\annex_k
Extracting annex_k\include
Extracting annex_k
Sub items Errors: 10
Archives with Errors: 1
Sub items Errors: 10
Kernel Time = 0.031 = 39%
User Time = 0.031 = 39%
Process Time = 0.062 = 78% Virtual Memory = 3 MB
Global Time = 0.080 = 100% Physical Memory = 4 MB
Notice that not even an annex_k directory was created:
C:\Program Files\7-Zip>dir /b
The solution is to extract to a directory in which you have access rights. You can specify an output directory using something like -oC:\Users\MYUSERNAME\Desktop\copyto\1. If you absolutely need to do this in a directory in which you don't have write access ordinarily, you'd need to run the command prompt as an administrator and extract the file as usual.

Hadoop, Mapreduce - Cannot obtain block length for LocatedBlock

I've a file on hdfs in the path 'test/test.txt' which is 1.3G
output of ls and du commands is:
hadoop fs -du test/test.txt -> 1379081672 test/test.txt
hadoop fs -ls test/test.txt ->
Found 1 items
-rw-r--r-- 3 testuser supergroup 1379081672 2014-05-06 20:27 test/test.txt
I want to run a mapreduce job on this file but when i start the mapreduce job on this file the job fails with the following error:
hadoop jar myjar.jar test.TestMapReduceDriver test output
14/05/29 16:42:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/05/29 16:42:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/29 16:42:03 INFO mapred.JobClient: Running job: job_201405271131_9661
14/05/29 16:42:04 INFO mapred.JobClient: map 0% reduce 0%
14/05/29 16:42:17 INFO mapred.JobClient: Task Id : attempt_201405271131_9661_m_000004_0, Status : FAILED Cannot obtain block length for LocatedBlock{BP-428948818-namenode-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode4:50010, datanode3:50010, datanode1:50010]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(
at org.apache.hadoop.hdfs.DFSInputStream.<init>(
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(
at org.apache.hadoop.mapred.Ma`
i tried the following commands:
hadoop fs -cat test/test.txt gives the following error
cat: Cannot obtain block length for LocatedBlock{BP-428948818-; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode3:50010, datanode1:50010, datanode4:50010]}
additionally i can't copy the file hadoop fs -cp test/test.txt tmp gives same error:
cp: Cannot obtain block length for LocatedBlock{BP-428948818-; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode1:50010, datanode3:50010, datanode4:50010]}
output of the hdfs fsck /user/testuser/test/test.txt command:
Connecting to namenode via `http://namenode:50070`
FSCK started by testuser (auth:SIMPLE) from / for path
/user/testuser/test/test.txt at Thu May 29 17:00:44 EEST 2014
Total size: 0 B (Total open files size: 1379081672 B)
Total dirs: 0
Total files: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 21)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu May 29 17:00:44 EEST 2014 in 0 milliseconds
The filesystem under path /user/testuser/test/test.txt is HEALTHY
by the way i can see the content of the test.txt file from the web browser.
hadoop version is: Hadoop 2.0.0-cdh4.5.0
I got the same issue with you and I fixed it by the following steps.
There are some files that opened by flume but never closed (I am not sure about your reason).
You need to find the name of the opened files by the command:
hdfs fsck /directory/of/locked/files/ -files -openforwrite
You can try to recover files as command:
hdfs debug recoverLease -path <path-of-the-file> -retries 3
Or removing them by the command:
hdfs dfs -rmr <path-of-the-file>
I had the same error, but it was not due to the full disk problem, and I think the inverse, where there were files and blocks referenced by in the namenode that did not exist on any datanodes.
Thus, hdfs dfs -ls shows the files, but any operation on them fails, e.g. hdfs dfs -copyToLocal.
In my case, the hard part was isolating which files were listed but corrupted, as they existed in a tree having thousands of files. Oddly, hdfs fsck /path/to/files/ did not report any problems.
My solution was:
Isolate the location using copyToLocal which resulted in copyToLocal: Cannot obtain block length for LocatedBlock{BP-1918381527-; getBlockSize()=1231; corrupt=false; offset=0; locs=[,,]} for several files
Get a list of the local directories using ls -1 > baddirs.out
get rid of the local files from the first copyToLocal
use for files incat baddirs.out;do echo $files; hdfs dfs -copyToLocal $files This will produce a list of directories checks, and errors where files are found.
get rid of the local files again, and now get lists of files from each affected subdirectory. Use that as input to a file-by-file copyToLocal, at which point you can echo each file as it's copied, then see where the error occurs.
use hdfs dfs -rm <file> for each file.
Confirm you got 'em all be removing all local files again, and using the original copyToLocal on the top level directory where you had problems.
A simple two hour process!
You are having some corrupted files with no blocks on datanode but an entry in namenode. Best to follow this:
According to this this may be produced by a full disk problem. I came across the same problem recently with an old file and checking my servers metrics it effectively was a full disk problem during the creation of that file. Most solutions just claim to delete the file and prey for it not happening again.

Two copies of each file being copied from local to HDFS

I am using fs.copyFromLocalFile(local path, Hdfs dest path) in my program.
I am deleting the destination path on HDFS every time and before copying file from local machine. But after copying files from Local path, and implementing map reduce on it generates two copies of each file, hence the word count doubles.
To be clear, I have "Home/user/desktop/input/" as my local path and HDFS dest path to be "/input".
When I check the HDFS Destination path, i.e folder on which map reduce was applied this is the result
hduser#rallapalli-Lenovo-G580:~$ hdfs dfs -ls /input
14/03/30 08:30:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt~
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt~
When I provide Input as single file Home/user/desktop/input/1.txt creates no problem and only single file is copied. But mentioning the directory creates a problem
But manually placing each file in the HDFS Dest through command line creates no problem.
I am not sure If I am missing a simple logic of file system. But would be great if any one could suggest where I am going wrong.
I am using hadoop 2.2.0.
I have tried deleting the local temporary files and made sure the text files are not open. Looking for a way to avoid copiying the temporary files.
Thanks in advance.
The files /input/1.txt~ /input/2.txt~ are temporary files created by the File editor you are using in your machine.You can use Ctrl + H to see all hidden temporary files in your local directory and delete them.

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.
