Difference between hadoop fs -put and hadoop fs -copyFromLocal - hadoop

-put and -copyFromLocal are documented as identical, while most examples use the verbose variant -copyFromLocal. Why?
Same thing for -get and -copyToLocal

-copyFromLocal is similar to -put command, except that the source is restricted to a local file reference.
So basically, you can do with put, all that you do with -copyFromLocal, but not vice-versa.
Similarly,
-copyToLocal is similar to get command, except that the destination is restricted to a local file reference.
Hence, you can use get instead of -copyToLocal, but not the other way round.
Reference: Hadoop's documentation.
Update: For the latest as of Oct 2015, please see this answer below.

Let's make an example:
If your HDFS contains the path: /tmp/dir/abc.txt
And if your local disk also contains this path then the hdfs API won't know which one you mean, unless you specify a scheme like file:// or hdfs://. Maybe it picks the path you did not want to copy.
Therefore you have -copyFromLocal which is preventing you from accidentally copying the wrong file, by limiting the parameter you give to the local filesystem.
Put is for more advanced users who know which scheme to put in front.
It is always a bit confusing to new Hadoop users which filesystem they are currently in and where their files actually are.

Despite what is claimed by the documentation, as of now (Oct. 2015), both -copyFromLocal and -put are the same.
From the online help:
[cloudera#quickstart ~]$ hdfs dfs -help copyFromLocal
-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst> :
Identical to the -put command.
And this is confirmed by looking at the sources, where you can see that the CopyFromLocal class extends the Put class, but without adding any new behavior:
public static class CopyFromLocal extends Put {
public static final String NAME = "copyFromLocal";
public static final String USAGE = Put.USAGE;
public static final String DESCRIPTION = "Identical to the -put command.";
}
public static class CopyToLocal extends Get {
public static final String NAME = "copyToLocal";
public static final String USAGE = Get.USAGE;
public static final String DESCRIPTION = "Identical to the -get command.";
}
As you might notice it, this is exactly the same for get/copyToLocal.

both are the same except
-copyFromLocal is restricted to copy from local while -put can take file from any (other HDFS/local filesystem/..)

They're the same. This can be seen by printing usage for hdfs (or hadoop) on a command-line:
$ hadoop fs -help
# Usage: hadoop fs [generic options]
# [ . . . ]
# -copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
# Identical to the -put command.
# -copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
# Identical to the -get command.
Same for hdfs (the hadoop command specific for HDFS filesystems):
$ hdfs dfs -help
# [ . . . ]
# -copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
# Identical to the -put command.
# -copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
# Identical to the -get command.

Both -put & -copyFromLocal commands work exactly the same. You cannot use -put command to copy files from one HDFS directory to another. Let's see this with an example: say your root has two directories, named 'test1' and 'test2'. If 'test1' contains a file 'customer.txt' and you try copying it to test2 directory
$ hadoop fs -put /test1/customer.txt /test2
It will result in 'no such file or directory' error since 'put' will look for the file in the local file system and not hdfs.
They are both meant to copy files (or directories) from the local file system to HDFS, only.

Related

How can we set the block size in hadoop specific to each file?

for example if my input fie has 500MB i want this to split 250MB each, if my input file is 600MB block size should be 300MB
If you are loading files into hdfs you can put with dfs.blocksize oprtion, you can calculate parameter in a shell depending on size.
hdfs dfs -D dfs.blocksize=268435456 -put myfile /some/hdfs/location
If you already have files in HDFS and want to change it's block size, you need to rewrite it.
(1) move file to tmp location:
hdfs dfs -mv /some/hdfs/location/myfile /tmp
(2) Copy it back with -D dfs.blocksize=268435456
hdfs dfs -D dfs.blocksize=268435456 -cp /tmp/myfile /some/hdfs/location

Difference between 'hdfs dfs -ls' and 'hdfs dfs -ls /'

Why hdfs dfs -ls points to the different location than hdfs dfs -ls /?
It can be clearly seen from below screenshot of two commands give different output:
What is the main cause of the outputs above?
From the official source code org.apache.hadoop.fs.shell.Ls.java . Just search for DESCRIPTION word. It will list below statements:-
public static final String DESCRIPTION =
"List the contents that match the specified file pattern. If " +
"path is not specified, the contents of /user/<currentUser> " +
"will be listed. For a directory a list of its direct children " +
"is returned (unless -" + OPTION_DIRECTORY +
" option is specified)"
hadoop fs -ls will list home directory content of current user.
hadoop fs -ls / will list direct childs of root directory.
The default location for -ls in Hadoop is the home directory of the user, in this case /user/root.
Adding the / makes the -ls command point at the root directory of the file system.
The / looks for the root Folder of the Hdfs

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Hadoop fs lookup for block size?

In Hadoop fs how to lookup the block size for a particular file?
I was primarily interested in a command line, something like:
hadoop fs ... hdfs://fs1.data/...
But it looks like that does not exist. Is there a Java solution?
The fsck commands in the other answers list the blocks and allow you to see the number of blocks. However, to see the actual block size in bytes with no extra cruft do:
hadoop fs -stat %o /filename
Default block size is:
hdfs getconf -confKey dfs.blocksize
Details about units
The units for the block size are not documented in the hadoop fs -stat command, however, looking at the source line and the docs for the method it calls we can see it uses bytes and cannot report block sizes over about 9 exabytes.
The units for the hdfs getconf command may not be bytes. It returns whatever string is being used for dfs.blocksize in the configuration file. (This is seen in the source for the final function and its indirect caller)
Seems hadoop fs doesn't have options to do this.
But hadoop fsck could.
You can try this
$HADOOP_HOME/bin/hadoop fsck /path/to/file -files -blocks
I think it should be doable with:
hadoop fsck /filename -blocks
but I get Connection refused
Try to code below
path=hdfs://a/b/c
size=`hdfs dfs -count ${path} | awk '{print $3}'`
echo $size
For displaying the actual block size of the existing file within HDFS I used:
[pety#master1 ~]$ hdfs dfs -stat %o /tmp/testfile_64
67108864

Hadoop copy a directory?

Is there an HDFS API that can copy an entire local directory to the HDFS? I found an API for copying files but is there one for directories?
Use the Hadoop FS shell. Specifically:
$ hadoop fs -copyFromLocal /path/to/local hdfs:///path/to/hdfs
If you want to do it programmatically, create two FileSystems (one Local and one HDFS) and use the FileUtil class
I tried copying from the directory using
/hadoop/core/bin/hadoop fs -copyFromLocal /home/grad04/lopez/TPCDSkew/ /export/hadoop1/lopez/Join/TPCDSkew
It gave me an error saying Target is a directory . I then modified it to
/hadoop/core/bin/hadoop fs -copyFromLocal /home/grad04/lopez/TPCDSkew/*.* /export/hadoop1/lopez/Join/TPCDSkew
it works .
In Hadoop version:
Hadoop 2.4.0.2.1.1.0-390
(And probably later; I have only tested this specific version as it is the one I have)
You can copy entire directories recursively without any special notation using copyFromLocal e.g.,:
hadoop fs -copyFromLocal /path/on/disk /path/on/hdfs
which works even when /path/on/disk is a directory containing subdirectories and files.
You can also use the put command:
$ hadoop fs -put /local/path hdfs:/path
For programmer, you also can use copyFromLocalFile. Here is an example:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hdfsConfig = new Configuration
val hdfsURI = "hdfs://127.0.0.1:9000/hdfsData"
val hdfs = FileSystem.get(new URI(hdfsURI), hdfsConfig)
val oriPath = new Path("#your_localpath/customer.csv")
val targetFile = new Path("hdfs://your_hdfspath/customer.csv")
hdfs.copyFromLocalFile(oriPath, targetFile)

Resources