Difference between 'hdfs dfs -ls' and 'hdfs dfs -ls /' - hadoop

Why hdfs dfs -ls points to the different location than hdfs dfs -ls /?
It can be clearly seen from below screenshot of two commands give different output:
What is the main cause of the outputs above?

From the official source code org.apache.hadoop.fs.shell.Ls.java . Just search for DESCRIPTION word. It will list below statements:-
public static final String DESCRIPTION =
"List the contents that match the specified file pattern. If " +
"path is not specified, the contents of /user/<currentUser> " +
"will be listed. For a directory a list of its direct children " +
"is returned (unless -" + OPTION_DIRECTORY +
" option is specified)"
hadoop fs -ls will list home directory content of current user.
hadoop fs -ls / will list direct childs of root directory.

The default location for -ls in Hadoop is the home directory of the user, in this case /user/root.
Adding the / makes the -ls command point at the root directory of the file system.

The / looks for the root Folder of the Hdfs

Related

How can we set the block size in hadoop specific to each file?

for example if my input fie has 500MB i want this to split 250MB each, if my input file is 600MB block size should be 300MB
If you are loading files into hdfs you can put with dfs.blocksize oprtion, you can calculate parameter in a shell depending on size.
hdfs dfs -D dfs.blocksize=268435456 -put myfile /some/hdfs/location
If you already have files in HDFS and want to change it's block size, you need to rewrite it.
(1) move file to tmp location:
hdfs dfs -mv /some/hdfs/location/myfile /tmp
(2) Copy it back with -D dfs.blocksize=268435456
hdfs dfs -D dfs.blocksize=268435456 -cp /tmp/myfile /some/hdfs/location

Pass argument in quotes with percent signs in Windows cmd

I'm trying to use Hadoop's stat command to retrieve file information from my HDFS. On Linux, you pass formatting strings to stat (similar to the GNU stat builtin) like:
$ hdfs dfs -stat "type:%F perm:%a %u:%g size:%b mtime:%y atime:%x name:%n" /file
But I can't figure out how to get this to work in Windows' cmd prompt:
C:\> hdfs dfs -stat "type:%F" /file
C:\> hdfs dfs -stat "type:%F" /
-stat: java.net.URISyntaxException: Relative path in absolute URI: type:F
...
...it looks like it's trying to interpret the first argument as the path, instead of the second one. So I thought "maybe I need to include literal quotes?" Trying to escape the quotes with ^" doesn't work:
C:\> hdfs dfs -stat "^"type:%F^"" /
-stat: java.net.URISyntaxException: Relative path in absolute URI: ^^type:F
...
...in fact, it looks like it auto-escaped the ^ I sent in, but didn't send any quotes at all. Trying to surround the whole argument in single quotes rather than double quotes also doesn't work:
C:\> hdfs dfs -stat '^"type:%F^"' /
-stat: java.net.URISyntaxException: Relative path in absolute URI: 'type:F'
...
This time, it included the single quotes but again skipped the double quotes. Double-escaping the carats also doesn't work:
C:\> hdfs dfs -stat '^^"type:%F^^"' /
-stat: java.net.URISyntaxException: Relative path in absolute URI: 'type:F%5E%5E'
...
Triple-escaping the carats yields the same result as using a single carat.
I've found that a kludgey solution is to begin the formatting string with %3, and not surround it with quotes
C:\> hdfs dfs -stat %3%u /
3Andrew.Watsonuu
C:\> hdfs dfs -stat %3%u%g /
3Andrew.Watsonsupergroupgg
...but you can see that the returned string then has a 3 at the start, and the last flag character is doubled at the end uu or gg. I think this is because the %N are converted to the arguments that I passed to that function, like:
C:\> hdfs dfs -stat %0 /
stat: `hadoop': No such file or directory
2019-09-25 12:28:00
C:\> hdfs dfs -stat %1 /
stat: `fs': No such file or directory
2019-09-25 12:28:00
C:\> hdfs dfs -stat "%2" /
-stat: Illegal option -stat
You can see that %0, %1, and %2 correspond to the first, second, and third tokens in that command. So when I call %3, it substitutes the fourth argument into itself. This explains the weird repeated, glitchy output:
C:\> hdfs dfs -stat %3"repeat" /
3repeat"repeat"repeat
So the best solution I've come up with so far is to pass a superfluous argument at the end (which will throw an error), but then reference that argument earlier in the command, like:
C:\> hdfs dfs -stat -R %6 / "%%u %%g %%Y" 2> nul
Andrew.Watson supergroup 1569414480510
Andrew.Watson supergroup 1568728730673
...
Andrew.Watson supergroup 1568103636381
Andrew.Watson supergroup 1568103590659
It throws that error at the end, which I hide by piping it to nul. There's got to be a better way to do this. Any ideas?

hdfs ls on directory returns No such file or directory error

HDFS ls on below two directories returning No such file or directory error.
[mybox]$ hdfs dfs -ls /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/ | grep Tenant
drwxr-xr-x - tdcdv1r tdcdv1c 0 2018-05-01 18:28 /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event
drwxr-xr-x - tdcdv1r tdcdv1c 0 2018-05-01 15:35 /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.v2.event
See the error:
[mybox]$ hdfs dfs -ls /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event
ls: `/data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event': No such file or directory
I am not able to understand. Its a directory, it should return the content but its returning error.
You just need to escape the weird characters ({ and }) in the path:
hdfs dfs -ls /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.\\{Tenant\\}.Rejected.v2.event
EDIT
As in the comments said you can comment the path to avoid escape the weird characters.
This should work fine:
hdfs dfs -ls '/data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event'

copy file from unc to hdfs using shellscript

I have UNC path folders in this path " //aloha/log/folderlevel1/folderlevel2/"
Each of these level2 folders will have files like "empllog.txt","deptlog.txt","adminlog.txt" and few others files as well.
I want to copy the content of this particular folders if they were created in last 24 hours & only if these 3 files are present to HDFS cloudera cluster.But if one of these files are not present , then that particular folder should not be copied. Also I need to preserve the folderstructre.
i.e In HDFS it should be "/user/test/todaydate/folderlevel1/folderlevel2"
I have written below shell script to copy files to hdfs with date folder created. But not sure how to proceed further with UNC Paths & other criterias.
day=$(date +%Y-%m-%d)
srcdir="/home/test/sparkjops"
stdir="/user/test/$day/"
hadoop dfs -mkdir $day /user/test
for f in ${srcdir}/*
do
if [ $f == "$srcdir/empllog.txt" ]
then
hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/deptlog.txt" ]
then hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/adminlog.txt" ]
then hadoop dfs -put $f $stdir
fi
done
I have tried to change the UNC Path like below . It did not do anything. No error & did not copy the content as well.
srcdir="//aloha/log/*/*"
srcdir='//aloha/log/*/*'
srcdir="\\aloha\log\*\*"
Appreciate all help.
Thanks.
EDIT 1 :
I ran it with code sh -x debug mode.and also with bash -x(just to check). But It returned that file not found error as below
test#ubuntu:~/sparkjops$ sh -x ./hdfscopy.sh
+ date +%Y-%m-%d
+ day=2016-12-24
+ srcdir= //aloha/logs/folderlevel1/folderlevel2
+ stdir=/user/test/2016-12-24/
+ hadoop dfs -mkdir 2016-12-24 /user/test
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
mkdir: `2016-12-24': File exists
mkdir: `/user/test': File exists
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/empllog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/deptlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/adminlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
test#ubuntu:~/sparkjops$
But not able to understand why it is not reading from that path. I have tried different escaping sequences as well(doubleslash for each slash, forwardslash as we do in window folderpath) . But none working. All are throwing same error message. I am not sure how to read this file in the script. Any help would be appreciated.

Difference between hadoop fs -put and hadoop fs -copyFromLocal

-put and -copyFromLocal are documented as identical, while most examples use the verbose variant -copyFromLocal. Why?
Same thing for -get and -copyToLocal
-copyFromLocal is similar to -put command, except that the source is restricted to a local file reference.
So basically, you can do with put, all that you do with -copyFromLocal, but not vice-versa.
Similarly,
-copyToLocal is similar to get command, except that the destination is restricted to a local file reference.
Hence, you can use get instead of -copyToLocal, but not the other way round.
Reference: Hadoop's documentation.
Update: For the latest as of Oct 2015, please see this answer below.
Let's make an example:
If your HDFS contains the path: /tmp/dir/abc.txt
And if your local disk also contains this path then the hdfs API won't know which one you mean, unless you specify a scheme like file:// or hdfs://. Maybe it picks the path you did not want to copy.
Therefore you have -copyFromLocal which is preventing you from accidentally copying the wrong file, by limiting the parameter you give to the local filesystem.
Put is for more advanced users who know which scheme to put in front.
It is always a bit confusing to new Hadoop users which filesystem they are currently in and where their files actually are.
Despite what is claimed by the documentation, as of now (Oct. 2015), both -copyFromLocal and -put are the same.
From the online help:
[cloudera#quickstart ~]$ hdfs dfs -help copyFromLocal
-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst> :
Identical to the -put command.
And this is confirmed by looking at the sources, where you can see that the CopyFromLocal class extends the Put class, but without adding any new behavior:
public static class CopyFromLocal extends Put {
public static final String NAME = "copyFromLocal";
public static final String USAGE = Put.USAGE;
public static final String DESCRIPTION = "Identical to the -put command.";
}
public static class CopyToLocal extends Get {
public static final String NAME = "copyToLocal";
public static final String USAGE = Get.USAGE;
public static final String DESCRIPTION = "Identical to the -get command.";
}
As you might notice it, this is exactly the same for get/copyToLocal.
both are the same except
-copyFromLocal is restricted to copy from local while -put can take file from any (other HDFS/local filesystem/..)
They're the same. This can be seen by printing usage for hdfs (or hadoop) on a command-line:
$ hadoop fs -help
# Usage: hadoop fs [generic options]
# [ . . . ]
# -copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
# Identical to the -put command.
# -copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
# Identical to the -get command.
Same for hdfs (the hadoop command specific for HDFS filesystems):
$ hdfs dfs -help
# [ . . . ]
# -copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
# Identical to the -put command.
# -copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
# Identical to the -get command.
Both -put & -copyFromLocal commands work exactly the same. You cannot use -put command to copy files from one HDFS directory to another. Let's see this with an example: say your root has two directories, named 'test1' and 'test2'. If 'test1' contains a file 'customer.txt' and you try copying it to test2 directory
$ hadoop fs -put /test1/customer.txt /test2
It will result in 'no such file or directory' error since 'put' will look for the file in the local file system and not hdfs.
They are both meant to copy files (or directories) from the local file system to HDFS, only.

Resources