Checking if directory in HDFS already exists or not - shell

I am having following directory structure in HDFS,
/analysis/alertData/logs/YEAR/MONTH/DATE/HOURS
That is data is coming on houly basis and stored in format of year/month/day/hour.
I have written a shell script in which i am passing path till
"/analysis/alertData/logs" ( this will vary depending on what product of data i am handling)
then shell script go through the year/month/date/hour folders and return the most latest path.
For example:
Directories present in HDFS has following structure:
/analysis/alertData/logs/2014/10/22/01
/analysis/alertData/logs/2013/5/14/04
shell script is given path till : " /analysis/alertData/logs "
it outputs most recent directory : /analysis/alertData/logs/2014/10/22/01
My question is here is how can i validate whether HDFS directory path pass to shell script is valid or not. Lets say i pass a wrong path as input or path which does not exist so how to handle that in shell script.
Sample wrong path can be:
wrong path : /analysis/alertData ( correct path : /analysis/alertData/logs/ )
wrong path : /abc/xyz/ ( path does not exit in HDFS )
I tried using Hadoop dfs -test -z/-d/-e options did not worked for me.
Any suggestion for this.
NOTE : Not posting my original code here, as solution to my problem does not depend on it.
Thanks in advance.

Try w/o test command []:
if $(hadoop fs -test -d $yourdir) ; then echo "ok";else echo "not ok"; fi

Since
hdfs dfs -test -d $yourdir
return 0 if exists, then
if [ $? == 0 ]; then
echo "exists"
else
echo "dir does not exists"
fi

Hadoop fs is deprecated
Usage: hdfs dfs -test -[ezd] URI
Options:
The -e option will check to see if the file exists, returning 0 if true.
The -z option will check to see if the file is zero length, returning 0 if true.
The -d option will check to see if the path is directory, returning 0 if true.
Example: hdfs dfs -test -d $yourdir
Please check the following for more info: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
Regards

Hi I have used following script to test the HDFS directory exists or not. I have seen in your question that you tried this test command and not worked. Could you please provide any trace on why this not working..
hadoop fs -test -d $dirpath
if [ $? != 0 ]
then
hadoop fs -mkdir $dirpath
else
echo "Directory already present in HDFS"
fi

works for scala with spark.
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val fileExists = fs.exists(new Path(<HDFSPath>)) //return boolean of true or false

In Java we can verify this by using FileSystem class.
FileSystem

Related

copy file from unc to hdfs using shellscript

I have UNC path folders in this path " //aloha/log/folderlevel1/folderlevel2/"
Each of these level2 folders will have files like "empllog.txt","deptlog.txt","adminlog.txt" and few others files as well.
I want to copy the content of this particular folders if they were created in last 24 hours & only if these 3 files are present to HDFS cloudera cluster.But if one of these files are not present , then that particular folder should not be copied. Also I need to preserve the folderstructre.
i.e In HDFS it should be "/user/test/todaydate/folderlevel1/folderlevel2"
I have written below shell script to copy files to hdfs with date folder created. But not sure how to proceed further with UNC Paths & other criterias.
day=$(date +%Y-%m-%d)
srcdir="/home/test/sparkjops"
stdir="/user/test/$day/"
hadoop dfs -mkdir $day /user/test
for f in ${srcdir}/*
do
if [ $f == "$srcdir/empllog.txt" ]
then
hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/deptlog.txt" ]
then hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/adminlog.txt" ]
then hadoop dfs -put $f $stdir
fi
done
I have tried to change the UNC Path like below . It did not do anything. No error & did not copy the content as well.
srcdir="//aloha/log/*/*"
srcdir='//aloha/log/*/*'
srcdir="\\aloha\log\*\*"
Appreciate all help.
Thanks.
EDIT 1 :
I ran it with code sh -x debug mode.and also with bash -x(just to check). But It returned that file not found error as below
test#ubuntu:~/sparkjops$ sh -x ./hdfscopy.sh
+ date +%Y-%m-%d
+ day=2016-12-24
+ srcdir= //aloha/logs/folderlevel1/folderlevel2
+ stdir=/user/test/2016-12-24/
+ hadoop dfs -mkdir 2016-12-24 /user/test
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
mkdir: `2016-12-24': File exists
mkdir: `/user/test': File exists
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/empllog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/deptlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/adminlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
test#ubuntu:~/sparkjops$
But not able to understand why it is not reading from that path. I have tried different escaping sequences as well(doubleslash for each slash, forwardslash as we do in window folderpath) . But none working. All are throwing same error message. I am not sure how to read this file in the script. Any help would be appreciated.

Difference between 'hdfs dfs -ls' and 'hdfs dfs -ls /'

Why hdfs dfs -ls points to the different location than hdfs dfs -ls /?
It can be clearly seen from below screenshot of two commands give different output:
What is the main cause of the outputs above?
From the official source code org.apache.hadoop.fs.shell.Ls.java . Just search for DESCRIPTION word. It will list below statements:-
public static final String DESCRIPTION =
"List the contents that match the specified file pattern. If " +
"path is not specified, the contents of /user/<currentUser> " +
"will be listed. For a directory a list of its direct children " +
"is returned (unless -" + OPTION_DIRECTORY +
" option is specified)"
hadoop fs -ls will list home directory content of current user.
hadoop fs -ls / will list direct childs of root directory.
The default location for -ls in Hadoop is the home directory of the user, in this case /user/root.
Adding the / makes the -ls command point at the root directory of the file system.
The / looks for the root Folder of the Hdfs

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Spark - on EMR saveAsTextFile wont write data to local dir

Running Spark on EMR (AMI 3.8). When trying to write an RDD to a local file, I am getting no results on the name/master node.
On my previous EMR cluster (same version of Spark installed with bootstrap script instead of as an add-on to EMR), the data would write to the local dir on the name node. Now I can see it appearing in "/home/hadoop/test/_temporary/0/task*" directories on the other nodes in the cluster, but only the 'SUCCESS' file on the master node.
How can I get the file to write to the name/master node only?
Here is an example of the command I am using:
myRDD.saveAsTextFile("file:///home/hadoop/test")
I can do this in a round about way using by pushing to HDFS first then writing the results to local filesystem with shell commands. But I would love to hear if others have a more elegant approach.
//rdd to local text file
def rddToFile(rdd: RDD[_], filePath: String) = {
//setting up bash commands
val createFileStr = "hadoop fs -cat " + filePath + "/part* > " + filePath
val removeDirStr = "hadoop fs -rm -r " + filePath
//rm dir in case exists
Process(Seq("bash", "-c", removeDirStr)) !
//save data to HDFS
rdd.saveAsTextFile(filePath)
//write data to local file
Process(Seq("bash", "-c", createFileStr)) !
//rm HDFS dir
Process(Seq("bash", "-c", removeDirStr)) !
}

Hadoop fs mkdir and testing using FileSystem.exists

I can create directories in my hadoop using: hadoop fs -mkdir /test/input. I can check this by browsing localhost:50070 and it works:
/test
/tmp
But when I check for existence from java:
FileSystem fs = FileSystem.get(conf);
fs.exists(new Path("/tmp")); // returns true
fs.exists(new Path("/test")); // returns false
Same thing happens even when i created test inside /tmp. What's wrong?
Thanks,
FileSystem.get(conf) may return the local file system where the /tmp/ folder exists and /test/ not exists. Try to specify the file system that you want to get:
FileSystem fs = new Path("hdfs://localhost:8020/").getFileSystem(conf);
I'm not sure about the port, you may need a 9000.

Resources