How to check whether a file exists or not using hdfs shell commands - hadoop

am new to hadoop and a small help is required.
Suppose if i ran the job in background using shell scripting, how do i know whether the job is completed or not. The reason am asking is, once the job is completed my script has to move output file to some other location. How can i check whether job completed or outfile exists or not using hdfs.

You need to be careful in the way you are detecting the job is done in this way, because there might be output before your job is completely finished.
To answer your direct question, to test for existence I typically do hadoop fs -ls $output | wc -l and then make sure the number is greater than 0.
My suggestion is you use && to tack on the move:
hadoop ... myjob.jar ... && hadoop fs -mv $output $new_output &
This will complete the job, and then perform the move afterwards.

You can use JobConf.setJobEndNotificationURI() to get notified when the job gets completed.
I think you can also check for the pid of the process that started the Hadoop job using the ps command.


Script didn't Finish execution but cron job started again

i am trying to run a cron job which will execute my shell script, my shell script is having hive & pig scripts. I am setting the cron job to execute after every 2 mins but before my shell script is getting finish my cron job starts again is it going to effect my result or once the script finishes its execution then only it will start. I am in a bit of dilemma here. Please help.
I think there are two ways to better resolve this, a long way and a short way:
Long way (probably most correct):
Use something like Luigi to manage job dependencies, then run that with Cron (it won't run more than one of the same job).
Luigi will handle all your job dependencies for you and you can make sure that a particular job only executes once. It's a little more work to get set-up, but it's really worth it.
Short Way:
Lock files have already been mentioned, but you can do this on HDFS too, that way it doesn't depend on where you run the cron job from.
Instead of checking for a lock file, put a flag on HDFS when you start and finish the job, and have this as a standard thing in all of your cron jobs:
# at start
hadoop fs -touchz /jobs/job1/2016-07-01/_STARTED
# at finish
hadoop fs -touchz /jobs/job1/2016-07-01/_COMPLETED
# Then check them (pseudocode):
if(!started && !completed): run_job; add_completed; remove_started
At the start of the script, have a check:
if [ -e /tmp/file.lock ]; then
rm /tmp/file.lock # removes the lock and continue
exit # No lock file exists, which means prev execution has not completed.
.... # Your script here
touch /tmp/file.lock
There are many others ways of achieving the same. I am giving a simple example.

What are the different ways to check if the mapreduce program ran successfully

If we need to automate a mapreduce program or run from a script, what are the different ways to check if the mapreduce program ran successfully? One way is to find is if _SUCCESS file is created in the output directory. Does the command "hadoop jar program.jar hdfs:/input.txt hdfs:/output" return 0 or 1 based on success or failure ?
Just like any other command in Linux, you can check the exit status of a
hadoop jar command using the built in variable $?.
You can use:
echo $?
after executing the hadoop jar command to check its status.
The exit status value varies from 0 to 255. An exit status of zero implies that the command executed successfully while a non-zero value indicates that the command failed.
Edit: To see how to achieve automation or to run from a script, refer Hadoop job fails when invoked by cron.

How to Kill Hadoop fs -copyToLocal task

I ran the following command on my local filesystem:
hadoop fs -copyToLocal <HDFS Path>
But, in the middle of the task (after hitting the command in terminal and before the command completes it's task), I want to cancel the copy. How can I do this ?
Also, is -copyToLocal executed as a MR job internally ? Can someone point me to a reference.
It uses the FileSystem API to stream & copy the file to local. There is no MR.
You could find the process on the machine & kill the process. It is usually a JVM process which gets invoked.
if you are using Nohup and/or & to perform the process you will get the job status by searching CopyToLocal in ps -eaf action, and if you are using normal command execution, you can use ctrl+z or ctrl+c. these will kill the process.
In Both case the dump and temp location which all create remains there, so once killing the process you have to clear the dump/temp dump to perform the same process again.
It will not create any MR Job,

Running script on my local computer when jobs submitted by qsub on a server finish

I am submitting jobs via qsub to a server, and then want to analyze the results on the local machine after jobs are finished. Though I can find a way to submit the analysis job on the server, but don't know how to run that script on my local machine.
qsub -W depend=afterok:$jobID
But instead of the above, I want something like
if(qsub -W depend=afterok:$jobID) finished successfully
some script
How can I accomplish the above task?
Thank you very much.
I've faced a similar issue and I'll try to sketch the solution that worked for me:
After submitting your actual job,
I would create a loop in your script that checks if the job is still running using
qstat $jobID | grep $jobID | awk '{print $5}'
Although I'm not 100% sure if the status is in the 5h column, you better double check. While the job is idling, the status will be I or Q, while running R, and afterwards C.
Once it's finished, I usually grep the output files for signs that the run was a success or not, and then run the appropriate post-processing script.
One thing that works for me is to use qsub synchronous with the option
qsub -sync y
(either on command line or as
#$ -sync y
in the script ( itself.
qsub will then exit with code 0 only if the job (or all array jobs) have finished successfully.

How to clear hadoop fifo queue?

I have set up a pseudo distributed mode cluster setup. The FIFO scheduler got stuck somehow in between and therefore a lot of jobs got piled up which I had scheduler through cron. Now, when I restarted YARN resourcemanager it gets stuck after a while and the jobs keep piling up.
Is there a way I can clear the whole queue. Or, is it that my complete understanding of hadoop scheduling is somewhere flawed. Please help.
If you're trying to kill all the jobs in your queue, you can use this shell script:
$HADOOP_HOME/bin/hadoop job -list | awk ' { system("$HADOOP_HOME/bin/hadoop job -kill " $1) } '
