How to Kill Hadoop fs -copyToLocal task - hadoop

I ran the following command on my local filesystem:
hadoop fs -copyToLocal <HDFS Path>
But, in the middle of the task (after hitting the command in terminal and before the command completes it's task), I want to cancel the copy. How can I do this ?
Also, is -copyToLocal executed as a MR job internally ? Can someone point me to a reference.
Thanks.

It uses the FileSystem API to stream & copy the file to local. There is no MR.
You could find the process on the machine & kill the process. It is usually a JVM process which gets invoked.

if you are using Nohup and/or & to perform the process you will get the job status by searching CopyToLocal in ps -eaf action, and if you are using normal command execution, you can use ctrl+z or ctrl+c. these will kill the process.
In Both case the dump and temp location which all create remains there, so once killing the process you have to clear the dump/temp dump to perform the same process again.
It will not create any MR Job,

Related

HBase export task mysteriously stopped logging to output file

I recently attempted to do an export of a table from an HBase instance using a 10 data node Hadoop cluster. The command line looked like the following:
nohup hbase org.apache.hadoop.hbase.mapreduce.Export documents /export/documents 10 > ~/documents_export.out &
As you can see, I nohup the process so it wouldn't prematurely die when my SSH session closed, and I put the whole thing in the background. To capture the output, I directed it to a file.
As expected, the process started to run and in fact ran for several hours before the output mysteriously stopped in the file I was outputting to. It stopped at about 31% through the mapping phase of the mapreduce job being run. However, per Hadoop, the mapreduce job itself was still going and in fact was working to completion the next morning.
So, my question is why did output stop going to my log file? My best guess is that the parent HBase process I invoked exited normally when it was done with the initial setup for the mapreduce job involved in the export.

Kill hive queries without exiting from hive shell

Is there any way we can kill hive query without exiting from hive shell ?. For Example, I wrongly ran the select statement from some table which has million rows of data, i just wanted to stop it, but not exiting from the shell. If I pressed CTRL+Z, its coming out of shell.
You have two options:
press Ctrl+C and wait till command terminates, it will not exit from hive CLI, press Ctrl+C second time and the session will terminate immediately exiting to the shell
from another shell run
yarn application -kill <Application ID> or
mapred job -kill <JOB_ID>
First, look for Job ID by:
hadoop job -list
And then kill it by ID:
hadoop job -kill <JOB_ID>
Go with the second option
yarn application -kill <Application ID>. Get the application ID by getting onto another session.
This is the only way I think you would be able to kill the current query. I do use via beeline on hortonwork framework.

Stopping Flume Agent

I have a requirement where I want to run Flume agent with spooling directory as source. After all the files from the spool directory is copied to HDFS(sink) I want the agent to stop as I know all the files are pushed to channel.
Also I want to run this steps for different spooling directories each time and stop the agent when all files from the directory are marked as .COMPLETED.
Is there any way to stop the flume agent?
For now I could suggest that on running flume agent you open the flume agent terminal. Then on this terminal execute ctrl+c and agent is gone.
2 ways to stop the Flume agent:
Go to the terminal where Flume agent is running and press ctrl+C to forcefully kill the agent
Run jps from any terminal and look for 'Application' process. Note down its process id and then run kill -9 to terminate the process
Open another duplicate session window , then use below command .
ps –ef | grep flume
take out the process_id, and use below command to kill
kill -9 process_id
This worked for me.

How to interrupt PIG from DUMP -ing a huge file/variable in grunt mode?

How do we interrupt pig dump command (EDIT: when it has completed the MapReduce jobs and is now just displaying the result on grunt shell) without exiting the grunt shell?
Sometimes, if we dump a HUGE file by mistake, it goes on forever!
I know we can use CTRL+C to stop it but it also quits the grunt shell and then we have to write all the commands again.
We can execute the following command in the grunt shell
kill jobid
We can find the job’s ID by looking at Hadoop’s JobTracker GUI, which lists all jobs currently running on the cluster. Note that this command kills a particular MapReduce job. If the Pig job contains other MapReduce jobs that do not depend on the killed MapReduce job, these jobs will still continue. If you want to kill all of the MapReduce jobs associated with a particular Pig job, it is best to terminate the process running Pig using CTRL+C, and then use this command to kill any MapReduce jobs that are still running.

How to check whether a file exists or not using hdfs shell commands

am new to hadoop and a small help is required.
Suppose if i ran the job in background using shell scripting, how do i know whether the job is completed or not. The reason am asking is, once the job is completed my script has to move output file to some other location. How can i check whether job completed or outfile exists or not using hdfs.
Thanks
MRK
You need to be careful in the way you are detecting the job is done in this way, because there might be output before your job is completely finished.
To answer your direct question, to test for existence I typically do hadoop fs -ls $output | wc -l and then make sure the number is greater than 0.
My suggestion is you use && to tack on the move:
hadoop ... myjob.jar ... && hadoop fs -mv $output $new_output &
This will complete the job, and then perform the move afterwards.
You can use JobConf.setJobEndNotificationURI() to get notified when the job gets completed.
I think you can also check for the pid of the process that started the Hadoop job using the ps command.

Resources