HBase export task mysteriously stopped logging to output file - hadoop

I recently attempted to do an export of a table from an HBase instance using a 10 data node Hadoop cluster. The command line looked like the following:
nohup hbase org.apache.hadoop.hbase.mapreduce.Export documents /export/documents 10 > ~/documents_export.out &
As you can see, I nohup the process so it wouldn't prematurely die when my SSH session closed, and I put the whole thing in the background. To capture the output, I directed it to a file.
As expected, the process started to run and in fact ran for several hours before the output mysteriously stopped in the file I was outputting to. It stopped at about 31% through the mapping phase of the mapreduce job being run. However, per Hadoop, the mapreduce job itself was still going and in fact was working to completion the next morning.
So, my question is why did output stop going to my log file? My best guess is that the parent HBase process I invoked exited normally when it was done with the initial setup for the mapreduce job involved in the export.

Related

How to avoid series of commands running under sql plus using shell script

I have created a script which runs a series of other instances in shell using the while command. But the problem is that one of the job is connecting to sql plus and running the remaining command under sqlplus.
For ex. my input file --> job.txt
job1
job2
job3
Now, using the below script I am calling job one by one. So, until the present job is finished next job wont start. But the catch comes when the job tries to connect the sql plus. After it connects to sql plus, the process id of the current job gets complete and remaining jobs run as the sql statement in unix env.
while read line;
do
$job1
done < job.txt;
Error message I am getting in sqlplus after current instance exits.
SP2-0042: unknown command
How can I avoid bringing the jobs under sqlplus.

How get exception,error,log for HIVE-SQOOP based Batch Job?

I have Hadoop cluster with 6 datanode and 1 namenode. I have few(4) jobs in HIVE which run on every day and push some data from logfile to our OLPT data base using sqoop. I do not have oozie installed in the environment. All are written in HIVE script file (.sql file) and I run those from unix script(.sh file). Those shell script file are attach with different OS cron job to run those on different time.
Now Requirement is This:
Generate log/status for each job separately on daily basis. So that at the end of the day looking into those log we can identify which job run successfully and time it took to run , which job failed and dump/stack stace for that failed job.(Feature plan is that we will have mail server and every failed or success job shell script will send mail to respective stack holder with those log/status file as attachment)
Now my problem is how I can find error/exception if anything I have to run those batch job / shell script and how to generate success log also with execution time?
I tried to get the output in text file for each query run into HIVE by redirecting the output but that is not working.
for example :
Select * from staging_table;>>output.txt
Is there any way to do this by configuring HIVE log for each and every HIVE job on day to day basis?
Please let me know if any one face this issue and how can I resolve this?
Select * from staging_table;>>output.txt
this is Redirecting output if you are looking for that option then below is the way from the console.
hive -e 'Select * from staging_table' > /home/user/output.txt
this will simply redirect the output. It wont display job specific log information.
However, I am assuming that you are running on yarn, if you are expecting to see application(job) specific for logs please see this
Resulting log file locations :
During run time you will see all the container logs in the ${yarn.nodemanager.log-dirs}
Using UI you can see the logs i.e job level and task level.
other way is to look from and dump application/job specific logs from command line.
yarn logs -applicationId your_application_id
Please note that using the yarn logs -applicationId <application_id> method is preferred but it does require log aggregation to be enabled first.
Also see much better explanation here

EMR kill PIG script

Is there a way of killing a running pig script, not only the current hadoop job ?
As you know a pig script is translated to a hadoop job DAG. Assume everything runs smoothly up to some point in this graph but, for some reason, I want to stop the execution of this script/"DAG". Is there an emr command to do that ?
I tried to kill the current hadoop job and it looks like the execution of the pig script is CANCELLED but the cluster/master node is left in a weird state which makes all the subsequent pig scripts fail instantly.

How to Kill Hadoop fs -copyToLocal task

I ran the following command on my local filesystem:
hadoop fs -copyToLocal <HDFS Path>
But, in the middle of the task (after hitting the command in terminal and before the command completes it's task), I want to cancel the copy. How can I do this ?
Also, is -copyToLocal executed as a MR job internally ? Can someone point me to a reference.
Thanks.
It uses the FileSystem API to stream & copy the file to local. There is no MR.
You could find the process on the machine & kill the process. It is usually a JVM process which gets invoked.
if you are using Nohup and/or & to perform the process you will get the job status by searching CopyToLocal in ps -eaf action, and if you are using normal command execution, you can use ctrl+z or ctrl+c. these will kill the process.
In Both case the dump and temp location which all create remains there, so once killing the process you have to clear the dump/temp dump to perform the same process again.
It will not create any MR Job,

How to interrupt PIG from DUMP -ing a huge file/variable in grunt mode?

How do we interrupt pig dump command (EDIT: when it has completed the MapReduce jobs and is now just displaying the result on grunt shell) without exiting the grunt shell?
Sometimes, if we dump a HUGE file by mistake, it goes on forever!
I know we can use CTRL+C to stop it but it also quits the grunt shell and then we have to write all the commands again.
We can execute the following command in the grunt shell
kill jobid
We can find the job’s ID by looking at Hadoop’s JobTracker GUI, which lists all jobs currently running on the cluster. Note that this command kills a particular MapReduce job. If the Pig job contains other MapReduce jobs that do not depend on the killed MapReduce job, these jobs will still continue. If you want to kill all of the MapReduce jobs associated with a particular Pig job, it is best to terminate the process running Pig using CTRL+C, and then use this command to kill any MapReduce jobs that are still running.

Resources