How get exception,error,log for HIVE-SQOOP based Batch Job? - shell

I have Hadoop cluster with 6 datanode and 1 namenode. I have few(4) jobs in HIVE which run on every day and push some data from logfile to our OLPT data base using sqoop. I do not have oozie installed in the environment. All are written in HIVE script file (.sql file) and I run those from unix script(.sh file). Those shell script file are attach with different OS cron job to run those on different time.
Now Requirement is This:
Generate log/status for each job separately on daily basis. So that at the end of the day looking into those log we can identify which job run successfully and time it took to run , which job failed and dump/stack stace for that failed job.(Feature plan is that we will have mail server and every failed or success job shell script will send mail to respective stack holder with those log/status file as attachment)
Now my problem is how I can find error/exception if anything I have to run those batch job / shell script and how to generate success log also with execution time?
I tried to get the output in text file for each query run into HIVE by redirecting the output but that is not working.
for example :
Select * from staging_table;>>output.txt
Is there any way to do this by configuring HIVE log for each and every HIVE job on day to day basis?
Please let me know if any one face this issue and how can I resolve this?

Select * from staging_table;>>output.txt
this is Redirecting output if you are looking for that option then below is the way from the console.
hive -e 'Select * from staging_table' > /home/user/output.txt
this will simply redirect the output. It wont display job specific log information.
However, I am assuming that you are running on yarn, if you are expecting to see application(job) specific for logs please see this
Resulting log file locations :
During run time you will see all the container logs in the ${yarn.nodemanager.log-dirs}
Using UI you can see the logs i.e job level and task level.
other way is to look from and dump application/job specific logs from command line.
yarn logs -applicationId your_application_id
Please note that using the yarn logs -applicationId <application_id> method is preferred but it does require log aggregation to be enabled first.
Also see much better explanation here

Related

After triggering a Jenkins job remotely via a Bash script, when should I retrieve the job id?

I already built a script trigger_jenkins_job.sh which works perfectly fine for now. It’s composed mainly of 3 functions:
input_checkpoint
run_remotejob #: Running Jenkins job remotely using Json api.
sleep 10 #: 10 sec estimated time until pending duration is over
#and Jenkins job start running, i.e. a given slave was
#assigned to run the job.
get_buildID #: Retrieving build state, last build ID and last stable
#build ID using
The problem is I want to get rid of that sleep 10 seconds. And in the same time, I want to be sure before executing the function get_buildID that the remotely- triggered job is actually running on a node.
That way I will be retrieving the triggered job’s id, and not the last one in the queue before triggering that job.
Regarding the Jenkins file of the job, I specified:
agent {
label 'linux-node'
}
So, I guess the question is, I need some how from by bash script, to test if linux-node is running the remotely-triggered job, and if yes I execute the function get_buildID.
Get rid of the sleep command and use the wait command.
If you are triggering Job with tokens,it command itself should return you buildNumber.
Another way could be REST API. Please see "nextBuildNumber" field there (if build is still pending) else "number"

In Oozie, how I'd be able to use script output

I have to create a cron-like coordinator job and collect some logs.
/mydir/sample.sh >> /mydir/cron.log 2>&1
Can I use simple oozie wf, which I use for any shell command?
I'm asking because I've seen that there are specific workflows to execute .sh scripts
Sure, you can execute Shell action (On any node in the Yarn cluster) or use the Ssh action if you'd like to target specific hosts. You have to keep in mind that the "/mydir/cron.log" file will be created on the host the action is executed on and the generated file might no be available for other Oozie actions.

HBase export task mysteriously stopped logging to output file

I recently attempted to do an export of a table from an HBase instance using a 10 data node Hadoop cluster. The command line looked like the following:
nohup hbase org.apache.hadoop.hbase.mapreduce.Export documents /export/documents 10 > ~/documents_export.out &
As you can see, I nohup the process so it wouldn't prematurely die when my SSH session closed, and I put the whole thing in the background. To capture the output, I directed it to a file.
As expected, the process started to run and in fact ran for several hours before the output mysteriously stopped in the file I was outputting to. It stopped at about 31% through the mapping phase of the mapreduce job being run. However, per Hadoop, the mapreduce job itself was still going and in fact was working to completion the next morning.
So, my question is why did output stop going to my log file? My best guess is that the parent HBase process I invoked exited normally when it was done with the initial setup for the mapreduce job involved in the export.

Submitting jobs to different fair scheduler pools while using jar option

I am relatively new to Hadoop and was trying to have different jobs of the same user submitted to different pools of the fair scheduler at run time while using the hadoop jar option.
Based on the solution in http://osdir.com/ml/hive-user-hadoop-apache/2009-03/msg00162.html, I used the -D option while running the job.
Specifically, I ran the command: bin/hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' -D pool.name=sample_pool
I can see the pool in the job tracker scheduler page, but the job is still submitted to user pool. I found that the -D option is not supported by the jar option: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#job.
How can I specify this at run time?
Couple of suggestions:
Have you restarted the job tracker since you made the changes suggested in the first link?
You've previously needed to set all -D properties before the other arguments (i'm not sure if this has changed in more recent versions). Try:
bin/hadoop jar hadoop-examples-1.0.4.jar -Dpool.name=sample_pool grep input output 'dfs[a-z.]+'
If probably doesn't matter, but i always bunch up the -Dkey=value options (no space between the -D and the key=value), i find it makes it more obvious that this is not part of the variable args list.
One way to verify this has been picked up correctly is to check the job's job.xml in the job tracker - does it have the pool.name property listed, and does it have the value you configured.
EDIT Just reading up on how the examples are bundles, you'll need to add the -D after the program name and before the other arguments:
`bin/hadoop jar hadoop-examples-1.0.4.jar grep -Dpool.name=sample_pool input output 'dfs[a-z.]+' `
I think you can specified the parameter mapred.fairscheduler.pool or mapred.fairscheduler.poolnameproperty.
For instance, you can run command
bin/hadoop jar hadoop-examples-1.0.4.jar -Dmapred.fairscheduler.pool=sample_pool grep input output 'dfs[a-z.]+'
mapred.fairscheduler.pool:
Specify the pool that a job belongs in. If this is specified then mapred.fairscheduler.poolnameproperty is ignored.
mapred.fairscheduler.poolnameproperty:
Specify which jobconf property is used to determine the pool that a job belongs in. String, default: user.name (i.e. one pool for each user). Another useful value is mapred.job.queue.name to use MapReduce's "queue" system for access control lists (see below). mapred.fairscheduler.poolnameproperty is used only for jobs in which mapred.fairscheduler.pool is not explicitly set.
references:
hadoop fair scheduler

DATASTAGE: how to run more instance jobs in parallel using DSJOB

I have a question.
I want to run more instance of same job in parallel from within a script: I have a loop in which I invoke jobs with dsjob and without option "-wait" and "-jobstatus".
I want that jobs completed before script termination, but I don't know how to verify if job instance terminated.
I though to use wait command but it is not appropriate.
Thanks in advance
First,you should assure job compile option "Allow Multiple Instance" choose.
Second:
#!/bin/bash
. /home/dsadm/.bash_profile
INVOCATION=(1 2 3 4 5)
cd $DSHOME/bin
for id in ${INVOCATION[#]}
do
./dsjob -run -mode NORMAL -wait test demo.$id
done
project -- test
job -- demo
$id -- invocation id
the two line in shell scipt:guarantee the environment path can work.
Run the jobs like you say without the -wait, and then loop around running dsjob -jobinfo and parse the output for a job status of 1 or 2. When all jobs return this status, they are all finished.
You might find, though, that you check the status of the job before it actually starts running and you might pick up an old status. You might be able to fix this by first resetting the job instance and waiting for a status of "Not running", prior to running the job.
Invoke the jobs in loop without wait or job-status option
after your loop , check the jobs status by dsjob command
Example - dsjob -jobinfo projectname jobname.invocationid
you can code one more loop for this also and use sleep command under that
write yours further logic as per status of the jobs
but its good to create Job Sequence to invoke this multi-instance job simultaneously with the help of different invoaction-ids
create a sequence job if these are in same process
create different sequences or directly create different scripts to trigger these jobs simultaneously with invocation- ids and schedule in same time.
Best option create a standard generalized script where each thing will be getting created or getting value as per input command line parameters
Example - log files on the basis of jobname + invocation-id
then schedule the same script for different parameters or invocations .

Resources