Oozie Sqoop Workflow Refresh table - sqoop

I update impala-tables by querying though workflow that created in Oozie Editor. (But who cares? Just "I update tables".
And, at the end of workflow, you need to run "refresh ". But I don't know how to do it. I need non-bash method.
Does Oozie can exec impala-ddl by itself?

You can have one additional shell action and call invalidate metadata from impala-shell command.
impala-shell -q "invalidate metada"

Related

is it possible to execute more than one hive queries parallely

I have a script where it will read & execute one hql at a time,but i want to execute more than one hql at a time.Please let me know is there any way to do so.
If you use hive -e 'some command' you can use Bash &:
hive -e 'some command' &
hive -f someFile.hql &
etc..
Approach 1 (oozie):
One of the easiest and straightforward approach to run all your hql's is to use oozie. Create an oozie job and define hive actions in parallel and submit your job.
Approach 2 (Shell):
Create multiple shell scripts, with each shell script having a hive -e '<<query>>' and run all the shell scripts in parallel with a cron job (or again you can use oozie to run the shell scripts).
Although approach 2 works, I'd recommend approach 1 since oozie is the way to go to run hive scripts in parallel.

Need to pass Variable from Shell Action to Oozie Shell using Hive

All,
Looking to pass variable from shell action to the oozie shell. I am running commands such as this, in my script:
#!/bin/sh
evalDate="hive -e 'set hive.execution.engine=mr; select max(cast(create_date as int)) from db.table;'"
evalPartition=$(eval $evalBaais)
echo "evaldate=$evalPartition"
Trick being that it is a hive command in the shell.
Then I am running this to get it in oozie:
${wf:actionData('getPartitions')['evaldate']}
But it pulls a blank every time! I can run those commands in my shell fine and it seems to work but oozie does not. Likewise, if I run the commands on the other boxes of the cluster, they run fine as well. Any ideas?
The issue was configuration regarding to my cluster. When I ran as oozie user, I had write permission issues to /tmp/yarn. With that, I changed the command to run as:
baais="export HADOOP_USER_NAME=functionalid; hive yarn -hiveconf hive.execution.engine=mr -e 'select max(cast(create_date as int)) from db.table;'"
Where hive allows me to run as yarn.
The solution to your problem is to use "-S" switch in hive command for silent output. (see below)
Also, what is "evalBaais"? You might need to replace this with "evalDate". So your code should look like this -
#!/bin/sh
evalDate="hive -S -e 'set hive.execution.engine=mr; select max(cast(create_date as int)) from db.table;'"
evalPartition=$(eval $evalDate)
echo "evaldate=$evalPartition"
Now you should be able to capture the out.

how to write a sqoop job using shell script and run them sequentially?

I need to run a set of sqoop jobs one after another inside a shell script. How can I achieve it? By default, it runs all the job in parallel which results in performance taking a hit. should i remove the "-m" parameter and run ?
-m parameter is used to run multiple map-only jobs for each sqoop command but not for all the commands that you issue.
so removing -m parameter will not help you to solve the problem.
first you need to write a shell script file with your sqoop commands
#!/bin/bash
sqoop_command_1
sqoop_command_2
sqoop_command_3
save the above command with some name like sqoop_jobs.sh
then issue permissions to run on the shell file
chmod 777 sqoop_jobs.sh
now you can run/execute your shell file by issuing the following command within your terminal
>./sqoop_jobs.sh
I hope this will help

How get information of completed PBS or Torque jobs?

I have IDs of completed jobs. How do I check its detailed information, such as execution time, allocated nodes, etc? I remember SGE has a command for it (qacct?). But I could not find it for PBS or Torque. Thanks.
Since job accounting requires root access to view completed jobs, or that the cluster admins have installed pbstools (both out of the control of a user), I've found that the easiest thing to do is to place a
tracejob $PBS_JOBID
on the last line of the submission script. If the scheduler is MAUI, then checkjob -vv $PBS_JOBID is another alternative. These commands could be redirected to a separate outfile:
tracejob $PBS_JOBID > $PBS_O_WORKDIR/$PBS_JOBID.tracejob
Should also be possible to have this run as a user epilog script to make it more reusable from job to job.
I was looking at this thread searching how to do this in my HPC running PBSPro 19.2.3 and as of PBSPro 18 the solution is similar to John Damm Sørensen's reply, but the -w flag is used instead of -1 to display output of each field in a single line and you need to add -x flag to see the details of finished jobs as well, so you don't need to run it within the job script. (p.203, section 2.59.2.2 of the Reference Guide)
qstat -fxw $PBS_JOBID
You can then grep out of it the requested information, such as resources used, Exit status, etc:
qstat -fxw $PBS_JOBID | grep -E "resources_used|Exit_status|array_index"
For Torque, you can check at least part of the information you seek using the "tracejob" command.
Official documentation:
http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/usingTracejobToLocateFailures.htm
One thing you should notice is that this tool is a convenience that parses the logs. By default it will only check the last day. Be sure to read the doc for the "-n" option.
On a Torque based system. I find that the best way to get stats from a job is to add this to the end of the submitted job script. The output will be added to the STDOUT file.
qstat -f -1 $PBS_JOBID
Right now the only way to get this in TORQUE is to look at the accounting logs. You can grep for the job id and view the accounting records for the job, which look like this:
04/30/2014 15:20:18;Q;5000.bob;queue=batch
04/30/2014 15:33:00;S;5000.bob;user=dbeer group=dbeer jobname=STDIN queue=batch ctime=1398892818 qtime=1398892818 etime=1398892818 start=1398893580 owner=dbeer#bob exec_host=bob/0
04/30/2014 15:36:20;E;5000.bob;user=dbeer group=dbeer jobname=STDIN queue=batch ctime=1398892818 qtime=1398892818 etime=1398892818 start=1398893580 owner=dbeer#bob exec_host=bob/0 session=22933 end=1398893780 Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=2580kb resources_used.vmem=37072kb resources_used.walltime=00:03:20
Unfortunately, to do this directly you have to have root access. To get around this, there are tools such as pbsacct that help better browse this. pbsacct is part of the pbstools package, which is where that link takes you.

catch data error in informatica

Actually, I have a shell script which calls the informatica workflow. but i want to add a functionality in script to catch the data error while processing of data in workflow if required, and give the error message on screen like (error is coming due to wrong data .please refer the logs). Currently log is generated but i am unable to show screen message by using shell script.
below is command to call the workflow
pmcmd startworkflow -sv CSA_DEV_INT -d Domain_CSADevelopment -u Administrator -p Administrator -f Sumit -wait wf_ERROR_LOG_TESTING
pwc_status=$?
but the value of pwc_status is coming as 0 whereas I processed the wrong data. and informatica logs catch the error.
As long as the pmcmd call itself is successful (i.e. the server is found, the user can be authenticated, the workflow starts) it will return 0, even if there are errors while processing data. Use the the getworkflowdetails or gettaskdetails commands of the pmcmd utility to obtain details related to the workflow execution.
For more information about these commands see the Command Reference - you can find it in the Informatica installation directory on your server or download from Informatica My Support site (you need to be a registered user).

Resources