TEZ as execution at job level - hadoop

How to selectively set TEZ as execution engine for PIG jobs?
We can set execution engine in pig.properties but its at the cluster impacts all the jobs of cluster.

Its possible if the jobs are submitted through Templeton.
Example of PowerShell usage
New-AzureHDInsightPigJobDefinition -Query $QueryString -StatusFolder $statusFolder -Arguments #("-x”, “tez")
Example of CURL usage:
curl -s -d file=<file name> -d arg=-v -d arg=-x -d arg=tez 'https://<dnsname.azurehdinsight.net>/templeton/v1/pig?user.name=admin'
Source: http://blogs.msdn.com/b/tiny_bits/archive/2015/09/19/pig-tez-as-execution-at-job-level.aspx

you can pass the execution engine as parameter as shown below, for mapreduce it is mr and for tez it is tez.
pig -useHCatalog -Dexectype=mr -Dmapreduce.job.queuename=<queue name> -param_file dummy.param dummy.pig

Related

Hive query cli works, same via hue fails

I have a weird issue with hue (version 3.10).
I have a very simple hive query:
drop table if exists csv_dump;
create table csv_dump row format delimited fields terminated by ',' lines terminated by '\n' location '/user/oozie/export' as select * from sample;
running this query in the hive editor works
running this query as an oozie workflow command line works
running this query command line with beeline works
running this query via an oozie workflow from hive fails
Fail in that case means:
drop and create are not run, or at least do not have any effect
a prepare action in the workflow will be executed
the hive2 step in the workflow still says succeeded
a following step will be executed.
Now I did try with different users (oozie and ambari, adapting the location as relevant), with exactly the same success/failure cases.
I cannot find any relevant logs, except maybe from hue:
------------------------
Beeline command arguments :
-u
jdbc:hive2://ip-10-0-0-139.eu-west-1.compute.internal:10000/default
-n
oozie
-p
DUMMY
-d
org.apache.hive.jdbc.HiveDriver
-f
s.q
-a
delegationToken
--hiveconf
mapreduce.job.tags=oozie-e686d7aaef4a29c020059e150d36db98
Fetching child yarn jobs
tag id : oozie-e686d7aaef4a29c020059e150d36db98
Child yarn jobs are found -
=================================================================
>>> Invoking Beeline command line now >>>
0: jdbc:hive2://ip-10-0-0-139.eu-west-1.compu> drop table if exists csv_dump; cr
eate table csv_dump0 row format delimited fields terminated by ',' lines termina
ted by '\n' location '/user/ambari/export' as select * from sample;
<<< Invocation of Beeline command completed <<<
Hadoop Job IDs executed by Beeline:
<<< Invocation of Main class completed <<<
Oozie Launcher, capturing output data:
=======================
#
#Thu Jul 07 13:12:39 UTC 2016
hadoopJobs=
=======================
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://ip-10-0-0-139.eu-west-1.compute.internal:8020/user/oozie/oozie-oozi/0000011-160707062514560-oozie-oozi-W/hive2-f2c9--hive2/action-data.seq
Oozie Launcher ends
Where I see that beeline is started, but I do not see any mapper allocated as I do command line.
Would anybody have any idea of what could go wrong?
Thanks,
Guillaume
As explained by #romain in the comments, new lines need to be added in the sql script. Then all is good.

error in running pig script in tez mode with hacatalog

I was running a pig script with tez as the execution engine and using hcatalog. Below is my pig script.
set exectype=tez;
a = load 'hive table' using org.apache.pig.hcatalog.hive.HCatloader();
when I entered the following in command line,
pig -useHCatalog -x tez /home/script.pig
I got an error:
"error encountered during parsing " ";" "; " at line1, column 17.
Can anyone tell me what the issue is. Is there any different way to set execution engine inside a script?
I think you should use:
set exectype tez
instead of :
set exectype=tez;
And anyway, isn't specifying "-x tez" enough to set the execution type? Why do you need to add it in the script as well?

how to pass parameter to EMR job

I have an EMR job that I run it as follows
/opt/emr/elastic-mapreduce --jobflow $JOB_FLOW_ID --jar s3n://mybucket/example1.jar
Now I need to pass parameter to the job maperd.job.name="My job"
I have tried passing it via the -D flag but that did not work
/opt/emr/elastic-mapreduce --jobflow $JOB_FLOW_ID -Dmaperd.job.name="My job --jar s3n://mybucket/example1.jar \\ does not work
Any idea ?

How to execute multiple PIG scripts parallely‏?

I have multiple PIG Script with and currently I am executing it in sequential manner using command pig -x mapreduce /path/to/Script/Script1.pig && /path/to/Script/Script2.pig && /path/to/Script/Script3.pig
Now I am looking for executing those scripts in parallel to improve the performance as all are independent of each other. I tried to search for it but not getting exactly.
So is there any way through which I can execute all PIG scripts parallely?
#!/bin/bash
pig -x mapreduce /path/to/Script/Script1.pig &
pig -x mapreduce /path/to/Script/Script2.pig &
pig -x mapreduce /path/to/Script/Script3.pig &
wait
echo "Done!"
You should be able to use Apache Oozie http://oozie.apache.org/

Difference between Hadoop jar command and job command

What is difference between the two commands "jar" and "job".
*> Below is my understanding.
The command "jar"could be used to run MR jobs locally.
The "hadoop job" is deprecated and used to submit a job to the
cluster. The alternative to that is the mapred command.
Also the jar command would run the MR job locally in the same node
where we are executing the command and not anywhere else on the
cluster. If we were to submit a job that would run on some non
deterministic node on the cluster.*
Let me know if my understanding is correct and if not what exactly is the difference.
Thanks
They both are completely different and I don't think are comparable. Both co-exist and have separate functions and none is deprecated AFAIK.
job isn't used to submit a job to the cluster, rather it is used to get information on the jobs that have already been run or are running, also it is used to kill a running job or even kill a specific task.
While jar is simply used to execute the custom mapred jar, example:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
hadoop jar
Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command.
Usage: hadoop jar [mainClass] args...
hadoop job
Command to interact with Map Reduce Jobs.
*Usage: hadoop job [GENERIC_OPTIONS] [-submit ] | [-status ] | [-counter ] | [-kill ] | [-events <#-of-events>] | [-history [all] ] | [-list [all]] | [-kill-task ] | [-fail-task ] | [-set-priority ]*
For more info, read here.

Resources