how to pass parameter to EMR job - hadoop

I have an EMR job that I run it as follows
/opt/emr/elastic-mapreduce --jobflow $JOB_FLOW_ID --jar s3n://mybucket/example1.jar
Now I need to pass parameter to the job maperd.job.name="My job"
I have tried passing it via the -D flag but that did not work
/opt/emr/elastic-mapreduce --jobflow $JOB_FLOW_ID -Dmaperd.job.name="My job --jar s3n://mybucket/example1.jar \\ does not work
Any idea ?

Related

Coursera Bigdata Grader and how to set Hadoop Streaming number of reducers?

I'm trying to pass the course task on the Coursera, but fail at some unit test with the following error:
RES1_6 description: The first job should have more than 1 reducer or
shouldn't have them at all. Please set the appropriate number in -D mapreduce.job.reduces. It can be 0 or more than 1.
BUT, I use NUM_REDUCERS=4 in the following script!
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.job.name="somename" \
-D mapreduce.job.reduces=${NUM_REDUCERS}\
-files mapper.py,reducer.py,somesupplimentary.txt \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py" \
-input someinput.txt \
-output ${OUT_DIR} > /dev/null 2> $LOGS
And when I read the logs, I see the following:
Job Counters
Killed reduce tasks=1
Launched map tasks=2
Launched reduce tasks=9
Data-local map tasks=2
So I feel myself stupid and totally do not understand what does the grader want from me. That simply doesn't make a consecutive picture for me. It seems that I use MORE than one reducer and the log seem to approve it. Why does the unit test fail? Or am I not understanding some homespun truth?
I have needed to delete "2> $LOGS" part. That was the grader issue, since it have implied that I would not catch the logs and would not write it into the file, at least I shouldn't do it myself.

TEZ as execution at job level

How to selectively set TEZ as execution engine for PIG jobs?
We can set execution engine in pig.properties but its at the cluster impacts all the jobs of cluster.
Its possible if the jobs are submitted through Templeton.
Example of PowerShell usage
New-AzureHDInsightPigJobDefinition -Query $QueryString -StatusFolder $statusFolder -Arguments #("-x”, “tez")
Example of CURL usage:
curl -s -d file=<file name> -d arg=-v -d arg=-x -d arg=tez 'https://<dnsname.azurehdinsight.net>/templeton/v1/pig?user.name=admin'
Source: http://blogs.msdn.com/b/tiny_bits/archive/2015/09/19/pig-tez-as-execution-at-job-level.aspx
you can pass the execution engine as parameter as shown below, for mapreduce it is mr and for tez it is tez.
pig -useHCatalog -Dexectype=mr -Dmapreduce.job.queuename=<queue name> -param_file dummy.param dummy.pig

start-stop-daemon: pass arguments to application (vertx)

I'm trying to build an upstart configuration that's used in combination with monit.
I'd like to pass some arguments to vertx as well (multiple instances of the verticle), but I'm failing to get a proper statement on the shell already, so I think there's no need to quote the upstart script.
start-stop-daemon --start --chdir /my/app/dir --exec /usr/bin/vertx runzip myverticle-mod.zip -instances 20
No idea how to pass the '-instances 20' arg to the exec statement, somehow it is always interpreted as option to start-stop-daemon
start-stop-daemon: invalid option -- 'i'
I already tried putting the whole --exec statement into braces...
Maybe I missed something in Unix basics and didn't manage to properly escape the --exec string, so my pragmatic approach/workaround was creating a custom parameterized start script:
#!/bin/sh
export JAVA_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=$1 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=192.168.0.100"
/usr/bin/vertx runzip $2 -instances $3
Upstart config (running 10 instances of a verticle via JMX RMI on port 33002)
script
VERTX_OPTS=" 33002 mymodule-mod.zip 10"
exec start-stop-daemon --start --exec /usr/bin/myVertxStartup --$VERTX_OPTS
end script

Trigger oozie from shell script

I am trying to run a shell script which contains a oozie job; trigger this shell script from crontab. Oozie is not getting triggered !!!
shell script myshell.sh contains
#!/bin/bash
oozie job -run -config $1
crontab
*/5 * * * * /path/myshell.sh example.properties
Is there something I need to set in my environment or am I missing something!
Thanks
It looks like you're missing the -oozie argument to specify the oozie api url.
oozie job -oozie http://ooziehost:11000/oozie -run -config $1
you could also set the OOZIE_URL environment variable
#!/bin/bash
OOZIE_URL=http://ooziehost:11000/oozie
oozie job -run -config $1

Difference between Hadoop jar command and job command

What is difference between the two commands "jar" and "job".
*> Below is my understanding.
The command "jar"could be used to run MR jobs locally.
The "hadoop job" is deprecated and used to submit a job to the
cluster. The alternative to that is the mapred command.
Also the jar command would run the MR job locally in the same node
where we are executing the command and not anywhere else on the
cluster. If we were to submit a job that would run on some non
deterministic node on the cluster.*
Let me know if my understanding is correct and if not what exactly is the difference.
Thanks
They both are completely different and I don't think are comparable. Both co-exist and have separate functions and none is deprecated AFAIK.
job isn't used to submit a job to the cluster, rather it is used to get information on the jobs that have already been run or are running, also it is used to kill a running job or even kill a specific task.
While jar is simply used to execute the custom mapred jar, example:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
hadoop jar
Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command.
Usage: hadoop jar [mainClass] args...
hadoop job
Command to interact with Map Reduce Jobs.
*Usage: hadoop job [GENERIC_OPTIONS] [-submit ] | [-status ] | [-counter ] | [-kill ] | [-events <#-of-events>] | [-history [all] ] | [-list [all]] | [-kill-task ] | [-fail-task ] | [-set-priority ]*
For more info, read here.

Resources