Why "hadoop -jar" command only launch local job? - hadoop

I use "hadoop -jar" rather than "hadoop jar" by mistake when I submit a job.
In this case, my jar package cannot not be submit to the clusters, and only "local job runner" will be launched, which puzzled me so much.
Anyone knows the reason for that? Or the difference between "hadoop jar" and "hadoop -jar" command ?
Thank you!

/usr/bin/hadoop jar is what your Hadoop's $HADOOP_HOME/bin/hadoop script requires as an argument, where $HADOOP_HOME is where you have kept your hadoop related files.
Excerpt from hadoop script
elif [ "$COMMAND" = "jar" ] ; then
CLASS=org.apache.hadoop.util.RunJar
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
and,
elif [[ "$COMMAND" = -* ]] ; then
# class and package names cannot begin with a -
echo "Error: No command named \`$COMMAND' was found. Perhaps you meant \`hadoop ${COMMAND#-}'"
exit 1
Here COMMAND="jar" and when COMMAND=-*, or -jar it should throw an exception as coded above. I'm not sure how you can even run a local jar.

Related

Not entering while loop in shell script

I was trying to implement page rank in hadoop. I created a shell script to iteratively run map-reduce. But the while loop just doesn't work. I have 2 map-reduce, one to find the initial page rank and to print the adjacency list. The other one will take the output of the first reducer and take that as input to the second mapper.
The shell script
#!/bin/sh
CONVERGE=1
ITER=1
rm W.txt W1.txt log*
$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave
hdfs dfs -rm -r /task-*
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.3.jar \-mapper "'$PWD/mapper.py'" \-reducer "'$PWD/reducer.py' '$PWD/W.txt'" \-input /assignment2/task2/web-Google.txt \-output /task-1-output
echo "HERE $CONVERGE"
while [ "$CONVERGE" -ne 0 ]
do
echo "############################# ITERATION $ITER #############################"
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.3.jar \-mapper "'$PWD/mapper2.py' '$PWD/W.txt' '$PWD/page_embeddings.json'" \-reducer "'$PWD/reducer2.py'" \-input task-1-output/part-00000 \-output /task-2-output
touch w1
hadoop dfs -cat /task-2-output/part-00000 > "$PWD/w1"
CONVERGE=$(python3 $PWD/check_conv.py $ITER>&1)
ITER=$((ITER+1))
hdfs dfs -rm -r /task-2-output/x
echo $CONVERGE
done
The first mapper runs perfectly fine and I am getting output for it. The condition for while loop [ '$CONVERGE' -ne 0 ] just gives false so it doesn't enter the while loop to run 2nd map-reduce. I removed the quotes on $CONVERGE and tried it still doesn't work.
I defined CONVERGE at the beginning of the file and is updated in while loop with the output of check.py. The while loop just doesn't run.
What could I be doing wrong?
Self Answer:
I tried doing everything possible to correct the mistakes. But later I was told to download dos2unix and run it again. Surprisingly it worked. The file was being read properly. I don't know why that happened.

executing command on vagrant-mounted

I'm trying to run a command after a share is mounted with vagrant. bu I've never written an upstart script before. What I have so far is
start on vagrant-mounted
script
if [ "$MOUNTPOINT" = "/vagrant" ]
then
env CMD="echo $MOUNTPOINT mounted at $(date)"
elif [ "$MOUNTPOINT" = "/srv/website" ]
then
env CMD ="echo execute command"
fi
end script
exec "$CMD >> /var/log/vup.log"
of course that's not the actual script I want to run but I haven't gotten that far yet but the structure is what I need. My starting point has been this article. I've had a different version that was simply
echo $MOUNTPOINT mounted at `date`>> /var/log/vup.log
that version did write to the log.
Trying to use init-checkconf faile with failed to ask Upstart to check conf file

NodeManager not started in Hadoop Yarn

I have setup hadoop and yarn in standalone mode for now.
I am trying to start all process in yarn. All process are started except nodemanager. It is throwing jvm error everytime.
[root#ip-10-100-223-16 hadoop-0.23.7]# sbin/yarn-daemon.sh start nodemanager
starting nodemanager, logging to /root/hadoop-0.23.7/logs/yarn-root-nodemanager-ip-10-100-223-16.out
Unrecognized option: -jvm
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
What can be the issue? Any help is appreciated.
Following link has a patch for the above issue : https://issues.apache.org/jira/browse/MAPREDUCE-3879
in bin/yarn script, we need to comment following lines. Here :
'-' : shows removal of lines
'+' : shows addition of lines
elif [ "$COMMAND" = "nodemanager" ] ; then
CLASSPATH=${CLASSPATH}:$YARN_CONF_DIR/nm-config/log4j.properties
CLASS='org.apache.hadoop.yarn.server.nodemanager.NodeManager'
- if [[ $EUID -eq 0 ]]; then
- YARN_OPTS="$YARN_OPTS -jvm server $YARN_NODEMANAGER_OPTS"
- else
- YARN_OPTS="$YARN_OPTS -server $YARN_NODEMANAGER_OPTS"
- fi
+ YARN_OPTS="$YARN_OPTS -server $YARN_NODEMANAGER_OPTS"
elif [ "$COMMAND" = "proxyserver" ] ; then
CLASS='org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer'
YARN_OPTS="$YARN_OPTS $YARN_PROXYSERVER_OPTS"
Above patch is available on this location.
Courtesy LorandBendig for helping me .

Difference between Hadoop jar command and job command

What is difference between the two commands "jar" and "job".
*> Below is my understanding.
The command "jar"could be used to run MR jobs locally.
The "hadoop job" is deprecated and used to submit a job to the
cluster. The alternative to that is the mapred command.
Also the jar command would run the MR job locally in the same node
where we are executing the command and not anywhere else on the
cluster. If we were to submit a job that would run on some non
deterministic node on the cluster.*
Let me know if my understanding is correct and if not what exactly is the difference.
Thanks
They both are completely different and I don't think are comparable. Both co-exist and have separate functions and none is deprecated AFAIK.
job isn't used to submit a job to the cluster, rather it is used to get information on the jobs that have already been run or are running, also it is used to kill a running job or even kill a specific task.
While jar is simply used to execute the custom mapred jar, example:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
hadoop jar
Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command.
Usage: hadoop jar [mainClass] args...
hadoop job
Command to interact with Map Reduce Jobs.
*Usage: hadoop job [GENERIC_OPTIONS] [-submit ] | [-status ] | [-counter ] | [-kill ] | [-events <#-of-events>] | [-history [all] ] | [-list [all]] | [-kill-task ] | [-fail-task ] | [-set-priority ]*
For more info, read here.

cront not working with hadoop command in shell script

I'm trying to schedule a cronjob using crontab to execute a shell script which executes a list of hadoop commands sequentially, but when i look at the hadoop folder the folders are not created or dropped. The hadoop connectivity on our cluster is pretty slow. so these hadoop command might take sometime to execute due to number of retries.
Cron expression
*/5 * * * * sh /test1/a/bin/ice.sh >> /test1/a/run.log
shell script
#!/bin/sh
if [ $# == 1 ]
then
TODAY=$1
else
TODAY=`/bin/date +%m%d%Y%H%M%S`
fi
# define seed folder here
#filelist = "ls /test1/a/seeds/"
#for file in $filelist
for file in `/bin/ls /test1/a/seeds/`
do
echo $file
echo $TODAY
INBOUND="hadoop fs -put /test1/a/seeds/$file /apps/hdmi-set/inbound/$file.$TODAY/$file"
echo $INBOUND
$INBOUND
SEEDDONE="hadoop fs -put /test1/a/seedDone /apps/hdmi-set/inbound/$file.$TODAY/seedDone"
echo $SEEDDONE
$SEEDDONE
done
echo "hadoop Inbound folders created for job1 ..."
Since there are no output that has been captured that could be used to debug the output, I can only speculate.
But from my past experience, one of the common reason hadoop jobs fail when they are spawned through scripts is that HADOOP_HOME is not available when these commands are executed.
Usually that is not the case when working directly from the terminal. Try adding the following to both ".bashrc" and ".bash_profile" or ".profile":
export HADOOP_HOME=/usr/lib/hadoop
You may have to change the path based on your specific installation.
And yes as comment says, don't just redirect standard output but error too in the file.

Resources