Differences between hadoop jar and yarn -jar - hadoop

what's the difference between run a jar file with commands "hadoop jar " and "yarn -jar " ?
I've used the "hadoop jar" command on my MAC successfully but I want be sure that the execution is being correct and parallel on my four cores.
Thanks!!!

Short Answer
They are probably identical for you, but even if they aren't, they should both utilize your cluster to the best of its ability.
Longer Answer
The /usr/bin/yarn script sets up the execution environment so that all of the yarn commands can be run. The /usr/bin/hadoop script isn't quite as concerned about yarn specific functionality. However, if you have your cluster set up to use yarn as the default implementation of mapreduce (MRv2), then hadoop jar will probably act the same as yarn jar for a mapreduce job.
Either way you're probably fine, but you can always check the resource manager (or job tracker) web interface to see how your job is distributed across the cluster (whether it's a single node cluster or not)

Related

Why MR2 map task is running under 'yarn' user and not under user I ran hadoop job?

I'm trying to run mapreduce job on MR2, Hadoop ver. 2.6.0-cdh5.8.0. Job has relative path to directory which has a lot of files to be compressed based on some criteria(not really necessary for this question). I'm running my job as following:
sudo -u my_user hadoop jar my_jar.jar com.example.Main
There is a folder on HDFS under path /user/my_user/ with files. But when I'm running my job I got following exception:
java.io.FileNotFoundException: File /user/yarn/<path_from_job> does not exist.
I'm migrating this job from MR1 where this job is working correctly. My suggestion is this is happening due to YARN, because each container started under YARN user. In my job configuration I've tried to set mapreduce.job.user.name="my_user" but this didn't help.
I've found ${user.home} usage in me Job configuration, but I don't know aware where it is set and is it possible to change this.
The only solution I found so far is to provide absolute path to folder. Is there any other way around, because I feel like this is not correct approach.
Thank you

Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster.
I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. Looked for results in log and everything is working fine.
Now I need to add the spark installation commands and configurations files in my vagrantfile. I can't find a good guide, could someone give me a good link ?
I used this guide for the yarn cluster
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation
Thanks in advance!
I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps.
Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). Then decompress it:
tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt
Now you only need to set some environment variables. First in your ~/.bashrc (or ~/.zshrc) you can set SPARK_HOME and add it to your PATH if you want:
export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin
Also for this changes to take effect you can run:
source ~/.bashrc
Second you need to point Spark to your Hadoop configuartion directories. To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh:
export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]
If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there.
Now to start the shell in yarn mode you can run:
spark-shell --master yarn --deploy-mode client
(You can't run the shell in cluster deploy-mode)
----------- Update
I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks #JulianCienfuegos):
spark-submit --master yarn --deploy-mode cluster project-spark.py
This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed).
You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (e.g. you press Ctrl+C, or your session ends)

Apache Giraph on EMR

Has any tried Apache Giraph on EMR?
It seems to me the only requirements to run on EMR are to add proper bootstrap scripts to the Job Flow configuration. Then I should just need to use a standard Custom JAR launch step to launch the Giraph Runner with appropriate arguments for my Giraph program.
Any documentation/tutorial or if you could just share your experience with Giraph on EMR, that will be much appreciated.
Yes, I run Giraph jobs on EMR regularly but I don't use "Job Flows", I manually login to the master node and use it as a normal Hadoop cluster (I just submit the job with hadoop jar command).
You are right, you need to add bootstrap scripts to run Zookeeper and to add Zookeeper details to core-site config. Here is how I did it :
Bootstrap actions -
Configure Hadoop s3://elasticmapreduce/bootstrap-actions/configure-hadoop --site-key-value, io.file.buffer.size=65536, --core-key-value, giraph.zkList=localhost:2181, --mapred-key-value, mapreduce.job.counters.limit=1200
Run if s3://elasticmapreduce/bootstrap-actions/run-if instance.isMaster=true, s3://hpc-chikitsa/zookeeper_install.sh
The contents of zookeeper_install.sh are :
#!/bin/bash
wget --no-check-certificate http://apache.mesi.com.ar/zookeeper/zookeeper3.4./zookeeper3.4.5.tar.gz
tar zxvf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
mv conf/zoo_sample.cfg conf/zoo.cfg
sudo bin/zkServer.sh start
Then copy your Giraph jar file to master node (using scp) and then ssh to master node and submit the job using hadoop jar command.
Hope that helps.
Here is a relevant mail-thread on giraph-user mailing list :
https://www.mail-archive.com/user%40giraph.apache.org/msg01240.html

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

How to tell if I am about to run Hadoop streaming job on a cluster or in "local" mode?

Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box. I have a shell script that is controlling a set of hadoop streaming jobs in sequence and I need to condition copying files from HDFS to local depending on whether the jobs have been running locally or not. Is there a standard way to accomplish this test? I could do a "ps aux | grep something" but that seems ad-hoc.
Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box.
Can you pl point to the reference for this?
A regular or a streaming job will run the way it is configured, so we know ahead of time in which mode a Job is run. Check the documentation for configuring Hadoop on a Single Node and Cluster in different modes.
Rather than trying to detect at run time which mode the process is operating, it is probably better to wrap the tool you are developing in a bash script that explicitly selects local vs cluster operatide. The O'Reilly Hadoop describes how to explicitly choose local using a configuration file override:
hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-temp
where conf-local.xml is an XML file configured for local operation.
I haven't tried this yet, but I think you can just read out the mapred.job.tracker configuration setting.

Resources