Apache Giraph on EMR - hadoop

Has any tried Apache Giraph on EMR?
It seems to me the only requirements to run on EMR are to add proper bootstrap scripts to the Job Flow configuration. Then I should just need to use a standard Custom JAR launch step to launch the Giraph Runner with appropriate arguments for my Giraph program.
Any documentation/tutorial or if you could just share your experience with Giraph on EMR, that will be much appreciated.

Yes, I run Giraph jobs on EMR regularly but I don't use "Job Flows", I manually login to the master node and use it as a normal Hadoop cluster (I just submit the job with hadoop jar command).
You are right, you need to add bootstrap scripts to run Zookeeper and to add Zookeeper details to core-site config. Here is how I did it :
Bootstrap actions -
Configure Hadoop s3://elasticmapreduce/bootstrap-actions/configure-hadoop --site-key-value, io.file.buffer.size=65536, --core-key-value, giraph.zkList=localhost:2181, --mapred-key-value, mapreduce.job.counters.limit=1200
Run if s3://elasticmapreduce/bootstrap-actions/run-if instance.isMaster=true, s3://hpc-chikitsa/zookeeper_install.sh
The contents of zookeeper_install.sh are :
#!/bin/bash
wget --no-check-certificate http://apache.mesi.com.ar/zookeeper/zookeeper3.4./zookeeper3.4.5.tar.gz
tar zxvf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
mv conf/zoo_sample.cfg conf/zoo.cfg
sudo bin/zkServer.sh start
Then copy your Giraph jar file to master node (using scp) and then ssh to master node and submit the job using hadoop jar command.
Hope that helps.
Here is a relevant mail-thread on giraph-user mailing list :
https://www.mail-archive.com/user%40giraph.apache.org/msg01240.html

Related

Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster.
I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. Looked for results in log and everything is working fine.
Now I need to add the spark installation commands and configurations files in my vagrantfile. I can't find a good guide, could someone give me a good link ?
I used this guide for the yarn cluster
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation
Thanks in advance!
I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps.
Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). Then decompress it:
tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt
Now you only need to set some environment variables. First in your ~/.bashrc (or ~/.zshrc) you can set SPARK_HOME and add it to your PATH if you want:
export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin
Also for this changes to take effect you can run:
source ~/.bashrc
Second you need to point Spark to your Hadoop configuartion directories. To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh:
export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]
If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there.
Now to start the shell in yarn mode you can run:
spark-shell --master yarn --deploy-mode client
(You can't run the shell in cluster deploy-mode)
----------- Update
I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks #JulianCienfuegos):
spark-submit --master yarn --deploy-mode cluster project-spark.py
This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed).
You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (e.g. you press Ctrl+C, or your session ends)

Running Spark Jobs via Oozie

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background.
Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action also.
To run Spark SQL by Oozie you need to use Oozie Spark Action.
You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path.
]$ locate oozie.gz
/usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>

Differences between hadoop jar and yarn -jar

what's the difference between run a jar file with commands "hadoop jar " and "yarn -jar " ?
I've used the "hadoop jar" command on my MAC successfully but I want be sure that the execution is being correct and parallel on my four cores.
Thanks!!!
Short Answer
They are probably identical for you, but even if they aren't, they should both utilize your cluster to the best of its ability.
Longer Answer
The /usr/bin/yarn script sets up the execution environment so that all of the yarn commands can be run. The /usr/bin/hadoop script isn't quite as concerned about yarn specific functionality. However, if you have your cluster set up to use yarn as the default implementation of mapreduce (MRv2), then hadoop jar will probably act the same as yarn jar for a mapreduce job.
Either way you're probably fine, but you can always check the resource manager (or job tracker) web interface to see how your job is distributed across the cluster (whether it's a single node cluster or not)

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Running jar on hadoop server as a service

i made one jar
which analyzes system logs .. for running this jar on HADOOP server i can do it using command line like "bin/hadoop jar log.jar"
but my problem is i want to make this jar executable in background as a service on Ubuntu master machine.
can any one help me how can i make HADOOP jar as a service so it can run like a background service on Ubuntu Machine .. runs after every 1hrs.
You have a few options, here's two:
Configure a crontab job to run your job every hour, something like (you'll need to fully qualify the path to hadoop and the jar itself):
0 * * * * /usr/lib/hadoop/bin/hadoop jar /path/to/jar/log.jar
Run an OOZIE server and configure a coordinator to submit the job on an hourly basis. More effort that the above suggestion but worth a look.

Resources