Spark History Server on Yarn only shows Python application - hadoop

I have two spark contexts running on a box, 1 from python and 1 from scala. They are similarly configured, yet only the python application appears in the spark history page pointed to by the yarn tracking URL. Is there extra configuration I am missing here? (both run in yarn-client mode)

Related

Setting spark yarn client

I would like to make spark yarn client (link). Does it need to install hadoop ? or is it ok to install only yarn? ( by this
link)
No Spark do not require Hadoop for running. Apache Spark is an independent project which can run on its own. If you want you can even run it without apache yarn.
Spark support 3 type of cluster manager which are mesos, yarn and standalone. if you do not have yarn installed then it can use mesos and standalone and by default it uses standalone when you do not mention any preference for cluster manager.Links which you have mentioned is fine to use but I think more better resources are available on google.

Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster.
I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. Looked for results in log and everything is working fine.
Now I need to add the spark installation commands and configurations files in my vagrantfile. I can't find a good guide, could someone give me a good link ?
I used this guide for the yarn cluster
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation
Thanks in advance!
I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps.
Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). Then decompress it:
tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt
Now you only need to set some environment variables. First in your ~/.bashrc (or ~/.zshrc) you can set SPARK_HOME and add it to your PATH if you want:
export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin
Also for this changes to take effect you can run:
source ~/.bashrc
Second you need to point Spark to your Hadoop configuartion directories. To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh:
export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]
If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there.
Now to start the shell in yarn mode you can run:
spark-shell --master yarn --deploy-mode client
(You can't run the shell in cluster deploy-mode)
----------- Update
I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks #JulianCienfuegos):
spark-submit --master yarn --deploy-mode cluster project-spark.py
This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed).
You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (e.g. you press Ctrl+C, or your session ends)

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="10.113.57.98", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: 10.113.57.98:54321. Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

I cannot see the running applications in hadoop 2.5.2 (yarn)

I installed hadoop 2.5.2, and I can run the wordcount sample successfully. However, when I want to see the application running on yarn (job running), I cannot as all applictaions interface is always empty (shown in the following screen).
Is there anyway to make the jobs visible?
Please try localhost:19888 or check value of the the property for web url for job history (mapreduce.jobhistory.webapp.address) configured in you yarn config file.

Spark Standalone Mode: Worker not starting properly in cloudera

I am new to the spark, After installing the spark using parcels available in the cloudera manager.
I have configured the files as shown in the below link from cloudera enterprise:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
After this setup, I have started all the nodes in the spark by running /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-all.sh. But I couldn't run the worker nodes as I got the specified error below.
[root#localhost sbin]# sh start-all.sh
org.apache.spark.deploy.master.Master running as process 32405. Stop it first.
root#localhost.localdomain's password:
localhost.localdomain: starting org.apache.spark.deploy.worker.Worker, logging to /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain: failed to launch org.apache.spark.deploy.worker.Worker:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
localhost.localdomain: at gnu.java.lang.MainThread.run(libgcj.so.10)
localhost.localdomain: full log in /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain:starting org.apac
When I run jps command, I got:
23367 Jps
28053 QuorumPeerMain
28218 SecondaryNameNode
32405 Master
28148 DataNode
7852 Main
28159 NameNode
I couldn't run the worker node properly. Actually I thought to install a standalone spark where the master and worker work on a single machine. In slaves file of spark directory, I given the address as "localhost.localdomin" which is my host name. I am not aware of this settings file. Please any one cloud help me out with this installation process. Actually I couldn't run the worker nodes. But I can start the master node.
Thanks & Regards,
bips
Please notice error info below:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
I met the same error when I installed and started Spark master/workers on CentOS 6.2 x86_64 after making sure that libgcj.x86_64 and libgcj.i686 had been installed on my server, finally I solved it. Below is my solution, wish it can help you.
It seem as if your JAVA_HOME environment parameter didn't set correctly.
Maybe, your JAVA_HOME links to system embedded java, e.g. java version "1.5.0".
Spark needs java version >= 1.6.0. If you are using java 1.5.0 to start Spark, you will see this error info.
Try to export JAVA_HOME="your java home path", then start Spark again.

Resources