Connect spark on yarn-cluster in CDH 5.4 - hadoop

I am trying to understand the "concept" of connecting to a remote server. What I have are 4 servers on CentOS using CDH5.4
What I want to do is connect spark on yarn on all these four nodes.
My problem is I do not understand how to set HADOOP_CONF_DIR as specified here. Where and what value should i set for this variable? And then do I need to set this variable on all four nodes or only the master node will suffice?
The documentation says "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster".
I have read many questions similar to this before asking it in here. Please, let me know what can I do to solve this problem. I am able to run spark and pyspark on stand alone mode on all nodes.
Thanks for your help.
Ashish

Where and what value should i set for this variable?
The variable HADOOP_CONF_DIR should point to the directory that contains yarn-site.xml. Usually you set it in ~/.bashrc. I found documentation for CDH.
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html
Basically all nodes need to have configuration files pointed by the environment variable.
Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines

Related

Hive installation

After we install hive-3.2.1 on Hadoop-3.3.0 in Ubuntu, we start the hive services. I am not sure how HIVE identifies hadoop services though we don't give anything related to Hadoop in the HIVE setup process. Does HIVE identify hadoop by the means of HADOOP_HOME environment variable defined in .bashrc file ?
Can someone please confirm my understanding.
Thanks!
Yes, Hive uses HADOOP_HOME/conf to discover the cluster, which could be specified in hive-env.sh

Which configuration file client is using to connect to hadoop cluster

When the edge node had multiple hadoop distributions, there can be multiple configuration files scattered across the directories.
In those cases, how to know which configuration file the client is referring to, for it to connect to the cluster. ( say, for Yarn ). One option is to look at .bashrc file to find out if the HADOOP_HOME variable is set.
Are there are any other options to find this out . ( obviously, using the find command to search for a file will not solve the purpose ).
Hadoop provides classpath command. Read the description of the command below:
classpath prints the class path needed to get the
Hadoop jar and the required libraries
You can execute this command as:
hadoop classpath
or
yarn classpath
Both these commands, should give you almost identical results.
For e.g. I got following output, for hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\etc\hadoop;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\conf\;
e:\hdp\tez-0.7.0.2.3.0.0-2557\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\lib\*;
All these paths contain HADOOP_HOME as the parent path. In my case, it is: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557". From this path, you can easily figure out, which distribution of Hadoop, is your client referring to.
In my case, my client is using the Hadoop configurations and jars from: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557" directory.
You can run env command to get HADOOP_HOME for the session. Even if you overwrite HADOOP_HOME, env will give current value of the session.

Not able to deploy workers on Spark-1.2.0

I am new to spark and using spark-1.2.0 with hadoop 2.4.1. I have set up master and four slave nodes. But two of my nodes are not starting.
I have defined IP addresses of nodes in slaves file in spark-1.2.0/conf/ directory.
But when I try to run ./sbin/start-all.sh the error is as follows :
failed to launch org.apache.spark.deploy.worker.Worker
could not find or load main class org.apache.spark.deploy.worker.Worker
This is happening for two nodes. Other two are working fine.
I've also setup spark-env.sh in master as well as slaves. The master also has passwordless ssh connectiviy to the slaves.
I've also tried doing ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
It gives out the same error as before. Can someone help me with this. Where am I doing mistake?
So I figured out the solution. For all those who are starting new with spark, please check all the jar files in lib folder. I had spark-assembly-1.2.0-hadoop2.4.0.jar file missing in my slave.
I also encountered the same issue. If this is localmode cluster setup then you can run instead:
./sbin/start-master.sh
./sbin/start-slave.sh spark://localhost:7077
Then run:
MASTER=spark://localhost:7077 ./bin/pyspark
I was able to execute my jobs on the shell.
Do remember to setup up conf/slaves and conf/spark-env.sh as per here:
http://pulasthisupun.blogspot.com/2013/11/how-to-set-up-apache-spark-cluster-in.html
Also change localhost to your hostname.

Spark: Run Spark shell from a different directory than where Spark is installed on slaves and master

I have a small cluster (4 machines) set up with 3 slaves and a master node, all installed to /home/spark/spark. (I.e, $SPARK_HOME is /home/spark/spark)
When I use the spark shell: /home/spark/spark/bin/pyspark --master spark://192.168.0.11:7077 everything works fine. However I'd like for my colleagues to be able to connect to the cluster from a local instance of spark on their machine installed in whatever directory they wish.
Currently if somebody has spark installed in say /home/user12/spark and run /home/user12/spark/bin/pyspark --master spark://192.168.0.11:7077 the spark shell will connect to the master without problems but fails with an error when I try to run code:
class java.io.IOException: Cannot run program
"/home/user12/bin/compute-classpath.sh"
(in directory "."): error=2, No such file or directory)
The problem here is that Spark is looking for the spark installation in /home/user12/spark/, where as I'd like to just tell spark to look in /home/spark/spark/ instead.
How do I do this?
You need to edit three files, spark-submit, spark-class and pyspark (all in the bin folder).
Find the line
export SPARK_HOME = [...]
Then change it to
SPARK_HOME = [...]
Finally make sure you set SPARK_HOME to the directory where spark is installed on the cluster.
This works for me.
Here you can find a detailed explanation.
http://apache-spark-user-list.1001560.n3.nabble.com/executor-failed-cannot-find-compute-classpath-sh-td859.html

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Resources