Idle hadoop master - how to make it do some work? - hadoop

I have launched a small cluster of two nodes and noticed that the master stays completely idle while the slave does all the work. I was wondering what is the way to let master run some of the tasks. I understand that for a larger cluster having a dedicated master may be necessary but on a 2-node cluster it seems an overkill.
Thanks for any tips,
Vaclav
Some more details:
The two boxes have 2 CPUs each. The cluster has been set up on Amazon Elastic MapReduce but I am running hadoop from commandline.
The cluster I just tried it on has:
Hadoop 0.18
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) Server VM (build 11.2-b01, mixed mode)
hadoop jar /home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar \
-jobconf mapred.job.name=map_data \
-file /path/map.pl \
-mapper "map.pl x aaa" \
-reducer NONE \
-input /data/part-* \
-output /data/temp/mapped-data \
-jobconf mapred.output.compress=true
where the input consists of 18 files.

Actually hadoop master is not the one doing work (tasks you run).
You can start datanode and tasktracker on the same machine the master runs.

Steve Loughran on the hadoop-users list suggested that starting a tasktracker on the master would do the trick.
$ bin/hadoop-daemon.sh start tasktracker
Seems to work. You may want to adjust number of slots for this tasktracker.

It may be different for Hadoop 0.18 but you can try adding the IP address of the master to the conf/slaves file - then restart the cluster

Related

Spark on Hadoop YARN - executor missing

I have a cluster of 3 macOS machines running Hadoop and Spark-1.5.2 (though with Spark-2.0.0 the same problem exists). With 'yarn' as the Spark master URL, I am running into a strange issue where tasks are only allocated to 2 of the 3 machines.
Based on the Hadoop dashboard (port 8088 on the master) it is clear that all 3 nodes are part of the cluster. However, any Spark job I run only uses 2 executors.
For example here is the "Executors" tab on a lengthy run of the JavaWordCount example:
"batservers" is the master. There should be an additional slave, "batservers2", but it's just not there.
Why might this be?
Note that none of my YARN or Spark (or, for that matter, HDFS) configurations are unusual, except provisions for giving the YARN resource- and node-managers extra memory.
Remarkably, all it took was a detailed look at the spark-submit help message to discover the answer:
YARN-only:
...
--num-executors NUM Number of executors to launch (Default: 2).
If I specify --num-executors 3 in my spark-submit command, the 3rd node is used.

Can't get pyspark job to run on all nodes of hadoop cluster

Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster.
I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets
distributed over all nodes, when launching a python spark job, only the one node takes the load.
Setup:
hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers
versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6
hadoop installed all 4 nodes
spark only installed on nk01
I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :
Python:
Using a homebrew python script for doing wordcount:
/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \
--num-executors 4 --executor-cores 1
The Python code assigns 4 partions:
tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)
Load on the 4 nodes during 60 seconds:
Java:
Using the JavaWordCount found in the spark distribution:
/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \
--num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'
Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.
Question: how do I get the python version also to distribute the load across all nodes?
The python-program name was indeed in the wrong position, as suggested by Shawn Guo. It should have been run this way:
/opt/spark/bin/spark-submit --master yarn-cluster --num-executors 4
--executor-cores 1 wordcount.py
That gives this load on the nodes:
Spark-submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Here are some different with scala/java submit in parameter position.
For Python applications, simply pass a .py file in the place of
application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py
--num-executors 4 --executor-cores 1

Spark over Yarn - Incorrect Application Master selection

I'm trying to fire some jobs with Spark over Yarn with the following command (this is just an example, actually i'm using different amount of memory and core) :
./bin/spark-submit --class org.mypack.myapp \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
lib/myapp.jar \
When I look at the Web UI to see what's is really happening under the hood, I notice that YARN is picking as Application Master a node that is not the Spark Master. This is a problem because the real Spark Master node is forcefully involved into the distributed computation leading to unnecessary network transfers of data (because, of course, the Spark master has no data to start with).
For what I saw during my tests, Yarn is picking the AM in a totally random fashion and I can't find a way to force him picking the Spark Master as AM.
My cluster is made of 4 nodes (3 Spark slaves, 1 Spark Master) with 64GB of total RAM and 32 cores, built upon HDP 2.4 with HortonWorks. The Spark Master is only hosting the namenode, the three slaves are datanodes.
You want to be able to specify a node, which does not have any DataNodes, to run the Spark Master. This, as far as I know, is not possible out of the box.
What you could do is run the master in yarn-client mode on the node which is running the NameNode, but this is probably not what you are looking for.
Another way would be to create your own Spark Client (where you specify using YARN API to prefer certain nodes over others for your Spark Master).

Spark not able to run in yarn cluster mode

I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit \
--class "MyApp" \
target/scala-2.10/my-application_2.10-1.0.jar \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 7g \
<outputPath>
But, I can see that this program is running only on the localhost.
Its able to read the file from hdfs.
I have tried this in standalone mode and it works fine.
Please suggest where is it going wrong.
I am using Hadoop2.4 with Spark 1.1.0 . I was able to get it running in the cluster mode.
To solve it we simply removed all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to the standalone mode.
Thanks.

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
VoilĂ !

Resources