sparkR: verify number functioning worker nodes - sparkr

After starting a spark-ec2 cluster, I start sparkR from /root with
$ ./spark/bin/sparkR
A few lines of the resulting message include:
16/11/20 10:13:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '1').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
So, following that suggestion I added the last line to spark-defaults.conf
$ pwd
/root/spark/conf
$ cat spark-defaults.conf
spark.executor.memory 512m
spark.executor.extraLibraryPath /root/ephemeral-hdfs/lib/native/
spark.executor.extraClassPath /root/ephemeral-hdfs/conf
spark.executor.instances 2
This resulted in the message no longer being printed.
In sparkR, how can I verify the number of worker nodes that will be accessed?

After you start your spark cluster you can check your current workers and executer on spark ui on Master_IP:8080 for example in local its localhost:8080
And you can also check that your configuration will correctly apply in localhost:4040 under environment tab

Related

How to set yarn.app.mapreduce.am.command-opts for spark yarn cluster job

I am getting "Container... is running beyond virtual memory limits" error while running spark job in yarn cluster mode.
It is not possible to ignore this error or increase Vmem Pmem ratio.
Job is submitted through spark-submit with " --conf spark.driver.memory=2800m".
I think it is because default value of yarn.app.mapreduce.am.command-opts is 1G, so yarn kills my driver/AM as soon as my driver/AM uses more than 1G memory.
So I would like to pass "yarn.app.mapreduce.am.command-opts" to spark-submit in bash script. Passing it with "spark.driver.extraJavaOptions" errors out with "Not allowed to specify max heap(Xmx) memory settings through java options"
So how do I pass it ?
EDIT: I cannot edit conf files as that will make the change for all MR and spark jobs.

Spark on Hadoop YARN - executor missing

I have a cluster of 3 macOS machines running Hadoop and Spark-1.5.2 (though with Spark-2.0.0 the same problem exists). With 'yarn' as the Spark master URL, I am running into a strange issue where tasks are only allocated to 2 of the 3 machines.
Based on the Hadoop dashboard (port 8088 on the master) it is clear that all 3 nodes are part of the cluster. However, any Spark job I run only uses 2 executors.
For example here is the "Executors" tab on a lengthy run of the JavaWordCount example:
"batservers" is the master. There should be an additional slave, "batservers2", but it's just not there.
Why might this be?
Note that none of my YARN or Spark (or, for that matter, HDFS) configurations are unusual, except provisions for giving the YARN resource- and node-managers extra memory.
Remarkably, all it took was a detailed look at the spark-submit help message to discover the answer:
YARN-only:
...
--num-executors NUM Number of executors to launch (Default: 2).
If I specify --num-executors 3 in my spark-submit command, the 3rd node is used.

While running a application using spark-submit in Apache Spark, gave WARN message

I have configured Apache Spark standalone cluster into two Ubuntu 14.04 VMs. One of the VMs i.e. Master and the other one i.e. Worker,both are connected with password less ssh described here.
After that from the Master, I have started master as well as worker by the following command from the spark home directory -
sbin/start-all.sh
Then I run the following command from Master as well as Woker VMs.
jps
It shows in Master VM-
6047 jps
6048 Master
And into Worker VM-
6046 jps
6045 Worker
It seemed that the Master and Worker is running properly and also in Web UI, there is no error occured.
But when I am trying to run an application using the following command-
spark-1.6.0/bin/spark-submit spark.py
It gives WARN message in console that-
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Here is my test application-
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf =SparkConf().setMaster('spark://SparkMaster:7077').setAppName("My_App")
sc = SparkContext(conf=conf)
SQLCtx = SQLContext(sc)
list_of_list = sc.textFile("ver1_sample.csv").map(lambda line: line.split(",")).collect()
print("type_of_list_of_list===========",type(list_of_list), list_of_list)
As I am new to Apache Spark. Please help.
The problem could be with the resource (memory/cores) availability. By default spark takes default from spark-defaults.conf.
Try using
bin/spark-submit --executor-memory 1g

Presto - Query ... No worker nodes available

Using Amazon EMR, Hive .13, Hadoop 2.x, and Presto Server 0.89. Trying to set up Presto to query data that is usually queried through Hive. Hive metadata is stored in MySQL. Presto Server is installed set up on all nodes. For the most part everything is set up as is documented on prestodb.io.
I first start the server on all nodes (coordinator and workers), and then start the CLI on the coordinator/name node. When I try to run a query using the below commands I get a "Query ... No worker nodes available" error:
presto-cli presto-cli --server localhost:8080 --catalog jmx --schema default
presto:default> SELECT * FROM sys.node;
"Query ... No worker nodes available"
If I include the node-scheduler.include-coordinator=true in my coordinator config.properties file, 1 node is returned from this query.
Configs:
etc/config.properties (only on coordinator)
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://aws.internal.ip.of.coordinator:8080
etc/config.properties (only on workers)
coordinator=false
http-server.http.port=8080
task.max-memory=1GB
discovery.uri=http://aws.internal.ip.of.coordinator:8080
etc/catalog/hive.properties (all nodes)
connector.name=hive-hadoop2
hive.metastore.uri=thrift://aws.internal.ip.of.coordinator:9083
etc/catalog/jmx.properties (all nodes)
connector.name=jmx
etc/jvm.config (all nodes)
-server
-Xmx16G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-XX:ReservedCodeCacheSize=150M
etc/log.properties
com.facebook.presto=INFO
etc/node.properties
node.environment=production
node.id=unique-uuid #used uuidgen
node.data-dir=/mnt/presto-data
Simple mistake on my part was making this not run. I had a random semi-colon instead of a period in my aws.internal.ip.of.coordinator IP address. Looking at my configs I just didn't see it.
The above code will work on an Amazon EMR multi-node cluster similar to the one above.

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
VoilĂ !

Resources