Optimal settings for apache spark based on the hardware - hadoop

is there a mapping/translation for the number of hardware systems, cpu cores and their associated memory to the spark-submit tunables of:
executor-memory
executor-cores
num-executors
The application is certaionly bound to have something to do with these tunables, I am however looking for a "basic rule of thumb"
Apache spark is running on yarn with hdfs in cluster mode.
Not all the hardware systems in the spark/hadoop yarn cluster have the same number of cpu cores or RAM.

There is no thumb rule, but after considering
off heap memory
Number of applications and other hadoop dameons running
Resource manager needs
HDFS IO
etc.
You can derive a suitable configuration. Please check this url

Related

tracking memory usage or consumption for apache spark 2.0.2

I m beginner in apache spark and have installed a prebuilt distribution of apache spark with hadoop. I look to get the consumption or the usage of memory while running the example PageRank implemented within spark. I have my cluster standalone mode with 1 maser and 4 workers (Virtual machines)
I have tried external tools like ganglia and graphite but they give the memory usage at resource or system level (more general) but what i need exactly is "to track the behavior of the memory (Storage, execution) while running the algorithm does it means, memory usage for a spark application-ID ". Is there anyway to get it into text-file for further exploitation? Please help me on this, Thanks

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured

Calculating yarn.nodemanager.resource.cpu-vcores for a yarn cluster with multiple spark clients

If I have 3 spark applications all using the same yarn cluster, how should I set
yarn.nodemanager.resource.cpu-vcores
in each of the 3 yarn-site.xml?
(each spark application is required to have it's own yarn-site.xml on the classpath)
Does this value even matter in the client yarn-site.xml's ?
If it does:
Let's say the cluster has 16 cores.
Should the value in each yarn-site.xml be 5 (for a total of 15 to leave 1 core for system processes) ? Or should I set each one to 15 ?
(Note: Cloudera indicates one core should be left for system processes here: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ however, they do not go into details of using multiple clients against the same cluster)
Assume Spark is running with yarn as the master, and running in cluster mode.
Are you talking about the server-side configuration for each YARN Node Manager? If so, it would typically be configured to be a little less than the number of CPU cores (or virtual cores if you have hyperthreading) on each node in the cluster. So if you have 4 nodes with 4 cores each, you could dedicate for example 3 per node to the YARN node manager and your cluster would have a total of 12 virtual CPUs.
Then you request the desired resources when submitting the Spark job (see http://spark.apache.org/docs/latest/submitting-applications.html for example) to the cluster and YARN will attempt to fulfill that request. If it can't be fulfilled, your Spark job (or application) will be queued up or there will eventually be a timeout.
You can configure different resource pools in YARN to guarantee a specific amount of memory/CPU resources to such a pool, but that's a little bit more advanced.
If you submit your Spark application in cluster mode, you have to consider that the Spark driver will run on a cluster node and not your local machine (that one that submitted it). Therefore it will require at least 1 virtual CPU more.
Hope that clarifies things a little for you.

MapReduce to yarn Config conversion

how do I change my MapReduce v1 configuration (eg number of slots) to yarn configuration (eg container virtual memory and virtual cpu)?
There is no longer the concept of a "slot". Instead there are containers, and each container is configured to have a certain amount of RAM and virtual CPU. Go to http://tiny.cloudera.com/yarn-tuning for a detailed writeup on the conversion process.

how to check if hadoop is tuned for job,

I have a job (handles data of 4 GB) and I checked the CPU usage and memory usage they both are under 10%.
Your job most likely doesn't need more than that. You could try stress-testing your cluster with TeraSort (included in the examples jar), if your nodes are still operating at a very low amount of usage it may be a problem with your configuration.
Hadoop comes with Benchmark utility to verify Hadoop cluster settings.
Check Hadoop TestDFSIO.

Resources