Zeppelin Interpreter Memory - driver memory - apache-spark-mllib

Im unsuccessfully trying to increase the driver memory for my spark interpreter.
I just set spark.driver.memory in interpreter settings and everything looks great at first.
But in the docker container that zeppelin runs there is
Zeppelin 0.6.2
Spark 2.0.1
2:06 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /usr/zeppelin/int.....-2.7.2/share/hadoop/tools/lib/* -Xmx1g ..... --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/zeppelin/interpreter/spark/zeppelin-spark_2.11-0.6.2.jar 42651
a max heap setting that kind of breaks everything.
My main issue is I am trying to run the Latent Dirchilet Allocation of mllib and it always runs out of memory and just dies on the driver.
The docker container has 26g RAM now so that should be enough.
Zeppelin itself should be fine with its 1g ram.
But the spark driver simply needs more.
My Executor process have RAM but the driver is reported in the UI as
Executor ID Address Status RDD Blocks Storage Memory Disk Used Cores Active Tasks Failed Tasks Complete Tasks Total Tasks Task Time (GC Time) Input Shuffle Read Shuffle Write Thread Dump
driver 172.17.0.6:40439 Active 0 0.0 B / 404.7 MB 0.0 B 20 0 0 1 1 1.4 s (0 ms) 0.0 B 0.0 B 0.0 B Thread Dump
pretty abysmal
Setting ZEPPELIN_INTP_MEM='-Xms512m -Xmx12g' does not seem to change anything.
I though zeppelin-env.sh is not loaded correctly so I passed this variable directly in the docker create -e ZE... but that did not change anything.
SPARK_HOME is set and the it connects to a standalone spark cluster. But that part works. Only the driver runs out of memory.
But I tried starting a local[*] process with 8g driver memory and 6g executor but the same abysmal 450mb driver memory.
the intrepreter reports a java heap out of memory error and that breaks that halts the LDAModel training.

Just came across this in a search while running into the exact same problem! Hopefully you've found a solution by now, but just in case anyone else runs across this issue and is looking for a solution like me, here's the issue:
The process you're looking at here isn't considered an interpreter process by Zeppelin, it's actually a Spark Driver process. This means that it gets options set differently than the ZEPPELIN_INTP_MEM variable. Add this to your zeppelin-env.sh:
export SPARK_SUBMIT_OPTIONS="--driver-memory 12G"
Restart Zeppelin and you should be all set! (tested and works with the latest 0.7.3, assuming it works with earlier versions).

https://issues.apache.org/jira/browse/ZEPPELIN-1263 fix this issue. After that you can use whatever standard spark configuration. e.g. you can specify driver memeory via setting spark.driver.memory in spark interpreter setting.

Related

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured

Pig script runs fine on Sandbox but fails on a real cluster

Environments:
Hortonworks Sandbox running HDP 2.5
Hortonworks HDP 2.5 Hadoop cluster managed by Ambari
We are facing a tricky situation. We run Pig script from Hadoop tutorial. Script is working with tiny data. It works fine on a Sandbox. But fails in real cluster where it complains about insufficient memory for the container.
container is running beyond physical memory limit
message can be seen in the logs.
The tricky part is - Sandbox has way less memory available than real cluster (about 3 times less). Also most memory settings in Sandbox (MapReduce memory, Yarn memory, Yarn container sizes) allow much less memory than corresponding settings in a real cluster. Still it is sufficient for Pig in Sandbox but not sufficient in a real cluster.
Another note - Hive queries doing the similar job also work good (in both environements), they do not complain about memory.
Apparently there is some setting somewhere (within Environment 2), which makes Pig to request too much memory? Can please anybody recommend what parameter should be modified to stop Pig script to request too big memory?

Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster

I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured.
I receive the following error when running "spark-shell" from the shell (locally on the master), as well as when uploading a job through the web-GUI and the gcloud command line utility from my local machine:
15/11/08 21:27:16 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (38281+2679 MB) is above the max threshold (20480 MB) of this cluster! Please increase the value of 'yarn.s
cheduler.maximum-allocation-mb'.
I tried modifying the value in /etc/hadoop/conf/yarn-site.xml but it didn't change anything. I don't think it pulls the configuration from that file.
I've tried with multiple cluster combinations, at multiple sites (mainly Europe), and I only got this to work with the low memory version (4-cores, 15 gb memory).
I.e. this is only a problem on the nodes configured for memory higher than the yarn default allows.
Sorry about these issues you're running into! It looks like this is part of a known issue where certain memory settings end up computed based on the master machine's size rather than the worker machines' size, and we're hoping to fix this in an upcoming release soon.
There are two current workarounds:
Use a master machine type with memory either equal to or smaller
than worker machine types.
Explicitly set spark.executor.memory and spark.executor.cores either using the --conf flag if running from an SSH connection like:
spark-shell --conf spark.executor.memory=4g --conf spark.executor.cores=2
or if running gcloud beta dataproc, use --properties:
gcloud beta dataproc jobs submit spark --properties spark.executor.memory=4g,spark.executor.cores=2
You can adjust the number of cores/memory per executor as necessary; it's fine to err on the side of smaller executors and letting YARN pack lots of executors onto each worker, though you can save some per-executor overhead by setting spark.executor.memory to the full size available in each YARN container and spark.executor.cores to all the cores in each worker.
EDIT: As of January 27th, new Dataproc clusters will now be configured correctly for any combination of master/worker machine types, as mentioned in the release notes.

Mismatch in no of Executors(Spark in YARN Pseudo distributed mode)

I am running Spark using YARN(Hadoop 2.6) as cluster manager. YARN is running in Pseudo distributed mode. I have started the spark shell with 6 executors and was expecting the same
spark-shell --master yarn --num-executors 6
But whereas in the Spark Web UI, I see only 4 executors
Any reason for this?
PS : I ran the nproc command in my Ubuntu(14.04) and give below is the result. I believe this mean, my system has 8 cores
mountain#mountain:~$ nproc
8
did you take in account spark.yarn.executor.memoryOverhead?
possobly it creates hiden memory requrement and finaly yarn could not provide whole resources.
also, note that yarn round container size to yarn.scheduler.increment-allocation-mb.
all detail here:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
This happens when there are not enough resources on your cluster to start more executors. Following things are taken into account
Spark executor runs inside a yarn container. This container size is determined from the value of yarn.scheduler.minimum-allocation-mb in yarn-site.xml. Check this property. If your existing containers consume all available memory then more memory will not be available for new containers. so no new executors will be started
The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs. ref(Spark on YARN: Less executor memory than set via spark-submit)
I hope this helps.

Hadoop 2.x on amazon ec2 t2.micro

I'm trying to install and configure Hadoop 2.6 on Amazon EC2 t2.micro instance (The Free one, with only 1GB RAM) in Pseudo-Distributed Mode.
I could configure and start all the daemons (ie. Namenode,Datanode,ResourceManager,NodeManager). But When I tried to run a mapreduce wordcount example, it is failing.
I dont know if its failing due to low memory ( Since t2.micro has only 1GB of memory and some of it is taken up by Host OS, Ubuntu in my case). Or can it be some other reason?
I'm using default memory settings. If I can tweak down everything to minimum memory settings will it solve the problem? What is the minimum memory in mb that can be assigned to containers.
Thanks a lot Guys. I'll appreciate if you can provide me with some information.
Without tweaking any memory settings I could run a pi example with 1 mapper and 1 reducer sometimes only on the free tier t2.micro instance, it fails most of the time.
By using the memory optimized r3.large instance with 15GB RAM everything works perfect. All jobs get completed without failure.

Resources