How to work with a group of people using Zeppelin? - hadoop

I a trying to work with Zeppelin on my Hadoop Cluster:
1 edge node
1 name node
1 secondary node
16 data nodes.
Node specification:
CPU: Intel(R) Xeon(R) CPU E5345 # 2.33GHz, 8 cores
Memory: 32 GB DDR2
I have some issues with this tool when more than 20 people want to use it at the same time.
This is mainly when I am using pyspark - either 1.6 or 2.0.
Even if I set zeppelin.execution.memory = 512 mb and spark.executor memory = 512 mb is still the same. I have tried a few interpreter options (for pyspark), like Per User in scoped/isolated and others and still the same. It is a little better with globally option but still after a while I can not do anything there. I was looking on Edge Node and I saw that memory is going up very fast. I want to use Edge Node only as an access point.

If your deploy mode is yarn client, then your driver will always be the access point server (the Edge Node in your case).
Every notebook (per note mode) or every user (per user mode) instantiates a spark context allocating memory on the driver and on the executors. Reducing spark.executor.memory will alleviate the cluster but not the driver. Try reducing spark.driver.memory instead.
The Spark interpreter can be instantiated globally, per note or per user, I don't think sharing the same interpreter (globally) is a solution in your case, since you can only run one job at a time. Users would end up waiting for every one else's cells to compile before being able to do so themselves.

Related

Nifi memory continues to expand

I used a three-node nifi cluster, the nifi version is 1.16.3, the hardware is 8core 32G memory, and the solid-state high-speed hard disk is 2T. OS is CentOS7.9, ARM64 hardware architecture.
The initial configuration of nifi is xms12g and xmx12G(bootstrip.conf).
Native installation, docker is not used, and only nifi installed on all thoese machines, using integrated zookeeper.
Run 20 workflow everyday from 00:00 to 03:00, and the total data size is 1.2G. Collect csv documents to the greenplum database.
My problem now is that the memory usage of nifi is increasing every day, 0.2G per day, and all three nodes are like this. Then the memory is slowly full and then the machine is dead. This procedure is about a month(when the memory is set to 12G.).
That is to say, I need to restart the cluster every month. I use a native processor and workflow.
I can't locate the problem. Who can help me?
I may have any descriptions. Please feel to let me know,thanks.
I have made the following attempts:
I set the initial memory to 18G or 6G, and the speed of workflow processing has not changed. The difference is that, after setting it to 18G, it will freeze for a shorter time.
I used openjre1.8, and I tried to upgrade it to 11, but it was useless.
i add the following configuration, and is also useless:
java.arg.7=-XX:ReservedCodeCacheSize=256m
java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
java.arg.9=-XX:+UseCodeCacheFlushing
Every day's timing tasks consume little resources. Even if the memory is adjusted to 6G, 20 tasks run at the same time, the memory consumption is about 30%, and it will run out in half an hour.

Using hadoop cluster with different machine configuration

I have two linux machines, both with different configuration
Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)
Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)
I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.
I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.
My spark code looks something like:
conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()
So far, I have tried the following:
When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.
Apache Spark: The number of cores vs. the number of executors
Any help will be greatly appreciated.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
Its not clear where your file is stored.
I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).
In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.
If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.
As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes

Memory Management in H2O

I am curious to know how memory is managed in H2O.
Is it completely 'in-memory' or does it allow swapping in case the memory consumption goes beyond available physical memory? Can I set -mapperXmx parameter to 350GB if I have a total of 384GB of RAM on a node? I do realise that the cluster won't be able to handle anything other than the H2O cluster in this case.
Any pointers are much appreciated, Thanks.
H2O-3 stores data completely in-memory in a distributed column-compressed distributed key-value store.
No swapping to disk is supported.
Since you are alluding to mapperXmx, I assume you are talking about running H2O in a YARN environment. In that case, the total YARN container size allocated per node is:
mapreduce.map.memory.mb = mapperXmx * (1 + extramempercent/100)
extramempercent is another (rarely used) command-line parameter to h2odriver.jar. Note the default extramempercent is 10 (percent).
mapperXmx is the size of the Java heap, and the extra memory referred to above is for additional overhead of the JVM implementation itself (e.g. the C/C++ heap).
YARN is extremely picky about this, and if your container tries to use even one byte over its allocation (mapreduce.map.memory.mb), YARN will immediately terminate the container. (And for H2O-3, since it's an in-memory processing engine, the loss of one container terminates the entire job.)
You can set mapperXmx and extramempercent to as large a value as YARN has space to start containers.

Spark streaming application configuration with YARN

I'm trying to squeeze every single bit from my cluster when configuring the spark application but it seems I'm not understanding everything completely right. So I'm running the application on an AWS EMR cluster with 1 master and 2 core nodes from type m3.xlarge(15G ram and 4 vCPU for every node). This means that by default 11.25 GB are reserved on every node for applications scheduled by yarn. So the master node is used only by the resource manager(yarn) and that means the remaining 2 core nodes will be used to schedule applications(so we have 22.5G for that purpose). So far so good. But here comes the part which I don't get. I'm starting the spark application with the following parameters:
--driver-memory 4G --num-executors 4 --executor-cores 7 --executor-memory 4G
What this means by my perceptions(from what I found as information) is that for the driver will be allocated 4G and 4 executors will be launched with 4G every one of them. So a rough estimate makes it 5*4=20G(lets make them 21G with the expected memory reserves), which should be fine as we have 22.5G for applications. Here's a screenshot from the UI of the hadoop yarn after the launch:
What we can see is that 17.63 are used by the application but this a little bit less than the expected ~21G and this triggers the first question- what did happen here?
Then I go to the spark UI's executors page. Here comes the bigger question:
The executors are 3(not 4), the memory allocated for them and the driver is 2.1G(not the specified 4G). So hadoop yarn says 17.63G are used, but the spark says 8.4G are allocated. So, what is happening here? Is this related to the Capacity Scheduler(from the documentation I couldn't come up with this conclusion)?
Can you check whether spark.dynamicAllocation.enabled is turned on. If that is the case then spark your application may give resources back to the cluster if they are no longer used. The minimum number of executors to be launched at the startup will be decided by spark.executor.instances.
If that is not the case, what is your source for spark application and what is the partition size set for that, spark will literally map the partition size to the spark cores, if your source has only 10 partitions, and when you try to allocate 15 cores it will only use 10 cores because that is what is needed. I guess this might be the cause that spark has launched 3 executors instead of 4. Regarding memory i would recommend to revisit because you are asking for 4 executors and 1 driver with 4Gb each which would be 5*4+5*384MB approx equals to 22GB and you are trying to use up everything and not much is left for your OS and nodemanager to run that would not be the ideal way to do.

Spark not detecting all CPU cores

I have a cluster running Spark with 4 servers each having 8 cores. Somehow the master is not detecting all available cores. It is using 18 out of 32 cores:
I have not set anything relating to the no. of cores in any spark conf file (at least not that I am aware of)
I am positive each cluster member has the same no. of cores (8):
Is there a way to make Spark detect/use the other cores as well?
I found it but still it is somewhat unclear:
One node that was only contributing 1 out of 8 cores was having this setting turned on in $SPARK_HOME/conf/spark-env.sh:
SPARK_WORKER_CORES=1
Commenting it out did the trick for that node. Spark will grab all cores by default. (same goes for memory)
But... on the other node with only 1 core this setting was not activated, but still Spark did not grab 8 cores untill I specifically told it to:
SPARK_WORKER_CORES=8
But at least it is grabbing all resources now.

Resources