hello i just finished creating my first spark application, now i have access to a cluster (12 nodes where each node has 2 processors Intel(R) Xeon(R) CPU E5-2650 2.00GHz, where each processor has 8 cores), i want to know what are criteria that help me to tuning my application and to observe its performance.
i have already visited the official website of spark, it's talking about Data Serialization, but i couldn't get what is it exactly or how to specify it.
it is talking also about "memory management", "Level of Parallelism" but i didn't understand how to control these.
one more thing, i know that the size of data has an effect, but all files.csv that i have have small size, how can i get files with large size (10 GB, 20 GB, 30 GB, 50 GB, 100 GB, 300 GB, 500 GB)
please try to explain well for me, because cluster computing is fresh for me.
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia
From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!
Related
I used a three-node nifi cluster, the nifi version is 1.16.3, the hardware is 8core 32G memory, and the solid-state high-speed hard disk is 2T. OS is CentOS7.9, ARM64 hardware architecture.
The initial configuration of nifi is xms12g and xmx12G(bootstrip.conf).
Native installation, docker is not used, and only nifi installed on all thoese machines, using integrated zookeeper.
Run 20 workflow everyday from 00:00 to 03:00, and the total data size is 1.2G. Collect csv documents to the greenplum database.
My problem now is that the memory usage of nifi is increasing every day, 0.2G per day, and all three nodes are like this. Then the memory is slowly full and then the machine is dead. This procedure is about a month(when the memory is set to 12G.).
That is to say, I need to restart the cluster every month. I use a native processor and workflow.
I can't locate the problem. Who can help me?
I may have any descriptions. Please feel to let me know,thanks.
I have made the following attempts:
I set the initial memory to 18G or 6G, and the speed of workflow processing has not changed. The difference is that, after setting it to 18G, it will freeze for a shorter time.
I used openjre1.8, and I tried to upgrade it to 11, but it was useless.
i add the following configuration, and is also useless:
java.arg.7=-XX:ReservedCodeCacheSize=256m
java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
java.arg.9=-XX:+UseCodeCacheFlushing
Every day's timing tasks consume little resources. Even if the memory is adjusted to 6G, 20 tasks run at the same time, the memory consumption is about 30%, and it will run out in half an hour.
Current Setup
we have our 10 node discovery cluster.
Each node of this cluster has 24 cores and 264 GB ram Keeping some memory and CPU aside for background processes, we are planning to use 240 GB memory.
now, when it comes to container set up, as each container may need 1 core, so max we can have 24 containers, each with 10GB memory.
Usually clusters have containers with 1-2 GB memory but we are restricted with the available cores we have with us or maybe I am missing something
Problem statement
as our cluster is extensively used by data scientists and analysts, having just 24 containers does not suffice. This leads to heavy resource contention.
Is there any way we can increase number of containers?
Options we are considering
If we ask the team to run many tez queries (not separately) but in a file, then at max we will keep one container.
Requests
Is there any other way possible to manage our discovery cluster.
Is there any possibility of reducing container size.
can a vcore (as it's a logical concept) be shared by multiple containers?
Vcores are just a logical unit and not in anyway related to a CPU core unless you are using YARN with CGroups and have yarn.nodemanager.resource.percentage-physical-cpu-limit enabled. Most tasks are rarely CPU-bound but more typically network I/O bound. So if you were to look at your cluster's overall CPU utilization and memory utilization, you should be able to resize your containers based on the wasted (spare) capacity.
You can measure utilization with a host of tools but sar, ganglia and grafana are the obvious ones but you can also look at Brendan Gregg's Linux Performance tools for more ideas.
I will have 200 million files in my HDFS cluster, we know each file will occupy 150 bytes in NameNode memory, plus 3 blocks so there are total 600 bytes in NN.
So I set my NN memory having 250GB to well handle 200 Million files. My question is that so big memory size of 250GB, will it cause too much pressure on GC ? Is it feasible that creating 250GB Memory for NN.
Can someone just say something, why no body answer??
Ideal name node memory size is about total space used by meta of the data + OS + size of daemons and 20-30% space for processing related data.
You should also consider the rate at which data comes in to your cluster. If you have data coming in at 1TB/day then you must consider a bigger memory drive or you would soon run out of memory.
Its always advised to have at least 20% memory free at any point of time. This would help towards avoiding the name node going into a full garbage collection.
As Marco specified earlier you may refer NameNode Garbage Collection Configuration: Best Practices and Rationale for GC config.
In your case 256 looks good if you aren't going to get a lot of data and not going to do lots of operations on the existing data.
Refer: How to Plan Capacity for Hadoop Cluster?
Also refer: Select the Right Hardware for Your New Hadoop Cluster
You can have a physical memory of 256 GB in your namenode. If your data increase in huge volumes, consider hdfs federation. I assume you already have multi cores ( with or without hyperthreading) in the name node host. Guess the below link addresses your GC concerns:
https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html
The following are my configuration :
**mapred-site.xml**
map-mb : 4096 opts:-Xmx3072m
reduce-mb : 8192 opts:-Xmx6144m
**yarn-site.xml**
resource memory-mb : 40GB
min allocation-mb : 1GB
the Vcores in hadoop cluster displayed 8GB but i dont know how the computation or where to configure it.
hope someone could help me.
Short Answer
It most probably doesn't matter, if you are just running hadoop out of the box on your single-node-cluster or even a small personal distributed cluster. You just need to worry about memory.
Long Answer
vCores are used for larger clusters in order to limit CPU for different users or applications. If you are using YARN for yourself there is no real reason to limit your container CPU. That is why vCores are not even taken into consideration by default in Hadoop !
Try setting your available nodemanager vcores to 1. It doesn't matter ! Your number of containers will still be 2 or 4 .. or whatever the value of :
yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb
If really do want the number of containers to take vCores into consideration and be limited by :
yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores
then you need to use a different a different Resource Calculator. Go to your capacity-scheduler.xml config and change DefaultResourceCalculator to DominantResourceCalculator.
In addition to using vCores for container allocation, you want to use vCores to really limit CPU usage of each node ? You need to change even more configurations to use the LinuxContainerExecutor instead of the DefaultContainerExecutor, because it can manage linux cgroups which are used to limit CPU resources. Follow this page if you want more info on this.
yarn.nodemanager.resource.cpu-vcores - Number of CPU cores that can be allocated for containers.
mapreduce.map.cpu.vcores - The number of virtual CPU cores allocated for each map task of a job
mapreduce.reduce.cpu.vcores - The number of virtual CPU cores for each reduce task of a job
I accidentally came across this question and I eventually managed to find the answers that I needed, so I will try to provide a complete answer.
Entities and they relations For each hadoop application/job, you have an Application Master that communicates with the ResourceManager about available resources on the cluster. The ResourceManager receives information about available resources on each node from each NodeManager. The resources are called Containers (memory and CPU). For more information see this.
Resource declaration on the cluster Each NodeManager provides information about its available resources. Relevant settings are yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores in $HADOOP_CONF_DIR/yarn-site.xml. They declare the memory and cpus that can be allocated to Containers.
Ask for resources For your jobs you can configure what resources are needed by each map/reduce. This can be done as follows (this is for the map tasks).
conf.set("mapreduce.map.cpu.vcores", "4");
conf.set("mapreduce.map.memory.mb", "2048");
This will ask for 4 virtual cores and 2048MB of memory for each map task.
You can also configure the resources that are necessary for the Application Master the same way with the properties yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.resource.cpu-vcores.
Those properties can have default values in $HADOOP_CONF_DIR/mapred-default.xml.
For more options and default values I would recommend you to take a look at this and this
Hi I am trying to setup the hadoop environment. In short the problem which I am trying to solve involves billions of XML files of size few MB, extract relevant information from them using HIVE and do some analytic work with the information. I know this is a trivial problem in hadoop world but if Hadoop solution works well for me than size and number of files I will be dealing will increase in geometric progession form.
I did research by referring various books like "Hadoop - the definite guide", "Hadoop in action". Resources like documents by yahoo and hortonworks. I am not able to figure out the hardware /software specifications for establishing the hadoop environment. In the resources which I had referred so far I had kind of found standard solutions like
Namenode/JobTracker (2 x 1Gb/s Ethernet, 16 GB of RAM, 4xCPU, 100 GB disk)
Datanode (2 x 1Gb/s Ethernet, 8 GB of RAM, 4xCPU, Multiple disks with total amount
of 500+ GB)
but if anyone can give some suggestions that will be great. Thanks
First I would suggest you to consider: what do you need more processing + some storage or opposite, and from this view select hardware. Your case sounds more processing then storage.
I would specify a bit differently standard hardware for hadoop
NameNode: High quality disk in mirror, 16 GB HDD.
Data Nodes: 16-24 GB RAM, Dual Quad or Dual six cores CPU, 4 to 6 1-2-3 SATA TB Drives.
I would also consider 10 GBit option. I think if it does not add more then 15% of cluster price - it makes sense. 15% came from rough estimation that data shipping from mappers to reducers takes about 15% of job time.
In your case I would be more willing to sacrifice disc sizes to save money, but not CPU/Memory/number of drives.
"extract relevant information from them using HIVE"
That is going to be a bit tricky since hive doesn't really do well with xml files.
you are going to want to build a parsing script in another language (ruby, python, perl, etc) that can parse the xml files and produce columnar output that you will load into hive. You can then use hive to call that external parsing script with a transform, or just use hadoopstreaming to prepare the data for hive.
Then it is just a matter of how fast you need the work done and how much space you need to hold the amount of data you are going to have.
you could build the process with a handful of files on a single system to test it. But you really need to have a better handle on your overall planned workload to properly scale your cluster. Minimum production cluster size would be 3 or 4 machines at a minimum, just for data redundancy. Beyond that, add nodes as necessary to meet your workload needs.