Using hadoop cluster with different machine configuration

Using hadoop cluster with different machine configuration - hadoop

I have two linux machines, both with different configuration
Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)
Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)
I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.
I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.
My spark code looks something like:
conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()
So far, I have tried the following:
When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.
Apache Spark: The number of cores vs. the number of executors
Any help will be greatly appreciated.

When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
Its not clear where your file is stored.
I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).
In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.
If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.
As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes

Related

Spark streaming application configuration with YARN

I'm trying to squeeze every single bit from my cluster when configuring the spark application but it seems I'm not understanding everything completely right. So I'm running the application on an AWS EMR cluster with 1 master and 2 core nodes from type m3.xlarge(15G ram and 4 vCPU for every node). This means that by default 11.25 GB are reserved on every node for applications scheduled by yarn. So the master node is used only by the resource manager(yarn) and that means the remaining 2 core nodes will be used to schedule applications(so we have 22.5G for that purpose). So far so good. But here comes the part which I don't get. I'm starting the spark application with the following parameters:
--driver-memory 4G --num-executors 4 --executor-cores 7 --executor-memory 4G
What this means by my perceptions(from what I found as information) is that for the driver will be allocated 4G and 4 executors will be launched with 4G every one of them. So a rough estimate makes it 5*4=20G(lets make them 21G with the expected memory reserves), which should be fine as we have 22.5G for applications. Here's a screenshot from the UI of the hadoop yarn after the launch:
What we can see is that 17.63 are used by the application but this a little bit less than the expected ~21G and this triggers the first question- what did happen here?
Then I go to the spark UI's executors page. Here comes the bigger question:
The executors are 3(not 4), the memory allocated for them and the driver is 2.1G(not the specified 4G). So hadoop yarn says 17.63G are used, but the spark says 8.4G are allocated. So, what is happening here? Is this related to the Capacity Scheduler(from the documentation I couldn't come up with this conclusion)?

Can you check whether spark.dynamicAllocation.enabled is turned on. If that is the case then spark your application may give resources back to the cluster if they are no longer used. The minimum number of executors to be launched at the startup will be decided by spark.executor.instances.
If that is not the case, what is your source for spark application and what is the partition size set for that, spark will literally map the partition size to the spark cores, if your source has only 10 partitions, and when you try to allocate 15 cores it will only use 10 cores because that is what is needed. I guess this might be the cause that spark has launched 3 executors instead of 4. Regarding memory i would recommend to revisit because you are asking for 4 executors and 1 driver with 4Gb each which would be 5*4+5*384MB approx equals to 22GB and you are trying to use up everything and not much is left for your OS and nodemanager to run that would not be the ideal way to do.

hadoop not creating enough containers when more nodes are used

So I'm trying to run some hadoop jobs on AWS R3.4xLarge machines. They have 16 vcores and 122 gigabytes of ram available.
Each of my mappers requires about 8 gigs of ram and one thread, so these machines are very nearly perfect for the job.
I have mapreduce.memory.mb set to 8192,
and mapreduce.map.java.opts set to -Xmx6144
This should result in approximately 14 mappers (in practice nearer to 12) running on each machine.
This is in fact the case for a 2 slave setup, where the scheduler shows 90 percent utilization of the cluster.
When scaling to, say, 4 slaves however, it seems that hadoop simply doesnt create more mappers. In fact it creates LESS.
On my 2 slave setup I had just under 30 mappers running at any one time, on four slaves I had about 20. The machines were sitting at just under 50 percent utilization.
The vcores are there, the physical memory is there. What the heck is missing? Why is hadoop not creating more containers?

So it turns out that this is one of those hadoop things that never makes sense, no matter how hard you try to figure it out.
there is a setting in yarn-default called yarn.nodemanager.heartbeat.interval-ms.
This is set to 1000. Apparently it controls the minimum period between assigning containers in milliseconds.
This means it only creates one new map task per second. This means the number of containers is limited by how many containers I have running*the time that it takes for a container to be finished.
By setting this value to 50, or better yet, 1, I was able to get the kind of scaling that is expected from a hadoop cluster. Honestly should be documented better.

Spark not detecting all CPU cores

I have a cluster running Spark with 4 servers each having 8 cores. Somehow the master is not detecting all available cores. It is using 18 out of 32 cores:
I have not set anything relating to the no. of cores in any spark conf file (at least not that I am aware of)
I am positive each cluster member has the same no. of cores (8):
Is there a way to make Spark detect/use the other cores as well?

I found it but still it is somewhat unclear:
One node that was only contributing 1 out of 8 cores was having this setting turned on in $SPARK_HOME/conf/spark-env.sh:
SPARK_WORKER_CORES=1
Commenting it out did the trick for that node. Spark will grab all cores by default. (same goes for memory)
But... on the other node with only 1 core this setting was not activated, but still Spark did not grab 8 cores untill I specifically told it to:
SPARK_WORKER_CORES=8
But at least it is grabbing all resources now.

Why more memory on hadoop map task make mapreduce job slower?

I want to ask. Why if I configured on mapred-site.xml in mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts to bigger value than default value make my job slower?
But If I configured it too low, then I'll get task failed. And I think on this condition, my memory configuration on hadoop is not necessary...
Can you give me an explanation?

What might be happening in your environment is, when you increase values of the mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts configurations to upper bound, it actually reduces the number of containers allowed to execute Map/Reduce task in every node thus eventually causes the slowness in the over all job time.
If you have 2 nodes, each with 25 GB of free ram , and say you configured the mapreduce.map/reduce.memory.mb as 4 GB, then you might get atleast 6 containers on every node, totally it is 12. So you would get a chance of running 12 mapper/reducer tasks in parallel.
In case if you configure mapreduce.map/reduce.memory.mb as 10 GB , then you might get only 2 containers on every node , totally it would be 4 containers to execute your mapper/reducer tasks in parallel. So the mapper/reducer tasks would mostly run in sequence due to lack of free containers, thus causes a delay in the over all job completion time.
You should justify the approprite value for the configuration with considering the resources available and the amount of resources required for the Map/Reduce containers according to your environment. Hope this makes sense.

you can allocate memory for map/reduce containers based on two factors
available memory per each Datanode
total number of cores(vcores) you have.
try to create number of containers equivalent to number of cores you have in each detained. ( including hyper threading)
for example if you have 10 physical core ( 20 cores including hyper threading)
so total number containers you can plan is 19 ( leaving 1 core for other processes)
assume that you have 'X' GB Ram in each data node, then
leave some memory(assume Y GB) for other processes (heap) like, Datanode, Node Manager,Region server ,etc
Now memory available for YARN is X -Y = Z
Memory for Map container = Y/number of containers per node
Memory for Reduce container = Y/(2 * number of containers per node)

HDFS sequence file performance tuning

I'm trying to use Hadoop to process a lot of small files which are stored in sequence file. My program is highly IO bound so I want to make sure that IO throughput is high enough.
I wrote a MR program that reads small sample files from sequence file and write these files to ram disk (/dev/shm/test/). There's another stand alone program that will delete files written to ram disk without any computation. So the test should be almost pure IO bound. However, the IO throughput is not as good as I expected.
I have 5 datanode and each of the datanode has 5 data disk. Each disk can provide about 100MB/s throughput. Theoretically this cluster should be able to provide 100MB/s * 5 (disks) * 5 (machines) = 2500MB/s. However, I get about 600MB/s only. I run "iostat -d -x 1" on the 5 machines and found that the IO loading is not well balanced. Usually only a few of the disk have 100% utilization, some disks have very low utilization ( 10% or less). And some machine even have no IO loading at some time. Here's the screenshot. (Of course the loading for each disk/machine varies quickly)
Here's another screenshot the shows CPU usage by "top -cd1" command:
Here're some more detailed config about my case:
Hadoop cluster hardware: 5 Dell R620 machines which equipped with 128GB ram and 32 core CPU (actually 2 Xeon E5-2650). 2 HDD consist of a RAID 1 disk for CentOS and 5 data disks for HDFS. So uou can see 6 disks in the above screenshot.
Hadoop settings: block size 128MB; data node handler count is 8; 15 maps per task tracker; 2GB Map reduce child heap process.
Testing file set: about 400,000 small files, total size 320GB. Stored in 160 sequence files, each seq file has the size about 2GB. I tried to store all the files in many different size seq files(1GB, 512MB, 256MB, 128MB), but the performance didn't change much.
I won't expect the whole system can have 100% IO throughput(2500MB/s), but I think 40% (1000MB/s) or more should be reasonable. Can anyone provide some guide for performance tuning?

I solved the problem myself. Hint: the high CPU usage.
It's very abnormal that the CPU usage is so high since it's an almost pure IO job.
The root cause is that each task node gets about 500 map and each map use exactly one JVM. By default, hadoop map reduce is configured to create a new JVM for a new map.
Solution: Modify the value of "mapred.job.reuse.jvm.num.tasks" from 1 to -1, which indicates that the JVM will be reused without limitation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio