Nifi CPU Spike in Merge Content - apache-nifi

Apache NiFi is spiking the CPU to 100% and the specific node gets disconnected from the cluster.
We have 3 nodes cluster,
each with 8 core, and 64 GB RDM and 40 GB memory allocated to the JVM
In Merge Content when processing 2 million + small flow small flow files ( size of all files around 300 MB), NiFi node cpu reaches 100% and node gets disconnected and data never gets processed,
I am using defragment strategy in merge content.
I tried changing the nifi.queue.swap.threshold to 1,000,00 but it does not help.
Any hints on how to deal with this ?

Related

Why Spark Fails for Huge Dataset with Container Getting Killed Issue and Hive works

I am trying to run a Simple Query Assuming running queries with spark.sql("query") compared to Dataframes has no performance Difference as I am using Spark 2.1.0 i have Catalyst Optimizer to take care of the optimization part & Tungsten Enabled.
Here i am joining 2 tables with a Left-Outer join. My 1st table is 200 GB & is the Driving table(being on left side) and the 2nd table is 2GB and there has to be no Filters as per our Business requirement.
Configuration of My Cluster. As this is Shared Cluster i have a assigned a specific queue which allows me to use 3-TB of Memory(Yes 3 tera bytes) but the No.of VCORES is 480 . That means i can only run 480 Parallel tasks. On top of that AT YARN LEVEL i have a Constraint to having MAX of 8 cores per node. And MAX of 16 GB of Container Memory Limit. Because of which i cannot give my Executor-Memory(which is per node) more than 12 GB as i am giving 3-GB as ExecutorMemoryOverhead to be on safer side which becomes 15 GB of per node memory utilization.
So after calculating 480 total allowed vcores with 8-cores per node limit i have got 480/8 = 60 Nodes for my computation. Which comes to 60*15 = 900 GB of usable memory(I don't why total queue memory is assigned 3 TB) And this is at peak .. IF i am the only one using the Queue but that's not always the case.
Now the doubt is how Spark this whole 900 GB of memory. From the Numbers & stats i can clearly say that my Job will run without any issues as the data size i am trying to process is just 210-250 GB MaX & i have 900 GB of available memory.
But i keep getting Container getting killed error msgs. And i cannot increase the YARN Container size becoz it is at YARN level and overall cluster will get the increased container size which is not the right thing. I have also tried Disabling vmem-check.enabled property to FALSE in my code using sparksession.config(property) but that doesn't help too May be i am not allowed to change anything at YARN Level so it might be ignoring that.
Now on what basis spark splits the data initially is it based on the Block size defined at Cluster Level (assuming 128 MB) I am thinking this because when my Job is started i see that my Big Table which is of around 200 GB has 2000 tasks so on what basis Spark calculates this 2000 tasks(partitions) I thought may be the Default partition size when spark starts to load my table is quite big by seeing the Input Size/Records && Shuffle Write Size/Records Under the Stage Tab of Spark UI and that is the reason why i am getting Container Killed Error & suggestion to increase Executor memory overhead which did not helped either.
I tried to Repartition the Data from 10k to 100k partitions and tried persisting to MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY but nothing Helped. Many of my task were getting failed and at the End job used to get Fail. Sometimes with Container killed, Direct Buffer, and others.
Now here what is the use of Persist /Caching and how does it behave ..???? I am doing
val result = spark.sql("query big_table").repartition(10000, $<column name>).persist()
The column in Repartition is the Joining key so it gets distributed. TO make this work before the JOIN i am doing result.show(1) . So the action is performed and data gets persisted on DISK and Spark will read data persisted on DISK for JOIN and there will be no load on memory as it is stored in small chunks on Disks(Am i correct over HERE ..??)
Why in HIVE this same job with the same Big Table plus some additional tables with Left Join get completed. Though it takes time but it completes successfully But it Fails in Spark..?? Why ?? Is Spark not the Complete Replacement of HIVE..?? Doesn't Spark works like HIVE when it comes to Spilling to Disk & write data to disk while using DISK for PERSISTING.
Does yarn-container size plays a role if we have less container size but good number of nodes ??
Does Spark combines memory of all the available nodes (15 GB Per Node as per container size) and Combine them to load a large partition..??

HDFS Data Write Process for different disk size nodes

We have 10 node HDFS (Hadoop - 2.6, cloudera - 5.8) cluster, and 4 are of disk size - 10 TB and 6 node of disk size - 3TB. In that case, Disk is constantly getting full on small size disk nodes, however the disk is free available on high disk size nodes.
I tried to understand, how namenode writes data/block to different disk size nodes. whether it is equally divided or some percentage of data getting written.
You should look at dfs.datanode.fsdataset.volume.choosing.policy. By default this is set to round-robin but since you have an asymmetric disk setup you should change it to available space.
You can also fine tune disk usage with the other two choosing properties.
For more information see:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_dn_storage_balancing.html

HDFS sequence file performance tuning

I'm trying to use Hadoop to process a lot of small files which are stored in sequence file. My program is highly IO bound so I want to make sure that IO throughput is high enough.
I wrote a MR program that reads small sample files from sequence file and write these files to ram disk (/dev/shm/test/). There's another stand alone program that will delete files written to ram disk without any computation. So the test should be almost pure IO bound. However, the IO throughput is not as good as I expected.
I have 5 datanode and each of the datanode has 5 data disk. Each disk can provide about 100MB/s throughput. Theoretically this cluster should be able to provide 100MB/s * 5 (disks) * 5 (machines) = 2500MB/s. However, I get about 600MB/s only. I run "iostat -d -x 1" on the 5 machines and found that the IO loading is not well balanced. Usually only a few of the disk have 100% utilization, some disks have very low utilization ( 10% or less). And some machine even have no IO loading at some time. Here's the screenshot. (Of course the loading for each disk/machine varies quickly)
Here's another screenshot the shows CPU usage by "top -cd1" command:
Here're some more detailed config about my case:
Hadoop cluster hardware: 5 Dell R620 machines which equipped with 128GB ram and 32 core CPU (actually 2 Xeon E5-2650). 2 HDD consist of a RAID 1 disk for CentOS and 5 data disks for HDFS. So uou can see 6 disks in the above screenshot.
Hadoop settings: block size 128MB; data node handler count is 8; 15 maps per task tracker; 2GB Map reduce child heap process.
Testing file set: about 400,000 small files, total size 320GB. Stored in 160 sequence files, each seq file has the size about 2GB. I tried to store all the files in many different size seq files(1GB, 512MB, 256MB, 128MB), but the performance didn't change much.
I won't expect the whole system can have 100% IO throughput(2500MB/s), but I think 40% (1000MB/s) or more should be reasonable. Can anyone provide some guide for performance tuning?
I solved the problem myself. Hint: the high CPU usage.
It's very abnormal that the CPU usage is so high since it's an almost pure IO job.
The root cause is that each task node gets about 500 map and each map use exactly one JVM. By default, hadoop map reduce is configured to create a new JVM for a new map.
Solution: Modify the value of "mapred.job.reuse.jvm.num.tasks" from 1 to -1, which indicates that the JVM will be reused without limitation.

How to decide on number of parallel mapers/reducers along with Heap memory?

Say I have a EMR job running on 11 node cluster: m1.small master node while 10 m1.xlarge slave nodes.
Now one m1.xlarge node has 15 GB of RAM.
How to then decide on the number of parallel mappers and reducers which can be set?
My jobs are memory intensive and I would like to have more and more of heap allotted to JVM.
Another related question:
If we set the following parameter:
<property><name>mapred.child.java.opts</name><value>-Xmx4096m</value></property>
<property><name>mapred.job.reuse.jvm.num.tasks</name><value>1</value></property>
<property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
<property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2</value></property>
So will this 4GB be shared by 4 processes (2 mapper and 2 reducer) or will they all get 4GB each?
They will each get 4gb.
You should check what your heap setting is for the task trackers and the data nodes, then you'll have an idea of how much memory you have left over to allocate to children (the actual mappers / reducers).
Then it's just a balancing act. If you need more memory, you'll want less mappers / reducers, and vice versa.
Also try to keep in mind how many cores your CPU has, you don't want 100 map tasks on a single core. To tweak, it's best to monitor both heap usage and cpu utilization over time so you can fiddle with the knobs.

Cassandra compaction taking too much time to complete

Initially we had 12 nodes in Cassandra cluster and with 500GB of data load on each node major compaction use to complete in 20 hours.
Now we have upgraded the cluster to 24 nodes and with same data size that is 500 GB on each node major compaction is taking 5 days.(hardware configuration of each node is exactly same and we are using cassandra-0.8.2 )
So what could be the possible reason for this slowdown?
Is increased cluster size causing this issue?
Compaction is is a completely local operation, so cluster size would not affect it. Request volume would, and so would data volume.

Resources