Apache Nifi slow cluster issue - performance

I am using a Apache nifi for one of my clickstream projects to do some ETL.
I am getting traffic around 300 messages per second currently with the following infra:
RAM - 16 GB
Swap - 6 GB
CPU - 16 cores
Disk - 100GB (Persistance not required)
Cluster - 6 nodes
The entire cluster UI has become extremely slow with the following issues
Processors giving back pressure when some failure happens, which consumes lot of threads
Provenance writing becomes very slow
Heartbeat across nodes becomes slow
Cluster Heart beat
I have the following questions on the setup
Is RPG use recommended, as it is a HTTP call, which i using to spread
across all the nodes, as there is an existing issue with EMQTT
process for consumer group.
What is the recommended value of thread count that should be allotted
per core?
What are the guidelines for infrastructure sizing
What are the tuning parameters for a large cluster with high incoming requests and lot of heavy JSON parsing for transformation

A couple of suggestions
Yes RPG usage is recommended, at least from what I've experienced, RPG seems to offer better distribution. Take a look at [3] below
Some processors are CPU intensive then others so there's no clear cut answer for what value can be set for Concurrent Tasks. This is more of trial and error or testing and fine tuning approach that you'd have to master. One suggestion is, if you set too many Concurrent Tasks for a CPU intensive processor, it will have serious impact on the nodes.
Hortonworks have made a detailed guide regarding this. I've provided the link below. [1]
Some best practices and handy guides:
https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
http://ijokarumawak.github.io/nifi/2016/11/22/nifi-jolt/
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Related

Apache NiFi tuning issues

I've developed a NiFi flow prototype for data ingestion in HDFS. Now I would like to improve the overall performances but it seems I cannot really move forward.
​
​The flow takes in input csv files (each row has 80 fields), split them at row level, applies some transformations to the fields (using 4 custom processors executed sequentially), buffers the new rows into csv files, outputs them into HDFS. I've developed the processors in such a way the content of the flow file is accessed only once when each individual record is read and its fields are moved to flowfile attributes. Tests have been performed on a amazon EC2 m4.4xlarge instance (16 cores CPU, 64 GB RAM).
​​This is what I tried so far:
​​Moved the flowfile repository and the content repository on different SSD drives
Moved the provenance repository in memory (NiFi could not keep up with the events rate)
Configuring the system according to the ​configuration best practices
I've tried assigning multiple threads to each of the processors in order to reach different numbers of total threads
I've tried increasing the nifi.queue.swap.threshold and setting backpressure to never reach the swap limit
Tried different JVM memory settings from 8 up to 32 GB (in combination with the G1GC)
I've tried increasing the instance specifications, nothing changes
From the monitoring I've performed it looks like disks are not the bottleneck (they are basically idle a great part of the time, showing the computation is actually being performed in-memory) and the average CPU load is below 60%.
​The most I can get is 215k rows/minute, which is 3,5k rows/second. In terms of volume, it's just 4,7 MB/s. I am aiming to something definitely greater than this.
​
​Just as a comparison, I created a flow that reads a file, splits it in rows, merges them together in blocks and outputs on disk. Here I get 12k rows/second, or 17 MB/s. Doesn't look surprisingly fast too and let me think that probably I am doing something wrong.
​
​Does anyone has suggestions about how to improve the performances? How much will I benefit from running NiFi on cluster instead of growing with the instance specs? Thank you all
It turned out the poor performances were a combination of both the custom processors developed, and the merge content built-in processor. The same question mirrored on the hortonworks community forum got interesting feedback.
Regarding the first issue, a suggestion is to add the SupportsBatching annotation to the processors. This allows the processors to batch together several commits, and allows the NiFi user to favor latency or throughput with the processor execution from the configuration menu. Additional info can be found on the documentation here.
The other finding was that the MergeContent built-in processor doesn't seem to have optimal performances itself, therefore if possible one should consider modifying the flow and avoid the merging phase.

Impact of more executors than cpu/cores in Storm Cluster

I have started using Apache Storm recently. Right now focusing on some performance testing and tuning for one of my applications (pulls data from a NoSQL database, formats, and publishes to a JMS Queue for consumption by the requester) to enable more parallel request processing at a time. I have been able to tune the topology in terms of changing no. of bolts, MAX_SPENDING_SPOUT etc. and to throttle data flow within topology using some ticking approach.
I wanted to know what happens when we define more parallelism than the no of cores we have. In my case I have a single node, single worker topology and the machine has 32 cores. But total no of executors (for all the spouts and bolts) = 60. So my questions are:
Does this high number really helps processing requests or is it actually degrades the performance, since I believe there will more context switch between bolt tasks to utilize cores.
If I define 20 (just a random selection) executors for a Bolt and my code flow never needs to utilize the Bolt, will this be impacting performance? How does storm handles this situation?
This is a very general question, so the answer is (as always): it depends.
If your load is large and a single executor fully utilizes a core completely, having more executors cannot give you any throughput improvements. If there is any impact, it might be negative (also with regard to contention of internally used queues to which all executers need to read from and write into for tuple transfer).
If you load is "small" and does not fully utilize your CPUs, it wound matter either -- you would not gain or loose anything -- as your cores are not fully utilized you have some left over head room anyway.
Furthermore, consider that Storm spans some more threads within each worker. Thus, if your executors fully utilize your hardware, those thread will also be impacted.
Overall, you should not run your topologies to utilize core completely anyway but leave form head room for small "spikes" etc. In operation, maybe 80% CPU utilization might be a good value. As a rule of thumb, one executor per core should be ok.

Are Hadoop and Map/Reduce useful for BIG parallel processes?

I have a superficial understanding of Hadoop and Map/Reduce. I see it can be useful for running many instances of small independent processes. But can I use this infrastructure (with its fault tolerance, scalability and ease of use) to run BIG independent processes?
Let's say I want to run certain analysis of the status of the clients of my company (600), and this analysis requires about 1 min of process, accessing a variety of static data, but the analysis of one client is not related to the others. So now I have 10 hs of centralized processing, but if I can distribute this processing in 20 nodes, I can expect to finish it in about half hour (plus some overhead due to replication of data). And if I can rent 100 nodes in Amazon EC2 for an affordable price, it will be done in about 6 minutes and that will change radically the usability of my analysis.
Is Hadoop the right tool to solve my problem? Can it run big Mapper processes that take 1 min each? If not, where should I look?
Thanks in advance.

Performance issue for batch insertion into marklogic

I have the requirement to insert 10,000 docs into marklogic in less than 10 seconds.
I tested in one single-node marklogic server in the following way:
use xdmp:spawn to pass the doc insertion task to task server;
use xdmp:document-insert without specify forest explicitly;
the task server has 8 theads to process tasks;
We have enabled CPF.
The performance is very bad: it took 2 minutes to finish the 10,000 doc creation.
I'm sure the performance will be better if I tested it in a cluster environment, but I'm not sure whether it can finish in less than 10 seconds.
Please advise the way of improving the performance.
I would start by gathering more information. What version of MarkLogic is this? What OS is it running on? What's the CPU? RAM? What's the storage subsystem? How many forests are attached to the database?
Then gather OS-level metrics, to see if one of the subsystems is an obvious bottleneck. For now I won't speculate beyond that.
If you need a fast load, I wouldn't use xdmp:spawn for each individual document, nor use CPF. But 2 minutes for 10k docs doesn't necessarily sound slow. On the other hand, I have reached up to 3k/sec, but without range indexes, transforms, whatsoever. And a very fast disk (e.g. ssd)..
HTH!
Assuming 2 socket server, 128GB-256GB of ram, fast IO (400-800MB/sec sustained)
Appropriate number of forests (12 primary or 6 primary/6 secondary)
More than 8 threads assuming enough cores
CPF off
Turn on perf history, look in metrics, and you will see where the bottleneck is.
SSD is not required - just IO throughput...which multiple spinning disks provide without issue.

cassandra massive write perfomance problem

I have server with 4 GB RAM and 2x 4 cores CPU. When I start perform massive writes in Cassandra all works fine initially, but after a couple hours with 10K inserts per second database grows up to 25+ GB, and performance go down to 500 insert per seconds!
I find out this because compacting operations is very slow but I don't understand why? I set 8 concurrent compacting threads but Cassandra don't use 8 threads; only 2 cores are loaded.
Appreciate any help.
We've seen similar problems with Cassandra out-the-box, see:
http://www.acunu.com/blogs/richard-low/cassandra-under-heavy-write-load-part-ii/
One solution to these sort of performance degradation issues (but by no means the only) is to consider a different storage engine, like Castle, used in the above blog post - its opensource (GPL v2), has much better performance and degrades much more gracefully. The code is here (I've just pushed up a branch for Cassandra 0.8 support):
https://bitbucket.org/acunu/fs.hg
And instructions on how to get started are here:
http://support.acunu.com/entries/20216797-castle-build-instructions
(Full disclosure: I work for Acunu, so may be a little biased ;-)

Resources