NiFi process groups performance (output ports) - apache-nifi

I use NiFi process groups to simplify the view of the entire process.
However, to use process groups, we have to pass the output to an output port and then the next processor has to be fed from that process group "via" the output port.
I have noticed that I experience performance degradation when I do that. It seems that the downstream processors are waiting for the output port to send files although the files are "available" in the upstream process groups' output port.
I removed the process groups and directly connected the processors and I see a drastic improvement in the flows. Although this looks messy and unreadable (that's the purpose of using process groups).
There is no configuration available in output port and it seems like just a passthrough mecahnism(it should be) but I am not sure why is it acting as a bottleneck.
Any views or insight on this would be very helpful
1) Option that is slower: Input -----> A Process Group(Containing Input port+Extract text+Replace text+Output port) ------> Output
2) Faster performing flow: Input ------->Extract text+Replace text ------------> Output

There is a thread about this on HCC.
Some things to look into:
If there is too much in the queues, swapping may occur
Timer based microbatching is used to move data between processes groups, this in itself should not add significant overhead, but you will want to make sure that you set Maximum Timer Driven Thread Count high enough

Related

Long duration soak tests in jmeter

Jmeter tests are run in master slave fashion with around 8 slave machines. However with the remote batching mode set to MODE_STRIPPED_BATCH, I am not able to run tests for more than 64 hours. Throughput is around 450 requests per minute, and per slave machine it results in the creation of jtl files that are around 1.5 gb. All 8 slaves are going to send this to the master (1.5 gb x 8) and probably the I/O gets too much for the master to handle. The master machines memory is at 16 gb ram and has disk storage of around 250 gb. I was wondering if the jmeter distributed architecture has any provision to make long running soak tests possible without any un explained stress on the master machine. Obviously I have the option to abandon master slave setup and go for 8 independent nodes, however I'll in that case run into complications with respect to serving data csv files ( which I currently serve using simple table server plugin from the master m) and also around aggregating result files. Any suggestions please. It would be great to be able to run tests atleast for around 4 days (96 hours or so).
I would suggest to go for an independent JMeter workers + external data collector setup.
Actually, the JMeter right-out-of-the-box "distributed scaling" abilities are weak, way outdated & overall pretty ridiculous. As well as it's data collection/agregation/processing abilities.
This situation actually puzzles me a lot - mind you, rivals are even worse, so there's literally NOTHING in the field (except for, perhaps, some SaaS solutions trying to monetize on this gap).
But is is what it is...
So that's about why-s, now to how-s.
If I were you, I would:
Containerize the JMeter worker
Equip each container with a watchdog to quickly restart the worker if things go south locally (or probably even on schedule to refresh it ultimately). Be that an internal one, or external like cloud services have - doesn't matter.
Set up a timeseries database - I recommend InfluxDB, it's an excellent product & it's free in basic version (which is going to be enough for your purposes).
Flow your test results/metrics into that DB - do not collect them locally! You can do it right from your tests with pretty simple custom listener (Influx line protocol is ridiculously simple & fast), or you can have external agent watching the result files as they flow. I just suggest you not to use so called Backend Listner to do the job - it's garbage, it won't shape your data right, so you'd have to do additional ops to bring them to order.
If you shape your test result/metrics data properly, you've get 'em already time-synced into a single set - and the further processing options are amazingly powerful!
My expectation is that you're looking for the StrippedAsynch sampler sender mode.
As per the documentation:
Asynch
samples are temporarily stored in a local queue. A separate worker thread sends the samples. This allows the test thread to continue without waiting for the result to be sent back to the client. However, if samples are being created faster than they can be sent, the queue will eventually fill up, and the sampler thread will block until some samples can be drained from the queue. This mode is useful for smoothing out peaks in sample generation. The queue size can be adjusted by setting the JMeter property asynch.batch.queue.size (default 100) on the server node.
StrippedAsynch
remove responseData from successful samples, and use Async sender to send them.
So on slave node add the following line to user.properties file:
mode=StrippedAsynch
and on the master node define asynch.batch.queue.size, to be as high to not to have impact onto JMeter's throughput (won't slow it down) and as low to not to overwhelm the master. I would start with 1000.
Another option is using StrippedDiskStore but you will have to manually collect serialized results after test completion (make sure that slave processes will not shut down because the results will be deleted when slave process finishes)
You could use JMeter PerfMon Plugin to monitor memory and network usage on master and slaves.

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

Parallel Processing with Starting New Task - front end screen timeout

I am running an ABAP program to work with a huge amount of data. The SAP documentation gives the information that I should use
Remote Function Modules with the addition STARTING NEW TASK to process the data.
So my program first selects all the data, breaks the data into packages and calls a function module with a package of data for further processing.
So that's my pseudo code:
Select KEYFIELD from MYSAP_TABLE into table KEY_TABLE package size 500.
append KEY_TABLE to ALL_KEYS_TABLE.
Endselect.
Loop at ALL_KEYS_TABLE assigning <fs_table> .
call function 'Z_MASS_PROCESSING'
starting new TASK 'TEST' destination in group default
exporting
IT_DATA = <fs_table> .
Endloop .
But I am surprised to see that I am using Dialog Processes instead of Background Process for the call of my function module.
So now I encountered the problem that one of my Dialog Processes were killed after 60 Minutes because of Timeout.
For me, it seems that STARTING NEW TASK is not the right solution for parallel processing of mass data.
What will be the alternative?
As already mentioned, thats not an easy topic that is handled with a few lines of codes. The general steps you have to conduct in a thoughtful way to gain the desired benefit is:
1) Get free work processes available for parallel processing
2) Slice your data in packages to be processed
3) Call an RFC enabled function module asynchronously for each package with the available work processes. Handle waiting for free work processes, if packages > available processes
4) Receive your results asynchronously
5) Wait till everything is processed and merge the data together again and assure that every package was handled properly
Although it is bad practice to just post links, the code is very long and would make this answer very messy, therfore take a look at the following links:
Example1-aRFC
Example2-aRFC
Example3-aRFC
Other RFC variants (e.g. qRFC, tRFC etc.) can be found here with short description but sadly cannot give you further insight on them.
EDIT:
Regarding process type of aRFC:
In parallel processing, a job step is started as usual in a background
processing work process. (...)While the job itself runs in a
background process, the parallel processing tasks that it starts run
in dialog work processes. Such dialog work processes may be located on
any SAP server.
The server is specified with the GROUP (default: parallel_generators) see transaction RZ12 and can have its own ressources just for parallel processing. If your process times out, you have to slice your packages differently in size.
I think, best way for parallel processing in SAP is Bank Parallel Processing framework as Jagger mentioned. Unfortunently its rarerly mentioned in any resource and its not documented well.
Actually, best documentation I found was in this book
https://www.sap-press.com/abap-performance-tuning_2092/
Yes, it's tricky. It costed me about 5 or 6 days to force it going. But results were good.
All stuff is situated in package BANK_PP_JOBCTRL and you can use its name for googling.
Main idea there is to divide all your work into steps (simplified):
Preparation
Parallel processing
2.1. Processing preparation
2.2. Processing
(Actually there are more steps there)
First step is not paralleized. Here you should prepare all you data for parallel processing and devide it into 'piece' which will be processed in parallel.
Content of pieces, in turn, can be ID or preloaded data as well.
After that, you can run step 2 in parallel processing.
Great benefit of all this is that error in one piece of parallel work won't lead to crash of all your processing.
I recomend you check demo in function group BANK_API_PP_DEMO
To implement parallel processing, you need to do a bit more than just add that clause. The information is contained in this help topic. A lot of design effort needs to be devoted to ensure that the communication and result merging overhead of the parallel processing does not negate the performance advantage gained by the parallel processing in the first place and that referential integrity of the data is maintained even when some of the parallel tasks fail. Do not under-estimate the complexity of this task.
You could make use of the bgRFC technique. This is a new method of background processing made by SAP.
BgRFC has, in addition to the already existing IN BACKGROUND TASK, the possibility to configure and monitor all calls which run through this method.
You can read more documentation between the different possibilities here. This is all (of course) depending on your SAP version.

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

Computing usage of independent cores and binding a process to a core

I am working with MPI, and I have a certain hierarchy of operations. For a particular value of a parameter _param, I launch 10 trials, each running a specific process on a distinct core. For n values of _param, the code runs in a certain hierarchy as:
driver_file ->
launches one process which checks if available processes are more than 10. If more than 10 are available, then it launches an instance of a process with a specific _param value passed as an argument to coupling_file
coupling_file ->
does some elementary computation, and then launches 10 processes using MPI_Comm_spawn(), each corresponding to a trial_file while passing _trial as an argument
trial_file ->
computes work, returns values to the coupling_file
I am facing two dilemmas, namely:
How do I evaluate the required condition for the cores in driver_file?
As in, how do I find out how many processes have been terminated, so that I can correctly schedule processes on idle cores? I thought maybe adding a blocking MPI_Recv() and use it to pass a variable which would tell me when a certain process has been finished, but I'm not sure if this is the best solution.
How do I ensure that processes are assigned to different cores? I had thought about using something like mpiexec --bind-to-core --bycore -n 1 coupling_file to launch one coupling_file. This will be followed by something like mpiexec --bind-to-core --bycore -n 10 trial_file
launched by the coupling_file. However, if I am binding processes to a core, I don't want the same core to have two/more processes. As in, I don't want _trial_1 of _coupling_1 to run on core x, then I launch another process of coupling_2 which launches _trial_2 which also gets bound to core x.
Any input would be appreciated. Thanks!
If it is an option for you, I'd drop the spawning processes thing altogether, and instead start all processes at once.
You can then easily partition them into chunks working on a single task. A translation of your concept could for example be:
Use one master (rank 0)
Partition the rest into groups of 10 processes, maybe create a new communicator for each group if needed, each group has one leader process, known to the master.
In your code you then can do something like:
if master:
send a specific _param to each group leader (with a non-blocking send)
loop over all your different _params
use MPI_Waitany or MPI_Waitsome to find groups that are ready
else
if groupleader:
loop endlessly
MPI_Recv _params from master
coupling_file
MPI_Bcast to group
process trial_file
else
loop endlessly
MPI_BCast (get data from groupleader)
process trial file
I think, following this approach would allow you to solve both your issues. Availability of process groups gets detected by MPI_Wait*, though you might want to change the logic above, to notify the master at the end of your task so it only sends new data then, not already during the previous trial is still running, and another process group might be faster. And pinning is resolved as you have a fixed number of processes, which can be properly pinned during the usual startup.

Resources