Nifi- Parallel and concurrent execution with ExecuteStreamCommand - parallel-processing

Currently, I have Nifi running on an edge node that has 4 cores. Say I have 20 incoming flow files and I give concurrent tasks as 10 for ExecuteStreamCommand processor, does it mean I get only concurrent execution or both concurrent and parallel execution?

In this case you get concurrency and parallelism, as noted in the Apache NiFi User Guide (emphasis added):
Next, the Scheduling Tab provides a configuration option named
Concurrent tasks. This controls how many threads the Processor will
use. Said a different way, this controls how many FlowFiles should be
processed by this Processor at the same time. Increasing this value
will typically allow the Processor to handle more data in the same
amount of time. However, it does this by using system resources that
then are not usable by other Processors. This essentially provides a
relative weighting of Processors — it controls how much of the
system’s resources should be allocated to this Processor instead of
other Processors. This field is available for most Processors. There
are, however, some types of Processors that can only be scheduled with
a single Concurrent task.
If there are locking issues or race conditions with the command you are invoking, this could be problematic, but if they are independent, you are only limited by JVM scheduling and hardware performance.
Response to question in comments too long for a comment:
Question:
Thanks Andy. When there are 4 cores, can i assume that there shall be
4 parallel executions within which they would be running multiple
threads to handle 10 concurrent tasks? In the best possible way, how
are these 20 flowfiles executed in the scenario I mentioned. – John 30
mins ago
Response:
John, JVM thread handling is a fairly complex topic, but yes, in general there would be n+C JVM threads, where C is some constant (main thread, VM thread, GC threads) and n is a number of "individual" threads created by the flow controller to execute the processor tasks. JVM threads map 1:1 to native OS threads, so on a 4 core system with 10 processor threads running, you would have "4 parallel executions". My belief is that at a high level, your OS would use time slicing to cycle through the 10 threads 4 at a time, and each thread would process ~2 flowfiles.
Again, very rough idea (assume 1 flowfile = 1 unit of work = 1 second):
Cores | Threads | Flowfiles/thread | Relative time
1 | 1 | 20 | 20 s (normal)
4 | 1 | 20 | 20 s (wasting 3 cores)
1 | 4 | 5 | 20 s (time slicing 1 core for 4 threads)
4 | 4 | 5 | 5 s (1:1 thread to core ratio)
4 | 10 | 2 | 5+x s (see execution table below)
If we are assuming each core can handle one thread, and each thread can handle 1 flowfile per second, and each thread gets 1 second of uninterrupted operation (obviously not real), the execution sequence might look like this:
Flowfiles A - T
Cores α, β, γ, δ
Threads 1 - 10
Time/thread 1 s
Time | Core α | Core β | Core γ | Core δ
0 | 1/A | 2/B | 3/C | 4/D
1 | 5/E | 6/F | 7/G | 8/H
2 | 9/I | 10/J | 1/K | 2/L
3 | 3/M | 4/N | 5/O | 6/P
4 | 7/Q | 8/R | 9/S | 10/T
In 5 seconds, all 10 threads have executed twice, each completing 2 flowfiles.
However, assume the thread scheduler only assigns each thread a cycle of .5 seconds each iteration (again, not a realistic number, just to demonstrate). The execution pattern then would be:
Flowfiles A - T
Cores α, β, γ, δ
Threads 1 - 10
Time/thread .5 s
Time | Core α | Core β | Core γ | Core δ
0 | 1/A | 2/B | 3/C | 4/D
.5 | 5/E | 6/F | 7/G | 8/H
1 | 9/I | 10/J | 1/A | 2/B
1.5 | 3/C | 4/D | 5/E | 6/F
2 | 7/G | 8/H | 9/I | 10/J
2.5 | 1/K | 2/L | 3/M | 4/N
3 | 5/O | 6/P | 7/Q | 8/R
3.5 | 9/S | 10/T | 1/K | 2/L
4 | 3/M | 4/N | 5/O | 6/P
4.5 | 7/Q | 8/R | 9/S | 10/T
In this case, the total execution time is the same (* there is some overhead from the thread switching) but specific flowfiles take "longer" (total time from 0, not active execution time) to complete. For example, flowfiles C and D are not complete until time=2 in the second scenario, but are complete at time=1 in the first.
To be honest, the OS and JVM have people much smarter than me working on this, as does our project (luckily), so there are gross over-simplifications here and in general I would recommend you let the system worry about hyper-optimizing the threading. I would not think setting the concurrent tasks to 10 would yield vast improvements over setting it to 4 in this case. You can read more about JVM threading here and here.
I just did a quick test in my local 1.5.0 development branch -- I connected a simple GenerateFlowFile running with 0 sec schedule to a LogAttribute processor. The GenerateFlowFile immediately generates so many flowfiles that the queue enables the back pressure feature (pausing the input processor until the queue can drain some of the 10,000 waiting flowfiles). I stopped both and re-ran this, giving the LogAttribute processor more concurrent tasks. By setting the LogAttribute concurrent tasks to 2:1 of the GenerateFlowFile, the queue never built up past about 50 queued flowfiles.
tl;dr Setting your concurrent tasks to the number of cores you have should be sufficient.
Update 2:
Checked with one of our resident JVM experts and he mentioned two things to note:
The command is not solely CPU limited; if I/O is heavy, more concurrent tasks may be beneficial.
The max number of concurrent tasks for the entire flow controller is set to 10 by default.

Related

What is the "spool time" of Intel Turbo Boost?

Just like a turbo engine has "turbo lag" due to the time it takes for the turbo to spool up, I'm curious what is the "turbo lag" in Intel processors.
For instance, the i9-8950HK in my MacBook Pro 15" 2018 (running macOS Catalina 10.15.7) usually sits around 1.3 GHz when idle, but when I run a CPU-intensive program, the CPU frequency shoots up to, say 4.3 GHz or so (initially). The question is: how long does it take to go from 1.3 to 4.3 GHz? 1 microsecond? 1 milisecond? 100 miliseconds?
I'm not even sure this is up to the hardware or the operating system.
This is in the context of benchhmarking some CPU-intensive code which takes a few 10s of miliseconds to run. The thing is, right before this piece of CPU-intensive code is run, the CPU is essentially idle (and thus the clock speed will drop down to say 1.3 GHz). I'm wondering what slice of my benchmark is running at 1.3 GHz and what is running at 4.3 GHz: 1%/99%? 10%/90%? 50%/50%? Or even worse?
Depending on the answer, I'm thinking it would make sense to run some CPU-intensive code prior to starting the benchmark as a way to "spool up" TurboBoost. And this leads to another question: for how long should I run this "spooling-up" code? Probably one second is enough, but what if I'm trying to minimize this -- what's a safe amount of time for "spooling-up" code to run, to make sure the CPU will run the main code at the maximum frequency from the very first instruction executed?
Evaluation of CPU frequency transition latency paper presents transition latencies of various Intel processors. In brief, the latency depends on the state in which the core currently is, and what is the target state. For an evaluated Ivy Bridge processor (i7-3770 # 3.4 GHz) the latencies varied from 23 (1.6 GH -> 1.7 GHz) to 52 (2.0 GHz -> 3.4 GHz) micro-seconds.
At Hot Chips 2020 conference a major transition latency improvement of the future Ice Lake processor has been presented, which should have major impact mostly at partially vectorised code which uses AVX-512 instructions. While these instructions do not support as high frequencies as SSE or AVX-2 instructions, using an island of these instructions cause down- and following up-scaling of the processor frequency.
Pre-heating a processor obviously makes sense, as well as "pre-heating" memory. One second of a prior workload is enough to reach the highest available turbo frequency, however you should take into account also temperature of the processor, which may down-scale the frequency (actually CPU core and uncore frequencies if speaking about one of the latest Intel processors). You are not able to reach the temperature limit in a second. But it depends, what you want to measure by your benchmark, and if you want to take into account the temperature limit. When speaking about temperature limit, be aware that your processor also has a power limit, which is another possible reason for down-scaling the frequency during the application run.
Another think that you should take into account when benchmarking your code is that its runtime is very short. Be aware of the runtime/resources consumption measurement reliability. I would suggest an artificially extending the runtime (run the code 10 times and measure the overall consumption) for better results.
I wrote some code to check this, with the aid of the Intel Power Gadget API. It sleeps for one second (so the CPU goes back to its slowest speed), measures the clock speed, runs some code for a given amount of time, then measures the clock speed again.
I only tried this on my 2018 15" MacBook Pro (i9-8950HK CPU) running macOS Catalina 10.15.7. The specific CPU-intensive code being run between clock speed measurements may also influence the result (is it integer only? FP? SSE? AVX? AVX-512?), so don't take these as exact numbers, but only order-of-magnitude/ballpark figures. I have no idea how the results translate into different hardware/OS/code combinations.
The minimum clock speed when idle in my configuration is 1.3 GHz. Here's the results I obtained in tabular form.
+--------+-------------+
| T (ms) | Final clock |
| | speed (GHz) |
+--------+-------------+
| <1 | 1.3 |
| 1..3 | 2.0 |
| 4..7 | 2.5 |
| 8..10 | 2.9 |
| 10..20 | 3.0 |
| 25 | 3.0-3.1 |
| 35 | 3.3-3.5 |
| 45 | 3.5-3.7 |
| 55 | 4.0-4.2 |
| 66 | 4.6-4.7 |
+--------+-------------+
So 1 ms appears to be the minimum amount of time to get any kind of change. 10 ms gets the CPU to its nominal frequency, and from then on it's a bit slower, apparently over 50 ms to reach maximum turbo frequencies.

Memory Allocation Algorithm (non-contiguous)

Just a curious question though!
While going through various techniques of memory allocation and management, specially paging in which the fixed size block is assigned to a particular request, everything goes fine until when the process exits the memory leaving behind spaces for non-contiguous allocation for other processes. Now for that case the data structure Page Tables keeps a track of corresponding Page to Frame number.
Can an algorithm be designed in such a way that pages are always allocated after the last allocated page in the memory and dynamically shifting and covering the empty page space that is caused by the freeing process (from somewhere in the middle) at every minute intervals, maintaining a row of contiguous memory at any given time. This may preserve contiguous memory allocation for processes thus enabling quicker memory access.
For e.g:
---------- ----------
| Page 1 | After Page 2 is deallocated | Page 1 |
---------- rather than assigning ----------
| Page 2 | the space to some other process | Page 3 |
---------- in a non-contiguous fashion, ----------
| Page 3 | there can be something | Page 4 |
---------- like this --> ----------
| Page 4 | | |
---------- ----------
The point is that the memory allocation can be continuous and always after the last allocated page.
I would appreciate if I am told about the design flaws or the parameters one has to take care of while thinking of any such algorithm.
This is called 'compacting' garbage collection, usually part of a mark-compact garbage collection algorithm: https://en.wikipedia.org/wiki/Mark-compact_algorithm
As with any garbage collector, the collection/compaction is the easy part. The hard part is letting the program that you're collecting for continue to work as you're moving its memory around.

Too big efficiency - parallel computing

This is only theoretical question
In parallel computing is it possible to acheive efficiency greater than 100%?
E.g 125% efficiency
+-------------+------+
| Processors | Time |
+-------------+------+
| 1 | 10s |
| 2 | 4s |
+-------------+------+
I don't mean situtations when parallel environment is configured wrong or there is some bug in code.
Efficiency definition:
https://stackoverflow.com/a/13211093/2265932
Yes, it is possible, it is called superlinear speedup and it is caused usually by improving the cache usage. Though it is usually less than 125%.
See, for example, Where does super-linear speedup come from?

Independent memory channels: What does it mean for a programmer

I read the Datasheet for an Intel Xeon Processor and saw the following:
The Integrated Memory Controller (IMC) supports DDR3 protocols with four
independent 64-bit memory channels with 8 bits of ECC for each channel (total of
72-bits) and supports 1 to 3 DIMMs per channel depending on the type of memory
installed.
I need to know what this exactly means from a programmers view.
The documentation on this seems to be rather sparse and I don't have someone from Intel at hand to ask ;)
Can this memory controller execute 4 loads of data simultaneously from non-adjacent memory regions (and request each data from up to 3 memory DIMMs)? I.e. 4x64 Bits, striped from up to 3 DIMMs, e.g:
| X | _ | X | _ | X | _ | X |
(X is loaded data, _ an arbitrarily large region of unloaded data)
Can this IMC execute 1 load which will load up to 1x256 Bits from a contiguous memory region.
| X | X | X | X | _ | _ | _ | _ |
This seems to be implementation specific, depending on compiler, OS and memory controller. The standard is available at: http://www.jedec.org/standards-documents/docs/jesd-79-3d . It seems that if your controller is fully compliant there are specific bits that can be set to indicate interleaved or non-interleaved mode. See page 24,25 and 143 of the DDR3 Spec, but even in the spec details are light.
For the i7/i5/i3 series specifically, and likely all newer Intel chips the memory is interleaved as in your first example. For these newer chips and presumably a compiler that supports it, yes one Asm/C/C++ level call to load something large enough to be interleaved/striped would initiate the required amount of independent hardware channel level loads to each channel of memory.
In the Triple channel section in of the Multichannel memory page on wikipedia there is a small list of CPUs that do this, likely it is incomplete: http://en.wikipedia.org/wiki/Multi-channel_memory_architecture

What does it mean % Cpu Utilization in application runtime statistics?

When I run an application and, at the same time, I use a runtime evaluator in order to profile my program, I get, at the end of the process, many statistics. One of this is the Cpu utilization time.
Typically what is the Cpu utilization time? Well you might tell me the percantage calculated dividing the global time that process spent in cpu by the overall simulation time. Well, unfortunately my statistics are very deep and the program I am using is very precise and gets me a chart about the cpu utilization time.
So in my chart I have on x axis the time and on y axis the cpu utilization time in %.
So something like this:
cpu%
^
|
|
| *
| * * *
| * * *
| * * *
| * * *
| * * *
|* *
-------------------------------------------------> time
So, what does it mean? How should I interpret the following sentence?
"The cpu utilization percentage for
process 'MyProcess' at time '5.23 s'
is 12%"
It's the percent of the CPU's cycles being spent on your process. If your process was using 100% of the CPU's possible time it would be 100%. At 12% it's probably waiting on I/O or something like that. See http://en.wikipedia.org/wiki/CPU_usage.

Resources