Too big efficiency - parallel computing - parallel-processing

This is only theoretical question
In parallel computing is it possible to acheive efficiency greater than 100%?
E.g 125% efficiency
+-------------+------+
| Processors | Time |
+-------------+------+
| 1 | 10s |
| 2 | 4s |
+-------------+------+
I don't mean situtations when parallel environment is configured wrong or there is some bug in code.
Efficiency definition:
https://stackoverflow.com/a/13211093/2265932

Yes, it is possible, it is called superlinear speedup and it is caused usually by improving the cache usage. Though it is usually less than 125%.
See, for example, Where does super-linear speedup come from?

Related

What is the "spool time" of Intel Turbo Boost?

Just like a turbo engine has "turbo lag" due to the time it takes for the turbo to spool up, I'm curious what is the "turbo lag" in Intel processors.
For instance, the i9-8950HK in my MacBook Pro 15" 2018 (running macOS Catalina 10.15.7) usually sits around 1.3 GHz when idle, but when I run a CPU-intensive program, the CPU frequency shoots up to, say 4.3 GHz or so (initially). The question is: how long does it take to go from 1.3 to 4.3 GHz? 1 microsecond? 1 milisecond? 100 miliseconds?
I'm not even sure this is up to the hardware or the operating system.
This is in the context of benchhmarking some CPU-intensive code which takes a few 10s of miliseconds to run. The thing is, right before this piece of CPU-intensive code is run, the CPU is essentially idle (and thus the clock speed will drop down to say 1.3 GHz). I'm wondering what slice of my benchmark is running at 1.3 GHz and what is running at 4.3 GHz: 1%/99%? 10%/90%? 50%/50%? Or even worse?
Depending on the answer, I'm thinking it would make sense to run some CPU-intensive code prior to starting the benchmark as a way to "spool up" TurboBoost. And this leads to another question: for how long should I run this "spooling-up" code? Probably one second is enough, but what if I'm trying to minimize this -- what's a safe amount of time for "spooling-up" code to run, to make sure the CPU will run the main code at the maximum frequency from the very first instruction executed?
Evaluation of CPU frequency transition latency paper presents transition latencies of various Intel processors. In brief, the latency depends on the state in which the core currently is, and what is the target state. For an evaluated Ivy Bridge processor (i7-3770 # 3.4 GHz) the latencies varied from 23 (1.6 GH -> 1.7 GHz) to 52 (2.0 GHz -> 3.4 GHz) micro-seconds.
At Hot Chips 2020 conference a major transition latency improvement of the future Ice Lake processor has been presented, which should have major impact mostly at partially vectorised code which uses AVX-512 instructions. While these instructions do not support as high frequencies as SSE or AVX-2 instructions, using an island of these instructions cause down- and following up-scaling of the processor frequency.
Pre-heating a processor obviously makes sense, as well as "pre-heating" memory. One second of a prior workload is enough to reach the highest available turbo frequency, however you should take into account also temperature of the processor, which may down-scale the frequency (actually CPU core and uncore frequencies if speaking about one of the latest Intel processors). You are not able to reach the temperature limit in a second. But it depends, what you want to measure by your benchmark, and if you want to take into account the temperature limit. When speaking about temperature limit, be aware that your processor also has a power limit, which is another possible reason for down-scaling the frequency during the application run.
Another think that you should take into account when benchmarking your code is that its runtime is very short. Be aware of the runtime/resources consumption measurement reliability. I would suggest an artificially extending the runtime (run the code 10 times and measure the overall consumption) for better results.
I wrote some code to check this, with the aid of the Intel Power Gadget API. It sleeps for one second (so the CPU goes back to its slowest speed), measures the clock speed, runs some code for a given amount of time, then measures the clock speed again.
I only tried this on my 2018 15" MacBook Pro (i9-8950HK CPU) running macOS Catalina 10.15.7. The specific CPU-intensive code being run between clock speed measurements may also influence the result (is it integer only? FP? SSE? AVX? AVX-512?), so don't take these as exact numbers, but only order-of-magnitude/ballpark figures. I have no idea how the results translate into different hardware/OS/code combinations.
The minimum clock speed when idle in my configuration is 1.3 GHz. Here's the results I obtained in tabular form.
+--------+-------------+
| T (ms) | Final clock |
| | speed (GHz) |
+--------+-------------+
| <1 | 1.3 |
| 1..3 | 2.0 |
| 4..7 | 2.5 |
| 8..10 | 2.9 |
| 10..20 | 3.0 |
| 25 | 3.0-3.1 |
| 35 | 3.3-3.5 |
| 45 | 3.5-3.7 |
| 55 | 4.0-4.2 |
| 66 | 4.6-4.7 |
+--------+-------------+
So 1 ms appears to be the minimum amount of time to get any kind of change. 10 ms gets the CPU to its nominal frequency, and from then on it's a bit slower, apparently over 50 ms to reach maximum turbo frequencies.

Nifi- Parallel and concurrent execution with ExecuteStreamCommand

Currently, I have Nifi running on an edge node that has 4 cores. Say I have 20 incoming flow files and I give concurrent tasks as 10 for ExecuteStreamCommand processor, does it mean I get only concurrent execution or both concurrent and parallel execution?
In this case you get concurrency and parallelism, as noted in the Apache NiFi User Guide (emphasis added):
Next, the Scheduling Tab provides a configuration option named
Concurrent tasks. This controls how many threads the Processor will
use. Said a different way, this controls how many FlowFiles should be
processed by this Processor at the same time. Increasing this value
will typically allow the Processor to handle more data in the same
amount of time. However, it does this by using system resources that
then are not usable by other Processors. This essentially provides a
relative weighting of Processors — it controls how much of the
system’s resources should be allocated to this Processor instead of
other Processors. This field is available for most Processors. There
are, however, some types of Processors that can only be scheduled with
a single Concurrent task.
If there are locking issues or race conditions with the command you are invoking, this could be problematic, but if they are independent, you are only limited by JVM scheduling and hardware performance.
Response to question in comments too long for a comment:
Question:
Thanks Andy. When there are 4 cores, can i assume that there shall be
4 parallel executions within which they would be running multiple
threads to handle 10 concurrent tasks? In the best possible way, how
are these 20 flowfiles executed in the scenario I mentioned. – John 30
mins ago
Response:
John, JVM thread handling is a fairly complex topic, but yes, in general there would be n+C JVM threads, where C is some constant (main thread, VM thread, GC threads) and n is a number of "individual" threads created by the flow controller to execute the processor tasks. JVM threads map 1:1 to native OS threads, so on a 4 core system with 10 processor threads running, you would have "4 parallel executions". My belief is that at a high level, your OS would use time slicing to cycle through the 10 threads 4 at a time, and each thread would process ~2 flowfiles.
Again, very rough idea (assume 1 flowfile = 1 unit of work = 1 second):
Cores | Threads | Flowfiles/thread | Relative time
1 | 1 | 20 | 20 s (normal)
4 | 1 | 20 | 20 s (wasting 3 cores)
1 | 4 | 5 | 20 s (time slicing 1 core for 4 threads)
4 | 4 | 5 | 5 s (1:1 thread to core ratio)
4 | 10 | 2 | 5+x s (see execution table below)
If we are assuming each core can handle one thread, and each thread can handle 1 flowfile per second, and each thread gets 1 second of uninterrupted operation (obviously not real), the execution sequence might look like this:
Flowfiles A - T
Cores α, β, γ, δ
Threads 1 - 10
Time/thread 1 s
Time | Core α | Core β | Core γ | Core δ
0 | 1/A | 2/B | 3/C | 4/D
1 | 5/E | 6/F | 7/G | 8/H
2 | 9/I | 10/J | 1/K | 2/L
3 | 3/M | 4/N | 5/O | 6/P
4 | 7/Q | 8/R | 9/S | 10/T
In 5 seconds, all 10 threads have executed twice, each completing 2 flowfiles.
However, assume the thread scheduler only assigns each thread a cycle of .5 seconds each iteration (again, not a realistic number, just to demonstrate). The execution pattern then would be:
Flowfiles A - T
Cores α, β, γ, δ
Threads 1 - 10
Time/thread .5 s
Time | Core α | Core β | Core γ | Core δ
0 | 1/A | 2/B | 3/C | 4/D
.5 | 5/E | 6/F | 7/G | 8/H
1 | 9/I | 10/J | 1/A | 2/B
1.5 | 3/C | 4/D | 5/E | 6/F
2 | 7/G | 8/H | 9/I | 10/J
2.5 | 1/K | 2/L | 3/M | 4/N
3 | 5/O | 6/P | 7/Q | 8/R
3.5 | 9/S | 10/T | 1/K | 2/L
4 | 3/M | 4/N | 5/O | 6/P
4.5 | 7/Q | 8/R | 9/S | 10/T
In this case, the total execution time is the same (* there is some overhead from the thread switching) but specific flowfiles take "longer" (total time from 0, not active execution time) to complete. For example, flowfiles C and D are not complete until time=2 in the second scenario, but are complete at time=1 in the first.
To be honest, the OS and JVM have people much smarter than me working on this, as does our project (luckily), so there are gross over-simplifications here and in general I would recommend you let the system worry about hyper-optimizing the threading. I would not think setting the concurrent tasks to 10 would yield vast improvements over setting it to 4 in this case. You can read more about JVM threading here and here.
I just did a quick test in my local 1.5.0 development branch -- I connected a simple GenerateFlowFile running with 0 sec schedule to a LogAttribute processor. The GenerateFlowFile immediately generates so many flowfiles that the queue enables the back pressure feature (pausing the input processor until the queue can drain some of the 10,000 waiting flowfiles). I stopped both and re-ran this, giving the LogAttribute processor more concurrent tasks. By setting the LogAttribute concurrent tasks to 2:1 of the GenerateFlowFile, the queue never built up past about 50 queued flowfiles.
tl;dr Setting your concurrent tasks to the number of cores you have should be sufficient.
Update 2:
Checked with one of our resident JVM experts and he mentioned two things to note:
The command is not solely CPU limited; if I/O is heavy, more concurrent tasks may be beneficial.
The max number of concurrent tasks for the entire flow controller is set to 10 by default.

Memory Allocation Algorithm (non-contiguous)

Just a curious question though!
While going through various techniques of memory allocation and management, specially paging in which the fixed size block is assigned to a particular request, everything goes fine until when the process exits the memory leaving behind spaces for non-contiguous allocation for other processes. Now for that case the data structure Page Tables keeps a track of corresponding Page to Frame number.
Can an algorithm be designed in such a way that pages are always allocated after the last allocated page in the memory and dynamically shifting and covering the empty page space that is caused by the freeing process (from somewhere in the middle) at every minute intervals, maintaining a row of contiguous memory at any given time. This may preserve contiguous memory allocation for processes thus enabling quicker memory access.
For e.g:
---------- ----------
| Page 1 | After Page 2 is deallocated | Page 1 |
---------- rather than assigning ----------
| Page 2 | the space to some other process | Page 3 |
---------- in a non-contiguous fashion, ----------
| Page 3 | there can be something | Page 4 |
---------- like this --> ----------
| Page 4 | | |
---------- ----------
The point is that the memory allocation can be continuous and always after the last allocated page.
I would appreciate if I am told about the design flaws or the parameters one has to take care of while thinking of any such algorithm.
This is called 'compacting' garbage collection, usually part of a mark-compact garbage collection algorithm: https://en.wikipedia.org/wiki/Mark-compact_algorithm
As with any garbage collector, the collection/compaction is the easy part. The hard part is letting the program that you're collecting for continue to work as you're moving its memory around.

WebRTC - CPU reduction, settings to tweak

PRE-SCRIPTUM:
I have searched over StackOverflow and there is no Q/A explaining all possibilities of tweaking WebRTC to make it more viable for end products.
PROBLEM:
WebRTC has a very nice UX and it is cutting the edge. It should be perfect for mesh calls (3-8 people), but it is not yet. The biggest issue with mesh calls (all participants exchange streams with each other) is resource consumption, especially CPU.
Here are some stats I would like to share:
2.3 GHz Intel Core i5 (2 cores), OSX 10.10.2 (14C109), 4GB RAM, Chrome 40.0.2214.111 (64-bit)
+------------------------------------+----------+----------+
| Condition | CPU | Delta |
+------------------------------------+----------+----------+
| Chrome (idle after getUserMedia) | 11% | 11% |
| Chrome-Chrome | 55% | 44% |
| Chrome-Chrome-Chrome | 74% | 19% |
| Chrome-Chrome-Chrome-Chrome | 102% | 28% |
+------------------------------------+----------+----------+
QUESTION:
I would like to create a table with WebRTC tweaks, which can improve resource consumption and make overall experience better. Are there any other settings I can play with apart from those which are in the table below?
+------------------------------------+--------------+----------------------+
| Tweak | CPU Effect | Affects |
+------------------------------------+--------------+----------------------+
| Lower FPS | Low to high | Video quality lower |
| Lower video bitrate | Low to high | Video quality lower |
| Turn off echo cancellation | Low | Audio quality lower |
| Lower source video resolution | Low to high | Video quality lower |
| Get audio only source | Very high | No video |
| Codecs? Compression? More?.. | | |
+------------------------------------+--------------+----------------------+
P.S.
I would like to leave the same architecture (mesh), so MCU is not the thing I am searching for.
You can change the audio rate and codec(OPUS -> PCMA/U) and you could also reduce the channels. Changing audio will help but video is your main CPU hog.
Firefox does support H.264. Using it could bring significant reductions to the CPU utilization as a ton of different architectures support hardware encoding/decoding of H.264. I am not 100% sure if Firefox will take advantage of that but it is worth a shot.
As for chrome, VP8 is really your only option for video at the moment and your codec agnostic changes(resolution, bitrate, etc.) are really the only way to address the cycles utilized there.
You may also be able to force Chrome to use a lower quality stream by negotiating the maximum bandwith in your SDP. Though, in the past, this has not worked with Firefox.

Optimal bcrypt work factor

What would be an ideal bcrypt work factor for password hashing.
If I use a factor of 10, it takes approx .1s to hash a password on my laptop. If we end up with a very busy site, that turns into a good deal of work just checking people's passwords.
Perhaps it would be better to use a work factor of 7, reducing the total password hash work to about .01s per laptop-login?
How do you decide the tradeoff between brute force safety and operational cost?
Remember that the value is stored in the password: $2a$(2 chars work)$(22 chars salt)(31 chars hash). It is not a fixed value.
If you find the load is too high, just make it so the next time they log in, you crypt to something faster to compute. Similarly, as time goes on and you get better servers, if load isn't an issue, you can upgrade the strength of their hash when they log in.
The trick is to keep it taking roughly the same amount of time forever into the future along with Moore's Law. The number is log2, so every time computers double in speed, add 1 to the default number.
Decide how long you want it to take to brute force a user's password. For some common dictionary word, for instance, your account creation probably already warned them their password was weak. If it's one of 1000 common words, say, and it takes an attacker 0.1s to test each, that buys them 100s (well, some words are more common...). If a user chose 'common dictionary word' + 2 numbers, that's over two hours. If your password database is compromised, and the attacker can only get a few hundred passwords a day, you've bought most of your users hours or days to safely change their passwords. It's a matter of buying them time.
http://www.postgresql.org/docs/8.3/static/pgcrypto.html has some times for cracking passwords for you to consider. Of course, the passwords they list there are random letters. Dictionary words... Practically speaking you can't save the guy whose password is 12345.
Short Version
The number of iterations that gives at least 250 ms to compute
Long Version
When BCrypt was first published, in 1999, they listed their implementation's default cost factors:
normal user: 6
super user: 8
A bcrypt cost of 6 means 64 rounds (26 = 64).
They also note:
Of course, whatever cost people choose should be reevaluated
from time to time
At the time of deployment in 1976, crypt could hash fewer than 4 passwords per second. (250 ms per password)
In 1977, on a VAX-11/780, crypt (MD5) could be evaluated about 3.6 times per second. (277 ms per password)
That gives you a flavor of the kind of delays that the original implementers were considering when they wrote it:
~250 ms for normal users
~1 second for super users.
But, of course, the longer you can stand, the better. Every BCrypt implementation I've seen used 10 as the default cost. And my implementation used that. I believe it is time for me to to increase the default cost to 12.
We've decided we want to target no less than 250ms per hash.
My desktop PC is an Intel Core i7-2700K CPU # 3.50 GHz. I originally benchmarked a BCrypt implementation on 1/23/2014:
1/23/2014 Intel Core i7-2700K CPU # 3.50 GHz
| Cost | Iterations | Duration |
|------|-------------------|-------------|
| 8 | 256 iterations | 38.2 ms | <-- minimum allowed by BCrypt
| 9 | 512 iterations | 74.8 ms |
| 10 | 1,024 iterations | 152.4 ms | <-- current default (BCRYPT_COST=10)
| 11 | 2,048 iterations | 296.6 ms |
| 12 | 4,096 iterations | 594.3 ms |
| 13 | 8,192 iterations | 1,169.5 ms |
| 14 | 16,384 iterations | 2,338.8 ms |
| 15 | 32,768 iterations | 4,656.0 ms |
| 16 | 65,536 iterations | 9,302.2 ms |
Future Proofing
Rather than having a fixed constant, it should be a fixed minimum.
Rather than having your password hash function be:
String HashPassword(String password)
{
return BCrypt.HashPassword(password, BCRYPT_DEFAULT_COST);
}
it should be something like:
String HashPassword(String password)
{
/*
Rather than using a fixed default cost, run a micro-benchmark
to figure out how fast the CPU is.
Use that to make sure that it takes **at least** 250ms to calculate
the hash
*/
Int32 costFactor = this.CalculateIdealCost();
//Never use a cost lower than the default hard-coded cost
if (costFactor < BCRYPT_DEFAULT_COST)
costFactor = BCRYPT_DEFAULT_COST;
return BCrypt.HashPassword(password, costFactor);
}
Int32 CalculateIdealCost()
{
//Benchmark using a cost of 5 (the second-lowest allowed)
Int32 cost = 5;
var sw = new Stopwatch();
sw.Start();
this.HashPassword("microbenchmark", cost);
sw.Stop();
Double durationMS = sw.Elapsed.TotalMilliseconds;
//Increasing cost by 1 would double the run time.
//Keep increasing cost until the estimated duration is over 250 ms
while (durationMS < 250)
{
cost += 1;
durationMS *= 2;
}
return cost;
}
And ideally this would be part of everyone's BCrypt library, so rather than relying on users of the library to periodically increase the cost, the cost periodically increases itself.
The question was related to optimal and practical determination of cost factor for bcrypt password hashes.
On a system where you're calculating user password hashes for a server that you expect will grow in user population over time, why not make the duration of time that a user has had an account with your service the determining factor, perhaps including their login frequency as part of that determination.
Bcrypt Cost Factor = 6 + (Number years of user membership or some factor thereof) with an optional ceiling for total cost, perhaps modified in some way by the login frequency of that user.
Keep in mind though that using such a system or ANY system for determining cost factor effectively must include the consideration that the cost of calculating that hash could be used as a method of DDOS attack against the server itself by factoring in any method that increases the cost factor into their attack.

Resources