PRE-SCRIPTUM:
I have searched over StackOverflow and there is no Q/A explaining all possibilities of tweaking WebRTC to make it more viable for end products.
PROBLEM:
WebRTC has a very nice UX and it is cutting the edge. It should be perfect for mesh calls (3-8 people), but it is not yet. The biggest issue with mesh calls (all participants exchange streams with each other) is resource consumption, especially CPU.
Here are some stats I would like to share:
2.3 GHz Intel Core i5 (2 cores), OSX 10.10.2 (14C109), 4GB RAM, Chrome 40.0.2214.111 (64-bit)
+------------------------------------+----------+----------+
| Condition | CPU | Delta |
+------------------------------------+----------+----------+
| Chrome (idle after getUserMedia) | 11% | 11% |
| Chrome-Chrome | 55% | 44% |
| Chrome-Chrome-Chrome | 74% | 19% |
| Chrome-Chrome-Chrome-Chrome | 102% | 28% |
+------------------------------------+----------+----------+
QUESTION:
I would like to create a table with WebRTC tweaks, which can improve resource consumption and make overall experience better. Are there any other settings I can play with apart from those which are in the table below?
+------------------------------------+--------------+----------------------+
| Tweak | CPU Effect | Affects |
+------------------------------------+--------------+----------------------+
| Lower FPS | Low to high | Video quality lower |
| Lower video bitrate | Low to high | Video quality lower |
| Turn off echo cancellation | Low | Audio quality lower |
| Lower source video resolution | Low to high | Video quality lower |
| Get audio only source | Very high | No video |
| Codecs? Compression? More?.. | | |
+------------------------------------+--------------+----------------------+
P.S.
I would like to leave the same architecture (mesh), so MCU is not the thing I am searching for.
You can change the audio rate and codec(OPUS -> PCMA/U) and you could also reduce the channels. Changing audio will help but video is your main CPU hog.
Firefox does support H.264. Using it could bring significant reductions to the CPU utilization as a ton of different architectures support hardware encoding/decoding of H.264. I am not 100% sure if Firefox will take advantage of that but it is worth a shot.
As for chrome, VP8 is really your only option for video at the moment and your codec agnostic changes(resolution, bitrate, etc.) are really the only way to address the cycles utilized there.
You may also be able to force Chrome to use a lower quality stream by negotiating the maximum bandwith in your SDP. Though, in the past, this has not worked with Firefox.
Related
I encode camera images with several parallel H.264 baseline I-frame only encoder cores in FPGA (more parallel cores are required to get the encoding done fast enough without external memory). I use an opensource encoder. On simulator I use 352x288 yuv420 input image and I slice it to 2 horizontal halves.
The decoder is on another embedded device and I can decode the separated slices correctly. But, instead of several decoded video sequences, I would need to have just one vido sequence (with the whole image) from one H.264 byte stream. It is preferred to merge the H.264 streams on the FPGA (instead of doing this by post processing the decoded slice video sequences on the embedded host).
So, the input image data looks like this:
.--.-----.--.--.-----.--.
|MB| |MB|MB| |MB|
| 0| ... |10|11| ... |21|
|--. '--|--' '--|
|MB| | |
|22| | |
|--' | |
| Slice | Slice |
| 0 | 1 |
'-----------'----------'
And the encoder's block diagram is this (just one encoder core shown, without merging):
YUV420 MacroBlocks (16*16 pixel)
| |
chroma | | luma
| |
.----------------.-------. |
| | | | |
| v v v v
reconstruct intra8x8cc intra4x4 ---> header
^ | | |
| v v |
inv.transform coretransform |
^ | | |
| | v |
dequantise | dctransform |
^ ^ | | |
| | v v |
inv.dc.transform <-'---------- quantise |
| |
v |
buffer |
| |
v |
cavlc |
| |
v |
tobytes <------------'
|
v
H.264 byte stream
The encoded h264 data from a singel core (parsed with H264Naked, see parse.txt, ref.yuv420 and ref.264) contains 3 NAL-s: in the first one there is a Sequence Parameter Set, in the second a Picture Parameter Set and in the third a Coded slice of an IDR picture. The SPS and PPS headers are static and these are generated on the host. The IDR header is generated by the encoder on the FPGA. Currently I use a constant Quantization Parameter (i.e. VBR, but it is planned to implement CBR, with variable QP later).
To get the encoded streams merged, I have removed the IDR header from the slices (except from slice 0) in the header block. Then I buffer the output of CAVLC blocks from each encoder for a whole MB row (this is the encoded MB data, MB headers can be identified from the output of header block). Then from this buffer I drive all the CAVLC data to a new instance of tobytes block in this sequence:
CAVLC data of one MB row from slice 0
CAVLC data of one MB row from slice 1
the above is repeated till the end of the frame.
When I decode this merged H.264 stream, then I get the following error with ffmpeg (see merged_tobytes.264 and ffplay_error.txt):
[h264 # 0x7f5f0c00ac00] dquant out of range (-112) at 13 1B f=0/0
[h264 # 0x7f5f0c00ac00] error while decoding MB 13 1
[h264 # 0x7f5f0c00ac00] concealing 396 DC, 396 AC, 396 MV errors in I frame
With JM reference decoder I get this error (jm_dec_error.txt):
mb_qp_delta is out of range (-112)
illegal chroma intra pred mode!
I think the problem is that the decoder does not know about the slices and expects neighboring MB data from other slice which are not available in the encoded data. I have checked the Macroblock prediction syntax in the standard (chapter 7.3.5.1), but I do not see how should I correct this (I have very little knowledge about H.264). On the encoder side, I see that intra4x4 and intra8x8cc are using neighboring MB prediction and pixel data from the top and left MB neighbors when those are available (we have no top MB neighbors in the first MB row and we have no left MB neighbor at first MB of each MB row, in each slice).
I have also tried to use FMO and declared slice groups (slice_group_map_type=2) as described here, but FMO and slice groups are not supported by typical H.264 decoders (see ffmpeg issue, I think it is used only in broadcast equipments). In fact, I had also got error from JM reference decoder even when I have declared the slice groups (see merged_tobytes_with_slice_grops.264 and jm_dec_slice_groups_error.txt):
warning: Intra_8x8_Horizontal prediction mode not allowed at mb 0
illegal chroma intra pred mode!
Any help is appreciated.
Reference photo is taken from here.
PS: if someone can advise a deblocker which could be used with this FPGA core, that would be also nice.
You can't just interleave slice data like that. From a decoder's point of view, each slice is supposed to be a complete and independent decoding unit, so a decoder can parse and decode multiple slices at the same time using different slice decoding instances (e.g. threads).
Slices are also supposed to be contiguous in scan-order. So, in your example, you'd want to cut the image horizontally (in two slices on top of each other) rather than vertically, and then append the second slice after the first - and fixing up the header(s) to signal the correct frame height. This should decode correctly. To do this for more slices, just cut the individual slices smaller. But they should always be contiguous in scan-order.
Just like a turbo engine has "turbo lag" due to the time it takes for the turbo to spool up, I'm curious what is the "turbo lag" in Intel processors.
For instance, the i9-8950HK in my MacBook Pro 15" 2018 (running macOS Catalina 10.15.7) usually sits around 1.3 GHz when idle, but when I run a CPU-intensive program, the CPU frequency shoots up to, say 4.3 GHz or so (initially). The question is: how long does it take to go from 1.3 to 4.3 GHz? 1 microsecond? 1 milisecond? 100 miliseconds?
I'm not even sure this is up to the hardware or the operating system.
This is in the context of benchhmarking some CPU-intensive code which takes a few 10s of miliseconds to run. The thing is, right before this piece of CPU-intensive code is run, the CPU is essentially idle (and thus the clock speed will drop down to say 1.3 GHz). I'm wondering what slice of my benchmark is running at 1.3 GHz and what is running at 4.3 GHz: 1%/99%? 10%/90%? 50%/50%? Or even worse?
Depending on the answer, I'm thinking it would make sense to run some CPU-intensive code prior to starting the benchmark as a way to "spool up" TurboBoost. And this leads to another question: for how long should I run this "spooling-up" code? Probably one second is enough, but what if I'm trying to minimize this -- what's a safe amount of time for "spooling-up" code to run, to make sure the CPU will run the main code at the maximum frequency from the very first instruction executed?
Evaluation of CPU frequency transition latency paper presents transition latencies of various Intel processors. In brief, the latency depends on the state in which the core currently is, and what is the target state. For an evaluated Ivy Bridge processor (i7-3770 # 3.4 GHz) the latencies varied from 23 (1.6 GH -> 1.7 GHz) to 52 (2.0 GHz -> 3.4 GHz) micro-seconds.
At Hot Chips 2020 conference a major transition latency improvement of the future Ice Lake processor has been presented, which should have major impact mostly at partially vectorised code which uses AVX-512 instructions. While these instructions do not support as high frequencies as SSE or AVX-2 instructions, using an island of these instructions cause down- and following up-scaling of the processor frequency.
Pre-heating a processor obviously makes sense, as well as "pre-heating" memory. One second of a prior workload is enough to reach the highest available turbo frequency, however you should take into account also temperature of the processor, which may down-scale the frequency (actually CPU core and uncore frequencies if speaking about one of the latest Intel processors). You are not able to reach the temperature limit in a second. But it depends, what you want to measure by your benchmark, and if you want to take into account the temperature limit. When speaking about temperature limit, be aware that your processor also has a power limit, which is another possible reason for down-scaling the frequency during the application run.
Another think that you should take into account when benchmarking your code is that its runtime is very short. Be aware of the runtime/resources consumption measurement reliability. I would suggest an artificially extending the runtime (run the code 10 times and measure the overall consumption) for better results.
I wrote some code to check this, with the aid of the Intel Power Gadget API. It sleeps for one second (so the CPU goes back to its slowest speed), measures the clock speed, runs some code for a given amount of time, then measures the clock speed again.
I only tried this on my 2018 15" MacBook Pro (i9-8950HK CPU) running macOS Catalina 10.15.7. The specific CPU-intensive code being run between clock speed measurements may also influence the result (is it integer only? FP? SSE? AVX? AVX-512?), so don't take these as exact numbers, but only order-of-magnitude/ballpark figures. I have no idea how the results translate into different hardware/OS/code combinations.
The minimum clock speed when idle in my configuration is 1.3 GHz. Here's the results I obtained in tabular form.
+--------+-------------+
| T (ms) | Final clock |
| | speed (GHz) |
+--------+-------------+
| <1 | 1.3 |
| 1..3 | 2.0 |
| 4..7 | 2.5 |
| 8..10 | 2.9 |
| 10..20 | 3.0 |
| 25 | 3.0-3.1 |
| 35 | 3.3-3.5 |
| 45 | 3.5-3.7 |
| 55 | 4.0-4.2 |
| 66 | 4.6-4.7 |
+--------+-------------+
So 1 ms appears to be the minimum amount of time to get any kind of change. 10 ms gets the CPU to its nominal frequency, and from then on it's a bit slower, apparently over 50 ms to reach maximum turbo frequencies.
Currently, I have Nifi running on an edge node that has 4 cores. Say I have 20 incoming flow files and I give concurrent tasks as 10 for ExecuteStreamCommand processor, does it mean I get only concurrent execution or both concurrent and parallel execution?
In this case you get concurrency and parallelism, as noted in the Apache NiFi User Guide (emphasis added):
Next, the Scheduling Tab provides a configuration option named
Concurrent tasks. This controls how many threads the Processor will
use. Said a different way, this controls how many FlowFiles should be
processed by this Processor at the same time. Increasing this value
will typically allow the Processor to handle more data in the same
amount of time. However, it does this by using system resources that
then are not usable by other Processors. This essentially provides a
relative weighting of Processors — it controls how much of the
system’s resources should be allocated to this Processor instead of
other Processors. This field is available for most Processors. There
are, however, some types of Processors that can only be scheduled with
a single Concurrent task.
If there are locking issues or race conditions with the command you are invoking, this could be problematic, but if they are independent, you are only limited by JVM scheduling and hardware performance.
Response to question in comments too long for a comment:
Question:
Thanks Andy. When there are 4 cores, can i assume that there shall be
4 parallel executions within which they would be running multiple
threads to handle 10 concurrent tasks? In the best possible way, how
are these 20 flowfiles executed in the scenario I mentioned. – John 30
mins ago
Response:
John, JVM thread handling is a fairly complex topic, but yes, in general there would be n+C JVM threads, where C is some constant (main thread, VM thread, GC threads) and n is a number of "individual" threads created by the flow controller to execute the processor tasks. JVM threads map 1:1 to native OS threads, so on a 4 core system with 10 processor threads running, you would have "4 parallel executions". My belief is that at a high level, your OS would use time slicing to cycle through the 10 threads 4 at a time, and each thread would process ~2 flowfiles.
Again, very rough idea (assume 1 flowfile = 1 unit of work = 1 second):
Cores | Threads | Flowfiles/thread | Relative time
1 | 1 | 20 | 20 s (normal)
4 | 1 | 20 | 20 s (wasting 3 cores)
1 | 4 | 5 | 20 s (time slicing 1 core for 4 threads)
4 | 4 | 5 | 5 s (1:1 thread to core ratio)
4 | 10 | 2 | 5+x s (see execution table below)
If we are assuming each core can handle one thread, and each thread can handle 1 flowfile per second, and each thread gets 1 second of uninterrupted operation (obviously not real), the execution sequence might look like this:
Flowfiles A - T
Cores α, β, γ, δ
Threads 1 - 10
Time/thread 1 s
Time | Core α | Core β | Core γ | Core δ
0 | 1/A | 2/B | 3/C | 4/D
1 | 5/E | 6/F | 7/G | 8/H
2 | 9/I | 10/J | 1/K | 2/L
3 | 3/M | 4/N | 5/O | 6/P
4 | 7/Q | 8/R | 9/S | 10/T
In 5 seconds, all 10 threads have executed twice, each completing 2 flowfiles.
However, assume the thread scheduler only assigns each thread a cycle of .5 seconds each iteration (again, not a realistic number, just to demonstrate). The execution pattern then would be:
Flowfiles A - T
Cores α, β, γ, δ
Threads 1 - 10
Time/thread .5 s
Time | Core α | Core β | Core γ | Core δ
0 | 1/A | 2/B | 3/C | 4/D
.5 | 5/E | 6/F | 7/G | 8/H
1 | 9/I | 10/J | 1/A | 2/B
1.5 | 3/C | 4/D | 5/E | 6/F
2 | 7/G | 8/H | 9/I | 10/J
2.5 | 1/K | 2/L | 3/M | 4/N
3 | 5/O | 6/P | 7/Q | 8/R
3.5 | 9/S | 10/T | 1/K | 2/L
4 | 3/M | 4/N | 5/O | 6/P
4.5 | 7/Q | 8/R | 9/S | 10/T
In this case, the total execution time is the same (* there is some overhead from the thread switching) but specific flowfiles take "longer" (total time from 0, not active execution time) to complete. For example, flowfiles C and D are not complete until time=2 in the second scenario, but are complete at time=1 in the first.
To be honest, the OS and JVM have people much smarter than me working on this, as does our project (luckily), so there are gross over-simplifications here and in general I would recommend you let the system worry about hyper-optimizing the threading. I would not think setting the concurrent tasks to 10 would yield vast improvements over setting it to 4 in this case. You can read more about JVM threading here and here.
I just did a quick test in my local 1.5.0 development branch -- I connected a simple GenerateFlowFile running with 0 sec schedule to a LogAttribute processor. The GenerateFlowFile immediately generates so many flowfiles that the queue enables the back pressure feature (pausing the input processor until the queue can drain some of the 10,000 waiting flowfiles). I stopped both and re-ran this, giving the LogAttribute processor more concurrent tasks. By setting the LogAttribute concurrent tasks to 2:1 of the GenerateFlowFile, the queue never built up past about 50 queued flowfiles.
tl;dr Setting your concurrent tasks to the number of cores you have should be sufficient.
Update 2:
Checked with one of our resident JVM experts and he mentioned two things to note:
The command is not solely CPU limited; if I/O is heavy, more concurrent tasks may be beneficial.
The max number of concurrent tasks for the entire flow controller is set to 10 by default.
This is only theoretical question
In parallel computing is it possible to acheive efficiency greater than 100%?
E.g 125% efficiency
+-------------+------+
| Processors | Time |
+-------------+------+
| 1 | 10s |
| 2 | 4s |
+-------------+------+
I don't mean situtations when parallel environment is configured wrong or there is some bug in code.
Efficiency definition:
https://stackoverflow.com/a/13211093/2265932
Yes, it is possible, it is called superlinear speedup and it is caused usually by improving the cache usage. Though it is usually less than 125%.
See, for example, Where does super-linear speedup come from?
I read the Datasheet for an Intel Xeon Processor and saw the following:
The Integrated Memory Controller (IMC) supports DDR3 protocols with four
independent 64-bit memory channels with 8 bits of ECC for each channel (total of
72-bits) and supports 1 to 3 DIMMs per channel depending on the type of memory
installed.
I need to know what this exactly means from a programmers view.
The documentation on this seems to be rather sparse and I don't have someone from Intel at hand to ask ;)
Can this memory controller execute 4 loads of data simultaneously from non-adjacent memory regions (and request each data from up to 3 memory DIMMs)? I.e. 4x64 Bits, striped from up to 3 DIMMs, e.g:
| X | _ | X | _ | X | _ | X |
(X is loaded data, _ an arbitrarily large region of unloaded data)
Can this IMC execute 1 load which will load up to 1x256 Bits from a contiguous memory region.
| X | X | X | X | _ | _ | _ | _ |
This seems to be implementation specific, depending on compiler, OS and memory controller. The standard is available at: http://www.jedec.org/standards-documents/docs/jesd-79-3d . It seems that if your controller is fully compliant there are specific bits that can be set to indicate interleaved or non-interleaved mode. See page 24,25 and 143 of the DDR3 Spec, but even in the spec details are light.
For the i7/i5/i3 series specifically, and likely all newer Intel chips the memory is interleaved as in your first example. For these newer chips and presumably a compiler that supports it, yes one Asm/C/C++ level call to load something large enough to be interleaved/striped would initiate the required amount of independent hardware channel level loads to each channel of memory.
In the Triple channel section in of the Multichannel memory page on wikipedia there is a small list of CPUs that do this, likely it is incomplete: http://en.wikipedia.org/wiki/Multi-channel_memory_architecture