How to restrict ffmpeg to use only 50% of my CPU? - ffmpeg

i m using ffmpeg for streaming my desktop over UDP but my problem is ffmpeg's process always takes 100% cpu for the entire time it is running leaving no room other application , my question is how can i restrict ffmpeg's process to take up only 50-60% of cpu ?
My cpu has single core
2 gb ram

FFMpeg has an option -threads. You can define there, if it's going to be auto (default) or you can limit the number of used threads (CPU cores). Recommended that you set your number based on the amount of available threads minus 1, or 2. So if you have 8 threads, setting to -threads 6 would be great. But if you have only one core and 2 gb of RAM, I'm not sure if this is the case! FFMpeg requires a lot of resources.

Related

Is there a way to predict the amount of memory needed for ffmpeg?

I've just starting using ffmpeg and I want to create a VR180 video from a list of images with resolution 11520x5760. (Images are 80MB each, i have for now just 225 for testing.)
I used the code :
ffmpeg -framerate 30 -i "%06d.png" "output.mp4"
I ran out of my 8G RAM and ffmpeg crashed.
So I've create a 10G swap, ffmpeg filled it up and crashed.
Is there a way to know how much is needed for an ffmpeg command to run properly ?
Please provide output of the ffmpeg command when you run it.
I'm assuming FFmpeg will transcode to H.264, so it will create a H.264 encoder. Most memory sits in the lookahead queue and reference buffers. For H.264, the default for --rc-lookahead is 40. I believe H.264 allows something like 2x4=8 references (?) + current frame(s) (there can be frame-threading), so let's say roughly 50 frames in total. Frame size for YUV420P data is 1.5xresolution, so 1.5x11520x5760x50=~5GB. Add to that encoder-specific data which roughly doubles this, so 10GB should be enough.
If 8+10GB is not enough, my rough handwavy calculation is probably not precise enough. Your options are:
significantly reduce --rc-lookahead, --threads and --level so there's fewer frames alive at a time - read the documentation for each of these options to understand what they do, what their defaults are and what to change them to to reduce memory usage (see e.g. note here for --rc-lookahead).
You can also use a different (less complex) codec that has smaller memory requirements.

What determine number of cpu core usage of filter select=gt(scene,0.1)?

I've notice that when using filter gt(scene,0.1), for example:
ffmpeg -i big_buck_bunny.mkv -filter:v "select='gt(scene,0.1)',showinfo" -f null -
Depends on the video, number of cpu cores usage varies extremely (sometimes it 3 cores usage - other time 12 cores usage in different video).
Would like to ask what determine that logic ?
I try to read ffmpeg source code but not familiar with it, a general explanation would be enough, but much appreciate if you point out the line/directory determine that logic in https://github.com/FFmpeg/FFmpeg.
(Also not asking how to reduce cpu usage, interested in the logic determine that).
In source code there is no mention of multi threading on filter, keyword for searching in source code: get_scene_score, scene_sad.
Further searching suggest that this is due to video decoding and not related to filter (filter always use 1 cpu core - thus could use -threads 1 to limit cpu usage, position of params -threads also matter).

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Using more than 2 NV_ENC at a time with FFMPEG

I'm currently generating timelapse videos using a thread on my CPU with fluent-ffmpeg running on nodejs. It takes roughly 1 minute to generate a 10 second timelapse. I'm generating many at the same time (basically one per thread) such that I tend to get the best performance at 8 worker threads. ... overall system throughput is about one video per 12 seconds.
GPU processing using h264_nvenc takes the single-thread time to about 3-4 seconds. Yippie! I went out and bought some nVidia 1660's to take advantage.
Unfortunately, when I go to generate the 3rd simultaneous video, I get "Conversion Failed!" error from FFMPEG.
Some basic research seems to show you can only 2 at a time. Perhaps 3 with updated drivers.
Is there a method around this? Posts from here indicates this limit is artificial and can be worked around: https://www.techpowerup.com/268495/nvidia-silently-increases-geforce-nvenc-concurrent-sessions-limit-to-3
Perhaps a way to use all the cuda/tensor/etc cores to render timelapse videos instead of just relying on the limited nv_enc?
Current limit of 3 renders on both my GTX 1060 and my RTX 2080 Ti. Other post says GTX 1660 is same. So this is obviously an artificial limit. Looking at the nVidia link posted above, which has a list of cards and their NVENC/NVDEC capabilities, it looks like most nVidia gaming cards themselves have this 3-render limit. However, most of the modern (Pascal and up) Quadro cards allow unlimited renders per card. As another workaround, you can put multiple gaming cards in a system. FFmpeg has a function to send a particular job to the card of your chosing. The same encoder module is in the GTX1660 as is in the RTX 2080 Ti, so there shouldn't be much speed difference between low-end and high-end cards. Maybe some minor difference from the memory bus width, but I haven't compared 1660/2080Ti directly to each other. What I'm saying is: if you need more encoding horsepower, just buy another couple low-end cards and divide up the workload using FFmpeg's builtin functionality.

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.
Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
result.
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory
allocations&deallocations.

Resources