I'm currently generating timelapse videos using a thread on my CPU with fluent-ffmpeg running on nodejs. It takes roughly 1 minute to generate a 10 second timelapse. I'm generating many at the same time (basically one per thread) such that I tend to get the best performance at 8 worker threads. ... overall system throughput is about one video per 12 seconds.
GPU processing using h264_nvenc takes the single-thread time to about 3-4 seconds. Yippie! I went out and bought some nVidia 1660's to take advantage.
Unfortunately, when I go to generate the 3rd simultaneous video, I get "Conversion Failed!" error from FFMPEG.
Some basic research seems to show you can only 2 at a time. Perhaps 3 with updated drivers.
Is there a method around this? Posts from here indicates this limit is artificial and can be worked around: https://www.techpowerup.com/268495/nvidia-silently-increases-geforce-nvenc-concurrent-sessions-limit-to-3
Perhaps a way to use all the cuda/tensor/etc cores to render timelapse videos instead of just relying on the limited nv_enc?
Current limit of 3 renders on both my GTX 1060 and my RTX 2080 Ti. Other post says GTX 1660 is same. So this is obviously an artificial limit. Looking at the nVidia link posted above, which has a list of cards and their NVENC/NVDEC capabilities, it looks like most nVidia gaming cards themselves have this 3-render limit. However, most of the modern (Pascal and up) Quadro cards allow unlimited renders per card. As another workaround, you can put multiple gaming cards in a system. FFmpeg has a function to send a particular job to the card of your chosing. The same encoder module is in the GTX1660 as is in the RTX 2080 Ti, so there shouldn't be much speed difference between low-end and high-end cards. Maybe some minor difference from the memory bus width, but I haven't compared 1660/2080Ti directly to each other. What I'm saying is: if you need more encoding horsepower, just buy another couple low-end cards and divide up the workload using FFmpeg's builtin functionality.
Related
I've notice that when using filter gt(scene,0.1), for example:
ffmpeg -i big_buck_bunny.mkv -filter:v "select='gt(scene,0.1)',showinfo" -f null -
Depends on the video, number of cpu cores usage varies extremely (sometimes it 3 cores usage - other time 12 cores usage in different video).
Would like to ask what determine that logic ?
I try to read ffmpeg source code but not familiar with it, a general explanation would be enough, but much appreciate if you point out the line/directory determine that logic in https://github.com/FFmpeg/FFmpeg.
(Also not asking how to reduce cpu usage, interested in the logic determine that).
In source code there is no mention of multi threading on filter, keyword for searching in source code: get_scene_score, scene_sad.
Further searching suggest that this is due to video decoding and not related to filter (filter always use 1 cpu core - thus could use -threads 1 to limit cpu usage, position of params -threads also matter).
We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?
I'm using STM32F411 with USB CDC library, and max speed for this library is ~1Mb/s.
I'm creating a project where I have 8 microphones connected into ADC line (this part works fine), I need a 16-bit signal, so I'm increasing accuracy by adding first 16 signals from one line (ADC gives only 12-bits signal). In my project, I need 96k 16-bit samples for one line, so it's 0,768M signals for all 8 lines. This signal needs 12000Kb space, but STM32 have only 128Kb SRAM, so I decided to send about 120 with 100Kb data in one second.
The conclusion is I need ~11,72Mb/s to send this.
The problem is that I'm unable to do that because CDC USB limited me to ~1Mb/s.
Question is how to increase USB speed to 12Mb/s for STM32F4. I need some prompt or library.
Or maybe should I set up "audio device" in CubeMX?
If small b means byte in your question, the answer is: it is not possible as your micro has FS USB which max speeds is 12M bits per second.
If it means bits your 1Mb (bit) speed assumption is wrong. But you will not reach the 12M bit payload transfer.
You may try to write (only if b means bit) your own class but I afraid you will not find a ready made library. You will need also to write the device driver on the host computer
I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.
i m using ffmpeg for streaming my desktop over UDP but my problem is ffmpeg's process always takes 100% cpu for the entire time it is running leaving no room other application , my question is how can i restrict ffmpeg's process to take up only 50-60% of cpu ?
My cpu has single core
2 gb ram
FFMpeg has an option -threads. You can define there, if it's going to be auto (default) or you can limit the number of used threads (CPU cores). Recommended that you set your number based on the amount of available threads minus 1, or 2. So if you have 8 threads, setting to -threads 6 would be great. But if you have only one core and 2 gb of RAM, I'm not sure if this is the case! FFMpeg requires a lot of resources.