BitBlt Performance

BitBlt Performance - image

I have a function that splits a multipage tiff into single pages and it uses the windows BitBlt function. In terms of performance, would the video card have any influence in doing the split? Would it be worth using a straight C/C++ library instead?

The video card won't participate in any activity unless it is the destination HDC of the BitBlt. A library dedicated to imaging functions should perform better for this task, since ultimately you will be writing these to disk.
If you were making alterations to the image data, then there is the possibility that using your video card could help; but only if you are rendering a lot of new image data for the destination tiffs, particularly 3D scenes and the like.

If BitBlt can map the pages into video memory, there is a very good chance that your video card will be much, much faster than the CPU. This is for a few reasons:
The card will run in parallel with your CPU, so you can do other work while it is running.
The video card is optimized to perform the memory copies on it's own, instead of having to have the CPU copy each word from one place to another. This frees your CPU bus up for other things.
The video card probably has a larger word size for data moves, and if you blit has any operation flags attached, those would be likely optimized by the hardware. Also, the memory on most video cards is faster than system memory.
Note that these things aren't always true. For example, if you card shares system memory then it won't have a faster access to the memory than the CPU. However, you still get the parallel support.
Finally, there is the possibility that the overhead of transfering the image to the card and back will overwhelm the speed improvement you get by doing it on the card. So you just need to experiment.
I should add - I believe that you need to specify that the memory is on-card in the device context. I don't think that just creating a memory context does anything particular with the video card.

Related

Linux device driver for display | Framebuffer

I am studying the display device driver for linux that runs TFT display, now framebuffer stores all the data that is to be displayed.
Question: does display driver have equvalant buffer of its own to handle framebuffer from the kernel?
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver, but depending on the display there might be some latencies and other issues so do display driver access framebuffer directly or it uses its own buffer as well?

This is a rabbit-hole question; it seems simple on the surface, but a factual answer is bound to end up in fractal complexity.
It's literally impossible to give a generalized answer.
The cliff notes version is: GPUs have their own memory, which is directly visible to the CPU in the form of a memory mapping (you can query the actual range of physical addresses from e.g. /sys/class/drm/card0/device/resource). Somewhere in there, there's also the memory used for the display scanout buffer. When using GPU accelerated graphics, the GPU will write directly to those scanout buffers – possibly to memory that's on a different graphics card (that's how e.g. Hybrid graphics work).
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver
Usually that's not the case. However even if there's a copy involved, these days bus bandwidths are large enough for that copy operation not to matter.
I am studying the display device driver for linux that runs TFT display
If this is a TFT display connected with SPI or an parallel bus made from GPIOs, then yes, there'll be some memory reserved for the image to reside on. Strictly speaking this can be in the RAM for the CPU, or in the VRAM of a GPU, if there is one. However as far as latencies go, the copy operations for scanout don't really matter these days.
20 years ago, yes, and even back then with clever scheduling you could avoid the latencies.

Rapid load and display of still images

I'm developing an app (Windows) that will allow essentially random access to about 1000 1920x1080 images. In effect this is a movie, but using stills and not presented sequentially -- the user can "scrub" to any image very rapidly.
I gather there are three factors to trade off: load time, decode time (if needed) and presentation time. I can specify the hardware, within limits, so SSD and good graphics card can be assumed.
Compressed images (PNG, JPG etc) will load more quickly but have an added decode step. Raw or BMP images will be slower to load but avoid the decode step. Presentation time should be the same in all cases, right, once in the proper form?
Is there an obviously superior approach, codec, library, hardware etc? Can anyone point to a study of the tradeoffs, or offer personal experience as a guide?

Does caching to save memory make sense in face of swapping by the OS?

Disclaimer: I know very little about memory management or performance, and I code in C#.
Question:
Does "caching" medium-sized data (in the order of, say, dozens of MBs), especially media that will be sent at any time to a device (audio and images), on disk (instead of "keeping it in (virtual) memory"), in face of the fact that any OS will swap (maybe "page" is the correct word) unused memory to disk?
This may not have been clear, so I'll post examples.
It is mainly related to user interfaces, not network I/O.
Examples of what I'm talking about:
FooSlideshow app could store slides on disk instead of allocating virtual memory for them.
BarGame could store sounds of different, numerous events on disk and load them for playing.
BazRenderer could store bitmaps of the several layers in a composite image if they're not prone to constant changing (If only one layer changes, the rest just have to be read from disk).
Examples of what I'm not talking about:
FooPlayer caches a buffer of the song while it streams from the server.
BarBrowser caches images because the user may visit the same page.
Why I should care:
Because, let's say a slideshow, when shown fullscreen on a 1024x768 screen, with 32 bits/pixel, would spend 1024 * 768 * 32 bytes = 3 MiB (8 MiB for an HD screen). So for a 10-slides slideshow, that would be 30-80 MiB just to cache the images. A short song, converted to 16-bit sample 44.1 KHz (CD quality) would also weight that on average.
From my C# code (but it could be Java, Python, whatever), should I care about making a complex caching system to free memory whenever possible, or should I trust the OS to swap that out? (And, the result would be the same? One approach will be better than the other? Why?)

What is the overhead of constantly uploading new Textures to the GPU in OpenGL?

What is the overhead of continually uploading textures to the GPU (and replacing old ones). I'm working on a new cross-platform 3D windowing system that uses OpenGL, and am planning on uploading a single Bitmap for each window (containing the UI elements). That Bitmap would be updated in sync with the GPU (using the VSync). I was wondering if this is a good idea, or if constantly writing bitmaps would incur too much of a performance overhead. Thanks!

Well something like nVidia's Geforce 460M has 60GB/sec bandwidth on local memory.
PCI express 2.0 x16 can manage 8GB/sec.
As such if you are trying to transfer too many textures over the PCIe bus you can expect to come up against memory bandwidth problems. It gives you about 136 meg per frame at 60Hz. Uncompressed 24-bit 1920x1080 is roughly 6 meg. So, suffice to say you could upload a fair few frames of video per frame on a 16x graphics card.
Sure its not as simple as that. There is PCIe overhead of around 20%. All draw commands must be uploaded over that link too.
In general though you should be fine providing you don't over do it. Bear in mind that it would be sensible to upload a texture in one frame that you aren't expecting to use until the next (or even later). This way you don't create a bottleneck where the rendering is halted waiting for a PCIe upload to complete.

Ultimately, your answer is going to be profiling. However, some early optimizations you can make are to avoid updating a texture if nothing has changed. Depending on the size of the textures and the pixel format, this could easily be prohibitively expensive.
Profile with a simpler situation that simulates the kind of usage you expect. I suspect the performance overhead (without the optimization I mentioned, at least) will be unusable if you have a handful of windows bigger, depending on the size of these windows.

Report Direct3D memory usage

I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...

You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)

Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio