I am trying to design a convolution kernel code for CUDA. It will take relatively small pictures (typically for my application a 19 * 19 image)
In my research , i found most notably this paper : https://www.evl.uic.edu/sjames/cs525/final.html
I understand the concept of it, but I wonder, for small images, does using
a block by pixel of the original image, and using the threads of that block as the pixels to fetch , then do a block wide reduction, fast enough ? I made a basic implementation that makes global memory access coalescent, so, is it a good design for small pictures ? Or should I follow the "traditional" method ?
It all depends upon your eventual application for your program. If you intend to only convolute a few "relatively small pictures", as you mention, then a naive approach should be sufficient. In fact, a serial approach may even be faster due to memory transfer overhead between the CPU and GPU if you're not processing much data. I would recommend first writing the kernel which accesses global memory, as you mention, and if you will be working with a larger dataset in the future, it would make sense to attempt the "traditional" approach as well, and compare runtimes.
When you look at programmes written for several old computers of 80s like the commodore64 , atari and NES they are extremely small in size with most ranging to a few hundred kilobytes.
Not to mention these computers had very little memory to run on ,like Commodore 64 had 64KB of RAM and yet managed to run a GUI os!
How were these programs written to be so small?
Many of them seem to be unbelievable given the hardware constraints they had.
On a commodore 64 its resolution of 320 x 200 #4bpp would have eaten up half its 64k mem
While Atari 2600 had just 128bytes of ram
In today's apps, 80% or more of the on-disk size is graphical elements.
When space was expensive relative to programmer time, programmers spent more time optimizing for size, and often went to raw assembly. Today, space is cheap, so it doesn't pay for a company to save space.
Compare notepad to Edlin. Both are the simplest reasonable text editor for their paradigm. Edlin fits program and data comfortably into less than 64 K. But there is no way one could claim notepad is a graphical Edlin.
C64 did not have a gui os. It had a rudimentary menu system, and a skilled programmer could use custom hardware sprites to overlay small graphical icons.
In low resolution mode, you had 4 bits per pixel (16 colors). In high resolution mode, you had 1 bit per pixel (monochrome). Today's systems presume 16 bits per pixel or better, at 1080p (roughly 1900 x 1080 pixels). Even monochrome displays have ballooned from 8k to we'll over 1MB. With modern displays expecting 24 bit color depth or better, the minimum storage required for a single frame is multi megabytes. Add to that the working space for buffering and other things that a modern graphics card does, and it doesn't take long for your graphics needs to run To gigabytes. There is a reason the high resolution mode on that generation of computer was rarely used.
When you loaded a program, you unloaded the os. You ran only one program at time. Today I regularly run twenty or more apps at a time, not to mention dozens of background processes necessary to do my work.
I can't speak for other systems, but for the C64 (and C128)...
As #pojo-guy stated, we often went straight to assembly, which cut down on operating system overhead, since the OS was in ROM, so it used no RAM. Moreover, you could flip ROM in and out by playing with the memory registers, virtually "doubling" the available memory (although some of that "available" memory was read-only, the key point being that you didn't waste precious RAM on OS routines). By utilizing ROM routines, and using straight assembly, a large amount of (memory) overhead was eliminated.
For bitmaps you had two choices: high resolution, or multi-colour mode. In high resolution (320 x 200), each pixel showed either the foreground, or background, colour - so you only needed 320 x 200 = 64,000 bits or 8000 bytes.
Standard multi-colour mode offered four colours at the expense of horizontal resolution. To quote the C64 programmers reference manual:
Each dot in multi-colour mode can be one of 4 colours: screen colour (background colour register #0), the colour in background colour register #1, the colour in background colour register #2, or character colour. The only sacrifice is the horizontal resolution, because each multi-colour mode dot is twice as wide a a high-resolution dot. The minimal resolution is more than compensated for by the extra capabilities of multi-colour mode.
The reduction in overhead, a simpler OS (which could be completely swapped out), and simpler functionality (e.g. 2-bit colour which allowed you to have four colours), made things much smaller. As techniques improved, coders also applied overlays: loading parts in while other parts were playing.
Also, more advanced architectures (e.g. C128) had separate video RAM (either 16K or 64K depending on your model of C128), which gave even more space to flex your coding muscles, since graphics (or text) did not take up processing space.
Look up any of the 4k demo competitions to see what can really be done a machine with such a small memory footprint.
As C64 is 8-bit computer, all assembly language commands are 8 bits long. In addition to that, they might have 0-2 data bytes after it. So each command takes 1-3 bytes of ram.
Now when we move forward to more modern systems, the CPU is already often 64bit.
Basically, all CPU's have some "preferred sizes" for variables (which it can handle efficiently), and usually, it's exactly same number as how many "bits" does your processor have. This is usually what "int" is in C (with the exception, that it's guaranteed to always be at least 16bit, while obviously the "preferred" size on 8-bit CPUs is 8 bits)
So for an integer, no matter how small it might be, it's most efficient to use this "preferred size".
So on 8-bit systems, that would be 8 bits (obviously can't be int), and on 64bit systems, that would be 64 bits. So that's 8x in size.
Of course you can use smaller types, but often it's less efficient, and often, this also affects struct padding too.
But with pointers, you're often stuck with the amount of bits the CPU has (as you need to be able to address the whole memory range)
And while data values are generally bigger, so are assembly language commands. On the other hand, that allows for more complex commands, which can perform operations, that would need more 8-bit commands.
Of course there are exceptions, like thumb command set on ARM.
What I am saying is, efficient assembly language code on a modern platform takes more space than C64 assembly language (but is less restricted, and can do fancy stuff, such as multiply / divide etc.)
As for graphical operating systems on C64, the 2 best known ones are GEOS and Contiki. Final Cartridge 3 also had a built in windowing system, but iirc, only allowed built-in programs, and there wasn't anything that useful.
GEOS is rather "restricted", doesn't do any real multitasking (you can select which program is displayed in the main area of the window, but f.ex. clock is always running), and that's basically it. And even if it is restricted, there is f.ex. rather nice word processor (GEOWrite) for it, which I used back in the day.
Contiki is more "modern" (and actually mostly written in C iirc), and it's actually much more simple than you might think. It runs in character gfx mode (so 1000 bytes for onscreen graphics 0 2k for charset, and 1k color ram), so that's only 4k wasted there.
And I'd say Contiki is more of "proof of concept" than actually useful operating system, but unlike GEOS, does real (co-operative) multitasking.
I guess you're seriously overestimating what would be needed for a really simple graphical operating system. Instead, you could compare to AmigaOS, which was very modern for its time, and still rather small, and runs on CPU that's (internally) 32bit, so much closer to modern processors.
I could not get pixel buffer objects to make async glReadPixels work (well it works but has no speed up) on OS X 10.10 with Pixel Buffers using GL_PIXEL_PACK_BUFFER.
I switched from GL_UNSIGNED_BYTE in glReadPixels to GL_UNSIGNED_INT_8_8_8_8_REV
glReadPixels dropped to 0.6 ms from 20ms - in other words it started to work async in a real sense.
My question is:
Will setting GL_UNSIGNED_INT_8_8_8_8_REV as a pixel format work on other mac systems or do I need to test them all?
If you want to be confident that it will perform well on all configurations, you'll have to test them all. It will often depend on the GPU vendor if a certain path is slow or fast. The result can also be different between the drivers for different GPU generations, and can even change from software release to software release.
What you're measuring in this specific example is quite odd. GL_UNSIGNED_BYTE and GL_UNSIGNED_INT_8_8_8_8_REV are actually the same format on a little endian machine. There's no good reason why one of them should be faster than the other. It's most likely just an omission when checking if a fast path can be used.
What is the overhead of continually uploading textures to the GPU (and replacing old ones). I'm working on a new cross-platform 3D windowing system that uses OpenGL, and am planning on uploading a single Bitmap for each window (containing the UI elements). That Bitmap would be updated in sync with the GPU (using the VSync). I was wondering if this is a good idea, or if constantly writing bitmaps would incur too much of a performance overhead. Thanks!
Well something like nVidia's Geforce 460M has 60GB/sec bandwidth on local memory.
PCI express 2.0 x16 can manage 8GB/sec.
As such if you are trying to transfer too many textures over the PCIe bus you can expect to come up against memory bandwidth problems. It gives you about 136 meg per frame at 60Hz. Uncompressed 24-bit 1920x1080 is roughly 6 meg. So, suffice to say you could upload a fair few frames of video per frame on a 16x graphics card.
Sure its not as simple as that. There is PCIe overhead of around 20%. All draw commands must be uploaded over that link too.
In general though you should be fine providing you don't over do it. Bear in mind that it would be sensible to upload a texture in one frame that you aren't expecting to use until the next (or even later). This way you don't create a bottleneck where the rendering is halted waiting for a PCIe upload to complete.
Ultimately, your answer is going to be profiling. However, some early optimizations you can make are to avoid updating a texture if nothing has changed. Depending on the size of the textures and the pixel format, this could easily be prohibitively expensive.
Profile with a simpler situation that simulates the kind of usage you expect. I suspect the performance overhead (without the optimization I mentioned, at least) will be unusable if you have a handful of windows bigger, depending on the size of these windows.
I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...
You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)
Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.