Linux device driver for display | Framebuffer - linux-kernel

I am studying the display device driver for linux that runs TFT display, now framebuffer stores all the data that is to be displayed.
Question: does display driver have equvalant buffer of its own to handle framebuffer from the kernel?
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver, but depending on the display there might be some latencies and other issues so do display driver access framebuffer directly or it uses its own buffer as well?

This is a rabbit-hole question; it seems simple on the surface, but a factual answer is bound to end up in fractal complexity.
It's literally impossible to give a generalized answer.
The cliff notes version is: GPUs have their own memory, which is directly visible to the CPU in the form of a memory mapping (you can query the actual range of physical addresses from e.g. /sys/class/drm/card0/device/resource). Somewhere in there, there's also the memory used for the display scanout buffer. When using GPU accelerated graphics, the GPU will write directly to those scanout buffers – possibly to memory that's on a different graphics card (that's how e.g. Hybrid graphics work).
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver
Usually that's not the case. However even if there's a copy involved, these days bus bandwidths are large enough for that copy operation not to matter.
I am studying the display device driver for linux that runs TFT display
If this is a TFT display connected with SPI or an parallel bus made from GPIOs, then yes, there'll be some memory reserved for the image to reside on. Strictly speaking this can be in the RAM for the CPU, or in the VRAM of a GPU, if there is one. However as far as latencies go, the copy operations for scanout don't really matter these days.
20 years ago, yes, and even back then with clever scheduling you could avoid the latencies.

Related

Which instructions does a CPU use to communicate with PCIe cards?

I want to understand how a CPU works and so I want to know how it communicates with a PCIe card.
Which instructions does the CPU use to initialize a PCIe port and than read and write to it?
For example OUT or MOV.
A CPU mainly communicates with PCIe cards through memory ranges they expose. This memory may be small for network or sound cards, and very large for graphics cards. Integrated GPUs have also have their own tiny memory but share most of the main memory. Most other cards also have read/write access to main memory.
To set up the PCIe device, the configuration space is written to. On x86, the BIOS or bootloader will provide the location of this data. PCI devices are connecting in a tree which may include hubs and bridges on larger computers and this can be shown in lspci -t. Thunderbolt can even connect to external devices. This is why the OS needs to recursively "probe" the tree to find PCI devices and configure them.
Synchronization uses interrupts and ring buffers. The device can send a prenegotiated interrupt to the CPU when it's done doing work. The CPU writes work to a ring buffer. It then writes another memory location that contains the head pointer. This memory location is located on the device so it can listen to writes there and wake up when there is work to do.
Most of the interaction for modern devices will use MOV instead of OUT. The I/O ports concept is very old and not very suitable for the massive amount of data on modern systems. Having devices expose their functionality as a type of memory instead of a separate mechanism allows vectorized variants of MOV to move 32 bytes or similar at a time. With graphics card and modern network cards supporting offload, they can also use their own hardware to write results back to main memory when instructed to do so. The CPU can then read the results when it's free later, again using MOV.
Before this memory access works, the OS will need to set up the memory mapping properly. The memory mapping is set in the PCI configuration space as BARs. On the CPU side it is set up in the page tables. CPUs usually have caches to keep data locally because access to RAM is slower. This causes a problem when the data needs to get to a PCI device, so the OS will set certain memory as write-through or even uncacheable so this is ensured.
The word BAR is often marketed by GPU vendors. What they are selling is the ability to map a larger region of memory at a time. Without that, OSes have been just unmapping and reinitializing by remapping a limited window of memory at a time. This exemplifies the importance of MOV accessing PCIe devices.

Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?

What really happens on modern PC hardware booted in 16-bit legacy BIOS MBR mode when you store a byte such as '1' (0x31) into the VGA text (mode 03) framebuffer at physical linear address B8000? How slow is a mov [es:di], eax store with the MTRR for that region set to UC? (Experimental testing on one Kaby Lake iGPU laptop indicates that clflushopt on WC was roughly the same speed as UC for VGA memory. But without clflushopt, mov stores to WC memory never leave the CPU and don't update the screen at all, running super fast.)
If it's not an SMI for every store, is there any way to approximate this cost on a chunk of WB memory in user-space, for performance experiments without actually rebooting into real mode? (e.g. using a BSS page as a pretend framebuffer that doesn't actually display anywhere).
The corresponding font glyph appears on screen in the next refresh, but is hardware scan-out really reading that ASCII char from VRAM (or DRAM for an iGPU) and mapping to bitmap font glyphs on the fly? Or is there some software interception on each store or once per vblank so the real hardware only has to handle a bitmapped framebuffer?
Legacy BIOS booting is well known to use System Management Mode (SMM) to emulate USB kbd/mouse as a PS/2 devices. I'm wondering if it's also used for the VGA text mode framebuffer. I assume it is used for VGA I/O ports for mode-setting but it's plausible that a text framebuffer could be supported by hardware. However, most computers spend all their time in graphics mode so leaving out HW support for text mode seems like something vendors might want to do. (OTOH this blog suggests that a homebrew verilog VGA controller can implement text mode fairly simply.)
I'm specifically interested in systems using the iGPU in Intel Skylake, but would be interested in earlier / later iGPUs from Intel and AMD, and new or old discrete GPUs.
(Including vendors other than AMD and NVidia; there are some Skylake motherboards with PCI slots, not PCIe. If modern GPU firmware drivers do emulate text mode, presumably there are some old PCI video cards with hardware VGA text mode. And maybe such a card could make stores just be a PCI transaction instead of an SMI.)
My own desktop is an i7-6700k in an Asus Z170 Pro Gaming mobo, no add-on cards just iGPU with a 1920x1200 monitor on the DVI-D output. I don't know the details of the Kaby Lake i5-7300HQ system #Eldan is testing on, only the CPU model.
I found Phoenix BIOS's patent US20120159520 from 2011,
Emulating legacy video using uefi. Instead of requiring video hardware vendors to supply both UEFI and native 16-bit real mode option-ROM drivers, they propose a real-mode VGA driver (int 10h functions and so on) that calls a vendor-supplied UEFI video driver via SMM hooks.
Abstract
[...] The generic video option ROM notifies a generic video SMM driver of the request for video services. Such notification may be performed using a software system management interrupt (SMI). Upon notification, the generic video SMM driver notifies a third party UEFI video driver of the request for video services. The third party video driver provides the requested video services to the operating system. In this way, a third party UEFI graphics driver may support a wide variety of operating systems, even those that do not natively support the UEFI display protocols.
Much of the description covers handling int 10h calls and stuff like that which already obviously trap through the IVT, thus can easily run custom code that triggers an SMI on purpose. The relevant part is what they describe for direct stores into the text-mode framebuffer which need to work even for code that doesn't trigger any software or hardware interrupts. (Other than HW triggering SMI on such stores, which they say they can use if supported.)
Text Buffer Support
[0066] In certain embodiments, applications may manipulate the VGA's
text buffer directly. In such an embodiment, generic video SMM driver
130 support this in one of two ways, depending on whether the hardware
provides SMI trapping on read/write access to the 740 KB-768 KB memory
region (where the text buffers are located).
[0067] When SMI trapping is available, the hardware generates an SMI
on each read or write access. Using the trap address of the SMI trap,
the exact text column and row may be calculated and the corresponding
row and column in the virtual text screen accessed.
Alternately,
normal memory is enabled for this region and, using a periodic SMI,
generic video SMM driver 130 scans for changes in the emulated
hardware text buffer and updates the corresponding virtual text screen
maintained by the video driver. In both cases, when a change is
detected, the character is redrawn on the virtual text screen.
This is just one BIOS vendor's patent, and doesn't tell us which way most hardware actually works, or if other vendors do different things. It does essentially confirm that some hardware exists which can trap on stores in that range, though. (Unless that's just a hypothetical possibility that they decided to cover in their patent.)
For the use-case I have in mind, trapping only on screen refresh would be vastly faster than trapping on every store so I'm curious which hardware / firmware works which way.
Motivation for this question
Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core - repeatedly storing new digits for an ASCII text counter into the same few bytes of video RAM.
I tested a version of the code in 32-bit user-space under Linux, on WB memory, hoping to approximate the situation with movnti and different ways of getting the CPU to sync its WC buffer to video RAM after each store (or perhaps occasionally in a timer interrupt). But this is not realistic if the real-mode bootloader situation isn't just storing to DRAM, but instead triggering an SMI.
On WB memory, flushing movnti stores with a lock xor byte [esp], 0 is somewhat faster than flushing with clflushopt. But #Eldan reports no speed improvement for those on VGA memory after programming an MTRR to make it WC. (And the same speed as for the original doing normal stores, indicating that by default the VGA framebuffer was UC. Some older BIOSes had an option to make VGA memory WC, which they called USWC = Uncached Speculative Write Combining.)
It's not a real-world problem so I'm not looking for actual workarounds; although it would be interesting to know if manually storing pixel bytes into a VGA graphics mode could be much faster.
Summary
Do any / all real modern systems trigger an SMI on every store to the text-mode framebuffer?
If no, can we approximate a WC store+clflush to the framebuffer, using a movnti + something in user-space on WB memory? So we can easily profile with perf for performance counters.
If different BIOSes and/or hardware use different strategies, what are those strategies? (I don't want details, just a high level like "SMI every vblank to sync the VGA framebuffer to the actual hardware framebuffer")
Would a PCIe or PCI video card with hardware VGA textmode be faster than whatever integrated GPUs actually do? I'm guessing an actual PCIe write transaction would be slower than waiting for a store to hit DRAM, but that a PCIe write would be cheaper than an SMI on every store. A ballpark / order of magnitude comparison would be interesting.
These questions are all highly related, but I can split this up if there isn't as much overlap as I expect.
Do any / all real modern systems trigger an SMI on every store to the text-mode framebuffer?
For video cards, I very much doubt it. Video card manufacturers have had the "get pixel data from char+attribute" logic built into hardware since the 1980s (it predates VGA and hasn't changed much since CGA), and just cut&paste that logic into each newer design without caring much about it.
For things that are not video cards at all (e.g. remote system management tools using LAN) I don't know but suspect not (often they use a special management CPU rather than the main CPU/s so that it works even if the computer is turned "off").
If no, can we approximate a WC store+clflush to the framebuffer, using a movnti + something in user-space on WB memory?
If you're not in user-space, you can change MTTRs (on all CPUs - MTRRs must match and there's a special sequence involved) to make an area of RAM "uncached"; or use PAT in the page tables (much easier than messing with MTRRs, especially if you're using paging anyway, but slightly different behavior due to still needing cache coherency). If you are in user-space then you will have to rely on whatever the OS/kernel provides, and (depending on which OS it is) the OS/kernel may not provide any way to do this at all.
However; even if you find a way to make (an area of) RAM uncached it still won't be very similar, because you'll be writing directly to something attached to a memory controller built into the CPU (that CPU can write to extremely quickly) instead of talking to something at the other end of a PCI link (that will have higher latency and lower bandwidth from CPU's side). Even for integrated video (where it's technically the same RAM chips in the end) writes to VRAM go through a very different path (subject to remapping/GART/paging in the video card, effected by a "write mode" VGA register, effected by bit/plane mask VGA registers, etc).
Would a PCIe or PCI video card with hardware VGA textmode be faster than whatever integrated GPUs actually do?
For writes from CPU to VRAM; typically integrated video is significantly faster than discrete cards (at least for plain writes from CPU to linear frame buffers where none of the VGA's "write logic" is involved).
For extremely rough ballpark estimates; I'd expect a single write to RAM to be around 150 cycles and a single write to PCI to be close to 1000 cycles. For SMI I'd expect a few hundred cycles of latency before SMI arrives at CPU, then the cost of CPU pipeline flush, then about 500 cycles to save CPU's state (and same loading state on the return path); then the firmware's code would have to find the cause of the SMI (another few hundred cycles?) before it could know it was a write to VRAM and not something else; then it'd have to examine the saved CPU state and find and decode the instruction that made the write (because it can't know what data was being written, if it was a byte/word/dword write, etc) while taking into account previous CPU state (which mode CPU was in, code size, etc) and keeping track of how emulating the instruction effects the future CPU state (advancing RIP, etc - don't forget that they'll be emulating every instruction that can cause a write, including things like XADD, etc). Next it would have to analyze the state of (emulated) VGA registers (write mode, write mask, plane enable, whatever controls which 64 KiB bank is mapped into the legacy area, font height, ...). Basically; for SMI emulation of a write to text mode frame buffer; I'd expect it to take tens of thousands of cycles before the firmware's code overlooks a minor but important detail buried among a huge amount of complexity, causing it to do the wrong thing and be unusably broken.
Other Notes
I found Phoenix BIOS's patent US20120159520 from 2011, Emulating legacy video using uefi.
I doubt this was ever implemented, because I doubt it can ever work. There's far too many (common and obscure) things you can do with the legacy interfaces (e.g. detect vertical refresh, setup non-standard video modes like "mode X", fiddle with "display start" to implement smooth scrolling and/or page flipping, use "CRTC info" in VBE to alter video timings, etc) that isn't supported by UEFI and can't be done via. a third party video driver for UEFI.
Instead, video card manufacturers didn't bother providing UEFI drivers for about 10 years and UEFI firmware used the legacy interface to emulate UEFI services (often breaking secure boot while they were at it); until almost everything was UEFI anyway.
I assume it (SMM) is used for VGA I/O ports for mode-setting.
I assume not. The only thing vaguely related to video that I'd suspect SMM may be used for is controlling the brightness of the screen's backlight in laptops (especially for older laptops, and especially for "lid open/close events") during early boot (before OS takes over).
.. leaving out HW support for text mode seems like something vendors might want to do
I still believe that the (eventual, after the already too long "hybrid BIOS+UEFI" transition phase) removal of 30+ years of accumulated legacy mess (A20, VGA, PS/2, PIT, PIC, ...) from hardware is one of the main reasons hardware manufacturers (Intel) are/have been pushing for UEFI adoption.
Reading through various modern Intel CPU and Platform Controller Hub (PCH) datasheets, it doesn't appear that the necessary hardware is implemented. There doesn't seem to be any way to generate an SMI (System Management Interrupt) in response to processor accesses of the VGA frame buffer (physical addresses 0xA0000 - 0xBFFFF).
The memory controller in the CPU will either route accesses to VGA frame buffer to the integrated graphics controller, the PCI Express port connected directly to the CPU, or the DMI interface connecting the CPU to the PCH. While it's possible route parts VGA frame buffer separately, this appears only meant to support a separate MDA (Monochrome Display Adapter) device. The integrated graphics controller is not well documented so it's possible that it can be configured to generate an SMI on VGA frame buffer accesses, but this seems unlikely. In any case, it wouldn't work with discrete graphics.
Intel PCH's also don't seem to have any support for generating SMIs in response to VGA frame buffer accesses. This would be the most natural place for it, as it already has support for generating SMIs in response to I/O accesses to the keyboard controller, IDE controller and other legacy devices. It possible that there's some undocumented feature that does this, but it's not included in the lists of possible SMI sources given in the PCH datasheets.
Theoretically, it would be possible for a motherboard manufacture to connect a fake VGA device to the PCH through a PCI Express port and then generate SMIs using a PCH GPIO pin. However, I'm not sure this will work in practice. By the time the CPU gets the SMI it could have moved on to executing other instructions and it wouldn't be possible to examine the CPU state at the time of the frame buffer access.
(A similar problem happened with SoundBlaster 16 emulation on the SoundBlaster Live. It would generate a PCI SERR# when the legacy SoundBlaster ports were accessed, which would generate a NMI on the CPU. Unfortunately the emulation would break on many Pentium 4 motherboards because the NMI would arrive on the next or subsequent instruction.)

Why the data transfer is slow from GPU to CPU?

Today I have figured out something that really made me wondering. I have the Samsung Exynos 4412 ARM9 CPU which has a GPU400(QuadCore). I tried to get a texture from the GPU to CPU by all known methods and its really slow. The same scenario and slow speed happens also in modern CPUs and GPUs in the PC Platform. My wondering is how that happens and the Samsung Exynos is an SoC and both of them has the same memory and I should not care about the bus.
Why that happens ?
The data from the GPU to the CPU is transferred by many methods which I have tried glReadpixels, gltexSubImage2D, gltexImage2d, FBO.
The frame rate drops from 40FPS to 7FPs or 7FPS while using any of those methods, on a texture 1024*1024 24bits.
Possible answers taken from the OpenGL forums:
Latency: it takes time for the read command to reach the hardware.
OpenGL command buffering: Reading the data requires the OpenGL driver to complete all outstanding commands.
Hardware buffering: Hardware must empty all GPU core pipelines before doing a readback.
Possible solution:
- Copy the data internally on the GPU to another location and read it back some number of frames after computing it. This should allow everything writing to that location to have completed before you attempt to read it.

What happens when we plug a piece of hardware into a computer system?

When we plug a piece of hardware into a computer system, say a NIC (Network Interface Card) or a sound card, what happens under the hood so that we coud use that piece of hardware?
I can think of the following 2 scenarios, correct me if I am wrong.
If the hardware has its own memory chips, someone will arrange for a range of address space to map to those memory chips.
If the hardware doesn't have its own memory chips, someone will allocate a range of address in the main memory of the computer system to accomodate that hardware.
I am not sure the aforemetioned someone is the operating system or the CPU.
And another question: Does hardware always need some memory to work?
Am I right on this?
Many thanks.
The world is not that easily defined.
first off look at the hardware and what it does. Take a mouse for example, it is trying to deliver x and y coordinate changes and button status, that can be as little as a few bytes or even a single byte two bits define what the other 6 mean, update x, update y, update buttons, that kind of thing. And the memory requirement is just enough to hold those bytes. Take a serial mouse there is already at least one byte of storage in the serial port so do you need any more? usb, another story just to speak usb back and forth takes memory for the messages, but that memory can be in the usb logic, so do you need any more for such small information.
NICs and sound cards are another category and more interesting. For nics you have packets of data coming and going and you need some buffer space, ring, fifo, etc to allow for multiple packets to be in flight in both directions for efficiency and interrupt latency and the like. You also need registers, these have their storage in the hardware/logic itself and wont need main memory. In both the sound card case and the nic case you can either have memory on the board with the hardware or have it use system memory that it can access semi-directly (dma, etc). Sound cards are similar but different in that you can think of the packets as being fixed sized and continuous. Basically you need to ping-pong buffers to or from the card at some rate, 44100khz 16 bit per sample stereo is 44100 * 2 * 2 = 176400 bytes per second, say for example the driver/software is preparing the next 8192 bytes at a time and while the hardware is playing the pong buffer software is filling the ping buffer, when hardware drains the pong buffer it indicates this to the software, starts draining the ping buffer and the software fills the ping buffer.
All interesting stuff but to get to the point. With the nic or sound card you could have as little as two registers, an address/command register and a data register. Quite painful but was often used in the old days in restricted systems, still used as well. Or you could go to the other extreme and desire to have all of the memory on the device mapped into system memory's address space as well as each register having its own unique address. With audio you dont really need random access to the memory so you dont really need this, graphics you do, nic cards you could argue do you leave the packet on the nic or do you make a copy in system memory where you can have a much larger software buffer/ring freeing the hardwares limited buffer/ring. If on nic then you would want random access, if not then you dont.
For isa/pci/pcie, etc on x86 systems the hardware is usually mapped directly into the processors memory space. So for 32 bit systems you can address up to 4GB, well even if you have 4GB worth of memory some of that memory you cannot get to because video cards, hardware registers, PCI, etc consume some of that address space (registers or memory or both, whatever the hardware was designed to use). As distasteful as it may appear to day this is why there was a distiction between I/O mapped I/O and memory mapped I/O on x86 systems, its another address bit if you will. You could have all of your registers in I/O space and not lose memory space, and map memory into nice neat aligned chunks, requiring less of your ram to be replaced with hardware. either way, isa had basically vendor specific ways of mapping into the memory space available to the isa bus, jumpers, interesting detection schemes with programmable address decoders, etc. PCI and its successors came up with something more standard. When the computer boots (talking x86 machines in general now) the BIOS goes out on the pcie bus and looks to see who is out there by talking to config space that is mapped per card in a known place. Using a known protocol the cards indicate the desired amount of memory they require, the BIOS then allocates out of the flat memory space for the processor chunks of memory for each device and tells the device what address and how much it has been allocated. It is certainly possible for the operating system to re-do or override this but typically the BIOS does this discovery for the system and the operating system simply reads the config space on each device which includes the vendor id and device id and then knows how and where to talk to the device. For this memory space I believe the hardware contains the memory/registers. For general system memory to dma to/from I believe the operating system and device drivers have to provide the mechanism for allocating that system memory then telling the hardware what address to dma to/from.
The x86 way of doing it with the bios handling the ugly details and having system memory address space and pci address space being the same address space has its pros and cons. A pro is that the hardware can easily dma to/from system memory because it does not have to know how to get from pcie address space to system address space. The negative is the case of a 32 bit system where pcie normally consumes up to 1GB of address space and the dram you bought for that hole is not available. The transition from 32 bit to 64 bit is slow and painful, the bioses and pcie chips are still limiting to the lower 4gig and limiting to 1gb for all the pcie devices, even if the chipset has a 64 bit mode, and this is with 64 bit processors and more than 4gb of ram. the mmu allowes for fragmented memory so that is not an issue. Slowly the chipsets and bioses are catching up but it is taking time.
USB. these are serial mostly master/slave protocols. Like a serial port but bigger and faster and more complicated, and like a serial port both the master and slave hardware need to have ram to store the messages, very much like a nic. Like a nic, in theory, you can be register based and pull the memory sequentially or have it mapped in to system memory and have random access to it, etc. Think of it this way, the usb interface can/does sit on a pcie interface even if it is on the motherboard. A number of devices are pcie devices on your motherboard even if they are not an actual pcie connector with a card. And they fall into the pcie cagetory of how you might design your interface or who has what memory where.
Some devices like video cards have lots of memory on board, more than is practical or is at least painful to allow all of it to be mapped into pcie memory space at once. And these would want to use a sliding window type arrangement. Tell the video card you want to look at address 0x0000 in the video cards address space, but your window may only be 0x1000 bytes (for example) in system/pcie space. When you want to look at addresses 0x1000 to 0x1FFF in video memory space you write some register to move the window then the same pcie memory space accesses different memory on the video card.
x86 being the dominant architecture has this overlapped pcie and system memory addressing thing but that is not how the whole world works. Other solutions include having independent system and pcie address spaces, with sliding windows, like the video card problem above, allowing you to have say a 2gb video card mapped flat in pcie space but limiting the window into pcie space to something not painful for the host system.
hardware designs are as varied as software designs. take 100 software engineers and give them a specification and you may get as many as 100 different solutions. Same with hardware give them a specification and you may get 100 different pcie designs. Some standards are in place to limit that, and/or cloning where you want to make a sound blaster compatible card, you dont change the interface, but given the freedom software has the hardware can and will vary and with the number of types of pcie devices (sound, hard disk controllers, video, usb, networking,etc) you will get that many different mixes of registers and addressable memory.
sorry for the long answer, hope this helps. I would dig through linux and/or bsd sources for device drivers along with programmers reference manuals if you can get access to them, and see how different hardware designs use register and memory space and see what designs are painful for the software folks and what designs are elegant and well done.
The answer depends on what is the interface of the hardware- is it over USB or PCI-Express? (and there could be others connectivity methods too - USB and PCI-Express are the most common)
With USB
The host learns about the newly arrived device by reading the descriptors and loads the appropriate device driver. The device would have presented its ID that is used for Plug n Play. The device is also assigned an address by the Host. Once the device driver kicks-in it configures the device and makes it ready for data transfer. The data transfer is done using IRP, the transfer technique and how the IRPs are loaded depend upon whether the transfer is isochronous data or bulk or other modes.
So to answer your second question - yes the hardware needs some memory to work. The Driver and the USB Host Controller Driver together setup the Memory on the host for the USB Device - the USB Device Driver then accordingly communicates/drives the device.
With PCI-Express
It is similar - sorry I do not have hands on experience with PCI-Express.

BitBlt Performance

I have a function that splits a multipage tiff into single pages and it uses the windows BitBlt function. In terms of performance, would the video card have any influence in doing the split? Would it be worth using a straight C/C++ library instead?
The video card won't participate in any activity unless it is the destination HDC of the BitBlt. A library dedicated to imaging functions should perform better for this task, since ultimately you will be writing these to disk.
If you were making alterations to the image data, then there is the possibility that using your video card could help; but only if you are rendering a lot of new image data for the destination tiffs, particularly 3D scenes and the like.
If BitBlt can map the pages into video memory, there is a very good chance that your video card will be much, much faster than the CPU. This is for a few reasons:
The card will run in parallel with your CPU, so you can do other work while it is running.
The video card is optimized to perform the memory copies on it's own, instead of having to have the CPU copy each word from one place to another. This frees your CPU bus up for other things.
The video card probably has a larger word size for data moves, and if you blit has any operation flags attached, those would be likely optimized by the hardware. Also, the memory on most video cards is faster than system memory.
Note that these things aren't always true. For example, if you card shares system memory then it won't have a faster access to the memory than the CPU. However, you still get the parallel support.
Finally, there is the possibility that the overhead of transfering the image to the card and back will overwhelm the speed improvement you get by doing it on the card. So you just need to experiment.
I should add - I believe that you need to specify that the memory is on-card in the device context. I don't think that just creating a memory context does anything particular with the video card.

Resources