Why HBITMAP use so little memory? - winapi

I met an interesting question:
load a big (4500x6000) jpeg into memory (RGBRGBRGB....) by libjpeg (cost about 200M memory)
CreateDIBitmap() to create a HBITMAP from the data
free the memory used
now I found that the process use only 5M memory at all. I wonder where is the data of the HBITMAP. (I disable pagefile)
update:
I write such code for testing:
// initilise
BITMAP bitmap;
BITMAPINFO info;
// ....
void *data = NULL;
HDC hdc = ::GetDC(NULL);
HBITMAP hBitmap = ::CreateDIBSection(hdc, &info, DIB_RGB_COLORS, &data, NULL, 0);
::ReleaseDC(NULL, hdc);
if (hBitmap) {
::GetObject(m_hBitmap, sizeof(bitmap), &bitmap);
}
Then the data is 0x2d0000 (surely in user space), bitmap.bmBits is also 0x2d0000. So I make sure that CreateDIBSection use user space memory for bitmap.

How about this for a test. Create HBITMAPs in a loop. Counting the number of bytes theoretically used (Based on the bitdepth of your video card).
How many bytes worth of HBITMAPs can you allocate before they start to fail? (Or, alternately, until you do start to see an impact on memory).
DDBs are managed by device drivers. Hence they tend to be stored in one of two places :- kernel mode paged pool or in the Video Cards memory itself. Both of which will not be reflected in any process memory count.
In theory device drivers can allocate system memory storage for bitmaps, moving them across to vram as and when needed... but some video card drivers think that video memory should be enough and simply allocate all HBITMAPs on the card. Which means you run out of space for HBITMAPs at either the 2Gb mark (if they're allocated in kernel paged pool; depending on available ram and assuming 32bit windows editions), or 256Mb mark (or however much memory the video card has).
That discussion covered Device Dependent Bitmaps.
DIBSections are a special case as they're allocated in memory accessible from kernel mode, but available in userspace. As such, any application that uses a lot of bitmaps probably should use DIBSections where possible as there should be much less opportunity to starve the system of space to store DDBs.
I suspect that one still has a system wide limit of up to 2Gb worth of DIBSections (on 32bit Windows versions) as there is no concept of 'current process' in kernel mode where the video device drivers will need access.

Related

Allocate more than 2 GB of dma common buffer

I am developing a driver for PCI Express in Windows environment.
I use Windows7 and Windows10 and the HW is i7-7700K, RAM: 16GBytes.
There is no problem in using up to 2GBytes buffer allocated so far.
But, more than 2GB is not allocated.
Here is the code snippet that succeeded in allocating 2GB dma common buffer.
DEVICE_DESCRIPTION dd;
RtlZeroMemory(&dd, sizeof(dd));
dd.Version = DEVICE_DESCRIPTION_VERSION;
dd.InterfaceType = InterfaceTypeUndefined;
dd.MaximumLength = 0x200000; // 2MB DMA transfer size at a time
dd.Dma32BitAddresses = FALSE;
dd.Dma64BitAddresses = TRUE;
dd.Master = TRUE;
pdx->AdapterObject = IoGetDmaAdapter(pdx->Pdo, &dd, &nMapRegisters);
pdx->vaCommonBuffer = (*pdx->AdapterObject->DmaOperations->AllocateCommonBuffer)
(pdx->AdapterObject, 0x80000000, &pdx->paCommonBuffer, FALSE);
What is the size limit for the DMA common buffer allocation and why?
If we changing length 0x80000000 (2GB) to 0xC0000000 (3GB) above causes the buffer allocation to fail.
How can I allocate up to 4GB and more than 4GB dma common buffer using AllocateCommonBuffer()?
Thank you very much for your valuable time and comments in advance.
Respectfully,
KJ
I believe that DMAv2 (which is the best Windows 7 supports) cannot do more than 4GB in a single allocation, so there is a hard limit.
But even before you hit the hard limit, you're going to start seeing nondeterministic allocation failures, because AllocateCommonBuffer gives you physically contiguous pages.
As a thought experiment: all it takes is 8 unmovable pages placed at pathologically-inconvenient locations before you can't find 2GB of contiguous pages out of 16GB of RAM. And a real system will have a lot more than 8 unmovable pages, although hopefully they aren't placed at such unfortunate intervals.
Compounding matters, DMAv2 refuses to allocate memory that straddles a 4GB boundary (for historical reasons). So of the 3,670,017 different pages that this allocation could start at, 50% of them are not even considered. At 3GB, 75% of possible allocations are not considered.
We've put a lot of work into the DMA subsystem over the years: DMAv3 is a more powerful API, with fewer weird quirks. Its implementation in Windows 10 and later has some fragmentation-resistance features, and the kernel's memory manager is better at moving pages. That doesn't eliminate the fundamental problem of fragmentation; but it does make it statistically less likely.
Very recent versions of the OS can actually take advantage of the IOMMU (if available and enabled) to further mitigate the problem: you don't need the physical pages to be contiguous anymore, if we can just use the IOMMU to make them seem contiguous to the device. (An MMU is exactly why usermode apps don't need to worry about physical memory fragmentation when they allocate massive buffers via malloc).
At the end of the day, though, you simply can't assume that the system has any 2 contiguous pages of memory for your device, especially if you need your device to operate on a wide variety of systems. I've seen production systems with oodles of RAM routinely fail to allocate 64KB of common buffer, which is "merely" 16 contiguous pages. You will need to either:
Fall back to using many, smaller allocations. They won't be contiguous between them, but you'll be more successful at allocating them.
Fall back to a smaller buffer. For example, in on my home turf of networking, a NIC can use a variety of buffer sizes, and the only visible effect of a smaller buffer might just be degraded throughput.
Ask the user to reboot the device (which is a very blunt way to defragment memory!).
Try to allocate memory once, early in boot, (before memory is fragmented), and never let go of it.
Update the OS, update to DMAv3 API, and ensure you have an IOMMU.

Why do we need DMA pool?

I'm reading https://www.kernel.org/doc/Documentation/DMA-API.txt and I don't understand why DMA pool is needed.
Why not having PAGE_SIZE DMA allocated memory dma_alloc_coherent and use offsets?
Also, why is dynamic DMA useful for networking device driver instead of reusing the same DMA memory?
What is the most performant for <1KB data transfers?
Warning: I'm not expert in linux kernel.
LDD book (which may be better reading to start) says that DMA pool works better for smaller dma regions (shorter than page) - https://static.lwn.net/images/pdf/LDD3/ch15.pdf page 447 or https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch15.html, "DMA pools" section:
A DMA pool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool. DMA pools are also useful in situations where you may be tempted to perform DMA to small areas embedded within a larger structure. Some very obscure driver bugs have been traced down to cache coherency problems with structure fields adjacent to small DMA areas. To avoid this problem, you should always allocate areas for DMA operations explicitly, away from other, non-DMA data structures. ... Allocations are handled with dma_pool_alloc
Same is stated in https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
If your driver needs lots of smaller memory regions, you can write
custom code to subdivide pages returned by dma_alloc_coherent(),
or you can use the dma_pool API to do that. A dma_pool is like
a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages().
Also, it understands common hardware constraints for alignment,
like queue heads needing to be aligned on N byte boundaries.
So DMA pools are optimization for smaller allocations. You can use dma_alloc_coherent for every small dma memory individually (with larger overhead) or you can try to build your own pool (more custom code to manage offsets and allocations). But DMA pools are already implemented and they can be used.
Performance of methods should be profiled for your case.
Example of dynamic dma registration in network driver (used for skb fragments):
https://elixir.bootlin.com/linux/v4.6/source/drivers/net/ethernet/realtek/r8169.c
static struct sk_buff *rtl8169_alloc_rx_data
mapping = dma_map_single(d, rtl8169_align(data), rx_buf_sz,
DMA_FROM_DEVICE);
static int rtl8169_xmit_frags
mapping = dma_map_single(d, addr, len, DMA_TO_DEVICE);
static netdev_tx_t rtl8169_start_xmit
mapping = dma_map_single(d, skb->data, len, DMA_TO_DEVICE);
static void rtl8169_unmap_tx_skb
dma_unmap_single(d, le64_to_cpu(desc->addr), len, DMA_TO_DEVICE);
Registering skb fragments for dma in-place can be better (if sg dma is supported by NIC chip) than copying every fragment from skb into some DMA memory. Check "Understanding Linux Network Internals" book for section "dev_queue_xmit Function" and Chapter 21; and skb_linearize
Example of DMA pool usage - nvme driver (prp is part of Submission Queue Element, Physical Region Pages, 64bit pointer, and "PRP List contains a list of PRPs with generally no offsets."):
https://elixir.bootlin.com/linux/v4.6/source/drivers/nvme/host/pci.c#L1807
static int nvme_setup_prp_pools(struct nvme_dev *dev)
{
dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
PAGE_SIZE, PAGE_SIZE, 0);
static bool nvme_setup_prps
prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
static void nvme_free_iod
dma_pool_free(dev->prp_small_pool, list[0], prp_dma);

Is a buffer within kmalloc also a DMA safe buffer?

I'm in the middle of writing a framebuffer driver for an SPI connected LCD. I use kmalloc to allocate the buffer, which is quite large - 150KB. Given the way kmalloc is allocating the buffer, ksize reports that way more memory is being used - 256KB or so.
The SPI spi_transfer structure takes pointers to tx and rx buffers, both of which have to be DMA safe. As I want the tx buffer to be about 16KB, can I allocate that buffer within the kmalloced video buffer and still be DMA safe?
This could be considered premature optimisation but there's so much spare space within the video buffer it feels bad not to use it! Essentially there is no difference in allocated memory between:
kmalloc(videosize)
and
kmalloc(PAGE_ALIGN(videosize) + txbufsize)
so one could take the kptr returned and do:
txbuf = (u8 *)kptr + PAGE_ALIGN(videosize);
I'm aware that part of the requirement of "DMA safe" is appropriate alignment - to CPU cacheline size I believe... - but shouldn't a page alignment be ok for this?
As an aside, I'm not sure if tx and rx can point to the same place. The spi.h header is unclear too (explicitly unclear actually). Given that the rx buffer will never be more than a few bytes, it would be silly to make trouble by trying to find out!
The answer appears to be yes with provisos. (Specifically that "it's more complicated than that")
If you acquire your memory via __get_free_page*() or the generic memory allocator (kmalloc) then you may DMA to/from that memory using the addresses returned from those routines. The underlying implication is that a page aligned buffer within kmalloc, even spanning multiple pages, will be DMA safe as the underlying physical memory is guaranteed to be contiguous and a page aligned buffer is guaranteed to be on a cache line boundary.
One proviso is whether the device is capable of driving the full bus width (eg: ISA). Thus, the physical address of the memory must be within the dma_mask of the device.
Another is cache coherency requirements. These operates at the granularity of the cache line width. To prevent two seperate memory regions from sharing one cache line, the memory for dma must begin exactly on a cache line boundary and end exactly on one. Given that this may not be known, it is recommended (DMA API documentation) to only map virtual regions that begin and end on page boundaries (as these are guaranteed also to be cache line boundaries as stated above).
A DMA driver can use dma_alloc_coherent() to allocate DMA-able space in this case to guarantee that the DMA region is uncacheable. As this may be expensive, a streaming method also exists - for one way communication - where coherency is limited to cache flushes on write. Use dma_map_single() on a previously allocated buffer.
In my case, passing the tx and rx buffers to spi_sync without dma_map_single is fine - the spi routines will do it for me. I could use dma_map_single myself along with either unmap or dma_sync_single_for_cpu() to keep everything in sync. I won't bother at the moment though - performance tweaking after the driver works is a better strategy.
See also:
Does every dma_map_single call require a corresponding dma_unmap_single?
Linux kernel device driver to DMA into kernel space

Implement malloc which is backed by a disk file (dmalloc)

General malloc and mmap description
malloc (or any allocation function) is supposed to allocate memory for applications. Standard glibc malloc implementation uses sbrk() system call to allocate the memory. The memory allocated to an application is not backup by disk. Only when the application is swept out, the contents of memory are moved to disk (pre-configured swap disk).
The other method to allocate the memory is through the use of mmap. mmap system call creates mapping in virtual address space for calling process. Following is mmap function declaration as per POSIX standard.
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
/* Describe the meaning for few important parameters to mmap */
mmap system call can also be used to allocate memory. Typically this is used to load, application binaries or static libraries. For example following mmap call will allocate memory, without a backing file.
address = __mmap (0, length, PROT_READ|PROT_WRITE,MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
Flags
MAP_ANONYMOUS: The mapping is not backed by any file; its contents
are initialized to zero.
MAP_PRIVATE: Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.
dmalloc
dmalloc is a new API which allocates memory using a disk backed file i.e. without MAP_ANONYMOUS and MAP_PRIVATE to mmap. The dmalloc would be particularly useful with SSDs, which has very low read/write latency as compared to HDD. Since the file is mapped into the RAM, the dlmalloc will also benefit from high speed RAM.
Alternatives
A SSD can also be configured as a highest priority swap device, however this approach suffers from HDD optimized swapping algorithm inside Linux kernel. The swapping algorithm tries to cluster application pages on swap. When a data from swap is needed it reads
the complete cluster (read-ahead). If an application is doing random IOs, the read-ahead data would cause unnecessary IOs to disk.
Question:-
what is ment by "allocates memory using a disk backed file i.e. without MAP_ANONYMOUS and MAP_PRIVATE to mmap." which flag i should use apart from those two.
how i creat on-write backup of memory allocated to an application.
I never heard from dmalloc but like you mention it it looks like a mix between malloc (pure memory allocation) and mmap (direct mapping of memory to disk). dmalloc seems to allocate memory backed by disk, but more performant than mmap on slow disks (e.g. SSD). I could imagine that it groups write operations before actually flushing the writes to disk, whereas mmap is more or less a "virtual memory window" on a disk file.
As for your questions.
1) MAP_ANONYMOUS and MAP_PRIVATE are flags for use with mmap. The fact that these flags are mentioned as not being used, makes me think dmalloc is a fresh implementation and has no relationship to mmap.
2) dmalloc seems to be suited for what you say: it "backs up" memory to disk, similar to mmap. You may need to read the details of the documentation to know when exactly you have guarantee that the data is effectively on disk (caching,...)

Memory management in OpenCL

When I started programming in OpenCL I used the following approach for providing data to my kernels:
cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, object_size, NULL, NULL);
clEnqueueWriteBuffer(cl_queue, buff, CL_TRUE, 0, object_size, (void *) object, NULL, NULL, NULL);
This obviously required me to partition my data in chunks, ensuring that each chunk would fit into the device memory. After performing the computations, I'd read out the data with clEnqueueReadBuffer(). However, at some point I realised I could just use the following line:
cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, object_size, (void*) object, NULL);
When doing this, the partitioning of the data became obsolete. And to my surprise, I experienced a great boost in performance. That is something I don't understand. From what I got, when using a host pointer, the device memory is working as a cache, but all the data still needs to be copied to it for processing and then copied back to main memory once finished. How come using an explicit copy ( clEnqueRead/WriteBuffer ) is an order of magnitude slower, when in my mind it should be basically the same? Am I missing something?
Thanks.
Yes, you're missing the CL_TRUE in the clEnqueueWriteBuffer call. This makes the write operation blocking, which stalls the CPU while the copy is made. Using the host pointer, the OpenCL implementation can "optimize" the copy by making it asynchronous, thus in overall the performance is better.
Note that this depends on the CL implementation, and there's no guarantee that will be faster/equal/slower.
In some cases the CPU and GPU can share the same physical DRAM memory. For example, if the memory block satisfies CPU and GPU alignment rules then Intel interprets CL_MEM_USE_HOST_PTR as permission to share physical DRAM between CPU and GPU, so there is no actual copying of data. Obviously, that's very fast!
Here is a link that explains it:
https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics
PS I know my reply is far too old for OP, but other readers may be interested.

Resources