Magic number with MmMapIoSpace - windows

So upon mapping a memory space with MmMapIoSpace, I noticed that past a certain point, the data was just being discarded when written to. No errors, breakpoints, or even bugchecks were thrown. Everything worked as normal, just without any adverse effects.
I decided to do a write/read test (the driver would write 1's to every byte for the length of the intended size) and the reader (userland) mode would read and report where the 1's ended.
The number it came up with was 3208, which is a seemingly nice, round number (/8=401, /256=12, etc.)
What's up with this? How come I can't map the full buffer space?
EDIT And in 64-bit it drops to 2492.

I'm no expert, but I don't see how MmMapIoSpace can be relied upon to do what you're asking it to, because there's no guarantee that the user-space buffer is contiguous in physical memory.
Instead, I think you should be using IoAllocateMdl and MmProbeAndLockPages to lock down the user buffer and then MmGetSystemAddressForMdlSafe to map it into the system address space. This process is described here.
As previous stated, I think that the point at which the mapping is failing (3208/2492 bytes into the buffer) is probably just the end of the page, but that's easy enough for you to verify: get the user-space application to report the (virtual) address of the first byte that didn't get written rather than the offset, and check whether it is a multiple of 4096 or not.

Related

VirtualAlloc a writeable, unbacked "throw-away" garbage range?

Is it possible in Win32 to get a writeable (or write-only) range of "garbage" virtual address space (i.e., via VirtualAlloc, VirtualAlloc2, VirtualAllocEx, or other) that never needs to be persisted, and thus ideally is never backed by physical memory or pagefile?
This would be a "hole" in memory.
The scenario is for simulating a dry-run of a sequential memory writing operation just in order to obtain the size that it actually consumes. You would be able to use the exact same code used for actual writing, but instead pass in an un-backed "garbage" address range that essentially ignores or discards anything that's written to it. In this example, the size of the "void" address range could be 2⁶⁴ = 18.4ᴇʙ (why not? it's nothing, after all), and all you're interested in is the final value of an advancing pointer.
[edit:] see comments section for the most clever answer. Namely: map a single 4K page multiple times in sequence, tiling the entire "empty" range
This isn't possible. If you have code that attempts to write to memory then the virtual memory needs to be backed with something.
However, if you modified your code to use the stream pattern then you could provide a stream implementation that ignored the write and just tracked the size.

Does cache line flush write the whole line to the memory?

When a dirty cache line is flushed (because of any reason), is the whole cache line written to the memory or CPU tracks down which words got written to and reduces the number of memory writes?
If this differs among architectures, I'm primarily interested in knowing this for Blackfin, but it would be nice to hear practices in x86, ARM, etc...
generally if you have a write buffer it flushes through the write buffer (entire cache line). Then the write buffer at some point completes the writes to ram. I have not heard of a cache that keeps track per item within a line which parts are dirty or not, that is why you have a cache line. So for the cases I have heard of the whole line goes out. Another point is that it is not uncommon for the slow memory on the back side of a cache DDR for example, is accessed through some fixed width, 32 bits at a time 64 bits at a time 128 bits at a time, or each part is at that width and there are multiple parts. That kind of thing, so to avoid a read-modify-write you want to write in complete ram width sizes. Cache lines are multiples of that, sure, and the opportunity to not do writes is there. Also if there is ecc on that ram then you need to write a whole ecc line at once to avoid a read-modify write.
You would need a dirty bit per writeable item in the cache line so that would multiply the dirty bit storage bu some amount, that may or may not have a real impact on size or cost, etc. There may or may not be an overhead on the ram side per transaction and it may be cheaper to do a multi word transaction rather than even two separate transactions, so this scheme might create a performance hit rather than boost (same problem inside the write buffer, instead of one transaction with a start address and length, now multiple transactions).
It just seems like a lot of work for something that may or may not result in a gain. If you find one that does please post it here.
I'm dusting off my cobwebby computer architecture knowledge from classes taken 15 years ago -- please be kind if I'm totally wrong.
I seem to remember that x86, MIPS and Motorola, the whole line gets written. This is because the cache line is the same as the bus width (except in very odd circumstances, such as the moldy old 386-SX line which was a 32-bit architecture with a 16-bit bus), so there's no point in trying to do word-wise optimization, the whole line is going to be written anyway.
I can't imagine any scenario in which a hardware architecture of any kind would do anything different, but I've been known to be wrong in the past.

dma mapping which does not cross some power-of-2 boundary

I want to setup a DMA mapping for a memory buffer allocated outside my control. dma_map_single appears the right API to use but my HW has a restriction due to which the mapping must not cross some power-of-two boundary say for e.g. 1K. The buffer being mapped is of size less than the boundary value always but otherwise variable. So it looks like DMA pools may not work since they need a fixed size even though the "allocation" part is sort of what I need.
Should I just keep doing dma_map_single and check if mapping meets my requirement and release mapping if it does not? Can this cause same mapping to potentially be returned causing a never ending search? If so, I could hang on to the unfit mappings till a fit one is found and then release all the unfit mappings in one shot. These however don't sound like good ideas.
Does anyone have other/better ideas?
Thanks.
If you can't guarantee that the buffer you are passed meets your criteria, you may need to allocate an auxiliary buffer and copy to/from that buffer before you DMA. On platforms without an IOMMU or other address translation hardware (eg classic x86, ARM, etc), the DMA mapping operation is really just converting to a physical address. So if you unmap and try again with the same buffer, you'll always get back the same DMA address.
On most (all?) other platforms that do have an IOMMU, the translation is still done on chunks >= PAGE_SIZE. In other words, if you're on a platform with 4K pages, and you do DMA mapping on a buffer at 0xABCDExxx, you'll always get a DMA address like 0xFGHIJxxx where the low part of the address "xxx" stays the same. (This is because the IOMMU works like a normal MMU and only looks up the page translation, and leaves the low 12 or whatever bits alone)
So in essentially all cases on all platforms, you can't use the DMA API to fix up the alignment of the buffer you get passed. As I said, I think the only alternative if the buffers you get passed in don't meet your alignment requirements is to use a bounce buffer. The DMA pool API is a fine way to allocate these bounce buffers -- it's no problem if you sometimes need a smaller buffer; it's fine to leave some of the memory you get back unused.

allocate large (32mb) contiguous region

Is it at all possible to allocate large (i.e. 32mb) physically contiguous memory regions from kernel code at runtime (i.e. not using bootmem)? From my experiments, it seems like it's not possible to get anything more than a 4mb chunk successfully, no matter what GFP flags I use. According to the documentation I've read, GFP_NOFAIL is supposed to make the kmalloc just wait as long as is necessary to free the requested amount, but from what I can tell it just makes the request hang indefinitely if you request more than is availble - it doesn't seem to be actively trying to free memory to fulfil the request (i.e. kswapd doesn't seem to be running). Is there some way to tell the kernel to aggressively start swapping stuff out in order to free the requested allocation?
Edit: So I see from Eugene's response that it's not going to be possible to get a 32mb region from a single kmalloc.... but is there any possibility of getting it done in more of a hackish kind of way? Like identifying the largest available contiguous region, then manually migrating/swapping away data on either side of it?
Or how about something like this:
1) Grab a bunch of 4mb chunks until you're out of memory.
2) Check them all to see if any of them happen to be contiguous, if so,
combine them.
3) kfree the rest
4) goto 1)
Might that work, if given enough time to run?
You might want to take a look at the Contiguous Memory Allocator patches. Judgging from the LWN article, these patches are exactly what you need.
Mircea's link is one option; if you have an IOMMU on your device you may be able to use that to present a contiguous view over a set of non-contiguous pages in memory as well.

Getting the lowest free virtual memory address in windows

Title says it pretty much all : is there a way to get the lowest free virtual memory address under windows ? I should add that I am interested by this information at the beginning of the program (before any dynamic memory allocation has been done).
Why I need it : trying to build a malloc implementation under Windows. If it is not possible I would have to really to whatever VirtualAlloc() returns when given NULL as first parameter. While you would expect it to do something sensible, like allocation memory at the bottom of what is available, there are no guarantees.
This can be implemented yourself by using VirtualQuery looking for pages that are marked as free. It would be relatively slow though. (You will also need to consider allocation granularity which is different from page size.)
I will say that unless you need contiguous blocks of memory, trying to keep everything close together is mostly meaningless since if two pages of virtual memory might be next to each other in the address space, there is no reason to assume they are close to each other in physical memory. In fact, even if they are close to each other at some point in time, if those pages get moved to backing store and then faulted back into memory, the page would not be faulted to the same physical address page.
The OS uses more complicated metrics than just what is the "lowest" memory address available. Specifically, VirtualAlloc allocates pages of memory, so depending on how much you're asking for, at least one page of unused address space has to be available at the starting address. So even if you think there's a "lower" address that it should have used, that address might not have been compatible with the operation that you asked for.

Resources