mmap /dev/mem, read performance is very slow - performance

I have written a test program which is like this:
fd = open("/dev/mem", O_RDWR);
src = mmap(0x0, 0x1000000, PROT_READ, MAP_SHARED, fd, 0x80000000);/* 0x80000000 is physical start address of DDR on my A8-cortex platform */
dst = malloc(0x1000000);
start_time = get_time()
memcpy(dst, src, 0x1000000);
end_time = get_time();
print_speed();
On my ARM A8-cortex based board, it gives me about 400MB/s. Then I changed the above test program, src buffer is also alllocated by malloc, test again, now it gives me about 1400MB/s, about 3~4X faster.
I try to figure out the reason. First, I suspect that the src memory is uncached through mmap, so I check out the code in driver/cha/mem.c in kernel. In mmap_mem function, I use printk to print the page attribute of maped address, vma->vm_pgoff shows 0x10f, so it is not uncached.
Further more, I change the code and set it uncached type through vma->vm_pgoff = pgprot_nocached(vma->vm_pgoff) and test again, the result is about 30MB/s. So, we can defenitly confirm that /dev/mem maped memory is surely cached, but its read performance is very slow as compared to malloced memory.
Then how to explain this test result?

Related

MMAP buffer kernel writes are not seen by user space

i have a kernel driver which shares a buffer with the user space layer.
Everything seemed to work fine in my VM prototype (Ubuntu, Kernel 5.4) but when i moved my code to the target (same kernel but this is an embedded distro) I can clearly see that Kernel writes to the buffer (using memcpy, or memset) are not reflected in the User space side of the buffer.
Note that, i use direct buffer accesses on both sides. There is no concurrency issue, as the Kernel writes to, then the user space reads from.
I ended up believing this is a cache issue ... as the same code works perfectly in my VM.
The buffer size is 4 * PAGE_SIZE.
It is allocated as follows:
int _size = (SFP_BUFFER_SIZE + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);
input_buffer = (char*) kzalloc (_size, GFP_KERNEL); // aligned on page boundary
if (!input_buffer) {
dev_dbg(&dev, "open/ENOMEM (input_buffer)\n");
status = -ENOMEM;
goto err_all
When mmap'ing, i used the following code pattern:
vma->vm_ops = &fpgadrv_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = virt_to_phys((void*)(input_buffer)) >> PAGE_SHIFT;
if (remap_pfn_range (vma, vma->vm_start, pfn, size, vma->vm_page_prot))
{
printk(KERN_DEBUG "remap page range failed\n");
return -EAGAIN;
}
User space code, and kernel code user memcpy to update the buffer. Note also that I cannot use write/read entry points, as they are already used for very specific operations.
The user code is calling mmap as follows:
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, device_fd, 0);
if (buf == MAP_FAILED)
{
perror("USERDRV:cannot mmap");
return -1; // for testing, ignore the return code and continue
}
and upon IOCTL call, the kernel would fill up the mmap buffer as follows:
case IOCTL_RESET:
printk(KERN_DEBUG "FPGADRV: IOCTL RESET");
// reset the buffer (zero + put back the signature)
memset(input_buffer, 0xA5, SFP_BUFFER_SIZE);
memcpy((void*)(input_buffer), (void*)signature, 10);
break;
Is there something more i should do to make sure the pages are not cached (assuming this is the cause of my pb) ?
Thanks,
Jacques

Simulating (lazy) NAND memory on Windows

I'm running a firmware simulation in a DLL which has simulated NAND (256MB or 1GB). I want to avoid allocating memory for this on the heap and instead allocate using virtual memory.
The memory initially needs to be cleared to 0xFF (like NAND is). However I don't want to pay for that initialization (nor commit un-accessed pages). So ideally it should only allocate upon access. And I do not need to retain the data following exit of the simulation.
Initial ideas are
VirtualAlloc. Not sure but thinking perhaps could use guard page and then trap the exception on first access. Not sure its ideal that a DLL handles such SEH exceptions? Or is there a better way?
Create a big file that's initialized to 0xFF. Then map view of file with copy-on-write.
Anyone know if it is possible to create a file with a callback for providing the initial data?
Think probably 1) the way to go but wondering if that's really the best option.
Edit:
3) I've come up with another method that can avoid exception handler and also avoids creating a huge file:
Create a file that is same size as dwAllocationGranularity (64KiB typically). Fill with 0xFF. Then create multiple copy-on-write views of that in contiguous memory using MapViewOfFileEx + FILE_MAP_COPY (after an initial VirtualAlloc/VirtualFree to get a suitable base address that we can hope to allocate juxtapositioned views). Need to test this a bit more fully - slight concern about potential thread races.. I'm ony actually using a single thread but the CRT does start a few too.
This means that any code that only reads the virtual NAND also does not result in all pages getting committed.
yes, basically 1 is best solution. only i be do next changes - use VEH instead SEH - SEH handler will be called only if you access memory inside it, when in case VEH - access can be ai any context and thread. and instead use guard page, i be initial only reserve region of memory without real allocation. so any access to memory region lead to exception, you handle it in VEH - commit memory and fill with 0xFF pattern. demo code
PVOID g_NandBegin;
SIZE_T g_NandSize = 0x1000000;
LONG NTAPI Vex(::PEXCEPTION_POINTERS ExceptionInfo)
{
::PEXCEPTION_RECORD ExceptionRecord = ExceptionInfo->ExceptionRecord;
if (ExceptionRecord->ExceptionCode == STATUS_ACCESS_VIOLATION &&
ExceptionRecord->NumberParameters > 1)
{
PVOID pv = (PVOID)ExceptionRecord->ExceptionInformation[1];
if ((ULONG_PTR)pv - (ULONG_PTR)g_NandBegin < g_NandSize)
{
SIZE_T RegionSize = 1;
if (0 <= NtAllocateVirtualMemory(NtCurrentProcess(), &pv, 0, &RegionSize, MEM_COMMIT, PAGE_READWRITE))
{
RtlFillMemoryUlong(pv, RegionSize, MAXULONG);
return EXCEPTION_CONTINUE_EXECUTION;
}
}
}
return EXCEPTION_CONTINUE_SEARCH;
}
void dc()
{
if (PVOID pv = AddVectoredExceptionHandler(TRUE, Vex))
{
if (g_NandBegin = VirtualAlloc(0, g_NandSize, MEM_RESERVE, PAGE_READWRITE))
{
ULONG seed = ~GetTickCount();
int n = 0x100;
do
{
if (*(UCHAR*)((PBYTE)g_NandBegin + (((ULONG64)RtlRandomEx(&seed) * g_NandSize) >> 32)) != 0xFF)
{
__debugbreak();
}
} while (--n);
VirtualFree(g_NandBegin, 0, MEM_RELEASE);
}
RemoveVectoredExceptionHandler(pv);
}
}

Is reusing a variable but memory not being released by the process considered a memory leak?

In Vala I have a TreeMultiMap from the Gee library created as a private variable of a class. When I use the tree multi map and fill it with data, the memory consumption of the process increases to 14.2 MiB. When I clear the tree multi map which is still the same variable and use it again, but add less data to it, the memory consumption of the process doesn't increase, but it doesn't decrease either. It stays at 14.2 MiB.
The code is as follows MultiMapTest.vala
using Gee;
private TreeMultiMap <string, TreeMultiMap<string, string> > rootTree;
public static int main () {
// Initialize rootTree
rootTree = new TreeMultiMap<string, TreeMultiMap<string, string> > (null, null);
// Add data repeatedly to the tree to make the process consume memory
for (int i = 0; i < 10000; i++) {
TreeMultiMap<string, string> nestedTree = new TreeMultiMap<string, string> (null, null);
nestedTree.#set ("Lorem ipsum", "Lorem ipsum");
rootTree.#set ("Lorem ipsum", nestedTree);
}
stdout.printf ("Press ENTER to clear the tree...");
// Wait for the user to press enter
var input = stdin.read_line ();
// Clear the tree
rootTree.clear ();
stdout.printf ("Press ENTER to continue and refill the tree with less data...");
// Wait for the user to press enter
input = stdin.read_line ();
// Refill the tree but with much less data
for (int i = 0; i < 10; i++) {
TreeMultiMap<string, string> nestedTree = new TreeMultiMap<string, string> (null, null);
nestedTree.#set ("Lorem ipsum", "Lorem ipsum");
rootTree.#set ("Lorem ipsum", nestedTree);
}
stdout.printf ("Press ENTER to quit...");
// Wait for the user to press enter
input = stdin.read_line ();
return 0;
}
Compiled with valac --pkg gee-0.8 -g MultiMapTest.vala
Is this considered a memory leak? If so, is there any way to properly approach the situation such as that memory gets released to the OS once the tree multi map is cleared even if it involves using other data structures?
I used valgrind, but could not detect any memory leaks. My take on it is that once memory is allocated for the TreeMultiMap variable, unless the variable goes out of scope, the program will keep that memory allocated until the end of its lifetime instead of releasing it back to the operating system. Even if the TreeMultiMap is emptied.
This has little to do with Vala and more to do with the way UNIX programs deal with memory.
When a program starts, the UNIX kernel allocates memory for the program itself, the stack, and an area called the heap where memory can be dynamically allocated by the program. From the UNIX kernel's perspective, the heap is a large chunk of memory and the program can request more using sbrk.
In C, and most other languages, you need small chunks of memory quite often. So, the C standard library has a code to do memory allocation via malloc and free. When you allocate memory using malloc, it takes it out of the free space it has in the heap. When you free it, it can be reused by a later malloc. If there isn't enough memory, malloc will call sbrk to get the program more memory. No matter how much you free, the C standard library will not give memory back to the kernel until he program ends.
Valgrind and Vala are talking about memory leaks where you malloc without free. ps or top see the total memory that was allocated by sbrk.
That means if you malloc a large chunk of memory, then free it, Valgrind will show it as correctly freed and it is available to your program for reuse, but the kernel still considers it in use by the program.

Why doesn't free execute munmap?

I have the following code:
unsigned char *p = (unsigned char *)valloc(page_size);
if (!p) {
ret = -1;
goto out;
}
printf("valloc: allocated %d bytes, virtual address: %p\n", page_size, p);
memset(p, 0xFF, page_size);
memcpy(p, s, sizeof(s));
trace_mem(p, sizeof(s));
printf("Memory: %p - press any key\n", p);
getchar();
if (ioctl(fd, MY_IOC_PATCH) == -1) {
fprintf(stderr, "ioctl %s error(%d): %s\n ", "MY_IOC_PATCH", errno, strerror(errno));
ret = -1;
goto out;
}
if (p) {
printf("free: freed %d bytes, virtual address: %p\n", page_size, p);
free(p);
}
.........................
Then I use strace to observe system calls: strace ./my_program I get the following:
fstat64(1, {st_mode=S_IFREG|0644, st_size=1533, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7730000
brk(0) = 0x9d81000
brk(0x9da4000) = 0x9da4000
fstat64(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb772f000
read(0, "\n", 1024) = 1
ioctl(3, RTC_IRQP_SET, 0x1000) = 0
read(0, "\n", 1024) = 1
ioctl(3, RTC_EPOCH_READ, 0x9d82000) = 0
read(0, "\n", 1024) = 1
close(3) = 0
valloc: allocated 4096 bytes, virtual address: 0x9d82000
After the first IOCTL I don't see munlock. I suppose that free must use munlock to unmap memory, but it doesn't cause. What is the reason for that?
I think that Paramagnetic Croissant's comment, above, qualifies as "the Answer" to this one. It is ordinary practice for malloc() implementations to ask the operating-system for more memory when they need it, but then to never give it back. For any operating-system.
You see, there's really no need to "give it back." Pestering the kernel, asking him to carve out more VM-space and to update the memory-management data structures, is a comparatively expensive operation. But, it doesn't really "cost" much to keep the storage around. (The cost of "releasing them" doesn't gain you anything, especially if you turn right around and have to ask for them again!) So, you just do it once.
If you stop using those pages, they'll eventually get swapped-out, and the physical resource (page frames) will automagically get used for other purposes. "No harm, no foul." But then, if you then suddenly start using that storage again, there's no reason to "pester the kernel" a second (or third) time. The pages just get swapped-in again, and off you go.
malloc/valloc(page size variant of malloc) actually gets the memory addresses from virtual address space. These addresses have mapping to physical address by way of page tables that are specific to a particular process.Thence in my opinion all kernel has to do in case of [vm]alloc is:
1) Attach an anonymous segment to the process.
2) Associate a bunch of virtual address (heap area) entries with physical pages, of course on first use.
In case of "free" it just needs to disassociate the virtual memory entries with the physical pages. Note that since these are anonymous pages it aint need to care where the "data" needs to go, while mmaping a file it may need to stage it back to the disk.
The physical pages are tracked and managed by the memory manager independently and is governed by cache principles (hot, cold color etc). Thus there is no question of free trying to give back memory to the kernel. Since all it got was a virtual address. It will give back the virtual address to the glibc library which should maintain virtual address chunks for use by the specific process.

How to directly write to the frame buffer in windows driver

I am writing the driver that can directly write data to the frame buffer, so that I can show the secret message on the screen while the applications in user space can't get it. Below is my code that trying to write the value to the frame buffer, but after I write the value to the frame buffer, the values i retrieved from the frame buffer are all 0.
I am puzzled, anyone knows the reason? Or anyone knows how to display a message on the screen while the applications in the user space can't get the content of the message? Thanks a lot!
#define FRAME_BUFFER_PHYSICAL_ADDRESS 0xA0000
#define BUFFER_SIZE 0x20000
void showMessage()
{
int i;
int *vAddr;
PHYSICAL_ADDRESS pAddr;
pAddr.QuadPart = FRAME_BUFFER_PHYSICAL_ADDRESS;
vAddr = (int *)MmMapIoSpace(pAddr, BUFFER_SIZE, MmNonCached);
KdPrint(("Virtual address is %p", vAddr));
for(i = 0; i < BUFFER_SIZE / 4; i++)
{
vAddr[i] = 0x11223344;
}
for(i = 0; i < 0x80; i++)
{
KdPrint(("Value: %d", vAddr[i])); // output are all zero
}
MmUnmapIoSpace(vAddr, BUFFER_SIZE);
}
You must map the shared memory during device start up. I assume that showMessage isn't called during the start up. See more here.
Regarding displaying message on the screen - it must involve user-space interaction since GUI is a user-space component. I suppose you could notify some GUI listener without other applications involvement.
Memory mapped IO isn't designed to act exactly like memory (retrieving data that is placed there in the same form it was stored). The writes into the 0xA0000+ range are writes into PORTS in the video device's IO space (from its perspective); So long as the appropriate writes result in the appropriate pixels lighting up, then the video device has done its job from the perspective of people that write drivers for screen rendering (or old DOS code where memory was a free-for-all without a user-space/kernel-space division). But such code never had a need to store data that would later be retrieved from the video segment. Therefore typical memory semantics would generally not have been implemented (waste of hardware and effort). Here, these randoms talk about it:
Magic number with MmMapIoSpace

Resources