use hugepages in kernel driver - linux-kernel

I refer to below two links to use huge page in my linux driver:
Sequential access to hugepages in kernel driver
http://nuncaalaprimera.com/2014/using-hugepage-backed-buffers-in-linux-kernel-driver
Below is my code:
#define PAGE_SHIFT_2M 21
pages = vmalloc(nr_pages * sizeof(struct page*));
down_read(&current->mm->mmap_sem);
get_nr_pages = get_user_pages(current, current->mm, buffer_start, nr_pages,
1 /* Write enable */, 0 /* Force */, pages, NULL);
up_read(&current->mm->mmap_sem);
nid = page_to_nid(pages[0]); // Remap on the same NUMA node.
remapped_addr = vm_map_ram(pages, nr_pages, nid, PAGE_KERNEL);
printf("page pfn [0]=%lX, [1]=0x%lX, [2]=0x%lX\n",
page_to_pfn(pages[0]),
page_to_pfn(pages[1]),
page_to_pfn(pages[2]));
printf("page physical [0]=%lX, [1]=0x%lX, [2]=0x%lX\n",
page_to_pfn(pages[0])<<PAGE_SHIFT_2M,
page_to_pfn(pages[1])<<PAGE_SHIFT_2M,
page_to_pfn(pages[2])<<PAGE_SHIFT_2M);
printf("page logical addr [0]=%p, [1]=%p, [2]=%p\n",
__va(page_to_pfn(pages[0])<<PAGE_SHIFT_2M),
__va(page_to_pfn(pages[1])<<PAGE_SHIFT_2M),
__va(page_to_pfn(pages[2])<<PAGE_SHIFT_2M));
printf("page_address [0]=%p, [1]=%p, [2]=%p\n",
page_address(pages[0]),
page_address(pages[1]),
page_address(pages[2]));
Log print:
page pfn [0]=154A00, [1]=0x154A01, [2]=0x154A02
page physical [0]=2A940000000, [1]=0x2A940200000, [2]=0x2A940400000
page logical addr [0]=ffff8aa940000000, [1]=ffff8aa940200000, [2]=ffff8aa940400000
page_address [0]=ffff880154a00000, [1]=ffff880154a01000, [2]=ffff880154a02000
I have several questions:
1) I'm wondering whether vm_map_ram() can works with huge page. From kernel source code, I can see vm_map_ram() use PAGE_SIZE and PAGE_SHIFT, which's value should for default 4KB page size.
In my case, after write to the virtual address returned from vm_map_ram(), I encounter "BUG: unable to handle kernel paging request at XXXX" issue.
2) page_address return values for two pages are 0x1000(4KB) gap, not 2MB gap. Why is that?
3) Did I use right with "__va(page_to_pfn(pages[0])<
Thanks in advance!

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?
The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

How to find dma_request_chan() failure reason details?

In an external kernel module, using DMA Engine, when calling dma_request_chan() returns an error pointer of value -19, i.e. ENODEV or "No such device".
Now, in the active device tree, I do find a dma-names entry with what I'm trying to get a channel for, so my suspicion is that something else deeper in the forest is already not found.
How do I find out what's wrong?
Background:
I have a Zynq MP Ultrascale+ board here, with an FPGA design which uses AXI VDMA block to provide one channel of data to be received on the Cortex A's Linux, where the data is written to DDR4 by the FPGA and to be read from Linux.
I found that there is a Xilinx DMA driver included in the kernel, in the Xilinx source repo anyway, currently kernel version 5.6.0.
And that that driver has no user space interface, such that an intermediate kernel driver is needed.
This is depicted, and they have an example here: Section "4 DMA Proxy Design". I modified the code in the dma-proxy.c of the zip file linked there such that it uses only the RX channel, i.e. also only tries to request it.
The code for that is here, to not make this post huge:
Modified dma-proxy.c at onlinegdb.com
Line 407 has the function create_channel(), which used to use dma_request_slave_channel() which ditches the error code of the function it wraps, so to see the error, I am using that one instead: dma_request_chan().
The function create_channel() is called in function dma_proxy_probe() # line 470 (the occurences before that are deactivated by compile switch).
So by way of this call, dma_request_chan() will be called with the parameters:
create_channel(pdev, &channels[RX_CHANNEL], "dma_proxy_rx", DMA_DEV_TO_MEM);
The Device Tree for my board has an added node for dma-proxy driver as is shown at the top of the dma-proxy.c
dma_proxy {
compatible ="xlnx,dma_proxy";
dmas = <&axi_dma_0 0>;
dma-names = "dma_proxy_rx";
};
The name "axi_dma_0" matches with the name in the axi DMA device tree node:
axi_dma_0: dma#a0000000 {
#dma-cells = <0x1>;
clock-names = "s_axi_lite_aclk", "m_axi_s2mm_aclk";
clocks = <0x3 0x47 0x3 0x47>;
compatible = "xlnx,axi-dma-7.1", "xlnx,axi-dma-1.00.a";
interrupt-names = "s2mm_introut";
interrupt-parent = <0x1d>;
interrupts = <0x0 0x2>;
reg = <0x0 0xa0000000 0x0 0x1000>;
xlnx,addrwidth = <0x28>;
xlnx,sg-length-width = <0x1a>;
phandle = <0x1e>;
dma-channel#a0000030 {
compatible = "xlnx,axi-dma-s2mm-channel";
dma-channels = <0x1>;
interrupts = <0x0 0x2>;
xlnx,datawidth = <0x40>;
xlnx,device-id = <0x0>;
};
If I now look here:
% cat /proc/device-tree/dma_proxy/dma-names
dma_proxy_rx
Looks like my dma_proxy_rx, that I'm trying to request the channel for, is in there.
Edit:
In the boot log, I see this:
xilinx-vdma a0000000.dma: Please ensure that IP supports buffer length > 23 bits
irq: no irq domain found for interrupt-controller#a0010000 !
xilinx-vdma a0000000.dma: unable to request IRQ 0
xilinx-vdma a0000000.dma: WARN: Device release is not defined so it is not safe to unbind this driver while in use
xilinx-vdma a0000000.dma: Xilinx AXI DMA Engine Driver Probed!!
There are warnings - but in the end, the Xilinx AXI DMA Engine got "probed", meaning the lowest level driver loaded and is ready, right?
So it looks to me like there should be my device, but the kernel disagrees.
I've got the same problem with similar configuration. After digging a lot of kernel source code (especially drivers/dma/xilinx/xilinx_dma.c) I've solved this problem by changing channel number in dmas parameter from 0 to 1 in dma-proxy device tree entry like this:
dma_proxy {
compatible ="xlnx,dma_proxy";
dmas = <&axi_dma_0 1>;
dma-names = "dma_proxy_rx";
};
It seems that dma-proxy example is written for AXI DMA block with both mm2s (channel #0) and s2mm (channel #1) channels. And if we remove mm2s channel from AXI DMA block, the s2mm channel stays #1.

Accessing 0xCxxxxxxx guest kernel pointers within qemu-system-mips

In my QEMU-based project (system emulation) I analyse various kernel structures of the guest Linux. To read the guest virtual memory I use cpu_memory_rw_debug() function.
In particular, I search struct module linked list in the kernel memory using some kind of heuristics.
Lest assume that the relevant part of an element in this list looks like this:
--------------------- ---------------------
| prev = 0xc1231234 | | prev = 0xc5675678 |
--------------------- ---------------------
| next = 0xc1122334 | | next = 0xc5566778 |
--------------------- ---------------------
| etc. | | etc. |
--------------------- ---------------------
When QEMU emulates x86 or ARM, prev/next pointers can be accessed by cpu_memory_rw_debug() and they actually point to previous/next list elements.
However, when QEMU emulates MIPS, I observe the following strange behavior: while prev/next pointers look like a valid kernel pointers in every element in the list, I cannot access their pointees by means of cpu_memory_rw_debug(), because finding the corresponding physical address fails: the access permissions are ok, the virtual CPU is in kernel mode, but tlb->map_address() fails.
Since I can't walk through the linked list, I tried to find the elements one by one - just to see what their prev/next pointers look like - and I actually found all the elements, but all of them reside at 0xAxxxxxxx addresses, not 0xCxxxxxxx, as prev/next imply.
The function r4k_map_address(), which performs physical address lookup looks like this (only the relevant excerpt):
#define KSEG0_BASE 0x80000000UL
#define KSEG1_BASE 0xA0000000UL
#define KSEG2_BASE 0xC0000000UL
#define KSEG3_BASE 0xE0000000UL
//..............
if (address < (int32_t)KSEG1_BASE) {
/* kseg0 */
if (kernel_mode) {
*physical = address - (int32_t)KSEG0_BASE;
*prot = PAGE_READ | PAGE_WRITE;
} else {
ret = TLBRET_BADADDR;
}
} else if (address < (int32_t)KSEG2_BASE) {
/* kseg1 */
if (kernel_mode) {
*physical = address - (int32_t)KSEG1_BASE;
*prot = PAGE_READ | PAGE_WRITE;
} else {
ret = TLBRET_BADADDR;
}
} else if (address < (int32_t)KSEG3_BASE) {
/* sseg (kseg2) */
if (supervisor_mode || kernel_mode) {
ret = env->tlb->map_address(env, physical, prot, real_address, rw, access_type);
} else {
ret = TLBRET_BADADDR;
}
That is, on MIPS 0xC0000000...0xE0000000 range is mapped differently from lower kernel ranges.
If I replace the TLB access with *physical = address - (int32_t)KSEG1_BASE direct mapping, I get the things working, but certainly that's not the solution.
Does it look like QEMU-related issue or a MIPS-related one? I'd appreciate any idea or debugging direction.
The bottom line is that cpu_memory_rw_debug() doesn't work reliably in qemu-system-mips.
The reason is that QEMU emulates MIPS software-managed TLB. With this approach, whenever virtual->physical address mapping does not exist in the TLB cache, QEMU emulates "TLB-miss" exception, which should be handled by the OS. It is OS responsibility to walk through the page directory and fill the TLB -- QEMU (just like real MIPS) won't do that.
While this approach works for the guest code, it results in inability
to read guest virtual memory using cpu_memory_rw_debug() - it
doesn't work reliably for mapped segments.
As for the question why kernel structs that actually reside in KSEG2 where observed in KSEG1 - that's just because some virtual ranges of KSEG1 and KSEG2 correspond to the same physical pages.

Wrong result from GlobalMemoryStatusEx() on Windows

I'm trying to get total system memory using GlobalMemoryStatusEx():
MEMORYSTATUSEX memory;
GlobalMemoryStatusEx(&memory);
#define PRINT(v) {printf("%s ~%.3fGB\n", (#v), ((double)v)/(1024.*1024.*1024.));}
PRINT(memory.ullAvailPhys);
PRINT(memory.ullTotalPhys);
PRINT(memory.ullTotalVirtual);
PRINT(memory.ullAvailPageFile);
PRINT(memory.ullTotalPageFile);
#undef PRINT
fflush(stdout);
But the result is very weired and not understandable.
memory.ullAvailPhys ~1.002GB
memory.ullTotalPhys ~1.002GB
memory.ullTotalVirtual ~0.154GB
memory.ullAvailPageFile ~0.002GB
memory.ullTotalPageFile ~1.002GB
My total physical memory is 8GB but non of result is close it. All values are much smaller.
Also, the 'total' values keep changing whenever I execute. For instance, another result is here:
memory.ullAvailPhys ~0.979GB
memory.ullTotalPhys ~0.979GB
memory.ullTotalVirtual ~0.154GB
memory.ullAvailPageFile ~0.002GB
memory.ullTotalPageFile ~0.979GB
What am I doing wrong?
This is the part you are missing:
MEMORYSTATUSEX memory = { sizeof memory };
MSDN:
dwLength
The size of the structure, in bytes. You must set this member before calling GlobalMemoryStatusEx.
If you checked value returned by GlobalMemoryStatusEx, you could see the problem by getting error indication there.

msync equivalent in Windows

what is the equivalent of msync [unix sys call] in windows? I am looking for MSDN api in c,C++ space.
More info on msync can be found at http://opengroup.org/onlinepubs/007908799/xsh/msync.html
FlushViewOfFile
Checkout the Python 2.6 mmapmodule.c for an example of FlushViewOfFile and msync in use:
/*
/ Author: Sam Rushing <rushing#nightmare.com>
/ Hacked for Unix by AMK
/ $Id: mmapmodule.c 65859 2008-08-19 17:47:13Z thomas.heller $
/ Modified to support mmap with offset - to map a 'window' of a file
/ Author: Yotam Medini yotamm#mellanox.co.il
/
/ mmapmodule.cpp -- map a view of a file into memory
/
/ todo: need permission flags, perhaps a 'chsize' analog
/ not all functions check range yet!!!
/
/
/ This version of mmapmodule.c has been changed significantly
/ from the original mmapfile.c on which it was based.
/ The original version of mmapfile is maintained by Sam at
/ ftp://squirl.nightmare.com/pub/python/python-ext.
*/
static PyObject *
mmap_flush_method(mmap_object *self, PyObject *args)
{
Py_ssize_t offset = 0;
Py_ssize_t size = self->size;
CHECK_VALID(NULL);
if (!PyArg_ParseTuple(args, "|nn:flush", &offset, &size))
return NULL;
if ((size_t)(offset + size) > self->size) {
PyErr_SetString(PyExc_ValueError, "flush values out of range");
return NULL;
}
#ifdef MS_WINDOWS
return PyInt_FromLong((long) FlushViewOfFile(self->data+offset, size));
#elif defined(UNIX)
/* XXX semantics of return value? */
/* XXX flags for msync? */
if (-1 == msync(self->data + offset, size, MS_SYNC)) {
PyErr_SetFromErrno(mmap_module_error);
return NULL;
}
return PyInt_FromLong(0);
#else
PyErr_SetString(PyExc_ValueError, "flush not supported on this system");
return NULL;
#endif
}
UPDATE:
I don't think you are going to find complete parity in the win32 mapped file APIs. The FlushViewOfFile API doesn't have a synchronous flavor (probably because of the possible impact of the cache manager). If precise control over when data is written to disk is required perhaps you can use the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags with the CreateFile API when you create the handle to your mapped file?
Windows equivalent for flushing all filemapping is
void FlushToHardDrive(LPVOID fileMapAddress,HANDLE hFile)
{
FlushViewOfFile(fileMappAddress,0); //Async flush of dirty pages
FlushFileBuffers(hFiles); // flush metadata and wait
}
And for flushing part of filemapping
void FlushToHardDrive(LPVOID address,DWORD size, HANDLE hFile)
{
FlushViewOfFile(address,size); //Async flush of region
FlushFileBuffers(hFiles); // flush metadata and wait
}
This is described in MSDN here
Flag FILE_FLAG_NO_BUFFERING actually do nothing for memory mapped files (described here), also even file handle is created with these flags, metadata of file can be cached and not flushed, so FlushFileBuffers is always required for IO and MM, if you want to be completely sure that all data (including file access time) is saved. This behavior is described here
P.S. Some real world example: SQLite use MM-files almost only for reading, so when you are using MM-files for write/update you need understand all side effects for this scenario
I suspect FlushViewOfFile actually is the right thing. When I read the man page for msync, I would not assume that it is actually flushing the disk cache (the cache in the disk unit, as opposed to the system cache in main memory).
FlushViewOfFile won't return until the disk stack has completed the writes; like the msync documentation, it says nothing about what happens in the disk cache. We should probably take a look at making that more clear in the documentation.

Resources