Accessing 0xCxxxxxxx guest kernel pointers within qemu-system-mips - linux-kernel

In my QEMU-based project (system emulation) I analyse various kernel structures of the guest Linux. To read the guest virtual memory I use cpu_memory_rw_debug() function.
In particular, I search struct module linked list in the kernel memory using some kind of heuristics.
Lest assume that the relevant part of an element in this list looks like this:
--------------------- ---------------------
| prev = 0xc1231234 | | prev = 0xc5675678 |
--------------------- ---------------------
| next = 0xc1122334 | | next = 0xc5566778 |
--------------------- ---------------------
| etc. | | etc. |
--------------------- ---------------------
When QEMU emulates x86 or ARM, prev/next pointers can be accessed by cpu_memory_rw_debug() and they actually point to previous/next list elements.
However, when QEMU emulates MIPS, I observe the following strange behavior: while prev/next pointers look like a valid kernel pointers in every element in the list, I cannot access their pointees by means of cpu_memory_rw_debug(), because finding the corresponding physical address fails: the access permissions are ok, the virtual CPU is in kernel mode, but tlb->map_address() fails.
Since I can't walk through the linked list, I tried to find the elements one by one - just to see what their prev/next pointers look like - and I actually found all the elements, but all of them reside at 0xAxxxxxxx addresses, not 0xCxxxxxxx, as prev/next imply.
The function r4k_map_address(), which performs physical address lookup looks like this (only the relevant excerpt):
#define KSEG0_BASE 0x80000000UL
#define KSEG1_BASE 0xA0000000UL
#define KSEG2_BASE 0xC0000000UL
#define KSEG3_BASE 0xE0000000UL
//..............
if (address < (int32_t)KSEG1_BASE) {
/* kseg0 */
if (kernel_mode) {
*physical = address - (int32_t)KSEG0_BASE;
*prot = PAGE_READ | PAGE_WRITE;
} else {
ret = TLBRET_BADADDR;
}
} else if (address < (int32_t)KSEG2_BASE) {
/* kseg1 */
if (kernel_mode) {
*physical = address - (int32_t)KSEG1_BASE;
*prot = PAGE_READ | PAGE_WRITE;
} else {
ret = TLBRET_BADADDR;
}
} else if (address < (int32_t)KSEG3_BASE) {
/* sseg (kseg2) */
if (supervisor_mode || kernel_mode) {
ret = env->tlb->map_address(env, physical, prot, real_address, rw, access_type);
} else {
ret = TLBRET_BADADDR;
}
That is, on MIPS 0xC0000000...0xE0000000 range is mapped differently from lower kernel ranges.
If I replace the TLB access with *physical = address - (int32_t)KSEG1_BASE direct mapping, I get the things working, but certainly that's not the solution.
Does it look like QEMU-related issue or a MIPS-related one? I'd appreciate any idea or debugging direction.

The bottom line is that cpu_memory_rw_debug() doesn't work reliably in qemu-system-mips.
The reason is that QEMU emulates MIPS software-managed TLB. With this approach, whenever virtual->physical address mapping does not exist in the TLB cache, QEMU emulates "TLB-miss" exception, which should be handled by the OS. It is OS responsibility to walk through the page directory and fill the TLB -- QEMU (just like real MIPS) won't do that.
While this approach works for the guest code, it results in inability
to read guest virtual memory using cpu_memory_rw_debug() - it
doesn't work reliably for mapped segments.
As for the question why kernel structs that actually reside in KSEG2 where observed in KSEG1 - that's just because some virtual ranges of KSEG1 and KSEG2 correspond to the same physical pages.

Related

How to find dma_request_chan() failure reason details?

In an external kernel module, using DMA Engine, when calling dma_request_chan() returns an error pointer of value -19, i.e. ENODEV or "No such device".
Now, in the active device tree, I do find a dma-names entry with what I'm trying to get a channel for, so my suspicion is that something else deeper in the forest is already not found.
How do I find out what's wrong?
Background:
I have a Zynq MP Ultrascale+ board here, with an FPGA design which uses AXI VDMA block to provide one channel of data to be received on the Cortex A's Linux, where the data is written to DDR4 by the FPGA and to be read from Linux.
I found that there is a Xilinx DMA driver included in the kernel, in the Xilinx source repo anyway, currently kernel version 5.6.0.
And that that driver has no user space interface, such that an intermediate kernel driver is needed.
This is depicted, and they have an example here: Section "4 DMA Proxy Design". I modified the code in the dma-proxy.c of the zip file linked there such that it uses only the RX channel, i.e. also only tries to request it.
The code for that is here, to not make this post huge:
Modified dma-proxy.c at onlinegdb.com
Line 407 has the function create_channel(), which used to use dma_request_slave_channel() which ditches the error code of the function it wraps, so to see the error, I am using that one instead: dma_request_chan().
The function create_channel() is called in function dma_proxy_probe() # line 470 (the occurences before that are deactivated by compile switch).
So by way of this call, dma_request_chan() will be called with the parameters:
create_channel(pdev, &channels[RX_CHANNEL], "dma_proxy_rx", DMA_DEV_TO_MEM);
The Device Tree for my board has an added node for dma-proxy driver as is shown at the top of the dma-proxy.c
dma_proxy {
compatible ="xlnx,dma_proxy";
dmas = <&axi_dma_0 0>;
dma-names = "dma_proxy_rx";
};
The name "axi_dma_0" matches with the name in the axi DMA device tree node:
axi_dma_0: dma#a0000000 {
#dma-cells = <0x1>;
clock-names = "s_axi_lite_aclk", "m_axi_s2mm_aclk";
clocks = <0x3 0x47 0x3 0x47>;
compatible = "xlnx,axi-dma-7.1", "xlnx,axi-dma-1.00.a";
interrupt-names = "s2mm_introut";
interrupt-parent = <0x1d>;
interrupts = <0x0 0x2>;
reg = <0x0 0xa0000000 0x0 0x1000>;
xlnx,addrwidth = <0x28>;
xlnx,sg-length-width = <0x1a>;
phandle = <0x1e>;
dma-channel#a0000030 {
compatible = "xlnx,axi-dma-s2mm-channel";
dma-channels = <0x1>;
interrupts = <0x0 0x2>;
xlnx,datawidth = <0x40>;
xlnx,device-id = <0x0>;
};
If I now look here:
% cat /proc/device-tree/dma_proxy/dma-names
dma_proxy_rx
Looks like my dma_proxy_rx, that I'm trying to request the channel for, is in there.
Edit:
In the boot log, I see this:
xilinx-vdma a0000000.dma: Please ensure that IP supports buffer length > 23 bits
irq: no irq domain found for interrupt-controller#a0010000 !
xilinx-vdma a0000000.dma: unable to request IRQ 0
xilinx-vdma a0000000.dma: WARN: Device release is not defined so it is not safe to unbind this driver while in use
xilinx-vdma a0000000.dma: Xilinx AXI DMA Engine Driver Probed!!
There are warnings - but in the end, the Xilinx AXI DMA Engine got "probed", meaning the lowest level driver loaded and is ready, right?
So it looks to me like there should be my device, but the kernel disagrees.
I've got the same problem with similar configuration. After digging a lot of kernel source code (especially drivers/dma/xilinx/xilinx_dma.c) I've solved this problem by changing channel number in dmas parameter from 0 to 1 in dma-proxy device tree entry like this:
dma_proxy {
compatible ="xlnx,dma_proxy";
dmas = <&axi_dma_0 1>;
dma-names = "dma_proxy_rx";
};
It seems that dma-proxy example is written for AXI DMA block with both mm2s (channel #0) and s2mm (channel #1) channels. And if we remove mm2s channel from AXI DMA block, the s2mm channel stays #1.

use hugepages in kernel driver

I refer to below two links to use huge page in my linux driver:
Sequential access to hugepages in kernel driver
http://nuncaalaprimera.com/2014/using-hugepage-backed-buffers-in-linux-kernel-driver
Below is my code:
#define PAGE_SHIFT_2M 21
pages = vmalloc(nr_pages * sizeof(struct page*));
down_read(&current->mm->mmap_sem);
get_nr_pages = get_user_pages(current, current->mm, buffer_start, nr_pages,
1 /* Write enable */, 0 /* Force */, pages, NULL);
up_read(&current->mm->mmap_sem);
nid = page_to_nid(pages[0]); // Remap on the same NUMA node.
remapped_addr = vm_map_ram(pages, nr_pages, nid, PAGE_KERNEL);
printf("page pfn [0]=%lX, [1]=0x%lX, [2]=0x%lX\n",
page_to_pfn(pages[0]),
page_to_pfn(pages[1]),
page_to_pfn(pages[2]));
printf("page physical [0]=%lX, [1]=0x%lX, [2]=0x%lX\n",
page_to_pfn(pages[0])<<PAGE_SHIFT_2M,
page_to_pfn(pages[1])<<PAGE_SHIFT_2M,
page_to_pfn(pages[2])<<PAGE_SHIFT_2M);
printf("page logical addr [0]=%p, [1]=%p, [2]=%p\n",
__va(page_to_pfn(pages[0])<<PAGE_SHIFT_2M),
__va(page_to_pfn(pages[1])<<PAGE_SHIFT_2M),
__va(page_to_pfn(pages[2])<<PAGE_SHIFT_2M));
printf("page_address [0]=%p, [1]=%p, [2]=%p\n",
page_address(pages[0]),
page_address(pages[1]),
page_address(pages[2]));
Log print:
page pfn [0]=154A00, [1]=0x154A01, [2]=0x154A02
page physical [0]=2A940000000, [1]=0x2A940200000, [2]=0x2A940400000
page logical addr [0]=ffff8aa940000000, [1]=ffff8aa940200000, [2]=ffff8aa940400000
page_address [0]=ffff880154a00000, [1]=ffff880154a01000, [2]=ffff880154a02000
I have several questions:
1) I'm wondering whether vm_map_ram() can works with huge page. From kernel source code, I can see vm_map_ram() use PAGE_SIZE and PAGE_SHIFT, which's value should for default 4KB page size.
In my case, after write to the virtual address returned from vm_map_ram(), I encounter "BUG: unable to handle kernel paging request at XXXX" issue.
2) page_address return values for two pages are 0x1000(4KB) gap, not 2MB gap. Why is that?
3) Did I use right with "__va(page_to_pfn(pages[0])<
Thanks in advance!

Running a kernel with CUDAfy .NET using OpenCL asynchronously

I am trying to make asynchronous kernel calls to my GPGPU using CUDAfy .NET.
When I pass values to the kernel and copy them back to the host, I do not always get the value I expect.
I have a structure Foo with a byte Bar:
[Cudafy]
public struct Foo {
public byte Bar;
}
And I have a kernel I want to call:
[Cudafy]
public static void simulation(GThread thread, Foo[] f)
{
f[0].Bar = 3;
thread.SyncThreads();
}
I have a single thread with streamID = 1 (I tried using multiple threads, and noticed the issue. Reducing to a single thread didn't seem to fix the issue though).
//allocate
streamID = 1;
count = 1;
gpu.CreateStream(streamID);
Foo[] sF = new Foo[count];
IntPtr hF = gpu.HostAllocate<Foo>(count);
Foo[] dF = gpu.Allocate<Foo>(sF);
while (true)
{
//set value
sF[0].Bar = 1;
byte begin = sF[0].Bar;
//host -> pinned
GPGPU.CopyOnHost<Foo>(sF, 0, hF, 0, count);
sF[0].Bar = 2;
lock (gpu)
{
//pinned -> device
gpu.CopyToDeviceAsync<Foo>(hF, 0, dF, 0, count, streamID);
//run
gpu.Launch().simulation(dF);
//device -> pinned
gpu.CopyFromDeviceAsync<Foo>(dF, 0, hF, 0, count, streamID);
}
//WAIT
gpu.SynchronizeStream(streamID);
//pinned -> host
GPGPU.CopyOnHost<Foo>(hF, 0, sF, 0, count);
byte end = sF[0].Bar;
}
//de-allocate
gpu.Free(dF);
gpu.HostFree(hF);
gpu.DestroyStream(streamID);
First I create a stream on the GPU.
I am creating a regular structure Foo array of size 1 (sF) and setting it's Bar value to 1. Then I create pinned memory on the host (hF) for Foo as well. I also create memory on the device for Foo (dF).
I initialize the structure's Bar value to 1 then I copy it to the pinned memory (As a check, I set the value to 2 for the structure after copying to pinned, you'll see why later). Then I use a lock to ensure I have full access to the GPU and I queue a copy to dF, a run for the kernel, and a copy from dF. At this point I don't know when this will all actually run on the GPU... so I can call SynchronizeStream to wait on the host until the device is done.
When it's done, I can copy the pinned memory (hF) to the shared memory (sF). When I get the value, it's usually a 3 (which was set on the device) or a 1 (which means either the value wasn't set in the kernel, or the new value wasn't copied to the pinned memory). I do know that the pinned memory is copied to the structure because the structure never has the value of 2.
Over many runs, a small percentage is runs results in something other than begin=1 and end=3. It would always be begin=1, end=1 and it happens about 5-10% of the time.
I have no idea why this happens. I know it generally highlights a race condition, but by calling the sync calls, I would expect the async calls to work in a predictable fashion.
Why would I be encountering this kind of issue with this code?
Thank you so much!
-Phil
I just figured out the issue that was occurring. While the launch was being done asynchronously... I didn't include the stream for the launch.
Changing my launch to be:
gpu.Launch(gridsize,blocksize,streamID).simulation(dF);
resolved the problem. It seems that the launches were occurring on stream 0 and the stream 1 and 2 were being synced. So sometimes the data gets set, sometimes it doesn't. A race condition.

Linux kernel network device driver and skb pointers

I am writing a network device driver.
Kernel 2.6.35.12
The device is supposed to be working when it is connected to a bridge port.
I am trying to intercept ICMPv6 RA and NS messages (Router/ Neighbor solicitation) forwarded to the interface from the bridge.
eth <–> br0 <–> mydevice
In the device start_xmit function I am doing to following:
Check that the protocol field after the Ethernet header is IPV6 (0x86dd)
Check that the ipv6 next header is ICMPv6 and check its type:
__u8 nexthdr = ipv6_hdr(skb)->nexthdr;
if (nexthdr == htons (IPPROTO_ICMPV6))
{
struct icmp6hdr *hdr = icmp6_hdr(skb);
u8 type = hdr->icmp6_type;
if(type == htons (NDISC_NEIGHBOUR_SOLICITATION) || type == htons (NDISC_ROUTER_SOLICITATION))
{
….Do something here…
}
}
When RS/NS are sent from within the device (e.g br0), I see that the code is working right.
The problem is when traffic is forwarded through the bridge from the other port.
I see that the icmp6_hdr(skb) returns an incorrect header.
Debugging some more, it seems that the
skb->network_header and the skb->transport_header are pointing to the same place.
icmp6_hdr is using the transport_header which explain why it is incorrect.
Dumping the skb data it looks that all the headers and payload are at the right offset (also compared it with tcpdump)
I suspect that it might be related to the bridge code, before going to dive into it,
I thought that maybe anyone had come up against anything similar or have any other ideas?
Part of the problem is that you are assuming that Netfilter did anything more than just figure out what was the next header. In my experience (albeit not very long) you want to do something like this:
struct icmp6hdr *icmp6;
// Obviously don't do this unless you check to make sure that it's the right protocol
struct ipv6_hdr *ip6hdr = (struct ipv6_hdr*)skb->network_header;
// You need to move the headers around
// Notice the memory address of skb->data and skb->network_header are the same
// that means that the IP header hasn't been "pulled"
skb->transport_header = skb_pull(skb, sizeof(struct ipv6_hdr));
if(ntohs(ip6hdr->nexthdr) == IPPROTO_ICMPV6) {
icmp6 = (struct icmp6hdr*)skb->transport_header;
// Doing this is more efficient, since you only are calling the
// Network to Host function once
__u8 type = ntohs(hdr->icmp6_type);
switch(type) {
case NDISC_NEIGHBOUR_SOLICITATION:
case NDISC_ROUTER_SOLICITATION:
// Do your stuff
break;
}
}
Hopefully this was helpful. I just started diving into writing Netfilter code, so I am not exactly certain 100%, but I found this out when I was trying to do something similar with IPv4 on the NF_IP_LOCAL_IN hook.

pci_driver.probe function not called so pci_device_id wrong?

I am moving my first steps into Linux Kernel Device Driver development.
I learnt that for pci-e cards I have to call pci_register_driver providing information via an object of type pci_driver ( below an example ).
When I load my module ( via insmod ) If the information passed via .id_table is found than the .probe function is called.
As I am now I cannot see my .probe function called at all ( I added some logging via printk ) so I must assume that the information contained in pci_device_id must be wrong, right?
Is there any way to retrieve this information directly from the hardware itself?
Once I plug my PCI-E card on my Linux box, where I can find all information about it?
Maybe reading BIOS or some file in sys?
Any help is appreciated.
AFG
static struct pci_driver my_driver = {
// other here
.id_table = pci_datatable,
.probe = driver_add
//
};
static struct pci_device_id pci_datatable[] __devinitdata =
{
{ VendorID, PciExp_0041, PCI_ANY_ID, PCI_ANY_ID },
{ 0 },
};
int __devinit DmaDriverAdd(
struct pci_dev * pPciDev,
const struct pci_device_id * pPciEntry
)
{
// my stuff!
}
While the accepted answer does indeed answer the question, I want to elaborate a bit about the probe function not being called.
According to the Documentation/PCI/pci.txt (How To Write Linux PCI Drivers) the probing function is called for all existing PCI devices that are not owned by the other drivers yet. So, even if you have the correct vendor and device IDs you will not see the function being called if the device is owned by another driver.
To see which drivers own which devices run:
lspci -knn
If you temporarily change both vendor ID and device ID to PCI_ANY_ID your probe function will be called for every available (i.e. not owned) device.
The command you want is lspci.
With no arguments it will give you a list of all PCI devices, eg:
$ lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family
03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 (rev 34)
...
Then to get the ids, use:
$ lspci -v -n -s 03:00.0
03:00.0 0280: 8086:0085 (rev 34)
Subsystem: 8086:1311
Flags: bus master, fast devsel, latency 0, IRQ 52
You can also find the same information in /sys:
$ cd /sys/bus/pci/devices/0000:03:00.0
$ cat vendor device
0x8086
0x0085
$ cat subsystem_vendor subsystem_device
0x8086
0x1311

Resources