Intel VT-x: Configuring debug registers to debug from host - debugging

I am trying to configure the debug registers in the host so that I can monitor an address of a guest running on Intel VT-x. For this I have called KVM_SET_GUEST_DEBUG IOCTL.
struct kvm_guest_debug guest_debug;
guest_debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_HW_BP;
guest_debug.arch.debugreg[0] = addr; // DR0
guest_debug.arch.debugreg[7] = encode_dr7(0, len, bpType);
if (ioctl(vcpu_fd, KVM_SET_GUEST_DEBUG, &guest_debug) < 0)
return false;
It successfully sets up the debug register. But upon debug register read/write, it causes VM_EXIT with EXIT_REASON_EXCEPTION_NMI. Although I am expecting to have EXIT_REASON_DR_ACCESS. What is the reason that causes an nmi exit instead of DR_ACCESS exit? Did I set the registers right?

Related

cgroup skb egress ebpf hook - skb->cb[0-4] updated values are reset in tc egress ebpf hook

I am just updating the skb->cb with constant values (0xfeed) and trying to get the same packet data at egress tc layer ebpf hook. It is all zero. Am i missing something here?
Is there anyway to send meta data between hooks with skb itself (LRU_HASH map may help but having in-band metadata will be faster, I think).
SEC("cgroup/skb")
int cgrp_dump_pkt(struct __sk_buff *skb) {
void *data_end = (void *) (long) skb->data_end;
void *data = (void *) (long) skb->data;
if(data < data_end)
{
skb->cb[0] = 0xfeed;
skb->cb[1] = 0xfeed;
skb->cb[2] = 0xfeed;
skb->cb[3] = 0xfeed;
skb->cb[4] = 0xfeed;
bpf_printk("hash:%x mark:%x pri:%x", skb->hash, skb->mark, skb->priority);
bpf_printk("cb0:%x cb1:%x cb2:%x", skb->cb[0], skb->cb[1], skb->cb[2]);
bpf_printk("cb3:%x cb4:%x", skb->cb[3], skb->cb[4]);
}
return 1;
}
The above hook is attached to a docker instance cgroup and I am trying to see the same packets by attaching the hook in tc egress (will just print the packet info includes cb[0-4]). cb[0-4] values are shown as 0. Is it an expected behavior? How to pass the metadata between ebpf hook from cgroup skb to tc egress or lower layers?
I've only ever used the cb fields to pass data across BPF tail calls. To pass data from one hook point to another in the stack, you could use skb->mark (32-bit value).
Do make sure that no other software is using the same bits you want to use though. See https://github.com/fwmark/registry for example.

MMAP buffer kernel writes are not seen by user space

i have a kernel driver which shares a buffer with the user space layer.
Everything seemed to work fine in my VM prototype (Ubuntu, Kernel 5.4) but when i moved my code to the target (same kernel but this is an embedded distro) I can clearly see that Kernel writes to the buffer (using memcpy, or memset) are not reflected in the User space side of the buffer.
Note that, i use direct buffer accesses on both sides. There is no concurrency issue, as the Kernel writes to, then the user space reads from.
I ended up believing this is a cache issue ... as the same code works perfectly in my VM.
The buffer size is 4 * PAGE_SIZE.
It is allocated as follows:
int _size = (SFP_BUFFER_SIZE + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);
input_buffer = (char*) kzalloc (_size, GFP_KERNEL); // aligned on page boundary
if (!input_buffer) {
dev_dbg(&dev, "open/ENOMEM (input_buffer)\n");
status = -ENOMEM;
goto err_all
When mmap'ing, i used the following code pattern:
vma->vm_ops = &fpgadrv_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = virt_to_phys((void*)(input_buffer)) >> PAGE_SHIFT;
if (remap_pfn_range (vma, vma->vm_start, pfn, size, vma->vm_page_prot))
{
printk(KERN_DEBUG "remap page range failed\n");
return -EAGAIN;
}
User space code, and kernel code user memcpy to update the buffer. Note also that I cannot use write/read entry points, as they are already used for very specific operations.
The user code is calling mmap as follows:
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, device_fd, 0);
if (buf == MAP_FAILED)
{
perror("USERDRV:cannot mmap");
return -1; // for testing, ignore the return code and continue
}
and upon IOCTL call, the kernel would fill up the mmap buffer as follows:
case IOCTL_RESET:
printk(KERN_DEBUG "FPGADRV: IOCTL RESET");
// reset the buffer (zero + put back the signature)
memset(input_buffer, 0xA5, SFP_BUFFER_SIZE);
memcpy((void*)(input_buffer), (void*)signature, 10);
break;
Is there something more i should do to make sure the pages are not cached (assuming this is the cause of my pb) ?
Thanks,
Jacques

Linux kernel Hardware breakpoint registration failed with EACCES

I am using below code in my driver, but the register_wide_hw_breakpoint returns an EACESS error.
hw_breakpoint_init(&attr);
attr.bp_addr = kallsyms_lookup_name(ksym_name);
attr.bp_len = HW_BREAKPOINT_LEN_4;
attr.bp_type = HW_BREAKPOINT_W | HW_BREAKPOINT_R;
sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler, NULL);
if (IS_ERR((void __force *)sample_hbp)) {
ret = PTR_ERR((void __force *)sample_hbp);
printk(KERN_INFO "Breakpoint registration failed:%d\n",ret);
}
What could be possible reasons for this? Am I missing a CONFIG option which will grant access to hardware registers?
Additional info: My platform is X86_64, Linux
Please help.

Interrupt performance on linux kernel with RT patches - should be better?

I have bumped into a bit inconsistent IRQ/ISR performance on Freescales imx.233 running linux kernel (3.8.13) with CONFIG_PREEMPT_RT patches.
I am little bit surprised why this processor (ARM9, 454mhz) is unable to keep up even with 74kHz IRQ requests.. ?
In my kernel config I have set following flags:
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT_BASE=y
CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_LAZY=y
CONFIG_PREEMPT_RT_FULL=y
CONFIG_PREEMPT_COUNT=y
CONFIG_DEBUG_PREEMPT=y
On the system there is basically nothing running (created by buildroot), and I set PWM to generate a pulse of 74kHz, that serves as interrupt.
Then in the ISR, I just trigger another GPIO output pin, and check the output.
What I find is that sometimes I miss an interrupt -
You can see the missed interrupt here:
And also the the triggering of output pin seems to be a bit inconsistent, the output pin is triggered usually within "5% window", that might still be acceptable. But I worry, that when I start implementing data transfer logic, instead of just triggering the pin, I might run into further problems...
My simple driver code looks like this:
#needed includes
uint16_t INPUT_IRQ = 39;
uint16_t OUTPUT_GPIO = 38;
struct test_device *device;
//Prototypes
void irqtest_exit(void);
int irqtest_init(void);
void free_device(void);
//Default functions
module_init(irqtest_init);
module_exit(irqtest_exit);
//triggering flag
uint16_t pulse = 0x1;
irqreturn_t irq_handle_function(int irq, void *device_id)
{
pulse = !pulse;
gpio_set_value(OUTPUT_GPIO, pulse);
return IRQ_HANDLED;
}
struct test_device {
int huuhaa;
};
void free_device() {
if (device)
kfree(device);
}
int irqtest_init(void) {
int result = 0;
device = kmalloc(sizeof *device, GFP_KERNEL);
device->huuhaa = 10;
printk("IRB/irqtest_init: Inserting IRQ module\n");
printk("IRB/irqtest_init: Requesting GPIO (%d)\n", INPUT_IRQ);
result = gpio_request_one(INPUT_IRQ, GPIOF_IN, "PWM input");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as input.. exiting\n", INPUT_IRQ);
return -EINVAL;
}
result = gpio_request_one(OUTPUT_GPIO, GPIOF_OUT_INIT_LOW , "IR OUTPUT");
if (result != 0) {
free_device();
printk("IRB/irqtest_init: Failed to set GPIO (%d) as output.. exiting\n", OUTPUT_GPIO);
return -EINVAL;
}
//Set our desired interrupt line as input
result = gpio_direction_input(INPUT_IRQ);
if (result != 0) {
printk("IRB/irqtest_init: Failed to set IRQ as input.. exiting\n");
free_device();
return -EINVAL;
}
//Set flags for our interrupt, guessing here..
irq_flags |= IRQF_NO_THREAD;
irq_flags |= IRQF_NOBALANCING;
irq_flags |= IRQF_TRIGGER_RISING;
irq_flags |= IRQF_NO_SOFTIRQ_CALL;
//register interrupt
result = request_irq(gpio_to_irq(INPUT_IRQ), irq_handle_function, irq_flags, "irq testing", device);
if (result != 0) {
printk("IRB/irqtest_init: Failed to reserve GPIO 38\n");
return -EINVAL;
}
printk("IRB/irqtest_init: insert success\n");
return 0;
}
void irqtest_exit(void) {
if (device)
kfree(device);
gpio_free(INPUT_IRQ);
gpio_free(OUTPUT_GPIO);
printk("IRB/irqtest_exit: Removing irqtest module\n");
}
int irqtest_open(struct inode *inode, struct file *filp) {return 0;}
int irqtest_release(struct inode *inode, struct file *filp) {return 0;}
In the system, I have following interrupts registered, after the driver is loaded:
# cat /proc/interrupts
CPU0
16: 36379 - MXS Timer Tick
17: 0 - mxs-spi
18: 2103 - mxs-dma
60: 0 gpio-mxs irq testing
118: 0 - mxs-spi
119: 0 - mxs-dma
120: 0 - RTC alarm
124: 0 - 8006c000.serial
127: 68050 - uart-pl011
128: 151 - ci13xxx_imx
Err: 0
I wonder if the flags I declare to my IRQ are good ? I noticed that with this configuration, I can no longer reach console, so kernel seems totally consumed with servicing this 74kHz trigger now.. this can't be right ?
I suppose it's not a big deal for me since this is only during data transfer, but still I feel I'm doing something wrong..
Also, I wonder if it would be more efficient to map the registers with ioremap, and trigger the output with direct memory writes ?
Is there some way I could increase the priority of the interrupt even higher ? Or could I somehow lock the kernel for the duration of the data transfer (~400ms), and generate somehow else my timing for the output ?
Edit: Forgot to add /proc/interrupts output to the question...
What you experience here is interrupt jitter. This is to be expected on Linux, because the kernel regularly disables the interrupts for various tasks (entering a spinlock, handling an interrupt, etc.).
This will happen, regardless wether you have PREEMPT_RT or not, so expecting to generate 74kHz signal with regular interrupts is pretty much unrealistic.
Now, ARM has higher priority interrupts called FIQs, that will never be masked or disabled.
Linux doesn't use FIQ, and is not built to deal with the fact that an FIQ could be used, so you won't be able to use the generic kernel framework.
From Linux driver development point of view however, it's not really different as long as you keep this in mind: you have to write a handler, and associate it to an IRQ. You'll also have to poke into the interrupt controller to make it generate a FIQ for the interrupt you want to use (the details on how to change it are platform-dependant. Some platforms have functions to do that (like imx25 and mxc_set_irq_fiq), some others don't. imx23/28 don't, so you'll have to do it by hand).
The only thing that the functions to setup a fiq handler only work with a assembly-written handler, so you'll have to rewrite your handler in assembly (with your current code, it should be trivial though).
You can grab additional details to the blog post Alexandre posted (http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/), where you'll find working code, samples, and explanations on how it all works together.
You can have a look at what my colleague Maxime Ripard did using an FIQ on a similar SoC (i.mx28) :
http://free-electrons.com/blog/fiq-handlers-in-the-arm-linux-kernel/
Try this flags:
int irq_flags;
...
irq_flags = IRQF_TRIGGER_RISING | IRQF_EARLY_RESUME
I had a kernel 3.8.11 and can't find IRQF_NO_SOFTIRQ_CALL define. It's only for 3.8.13?
Also I didn't see irq_flags define. Where is it?

Implementing PCIe Linux device driver (want to access my card registers from kernel driver)

I'm writing a device driver to access the memory in a FPGA on a PCIe card.
The card boots and is probed/found :-
/proc/iomem
80000000-840fffff : PCI Bus #03
80000000-83ffffff : 0000:03:00.0
84000000-840fffff : 0000:03:00.0
So reading ldd/etc I coded up a call to request_mem_region at the 80000000, and requested a pointer to it via ioremap_nocache
1) Do I need to request_mem_region as well as a ioremap_nocache, cant I use just the latter?
/proc/iomem After insmod my device driver :-
80000000-840fffff : PCI Bus #03
80000000-83ffffff : 0000:03:00.0
80000000-8003ffff : fp2
84000000-840fffff : 0000:03:00.0
2) Doesnt look quite right to me...?
Anyway, reads don't work (its not coded like below, it has checks etc):-
#define BAR_ADDR 0x80000000
void *base = ioremap_nocache(BAR_ADDR, 0x40000);
void *address = base + KNOWN_REG_LOCATION;
int data = ioread32(address);
printk("fp2: address:0x%08x, data:0x%08x\n", address, data);
Outputs :-
address:0xfd500000, data:0xffffffff
I can read the x80000000+KNOWN_REG_LOCATION from mmap userspace.
3) I've tried __raw_readl/readl with no luck as well.
4) Can I just read at the currently mapped address x80000000?
Ian,
I wrote a PCI driver for a device (full source). The mapping of the register space should be the same though. Here is how I do it.
dm7820_device->pci[region].virt_addr = ioremap_nocache(address, length);
if (dm7820_device->pci[region].virt_addr == NULL) {
printk(KERN_ERR "%s: ERROR: BAR%u remapping FAILED\n",
&((dm7820_device->device_name)[0]), region);
dm7820_release_resources();
return -ENOMEM;
}
if (request_mem_region(address, length, &((dm7820_device->device_name)[0])) == NULL) {
printk(KERN_ERR "%s: ERROR: I/O memory range %#lx-%#lx allocation FAILED\n",
&((dm7820_device->device_name)[0]), address, (address + length - 1));
dm7820_release_resources();
return -EBUSY;
}
The address and length values are returned from pci_resource_start() and pci_resource_length() calls.
Then you can access it using ioread32() using dm7820_device->pci[region].virt_addr + <register offset>
Let me know if you have any questions.

Resources