How to add device tree blob to Linux x86 kernel boot? - linux-kernel

My custom development board is based on x86 and one of the electronic component which is connected to it (through SPI mainly) cannot be controlled easily without using the vendor kernel driver (and the vendor won't help if I don't use it). This module requires some configuration parameters that it gets from the device tree. I believe this module is mostly used on ARM platforms where device trees are common.
On x86, the device tree is generally not needed so it is disabled by default during Linux kernel compilation. I changed the configuration in order to enable it, but I cannot find the way to put the device tree BLOB into the boot image. There is only one DTS file for the x86 architecture in the kernel sources but it doesn't seem to be used at all so it doesn't help.
From the kernel documentation, I understand I need to put it in the setup_data field of the x86 real-mode kernel header, but I don't understand how to do that and when (at kernel build time? when building the bootloader?). Am I supposed to hack the arch/x86/boot/header.S file directly?
Right now, I've replaced the module configuration by hard-coded values, but using the device tree would be better.

On x86, the boot loader adds the Device Tree binary data (DTB) to the linked list of setup_data structures before calling the kernel entry point. The DTB can be loaded from a storage device or embedded into the boot loader image.
The following code shows how it's implemented in U-Boot.
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/lib/zimage.c:
static int setup_device_tree(struct setup_header *hdr, const void *fdt_blob)
{
int bootproto = get_boot_protocol(hdr);
struct setup_data *sd;
int size;
if (bootproto < 0x0209)
return -ENOTSUPP;
if (!fdt_blob)
return 0;
size = fdt_totalsize(fdt_blob);
if (size < 0)
return -EINVAL;
size += sizeof(struct setup_data);
sd = (struct setup_data *)malloc(size);
if (!sd) {
printf("Not enough memory for DTB setup data\n");
return -ENOMEM;
}
sd->next = hdr->setup_data;
sd->type = SETUP_DTB;
sd->len = fdt_totalsize(fdt_blob);
memcpy(sd->data, fdt_blob, sd->len);
hdr->setup_data = (unsigned long)sd;
return 0;
}

Related

How can I check whether a kernel address belongs to the Linux kernel executable, and not just the core kernel text?

Coming from the Windows world, I assume that Vmlinuz is equivalent to ntoskrnl.exe, and this is the kernel executable that gets mapped in Kernel memory.
Now I want to figure out whether an address inside kernel belongs to the kernel executable or not. Is using core_kernel_text the correct way of finding this out?
Because core_kernel_text doesn't return true for some of the addresses that clearly should belong to Linux kernel executable.
For example the core_kernel_text doesn't return true when i give it the syscall entry handler address which can be found with the following code:
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
return (void *)system_call_entry;
And when I use this code snippet, the address of the syscall entry handler doesn't belong to the core kernel text or to any kernel module (using get_module_from_addr).
So how can an address for a handler that clearly belongs to Linux kernel executable such as syscall entry, don't belong to neither the core kernel or any kernel module? Then what does it belong to?
Which API do I need to use for these type of addresses that clearly belong to Linux kernel executable to assure me that the address indeed belongs to kernel?
I need such an API because I need to write a detection for malicious kernel modules that patch such handlers, and for now I need to make sure the address belongs to kernel, and not some third party kernel module or random kernel address. (Please do not discuss methods that can be used to bypass my detection, obviously it can be bypassed but that's another story)
The target kernel version is 4.15.0-112-generic, and is Ubuntu 16.04 as a VMware guest.
Reproducible code as requested:
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
int isInsideCoreKernel = core_kernel_text_((unsigned long)system_call_entry);
printk("%d , 0x%pK ", isInsideCoreKernel, system_call_entry);
EDIT1: So in the MSR_LSTAR example that I gave above, it turns out that It's related to Kernel Page Table Isolation and CONFIG_RETPOLINE=y in config:
system_call value is different each time when I use rdmsrl(MSR_LSTAR, system_call)
And that's why I am getting the address 0xfffffe0000006000 aka SYSCALL64_entry_trampoline, the same as the question above.
So now the question remains, why this SYSCALL64_entry_trampoline address doesn't belong to anything? It doesn't belong to any kernel module, and it doesn't belong to the core kernel, so which executable this address belongs to and how can I check that with an API similar to core_kernel_text? It seems like it belongs to cpu_entry_area, but what is that and how can I check if an address belongs to that?
You are seeing this "weird" address in MSR_LSTAR (IA32_LSTAR) because of Kernel Page-Table Isolation (KPTI), which mitigates Meltdown. As other existing answers(1) you already found point out, the address you see is the one of a small trampoline (entry_SYSCALL_64_trampoline) that is dynamically remapped at boot time by the kernel for each CPU, and thus does not have an address within the kernel text.
(1)By the way, the answer linked above wrongly states that the corresponding config option for KPTI is CONFIG_RETPOLINE=y. This is wrong, the "retpoline" is a mitigation for Spectre, not Meltdown. The config to enable KPTI is CONFIG_PAGE_TABLE_ISOLATION=y.
You don't have many options. Either:
Tell VMWare to emulate a recent CPU that is not vulnerable to Meltdown.
Detect and implement support for the KPTI trampoline.
You can implement support for this by detecting whether the kernel supports KPTI (CONFIG_PAGE_TABLE_ISOLATION), and if so check whether current CPU has KPTI enabled. The code at kernel/cpu/bugs.c that provides information for /sys/devices/system/cpu/vulnerabilities/meltdown shows how this can be detected:
ssize_t cpu_show_meltdown(struct device *dev,
struct device_attribute *attr, char *buf)
{
if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
return sprintf(buf, "Not affected\n");
if (boot_cpu_has(X86_FEATURE_PTI))
return sprintf(buf, "Mitigation: PTI\n");
return sprintf(buf, "Vulnerable\n");
}
The actual trampoline is set up at boot and its address is stored in each CPU's "entry area" for later use (e.g. here when setting up IA32_LSTAR). This answer on Unix & Linux SE explains the purpose of the cpu entry area and its relation to KPTI.
In your module you can do the following detection:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kallsyms.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
#include <asm/cpu_entry_area.h>
// ...
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
bool syscall_entry_64_ok(void)
{
unsigned long entry;
rdmsrl(MSR_LSTAR, entry);
if (core_kernel_text_(entry))
return true;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (this_cpu_has(X86_FEATURE_PTI)) {
int cpu = smp_processor_id();
unsigned long trampoline = (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline;
if ((entry & PAGE_MASK) == trampoline)
return true;
}
#endif
return false;
}
static int __init modinit(void)
{
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
if (!core_kernel_text_)
return -EOPNOTSUPP;
pr_info("syscall_entry_64_ok() -> %d\n", syscall_entry_64_ok());
return 0;
}

add attribute to an existing kobject

I'm working on porting a driver I've written for the 2.6.x kernel series into 3.x (i.e. RH EL 6 -> RH EL 7). My driver solution actually comes in two modules: a modified form of ahci.c (from the kernel tree) and my own upper-layer character driver for access AHCI registers and even executing commands against the drive.
In porting to CentOS 7, I've run into an interesting problem. Changes to the drivers I'm building on remove the access to the scsi_host attributes in SYSFS. My question then is, can I append attributes onto devices already existing in SYSFS? Every example I've come across shows making the attributes at the point of device creation, e.g.:
static ssize_t port_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buff);
static struct kobj_attribute pxclb_attr = __ATTR(pxclb, 0664, port_show, NULL);
static struct attribute *attrs[] = {
&pxclb_attr.attr,
NULL,
};
static struct attribute_group port_group = {
.attrs = attrs,
};
/* much other code */
sysfs_create_group(&kobj, &port_group);
This example comes from my character driver. All the searches I've done with Google, and referencing my Linux Foundation Drivers class book, all show attribute creation done at the time of device creation. Can I add to them at any time? It would seem that one could call sysfs_create_group() at any time. Is this true?
You can add attribute in sysfs at anytime you like.
The reason that almost all the drivers add attribute in probe is that userspace has strict expectations on when attributes get created. When a new device is registered in the kernel, a uevent is generated to notify userspace (like udev) that a new device is available. If attributes are added after the device is registered, then userspace won't get notified and userspace will not know about the new attributes.

Can I query device tree items without creating a platform device?

I am writing a kernel module intended to functionally test a device driver kernel module for an ARM+FPGA SOC system. My approach involves finding which interrupts the device driver is using by querying the device tree. In the device driver itself, I register a platform driver using platform_driver_register and in the .probe function I am passed a platform_device* pointer that contains the device pointer. With this I can call of_match_device and irq_of_parse_and_map, retrieving the irq numbers.
I don't want to register a second platform driver just to query the device tree this way in the test module. Is there some other way I can query the device tree (perhaps more directly, by name perhaps?)
This is what I've found so far, and it seems to work. of_find_compatible_node does what I want. Once I have the device_node*, I can call irq_of_parse_and_map (since of_irq_get_byname doesn't seem to compile for me). I can use it something like the following:
#include <linux/of.h>
#include <linux/of_irq.h>
....
int get_dut_irq(char* dev_compatible_name)
{
struct device_node* dev_node;
int irq = -1;
dev_node = of_find_compatible_node(NULL, NULL, dev_compatible_name);
if (!dev_node)
return -1;
irq = irq_of_parse_and_map(dev_node, 0);
of_node_put(dev_node);
return irq;
}

Load kernel module prior of device tree's probing

I have developed a working driver for my custom_hardware that relies on the device tree. Because my driver may evolve, I do not want my driver to be part of the kernel (when I say 'be part of the kernel', I mean, to be compiled with the kernel during the kernel creation)
Here is a glimpse of my dts:
custom_hardware: custom_hardware#0x41006000 {
compatible = "mfg,custom";
reg = <0x41006000 0x1000>;
#interrupt-cells = <0x1>;
interrupt-controller;
};
existing_hardware: existing_hardward#41004000 {
compatible = "mfg,existing";
reg = <0x41004000 0x1000>;
interrupt-parent = <&custom_hardware>;
interrupts = <0>;
};
The existing_hardware's driver is already compiled with kernel (the existing_hardware's driver has been compiled with the kernel during the kernel creation).
What I would like to do is to append my custom_hardware's driver to the ramfs and let the kernel loads the custom_hardware's driver prior of the existing_hardware's driver.
This is important since the existing_hardware's driver requests a virq from the irq_domain of the custom_hardware's driver. In order to get the irq_domain, the custom_hardware's driver must be loaded first.
Note that the existing_hardware's driver gets loaded during the probing of the device tree which seems to happen in the early stage of the kernel booting sequence.
That is not the way to do. The order of the module/driver loading must not matter. What you need to do is return -EPROBE_DEFER when getting the IRQ fails in existing_hardware. Then it will get probed again at a later time, hopefully after custom_hardware got probed.
Also, you can apply that patch that will ensure that request_irq() fails because the domain is not present yet and return -EPROBE_DEFER in that case
https://lkml.org/lkml/2014/2/13/114
I had similar problem (probing order was wrong) and the only simple solution what I found is put the modules in the desired probing order into the Makefile.
I've found the solution here: What is the Linux built-in driver load order?

Retrieving PCI coordinates by Windows' API (user mode)

Is there a way to obtain PCI coordinates (bus/slot/function numbers) of devices by using Windows c/c++ API (e.g PnP Configuration Manager API)? I already know how to do it in kernel mode, I need an user-mode solution. My target system is Windows XP-32 bit.
I've eventually found a simple solution (it was just a matter of digging into MSDN).
This minimal code finds the device's PCI coordinates in terms of bus/slot/function:
DWORD bus, addr, slot, func;
HDEVINFO h; // Obtained by SetupDiGetClassDevs
SP_DEVINFO_DATA d; // Filled by SetupDiGetDeviceInterfaceDetail
SetupDiGetDeviceRegistryProperty(h,&d,SPDRP_BUSNUMBER,NULL,&bus,sizeof(bus),NULL);
SetupDiGetDeviceRegistryProperty(h,&d,SPDRP_ADDRESS,NULL,&addr,sizeof(addr),NULL);
slot = (addr >> 16) & 0xFFFF;
func = addr & 0xFFFF;
Note: for real production the output buffer's size must be obtained by a previous call of the API function in order to allocate it dynamically, and error checks must be added, of course.

Resources